This article explains the concept of disaster testing and how we apply disaster tests to the Trendyol network infrastructure.
Disaster, by definition, is an event that results in great harm, damage, or serious difficulty. If you look it up in the network engineer’s dictionary, some potential definitions would be:
- An event that results in loss of equipment or site in a non-redundant system
- You have redundancy, but the failover to backup did not happen.
- Services are not working, and you‘ve no idea where the problem is.
To avoid all the circumstances listed above, we decided to test our network infrastructure against potential disaster scenarios and get ready for the worst.
“It’s too risky; we can’t afford it.”
Let’s start with the sentence that you’ll hear the most when you start talking about disaster tests. I will explain our testing phases later in this article, but long story short, we are talking about powering off random network equipment in production and cutting off DC uplinks. So, there is a risk.
However, the phrase — “It’s too risky, we can’t afford it” — is an oxymoron when it comes to failover and redundancy. So, you rely on the redundant design that you find too risky to test?
First, assess the risk you’re so afraid to take by testing failover to backup in a maintenance window. Then, consider the risk of encountering such an incident during an important marketing event. For example, imagine that the master node dies, and the redundancy mechanism doesn’t kick in as it was supposed to during Black Friday. Is it more affordable?
The bright side of working at a company such as Trendyol is, we still maintain the startup spirit. Corporate culture is sniffing around as the company is growing. Nevertheless, the agile mindset is in charge and prevents cascaded approval chains. The teams have autonomy and authority over the domain they are responsible for.
The network team decided that the disaster tests were necessary, discussed the potential impact and consequences with other teams, and scheduled the tests accordingly.
Network disaster test plan
1. Categorize first
Certain tests can be done in the same maintenance window. Therefore, we decided to categorize the tests to avoid excessive night work and not occupy too many change windows that we share with other infrastructure teams.
We divided our tests into four groups:
I. Out of band management tests — This is more like preparation for disaster testing. You should make sure that the equipment is accessible via console servers in case something goes wrong during the tests.
The team first checked physical connections and configuration on management switches and console servers. Later on they tried whether if we have out of band access to each equipment.
II. DC fabric failover tests — Tests to be done at single night per datacenter:
- TOR pair switch host port and uplink failover
- TOR pair switch power off and failover
- Super spine, spine and border leaf power off and failover
III. DC WAN tests — Tests to be done at single night per datacenter:
- WAN routers, WAN edge switch, firewall uplink failure
- WAN routers, WAN edge switch, firewall power off and failover
- Security chain deactivation/reactivation tests
IV. DC interconnection tests — Tests to be done at the same night for three datacenters:
- DC interconnection DWDM Link Failure
- DC interconnection router power off and failover
2. Do virtual tests before real ones
You can do many things to check your network’s health before running around and shutting down routers or plugging out uplinks. There is no use in testing scenarios in real life without reviewing them offline. First, study the scenarios on paper, then test them in the lab. If everything is perfect, then you may consider running them on production.
We call all these studies and non-prod tests “virtual tests.” If you fail virtually, you’ll fail in real life. Therefore, it is useless to confirm the same failure scenario on-site. Instead, it would be best if you focused on solving the problem in the lab. Doing production tests will only make sense if you‘re sure everything is flawless in a lab environment.
3. Decide what data you should collect during the tests
Data is everything. It holds every answer within. How will you decide whether if the tests are successful? You’ll only know if you collect the right data.
Prepare the test documentation to ensure you’ll get all the data you need at the end of the tests. The documentation should include pre-check and post-check steps and the data you should collect at each testing phase. Decide what numbers you expect to see beforehand so that you’ll have a chance to compare the results on the fly and troubleshoot if you notice any unexpected results.
Disaster testing is an excellent opportunity to test your monitoring systems and alerts. Keep an eye on your dashboards and alert panels to verify whether your monitoring system reflects on-site events.
4. Review the outcomes and take necessary actions
This is the most crucial step. If you faced unexpected results on the field, now it’s time to fix them.
Sometimes, it may not be possible to fix the issue. For example, let’s say you hit a bug that prevents smooth failover, and the fix is not yet released. If that’s the case, you can work on a workaround to decrease failover time. If such a workaround does not exist, you can prepare an action plan to run if you face such an incident in the future so that you will know what to do. Either the tests will fail or succeed; it’s a win.
Some thoughts about proactivity
Documentation, planning, analysis, briefing other teams, all the maintenance windows, hours of testing, and sleepless nights… It took a lot of effort. And, the most amazing thing is the team didn’t have to do it at all. Everything has been working fine, and they already had too much work to do. So, why did they bother?
The answer is the sense of responsibility. Stephen R. Covey defines the word “responsibility” as follows:
“Look at the word responsibility — “response-ability” — the ability to choose your response. Highly proactive people recognize that responsibility. They do not blame circumstances, conditions, or conditioning for their behavior.”
Due to unforeseen events beyond your control, you may lose a whole site, master nodes, or the primary path. Depending on the cause of the failure, one can easily blame the vendor for the faulty chip or the DC facility for problematic voltage.
The network team refused to make such excuses. Instead, they took full responsibility and chose to seek a “response” to unexpected circumstances. That’s why I believe performing disaster testing is the ultimate level of proactivity a network team can achieve.
Thanks for reading till the end. Let me know your comments.