What does Black Friday mean for the infrastructure teams?

Ezgi Küşüm
7 min readOct 26, 2021

Cabin crew slides armed and cross-check. We’re ready for flight! As prices go down, customer traffic goes up to the sky.

If you think that Trendyol’s legendary November events are solely about influencer advertisements and TV commercials, you’re wrong. This article will explain how we got ready for the most prominent e-commerce event of the year as the Trendyol system infrastructure teams.

Photo by John Salvino on Unsplash

Prologue

My professional experience is mostly with service providers and mobile operators; I am new to the world of e-commerce. Let me brief you a bit before I get to the point, as I know most of my contacts don’t have an e-commerce background.

The peak traffic mostly depends on how aggressive the marketing will be

Trendyol runs many campaigns throughout the year, the most aggressive being on Black Friday, which means we’ll see the highest traffic that day. We live and get ready for those rush hours.

When estimating traffic load, the number of active shoppers (customers who bought at least one product within a year) is important. However, it is impossible to guess the peak traffic by looking at existing customers or trends. For example, the traffic on a significant event such as Black Friday can be 10 times that of a regular evening or twice that of a standard campaign. Considering that the number of active shoppers between campaigns has not increased that much, we can conclude that traffic is primarily determined by marketing.

Event dates are set

You may receive bad news from the vendor that the delivery of your order will be delayed by several weeks. The data center facility may say that the new hall, which you plan to begin installations next month, will not be ready for another month.

These are very valid excuses to postpone a project’s delivery date. However, if Black Friday is around the corner and you know that you need more capacity and resources to handle customer traffic during the event, you should find a way and get ready. Unfortunately, you cannot postpone the event.

I will write another article about the impact of Covid-19 on delivery times and the workarounds we have implemented to overcome delays. Stay tuned.

Some “decacorn” problems…

A decacorn company refers to a privately-owned company with a valuation of more than $10 billion. You may have heard the news:

“Trendyol is now the first decacorn from Turkey and the most valuable private internet company in EMEA! After raising $1.5 billion from a roster of high-profile investors, we are now valued at $16.5 billion USD.”

Let’s look at some figures and news shared with the press in recent years:

“In 2021 Trendyol has evolved into a super-app, combining its marketplace platform powered by its own last-mile delivery solution (Trendyol Express), with instant grocery and food delivery through its own courier network (Trendyol Go), its digital wallet (Trendyol Pay), consumer-to-consumer channel (Dolap) and many other services.”

Ongoing investments in Trendyol mean new business lines and growth, which require an equivalent extension in infrastructure. As a result, new deployments should continue throughout the year and be available to handle the highest traffic we’ll have during November events.

Photo by Isaac Smith on Unsplash

Getting ready

We mainly focused on three items; capacity & resources, infrastructure health check, and backlog cleanup.

1. Do you have enough capacity and resources?

The best approach to answer this question is to perform load tests where we send incrementing traffic to each service and discover bottlenecks.

Services are individual functions running on Trendyol, and you use many of them throughout your visit to the application. For example, you search for a product, then you add it to the cart, at the same time the app suggests some other products to you, later on, you pay for the product via credit card or digital wallet, etc. All these actions are performed by different services running on the app. During load tests, services are hit simultaneously, and as infrastructure teams, we try to understand if we create any bottleneck in our domain.

Photo by Alora Griffiths on Unsplash

How we perform load tests:

  • The traffic is sent from a test installation on a web cloud outside our data centers to simulate customer traffic.
  • Tests are done during the night to avoid service interruption during day hours and may continue till morning.
  • All technical teams participate in load tests and collect data. At the end of the tests, they share findings regarding the bottlenecks in their domain and action points.
  • We perform many of these activities, at least four. Teams should work on action points and fix detected issues until the next load test.
  • If some problems remain, we can arrange further load tests until all is well.

Here is a snap showing the effect of testing on traffic. The test traffic goes four to five times higher than a regular evening. A similar pattern is seen for almost every service and uplinks.

2. Health check

If you are going to run a marathon, you have to make sure that you will be in your best shape that day. It’s not about getting a good night’s sleep and eating well the day before. Getting in shape takes time and extensive training. It’s the same when it comes to the health of your infrastructure. The infrastructure should be in its best condition during Black Friday, and it takes time to get ready. For the network infrastructure team, the health check process spanned three months.

Photo by Quino Al on Unsplash

We conduct comprehensive testing and collaborate with vendors to accomplish the status health check of our network infrastructure. We aim to pinpoint existing issues, find out potential solutions and apply the best solution.

Infrastructure teams’ practices may differ for the health check. As network team, we focused on two elements:

I. Bug scrubbing and support tickets

Vendor collaboration is crucial at this stage. We benefit from the experience and expertise of our vendors.

Bug scrubbing is going through the bugs and vulnerabilities in the software release and filtering the bugs against features used. After discovering what may impact your production, you can upgrade the devices to a release with the fix or apply workarounds on the existing release.

We also try to eliminate all kinds of suspicious cases by raising support tickets. They should be all concluded, and recovery actions should be implemented before Black Friday where we’ll be in a complete change freeze.

II. Disaster tests

We performed thorough disaster testing on our data centers to ensure the failover to backup is flawless in case of an unexpected failure of the master equipment.

You can check my previous article for further details on disaster testing.

3. Backlog cleanup

We live in the upper left corner of the Eisenhower matrix due to aggressive deadlines driven by the growing business. Our sprints are often filled with important and urgent tasks. Non-urgent but important tasks can pile up as we try to meet project demands on time.

The tasks at the upper right corner are essential, which means that you should do them at some point, but they are not time-sensitive, so not prioritized over the time-sensitive ones.

Eisenhower matrix

Backlog cleanup refers to focusing on the upper right quadrant. The reason is, if you don’t work on them early enough, they may become urgent and migrate to the upper left quadrant. We don’t want that, especially during November events.

Some example tasks that we eliminated during backlog cleanup:

  • Trying to reproduce a past problem in the lab that was encountered in production once.
  • Testing and applying a better solution for layer 2 loops.
  • Physically checking and verifying inventory and stock records.

Finally, change freeze!

Let’s visit the marathon metaphor one more time. You don’t want to eat something you’re unfamiliar with right before a race because you don’t know if it will give you a sour stomach. Likewise, you wouldn’t change your running shoes and join the race with brand new ones as they can pinch your feet. This is the main idea of a change freeze.

At the beginning of November, we stop new deployments, avoid any configuration changes and physical operations in our data centers to minimize the risk.

The team is currently working day and night to perform final changes before the freeze starts. But, I’m sure that we’ll make it on time.

I hope you enjoyed the reading. Please let me know your comments,

ezgi

--

--