You are here

You are here

What your team can learn from traffic mega-surges like Amazon Prime Day

public://pictures/Christopher-Null-CEO-Null-Media.png
Christopher Null Freelance writer
 

This year's Amazon Prime Day may be delayed (reportedly until August or September), but when it does arrive, it is certain to bring with it the usual flood of web traffic from hungry shoppers looking for killer deals.

Now one of the biggest annual shopping bonanzas on the web—for Amazon, it's bigger than Black Friday and Cyber Monday combined—Prime Day doesn't just bring visitors flocking to Amazon. It also delivers an onslaught of traffic to other retailers, which wisely piggyback on Amazon with their own promotional activities. Last year, more than 250 retailers joined in, with their average revenue jumping a whopping 68% over the two days of the sale.

The bad news is that massive traffic spikes such as this can wreak havoc on your network if you're unprepared. The good news is that events such as Cyber Monday and Amazon Prime Day are announced well in advance, so you can plan for them accordingly.

With Prime Day likely a couple of months away, you have plenty of time to get your network in order so the business can plan for the surge of visitors from a place of relative comfort. Here's how.

Cache is king

Stephan Roussan, president of ICVM Group, a web development and hosting agency, , said his company's client base included high-profile corporate clients who experience high-visibility events and have a very low tolerance for downtime. 

"Whenever we are preparing for an exponential traffic or load event, we start with a basic assessment."
Stephan Roussan

That assessment asks a series of questions that are instructive to any enterprise preparing for a traffic surge, including:

  • What is the baseline health of the environment, and how much ceiling is currently available in terms of available resources?
  • Can you calculate the potential peak load and anticipate when it might occur? Will there be more than one peak? How long will the peaks last?
  • If the environment is already load-balanced, how often is it rolling over to secondary and tertiary instances? Is there a chance it will exhaust all resources?

The answers to these questions help set a base understanding for capacity planning for the surge event.

As another safety measure, Roussan said, his firm will look to reschedule any processes that would normally be running during the event—such as backups and index builds. This will help minimize resource consumption during the spike. Disallowing spider and bot traffic during this time is another smart idea.

As a final tactic, Roussan said, he can't overemphasize the importance of a smart caching regimen.

"Lengthen your cache life for the anticipated spike window. You can always dial it back afterwards."
—Stephan Roussan

Just make sure to be in close contact with your content publishing teams so that everyone is on the same page with regard to cache-flush protocol and any lag time relating to making content updates visible publicly, he advised.

Test extensively and often

The counterpart to planning is testing, and plenty of it. Daniel Spoonhower, CTO of performance management software developer Lightstep, said that part of being ready means testing—both software systems and your staff's readiness. 

Spoonhower recommends starting with a standard barrage of load testing. Use synthetic but realistic application loads to understand what services will fail first and what their breaking points are. He also suggests taking advantage of squeeze testing—a newer tactic popularized by Netflix to determine the threshold at which a service "breaks" by gradually applying additional traffic in incremental steps.

And remember that testing also means testing your staff.

"Game-day testing provides a safe way for your staff to practice responding to events and ensure that they have the documentation, training, and processes necessary to respond quickly and confidently."
Daniel Spoonhower

On a similar note, scaling your services is easier if you embrace infrastructure as code, said Goutham Belliappa, Capgemini's vice president of AI engineering, so you can use simple, automated mechanisms to add capacity.

You don't want to leave 10 times your normal infrastructure idling and accruing charges when it’s not needed.

"Build in auto-scaling code routines that are thoroughly tested, and don't forget to test the throttle-down routines." 
Goutham Belliappa

Belliappa added that you should really double down on your worst-case scenarios when performing end-to-end testing. If you think your surge will be 10 times your monthly peak volume, push your system to 100 times so you can understand and remediate points of fragility.

"If you don't thoroughly test your infrastructure and ecosystem, you are planning to fail."

Prepare for the worst

Despite all the planning and testing, even the most robust of systems are going to fail sometime. Last year, both Costco and Nordstrom experienced online outages during Black Friday, sending thousands of potential shoppers to the competition. The solution to this is to have a strong contingency plan in place that ensures you can get back up and running quickly should a system fail.

When experiencing a dramatic change in input such as e-commerce platforms often see on major shopping holidays, the main risk is that a set of unforeseen errors can quickly cascade across the environment, leading to systemic failure, said Tal Weiss, CTO of OverOps.

"At that point, what began as a local surge of errors in one or more services or components can rapidly begin to generate errors in downstream or dependent services."
Tal Weiss

The challenge in responding to a failure event such as this, Weiss said, is to determine which of the errors that you're experiencing are new. In a scenario where literally millions of alerts could be flooding the system at once, only a handful may be relevant to finding a solution.

"Your team will need to have a process in place to proactively establish a baseline ahead of the surge, as well as the right set of tools to fingerprint and map errors to their code location and alert based on deviations."

Consider outsourcing the problem

Planning for a traffic surge is an incredibly complex endeavor, and if your organization doesn't have the necessary skills and tools, you can quickly find yourself in over your head.

Brian Lim is CEO of EmazingGroup, which sells club and festival outfits and accessories. He appeared on Shark Tank in 2015 to pitch EmazingLights, lighted gloves designed for the rave scene. Well aware of the date and time when the show would air, Lim knew in advance when traffic would hit the company's website almost down to the minute.

"We knew that being featured on a highly viewed television show would attract quite a bit of website traffic," said Lim. In the six months leading up to the episode's air date, the company purchased 30 servers, simulated load tests, and invested over $500,000 to make sure the site could handle the massive influx of traffic that would be coming their way.

But when the episode aired, the site went down anyway, he said.

Lim said he considers the company's choice of platforms, originally built on a Magento backbone, a "near fatal" error.

"A lot of money was lost that day, and it made what should have been one of the happiest days one of the worst."
Brian Lim

To solve the problem, EmazingGroup surrendered, dropping its DIY e-commerce ambitions and moving everything to a hosted site on Shopify.

"Our lives have been made immeasurably easier, and we likely would never have had this issue if we had been with Shopify from the beginning. But we didn't let that situation sink our company. The experience was a great reminder to focus on our products and marketing above all other business functions as an e-commerce brand."

Learn from the big fish

What can you learn from such mega-traffic events and the planning that goes into them? Follow their lead and adapt at an appropriate level for your online services, and you will be one step ahead. After all, customers consider performance to be a primary reflection on your brand.

Keep learning