The AWS Outage: How We Were Prepared for the Worst

Editor’s note: With GCP’s region-wide outage on June 29th and the outages it caused for other service providers in our industry, we think it’s worth revisiting this piece on what TUNE does to run highly-stable, fault-tolerant services on the cloud. The issue that occurred with GCP is something that we’re fully prepared to handle with AWS, and this blog post is an example of that preparation paying off for our customers. And we haven’t rested on our laurels since this piece was published in 2017 – we’ve been tireless in ensuring that our technology uses every avenue available to stay rock solid, even when the major cloud service providers have issues.

Yesterday saw one of the worst Amazon Web Services (AWS) outages to-date: the S3 service in the us-east-1 AWS region experienced errors for several hours, creating a cascade of errors and outages across the entire AWS ecosystem. You may have noticed, given the sudden proliferation of Twitter apologies from many of the largest SaaS businesses on the internet.

As many tech companies do, we too make extensive use of S3: for logging, to serve creative, and to be the single source of truth against which we can compare disparate data sets. During an average day, we sync every event TUNE receives to S3 for logging and continuity purposes. We make use of these logged events to replay data, ensure data correctness, and be a critical part of our disaster recovery strategy. Normally we log all of this data to our S3 buckets in, you guessed it, us-east-1. On a normal day, this works great. But what about yesterday?

TUNE weathered the storm. Because TUNE engineering planned ahead.

As a contingency for this exact scenario, the TUNE engineering team has already split our critical operations over multiple AWS data centers – which is a prerequisite if you truly want your service to be highly available in the modern, distributed internet. That includes creating failover versions of our S3 buckets, reducing a potentially time consuming operation to just a quick configuration change. We also handle any and all failure cases from both S3 and our statistics pipeline by logging and replaying all failures. This keeps our tracking instances safe from any short to medium-term outages in either system, and allows our clients’ offers to keep tracking like nothing happened. Finally, we have the ability to shift traffic from one AWS region to our other tracking regions, with minimal delay.

All of this preparation meant that when S3 failed, we could easily:

Capture the data that couldn’t be stored, and queue it for replay.
Shift our traffic to our other data centers.

-and-

Redirect our services to S3 in another region.

This all allowed us to keep our links tracking and conversions rolling in, despite the ongoing outage.

Is there room for improvement here? Of course – we’re never satisfied. We’ll be working on our alerting, automated responses, and deploy procedures over the coming weeks to make a failover of this magnitude more seamless. We’re also already in the process of improving how we enforce caps to make them more resilient in the face of issues like these.

We’re delighted that the foresight and hard work of our engineers paid off, and we’re sure the more than 250 million clicks we tracked and reported on during the outage have our clients feeling that way too.

Author

Becky is the Senior Content Marketing Manager at TUNE. Before TUNE, she handled content strategy and marketing communications at several tech startups in the Bay Area. Becky received her bachelor's degree in English from Wake Forest University. After a decade in San Francisco and Seattle, she has returned home to Charleston, SC, where you can find her strolling through Hampton Park with her pup and enjoying the simple things in life.

Author

Related Articles

Leave a Reply Cancel reply