Yesterday saw one of the worst Amazon Web Services (AWS) outages to-date: the S3 service in the us-east-1 AWS region experienced errors for several hours, creating a cascade of errors and outages across the entire AWS ecosystem. You may have noticed, given the sudden proliferation of Twitter apologies from many of the largest SaaS businesses on the internet.
As many tech companies do, we too make extensive use of S3: for logging, to serve creative, and to be the single source of truth against which we can compare disparate data sets. During an average day, we sync every event HasOffers receives to S3 for logging and continuity purposes. We make use of these logged events to replay data, ensure data correctness, and be a critical part of our disaster recovery strategy. Normally we log all of this data to our S3 buckets in, you guessed it, us-east-1. On a normal day, this works great. But what about yesterday?
HasOffers weathered the storm. Because HasOffers engineering planned ahead.
As a contingency for this exact scenario, the HasOffers engineering team has already split our critical operations over multiple AWS data centers – which is a prerequisite if you truly want your service to be highly available in the modern, distributed internet. That includes creating failover versions of our S3 buckets, reducing a potentially time consuming operation to just a quick configuration change. We also handle any and all failure cases from both S3 and our statistics pipeline by logging and replaying all failures. This keeps our tracking instances safe from any short to medium-term outages in either system, and allows our clients’ offers to keep tracking like nothing happened. Finally, we have the ability to shift traffic from one AWS region to our other tracking regions, with minimal delay.
All of this preparation meant that when S3 failed, we could easily:
- Capture the data that couldn’t be stored, and queue it for replay.
- Shift our traffic to our other data centers.
- Redirect our services to S3 in another region.
This all allowed us to keep our links tracking and conversions rolling in, despite the ongoing outage.
Is there room for improvement here? Of course – we’re never satisfied. We’ll be working on our alerting, automated responses, and deploy procedures over the coming weeks to make a failover of this magnitude more seamless. We’re also already in the process of improving how we enforce caps to make them more resilient in the face of issues like these.
We’re delighted that the foresight and hard work of our engineers paid off, and we’re sure the more than 250 million clicks we tracked and reported on during the outage have our clients feeling that way too.
Dan Koch is TUNE's Chief Technology Officer. Previously, he was TUNE's Director of Marketing Automation, and previous to that the Director of Engineering at Artisan. Artisan is the industry’s first mobile experience management (MEM) platform, allowing businesses to analyze, manage and enhance their existing mobile applications in real-time without writing code or resubmitting to app stores. Dan is a graduate of both the University of Pennsylvania and of the Villanova University School of Business, and built systems for Best Buy and the United States Air Force in a previous life. It's been a wild ride. Pester him on LinkedIn!