February 28th Downtime Postmortem
While we are still a small company, our customers trust our APIs and client applications to be available 100% of the time. We strive every day to reach that benchmark and in turn, your trust. We are now at a scale where seconds of downtime can equate to hundreds of lost support questions. Just last week, we did a full database migration with no downtime (we'll do a post about using replica sets in MongoDB to achieve this a little later). Yesterday, we learned that not even Amazon can achieve 100% availability and that we were wrong to assume they had. We also learned where we need to improve our codebase to better handle large scale cloud failures such as the one the internet experienced yesterday.
In hindsight, it seems obvious that S3 could go down, but as cloud developers, we are used to trusting hosting giants like AWS to always be available. We've learned our lesson from yesterday's S3 downtime and will be providing redundancy where we had none. Our team would like to sincerely apologize to our customers and to our customers' customers for the 25 minute loss of our core services between 2:56PM and 3:21PM EST yesterday. If you're interested, you can keep reading to learn about the time leading up to the downtime, what we did to get things back online as quickly as possible and what we'll be doing to mitigate our risk going forward.
All times are PM and Eastern Standard Time
Our exception logging service started reporting that our API's task queue was timing out during the writing of our learned models to AWS S3. Specifically, PUT calls to our S3 bucket in the us-east region. These are asynchronous tasks, so exceptions during these tasks do not affect inbound or outbound requests to our APIs, even when they bunch up in scenarios like this one. We wrongly assumed that the S3 issue was transient and that Amazon would fix it momentarily. We checked the pulse on all of our infrastructure and all was ok.
Twitter started blowing up with S3 issues. We continued to monitor the situation and perform regular health checks on all of our infrastructure.
Our website went down. We wrongly assumed it was our CMS' API breaking due to the S3 outage. We assumed this because they are a new company and had a large failure event happen two weeks prior. We decided not to worry about it for the time being as our site is a non-critical part of our infrastructure, assuming things would be resolved by the CMS provider shortly.
In hindsight, we should have realized that this could have been an issue with Heroku/EC2 and would have been a good indicator to move our production API off of Heroku. In the future, we will use any potential AWS failure event as an indicator to move core services to our standby host (Digital Ocean). We could not confirm that this was our CMS' API or EC2 as Heroku had disabled their dashboard and monitoring APIs.
David and Anson decided that, while failing training tasks on the production API aren't inherently a problem, it would be good practice to fix them. At this point, we were surprised S3 was still down and decided to assume the worst. We shipped a patch which updated the caches in the event of a failed write to S3. In hindsight, it would have been a good idea to update the socket timeout on our training task queue so that it wouldn't wait so long before failing.
The S3 timeouts have now turned into instant 500 errors.
2:56 - Start of downtime
Heroku, our cloud provider, decided to turn off all dynos (or the EC2 instances running our app were taken down by Amazon -- Heroku to confirm), presumably because EC2 had now become widely unavailable. This immediately took down our production API. Thankfully, Jason had been working on a bot and noticed the app wasn't working. He alerted me immediately.
Anson worked to restore our production API by changing our NGINX router to point to our old production API VPS which, by chance, had not yet been taken down after we moved to Heroku. We updated this instance with the latest API codebase and moved our network router to point to this. Having a backup of our NGINX config file from before our move to Heroku would have been useful at this moment.
3:21 - End of downtime
Our production API was back up and all of our core services were working again, but the caches had no learned models in them and were still failing to read from S3. The team worked to restore the frequently used ones first by triggering training events manually which, with the new patch, inserted them quickly into the caches and skipped writing to S3. A script to do this automatically would have been useful, but making a better hose when your house is on fire seemed like it would not be a good use of this time.
All client learned models are retrained and cached, circumventing S3 reliance once again. All services are running smoothly thanks to the work of our entire team.
- We trusted Heroku to stay available during this loss. This was a mistake. What we'll do is have a second instance of our API in "hot-standby" on our old hosting provider Digital Ocean so we can quickly switch to it during these types of failures before Heroku goes down. Heroku gave us no notice of downtime despite seemingly making a conscious decision to put their own services in maintenance mode. We'll be talking with them about that.
- We had written code which relied on writes to our persistent store (S3) to succeed before a write to our cache could happen. This code happened on a queue whenever you trained or untrained your bot. We've already remedied this code to attempt writing to the persistent store and, during a failure, continue writing to our caches. This will provide for zero downtime when the persistent store is failing as the cache will be used. When S3 becomes available again, future writes will succeed, updating the stale last version of the file. We also plan to write code which will write to the caches first since this is an asynchronous task, making the most recent version of your learning models available quickly and without reliance on the persistent store.
- We will be persisting our learned models across multiple regions so that in the event that a region becomes unavailable we can quickly switch to another one.
- While Jason makes an excellent canary, we will be instantiating integration health checks at an interval of one per 10 seconds with a reputable service provider to be chosen today.
- We will update the socket timeout in boto so it doesn't wait so long to fail in the event of S3 downtime triggering network timeouts.
- We will keep backups of all useful NGINX configurations for quicker routing changes.
- We will build a script which we can use to retrain all learned models in the event of our caches needing to be flushed.
We are sincerely sorry for any inconvenience yesterday's downtime caused your company. We take our customers' trust very seriously. If you'd like to discuss anything further or have questions, concerns or feedback on how we can improve things more than we've outlined above, please email me (David Hariri) directly at firstname.lastname@example.org.