In the wee hours of Sunday morning something went very wrong in an Amazon Web Services data center.
At 6 AM ET error rates for the company’s massive NoSQL database named DynamoDB began skyrocketing in AWS’s US-East Virginia region – the oldest and largest of its nine global regions. By 7:52 AM ET, AWS determined the cause of the problems: an issue with how the database manages metadata had gone awry, impacting the service’s partitions and tables.
Amazon Web Service’s Health Dashboard shows the timeline of events from Sunday’s outage, including the root cause.
Because of the intricate interconnectivity of AWS’s services, the issue snowballed to impact 34 total services (out of 117) that the company’s Service Health Dashboard monitors. Everything from Elastic Compute Cloud (EC2) virtual machines to the Glacier storage service to its Relational Database Service were impacted. According to media reports, other companies that rely on AWS experienced outages too, ranging from Netflix to IMDB, to Tinder, Pocket and Buffer.
By noon on Sunday AWS reported the issue was resolved, but not without numerous complaints and musings on Twitter and elsewhere.
What can we takeaway from this event? Below are some thoughts
Even the big boys fail
Amazon Web Services is the kingpin of the public IaaS cloud market – although Microsoft seems to be giving the company a run for its money. Sunday’s events remind us that even big, established cloud vendors are still vulnerable to outages.
Prepare for outages
Given that even the most mature cloud offering on the market can still have a six-hour plus service disruption, customers should prepare for this stuff. AWS has for a long time advised customers to architect their systems to handle virtual machines and other services going down.
DownDetector.com showed higher-than-normal error reports for Netflix on Sunday morning. A company spokesperson denied that the service was significantly impacted.
Netflix, perhaps one of Amazon’s biggest brand-name cloud customers, said via a spokesperson that the impact of the outage on the company’s services was minimal because it migrated workloads automatically from the US-East region to another when the outage happened. Anyone who uses AWS for mission critical apps should architect their system with the expectation that the services that run it could fail at any time, as Netflix has done. Despite the company not acknowledging a major outage, third-party outage tracking sites reported higher-than-normal reports by customers of service disruption to the service.
“I told you so”
A blogger at Forbes argues that this outage changes nothing. I basically agree with this. If you’re an AWS fanboy then you will say that these outages are less frequent then they used to be and that if you heed AWS’s best practices then these situations will not impact you.
On the other side of the coin, outages like what happened Sunday will only be further fodder for folks who are weary to send workloads to the public cloud.
The fact is outages happen. They happen in the public cloud, across any and all providers, and they happen in internal data centers that companies run too. They’re just a fact of life for IT.