5 lessons from Amazon’s S3 cloud blunder

According to internet monitoring platform Catchpoint, Amazon Web Service’s Simple Storage Service (S3) experienced a three hour and 39 minute disruption on Tuesday that had cascading effects across other Amazon cloud services and many internet sites that rely on the popular cloud platform.

“S3 is like air in the cloud,” says Forrester analyst Dave Bartoletti; when it goes down many websites can’t breathe. But disruptions, errors and outages are a fact of life in the cloud. Bartoletti says there’s no reason to panic: “This is not a trend,” he notes. “S3 has been so reliable, so secure, it’s been the sort of crown jewel of Amazon’s cloud.“

What this week should be though is a wake up call to make sure your cloud-based applications are ready for the next time the cloud hiccups. Here are five tips for preparing yourself for a cloud outage:

Don’t keep all your eggs in one basket

This advice will mean different things for different users, but the basic idea is that if you deploy an application or piece of data to one point in the cloud, it will not be very fault tolerant. Depending on how highly available you want your application to be will determine how many baskets you spread your workloads across. There are multiple options:

  • AWS recommends at a minimum to spread workloads across multiple Availability Zones. Each of the 16 regions that make up AWS are broken down into at least two, sometimes as many as five, AZs. Each AZ is meant to be isolated from other AZs in the same region. AWS provides low-latency connections between its AZs in the same region, creating the most basic way to distribute your workloads.
  • For increased protection, users can spread their applications across multiple regions.
  • The ultimate protection would be to deploy the application across multiple providers, for example using Microsoft Azure, Google Cloud Platform or some internal or hosted infrastructure resource as a backup.

Bartoletti says different customers will have different levels of urgency for doing this. If you rely on the cloud to make money for your business or its integral for productivity, you’d better make sure it’s fault tolerant and highly available. If you use it to back up files that aren’t accessed frequently, then you may be able to live with the occasional service disruption.

ID failures ASAP

One key to responding to a cloud failure is knowing when one happens. AWS has a series of ways to do this. One of the most basic is to use what it calls Health Checks, which provide a customized view of the status of AWS resources used by each account. Amazon CloudWatch can be configured to automatically track service availability, monitor log files, create alarms and react to failures. One important precursor to this working is having a thorough analysis of what “normal” behavior is so that the AWS cloud tools can detect “abnormal” behavior.

Once an error is identified, there are a range of domino-effect reactions that need to be preconfigured to respond to the situation (see above on multi-AZ, multi-region, or multi-cloud). Load balancers can be in place to redirect traffic and backup systems can be kicked in if they’ve been set up to do so (see below).

Build redundant systems from the start

It will not be very useful to try to respond to an outage in real-time. Preparation before the outage will save you when it inevitably comes. There are two basic ways to build redundancy into cloud systems:

-Standby: When a failure occurs, the application automatically detects it and fails over into a backup, redundant system. In this scenario, the backup system can be off, but ready to spin up when an error is detected. An alternative is the standby backup can be running idly in the background the entire time (this costs more but will reduce failover time). The downside to these standby approaches is there could be a lag between when an error is detected and when the failover system kicks in.

-Active redundancy: To (theoretically) avoid downtime users can architect their application to have active redundancy. In this scenario, the application is distributed across multiple redundant resources: When one fails, the rest of the resources absorb a larger share of the workload. A sharding technique can be used in which services are broken up into components. Say, for example, an application runs across eight virtual machine instances – those eight instances can be broken up into four groups of two each and traffic can be load balanced between them. If one shard goes down, the other three can pick up the traffic.