Home > Uncategorized > Amazon Failure Caused by Weak Change Management

Amazon Failure Caused by Weak Change Management

Amazon just released their summary of last week’s web services outage.  It is an incredibly long explanation ending with an apology.  The bottom line for this outage is contained in the following statement; “The trigger for this event was a network configuration change.”

When I first became an “EDP Auditor” (showing my age) the first project I did was a program change control review.   Change control is the most basic of all IT controls.  So how does a juggernaut like Amazon allow their service to be crippled by a basic network configuration change?  Don’t they have redundancy and failover for critical services?  Nothing is as simple as it seems these days.

The system administrator that attempted to make the change made a mistake:

“During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. ”

So, yes, they had redundancy.  Unfortunately, the redundancy is what led to the failure.  The redundant network couldn’t handle the load and certain devices could not find their redundant pair for data mirroring.  As a result, they consumed all the available local resources which resulted in overloading the management layer.  The failure of the management layer basically began shutting down entire segments and the Application Program Interfaces (API).  The mirroring architecture put everything on hold until the devices could find a new “partner” to mirror with.

In much the same way that the Fukushima catastrophe was the result of multiple failures of controls and redundancy, so was Amazon’s failure.  Sometimes our systems are so complex that we can’t foresee the true impact of a simple error.  Amazon has taken several steps to prevent this type of failure from happening again, but I suspect we will continue to see unprecedented and unforeseen errors in the cloud as our systems continue to grow in complexity.

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: