AWS September 26th Disruption

On the 26th September 2018, one of AWS’ availability zones in the Ireland region (EU-WEST-1) suffered an incident that lead to “increased error rates”. At Infinity Works we have multiple clients and systems we operate in the AWS Ireland region, during this event our teams observed various symptoms, resulting in some unplanned work. This post details what happened; what we saw and how we plan to mitigate the impact of a similar incident in the future.

So first; what did AWS publish about this on the AWS status page

Timeline generator
  • 1:06 AM PDT

We are investigating increased error rates for new launches in the EU-WEST-1 Region.

  • 1:37 AM PDT

We can confirm increased API error rates and connectivity issues for some instances in a single Availability Zone in the EU-WEST-1 Region.

  • 2:14 AM PDT

We continue to investigate connectivity issues from some instances to some AWS services in the EU-WEST-1 Region. We have identified the root cause and are taking steps to resolve the issue.

  • 3:03 AM PDT

We have resolved the connectivity issues from EC2 instances to the affected AWS services in the EU-WEST-1 Region. We continue to see elevated error rates for the RunInstances EC2 API, which we are working to resolve. Internet connectivity and connectivity between EC2 instances remain unaffected.

  • 3:28 AM PDT

Starting at 12:15 AM PDT we experienced increased API error rates for the EC2 API, and connectivity issues between EC2 instances and AWS services in the EU-WEST-1 Region. At 2:29 AM PDT, the connectivity issues between EC2 instances and AWS services were resolved. At 2:59 AM PDT, the increased API error rates for the EC2 APIs were fully resolved. Internet connectivity and connectivity between EC2 instances was not affected. The issue has been resolved and the service is operating normally.

What did we see accross accounts and clients

Not all of our accounts and clients were in the affected availability zone; which zone was impacted? Well each customer gets a random availability zone id for their account, so you cannot just say eu-west-1a was impacted, because every customer gets assigned one of the zones as having this label. We suspect this is because customers mostly choose the first Zone, and would not evenly distribute usage across all three zones in Ireland if everyone had the same labels.

The UK (BST) time for the distruption was reported at 9:06am, it started at 8:15am and was resolved at 10:59am. Major IT news sites; The Verge, The Register, were only reporting that Amazon’s Alexa was down, while the Social IT sites such as Reddit had accounts of more specific issues.

Some of our teams were quickly aware of the issues, whilst others who were not impacted were unaware there was a disruption. Some accounts had direct contact with their AWS TAM (Technical Account Manager), to understand the impact and help deal with the issues caused. Here is a summary of the symptoms we encountered :

  • Direct Connect - connections down
  • EC2 - Launching instances slow, API calls slow or failing intermittently
  • EC2 - Connectivity between availability zones impacted
  • EC2 - Unable to RDP into instances in impacted Availability Zone
  • SQS - Requests not making it into SQS - could not establish tunnels
  • APIGateway - Requests into ALB’s resulted in 5xx errors

Once everything was back up and running, many teams tried to work out how they could have dealt with the issue better. AWS did not publish the root cause of the incident so the focus was on how to be more aware of similar events rather than how to avoid it altogether.

From this the following suggestions arose;

  • Make use of the AWS Health API - consider adding cloudwatch events to integrate into your current monitoring systems.
  • For Direct Connect or VPN connections into customer data centers, have them monitor the status of the connection from the DC side - as most connections only allow one way traffic out of the on-prem systems.
  • External monitoring outside or your operating region is important - if you use a third party, pick a region that is not the same as your production services.
  • Ensure that systems are multi AZ deployed if critical
  • Test deployment code with AWS API failures in mind.
  • Monitor other nearby AWS Regions, as customer may be moving to your Region during service affecting disruptions.
Written on October 12, 2018 by:
Steven Harper