Amazon outage caused by single failure in AWS network

A software bug in Amazon Web Services' DynamoDB DNS management system triggered a 15-hour outage affecting millions worldwide. The failure originated in the US-East-1 region and cascaded to impact services like Snapchat and Roblox. Amazon engineers detailed the root cause as a race condition that led to inconsistent network states.

The outage began in Amazon's US-East-1 region, the company's oldest and most heavily used hub, due to a race condition in the DynamoDB DNS Enactor component. This system monitors load balancers by updating DNS configurations for AWS endpoints. As described by Amazon engineers, the Enactor experienced high delays in updating DNS endpoints, while a separate DNS Planner generated new plans. A second Enactor then implemented these, invoking a cleanup process that deleted an older plan just as the delayed first Enactor applied it, overwriting the newer one.

"When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them," Amazon explained. This left the system in an inconsistent state, removing all IP addresses for the regional endpoint and preventing further updates, requiring manual intervention to resolve.

The DynamoDB failure disrupted connections for systems relying on the US-East-1 endpoint, affecting both customer traffic and internal AWS services. It strained EC2 instances, causing delays in network state propagations even after restoration. "While new EC2 instances could be launched successfully, they would not have the necessary network connectivity due to the delays in network state propagation," engineers noted. This spilled over to a network load balancer, leading to errors in AWS functions such as creating Redshift clusters, Lambda invocations, Fargate tasks, and operations in Managed Workflows for Apache Airflow and the AWS Support Center.

The incident lasted 15 hours and 32 minutes, with Ookla's DownDetector recording over 17 million reports from 3,500 organizations, primarily in the US, UK, and Germany. Snapchat, AWS, and Roblox were the most affected services, marking it among the largest outages on record. In response, Amazon disabled the DynamoDB DNS Planner and Enactor automation globally to fix the race condition and add safeguards. Ookla highlighted the risks of regional concentration, noting that global apps often route through US-East-1, amplifying impacts and underscoring the need for multi-region designs to contain failures.

Questo sito web utilizza i cookie

Utilizziamo i cookie per l'analisi per migliorare il nostro sito. Leggi la nostra politica sulla privacy per ulteriori informazioni.
Rifiuta