Amazon Web Services (AWS) has issued an apology to customers following a widespread outage on October 20 that brought down more than a thousand websites and services globally. The disruption affected major platforms including Snapchat, Reddit, Lloyds Bank, Venmo, and several gaming and payment applications, underscoring the heavy dependence of the modern internet on a few dominant cloud providers. The outage originated in AWS’s North Virginia region (US-EAST-1), which powers a significant portion of global online infrastructure.
According to Amazon’s official statement, the outage stemmed from internal errors that prevented systems from properly linking domain names to the IP addresses required to locate them. This technical fault caused a cascade of connectivity failures across multiple services. “We apologize for the impact this event caused our customers,” AWS said. “We know how critical our services are to our customers, their applications, and their businesses. We are committed to learning from this and improving our availability.”
While some platforms like Fortnite and Roblox recovered within a few hours, others faced extended downtime. Lloyds Bank customers, for instance, reported continued access issues well into the afternoon. Similarly, services like Reddit and Venmo were affected for longer durations. The outage even extended to connected devices such as Eight Sleep’s smart mattresses, which rely on internet access to adjust temperature and elevation.
The company stated it would work to make its systems more resilient after some users reported overheating or malfunctioning devices during the outage.
AWS’s detailed incident summary attributed the issue to a “latent race condition” in the systems managing the Domain Name System (DNS) records in the affected region. Essentially, one of the automated processes responsible for maintaining synchronization between critical database systems malfunctioned, triggering a chain reaction that disrupted multiple dependent services. Because many of AWS’s internal processes are automated, the problem propagated without human intervention until it was detected and mitigated.
Dr. Junade Ali, a software engineer and fellow at the Institute for Engineering and Technology, explained that “faulty automation” was central to the failure. He noted that the internal “address book” system in the region broke down, preventing key infrastructure components from locating each other. “This incident demonstrates how businesses relying on a single cloud provider remain vulnerable to regional failures,” Dr. Ali added, emphasizing the importance of diversifying cloud service providers to improve resilience.
The event once again highlights the concentration of digital infrastructure within a few dominant providers, primarily AWS and Microsoft Azure. Experts warn that such dependency increases systemic risk, as disruptions in one region can have global ripple effects. Amazon has stated that it will take measures to strengthen fault detection, introduce greater redundancy, and enhance the reliability of automated processes in its network.
As the world grows increasingly reliant on cloud computing, the AWS outage serves as a critical reminder of the fragility of internet infrastructure and the urgent need for redundancy and diversification.
