Search This Blog

Powered by Blogger.

Blog Archive

Labels

Footer About

Footer About

Labels

Showing posts with label Cloudflare outage. Show all posts

Cloudflare Outage Traced to Internal File Error After Initial Fears of Massive DDoS Attack

Cloudflare experienced a major disruption yesterday that knocked numerous websites and online services offline. At first, the company suspected it was under a massive “hyper-scale” DDoS attack.

“I worry this is the big botnet flexing,” Cloudflare co-founder and CEO Matthew Prince wrote in an internal chat, referring to concerns that the Aisuru botnet might be responsible. However, the team later confirmed that the issue originated from within Cloudflare’s own infrastructure: a critical configuration file unexpectedly grew in size and spread across the network.

This oversized file caused failures in software responsible for reading the data used by Cloudflare’s bot management system, which relies on machine learning to detect harmful traffic. As a result, Cloudflare’s core CDN, security tools, and other services were impacted.

“After we initially wrongly suspected the symptoms we were seeing were caused by a hyper-scale DDoS attack, we correctly identified the core issue and were able to stop the propagation of the larger-than-expected feature file and replace it with an earlier version of the file,” Prince explained in a post-mortem.

According to Prince, the issue began when changes to database permissions caused the system to generate duplicate entries inside a “feature file” used by the company’s bot detection model. The file then doubled in size and automatically replicated across Cloudflare’s global network.

Machines that route traffic through Cloudflare read this file to keep the bot management system updated. But the software had a strict size limit for this configuration file, and the bloated version exceeded that threshold, causing widespread failures. Once the old version was restored, traffic began returning to normal — though it took another 2.5 hours to stabilize the network after the sudden surge in requests.

Prince apologized for the downtime, noting the heavy dependence many online platforms have on Cloudflare. “On behalf of the entire team at Cloudflare, I would like to apologize for the pain we caused the Internet today,” he wrote, adding that outages are especially serious due to “Cloudflare’s importance in the Internet ecosystem.”

Cloudflare’s bot management system assigns bot scores using machine learning, helping customers filter legitimate traffic from malicious requests. The configuration file powering this system is updated every five minutes to adapt quickly to changing bot behaviors.

The faulty file was generated by a query on a ClickHouse database cluster. After new permissions were added, the query began returning additional metadata—duplicating columns and producing more rows than expected. Because the system caps features at 200, the oversized file triggered a panic state once deployed across Cloudflare’s servers.

The result was a dramatic surge in 5xx server errors. The pattern appeared irregular at first because only some database nodes were generating the bad file. Every five minutes, the system could push either a correct or incorrect version depending on which node handled the query, creating cyclical failures that initially resembled a distributed attack.

Eventually, all ClickHouse nodes began producing the faulty file consistently. Cloudflare resolved the issue by stopping the distribution of the corrupted file, manually injecting a stable version, and restarting its core proxy services. The network returned to normal later that day.

Prince called this Cloudflare’s most significant outage since 2019. To prevent similar incidents, the company plans to strengthen safeguards around internal configuration files, introduce more global kill switches, prevent system overloads caused by error logs, and review failure points across core components.

While Prince emphasized that no system can be guaranteed immune to outages, he noted that past failures have led Cloudflare to build more resilient systems each time.

Cloudflare Explains Major Service Outage: Not a Security Breach, No Data Lost

 

Cloudflare has clarified that a widespread outage affecting its global services was not the result of a cyberattack or data breach. The company confirmed that no customer data was compromised during the disruption, which significantly impacted numerous platforms, including major edge computing services and some Google Cloud infrastructure. 

The issue began at approximately 17:52 UTC and was primarily caused by a complete failure of Workers KV, Cloudflare’s globally distributed key-value storage system. As a backbone for its serverless computing platform, Workers KV plays a crucial role in supporting configuration, identity management, and content delivery across many of Cloudflare’s offerings. When it went offline, critical functions across the ecosystem were immediately affected. 

In a post-incident analysis, Cloudflare revealed that the root cause was a malfunction in the storage infrastructure that underpins Workers KV. This backend is partially hosted by a third-party cloud service, which experienced its own outage—directly leading to the failure of the KV system. The ripple effects were far-reaching, disrupting Cloudflare services for nearly two and a half hours. 

Key services impacted included authentication platforms like Access and Gateway, which saw major breakdowns in login systems, session handling, and policy enforcement. Cloudflare’s WARP service was unable to register new devices, while Gateway experienced failures in DNS-over-HTTPS queries. CAPTCHA and login tools such as Turnstile and Challenges also malfunctioned, with a temporary kill switch introducing token reuse risks.  
Media services like Stream and Images were hit particularly hard, with all live streaming and media uploads failing during the incident. Other offerings such as Workers AI, Pages, and the AutoRAG AI system were rendered entirely unavailable. Even backend systems like Durable Objects, D1 databases, and Queues registered elevated error rates or became completely unresponsive.  

Cloudflare’s response plan now includes a significant architectural shift. The company will begin migrating Workers KV from its current third-party dependency to its in-house R2 object storage solution. This move is designed to reduce reliance on external providers and improve the overall resilience of Cloudflare’s services. 

In addition, Cloudflare will implement a series of safeguards to mitigate cascading failures in future outages. This includes new cross-service protections and controlled service restoration tools that will help stabilize systems more gradually and prevent sudden traffic overloads. 

While the outage was severe, Cloudflare’s transparency and swift action to redesign its infrastructure aim to minimize similar disruptions in the future and reinforce trust in its platform.

Cloudflare Outage Disrupts Website Access in Multiple Regions, Affecting Global Users

A widespread Cloudflare outage is affecting access to websites globally, including BleepingComputer. While some regions can still access these sites, others are experiencing disruptions.

Cloudflare has mentioned ongoing scheduled maintenance in Singapore and Nashville, but their status page shows no indication of any issues. Despite this, many users around the world are encountering error messages when trying to visit websites utilizing Cloudflare, with browsers unable to connect to the servers.

BleepingComputer is among the websites impacted, as the users are facing intermittent access problems. However, the monitoring tools indicate the site is still receiving traffic, suggesting the outage is affecting specific regions. For instance, they ae not able to access the site from the U.S., but some of the staff members in other countries are unable to do so.

Downdetector recorded a surge in complaints about Cloudflare starting at approximately 1:45 PM ET, which aligns with when BleepingComputer began experiencing connectivity issues.

Reports on X (formerly Twitter) also indicate that some websites are unreachable over IPv4, though still accessible via IPv6. NodeJS.org has reported similar issues, stating the outage is preventing access to their website and hindering the ability to download Node.js.

BleepingComputer has reached out to Cloudflare for further information but has yet to receive a response.