Cloudflare experienced a major disruption yesterday that knocked numerous websites and online services offline. At first, the company suspected it was under a massive “hyper-scale” DDoS attack.
“I worry this is the big botnet flexing,” Cloudflare co-founder and CEO Matthew Prince wrote in an internal chat, referring to concerns that the Aisuru botnet might be responsible. However, the team later confirmed that the issue originated from within Cloudflare’s own infrastructure: a critical configuration file unexpectedly grew in size and spread across the network.
This oversized file caused failures in software responsible for reading the data used by Cloudflare’s bot management system, which relies on machine learning to detect harmful traffic. As a result, Cloudflare’s core CDN, security tools, and other services were impacted.
“After we initially wrongly suspected the symptoms we were seeing were caused by a hyper-scale DDoS attack, we correctly identified the core issue and were able to stop the propagation of the larger-than-expected feature file and replace it with an earlier version of the file,” Prince explained in a post-mortem.
According to Prince, the issue began when changes to database permissions caused the system to generate duplicate entries inside a “feature file” used by the company’s bot detection model. The file then doubled in size and automatically replicated across Cloudflare’s global network.
Machines that route traffic through Cloudflare read this file to keep the bot management system updated. But the software had a strict size limit for this configuration file, and the bloated version exceeded that threshold, causing widespread failures. Once the old version was restored, traffic began returning to normal — though it took another 2.5 hours to stabilize the network after the sudden surge in requests.
Prince apologized for the downtime, noting the heavy dependence many online platforms have on Cloudflare. “On behalf of the entire team at Cloudflare, I would like to apologize for the pain we caused the Internet today,” he wrote, adding that outages are especially serious due to “Cloudflare’s importance in the Internet ecosystem.”
Cloudflare’s bot management system assigns bot scores using machine learning, helping customers filter legitimate traffic from malicious requests. The configuration file powering this system is updated every five minutes to adapt quickly to changing bot behaviors.
The faulty file was generated by a query on a ClickHouse database cluster. After new permissions were added, the query began returning additional metadata—duplicating columns and producing more rows than expected. Because the system caps features at 200, the oversized file triggered a panic state once deployed across Cloudflare’s servers.
The result was a dramatic surge in 5xx server errors. The pattern appeared irregular at first because only some database nodes were generating the bad file. Every five minutes, the system could push either a correct or incorrect version depending on which node handled the query, creating cyclical failures that initially resembled a distributed attack.
Eventually, all ClickHouse nodes began producing the faulty file consistently. Cloudflare resolved the issue by stopping the distribution of the corrupted file, manually injecting a stable version, and restarting its core proxy services. The network returned to normal later that day.
Prince called this Cloudflare’s most significant outage since 2019. To prevent similar incidents, the company plans to strengthen safeguards around internal configuration files, introduce more global kill switches, prevent system overloads caused by error logs, and review failure points across core components.
While Prince emphasized that no system can be guaranteed immune to outages, he noted that past failures have led Cloudflare to build more resilient systems each time.
