Cloudflare outage caused by oversized bot management file

A major outage at Cloudflare on November 18, 2025, disrupted numerous websites and services, initially mistaken for a massive DDoS attack. The issue stemmed from an internal database change that doubled the size of a critical feature file used in the company's bot management system. Cloudflare resolved the problem by reverting to a previous file version, though full recovery took additional time due to surging traffic.

Cloudflare's outage began when a change to database permissions in its ClickHouse cluster caused a query to output multiple entries into a 'feature file' essential for the bot management system. This file, which describes traits used by a machine learning model to score bots and determine access to customer sites, unexpectedly doubled in size. The software routing traffic across Cloudflare's network had a size limit of 200 features, and the bloated file exceeded this, leading to failures in the core CDN, security services, and other components.

CEO Matthew Prince initially suspected a hyper-scale DDoS attack from the Aisuru botnet, writing in an internal chat: “I worry this is the big botnet flexing.” However, investigation revealed the self-inflicted issue. The file is generated every five minutes and propagated network-wide to counter evolving bot threats. Due to gradual updates in the database cluster, bad files were intermittently generated, causing fluctuating 5xx errors that mimicked an attack pattern.

Cloudflare halted the propagation of the faulty file, replaced it with an earlier good version, and restarted its core proxy. This restored most traffic, but it took another two-and-a-half hours to handle the influx as services came back online. Prince described it as the worst outage since 2019, apologizing: “On behalf of the entire team at Cloudflare, I would like to apologize for the pain we caused the Internet today.”

The company confirmed no hack was involved, attributing it solely to the system error. To prevent recurrences, Cloudflare plans to harden configuration ingestion, add global kill switches, eliminate resource-overwhelming error reports, and review failure modes across proxy modules. Prince noted that past outages have driven resilience improvements, though he cannot guarantee against future ones of this scale.

Ojú-ìwé yìí nlo kuki

A nlo kuki fun itupalẹ lati mu ilọsiwaju wa. Ka ìlànà àṣírí wa fun alaye siwaju sii.
Kọ