Incidents: the exceptional as routine

In yesterday’s post, I was looking at the Cloudflare’s public incident data to see if the time-to-resolve was under statistical control. Today I want to look at just the raw counts.

Here’s a graph that shows a count of incidents reported per day, color-coded by impact.

Cloudflare is reporting just under two incidents per day for the time period I looked at (2025-01-01 to 2025-11-27), for minor, major, and critical incidents that are not billing-related.

I spot checked the data to verify I wasn’t making any obvious mistakes. For example, there were, indeed, eight reported incidents on November 12, 2025:

Network Performance Issues in Madrid, SpainIssues with Cloudflare Images PlansNetwork performance issues in SingaporeCloudflare Page Shield IssuesNetwork Connectivity Issues in Chicago, USIssues with Zero Trust DNS-over-TLSWARP connectivity in South AmericaCloudflare Dashboard and Cloudflare API service issues

(Now, you might be wondering: are these all “distinct” incidents, or are they related? I can’t tell from the information provided on the Cloudflare status pages. Also, the question itself illustrates the folly of counting incidents. A discrete incident is not a well-defined thing, and you might want to call something “one incident” for one purpose but “multiple incidents” for a different purpose).

Two incidents per day sounds like a lot, doesn’t it? Contrast this with AWS, which reports significantly fewer incidents than Cloudflare, despite offering a broader array of services: you can see on the AWS service health page (click on “List of events” and set “Locales” to “all locales”, or you can look at the Google sheet I copy-pasted this data into) that they reported only 36 events in that same time period, giving them an average of about one incident every nine days.

(AWS doesn’t classify impact, so I just marked the Oct 20 incident as critical and marked the others as minor, in order to make the visualization consistent with the Cloudflare graph).

But don’t let the difference in reported incidents fool you into thinking that Cloudflare deals with many more incidents than AWS does. Instead, what’s almost certainly going on is that Cloudflare is more open about reporting incidents than AWS is. I am convinced that Cloudflare’s incident reporting is much closer to reality than AWS’s. In fact, if you walked into any large tech company on any day of the week, I have high confidence that someone would be working on resolving an ongoing incident.

Incidents are always exceptional, by definition: they are negative-impacting events which we did not expect to happen. But the thing is, they’re also normal, in the sense that they happen all of the time. Now, most of these incidents are minor, which is why you aren’t constantly reading about them in the press: it’s only the large-scale conflagrations that you’ll hear about. But there are always small fires burning, along with engineers who are in the process of fighting these fires. This is the ongoing reliability work that is all-too-often invisible.

 •  0 comments  •  flag
Share on Twitter
Published on November 28, 2025 15:33
No comments have been added yet.