Brief thoughts on the recent Cloudflare outage
I was at QCon SF during the recent Cloudflare outage (I was hosting the Stories Behind the Incidents track), so I hadn’t had a real chance to sit down and do a proper read-through of their public writeup and capture my thoughts until now. As always, I recommend you read through the writeup first before you read my take.
All quotes are from the writeup unless indicated otherwise.
Hello saturation my old friendThe software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
One thing I hope readers take away from this blog post is the complex systems failure mode pattern that resilience engineering researchers call saturation. Every complex system out there has limits, no matter how robust that system is. And the systems we deal with have many, many different kinds of limits, some of which you might only learn about once you’ve breached that limit. How well a system is able to perform as it approaches one of its limits is what resilience engineering is all about.
Each module running on our proxy service has a number of limits in place to avoid unbounded memory consumption and to preallocate memory as a performance optimization. In this specific instance, the Bot Management system has a limit on the number of machine learning features that can be used at runtime. Currently that limit is set to 200, well above our current use of ~60 features.
In this particular case, the limit was set explicitly.
thread fl2_worker_thread panicked: called Result::unwrap() on an Err value
As sparse as the panic message is, it does explicitly tell you that the problematic call site was an unwrap call. And this is one of the reasons I’m a fan of explicit limits over implicit limits: you tend to get better error messages than when breaching an implicit limit (e.g., of your language runtime, the operating system, the hardware).
A subsystem designed to protect surprisingly inflicts harmIdentify and mitigate automated traffic to protect your domain from bad bots. – Cloudflare Docs
The problematic behavior was in the Cloudflare Bot Management system. Specifically, it was in the bot scoring functionality, which estimates the likelihood that a request came from a bot rather than a human.
This is a system that is designed to help protect their customer from malicious bots, and yet it ended up hurting their customers in this case rather than helping them.
As I’ve mentioned previously, once your system achieves a certain level of reliability, it’s the protective subsystems that end up being things that bite you! These subsystems are a net positive, they help much more than they hurt. But they also add complexity, and complexity introduces new, confusing failure modes into the system.
The Cloudflare case is a more interesting one than the typical instances of this behavior I’ve seen, because Cloudflare’s whole business model is to offer different kinds of protection, as products for their customers. It’s protection-as-a-service, not an internal system for self-protection. But even though their customers are purchasing this from a vendor rather than building it in-house, it’s still an auxiliary system intended to improve reliability and security.
Confusion in the momentWhat impressed me the most about this writeup is that they documented some aspects of what it was like responding to this incident: what they were seeing, and how they tried to made sense of it.
In the internal incident chat room, we were concerned that this might be the continuation of the recent spate of high volume Aisuru DDoS attacks:
Man, if I had a nickel every time I saw someone Slack “Is it DDOS?” in response to a surprising surge of errors returned by the system, I could probably retire at this point.
The spike, and subsequent fluctuations, show our system failing due to loading the incorrect feature file. What’s notable is that our system would then recover for a period. This was very unusual behavior for an internal error.
We humans are excellent at recognizing patterns based on our experience, and that generally serves us well during incidents. Someone who is really good at operations can frequently diagnose the problem very quickly just by, say, the shape of a particular graph on a dashboard, or by seeing a specific symptom and recalling similar failures that happened recently.
However, sometimes we encounter a failure mode that we haven’t seen before, which means that we don’t recognize the signals. Or we might have seen a cluster of problems recently that followed a certain pattern, and assume that the latest one looks like the last one. And these are the hard ones.
This fluctuation made it unclear what was happening as the entire system would recover and then fail again as sometimes good, sometimes bad configuration files were distributed to our network. Initially, this led us to believe this might be caused by an attack.
This incident was one of those hard ones: the symptoms were confusing. The “problem went away, then came back, then went away again, then came back again” type of unstable incident behavior is generally much harder to diagnose than one where the symptoms are stable.
Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page.
Here they got bit by a co-incident, an unrelated failure of their status page that led them to believe (reasonably!) that the problem must have been external.
I’m still curious as to what happened with their status page. The error message they were getting mentions CloudFront, so I assume they were hosting their status page on AWS. But their writeup doesn’t go into any additional detail on what the status page failure mode was.
But the general takeaway here is that even the most experienced operators are going to take longer to deal with a complex, novel failure mode, precisely because it is complex and novel! As the resilience engineering folks say, prepare to be surprised! (Because I promise, it’s going to happen).
A plea: assume local rationalityThe writeup included a screenshot of the code that had an unhandled error. Unfortunately, there’s nothing in the writeup that tells us what the programmer was thinking when they wrote that code.
In the absence of any additional information, a natural human reaction is to just assume that the programmer was sloppy. But if you want to actually understand how these sorts of incidents actually happen, you have to fight this reaction.
People always make decisions that make sense to them in the moment, based on what they know and what constraints they are operating under. After all, if that wasn’t true, then they wouldn’t have made that decision. The only we can actually understand the conditions that enable incidents, we need to try as hard as we can to put ourselves into the shoes of the person who made that call, to understand what their frame of mind was at the moment.
If we don’t do that, we risk the problem of distancing through differencing. We say, “oh, those devs were bozos, I would never have made that kind of mistake”. This is a great way to limit how much you can learn from an incident.
Detailed public writeups as evidence of good engineeringThe writeup produced by Cloudflare (signed by the CEO, no less!) was impressively detailed. It even includes a screenshot of a snippet of code that contributed to the incident! I can’t recall ever reading another public writeup with that level of detail.
Companies generally err on the side of saying less rather than more. After all, if you provide more detail, you open yourself up to criticism that the failure was due to poor engineering. The fewer details you provide, the fewer things people can call you out on. It’s not hard to find people online criticizing Cloudflare online using the details they provided as the basis for their criticism.
Now, I think it would advance our industry if people held the opposite view: the more details that are provided an incident writeup, the higher esteem we should hold that organization. I respect Cloudflare is an engineering organization a lot more precisely because they are willing to provide these sorts of details. I don’t want to hear what Cloudflare should have done from people who weren’t there, I want to hear us hold other companies up to Cloudflare’s standard for describing the details of a failure mode and the inherently confusing nature of incident response.


