The hidden trade-offs of fine-grained progressive rollouts
A progressive rollout refers to the act of rolling out some new functionality gradually rather than all at once. This means that, when you initially deploy it, the change only impacts a fraction of your users. The idea behind a progressive rollout is to reduce the risk of a deployment by reducing the blast radius: if something goes wrong with the new thing during deployment, then the impact is much smaller than if you had deployed it all-at-once, to all of the traffic.
The impact of a bad rollout is shown in redThere are two general strategies for doing a progressive rollout. One strategy is coarse grained, where you stage your deploys across domains. For example, deploying the new functionality to one geographic region at a time. The second strategy is more fine-grained, where you define a ramp up schedule (e.g., 1% of traffic to the new thing, then 5%, then 10%, etc.).
Note that the two strategies aren’t mutually exclusive: you can stage your deploy across regions, and within each region, you can do a fine-grained ramp-up within each regions. And you can also think of it as a spectrum rather than two separate categories, since you can control the granularity. But I make the distinction here because I want to talk specifically about the fine-grained approach, where we use a ramp.
The ramp is clearly superior if you’re able to detect a problem during deployment, as shown in the diagram above. It’s a real win if you have automation that can automatically detect based on a metric like error rate. The problem with the ramp is the scenario when you don’t detect that there’s a problem with the deployment.
My claim here in this post is that if you don’t detect a problem with a fine-grained progressive rollout until after the rollout has completed, then it will tend to take you longer to diagnose what the problem is:
Paradoxically, progressive rollout can increase the blast radius by making after-the-fact diagnosis harderHere’s my argument: once you know something is wrong with your system, but you don’t know what it is that has gone wrong, one of the things you’ll do is to look at dashboard graphs to look for a signal that identifies when the problem started, such as an increase in error rate or request latency. When you do a fine-grained progressive rollout, if something has gone wrong, then the impact will get smeared out over time, and it will be harder to identify the rollout as the relevant change by looking at a dashboard. If you’re lucky, your observability tools will let you slice on the rollout dimension. This is why I like coarse-grained rollouts, because if you have explicit deployment domains like geographical regions, then your observability tools will almost certainly let you slice the data based on those. Heck, you should have existing dashboards that already slice on it. But for fine-grained rolled-out, you may not think to slice on a particular rollout dimension (especially if you’re rolling out a bunch of things at once, all of them doing fine-grained deployments), and you might not even be able to.
To determine whether fine-grained rollouts are a net win depends on a number of factors whose values are not obvious, including:
the probability you detect a problem during the rollout vs after the rollouthow much longer it takes to diagnose the problem if not caught during rolloutyour cost model for an incidentOn the third bullet: the above diagram implicitly assumes that impact to the business is linear with respect to time. However, it might be non-linear: an hour-long incident may turn out to be more than twice as expensive as two half-hour-long incidents.
As someone who works in the reliability space, I’m acutely aware of the pain of incidents that take a long time to mitigate because they are difficult to diagnose. But I think that the trade-off of fine-grained progressive rollouts are generally not recognized as such: it’s easy to imagine the benefits when the problems are caught earlier, it’s harder to imagine the scenarios where the problem isn’t caught until later, and how harder things get because of it.


