When Red Buttons Aren't Enough

On June 12, 2025, most of GCP went offline. This led to downstream outages in a multitude of websites and services, such as Cloudflare, Spotify, OpenAI, Anthropic, Replit, and many others.

With a few days of hindsight, GCP published a quite detailed postmortem. Frankly, I’m impressed by the depth of this PM and the quantity of technical details that they released publicly. Given this level of detail, it’s feasible to piece together a reasonably full picture of what happened and what ultimately went wrong.

In what follows, I’m only going on what was publicly released in the postmortem.

The Underlying Cause

GCP has an internal service called “Service Control”, which handles policy checks like authorization and quotas for the GCP APIs. A data-dependent bug was introduced into this service that caused it to crash and fail closed when a particular code-path was reached; this code path could be exercised by a Google-driven global quota policy change. Since the bad code path was data-dependent, it was never exercised as this change gradually rolled out to production – it was triggered all-at-once, globally, when a new global quota policy version was applied.

This matches some of the provisional rumors (later ~confirmed) that the outage was “related to IAM”.

Lack of Gradual Rollout

The biggest egg-on-face from this incident is the following admission, worth quoting in full:

On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code. As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.

So, effectively, the change that caused this global outage seemed to have never been tested in production before its enablement. On its face, this seems… bad. Reading between the lines of this paragraph, it’s unclear to me to what extent they’re throwing themselves under the bus. Saying “If this had been flag protected, the issue would have been caught in staging” could be read as either “we messed up and should have had this flag protected” or just literally “if there was a flag, this wouldn’t have happened”.

The change was hard to feature flag. There are very much times when it’s quite difficult to feature flag changes. If a change requires a binary restart to take effect, or if the change is part of a large refactor that doesn’t have isolated pieces over which to place a flag, effectively feature flagging a change can be quite difficult. Giving Google the benefit of the doubt, it’s possible that that this change was hard to flag, and so instead of using standard flags, they relied on their “red button” mechanism for safety. I give relatively moderately low probability to this possibility, since the change was able to be disabled with the “red button” and was data-dependent.

The change relied on their “red-button”¹ instead of standard feature flags. The PM notes that “this code change came with a red-button to turn off that particular policy serving path” and “Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place”. So it seems like the rollout plan for this feature did heavily rely on the existence of this “red-button” for safety. The fact that it was identified as the correct mitigation approach within 10 minutes updates me in the direction that this was their rollout safety plan: if it breaks, we’ll kill-switch it to disable. Perhaps they thought the red-button would be quicker than it actually was to take effect, and/or the change was low-risk enough that the red-button approach was sufficiently safe.

There was a process failure. At least twice in the PM, its listed as a follow-up that “We will enforce all changes to critical binaries to be feature flag protected and disabled by default.” My knowledge of change management practices at GCP is now stale, but I would strongly believe that it was already a best practice to feature flag critical binaries – for the policy to be otherwise would be courting disaster, near malpractice for a set of services running at their scale. I put higher weight on this being an individual or team-specific process failure. There should have been a plan to rollout this code gradually, but either that plan had a faulty assumption (i.e. thinking that the gradual binary rollout protected them, not knowing that the code wasn’t actually being executed) or it was insufficiently paranoid (i.e. that the red-button approach provided sufficient safety).

Ultimately, these things happen. I’ve been cc’d on many PMs that have a similar shape to this. Process compliance is a hard problem, because it grounds out in the cultural norms and practices of the engineering environment – down to individual contributors and tech leads – rather than a technical architectural decision that can be made and enforced with a fully top-down approach.

Other Followups

Fail Open:

We will modularize Service Control’s architecture, so the functionality is isolated and fails open. Thus, if a corresponding check fails, Service Control can still serve API requests.

This seems wise. For a single-point-of-failure system like Service Control, adding isolation and (well-considered) means to fail-open makes sense.

Globally replicated data:

We will audit all systems that consume globally replicated data. Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues.

This also seems quite wise, and honestly is what I’d consider to be the most important followup from this incident, from what has been publicly been released. Consumption of globally replicated data is always going to be tricky. Even if you have a feature flag, doing a globally impactful operation is always going to be significantly riskier than doing that same action gradually.

Google undoubtedly has the internal tools to this sort of gradual propagation, so I’m guessing that in this case the global nature of the data was an intentional business requirement (i.e. quotas need to be updated globally within a short window of time for something something compliance reasons). If true, reconsidering or challenging these requirements would also be helpful.

Incident Response:

Within 2 minutes, our Site Reliability Engineering team was triaging the incident. Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place. The red-button was ready to roll out ~25 minutes from the start of the incident. Within 40 minutes of the incident, the red-button rollout was completed, and we started seeing recovery across regions, starting with the smaller ones first.

The incident response here seems quite fast. Two minutes to first triage and ten minutes to root cause is fast – although, the global nature of this impact assuredly got eyes on this quickly.

Incident Communication:

We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage.

Oof. First, the circular dependency here is rough, and I’m surprised there wasn’t an easier way to provide an incident report through an air-gapped channel. Second, we’ll never know, but I also wouldn’t be surprised if institutional caution accounted for at least some of this delay. Especially with such a substantial impact like this, figuring out who has the authority to green-light the costly “everything is down” message would likely take some time.

Google exposing this level of detail in a retrospective is a positive sign. It rhymes with postmortems that I’ve seen in the past which, while painful, resulted in cultural change that ultimately improved stability long term. Again, I have no idea the actual internals of what happened here – it’s quite possible that this rollout was a well-designed calculated risk which came up unlucky, or it was a legitimate own-goal process blunder. In any case, the takeaways I have are twofold: first, even the hyperscalers make mistakes, and two, that kill-switches are necessary but not sufficient – there is no substitute for exercising new code gradually, to avoid global outages like this.

Obligatory: These are my own opinions and very much do not represent my employer, etc.

I think they effectively mean “kill switch” by “red button”? ↩︎