Feature Flagging at Databricks

In late January, I published a post¹ (archive) on the Databricks engineering blog about “SAFE”, the feature flagging and experimentation platform I’ve been working on for the past few years. SAFE is what I’ve been spending most of my time on during my time at Databricks, and it’s been rewarding to see the project grow from an initial prototype to a mature internal platform.

I’ve been the tech lead for SAFE for a while now, and the project has scaled significantly in headcount, scope, and usage. The work described in that post represents the efforts both of an initial core team of people that got it off the ground (which I was fortunate to be a part of), as well as a larger group of engineers who’ve shepherded it into a durable platform that has evolved to meet the needs of a now-$134B company.

A few particular things I’m proud of:

𐡸 We really optimized the heck out of the evaluation runtime “SDK”, such that the p95 for flag evaluation is roughly ~10μs. After publishing the blog post, someone reached out to me internally and asked me, effectively, “Really? You were able to get evaluation that fast, even in the JVM?” I had a moment of panic thinking maybe I’d grabbed outdated numbers, but then looked at the live prod latency statistics, and yup – we were humming away at around 8μs in prod.

A coworker and I also translated the whole evaluation stack into Rust over the past year², and the latency numbers there are even better. In Rust, flag evaluation is pretty dang close to the latency of a hashmap lookup, from the perspective of an RPC service.

𐡸 We spent a lot of time getting the UX right. As an example, SAFE was essentially the first internal tool at Databricks to have a fully-featured, in-house web UI³ as a primary means of interacting with it. It felt risky at the time, but the investment in an internal UI as the primary interaction mode proved to be quite high ROI.

UX is the whole end-to-end journey though, not just the fancy chrome you put on top. It took us quite a while to get to a point where the usability of the system was where I wanted it to be, and there’s still a bunch of places we can improve, but on the whole I’m quite proud of the system we’ve ended up with.

𐡸 We spent a lot of time getting the change management guardrails right. SAFE is fundamentally a configuration management system. Configuration changes are a notorious source of outages. As such, most of the dev cycles put into improving SAFE have been into improving guardrails around its usage.

There was definitely a period of “post-mortem-based-development” in SAFE, where we reactively added checks to “fight the last fire”. Over time, though, the team has developed a quite defensible philosophy around change management that has struck a good balance between allowing feature teams to ship quickly, mitigating risk, and reducing the blast radius of incidents.

Each flag flip now runs dozens, if not hundreds of checks, with teams being able to augment their own flags/rollouts with custom checks. We’ve recently added AI agent-driven checks to enforce best practices for usage. Flag rollouts can have automated monitoring to check for regressions;⁴ flags can be used to perform A/B experiments; flags can be used to detect performance changes. There is, of course, still more work to be done here.

𐡸 We aren’t sitting still. Projects naturally have a lifecycle. There’s a “0->1” period, which is exciting for obvious reasons, and then a “1->10” period, which can similarly be quite enjoyable, and then a plateauing as the S-curve of the project starts to level out. There was a time around 2 years ago where SAFE had kinda reached its initial “local maxima”. We’d closed the loop, fought the fires, and come to a workable, stable system.

Now what? It took a bit of time for me personally to find that “what next?”, but it’s now super clear to me and I’m unusually energized about it.

SWE as a field, as a practice, as a culture is changing profoundly right now. Teams are shipping quicker, and stability is more important than ever. Agent-based development is allowing us to think significantly larger than we could a few years ago, and putting agents in the loop of production monitoring and change management is overdetermined at this point.

Configuration is an unintuitively high-leverage piece of infrastructure given where things are progressing over the medium-term.

It’s been a joy to work on this system and see it grow, and to work alongside the team of people who’ve built it up.

Obligatory disclaimer: These are my own opinions and do not reflect those of my employer, etc. ↩︎
Not as a “rewrite it in Rust”, but as an additive support for new services being written in Rust and other non-JVM languages. ↩︎
I’m not including OSS UIs like Grafana, OpenSearch, etc. here. ↩︎
This turns out to be a wicked problem. It’s one of those things that sounds super simple, but actually getting right is surprisingly hard to do. ↩︎