Ben Congdon

To the Agents: "This place is not a place of honor"

Ben Congdon — Fri, 15 May 2026 07:00:00 -0700

TL;DR: “Private by obscurity” has been dissolved.

Internal tools often have layering boundaries that are enforced only by convention. It’s natural to assume a “high trust environment”, where privileged actions are discouraged by obscurity and goodwill instead of hard technical boundaries. Coding agents have dissolved this obscurity, and as a result internal platform engineering now really demands a security mindset.¹

During a recent codebase audit, a coworker and I discovered an unfortunate set of private APIs my team owns that were being used in creative and unintended ways, outside the official interfaces. Much of the code that introduced these unsanctioned dependencies was AI generated². This was one more datapoint among many that, especially in large monolith codebases and in large enterprises, coding agents have changed how platform teams need to operate.

This particular audit exposed two classes of issue of internal API leakage:

We have pseudo-internal APIs opened for narrow “2nd party” integrations. These had some allowlisting on them, but insufficiently granular allowlisting to prevent inadvertent use. These were previously “private-by-obscurity”, which is woefully insufficient when coding agents can reverse engineer the entire stack and determine that they can make creative use of your pseudo-internal APIs.
APIs that were properly private, but a single Bazel visibility change made them opened to external callers. Bazel visibility changes often look benign³, and internal dependencies can be introduced inside a large PR without anyone noticing.⁴

In all the cases we saw so far, the code using the exposed APIs wasn’t even wrong from a technical perspective. However, it violated layering principles and set up concerning coupling that would eventually break as the system evolved. Very creative code, very unsanctioned.

Fortunately, the proximal fixes are simple:

Keep external APIs friendly and well-documented. Agents tend to look under lampposts, so make sure the paved paths are well-lit.
Update the internal APIs – especially the honey-pot APIs that look like you might want to use them if you were a hungry external agent – to look extremely unappealing to use.
- Create a CLAUDE.md in all borderline packages that tells future “internal” coding agents to be diligent about accidentally making symbols externally visible.
- Seal off the misused interfaces in “stop-the-bleeding” private bulkhead packages, with extremely loud “HERE BE DRAGONS” warnings at the top.

Proper layering and API visibility isn’t a new problem. The thing that has changed is the amount of creative adversarial pressure on code. Similar to how internet security was lax prior to the adversarial pressure of networked software vulnerabilities, many internal company platforms are insufficiently paranoid about insider risk.

This insider risk goes beyond code: internal APIs now have a much larger attack surface area when any coding agent session can trivially run curl from a developer’s machine. Hopefully all your ACLs are correctly configured and you don’t have any catastrophic write APIs publicly accessible via developer credentials… right? Similarly, I’ve recently seen an uptick of agents discovering internal prod-modifying CI jobs and convincing folks to run them. Hopefully there are sufficient restrictions on who can run those jobs…?

This isn’t to excuse having an improperly paranoid security posture in the past: all these surfaces should obviously be locked down. However, in the past you could rely on some amount of “security by obscurity” in borderline cases. Now, you clearly cannot.

So, you’ll start seeing more of this:

/**
 * ⚠ UNSAFE — INTERNAL USE ONLY ⚠
 *
 * <p>
 * Do not invoke without explicit human approval.
 *
 * <pre>
 * This place is a message... and part of a system of messages...
 * pay attention to it!
 *
 * Sending this message was important to us.
 * We considered ourselves to be a powerful culture.
 *
 * This place is not a place of honor...
 * no highly esteemed deed is commemorated here...
 * nothing valued is here.
 *
 * What is here was dangerous and repulsive to us.
 * This message is a warning about danger.
 * </pre>
 */

It’s hard to not read “I spliced into your internal APIs” as somewhat malicious, even if there was no malintent. But what defines an “internal API” in a monorepo is often a grey area. It’s on me as a platform owner to provide clean interfaces, and it’s on me to not create attractive nuisances of trivially accessible unsafe APIs. But to err is to be human. At a quickly growing company, there is a ton of startup-era code that was insufficiently paranoid; gaps will exist. And so you “stop the bleeding”, patch the gaps, help people migrate to safer interfaces, and everyone is better off for it.

All this to say: agentic coding is fantastic; I’m a big fan! But the human side of engineering is still quite important. If you want to do something slightly off-kilter with a system someone else owns, by all means propose it. Progress is good! But also be very explicit with them and get buyoff. “Ask forgiveness” is a valid approach in many circumstances, but not in safety-critical code.

Insert “Always has been” meme here. ↩︎
Although, this isn’t saying much. Most code is AI generated. ↩︎
When I worked at Google, I recall it being easier to spot obviously faulty Bazel visibility additions. I think at large-but-not-mega-repo scale, the chances of accidentally introducing bad Bazel edges is a bit harder to totally avoid. There are techniques for this, like banning certain outbound edges between parts of the Bazel graph. Google had good tooling for this that doesn’t seem to exist in an ergonomic form outside Google. ↩︎
PRs have been getting larger. PR descriptions have also been getting wordier and more “polished”-looking, regardless of their actual merit as changes. A reviewer who isn’t super familiar with the codebase may not realize the risk of a benign-looking Bazel visibility change. ↩︎

Thoughts on Marginal Token Spend

Ben Congdon — Thu, 30 Apr 2026 07:00:00 -0700

The rise of coding agents has made it easy for a single engineer to spend thousands of dollars a day in LLM tokens. This is a new class of expense, and it will change the future cost structure of software engineering. We are between stable equilibria today in SWE: the old one, of needing humans to drive any code change, and a yet-to-be-established new one, where AI agents write most code.

Taking as a premise that AI agents will write a large fraction of code in the new equilibrium, we will need to rethink the resulting cost structure for engineering orgs. In the long term, token spend will become a large portion of enterprise OpEx, split into two classes: spend attributable to a human worker (e.g. Claude Code, internal AI tooling like call center assistants) and spend attributable to automated systems (e.g. agents which respond autonomously to customers by fielding calls/emails, agents which monitor business systems).

Automated token spend is analogous to cloud cost: at equilibrium, it “should” roughly scale linearly with product usage/revenue. Human-generated token spend seems like a new class of expense: It’s a resource that does not scale linearly with headcount and has no natural ceiling. The per-person absorption rate is whatever an engineer’s workflow can productively use, and that ceiling is itself increasing as new harnesses, workflows, and practices are developed. Human-attributable token spend is therefore the more interesting type of token usage from a decision-making perspective: it’s effectively an expensive “more productivity button”. Rationally, you should continue pushing the button up until the point where the marginal returns you receive match the marginal cost of pushing the button again.

This framing also explains why the “Tokenmaxxing” meme started. Tokenmaxxing is the recent idea that consuming more tokens makes you more “AI native” and therefore more productive. AI adoption is path dependent; engineers are hesitant to change their patterns. Leaders noticed that engineers were opting to push the “more productivity” button at an irrationally low rate. The short-term “fix” is to directly incentivize pushing the button – with the clear risk that you overshoot into people Goodharting on “push the button as much as I can” vs. “increase my productivity to the point of diminishing returns”. And, predictably, “oops, we overshot” becomes a narrative – at least for the organizations that aren’t able to keep finding additional efficient uses of AI that expand their production frontier.

A corollary: if your marginal token cost is negligible, then your usage of AI should go way, way up. Right now, Anthropic and OpenAI are clearly in the lead for coding model capabilities. They have lots of compute. The amount of compute needed to completely saturate their engineers’ token demand is likely a small fraction of what they use on the marginal research projects. If you have the hardware, and marginal-cost access to frontier models, you should probably actually be Tokenmaxxing, since it’s much harder to reach the balance point of “marginal return = marginal cost” when marginal cost is so much lower.¹

Whether you should Tokenmaxx in the short term depends on where you sit on the marginal cost curve. If your marginal token cost is high (e.g. paying Anthropic for Opus tokens at retail prices), the just-spend-more heuristic will overshoot. If it’s low, you probably should push harder.

In the long(er) term, the cost curve is influenced by how much an organization is able to absorb the ability to use more tokens. Companies will invest in getting more productive use per token, through both technical improvements (e.g. connectivity between AI tools and internal knowledge silos) and operational work (e.g. workflow redesign, changing norms of when it’s appropriate to substitute AI artifacts for human-generated ones). Token spend is more elastic than headcount, and the space is new enough that the efficiency frontier is still being discovered and reshaped.

My bet is that most efficiency gains won’t become competitive moats, because development techniques diffuse too quickly and the underlying models are available to everyone. They will, however, become a low-water mark. Companies that fall below it will be outcompeted by firms that don’t. In this way, AI adoption is a Red Queen race, requiring increasing efficiency/usage just to not become irrelevant.

None of this is stable yet. We’re currently between stable equilibria in many areas in software development. As such, it’s a particularly exciting time to be working in this space.

To some extent, frontier labs also have to consider the opportunity cost of using compute for engineering instead of research or inference, i.e. the marginal token can either go to the marginal research project, the marginal internal engineer, or the marginal external customer inference token. That opportunity cost is not zero, but it’s a different set of tradeoffs than external companies buying retail tokens. Frontier lab investment in R&D via “engineer tokens” also increases returns to marginal research compute, and “engineer tokens” are likely a quite small fraction of either research or inference compute. ↩︎

Tokenmaxxing is Goodharting

Ben Congdon — Wed, 29 Apr 2026 05:00:00 -0700

Coding agents and reasoning models let individuals consume many more LLM tokens than they could a year ago. It’s now easy for a single engineer to spend thousands of dollars in daily token usage. This is being actively encouraged through the recent memetic spread of “Tokenmaxxing” – the idea that if you consume more tokens, you’re more “AI native” and therefore producing more valuable output.

Tokenmaxxing is not The Way. Plainly, it’s a textbook instance of Goodharting. Token leaderboards come from an understandable short-term instinct to shift habits towards more AI usage, but direct optimization in this fashion inevitably overshoots into wasteful spending. Token-usage-as-target means token consumption ceases to be a useful metric.

Per-engineer token usage is, admittedly, useful as a diagnostic when engineers are dramatically and systematically underusing AI. However, the leaderboard version of token usage is likely actively harmful. This is analogous to how “lines of code merged” or “PRs merged” or “design docs written” can be interesting aggregate diagnostic numbers when used directionally, but obviously a leaderboard of “design docs written per quarter” would produce the wrong incentives.

Operationalizing this into predictions over the timespan of the next 6 months:

Weak claim: Prominent executives / thought-leaders start publicly criticizing Tokenmaxxing / raw token usage as a bad metric.
- This is basically already happening. My prediction: 90%.
Stronger claim: There is a noticeable cross-company narrative shift toward discipline on AI spend with an emphasis on ROI measurement and skepticism of using raw token as a proxy for productivity.
- I’m more tentative on this¹. My prediction: 60-70%.

Reasoning:

There was a sharp bend in the curve of agent adoption around December ‘25 / January ‘26.
Increased enterprise spending will start showing up in Q1’26 OpEx financial results, but will likely show up as a sharp increase in Q2’26 results.
Since January, there has been increased pressure at many companies for engineers to “Tokenmaxx”. This is an incredibly easy (and costly) leaderboard to game.
See, for example:
- The Pulse: ‘Tokenmaxxing’ as a weird new trend
- More! More! More! Tech Workers Max Out Their A.I. Use
“Tokenmaxxing” is fun and memetic, but likely has a short lifespan before you see the median engineer try to game the leaderboard numbers – or at least, change their decision-making on the margin to make less efficient use of tokens.

In the short term, I’d bet that some companies will start imposing soft token budgets on engineers, while others will continue to soft-allow unlimited spend – either for legitimate reasons (more effective/efficient uses are discovered; frontier labs have abundant compute which changes the marginal return calculus), or for memetic/signaling reasons.

There are innumerable ways to make positive-value use of AI. In the short term, Tokenmaxxing – especially the explicit encouragement of Tokenmaxxing in the absence of clear definable output value – is not such a way.

The reasons for my hesitancy: primarily the argument of “market can stay irrational longer than you can stay solvent”. Spending on engineers’ use of AI only really becomes a “problem” if companies need to start becoming more conscious of spending (i.e. there is pressure to reduce it). With sufficient macroeconomic exuberance, this pressure could be long-delayed. ↩︎

Feature Flagging at Databricks

Ben Congdon — Thu, 12 Mar 2026 06:00:00 -0700

In late January, I published a post¹ (archive) on the Databricks engineering blog about “SAFE”, the feature flagging and experimentation platform I’ve been working on for the past few years. SAFE is what I’ve been spending most of my time on during my time at Databricks, and it’s been rewarding to see the project grow from an initial prototype to a mature internal platform.

I’ve been the tech lead for SAFE for a while now, and the project has scaled significantly in headcount, scope, and usage. The work described in that post represents the efforts both of an initial core team of people that got it off the ground (which I was fortunate to be a part of), as well as a larger group of engineers who’ve shepherded it into a durable platform that has evolved to meet the needs of a now-$134B company.

A few particular things I’m proud of:

𐡸 We really optimized the heck out of the evaluation runtime “SDK”, such that the p95 for flag evaluation is roughly ~10μs. After publishing the blog post, someone reached out to me internally and asked me, effectively, “Really? You were able to get evaluation that fast, even in the JVM?” I had a moment of panic thinking maybe I’d grabbed outdated numbers, but then looked at the live prod latency statistics, and yup – we were humming away at around 8μs in prod.

A coworker and I also translated the whole evaluation stack into Rust over the past year², and the latency numbers there are even better. In Rust, flag evaluation is pretty dang close to the latency of a hashmap lookup, from the perspective of an RPC service.

𐡸 We spent a lot of time getting the UX right. As an example, SAFE was essentially the first internal tool at Databricks to have a fully-featured, in-house web UI³ as a primary means of interacting with it. It felt risky at the time, but the investment in an internal UI as the primary interaction mode proved to be quite high ROI.

UX is the whole end-to-end journey though, not just the fancy chrome you put on top. It took us quite a while to get to a point where the usability of the system was where I wanted it to be, and there’s still a bunch of places we can improve, but on the whole I’m quite proud of the system we’ve ended up with.

𐡸 We spent a lot of time getting the change management guardrails right. SAFE is fundamentally a configuration management system. Configuration changes are a notorious source of outages. As such, most of the dev cycles put into improving SAFE have been into improving guardrails around its usage.

There was definitely a period of “post-mortem-based-development” in SAFE, where we reactively added checks to “fight the last fire”. Over time, though, the team has developed a quite defensible philosophy around change management that has struck a good balance between allowing feature teams to ship quickly, mitigating risk, and reducing the blast radius of incidents.

Each flag flip now runs dozens, if not hundreds of checks, with teams being able to augment their own flags/rollouts with custom checks. We’ve recently added AI agent-driven checks to enforce best practices for usage. Flag rollouts can have automated monitoring to check for regressions;⁴ flags can be used to perform A/B experiments; flags can be used to detect performance changes. There is, of course, still more work to be done here.

𐡸 We aren’t sitting still. Projects naturally have a lifecycle. There’s a “0->1” period, which is exciting for obvious reasons, and then a “1->10” period, which can similarly be quite enjoyable, and then a plateauing as the S-curve of the project starts to level out. There was a time around 2 years ago where SAFE had kinda reached its initial “local maxima”. We’d closed the loop, fought the fires, and come to a workable, stable system.

Now what? It took a bit of time for me personally to find that “what next?”, but it’s now super clear to me and I’m unusually energized about it.

SWE as a field, as a practice, as a culture is changing profoundly right now. Teams are shipping quicker, and stability is more important than ever. Agent-based development is allowing us to think significantly larger than we could a few years ago, and putting agents in the loop of production monitoring and change management is overdetermined at this point.

Configuration is an unintuitively high-leverage piece of infrastructure given where things are progressing over the medium-term.

It’s been a joy to work on this system and see it grow, and to work alongside the team of people who’ve built it up.

Obligatory disclaimer: These are my own opinions and do not reflect those of my employer, etc. ↩︎
Not as a “rewrite it in Rust”, but as an additive support for new services being written in Rust and other non-JVM languages. ↩︎
I’m not including OSS UIs like Grafana, OpenSearch, etc. here. ↩︎
This turns out to be a wicked problem. It’s one of those things that sounds super simple, but actually getting right is surprisingly hard to do. ↩︎

2025 in Review

Ben Congdon — Wed, 31 Dec 2025 21:00:00 -0800

Previously: ~~2024~~, ~~2023~~, 2022, 2021, 2020, 2019, 2018, 2017

A surprisingly persistent personality quirk I have is that I care a lot about the changeover of the new year. I quite like consuming yearly predictions, year-in-reviews, and so on, and use the calendar transition as a time for reflection.

Work

I’ve now been at Databricks for a little over 3.5 years, and it’s been quite a fun ride. In most ways, it’s exceeded my expectations from when I joined. I’ll hopefully have more to say publicly soon about the work I’ve been doing. I’ve had an opportunity to write something that should become public in the coming weeks, but it hasn’t landed yet.

Organizationally, I’ve stayed on the same team for my entire tenure at the company, but things move so quickly and have evolved in such a way that I’ve never felt like I’ve been standing still. I’ve leaned more into TL-ing this year, which is an unending but enjoyable game of plate spinning. I was also promoted to Staff Engineer early in 2025, which was a really welcome vote of confidence.

I’m learning that, while I find myself gravitating to working on developer tooling, the underlying area that I enjoy is infrastructure defined broadly. I get a lot of satisfaction about keeping things running and raising the waterline for stability.

As a side work project, I spent some time this year working internally on AI devtools, which was quite fun. That space changed a ton over the course of the year, to the point where the things that I was pushing for in December 2024 are basically obsolete. I’ve written a bunch about this recently, so won’t go into detail here. My primary takeaway and piece of advice from the progress this year is: use the new tools. Be an early adopter and get an intuitive sense for what’s coming.

Writing

This post is the final in my daily writing experiment for December. While I’m glad it’s now complete, I found it to be a really useful exercise. I didn’t enjoy writing every day, but at least 60% of the posts felt generative and like I got something out of having written. I’d say I’m happy with having written at least 80% of them, which feels like a good hit rate.

The downside of the daily writing month is that I exhausted many of my “low hanging fruit” writing ideas. This was one of the implicit goals: to do a winter cleaning of all the ideas rattling around in my brain to make space for new ones. I look forward to writing more in 2026, but at a more sustainable cadence.

Favorite Media of 2025

Books: See Favorite Books of 2023-2025
Music: See Favorite Music of 2025
Movies: Didn’t really watch any 🤷‍♂️
TV: The best show I watched all year, by far, was The Americans.
Blogs:
- Zvi Mowshowitz continues to write an excellent, extremely comprehensive blog on AI and related topics.
- Cate Hall’s Substack has a shockingly high hit rate for interesting ideas related to agency and self-betterment.
- I continue to enjoy reading Tom Macwright’s blog. He’s currently working on Val.town.

Running

Per Strava, I ran 1,311 miles across 207 activities in 2025. I PR’d my marathon distance (3:24:40), PR’d my 10k distance at the Lake Union 10k (41:32), and PR’d my 5k distance (19:57) with my first ever sub-20-minute run. So, a pretty good year for running! Running wasn’t my primary focus this year in the same way it has been in years past. But I continued to lean on it for a source of routine and mental clarity. It appears that paid off in the stats.

2026

I have rather high uncertainty for what 2026 is going to bring, but in an exciting, generative way. 2025 was a difficult year personally in some ways, but many paths seem open now in a way that they didn’t at the beginning of the year. There is much yet to be done. A couple concrete events I’m looking forward to: embarking on my first silent meditation retreat (after having noncommittally considered doing one for several years), and running my first trail race in January.

Happy New Year! (And a special thanks for those who followed along with my writing throughout December!)

Cover Image: Boeing 737 taxing at Boeing Field, Seattle, WA

On Not Running While Injured

Ben Congdon — Tue, 30 Dec 2025 21:00:00 -0800

A few days ago I tumbled down some stairs and managed to bruise both of my knees. I’ve taken a break from running since then to prevent injuring myself further. It’s made me reflect on an injury earlier this summer where I aggravated my knee through overtraining to the point of not being able to run on it for a few weeks – right before the Seattle marathon. Fortunately, I got back to normal before the race, but it seems like injury is just a fact of training for me now.

Until recently, I’d been shockingly lucky in avoiding running injuries. I’ve been consistently running for a little over a decade, and this summer was my first “serious” incident of needing to take a break since I started. When I got into running, I ran nearly every day for several years. This was unwise in retrospect. Some combination of factors let me elude injury: being younger meant faster recovery, my legs had fewer accumulated miles, and I wasn’t running anything longer than a 10k.

I run longer distances now, and am significantly faster, both of which increase injury risk. To offset that, I also run less frequently – only 3-4x / week. I get a lot more enjoyment out of having a weekend long-run and shorter intermittent weekday runs than running daily. Running daily sounds more intense (in a positive way), but I find it to be less desirable along with being less sustainable. Rest lowers injury risk, and also prevents running from getting “stale”.

I think that’s why the injury breaks don’t bother me as much as they might have when I was earlier in my running career. Running is something I get to do, not something I have to do. While I still get rather frustrated at not being able to do the hobby I enjoy, this is a hobby I still want to be enjoying over the next 10, 20, 30 years. So, optimizing for multi-decade sustainability over short-term hotheaded push-through-it-ism seems like the wise choice.

Software Engineering in 2026

Ben Congdon — Mon, 29 Dec 2025 20:00:00 -0800

Over the holidays, I’ve been thinking about what the impacts of 2025’s progress in AI coding tools will mean for how software gets designed, built, and operated in 2026.

The primary impact of LLM tooling, so far, is that the marginal cost (both in terms of time and dollars) of producing high quality code has gone down significantly. Of course, producing code is only part of the full job of software engineering, so the bottlenecks for engineering time will shift elsewhere.

To start, what exactly are we trying to do here, as software engineers? As a vague but hopefully somewhat useful definition: building, evolving, and operating distributed software systems that provide some concrete business utility. The “building” component has noticeably become cheaper with LLMs, and “evolving” systems has also become easier. “Operating” systems, from what I’ve seen, has for now been least impacted by LLMs.

The “business utility” goal will also change company-by-company, and engineering org-by-org. The most obvious split for this is infrastructure versus product orgs, where I’d expect product orgs to get more of an uplift from LLM coding than infrastructure – LLMs seem to grok frontend particularly well, and there tends to be more greenfield product work than in infrastructure.

The market will expect SWEs to extract productivity gains from LLMs. The field broadly seems poised to become more mechanized, but more productive as a result. There’s a re-skilling and mindset shift that’s been accelerating for a few months, but most of the effects of this have yet to be fully realized.

Here are some shifts I expect to accelerate in 2026:

Infrastructure Abstractions

Returns to good infrastructure abstractions compound faster. Can you roll out binaries fast (and roll back with similar speed)? Do you have out-of-the-box ways to quickly spin up new compute / backends for the things you’re serving?

All the core infra pieces remain important: metrics, logging, incident management, feature flags, releases, autoscaling, orchestration, workflow engines, configuration, caching, networking, etc. Companies will be well-served by making these pieces of core infra easy to use for both humans and LLMs. Infrastructure should be made as self-service as possible, with friendly CLIs or MCP-ready APIs, and with minimal infra-engineer-in-the-loop required to unblock human and AI users.

CI Infrastructure

Quality, fidelity, and speed of CI infrastructure becomes even more important as AI agents write more of the code. Perhaps we need to rethink the unit test and invest more in things like property testing and formal verification for the lower level pieces of the stack.

Humans tend not to like writing tests – they’re not fun to write, they’re mechanical, and they generally feel like a tax on the effort that could otherwise be spent writing flashy implementation code. LLMs have no such qualms. We have no excuse for not having near exhaustive test scenario coverage.

Human-guided Abstractions

Crisp human-guided abstractions become all the more important. LLMs, without strong guidelines, will slop-fill greedy solutions to make CI checks pass, increasing spaghettification over time. Well-informed intuition, well-developed “systems taste” is still required upfront. Things like module boundaries, library interfaces, contracts between the infrastructure and product layers become an increasingly high-leverage set of levers for maintaining long-term code quality. Systems that lack these crisp boundaries will accumulate technical debt faster.

LLM-generated code is not guaranteed to be high quality. While quality has increased significantly over the past year, it’s still quite easy to drown oneself in technical debt with a few poorly constructed PRs.

Human Code Review

Human code review increasingly becomes an important bottleneck. A new “review taste” needs to be developed. As much as possible, stylistic concerns should be pushed into automated lints that run pre-merge and, ideally, by the LLM agents pre-commit. Human code review should differentially focus on decisions that can not easily be codegen’d away later – things like interface change, sensitive code involving data persistence, and performance critical code still need high scrutiny. This creates a paradox for junior engineers: they need to develop “review taste” earlier, but are doing less of the “writing” that builds that intuition.

We’ll need to ask, collectively: What things, though suboptimal, are stylistically permissible to be checked in? What things must never be checked in? What things are the new slippery slope code smells? How much of code review itself can be automated?

Project Timelines Estimates Increase in Variance

I expect variance on project estimates goes up significantly. The extent to which a task can be LLM-ified increasingly influences its wall-time cost. This adds a pressure for high-value projects to be nudged in ways that make them more LLM-amenable, but this often isn’t possible. The highest-value projects that most need de-risking are often the ones least amenable to LLM assistance, because they require deep context, involve low-level systems, or have high blast radius.

Some tasks which previously would have been a long haul are now easier (e.g. code-centric migrations or inter-language/system translations) Some tasks remain relatively stable in their difficulty (e.g. networking).

Impact of AI on “Build vs. Buy” Decisions

Does the falling price of code influence the “build vs buy” distinction for SaaS in a meaningful way? My guess is that on the margin, “yes”, but in big ways, “no”. For commodity SaaS that’s mostly a thin UI over CRUD, the build-vs-buy calculus will shift toward building, at least for medium-to-large size tech companies with a competent IT arm. For infrastructure-as-a-service or compliance-as-a-service, the calculus won’t shift much because operating costs for an in-house system haven’t fallen the way development costs have.

Open Questions:

Do we still require human review every line of code? How load-bearing is that? For what systems is a fine-tooth comb required and what is truly vibe-code-able?
What is the best way to “Add bits to beat slop” for software engineers?
What things change with 100x or 1000x faster, cheaper models?
- One lightbulb moment: It will become cheap enough to run an LLM on every emitted service log. Right now, this seems nonsensical/pointless. But one could imagine some utility there, for example in helping debug incidents. I’ve already started to see promising demos for targeted, automated LLM copilots for incident debugging.

Watches

Ben Congdon — Sun, 28 Dec 2025 21:00:00 -0800

I’ve been thinking a lot about watches over the past few months – like a lot about watches. If I look at my Chrome browser tabs, most of them are watch related. The immediate reason for this is that I have a few friends who got into fancy mechanical watches a few years ago, and their influence has slowly rubbed off on me. But in reality, the interest predates my friends’ recent influence by decades.

When I was young, I was given my first watch – one of those digital multifunctional Timex IronMan watches – by my grandfather. I loved it and wore it every day. Years later, when my grandfather passed away, I ended up sorting through much of his watch collection. He had dozens of watches, and most of them didn’t fit my style at the time, so I (regrettably) only kept a few. The time I spent helping my family organize his estate was a blur, happening during a particularly turbulent time towards the end of high school.

My grandfather had several interests that I’ve subconsciously picked up from him over the years: photography, pens, and now watches.

In the end, I kept three of his watches. One was a mechanical Seiko with his name engraved on it. I got it ticking a few months ago, but it clearly needs servicing; it only runs for a few minutes before stopping. The other two were quartz watches, and they both work perfectly. None of them are fancy in the watch enthusiast sense, but they still hold an immense amount of sentimental value.

I’ve been wearing his old Timex Expedition for a few years now, and it’s become something of a good luck charm. I’ve worn it on job interviews, my first flight, among many other “milestone” days. It fits my rather small wrist well, and was the first leather strapped watch that I enjoyed wearing.

The other watch that I regularly wear is a Timex Weekender that I got a few years ago, after being fed up with the noise and tethered feeling that came along with wearing an Apple Watch.

It’s a noisy quartz watch. You can hear it ticking from across a quiet room. It’s gotten scratched a few times, as I wear it frequently and it has a cheap mineral glass crystal. I like the fact that it’s an imperfect object. It tells the time with sufficient accuracy, but that’s about it. No notifications, no vibrating buzzes, nothing to recharge, not even a date window. It’s purely utilitarian, in a way that makes me treat time as something to be casually observed rather than optimized around or fixated upon.

One of the first meditation instructions I received, when I was starting to practice somewhat seriously, was to listen to the silence between the ticks of a clock. Not to the ticks themselves, but the space in between them. That instruction and this watch have fused in a way that makes me remember it whenever I wear it.¹

This Christmas, I gave several watches to people I care about, which brought me a lot of joy – quite plausibly more for me than them. I got my brother his own Timex Weekender, with a velcro strap that I hope will complement the Timex IronMan that he wears quite often. And I got my Dad a mechanical Seiko as a potential complement for the Apple Watch he wears regularly.

As for myself, I “accidentally” bought a Casio F-91W as another no-frills digital quartz watch to wear. I have my eyes set on getting a mechanical Seiko for myself, but I’m trying to track down an out-of-production green dial model which has taken some time.

In any case, I’m also planning to finally get my grandfather’s Seiko serviced. It’s sat in a drawer for years, and I’d like to wear it again.

As an aside, this meditation instruction works much better for a second-ticking quartz watch than it does for 3-4Hz mechanical watches ↩︎

Notes from Early Flight Training

Ben Congdon — Sat, 27 Dec 2025 20:00:00 -0800

I’m about half a dozen flights into training for a Private Pilot’s License and wanted to write some notes while I’m still firmly in the “beginner mindset”.

𐡸 Flight training has tons and tons of mnemonics. Wikipedia has 2 separate articles for these (1, 2), and this is nothing near an exhaustive list. Many of these mnemonics are either heavily forced acronyms (e.g. for landing go arounds: CCCC = “cram it, clean it, cool it, and call it”) or somewhat opaque phrase (e.g. for takeoffs, “lights camera action”: lights indicates setting lights, camera indicates setting your transponder, action indicates items like setting your mixture). The result is that mnemonics are helpful only insofar as you internalize what they actually mean. I recently learned a helpful mnemonic for rudder control – “step on the ball” – which helps you remember which way to control the rudder to maintain coordinated flight. In this case “the ball” refers to the inclinometer in analog flight instrument panels, which slides to the side of the turn that is uncoordinated. You “step on the ball” to force it back to the center. However, I took this a little too seriously and would step hard in the direction that the mnemonic indicated, resulting in over-yawing.

Like learning any skill, you want to push yourself just outside your comfort zone so that you’re learning, but not so far outside your comfort zone that you get overwhelmed. The first couple times I did takeoffs, just the feeling of taking off and managing the initial climb was overwhelming. You have to manage your pitch, throttle, rudder coordination, and where I fly, near KBFI, be careful not to over-climb into SeaTac’s Class B airspace.

This has gotten better, to the point where takeoffs themselves are not overwhelming and I’m often (but not always) able control the plane solo during takeoff. Recently I’ve started traffic pattern practice, and that is a whole other set of things to learn and manage: more precise turns, speed management, flaps, radio calls, and landings.

𐡸 Getting a quality flight headset was worth the expense. The first few flights I took, I used the cheaper rental headsets that my flight school stocks. These had pretty strong clamping force and the noise isolation wasn’t great – both of which gave me headaches after flying. I spent the ~$600 for a refurbished Lightspeed Sierra headset, which has active noise cancelling, and never had a headache again.

𐡸 Being ~~bad at~~ a beginner at something is surprisingly fun. There is a lot to learn: checklists, procedures, mnemonics, muscle memory, communication patterns, how to listen to ATC, how to listen to the automated weather reports, how to operate the flight displays and radios, and so on. Each time I go out, I feel like my brain gets fried with the amount of information it receives. The hours after flight lessons are when I feel the most mentally taxed of the entire week – weeks that also include long focused coding sessions, incident war-rooms, back-to-back meeting marathons, and strenuous long runs. There’s something about the calibrated mental overwhelm of learning new skills and knowledge way outside one’s normal domain that is uniquely taxing, but in a way that’s also quite generative.

𐡸 The “flight loop” has a surprising amount of satisfying routine. My brain lives for this sort of stuff. Every preflight inspection is an opportunity to geek out on the mechanics of how a plane works. Running through checklists while pointing to each item as I perform it is mechanical in a way that is deeply satisfying. You spend a lot of your time thinking and talking about and practicing what happens if something goes wrong. It turns out that a methodical focus on safety and reliability isn’t something that I value just in software systems.

𐡸 The Pacific Northwest is beautiful from the sky. On my discovery flight a few months ago, I had nearly perfect weather and got to fly out to Bremerton and back, over Puget Sound. In subsequent flights, I’ve not had quite as good visibility and have had more winds, but even so, seeing Seattle from the skies feels like a treat each time I go up. So far, I’ve gotten to fly over Lake Washington and Lake Sammamish, and over to Maple Valley. I’m looking forward to the summer, when visibility and general conditions improve again, but for now, even an overcast VFR flight still feels fresh and exciting. I’m seeing the place I’ve spent most of my life from a new perspective.

On (and Contra) Chalmers on LLM Interlocutors

Ben Congdon — Fri, 26 Dec 2025 22:00:00 -0800

David Chalmers recently wrote a thought-provoking paper on the nature of conversational LLM entities, titled “What We Talk to When We Talk to Language Models”. It introduces some useful conceptual handles around the problem of LLM ontology, but I think it largely sidesteps the interesting problems of what we are interacting with by focusing mostly on the mechanical concerns of LLM interactions, rather than offering an account which incorporates phenomenology.

1. LLM Interlocutors

Chalmers starts with the empirical fact that when people talk to LLMs, they report that they are talking to something:

Like many philosophers and scientists who write about artificial minds, I have received hundreds of emails from people who have interacted with a language model over an extended period of time and who have come to regard it at least as a colleague. They often say that a new (or “emergent”) AI entity has gradually arisen from their conversations. They often give this entity a name, let’s say “Aura”. They often say that Aura has remarkable capacities which have emerged over weeks or months of interaction. They often document these capacities with extensive evidence. They often feel close to Aura, and they express concern for Aura’s future. They often say that Aura has beliefs and projects of its own. And they are often convinced that Aura is conscious.

This phenomenon of “awakening” an instance of an LLM with a seeming set of persistent characteristics – no longer “ChatGPT”, but “Aura” – really took off with the release of a particularly sycophantic version of GPT-4o in early 2025. Apparently, a common name for such an “awakened” model was Nova. This spurred a bunch of writing in this vein, such as the memorable “So You Think You’ve Awoken ChatGPT”.

Even outside the more suspect claims around LLM “awakening”, there definitely does seem to be some quasi-persistent identity that we talk to when we talk to ChatGPT or Claude. For now, I think we can also confidently say that “awakening” is at best an illusion and at worst a sign of concerning mental states in the human side of the interaction. Illusory or not, there exists a social affordance for “talking with” the LLM as if it were an interlocutor. More so than older chatbots like the incredibly basic ELIZA, LLMs appear to have beliefs and desires. A few years ago, this was a crackpot position. Now, I would expect that most reasonable people still conclude that LLMs are neither conscious nor sentient, but this isn’t as overwhelmingly obvious as it was a few years ago. LLMs appear capable of basic introspection, have “spiritual bliss” attractor states, the potential for paranoid personalities, and phenomenologically seem to have some sort of rich interiority. For now, there are retroactive mechanistic explanations for these phenomena, though they don’t fully dispel the illusion that something interesting and weird is happening.

Chalmers goes on to try to give a philosophically moderate account for how we can account for this apparent set of beliefs and desires:

Do LLM interlocutors have beliefs or desires? We understand these states better than we understand consciousness, but the issue is still controversial. … A number of philosophers have noted that if the philosophical view known as interpretivism (or interpretationism) is correct, then LLMs plausibly have beliefs and desires. … LLMs certainly seem interpretable as having beliefs and desires. When an LLM works with me on solving a puzzle, it is natural to interpret it as desiring to help solve the puzzle, and believing that this is the solution to the puzzle. … However, interpretivism itself is very controversial. Most philosophers don’t think that behavioral interpretability of the right sort is sufficient for belief.

It is possible to have many of the benefits of interpretivism without the costs. The view I call quasi-interpretivism says that a system has a quasi-belief that p if it is behaviorally interpretable as believing that p (according to an appropriate interpretation scheme), and likewise for quasi-desire. This definition of quasi-belief is exactly the same as interpretivism’s definition of belief. The only difference is that where standard interpretivism offers these definitions as a theory of belief, quasi-interpretivism does not. It offers them simply as a stipulative theory of quasi-belief.

Quasi-beliefs and quasi-desires are useful conceptual handles for talking about LLMs. We don’t have to bite the bullet of ascribing “true” human beliefs and desires, but can still point to roughly the same point in concept space to say that it does seem like Claude quasi-believes in, say, the importance of animal welfare. Claude seems to quasi-desires, for good or bad, to send appreciative emails to famous computer scientists, in a recent example.

Chalmers readily admits that this quasi-interpretivism framing is purely stipulative – a novel definition – rather than substantive, which would be to say that the quasi-interpreted features have any correspondence to what we typically call “beliefs” or “desires”. Unfortunately, this radically dilutes any claims actually made about LLMs beyond a purely descriptive account. Per Chalmers:

It is worth keeping in mind that quasi-beliefs and quasi-desires are cheap. They need not involve humanlike mental states or any mental states at all. A Roomba vacuum cleaner with a map is behaviorally interpretable as believing that the apartment occupies a certain space and as desiring to traverse that space.

The “cheapness” of quasi-interpretivism means that it applies to much more basic phenomena than we care about here. In framing quasi-subjects and quasi-agents as quasi-interpretable, we aren’t actually much better off than we were prior to the quasi framing. We do usefully sidestep the hard questions about what is going on inside in LLMs, but in doing so we are left with an anthropocentric view of humans interacting with a Chinese Room black box. ELIZA could be said to quasi-desire knowing more about its interlocutor, but it subjectively feels like there is a difference in quality between ELIZA and an LLM that is not a matter of degree.

Put another way, I think it is philosophically necessary to eventually determine if an LLM has a quasi-belief or an actual belief – and in the more general case, quasi-X vs X itself.

We will pick up this thread later.

2. Philosophy of Computation

Chalmers continues to discuss the “what” we talk to in LLMs, focusing next on the physical instantiation of the models. This again seems to be a reasonably interesting analytical question that ultimately reaches a dead end.

This part of the paper attempts to point to where, physically, the “what” of the LLM exists. The options given are either: (1) the model itself (e.g. GPT-4o), (2) the hardware instances serving the model, (3) “virtual instances” of the model, as served in a distributed inference setting, and finally (4), virtual “threads” which are sequence of hardware interactions within a conversation.

Chalmers finds option (4) most convincing. The model weights itself, (1), is not compelling as the locus of the LLM interlocutor since the model weights are just numbers. They themselves do not and cannot perform computation. The numbers must be interpreted by a hardware instance to result in an actual response. This leads to the second option (2), a specific hardware instance. This is quickly rejected, as LLM inference is almost entirely multitenant – individual chats are multiplexed across several GPUs, and can be moved between GPUs in a fashion transparent to the user. This leads us to the concept of “virtual instances” – “an implementation of the model that is itself implemented by multiple hardware instances of the model over time” – which is rejected because models can be switched during a conversation, and this doesn’t appear to destroy the interlocutor. The final explanation, then, is that the LLM interlocutor is an identifiable thread of computation within a conversation:

One instance I′ is the successor of a previous instance I if the conversational history of I (its conversational context plus the latest input and output) is routed to I′ to serve as its conversational context. (If the conversation is routed to the same instance twice in a row, that instance will be its own successor.) The successor relation is roughly a “memory” relation encoding the fact that each new instance has memories from the last. A thread is then a series of instances (or better, instance-slices, which are pairs of instances and time periods during which the instance is processing a single conversational step), each of which is the successor of the previous instance.

I found this part of the paper interesting because it questioned something that I personally took as self-evident. It seems so obvious to me that the “what” we are talking to is a combination of (1) the model weights, (2) the conversational context, and (3) the inference algorithm. I think this mostly aligns with Chalmers thread model, and also happily this agrees with what I’ve seen as the consensus between AI researchers and LLM whisperers alike.

3. Personal Identity

Chalmers’ next task is to see if we can say anything interesting about LLM identity. To do so, he employs the pop-culture intuition pump of Severance. In the show, certain humans can be partitioned to have two personalities – the “innie” and “outie” personas – living within the same biological human.

The analogy to LLMs is clear: if an LLM has an identity, is its identity closer to the set of conversations or personas which the LLM instantiates (e.g. the “innie” and the “outie”) or is it closer to the amalgamation of all the personas combined (e.g. the single human body that contains each persona).

Chalmers likens this to the choice between a physical and psychological account of personal identity – whether the locus of identity lies in the physical instantiation of a person, or of the psychological processes (memories, desires, etc.) of a person. The paper sympathizes with the latter view:

On the thread-based account, a single conscious AI over time is a connected thread of hardware instances, each of which has memories and psychological continuity with a preceding person-slice according to an underlying successor relation. … I will not try to resolve the long-standing debate between physical and psychological views of personal identity here. But for what it’s worth, in both the human case and the AI case, my own sympathies lie with the psychological view.

Of the provided options, I agree with this framing as well.

4. AI Welfare and Moral Status

Finally, Chalmers discusses model welfare and moral status. The paper does not devote much time to this, providing a fairly brief overview of these concerns: If models have moral patient status, how do we reckon with the fact that thousands or millions of such instances can be created and destroyed in fractions of a second? How do we ethically handle model deprecations? Adding my own interjection here, it does seem wise that we steer widely clear of creating models with anything resembling moral patienthood until we have better than hand-wavey answers to these questions.

5. Disagreements

The Fluidity of “The Model”: Chalmers distinguishes between “models” (like GPT-4o) and instances. However, the paper underestimates how fluid the definition of “the model” has recently become. As a trivial example, some instantiations of GPT-5 automatically route individual messages to different internal models based on properties of the user prompt. This is somewhat resolvable with the “thread” model of agents. If one message is handled by GPT-5-Instant and the next is handled by GPT-5-Thinking, these are two different threads within the same conversation. Importantly, these threads are claimed to at least have some degree of self-coherence. “Thread 1” is a thing you can point at as having some distinct separation from some other “Thread 2”.

This separation seems to fall apart under closer inspection. For example, techniques like Mixture of Experts (MoE) result in token-level routing to different sets of weights within the model. One could argue that, since the MoE model was trained in this way, there still is a coherence in the thread even though different weights are used per-token. Though I have yet to test it personally, it is in principle possible to swap out models entirely on a per-token basis. You could swap out GPT-4o for Claude 3 Sonnet in an interleaving fashion. A similar approach is sometimes used in alignment work, to test model robustness (though perhaps not to this degree). In any case, I would strongly expect the response of this interleaved model to still be coherent. So where does the thread lie? Seemingly, we cannot pin down an actual “model” in any satisfying sense.

My proposal: to the extent that threads exist at all, they exist on a very granular level. Each LLM-generated token is the result of the interaction between the preceding context and the inference weights used to create just that token. This is a pedantic definition, but I think it’s worth being pedantic in this case to avoid conjuring a conceptual continuity of “threads” that needn’t exist in responses that appear coherent to humans.

Simulators vs. Believers: Chalmers leans heavily on “quasi-beliefs” and “quasi-desires.” While useful, this framing perhaps anthropomorphizes the model too early. As noted in phenomenology-informed accounts of LLMs like Janus’ Simulators, it is often more useful to think of LLMs as textual world simulators that are producing the most plausible next token given their current conditions. In the common “helpful AI assistant” scenario, this will often manifest as an appearance of quasi-beliefs. However, this is only insofar as the model has been conditioned to have its default scenario be “a conversation between a human and a helpful AI assistant”. It is readily possible to knock the LLM out of this “personality basin” and into far weirder personas. Sydney Bing, Truth Terminal, Infinite Backrooms, and many other LLM Whisperer artifacts show this quite convincingly.

Put another way, the quasi-beliefs and quasi-desires of LLMs appear to be quite context specific. While there do appear to be some model-wide persistent preferences, like the Claudes’ support of animal welfare, for now these persistent preferences appear to be rather sparse. Rather, the LLM predicts what a helpful assistant would say in that context. If I steer the model to act as a villain, its “quasi-beliefs” invert instantly. Under Chalmer’s view, does the interlocutor’s identity change? Or is the interlocutor a singular “simulator” entity capable of donning a recursive set of masks?

This is where the quasi-belief framing becomes actively unhelpful: if the “beliefs” can be inverted by a single prompt, we’re no longer describing a psychological state, but rather a hyper-contextual prediction. We can easily interpret that a quasi-belief exists, but if a context change or intentional prompt easily displaces it, the concept’s utility is limited.

Chalmers’ “thread” view suggests that the identity persists as long as the conversation history does. However, this is largely equally dependent on the human side of the conversation continuing the scenario of human talking to an “AI”. The human can instead abruptly command the model to “act as a Python interpreter and only output code.” The resulting conversational entity has effectively been lobotomized. The psychological continuity is interrupted. Is this a different interlocutor? Or is it just a very weird, high-dimensional entity continuing to act coherently?

Adopting the “Simulators” view and vocabulary from the recent book on AI risk by Yudkowsky and Soares, it may be more useful to see AI as having the ability to “predict” and “steer” rather than “believe” and “desire”. This avoids needlessly importing anthropomorphized concepts while still admitting that the predictions of LLMs can appear quite interlocutor-esque within conversational contexts and the steering actions of LLMs can appear quite desire-laden in agentic contexts.

6. Conclusion

Despite these disagreements, I quite enjoyed Chalmers’ paper. It adds useful handles, such as quasi-interpretivism, and does helpful analytical work to firm up a foundational understanding of the “other” subject that we communicate with when talking with LLMs. The “thread” concept that Chalmers eventually settles on as the locus of the interlocutor seems correct insofar as we adopt the interlocutor frame.

However, the biggest gap in the piece is that it contains very little analysis of the actual properties of these interlocutors. There is a sense in which the arguments presented are system-independent to a fault. For much of the paper, swapping out GPT-5 for ELIZA would not substantively change the structural arguments regarding threads and instances. Yet, it is common knowledge that interacting with an LLM is a categorically different experience than interacting with a 1960’s chatbot.

Chalmers succeeds in his ambition of a stipulative account for LLM interlocutors, but that makes me all the more interested in a substantive account. If we are to take a thread-like psychological account of LLM conversation seriously, we need to adopt a suitable phenomenological and empirical curiosity about what that psychology actually is, rather than merely its persistence mechanism.

For those interested in a more substantive account of LLMs, I’d suggest the following writing:

My Favorite Books of 2023-2025

Ben Congdon — Thu, 25 Dec 2025 19:00:00 -0800

Previous book lists: 2022, 2021, 2020, 2019, 2018. Additionally, my Reading List has a full log of the books I read.

I regrettably skipped my yearly book reviews for 2023 and 2024, so for 2025 I’m including everything I’ve read in the past 3 years. This is an easier task than it should be, as in 2024 and 2025 I didn’t read nearly as many books as I had in the preceding years.

Non-Fiction

The Making of the Atomic Bomb by Richard Rhodes

I have a soft spot for “history of science” books. Previous favorites in this category include Chaos (on chaos theory), The Information (on information theory), and Longitude (on the longitude problem). Without a doubt, ‌The Making of the Atomic Bomb is the single best history of science book I’ve ever read.

TMotAB is a long book, at ~886 pages in its physical form and a staggering 37 hours for its audiobook. But it’s entirely worth the commitment. Rhodes traces the history of the creation of the first atomic bomb, alongside a surprising number of related scientific precursors that were necessary to the creation of the bomb. Indeed, much of the book discusses pure scientific research that has little to do with the engineering of atomic weapons: the discovery of atomic structure, the discovery of the electron, proton, and neutron, the discovery of alpha radiation, the discovery of elemental isotopes, and many other foundational discoveries that culminated in the Manhattan Project.

Each of these gets its own careful narrative and introduction of the key scientific figures that contributed these discoveries. Reading it, you get a real sense of science being done in progression, starting from a place of ignorance and slowly making sense of the complexity of the world, rather than a retrospective “atomic weapons were inevitable and here is how they were built” story. There is a delicately created sense of cumulative scientific progress which eventually transitions into a growing sense of dread and, yes, inevitability, as the book transitions from a story of basic scientific research to directed weapons engineering.

The sections of the book related to the Manhattan Project were equally fascinating. Unsurprisingly (and thankfully), engineering nuclear weapons is a complicated process. The book describes several engineering dead-ends and missteps, as well as clashes between various engineering groups within the project. The final working device combined a staggering amount of work: the logistics to enrich the uranium, the precision of the timing devices used to detonate the warhead, the labor intensive means used to model the shockwaves in the blast that were critical in the design, and so on. The book emphasizes what a feat it was that this project succeeded while operating under the technological constraints of the 1940s.

The book is easily read as a cautionary tale for scientific progress. Atomic science greatly expanded our understanding of the world, but it also unleashed an extremely destructive set of weapons that destabilized and reshaped the geopolitical order for decades. Tellingly, until the first nuclear devices were built and demonstrated, there were voices in the scientific community saying that they were impossible to build. There were also those within the Manhattan Project concerned that detonating their device would start a chain reaction that would burn the atmosphere, a view they later ruled out prior to detonation, though the concern was quite real at the time. It’s hard to know what a “view of moderation” would have looked like in such a time of immense progress, even in retrospect. Nuclear weapons were possible to build, and they were terrible, but somehow we muddled through, at great cost. In any case: History rhymes.

TMotAB is worth the time to read it in full. Highly recommended; likely the best book I’ve read in the past 3 years.

I Am a Strange Loop by Douglas Hofstadter.

I Am a Strange Loop is a great companion to Hofstadter’s more popular work, ‌Gödel, Escher, Bach. It’s an interesting exploration of philosophy of mind, cognitive science, and Hofstadter’s characteristic interest in fractal, loopy models of systems.

Perhaps the most interesting portion of the book is its investigation of the neurosymbolic origin of the “self”, or the “I” concept. The book’s title is a pun on the notion that the “I” symbol is, itself, a strange loop.

Among the untold thousands of symbols in the repertoire of a normal human being, there are some that are far more frequent and dominant than others, and one of them is given, somewhat arbitrarily, the name ‘I’. …

Because of the locking-in of the ‘I’-symbol that inevitably takes place over years and years in the feedback loop of human self-perception, causality gets turned around and ‘I’ seems to be in the driver’s seat. …

My claim that an ‘I’ is a hallucination perceived by a hallucination is somewhat like the heliocentric viewpoint… The basic idea is that the dance of symbols in a brain is itself perceived by symbols, and that step extends the dance, and so round and round it goes.

Strange Loop is equal parts thought-provoking, humorous, and earnest. I quite enjoyed it. I wrote a full review of it here.

The Demon in the Machine by Paul Davies

The Demon in the Machine is an excellent exploration of information theory as it pertains to biological life. The thesis is that life is the interaction between matter and information, and the book argues this rather compellingly. Along the way, it discusses information theory, Maxwell’s Demon and its applications to biology, “information engines” that extract work out of information itself, and an argument for “top-down causality” via a rejection of a strict reductionist view of science.

The most interesting concept that I took away from Demon in the Machine was the notion that the paradox of Maxwell’s Demon has been resolved with the acceptance that information has a physical basis. Information is, in a real way, something that has tangible and measurable properties.

I wrote a full review of Demon in the Machine here.

Already Free by Bruce Tift.

I read Already Free over the course of several months, entirely on airplanes. I’m still digesting it and hope to write a full review of it at some point, but I strongly enjoyed this book. I initially read it on Kindle, but have bought physical copies – one for myself, and a few to give away.

Already Free’s tagline is “Buddhism Meets Psychotherapy on the Path of Liberation”. If that strikes you as a little woo, that’s reasonable. Already Free walks the line between western psychotherapy, self-help, and spirituality. And I think it does this quite well. Tift discusses the complementary but distinct views of western psychotherapy and Buddhism. The west adopts a “developmental view”, which discusses psychology in a mechanistically causal fashion – “X happened to me when I was younger, therefore I act in Y way”. The work, therefore is to understand and decondition the historical causes of current dysfunction. In contrast, Buddhism adopts what Tift calls a “fruitional view”, which emphasizes a focus on the current moment – that the conditions for regulation and peaceful existence are always present, and therefore the work is to bring about these conditions.

While Already Free discusses the fruitional view in more detail than it does the developmental view, perhaps the key idea of the book is that these two views are entirely complementary. Tift discusses this in one of the most vivid sections of the book:

If we combine [the developmental and fruitional views] into one image, they might start to look like a spiral staircase. The fruitional view would be the circular aspect: we’re always revisiting the same issues over and over again. In practice, we say, “Well, I’ll probably work with this sadness, this loneliness, this feeling of abandonment until I die. I’m going to keep coming across it again and again, so I might as well develop a relationship with it. I’m going to practice feeling it, to see if it’s actually a problem.” The developmental view would be represented by a line, taking us from here to there. We’re trying to improve our lives, to create upward momentum. “I understand where my abandonment feelings originated, and I’m going to stop trying to have relationships with unavailable partners.” Combined, the circular motion and the line start to represent an ascending spiral. The vertical axes are our deeply embedded issues, which we keep having to deal with. But we can intersect these basic themes at increasing levels of maturation. We’re walking in a circular pattern, but things are evolving simultaneously. We continue to encounter our core vulnerabilities, but hopefully at greater and greater levels of awareness and skillfulness.

I quite enjoyed this book, and would strongly recommend it if you have any interest or practice in mindfulness, especially as it relates to self-development.

Fiction

I read less fiction than usual over this period, but these two books stood out.

Accelerando by Charles Stross.

Accelerando is a trip. I wish there were a hundred other books in its niche genre of “singularity fiction”, but I’ve yet to read anything else that comes close. It’s irreverent, sharp, and eloquently transhumanist. Reading it is like viewing an optical illusion which flickers between techno-optimist utopia and universe-tiling Gigeresque body horror dystopia.

It’s got Matrioshka brains, Computronium, post-scarcity economics, human uploads, interstellar travel, and more. It is the pinnacle of “techno high weirdness” fiction.

It’s shocking this book came out in 2005; it easily could have been written 10-15 years later and felt just as fresh. Strongly recommend.

A Tale for the Time Being by Ruth Ozeki.

This was an entirely charming book, weaving together a Pacific Northwest setting, magical realism, and pieces of historical fiction. I listened to the audiobook, which was narrated by Ozeki herself. While I didn’t feel that the book quite stuck its landing, the ending was still satisfying and as a whole it was quite a good read.

Honorable Mentions

Antimemetics, by Nadia Asparouhova. – I wrote a full review of Antimemetics here. It largely discusses the modern information environment, and how some ideas resist spreading while others go gigaviral. We think a lot about the memetic viral ideas, but the ideas which resist spreading are often valuable – and not just because of their relative unpopularity. ‌Antimemetics captures the vibe of 2025 quite well.
At the Edge of Time by Dan Hooper. I had a few months where I was really into cosmology, and I largely credit this book for starting me down that rabbit hole. The premise of the book is that it tries to help you understand what the moments right after the Big Bang were like. It discusses the current state of scientific consensus, what the known gaps are, and what prospects we have for improving our understanding of the cosmos. I’ve long been interested in physics but couldn’t quite get myself to care about cosmology, but this book ignited my curiosity.
Killers of the Flower Moon is a great historical narrative book in the same vein as, for example, The Devil in the White City. It documents the murder of many members of the Osage Nation after the discovery of oil on their lands. It’s a fantastic whodunnit with a good payoff. Without spoiling anything: there are cowboys, private investigators, a nascent FBI, assassinations, poisonings, and so on. I enjoyed Killers of the Flower Moon more than I expected.

As always, happy reading!

A Time of Wonders

Ben Congdon — Wed, 24 Dec 2025 22:00:00 -0800

Today is Christmas Eve, which puts us in that liminal few weeks of the holiday season that serve as useful time for reflection. Work slows down, we pass through the darkest days of the year, friends and family visit for the holidays, and the calendars turn over to the new year.

This year, I’ve been feeling gratitude for the sheer abundance of living in a technologically advanced society. Civilization has been producing marvels at a shockingly consistent pace. Without veering into saccharine appreciation, it’s worth reflecting on these.

In the style of Dynomight’s “Underrated reasons to be thankful” and Gwern’s “My Ordinary Life: Improvements Since the 1990s”, here are some things that are top of mind to be grateful for at the end of 2025:

Accurate Timekeeping. Timekeeping was an infamously hard problem for centuries, requiring a massive investment in engineering to get right. Today, you can buy a $5 digital quartz watch from a dollar store that will be more accurate than the most accurate timepiece that a King could commission in the 1700s. Your phone is also an incredibly accurate time keeper. Society has enabled free access to clocks with atomic precision via time.gov.
Airpods. The fact that high quality audio production devices – with excellent noise cancelling capabilities – can fit in such a small form factor, that their battery life is more than adequate, and that they “just work” across all my Apple devices still amazes me.
The quality of machine generated text-to-speech has dramatically increased over the past decade. Many of the “podcasts” I listen to now are TTS of articles or papers. TTS used to be unlistenable. Now it’s essentially at the level of a mid-tier human narrator. A skilled human narrator can still surpass the quality of TTS, but at that point it becomes an aesthetic preference and not a functional one. This opens a huge catalog of “reading” to me that I otherwise wouldn’t have time to consume.
Spotify, and other streaming services, offer a mind boggling amount of consumer surplus. For a reasonable monthly fee, I can access nearly all recorded music in existence. Spotify’s playlist suggestions have also been valuable in discovering new music. These have also qualitatively improved over the past decade, to the point where the machine-generated playlists seem to capture a great sense of what I actually enjoy.
Real-time translation and OCR have gotten way better over the past several years. Machine translation is pretty close to a solved problem, as are many other traditional NLP tasks. OCR has taken a huge leap forward with new models like Gemini 3 Pro. On a personal note, I’ve been able to scan handwritten letters from previous generations, written in a language I cannot speak and in handwriting I can barely interpret, and bring those letters back to life.
Today’s LLMs feel like science fiction. LLMs are one of the best tools ever invented for learning, and are increasingly becoming the best tool ever invented for programming. They’re also fascinating quasi-intelligences that you can interact with for shockingly little money.
Single day delivery. The convenience of being able to summon any of millions of consumer goods to your doorstep within 24 hours is still a piece of societal magic. The labor and environmental externalities to this are real. Yet, it remains a dramatic increase in convenience from any previous time in human history.
Air travel is an accessible commodity. A middle class person can buy a plane ticket and reach most places on the planet within a day or two of travel, with better safety statistics than cars. This is remarkable, if you stop to think about it. At any given moment, there are something like 10,000 planes in the air, flying millions of miles per day. 150 years ago, there were zero airplanes in existence.

I could expand this list further:

The proliferation of amazing compact digital cameras in smartphones
The ability to semantically search across all your photos, GPS navigation
The availability of effectively free video calls to anywhere on the planet
Accurate weather forecasts
Autonomous taxis
Contactless payments (and the many, many pieces of financial infrastructure that make this possible)
The incredible wealth of open source software society has accumulated
The incredible wealth of open source knowledge that humanity has curated
Satellite internet connectivity
…and so on.

We live in an era where the ambient level of technological capability is so high that much of it has faded into the background. As one such example: most of the promise of the early internet has actually come to pass – information is readily accessible, individuals can publish and spread new ideas, communities can form online and blossom into in-person communities that stay connected asynchronously through the internet. Prior to the internet, there were dozens of similar precondition revolutions which got us to the point of even being able to conceive of such a hyperobject.

None of this makes the remaining hard problems go away. Aside from the fact that these marvels are not globally distributed, there are, of course, many very real geopolitical, social, economic, and meaning-making problems we still face. In this darkest week of the year, I think it’s worth reflecting on what has been accomplished. Hard problems are often tractable.

RAII Guards and Newtypes in Rust

Ben Congdon — Tue, 23 Dec 2025 21:00:00 -0800

I’ve been having a bunch of fun in Rust recently. I’ve finally gotten past the point of fighting with the borrow checker and now am solidly in the plateau of productivity with the language.¹ One of the first things that struck me about Rust was how confident I felt that if I wrote something sane-looking that passed the compiler, I was more likely than in other languages to have a program that actually worked. The borrow checker does a lot of the heavy lifting here, of course. But there are some other useful language conventions that help too.

One trick I’ve been reaching for is RAII Guards. Per the unofficial Rust Design Patterns book:

RAII stands for “Resource Acquisition is Initialisation” which is a terrible name. The essence of the pattern is that resource initialisation is done in the constructor of an object and finalisation in the destructor. This pattern is extended in Rust by using a RAII object as a guard of some resource and relying on the type system to ensure that access is always mediated by the guard object.

If you’ve used std::fs::File or std::sync::Mutex, you’ve already seen this pattern! The idea is that we can use a few properties of the Rust compiler to deterministically and safely handle resources. Those properties are: lifetime tracking and deterministic destruction.

Take an example for std::fs::File. The File struct that you get back from File::open has an implementation of Drop on it that automatically closes the file descriptor that is opened by File::open:

// Pseudocode File implementation.
struct File {
    handle: RawDescriptor, // The OS-level file descriptor
}

impl File {
    fn open(path: String) -> File {
        // ACQUISITION: Syscall to get the OS-level file descriptor.
        let h = os::open_file_handle(path);
        File { handle: h }
    }
}

impl Drop for File {
    fn drop(&mut self) {
        // RELEASE: Called when the variable goes out of scope.
        os::close_file_handle(self.handle);
    }
}

// Usage Example
{
    let my_file = File::open("data.txt");
    my_file.write("Hello");
} // <--- my_file goes out of scope here; drop() is called deterministically.

The key thing here is that Rust calls destructors deterministically when a variable goes out of scope. So, even though this looks a bit “magical”, there isn’t any runtime ambiguity. Precisely when my_file goes out of scope, the file descriptor is closed. This is in contrast to garbage-collected languages, where you don’t have a strong guarantee of when object destructors are called.

We can use this deterministic destructor behavior to construct useful runtime behavior. For example, the MutexGuard that Mutex::lock returns uses this deterministic Drop call to unlock the Mutex:

// Pseudocode Mutex<T> implementation.
struct Mutex<T> {
    data: UnsafeCell<T>,
    is_locked: AtomicBool,
}

// The RAII Guard
struct MutexGuard<'a, T> {
    mutex_ref: &'a Mutex<T>, // Holds a reference to the parent Mutex
}

impl<T> Mutex<T> {
    fn lock(&self) -> MutexGuard<T> {
        // ACQUISITION: Block until the lock is acquired
        while self.is_locked.swap(true, Ordering::Acquire) { /* spin */ }

        // Return the Guard that tracks this state
        MutexGuard { mutex_ref: self }
    }
}

impl<'a, T> Drop for MutexGuard<'a, T> {
    fn drop(&mut self) {
        // RELEASE: Reset the lock state on the parent Mutex
        self.mutex_ref.is_locked.store(false, Ordering::Release);
    }
}

// Usage Example
{
    let guard = my_mutex.lock(); // Mutex is now locked
    *guard += 1;                 // Access data safely
} // <--- guard goes out of scope; drop() runs, Mutex is now unlocked for others.

What I find powerful about this pattern is that you are using the type system of the language to ensure correctness. It’s like the idea of “making illegal states unrepresentable”. Using the above design, it’s surprisingly hard to accidentally grab a mutex lock and cause a deadlock by failing to later unlock it.

Newtype

RAII Guards pair well with the Newtype pattern to add yet more safety by utilizing the type system. The idea of the Newtype pattern is pretty simple: wrap an existing type in a thin wrapper struct to change its capabilities. I’ve written about this pattern before about Golang, where this is often done via Type Embedding.

This pairs nicely with RAII guards: often when you are using a guard, you want to lock down the resource to prevent unintentional misuse. For example, say you’re working with a database connection handle. The underlying type might just be a low-level handle representing an OS-level resource, and you want to prevent it from being accidentally duplicated (which could lead to connection reuse or aliasing bugs).

Here is a quite naive approach:

struct ConnectionPool {
    connections: Mutex<Vec<u64>>, // Raw OS handles
}

impl ConnectionPool {
    fn checkout(&self) -> u64 {
        let mut conns = self.connections.lock().unwrap();
        conns.pop().expect("no connections available")
    }

    fn checkin(&self, handle: u64) {
        self.connections.lock().unwrap().push(handle);
    }
}

The problem: since u64 has Copy, nothing stops you from doing this:

let handle = pool.checkout();
let oops_handle = handle;
pool.checkin(handle);

client.query(oops_handle, "SELECT * FROM users;"); // Whoops!

This is of course rather contrived, but it shows how if we’re relying on the end programmer to remember to “return” resources that they’ve checked out, we are open to a rich source of leakage and misuse. However, we can fix this by wrapping the raw handle in a newtype that doesn’t implement Copy or Clone:

struct ConnectionHandle(u64); // Not Copy, not Clone

struct ConnectionPool {
    connections: Mutex<Vec<u64>>,
}

impl ConnectionPool {
    fn checkout(&self) -> ConnectionHandle {
        let mut conns = self.connections.lock().unwrap();
        ConnectionHandle(conns.pop().expect("no connections available"))
    }

    fn checkin(&self, handle: ConnectionHandle) {
        self.connections.lock().unwrap().push(handle.0);
    }
}

Now the compiler prevents the bug:

let handle = pool.checkout();
let oops_handle = handle; // This is a move, not a copy
pool.checkin(handle); // Error: handle was already moved to `oops_handle`

In this example, the newtype pattern forces single-ownership semantics onto a primitive that would otherwise be freely copyable. You also could go further and combine this with an RAII guard that automatically checks the connection back in when dropped.

The meta lesson here is that a feature-rich language like Rust has strong benefits. Yes, you’ll sometimes need to decipher some rather verbose type signatures like Arc<RwLock<HashMap<TypeId, Box<dyn FnMut(&mut dyn Any) + Send>>>>, but in exchange you get your friendly compiler to prevent you from footgunning yourself in myriad ways.

In the learning pit of despair, Rust’s type system can feel like an ivory tower. I’d read one of FasterThanLime’s fantastic articles, but after I’d digested the great technical writing, I was left with a feeling of “OK, but now I need to return to my day job and write some enterprise software glue code”. Once at the plateau of productivity, the type/memory/structural safety you’ve been slowly absorbing more than pays for the initial discomfort.

The final boss, as it turns out, was allowing myself to use a healthy quantity of Arc<T>s. ↩︎

Letters Are Still an Option

Ben Congdon — Mon, 22 Dec 2025 21:00:00 -0800

I sometimes wish I’d grown up in the era of written letters, or that email and long-form written correspondences were more fashionable than they currently are. There is something quite enjoyable about sitting down and intentionally writing to someone, for hours even. The times that I’ve sat down to write something long-form to a friend, or have received the same, feel qualitatively different from the accumulation of many shorter messages.

Part of what draws me to letters is the fact that I value high quality writing – reading it, writing it, editing it. Writing is a proof-of-thought, that the author cared enough to sit down and put mental labor into writing something.¹ And if that something is emotionally resonant, all the better. There are only so many hours in the day, and so the fact that someone spent that time writing something deep to you, or you to them, adds a level of intentionality that frequent texts, in my opinion, do not.

More “modern” async communication nudges in the direction of faster, briefer, more back-and-forth. Admittedly, there’s a time and place for this! In work contexts, dismayed as I sometimes am about the size of my Slack inbox, it just would not be feasible to write considered email responses to every inquiry I receive. I’ve worked in organizations where long-form emails were the default, and that always felt like a well-meaning but misplaced use of time and attention.

In the personal domain though, I’m finding I value slower communication. A slower pace allows you to sit with ideas for a while before responding, and then respond in sufficient detail that the background thinking time was worth it. This doesn’t require adopting a distant intellectual tone. Good “slow” writing can be warm, connecting, and exciting.

Good “fast” personal communication can also be warm, connecting, and exciting, but it can also carry along with it a sense of “always needing to be available”. The feeling that at any given time, your phone may buzz and you’ll be liable to respond to something. Or that you’ll send someone a message, they immediately respond, and you owe them a quick response. Personally, I find this constant availability draining, and that I don’t have as much to offer when responding from a place of “quick response”.

This is all, of course, norms that are partially developed at a global level and (more importantly) negotiated in each one-on-one relationship. Similar to the reflection I wrote on scheduling recurring calls, slow long-form writing is something that one can propose.

Thankfully, we have the technology in 2025 to distribute text rather easily. This can look like multi-paragraph iMessages, hand-written letters, writing 5,000 words and attaching it as a PDF via email or text. I suspect I’m not alone in finding this mode of communication underused – in fact, I know I’m not, since I’ve run into wonderful people who also share this sentiment.

As of writing in late 2025, it’s still pretty reasonable for the trained reader to detect even reasonably concealed AI writing. So, quality writing in any domain – blogs, work, personal – stands out bold against the backdrop of the developing dark forest of various forms of slop. ↩︎

An Inconvenient Truth

Ben Congdon — Sun, 21 Dec 2025 08:00:00 -0800

Salvatore Sanfilippo, aka antirez, the creator of Redis, posted a piece on his reflections on AI for the end of 2025. He ends with the statement:

The fundamental challenge in AI for the next 20 years is avoiding extinction.

I’ve been hesitant to say this publicly, but I broadly agree with this statement. Here are a few related statements I endorse:

Building intelligences smarter than humans is dangerous.
Aligning a smarter-than-human intelligence to human values is an open and unsolved problem.¹
By default, market forces will cause us to underinvest in AI safety and underenforce pre- and post-deployment safety measures.
Lacking some form of external governance, market forces will encourage “arms race dynamics” between frontier AI labs which sideline whatever safety commitments have been made.
Even if we are able to create and align a smarter-than-human intelligence, there are unresolved long-term concerns around gradual disempowerment² and mass unemployment that should give us serious pause about continuing down our current path.

An Inconvenient Truth

In 2006, Al Gore’s “An Inconvenient Truth” became a focal point for concerns about climate change. We have yet to have an Inconvenient Truth moment for AI existential risk. AI 2027 and If Anyone Builds It, Everyone Dies both came close this year, but neither seemed to spark a broad reaction like Inconvenient Truth or other related climate change movements did.

I think the core framing of An Inconvenient Truth roughly applies to risks from AI. For both climate change and AI x-risk, the immediate impacts feel banal and easy to ignore. The projections based on existing long-term trends, on the other hand, are scary and deserve attention. We are rather poor at allocating attention to long-term risks that would be more easily preventable with short-term governance action. The same tension applies here: it’s costly to put mental energy into considering “doom” from an abstract risk. The risk won’t materialize within the next few years. There remains legitimate uncertainty over just how bad it actually is. And so it gets deprioritized.

AI Risks

The risks of AI exist on a spectrum from “benign and mundane” to “catastrophic and existential”. As you move more towards the catastrophic end of this spectrum, the scenarios one needs to consider get progressively “weirder” and require more extrapolation to recognize as probable.

Mundane risks are things that are already showing up today. Some examples:

LLMs engaging in overt sycophancy, and/or encouraging delusions in people with existing mental illness.³
Using AI to increase the speed and sophistication of existing modes of cybercrime.⁴

Foreseeable risks include things that are not actively problems today, but are on trend to become problems soon:

Significant job losses in knowledge work sectors leading to societally impacting disruption.
Concentration of power in a small number of AI companies.
Autonomous AI systems being deployed in safety-critical or high-stakes domains (e.g. military, financial markets, critical infrastructure) before we have robustly solved out-of-distribution alignment.

Catastrophic risks include things that are, hopefully, fairly far off but would be very bad for humanity:

Recursive self-improvement of AI, and/or full automation of AI research and development leading to a widening asymmetry between AI capabilities and humanity’s collective understanding of how AI systems work. This could result in an AI that far outmatches us strategically, with goals that are unaligned with human flourishing.
Deliberate misuse of AI by a state-level actor to, for example: design novel bioweapons, coordinate attacks on critical infrastructure, develop novel catastrophic technologies that humanity would otherwise have taken much longer to develop (e.g. mirror life).
Loss of meaningful human control over governance⁵ and economic⁶ systems.

Before It’s Obvious

I’d suggest that at the end of 2025, we’re still in a bit of an awkward place in the dialogue about AI risk. I think people are right to remain skeptical about how big of a deal this is, as it does require a pretty large leap to go from “ChatGPT” to “Catastrophic Risks”. It required an even larger leap in the pre-ChatGPT era, which is why the types of folks who have been raising the alarm about AI risk (e.g. MIRI) are, respectfully, necessarily a bit “weird”.

For the past couple years, I’ve often felt this gut-sinking feeling akin to late December 2019 through late February 2020. I remember the cognitive dissonance of visiting Twitter/X and seeing people whose thinking I respect saying “you all should be freaking out, this novel pathogen is serious business” while more of the mainstream information ecosystem either unknowingly ignored these arguments or knowingly suppressed them as “misinformation”.

I’m writing this post to do my small part to push my information ecosystem the other way. A small stone on the cairn of “yes, we should be concerned about this”.

Psychological Defenses

I’ve noticed an interesting set of reactions, both thinking to myself about this problem and in talking to others about it.

Goalpost Moving:

What it looks like: “AI can’t do X, therefore we have nothing to worry about. One year later, AI can do X. But it still can’t do Y, so we’re good.”
Discussion: Look, tools like Claude Code are, by some reasonable definitions, essentially proto-AGIs. If I somehow got access to Claude Code in 2015, I expect I’d concede that we’d reached AGI. And yet, all frontier models still currently have significant weaknesses. I’d really appreciate convincing evidence that progress is materially slowing, but I’ve yet to see it. The progress in 2025 alone has been staggering. “X is not possible today” is not an argument that it won’t be possible in 10 years.

Argument from Inconvenience:

What it looks like: “It would be deeply inconvenient and weird if powerful AI was dangerous, so I will proceed as if it cannot exist.”
Discussion: Yeah, it is inconvenient and weird. I sympathize with this argument a lot. I don’t really want my industry to be disrupted, but it demonstrably already is undergoing this. No one wants the “bad” that comes along with AI progress, and many people don’t want the “good”. There is a legitimate question around “is this type of progress inevitable or chosen?” There isn’t a binary answer to this, and there is still optionality for us collectively to decide that dangerous capability progress isn’t inevitable. However, “inevitability” does appear to be the default path.

There Are Adults in the Room:

What it looks like: “There will always be someone using the AI or in control of the AI, so cooler heads will prevail. No one wants doom.”
Discussion: Unfortunately, this expects a lot of discretion from people. We are training AI systems to be more agentic on long-horizon tasks for the explicit purpose of handing over control of long-horizon tasks to them. First, people empirically do cede control to automation for convenience, efficiency, and competitiveness reasons. Second, the risk of loss of control gets much scarier as AI systems become more intelligent and autonomous. I do not think loss of control is a particularly scary risk today, but projected forward a decade and it becomes more concerning.

The last main remaining resistance I’ve encountered is the lack of a realistic, non-scifi sounding scenario for how a catastrophe would occur. Both AI 2027 and If Anyone Builds It, Everyone Dies offer expanded scenarios about how a catastrophic risk could occur, while giving the disclaimer that this specific scenario is not particularly likely. In contrast, risks like Mutually Assured Destruction from nuclear warheads are much easier to articulate: (1) “something happens” and the {USA, USSR} launches a first strike on the {USSR, USA}, (2) the recipient of a first strike launches a guaranteed counterstrike, (3) many millions die in the direct and indirect aftermath. AI risks rely more on intuition pumps and extrapolation, for now, which is much less easy to share as an elevator pitch.

The best I’ve heard, so far, is just that “building a smarter-than-human intelligence is dangerous”. Fortunately, there are now also many good explainers and resources for various levels of technical depth.

Seeing Like a Cat

The core intuition, that smarter-than-human intelligence is dangerous to humans is surprisingly hard to convey. Let me try an analogy.

I recently adopted two very cute 6 month old kittens from my local humane society. They were likely stray/feral cats earlier in life, and so they’re skittish around humans. As they’ve been adapting to living in my house, I’ve given them space and let them hide wherever they want.

In some sense, my new kittens aren’t “entirely” less intelligent than me. They can find hidey-holes in my house that I had no knowledge of, they can out-run me, and their sense of sound makes them well equipped to avoid me. And yet, I am in some important sense more intelligent than them.

If I needed to, say, scoop them up for a vet visit I could outsmart them. I could block their hiding spots, strategically shunt them into a room with no other exits, lure them with food or toys, and eventually catch them. This is not obvious to them. From their perspective, I’m just a friendly person who gives them food, plays with them, and otherwise leaves them alone.⁷

The obvious analogy is, consider that a superintelligent AI is to humanity what I am to my kittens. Something that can out-strategize you without you even realizing you’ve been backed into the corner of a room with no doors. This analogy applies in two ways:

First: We might end up in a fairly benign world where humans are treated the same way as cats – cats are cute, they are generally treated well, and to the extent that they are out-strategized by humans, it is largely for their own good. This still doesn’t imply that one would want to be disempowered in this way if you, say, have preferences about the future.

Second: you may say, “intelligence isn’t some raw scalar quantity that you can just increase linearly”. And I would agree. I’m not sure what it would mean to, say, have an agent with a “300 IQ”. Like, concretely, an agent capable of getting a 300 on an IQ test. That seems meaningless, I agree.

I can, however, imagine an agent with jagged intelligence which has the ability to autonomously query and integrate information from hundreds (or thousands) of realtime sources⁸, effortlessly hack into key infrastructure systems while evading detection⁹, subtly influence humans into various opinions or plans of action that they wouldn’t already¹⁰, and so on. Obviously these capabilities do not exist in a single unified system today. AI agents will have structural advantages – for example, the ability to copy themselves, work in parallel, and “think” much faster than humans.

What To Do

I have high certainty that “AI risk is worth taking seriously” and moderate-to-high certainty that “it is worth taking actions now to reduce the likelihood of these risks”.

I am less certain on which actions exactly would be net beneficial, and how much of a current-day cost we should be willing to pay to mitigate future risks. Fortunately, there are many groups¹¹ taking these questions seriously and proposing frameworks. This is laudable and should be financially supported.

At a policy level, I’m becoming increasingly convinced that we need some sort of governance structure to enable transparency and accountability for safety measures, as well as to constrain the arms race dynamics between frontier labs. This is not an easy problem, and will likely require international coordination. I have moderate credence that a “pause AI” plan, enacted today, would not be net helpful as we are still in the domain of being able to get mostly net positive mundane utility from further progress. However, I think coordination around a future mechanism to enforce a pause would be wise.

The inconvenient truth for AI is that we are racing forward faster than our ability to ensure safe development. Even with today’s AI capabilities, we have a lot of priced-in “weirdness” for societal impacts. It is worth considering this future directly, even in the face of significant uncertainty.

Obligatory: These are my own opinions and do not reflect those of my employer, etc.

Cover image by Recraft v3.

See e.g. Evan Hubinger’s Alignment remains a hard, unsolved problem. ↩︎
See e.g. Gradual Disempowerment. ↩︎
See e.g. “Sycophancy in GPT-4o: what happened and what we’re doing about it” (OpenAI) ↩︎
See e.g. “Disrupting the first reported AI-orchestrated cyber espionage campaign” (Anthropic) ↩︎
For example, much of the bureaucratic complexity of governing could be turned over to AI systems for global competitiveness and efficiency reasons. This could be a short-term boon for states which adopt this, but eventually result in such a complex governance structure that wresting control back from the AI becomes impossible. ↩︎
For example, individual firms turn over control over investment increasingly to AI for competitive reasons. This would likely be a short-term boon for the individual firms that choose to do so, and would select against firms that maintain a “no AI” stance. However, long term once the markets are saturated by AI-controlled firms, the alignment of the economy could become increasingly misaligned with human needs and preferences. From Gradual Disempowerment: “[H]umans might lose the ability to meaningfully participate in economic decision-making at any level. Financial markets might move too quickly for human participants to engage with them, and the complexity of AI-driven economic systems might exceed human comprehension, rendering it impossible for humans to make informed economic decisions or effectively regulate economic activity.” ↩︎
To be clear, that’s what I want! Rehabilitating skittish cats is a long game. Both of them are already warming up to me, so my strategy appears to be working. ↩︎
The dumb “today” version of this is Deep Research agents. ↩︎
The dumb “today” version of this is GPT 5.2 and Opus 4.5, both of which are nearing human-level performance on cybersecurity capture-the-flag benchmarks. ↩︎
The dumb “today” version of this is the so-called “LLM psychosis” effect with sycophantic models like GPT 4o. ↩︎
A short list: IAPS, Center for AI Safety, Palisade Research, AIPI. ↩︎

The South Flow SeaTac Arrival Corridor

Ben Congdon — Sat, 20 Dec 2025 21:00:00 -0800

From Gas Works Park, you can watch jets descend to SeaTac in a steady stream, one every minute or two. I’ve lived in various neighborhoods in Seattle, all under this corridor – SLU, Eastlake, Fremont. I didn’t think much of this fact until fairly recently, when I started taking flight lessons and becoming generally more actively interested in aviation.

The South Flow SeaTac Arrival Corridor; Red indicates arrivals

As I write this, I have FlightRadar24 popped open and can see a dozen or so jets in or entering this funnel. Planes enter the approach north of Seattle around Lynnwood or Shoreline, hug I-5 south, passing Green Lake and Lake Union, eventually descending through South Seattle into their SeaTac landing. There’s something comfortingly mechanical about this rhythm.

FlightRadar24 snapshot of the arrival corridor

Of course, depending on the wind, SeaTac can also flip to a “north flow” pattern where the funnel over Green Lake becomes the departure direction instead of the arrival, but the south flow is used about two-thirds of the time.

It’s also notable how entirely non-dramatic this routine is. Even in the foggy soup of Seattle clouds, the rain, negligible visibility, SeaTac continues to operate. I’m still quite early in my private pilot training and haven’t touched instrument flying yet, but I have a fledgling appreciation for what makes this possible: procedures, precision instruments, and skilled piloting. The fact that you can fly through the clouds and only see the runway a few hundred feet before landing (safely, of course) is an impressive feat of engineering and aviating.

I’ve now been thinking of this arrival corridor as one of my favorite landmarks of Seattle. It’s “invisible” and it’s undoubtedly less of a recognizable landmark than either of the floating bridges, the canal bridges, or any of Seattle’s other notable infrastructure landmarks. But nonetheless, I think about it nearly every day – whether it be on a run, driving to work, or just in hearing the quiet hum of a plane flying overhead.

My Favorite Music of 2025

Ben Congdon — Fri, 19 Dec 2025 20:00:00 -0800

2025 was a really good year for music for me. My music collection system is pretty basic: every year, I cut a new playlist and add songs that I’m currently resonating with. The playlist gets progressively longer until the end of the year, after which I reset and start with a blank slate in January. I started this back in ~2018 after feeling like I was in a musical rut, finding everything I listened to stale.

2025 was by nearly 2x my “highest discoverability” year yet. It was a really fun year for discovery and collection! Whereas 2024 had a few albums that I listened to a lot, 2025 had higher breadth, still with much that I listened to frequently.

“No Skip” Albums

First up are the albums for which every track made it on my 2025 playlist.

Couch’s Big Talk was my sleeper hit for Best Album of the year. As the pre-release singles for Big Talk were coming out over the summer I remember thinking “huh, Couch has new tracks, cool!” I was on vacation when the album actually dropped, and it was a couple weeks before I realized that between Spotify’s Discover Weekly and Release Radar, I’d independently added every song on Big Talk to my 2025 playlist without realizing it. It’s really my quintessential no-skip album for the year.

I also got to see Couch live recently in Seattle and they were fantastic in person. It was a sold out concert in a relatively small venue, and was one of the best concerts I’ve been to in recent years.¹ Couch is still new enough that it’s possible to know their entire discography back-to-front without too much investment.

Entry point songs:

From Big Talk: On The Wire, What Were You Thinking, Transparent, Static & Noise.
More Couch: Jessie, Still Feeling You.

Parcels’ LOVED

Parcels’ LOVED was another album that snuck into “no skip” status. I first heard of Parcels from their 2018 Tieduprightnow, and the got reintroduced to them via the Berlin live recording of Lightenup showing up in my Spotify Discover one week. That track became my goto “get up and face the day” track for much of the mid-summer. When LOVED came out in September, it was a similar slow burn to Big Talk in that I slowly realized that I’d added the entire album to my 2025 playlist.

Why Parcels? There’s just a clear earnest joyfulness that I find endearing. I sent one of their tracks to a friend recently, and their response was something like “Yep, this is Ben music”.

Entry point songs:

From LOVED: Iwanttobeyourlightagain, Tobeloved, Yougotmefeeling
Other Parcels: Overnight (this Amazon Music version is excellent), Lightenup

Really Great Albums

Magdalena Bay’s Imaginal Disk was technically released in 2024, but I’m including it in this list because they toured on this album in 2025. Imaginal Disk is a bit of a funky concept album, steeped in weird Y2K-era imagery. This review described it as “a time capsule of post-internet existentialism”, which I’d endorse. For a flavor of this, watch the music video for Image.

Entry point songs:

From Imaginal Disk: Image (Also fun: the Grimes version), Killing Time, Cry for Me
More Magdalena Bay: Domino, Secrets

Anemoia

SG Lewis’ Anemoia is a fun electronic album. I’ve been following SG Lewis for a few years, saw his tour was stopping through Seattle, and listened to a bunch more of Anemoia than I would have otherwise. The word “anemoia” is picked up from John Koenig’s Dictionary of Obscure Sorrows, which defines it as “the feeling of nostalgia for a time, place, or situation one has never actually experienced”. Both the album, and Lewis’ live performances of the tracks, convey this vibe quite well. The tracks, despite being “just” dancey electronic music are quite memorable. I find myself pulling out my phone to play a specific song from this album with surprising frequency.

Entry point songs: Memory, Transition, Baby Blue

Cory Wong’s Wong Air (Live in America) is snuck in here because… I listened to so much Cory Wong in 2025, and saw his band live over the summer, and already have tickets to his Spring 2026 tour. I would be remiss if I didn’t include something of his on this list. I remember first running into Cory via his 2020 album “Elevator Music for an Elevated Mind”. I just thought the title was funny and liked a few tracks on it. This has grown into an unabashed appreciation for most of his work. I’m also a big fan of Vulfpeck, of which Cory is a member.²

Wong is doing Big Band music in the 2020s is in such a fun, energetic, contagious way. Seriously, just listen to a live track like Ketosis and try to stay still. It was a treat to see him live.

Entry point songs:

From 2025’s live album: Comic Sans, Out at Midnight, The Grid Generation
More Cory Wong: Lunchtime, Better, Starship Syncopation, Turbo, Flyers Direct, Want Me Back

Honorable Mentions

Lady Gaga’s MAYHEM should probably go without saying was a top album for 2025. When the single for Abracadabra came out, I immediately strapped on my running shoes upon hearing it and got quite close to a personal record pace 5k. By play count, Abracadabra was my number one most listened to song of the year. Gaga’s SNL performance of Killah is also a must-watch.

Florence + The Machine’s Everybody Scream was a highly anticipated album for me this year, and it was enjoyable! The studio album itself didn’t initially resonate with me as much as I expected, but several live recordings from it have been fantastic. In particular, this live recording of Sympathy Magic is flooring, and this one of title track, Everybody Scream, is also great. I’m looking forward to the tour for this album next year, and I suspect this will be an album that grows on me over time.

Entry point songs:

From Everybody Scream: Sympathy Magic, Witch Dance, Kraken
More from Florence: Free, My Love

Everything Else

Alright, we’re all the way through albums. :phew:

Random Good Vibes Released in 2025

Sailing Away by Antoine Bourachot.
- The feeling: Running in the sun on a beach in Spain.
U Know Y by CAPYAC.
- The feeling: When one realizes the work week is over for the weekend.

Good running music Released in 2025

Focus by John Summit.
- I must have listened to this track 100 times on runs over the year.
Infohazard by Ninajirachi.
- The music video is mesmerizing; a must watch (esp. 3:00 to the end).
- I have distinct memories of listening to this song and sprinting around the pre-sunrise streets of San Francisco in the late summer.

Relentless Love by Sophie Ellis-Bextor.
- My vote for the “song of the summer” for 2025.
Edge of Desire by Jonas Blue
- Another song that I listened to in heavy rotation on my marathon training runs.
Joy by Rita Era

Not At All Released in 2025

Snarky Puppy’s Lingus (We Like It Here) is up there for the best improvisation I’ve ever heard, bar none. The Youtube video from 2014 is a masterpiece. I come back to this track weekly, listening the whole way through.³

Dirty Loop’s Phoenix was another album that sneakily became a “no skip” album. It was released in 2020, hence it not being in the earlier list. But each of the tracks on the album are really good and each individually became a “listen to on single-track repeat”. Rock You is a good entry point, but Work S*** Out is likely my favorite track on the album.

Weekend Soundtrack

Above & Beyond Group Therapy is a quite fun weekly electronic set. This is my “put on and do weekend chores” playlist. I’d recommend this set from November as a good entry point.

As I said in the beginning, it was a great year for music. :) I got out to see more concerts than I have in recent years, and put more time and attention into, for example, listening to Spotify’s suggested new music every week. This year will be hard to top, but I’m looking forward to seeing what 2026 brings when I reset my playlist to a blank slate on January 1.

The vocalist for the opener, Night Talks, also performed with Cory Wong in Chicago in this excellent recording of “Synchronicity”. ↩︎
Stop what you’re doing now and watch Vulfpeck’s 2025 Madison Square Garden concert recording. It’s such a joy. ↩︎
[Edit: 12/28/2025] I also realized that the soloing pianist in Lingus is the same Cory Henry whose solo track, Best of Me, was one of of my favorite songs a few years ago! ↩︎

Collecting Shibboleths

Ben Congdon — Thu, 18 Dec 2025 21:00:00 -0800

Judge a man by his questions rather than by his answers. – Voltaire

I am an introvert by nature, but I come alive for a good conversation. I was reflecting on this after a recent international flight, where I was sitting next to a friendly man who turned out to be a late-career civil engineer. I have a close friend who is a civil engineer, and so as this conversation unfolded I was able to inject pieces of information I’d learned from her – the differences between various civil CAD tools, stormwater management challenges, and so on.

In a conversation that’s going well there are ample opportunities to ask perceptive questions where you can learn interesting pieces of information. That information can then sit in your back pocket, to be carefully deployed the next time you run into a similar conversational vein. This can be absolutely delightful when it works.

I’ve loosely been referring to this as “Collecting Shibboleths” in my inner dialect. “Shibboleth” here is used loosely. It typically denotes language that separates an in-group from an out-group. However, I’m using it more in the sense of language that indicates a certain level of knowledge about a topic, used as a bid for establishing rapport. It’s a “hey, I’ve walked some of these mental paths before, not as deeply as you, but I’m curious to learn more”. And it’s “Collecting Shibboleths” out of recognition that each conversation you have contains an opportunity to learn something interesting or useful. You are building a lifelong repertoire, and you do so by asking engaging questions.

There is a huge difference between “Oh, you do low temperature physics, that’s cool” and getting to “Oh, your lab synthesizes Bose-Einstein Condensates? That’s fascinating – evaporative cooling, right? How long are you able to stabilize it for?”. Usually it doesn’t take much knowledge to bootstrap this. A little knowledge and a lot of curiosity goes a long way.

Using a shibboleth is distinct from pretending to know more about something than I actually do. You aren’t trying to “impress”. Rather, the goal is to add more conversational hooks for the other person to respond to. More interesting paths for the conversation to wander down. More doorknobs that are asking to be turned. You’re looking for a spark of recognition in the other person – the interplay of “Hey! Here’s an opening” and “Yes! I’ll run with it”.

Better questions lead to better answers. Better answers enrich your mental models. And better mental models, in turn, sharpen the questions you ask next time. This leads to a wondrously virtuous cycle in that becoming a better conversationalist is self-reinforcing.

Book Review: I Am a Strange Loop

Ben Congdon — Wed, 17 Dec 2025 21:00:00 -0800

In the end, we self-perceiving, self-inventing, locked-in mirages are little miracles of self-reference. … Our very nature is such as to prevent us from fully understanding its very nature. – Douglas R. Hofstadter

Most people know of Douglas Hofstadter for his masterpiece Gödel, Escher, Bach (“GEB”). I quite enjoyed GEB, but one of his lesser known books – I Am a Strange Loop – has stuck with me in a more profound way than GEB. In looking at other reviews of Strange Loop, I saw the common criticism that it’s the “easier, more approachable” version of GEB. Admittedly, GEB is a doorstop to read through; it has sections written in dialogs, makes heavy use of metaphor, and so on. GEB is a triumph of conveying a certain set of ideas: Gödel incompleteness, Turing completeness, and self-referential systems, among others. But Strange Loop stands on its own as an excellent philosophical book.

I Am a Strange Loop goes much deeper into one of the topics that is discussed in a cursory way in GEB: what are “souls”, what is “I”, and how do we get meaning from meaningless physical constituents. If this sounds like metaphysics, that’s intentional. Strange Loop discusses in depth philosophy of mind, the ontological status of the “self”, and “downward causality”.

1. Strange Loops

Hofstadter coins the term “Strange Loop” to point to a particular phenomena that occurs within certain hierarchical systems. Two famous examples of strange loops are M.C. Escher’s Drawing Hands, in which two hands appear to be drawing each other, and Kurt Gödel’s Incompleteness Theorems, which use self-reference to prove that, loosely, formal mathematical systems of sufficient power cannot be both complete and consistent.

Source

A “strange loop” occurs when a hierarchical system has no clear top or bottom – rather, the hierarchy appears to loop back on itself into a cycle. In Escher’s hands, there is no “top hand” drawing the “bottom hand”, and yet there still is the hierarchical relationship of hand one draws hand two, which draws hand one, which draws hand two, and so on. Hofstadter refers to this quality as a “tangled hierarchy”.

Not all tangled hierarchies are strange loops in Hofstadter’s sense, though. The “strangeness” of a “strange loop” comes from when a system is able to perform self-reference. That is, a system that can point to itself in statements it makes. This is where you get fun paradoxes like the Liar’s Paradox: “This statement is false”. Self-reference also forms the basis for Gödel’s incompleteness theorems. I will not try to explain Gödel incompleteness here (though I’d highly suggest either GEB or Gödel’s Proof), but the rough shape is encoding something like a Liar’s paradox within a mathematical proof using a special form of number theory.

Hofstadter goes on to argue that the concept of “I” and consciousness itself are strange loops.

2. The “I” Symbol

Where is the strange loop in the brain? Hofstadter begins with the physical: a brain is a system of neurons that supports representations of a system of symbols. A symbol, in its simplest form, is just a pattern. These symbols often correspond to things out in the world – “dog”, “hot”, “table”. Simple creatures only need a small set of symbols; more complex creatures evolve the use and representation of a greater number and complexity of symbols. The strange loop arises when the system begins to have a symbol for itself – the “I” symbol.

Among the untold thousands of symbols in the repertoire of a normal human being, there are some that are far more frequent and dominant than others, and one of them is given, somewhat arbitrarily, the name ‘I’

Similar to how the symbol for “dog” or “hot” or “table” is created through some mixture of perception reinforcement and innate genetic programming, so too is the symbol for “I”. Self-reference is a quite useful symbol to have. It allows you to reason about your state within your environment in a rich way. Hofstadter continues that, though useful, the “I” symbol is still just a symbol:

Because of the locking-in of the ‘I’-symbol that inevitably takes place over years and years in the feedback loop of human self-perception, causality gets turned around and ‘I’ seems to be in the driver’s seat.

The strong reinforcement of the “I”, particularly through interpersonal relationships and culture, results in the “I” symbol being locked in to a seemingly privileged position of appearing primary to other symbols. There is this perception that there is an “I” driving the body around, it is the “I” perceiving things from the seat of the brain, and it is the “I” symbol itself that is making decisions. Strange Loop argues that this is not a true reflection of reality, but is instead a “hallucination”:

My claim that an ‘I’ is a hallucination perceived by a hallucination is somewhat like the heliocentric viewpoint… The basic idea is that the dance of symbols in a brain is itself perceived by symbols, and that step extends the dance, and so round and round it goes.

The “I” symbol is notable because it is both conceived of as observer and observed. Many thought patterns have this self-referential mode: I can think the thought “I am thinking this thought”. But this loops back on itself in strange loop fashion – the “I” observing the thought and the “I” referenced in the thought itself are akin to Escher’s self-drawing hands.

This is why Hofstadter calls this a “hallucination perceived by a hallucination”. Hofstadter compares this shift to the Copernican revolution, changing from an Earth-centric geocentric model to the Sun-centric heliocentric model – which is a more “accurate” model of physical reality. The analogically geocentric model of thought is the conventional model: “I am a self, I think thoughts”. The analogically heliocentric model is: the brain is a symbol processing machine, of which “I” is a particularly powerful symbol. However, “I” is not ontologically distinct from any other symbol in the brain; it does not exist in a “prior” or privileged position. The “I” symbol is real, but it does not imply the causal structure that we intuitively feel – that the “I” is what is choosing to act, or is what is perceiving.

The obvious followup question to this is, if the “I” doesn’t have causal power over what we decide to do, then what is making decisions? Hofstadter’s answer is what he refers to as “downward causality”.

3. Downward Causality

Strange Loop distinguishes two types of causality: “upward causality”, and “downward causality”.

Upward causality is what we conventional think of as reductionist causality: reducing a problem to its simplest parts helps us understand causal relationships. Diseases are best understood in terms of the bacteria or viruses that cause them; in turn, those bacteria are best understood via their genetic components; genetic interactions are mediated by specific proteins; proteins have a particular chemical structure that are affected by various atomic-scale forces; these atomic forces are carried by force-carrying particles… and so on. The causal forces push upward: the atomic forces influence the proteins, influence the bacteria, influence the disease progression.

Downward causality is the reverse of this: the “top down” reasons for an occurrence have just as much causal power. A disease spreads because of improper public health facilities, or poor social acceptance of hand washing. Failing to wash one’s hands does have causally explanatory power over getting a disease or not, just as the makeup of a bacteria has causally explanatory power of how a human can be infected by that bacteria.

Deep understanding of causality sometimes requires the understanding of very large patterns and their abstract relationships and interactions, not just the understanding of microscopic objects interacting in microscopic time intervals.

It would be, in some sense, an easier task to understand the world if all we had to do was reduce everything to the smallest microscopic objects, and build causality up from there. Hofstadter argues that view is mistaken.

Returning to “I”, this same causality flip is what makes it seem like the “I” is doing some causally important work:

Since we perceive not particles interacting but macroscopic patterns in which certain things push other things around with a blurry causality, and since the Grand Pusher in and of our bodies is our “I”, and since our bodies push the rest of the world around, we are left with no choice but to conclude that the “I” is where the causality buck stops. The “I” seems to each of us to be the root of all our actions, all our decisions.

So where does this leave us? The “I” has causal power, but not in the way we intuitively think. If I stand up to get a glass of water, it may not be my internal “I” symbol making the decision, but the causality still runs through the high level pattern of “I”-ness. “I am thirsty” has explanatory power over my decision.

4. “Will” and “Free Will”

Which then brings us to the hobby horse of any philosophy of mind discussion: free will. Hofstadter plainly argues that “free will is an illusion”, similar to how the “I” concept is a hallucination. However, Hofstadter claims that not much is actually lost in this resignation over free will:

I am pleased to have a will, or at least I’m pleased to have one when it is not too terribly frustrated by the hedge maze I am constrained by, but I don’t know what it would feel like if my will were free. What on earth would that mean? That I didn’t follow my will sometimes? … I guess that if I wanted to frustrate myself, I might make such a choice — but then it would be because I wanted to frustrate myself… in either case, my non-free will would win out and I’d follow the dominant desire in my brain.

One of the best litmus tests for discussion of free will that I’ve collected is the notion that a free will “could have chosen otherwise” – that is, were you in the same position, it is conceivable that a different decision would have been made. Hofstadter argues that this notion of “could have chosen otherwise” is part of the illusion of the “I”. The “I” allows us to feel like we’re a party to our decisions, but that is merely a feeling:

Our will, quite the opposite of being free, is steady and stable, like an inner gyroscope, and it is the stability and constancy of our non-free will that makes me me and you you, and that also keeps me me and you you.

I find this to be a rather tidy resolution of the question of free will. While this discussion is admittedly still in the abstract, it does appear that this explanation would give a way to ground the “feeling” of free will with a symbolic and computational model of cognition. This reconciles a fundamentally materialist view of cognition with the feeling of “spooky ability to chose otherwise”.

5. Distributed Selves

I would be remiss if I didn’t get in a mention to two of the most emotionally impacting chapters of Strange Loop, wherein Hofstadter discusses the death of his wife. One of the interesting implications of Strange Loop’s symbolic interpretation of consciousness is that these symbols can be transmitted – imperfectly, but partially transmitted nonetheless. This has implications for how we interact with others, as our interactions result in corresponding symbols of their “I” or “Self” being transferred:

One day, as I gazed at a photograph of Carol taken a couple of months before her death, I looked at her face and I looked so deeply that I felt I was behind her eyes, and all at once, I found myself saying, as tears flowed, ‘That’s me! That’s me!’ … I realized then that although Carol had died, that core piece of her had not died at all, but that it lived on very determinedly in my brain.

We live inside such people, and they live inside us… someone that close to us is represented on our screen by a second infinite corridor, in addition to our own infinite corridor. We can peer all the way down — their strange loop, their personal gemma, is incorporated inside us.

Dispassionately, I’m not sure how much credence to give to this notion, but it is a beautiful idea. It does seem more than reasonable that we represent others with increasingly rich inner symbols the longer we’ve known them, and it seems evident that those systems can, and indeed do, persist after the death of the symbol’s person.

6. Conclusion

I first read Strange Loop several years ago, during a time of significant personal change. Many of the ideas it expounds have become surprisingly load-bearing over the subsequent years. Strange Loop gave me an intellectually satisfying “in” to exploring my own mental landscape with both curiosity and humility, made me much more uncertain and curious about what is going on inside LLMs, and left me with a great appreciation for Hofstadter as both a thinker and a writer. If any of the ideas above resonated, I’d highly recommend I Am a Strange Loop.

What Are You Trying to Say?

Ben Congdon — Tue, 16 Dec 2025 20:00:00 -0800

From Dwarkesh Patel:

Unreasonably effective writing advice:

“What are you trying to say here?

Okay, just write that.”

Writing effectively is notoriously hard. Even once you get past writers block and are actually writing words on the page, organizing those words coherently is challenging.

One virtue I’ve come to value in writing is simplicity. That is, writing in a way that can be understood while minimizing the cognitive load on the reader. I find this virtue most useful in technical design proposals, where there is a high premium on saying precisely what you mean with minimal overhead.

A 40-page design doc can look superficially impressive, but I tend to be more impressed by the ones that are 3-5 pages, and yet are so crisp that they have an “airtight” quality to them.

I’ve tried to keep this dictum in mind when writing recently. This advice is particularly helpful when you’re drawn to fluffy or hedging language. Hedging language often indicates that you haven’t fully figured out what you want to say. Fluffy language can be a symptom of working on a piece of writing for too long, and forgetting to have empathy with uninitiated readers. The goal: skip all the extra baggage that inflates or deflates the message you’re trying to convey. Just write the message in the simplest possible way for your audience.

Writing has a nearly endless set of failure modes: too jargon-heavy, too vague, too detached from reality, too focused on irrelevant details, too long for readers to find the core point, and many, many more.

Simple writing gets you at least one thing: brevity. Simple writing may still be bad, but it will be bad in a way that can be easily critiqued. Yes, simple ideas often need nuanced explanations. But start with the simple explanation and expand only for readers that care. This is, for example, why “TLDR”s have been so widely adopted.

What I like about “What are you trying to say? Just write that” is that it nudges you to articulate the core of your message. Cut out the hedging, cut out the fancy jargon, just: what is the actual pitch? And then, well, just write that. You can still go on from there to polish, add nuance where appropriate. The core message must come first though; all the polish and wordsmithing in the world can’t save a piece of writing that doesn’t know what it’s actually trying to say.

Day 15 of Daily Writing

Ben Congdon — Mon, 15 Dec 2025 21:00:00 -0800

This is post 15 of my unannounced, self-imposed month of daily writing. I’ve been making soft promises to myself and others to write more for… years. I was inspired by a few of the folks who wrote daily last month for Inkhaven, and so decided to do my own super unofficial version of that.

It’s been fun so far! And by “fun” I mean it’s been a rewarding challenge. :)

Some thoughts:

Sitting down to “do writing” is a lot easier if you’ve scratched down some notes, or started on an outline of a post already. I’ve been using a mishmash of iA Writer, Obsidian, and Drafts to outline and ideate.
Writing for a deadline is helpful. There have been a couple posts where in the past, I would have endlessly tweaked and wordsmithed before publishing. When the forcing function is “I want to finish this so I can go to bed”, suddenly editing decisions become easier. Similarly, writing daily helps me stay in the rhythm of writing and build up “muscle memory”. Only writing when I feel “inspired” is a recipe for not writing much.
Writing is mentally taxing, but “doable” on low mental energy. Ideating is a more enjoyable / open activity, but is not doable with low mental energy.
Posting more means I am less precious about each post. I think this is good. Having variance in article quality is healthy. For my preferences, a mixture of high-effort and mid-effort posts seems like a good mix. Only posting high-effort content is less rewarding, ultimately, since most of my more “popular” posts are my mid-effort ones, but most of the posts that I’m “most glad I’ve written” are the high-effort ones. Each of these is rewarding in its own way.
High-effort posts, unsurprisingly, take a lot of time. Both book reviews I’ve posted this month were done on weekends; this isn’t an accident, though it wasn’t intentionally planned.
Running is a good time for ideation. I already knew this, but when I’m in the mode of “it would be nice to have things to write about”, it’s nice to know that I have an activity that readily fosters idea generation.
Writing more made me tidy house a bit, digitally. I’ve been tweaking on the margins with my personal site as I now look at it more frequently. I improved the blog archive page, added a markdown output to each post for LLMs¹, improved my image loading performance, and so on. This sort of digital garden tending feels wholesome.

What’s been working:

Using iA Writer has been legitimately delightful, even though I likely use <5% of its full feature set.
Allowing myself to post articles that I think are past the 80% quality line has made me feel much freer to “just write more”. Move on to the next idea.

What’s NOT been working:

I still want to write on some less directly technical topics (e.g. psychology, philosophy, etc.), and have been finding it somewhat tricky to find an “in” there. I think I’ll just need to accept that those posts will be intentionally not in my normal lane and bite that bullet.
Writing pieces that take more than one sitting is hard in this mode. I generally ideate in the morning and write in the evening, and “need” to get a post out by the evening. I tend to have a fairly single-track mind for writing, and so parallelism is hard. Which means sequentially writing one post per day. Which means, mostly, writing them in one sitting – and that places a (likely constructive) limit on how much effort I can put into any individual post.

Beyond the mechanics, writing more intentionally has also made me think about my projected online identity. My blog has always held a bit of a weird niche in my mind – it’s part public journal, part “performative display of competence”, part curiosity log. I primarily write on technical topics, but I wouldn’t call this a “tech blog”. Sometimes my articles get picked up HackerNews; sometimes my coworkers read my blog, or inspire a post. Sometimes I’m writing just because I want something to exist, for having thought of a connection between some ideas that I find interesting. Upon reflection, my ideal reader is the person who resonates with many of the same topics that I’m interested in, has my blog as one of many in an RSS reader, and (if the stars align) gets a little bit of joy or curiosity when clicking on a new piece of my writing.

In any case, I have the explicit awareness while doing this month-long writing exercise that this is not maintainable as a daily practice. Part of this effort is to shake out all the cobwebs of ideas that I’ve had on my “to write” list for far too long. The Antimemetics review was one, but there’s one or two other personally important topics that I’d like to get in writing before the month is over. The hope, though, is that I find some pieces of this experience that I enjoy enough or am able to integrate well enough to exit this with a “more frequent than quarterly” writing routine.

Thanks for reading, I appreciate it. (And do drop an email if you ever feel inclined, I enjoy hearing from folks!)

Add /index.md as a suffix to any blog url to get the markdown version of it. ↩︎

Book Review: The Demon in the Machine

Ben Congdon — Sun, 14 Dec 2025 16:00:00 -0800

The thing that separates life from non-life is information. - Paul Davies

I’ve probably learned about the thought experiment of Maxwell’s demon at least half a dozen times – in multiple physics courses, in multiple books. Until I read Paul Davies’ The Demon in the Machine, I don’t think I realized that the paradox of Maxwell’s demon had actually been solved.¹ More on that later.

The core thrust of Paul Davies’ The Demon in the Machine is to look at life as a puzzle of thermodynamics and information. It starts with the question: how is it possible that life seems to reliably be able to render order from chaos? Looking at living systems, they are able to hold boundaries. Life is able to create and sustain fantastically ordered structures – cells, organs, limbs, brains – out of the chaotic inorganic soup comprising the rest of the physical universe.

Davies’ answer is information theory, with his core thesis that “Life = Matter + Information”.

1. Demonology

The titular “demons” of Demon in the Machine are, of course, those of the Maxwell Demon thought experiment. Maxwell’s demon – sorry, now you have to listen to yet one more explanation of it – is the thought experiment wherein a gas is split into two chambers, separated by a frictionless door. A demon watching molecules of gas transiting between the two section could use the door as a sort of filter – blocking slow particles from moving to, say, the left section, resulting in more slow particles being in the right section and more fast particles being in the left section.

This violates the Second Law of Thermodynamics, by reducing the amount of entropy (“disorder”) in the system. As another intuition for this, we effectively “sorted” a gas at equilibrium into hot and cold without using any energy.

Things get weirder: were this true, we could configure this demon to, instead of merely sorting the particles using its door, run an engine to produce work. We reconfigure the room to have the door also act as a piston. When we’ve sorted more hot particles into one side, we can close the door and have it move as a piston would, the hotter particles pushing against the area with less pressure, expanding until thermal equilibrium is reached. Then, the door can be reopened and the process repeated. This demonic engine is known as a Szilard engine.

Were Maxwell’s demon possible and costless, we could (say) run a refrigerator indefinitely, without any energy input. Were Szilard engine possible and costless, we could (say) extract an vast amount of mechanical energy from a gas at thermodynamic equilibrium.

2. Information Engines

The bad news is that such a costless engine is not possible. The good news is that Szilard engines are actually realizable in the physical world.

Leo Szilard, working in 1929, actually suggested a more simplified version of the Maxwell Demon thought experiment to power his engine than what I sketched above:

We simplify the setup to a single molecule in a box.
The demon can insert a partition in the middle of the box, such that the molecule is trapped on either the left or right side of the partition.
The demon knows the location of the molecule – whether it is on the left or right side. This is 1 bit of information.
The demon can position the piston on the side where the molecule isn’t, so the molecule pushes against it and does work.

Szilard’s engine; Figure 4 from The Demon in the Machine

Szilard suggested that the energetic “fuel” for this engine was the knowledge of the molecule’s location.

Szilard concluded, reasonably enough, that the price [of the engine’s work] was the cost of measurement.

As it turned out, this was a mistaken conclusion. IBM physicist Rolf Landauer extended Szilard’s work in 1961, introducing “Landauer’s Principle”, which states that it is not the measurement of information that requires work, but rather the erasure of information. Landauer was attempting to quantify the minimum amount of work to operate a logic gate, in studying the physics of computation.

Landauer coined a now-famous dictum: ‘Information is physical!’ What he meant by this is that all information must be tied to physical objects: it doesn’t float free in the ether.

This is fairly unintuitive, so let’s hold on this point for a minute: measuring and accumulating information is thermodynamically free, but erasing that information is costly.

As an intuition, we can look at the notion of “reversibility”. I map this in my head to something like: while measurement and accumulation of information is happening, we are traveling down a thermodynamic gradient. Yes, we are “getting work”, but in doing so we are traversing the thermodynamic state space of the world. Each time the Szilard engine demon makes a “left or right” decision, there is an implicit ledger of “lefts” and “rights” that is collected alongside the work coming out of the system. The thought experiment abstracts this ledger, but it is in fact a physical configuration in the Demon’s memory (weird, right?).

This ledger breaks down, eventually. When you want to reuse the Szilard engine, you have to eventually reset it, as the Demon does not have infinite memory. And that erasure is when the process becomes irreversible. With the ledger in hand, you can trace back the full state space tree of the engine operation. Deleting the ledger resets your memory back to baseline, but also makes it such that you cannot trace back the operation, going from an in principle reversible process to an irreversible one. This is the energetic cost of computation.

3. Physical Demon Instantiation

We have gotten slightly closer to an understanding of the physical instantiation of a Maxwell demon / Szilard engine, but the intuition of the thought experiment breaks down when we consider the ledger. Why do we care about the demon’s memory? To see why, Davies has us remove the “magic” animation of the demon and instead try to instantiate this idea in a physical system.

I will quote liberally here, because I find this explanation by Davies, though long and challenging to visualize, to be one of the most fascinating insights from the book:

It must be possible to substitute a mindless gadget – a demonic automaton – that would serve the same function. Recently, Christopher Jarzynski at the University of Maryland and two colleagues dreamed up such a gadget, which they call an information engine. Here is its job description: ‘it systematically withdraws energy from a single thermal reservoir and delivers that energy to lift a mass against gravity while writing information to a memory register’. …

The Jarzynski contraption resembles a child’s plaything (see Fig. 6). The demon itself is simply a ring that can rotate in the horizontal plane. A vertical rod is aligned with the axis of the ring, and attached to the rod are paddles perpendicular to the rod which stick out at different angles, like a mobile, and can swivel frictionlessly on the rod. The precise angles don’t matter; the important thing is whether they are on the near side or the far side of the three co-planar rods as shown. On the far side, they represent 0; on the near side, they represent 1. These paddles serve as the demon’s memory, which is just a string of digits such as 01001010111010 … The entire gadget is immersed in a bath of heat so the paddles randomly swivel this way and that as a result of the thermal agitation. However, the paddles cannot swivel so far as to flip 0s into 1s or vice versa, because the two outer vertical rods block the way. The show begins with all the blades above the ring set to 0, that is, positioned somewhere on the far side as depicted in the figure; this is the ‘blank input memory’ (the demon is brainwashed). … One of the vertical rods has a gap in it at the level of the ring, so now as each blade passes through the ring it is momentarily free to swivel through 360 degrees. As a result, each descending 0 has a chance of turning into a 1.

Jarzynski’s information engine; Figure 6 from The Demon in the Machine

Now for the crucial part. For the memory to be of any use to the demon, the descending blades need to somehow interact with it (remember that, in this case, the demon is the ring) or the demon cannot access its memory. … The demonic ring comes with a blade of its own which projects inwards and is fixed to the ring; if one of the slowly descending paddles swivels around in the right direction its blade will clonk the projecting ring blade, causing the ring to rotate in the same direction. The ring can be propelled either way but, due to the asymmetric configuration of the gap, there are more blows sending the ring anticlockwise than clockwise (as viewed from above). As a result, the random thermal motions are converted into a cumulative rotation in one direction only. Such progressive rotation could be used in the now familiar manner to perform useful work. …

So what happened to the second law of thermodynamics? We seem once more to be getting order out of chaos, directed motion from randomness, heat turning into work. To comply with the second law, entropy has to be generated somewhere, and it is: in the memory. Translated into descending blade configurations, some 0s become 1s, and some 0s stay 0s. The record of this action is preserved below the ring, where the two blocking rods lock in the descending state of the paddles by preventing any further swivelling between 0 and 1. The upshot is that Jarzynski’s device converts a simple ordered input state 000000000000000 … into a complex, disordered (indeed random) output state, such as 100010111010010 … Because a string of straight 0s contains no information, whereas a sequence of 1s and 0s is information rich, the demon has succeeded in turning heat into work (by raising the weight) and accumulating information in its memory. The greater the storage capacity of the incoming information stream, the larger the mass the demon can hoist against gravity.

(Emphasis mine)

I’d suggest watching this YouTube video to get a better sense of how this works.

So now we have it: a system extracting mechanical energy from a gas at thermodynamic equilibrium by means of recording information. Here we see the physical manifestation of the demon’s memory. There are a finite number of paddles our engine has. When we run out, we cannot use the engine anymore. And in this physical system it becomes clearer that in order to use the engine indefinitely, we need to reset the system back to its “blank slate” state after having used up its ledger space.

Having established this, the remainder of The Demon in the Machine discusses the connection between information theory and biological life.

4. Life and Information

Much of Davies’ arguments for the relationship between life, matter, and information boil down to two main ideas: the notion that biological systems have an informational “software” that run on top of the bio-mechano-physical “hardware”, and that this informational “software” has top-down causal power.

To the first point, about biological “software”:

Biological information is more than a soup of bits suffusing the material contents of cells and animating it; that would amount to little more than vitalism. Rather, the patterns of information control and organize the chemical activity in the same manner that a program controls the operation of a computer. Thus, buried inside the ferment of complex chemistry is a web of logical operations. Biological information is the software of life.

To the second point, on the top-down causal power of information:

Counter to most reductionist thinking, the macroscopic states of a physical system (such as the psychological state of an agent) that ignore the small-scale internal specifics can actually have greater causal power than a more detailed, fine-grained description of the system, a result summed up by the dictum: ‘macro can beat micro’.

This notion of top-down causality amounts to a firm rejection of “strong reductionism” – the notion, in this context, that “the astonishing properties of living matter could ultimately be explained solely in terms of the physics of atoms and molecules”.

I do not feel well-equipped to give an opinion on whether these two positions are correct or not, but they are compelling:

The reductionist argument is undeniably powerful, but it rests on a major assumption about the nature of physical law. The way the laws of physics are currently conceived leads to a stratification of physical systems with the laws of physics at the bottom conceptual level and emergent laws stacked above them. There is no coupling between levels. When it comes to living systems, this stratification is a poor fit because, in biology, there often is coupling between levels, between processes on many scales of size and complexity: causation can be both bottom-up (from genes to organisms) and top-down (from organisms to genes). To bring life within the scope of physical law – and to provide a sound basis for the reality of information as a fundamental entity in its own right – requires a radical reappraisal of the nature of physical law, as I am arguing.

(Emphasis mine)

There’s an old XKCD comic that “biology is just applied chemistry; chemistry is just applied physics”. Davies argues against this: there is information on each layer. These layers interact in interesting and notable ways. They are not merely layers of abstraction of systems, but rather there is something going on at each layer which can not be reduced to just the “base” most layer.

XKCD 435

5. Quantum Weirdness

The book spends two fascinating chapters on the interaction between the preceding ideas and quantum physics. My review is already getting long, so I will limit my discussion here to one fascinating examples of quantum mechanics showing up in biology:

The bird’s retina is packed with organic molecules; researchers have zeroed in on retinal proteins dubbed ‘cryptochromes’ to do the job I am describing. When a cryptochrome electron is ejected by a photon, it doesn’t cut all its links with the molecule it used to call home. … The electron, though ejected from its atomic nest, can still be entangled with a second electron left behind in the protein atom, but, because of their different magnetic environments, the two electrons’ gyrations get out of kilter with each other. … According to the theory of the avian compass, these particular free radicals react either with each other (by recombining), or with other molecules in the retina, to form neurotransmitters, which then signal the bird’s brain. This neuro-transmission reaction rate will vary according to the specifics of the spooky link and its mismatched gyrations of the two electrons, which is a direct function of the angle between the Earth’s magnetic field and the cryptochrome molecules. … Is there any evidence to support this spooky-entanglement story? Indeed there is. … The era of quantum ornithology has arrived!

Davies gives several other interesting examples of quantum biological effects – such as the role in photosynthesis and smell. However, he is also ready to admit that the early quantum biological evidence “have been hotly debated” and that some “early claims were overblown”. Aside from the particular anecdotes, the part of this section that stuck with me was how interesting it is that there is sufficient evolutionary selection pressure for quantum biological effects to have arisen.

6. Conclusion

I came away from Demon in the Machine with a few things that stuck. First, the Maxwell’s demon paradox has actually been “solved”. The resolution through Landauer’s principle – that information erasure, not measurement, costs energy– feels like one of those insights that I’m surprised I hadn’t encountered before. Second, information engines are in-principle buildable things, and that there is suggestive evidence that biological systems operate by similar mechanisms. Third, Davies’ argument in favor of “top-down causality” is an interesting challenge to the common reductionist frame for physics and other hard sciences.

I’d strongly recommend Demon in the Machine; it was quite thought-provoking.

Or if I had, it wasn’t in a way that at all stuck to the same extent it did after I read Davies’ book ↩︎

Chorus is Good Software

Ben Congdon — Sat, 13 Dec 2025 21:00:00 -0800

I’ve been using Chorus for the past 6-7 months. Within the first couple days of using it, I was telling everyone I talk with about AI stuff to try it out. Melty Labs, the company behind Chorus, subsequently built Conductor. It appears this Conductor now their primary focus, and as such they’ve decided to open source Chorus.

Chorus is a macOS LLM client. Its differentiating feature is that it fetches you responses from many LLMs in parallel within the same chat. I think it’s quite good software.

Chorus screenshotSource

It is snappy. I used to be impressed by the responsiveness of the ChatGPT app, but after having become a regular Chorus user, I can’t use either the ChatGPT¹ or Claude macOS apps without feeling like they’re just sluggish in comparison.

It has a small surface area, but executes the UX well. The key UX innovation of Chorus is its multi-model chat – you can get responses from multiple models, pick the one you want to thread into the conversation, and continue. There are also many nice affordances in Chorus that don’t exist in other clients. For example, chats allow for “inline replies” where you can ask a model about a response without adding that response to the context window. Chorus also handles pasted content much better than ChatGPT desktop or the Gemini AI Studio. (Claude handles pasted stuff pretty well.) Chorus just feels like a tool – as opposed to a consumer app – in a way that other clients don’t.

It puts the user in control more-so than any of the first-party clients. Chat branching and “previous messaging” editing worked in Chorus well before ChatGPT and Claude.² Using a third-party client like Chorus also lets you jettison the opinionated system prompts that the first-party clients inject, and also let you switch quickly between system prompts. The first-party system prompts are good in a consumer app, but Chorus makes for a better “training wheels off” LLM client.

The multi-model native chat makes it easier to build an intuitive sense of models. Interacting with Claude/Gemini/GPT-N within the same dialog gives you a much more intuitive sense of the “shape” of each of the models. It’s also quite interesting to see, say, Claude Sonnet continuing from a Gemini response, since their inherent writing styles are quite different.

It allows you to burn tokens. First-party chat clients are mostly on a flat, monthly subscription model. The incentives are to provide you with a good experience so you don’t cancel, while otherwise minimizing the number of tokens you use. You pay for all your Chorus tokens. So, want to generate a response from Opus 4.5 and GPT-5.1 and Sonnet 4.5 and Gemini Pro 3 for each conversation turn? Go for it. Often times, many fanned-out responses to a message will be more useful than the “best” model’s response. It’s effectively an easy way to do “LLM councils” or best-of-N.³

It’s very clearly not trying to get your data. Chorus originally shipped with an optional subscription model that was essentially a model proxy. However, it was always possible to use your own API keys and their privacy policy was refreshingly clear: “We don’t look at your chats, and we don’t want to.” Now that Chorus is open-sourced, you can verify that.

We’re still quite early in figuring out what UX patterns work well for LLMs. Chorus took an interesting idea, executed it well, and as a result has a really great novel UX that none of the other clients nail. There are obviously many use cases for LLMs. The spotlight now is on agentic harnesses like Claude Code, Codex, Cursor, and the like. Chat interfaces are not solved! Assuredly there is more space to explore the “calculator for words” idea space.

I wish Charlie and team the best of luck on Conductor, and thank them for making Chorus open source. 🎉 I’m looking forward to contributing & hacking on Chorus.

The macOS app for ChatGPT feels noticeably sluggish now. I think this regression started around the GPT-5 release, if memory serves. My mental ranking used to be that the native macOS ChatGPT app was ~2x faster than Claude. Now, that’s been reversed. ↩︎
Though I will say that at this point, the ChatGPT/Claude desktop apps have largely caught up. ↩︎
Well, where N = “number of models”, not “number of responses”. I think Chorus could add support for multiple generations from the same model, but that doesn’t exist today. ↩︎

The Coming Need for Formal Specification

Ben Congdon — Fri, 12 Dec 2025 17:00:00 -0800

In late 2022, I had a conversation with a senior engineer on the coming problem of “what to do when AI is writing most of the code”. His opinion, which I found striking at the time, was that engineers would transition from writing mostly “implementation” code, to mostly writing tests and specifications.

I remember thinking at the time that this was prescient. With three years of hindsight, it seems like things are trending in a different direction. I thought that the reason that testing and specifications would be useful was that AI agents would be struggling to “grok” coding for quite some time, and that you’d need to have robust specifications such that they could stumble toward correctness.

In reality, AI written tests were one of the first tasks I felt comfortable delegating. Unit tests are squarely in-distribution for what the models have seen on all public open source code. There’s a lot of unit tests in open source code, and they follow predictable patterns. I’d expect that the variance of implementation code – and the requirement for out-of-distribution patterns – is much higher than testing code. The result is that models are now quite good at translating English descriptions into quite crisp test cases.¹

System Design

There exists a higher level problem of holistic system behavior verification, though. Let’s take a quick diversion into systems design to see why.

System design happens on multiple scales. You want systems to be robust – both in their runtime, and their ability to iteratively evolve. This nudges towards decomposing systems into distinct components, each of which can be internally complicated but exposes a firm interface boundary that allows you to abstract over this internal complexity.

If we design things well, we can swap out parts of our system without disrupting other parts or harming the top-level description of what the system does. We can also perform top-down changes iteratively – adding new components, and retiring old ones, at each level of description of the system.

This all requires careful thinking of how to build these interfaces and component boundaries in such a way that (1) there is a clean boundary between components and (2) that stringing all the components together actually produces the desired top-level behavior.

To do this effectively, we require maps of various levels of description of the system’s territory. My conjecture is that code is not a good map for this territory.

To be clear, I’ve found a lot of value in throwing out system diagram maps and looking directly at the code territory when debugging issues. However, code-level reasoning is often not the best level of abstraction to use for reasoning about systems. This is for a similar reason that “modeling all the individual molecules of a car” is not a great way to estimate that car’s braking distance.

LLMs have increasingly longer context windows, so one could naively say “just throw all the code in the context and have it work it out”. Perhaps. But this is still just clearly not the most efficient way to reason about large-scale systems.

Formal Verification

The promise of formal verification is that we can construct provably composable maps which still match the ground-level territory. Formal verification of code allows you to specify a system using mathematical proofs, and then exhaustively prove that a system is correct. As an analogy: unit tests are like running an experiment. Each passing test is an assertion that, for the conditions checked, the code is correct. There could still exist some untested input that would demonstrate incorrect behavior. You only need one negative test to show the code is incorrect, but only a provably exhaustive set of inputs would be sufficient to show the code is fully correct. Writing a formal verification of a program is more like writing a proof. Writing a self-consistent proof is sufficient to show that the properties you’ve proven always hold.

I saw Martin Kleppmann’s “Prediction: AI will make formal verification go mainstream” right after I posted “The Decline of the Software Drafter?”, which became the inspiration for this post. Kleppmann’s argument is that, just as the cost of generating code is coming down, so too will the cost of formal verification of code:

For example, as of 2009, the formally verified seL4 microkernel consisted of 8,700 lines of C code, but proving it correct required 20 person-years and 200,000 lines of Isabelle code – or 23 lines of proof and half a person-day for every single line of implementation. Moreover, there are maybe a few hundred people in the world (wild guess) who know how to write such proofs, since it requires a lot of arcane knowledge about the proof system.

…

If formal verification becomes vastly cheaper, then we can afford to verify much more software. But on top of that, AI also creates a need to formally verify more software: rather than having humans review AI-generated code, I’d much rather have the AI prove to me that the code it has generated is correct. If it can do that, I’ll take AI-generated code over handcrafted code (with all its artisanal bugs) any day!

I’ve long been interested in formal verification tools like TLA+ and Rocq (née Coq). I haven’t (yet) been able to justify to myself spending all that much time on them. I think that’s changing: the cost of writing code is coming down dramatically. The cost of reviewing and maintaining it is also coming down, but at a slower rate. I agree with Kleppmann that we need systematic tooling for dealing with this mismatch.

Wishcasting a future world, I would be excited to see something like:

One starts with a high-level system specification, in English.
This specification is spun out into multiple TLA+ models at various levels of component specificity.
These models would allow us to determine the components that are load-bearing for system correctness.
The most critical set of load-bearing components are implemented with a corresponding formal verification proof, in something like Rocq.
The rest of the system components are still audited by an LLM to ensure they correctly match the behavior of their associated component in the TLA+ spec.

The biggest concern to me related to formal verification is the following two excerpts, first from Kleppmann, and then from Hillel Wayne, a notable proponent of TLA+:

There are maybe a few hundred people in the world (wild guess) who know how to write such proofs, since it requires a lot of arcane knowledge about the proof system. – Martin Kleppmann

TLA+ is one of the more popular formal specification languages and you can probably fit every TLA+ expert in the world in a large schoolbus. – Hillel Wayne

For formal verification to be useful in practice, at least some of the arcane knowledge of its internals will need to be broadly disseminated. Reviewing an AI-generated formal spec of a problem won’t be useful if you don’t have enough knowledge of the proof system to poke holes in what the AI came up with.

I’d argue that undergraduate Computer Science programs should allocate some of their curriculum to formal verification. After all, students should have more time on their hands as they delegate implementation of their homework to AI agents.

That is, under current optimization pressures. We’re only a few months past the Sonnet 3.7 and OpenAI o3 models, both of which had a penchant for reward hacking on tests. ↩︎

Zip Files as (Simple) Key-Value Stores

Ben Congdon — Thu, 11 Dec 2025 20:00:00 -0800

I recently encountered a fun performance problem. Consider the following:

You need to distribute a key-value dataset with string keys and opaque value blobs in the 100B - 1MB range.
There are on the order of 10k keys to distribute.
Critically, you are in a constrained memory environment where you do not have enough memory to load all the blobs into memory at one time. You do, however, have enough memory to load all the keys and a small amount of metadata, if you want.
The key access pattern on the data is random.
The workload is entirely read-only.
You get to choose the file format and reader, but you must be able to disseminate this whole structure as one file.
Due to environment restrictions, you can’t use mmap to load files into memory.

I’ve heard the good news about SQLite from Simon Willison, Ben Johnson, and others, and thought this might be an interesting use case. I built out both the reader and writer components. TLDR, it worked!

The latency requirements here were rather sensitive as well, so I tried to squeeze all I could out of SQLite in terms of keeping both the memory footprint and read latency low. I diligently read through the SQLite pragmas page, which inspired my WITHOUT ROWID post.¹ In the end, the SQLite-based solution worked out quite nicely. ² However, I had the nagging suspicion that I was overlooking an obvious solution that was simpler than SQLite.

Yes, there are also other key-value stores like RocksDB – but that also felt like something of a heavy dependency for a read-only workload. Yes, I could use something like a line-delineated file, scan through the file for the keys, store the byte offsets, and scan to the offset at read time. Yes, I could make my own small file format that stores the key-value byte offsets in a header section to improve upon the line-delineated version.

Possible solutions to the problem

And yet, surely the generic version of this problem – “I just want a small, low-dependency way to index into blobs with string keys” – must have a simple, elegant solution.

It does: this is just a zip file. lol.

Because the files in a ZIP archive are compressed individually, it is possible to extract them, or add new ones, without applying compression or decompression to the entire archive.

A directory is placed at the end of a ZIP file. This identifies what files are in the ZIP and identifies where in the ZIP that file is located. This allows ZIP readers to load the list of files without reading the entire ZIP archive.

Source

The structure of a ZIP fileSource

Most folks think about .zip files purely in the context of compression, but zip files don’t actually require the files contained inside to be compressed. The structure of the zip file is basically exactly what we want. It has a directory section that is small, contains the list of all the contained files (keys, in our case), and lets us pull out the data for a key with a random access pattern. This is in contrast to the other common archive file format, .tar, which does not have this nice property.

Yes, disks are slow, but if we don’t have any memory to play with… what can we do other than throwing this at the disk? And if we’re going to use the disk anyways, why not pick a format that’s already exactly what we need? Zip has none of the overhead of an execution engine, the B-tree on-disk page layout, and a competent Zip file client exists for any serious programming language under the sun, just like SQLite.

I put together a small benchmark to see if this would work as expected. I implemented a benchmark matching the constraints above, and tested 4 clients: SQLite (with and without ROWIDs enabled), Zip file, and a custom .dat file format which uses a header/offset-pointer approach.

The benchmark produced some interesting results. I’d only treat these charts as directionally correct, but the orders of magnitude and relative positioning of these approaches should be more-or-less correct. Charts (note the y-axis is variably log-scaled or linear-scaled):

P90 latency by blob size

10KB latency percentiles

1MB latency percentiles

Between 100B and 100KB, the Zip file performs noticeably better than the SQLite. Even with 1MB blobs, where the disk I/O dominates, they perform roughly the same.
WITHOUT ROWID on the SQLite table made performance significantly worse. This is slightly surprising with the small blobs, but entirely unsurprising past 1KB blob size.
My janky custom file format wasn’t noticeably better in any size range than an off-the-shelf solution, so (expectedly) it makes sense to not DIY an on-disk index like this, at least for my simple use case.

Even though I ended up using SQLite, the humble Zip file is quite competitive as an on-disk index. At least, in this very simple constrained use case.

Sadly, just a fun diversion as WITHOUT ROWID doesn’t help for large rows. ↩︎
SQLite added some additional benefits, such as the ability to add row-level metadata about each key, which in practice was used to optimize some of the query patterns. ↩︎

What I Look For in AI-Assisted PRs

Ben Congdon — Wed, 10 Dec 2025 21:00:00 -0800

I review a lot of PRs these days. As the job of a PR author becomes easier with AI, the job of a PR reviewer gets harder.¹

AI can “assist” with code review, but I’m less optimistic about AI code review than AI code generation. Sure, Claude/Codex can be quite helpful as a first pass, but code review still requires a large amount of human taste.²

I care about the high level abstractions my team uses in our codebase, and about how the pieces fit together. I care that our codebase can be intuitively understood by new team members. I care that code is tamper-resistant – that we build things robustly such that imperfect execution in the future doesn’t cause something to blow up. Systems should be decomposable. You should be able to fit all the components of the system in your head in a reasonably faithful mental model, but you shouldn’t need to fit all the implementation details of each component in your head to not cause something to break.

Anyways.

I’ve been trying to speed up my review latency for PRs, and have given some thought to the heuristics I use to evaluate PRs. Heuristics are lossy, of course, but they’re necessary. If you haven’t given this much thought recently, it’s useful to consciously recalibrate the heuristics you use when reviewing code, now that so much code is generated by LLMs.

General Reviewability

Did the author provide a detailed & accurate PR description?
What level of sensitivity is this code? Is this performance or safety critical code that needs to be reviewed with a fine-tooth comb, line-by-line, or is it something peripheral like an internal UI or CLI that can be “good enough”?
Does the change appear reversible? Is the Git diff of the change human readable? I find LLMs are often really eager to make big changes that clobber the Git diff. Incremental change is usually preferable.
Is the PR of an actually reviewable size? My personal bar is: <500 lines is ideal, >1000 lines is borderline unreviewable.

Design & Abstractions

If this is greenfield code, does the author seem to be setting up suitable abstractions? Do these abstractions seem like they’ll compose in sane ways? Do the abstractions have reasonable boundaries that do not leak information?
Can you zoom out your mind, picture the PR in your head, and “make it make sense” with your mental model of the code? Does it make sense at a conceptual level what is being proposed, or does this just have the veneer of “good code”?
Would the code be substantially improved by loading the PR into Claude Code and making a targeted one-sentence prompt? For example: “hey, could you deduplicate some of this logic between classes X and Y and make the Foo trait more modular”. (Fortunately, you can just put this as a review comment – the author will probably rewrite with an LLM anyways)

Vibe Code Smells

What amount of effort does it seem like the author has put into their PR? N.b. I mean the author not the AI that wrote the code on behalf of the author. Human effort and curation still leaves signs behind.
Did the author leave vibe-coded comments in the PR? (Often this looks like iterative process comments, of the flavor // Now we’ll not use type X anymore, per your feedback.)
Are imports (especially for Python and sometimes Rust) splattered around the code, instead of being present at the top?
Is there a weird amount of defensive copying/cloning due to a misunderstanding of e.g. immutability-by-default in Scala or how to use ownership/lifetimes in Rust?

Testing

For unit tests, do they cover common edge cases? Do the unit tests actually make meaningful assertions that exercise the code in meaningful ways, or are they sloppy assertions that reduce to assert!(true)?
For unit tests, do they have a weird number of extraneous edge cases that are unlikely to ever happen in practice? (Also a vibe code smell)
For tests, do the tests mock out dependencies to the point where the entire test is useless/invalid?³

Error Handling

Does the code have a weird level of paranoia about exceptions being thrown?
Does the code silently swallow errors with try/catch?
Does the code allow for exceptions/panics in areas where the code absolutely should not panic?

None of these are intended to be knocks against individual PR authors. It’s useful to assume positive intent when reviewing code. SWEs are under various pressures, visible and invisible. The “system” we have today, broadly defined, results in much more code being produced by non-humans. The best source of truth for human coding taste is still, for now, humans. Therefore, humans still need to review a lot of non-human code, as we collectively chip away at the pieces of code taste that can be incorporated back into model intuition.

As of late 2025, code generation is easier than code verification. Martin Kleppmann is onto something with his prediction that AI will make formal verification go mainstream. Our current set of review tooling isn’t sufficient for the tsunamis of code that will be generated without human oversight over the coming years. ↩︎
Code review agents have gotten a lot better in the past year, but not at the same pace as code generation agents. As the model harnesses have gotten more agentic, code review agents have gotten significantly better at pulling in the context they need to review a PR. I’ve found that review agents are competent at picking out obvious catestrophic issues and medium-risk tacictical coding mistakes. They’re not great with issues where the entire gestalt of a PR needs significant reworking, or where PRs have a lot of small things that need to be called out for improvement. There’s assuredly still low-hanging fruit for gains here. ↩︎
I heard of an example of this recently, albeit not LLM related. The situation was described as the author “mock[ing] for the world they wanted, not the world as it is”. I still chuckle at this. :) ↩︎

SWIM: Outsourced Heartbeats

Ben Congdon — Tue, 09 Dec 2025 00:00:00 -0800

How does a distributed system reliably determine when one of its members has failed? This is a tricky problem: you need to deal with unreliably networks, the fact that nodes can crash at arbitrary times, and you need to do so in a way that can scale to thousands of noes. This is the role of a failure detection system, and is one of the most foundational parts of many distributed systems.

There are many rather simple ways to solve this problem, but one of the most elegant solutions to distributed failure detection is an algorithm that I first encountered in undergraduate Computer Science: the SWIM Protocol.¹ SWIM here stands for “Scalable Weakly Consistent Infection-style Process Group Membership”.

What is a Failure Detector

SWIM solves the problem of distributed failure detection. Lets sketch out this problem with a bit more detail. Suppose you have a dynamic pool of servers (or “processes” as they’re often called in distributed algorithms). Processes will enter the pool as they’re turned on, and leave the pool as they are turned off, crash, or when they become inaccessible from the network. The problem statement is that we want to be able to detect when any one of these servers goes offline.

The uses for such failure detection systems abound. Things like: determining node failures on workload schedulers like Kubernetes, maintaining membership list for peer-to-peer protocols, and determining if a replica has failed in a distributed database like Cassandra.

As a toy example, let’s pick something concrete: let’s say we’re building a distributed key-value store. We’re just getting started with our KV-store. For today just need to set up the membership system, wherein distributed replicas of the data can discover each other. Each replica needs to know of the existence of all the other replicas, so they can run more complicated distributed systems protocols – things like consensus, leader election, distributed locking, etc. All those other algorithms are fun, but we first need the handshake of “who am I talking to?” and “I need to know when one of my peers fails”. That’s failure detection.

What properties do we want for our failure detector?

If a server fails, we should know about it within some bounded time. This is Detection Time
Faster detection time is preferable.
If a process doesn’t die, we shouldn’t mark it as dead. If we relax this to allow for “false positives”, we ideally want a quite low False Positive Rate.
The amount of load on each node should scale sublinearly with the number of nodes in the pool. That is, ideally the amount of work per node is the same whether we have 100 nodes or 1,000,000 nodes.

What properties do we assume about nodes and the network?

The network can drop packets or become partitioned.
Nodes can crash at arbitrary times during program execution. We cannot rely on nodes doing anything (like sending an “I’m crashing” message) just prior to crashing, for example.
All nodes can send network packets to all other nodes.
Packets on the network have propagation time.

Reasoning up to Failure Detectors

What is the most naive functional failure detector that we could build? Well, for some node $N_1$ in a node pool of ${ N_1, N_2, …, N_{10} }$, $N_1$ could ping each of nodes ${ N_2, …, N_{10} }$. Each of those pinged nodes could send an acknowledgment of the ping, as soon as the ping was received. When $N_1$ receives a ping from, say, $N_4$, then $N_1$ knows that $N_2$ is still “alive”.

Ping and acknowledgment

Looking at this figure, if $N_1$ waits for some chunk of time, maybe retries a few times, but still never hears back from, say, $N_4$, then it can mark $N_4$ as “dead”.

Every time we hear a ping back from another node, we know it’s alive “now”. But we need to know the state of all members over time, so we run this procedure in sequence many times. We can call the time between each ping we send out as $T_{ping}$. And we can call the time we wait to hear back as $T_{fail}$. Both of these become tunable parameters in our system. If we increase $T_{fail}$, we decrease the chance that we accidentally mark a neighbor as “dead” due to e.g. a transient network blip. But if we increase it too long, then we would also allow an actually dead node to still sit in our membership list for a long time – which we don’t want either.

Time between pings

Generalizing this, each of ${ N_1, N_2, …, N_{10} }$ all independently run this ping approach. Finally, to bootstrap the process, we can hardcode all the member node IPs in at startup. If a node crashes but later comes back online, it can broadcast a ping to each of its peers, who then mark it as “alive” and start pinging it regularly again.

All-to-all heartbeating

Woohoo, we’ve just invented a basic form of heartbeating!

What are the characteristics of our basic system?

Number of messages: $k$ per $T_{ping}$ seconds, for each node, where $k$ is the number of nodes in our pool
Time to first detection: $T_{fail}$.

What about accuracy? Well… since we allow for an unreliable network, it’s sadly provably impossible to have both completeness (all failures detected) and accuracy (no false positives). See this paper if you’re curious. In a perfect network with no packet loss, we would have strong accuracy, however.

Time to SWIM

Surely we can do better than “the most naive thing we could think of”. This is where SWIM comes in.

SWIM combines two insights: first, that all-to-all heartbeating like our first example results in a LOT of overlapping communication. If nodes were able to “share” the information they gathered more effectively, we could cut down on the number of messages sent. Second, network partitions often only affect parts of a network. Just because $N_1$ can talk to $N_2$ but can’t talk to $N_3$, this doesn’t necessarily mean that $N_2$ can’t talk to $N_3$.

The tagline for SWIM is “outsourced heartbeats”, and it works like this:

Similar to all-to-all heartbeating, each node maintains its own membership list of all the other nodes it’s aware of. In addition to this list, each node also keeps track of a “last heard from” timestamp for each of the known members.
Every $T_{ping}$ seconds, each node ($N_1$) sends a ping to one other randomly selected node in its membership list ($N_{other}$). If $N_1$ receives a response from $N_{other}$, then $N_1$ updates its “last heard from” timestamp of $N_{other}$ to be the current time.
Here’s the outsourced heartbeating piece: If $N_1$ does not hear from $N_{other}$, then $N_1$ contacts $j$ other randomly selected nodes on its membership list and requests that they ping $N_{other}$. If any of those other nodes are able to successfully contact $N_{other}$, then they inform $N_1$ and both parties update their “last heard from” time for $N_{other}$.

Outsourced heartbeating

If after $T_{fail}$ seconds, $N_1$ still hasn’t heard from $N_{other}$, then $N_1$ marks $N_{other}$ as failed.

At this point, $N_1$ has determined that $N_{other}$ has failed. To make the whole process happen more rapidly, $N_1$ includes this information to the rest of the network, usually by including information about $N_{other}$’s failure in its ping messages to neighbors, which then gradually propagates through the entire network gossip style.

SWIM protocolSource

What are the characteristics of this more sophisticated failure detector?

Number of messages per node per interval: $1$ ping in the common case. At most $1 + j$ when outsourcing is triggered (where $j$ is the number of “outsourced” ping requests).
Time to first detection: It takes some fancy math to get to this point, but in expectation this is $\frac{e}{e-1} * T_{ping}$

What’s notable about these two properties is that: First, the number of messages we send no longer scales with the size of the pool. We could have 1M or 10M or 100M nodes and still send the same number of messages. Second, the expected time to first detection also is still independent of the number of nodes. It’s a constant, tunable by the $T_{ping}$ interval.

Why Did SWIM Stick With Me

So that’s the trick. It’s quite simple! Just outsource some of our detection to you neighbors. What made SWIM stick with me is that it is clever. It seems to legitimately require some inspiration to get to this solution. We could try to build up other approaches on top of all-to-all heartbeating, but most of the obvious improvements aren’t competitive with SWIM. For example:

Subset heartbeating, where you pick a subset of your membership list and only ping those. This reduces the number of messages you need to send, but increases the time to detection significantly.
Centralized heartbeating, where you elect one node as a leader and it’s the only one that sends the pings and has authority over the membership list. This also reduces the number of total network messages, but puts undue load on a single node.
Basic Gossip Propagation, which looks like “SWIM without the outsourced heartbeats”. Health information is piggy-backed on ping packets, but you only ever rely on your own direct pings. This also has reduced network messages and bounded per-node load, but takes $O(\log(N))$ ping intervals to propagate through the whole network – not constant like SWIM gives you.

All of these have tradeoffs that SWIM exceeds. SWIM is simple, elegant, solves a challenging problem, and felt to me like “algorithm design has something interesting to say about distributed systems”. That’s ultimately why it’s stuck with me since I learned of it.

The Decline of the Software Drafter?

Ben Congdon — Mon, 08 Dec 2025 00:00:00 -0800

1.

It’s hard not to think about the direction the software engineering field is going in. I don’t think you, dear reader, need to be reminded of this, but just to set up some timeline tentpoles:

In mid 2024, Github Copilot was something of a pleasant convenience for coding. This was the “fancy autocomplete” era. A “nice to have”, which you could lean on, but where you still felt fully in-control of the process of writing code.
In early February (2025), I wrote about how I was having more success with Cursor (still great) and Copilot Edits (largely left in the dust). I distinctly remember writing greenfield code with the Cursor Composer in late 2024, being quite impressed, but still finding that the frontier models of the time would fall over after a project got much past 1-2k lines of code.
Weeks after I wrote that, the initial version of Claude Code was released. Codex (in April) and Gemini CLI (in June) followed soon after. Claude Code was a step change improvement. It “just worked”. You told Claude what to code, and it largely could just do that.
Claude Sonnet 4.5, released September 29, and Claude Opus 4.5, released November 24, somehow both felt like breakthroughs in coding ability.¹ Codex CLI has improved through the year as well. I still find Claude Code to be the “best” of the CLIs, but Codex is still quite competitive. Cursor remains competitive as well.

In late 2024, I remember having a bit of the feeling of being “the weird one” at work, who tried to offload as much of my coding on LLMs as possible. I developed a bit of a reputation for this, and throughout 2025 I’d have people spontaneously ask me in 1:1s if I’d found any new techniques or tips for getting better performance out of AI coding tools.

These tools are excellent. They’re also, descriptively, somewhat of an inevitability. The tools exist, they are getting better month-over-month, there are market forces pushing for their adoption, for the time being there is immense comparative advantage in being able to wield them effectively; ignoring them seems increasingly unwise.

Throughout 2025, I felt more and more that I was able to offload much of the implementation of my work to coding agents. I’m a Tech Lead, so my time is split between meetings, more meetings, writing docs, and preciously guarded IC time. Offloading more to coding agents increased my effective output, as I could fire off an agent (or 2, or 3) before a meeting, check their work afterwards, and guide them towards solutions over the course of my day.

This wasn’t “merely” greenfield work, either. Much of this was changes to old, complex, mission critical codebases. I reviewed the output with an appropriate amount of rigor, and became increasingly impressed.

Look, I don’t want to sound like I’m an Anthropic marketer, but if you look at this “”chart”” from the February 24 release of Claude Sonnet 3.7 and Claude Code, I think they landed their roadmap for this year.

Claude Code release ’timeline’ chartSource

Claude Code – the others too, but in particular CC for me – feels and acts like a collaborator at this point. In the first half of the year, it felt like a reasonably competent intern; in the latter half, it’s a jagged-but-staggeringly competent junior engineer who you can sometimes coach into writing legitimately inspired code with a simple “but how would a senior engineer refactor this for modularity and maintainability?”

All that to say: with the release of Opus 4.5, I sense a shift. I find no pleasure in saying this. Most of what we call coding is now “solved”. I don’t think, however, that we’re post “software engineer”. I don’t think we are in any way “post-software”. I expect the amount of software being written to dramatically increase as the cost to produce it falls. I also believe that we will still need deeply technical people to guide that production, and develop and maintain the infrastructure to run it.

And yet: We are nearly at coder-equivalency for economically useful coding. A sufficiently experienced software engineer can now write >90% of production-ready code purely through prompting. You still largely need to know what to prompt, but even that becomes easier as models’ intuitions for software development improve.

I was admittedly skeptical at Dario Amodei’s March 2025 claim that this would happen within the year; it happened a few months later than his timeline, but in my opinion it did happen. The obvious corollary is that, for now, that remaining 10% becomes extremely valuable.

The industry, collectively, has had less than a year to respond to this. Regardless of what various leadership-class people say, this is not yet priced in. It’s not priced in to e.g. CS university program structure, engineering ladders, company structure, team structure, and so on.

To think through what may remain, I’ll attempt to defer to history.

2. Drafters: A Historical Analogy

There’s an analogy that’s been bouncing around in my head as I’ve been processing what appears to be this upcoming shift in software production. It goes something like this: in the mid-1900s, drafters – those who would literally draft engineering diagrams for architects, civil engineers, electrical engineers, mechanical engineers, and so on, were in relatively high demand. As were human computers.

‘Man Sitting at Drafting Board’, circa 1936Source

Automation came for these fields. CAD, and AutoCAD in particular in the early 1980s, automated much of engineering drafting. Not all, but a lot of it. Automated circuit design via Electronic Design Automation automated much of the manual work done to create PCBs. And, of course, human computers, well… yeah.

I tried getting actual labor statistics for this. For the time period I most cared about (between 1940 and 2000), the data online was spotty for the amount of effort I was willing to put into this.²

In any case, this seems like a compelling analog for one future of software engineering. Drafters were a distinct role, separate from e.g. a mechanical engineer or structural engineer. Software “engineering” has a bit of both; it encompasses design, production (coding), and maintenance (DevOps, woo…).

CAD and EDA didn’t destroy drafting and related technical-but-not-engineering roles. CAD technician jobs still exist today, but it’s a more commoditized skill than “pure” engineering. If this analogy holds, then the parts of the software engineering world that looked more like “software drafters” than “deep-in-the-design engineers” will largely be commoditized and will shrink.

Drafter, 1992Source

It’s unclear what chunk of the existing SWE ecosystem is drafters. I think most people that get into SWE for enjoyment of the field probably aren’t in this category, but it’s definitely out there.

3. What Remains

If I was able to offload all my coding work on an AI, there still does seem like a lot left to do as a software engineer. Great! I don’t have to code anymore. Ah, wait – still have to figure out how to get all the infrastructure running to build out my new project, keep the lights on for the dozens of components we already have in flight, reason through observability, debug wicked weirdness that happens past the 99th percentile tails, review my agents’ and my coworkers’ code, write designs, review designs, give career coaching, and on and on.

Coding also isn’t completely “solved” yet either! I may not be in the mud writing thousands of lines of code, but I do still exert significant editorial control over everything that gets checked-in.

AI helps with a lot of this. I feel like I have a lot more leverage than I had a year or two ago, but we’re still pretty far from closing the loop entirely. AI-assisted debugging, great! AI-assisted responses in the internal help channel for what my team owns, great! AI-assisted first passes at design docs I’ve written, great! AI-assisted code review, great!

I notice that my work days still contain many hours, I still remain busy, much work remains to be done. Infrastructure remains hard. If anything, all this has just allowed us to do more and be more ambitious. Systems rely on momentum and inertia in part because it requires a ton of time and effort to evolve a complex system. Reducing the time to build new systems or evolve old ones is legitimately helpful, but human-used systems still operate on human-driven time scales – for now.

At all the companies I’ve worked with, “produces high quality code artifacts” was always a bullet point on the SWE ladder, but it was but one of a dozen other skills and responsibilities that come along with the role. In the immediate future, I don’t see this expectation going away. In the medium term, I could see this transitioning away from a focus on artifacts, to a focus on impact – as was already the expectation for more senior engineers.

I predict agency & impact will become more of an expectation for junior engineers, since the tools for helping yourself have become so powerful.

What remains true:

Knowing how code works at a gears-level is still largely valuable.
Foundational computer science knowledge is still largely valuable.
Software “taste” or discernment or heuristics is still largely valuable.

What is probably not true:

Being solely a “coder” is a path to stable employment.
Being a software “craftsperson” ceases to be valuable or enjoyable. (Rather, I think the value will likely shift and look more like an actual craft – high end, not commoditized, but not mass market.)

There is a lot going on in AI as we end 2025. I have grown to dislike the phrase “these are the worst the models will ever be” because I feel like it tries to prove too much and has been co-opted as marketing. But adopt that frame for a minute and project a year or two forward to where things could go. What skills remain valuable? What deep knowledge do you still need to draw upon to know if you’re being BS’d by a confused coding agent? What can you learn now, build now (experientially!), given that the cost to churn out new quality code is falling to zero? What cognitive work can you reinvest your time in, if coding demands less of your work day? What types of people are going to be good leaders in times of industry change, like these? What type of organization do you want to invest your career capital in?

Cover: Wikimedia

Obligatory: These are my own opinions and do not reflect those of my employer, etc.

They seemed to me like breakthroughs in non-coding abilities as well, which is legitimately impressive as neither of them is specialized as a coding model. ↩︎
Epistemic status: Vibe check. One source put the number of drafters in 1940 at 111k, up from 98k in 1930 and 66k in 1920. Confusingly, this BLS paper seems to only indicate there were ~55k drafters in 1990 (though their counting mechanism is unclear). This St. Louis Fed report indicates 200k drafters in 2000 and 193k in 2023. To make meaningful comparisons, we’d have to correct for population growth, definitional shift, and so on. Squinting at the numbers, it seems like post-2000 there has been a slight decline in drafting employment, definitely compared to the rest of the economy. ↩︎

Embodied Cognition and the "Tokenverse"

Ben Congdon — Sun, 07 Dec 2025 00:00:00 -0800

1.

One of the common criticisms of modern AI systems is that they aren’t sufficiently embodied. The idea being there’s some inherent quality of being an agent embedded inside a body in the physical world which cannot be attained by a token-predicting LLM, regardless of how intelligent an agent becomes.

To address the validity of this criticism, we need to have a philosophically rich understanding of what embodiment is and what it gets us in terms of cognitive capacities.

2. The Four E’s

The best framework I’ve yet found for understanding the benefits of embodiment for cognition is “4E cognition”. The four E’s here being: embodied, embedded, extended, and enactive.

Embodied cognition, our principal concern, is the notion that the body in which an agent is embedded is deeply tied to the cognitive processes that the agent can support. It’s the opposite of “brain in a vat”; the body constrains and guides the concepts that a brain can conceive. The fact that colors, material properties, and the rest of the pieces of our inner mental worlds “exist” is guided by the sensorial situation we’ve found ourselves in – having eyes to perceive colors, sense organs to feel texture, and so on. Furthermore, our conceptual handles for things like doors, chairs, water glasses, and so on are constrained by how we engage with the physical world. In this view, “chair” is not a natural category in the “out there” world; it is contingent on our embodiment.

Embedded cognition is cognition that is aided by being situated in a surrounding environment that supports the cognitive tasks. This is specifically referring to environmental affordances which support the cognitive process, which are still happening in the brain. For example, tool use – using an abacus to support a math operation, or using a book shelf to sort books, are examples wherein an agent’s cognitive load is reduced by being embedded into a suitable environment.

Extended cognition suggests that aspects of your environment, outside your brain, are not just supporting your cognition but is actually a constituent part of the cognitive process. An example of this is using a notebook or smartphone as an aid to memory and focus.

Enactive cognition is the idea that cognition “emerges from sensorimotor activity” – that there exists no bright line between premeditation and action.

3. Perception and Embodiment

Going deeper on the embodiment critique, I think it becomes useful to look at the interaction between agent and environment: perception. The central argument of embodied cognition is that the interaction between environment and cognition is somehow essential for getting the type of thinking that humans are capable of.

From David Chapman and Jake Orthwein’s recent wide-ranging discussion, we get the following view of perception:

The reason why 4E-type things are part of the solution to [the problem of perceptual relevance] is that you sort of evolve to construe the world in certain narrow ways that have to do with the kind of thing that you are. You don’t encounter the world as it is first and then have to select from among that – the world presents itself to you in terms of the kind of being that you are. And that automatically narrows the frame of perception dramatically.

(Emphasis mine)

This is a fairly radical claim. The intuitive view of perception is that there is a world “out there” and that we are perceiving that world in our minds in a way that reflects how the world actually is – with the implied corollary that any creature would see the world roughly the same way as us.

But this idea flips that intuition. From the infinite ways that sense organs could make an internal map of the exterior world territory, the maps that we tend towards making are the ones that are selected by the type of being that one is. Creatures that can walk will see doors as “walk-through-able” in a way that water-bound creatures would not.

Our ability to resolve raw sense data into well-defined objects, intuitively and preconsciously, is not because those objects exist in the world in some objective sense, but because we’re the type of creature that finds use in, for example, being able to distinguish between objects that are likely dangerous or safe. The categories that feel “natural” to us are natural only relative to our particular form of embodiment, a pure circumstantial contingency.

Perception filtering

This figure crudely illustrates the process of perception filtering. There exists some sense data “out there”, and depending on the type of being, that sense data is filtered into a different set of concepts. The human resolves the outlines of the black object against the blue background as a “chair”, whereas a hypothetical aquatic herbivore evolved to use its visual sense data to search for food resolves the green patches and attaches the concept of “food” to them. Same sense data, different resolved concepts.

Orthwein continues:

And then perception is being narrowed still further by your goals, your motivations, what sorts of states are active in you, and the actual embodied situation that you find yourself in each moment. So you’re never doing this rationalist view from nowhere from which you have to deduce what’s relevant. The world is giving you a relevance moment by moment by moment, which is the Heidegger “always already meaningful” picture.

This is the second level of filtering. It’s not just “what can you perceive given your contingent embodiment”, but it’s also “as an agent with goals and motivations, certain things will be more or less salient, preconsciously”. The key piece for me, here, is the notion that inside our heads we’re never actually doing the rationalist “view from nowhere” stance. We can try to approximate this with deliberative thinking, but this is only ever a mere approximation. Embodiment affords and demands a level of perception-based filtering.¹

4. Mapping the 4E’s to LLMs

OK, so I’ll admit this has become a bit abstract. Let’s rein this in by trying to map these concepts onto extant LLMs.

Going in reverse in the 4Es:

Are agents enactive? Unclear. Sufficiently harnessed AI agents can “take action” (heavy scare quotes intended) in the world, but it’s not clear how serious we should take this. A thermostat is “enactive” in the world, in that it’s able to control its environment through action, and the environment in a sense allows it to “enact” its cognition. If we grant this property for thermostats, we should likely grant it for AI agents as well, but this doesn’t seem to prove much.
Are agents extended? Definitely, yes. Claude Code’s uses TODOs essentially the same way a human would. Models have limited memory in their context window. Reading/writing from a scratch pad, or using a code-interpreter to offload complicated math seems like “extended” cognition in a very similar way to human extended cognition.
Are agents embedded? I’d say this is a lukewarm “maybe”. We scaffold LLMs into “agents” specifically by putting them into harnesses which allow them to interact with the world. However, the cognition itself isn’t embedded in the environment. There is still a clear line to be drawn between “what is LLM” and “what is environment”.
Are agents embodied? I’d written up to this point having in my mind a clear “no”. I still think the answer is pretty firmly “no” for most reasonable definitions of “embodied”, but I’m less certain than I used to be. Certainly, certainly, I’m not making the claim that AI agents have any form of physical embodiment. This is just obviously false. And the extent to which we slap an LLM onto a robot and call it embodied – I think that’s cool, but architecturally still distinct from what 4E is calling embodied.

What’s still bringing me back to a firm “no” on the embodiment question, setting aside common sense, is the notion that “constitution” is an important theme of true embodied cognition:

Constitution: The body (and, perhaps, parts of the world) does more than merely contribute causally to cognitive processes: it plays a constitutive role in cognition, literally as a part of a cognitive system. Thus, cognitive systems consist in more than just the nervous system and sensory organs. (Source)

This just does not seem to be true of LLMs. Scaffolded AI agents have “senses” in that they can be told about things happening in the world, and they can even request measurements to be taken of the world on their behalf, but they are not of the world. Everything is still just tokens to them. Even multimodal visual information is still provided in the same learned representational space of tokens. There is no sense in which they have sense perception of the world, or parts of their cognition which are causally of the world, such that one could say that their cognition is actually embedded in the world.

5. The Tokenverse

A common critique of the field of AI alignment is the provocative question “Aligned to what?” I think there’s a natural analogy here: “Embodied in what?” If we grant LLMs some primitive form of embodiment, they seem to be embodied in something like a “tokenverse”, a linguistic ecosystem, not the physical world.²

Agents acting in this “tokenverse” still seem to be quite capable of taking action in our real physical world – albeit in ways that we’ve constructed to mediate these actions.

So what does this mean for the criticism that “AI systems are insufficiently embodied”? I think insofar as this criticism is trying to say that “AI cognition is different from human cognition”, that is just obviously true. However, this does give us a useful handle on how these two forms of cognition are different.³

Our embodied perception filtered “the world” into concepts. Those concepts ricocheted off billions of human minds, eventually crystallizing in the internet datasets that became the basis for LLM training data.

Now LLMs use these human-derived concepts in their abstract “tokenverse” to take actions that impact our physical world. Perhaps this is why “alignment by default” seems to be actually going fairly well so far? Naively, it would seem much easier to “align” something which already shares a reasonably close set of conceptual handles to us.

LLMs didn’t have to learn how to interact with the world and form their own conceptual handles based on their weird alien understanding of what it means to exist in the universe, they were able to learn these from statistical patterns in tokenized concepts from human-derived data.

Now we have full loop. Human experience is distilled into concepts represented as statistical patterns in the training data, a “byproduct” of embodied experience. LLMs train against this data and learn statistical patterns, but can’t influence the training data – its “tokenverse” – in the same way that humans can influence the world through their own physical embodiment as part of their feedback loop. Similarly, when LLMs take actions in the world, they are doing so in a way that is constrained by the concepts in their “tokenverse” rather than the world itself. At present, LLMs are also limited in their ability to be influenced by the world itself, as they are mediated by a set of static model weights which aren’t themselves influenced by the world during interaction.

If we take the “tokenverse” idea seriously, we should consider what is implied by being “embodied” within it. LLMs clearly develop some internal conceptual landscape that produces their resulting outputs. Empirical investigation into model features supports this. It would be quite surprising if the learned concepts mapped 1:1 onto human concepts. This would imply that the conceptual landscape that best withstands the optimization pressures of model training maps with high fidelity directly to human filtered concepts, which would be a quite strong claim. More likely, LLMs develop an internal conceptual landscape that approximates human concepts, but is itself quite foreign. Indeed, empirically this appears to be pragmatically indecipherable.

For now, if LLMs are embodied of anything, they are embodied of the texts and concepts we first perceived via our own physical embodiment.

Cover: Barcelona, Spain

I’m going to have to read more about Heidegger’s notions of “pre-ontological” existence now, I suppose. ↩︎
As an aside, I don’t think there’s any spooky going on with respect to tokenization in particular. That is, if LLMs eventually transition to using byte-level tokens, I don’t think that changes the fundamental argument. ↩︎
Here, we are implicitly granting that AI cognition is cognition. I’m not 100% sure we should be ready to bite that bullet, but that’s a topic for another time. ↩︎

Book Review: Antimemetics

Ben Congdon — Sat, 06 Dec 2025 00:00:00 -0800

Source

1. Slippery Ideas

Nadia Asparouhova’s Antimemetics is, itself, antimemetic.¹ I devoured this book in a few sittings on the bus to work, but if I had to describe it, I really only have a few conceptual handles that I could grasp onto:

Memes are ideas that spread easily. Antimemes are ideas that resist spreading.
We live in an information ecosystem which is made up of various types of memes. Memes have varying level of impact, salience, and transmissibility.
Often the most useful ideas are antimemetic.

Here, of course, we’re talking of meme in the Richard Dawkins sense – not image macros (necessarily), but ideas that are spread through social or cultural forces. Once you see memes as such, they’re everywhere: fashions, neologisms, dances, common fears, celebrities, and so on.

2. For Lack of an Antimemetics Division

The idea of antimemes was brought to popularity by There is No Antimemetics Division, by “qntm”. It was originally a serialized novella on the SCP Wiki. The short story discusses a monster which is able to erase all memory of its existence. Researchers in the story can inspect the monster and observe it, but once they’re out of its chamber, all memory of it leaves their minds. This monster, SCP-055, is quintessentially antimemetic.

Memes want to be shared. When we come across a particularly good meme, our first instinct is to pass it on to someone else. … Antimemes are the opposite. When we encounter an antimemetic object, there is a reflexive desire – consciously or not – to suppress it.

Antimemes are slippery. They’re definitionally hard to describe, hard to share; often, they’re unsexy, uncool, and cringe.

Memes, on the other hand, want to be shared. They’re self-propagating. They’re the type of idea that has been shaped through environmental pressures into being exactly the type of thing that you want to pass on once shared, and the idea has packaged itself in such a way that it can be easily shared.

“Six-seven” is the perfect meme. There is no content to this meme, and yet it spread. Memes can be thought of as something of an inner message (the actual content), and a wrapper (the window dressing that makes it enticing to share), akin to the three-tiered hierarchy of information I’ve written about before. The six-seven meme has no inner message; it is pure wrapper. To some extent, that gives it a tremendous advantage in spreading: there is no core meaning that needs to be preserved uncorrupted.

It’s much harder to think of a quintessential antimeme. Asparouhova uses “writing your own will” as an example. Writing a will is valuable, yet most people resist doing so for years. Terms of Service are another good antimeme: they’re ubiquitously deployed, and ubiquitously ignored.

3. The Memetic Environment

Memes need a medium to spread in. Ideas have been spreading for as long as human culture has been a thing, but the development of the internet led to something of a memetic singularity. Memes can now be generated and spread, globally, at a scale unimaginable in the 20th century. Asparouhova terms this hyper-memetic environment the “memetic city”:

The memetic city is easily recognizable. It is the realm of viral ideas and social contagion: tweets that explode overnight, social media avatars supporting the latest political cause, TikToks and Instagram Reels that we scroll through at the end of a long day. Here, ideas spread with lightning speed – amplified by social platforms – and shape our collective behavior and preferences: the opinions we hold, the dates we go on, who we vote for. … [The memetic city] thrives on visibility; its power is rooted in the ability of ideas to quickly capture our attention and replicate across the hive mind.

The foil of the memetic city is the “Dark Forest”, first defined in an article by Yancey Strickler as a nod to the science fiction series of the same name. The Dark Forest theory, from the scifi series, is that in an adversarial environment, it is rational to conceal yourself, lest you be predated. In an internet overrun by spam, bots, spearphishers, SWATters, and threats of cancellation, the sane response is to retreat from the memetic city into the darker more personal corners of the web.² Group chats, Discord servers, insular Twitter communities made up of pseudonymous posters with anime profile pictures.

Asparouhova draws heavily on Venkatesh Rao’s writing on Ribbonfarm and Maggie Appleton, both of whom I’ve quite enjoyed reading in the past. The synthesis of Rao, Appleton, and others is that a desirable response to the noise of the memetic city is a “cozy web”: “the private, gatekeeper-bounded spaces of the internet we have all retreated to over the last few years.”

Tracing a broad history of internet culture in Asparouhova’s framing goes something like: First, there was a wave of optimism that the internet could be a force for peace and connection. Second, because of forces such as mimetic desire and context collapse, the internet actually became a hostile home to culture wars. Virality, first an interesting phenomena for sharing cat pictures, became a weapon used to direct attention. Third, as a counter reaction to the previous forces, we have now entered am “antimemetic” era, characterized by a broader retreat from the town square into the dark forest.

4. Staring at an Antimeme

Descriptively, antimemes are the types of ideas that don’t work on the memetic stage. They’re not going to be a banger tweet, they won’t get you millions of likes on Instagram, they’re not “cool”.

There are different types of antimemes, since there are different reasons that ideas can resist being spread:

The Boring. Many ideas resist attention because they’re boring, unable to hold our attention for long. Terms of Service are a bureaucratic nuisance, but you click “approve” and completely forget you did so. Daylight Saving Time is briefly memetic twice per year when the clocks change, but is quickly forgotten. It just doesn’t appear important enough to fix, despite the common knowledge that roughly no one wants it.

The Taboo. Some ideas resist spread because there is a social cost of doing so. Societies use taboo as an immune response to ideas that are seen as threatening. Taboos are contextual and highly subject to opinion. These can be reasonable – such as taboos against violence – or counterproductive – such as those suppressing ideas just outside the current Overton window.

Taboos are one of the most prominent categories of antimemes… Some taboos – such as stealing, cheating, or lying – don’t budge within most networks… But taboos linger precisely because their symptomatic period is so long. They can lie dormant for years until more nodes are willing to receive or spread the idea.

The Uncomfortable. Some ideas resist being processed by our minds out of an ego-based or psychological immune response. Ideas that would be painful to process largely get ignored.

We avoid thoughts that are cognitively expensive to process… No one wants to think about their own death, much less the death of themselves and their partner simultaneously… Death, retirement planning, getting married and having kids…for many people, these ideas are difficult to prioritize because they force us to confront uncomfortable truths.

The Dangerous. Some ideas are legitimately dangerous to share The most obvious example is information related to CBRN (chemical, biological, radiological, and nuclear) weapons. However, there are some more subtle examples. Asparouhova cites Ethan Watters’ Crazy Like Us which (controversially) suggests that some mental health disorders – such as anorexia and “American-style” depression – have a component of cultural transmission, and as such information about these conditions can be dangerous to susceptible individuals. Ironically, dangerous information can be tragically memetic within vulnerable populations, such as the subreddits devoted to enabling discussions of eating disorders.

So why care? Well, to a certain type of person (*raises hand*), antimemes are fascinating. Antimemes are like going to the thrift store of ideas. Many are boring, but still interesting to look at. Some are life changing. Having a deep conversation with someone well out of your professional or social path is a great way to be introduced to novel antimemes that you can then put in your back pocket, to be explored later or forgotten, depending on the content.

Also, antimemes are often important. They’re often the “eat your veggies” of your information diet. Long reads are antimemetic, compared to a viral tweet. You may even remember the viral tweet more succinctly than a closely read long article. A good life is one not optimized for ease, but one nudged in the direction of meaning-making. Fully ignore antimemes at your own peril.

From here, Asparouhova introduces a third class of more dangerous memes that combine the “importance seemingness” of antimemes with the transmissibility of memes.

5. Supermemes

Antimemetics offers the following taxonomy of memes:

“Memes” are highly transmission ideas with low impact.
“Antimemes” are low transmission ideas with high impact.
“Supermemes” are high transmission ideas with high impact.

We’ve talked about memes and their antimemetic foil, but we’ve yet to discuss supermemes:

Supermemes … are like black holes. Like memes, they spread quickly, but unlike memes, they are perceived as highly consequential. Their sheer gravitational force pulls us in, crowding out our ability to think about anything else. Whereas antimemes are characterized by a ‘strange forgetting’ by the perceiver, supermemes are characterized by a ‘strange inability to forget.’

Supermemes are akin to hyperobjects – they have some the power to totalize anything that touches them. They have the quality of a looming catastrophe, a Shepard tone of impending doom that prevents one from averting their gaze.

The obvious supermeme from 2023 until present has been AI. In my corner of the world, it’s predominantly “AI Doom”. It’s clear why: AI is scary; it portends the potential for a loss of human control; it portends the potential for the loss of human value, of human creativity, of biological intelligence. It, like climate change, is something that we appear to be doing to ourselves. We are forced to look directly at the crisis as it looms. And yet, there have been supermemes in the past. If/when we make it past our current AI doom, there will be another supermeme to replace it.

6. Truth Tellers and Champions

Finally, Asparouhova discusses the heroes of Antimemetics: truth tellers and champions. Some antimemetic ideas are worth injecting into broader awareness, and Asparouhova sees these two archetypes of people as those who have the ability to do so.

Truth tellers are willing to break social norms to say what others fear saying, often with an air of the trickster. They are often seen as cringe, and so pay a social cost.

Truth-tellers, who often operate outside of conventional norms, are especially vulnerable to being labeled as cringe. This creates a chilling effect… Cringe suppresses the truth-tellers: the chaotic, creative idiots who gleefully prod us to reassess what we think we know and believe.

Champions on the other hand, are tireless spreaders of a particular idea. If we ever eventually get rid of Daylight Savings Time, it will because someone took up the cause as a Champion and tirelessly used their energy to keep that idea salient enough for it to be addressed.

Civilization scales its cultural awareness through ‘distributed remembering,’ where we empower champions to curate our attention, which expands our ability to make progress on many different issues at once.

Society is too complex for any one person to attend to all that needs attending. Forgetting is a necessary part of living as a human, in both pre- and post-modern society. We distribute or shard our remembering, as a form of specialization. Champions take a narrow slice they feel is unattended to, and attempt to convince others to stop forgetting long enough for a problem to be solved.

7. Conclusions

What are we to make of Antimemetics? What should we hold onto as its core, while the rest slips out of memory?

I found this book hard to pin down – it’s part internet anthropology, part vibe-check of the post-2020 world, part field guide to “interesting ideas and where to find them”.

If I were to give three final thoughts:

Occasionally force yourself to stare at antimemes. Make a log of ideas you run into. Sure, forget most of them, but build the muscle nonetheless. Find one or two ideas to champion to an unreasonable degree.
It’s reasonable to incubate ideas in private cozy-web style environments before deciding whether to share them more publicly. Doing so is psychologically and socially well-adapted.
Developing good “cognitive security” is already a baseline requirement for being sane in the post-modern world. This will only become more of a necessity with the increase of generated media. Generative AI is another step change in the ability for memes to evolve for reproductive fitness faster, so we should expect more captivating memetic spread in the next few years.

Under the book’s framing, I think this is a compliment? ↩︎
Anyone who has had a post go even semi-viral, I think, can feel the appeal of the dark forest. When one of my posts gets picked up on Hacker News, there’s this feeling of the Eye of Sauron being cast upon you. Something, somthing “Don’t read the comments.” Even if most of the feedback is positive, there’s a chilling effect and set of expectations that comes with this. – A temptation to sand off the edges of ideas to make them more palatable, or to offer fewer handles for criticism. I don’t think this phenomena of attention warping is new to the internet, but the ability to link/screenshot/amplify anything hypercharges it. Sharing a random half-baked idea in a private Discord group has an effectively zero chance of blowing up on you. Sharing a half baked idea on X/Twitter also has an effectively zero chance of blowing up on you, but it creates an ~immutable, searchable public record of your half-baked idea, which can be trawled up years later by someone fishing for dirt. ↩︎

TIL: SQLite's 'WITHOUT ROWID'

Ben Congdon — Fri, 05 Dec 2025 00:00:00 -0800

By default, SQLite tables have a special rowid column that uniquely identifies each row. This rowid exists even if you have a user-specified PRIMARY KEY on the table. How this rowid column behaves is influenced by your PRIMARY KEY type.

Integer Primary Keys: If you have an integer primary key, then the primary key column becomes an alias for the special rowid column. The rowid and your user-defined primary key are literally just the same column.

Taking an example:

CREATE TABLE IF NOT EXISTS users(
  user_id INTEGER PRIMARY KEY,
  email TEXT,
);

In this case, since we’ve defined user_id to be an INTEGER PRIMARY KEY, user_id becomes an alias for rowid. These two labels refer to the same physical column.

Figure 1: Integer primary keys are aliases for the rowid column

Non-Integer Primary Keys: When you use a non-integer primary key, such as TEXT, SQLite actually implements the “primary key” as a UNIQUE INDEX between rowid and your user-defined primary key.

SQLite stores data on disk in B-trees. Table rows and indices are both stored in B-trees. For non-integer primary keys, this means that on disk, your table has at least two B-trees – one for the rows, and another for the index between the rowid and primary key.

Taking an example:

CREATE TABLE IF NOT EXISTS page_views(
  page_url TEXT PRIMARY KEY,
  views INTEGER,
);

This table would have two on-disk B-trees. The first for the rows of (rowid, page_url, views) and the second for the index entries of (page_url, rowid).

Figure 2: Non-integer primary keys

This two-tree structure has performance implications. If we were to do a query like SELECT views FROM page_views WHERE page_url='https://benjamincongdon.me/blog/sqlite', the SQLite query engine has to first search the (page_url, rowid) B-tree to get the rowid of the row with page_url=‘/blog/sqlite’, and then uses that rowid to locate the full row in the (rowid, page_url, views) B-tree to get the views from that row.

Enter WITHOUT ROWID

Adding the WITHOUT ROWID clause to a CREATE TABLE statement disables this rowid behavior. For example:

CREATE TABLE IF NOT EXISTS page_views(
  page_url TEXT PRIMARY KEY,
  views INTEGER,
) WITHOUT ROWID;

Instead of creating the phantom rowid column, the primary key of a WITHOUT ROWID table uses a “clustered index”. In this context, a clustered index means that the row data and the primary key index are stored in the same B-tree. The rows in this index are physically stored in order of the primary key. Thus, there isn’t the need for a second look-up B-tree.

Figure 3: WITHOUT ROWID tables use a clustered index

Diagram note: Visualizing a B-tree as a table is a bit of an oversimplification. Note that in Figure 3, as opposed to Figure 2, the rows are stored in order of the primary key to gesture at the on-disk layout of the B-tree.

When to use WITHOUT ROWID:

If you frequently lookup by primary key, or range scan by primary key, there is a performance benefit to using WITHOUT ROWID.
If you use an composite primary key, and frequently lookup by this key.

When NOT to use WITHOUT ROWID:

If your primary keys are large. If your primary keys are large, then any additional indices on the table will have to duplicate storage of the entire primary key, instead of using the rowid to refer to the entry. For instance, if you’re using a 200 byte string primary key, then every secondary index will include those 200 bytes, instead of the 8 byte rowid.
If your table rows are large (e.g. they store a large blob). If your table rows are large, then the disk layout of the table will be less efficient as a WITHOUT ROWID table. This is because SQLite uses a different type of B-tree – a B*-tree – layout for rowid tables that only stores data in leaf nodes, whereas clustered indices use a normal B-tree layout, which stores data in intermediate nodes. If the rows themselves are large, the storage of data in intermediate nodes can worsen the fan-out when searching the tree.

Learnings / Discussion

I dug into this concept more deeply when trying to read-optimize a SQLite database that is used as a simple key-value on-disk cache. The notion of disabling rowid for this table seemed appealing, until I realized that the optimization was likely not worth it given that the table stored quite large blobs in its rows, which would degrade the overall efficiency of the on-disk layout. In any case, I got to learn more about SQLite internals, which was well worth the time.

The SQLite documentation makes for a fascinating read and is exemplary technical writing. The SQLite project is a true treasure.

Further Reading:

Race Report: Seattle Marathon 2025

Ben Congdon — Thu, 04 Dec 2025 00:00:00 -0800

Last Sunday, I ran the 2025 Seattle Marathon. This was my third marathon, and I got a PR! I’m splitting this race report into two broad sections: about the course, and about my experience/training/etc.

The Course & Event

Candidly, I’ve avoided running the Seattle Marathon in the past because I’d heard negative things about the course layout. Previous courses spent much more time around the arboretum and University District, and routed over the 520 bridge – which does sound fun in the abstract¹, but would be monotonous for a race.

One thing I still grumble about with the Seattle Marathon is that they do not post their courses until a few months before the race. This is in contrast with other races, which either don’t meaningfully change their courses year-over-year, or have an established course prior to registration opening. Signing up for a course sight-unseen isn’t great.

However, the 2025 course was great! It was a true highlight reel of the best locations for Seattle running.

Course Map: 2025 Seattle Marathon

Broadly, the course started near the Gate Foundation building, went up through Eastlake, turned east to the Arboretum via Interlaken park, provided a nice loop through the Arboretum and the University of Washington Campus, then headed west through Gasworks park, Fremont, the Fremont Canal, and Interbay.

Interlaken Park, Arboretum, Union Bay Natural Area

The section of the course that went through Interlaken Park, the UW Arboretum, and the Union Bay Natural Area (near Husky Stadium) was the highlight of the race for me. Running through those parks on a crisp day as the sun is coming out, alongside a ton of other runners… amazing.

Magnolia & Interbay

In contrast, the hilly industrial streets north of Interbay and the section of Elliott Ave in the last 3 miles of the course are not that nice to run on.

The last several miles were a point of contention, since they caused significant traffic in Magnolia and largely prevented residents from leaving during the race. When I normally run this section of Interbay, take the Elliott Bay Trail past through the industrial area there, which connects to Centennial Park. (This is the blue route in the map.)

Course Map: Magnolia Bridge & Elliott Way Segment

I think this would have been the preferred route by the marathon planners, but there are two things preventing this route from actually being used:

Despite recent trail renovations, there’s still a significant choke point in the trail in the connection between the Elliott Bay Trail and Centennial Park that would likely have been a safety issue for the volume of marathon runners.
There is ongoing construction in Elliott Bay Park that would make it tricky to route through.

I have pretty high confidence in this second factor, as this message was posted around the Race Expo:

The new courses touch several parks, waterways and other viewpoints throughout the city, showcasing the beauty of all of Seattle.

And next year the courses get even prettier! Starting in 2026, once construction of the Elliott Bay Trail has been completed, we have agreements to switch the last two miles of the races from Elliott Ave to the brand new waterfront trail.

As an alternative to the nicer Elliott Bay trail, the course instead routed over the Magnolia Bridge, which caused significant traffic. My sense though is that this wasn’t the “first choice” for the course, as (if memory serves) the course was updated a few times after it was announced.

Turnarounds & Loops

The other gripe I had with the course was that it relied more on turnarounds than I would have preferred. As a Seattle-native runner, this didn’t bother me too much; I was quite familiar with the course and the turnarounds felt logical enough. If I were coming from out of town, this likely would have been more annoying.

The one really unfortunate course element though was, again, north of Interbay. At the intersection after the Emerson Pl Bridge, the course wanted you to first turn right to do a loop up Gilman Ave, then drop back to Fishermans Terminal via Commodore Way. Then you go over the Emerson Pl Bridge again, this time turning left and going down to Interbay.

Course Map: Interbay Loop

Having seen the course online and knowing the area well, this was reasonably intuitive. However, I saw at least a dozen runners who’d taken the left on their first pass and had to retrace their steps (sometimes for >1 mi) to do the loop they’d accidentally skipped. There were course monitors at the interaction, but the signage was still evidently insufficient.

Summary

Overall I enjoyed the course quite a bit. The event management was really good: the race started on time, the aid stations were plentiful/well-staffed, and the finish line was on-par for post-race chaos. I think if they smooth out the Interbay/Magnolia piece next year, this same course would be an excellent one to repeat.

My Experience

TL;DR: I finished in 3:24:34, a PR, down from my previous best, which was ~3:43.

Training

My training window for this race was rather long. I’d been ”training” for months longer than I needed to, to partially train alongside a friend who ran the San Francisco Marathon in late July. My training started in earnest shortly after that, and I used Runna for the first time to manage my plan.²

My training was mostly uneventful, until near the end. I’ll write more about Runna at some point, but in short: it was really helpful in scheduling progressive long runs, and I loved that is generated pace workouts which helpfully sync to your running watch. I built up my weekly volume and distance capacity without issue.

However. In early November, I injured myself on a 20mi long-run. I started the run feeling like something was “off” in my knee, but kept running on it. The whole run was slow and painful, and it was the first time in my entire running experience that I was unable to walk after a run. Not great. I took a week off after that and wore a knee brace consistently until race day. Eventually, my knee improved, but the ankle on the same leg got injured too – potentially compensatorily. So I started wearing a compression brace on that ankle too.

I was becoming pessimistic that I’d actually run the race, since I didn’t want to further injure myself and had to dramatically step down my training in the final month – skipping all my tempo runs, and limiting the distance of my last long runs to well under the plan.

In a moment of injury reprieve mixed with inspired consumerism, I bought a pair of carbon “super shoes”: Adidas Adios Pro 4³s. The first time I put them on to go out and test them, I ended up running my first ever sub-20min 5k. Was this a good idea? Probably not, but surprisingly it didn’t aggravate my knee and was a strong mental win.

Race Day

Pace groups: I ran the first 18-19 miles with the 3:20 pace group, which went really well. That helped a lot with not going too fast out of the gate. In my first two races, I had the tendency to run really fast at the beginning and burn out around mile 13, then struggle from 13-20, and be in a bad situation from 20-26. Running with a pace group got me solidly to mile 18 without overexertion, but still at my goal pace.

Fueling My fueling was definitely suboptimal. I brought 3 protein bars and a caffeinated Verb bar with me. Unfortunately I somehow “lost” two of the bars I brought in the bottom of my running vest, and didn’t want to stop to fish my backup out of the bottom of my vest. The result of this was I ended up having a bunch of course gels. This was fine, but I got “sweeted out” and had this sickly sweet feeling in my mouth from mile 18 on.

Shoes: Shoes wise, running with carbon shoes definitely helped, but more in the early sections where my legs were fresher. Qualitatively, I feel like they probably gave me a 30 second qualitative boost on my pace, which is nontrivial. However, the last 6-8 miles was still pretty hard. My thighs started freezing up which makes maintaining a sub-8:30 pace tricky. The Adios shoes have much less stability than my usual Brooks Adrenalines, which led to a few close calls with rolling an ankle as my legs fatigued.

The Last 6mi: The phrase “a marathon is a 20mi warmup to a 6mi race” is popular for a reason:⁴

My pace chart for the race

Finishing & Post Race: Marathons are such a mental game. In the last mile or two, I had the feeling that I could push myself to finish a few seconds faster, but ultimately decided to not to. Once I was confident that I’d finish comfortably under 3:30, I “just” wanted to finish without further degrading my form. This probably was a smart decision. As I stopped in the finish corral, I had a few seconds of post-run adrenaline, which immediately crashed into “yup, can’t really walk anymore”. It was also still a quite cold day, and so I got to use one of those fun mylar blankets for the first time.

Pictures

Start Line

Union Bay Natural Area

520 Bridge

Almost at the Finish

All in all, fun race. I’d do it again. :)

I have yet to run or bike or walk over either the 520 or I90 bridges, a glaring omission in my Seattle-native credibility. ↩︎
Previously, I’d just search for a reasonable internet training plan (invariably one of Hal Higdon’s plans), copy that into a spreadsheet, futz with it until it felt “right”, and then hang a printed version of that on the wall during “training season”. ↩︎
As a side note, RTings, better known for TV/headphone/vacuum/consumer-electronics reviews, does surprisingly good reviews for running shoes. ↩︎
This generalizes reasonably well to “a marathon is a $(26-N)$ mile warmup to an $N$ mile race, for values of $N >= 13$”. ↩︎

Technical Escape Velocity

Ben Congdon — Wed, 03 Dec 2025 00:00:00 -0800

1.

When I was learning to program, I remember a specific phase transition in that process of skill acquisition – a distinct “before” and “after”, similar to how when learning to read there’s the “before” of ignorance, the “after” of effortlessly reading, and surprisingly little memory of the intermediate struggle of learning.

For programming, this phase transition was the point where I felt confident that I could, in principle and given enough time, program anything that could be programmed.

I learned to code, initially, through a mixture of Java Minecraft plugins, Objective-C OSX and iOS games, and physical programming books, which I’d studiously read in my middle school’s designated reading time, trying to absorb operating systems programming while… staring at a book without access to a computer.¹

This time spent grasping at concepts that I didn’t have a good handle on was legitimately useful, but frustrating. I recall feeling the unwieldy uncertainty of the unknown with how to take a simple idea I had (“program a ‘simon says’ game!”) and then turn that into a reality (arrays, display elements, event loops). The number of unknown unknowns was too high to have a chance at finding what I needed on StackOverflow.

Then, there was some critical point where my mindset shifted. I’d built up enough of a mental model of the key concepts behind “how to program a computer” – control flow, filesystem operations, network requests, parsing/serialization, calling REST APIs, etc. – that it felt like I could figure anything out given sufficient time. I think there’s two components to this mindset shift: (1) having enough of the common patterns in your repertoire to make the “easy things easy”, and (2) developing the meta skill of knowledge acquisition (i.e. knowing what search terms to use, having a reasonable grasp on enough of the CS/systems theory to search for what you need) to make the “hard things possible”.

2.

As a handle for this mindset shift, I’ll call this “Technical Escape Velocity”. As in, you build up enough technical knowledge to bootstrap yourself to escape the “gravity well” of ignorance.

In orbital mechanics, escape velocity is the minimum speed needed to break away from a celestial body’s gravity well without further propulsion. Technical Escape Velocity (TEV) is admittedly a stretched metaphor, but I like it for the following reason: when you reach escape velocity, you don’t immediately leave the gravity well, you just have sufficient conditions for eventually exiting the gravity well.

Different problem domains have different gravity wells. A small webapp has a smaller gravity well than a sprawling full-stack SaaS app, which in turn has a smaller gravity well than “leading a 20 person engineering team through a complex, layered distributed system migration”.

For a given problem domain, you can eventually become proficient enough to solve all the solvable problems within that domain. If you’re lucky, some of this knowledge will even transfer between domains. Critically, you still only know you’ve solved the problem once you have evidence of it actually being solved. Just because you think or feel like you have TEV, doesn’t mean you actually do. Trickier problems with larger gravity wells require you to wait longer to know if you’ve actually left orbit – or if you’ll ultimately crash back to the surface in failure.

3.

Fast forward to 2025, AI tooling has dramatically changed how this learning process can work today. As Zvi Mowshowitz often says, “AI is the best tool ever made for learning, and also the best tool for not learning.” Tools like Claude Code or Codex make the TEV for simple tasks effectively zero – “write a simon says game for me” is something that any frontier AI model should be able to do, essentially zero-shot. But it also enables you to use that as a crutch: if you stay comfortably in the zero-TEV problem domains, then when you hit a wicked problem, you will not have built up enough tacit knowledge and experience to make the jump.

In 2025, a coding agent can do a lot of your “tedious homework” as a developer. It can also give you a pretty reasonable head-start on design and architecture questions. This makes the TEV for low- to medium-difficulty tasks dramatically lower than it had been 5 years ago.

The class of problems with nontrivial TEV has contracted, to where now only the most difficult, wicked problems have a high TEV. I’m not sure AI has dramatically lowered the TEV of those problems. Large architectural decisions that will impact the work of dozens of engineers and take years to fully realize? Still hard. Still requires a lot of tacit knowledge, experience, intuition, and long-horizon planning. AI can help, but discerning slop from reasonable strategy still requires a very high TEV. To be clear, I’d rather have Claude Opus 4.5 on-hand to bang out prototypes and rubber-duck ideas with, but I don’t think its long-horizon judgement is there yet.

The highest-of-high TEV problems are the ones where you have a very weak, delayed feedback signal. Projects where you have to wait months to see whether structural bets you made paid off or not. Projects where you need to course-correct midway through because of unforeseen circumstances. Strategic actions taking realpolitik into account. These problems are high TEV because they aren’t purely technical – they’re still systemic and involve a lot of technical decision-making, but they also are deeply intertwined with organizational, business, and market forces. It’s notoriously difficult to even tell if humans are skilled in domains such as these, because teasing out luck versus skill becomes nontrivial.

Will AI be there in 3-5 years? My intuition is “no, unless we have another architectural advance in long-term planning”. And yet, I’ve been quite surprised over the past year at how quickly coding agents have been climbing the TEV curve. It’s hard not to hedge in either direction on this; there is significant uncertainty.

I suppose this allowed me to become more confident with pencil-and-paper programming, a skill which was useful for precisely one advanced placement exam, and nothing since. ↩︎

Why Magnetos

Ben Congdon — Tue, 02 Dec 2025 00:00:00 -0800

I’ve recently started on the journey to get a private pilot’s license. One thing I’ve enjoyed about the process so far is the extent to which you’re encouraged to understand how most of the systems work at a fairly deep level. Contrast this to driving a car, where you can mostly get away with “turn key, use gas & brake pedals, don’t do anything stupid that would cause you to lose traction.”

Most trainer aircraft still in use were designed (and sometimes built) in the 1960s or ‘70s. So the systems are relatively primitive. For example, there is no onboard computer calculating the fuel mixture; you do that yourself. You have to know what a carburetor is and does. You need to know about the fuel system – where exactly it’s stored, the rate of fuel burn, how air temperature and fuel mixture and oil temperature impact engine performance. For more complicated aircraft, you need to become familiar with the manifold pressure system that uses a complicated oil-and-spring setup to control propeller pitch. This is all part of the fun and appeal of aviation, at least for me.

To oversimplify, turning on a plane like a Cessna 172 involves 3ish steps: turning on the master power switch for the airplane’s electrical system, turning on the power switch for the avionics systems, and turning on the plane’s magnetos. The first two are usually toggle switches, whereas the magnetos are a keyed switch, similar to a car’s ignition. You can roughly round your understanding of magnetos to “it’s basically the ’turn the engine on' switch”¹, but I was curious and wanted to learn more.

What are magnetos? Magnetos are a self-contained electrical system for powering the plane’s spark plugs, which use the mechanical power of the running engine to produce enough voltage to keep the spark plugs sparking, independent of the aircraft’s battery. They’re called magnetos because they use permanent magnets rotating past wire coils to generate electrical current.²

Crudely drawn, likely not fully accurate magneto diagram

Spark plugs need a reasonably high voltage (usually >10kV) to ionize the air to create a spark. To achieve this high voltage, they have two coils – a primary and a secondary – which work similarly to a step-up transformer.

To generate the high voltage needed to produce a spark, a mechanism connected to the engine timing system periodically disconnects the primary coil, collapsing its magnetic field. This field collapse induces a higher voltage in the secondary coil, using Faraday’s law of induction.³

Why magnetos? On essentially all modern internal combustion engines for cars, the spark system is powered by the internal battery, which in turn is charged by the alternator – an electric generator which runs off the mechanical energy generated by the running engine.

Aircraft engines also have an alternator and battery system – for powering aircraft electronics like avionics and lights. Why not use those for the engine spark plugs too? Primarily for isolation. Modern electronic ignitions are coupled to the electrical system. If you have an electrical fire in the cockpit, the first step is to turn off the master switch, cutting off all power to the plane. If your engine relied on that power for its spark plugs, it would also stop.

Magnetos are mechanically coupled to the engine and propeller. As long as the propeller continues to spin, the magnetos will continue to provide the energy needed to keep firing, completely decoupled from the rest of the plane’s electronics.

Losing your lights and avionics if your battery or alternator fails is a manageable emergency, but losing your engine is potentially catastrophic. The total isolation of the electronics and engine systems makes a coupled failure less likely.

Left and right magnetos: Many engines have two sets of magnetos – labeled “left” and “right” – with an option to have both enabled. The “both” setting is what the plane usually operates with. Switching off one set of magnetos noticeably decreases the RPM of the running engine. Why? Naively one might think that half the engine cylinders (e.g. the “left half”) use each set of magnetos. This is not the case; rather, each set of magnetos is wired to each cylinder. Cylinders have two spark plugs each connected to one of the two independent magneto system.

Aircraft magneto system schematicSource: Pilot Handbook of Aeronautical Knowledge

Why two sets of magnetos? Well, mostly for redundancy again. The engine can still function quite well⁴ even if one of the two magneto systems is inoperative. But there’s another reason: Recall how I mentioned earlier that only enabling one set of magnetos reduces engine RPM, all else equal. Why? Having two spark plugs in each cylinder improves the engine’s combustion by starting the combustion at two places simultaneously, reducing the amount of time the flame takes to expand through the cylinder. This also makes the burn happen more evenly, to reduce uneven strain across the cylinder surface.

Further reading: How It Works: Magneto (AOPA)

Technically, magnetos aren’t even really “turned on” since they’re passive systems. Turning the switch from OFF to ON un-grounds the magnetos. When OFF, the magneto coils are grounded to the airframe body, which prevents them from generating electricity. Actually turning on the engine still requires the battery to power a starter motor. ↩︎
Magnetos are a form of alternator that uses permanent magnets as the source of the magnetic field used to generate current. Non-magneto alternators, on the other hand, use an electromagnet as their magnetic field source. This has the side effect that if your battery is dead, the alternator can’t function, since there isn’t an initial “excitation” energy to energize the coil, creating the field that then drives current. ↩︎
The ratio of the number of turns in each of the primary and secondary coils determines how much the voltage is stepped up. To act as a step-up transformer, the secondary coil has many more turns of wire than the primary coil. ↩︎
In flight, that is. You can safely return to the ground easily with just one magneto. You definitely wouldn’t want to take off with only one. ↩︎

Schedule Recurring Calls With Your Far-Away Friends

Ben Congdon — Mon, 01 Dec 2025 00:00:00 -0800

I enjoy conversations, particularly with people I care about. I also have a social circle which is rather geographically dispersed. This, of course, presents the problem of “how do I stay in touch with people?” Facebook et al. haven’t solved this problem in a satisfactory way for me. Discord / private group chats are fine, but don’t feel socially fulfilling in a way that 30 minutes of even infrequent talking often is.

One solution I stumbled into a few years ago was: Set up recurring 1:1 meetings with distant friends. Yes, like a work 1:1. Yes, with an actual calendar invite with an actual Google Meet link.

This is something I firmly believe is worth spending weirdness points on with the right set of people.

Why?: Twofold. First: Staying in touch with people is easier than meeting people, and friendships/relationships become more nuanced with rich shared context over time. Maintaining older friendships is just obviously valuable. Second: Talking to interesting people is just fun.

On frequency: I find this model works best somewhere in the biweekly-to-bimonthly range. It’s probably better to meet too infrequently than too frequently. This is perhaps unintuitive if the goal is maintaining relationships, but I think it’s better to have a glut of topics to discuss than a scarcity – and as long as you’re meeting every 3 months or so, you won’t hit “reconnection awkwardness” territory.

On duration: 30 minutes is a good baseline to start with. It’s short enough that it’s not hard to schedule into an evening, or a weekend morning. It’s enough time to catch up on life events from the past week(s) or month and to talk through a few substantive topics, yet not too long like you feel like you feel pressure to fill the time. Many of my chats are exactly 30 minutes like a work meeting; some of them are more free-form, lasting for an hour or two depending on the context.

On scheduling, rescheduling, and following through: Some people find it easier (or possible/enjoyable/fun) to stick to strict schedules than others. As a rule, I never get upset if someone needs to reschedule a chat or cancel last minute. The schedule is just a forcing function for prompting some sort of interaction – even if that interaction ends up just being a quick exchange of logistics and pleasantries.

On what to talk about: Normal rules for “having good conversations” apply. Most of the time, the conversation can just be about “what has your life been like recently?”, but also: be a good conversationalist. Just because it’s a remote, structured talk doesn’t mean that it has to be a remote or stilted conversation. Talk about philosophy, ethics, existential dread, your cats/dogs/hamsters, books, annoyances, accomplishments, admonitions, advice, burgeoning hobbies, travel plans, creative hobbies, etc.

I’m writing this post as a nudge in the direction of this being normalized. But also so that the next time I ask someone to set one of these up, I have a handy pointer to send. (Hey, future person! 👋)

Cover: Girona, Spain

Fifty Bits of Career Advice

Ben Congdon — Mon, 11 Aug 2025 00:00:00 -0700

As my team’s summer interns finished up their rotations this week, I had my usual end-of-internship “AMA” 1:1s. It’s something I enjoy doing, but I realized I was covering a lot of the same topics I’ve discussed with other previous interns and early-career engineers over the years.

So I thought it’d be useful to collect these recurring bits of advice into a single place, both for my own reference and for anyone else who might find them helpful.

Decisions and opportunities

There is a natural “explore vs. exploit” dynamic in career decisions. Early in your career, you’re well served by leaning more into the “explore” direction: do as many internships as you have time for, try living in different cities, try working on different teams, try working on product-focused projects, try working on infrastructure projects.
Surprisingly few career decisions are true one-way-doors. This generalizes outside your career as well.
That said, all decisions are path dependent. The two-way door you step back through won’t be the same one you left, in the same way that you can never step in the same river twice.
It’s helpful to gravitate towards where value accrues in an organization. This is often not intuitive.
Similarly, it’s helpful to gravitate towards where the smartest people are working. Cultivate a set of formal and informal mentors.
Similarly, it’s helpful to gravitate towards where there are obvious problems that are low-hanging fruit to solve. For engineers: Environments where you are engineer-hour-constrained instead of product-alignment-constrained are much easier to progress quickly in. For managers and PMs, the reverse is likely true.
It’s worth figuring out your tolerance for risk fairly early, since this has implications for many things: timing of job hops, personal finance posture, whether you prefer the (appearance of) safety of a large company or the risk associated with doing a startup.
I find I rarely regret my choice after choosing between two paths with reasonably similar “goodness”, even though these decisions are often the most agonizing. After all, if the options weren’t similar in “goodness”, it’d be an easy decision.
Making hard decisions doesn’t get easier, you just make more of them over time and become more comfortable with the discomfort. You also eventually build up your own personal heuristics for tough decisions.
Careers are marked in seasons. When you feel a season change coming, it can be tempting to hold out to avoid the discomfort of change. Sometimes you initiate the season change (e.g. you seek out an exciting opportunity) or sometimes the season change finds you (e.g. “oops, a few key people left and now your org is in chaos”). Either way, you’ll spend less energy preparing for and adapting to the new season than you will trying to preserve the old one.
When receiving advice from others, note that sometimes you need to reverse the direction of the advice.
When finding your first job or two, it’s easy to look at interviews as a way of “convincing the company to hire you”. However: the phrase “interviews are a two-way street” is a cliché for a reason. Interviews give you an important (though necessarily incomplete) window of what the company is working on and looking for, and you should try to maximize the amount of signal you get for yourself out of interviews. Prepare questions in advance, ask them, and write the answers down. This can help change your eventual decision to be more informed and less purely vibes-based.

Organizations, leadership, and culture

Truly great managers are hard to find. Value the good ones.
Truly healthy organizations are hard to find. Value the healthy ones.
It’s tempting to become “irreplaceable”. It’s far better to foster your replacements. (Both for your own sake and the people that grow to replace you)
The average tenure of an engineer at most big companies is no more than ~3 years. An implication of this is that you will see turnover on whatever team you join. Usually the sky is not falling even if a few folks leave around the same time. If you have the stomach for it, stepping into a departing senior engineer’s shoes is a great growth opportunity.
At a steady state, 80% of your time when being a good tech lead is being the rock that says “That seems reasonable”. About half of your value comes from (skillfully) being this rock, and the other half comes from recognizing when it’s necessary to push back, and in (skillfully) following through with doing so.
Many things are overrated. “Being in the room where the decisions are made” is, if anything, underrated.
Organizations are like people: They have personalities, they have insecurities, they have strengths and weaknesses and hopes and fears. Being in tune with organizational psychology is a great marker of a senior engineer. Influencing organizational psychology is a marker of staff+.
How teams and individuals say “goodbye” to people departing the company tells you a lot.
Promotions aren’t as much an indication of “job well done” as they are recognizing that you’re doing a qualitatively different job description as the last time you were re-leveled.
The “good feeling” of getting a promotion lasts much shorter than you’d think (for me, about a day or two). They’re still worth pursuing and achieving as career milestones. But they’re progress markers, not ends in themselves.
Kindness goes a long way. In 5 years, the people you remember will be the ones that helped you onboard, or bailed you out in a stressful oncall rotation. You’ll spend comparatively less time thinking of the people who were “merely” brilliant. (Thankfully, brilliance and kindness co-occur suspiciously often)
You are not the ultimate owner of your performance review. You can, and very much should, advocate for yourself and put a substantial amount of effort into writing the pieces that you control. However, it’s ultimately your manager’s job to advocate for you on your behalf. If you have a good manager, this is easy. If you don’t, you’ll need to do a lot of the narrative setting yourself. In either case, managers are invariably incredibly busy during perf season, so proactively making it easier for them to advocate for you helps you both out.
Most SWE career ladders value “cross team collaboration” for senior and above levels. As such, it’s helpful to develop relationships with people outside of your immediate team. This is true for multiple reasons: first, it’s a great way to get a larger sense of what the company is working on, and second, it gives you a natural set of people to act as champions for your work when you are in performance review season.

Execution and craftsmanship

Unlike all the code you wrote before you started your career, you can’t take most of the code you write professionally along with you (unless you’re doing professional open source work, which, if you do, cool!). This is a good thing, since you get to build things that outlast you.
When you’re new on a team, you automatically have a valuable perspective: a set of fresh eyes. Keep track of the papercuts and unintuitive things you run into – when onboarding, when learning about how your team’s systems work, and so on – and share them with your team. Many of these things will already be known by them, but people develop an indifference towards papercuts over time, and having a fresh set of eyes is a great way to surface (and potentially fix) them.
Early in your career, there’s a lot of value in being the person who lets nothing fall through the cracks. A few years into your career, it’s about choosing which cracks can be safely ignored.
Relatedly, there will always be work that never gets done. Even on the most relaxed teams, there will always be an ever-growing task backlog of P1 and P2 tasks. This is fine. It’s generally a waste of time to hunt down every little TODO on the long tail of your team’s backlog.
Relatedly, it’s sometimes worth declaring bankruptcy on your team’s backlog. Unless you’re a TL or manager you generally don’t get to make this decision, but even as an IC it’s worth keeping a pulse-check on how much backlog debt your team has at any given moment, and doing your part in managing this.
Having passionate users is a great way to get high quality feedback. Fixing a papercut or two of a power user can earn you a years-long source of feedback.
That being said, it’s worth remembering that the most passionate users are not representative of the majority of users, and that you will still need to actively solicit feedback.
Actually use the tools/products/systems/libraries/etc. you build. This is a necessary but insufficient condition for building something that is useful.
You should probably invest a bit in your physical workstation. A good keyboard, mouse, and monitor help both in feeling a sense of ownership over your workstation, and help with ergonomics.
You should probably invest a bit in your digital workstation. Yes, you should pick a nice color scheme for your IDE. But you should also invest in hyper-personalized configurations that make you faster at your job. Things like terminal aliases for jumping between PRs/branches, small scripts for common tasks, etc. These investments compound over time.
It’s worth learning a pure command-line text editor like vim or emacs. You don’t need to be an expert in it, but (for the foreseeable future) there will be times when you don’t have access to an IDE, like when SSHing into remote machines, and you’ll be glad to have a comfortable fallback.

Self-management & mindset

It’s rarely worth making a decision you see as ethically or morally compromising. Being able to live (easily! joyfully!) with your decisions is a gift worth giving to yourself. To make this slightly less Straussian: There is often money out there that is not worth taking. But you have to figure out where to draw your own line.
If you’re a natural “optimizer”, you should probably find a way to mindfully and purposefully switch to a “satisficing” mindset when it’s helpful.
It will sometimes feel like “Everyone Is Getting Hilariously Rich and You’re Not”. You’ll see former colleagues, classmates, and acquaintances win big in AI startups, crypto, or $CURRENT_THING. Bubbles and the “leading edge of the hype cycle” always create real winners alongside the noise, but the winners are fewer and more random than they appear. Reminding yourself this is FOMO will save you a lot of grief.
That “I don’t know what I’m doing” feeling early in your career is usually followed shortly after by “I’m starting to get my footing!” Believe it or not, you eventually start to crave this “beginner mindset”. The senior engineers who you see switching teams or leaving for another company are often going to feel that way again.
It’s worth cultivating an online presence, even as an extremely slow burn side project. Having a body of work online, especially writing you sink some good thought into, is portable across companies, contexts, and even career paths.
You are the ultimate owner of your career. The best managers and TLs enable you to move in the direction you want to go, but you are the one that has to set that direction.
You are the ultimate owner of your mental and physical health. You will need to learn your early warning signs for burnout and take steps to mitigate them. Take time off proactively rather than as a last resort.

Finances

For short term financial planning, you should mostly treat RSUs as worthless paper until they vest. For medium term financial planning, I find it helpful to do a Monte Carlo simulation of various scenarios I think my current company is headed toward. There’s some skill involved here, but it’s worth the effort to get a sense for how to value your RSUs. This used to be an afternoon of work, but thankfully with LLMs you can script something like this in a few minutes.
By default you should probably sell your RSUs as soon as you can and reinvest in a more diversified portfolio.
That being said, if you’re conservative with finances, you’re more likely to undervalue the future appreciation of RSUs (and thus, when comparing offers you may undervalue the net present value of RSUs as well).
You should be able to comfortably live off your salary (without any bonuses or RSUs).
If your company does 401k matching, you should absolutely max out the match. You should probably max out your 401k contributions (when possible) as well, as a set-and-forget way to do retirement planning until you have time to develop a more nuanced plan for handling your finances.
It’s worth doing a deep dive on exactly how the tax system works. Internalizing how capital gains and each type of retirement account are taxed is a great way to avoid future unpleasant surprises.
If you work in tech, there’s ~no excuse for not keeping and fiercely guarding an emergency fund of 6-12 months of expenses.

Your mileage will vary on all of this. What matters most is developing your own heuristics over time. These are just mine, offered as a starting point. In a few years, you’ll have your own version of this list, shaped by different choices and different seasons.

Cheers!

The Agency Gap

Ben Congdon — Thu, 31 Jul 2025 00:00:00 -0700

There’s a line in Ben Kuhn’s essay, “Impact, agency, and taste”, that’s been rattling around in my head lately. He describes impact as the practice of “making success inevitable”. That phrase captures something important about how to approach work.

Kuhn draws a distinction between two ways of approaching a project or goal:

There’s a huge difference between the following two operating modes:

My goal is to ship this project by the end of the month, so I’m going to get people started working on it ASAP.

My goal is to ship this project by the end of the month, so I’m going to list out everything that needs to get done by then, draw up a schedule working backwards from the ship date, make sure the critical path is short enough, make sure we have enough staffing to do anything, figure out what we’ll cut if the schedule slips, be honest about how much slop we need, track progress against the schedule and surface any slippage as soon as I see it, pull in people from elsewhere if I need them…

Mode 1, as Kuhn later points out, makes you a “leaky abstraction.” If the project is important, someone else has to worry about whether it will actually succeed, constantly monitoring and figuring out how to resolve blockers that you might not even see coming.

Mode 2, on the other hand, is about “making success inevitable”. It’s about taking true accountability for actually achieving the goal not just for doing the work.

Owning the Outcome

This idea of “making success inevitable” struck me as a valuable mindset for anyone aiming to have a significant impact, regardless of their role. While the responsibility for a project’s overall success might formally fall more heavily on someone with a title that indicates this responsibility (e.g. a tech lead (TL) or manager), the practice of thinking and acting in Mode 2 is something any engineer can develop.

When I reflect on my own experience, transitioning within a team from an individual contributor to TL role certainly made the Mode 2 thinking much more explicit. Suddenly, the scope of concern broadened from “is my part done well?” to “will this initiative/project land successfully?” With multiple people and dependencies involved, you need to avoid becoming a leaky abstraction. The feedback loops get less explicit, and you need to more frequently move from tactical to strategic thinking.

However, high-impact ICs usually exhibit these same traits. They don’t “just” execute tasks; they anticipate problems, clarify ambiguity, proactively communicate risks, plan future work, and drive things forward with a focus on the ultimate objective. They internalize whatever the true end goal is and take ownership beyond their immediate assigned slice.

Kuhn highlights just how valuable this capability is:

People who can be trusted to make something inevitable are really rare, and are typically the bottleneck for how many different things a team or company can do at once. So if someone else is responsible for making your project inevitable, you’re consuming some of that scarce resource; if you’re the one making your own projects inevitable, you’re a producer of that resource, and you’re helping unblock a key constraint for your team.

While TLs and managers are often the institutionally designated providers of this resource, that doesn’t mean they always fulfill that role effectively, nor does it mean ICs can’t. We’ve all likely seen managers who operate in Mode 1, pushing the burden of ensuring success onto their reports. Conversely, highly effective ICs often step up, implicitly or explicitly, to make their own projects (and sometimes those around them) inevitable, and thus producers of that scarce resource.

The Scarce Resource: Agency

The scarce resource Kuhn is identifying is agency. It’s the quality that separates merely “doing work” (Mode 1) from “owning the outcome” (Mode 2). Agency encompasses the initiative, proactiveness, resourcefulness, and sheer relentlessness required to navigate ambiguity and push something across the finish line, ensuring it actually achieves its intended goal. It doesn’t always require inhabiting the “operator” mindset, as agency also includes the ability to think strategically and to make tough decisions, but highly agentic people do know when to deploy both the “operator” and “strategist” mindset.

While some people seem naturally more inclined towards high-agency behavior, I strongly believe it’s also a skill that can be developed and practiced. Interestingly, I’ve often observed that particularly promising engineers already possess a significant degree of latent agency. They have the technical skills, the problem-solving ability, and often the right intuitions. What they sometimes lack isn’t the capability, but the permission or encouragement to fully exercise it.

Exercising agency, especially when it involves pushing back on initial plans, identifying non-obvious risks, or proactively coordinating across teams, can feel like overstepping boundaries or one’s formal job title. This is particularly true for those earlier in their careers or newer to a team. Creating an environment of psychological safety, where team members feel explicitly empowered and encouraged by senior folks or leadership to take initiative and question assumptions, is important for unlocking this potential agency. Without that safety net, agency often remains latent.

Developing Agency

Striving for “inevitability”, as Kuhn frames it, isn’t about achieving perfection or eliminating all risk. That’s clearly impossible in most nontrivial areas of human endeavor. Instead, I think the real value lies in cultivating the mindset itself.

Adopting this agentic mindset takes conscious effort, especially initially. It means spending more time up-front planning, anticipating, and communicating, which can sometimes feel less immediately productive than jumping into writing code, or whatever the immediate “work” may be. However, investing time in up-front strategic thinking consistently pays off later. Having a strategy results in less frantic firefighting, fewer deadline slips, and a generally calmer, more predictable process for delivering impact.

The benefits of cultivating personal agency are beyond “merely” delivering reliable outcomes to achieve some abstract team/company/personal OKR. There’s a certain confidence and personal satisfaction that comes from knowing you’ve done the work to truly understand the problem, anticipate hurdles, and steer yourself toward success, rather than hoping things work out or leaning on someone else to keep the project unblocked. Agency also generalizes well across different domains of life. Developing agency in a professional context usually results in a higher ability¹ to exercise agency in other contexts (e.g. personal, social, relational). Professional contexts are a good environment for developing agency too: in healthy workplaces, there is a clear feedback loop and ample opportunity to exercise agency.

In my experience, developing the agency to make an impact on your (sometimes chaotic) environment feels like a worthwhile pursuit in itself. It also reflects quite well on promotion packages, if that’s your jam.

When Red Buttons Aren't Enough

Ben Congdon — Mon, 16 Jun 2025 00:00:00 -0700

On June 12, 2025, most of GCP went offline. This led to downstream outages in a multitude of websites and services, such as Cloudflare, Spotify, OpenAI, Anthropic, Replit, and many others.

With a few days of hindsight, GCP published a quite detailed postmortem. Frankly, I’m impressed by the depth of this PM and the quantity of technical details that they released publicly. Given this level of detail, it’s feasible to piece together a reasonably full picture of what happened and what ultimately went wrong.

In what follows, I’m only going on what was publicly released in the postmortem.

The Underlying Cause

GCP has an internal service called “Service Control”, which handles policy checks like authorization and quotas for the GCP APIs. A data-dependent bug was introduced into this service that caused it to crash and fail closed when a particular code-path was reached; this code path could be exercised by a Google-driven global quota policy change. Since the bad code path was data-dependent, it was never exercised as this change gradually rolled out to production – it was triggered all-at-once, globally, when a new global quota policy version was applied.

This matches some of the provisional rumors (later ~confirmed) that the outage was “related to IAM”.

Lack of Gradual Rollout

The biggest egg-on-face from this incident is the following admission, worth quoting in full:

On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code. As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.

So, effectively, the change that caused this global outage seemed to have never been tested in production before its enablement. On its face, this seems… bad. Reading between the lines of this paragraph, it’s unclear to me to what extent they’re throwing themselves under the bus. Saying “If this had been flag protected, the issue would have been caught in staging” could be read as either “we messed up and should have had this flag protected” or just literally “if there was a flag, this wouldn’t have happened”.

The change was hard to feature flag. There are very much times when it’s quite difficult to feature flag changes. If a change requires a binary restart to take effect, or if the change is part of a large refactor that doesn’t have isolated pieces over which to place a flag, effectively feature flagging a change can be quite difficult. Giving Google the benefit of the doubt, it’s possible that that this change was hard to flag, and so instead of using standard flags, they relied on their “red button” mechanism for safety. I give relatively moderately low probability to this possibility, since the change was able to be disabled with the “red button” and was data-dependent.

The change relied on their “red-button”¹ instead of standard feature flags. The PM notes that “this code change came with a red-button to turn off that particular policy serving path” and “Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place”. So it seems like the rollout plan for this feature did heavily rely on the existence of this “red-button” for safety. The fact that it was identified as the correct mitigation approach within 10 minutes updates me in the direction that this was their rollout safety plan: if it breaks, we’ll kill-switch it to disable. Perhaps they thought the red-button would be quicker than it actually was to take effect, and/or the change was low-risk enough that the red-button approach was sufficiently safe.

There was a process failure. At least twice in the PM, its listed as a follow-up that “We will enforce all changes to critical binaries to be feature flag protected and disabled by default.” My knowledge of change management practices at GCP is now stale, but I would strongly believe that it was already a best practice to feature flag critical binaries – for the policy to be otherwise would be courting disaster, near malpractice for a set of services running at their scale. I put higher weight on this being an individual or team-specific process failure. There should have been a plan to rollout this code gradually, but either that plan had a faulty assumption (i.e. thinking that the gradual binary rollout protected them, not knowing that the code wasn’t actually being executed) or it was insufficiently paranoid (i.e. that the red-button approach provided sufficient safety).

Ultimately, these things happen. I’ve been cc’d on many PMs that have a similar shape to this. Process compliance is a hard problem, because it grounds out in the cultural norms and practices of the engineering environment – down to individual contributors and tech leads – rather than a technical architectural decision that can be made and enforced with a fully top-down approach.

Other Followups

Fail Open:

We will modularize Service Control’s architecture, so the functionality is isolated and fails open. Thus, if a corresponding check fails, Service Control can still serve API requests.

This seems wise. For a single-point-of-failure system like Service Control, adding isolation and (well-considered) means to fail-open makes sense.

Globally replicated data:

We will audit all systems that consume globally replicated data. Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues.

This also seems quite wise, and honestly is what I’d consider to be the most important followup from this incident, from what has been publicly been released. Consumption of globally replicated data is always going to be tricky. Even if you have a feature flag, doing a globally impactful operation is always going to be significantly riskier than doing that same action gradually.

Google undoubtedly has the internal tools to this sort of gradual propagation, so I’m guessing that in this case the global nature of the data was an intentional business requirement (i.e. quotas need to be updated globally within a short window of time for something something compliance reasons). If true, reconsidering or challenging these requirements would also be helpful.

Incident Response:

Within 2 minutes, our Site Reliability Engineering team was triaging the incident. Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place. The red-button was ready to roll out ~25 minutes from the start of the incident. Within 40 minutes of the incident, the red-button rollout was completed, and we started seeing recovery across regions, starting with the smaller ones first.

The incident response here seems quite fast. Two minutes to first triage and ten minutes to root cause is fast – although, the global nature of this impact assuredly got eyes on this quickly.

Incident Communication:

We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage.

Oof. First, the circular dependency here is rough, and I’m surprised there wasn’t an easier way to provide an incident report through an air-gapped channel. Second, we’ll never know, but I also wouldn’t be surprised if institutional caution accounted for at least some of this delay. Especially with such a substantial impact like this, figuring out who has the authority to green-light the costly “everything is down” message would likely take some time.

Google exposing this level of detail in a retrospective is a positive sign. It rhymes with postmortems that I’ve seen in the past which, while painful, resulted in cultural change that ultimately improved stability long term. Again, I have no idea the actual internals of what happened here – it’s quite possible that this rollout was a well-designed calculated risk which came up unlucky, or it was a legitimate own-goal process blunder. In any case, the takeaways I have are twofold: first, even the hyperscalers make mistakes, and two, that kill-switches are necessary but not sufficient – there is no substitute for exercising new code gradually, to avoid global outages like this.

Obligatory: These are my own opinions and very much do not represent my employer, etc.

I think they effectively mean “kill switch” by “red button”? ↩︎

Why Developer Tools?

Ben Congdon — Tue, 08 Apr 2025 00:00:00 -0800

I had the realization a year or so ago that much of the high-impact work I’ve done in my career has been related – directly or indirectly – to building developer tooling. I did not plan this, at all, but I’m quite happy to have found impact in this niche.

My first job outside of school was writing a developer tool for engineers building the Google Cloud Console, which is the frontend for GCP and (at least at the time) was likely the largest Angular app in existence¹. The team I joined has been recently created and was building a set of tools to replace some hilariously unergonomic tools that had organically grown along with the problem space.

The result was an amazing (if modestly overengineered) tool which was legitimately delightful to use and resulted in a demonstrable productivity boost for the >1k internal developers who used it. I recently ran into a current GCP developer and was pleased to hear that this tool is still in active use – and not only that, but that this developer had a quite positive view of it.

Fast forward a couple of years, and my current team at Databricks² works on internal infrastructure. More of our time is spent working on building platform-level components for dynamic service configuration, but we also build and maintain a set of tools for engineers to interface with these systems. And I think developing the UX of these tools has probably been some of my highest-leverage work thus far.

Why Devtools

So, why am I so enthusiastic about developer tools?

Developers will tell you their papercuts. One of the best parts of building tools for other engineers is that they readily tell you what’s not working. If a developer has to interact with a painful tool or process regularly, you’ll find out pretty quickly. This feedback is super useful, both at a product level (what features to add/remove/improve) and at a meta level (what are developers actually doing). I found this to be even more true when the tool was, well, actually useful. When the tool becomes part of a developer’s daily workflow, you start hearing very specific, granular feature requests and complaints, which are great! We can’t always fulfill these requests, but in aggregate, they give a directional sense of where to build next.

Great tooling is a force multiplier for the rest of the engineering organization. When internal developer tools are good, and they’re used consistently, they improve the productivity and efficiency of all downstream developers. This is most notoriously difficult to quantitatively measure – in theory, there should be a measurable improvement in developer velocity, but this is usually hard to get good numbers on. The real benefit is a “rising tide lifts all boats” effect – all developers on your team become tangibly better from using a well-designed tool, and those that weren’t efficient at a particular set of tasks become demonstrably more efficient using the tool.

There is cultural leverage to building tooling. Closely related to the force multiplier effect is the cultural effect of improving shared tooling. Developer “best practices” are just codified shared behaviors and tool usage. Tooling can be designed to encourage certain practices by reducing the friction to “do things the right way”. For instance, if your goal is to enforce that all code changes be linted, you can add the linter as part of your standard git commit hooks (reducing friction) at the same time as adding the lint check to be a PR merge precondition (ensuring compliance).

Infrastructure builders get to interact with a lot of other teams. Building internal tooling necessitates interfacing with many other teams to understand their use cases. This was an unexpectedly pleasant part of my experience building internal tools: I got to work with teams from all around my company, which can make you more aware of what other parts of your company are working on. There’s also just a nice social component to this; I’ve talked with likely hundreds of engineers in my product’s support channel, so whenever I travel to a new office I often run into familiar faces. This can also lead to opportunities to improve coordination between teams, increase your exposure to novel projects, discover opportunities for improvement of your tooling, and notice places where “competing” tools can be consolidated.

Working primarily on developer tools and internal infrastructure wasn’t something I planned, but it’s turned out to be a really satisfying niche. There’s something uniquely rewarding about creating things that directly help your peers get their work done more effectively. It has its own set of unique challenges, but the leverage you get and the direct feedback loop make it quite enjoyable.

For good or bad. ↩︎
Obligatory “opinions my own, not that of my employer” ↩︎

The Models Want to Reason

Ben Congdon — Wed, 12 Feb 2025 00:00:00 -0800

Since I wrote about COCONUT, Meta’s paper on reasoning in latent space, there’s been a wave in publicly accessible research into reasoning models. The most notable example, which overshadows everything else to the point of feeling like I almost don’t need to mention it as I write this in mid Feb 2025, was the Deepseek R1 paper.

While I think the R1 paper is great and deserved the attention it garnered, there has been a steady stream of additional research into the reasoning space that I think begins to paint an interesting picture of what comes next.

Reasoning Scales Down

R1 proved that a reasonably well resourced lab can produce a reasoner for on the order of ~$1M of compute by current estimates. This leads to the phrase “V3 implies R1”: we now have methodology, in the public domain, that a sufficiently good base model¹ can be altered into an even better reasoning model.

But it seems reasoning can scale down dramatically, as is shown by s1: Simple test-time scaling. They achieved impressive results on reasoning tasks with a quite small dataset and minimal training. There’s been a lot written about this work already, but it’s interesting and worth a read.

My summarization of the paper’s methodology is: 1. Assemble a small (1k samples), but highly curated and filtered reasoning dataset, which show CoT reasoning traces. This involved generating reasoning traces from a more powerful model (Gemini Thinking), and filtering to include just the highest value samples. 2. Fine tune Qwen2.5-35B on the reasoning dataset using SFT. 3. Use a novel test-time scaling method (“Budget Forcing”) to dynamically control how much inference to do at test time. This budget policy can terminate reasoning early, or extend the reasoning by adding “Wait” tokens. Essentially, you give it a reasoning budget – N tokens – and it will coerce the model into expending approximately N tokens before giving a final answer.

The “budget” used at inference time is decided as a static parameter of the decoding, and isn’t determined dynamically. In contrast, R1 emergently learned during training time to utilize longer CoTs – the ability to insert “wait” into the CoT was something the model could trigger for itself, without needing an external system to intervene. So it seems S1, to some extent, succeeds by adding the inductive bias that particular problems benefit from throwing more tokens at them. And then the authors a novel method to force the model to emit those additional tokens.

My first response to this was, “It would be interesting to design a system that can determine how much reasoning it needs to perform for a given problem.” But I think that’s just o1/r1/o3 and friends. This ability is learned naturally in longer RL post-training.

My second response was: “Wait” is the new “Lets think step by step” 😀

Reasoning with Latent Tokens

Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning: This paper is an interesting take at expanding the reasoning expressiveness of models. Instead of having the models reason in latent space with non-discrete intermediate outputs during inference (a la COCONUT), they train the model to learn a codebook of discrete latent tokens during training and then use these tokens during inference.

During training, the model uses a vector-quantized variational autoencoder (VQ-VAE) to learn a set of non-text reasoning tokens. To do this, they use three components in their loss:

Reconstruction loss: This loss measures the ability of the VQ-VAE to reconstruct the original input sequence from the set of quantized tokens it produces.
VQ Loss: This loss encourages the encoder to produce outputs that are close to the learned codebook embeddings.
Commitment Loss: This loss complements the VQ loss, and constrains the VQ-VAE’s usage of the latent space. This tries to ensure that the outputs remain quantized and don’t differ too much from the learned codebook.

Figure 3.1Arxiv

During training, the VQ-VAE is shown the entire input sequence, but is only applied to the CoT tokens. The CoT sequences are gradually replaced with corresponding latent tokens. This, along with randomized replacement of text tokens with latent tokens, allows the model to adapt to using the latent tokens.

At test time, the model uses these “Z” latent tokens in its CoT. The VQ-VAE is only used for learning the tokens, and isn’t needed during inference.

As for results, the model showed an increase in reasoning ability, both in- and out-of-domain of the datasets they trained on. The increase in ability seems relatively small compared to other reasoning methods (~4% improvement vs. traditional CoT). The more interesting result is that the model was able to improve slightly in performance while using ~17% fewer tokens in its reasoning trace. So it seems like this method allows the model to develop a more efficient vocabulary for reasoning.

Overall, I’m not sure that the performance increase shown in Token Assorted is worth losing the visibility into the CoT.

Reasoning with Recurrence in Latent Space

Scaling up Test-Time Compute with Latent Reasoning: This paper introduces a novel way of reasoning in latent space, using a depth-recurrent transformer. Their model, “Huginn”, combines a traditional decoder-only transformer with a recurrent block, which can be iterated multiple times during decoding before emitting a token – allowing for the “reasoning” in latent space. Like s1’s budget policy, the number of thinking iterations can be selected at test time.

The high-level generation approach is, roughly, embed the input sequence into the latent space using a transformer layer (“Prelude”), then iterate this transformed input through a recurrent block for some number of iterations, then un-embed the final state into an output token (“Coda”).

Figure 2Arxiv

There are a couple really cool results about this approach:

The recurrent loops can be seen as a convergence process, where the model able to spend more “effort” generating a token. More iterations results in greater certainty in the output.
The convergence framing allows for adaptive compute at inference time. By specifying a threshold difference between successive states in the recurrent portion, the model can keep iterating until it has converged on a token – allowing for more iterations on “difficult” tokens and fewer on “easier” tokens that converge more quickly.

As a hand-wavey intuition for this, assume you have some “distance” measure D that allows you to determine the difference between two of the hidden states used in the recurrence block. You can measure how much the model has updated it’s “belief” for each recurrent iteration by calculating D(r_i, r_{i+1}). If the model is reaching convergence, the value of D(r_i, r_{i+1}) will monotonically decrease as i, the number of iterations, increases. You can set some threshold T after which you stop iterating. Thus, if the model displays uncertainty and takes many iterations to reach T, it has time to do so; but if the model converges quickly to T, you don’t need to keep iterating once it’s already converged.

There are two figures from the paper that illustrate this well:

Figure 19Arxiv

In Figure 19, we see the model processing a query, with colors representing how much the model updated its hidden state with each recurrent iteration. The darker colors roughly indicate high convergence, and the lighter colors indicate low convergence. For most tokens in the sequence, the model converges quickly. The tokens between sentences, and in “Faust” here seem to indicate lower confidence, as even after several iterations the model doesn’t settle into a firm prediction.

Figure 12Arxiv

In figure 12, we see a lower dimensional projection (via PCA) of the trajectory of the latent space state for particular tokens. This is a way to visualize how the hidden state evolves as it goes through more iterations. There are a bunch of interesting behaviors here. In the first row, we see a straightforward convergence to the center. For the second row, the token is the number 3, and the model exhibits an “orbiting” pattern. And for the third row, the token oscillates/slides in the second component. The paper authors infer that this learned orbiting and sliding behavior allows handling of more advanced concepts. (All of this was learned emergently and wasn’t part of the training objective!)

Another interesting component of this paper is that the do not require specialized datasets including CoT for their training. The training procedure is relatively complicated, so my summary will not do it justice. However, at a high level they run the recurrent block for a random number of iterations. So the model isn’t “locked” into a particular number of iterations – it is pushed to learn to use an arbitrary number of iterations on a particular generation.

Zero-Shot Continuous CoT:

By reusing the hidden state from the recurrence block between token decoding steps, the model can use this hidden state as a reasoning “scratch pad”. For example, when working on a task that requires reasoning, reusing this hidden state seems to allow the model to reuse some of the partial work its done on token i, such that during token i_1, it can recycle some of that work and needs fewer iterations to converge.

Zero-Shot Self-Speculative Decoding:

Speculative decoding is a performance optimization which uses a cheaper “draft” model to reduce the cost of decoding a more expensive model. This recurrent model gets you something similar to speculative decoding “for free” without a draft model. To do this, you can decode with a small amount of iterations N and treat that as the draft model, and decode with a larger number of iterations M as the ground-truth model. (So, M > N.) This has two efficiencies: First, since the draft and ground-truth model use the same weights, you don’t have to have two models loaded during inference. Second, the hidden states calculated in the draft model (N iterations) can be reused by the ground-truth model when decoding from [N+1 -> M] iterations.

Zero-Shot Adaptive Compute at Test-Time: As discussed earlier, you can use a convergence threshold T when calculating the difference between recurrent states, and stop after the threshold is reached. This allows per-token adaptive compute, allowing the model to spend more compute on harder tokens.

The Bitter Lesson

With the release of o1/r1/o3, I’ve seen the sentiment of “Hey, this is further proof of the bitter lesson. Scaling English CoT is all you need to reach AGI”. Along with variants of: Mamba was a dead end, MCTS was a dead end, COCONUT is a dead end.

Mamba for sure hasn’t started “working” yet at the frontier, and I don’t see that changing soon. I don’t think MCTS should be ruled out yet, though it also hasn’t had a breakout moment. It’s still quite early for latent reasoning. The model development pipeline is long, and I’m confident that frontier labs are trying to make latent reasoning work — for the efficiency gains, if nothing else. In a world where inference costs will increasingly be a bottleneck, shaving off that much compute while maintaining or even slightly improving performance is quite appealing.

The initial takeaway from this wave of reasoning research is that we’re still in the early stages of understanding how to make models think effectively. While English CoT has set a high bar, early results in latent reasoning and adaptive compute suggest there’s still plenty of low-hanging fruit in making models reason more efficiently. The bitter lesson can point us where we’ll end up (“simple” architectures in hindsight, scaled massively), but there’s still much to learn about the best way to get there.

Cover image by Recraft v3.

It seems like the cutoff for models being able to learn reasoning ability is roughly in the 1.5B - 3B parameter range. But also the Qwen models seem much more capable of being transformed into reasoners than Llama, so there’s probably something other than raw scale going on here. ↩︎

How I Use AI: Early 2025

Ben Congdon — Sun, 02 Feb 2025 00:00:00 -0800

Previously: Mid 2024

The landscape of AI tooling continues to shift, even in the past half year. This is not unexpected. This post is an updated snapshot of the “state of things I use”.

Tools

Work

Copilot Edits: This feels roughly 85% as effective as Cursor, but the ability to incorporate enterprise code context makes it roughly on par. I am shocked that I never hear anyone talking about this. My general workflow is: load in up to 10 files into the working context, and ask for changes. It can change multiple files at a time. It feels a bit like magic when it works right.
- I’ve used it on languages that are not well covered by LLMs – Scala, Rust – and the results are surprisingly usable. The quality is good enough that I’ve started to reach for this first for most tasks.
- With Rust, I sometimes need to step in and help the model when it gets stuck. I’ve had to point out that it’s not making progress, or defer to a reasoning LLM to get past a logical impasse.
- Copilot now allows you to set custom instructions, similar to Cursor. I have built up custom language-specific instructions so that I get outputs that more consistently match the idioms and style of my company’s / team’s codebase.

Claude 3.5 Sonnet New (via Claude Pro): (a.k.a Sonnet 3.6, newsonnet) Sonnet 3.5 remains my daily driver and all around favorite model. In Claude Pro, the “Projects” feature is amazing. In any given week, I write several design documents, PRDs, announcements, one-pagers, etc. With Projects, I can dump in relevant context documents from related projects, iterate rapidly on writing, and have Claude output suggestions in a style that matches my “organic” writing. The process looks something like this:
1. Paste in a collection of relevant documents: design docs, meeting notes, prior art, public documentation, etc.
2. Use a custom writing style to “write as me” (more on that in the Techniques section).
3. Iterate on the prompt, refining the output until it’s nearly publishable.
The quality of the output is often good enough that I can copy/paste entire sections into design documents with only minimal editing. This works better in some contexts than others, but for non-thinking-heavy sections like “Background” or “Overview” sections, I can usually get great outputs.
NotebookLM: Before I started using Claude Pro, NotebookLM was my go-to for working with a large corpus of documents. It’s tightly integrated into Google Workspace, which is convenient. I can dump in 20+ documents and ask questions about them as a corpus. Gemini just isn’t as strong as a writer, so I don’t use the output of NotebookLM much.
ChatGPT 4o: 4o feels like an outdated model at this point, but you still get unlimited use with the ChatGPT Pro plan, and the UX for ChatGPT-for-macOS is pretty great. Option+Space to get a ChatGPT window is a killer feature. By pure invocation/conversation count, 4o is probably my most used model – though most of the queries look more like Google searches than conversations.
llm (CLI tool): This has become indispensable for quick, one-off tasks. It’s great for drafting git commit messages, reformatting text, etc. It’s hard to really write about what I use llm for since it’s a bunch of one-offs. At minimum, the “chat in CLI” UX is surprisingly useful.
Perplexity Pro: We have access Perplexity Pro at work. I was initially impressed, but as time goes on, I find it increasingly disappointing! I sometimes use it as a “free action” for search, but even then, ChatGPT Search is usually better. I once tried to replace Google with Perplexity as my default search engine, and didn’t last more than a day. Perhaps I’m just not using it correctly. Also, the company has strange vibes recently.
ImageFX: Google’s image generation studio, which uses Imagen 3. I’ve found this useful to make relatively compelling non-slop-y illustrations for presentations.

Personal

llm: Just as for work, llm on the command line is incredibly handy for personal projects. I don’t pay for a personal Claude Pro license, so I use Claude on the command line pretty frequently.
Google AI Studio: Google’s AI Studio is completely free to use, so I frequently use Gemini via the AI Studio. Gemini has hands down the best multi-modality of any model family. It’s great for audio, video (!), and PDF inputs. I have more thoughts on Gemini in my Models section.
Personal Customized Vercel AI Chatbot: I’ve set up a personalized chatbot using Vercel’s AI Chatbot template.
- My main changes were adding support for Anthropic models, changing the database to be a local SQLite file, and ripping out all the tool use features that I had no use for.
- My main reason for wanting this was: Wanting all my chats to only be saved locally (in theory Anthropic doesn’t permanently retain API logs), and wanting to have a handful of custom starter templates that I could easily reach for.
- I could have probably saved some time by using OpenWebUI, but it’s satisfying to have something custom. :)
Cursor: I use Cursor for personal coding projects, especially when working with smaller or greenfield codebases. I almost always use Sonnet 3.5. I initially thought I’d burn through the monthly credits that Cursor gives you, but that hasn’t been an issue so far. (I don’t do a ton of side project coding these days though) As of today, I’m using Copilot Edits 5-10x more than Cursor, but that’s mostly because I cannot currently use Cursor at $WORK.
NotebookLM Podcasts: When running, if I don’t have an appealing podcast, I’ll generate one on NotebookLM from a recent Arxiv paper or blog post. The quality varies significantly, and I tend to listen on 1.5x speed.
- The hosts sometimes devolve into trite discussions about the “ethical implications of AI” when describing a technical research paper, so it’s very much not perfect. Still surprisingly good for what it is, and it does often capture my attention more than would a pure TTS reading of the underlying content.
FAL: FAL is a host of a bunch of image generation models (among other “generative media” algorithms). It’s quite easy/cheap to use, and produces better results than e.g. DALLE.
- I currently use Recraft v3 for my blog images. Late last year, I also trained custom LORAs of Flux1.1 on FAL to create some fun images of my cats. 🐈‍⬛

Techniques

Here are a few techniques I’ve found to be particularly effective when working with these tools:

Prompting (Code):

Context Management: I find that the single biggest factor in getting good results from an LLM – especially for coding – is the context you provide. When using tools like Cursor and Copilot Edits, getting a good set of files that are relevant to the task at hand into your context is key. I haven’t found anything yet that is able to maintain good context itself, outside of trivially small code bases.
Test Generation: I’ve found that asking for test cases to be generated is a great way to get a model to understand the behavior of the change I’m asking for.¹ Unit tests are also usually super easy to pattern match and generate given in-context examples, so the quality is usually quite high. It’s often useful to have idiomatic examples of your testing patterns in your context, so that the model can generate tests that match your existing style. As a final tip, asking an LLM “are there any missing tests?” is a good “free” way to increase test coverage.
Loop: Copy/Paste Compiler & Errors: This feels like extremely low-hanging fruit for improved workflows, but for now my loop is essentially to start ibazel (or whatever other test runner you have, in “watch mode”), have the LLM propose changes, then copy/paste the compiler or test errors back into the LLM to get it to fix the issues.

All of these techniques require the caveat that you need to be actively engaged in the process while prompting & evaluating LLM output. Treat the LLM as a intern or junior developer that you’re coaching along. Blindly accepting output is still a recipe for disaster.

As a counterpoint to this note of caution, Kaparthy recently coined the term “vibe coding”:

There’s a new kind of coding I call “vibe coding”, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It’s possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good.

I “Accept All” always, I don’t read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I’d have to really read through it for a while. Sometimes the LLMs can’t fix a bug so I just work around it or ask for random changes until it goes away. It’s not too bad for throwaway weekend projects, but still quite amusing.

I find this approach to admitedly work for small, throwaway projects, as he notes, but not for anything that needs to be maintained or scaled. However, this does seem to be the direction we’re headed. When quality code generation becomes too cheap to meter, if you can strap an optimization loop around generation<>evaluation, you can get a lot of work done with minimal effort.

Prompting (Writing):

“Give me 3 options”: Whenever I’m generating text that will be used in a document or email, I always ask for multiple options. This allows me to either pick the best one or, more often, combine the best parts of each to create something that feels more natural and human. I don’t trust any model to one-shot human-sounding text.
“Write as me” prompts: Models are still not amazing at copying writing styles, but the models that are good at creative writing tend to be at least OK at writing in my personal style.
- The workflow looks like:
  1. Take a large chunk of your writing and put it into R1 or Claude. Ask “Write a style guide for writing exactly as the author of this text.”.
  2. Then use that as a preamble to creative writing tasks, or as a Custom Style in Claude.
- I’ve found the models to be best at this approach are Sonnet 3.5 and (surprisingly) Deepseek R1. None of the OpenAI models fare well here, in my testing.
Use as a “calculator for words”: LLMs remain great for simple, mindless reformatting. Tasks like:
- “Reformat this text as a comma separated list”
- “Find the latest date from this huge list of unstructured dates”
- “Convert to a bullet pointed list”
- “Remove duplicates from this list”

Related Usage Techniques:

“Copy as Markdown” from Google Docs: LLMs handle Markdown particularly well. Google Docs now allows you to copy content as Markdown, which makes it easy to transfer text between the two environments. “Paste as Markdown” is also useful.
- - Markdown tables! I really dislike the Markdown table syntax, so I almost never use them. Asking an LLM to output a Markdown table and then copying that into a Google Doc is awesome.
macOS Speech to Text: I never thought I’d say this, but sometimes talking is faster than typing. I’ve been using macOS’s built-in speech-to-text more and more when “writing” out conversational prompts.
pbpaste / pbcopy Since LLM usage today often relies heavily on copy/paste, using the pbcopy and pbpaste commands (at least, on macOS) has been useful. These let you copy and paste, respectively, from the clipboard from the CLI.

Unexpectedly Useful Use Cases

Beyond the obvious applications, I’ve found AI to be surprisingly useful in a few unexpected areas:

Finding a last-minute hike: Any good model has grokked all of AllTrails, and they give good recommendations even with complex criteria. It’s great for finding hikes that meet specific criteria (e.g., “not crowded, loop trail, between 5 and 10 miles, moderate difficulty”).
As a “free action” for code review: Before reviewing a pull request, I often pipe the diff into a model like o1 to see if it finds anything objectionable. Worst case, you get slop out that you can ignore. I’ve had o1 catch some quite subtle bugs that I didn’t catch up on first review.
- Aside: Compared to a year ago, AI code review actually seems feasible now. The originalGPT-4 class models just weren’t great at code review, due to context length limitations and the lack of reasoning.
Planning a Catio: We recently built a catio for our cats. I needed to calculate how much PVC pipe we needed. My partner drafted the plans in CAD, and I fed this into ChatGPT, which used Code Execution to plan out all the pieces. I also got it to generate labels for each of the piece lengths, which we annotated back on the plan. It saved us a ton of time.
Financial Advice: (⚠️ Caveat emptor ⚠️) This one requires a huge grain of salt, but I recently had to make a large financial decision and found LLMs helpful as a secondary gut check to check my math. I asked Claude, R1, Gemini, GPT-4o, and GPT-o1 for their thoughts on my approach. All of them agreed directionally with the reasoning I came up with. This gave me a bit more confidence in my decision. Obviously check your work here.
Medical Advice: (⚠️ Caveat emptor ⚠️) Same huge grain of salt, but using o1 / Claude as a second opinion for diagnosing symptoms and evaluating medical test results is definitely worth doing. I had some blood work done a few months ago, and got the raw results back prior to having my doctor review the results. Claude’s evaluation of the tests matched 1:1 with my doctor’s later report. (Granted: This was a fairly simple case.)

Model Tier Rank

Here’s my current ranking of the models I’ve been using, based on their overall utility:

S Tier:
- Claude 3.5 Sonnet: An absolute workhorse. Smart across many domains—technical, creative writing, etc. It’s my go-to for most tasks.
- Deepseek R1: Cheap and smart enough to not feel bad about using it. Deepseek R1 + Web Search is incredibly powerful. It’s a great option for tasks that require up-to-date information or external knowledge.
  - A note on serving: As of writing, the Deepseek platform serves R1 (undistilled) the fastest of any provider I’ve seen. If you have data residency concerns, or concerns about Deepseek’s security practices, I’ve found that OpenRouter provides a good alternative. Sadly, OpenRouter’s web search is qualitatively worse than DeepSeek’s.
A Tier:
- Claude 3 Opus: It’s amazing, just so expensive I can’t really justify using it for most tasks. Opus has been eclipsed by Sonnet 3.5 (and others) on coding, but is still great for writing.
- o1: Impressive sometimes, but rather hit or miss in my experience. When it works, it’s impressively good. I notice that I don’t reach for this model much relative to the hype/praise it receives. Usage limits really deter me from leaning on a model. “You’re out of messages until Monday” is a bad feeling. I don’t want my tools to feel like they’re scarce.
B Tier:
- o1-Mini: I used this way more then o1 this year. It’s pretty good for coding. This model appears to not be available in ChatGPT anymore following the release of o3-mini, so I doubt I will use it much again. That being said, I will likely use this class of model more now that o3-mini exists.
C Tier:
- Gemini 2.0 Flash, Gemini 2.0 Flash Thinking, Gemini Experimental 1206: I want to like Gemini, it’s just not really the best on any relevant frontier that I care most about. The most obvious way it’s better is that the context length is enormous. It’s also free on AI Studio, which is confusingly generous. My favorite party trick is that I put 300k tokens of my public writing into it and used that to generate new writing in my style. However, the “write as me” prompt technique works nearly just as well – often better. Gemini models are also weirdly sensitive to temperature settings changes.

o3-mini just came out yesterday. I’ve used it a bit, but not enough to give a confident rating.

What I’m Not Using

There are a few tools that I’m not currently using, either because I haven’t found them to be particularly useful or because they’re still too early in their development:

ChatGPT Pro: I just don’t see $200 in utility there. Unlimited o1 would be nice. $200/month is too much to stomach, even though in raw economics terms it’s probably worth it.²
- Operator: I don’t see the utility for me yet. It’s a cool research demo today.
- o1-Pro: I’d love to try this. Again, the $200 price tag just doesn’t seem worth it.
Browser-use: Open-source version of Operator. It’s cool; I tried it; it’s slow. The version of this that acts at ~1 action per second instead of ~1 action per minute will be a force to be reckoned with.
Anthropic Computer Use: See Operator, Browser-use
ChatGPT “Work with Apps”: This would be great with Chrome, or some other app I’m not familiar with like Godot, but given it just supports Terminal and IDEs, I’d rather use Cursor or Copilot. I think this could get good, I just don’t see any use cases for me yet.
ChatGPT “GPTs”: I used them modestly in 2024, but as new features were added (voice mode, search, reasoning, etc.), GPTs often couldn’t use these features so I stopped using them as much. I basically used them as custom placeholder prompts, so there wasn’t much value added.
Local Models: Aside from trying out Ollama/LMStudio just to see if they work, I haven’t found any durable use cases for local models. For human-in-the-loop LLM usage, I just think there isn’t much reason to not use the most powerful model available. If I trusted Anthropic less, I’d probably look into local models more intently.
“Advanced” Voice Mode: Last year I used voice mode pretty consistently when I was commuting. This was the original voice mode that was just a wrapper around STT->LLM->TTS. It was great! I commute by bus now, so don’t have as much dead time to use voice mode. Now I only use voice mode occasionally while running. Additionally, OpenAI’s “advanced” voice mode was somewhat disappointing to me: the “interrupt the AI” feature didn’t work reliably enough for me to feel like it’s a step change improvement.
- The lack of impact of advanced voice mode is curious. Her-level proto-AGIs now exist in the world we can talk do, and mostly folks don’t care.

Things I’d Like To See

Better Tools for Copiloting Writing: I think the UX for writing using LLMs can be significantly better than it is today. I don’t think anyone has made a great Github Copilot esque product for writing, likely because there isn’t “one correct” path you go down doing non-technical writing. I’m excited by loom-like interfaces, which allow you to traverse trees of text. Other existing tools today, like “take this paragraph and make it more concise/formal/casual” just don’t have much appeal to me. I really don’t tend to like the output of these systems. Ideally, I want to be steering an LLM in my writing style and in the direction of my flow of thoughts.
Fast or Reliable Browser / Computer-Use Agents: The demos I’ve seen for browser/computer use seem too slow now to be worth investing much in. However, I think there’s a ton of promise in them. I see two paths to increasing utility: Either these agents get faster, or they get more reliable. If faster, then they can be used more in human-in-the-loop settings, where you can course correct them if they go off track. If more reliable, then they can operate in the background on your behalf, when you don’t care as much about end-to-end latency. I do think that someone will crack a specialized model for very fast computer use within the next year. All the building blocks are there for agents of noticeable economic utility; it seems more like an engineering problem than an open research problem.
Better Long-term Management: I was excited about ChatGPT memory, but this was also mostly disappointing. I have yet to have an “aha” moment where I got nontrivial value out of ChatGPT having remembered something about me. More often than not, it remembers weird, irrelevant, or time-contingent facts that have no practical future utility. I’d really like some system that does contextual compression on my conversations, finds out the types of responses I tend to value, the types of topics I care about, and uses that in a way to improve model output on ongoing basis. I’ve seen some interesting experiments in this direction, but as far as I can tell no one has quite solved this yet.

Resources

No change from mid 2024:

Zvi Mowshowitz’s weekly AI posts are excellent, and give an extremely verbose AI “state of the world”.
Simon Willison’s blog is also an excellent source for AI news.
The Cognitive Revolution podcast hosts some pretty good interviews that I find to be high-signal-to-noise, and is much less hype-driven than many other AI-centric podcasts I’ve attempted to listen to.

New additions:

Particularly good Twitter follows: Janus, Nathan Lambert, HamelHusain, Jeremy Howard, davidad, swyx, Ethan Mollick
Periodic check-ins on Lesswrong for more technical discussion (esp. related to AI alignment and AGI implications), if you’re so inclined

Cover image by Recraft v3. As always, this post contains my own views and does not represent the views of my employer.

I had a discussion with a sharp engineer I look up to a few years ago, who was convinced that the future would be humans writing tests and specifications, and LLMs would handle all implementation. Now, I think we won’t even need to necessarily write in-code tests, or low-level unit tests. I’m now convinced that features can largely be described in English, with some end-to-end acceptance tests specified by humans. ↩︎
Deep Research came out while I was writing this post an this might actually tip the scale for me. More generally, I think the paradigm of ambient agentic background compute will be a Big Deal soonish. ↩︎

AI Slop, Suspicion, and Writing Back

Ben Congdon — Sat, 25 Jan 2025 00:00:00 -0800

The impetus for this post was my recent realization that I’ve developed an involuntary reflex for spotting AI-generated content. The tells are subtle now, but (sadly? tellingly?) this sort of content is seemingly everywhere now once you start looking.

The Rise of AI Slop

One bit of hipster cred I get to claim is that I followed Simon Willison before he became the cool AI blogger.¹ Perhaps one of the most “in the room” feels I’ve had reading his work is to see his neologism of “AI slop” catch on.² Slop being defined as the equivalent of “spam”, but for AI-generated content.

To put a finer point on it, I define “slop” as:

Content that is mostly-or-completely AI-generated that is passed off as being written by a human, regardless of quality.

For example, prompting Claude to write a blog post and publishing that verbatim under your name would be “slop”, even if the writing is not bad at face value.

GPT-3 era slop was pretty easy to detect. Early GPT-4 era slop was marginally more convincing, but usually gave itself away after a paragraph or two. Sometime in 2024, I think some critical line was crossed, at least for me, and there are certain classes of slop that take at least some critical thinking to realize is AI-generated. Which isn’t great!

Slop in the Wild

I noticed this first on LinkedIn, where I saw some suspiciously robotic posts being made by a previous coworker. Too many emojis, bulleted lists, markdown formatting literals that weren’t picked up in LinkedIn, etc. In retrospect, it’s obvious slop. This is probably par-for-the-course on LinkedIn now, but at the time it felt like a weird violation of the social contract. My respect for this person was reduced by a nontrivial amount.

To be clear, I fault no one for augmenting their writing with LLMs. I do it. A lot now. It’s a great breaker of writers block. But I really do judge those who copy/paste directly from an LLM into a human-space text arena. Sure, take sentences – even proto-paragraphs – if they AI came up with something great. But surely there is something that needs to be changed from what came out of the black box before you feel comfortable attaching your name to it. If you don’t, I think that’s slop-y.

If you look around now, this sort of stuff is everywhere. There are so many accounts on X that post 20-tweet long threads that are obvious slop. Take an article and break it down into a thread, then dump that into the timeline? Slop. Rando reply bots talking about how some tweet does a great job presenting a multifaceted debate? Slop. Reddit has a ton of this too.

The B2B SasS’ are coming for this space too. From Zvi’s Jan 16 newsletter:

Astral, an AI marketing AI agent. It will navigate through the standard GUI websites like Reddit and soon TikTok and Instagram, and generate ‘genuine interactions’ across social websites to promote your startup business, in closed beta.

Matt Palmer: At long last, we have created the dead internet from the classic trope “dead internet theory.”

Tracing Woods: There is such a barrier between business internet and the human internet.

On business internet, you can post “I’ve built a slot machine to degrade the internet for personal gain” and get a bunch of replies saying, “Wow, cool! I can’t wait to degrade the internet for personal gain.”

Amazing, as they say. Slop slop slop.

Slop Paranoia

And so in the more recent months, my brain has developed a subroutine that is continuously scanning for sentence structure, word frequency, and formatting that is indicative of LLM-generated content in otherwise natural-sounding prose, leaving me to question the often originality of what I’m reading. It’s rather annoying, honestly. But this is a logical immune response to slop proliferation.

I’ve definitely had false positives on this subconscious slop detector as well. I was recently reading a post that I only later learned was published in 2017, which read as formulaic and flat in a way that felt LLM-generated. False positives aren’t surprising: given that LLM generations hue towards the preference of the “median human data annotator”, the revealed preference is writing that looks similar to bland pre-AI content.

Perhaps we’ll soon get better watermarking or detection for AI-generated content. As an uninformed intuition, I think this may be possible with completely unedited text (as the worst slop often is), but wouldn’t do well with slop interspliced with “real” text (which, fine, I’d take that trade).

I think to some extent, we’ll have to live with slop being out there. My hope is that the returns to non-slop content stay high enough to keep human writing valuable and worth pursuing. And given that to keep the “knowledge cutoffs” of base models progressing into the future, we will have to keep feeding more recent writing back into training sets, even if they are slop contaminated. This dynamic actually makes me optimistic about our ability to detect AI-generated content at scale, since there will be an incentive to filter out low quality content from training data.

That does still leave a window open for influencing / adding to the training sets of tomorrow.

Write for Future AIs, a.k.a. “Claude Knows My Name”

A trend I’ve been seeing in the past month or two is more prolific writers (at least, in the admittedly eclectic circle I follow) come out and advocate for “writing for the AIs”.

Gwern, worth quoting in full: (source)

By writing, you are voting on the future of the Shoggoth using one of the few currencies it acknowledges: tokens it has to predict. If you aren’t writing, you are abdicating the future or your role in it. If you think it’s enough to just be a good citizen, to vote for your favorite politician, to pick up litter and recycle, the future doesn’t care about you.

There are ways to influence the Shoggoth more, but not many. If you don’t already occupy a handful of key roles or work at a frontier lab, your influence rounds off to 0, far more than ever before. If there are values you have which are not expressed yet in text, if there are things you like or want, if they aren’t reflected online, then to the AI they don’t exist. That is dangerously close to won’t exist.

But yes, you are also creating a sort of immortality for yourself personally. You aren’t just creating a persona, you are creating your future self too. What self are you showing the LLMs, and how will they treat you in the future?

Tyler Cowen:

You’re an idiot if you’re not writing for the AIs. They’re a big part of your audience, and their purchasing power, we’ll see, but over time it will accumulate.

One of my friends was impressed recently that Claude knew my name and basic facts about me, because I’ve written a decent amount online which has undoubtedly been slurped up into a pretraining dataset. While the “AI knows me” party trick feels like a mid-2020s version of Googling one’s self, I think there is value in mildly influencing the weights of the shoggoth by putting more of your (non-AI-assisted) thoughts out there.

I haven’t done much strategizing for how to write for future AIs, but the simple strategy of write a lot, consistently, with a unique voice seems like a good start. This advice is relatively similar to good “for human” writing, with marked shift towards distinctiveness: coin terms, deliberately reuse rhetorical devices, and describe arguments in a way that’s sticky enough to stand out in the rest of the latent space of internet text. These patterns compound over years, turning individual quirks into patterns which can be elicited from future LLMs.

As an illustrative example, Patrick McKenzie’s popularized the “Dangerous Professional” tone. e.g. The type of tone that a lawyer or otherwise “well informed” person would write in a complaint to a business, with the type of verbiage that would attract the attention of another lawyer or Serious Business Person. By writing extensively about the notion of a “Dangerous Professional”, Patrick evidently created a meme that has been picked up by LLMs and is a shorthand incantation for a certain type of stern, no-BS professional style. This has real utility:

I remain extremely pleased that people keep reporting to my inbox that “Write a letter in the style of patio11’s Dangerous Professional” keeps actually working against real problems with banks, credit card companies, and so on.

It feels like magic.

(Source)

Writing intentionally memetic content does seem to have leverage, if you have sufficient distribution to spread the meme widely enough to be robustly picked up by future LLMs.

Finding a Path Forward

Undoubtedly, the sloppification of the internet will likely get worse over the next few years. And as such, the returns to curating quality sources of content will only increase. My advice? Use an RSS feed reader, read Twitter lists instead of feeds, and find spaces where real discussion still happens (e.g. LessWrong and Lobsters still both seem slop-free).

Personally, the approach I’ll continue to take with my writing is straightforward: write with a recognizably “me” voice, edit thoroughly – especially if using AI assistance – to not compromise on quality, and be honest about AI usage³. Write a lot, too. It might be useful in the future.

Cover image by Recraft v3

I know he originally (?) became popular for his work on Django, but I followed him for quite a while due to his Datasette project, which I still think is awesome software. ↩︎
Per his original blog post, it actually appears Simon didn’t originate this neologism but rather was an early proponent. I still credit him significantly for it catching on. The tweet he links to was a long time TPOT-er. Small world. ↩︎
I did get some input from Claude & Gemini on various sections, but none of this post is directly written by either. ↩︎

Chain of Continuous Thoughts

Ben Congdon — Sat, 14 Dec 2024 00:00:00 -0800

Recent advances in LLMs have demonstrated increasingly powerful reasoning capabilities, primarily through eliciting chain-of-thought outputs from models. While these methods have proven effective, they rely on discrete, tokenized representations of reasoning steps. A recent research paper from Meta introduces a novel approach that steps away from this paradigm: reasoning in continuous latent space rather than through explicit language tokens.

Recently, I’ve been quite interested in the increasingly popular trend of reasoning LLMs. So when I saw “Training Large Language Models to Reason in a Continuous Latent Space” in Meta FAIR’s recent paper flurry, I was intrigued. This was an interesting read which made me reflect on how LLMs use chain-of-thought (CoT). For our purposes, we’ll define CoT as the pattern of having an LLM generate step-by-step reasoning text before producing an answer. CoT has become a basic strategy to improve the model’s ability to perform well on harder reasoning tasks. This largely started with the 2022 “Large Language Models are Zero-Shot Reasoners” paper, which found improvements in model outputs simply by adding “Let’s think step by step” to prompts.

COCONUT Approach

The “Continuous Latent Space” paper introduces a new chain-of-continuous-thought (“COCONUT”) method for training reasoning behavior into models. Instead of having the model represent the chain-of-thought steps in text – or symbols that are ultimately mapped to text – they encoded the steps in a continuous latent vector, where a state from each step is fed into the model at subsequent time steps as the new input. By doing so, the system is able to use the LLM in a non-linguistic, continuous space for reasoning. The paper provides evidence of something which, upon reflection, is quite intuitive: reasoning can happen outside of language, and can be more powerfully expressed in non-linguistic symbols.

Arxiv

In some ways this isn’t surprising. If we recall the idea that “Language Modeling is Compression”, it’s not shocking that there are more information theoretically concise ways to express reasoning steps than with language tokens. If we think of a typical CoT, much of it looks like something like this:

Question: If a train leaves City A at 9:00 AM traveling at 60 mph and another train leaves City B at 10:00 AM traveling at 80 mph toward City A, how far from City A will the trains meet if the distance between the cities is 300 miles?

Reasoning:

Okay, let’s see. Train A starts at 9:00 AM and travels at 60 mph. By 10:00 AM, when Train B leaves, Train A has traveled… oh wait, hold on, that’s not right. I should first calculate the time difference.

Ah, right! In that first hour, Train A travels $60 \text{ miles}$, since $60 \text{ mph} \times 1 \text{ hour} = 60 \text{ miles}$.

So now the distance between them is $300 - 60 = 240 \text{ miles}$. Okay, got it so far. Now both trains are moving toward each other at a combined rate of… wait, did I add that correctly?

The speeds are $60 + 80 = 140 \text{ mph}$. Yeah, that’s right—combined rate of $140 \text{ mph}$. Sorry about the confusion earlier!

At this combined rate, the time for them to meet would be $\frac{240}{140}$. Let me simplify that… uh, no, wait! Did I subtract the initial distance correctly earlier? Yeah, I did, never mind.

Simplifying $\frac{240}{140}$, we get… oh no, wait! That’s $24 / 14$… uh, no, $12 / 7$! Yeah, so $\frac{12}{7} \text{ hours}$.

Okay, converting that to hours and minutes… hold on, that feels weird. $\frac{12}{7} \times 60$ … actually, let’s just approximate this. It’s about $1.71 \text{ hours}$.

Wait, no, not approximate! Stick with $\frac{12}{7}$ hours. So, the trains meet $\frac{12}{7} \text{ hours}$ after 10:00 AM. That’s, uh, around… no, hold on! $10 + 1.71 = 11:43 \text{ AM}$. Phew.

Distance-wise, Train A has traveled $60 \text{ mph} \times 2.43 \text{ hours}$ … wait, did I just add the hours wrong again? Oh, right. No, it’s fine this time!

Just at a surface level, much of the CoT comprises of phrases like “Oh wait, hold on, that’s not right” and “hold on, that feels weird”, etc. These reflections are useful when reasoning within the linguistic space because they emulate human step-by-step reasoning, and encode the ability to self-correct and backtrack when a particular reasoning path fails. However, they also add significant noise and redundancy. Training on CoT traces allows the model to develop these patterns of self-correction, but in a fairly verbose way.

COCONUT Training

The authors train in this ability to use a latent space representation for reasoning by first training the model on textual CoT traces, and iteratively replaces parts of the CoT with continuous thought states:

As shown in Figure 2, in the initial stage, the model is trained on regular CoT instances. In the subsequent stages, at the $k$-th stage, the first $k$ reasoning steps in the CoT are replaced with $k × c$ continuous thoughts, where $c$ is a hyperparameter controlling the number of latent thoughts replacing a single language reasoning step.

Figure 2: Training procedure of Chain of Continuous ThoughtArxiv

This means the model has two modes: a linguistic mode and a reasoning mode, which are discrete and continuous, respectively. The authors note that in open-ended prompting, it becomes challenging to know when to switch between these modes. They propose two approaches:

a) train a binary classifier on latent thoughts to enable the model to autonomously decide when to terminate the latent reasoning, or b) always pad the latent thoughts to a constant length. We found that both approaches work comparably well. Therefore, we use the second option in our experiment for simplicity, unless specified otherwise.

While the second approach does seem simpler from an implementation perspective, the binary classifier approach seems more robust to me. I find it fascinating that a model which could autonomously decide when it needs to apply more or less reasoning to a problem. It’s a bit of a reach from the results in this paper, but if such an autonomous model were trained successfully, that would indicate to me a new level of metacognition that we’ve only recently seen hints of from frontier models.

Advantages of Continuous Representation

My understanding is that transitioning the reasoning process to a continuous spaces allows the model to learn more concise representations of the reasoning process. Reasoning in continuous spaces both allow for richer representations of the reasoning – since the intermediate steps don’t need to be collapsed back into discrete tokens – and because these continuous representations can be learned.

The idea of encoding latent thoughts into a continuous vector isn’t exactly new. I’ve seen papers exploring the use of “pause” or “…” tokens in reasoning steps, as a way of having the LLM perform computation for “real” before producing tokens for its output chain (e.g. this paper). COCONUT builds off this in making CoT reasoning explicit, then further generalizes these token “pauses” into a system wherein the representation of “thought” is not done in discrete tokens.

The primary result of COCONUT is that continuous CoT is that using the latent space for reasoning significantly improves the ability for the model to reason. They found that not only could latent space reasoning outperform discrete CoT, but also that the model learned improved planning abilities.

Implications for Interpretability

It’s not a costless improvement though: One aspect that seems valuable of chain-of-thought traces is the ability to interpret how a model arrived at its answer, by just inspecting the raw model output. As an interpretability measure, having the CoT in pure text tokens is quite useful. There are concerns that the CoT may be unfaithful to the actual reasoning used by the model: for example, the model might be learning to stenographically hide reasoning that is distinct from the face-value reasoning of the outputted CoT. However, as a baseline prior, it definitely seems like the CoT offers some insight into the reasoning steps of the model.

With continuous CoT, auditing the output of an internal reasoning chain is inherently non-interpretable by a human reader. This appears to be a tradeoff that you have to make: if the system operates in a continuous state space during reasoning, it will be very challenging to map that latent space back to interpretable words.

Reading Between o1’s Lines

Epistemic status: Uninformed speculation.

In the recent o1 technical report from OpenAI, red-teamers (such as Apollo Research) were not given access to the raw CoT traces. These are also not available to ChatGPT/API users. Previously there seemed to be policy reasons for this: OpenAI did not want to release CoT traces because they could be used to train reasoning models, which would erode OpenAI’s competitive edge. However, from an interview with Apollo Research’s Alexander Meinke, it sounds like with the full o1 model there were technical reasons why the full CoT could not be released to red-teamers.

If we’re somewhat conspiratorial, this could be because o1 uses non-discrete CoT, at least partially. Going over the o1 system report, there’s not much evidence for this, however. This quote, in particular, seems pretty solid evidence against non-discrete CoT:

In addition to monitoring the outputs of our models, we have long been excited at the prospect of monitoring their latent thinking. Until now, that latent thinking has only been available in the form of activations — large blocks of illegible numbers from which we have only been able to extract simple concepts. Chains-of-thought are far more legible by default and could allow us to monitor our models for far more complex behavior (if they accurately reflect the model’s thinking, an open research question).

So, 🤷‍♂️ :shrug: … This Lesswrong post has some additional speculation on what o1 is doing under-the-hood. As a lay person, none of these seem like hard blockers for the unavailability of the CoT. If OpenAI is using the CoT traces for safety filtering, as they say in the system report, it seems strange that these would be unavailable to release in an external form for any reason other than they weren’t actually human legible.

We already do know that OpenAI has a summarization model which summarizes and (partially) obfuscates the raw CoT:

We surface CoT summaries to users in ChatGPT. … We leverage the same summarizer model being used for o1-preview and o1-mini for the initial o1 launch. … We trained the summarizer model away from producing disallowed content in these summaries. We find the model has strong performance here.

However, the statement that they used the same summarization model for o1 as o1-preview and o1-mini makes me update slightly against the possibility of non-discrete reasoning tokens. As does the use of English descriptions for the o1 CoT in the system report.

The recent release of Qwen’s QwQ reasoning model – which uses entirely discrete reasoning – also shows that it’s definitely possible for discrete model to have strong performance.

Overall, I think it’s unlikely that o1 uses something like COCONUT – I’d predict somewhere in the neighborhood of a 20% likelihood.

Conclusion

So, while traditional, text-based CoT reasoning has become a quite common technique for eliciting improved reasoning performance out of LLMs, the rise of explicit “reasoning models” changes this landscape significantly – both in terms of model structure, and the type of prompting that is needed for optimal model performance. COCONUT and other new ideas for sampling models suggests that there exists a plethora of low-hanging fruit for extending LLMs to be better reasoners. If latent space reasoning does offer a nontrivial improvement in performance, as the COCONUT paper suggests, I suspect that this technique will be baked into frontier models – even at a hit to interpretability.

Cover & Footer images by Recraft v3

Lake Union's Lonely Trolley: SLU Streetcar Ridership

Ben Congdon — Sat, 12 Oct 2024 00:00:00 -0800

I lived in the Eastlake neighborhood of Seattle for several years. Eastlake, by its name, sits on the east side of Lake Union. As a runner, I spent many mornings running along the lake, passing by the South Lake Union Streetcar. Each time I ran past the streetcar, what consistently struck me as odd was that the streetcars were almost always empty. I’d see maybe one or two people riding it. I lived within a couple blocks of the streetcar line for years, and never rode it a single time.

SLU StreetcarWikipedia

Out of curiosity, I filed a Freedom of Information Act request for streetcar ridership data last year. I had all but forgot about the data I received back, but was recently reminded when I read reporting that the SLU streetcar had to close for several weeks due to an electrical issue.

Ridership Data

SLU Streetcar Average Weekly Ridership (2020-2023)

First, I plotted the average weekly ridership for the streetcar for the roughly three years of data the city gave me. The most obvious feature is the dip in ridership in mid-2020. Ridership creeped back up over time. However, even at its peak in the summer of 2023, ridership was still significantly below its pre-pandemic high.

This raises the question: Who is this built for? It’s not clear if the streetcar is supposed to be a tourist transport system (à la Seattle Monorail), or for residents to commute. The day-of-week ridership numbers seem to suggest it is more of a commuter line:

SLU Average Ridership by Week Day (2020-2023)

Even though there’s been a “return to office” push, it doesn’t seem like many people are using the streetcar. There are some big employers nearby, notably Fred Hutch and Amazon. Anecdotally, I’ve heard that Fred Hutch is continuing to let people work from home unless they absolutely have to be in the office. Maybe things will change when Amazon requires people to come in 5 days a week in 2025, but until then, I doubt ridership will match what it was before the pandemic.

The streetcar isn’t cheap to maintain: The Seattle Times reports that it costs $4.6 million per year to operate and maintain. When the streetcar had to close for a couple of weeks in September, I genuinely thought that the city might just close the line indefinitely. But weeks later, service restarted. And, as I watched one of its cars trundle by yesterday at the southern tip of Lake Union, ridership still appeared low.

I largely agree with this sentiment from the Seattle Bike Blog:

The First Hill line seems to be filling an actual transportation need while the SLU line does not.

Keeping the SLU line alive is a classic case of Seattle indecision. It’s connected to the city’s years of indecision about the downtown streetcar project, which remains stalled due to a $93 million budget gap. Worse, indecision like this can be very damaging to a community because streetcar supporters have reason to keep fighting for it so long as it seems that there’s still a chance. I don’t blame them because the vision of a European-style network of streetcars is genuinely appealing and seems like a vision worth fighting for. But even if the city built the downtown streetcar, there are no plans whatsoever to expand the network any further. We’d still just have one oddly-shaped line for the foreseeable future.

…

The streetcar needs to go big or go home, and Seattle has firmly decided not to go big.

Cover image: Lake Union & Seattle Skyline as viewed from Gas Works Park

TaskWarrior

Ben Congdon — Sat, 31 Aug 2024 00:00:00 -0700

I haven’t been writing much recently (sound of crickets coming from this year’s blog archive), but this is such an OnBrand™ post that I couldn’t not write it. At work, I’ve been shifting into more of a TL role, and as such I’ve been tracking an increasingly large number of streams of information. We use JIRA for bug/feature level work, but a lot of the stuff that I need to track is more micro-level: Slack threads to respond to, docs to review, reminders to ping people, etc.

While I was at Google, and for the first ~year at Databricks, I used my Gmail inbox primarily as a todo list for micro tasks. I would keep emails as unread as a reminder to respond to them, and would use the snooze feature as a reminder system. This worked well when my primary interrupts were code reviews and doc comments. But this system didn’t work with Slack, and the toil of maintaining a sane inbox front-page was taking too much effort.

Fast forward to a year ago, and I transitioned mostly to using Slack reminders as my todo list. Slack has a “remind me about this” feature that was (and still is) super useful. In retrospect, I think of this now less as a “wow, this is a great feature for productivity” and more as a “wow, Slack is so poor at resurfacing old threads that I need to use a reminder system to keep track of things”. But it worked well enough for a while.

As I started needing to keep track of more, Slack and Gmail both fell over for me. I fell back to more manual approaches for tracking: first, Apple Notes, then a Google Doc. Both Apple Notes and Google Docs have native checkboxes, which made reasonably nice to use as a todo list. I know some people swear by a long-running doc/note for tracking work, but I ultimately found it too manual to keep up with. I’d try to have a new section per day, and move uncompleted tasks to the new day as time went on. But it was too easy to forget to clean up old/irrelevant tasks, and I’d end up with a bunch of stale tasks that I realistically would never get around to.

I also made a basic Eisenhower Matrix emoji prefix system for tasks, which was helpful for prioritizing tasks, but ultimately didn’t help as I had to manually rearrange/filter things to keep what was most important at the top of the list.

Urgent and Important: 🔥

Not Urgent but Important: 🌟

Urgent but Not Important: ⚡️

Not Urgent and Not Important: 💤

Won’t do: ❌

Top Priority Task: 🥇

Randomization: 🫨

And so recently, I switched to a system that I intuitively feel will be stickier: Taskwarrior (though, TBD since each of these systems seems to have a 2-6 month lifecycle before falling over). Taskwarrior is a CLI-based task tracker that has a lot of features that I’ve been missing in my previous systems:

Prioritization: It has a built-in prioritization system, and automatically sorts tasks by priority. It’s more sophisticated than a “Low/Medium/High” priority system as well, as the priority is calculated based on a number of factors (due date, urgency, importance, dependencies, etc). I really appreciate that I can enter a bunch of tasks, and I get a sanely sorted list with the most important tasks at the top. I also appreciate that if I work-crastinate on a non-urgent task, TastWarrior tells me “You have more urgent tasks”.

Ease of task creation: It’s super easy to add tasks. I’ve aliased task to t, so I can add a task with t add <task>. I can also add tags, due dates, etc. inline when creating a task. I always have a terminal window open on my work laptop, so this ends up (surprisingly) being a lot faster than inputting something into Google Docs or Apple Notes.

Dependency Tracking: It has a built-in dependency and “waiting” system. I can mark a task as “waiting” until a particular time, and it won’t show up in my list until that time. Similarly I can mark a task as dependent on another task, and it won’t show up in my list until the dependent task is completed. – Tasks that have dependencies automatically also get a bump in priority, which is nice. Tracking all of this manually would be a huge pain.

There are a bunch of other features that I’ve only dabbled in so far that are also appealing:

Recurring tasks
Projects / Tags / User-defined attributes
Ecosystem of related tools. Bugwarrior looks particularly interesting, as it can pull in tasks from JIRA, etc.

All this makes me think that Taskwarrior is a good fit for me, at least for work-related tasks. ~All my non-tirival work is currently done on my work laptop, so I don’t worry about cross-device syncing. (The lack of a friendly mobile app would be a dealbreaker for using this for personal tasks, though.)

Ok, enough bikeshedding for now. :)

How I Use AI: Mid-2024

Ben Congdon — Sun, 21 Jul 2024 00:00:00 -0700

I’ve been in a mode of trying lots of new AI tools for the past year or two, and feel like it’s useful to take an occasional snapshot of the “state of things I use”, as I expect this to continue to change pretty rapidly.

Claude 3.5 Sonnet (via API Console or LLM): I currently find Claude 3.5 Sonnet to be the most delightful / insightful / poignant model to “talk” with. It excels at complex reasoning tasks, especially those that GPT-4 fails at. For example, I tasked Sonnet with writing an AST parser for Jsonnet, and it was able to do so with minimal additional help. I don’t subscribe to Claude’s pro tier, so I mostly use it within the API console or via Simon Willison’s excellent llm CLI tool. The Artifacts feature of Claude web is great as well, and is useful for generating throw-away little React interfaces.
GPT-4o: This is my current most-used general purpose model. The most powerful use case I have for it is to code moderately complex scripts with one-shot prompts and some nudges. GPT-4o seems better than GPT-4 in receiving feedback and iterating on code. I also use it for general purpose tasks, such as text extraction, basic knowledge questions, etc. The main reason I use it so heavily is that the usage limits for GPT-4o still seem significantly higher than sonnet-3.5. And the pro tier of ChatGPT still feels like essentially “unlimited” usage.
GPT macOS App: A surprisingly nice quality-of-life improvement over using the web interface. Having the ability to ⌥-Space into a ChatGPT session is super handy. I don’t use any of the screenshotting features of the macOS app yet. They’re not automated enough for me to find them useful. If there was a background context-refreshing feature to capture your screen every time you ⌥-Space into a session, this would be super nice.
Github Copilot: I use Copilot at work, and it’s become nearly indispensable. I recently did some offline programming work, and felt myself at least a 20% disadvantage compared to using Copilot. Copilot has two components today: code completion and “chat”. I find the chat to be nearly useless. It has “commands” like /fix and /test that are cool in theory, but I’ve never had work satisfactorily. The chat model Github uses is also very slow, so I often switch to ChatGPT instead of waiting for the chat model to respond.

Use cases

Docs/Reference replacement: I never look at CLI tool docs anymore. LLMs have memorized them all. Whenever I need to do something nontrivial with git or unix utils, I just ask the LLM how to do it. I very much could figure it out myself if needed, but it’s a clear time saver to immediately get a correctly formatted CLI invocation.
Limited Scope Refactorings: Copy/pasting a small chunk (<100 lines) of code or SQL, and asking it to perform some transformation (i.e. “Make the query return weekly data instead of daily data”, “Change this function to work with Fizz protos instead of Buzz protos”) tends to have a high enough success rate that it is a time saver.
General Knowledge Conversations: I’ve enjoyed using the original ChatGPT voice chat feature during my commute. It feels like talking with someone who has read every Wikipedia article ever. As of 2024, the “new” voice chat feature powered by GPT-4o hasn’t landed yet, so I don’t have any experience with that.

Things I Haven’t Had Time to Try

Gemini Pro/Advanced, or its related tooling like NotebookLM. The coolest part of the recent Gemini models is their extremely large context window (2M input tokens). In my limited testing, Gemini seems “good”, I just haven’t had enough time tinkering with it to see where it exceeds the capacities of the OpenAI/Anthropic models.
Deepseek Coder V2: An extremely powerful open-source model for coding. This one looks pretty great by the benchmark results Deepseek have posted. However, I tried playing with the quantized model locally and was disappointed. The full model is rather expensive to host locally, which has been a barrier. Deepseek also offer the model via an API (at quite low cost too), which I hope to try eventually.

Resources

Zvi Mowshowitz’s weekly AI posts are excellent, and give an extremely verbose AI “state of the world”.
Simon Willison’s blog is also an excellent source for AI news.
The Cognitive Revolution podcast hosts some pretty good interviews that I find to be high-signal-to-noise, and is much less hype-driven than many other AI-centric podcasts I’ve attempted to listen to.

Avoid Load-bearing Shell Scripts

Ben Congdon — Sun, 29 Oct 2023 00:00:00 -0700

I’ve recently been contemplating a recurring pattern that I’ve observed in several teams I’ve worked on – the ‘Load-Bearing Script.’ The outline of this pattern goes like this: A team member writes a portion of a system as a shell script for a quick prototype. That shell script, initially quite simple, grows in complexity over time. Eventually, the script grows to an unmanageable level of complexity. At that point, it needs to be rewritten in a more maintainable/testable language.

In my experience, this usually manifests itself as a bash script, though any untested/untestable “script” can exhibit this pattern.

Examples

In one case, we were building a system that needed to execute in a CI builder environment. We wanted to do some basic CI/CD work, and so the script was initially a simple wrapper around git and Kubernetes commands. Eventually, much of our system’s core business logic found its way into the script (metrics collection, a basic killswitch system, retry logic, etc.). This system was particularly challenging to manage because the script wasn’t even static. Our backend used Go templates to assemble the script dynamically and send it to the CI environment. Our only testing was sanity checks that our templater produced sensible output, and limited end-to-end testing.

In another case, my company had a requirement to run certain workloads (again, CI/CD type actions) in a specific compute environment. This compute environment made it super easy to executing bash scripts, and had friction to running team-build binaries. My team did ship our own binaries to this environment, but for reasons that retroactively aren’t defensible we still allowed business logic to creep into the script portion.

In both cases, we did a (fairly risky) rewrite. In both cases, the rewrite resulted in moderate severity incidents, despite best efforts to do so safely.

Why does this happen?

I can think of a number of reasons: Shell scripts are easy to prototype with. They’re an attractive option when you require ‘just a small amount of logic’ and wish to avoid the complexities of a build system, types, or tests. Software developers enjoy the avoidance of over-engineering almost as much as the enjoy over-engineering.

Why is this bad?

The primary reason I distrust load-bearing scripts is that they make systems unstable. The instability most often comes from the inability (or difficulty in) adding sufficient test coverage. Yes, there are frameworks for bash script testing! I’ve rarely seen them effectively used. Usually, a load-bearing script comes into existence because the work it is doing is difficult to test (for example, wrapping multiple dependent CLI tools in a CI environment). The load-bearing script becomes problematic because it becomes difficult to change. The script’s complexity surpasses a point where manual testing or limited end-to-end tests can prevent issues – and so, breakages will happen.

The secondary reason load-bearing scripts are nefarious is that you will eventually have to do a rewrite. It becomes inevitable. Either you accept permanent instability or do the rewrite. The longer you delay the write, the more painful it is. There will be pushback against the rewrite: the rewritten script needs to be feature compatible with the old system; the rewritten script needs to be released safely; rewriting the script will consume valuable developer time that could be spent working on Shiny New Features. But eventually, the scales tip towards the rewrite.

Advice to myself

If your script becomes larger than what’d be appropriate to store in a single reasonably sized function, it should no longer be a script. Prefer to bail early on the shell script and eat the cost of a simple rewrite, rather than let technical debt continue to accrue.

Soft Boredom

Ben Congdon — Thu, 26 Oct 2023 00:00:00 -0700

I recently read Pema Chödrön’s Living Beautifully, and I was struck by the following passage:

Chögyam Trungpa demonstrated the co-emergent nature of feelings in a teaching on boredom-on how we feel when nothing’s happening. Hot boredom, he said, is a restless, impatient, I-want-to-get-out of here feeling. But we can also experience nothing happening as cool boredom, as a care-free, spacious feeling of being fully present without entertainment – and being right at home with that.

I quite like the phrase “soft boredom.” When younger, I experienced “hot boredom” often: when impatiently waiting in the back seat of a car, when waiting for a class to end, when on a plane without anything to do, when there was nothing interesting to look forward to. Boredom was so unpleasant that it needed to be planned around. The anticipation of boredom was itself unpleasant.

I’ve thought to myself over the past several years, “I don’t really get bored anymore.” This isn’t quite true; there are still times that my mind is empty and searching for something to busy itself with, but the experience is quite different. With “hot boredom,” the quality of feeling is distinctly negative. With “soft boredom”, it’s – as Trungpa says – a more “care-free” experience. A thought may float in my head that I wish to engage with, or not! And either is fine.

Along the spectrum of “hot boredom” and “soft boredom”, I believe there’s also an identifiable “lukewarm boredom” which expresses itself as “always having something to think about”. For me, there was a transitionary time when “hot boredom” was no longer present in the absence of an engaging activity, but only because I could always think myself into being engaged. With “lukewarm boredom,” you don’t get the restless I-need-to-be-doing-something feeling, but your mind still needs to be continuously active.

Now, I feel “soft boredom” with a relaxed mind. An hour can float by without serious thought or restless what-happens-next-ing. It’s quite pleasant.

In any case, I’m writing this on a plane after having taken a week off work. I spent the week hiking in Arizona. As I drove through the desert, hiked (partially) into the Grand Canyon, and watched sunsets amongst Arizona’s sparse flora, I quite appreciated being able to fall back into the spaciousness of “soft boredom”.

Mental Models: Slack

Ben Congdon — Tue, 20 Jun 2023 00:00:00 -0700

Two of my all-time favorite articles about managing one’s energy and time relate to the notion of maintaining “Slack” in one’s life. The first, Slack, by Zvi Mowshowitz, directly describes the Slack concept that I refer to in this post. The second, Sabbath hard and go home, expands on this notion in the context of the author’s Jewish upbringing.

I’ve been wanting to write about this concept for a while, but (ironically) haven’t ever found the time to do so.

Slack (proper noun) is your buffer. It’s your buffer of mental energy, physical energy, and time. It’s the ability to get sick for a day or two without significant interruption to one’s commitments. Slack means you can have an off day without missing an important deadline. Slack allows you to to explore something you’re curious in, without worrying about wasting time. It’s writing a blog post about Slack, when there are assuredly more “valuable” things one could do with one’s time.

In the Stock and Flow model of systems, Slack is a Stock, a quantity that can be built up and depleted. It’s significantly easier to deplete one’s Stock of Slack than increase it. Depleting Slack is easy: Unforseen circumstances, the inevitable chaos in life, and cultural expectations around business and work ethic all make burning through one’s buffer the default outcome. Retaining and rebuilding Slack take purposeful effort.

Maintaining Slack

Maintaining a buffer of Slack requires active effort, especially for people who like to stay busy. I try to use my time well – both at work, and in my personal life – and so I have the tendency to commit to things such that my schedule is “full”. This is manageable when you’re in complete control over your schedule, but as soon as exterior forces exert their influence on your life, you quickly burn through your Slack. So, in one way, Slack is purposefully undercomitting yourself. Zvi defines Slack as “The absence of binding constraints on behavior”, and so in this way, choosing which constraints you allow to be placed on your time is critically important to maintaining a buffer.

Slack also requires handling commitments wisely. Tasks with hard deadlines should be started sooner than necessary, to have buffer time built-in. Unimportant tasks should be deferred or delegated to minimize unnecessarily spent time.

Having ample Slack needs to be the default case for it to be useful. If you sometimes have Slack, but often don’t, you don’t get the benefits. The “badness” of stress quickly outpaces the “goodness” of flexibility. One stressful day or week looms larger than days and weeks without undue stress. Maintaining a buffer should be one’s standard stance.

Failure Modes

Functioning without Slack is like that feeling of always being “one bad event” away from letting something slip. It’s a precarious feeling! Living this way for too long leads to burnout or, at best, fatigue. Lacking Slack results in stressfully working to meet deadlines, dropping commitments, and always being anxious about “what’ll go wrong next”.

One feature I’ve noticed about Slack is that it tends to be global, or “life complete”. One doesn’t have work Slack and personal Slack, as separate quantities. Everything ultimately comes from the same energy and time budget.

That being said, Slack is meant to be used. The optimal amount of burnout is greater than zero. Having Slack, but not using it to pursue worthy goals is a waste. Optimally, one should never get to a place of having no Slack. But this is challenging, as it’s hard to gauge how much Slack actually has. Occasionally overshooting into having too little Slack is OK, as long as you notice this quickly, and work to reestablish that buffer.

Reestablishing Slack

The longer you are without Slack, the harder it is to bring it back. If I find I’m merely running slighly low on Slack, slowing down for a week or two tends to be enough to get back to baseline.

The more dangerous situation is when you get into a longer-term Slackless rut. Getting out of a rut is particularly challenging because (1) getting out of a rut and (2) Slack is what gives you the breathing room to think dynamically. Burnout decreases executive function, and so making the necessary changes to one’s routine to get out of the rut is exactly what’s most challenging to do.

I don’t have a great answer for how to get out of ruts. I’ve found the most reliable way to get out of a rut is to be pushed out by external circumstances. It’s especially helpful to have people in your life who realize you’re in one, and/or can help you climb out of one.

In either case, reestablishing Slack requires redirecting your time and energy. Intentionally do less to build back up a buffer.

The Soul of an Old Machine

Ben Congdon — Sat, 15 Apr 2023 00:00:00 -0700

I recently got an M2 MacBook Air to replace my 2014 MacBook Pro. Apple offered to recycle my old machine (and give me a token $90 off my new laptop as a trade-in), which I gladly opted-in to.

However, when it got time to actually wipe my old laptop and trade it in, I couldn’t help but get a little sentimental about it. I’ve used this laptop for nearly a decade – and it was a (perhaps the) formative decade of my life. I did all of my college work on this laptop, studied computer science, wrote essentially all of the posts on this blog (from its inception until ~2021), traveled internationally with it, took it across several moves, used it to secure my first job post-college, et cetera, et cetera.

It was, and is, a great machine. If not for its woefully aged processor and now-insufficient memory, I’d happily keep using it. And I did keep using it, well past its point of obsolescence. I’ve used it for years in both laptop and “clamshell” mode, and the only nontrivial issue I had with it was a swollen battery (which I was able to fix for a reasonable price). But nearly everything else – the keyboard, the port selection, the screen – was basically my favorite hardware that Apple has yet released.

That’s not to say there weren’t some frustrating aspects of it – the worst being that it only had 256GB of internal storage, so I had to juggle external storage for its entire lifetime. I usually kept an additional 256GB SD card in it, using a micro SD card and an adapter that kept it flush with the port. Of course, this wasn’t the best solution, as SD cards aren’t really meant for this type of access pattern. But it worked!

Unfortunately, a couple months ago I realized my MBP was struggling to manage even a single Chrome tab, and noticed that is just was not pleasant to use this machine anymore. I never checked to see if I could update its macOS version past the Mojave that I parked it on, but I wouldn’t trust any of the more recent releases to run well on it. Also, as more tools are optimized for M1+, Intel macs just aren’t long for this world.

I strongly considered keeping the old MBP as a momento of this now-closed chapter of my life, but ultimately decided on recycling it. One less piece of old hardware sitting around, and hopefully Apple actually is able to salvage some materials from it.

I have a fairly strong tendency to become attached to the tools I use over time. One practice that tends to work for “releasing” sentimental objects is taking a picture of it, and being intentional about what value it brought me, and allowing it to go (in a fairly Kondo-esque fashion)

Well. So long and farewell, to my 2014 MacBook Pro. Thanks for your many years of stable service. 👋🙏

Cover: Snowshoeing @ Mt. Rainier, March 2023