Previously: Mid 2024
The landscape of AI tooling continues to shift, even in the past half year. This is not unexpected. This post is an updated snapshot of the “state of things I use”.
Tools
Work
- Copilot Edits:
This feels roughly 85% as effective as Cursor, but the ability to
incorporate enterprise code context makes it roughly on par. I am shocked
that I never hear anyone talking about this. My general workflow is: load
in up to 10 files into the working context, and ask for changes. It can
change multiple files at a time. It feels a bit like magic when it works
right.
- I’ve used it on languages that are not well covered by LLMs – Scala, Rust – and the results are surprisingly usable. The quality is good enough that I’ve started to reach for this first for most tasks.
- With Rust, I sometimes need to step in and help the model when it gets stuck. I’ve had to point out that it’s not making progress, or defer to a reasoning LLM to get past a logical impasse.
- Copilot now allows you to set custom instructions, similar to Cursor. I have built up custom language-specific instructions so that I get outputs that more consistently match the idioms and style of my company’s / team’s codebase.
-
Claude 3.5 Sonnet New (via Claude Pro): (a.k.a Sonnet 3.6, newsonnet) Sonnet 3.5 remains my daily driver and all around favorite model. In Claude Pro, the “Projects” feature is amazing. In any given week, I write several design documents, PRDs, announcements, one-pagers, etc. With Projects, I can dump in relevant context documents from related projects, iterate rapidly on writing, and have Claude output suggestions in a style that matches my “organic” writing. The process looks something like this:
- Paste in a collection of relevant documents: design docs, meeting notes, prior art, public documentation, etc.
- Use a custom writing style to “write as me” (more on that in the Techniques section).
- Iterate on the prompt, refining the output until it’s nearly publishable.
The quality of the output is often good enough that I can copy/paste entire sections into design documents with only minimal editing. This works better in some contexts than others, but for non-thinking-heavy sections like “Background” or “Overview” sections, I can usually get great outputs.
-
NotebookLM: Before I started using Claude Pro, NotebookLM was my go-to for working with a large corpus of documents. It’s tightly integrated into Google Workspace, which is convenient. I can dump in 20+ documents and ask questions about them as a corpus. Gemini just isn’t as strong as a writer, so I don’t use the output of NotebookLM much.
-
ChatGPT 4o: 4o feels like an outdated model at this point, but you still get unlimited use with the ChatGPT Pro plan, and the UX for ChatGPT-for-macOS is pretty great.
Option+Space
to get a ChatGPT window is a killer feature. By pure invocation/conversation count, 4o is probably my most used model – though most of the queries look more like Google searches than conversations. -
llm (CLI tool): This has become indispensable for quick, one-off tasks. It’s great for drafting git commit messages, reformatting text, etc. It’s hard to really write about what I use
llm
for since it’s a bunch of one-offs. At minimum, the “chat in CLI” UX is surprisingly useful. -
Perplexity Pro: We have access Perplexity Pro at work. I was initially impressed, but as time goes on, I find it increasingly disappointing! I sometimes use it as a “free action” for search, but even then, ChatGPT Search is usually better. I once tried to replace Google with Perplexity as my default search engine, and didn’t last more than a day. Perhaps I’m just not using it correctly. Also, the company has strange vibes recently.
-
ImageFX: Google’s image generation studio, which uses Imagen 3. I’ve found this useful to make relatively compelling non-slop-y illustrations for presentations.
Personal
- llm: Just as for work,
llm
on the command line is incredibly handy for personal projects. I don’t pay for a personal Claude Pro license, so I use Claude on the command line pretty frequently. - Google AI Studio: Google’s AI Studio is completely free to use, so I frequently use Gemini via the AI Studio. Gemini has hands down the best multi-modality of any model family. It’s great for audio, video (!), and PDF inputs. I have more thoughts on Gemini in my Models section.
- Personal Customized
Vercel AI Chatbot: I’ve set up a
personalized chatbot using Vercel’s AI Chatbot
template.
- My main changes were adding support for Anthropic models, changing the database to be a local SQLite file, and ripping out all the tool use features that I had no use for.
- My main reason for wanting this was: Wanting all my chats to only be saved locally (in theory Anthropic doesn’t permanently retain API logs), and wanting to have a handful of custom starter templates that I could easily reach for.
- I could have probably saved some time by using OpenWebUI, but it’s satisfying to have something custom. :)
- Cursor: I use Cursor for personal coding
projects, especially when working with smaller or greenfield codebases. I
almost always use Sonnet 3.5. I initially thought I’d burn through the
monthly credits that Cursor gives you, but that hasn’t been an issue so far.
(I don’t do a ton of side project coding these days though) As of today, I’m
using Copilot Edits 5-10x more than Cursor, but that’s mostly because I
cannot currently use Cursor at
$WORK
. - NotebookLM Podcasts: When running, if
I don’t have an appealing podcast, I’ll generate one on NotebookLM from a
recent Arxiv paper or blog post. The quality varies significantly, and I
tend to listen on 1.5x speed.
- The hosts sometimes devolve into trite discussions about the “ethical implications of AI” when describing a technical research paper, so it’s very much not perfect. Still surprisingly good for what it is, and it does often capture my attention more than would a pure TTS reading of the underlying content.
- FAL: FAL is a host of a bunch of image generation
models (among other “generative media” algorithms). It’s quite easy/cheap to
use, and produces better results than e.g. DALLE.
- I currently use Recraft v3 for my blog images. Late last year, I also trained custom LORAs of Flux1.1 on FAL to create some fun images of my cats. 🐈⬛
Techniques
Here are a few techniques I’ve found to be particularly effective when working with these tools:
Prompting (Code):
- Context Management: I find that the single biggest factor in getting good results from an LLM – especially for coding – is the context you provide. When using tools like Cursor and Copilot Edits, getting a good set of files that are relevant to the task at hand into your context is key. I haven’t found anything yet that is able to maintain good context itself, outside of trivially small code bases.
- Test Generation: I’ve found that asking for test cases to be generated is a great way to get a model to understand the behavior of the change I’m asking for.1 Unit tests are also usually super easy to pattern match and generate given in-context examples, so the quality is usually quite high. It’s often useful to have idiomatic examples of your testing patterns in your context, so that the model can generate tests that match your existing style. As a final tip, asking an LLM “are there any missing tests?” is a good “free” way to increase test coverage.
- Loop: Copy/Paste Compiler & Errors: This feels like extremely low-hanging fruit for improved workflows, but for now my loop is essentially to start ibazel (or whatever other test runner you have, in “watch mode”), have the LLM propose changes, then copy/paste the compiler or test errors back into the LLM to get it to fix the issues.
All of these techniques require the caveat that you need to be actively engaged in the process while prompting & evaluating LLM output. Treat the LLM as a intern or junior developer that you’re coaching along. Blindly accepting output is still a recipe for disaster.
As a counterpoint to this note of caution, Kaparthy recently coined the term “vibe coding”:
There’s a new kind of coding I call “vibe coding”, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It’s possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good.
I “Accept All” always, I don’t read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I’d have to really read through it for a while. Sometimes the LLMs can’t fix a bug so I just work around it or ask for random changes until it goes away. It’s not too bad for throwaway weekend projects, but still quite amusing.
I find this approach to admitedly work for small, throwaway projects, as he notes, but not for anything that needs to be maintained or scaled. However, this does seem to be the direction we’re headed. When quality code generation becomes too cheap to meter, if you can strap an optimization loop around generation<>evaluation, you can get a lot of work done with minimal effort.
Prompting (Writing):
- “Give me 3 options”: Whenever I’m generating text that will be used in a document or email, I always ask for multiple options. This allows me to either pick the best one or, more often, combine the best parts of each to create something that feels more natural and human. I don’t trust any model to one-shot human-sounding text.
- “Write as me” prompts: Models are still not amazing at copying writing
styles, but the models that are good at creative writing tend to be at least
OK at writing in my personal style.
- The workflow looks like:
- Take a large chunk of your writing and put it into R1 or Claude. Ask “Write a style guide for writing exactly as the author of this text.”.
- Then use that as a preamble to creative writing tasks, or as a Custom Style in Claude.
- I’ve found the models to be best at this approach are Sonnet 3.5 and (surprisingly) Deepseek R1. None of the OpenAI models fare well here, in my testing.
- The workflow looks like:
- Use as a “calculator for words”: LLMs remain great for simple, mindless
reformatting. Tasks like:
- “Reformat this text as a comma separated list”
- “Find the latest date from this huge list of unstructured dates”
- “Convert to a bullet pointed list”
- “Remove duplicates from this list”
Related Usage Techniques:
- “Copy as Markdown” from Google Docs: LLMs handle Markdown particularly
well. Google Docs
now allows you to copy content as Markdown,
which makes it easy to transfer text between the two environments. “Paste as
Markdown” is also useful.
-
- Markdown tables! I really dislike the Markdown table syntax, so I almost never use them. Asking an LLM to output a Markdown table and then copying that into a Google Doc is awesome.
-
- macOS Speech to Text: I never thought I’d say this, but sometimes talking is faster than typing. I’ve been using macOS’s built-in speech-to-text more and more when “writing” out conversational prompts.
- pbpaste / pbcopy Since LLM usage today often relies heavily on
copy/paste, using the
pbcopy
andpbpaste
commands (at least, on macOS) has been useful. These let you copy and paste, respectively, from the clipboard from the CLI.
Unexpectedly Useful Use Cases
Beyond the obvious applications, I’ve found AI to be surprisingly useful in a few unexpected areas:
- Finding a last-minute hike: Any good model has grokked all of AllTrails, and they give good recommendations even with complex criteria. It’s great for finding hikes that meet specific criteria (e.g., “not crowded, loop trail, between 5 and 10 miles, moderate difficulty”).
- As a “free action” for code review: Before reviewing a pull request, I
often pipe the diff into a model like o1 to see if it finds anything
objectionable. Worst case, you get slop out that you can ignore. I’ve had o1
catch some quite subtle bugs that I didn’t catch up on first review.
- Aside: Compared to a year ago, AI code review actually seems feasible now. The originalGPT-4 class models just weren’t great at code review, due to context length limitations and the lack of reasoning.
- Planning a Catio: We recently built a catio for our cats. I needed to calculate how much PVC pipe we needed. My partner drafted the plans in CAD, and I fed this into ChatGPT, which used Code Execution to plan out all the pieces. I also got it to generate labels for each of the piece lengths, which we annotated back on the plan. It saved us a ton of time.
- Financial Advice: (⚠️ Caveat emptor ⚠️) This one requires a huge grain of salt, but I recently had to make a large financial decision and found LLMs helpful as a secondary gut check to check my math. I asked Claude, R1, Gemini, GPT-4o, and GPT-o1 for their thoughts on my approach. All of them agreed directionally with the reasoning I came up with. This gave me a bit more confidence in my decision. Obviously check your work here.
- Medical Advice: (⚠️ Caveat emptor ⚠️) Same huge grain of salt, but using o1 / Claude as a second opinion for diagnosing symptoms and evaluating medical test results is definitely worth doing. I had some blood work done a few months ago, and got the raw results back prior to having my doctor review the results. Claude’s evaluation of the tests matched 1:1 with my doctor’s later report. (Granted: This was a fairly simple case.)
Model Tier Rank
Here’s my current ranking of the models I’ve been using, based on their overall utility:
- S Tier:
- Claude 3.5 Sonnet: An absolute workhorse. Smart across many domains—technical, creative writing, etc. It’s my go-to for most tasks.
- Deepseek R1: Cheap and smart enough to not feel bad about using it.
Deepseek R1 + Web Search is incredibly powerful. It’s a great option for
tasks that require up-to-date information or external knowledge.
- A note on serving: As of writing, the Deepseek platform serves R1 (undistilled) the fastest of any provider I’ve seen. If you have data residency concerns, or concerns about Deepseek’s security practices, I’ve found that OpenRouter provides a good alternative. Sadly, OpenRouter’s web search is qualitatively worse than DeepSeek’s.
- A Tier:
- Claude 3 Opus: It’s amazing, just so expensive I can’t really justify using it for most tasks. Opus has been eclipsed by Sonnet 3.5 (and others) on coding, but is still great for writing.
- o1: Impressive sometimes, but rather hit or miss in my experience. When it works, it’s impressively good. I notice that I don’t reach for this model much relative to the hype/praise it receives. Usage limits really deter me from leaning on a model. “You’re out of messages until Monday” is a bad feeling. I don’t want my tools to feel like they’re scarce.
- B Tier:
- o1-Mini: I used this way more then o1 this year. It’s pretty good
for coding. This model appears to not be available in ChatGPT anymore
following the release of
o3-mini
, so I doubt I will use it much again. That being said, I will likely use this class of model more now thato3-mini
exists.
- o1-Mini: I used this way more then o1 this year. It’s pretty good
for coding. This model appears to not be available in ChatGPT anymore
following the release of
- C Tier:
- Gemini 2.0 Flash, Gemini 2.0 Flash Thinking, Gemini Experimental 1206: I want to like Gemini, it’s just not really the best on any relevant frontier that I care most about. The most obvious way it’s better is that the context length is enormous. It’s also free on AI Studio, which is confusingly generous. My favorite party trick is that I put 300k tokens of my public writing into it and used that to generate new writing in my style. However, the “write as me” prompt technique works nearly just as well – often better. Gemini models are also weirdly sensitive to temperature settings changes.
o3-mini
just came out yesterday.
I’ve used it a bit, but not enough to give a confident rating.
What I’m Not Using
There are a few tools that I’m not currently using, either because I haven’t found them to be particularly useful or because they’re still too early in their development:
- ChatGPT Pro: I just
don’t see $200 in utility there. Unlimited o1 would be nice. $200/month is
too much to stomach, even though in raw economics terms it’s probably worth
it.2
- Operator: I don’t see the utility for me yet. It’s a cool research demo today.
- o1-Pro: I’d love to try this. Again, the $200 price tag just doesn’t seem worth it.
- Browser-use: Open-source version of Operator. It’s cool; I tried it; it’s slow. The version of this that acts at ~1 action per second instead of ~1 action per minute will be a force to be reckoned with.
- Anthropic Computer Use: See Operator, Browser-use
- ChatGPT “Work with Apps”: This would be great with Chrome, or some other app I’m not familiar with like Godot, but given it just supports Terminal and IDEs, I’d rather use Cursor or Copilot. I think this could get good, I just don’t see any use cases for me yet.
- ChatGPT “GPTs”: I used them modestly in 2024, but as new features were added (voice mode, search, reasoning, etc.), GPTs often couldn’t use these features so I stopped using them as much. I basically used them as custom placeholder prompts, so there wasn’t much value added.
- Local Models: Aside from trying out Ollama/LMStudio just to see if they work, I haven’t found any durable use cases for local models. For human-in-the-loop LLM usage, I just think there isn’t much reason to not use the most powerful model available. If I trusted Anthropic less, I’d probably look into local models more intently.
- “Advanced” Voice Mode:
Last year I used voice mode pretty consistently when I was commuting. This
was the original voice mode that was just a wrapper around STT->LLM->TTS. It
was great! I commute by bus now, so don’t have as much dead time to use
voice mode. Now I only use voice mode occasionally while running.
Additionally, OpenAI’s “advanced” voice mode was somewhat disappointing to
me: the “interrupt the AI” feature didn’t work reliably enough for me to
feel like it’s a step change improvement.
- The lack of impact of advanced voice mode is curious. Her-level proto-AGIs now exist in the world we can talk do, and mostly folks don’t care.
Things I’d Like To See
- Better Tools for Copiloting Writing: I think the UX for writing using LLMs can be significantly better than it is today. I don’t think anyone has made a great Github Copilot esque product for writing, likely because there isn’t “one correct” path you go down doing non-technical writing. I’m excited by loom-like interfaces, which allow you to traverse trees of text. Other existing tools today, like “take this paragraph and make it more concise/formal/casual” just don’t have much appeal to me. I really don’t tend to like the output of these systems. Ideally, I want to be steering an LLM in my writing style and in the direction of my flow of thoughts.
- Fast or Reliable Browser / Computer-Use Agents: The demos I’ve seen for browser/computer use seem too slow now to be worth investing much in. However, I think there’s a ton of promise in them. I see two paths to increasing utility: Either these agents get faster, or they get more reliable. If faster, then they can be used more in human-in-the-loop settings, where you can course correct them if they go off track. If more reliable, then they can operate in the background on your behalf, when you don’t care as much about end-to-end latency. I do think that someone will crack a specialized model for very fast computer use within the next year. All the building blocks are there for agents of noticeable economic utility; it seems more like an engineering problem than an open research problem.
- Better Long-term Management: I was excited about ChatGPT memory, but this was also mostly disappointing. I have yet to have an “aha” moment where I got nontrivial value out of ChatGPT having remembered something about me. More often than not, it remembers weird, irrelevant, or time-contingent facts that have no practical future utility. I’d really like some system that does contextual compression on my conversations, finds out the types of responses I tend to value, the types of topics I care about, and uses that in a way to improve model output on ongoing basis. I’ve seen some interesting experiments in this direction, but as far as I can tell no one has quite solved this yet.
Resources
No change from mid 2024:
- Zvi Mowshowitz’s weekly AI posts are excellent, and give an extremely verbose AI “state of the world”.
- Simon Willison’s blog is also an excellent source for AI news.
- The Cognitive Revolution podcast hosts some pretty good interviews that I find to be high-signal-to-noise, and is much less hype-driven than many other AI-centric podcasts I’ve attempted to listen to.
New additions:
- Particularly good Twitter follows: Janus, Nathan Lambert, HamelHusain, Jeremy Howard, davidad, swyx, Ethan Mollick
- Periodic check-ins on Lesswrong for more technical discussion (esp. related to AI alignment and AGI implications), if you’re so inclined
Cover image by Recraft v3. As always, this post contains my own views and does not represent the views of my employer.
-
I had a discussion with a sharp engineer I look up to a few years ago, who was convinced that the future would be humans writing tests and specifications, and LLMs would handle all implementation. Now, I think we won’t even need to necessarily write in-code tests, or low-level unit tests. I’m now convinced that features can largely be described in English, with some end-to-end acceptance tests specified by humans. ↩︎
-
Deep Research came out while I was writing this post an this might actually tip the scale for me. More generally, I think the paradigm of ambient agentic background compute will be a Big Deal soonish. ↩︎