Preventing Link Rot in my Obsidian Vault

As I’ve written in the past, I’ve been using Obsidian to store notes, links, and other “outboard brain” type information that I want to keep track of. As my Obsidian vault grows, one thing that becomes more of a risk is the potential for link rot.

Much has been written about the problem of link rot, so I’m going to focus on how I decided to “solve” (or at least, ameliorate) the problem for myself. When a link in my Obsidian notes becomes unavailable, it’s not the end of the world – so instead of doing something more resource intensive like actually saving all the pages I have links to, I decided that making sure that the links exist in the Internet Archive was sufficient.¹

A few months ago, I was making some contributions to Wikipedia, and I noticed that someone had written a bot to identify and replace broken links with an archived version from the Internet Archive. This bot – InternetArchiveBot – inspired me to write something similar for my Obsidian vault.

I did a bit of initial research, and couldn’t find any great tools for archiving links to the Internet Archive, so I decided to write my own. The result is wayback-archive: a small Rust CLI program that is specifically designed to archive a large number of URLs to the Internet Archive’s Wayback Machine.

The tool is pretty simple: You provide it with a list of URLs, and it requests those URLs get archived. It then saves a JSON blob of the URLs and their archived version so that subsequent runs of the tool just archive any new URLs.

There’s actually a little bit more going on under the hood: the tool first checks if there’s been a recent (<6 month old) snapshot of the page, and if so, it just returns that. Otherwise, it requests the page be freshly archived, and saves that snapshot. There’s also some logic to throttle archival requests and backoff following bandwidth warnings, which becomes necessary when requesting a large number of snapshots. In the future, I might also add the ability to request periodically updated snapshots of links, but for now once a page is included in the output archive, it will never be updated.

I then wrote a quick shell script that extracts URLs from my Obsidian vault and runs the tool on them. This was pretty easy, since everything is in Markdown. I added the script to my crontab, so that links get automatically archived soon after they’re added:

#!/usr/bin/env bash
set -e

# cd into the directory where this script lives
cd "$(dirname "$0")"

find . -name "*.md" ! -path "*.obsidian/*" \
  -exec grep -Eo "(http|https)://[a-zA-Z0-9./?=_%:-]*" --no-filename {} + |
sort | uniq |
wayback-archiver --out archive.json --merge

The result is a growing archive.json file, which contains a list of URLs and their archived versions:

# Example archive.json
{
    ...
    "https://litestream.io/blog/why-i-built-litestream/": {
        "url": "http://web.archive.org/web/20210611183401/https://litestream.io/blog/why-i-built-litestream/",
        "last_archived": "2021-06-11T18:34:01"
    }
    ...
}

I haven’t written any tooling yet to automatically swap out dead link with their archived version, but that would be a nice follow-up to do eventually.

After I wrote wayback-archiver, I stumbled upon this excellent Gwern article on link archiving. Unsurprisingly, other’s have written similar tools, and there are a few worth Gwern linked to that are worth pointing out:

oduwsdl/archivenow - A multi-archive CLI tool (supports archive.is, Internet Archive, and a few more).
linkchecker/linkchecker - A general purpose link checking tool. It’s main purpose is to check that a website’s links are valid/up, but it can also be used to recursively scrape a list of links from a website.

archivenow is pretty similar to wayback-archiver – and it supports more archival destinations than just the Internet Archive – but it doesn’t have as great support for archiving a large number of links at the same time. It also doesn’t have the same ability to maintain a list of archived links, and only request archival for new links. So, I’m still happy I wrote wayback-archiver, even though in retrospect I was sort of reinventing the wheel.

Archiving Personal Site Links

I was pretty happy with the way wayback-archiver worked for my Obsidian vault, so I also set it up to archive links on this website. I used the linkchecker tool listed above to scrape all the outbound links from this blog, and then set up a similar cron-based workflow to periodically run wayback-archiver on the scraped links. I needed to use linkchecker instead of a simple grep, since parsing links in HTML without a full HTML parser is notoriously difficult.

Here’s the script in its entirety:

#!/usr/bin/env bash
set -ex

# cd into the directory where this script lives
cd "$(dirname "$0")"

# Use hugo to build the site
pushd site && hugo && popd

# Scrape links
linkchecker --verbose --quiet --timeout=35 --no-warnings \
  -F csv --ignore=url=^mailto \
  ./site/public/ || true

# Run wayback-archiver on the dumped links
cat linkchecker-out.csv |
cut -d ';' -f1 |
grep -e ^http |
# Exclude some sites that don't need archiving
grep -E -v \
  -e "archive.is/" \
  -e "web.archive.org/" \
  -e "youtube.com" \
  -e "wikipedia.org" \
  -e "benjamincongdon.me/tags/" \
  -e "(png|gif|svg|jpg|jpeg|xml|css)$" |
sed "s#?ref_src=.*##" | # Remove junk from twitter URLs
sort | uniq |
wayback-archiver --out archive.json --merge

# Cleanup
rm -r ./site/public
rm linkchecker-out.csv

We’re lucky that a service like the Internet Archive exists, and I heartily endorse donating to their donation page.

Coincidentally, while I was writing this post the Internet Archive briefly went offline due to a power outage. So… yeah, this is by no means a failure-proof solution to link rot. ↩︎