For various reasons, it’s probably a good idea to keep backups of data that is stored in The Cloud™️. I have a bunch of data in my Google account from years of usage – photos and email primarily, but also a bunch of other digital detritus that would make me sad if it went away. Google Takeout is a pretty painless way of getting a bulk export of your account’s data. It isn’t a perfect replica of the data, but for my purposes it’s a good enough fallback in case of disaster (loss of account, accidental deletion, etc.).

My motivation for backing up data is mostly of the “two is one, one is none” variety, so I decided to store my Google Takeout backups on a non-Google cloud provider. Some quick searching revealed that AWS S3’s Glacier “Deep Archive” storage class provides a ridiculously cheap place to store backups. The current Deep Archive price is $0.00099/GB/month. My compressed Takeout archive is ~300GB, so I can store a few backups and still only pay ~$1/month. Excellent.

It’s worth noting that there are some caveats to using S3 Glacier: it’s a “cold storage” class, you need to make a request to retrieve your backups, which takes ~12 hours. You also will pay for data transfer to download your backed up data when recovering. For backups that are mostly a “just in case” fallback, I think these tradeoffs make sense.

For the mechanics of loading my Takeout data into S3, I found this guide1 to be quite useful. The key insight is that it makes sense to use a large EC2 instance to download and transfer your Takeout data, rather than your home internet. Not only is AWS’s networking quite fast, but also by keeping all the transfers within AWS, the bandwidth doesn’t get counted against your home internet’s bandwidth cap.

As far as I understand, the pricing mechanics for this setup work like this: (1) inbound network traffic to EC2 is free (although, your maximum network throughput is linked to the size of the instance you choose), (2) data transfer into S3 is also free. So, the only billable aspects of this setup are the EC2 instance cost2 while copying the backup, long-term S3 storage fees (which we already established are quite low), and charges for data egress if/when you need to download your backup.

Backup Procedure

Here’s my process when I run the backup every ~6 months:

  1. Request an export from the Takeout site. There are options for which “products” you want to export. I just keep them all selected for simplicity.
  2. Spin up a t2.large instance in AWS. Make sure it’s disk is large enough to fit your entire Takeout backup, with some margin.
    • The speed of an EC2 instance’s network interface is correlated with its size, so you could also bump this up to a larger instance to get faster transfer speeds.
  3. Install tmux so you don’t lose progress if your SSH session dies.
  4. Do the trick outlined in this guide of starting a download for each Takeout .zip file and then immediately canceling it. Copy the links of these cancelled downloads, and use curl to download them within your EC2 instance.
    • I just made a bunch of panes in tmux and had a separate curl invocation running in each. There’s probably a better way of doing this, but to download ~6 zip files, it was good enough.
    • Tip: You can use iftop to monitor the transfer speed, while waiting.
  5. Setup aws CLI credentials. (Guide)
    • If you’re not using the AWS Linux flavor, you may also need to install the aws CLI.
  6. Once all the Takeout zip files are downloaded, move them all into a single directory.
  7. Make an S3 bucket to store your backups in. You only need to do this once. In subsequent backups, I just reuse the same bucket.
  8. Copy all the zip files into S3, setting the storage class to “DEEP_ARCHIVE”:
    $ aws s3 cp /path/to/the/takeout/zips s3://$BUCKET \
        --storage-class DEEP_ARCHIVE --recursive
  9. Wait until the transfer is complete, and then terminate the EC2 instance.

The first time you do this procedure, it’s probably also worth convincing yourself that it worked by requesting a retrieval of one of the zip files, downloading it, and inspecting its contents. Always test your backups!

Some notes:

  • While the transfers were running, I used iftop to see how fast my t2.large was downloading/uploading data. I got almost exactly what this performance chart suggested: 0.51 Gbit/s = 65MB/s. At that rate, transferring 300GB took ~1.3 hr each way.
  • In total, the full process takes maybe 3-4 instance hours, for about $0.40 in EC2 costs.
  • I had wanted to use rclone instead of the aws CLI for the step of transferring data to S3, but rclone seems to work inconsistently with Glacier.

Standard disclaimer that the above are my own opinions, and are not necessarily those of my employer.

  1. I just noticed as I was writing this post that the rest of the articles on this site are about beekeeping. The most recent post is a video of the dissection of a wasp queen. The internet is an eclectic place. 🤷‍♂️ ↩︎

  2. The EC2 charge ends up being mostly negligible too. I used a t2.large instance for ~4 hours. At $0.0928/hr, that ends up being less than $1, even with a beefy 500GB SSD allocated. ↩︎