Simon Willison’s Weblog


Litestream backups for Datasette Cloud (and weeknotes)

11th August 2022

My main focus this week has been adding robust backups to the forthcoming Datasette Cloud.

Datasette Cloud is a SaaS service for Datasette. It allows people to create a private Datasette instance where they can upload data, visualize and transform it and share it with other members of their team. You can join the waiting list to try it out using this form.

I’m building Datastte Cloud on Fly, specifically on Fly Machines.

Security is a big concern for Datasette Cloud. Teams should only be able to access their own data—bugs where users accidentally (or maliciously) access data for another team should be protected against as much as possible.

To help guarantee that, I’ve designed Datasette Cloud so that each team gets their own, dedicated instance, running in a Firecracker VM managed by Fly. Their data lives in a dedicated volume.

Fly volumes already implement snapshot backups, but I’m interested in defence in depth. This is where Litestream comes in (coincidentally now part of Fly, although it wasn’t when I first selected it as my backup strategy).

I’m using Litestream to constantly backup the data for each Datasette Cloud team to an S3 bucket. In the case of a complete failure of a volume, I can restore data from a backup that should be at most a few seconds out of date. Litestream also gives me point-in-time backups, such that I can recover a previous version of the data within a configurable retention window.

Keeping backups isolated

Litestream works by writing a constant stream of pages from SQLite’s WAL (Write-Ahead Log) up to an S3 bucket. It needs the ability to both read and write from S3.

This requires making S3 credentials available within the containers that run Datasette and Litestream for each team account.

Credentials in those containers are not visible to the users of the software, but I still wanted to be confident that if the credentials leaked in some way the isolation between teams would be maintained.

Initially I thought about having a separate S3 bucket for each team, but it turns out AWS has a default limit of 100 buckets per account, and a hard limit of 1,000. I aspire to have more than 1,000 customers, so this limit makes a bucket-per-team seem like the wrong solution.

I’ve learned an absolute ton about S3 and AWS permissions building my s3-credentials tool for creating credentials for accessing S3.

One of the tricks I’ve learned is that it’s possible to create temporary, time-limited credentials that only work for a prefix (effectively a folder) within an S3 bucket.

This means I can run Litestream with credentials that are specific to the team—that can read and write only from the team-ID/ prefix in the S3 bucket I am using to store the backups.

Obtaining temporary credentials

My s3-credentials tool can create credentials for a prefix within an S3 bucket like this:

s3-credentials create my-bucket-for-backus \
  --duration 12h \
  --prefix team-56/

This command uses the sts.assume_role() AWS method to create credentials that allow access to that bucket, attaching this generated JSON policy to it in order to restrict access to the provided prefix.

I extracted the relevant Python code from s3-credentials and used it to create a private API endpoint in my Datasette Cloud management server which could return the temporary credentials needed by the team container.

With the endpoint in place, my code for launching a team container can do this:

  • Create the volume and machine for that team (if they do not yet exist)
  • Generate a signed secret token that the machine container can exchange for its S3 credentials
  • Launch the machine container, passing it the secret token
  • On launch, the container runs a script which exchanges that secret token for its 12 hour S3 credentials, using the private API endpoint I created
  • Those credentials are used to populate the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN environment variables used by Litestream
  • Start Litestream, which then starts Datasette

Restarting every 12 hours

You may be wondering why I bothered with that initial secret token—why not just pass the temporary AWS credentials to the container when I launch it?

The reason for this is that I need to be able to obtain fresh credentials every 12 hours.

A really neat feature of Fly Machines is that they support scale-to-zero. You can stop them, and Fly will automatically restart them the next time they recieve traffic.

All you need to do is call sys.exit(0) in your Python code (or the equivalent in any other language) and Fly will stop your container... and then restart it again with a couple of seconds of cold start time the next time an HTTP request for your container hits the Fly router.

So far I’m mainly using this to avoid the cost of running containers when they aren’t actually in- use. But there’s a neat benefit when it comes to Litestream too.

I’m using S3 credentials which expire after 12 hours. This means I need to periodically refresh the credentials and restart Litestream or it will stop being able to write to the S3 bucket.

After considering a few ways of doing this, I selected the simplest to implement: have Datasette call sys.exit(0) after ten hours, and let Fly restart the container causing my startup script to fetch freshly generated 12 hour credentials and pass them to Litestream.

I implemented this by adding it as a new setting to my existing datasette-scale-to-zero plugin. You can now configure that with "max-age": "10h" and it will shut down Datasette once the server has been running for that long.

Why does this require my own secret token system? Because when the container is restarted, it needs to make an authenticated call to my endpoint to retrieve those fresh S3 credentials. Fly persists environment variable secrets between restarts to the container, so that secret can be long-lived even while it is exchanged for short-term S3 credentials.

I only just put the new backup system in place, so I’m exercising it a bit before I open things up to trial users—but so far it’s looking like a very robust solution to the problem.

s3-ocr improvements

I released a few new versions of s3-ocr this week, as part of my ongoing project working with the San Francisco Microscopical Society team to release a searchable version of their scanned document archives.

The two main improvements are:

  • A new --dry-run option to s3-ocr start which shows you what the tool will do without making any changes to your S3 bucket, or triggering any OCR jobs. #22
  • s3-ocr start used to fail with an error if running it would create more than 100 (or 600 depending on your region) concurrent OCR jobs. The tool now knows how to identify that error and pause and retry starting the jobs instead. #21

The fix that took the most time is this: installations of the tool no longer arbitrarily fail to work depending on the environment you install them into!

Solving this took me the best part of a day. The short version is this: Click 8.1.0 introduced a new feature that lets you use @cli.command as a decorator instead of @cli.command(). This meant that installing s3-ocr in an environment that already had a previous version of Click would result in silent errors.

The solution is simple: pin to click>=8.1.0 in the project dependencies if you plan to use this new syntax.

If I’d read the Click changelog more closely I would have saved myself a whole lot of time.

Issues #25 and #26 detail the many false turns I took trying to figure this out.

More fun with GPT-3 and DALL-E

This tweet scored over a million impressions on Twitter:

As this got retweeted outside of my usual circles it started confusing people who thought the “prototype” was a working game, as opposed to a fake screenshot and a paragraph of descriptive text! I wasn’t kidding when I said I spent 60 seconds on this.

I also figured out how to use GPT-3 to write jq one-liners. I love jq but I have to look up how to use it every time, so having GPT-3 do the work for me is a pretty neat time saver. More on that in this TIL: Using GPT-3 to figure out jq recipes

Releases this week

TIL this week