Simon Willison’s Weblog



14th April 2020

I released a fun little Datasette utility today: datasette-clone.

It’s a command-line tool for cloning a Datasette instance down to your local hard drive—the name is inspired by the git clone command.

Here’s how to use it to create a local clone of all of the data on (discussed previously):

pip install datasette-clone

If you give the command the URL to a public Datasette instance it will iterate through the list of available SQLite database files (by hitting the /-/databases.json endpoint) and download each of them.

You can give it an optional second argument for the directory you would like to store the data in:

datasette-clone covid-19

Add -v to see debugging output showing what it’s doing.

The tool also pulls a copy of that databases.json file, and stores it alongside the downloaded database files. That file looks something like this:

    "name": "covid",
    "path": "covid.db",
    "size": 12038144,
    "is_mutable": false,
    "is_memory": false,
    "hash": "453e1090ca379bde05d86c2db35f80235a58a2b52c92dda4463c25f6b3a9211d"

That "hash" key is the sha256 hash of the file contents. The next time you run the datasette-clone command it will compare the cached databases.json file with the live one, and only download database files that have changed.

In this way, datasette-clone can be easily used to maintain a mirror of any public Datasette instance that you find interesting.

I built the command with the intention of using it in a GitHub Action: I’m increasingly using Actions to generate or update databases, and I often find myself wanting to download the previous database copy, update it in some way and then deploy the result.

My plan was to use datasette-clone in conjunction with the actions/cache action to cache copies of the database files locally (actions have a 5GB cache storage limit) and make my download step more efficient.

Unfortunately it turns out that doesn’t work for most of my projects, because actions/cache currenly only works for push and pull_request events, and most of my repositories are updated by scheduled workflows!

Hopefully they’ll fix that limitation at some point in the future. In the meantime, datasette-clone is still a useful tool for creating clones of public Datasette instances for other reasons.

Update, 23rd June 2020: They fixed that limitation in actions/cache@v2.