datasette-clone
14th April 2020
I released a fun little Datasette utility today: datasette-clone.
It’s a command-line tool for cloning a Datasette instance down to your local hard drive—the name is inspired by the git clone
command.
Here’s how to use it to create a local clone of all of the data on covid-19.datasettes.com (discussed previously):
pip install datasette-clone
datasette-clone https://covid-19.datasettes.com/
If you give the command the URL to a public Datasette instance it will iterate through the list of available SQLite database files (by hitting the /-/databases.json endpoint) and download each of them.
You can give it an optional second argument for the directory you would like to store the data in:
datasette-clone https://covid-19.datasettes.com/ covid-19
Add -v
to see debugging output showing what it’s doing.
The tool also pulls a copy of that databases.json
file, and stores it alongside the downloaded database files. That file looks something like this:
[
{
"name": "covid",
"path": "covid.db",
"size": 12038144,
"is_mutable": false,
"is_memory": false,
"hash": "453e1090ca379bde05d86c2db35f80235a58a2b52c92dda4463c25f6b3a9211d"
}
]
That "hash"
key is the sha256
hash of the file contents. The next time you run the datasette-clone
command it will compare the cached databases.json
file with the live one, and only download database files that have changed.
In this way, datasette-clone
can be easily used to maintain a mirror of any public Datasette instance that you find interesting.
I built the command with the intention of using it in a GitHub Action: I’m increasingly using Actions to generate or update databases, and I often find myself wanting to download the previous database copy, update it in some way and then deploy the result.
My plan was to use datasette-clone
in conjunction with the actions/cache action to cache copies of the database files locally (actions have a 5GB cache storage limit) and make my download step more efficient.
Unfortunately it turns out that doesn’t work for most of my projects, because actions/cache
currenly only works for push
and pull_request
events, and most of my repositories are updated by scheduled
workflows!
Hopefully they’ll fix that limitation at some point in the future. In the meantime, datasette-clone
is still a useful tool for creating clones of public Datasette instances for other reasons.
Update, 23rd June 2020: They fixed that limitation in actions/cache@v2
.
More recent articles
- Gemini 2.0 Flash: An outstanding multi-modal LLM with a sci-fi streaming mode - 11th December 2024
- ChatGPT Canvas can make API requests now, but it's complicated - 10th December 2024
- I can now run a GPT-4 class model on my laptop - 9th December 2024