datasette-clone
14th April 2020
I released a fun little Datasette utility today: datasette-clone.
It’s a command-line tool for cloning a Datasette instance down to your local hard drive—the name is inspired by the git clone
command.
Here’s how to use it to create a local clone of all of the data on covid-19.datasettes.com (discussed previously):
pip install datasette-clone
datasette-clone https://covid-19.datasettes.com/
If you give the command the URL to a public Datasette instance it will iterate through the list of available SQLite database files (by hitting the /-/databases.json endpoint) and download each of them.
You can give it an optional second argument for the directory you would like to store the data in:
datasette-clone https://covid-19.datasettes.com/ covid-19
Add -v
to see debugging output showing what it’s doing.
The tool also pulls a copy of that databases.json
file, and stores it alongside the downloaded database files. That file looks something like this:
[
{
"name": "covid",
"path": "covid.db",
"size": 12038144,
"is_mutable": false,
"is_memory": false,
"hash": "453e1090ca379bde05d86c2db35f80235a58a2b52c92dda4463c25f6b3a9211d"
}
]
That "hash"
key is the sha256
hash of the file contents. The next time you run the datasette-clone
command it will compare the cached databases.json
file with the live one, and only download database files that have changed.
In this way, datasette-clone
can be easily used to maintain a mirror of any public Datasette instance that you find interesting.
I built the command with the intention of using it in a GitHub Action: I’m increasingly using Actions to generate or update databases, and I often find myself wanting to download the previous database copy, update it in some way and then deploy the result.
My plan was to use datasette-clone
in conjunction with the actions/cache action to cache copies of the database files locally (actions have a 5GB cache storage limit) and make my download step more efficient.
Unfortunately it turns out that doesn’t work for most of my projects, because actions/cache
currenly only works for push
and pull_request
events, and most of my repositories are updated by scheduled
workflows!
Hopefully they’ll fix that limitation at some point in the future. In the meantime, datasette-clone
is still a useful tool for creating clones of public Datasette instances for other reasons.
Update, 23rd June 2020: They fixed that limitation in actions/cache@v2
.
More recent articles
- Weeknotes: Embeddings, more embeddings and Datasette Cloud - 17th September 2023
- Build an image search engine with llm-clip, chat with models with llm chat - 12th September 2023
- LLM now provides tools for working with embeddings - 4th September 2023
- Datasette 1.0a4 and 1.0a5, plus weeknotes - 30th August 2023
- Making Large Language Models work for you - 27th August 2023
- Datasette Cloud, Datasette 1.0a3, llm-mlc and more - 16th August 2023
- How I make annotated presentations - 6th August 2023
- Weeknotes: Plugins for LLM, sqlite-utils and Datasette - 5th August 2023
- Catching up on the weird world of LLMs - 3rd August 2023
- Run Llama 2 on your own Mac using LLM and Homebrew - 1st August 2023