datasette-clone
14th April 2020
I released a fun little Datasette utility today: datasette-clone.
It’s a command-line tool for cloning a Datasette instance down to your local hard drive—the name is inspired by the git clone
command.
Here’s how to use it to create a local clone of all of the data on covid-19.datasettes.com (discussed previously):
pip install datasette-clone
datasette-clone https://covid-19.datasettes.com/
If you give the command the URL to a public Datasette instance it will iterate through the list of available SQLite database files (by hitting the /-/databases.json endpoint) and download each of them.
You can give it an optional second argument for the directory you would like to store the data in:
datasette-clone https://covid-19.datasettes.com/ covid-19
Add -v
to see debugging output showing what it’s doing.
The tool also pulls a copy of that databases.json
file, and stores it alongside the downloaded database files. That file looks something like this:
[
{
"name": "covid",
"path": "covid.db",
"size": 12038144,
"is_mutable": false,
"is_memory": false,
"hash": "453e1090ca379bde05d86c2db35f80235a58a2b52c92dda4463c25f6b3a9211d"
}
]
That "hash"
key is the sha256
hash of the file contents. The next time you run the datasette-clone
command it will compare the cached databases.json
file with the live one, and only download database files that have changed.
In this way, datasette-clone
can be easily used to maintain a mirror of any public Datasette instance that you find interesting.
I built the command with the intention of using it in a GitHub Action: I’m increasingly using Actions to generate or update databases, and I often find myself wanting to download the previous database copy, update it in some way and then deploy the result.
My plan was to use datasette-clone
in conjunction with the actions/cache action to cache copies of the database files locally (actions have a 5GB cache storage limit) and make my download step more efficient.
Unfortunately it turns out that doesn’t work for most of my projects, because actions/cache
currenly only works for push
and pull_request
events, and most of my repositories are updated by scheduled
workflows!
Hopefully they’ll fix that limitation at some point in the future. In the meantime, datasette-clone
is still a useful tool for creating clones of public Datasette instances for other reasons.
Update, 23rd June 2020: They fixed that limitation in actions/cache@v2
.
More recent articles
- llm cmd undo last git commit - a new plugin for LLM - 26th March 2024
- Building and testing C extensions for SQLite with ChatGPT Code Interpreter - 23rd March 2024
- Claude and ChatGPT for ad-hoc sidequests - 22nd March 2024
- Weeknotes: the aftermath of NICAR - 16th March 2024
- The GPT-4 barrier has finally been broken - 8th March 2024
- Prompt injection and jailbreaking are not the same thing - 5th March 2024
- Interesting ideas in Observable Framework - 3rd March 2024
- Weeknotes: Getting ready for NICAR - 27th February 2024
- The killer app of Gemini Pro 1.5 is video - 21st February 2024
- Weeknotes: a Datasette release, an LLM release and a bunch of new plugins - 9th February 2024