Using “import refs” to iteratively import data into Django
4th November 2017
I’ve been writing a few scripts to backfill my blog with content I originally posted elsewhere. So far I’ve imported answers I posted on Quora (background), answers I posted on Ask MetaFilter and content I recovered from the Internet Archive.
I started out writing custom import scripts (like this Quora one), but I’ve now built a generalized mechanism for this which I thought was worth writing up.
Any of my content imports now take the form of a JSON document, which looks something like this:
[
{
"body": "<p><em>My answer to ...</em></p>",
"tags": [
"backpacks",
"laptops",
"style",
"accessories",
"bags"
],
"title": "I need a new backpack",
"datetime": "2005-01-16T14:08:00",
"import_ref": "askmetafilter:14075",
"type": "entry",
"slug": "i-need-a-new-backpack"
}
]
Two larger examples: the missing content I extracted from the Internet Archive, and the answers I scraped from Ask MetaFilter.
The type
property can be set to entry
, quotation
or blogmark
and specifies which type of content should be imported. The datetime
, slug
and tags
fields are common across all three types—the other fields differ for each type.
The most interesting field here is import_ref
. This is optional, but if provided forms a unique reference associated with that item of content. I then use that reference in a call Django’s update_or_create()
method. This means I can run the same import multiple times—the first run will create objects, while subsequent runs update objects in place.
The end result is that I can incrementally improve the scrapers I am writing, re-importing the resulting JSON to update previously imported records in-place. In addition to hacking on my blog, I’ve been using this pattern for some API integrations at work recently and it’s worked out very well.
import_ref
is defined on my models as a unique, nullable text field:
import_ref = models.TextField(max_length=64, null=True, unique=True)
Since the Django admin doesn’t handle nullable fields well by default, I added import_ref
to my readonly_fields
property in my admin configuration to avoid accidentally setting it to a blank string when editing through the admin interface.
Here’s my completed import_blog_json
management command.
My workflow for importing data is now pretty streamlined. I write the scrapers in a Juyter notebook and use that to generate a list of importable items as Python dictionaries. I run open('/tmp/items.json').write(json.dumps(items, indent=2))
to dump the items to a JSON file. Then I can run ./manage.py import_blog_json /tmp/items.json
to import them into my local development environment—thanks to the import_ref
I can do this as many times as I like until I’m pleased with the result.
Once it’s ready, I run !cat /tmp/blah.json | pbcopy
in Jupyter to copy the JSON to my clipboard, then paste the JSON into a new GitHub Gist. I then copy the URL to that raw JSON and execute it against my production instance.
Heroku tip: running heroku run bash
will start a bash prompt in a dyno hooked up to your application. You can then run ./manage.py ...
commands against your production environment.
So… I just have to run heroku run bash
followed by ./manage.py import_blog_json https://gist.github.com/path-to-json --tag_with=askmetafilter
and the new content will be live on my site.
The tag_with
option allows me to specify a tag to apply to all of that imported content, useful for checking that everything worked as expected.
More recent articles
- Notes from Bing Chat—Our First Encounter With Manipulative AI - 19th November 2024
- Project: Civic Band - scraping and searching PDF meeting minutes from hundreds of municipalities - 16th November 2024
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024