Recovering missing content from the Internet Archive
8th October 2017
When I restored my blog last weekend I used the most recent SQL backup of my blog’s database from back in 2010. I thought it had all of my content from before I started my 7 year hiatus, but in watching the 404 logs I started seeing the occasional hit to something that really should have been there but wasn’t. Turns out the SQL backup I was working from was missing some content.
Thank goodness then for the Wayback Machine at the Internet Archive! I tried some of the missing URLs there and found they had been captured and preserved. But how to get them back?
A quick search turned up wayback-machine-downloader, an open-source Ruby script that claims to be able to Download an entire website from the Internet Archive Wayback Machine. I gem installed it and tried it out (after some cargo cult incantations to work around some weird certificate errors I was seeing)
rvm osx-ssl-certs update all
gem update --system
gem install wayback_machine_downloader
wayback_machine_downloader http://simonwillison.net/
And it worked! I left it running overnight and came back to a folder containing 18,952 HTML files, neatly arranged in a directory structure that matched my site:
$ find . | more
.
./simonwillison.net
./simonwillison.net/2002
./simonwillison.net/2002/Aug
./simonwillison.net/2002/Aug/1
./simonwillison.net/2002/Aug/1/cetis
./simonwillison.net/2002/Aug/1/cetis/index.html
./simonwillison.net/2002/Aug/1/cssSelectorsTutorial
./simonwillison.net/2002/Aug/1/cssSelectorsTutorial/index.html
...
I tarred them up into an archive and backed them up to Dropbox.
Next challenge: how to restore the missing content?
I’m a recent and enthusiastic adopter of Jupyter notebooks. As a huge fan of development in a REPL I’m shocked I was so late to this particular party. So I fired up Jupyter and used it to start playing with the data.
Here’s the final version of my notebook. I ended up with a script that did the following:
- Load in the full list of paths from the tar archive, and filter for just the ones matching the /YYYY/Mon/DD/slug/ format used for my blog content
- Talk to my local Django development environment and load in the full list of actual content URLs represented in that database.
- Calculate the difference between the two—those are the 213 items that need to be recovered.
- For each of those 213 items, load the full HTML that had been saved by the Internet Archive and feed it into the BeautifulSoup HTML parsing library.
- Detect if each one is an entry, a blogmark or a quotation. Scrape the key content out of each one based on the type.
- Scrape the tags for each item, using this delightful one-liner:
[a.text for a in soup.findAll('a', {'rel': 'tag'})]
- Scrape the comments for each item separately. These were mostly spam, so I haven’t yet recovered these for publication (I need to do some aggressive spam filtering first). I have however stashed them in the database for later processing.
- Write all of the scraped data out to a giant JSON file and upload it to a gist (a nice cheap way of giving it a URL).
Having executed the above script, I now have a JSON file containing the parsed content for all of the missing items found in the Wayback Machine. All I needed then was a script which could take that JSON and turn it into records in the database. I implemented that as a custom Django management command and deployed it to Heroku.
Last step: shell into a Heroku dyno (using heroku run bash
) and run the following:
./manage.py import_blog_json \
--url_to_json=https://gist.github.com/simonw/5a5bc1f58297d2c7d68dd7448a4d6614/raw/28d5d564ae3fe7165802967b0f9c4eff6091caf0/recovered-blog-content.json \
--tag_with=recovered
The result: 213 recovered items (which I tagged with recovered
so I could easily browse them). Including the most important entry on my whole site, my write-up of my wedding!
So thank you very much to the Internet Archive team, and thank you Hartator for your extremely useful wayback-machine-downloader tool.
More recent articles
- Gemini 2.0 Flash: An outstanding multi-modal LLM with a sci-fi streaming mode - 11th December 2024
- ChatGPT Canvas can make API requests now, but it's complicated - 10th December 2024
- I can now run a GPT-4 class model on my laptop - 9th December 2024