Recovering missing content from the Internet Archive

8th October 2017

When I restored my blog last weekend I used the most recent SQL backup of my blog’s database from back in 2010. I thought it had all of my content from before I started my 7 year hiatus, but in watching the 404 logs I started seeing the occasional hit to something that really should have been there but wasn’t. Turns out the SQL backup I was working from was missing some content.

Thank goodness then for the Wayback Machine at the Internet Archive! I tried some of the missing URLs there and found they had been captured and preserved. But how to get them back?

A quick search turned up wayback-machine-downloader, an open-source Ruby script that claims to be able to Download an entire website from the Internet Archive Wayback Machine. I gem installed it and tried it out (after some cargo cult incantations to work around some weird certificate errors I was seeing)

rvm osx-ssl-certs update all
gem update --system
gem install wayback_machine_downloader

wayback_machine_downloader http://simonwillison.net/

And it worked! I left it running overnight and came back to a folder containing 18,952 HTML files, neatly arranged in a directory structure that matched my site:

$ find . | more
.
./simonwillison.net
./simonwillison.net/2002
./simonwillison.net/2002/Aug
./simonwillison.net/2002/Aug/1
./simonwillison.net/2002/Aug/1/cetis
./simonwillison.net/2002/Aug/1/cetis/index.html
./simonwillison.net/2002/Aug/1/cssSelectorsTutorial
./simonwillison.net/2002/Aug/1/cssSelectorsTutorial/index.html
...

I tarred them up into an archive and backed them up to Dropbox.

Next challenge: how to restore the missing content?

I’m a recent and enthusiastic adopter of Jupyter notebooks. As a huge fan of development in a REPL I’m shocked I was so late to this particular party. So I fired up Jupyter and used it to start playing with the data.

Here’s the final version of my notebook. I ended up with a script that did the following:

Load in the full list of paths from the tar archive, and filter for just the ones matching the /YYYY/Mon/DD/slug/ format used for my blog content
Talk to my local Django development environment and load in the full list of actual content URLs represented in that database.
Calculate the difference between the two—those are the 213 items that need to be recovered.
For each of those 213 items, load the full HTML that had been saved by the Internet Archive and feed it into the BeautifulSoup HTML parsing library.
Detect if each one is an entry, a blogmark or a quotation. Scrape the key content out of each one based on the type.
Scrape the tags for each item, using this delightful one-liner: [a.text for a in soup.findAll('a', {'rel': 'tag'})]
Scrape the comments for each item separately. These were mostly spam, so I haven’t yet recovered these for publication (I need to do some aggressive spam filtering first). I have however stashed them in the database for later processing.
Write all of the scraped data out to a giant JSON file and upload it to a gist (a nice cheap way of giving it a URL).

Having executed the above script, I now have a JSON file containing the parsed content for all of the missing items found in the Wayback Machine. All I needed then was a script which could take that JSON and turn it into records in the database. I implemented that as a custom Django management command and deployed it to Heroku.

Last step: shell into a Heroku dyno (using heroku run bash) and run the following:

./manage.py import_blog_json \
    --url_to_json=https://gist.github.com/simonw/5a5bc1f58297d2c7d68dd7448a4d6614/raw/28d5d564ae3fe7165802967b0f9c4eff6091caf0/recovered-blog-content.json \
    --tag_with=recovered

The result: 213 recovered items (which I tagged with recovered so I could easily browse them). Including the most important entry on my whole site, my write-up of my wedding!

So thank you very much to the Internet Archive team, and thank you Hartator for your extremely useful wayback-machine-downloader tool.

Posted 8th October 2017 at 7:08 pm · Follow me on Mastodon, Bluesky, Twitter or subscribe to my newsletter

Simon Willison’s Weblog

Recovering missing content from the Internet Archive

More recent articles

Monthly briefing