Simon Willison’s Weblog

Subscribe

Recovering missing content from the Internet Archive

8th October 2017

When I restored my blog last weekend I used the most recent SQL backup of my blog’s database from back in 2010. I thought it had all of my content from before I started my 7 year hiatus, but in watching the 404 logs I started seeing the occasional hit to something that really should have been there but wasn’t. Turns out the SQL backup I was working from was missing some content.

Thank goodness then for the Wayback Machine at the Internet Archive! I tried some of the missing URLs there and found they had been captured and preserved. But how to get them back?

A quick search turned up wayback-machine-downloader, an open-source Ruby script that claims to be able to Download an entire website from the Internet Archive Wayback Machine. I gem installed it and tried it out (after some cargo cult incantations to work around some weird certificate errors I was seeing)

rvm osx-ssl-certs update all
gem update --system
gem install wayback_machine_downloader

wayback_machine_downloader http://simonwillison.net/

And it worked! I left it running overnight and came back to a folder containing 18,952 HTML files, neatly arranged in a directory structure that matched my site:

$ find . | more
.
./simonwillison.net
./simonwillison.net/2002
./simonwillison.net/2002/Aug
./simonwillison.net/2002/Aug/1
./simonwillison.net/2002/Aug/1/cetis
./simonwillison.net/2002/Aug/1/cetis/index.html
./simonwillison.net/2002/Aug/1/cssSelectorsTutorial
./simonwillison.net/2002/Aug/1/cssSelectorsTutorial/index.html
...

I tarred them up into an archive and backed them up to Dropbox.

Next challenge: how to restore the missing content?

I’m a recent and enthusiastic adopter of Jupyter notebooks. As a huge fan of development in a REPL I’m shocked I was so late to this particular party. So I fired up Jupyter and used it to start playing with the data.

Here’s the final version of my notebook. I ended up with a script that did the following:

  • Load in the full list of paths from the tar archive, and filter for just the ones matching the /YYYY/Mon/DD/slug/ format used for my blog content
  • Talk to my local Django development environment and load in the full list of actual content URLs represented in that database.
  • Calculate the difference between the two—those are the 213 items that need to be recovered.
  • For each of those 213 items, load the full HTML that had been saved by the Internet Archive and feed it into the BeautifulSoup HTML parsing library.
  • Detect if each one is an entry, a blogmark or a quotation. Scrape the key content out of each one based on the type.
  • Scrape the tags for each item, using this delightful one-liner: [a.text for a in soup.findAll('a', {'rel': 'tag'})]
  • Scrape the comments for each item separately. These were mostly spam, so I haven’t yet recovered these for publication (I need to do some aggressive spam filtering first). I have however stashed them in the database for later processing.
  • Write all of the scraped data out to a giant JSON file and upload it to a gist (a nice cheap way of giving it a URL).

Having executed the above script, I now have a JSON file containing the parsed content for all of the missing items found in the Wayback Machine. All I needed then was a script which could take that JSON and turn it into records in the database. I implemented that as a custom Django management command and deployed it to Heroku.

Last step: shell into a Heroku dyno (using heroku run bash) and run the following:

./manage.py import_blog_json \
    --url_to_json=https://gist.github.com/simonw/5a5bc1f58297d2c7d68dd7448a4d6614/raw/28d5d564ae3fe7165802967b0f9c4eff6091caf0/recovered-blog-content.json \
    --tag_with=recovered

The result: 213 recovered items (which I tagged with recovered so I could easily browse them). Including the most important entry on my whole site, my write-up of my wedding!

So thank you very much to the Internet Archive team, and thank you Hartator for your extremely useful wayback-machine-downloader tool.