Simon Willison’s Weblog

Subscribe
Atom feed for scraping

36 posts tagged “scraping”

2008

lxml: an underappreciated web scraping library. I just wish I could get the wretched thing to install on OS X Leopard without resorting to MacPorts.

# 11th December 2008, 9:54 am / lxml, macports, python, ian-bicking, scraping

Data Scraping Wikipedia with Google Spreadsheets. I hadn’t played with =importHTML in Google spreadsheets, which lets you suck in data from an HTML table or list somewhere on the web. This tutorial takes it further, bringing Wikipedia, Yahoo! Pipes and KML in to the mix.

# 16th October 2008, 2:37 pm / mashups, importhtml, google-docs, googlespreadsheet, wikipedia, yahoopipes, kml, scraping

PDFMiner. Useful looking PDF parsing library in Python—can produce an XML representation of the text and style information in a PDF document.

# 3rd August 2008, 3:29 pm / pdf, python, xml, pdfminer, scraping

2007

/trunk/jl/scraper. journa-list.com is open source, and the screen scrapers are written in Python.

# 11th October 2007, 4:10 pm / python, open-source, journalist, scraping

2005

scrape.py. A clever Python screen-scraping module, with similarities to WWW::Mechanize.

# 25th March 2005, 5:09 am / scraping, python

2004

WWW::Odeon (via) A simple API for screen-scraping the www.odeon.co.uk website.

# 21st July 2004, 3:27 pm / scraping, perl