Simon Willison on scraping

37 posts tagged “scraping”

2008

YQL—converting the web to JSON with mock SQL. YQL just got a whole lot more interesting to me—I had no idea they were exposing an HTML and RSS scraping tool over a JSONP API in addition to all of the Yahoo! web service methods.

# 13th December 2008, 9:39 am / html, json, jsonp, scraping, sql, yahoo, yql

lxml: an underappreciated web scraping library. I just wish I could get the wretched thing to install on OS X Leopard without resorting to MacPorts.

# 11th December 2008, 9:54 am / ian-bicking, lxml, macports, python, scraping

Data Scraping Wikipedia with Google Spreadsheets. I hadn’t played with =importHTML in Google spreadsheets, which lets you suck in data from an HTML table or list somewhere on the web. This tutorial takes it further, bringing Wikipedia, Yahoo! Pipes and KML in to the mix.

# 16th October 2008, 2:37 pm / google-docs, googlespreadsheet, importhtml, kml, mashups, scraping, wikipedia, yahoopipes

PDFMiner. Useful looking PDF parsing library in Python—can produce an XML representation of the text and style information in a PDF document.

# 3rd August 2008, 3:29 pm / pdf, pdfminer, python, scraping, xml

2007

/trunk/jl/scraper. journa-list.com is open source, and the screen scrapers are written in Python.

# 11th October 2007, 4:10 pm / journalist, open-source, python, scraping

2005

scrape.py. A clever Python screen-scraping module, with similarities to WWW::Mechanize.

# 25th March 2005, 5:09 am / python, scraping

2004

WWW::Odeon (via) A simple API for screen-scraping the www.odeon.co.uk website.

# 21st July 2004, 3:27 pm / perl, scraping

«« first « previous page 2 / 2

Simon Willison’s Weblog

37 posts tagged “scraping”

2008

2007

2005

2004