Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

5 items tagged “screenscraping”

Crowbar. Headless Gecko/XULRunner which exposes a web service API for screen scraping using a real browser DOM—just pass it the URL of a page and the URL of a screen scraping JavaScript script (a bit like a Greasemonkey user script) and get back RDF/XML. 2 24th January 2009, 11:52 pm

YQL—converting the web to JSON with mock SQL. YQL just got a whole lot more interesting to me—I had no idea they were exposing an HTML and RSS scraping tool over a JSONP API in addition to all of the Yahoo! web service methods. 0 13th December 2008, 9:39 am

lxml: an underappreciated web scraping library. I just wish I could get the wretched thing to install on OS X Leopard without resorting to MacPorts. 2 11th December 2008, 9:54 am

PDFMiner. Useful looking PDF parsing library in Python—can produce an XML representation of the text and style information in a PDF document. 0 3rd August 2008, 3:29 pm

/trunk/jl/scraper. journa-list.com is open source, and the screen scrapers are written in Python. 0 11th October 2007, 4:10 pm

A django site