5 items tagged “screenscraping”
Crowbar. Headless Gecko/XULRunner which exposes a web service API for screen scraping using a real browser DOM—just pass it the URL of a page and the URL of a screen scraping JavaScript script (a bit like a Greasemonkey user script) and get back RDF/XML.
24th January 2009, 11:52 pm
YQL—converting the web to JSON with mock SQL. YQL just got a whole lot more interesting to me—I had no idea they were exposing an HTML and RSS scraping tool over a JSONP API in addition to all of the Yahoo! web service methods.
13th December 2008, 9:39 am
lxml: an underappreciated web scraping library. I just wish I could get the wretched thing to install on OS X Leopard without resorting to MacPorts.
11th December 2008, 9:54 am
PDFMiner. Useful looking PDF parsing library in Python—can produce an XML representation of the text and style information in a PDF document.
3rd August 2008, 3:29 pm
/trunk/jl/scraper. journa-list.com is open source, and the screen scrapers are written in Python.
11th October 2007, 4:10 pm