Crowbar. Headless Gecko/XULRunner which exposes a web service API for screen scraping using a real browser DOM—just pass it the URL of a page and the URL of a screen scraping JavaScript script (a bit like a Greasemonkey user script) and get back RDF/XML.
Impressive ... and a little bit weird. I can't imagine that using Gecko / XULRunner as being very efficient.
And having to run another server process, which has to be managed as well ... a bit clunky.
Clearly there must be better options out there for parsing and returning well-formed HTML from tag-soup, when the ruby community can spawn both hpricot and nokogiri ... or am I missing the point of Crowbar entirely?
Oh, and is OpenID commenting broken?
Morgan Roderick - 25th January 2009 17:09 - #
The big difference is that this runs JavaScript. You can inject clicks into the DOM that will cause JavaScript events to fire. Where BeautifulSoup will help you scrape a static blog page, this will help you scrape Yahoo! Mail.
It's very similar to how I hear Wesabe interacts with bank websites, actually.