Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Crowbar. Headless Gecko/XULRunner which exposes a web service API for screen scraping using a real browser DOM—just pass it the URL of a page and the URL of a screen scraping JavaScript script (a bit like a Greasemonkey user script) and get back RDF/XML.

Tagged , , , , , , , , ,

2 comments

  1. Impressive ... and a little bit weird. I can't imagine that using Gecko / XULRunner as being very efficient.

    And having to run another server process, which has to be managed as well ... a bit clunky.

    Clearly there must be better options out there for parsing and returning well-formed HTML from tag-soup, when the ruby community can spawn both hpricot and nokogiri ... or am I missing the point of Crowbar entirely?

    Oh, and is OpenID commenting broken?

    Morgan Roderick - 25th January 2009 17:09 - #

  2. The big difference is that this runs JavaScript. You can inject clicks into the DOM that will cause JavaScript events to fire. Where BeautifulSoup will help you scrape a static blog page, this will help you scrape Yahoo! Mail.

    It's very similar to how I hear Wesabe interacts with bank websites, actually.

    Richard Crowley - 3rd February 2009 17:42 - #

Comments are closed.
A django site