Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Mechanize the web

Via Keith Devens, Screen-scraping with WWW::Mechanize describes how Perl’s WWW::Mechanize module can be used to grab information from sites that require a user login. I’ve always dismissed screen scraping as something of a wasted effort, given the fact that a major rewrite of the scraper is required whenever the target site tweaks its HTML. This article has encouraged me to reconsider—some of the functionality in WWW::Mechanise is fantastic:

We create a WWW::Mechanize object and tell it the address of the site we’ll be working from. The Radio Times’ front page has an image link with an ALT text of “My Diary”, so we can use that to get to the right section of the site:


  my $agent = WWW::Mechanize->new();
  $agent->get("http://www.radiotimes.beeb.com/");
  $agent->follow("My Diary");

The returned page contains two forms—one to allow you to choose from a list box of program types, and then a login form for the diary function. We tell WWW::Mechanize to use the second form for input. (Something to remember here is that WWW::Mechanize’s list of forms, unlike an array in Perl, is indexed starting at 1 rather than 0. Our index is, therefore,’2.’)


  $agent->form(2);

Now we can fill in our e-mail address for the ’<INPUT name=“email” type=“text”>’ field, and click the submit button. Nothing too complicated.


  $agent->field("email", $email);
  $agent->click();

I’m still not quite impressed enough to learn Perl, but I’m very tempted to borrow some of the ideas and re-implement them in PHP or Python.

This is Mechanize the web by Simon Willison, posted on 3rd February 2003.

View blog reactions

Next: Vellum on Windows

Previous: Off to amsterdam

3 comments

  1. I've already written code that does this. See my related projects, http://mechanicalcat.net/tech/webunit/ and http://sourceforge.net/projects/pywebperf/ The both implement an "agent" class like you describe above, and although they've got additional functionality for testing or timing, you can still just use either to fetch a page, submit a retrieved form with login info and automatically handle cookies etc. I've used it to test and perform timing analysis of the website http://www.ekit.com/ which has a complex SSL/non-SSL cookie login procedure.

    Richard Jones - 3rd February 2003 22:24 - #

  2. If you need to add parsing, to get that "Nth form" functionality, try out twisted.web.microdom's beExtremelyLenient option. I haven't used it myself, but the #twisted crew on irc.freenode.net swear by it.

    Garth T Kidd - 3rd February 2003 23:30 - #

  3. The classes used in my code do the form extraction and resubmission stuff for you. The parser is probably not as lenient as it could be, though.

    Richard Jones - 4th February 2003 01:58 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/02/03/mechanizeTheWeb

A django site