Mechanize the web
3rd February 2003
Via Keith Devens, Screen-scraping with WWW::Mechanize describes how Perl’s WWW::Mechanize
module can be used to grab information from sites that require a user login. I’ve always dismissed screen scraping as something of a wasted effort, given the fact that a major rewrite of the scraper is required whenever the target site tweaks its HTML. This article has encouraged me to reconsider—some of the functionality in WWW::Mechanise
is fantastic:
We create a WWW::Mechanize object and tell it the address of the site we’ll be working from. The Radio Times’ front page has an image link with an ALT text of “My Diary”, so we can use that to get to the right section of the site:
my $agent = WWW::Mechanize->new(); $agent->get("http://www.radiotimes.beeb.com/"); $agent->follow("My Diary");
The returned page contains two forms—one to allow you to choose from a list box of program types, and then a login form for the diary function. We tell WWW::Mechanize to use the second form for input. (Something to remember here is that WWW::Mechanize’s list of forms, unlike an array in Perl, is indexed starting at 1 rather than 0. Our index is, therefore,’2.’)
$agent->form(2);
Now we can fill in our e-mail address for the ’<INPUT name=“email” type=“text”>’ field, and click the submit button. Nothing too complicated.
$agent->field("email", $email); $agent->click();
I’m still not quite impressed enough to learn Perl, but I’m very tempted to borrow some of the ideas and re-implement them in PHP or Python.
More recent articles
- Weeknotes: Embeddings, more embeddings and Datasette Cloud - 17th September 2023
- Build an image search engine with llm-clip, chat with models with llm chat - 12th September 2023
- LLM now provides tools for working with embeddings - 4th September 2023
- Datasette 1.0a4 and 1.0a5, plus weeknotes - 30th August 2023
- Making Large Language Models work for you - 27th August 2023
- Datasette Cloud, Datasette 1.0a3, llm-mlc and more - 16th August 2023
- How I make annotated presentations - 6th August 2023
- Weeknotes: Plugins for LLM, sqlite-utils and Datasette - 5th August 2023
- Catching up on the weird world of LLMs - 3rd August 2023
- Run Llama 2 on your own Mac using LLM and Homebrew - 1st August 2023