Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Using XPath to mine XHTML

This morning, I finally decided to install libxml2 and see what all the fuss was about, in particular with respect to XPath. What followed is best described as an enlightening experience.

XPath is a beautifully elegant way of adressing “nodes” within an XML document. XPath expressions look a little like file paths, for example:

/first/second
Match any <second> elements that occur inside a <first> element that is the root element of the document
//second
Match all <second> elements irrespective of their place in the document
//second[@hi]
Match all <second> elements with a ’hi’ attribute
//second[@hi="there"]
Match all <second> elements with a ’hi’ attribute that equals “there”

A full XPath tutorial is available.

The Python libxml2 bindings make running XPath expressions incredibly simple. Here’s some code that extracts the titles of all of the entries on my Kansas blog from the site’s RSS feed:

>>> import libxml2
>>> import urllib
>>> rss = libxml2.parseDoc(
      urllib.urlopen('http://www.a-year-in-kansas.com/syndicate/').read())
>>> rss.xpathEval('//item/title')
[<xmlNode (title) object at 0xb4b260>, <xmlNode (title) object at 0xa99968>, 
<xmlNode (title) object at 0x10dce68>]
>>> [node.content for node in rss.xpathEval('//item/title')]
['Music and Brunch', 'House hunting', 'Arrival']
>>> 

Why is this so exciting? I’ve been saying for over a year that XHTML is an ideal format for storing pieces of content in a database or content management system. Serving content to browsers as HTML 4 makes perfect sense, but storing your actual content as XML gives you the ability to process that content in the future using XML tools.

So far, the best example of a powerful tool for manipulating this stored XML has been XSLT. XSLT has its fans, but is also often criticised as being unintuitive and having a steep learning curve. XPath is a far better example of a powerful, easy to use tool that can be brought to bare on XHTML content.

Enough talk, here’s an example of what I mean. The following code snippet creates a Python dictionary of all of the acronyms currently visible on the front page of my blog, mapping their shortened version to the expanded text (extracted from the title attribute):


>>> blog = libxml2.parseDoc(
    urllib.urlopen('http://simon.incutio.com/').read())
>>> ctxt = blog.xpathNewContext()
>>> ctxt.xpathRegisterNs('xhtml', 'http://www.w3.org/1999/xhtml')
0
>>> acronyms = dict([(a.content, a.prop('title')) 
    for a in ctxt.xpathEval('//xhtml:acronym')])
>>> for acronym, fulltext in acronyms.items():
	print acronym, ':', fulltext


DHTML : Dynamic HyperText Markup Language
URL : Universal Republic of Love
HTML : HyperText Markup Language
SIG : Special Interest Group
PHP : PHP: Hypertext Preprocessor
CSS : Cascading Style Sheets
>>> 

The above code is slightly more complicated than the first example, as using XPath with a document that uses XML namespaces requires some extra work to register the namespace with the XPath parser. Still, it’s a pretty short piece of code considering what it does.

For an example of how powerful XPath can be on a much larger scale, take a look at Sam Ruby’s XPath enabled blog search feature.

This is Using XPath to mine XHTML by Simon Willison, posted on 21st October 2003.

View blog reactions

Next: Google's Internal Blogs

Previous: Fun with DHTML and Flash

10 comments

  1. Just want to state the fact that XPath is a very essential part of the "XSLT Learning Curve" - you almost made it sound like they're not related to each other :-)

    Chriztian Steinmeier - 21st October 2003 09:30 - #

  2. I like it. You've just given me a reason to learn Python. Cheers. :)

    Andrew Sidwell - 21st October 2003 12:00 - #

  3. I've always used XPath more than XSLT, as XPath is such an integral part of i don't think the sentance So far, the best example of a powerful tool for manipulating this stored XML has been XSLT is strictly true. The So far implies that Xpath is a relatively new kid on the block when it isn't.

    P.S. I think it is "brought to bear" not "brought to bare".

    BenM - 21st October 2003 12:56 - #

  4. Please excuse my obvious typo!

    BenM - 21st October 2003 12:57 - #

  5. I don't understand the "store stuff as XHTML" mindset. I prefer "store stuff as XML". My own blog content is stored in XML files using an ad-hoc schema drawing heavily on XHTML, but not limited to it. If I want a new construct (edit history, for example), I can create tags that match my desired semantics, adjust the transforms to show them the way I want in HTML, and I can start creating content.

    If you are going to transform everything for display anyway, why limit yourself to tags that only browsers understand?

    Ned Batchelder - 21st October 2003 13:10 - #

  6. One thing I find curious. On your site URL == Universal Republic of Love. Is there something significant (perhaps a message you are spreading) to this?

    Tzicha - 21st October 2003 13:44 - #

  7. The Universal Republic of Love thing is Tim Bray's fault :)

    Ned, custom XML is great if you're using an XML style database for an application, but many apps (my blog included) have a relational database at their core, which takes case of the metadata and specialist tags. If you have the time and the inclination, creating a custom XML based markup scheme for content is a great idea - but most people are comfortable with HTML. Almost every content management system I've ever played with stores its content as HTML - my argument is that by moving to XHTML (a very small step) the content stored in the CMS can be processed with tools like XPath essentially for free.

    Incidentally, if you're going to go that root HTMLTidy's almost demonic ability to turn any old garbage in to valid XHTML is a God-send.

    Simon Willison - 21st October 2003 15:10 - #

  8. Now take another look at Syncato.

    Sam - 21st October 2003 20:19 - #

  9. Thanks for that link to the definition of URL, made for a good laugh this morning.

    Tzicha - 22nd October 2003 13:24 - #

  10. Ned, custom XML is great if you're using an XML style database for an application, but many apps (my blog included) have a relational database at their core, which takes case of the metadata and specialist tags.

    David - 8th December 2003 21:00 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/10/21/xpathRocks

A django site