Simon Willison’s Weblog

Letting off some steam

I spent most of today knee deep in RSS, writing an aggregator for a project at work. It has been quickly becomng apparent that “Really Simple Syndication” is anything but! There are currently three major (and goodness knows how many minor) specifications doing the rounds, and the majority of feeds seem to pick and chose between the three at will. Even the three core elements that describe an item (title, link and description) are both optional and heavily overloaded.

Consider, for example, the seemingly simple task of extracting the URL of an item. All three specifications define a <link> element for this, but RSS 2.0 introduces guid which can also be used to define a permalink (unless its isPermaLink attribute is set to false). The Scripting News RSS Feed provides a guid rather than a link, and some Radio Userland feeds provide both but leave the link element blank. Introduce the ongoing discussion on how relative URLs should be resolved and things get even nastier.

Next up, extracting the actual content of an entry. Traditionally, this occurs as an HTML entity encoded string in the description element. Recently however a new element, content:encoded, has started to become fashionable (this uses a CDATA section). Even more recently, xhtml:body has started gaining ground which uses namespaces to embed unencoded XHTML, making event based parsing of content that much more difficult...

The aggregator I’m building is similar in style to Spycyroll, and as such needs to know the date that an entry was posted. On this point the specifications start to differ dramatically: RSS 2.0 uses pubDate, while RSS 1.0 relies on the Dublin Core element dc:date. In the wild this gets really messy—in a survey of the feeds linked to by Python Programmer Weblogs I found no less than 5 subtly (and not to subtly) different ways of representing dates. Here are some examples I picked up:

pubDate

  • 2003-03-21T16:28:40
  • 2003-04-03T07:45:57-08:00
  • Fri, 04 Apr 2003 05:04:39 GMT
  • Fri, 28 Mar 2003 05:18:59 -0800
  • 1049379042.0

dc:date

  • 2003-03-21T16:28:40
  • 2003-01-17T13:03:00+00:00
  • 2003-03-27T19:41:49-06:00

Having battled through that lot, the conscientious aggregator writer hits the next big hurdle: Approximately 10% of RSS feeds are badly formed XML! This issue is covered by Mark Pilgrim in Parsing RSS at all costs where he presents an ultra liberal Python RSS parser which uses Python’s relatively forgiving sgmllib module. Great, except PHP doesn’t have one of those... enter REX, a technique for “shallow parsing” of XML using regular expressions (no, it’s not as cludgy as it sounds—in fact Python’s sgmllib module is built on the same principles). Martin Spernau has an excellent article showing how REX can be implemented in PHP and demonstrates the technique in a modified version of the MagpieRSS library. Of course, XML purists (with very good reason) advocate ignoring badly formed feeds but as Mark points out, this really isn’t a very practical approach.

So what have I done? For the moment, I’ve gone with the path of least resistance. Onyx RSS is a well designed parser based on PHP’s XML support (so no invalid feed support) which works just fine for the moment. Unfortunately it is licensed under the GPL, and since this is likely to end up as a commercial project I’ll have to find something else for the final cut—or more likely implement something from scratch that uses REX.

Further reading:

This is Letting off some steam by Simon Willison, posted on 4th April 2003.

Next: Site moved

Previous: The blogging MP

Previously hosted at http://simon.incutio.com/archive/2003/04/04/lettingOffSomeSteam