Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Letting off some steam

I spent most of today knee deep in RSS, writing an aggregator for a project at work. It has been quickly becomng apparent that “Really Simple Syndication” is anything but! There are currently three major (and goodness knows how many minor) specifications doing the rounds, and the majority of feeds seem to pick and chose between the three at will. Even the three core elements that describe an item (title, link and description) are both optional and heavily overloaded.

Consider, for example, the seemingly simple task of extracting the URL of an item. All three specifications define a <link> element for this, but RSS 2.0 introduces guid which can also be used to define a permalink (unless its isPermaLink attribute is set to false). The Scripting News RSS Feed provides a guid rather than a link, and some Radio Userland feeds provide both but leave the link element blank. Introduce the ongoing discussion on how relative URLs should be resolved and things get even nastier.

Next up, extracting the actual content of an entry. Traditionally, this occurs as an HTML entity encoded string in the description element. Recently however a new element, content:encoded, has started to become fashionable (this uses a CDATA section). Even more recently, xhtml:body has started gaining ground which uses namespaces to embed unencoded XHTML, making event based parsing of content that much more difficult...

The aggregator I’m building is similar in style to Spycyroll, and as such needs to know the date that an entry was posted. On this point the specifications start to differ dramatically: RSS 2.0 uses pubDate, while RSS 1.0 relies on the Dublin Core element dc:date. In the wild this gets really messy—in a survey of the feeds linked to by Python Programmer Weblogs I found no less than 5 subtly (and not to subtly) different ways of representing dates. Here are some examples I picked up:

pubDate

  • 2003-03-21T16:28:40
  • 2003-04-03T07:45:57-08:00
  • Fri, 04 Apr 2003 05:04:39 GMT
  • Fri, 28 Mar 2003 05:18:59 -0800
  • 1049379042.0

dc:date

  • 2003-03-21T16:28:40
  • 2003-01-17T13:03:00+00:00
  • 2003-03-27T19:41:49-06:00

Having battled through that lot, the conscientious aggregator writer hits the next big hurdle: Approximately 10% of RSS feeds are badly formed XML! This issue is covered by Mark Pilgrim in Parsing RSS at all costs where he presents an ultra liberal Python RSS parser which uses Python’s relatively forgiving sgmllib module. Great, except PHP doesn’t have one of those... enter REX, a technique for “shallow parsing” of XML using regular expressions (no, it’s not as cludgy as it sounds—in fact Python’s sgmllib module is built on the same principles). Martin Spernau has an excellent article showing how REX can be implemented in PHP and demonstrates the technique in a modified version of the MagpieRSS library. Of course, XML purists (with very good reason) advocate ignoring badly formed feeds but as Mark points out, this really isn’t a very practical approach.

So what have I done? For the moment, I’ve gone with the path of least resistance. Onyx RSS is a well designed parser based on PHP’s XML support (so no invalid feed support) which works just fine for the moment. Unfortunately it is licensed under the GPL, and since this is likely to end up as a commercial project I’ll have to find something else for the final cut—or more likely implement something from scratch that uses REX.

Further reading:

This is Letting off some steam by Simon Willison, posted on 4th April 2003.

View blog reactions

Next: Site moved

Previous: The blogging MP

12 comments

  1. I've been talking with the author of Onyx (Ed Dumbhill) and he's thinking of changing the license (we use it for serendipity). It might help to send him a mail, asking if he'd be willing to use another license, like the BSD or MIT licenses.

    Sterling Hughes - 4th April 2003 23:38 - #

  2. When they said "simple", they meant "simple for producers". Not consumers.

    Mark - 5th April 2003 01:44 - #

  3. I think there's only one way out of this madness. Decide on a single implementation (my preference being RSS 1.0 + a few XHTML modules in namespaces), and write an "rsstidy" to take junk, and output your decided implementation.

    It's a clean separation that allows the tidy tool to be re-used across many different applications, and keeps your application code simple. It's also advocating your preferred format.

    Jim - 5th April 2003 01:57 - #

  4. That's a pretty smart idea. I was thinking earlier that a solution to the badly formed RSS problem could be to have two parsing classses with identical interfaces, one using an XML parser and one using regular expressions. If the XML parsing fails, the regular expression one could be used as a backup option. Building a single parser and then writing something that can take garbage feeds / different versions and reformat them sounds like a much nicer option - far easier to maintain for one thing.

    I've got quite specific information requirements from an RSS feed, so I think I'll work out a subset of RSS 2.0 that fits them and write a converter to get stuff to behave in line with what I want. In fact, if I do that I can write the rsstidy code in Python which we've established has much better tools for this kind of work :)

    Thanks for the suggestion.

    Simon Willison - 5th April 2003 02:08 - #

  5. Hi, thanks for the post, I was looking for different options for parsing and generating RSS when I came across your post. I posted the list I found to my blog as well, most of them you already mentioned. There is a PHP-based aggregator called Rippy, and may not scale up well for many users: http://ansuz.sooke.bc.ca/rippy.html

    Doug - 5th April 2003 07:04 - #

  6. Interesting to hear of your experiences. I recently had the same problem, working with Java. My current solution is to clean the raw XML with JTidy (a port of HTML Tidy, I bet this is available for PHP) then either read directly as RSS 1.0 or run through Sjoerd Visscher's RSS 2.0 -> RSS 1.0 XSLT stylesheet. (I'm using RDF internally in the app, so RSS 1.0 was the obvious choice, though I would have preferred this anyway - it's much better defined than RSS 2.0, it's richer thanks to the modules, and it's easier to extend.)

    Danny - 5th April 2003 10:06 - #

  7. Sure, the trouble is getting the developers of the applications creating the RSS to stop being pigheaded. This is, unfortunately, a non-trivial matter. Several battles have been waged, to varying degrees of success. Fundamentally, there's a faction that naively worries about 'readability' of the XML. Then there's a crowd that worries about the verbosity. Over in another corner is the 'but I want to theoretically be able to use mod_kitchensink in my RSS'. Meanwhile, vendor jockeying with proprietary, half-baked extensions keeps happening, over and over... The fortunate thing is diversity. As more tools come online it becomes apparent that giving the users what they want often means switching tools. When one tool doesn't cut it, the users switch. Witness the incredible growth of MovableType. It defaults to creating RSS-1.0 files and makes a lot of use of RDF (via trackbacks). The users never see the gritty details, they just see nice functionality and use it.

    Bill Kearney - 5th April 2003 13:03 - #

  8. Oh, and when you find a bad feed, report is as such to Syndic8. You can also programmatically use the Syndic8 database to see if a feed is or isn't known to be in good working order.

    Bill Kearney - 5th April 2003 13:04 - #

  9. I was going to suggest the same method as Jim, but I see I am too late :)

    I'd probably create: different "readers" that took the input and turned it into my internal representation and different "views" that could output my internal representation to my choice of format.

    Your internal representation being something nice and flexible, with good expressability, so I'd probably choose a simple datastructure (or the "best" RSS format would probably work).

    Not sure if this is the best way to do it, but it sounds good to me :)

    Swannie - 5th April 2003 14:30 - #

  10. Bill,

    I think the "mod_kitchensink" guys are given an unfair rap. When people wring their hands over inline HTML, it's already been solved by the XHTML namespace. When I see new elements go into the RSS 2.0 spec, I notice that they already exist as part of the dublin core or similar, and are already usable with RSS 1.0. Now RSS 2.0 includes namespaces, so as far as I can tell, the developers of RSS 2.0 argued against RSS 1.0, and then proceeded to try and copy it (badly).

    Jim - 5th April 2003 14:40 - #

  11. I will be lifting the GPL license off of Onyx or at least changing it to something like MIT or BSD. Look for an update on the site soon.

    Ed Swindelles - 6th April 2003 16:29 - #

  12. I agree with Jim re. 'mod_kitchensink', RSS 1.0 modules makes it easy to add new elements in a consistently defined way (the joy of frameworks, something lacking from RSS 2.0). But while it is easy for a feed producer to include material using kitchensink elements, there is no need for a consumer to interpret them, unless they really want to. I reckon this gets the best of both worlds.

    Danny - 7th April 2003 09:32 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/04/04/lettingOffSomeSteam

A django site