Letting off some steam
I spent most of today knee deep in RSS, writing an aggregator for a project at work. It has been quickly becomng apparent that “Really Simple Syndication” is anything but! There are currently three major (and goodness knows how many minor) specifications doing the rounds, and the majority of feeds seem to pick and chose between the three at will. Even the three core elements that describe an item (title, link and description) are both optional and heavily overloaded.
Consider, for example, the seemingly simple task of extracting the URL of an item. All three specifications define a <link> element for this, but RSS 2.0 introduces guid which can also be used to define a permalink (unless its isPermaLink attribute is set to false). The Scripting News RSS Feed provides a guid rather than a link, and some Radio Userland feeds provide both but leave the link element blank. Introduce the ongoing discussion on how relative URLs should be resolved and things get even nastier.
Next up, extracting the actual content of an entry. Traditionally, this occurs as an HTML entity encoded string in the description element. Recently however a new element, content:encoded, has started to become fashionable (this uses a CDATA section). Even more recently, xhtml:body has started gaining ground which uses namespaces to embed unencoded XHTML, making event based parsing of content that much more difficult...
The aggregator I’m building is similar in style to Spycyroll, and as such needs to know the date that an entry was posted. On this point the specifications start to differ dramatically: RSS 2.0 uses pubDate, while RSS 1.0 relies on the Dublin Core element dc:date. In the wild this gets really messy—in a survey of the feeds linked to by Python Programmer Weblogs I found no less than 5 subtly (and not to subtly) different ways of representing dates. Here are some examples I picked up:
pubDate
- 2003-03-21T16:28:40
- 2003-04-03T07:45:57-08:00
- Fri, 04 Apr 2003 05:04:39 GMT
- Fri, 28 Mar 2003 05:18:59 -0800
- 1049379042.0
dc:date
- 2003-03-21T16:28:40
- 2003-01-17T13:03:00+00:00
- 2003-03-27T19:41:49-06:00
Having battled through that lot, the conscientious aggregator writer hits the next big hurdle: Approximately 10% of RSS feeds are badly formed XML! This issue is covered by Mark Pilgrim in Parsing RSS at all costs where he presents an ultra liberal Python RSS parser which uses Python’s relatively forgiving sgmllib module. Great, except PHP doesn’t have one of those... enter REX, a technique for “shallow parsing” of XML using regular expressions (no, it’s not as cludgy as it sounds—in fact Python’s sgmllib module is built on the same principles). Martin Spernau has an excellent article showing how REX can be implemented in PHP and demonstrates the technique in a modified version of the MagpieRSS library. Of course, XML purists (with very good reason) advocate ignoring badly formed feeds but as Mark points out, this really isn’t a very practical approach.
So what have I done? For the moment, I’ve gone with the path of least resistance. Onyx RSS is a well designed parser based on PHP’s XML support (so no invalid feed support) which works just fine for the moment. Unfortunately it is licensed under the GPL, and since this is likely to end up as a commercial project I’ll have to find something else for the final cut—or more likely implement something from scratch that uses REX.
Further reading:
- RSS Parsing in PHP—the blog entry that got me started.
- Mark’s Ultra Liberal RSS Parser.
- RSS Auto Discovery—one of the few things in the RSS world that appears to be a stationary target.
- The RSS 2.0 Specification (mirrored here).
- The RDF Site Summary (RSS) 1.0 specification.
- The invaluable RSS Validator.
- syndic8 has a bunch of articles on RSS.
Sterling Hughes - 4th April 2003 23:38 - #
Mark - 5th April 2003 01:44 - #
I think there's only one way out of this madness. Decide on a single implementation (my preference being RSS 1.0 + a few XHTML modules in namespaces), and write an "rsstidy" to take junk, and output your decided implementation.
It's a clean separation that allows the tidy tool to be re-used across many different applications, and keeps your application code simple. It's also advocating your preferred format.
Jim - 5th April 2003 01:57 - #
That's a pretty smart idea. I was thinking earlier that a solution to the badly formed RSS problem could be to have two parsing classses with identical interfaces, one using an XML parser and one using regular expressions. If the XML parsing fails, the regular expression one could be used as a backup option. Building a single parser and then writing something that can take garbage feeds / different versions and reformat them sounds like a much nicer option - far easier to maintain for one thing.
I've got quite specific information requirements from an RSS feed, so I think I'll work out a subset of RSS 2.0 that fits them and write a converter to get stuff to behave in line with what I want. In fact, if I do that I can write the rsstidy code in Python which we've established has much better tools for this kind of work :)
Thanks for the suggestion.
Simon Willison - 5th April 2003 02:08 - #
Doug - 5th April 2003 07:04 - #
Danny - 5th April 2003 10:06 - #
Bill Kearney - 5th April 2003 13:03 - #
Bill Kearney - 5th April 2003 13:04 - #
I was going to suggest the same method as Jim, but I see I am too late :)
I'd probably create: different "readers" that took the input and turned it into my internal representation and different "views" that could output my internal representation to my choice of format.
Your internal representation being something nice and flexible, with good expressability, so I'd probably choose a simple datastructure (or the "best" RSS format would probably work).
Not sure if this is the best way to do it, but it sounds good to me :)
Swannie - 5th April 2003 14:30 - #
Bill,
I think the "mod_kitchensink" guys are given an unfair rap. When people wring their hands over inline HTML, it's already been solved by the XHTML namespace. When I see new elements go into the RSS 2.0 spec, I notice that they already exist as part of the dublin core or similar, and are already usable with RSS 1.0. Now RSS 2.0 includes namespaces, so as far as I can tell, the developers of RSS 2.0 argued against RSS 1.0, and then proceeded to try and copy it (badly).
Jim - 5th April 2003 14:40 - #
Ed Swindelles - 6th April 2003 16:29 - #
Danny - 7th April 2003 09:32 - #