Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Safely consuming RSS: RegExps don’t cut it

Mark Pilgrim highlights the severe security issues introduced by RSS aggregators that display potentially unsafe HTML, often executing it in the “secure zone” generally reserved for trusted local documents. Mark suggests a number of dangerous tags and attributes that should be removed before display. Unsurprisingly, regular expressions have cropped up in the comments as the suggested solution. Jamie Zawinsky famously once posted the following to comp.lang.emacs:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

In this case, the above quote is particularly relevant. Parsing simple HTML with regular expressions is unpleasant but possible, but attempting to securely filter potentially malicious HTML (while trying to keep the useful tags) can only lead to more problems. There are just too many possible combinations, thanks mainly to the huge flexibility provided by modern browsers. Attributes can be left unquoted, tags can be left unclosed, characters can be incorrectly escaped; it all adds up to far more variables than even the most comprehensive regexps can hope to match. Combine this with the fact that Internet Explorer for Windows has not only the most forgiving parser but also the most unpatched security holes and you’re looking at a very big problem.

The solution is to use a real HTML parser. Python users have the excellent sgmllib (although I’m not sure how robust it is when faced with truly unpleasant HTML) but other developers are not so lucky—I’m sure CPAN has some good solutions for Perl but if you’re stuck with PHP your best bet is probably something based on REX.

At any rate, the parser you use had better be as foriving as IE or it could miss out on damaging code. The easiest option is to strip HTML entirely, but doing so means greatly reducing the usefulness of the content that comes in through the aggregator. Alternatively, strip all but the bare essentials and use some heavy handed techniques to eliminate anything that could possibly be damaging. The latter approach could conceivably be achieved using regular expressions but would require some serious testing to stop up any leaks.

This is Safely consuming RSS: RegExps don’t cut it by Simon Willison, posted on 12th June 2003.

View blog reactions

Next: Official film sites almost always suck

Previous: Structured content defined

9 comments

  1. Was it just a coincidence that when got an update notification from blo.gs and came over here, Mozilla threw a parsing error (complaining about mismatched paragraph tags)?

    James - 12th June 2003 22:11 - #

  2. Yup - ironically enough I missed a closing tag when I posted the entry. It only took about 30 seconds to fix, but my CMS had already pinged blo.gs and weblogs.

    Simon Willison - 12th June 2003 22:36 - #

  3. That's the fun of properly serving XHTML, isn't it? After doing it for a while, I finally just set up a small mirror of my weblog to check for errors before I publish anything to the "real" site.

    James - 12th June 2003 22:48 - #

  4. "At any rate, the parser you use had better be as foriving as IE or it could miss out on damaging code."

    I don't understand that statement. Here's how I imagine an HTML sanitizer working:

    • Parse the HTML into an internal tree representation.
    • Remove all but a few whitelisted tags, attributes, properties in style attributes, and protocols in src/href attributes.
    • Output the result as valid HTML.

    I don't see what being forgiving in parsing the input has to do with the output not being able to contain malicious code.

    Jesse Ruderman - 14th June 2003 09:33 - #

  5. What I mean is that it needs to be able to cope with anything that IE can cope with - if the safety parser can't parse something it won't be able to remove the dangerous bits, but because IE will parse almost anything you throw at it it's possible that something that looks like garbage to the safety parser will be executed as malicious code once it gets to IE.

    Simon Willison - 14th June 2003 10:38 - #

  6. Wouldn’t the simplest way be to encode all of the HTML to entities (rather than stripping tags altogether), and then recode back only desired, properly-closed-and-with-properly-quoted-atrribute s tags?

    Shot - 14th June 2003 12:13 - #

  7. That's a pretty neat idea. There are probably loads of feeds out there with horrible HTML that is still useful (for example links where they failed to quote the href attribute) but it's possible such a feature would encourage people to improve their HTML.

    Simon Willison - 14th June 2003 12:21 - #

  8. if the safety parser can't parse something it won't be able to remove the dangerous bits, but because IE will parse almost anything you throw at it it's possible that something that looks like garbage to the safety parser will be executed as malicious code once it gets to IE

    Once the HTML code is in a tree form, there is no such thing as an unfinished tag, an unclosed tag, or an unquoted attribute. IE (and other browsers!) are very unlikely to parse the output from step 3 in my previous comment in a different way from how the HTML sanitizer intended it to be parsed.

    I'm only aware of one case where a browser treats valid HTML in a quirky way. In Mozilla and probably in other browsers, <p><table></table> is treated as <p><table></table></p> rather than <p></p><table></table> (bug 129508, which IMO should not be fixed as stated). This particular case takes place at a higher level than the "is this text, inside a tag, or inside an attribute" level that matters for keeping the most dangerous code (scripts) out.

    Jesse Ruderman - 15th June 2003 04:22 - #

  9. valid HTML

    I guess that is exactly why not closing the paragraph-tag should in fact be regarded as invalid HTML. Strictly speaking you can only know by guessing where a p-tag ends if you don't state it specifically.

    Ruben - 18th June 2003 06:25 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/06/12/safelyConsuming

A django site