Safely consuming RSS: RegExps don’t cut it
Mark Pilgrim highlights the severe security issues introduced by RSS aggregators that display potentially unsafe HTML, often executing it in the “secure zone” generally reserved for trusted local documents. Mark suggests a number of dangerous tags and attributes that should be removed before display. Unsurprisingly, regular expressions have cropped up in the comments as the suggested solution. Jamie Zawinsky famously once posted the following to comp.lang.emacs:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
In this case, the above quote is particularly relevant. Parsing simple HTML with regular expressions is unpleasant but possible, but attempting to securely filter potentially malicious HTML (while trying to keep the useful tags) can only lead to more problems. There are just too many possible combinations, thanks mainly to the huge flexibility provided by modern browsers. Attributes can be left unquoted, tags can be left unclosed, characters can be incorrectly escaped; it all adds up to far more variables than even the most comprehensive regexps can hope to match. Combine this with the fact that Internet Explorer for Windows has not only the most forgiving parser but also the most unpatched security holes and you’re looking at a very big problem.
The solution is to use a real HTML parser. Python users have the excellent sgmllib (although I’m not sure how robust it is when faced with truly unpleasant HTML) but other developers are not so lucky—I’m sure CPAN has some good solutions for Perl but if you’re stuck with PHP your best bet is probably something based on REX.
At any rate, the parser you use had better be as foriving as IE or it could miss out on damaging code. The easiest option is to strip HTML entirely, but doing so means greatly reducing the usefulness of the content that comes in through the aggregator. Alternatively, strip all but the bare essentials and use some heavy handed techniques to eliminate anything that could possibly be damaging. The latter approach could conceivably be achieved using regular expressions but would require some serious testing to stop up any leaks.
Was it just a coincidence that when got an update notification from blo.gs and came over here, Mozilla threw a parsing error (complaining about mismatched paragraph tags)?
James - 12th June 2003 22:11 - #
Simon Willison - 12th June 2003 22:36 - #
That's the fun of properly serving XHTML, isn't it? After doing it for a while, I finally just set up a small mirror of my weblog to check for errors before I publish anything to the "real" site.
James - 12th June 2003 22:48 - #
"At any rate, the parser you use had better be as foriving as IE or it could miss out on damaging code."
I don't understand that statement. Here's how I imagine an HTML sanitizer working:
I don't see what being forgiving in parsing the input has to do with the output not being able to contain malicious code.
Jesse Ruderman - 14th June 2003 09:33 - #
Simon Willison - 14th June 2003 10:38 - #
Shot - 14th June 2003 12:13 - #
Simon Willison - 14th June 2003 12:21 - #
Once the HTML code is in a tree form, there is no such thing as an unfinished tag, an unclosed tag, or an unquoted attribute. IE (and other browsers!) are very unlikely to parse the output from step 3 in my previous comment in a different way from how the HTML sanitizer intended it to be parsed.
I'm only aware of one case where a browser treats valid HTML in a quirky way. In Mozilla and probably in other browsers, <p><table></table> is treated as <p><table></table></p> rather than <p></p><table></table> (bug 129508, which IMO should not be fixed as stated). This particular case takes place at a higher level than the "is this text, inside a tag, or inside an attribute" level that matters for keeping the most dangerous code (scripts) out.
Jesse Ruderman - 15th June 2003 04:22 - #
I guess that is exactly why not closing the paragraph-tag should in fact be regarded as invalid HTML. Strictly speaking you can only know by guessing where a p-tag ends if you don't state it specifically.
Ruben - 18th June 2003 06:25 - #