Safely consuming RSS: RegExps don’t cut it
12th June 2003
Mark Pilgrim highlights the severe security issues introduced by RSS aggregators that display potentially unsafe HTML, often executing it in the “secure zone” generally reserved for trusted local documents. Mark suggests a number of dangerous tags and attributes that should be removed before display. Unsurprisingly, regular expressions have cropped up in the comments as the suggested solution. Jamie Zawinsky famously once posted the following to comp.lang.emacs:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
In this case, the above quote is particularly relevant. Parsing simple HTML with regular expressions is unpleasant but possible, but attempting to securely filter potentially malicious HTML (while trying to keep the useful tags) can only lead to more problems. There are just too many possible combinations, thanks mainly to the huge flexibility provided by modern browsers. Attributes can be left unquoted, tags can be left unclosed, characters can be incorrectly escaped; it all adds up to far more variables than even the most comprehensive regexps can hope to match. Combine this with the fact that Internet Explorer for Windows has not only the most forgiving parser but also the most unpatched security holes and you’re looking at a very big problem.
The solution is to use a real HTML parser. Python users have the excellent sgmllib (although I’m not sure how robust it is when faced with truly unpleasant HTML) but other developers are not so lucky—I’m sure CPAN has some good solutions for Perl but if you’re stuck with PHP your best bet is probably something based on REX.
At any rate, the parser you use had better be as foriving as IE or it could miss out on damaging code. The easiest option is to strip HTML entirely, but doing so means greatly reducing the usefulness of the content that comes in through the aggregator. Alternatively, strip all but the bare essentials and use some heavy handed techniques to eliminate anything that could possibly be damaging. The latter approach could conceivably be achieved using regular expressions but would require some serious testing to stop up any leaks.
More recent articles
- Weeknotes: asynchronous LLMs, synchronous embeddings, and I kind of started a podcast - 22nd November 2024
- Notes from Bing Chat—Our First Encounter With Manipulative AI - 19th November 2024
- Project: Civic Band - scraping and searching PDF meeting minutes from hundreds of municipalities - 16th November 2024