Simon Willison’s Weblog

Safe HTML checker

I’ve finally enabled a subset of HTML in my comments. In doing so, I had several requirements that needed to be fulfilled:

  1. Entered markup must be valid to XHTML strict, to stop comments form breaking validation and keep things nice and tidy.
  2. No presentational markup! I want to maintain control over how things look via my stylesheets—comments posted should only be able to use structural HTML elements.
  3. Attributes should be restricted to those that add semantic meaning. Javascript event attributes and CSS related attributes should not be allowed.
  4. I should retain full control over the tags and attributes allowed in the comments.
  5. Submitted HTML must be kept free from anything that could pose a security risk, such as javascript: URLs.

The system I have implemented works by running submitted posts through an XML parser, which checks that each element is in my list of allowed elements, is nested correctly (you can’t put a blockquote inside a p for example) and doesn’t have any illegal attributes. My initial test have shown it to work pretty well, but if anyone wants to have a go at breaking it please, be my guest.

The code for the main class is available here: SafeHtmlChecker.class.php

This is Safe HTML checker by Simon Willison, posted on 23rd February 2003.

Next: Mail models

Previous: Slow professional suicide

Previously hosted at http://simon.incutio.com/archive/2003/02/23/safeHtmlChecker