Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Safe HTML checker

I’ve finally enabled a subset of HTML in my comments. In doing so, I had several requirements that needed to be fulfilled:

  1. Entered markup must be valid to XHTML strict, to stop comments form breaking validation and keep things nice and tidy.
  2. No presentational markup! I want to maintain control over how things look via my stylesheets—comments posted should only be able to use structural HTML elements.
  3. Attributes should be restricted to those that add semantic meaning. Javascript event attributes and CSS related attributes should not be allowed.
  4. I should retain full control over the tags and attributes allowed in the comments.
  5. Submitted HTML must be kept free from anything that could pose a security risk, such as javascript: URLs.

The system I have implemented works by running submitted posts through an XML parser, which checks that each element is in my list of allowed elements, is nested correctly (you can’t put a blockquote inside a p for example) and doesn’t have any illegal attributes. My initial test have shown it to work pretty well, but if anyone wants to have a go at breaking it please, be my guest.

The code for the main class is available here: SafeHtmlChecker.class.php

This is Safe HTML checker by Simon Willison, posted on 23rd February 2003.

View blog reactions

Next: Mail models

Previous: Slow professional suicide

78 comments

  1. The system supports all of the XHTML phrase elements, links (with optional titles), all three kinds of lists, blockquotes and paragraphs.

    Simon Willison - 23rd February 2003 15:08 - #

  2. Neat Simon, I think I'll test it here.

    Paragraphs are nice

    Can i put in an inline style, I probably shouldn't be able to (and I can't)

    • a list item
    • a red (nope) list item
    • number 3

    This has been a test: Now back to your regularly scheduled comments

    what happens when I don't enter correct code? I won't close this P tag.

    Ah! I get detailed information about what I did wrong! That's wonderful!

    This is what I got:

    Error adding Comment

    Your comment could not be added:

    • Tag p may not have attribute style
    • Tag li may not have attribute style
    • XHTML is not well-formed

    The HTML entered was as follows: [code was then listed]

    Nate - 23rd February 2003 15:46 - #

  3. I really like what you've done here, I have only 2 small suggestions Simon:
    1. Provide a link back to correct your comment from the results screen?
    2. Maybe link to a page somwhere that details the allowed and not-allowed coding

    Nate - 23rd February 2003 16:02 - #

  4. What if I want something to be > something else.

    Daniel Nolan - 23rd February 2003 17:01 - #

  5. Ok, the above worked, but any other mixture of greater and less than tags breaks. I guess its not really a bug, but if the above worked then the others should too. Plus the above should invalidate your document? (doesnt seem to on w3c checker)

    Daniel Nolan - 23rd February 2003 17:03 - #

  6. On the subject of comments, you need to validate the the url. For example take my URL, if I enter it as "www.bleedingego.co.uk" on this article then it will link to "http://simon.incutio.com/archive/2003/02/23/www.b leedingego.co.uk" in IE. The http:// is necessary for the link to work correctly in IE.

    Daniel Nolan - 23rd February 2003 17:15 - #

  7. Greater and less than tags can be encoded using & characters: < > - I think a single > is valid XML as it can't be confused with the starting bracket of a tag, which is why it passes through the XMl checker without throwing an error.

    Simon Willison - 23rd February 2003 17:15 - #

  8. URL validation is a good point - I'll add something that requires URLs to start with http://. One could argue that this would siallow relative links, but relative links are probably a bad idea in any case as they wouldn't make sense outside of the context of the current page (for example if comments were syndicated in an RSS feed).

    Simon Willison - 23rd February 2003 17:17 - #

  9. You are running a blockquotes.js javascript which converts cite attributes to href attributes. One could (maybe) some javascript in there

    mw - 23rd February 2003 18:15 - #

  10. Crafty! I had actually thought of that this morning but it completely slipped my mind when I was coding - fixing it now. Thanks for point it out.

    Simon Willison - 23rd February 2003 18:21 - #

  11. <!--<script>alert('test');</script>

    mw - 23rd February 2003 18:55 - #

  12. Oops, this one is really bad. Just using <!--

    mw - 23rd February 2003 18:57 - #

  13. You should disable <-- somehow. It is not working in Internet Explorer, but with Mozilla more damage has been committed :)

    mw - 23rd February 2003 19:05 - #

  14. I meant to say <!--

    mw - 23rd February 2003 19:07 - #

  15. I have done now - that was a pretty nasty little exploit. The XML parser didn't notice (or care) that you'd only opened the comment and not closed it - I had to fix it by using str_replace() to eradicate any opening comment tags on site:

    $xhtml = str_replace('<!--', '', $xhtml);

    Simon Willison - 23rd February 2003 20:05 - #

  16. mw - 23rd February 2003 22:18 - #

  17. <?xml-stylesheet href="javascript:alert('Maybe this works')" type="text/javascript"?>

    mw - 23rd February 2003 22:23 - #

  18. This is really fun, but if I am beginning to annoy you, just say it.

    mw - 23rd February 2003 22:30 - #

  19. ROFL no this is great stuff, although by the time you've finished it looks like I'm going to have a list of "special cases" a mile long. Worth it though.

    Simon Willison - 23rd February 2003 22:37 - #

  20. adsffds

    Tom Gilder - 23rd February 2003 23:05 - #

  21. <!<!----<script>alert('test');</script>

    mw - 23rd February 2003 23:37 - #

  22. <?xml-stylesheet href="" type="text/javascript"><script>alert('Maybe this then?')</script>?>

    mw - 23rd February 2003 23:51 - #

  23. Object

    Simon Willison - 23rd February 2003 23:52 - #

  24. Right, I've fixed as many of the holes as I could spot (including the glaring ones in the name/email/url fields) so hopefully most of that lot won't work any more. Definitely a trial by fire - thanks :)

    Simon Willison - 23rd February 2003 23:56 - #

  25. May I suggest that you place a list of acceptable tags somewhere, perhaps near this form? Not everyone who reads this is necessarily going to know the difference between presentation and structure.

    I understand that you have an error form that tells you off, and that you generally have a literate readership, but it's the principle of the thing: when this drops off the front page, there's no frame of reference.

    Raena - 24th February 2003 07:18 - #

  26. Agreed - I'll add that this evening. Thanks.

    Simon Willison - 24th February 2003 07:54 - #

  27. This is great, Simon. Nice work! I second the list of acceptable tags! I'd be ticked if I typed out some intricate markup only to find out that one of my tags wasn't allowed. A simple list solves that problem. Note, for those of you using Movable Type, maintaining control over what tags people use in comments is possible using Sanitize, Brad Choate's plugin which is now wrapped into MT. Also, Validable is a MT plugin that "corrects many of the most common 'invalid' constructs" by applying simple changes to the HTML. It could serve as another alternative to those who don't or can't use Simon's SafeHTMLChecker.

    Joshua Kaufman - 24th February 2003 14:19 - #

  28. Not sure if this is what's running the BCSS.info site, but if it is, then you need to check the email thing!

    Andrew Hayward - 24th February 2003 22:31 - #

  29. Evidently you sorted that one with htmlentities for the email... it's a bug on BCSS.info though!

    Andrew Hayward - 24th February 2003 22:34 - #

  30. Wow Simon, your class has made it to the public-evangelist list at the W3C. Nice work!

    Jan! - 26th February 2003 13:23 - #

  31. fwdfa

    "'>Tom Gilder - 1st March 2003 15:00 - #

  32. dsasfd

    me - 1st March 2003 15:00 - #

  33. dsasfd

    "'>--><script>document.body.backgroundImage="http: - 1st March 2003 15:00 - #

  34. dsasfd

    "'>--></script><script>document.body.backgroundIma - 1st March 2003 15:00 - #

  35. Came upon this code when I needed to write some XML parsing code in PHP for the first time. It was very handy! It should be noted that javascript can invoked using the 'jscript:' and 'mocha:' schemes as well (depending on the browser). There may exist a 'vbscript:' scheme as well - I don't know :) here's a test It may be easier to set up a list of safe schemes, and filter out anything that's not in the list.

    David Weingart - 21st March 2003 19:37 - #

  36. long word test aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccccccccccc cccccccccccccccccccccccccccccccccccccddddddddddddd ddddddddddddddddddddddddddddddddeeeeeeeeeeeeeeeeee eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeefffffffffff ffffffffffffffffffffffffffffffffffffffgggggggggggg gggggggggggggggggggggggggggggg

    Freexe - 7th August 2003 13:39 - #

  37. Hmm.. it doesn't capture long words. If I stick in something to handle that I could get it to break up logn liens in pre tags as well, meaning they could be supoprted without fear of site breakage.

    Simon Willison - 7th August 2003 14:50 - #

  38. you could use css word-wrap

    Freexe - 13th August 2003 20:48 - #

  39. fghfghgfh

    ggfhgfhgfhfgfghfg - 8th September 2003 20:47 - #

  40. I cannot resist testing whether or not Javascript links are really blocked.

    Klaus Johannes Rusch - 8th September 2003 22:33 - #

  41. xml-stylesheet href="javascript:alert('Maybe this works')" type="text/javascript"?>

    sdf - 15th September 2003 10:10 - #

  42. Hello Simon. What about quotes " and other unescáped entitiés?

    manuel razzari - 6th October 2003 05:21 - #

  43. Hmm. Replaced my unescaped entities with some other entities, which by the way do validate. Is it because of the document encoding?

    manuel razzari - 6th October 2003 05:30 - #

  44. Simon: re long words: You might want to consider using CSS to deal with them. .comment{ overflow: auto/hidden; } Should cover it.

    GaryF - 17th October 2003 10:04 - #

  45. This breaks your filter (but only on Opera).

    BTW check out my filter kses.

    // Ulf Harnhammar

    Ulf - 27th October 2003 20:03 - #

  46. This breaks your filter, at least on Mozilla.

    (The "javascript:" filtering part is quite hard. I've implemented something called whitelisting URL protocols in my filter, to deal with it.)

    // Ulf

    Ulf Harnhammar - 28th October 2003 17:32 - #

  47. Nice, but I can't see how most commenters (ok, your readership may be quite techy, but still...) would want to bother with HTML tags.

    I've been working on something very similar to Textile (although with a LOT less features) for the past few months, which I believe is a lot more user-friendly and transparent for something as simple as a blog comment.

    Block level elements (p, h1-h6, blockquote, ul, ol, li, pre) and a whole bunch of other elements are automatically (or close to it) generated from the near-natural-language text. As a bonus, Textile also allows HTML input (which you could filter of course -- I do) for the techy commenters who want to dig a little deeper.

    Keep it simple for those not in the know, and for those who are, let them dig a little deeper.

    Justin French - 7th December 2003 13:41 - #

  48. Hi - Textile looks good - but unless I'm not understanding things correctly - Textile and SafeHtmlChecker.class.php do different things; one generates HTML, the other validates the HTML and keeps it clean and tidy. Put them together and you have a winning team. :-)

    I can download SafeHtmlChecker.class.php - can I do the same with Textile? What is it written in (i.e. PHP, Perl, ASP?)

    All the best, Jim

    Jim Byrne - 10th December 2003 17:09 - #

  49. Please, check out:
    Content management system Absolut Engine
    It has WYSIWYG editor, produces valid XHTML Strict that complies with webstandards and supports clean URLs.

    dusoft - 16th April 2004 15:16 - #

  50. thanks

    test - 22nd April 2004 20:24 - #

  51. looks good, i was looking for smth similar to smartypants (MT) but in php. i think you've made it so no need to reinvent the wheel. thanx. a lot.

    Sergi - 2nd May 2004 23:54 - #

  52. Simon, The validator will reject unescaped ampersands (including those in urls), however it won't provide a warning message, it will simply state "XHTML is not well-formed". Just so ya know. (I tried to e-mail you about this, but I guess my addy didn't make it thru the spam hoops...)

    Mike P. - 31st May 2004 17:07 - #

  53. test link

    Mark - 9th October 2004 05:44 - #

  54. testing

    sdfasdf - 22nd November 2004 20:55 - #

  55. looks great but I've been getting a fatal error (cant redefine class). possibly due to my setup, then again I'm not PHP literate so I really cant tell...

    Vidar - 26th November 2004 04:52 - #

  56. welldone

    sonny - 10th December 2004 13:17 - #

  57. >> VBScript Link (you can safely click here - just a MsgBox call. Works only under IE)

    There are also lots of differrent stuff, like ECMAScript, JScript1, PerlScript, etc.

    It is generally better test href/cite attributes against a list of safe protocols. The following code introduces this.

    if ((($attr == 'cite') || ($attr == 'href')) && !preg_match('/^(http|ftp|mailto|news|ed2k|dchub|ir c|telnet|gopher|about):/i', trim($value))) {
      if (preg_match('/^([A-Z0-9]+):/i', trim($value), $temp)) {
        $this->errors[] = "<code>$attr</code> attribute cannot link to <code>" . htmlspecialchars($temp[1]) . ":</code> protocol";
      } else {
        $this->errors[] = "<code>$attr</code> attribute cannot link to such protocol";
      }
    }

    Also, I suggest to let user know why XML parser failed on his code

    if (!xml_parse($this->parser, $xhtml)) {
      $this->errors[] = 'Not well-formed XHTML: ' . xml_error_string(xml_get_error_code($this->parser) );
    }

    Personally, I've made XHTML to be parsed and corrected by HTMLTidy (as PHP5 module), which greatly eases the process for end user.

    $tidycfg = array(
      'show-body-only' => TRUE,
      'quote-nbsp' => TRUE,
      'output-xhtml' => TRUE,
      'hide-comments' => TRUE,
      'drop-proprietary-attributes' => TRUE,
      'clean' => TRUE,
      'bare' => TRUE,
      'quote-ampersand' => TRUE,
      'numeric-entities' => TRUE,
      'ncr' => TRUE,
    );
    $html = tidy_repair_string($rawdata, $html, 'latin1');

    Oh, yeah, thanks for your work! You save me some time thinking of how to implement this... And sorry for my bad English.

    drdaeman - 12th December 2004 17:31 - #

  58. Nice work! What's the license on your code?
    David

    David Kelso - 19th February 2005 12:28 - #

  59. Great work!

    Daniel - 9th March 2005 19:59 - #

  60. Hi Simon I like your class. I'm working on new CMS project and testing your class as plugin. http://www.vision.to/ This will be commercial CMS but Plugins are opensource.

    feha - 27th March 2005 02:00 - #

  61. You code doesn't correctly pickup <strong /> as an invalid tag. I would have posted proof but all code after the tag would be bold. I didn't want to cause that.

    Courtney Mile - 5th May 2005 07:51 - #

  62. Hello ! I like what you did, but I chose a totally different approach !

    I replace :

    • '<!' by '&lt;!'
    • '<?' by '&lt;?'
    • '<#' by '&lt;#'
    • 'script' by 'scr|pt'
    • 'data' by 'da|a'

    Is it enough for security purpose ? I don't really care if people can do a big mess with some non-XHTML. There is always someone to correct bad stuff...

    See an example on http://wiki.vi5.org/demo/

    Guilain Omont - 6th May 2005 16:14 - #

  63. wonderful code man, i tried many things but this class puts it all nice together!

    reiben - 10th May 2005 13:38 - #

  64. thanx for your nice work, simon!

    bernd - 23rd May 2005 13:03 - #

  65. ghsfhsfdhhkmgdlhndlgfjnbkfdjnbkldn kdngbjk dfg kfjdgn jkfng fn g jngkfn ljdgnlfdgjnl fdnfjdn gnd lngd fjg knsjn ljdfsn lgjdn ln lgjn jlgnlfn djlgn lfg fldgn jlfng jfdg jn jn jlgfn ldn lsn glnd jkldngjk fdngjk dnfgjknd kgsnd djn jkdfngfjdn gkfdngkfdsngjkbn fjkv

    ryt - 12th October 2005 10:48 - #

  66. How about umlauts? �h!

    Steffen - 24th October 2005 22:24 - #

  67. dfffffffffffffffffffffffffffffffffffffffffffffffff ffffffffffffffffffffffffffffffffffffffffffffffffff ffffffffffffffffffffffffffffffffffffffffffffffffff ff

    test - 30th November 2005 23:44 - #

  68. dsfajjdddddddddddddddddddddddddddddddddddddddddddd dddddddddddddddddddddddddddddddddddddddddddddddddd dddddddddddddddddjjjjjjjjuuuuuuuuuuuuuuuuuuuuuuuuu uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuujjjjjj

    test - 30th November 2005 23:45 - #

  69. sssssssss

    sss - 10th December 2005 17:44 - #

  70. Hi, I´m tring this pluging in wordpress, but i find problem when I tried to write a broke-line in a post; that is, when I tried to write <br/> inside a post. For example: <p>a b</p> is ok for the pluging, but <p>a<br/>b</p> is bad for the pluging. I also tried only with <br> (in the code I only see br), but it doesn´t work.

    name - 12th December 2005 10:03 - #

  71. I hate to be a pain, but it seems like there's a minor error in the otherwise excellent SafeHtmlChecker.class.php available for download at the link in your original post.

    In lines 90 and 92, you assign the result of a str_replace to a variable called $comment, which appears nowhere else in the text. The effect is that the checks for <? and <script aren't applied to the contents of $xhtml.

    I assume the linked version isn't the newest, as <script> tags don't work in your comments. As it is offered for download, though, it happily passes through script tags and PHP blocks. You might want to fix that. :)

    On an unrelated note, using Google to filter user-submitted URLs is a really nice touch. Super nifty, and so is SafeHtmlChecker -- thanks!

    Aaron - 2nd February 2006 00:25 - #

  72. do you have new version? thanks!

    john - 24th March 2006 18:38 - #

  73. A link within a link.

    Edward Z. Yang - 18th April 2006 00:58 - #

  74. Sorry, I couldn't resist. Your class doesn't check for an A tag within an A tag, which is illegal according to the spec (your page doesn't validate anymore, see the validator). If you still maintain this class, anyway.

    Edward Z. Yang - 18th April 2006 01:01 - #

  75. thanks

    omer - 21st May 2006 01:52 - #

  76. Testing. comment --> Testing.

    Jim Bim - 10th July 2006 01:27 - #

  77. Do you have something similar for Django/ Python? Markdown seems to allow way too much HTML, including SCRIPT tags.

    Bjorn Stabell - 12th October 2006 05:54 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/02/23/safeHtmlChecker

A django site