Safe HTML checker
I’ve finally enabled a subset of HTML in my comments. In doing so, I had several requirements that needed to be fulfilled:
- Entered markup must be valid to XHTML strict, to stop comments form breaking validation and keep things nice and tidy.
- No presentational markup! I want to maintain control over how things look via my stylesheets—comments posted should only be able to use structural HTML elements.
- Attributes should be restricted to those that add semantic meaning. Javascript event attributes and CSS related attributes should not be allowed.
- I should retain full control over the tags and attributes allowed in the comments.
- Submitted HTML must be kept free from anything that could pose a security risk, such as
javascript:URLs.
The system I have implemented works by running submitted posts through an XML parser, which checks that each element is in my list of allowed elements, is nested correctly (you can’t put a blockquote inside a p for example) and doesn’t have any illegal attributes. My initial test have shown it to work pretty well, but if anyone wants to have a go at breaking it please, be my guest.
The code for the main class is available here: SafeHtmlChecker.class.php
The system supports all of the XHTML phrase elements, links (with optional titles), all three kinds of lists, blockquotes and paragraphs.
Simon Willison - 23rd February 2003 15:08 - #
Paragraphs are nice
Can i put in an inline style, I probably shouldn't be able to (and I can't)
This has been a test: Now back to your regularly scheduled comments
what happens when I don't enter correct code? I won't close this P tag.
Ah! I get detailed information about what I did wrong! That's wonderful!
This is what I got:
Nate - 23rd February 2003 15:46 - #
Nate - 23rd February 2003 16:02 - #
Daniel Nolan - 23rd February 2003 17:01 - #
Daniel Nolan - 23rd February 2003 17:03 - #
Daniel Nolan - 23rd February 2003 17:15 - #
Simon Willison - 23rd February 2003 17:15 - #
Simon Willison - 23rd February 2003 17:17 - #
mw - 23rd February 2003 18:15 - #
Simon Willison - 23rd February 2003 18:21 - #
mw - 23rd February 2003 18:55 - #
mw - 23rd February 2003 18:57 - #
mw - 23rd February 2003 19:05 - #
mw - 23rd February 2003 19:07 - #
I have done now - that was a pretty nasty little exploit. The XML parser didn't notice (or care) that you'd only opened the comment and not closed it - I had to fix it by using str_replace() to eradicate any opening comment tags on site:
$xhtml = str_replace('<!--', '', $xhtml);Simon Willison - 23rd February 2003 20:05 - #
mw - 23rd February 2003 22:18 - #
mw - 23rd February 2003 22:23 - #
mw - 23rd February 2003 22:30 - #
Simon Willison - 23rd February 2003 22:37 - #
Tom Gilder - 23rd February 2003 23:05 - #
mw - 23rd February 2003 23:37 - #
mw - 23rd February 2003 23:51 - #
Simon Willison - 23rd February 2003 23:52 - #
Simon Willison - 23rd February 2003 23:56 - #
May I suggest that you place a list of acceptable tags somewhere, perhaps near this form? Not everyone who reads this is necessarily going to know the difference between presentation and structure.
I understand that you have an error form that tells you off, and that you generally have a literate readership, but it's the principle of the thing: when this drops off the front page, there's no frame of reference.
Raena - 24th February 2003 07:18 - #
Simon Willison - 24th February 2003 07:54 - #
Joshua Kaufman - 24th February 2003 14:19 - #
Andrew Hayward - 24th February 2003 22:31 - #
Andrew Hayward - 24th February 2003 22:34 - #
Jan! - 26th February 2003 13:23 - #
"'>Tom Gilder - 1st March 2003 15:00 - #
"'><script>alert("this could be worth fixing?")</s - 1st March 2003 15:00 - #
me - 1st March 2003 15:00 - #
"'>--><script>document.body.backgroundImage="http: - 1st March 2003 15:00 - #
"'>--></script><script>document.body.backgroundIma - 1st March 2003 15:00 - #
David Weingart - 21st March 2003 19:37 - #
Freexe - 7th August 2003 13:39 - #
Simon Willison - 7th August 2003 14:50 - #
Freexe - 13th August 2003 20:48 - #
ggfhgfhgfhfgfghfg - 8th September 2003 20:47 - #
I cannot resist testing whether or not Javascript links are really blocked.
Klaus Johannes Rusch - 8th September 2003 22:33 - #
sdf - 15th September 2003 10:10 - #
manuel razzari - 6th October 2003 05:21 - #
manuel razzari - 6th October 2003 05:30 - #
GaryF - 17th October 2003 10:04 - #
This breaks your filter (but only on Opera).
BTW check out my filter kses.
// Ulf Harnhammar
Ulf - 27th October 2003 20:03 - #
This breaks your filter, at least on Mozilla.
(The "javascript:" filtering part is quite hard. I've implemented something called whitelisting URL protocols in my filter, to deal with it.)
// Ulf
Ulf Harnhammar - 28th October 2003 17:32 - #
Nice, but I can't see how most commenters (ok, your readership may be quite techy, but still...) would want to bother with HTML tags.
I've been working on something very similar to Textile (although with a LOT less features) for the past few months, which I believe is a lot more user-friendly and transparent for something as simple as a blog comment.
Block level elements (p, h1-h6, blockquote, ul, ol, li, pre) and a whole bunch of other elements are automatically (or close to it) generated from the near-natural-language text. As a bonus, Textile also allows HTML input (which you could filter of course -- I do) for the techy commenters who want to dig a little deeper.
Keep it simple for those not in the know, and for those who are, let them dig a little deeper.
Justin French - 7th December 2003 13:41 - #
Hi - Textile looks good - but unless I'm not understanding things correctly - Textile and SafeHtmlChecker.class.php do different things; one generates HTML, the other validates the HTML and keeps it clean and tidy. Put them together and you have a winning team. :-)
I can download SafeHtmlChecker.class.php - can I do the same with Textile? What is it written in (i.e. PHP, Perl, ASP?)
All the best, Jim
Jim Byrne - 10th December 2003 17:09 - #
Please, check out:
Content management system Absolut Engine
It has WYSIWYG editor, produces valid XHTML Strict that complies with webstandards and supports clean URLs.
dusoft - 16th April 2004 15:16 - #
test - 22nd April 2004 20:24 - #
Sergi - 2nd May 2004 23:54 - #
Mike P. - 31st May 2004 17:07 - #
Mark - 9th October 2004 05:44 - #
sdfasdf - 22nd November 2004 20:55 - #
looks great but I've been getting a fatal error (cant redefine class). possibly due to my setup, then again I'm not PHP literate so I really cant tell...
Vidar - 26th November 2004 04:52 - #
sonny - 10th December 2004 13:17 - #
There are also lots of differrent stuff, like ECMAScript, JScript1, PerlScript, etc.
It is generally better test href/cite attributes against a list of safe protocols. The following code introduces this.
if ((($attr == 'cite') || ($attr == 'href')) && !preg_match('/^(http|ftp|mailto|news|ed2k|dchub|ir c|telnet|gopher|about):/i', trim($value))) {if (preg_match('/^([A-Z0-9]+):/i', trim($value), $temp)) {
$this->errors[] = "<code>$attr</code> attribute cannot link to <code>" . htmlspecialchars($temp[1]) . ":</code> protocol";
} else {
$this->errors[] = "<code>$attr</code> attribute cannot link to such protocol";
}
}
Also, I suggest to let user know why XML parser failed on his code
if (!xml_parse($this->parser, $xhtml)) {$this->errors[] = 'Not well-formed XHTML: ' . xml_error_string(xml_get_error_code($this->parser) );
}
Personally, I've made XHTML to be parsed and corrected by HTMLTidy (as PHP5 module), which greatly eases the process for end user.
$tidycfg = array('show-body-only' => TRUE,
'quote-nbsp' => TRUE,
'output-xhtml' => TRUE,
'hide-comments' => TRUE,
'drop-proprietary-attributes' => TRUE,
'clean' => TRUE,
'bare' => TRUE,
'quote-ampersand' => TRUE,
'numeric-entities' => TRUE,
'ncr' => TRUE,
);
$html = tidy_repair_string($rawdata, $html, 'latin1');
Oh, yeah, thanks for your work! You save me some time thinking of how to implement this... And sorry for my bad English.
drdaeman - 12th December 2004 17:31 - #
David
David Kelso - 19th February 2005 12:28 - #
Daniel - 9th March 2005 19:59 - #
feha - 27th March 2005 02:00 - #
Courtney Mile - 5th May 2005 07:51 - #
Hello ! I like what you did, but I chose a totally different approach !
I replace :
Is it enough for security purpose ? I don't really care if people can do a big mess with some non-XHTML. There is always someone to correct bad stuff...
See an example on http://wiki.vi5.org/demo/
Guilain Omont - 6th May 2005 16:14 - #
reiben - 10th May 2005 13:38 - #
bernd - 23rd May 2005 13:03 - #
ryt - 12th October 2005 10:48 - #
Steffen - 24th October 2005 22:24 - #
test - 30th November 2005 23:44 - #
test - 30th November 2005 23:45 - #
sss - 10th December 2005 17:44 - #
name - 12th December 2005 10:03 - #
I hate to be a pain, but it seems like there's a minor error in the otherwise excellent SafeHtmlChecker.class.php available for download at the link in your original post.
In lines 90 and 92, you assign the result of a
str_replaceto a variable called$comment, which appears nowhere else in the text. The effect is that the checks for<?and<scriptaren't applied to the contents of$xhtml.I assume the linked version isn't the newest, as <script> tags don't work in your comments. As it is offered for download, though, it happily passes through script tags and PHP blocks. You might want to fix that. :)
On an unrelated note, using Google to filter user-submitted URLs is a really nice touch. Super nifty, and so is SafeHtmlChecker -- thanks!
Aaron - 2nd February 2006 00:25 - #
john - 24th March 2006 18:38 - #
Edward Z. Yang - 18th April 2006 00:58 - #
Sorry, I couldn't resist. Your class doesn't check for an A tag within an A tag, which is illegal according to the spec (your page doesn't validate anymore, see the validator). If you still maintain this class, anyway.
Edward Z. Yang - 18th April 2006 01:01 - #
omer - 21st May 2006 01:52 - #
Jim Bim - 10th July 2006 01:27 - #
Bjorn Stabell - 12th October 2006 05:54 - #