Simon Willison’s Weblog

Subscribe

HTMLifying user input

19th October 2003

I’ve added a comment system to my new Kansas blog. Since the target audience for that site is friends and family rather than fellow web developers, I’ve taken a very different approach to processing the input from comments. While this blog insists upon valid XHTML and gives very little help to comment posters aside from highlighting validation problems, my new site’s comment system takes the more traditional root of disallowing HTML while automatically converting line breaks and links.

The standard way of doing this with PHP is to use the nl2br function. I’ve never been a big fan of this method as I prefer blocks of text to be surrounded by paragraph tags. Luckily, adding paragraph tags to blocks of text is a relatively easy task. Here’s the pseudo-code, mocked up in Python because it’s quicker to experiment with than PHP:

>>> text = '''... lengthy text block here ...'''
>>> paras = text.split('\n\n')
>>> paras = ['<p>%s</p>' % para.strip() for para in paras]
>>> print '\n\n'.join(paras)

The above code splits the text block on any occurrence of a double newline, then wraps each of the resulting blocks in a paragraph tag (after stripping off any remaining whitespace) before joining the blocks back together with a pair of newlines between each one—because I like to keep my HTML nicely formatted. What it doesn’t do is handle any necessary <br> tags. The trick now is to replace any single line breaks with <br> without interfering with the paragraph tags. The easiest way to do this is to put the replacement inside the loop, so that only line breaks that occur within a paragraph are replaced. Here’s the updated list comprehension:


>>> paras = ['<p>%s</p>' % p.strip().replace('\n', '<br>\n') for p in paras]

The final job is to convert the above in to PHP:

$paras = explode("\n\n", $text);
for ($i = 0, $j = count($paras); $i < $j; $i++) {
    $paras[$i] = '<p>'.
        str_replace("\n", "<br>\n", trim($paras[$i])).
        '</p>';
}
$text = implode("\n\n", $paras);

That’s the line conversions handled, but there are a few other important steps. Any HTML tags entered by the user need to be either stripped out or disabled by converting them to entities. Converting them to entities carries the risk of ugly failed attempts at HTML appearing on the comments page, but stripping tags carries an equal risk of innocent parts of a legitimate comment (such as a <wink>) being discarded. I chose to go the entity conversion route but force commenters to preview their comments before posting them, a trick I picked up from Adrian’s blog. The final step is to automatically convert links in to <a href=""> tags. I achieve this using a pair of naive regular expressions in the hope that the preview screen would avoid them mangling comments in a way not intended by the author.

Here’s the finished PHP function:

function untrustedTextToHTML($text) {
    $text = htmlentities($text);
    $paras = explode("\n\n", $text);
    for ($i = 0, $j = count($paras); $i < $j; $i++) {
        $paras[$i] = '<p>'.
            str_replace("\n", "<br>\n", trim($paras[$i])).
            '</p>';
    }
    $text = implode("\n\n", $paras);
    // Convert http:// links
    $text = preg_replace('|\\b(http://[^\s)<]+)|', 
        '<a href="$1">$1</a>', $text);
    // Convert www. links
    $text = preg_replace('|\\b(www.[^\s)<]+)|', 
        '<a href="http://$1">$1</a>', $text);
    return $text;
}

I have no doubt it could be improved, but my tests so far have shown it to be good enough for the job at hand.

This is HTMLifying user input by Simon Willison, posted on 19th October 2003.

Next: Managing Social Software

Previous: Lawrence web meetup

Previously hosted at http://simon.incutio.com/archive/2003/10/19/htmlifying