HTMLifying user input
19th October 2003
I’ve added a comment system to my new Kansas blog. Since the target audience for that site is friends and family rather than fellow web developers, I’ve taken a very different approach to processing the input from comments. While this blog insists upon valid XHTML and gives very little help to comment posters aside from highlighting validation problems, my new site’s comment system takes the more traditional root of disallowing HTML while automatically converting line breaks and links.
The standard way of doing this with PHP is to use the nl2br
function. I’ve never been a big fan of this method as I prefer blocks of text to be surrounded by paragraph tags. Luckily, adding paragraph tags to blocks of text is a relatively easy task. Here’s the pseudo-code, mocked up in Python because it’s quicker to experiment with than PHP:
>>> text = '''... lengthy text block here ...'''
>>> paras = text.split('\n\n')
>>> paras = ['<p>%s</p>' % para.strip() for para in paras]
>>> print '\n\n'.join(paras)
The above code splits the text block on any occurrence of a double newline, then wraps each of the resulting blocks in a paragraph tag (after stripping off any remaining whitespace) before joining the blocks back together with a pair of newlines between each one—because I like to keep my HTML nicely formatted. What it doesn’t do is handle any necessary <br>
tags. The trick now is to replace any single line breaks with <br>
without interfering with the paragraph tags. The easiest way to do this is to put the replacement inside the loop, so that only line breaks that occur within a paragraph are replaced. Here’s the updated list comprehension:
>>> paras = ['<p>%s</p>' % p.strip().replace('\n', '<br>\n') for p in paras]
The final job is to convert the above in to PHP:
$paras = explode("\n\n", $text);
for ($i = 0, $j = count($paras); $i < $j; $i++) {
$paras[$i] = '<p>'.
str_replace("\n", "<br>\n", trim($paras[$i])).
'</p>';
}
$text = implode("\n\n", $paras);
That’s the line conversions handled, but there are a few other important steps. Any HTML tags entered by the user need to be either stripped out or disabled by converting them to entities. Converting them to entities carries the risk of ugly failed attempts at HTML appearing on the comments page, but stripping tags carries an equal risk of innocent parts of a legitimate comment (such as a <wink>) being discarded. I chose to go the entity conversion route but force commenters to preview their comments before posting them, a trick I picked up from Adrian’s blog. The final step is to automatically convert links in to <a href="">
tags. I achieve this using a pair of naive regular expressions in the hope that the preview screen would avoid them mangling comments in a way not intended by the author.
Here’s the finished PHP function:
function untrustedTextToHTML($text) {
$text = htmlentities($text);
$paras = explode("\n\n", $text);
for ($i = 0, $j = count($paras); $i < $j; $i++) {
$paras[$i] = '<p>'.
str_replace("\n", "<br>\n", trim($paras[$i])).
'</p>';
}
$text = implode("\n\n", $paras);
// Convert http:// links
$text = preg_replace('|\\b(http://[^\s)<]+)|',
'<a href="$1">$1</a>', $text);
// Convert www. links
$text = preg_replace('|\\b(www.[^\s)<]+)|',
'<a href="http://$1">$1</a>', $text);
return $text;
}
I have no doubt it could be improved, but my tests so far have shown it to be good enough for the job at hand.
More recent articles
- Weeknotes: Llama 3, AI for Data Journalism, llm-evals and datasette-secrets - 23rd April 2024
- Options for accessing Llama 3 from the terminal using LLM - 22nd April 2024
- AI for Data Journalism: demonstrating what we can do with this stuff right now - 17th April 2024
- Three major LLM releases in 24 hours (plus weeknotes) - 10th April 2024
- Building files-to-prompt entirely using Claude 3 Opus - 8th April 2024
- Running OCR against PDFs and images directly in your browser - 30th March 2024
- llm cmd undo last git commit - a new plugin for LLM - 26th March 2024
- Building and testing C extensions for SQLite with ChatGPT Code Interpreter - 23rd March 2024
- Claude and ChatGPT for ad-hoc sidequests - 22nd March 2024
- Weeknotes: the aftermath of NICAR - 16th March 2024