Converting links without regular expressions
I pair-programmed this code with Natalie just over a month ago, and I’ve now added it to my Kansas blog simplified comments system as mentioned earlier.
The problem is the age-old challenge of automatically converting URLs embedded in a piece of text in to links. The standard way of doing this is with regular expressions, as I demonstrated in my previous entry. Creating a reliable regular expression for spotting a link is actually a pretty big challenge; creating one that can deal with the vagaries of typed text even more so. Two common details that normally trip up regular expression based link parsers are as follows:
- Commas and semicolons, while rare, are perfectly valid within URLs. Consider the horrible links churned out by the popular Vignette content management system, as used by sites like the Guardian. Links like this are frequently truncated at the first comma by naive regular expression based link parsers.
- When writing plain text, most people use punctuation without bothering to explicitly separate it from the link. A classic example is mentioning a site (such as www.google.com) in parentheses—or even more commonly a URL that sits next to a comma or fullstop. Of course commas and periods are both valid within URLs so some mechanism is needed for only ignoring them if they occur at the very end of the link.
The other question of course is what we should be looking for in the first place. To keep things simple, I only consider tokens that start either with “http://” or with “www.” as these are by far the most common ways in which a link is included in plain text.
Regular expressions are powerful beasts. It’s quite possible that one could be constructed to avoid the problems mentioned above, making extensive use of arcane tricks such as negative lookahead assertions (say that quickly five times) and non-backreferenced atoms. The readability of a regular expression converges towards zero the more these tricks are employed, so I for one prefer to avoid them. Instead, the whole operation can be achieved using a combination of simple string operations.
The first step is to split the string in to an array of words (or tokens) by exploding it on spaces. Spaces are ideal for this as they’re the one character we can garauntee won’t appear in a link—or at least not one that people are likely to paste in to some text in the expectation that it will be converted. We can now concentrate on each word in turn. If it starts with http:// or www. then we can treat it as a link—but what about the trailing punctuation? If we know the characters we don’t want to appear at the end of the link (I use .,'")(<>;:) we can simply chop off those characters from the end of the link one by one until we get to a character not represented in our block list. To preserve the formatting of the original text we should store each of the eliminated characters and tag them back on once the link has been created.
The only step left is to add in the HTML for the link, then reconstruct the original text by imploding the array back down to a string separated by spaces. Here’s the finished function:
function convertLinks($text) {
$words = explode(' ', $text);
for ($i = 0, $j = count($words); $i < $j; $i++) {
$word = $words[$i];
$punctuation = '.,\'")(<>;:'; // Links may not end in these
if (substr($word, 0, 7) == 'http://' ||
substr($word, 0, 4) == 'www.') {
$trailing = '';
// Knock off ending punctuation
$last = substr($word, -1);
if (strpos($punctuation, $last) !== false) {
// Last character is punctuation - eliminate it
$trailing .= $last;
$word = substr($word, 0, -1);
}
// Make link, add trailing punctuation back afterwards
$link = $word;
if (substr($link, 0, 4) == 'www.') {
// This link needs an http://
$link = 'http://'.$link;
}
$word = '<a href="'.$link.'">'.$word.'</a>'.$trailing;
}
$words[$i] = $word;
}
return implode(' ', $words);
}
It doesn’t cover every eventuality (the fuzzy nature of the problem makes that a pretty thankless task) but it handles most cases admirably well. If you want to try it out I’ve set up a demo page that uses the function right here.
Update: The above code contains a couple of subtle bugs, mostly relating to line endings. I’ve posted an updated and improved version on the demo page.
When testing to see if the 'word' is a link, wouldn't it be easier to use a regular expression at that point?
Like
if ( preg_match ( "/^(http:\/\/|ftp:\/\/|www.)/i", $word ) ) ...This way, you can add any protocol you want as a link without the hassle of the length or extra cases in the
if.Andrew - 19th October 2003 23:51 - #
That should, of course, be a
www\.in the RegExp.Andrew - 19th October 2003 23:52 - #
... which is exactly why I like to avoid them if there's a simpler string method to use instead ;) Python provides the most readable way of expressing that logic:
Simon Willison - 20th October 2003 00:00 - #
roy - 20th October 2003 03:30 - #
Simon Willison - 20th October 2003 04:11 - #
If a URL has multiple punctuation marks at the end (like this one: http://simon.incutio.com), then it doesn't get marked up. I'm not a code guru, but would that be solved by simply changing the
if (strpos($punctuation...to awhile (strpos($punctuation...?Micah - 20th October 2003 05:07 - #
Simon Willison - 20th October 2003 05:10 - #
I think it will be quite tricky to support this:
:-)
Anne - 20th October 2003 08:08 - #
Eugene - 20th October 2003 14:28 - #
Very nice code, and as you say it's not perfect (and can't be, really).
The problem I encountered, however, is that links with parentheses around them didn't work (www.google.com). This is something I do every once in a while when I'm explaining something and don't want to use a link without showing the url for various reasons
I don't know if this is really a problem big enough to consider a problem/bug though. I guess the best way to come around it is to explode with ( and ) too, but then the function may become a bit bloated..
Just thinking out loud here :)
Eivind - 20th October 2003 15:28 - #
Here, for what it's worth, is a regex version of the above. Should work with urls in parentheses (http://thingy.com!), as well.
If you want to do it strictly with stock PHP functions (a fine ambition), you're taking a huge performance hit with all those count() calls:
...can be done much more simply:Dean Allen - 22nd October 2003 20:36 - #
pgl - 5th November 2003 02:27 - #
Hey pgl, Sorry but your
foreachstatements wouldn't work. There's just no way to implode the text back together. Well maybe with the second one using thewordnumvariable, but again that would create more temporary variables towards the end. I actually prefer a hybrid of yours and Mr. Allen's methods:And sorry for posting this so late, but I'm tweaking my CMS and had to reference your script Simon. The punctuation string is genius, but why you use a variable and not just a string during the if/loop statement is lost to me... Okay, now I'm being picky! :)
Stephen - 1st December 2003 07:25 - #
Lefty - 10th March 2004 20:44 - #
Jason Butler - 5th April 2005 21:40 - #
Jeremy - 5th February 2006 16:06 - #