Simon Willison’s Weblog

Subscribe

Converting links without regular expressions

19th October 2003

I pair-programmed this code with Natalie just over a month ago, and I’ve now added it to my Kansas blog simplified comments system as mentioned earlier.

The problem is the age-old challenge of automatically converting URLs embedded in a piece of text in to links. The standard way of doing this is with regular expressions, as I demonstrated in my previous entry. Creating a reliable regular expression for spotting a link is actually a pretty big challenge; creating one that can deal with the vagaries of typed text even more so. Two common details that normally trip up regular expression based link parsers are as follows:

  1. Commas and semicolons, while rare, are perfectly valid within URLs. Consider the horrible links churned out by the popular Vignette content management system, as used by sites like the Guardian. Links like this are frequently truncated at the first comma by naive regular expression based link parsers.
  2. When writing plain text, most people use punctuation without bothering to explicitly separate it from the link. A classic example is mentioning a site (such as www.google.com) in parentheses—or even more commonly a URL that sits next to a comma or fullstop. Of course commas and periods are both valid within URLs so some mechanism is needed for only ignoring them if they occur at the very end of the link.

The other question of course is what we should be looking for in the first place. To keep things simple, I only consider tokens that start either with “http://” or with “www.” as these are by far the most common ways in which a link is included in plain text.

Regular expressions are powerful beasts. It’s quite possible that one could be constructed to avoid the problems mentioned above, making extensive use of arcane tricks such as negative lookahead assertions (say that quickly five times) and non-backreferenced atoms. The readability of a regular expression converges towards zero the more these tricks are employed, so I for one prefer to avoid them. Instead, the whole operation can be achieved using a combination of simple string operations.

The first step is to split the string in to an array of words (or tokens) by exploding it on spaces. Spaces are ideal for this as they’re the one character we can garauntee won’t appear in a link—or at least not one that people are likely to paste in to some text in the expectation that it will be converted. We can now concentrate on each word in turn. If it starts with http:// or www. then we can treat it as a link—but what about the trailing punctuation? If we know the characters we don’t want to appear at the end of the link (I use .,'")(<>;:) we can simply chop off those characters from the end of the link one by one until we get to a character not represented in our block list. To preserve the formatting of the original text we should store each of the eliminated characters and tag them back on once the link has been created.

The only step left is to add in the HTML for the link, then reconstruct the original text by imploding the array back down to a string separated by spaces. Here’s the finished function:

function convertLinks($text) {
    $words = explode(' ', $text);
    for ($i = 0, $j = count($words); $i < $j; $i++) {
        $word = $words[$i];
        $punctuation = '.,\'")(<>;:'; // Links may not end in these
        if (substr($word, 0, 7) == 'http://' || 
                substr($word, 0, 4) == 'www.') {
            $trailing = '';
            // Knock off ending punctuation
            $last = substr($word, -1);
            if (strpos($punctuation, $last) !== false) {
                // Last character is punctuation - eliminate it
                $trailing .= $last;
                $word = substr($word, 0, -1);
            }
            // Make link, add trailing punctuation back afterwards
            $link = $word;
            if (substr($link, 0, 4) == 'www.') {
                // This link needs an http://
                $link = 'http://'.$link;
            }
            $word = '<a href="'.$link.'">'.$word.'</a>'.$trailing;
        }
        $words[$i] = $word;
    }
    return implode(' ', $words);
}

It doesn’t cover every eventuality (the fuzzy nature of the problem makes that a pretty thankless task) but it handles most cases admirably well. If you want to try it out I’ve set up a demo page that uses the function right here.

Update: The above code contains a couple of subtle bugs, mostly relating to line endings. I’ve posted an updated and improved version on the demo page.

This is Converting links without regular expressions by Simon Willison, posted on 19th October 2003.

Next: Google Life Guidance

Previous: Managing Social Software

Previously hosted at http://simon.incutio.com/archive/2003/10/19/convertingLinks