Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Converting links without regular expressions

I pair-programmed this code with Natalie just over a month ago, and I’ve now added it to my Kansas blog simplified comments system as mentioned earlier.

The problem is the age-old challenge of automatically converting URLs embedded in a piece of text in to links. The standard way of doing this is with regular expressions, as I demonstrated in my previous entry. Creating a reliable regular expression for spotting a link is actually a pretty big challenge; creating one that can deal with the vagaries of typed text even more so. Two common details that normally trip up regular expression based link parsers are as follows:

  1. Commas and semicolons, while rare, are perfectly valid within URLs. Consider the horrible links churned out by the popular Vignette content management system, as used by sites like the Guardian. Links like this are frequently truncated at the first comma by naive regular expression based link parsers.
  2. When writing plain text, most people use punctuation without bothering to explicitly separate it from the link. A classic example is mentioning a site (such as www.google.com) in parentheses—or even more commonly a URL that sits next to a comma or fullstop. Of course commas and periods are both valid within URLs so some mechanism is needed for only ignoring them if they occur at the very end of the link.

The other question of course is what we should be looking for in the first place. To keep things simple, I only consider tokens that start either with “http://” or with “www.” as these are by far the most common ways in which a link is included in plain text.

Regular expressions are powerful beasts. It’s quite possible that one could be constructed to avoid the problems mentioned above, making extensive use of arcane tricks such as negative lookahead assertions (say that quickly five times) and non-backreferenced atoms. The readability of a regular expression converges towards zero the more these tricks are employed, so I for one prefer to avoid them. Instead, the whole operation can be achieved using a combination of simple string operations.

The first step is to split the string in to an array of words (or tokens) by exploding it on spaces. Spaces are ideal for this as they’re the one character we can garauntee won’t appear in a link—or at least not one that people are likely to paste in to some text in the expectation that it will be converted. We can now concentrate on each word in turn. If it starts with http:// or www. then we can treat it as a link—but what about the trailing punctuation? If we know the characters we don’t want to appear at the end of the link (I use .,'")(<>;:) we can simply chop off those characters from the end of the link one by one until we get to a character not represented in our block list. To preserve the formatting of the original text we should store each of the eliminated characters and tag them back on once the link has been created.

The only step left is to add in the HTML for the link, then reconstruct the original text by imploding the array back down to a string separated by spaces. Here’s the finished function:

function convertLinks($text) {
    $words = explode(' ', $text);
    for ($i = 0, $j = count($words); $i < $j; $i++) {
        $word = $words[$i];
        $punctuation = '.,\'")(<>;:'; // Links may not end in these
        if (substr($word, 0, 7) == 'http://' || 
                substr($word, 0, 4) == 'www.') {
            $trailing = '';
            // Knock off ending punctuation
            $last = substr($word, -1);
            if (strpos($punctuation, $last) !== false) {
                // Last character is punctuation - eliminate it
                $trailing .= $last;
                $word = substr($word, 0, -1);
            }
            // Make link, add trailing punctuation back afterwards
            $link = $word;
            if (substr($link, 0, 4) == 'www.') {
                // This link needs an http://
                $link = 'http://'.$link;
            }
            $word = '<a href="'.$link.'">'.$word.'</a>'.$trailing;
        }
        $words[$i] = $word;
    }
    return implode(' ', $words);
}

It doesn’t cover every eventuality (the fuzzy nature of the problem makes that a pretty thankless task) but it handles most cases admirably well. If you want to try it out I’ve set up a demo page that uses the function right here.

Update: The above code contains a couple of subtle bugs, mostly relating to line endings. I’ve posted an updated and improved version on the demo page.

This is Converting links without regular expressions by Simon Willison, posted on 19th October 2003.

View blog reactions

Next: Google Life Guidance

Previous: Managing Social Software

16 comments

  1. When testing to see if the 'word' is a link, wouldn't it be easier to use a regular expression at that point?
    Like if ( preg_match ( "/^(http:\/\/|ftp:\/\/|www.)/i", $word ) ) ...

    This way, you can add any protocol you want as a link without the hassle of the length or extra cases in the if.

    Andrew - 19th October 2003 23:51 - #

  2. That should, of course, be a www\. in the RegExp.

    Andrew - 19th October 2003 23:52 - #

  3. ... which is exactly why I like to avoid them if there's a simpler string method to use instead ;) Python provides the most readable way of expressing that logic:

    if word.startswith('http://') or word.startswith('ftp://') or word.startswith('www.'):

    Simon Willison - 20th October 2003 00:00 - #

  4. Just a question ... although your method is easier to implement (seemingly), does it perform well in terms of performance? I would think that using this exploded string method would go significantly slower than using one of those nice preg_replace functions... But great site. I recently discovered your blog and can't stop reading it :)

    roy - 20th October 2003 03:30 - #

  5. Your question piqued my curiosity so I ran a couple of benchmarks. The naive regular expressions from my previous post are a great deal faster (as expected) running the function ten times on a reasonably average block of text in 0.003 seconds, while my function took 0.01 seconds. I would argue that even though the regular expression was three times faster the performance of my string function method is still easily fast enough for the vast majority of purposes.

    Simon Willison - 20th October 2003 04:11 - #

  6. If a URL has multiple punctuation marks at the end (like this one: http://simon.incutio.com), then it doesn't get marked up. I'm not a code guru, but would that be solved by simply changing the if (strpos($punctuation... to a while (strpos($punctuation... ?

    Micah - 20th October 2003 05:07 - #

  7. Micah: indeed it would, and in the original code I wrote with Natalie it was. Thanks for pointing that out - I've fixed it in the improved version of the code displayed on the demo page.

    Simon Willison - 20th October 2003 05:10 - #

  8. I think it will be quite tricky to support this:

    If you go to home.xfm#frames(a=one.xhtml,b=two.xhtml,c=three.xh tml), you will see an example of a XFrames based site

    :-)

    Anne - 20th October 2003 08:08 - #

  9. Of course, there are always exceptions. Ignore commas if they occur at the end of the URL? You'd break links to the administrative resources of the W3C. For example, http://www.w3.org/TR/xhtml1/, is a valid URL (although it does redirect to a bit more conventional URL). Then again, commenters on your other blog would probably not link to these pages. :P

    Eugene - 20th October 2003 14:28 - #

  10. Very nice code, and as you say it's not perfect (and can't be, really).

    The problem I encountered, however, is that links with parentheses around them didn't work (www.google.com). This is something I do every once in a while when I'm explaining something and don't want to use a link without showing the url for various reasons

    I don't know if this is really a problem big enough to consider a problem/bug though. I guess the best way to come around it is to explode with ( and ) too, but then the function may become a bit bloated..

    Just thinking out loud here :)

    Eivind - 20th October 2003 15:28 - #

  11. Here, for what it's worth, is a regex version of the above. Should work with urls in parentheses (http://thingy.com!), as well.

    
    $text = preg_replace("/(?:(http:\/\/)|(www\.))(\S+\b\/?)([ [:punct:]]*)(\s|$)/i", 
        "<a href=\"http://$2$3\">$1$2$3</a>$4$5", $text);
    

    If you want to do it strictly with stock PHP functions (a fine ambition), you're taking a huge performance hit with all those count() calls:

    
    $words = explode(' ', $text);
    for ($i = 0, $j = count($words); $i < $j; $i++) {
        $word = $words[$i];
        ...
    
    ...can be done much more simply:
    
    $words = explode(' ', $text);
    foreach ($words as $word) {
        ...
    

    Dean Allen - 22nd October 2003 20:36 - #

  12. ... or even:
    
    foreach (explode(' ', $words) as $word) {
        ...
    
    Also, there's no reason to split the text into lines before splitting it into words:
    
    foreach (preg_split('/[\r\n]/', $text) as $wordnum => $word) {
        ...
    
    OK. I'll stop nitpicking now.

    pgl - 5th November 2003 02:27 - #

  13. Hey pgl, Sorry but your foreach statements wouldn't work. There's just no way to implode the text back together. Well maybe with the second one using the wordnum variable, but again that would create more temporary variables towards the end. I actually prefer a hybrid of yours and Mr. Allen's methods:

    $words = explode(" ",$words);
    foreach(words as $wordnum => $word) {
       $word = ...; // looping done here
       $words[$wordnum] = $word;
    }
    implode(" ",$words);
    

    And sorry for posting this so late, but I'm tweaking my CMS and had to reference your script Simon. The punctuation string is genius, but why you use a variable and not just a string during the if/loop statement is lost to me... Okay, now I'm being picky! :)

    Stephen - 1st December 2003 07:25 - #

  14. mixed-case URI methods (e.g. Http://) won't be picked up either - maybe a little strtolower() action is in order?

    Lefty - 10th March 2004 20:44 - #

  15. Great code. Thanks for saving me a couple of hours of work!

    Jason Butler - 5th April 2005 21:40 - #

  16. Just have to say thanks - this code essentially saved my sanity.

    Jeremy - 5th February 2006 16:06 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/10/19/convertingLinks

A django site