Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Escaping regular expression characters in JavaScript

JavaScript’s support for regular expressions is generally pretty good, but there is one notable omission: an escaping mechanism for literal strings. Say for example you need to create a regular expression that removes a specific string from the end of a string. If you know the string you want to remove when you write the script this is easy:


var newString = oldString.replace(/Remove from end$/, '');

But what if the string to be removed comes from a variable? You’ll need to construct a regular expression from the variable, using the RegExp constructor function:


var re = new RegExp(stringToRemove + '$');
var newString = oldString.replace(re, '');

But what if the string you want to remove may contain regular expression metacharacters—characters like $ or . that affect the behaviour of the expression? Languages such as Python provide functions for escaping these characters (see re.escape); with JavaScript you have to write your own.

Here’s mine:


RegExp.escape = function(text) {
  if (!arguments.callee.sRE) {
    var specials = [
      '/', '.', '*', '+', '?', '|',
      '(', ')', '[', ']', '{', '}', '\\'
    ];
    arguments.callee.sRE = new RegExp(
      '(\\' + specials.join('|\\') + ')', 'g'
    );
  }
  return text.replace(arguments.callee.sRE, '\\$1');
}

This deals with another common problem in JavaScript: compiling a regular expression once (rather than every time you use it) while keeping it local to a function. argmuments.callee inside a function always refers to the function itself, and since JavaScript functions are objects you can store properties on them. In this case, the first time the function is run it compiles a regular expression and stashes it in the sRE property. On subsequent calls the pre-compiled expression can be reused.

In the above snippet I’ve added my function as a property of the RegExp constructor. There’s no pressing reason to do this other than a desire to keep generic functionality relating to regular expression handling the same place. If you rename the function it will still work as expected, since the use of arguments.callee eliminates any coupling between the function definition and the rest of the code.

This is Escaping regular expression characters in JavaScript by Simon Willison, posted on 20th January 2006.

Tagged , ,

View blog reactions

Next: Notes from the summit

Previous: Happy New Year!

32 comments

  1. This page just hit the del.icio.us "popular" list. Well done Simon!

    Chris Beach - 20th January 2006 16:32 - #

  2. / is not RegExp special, except when you write them in literal form, which you have not when this method is used. My corresponding sRE looks like this: /([.*+?|(){}[\]\\])/g.

    Johan Sundström - 20th January 2006 17:12 - #

  3. Your comment preview page UTF-8 encodes all input once too much, by the way. (Or incorrectly treats all input as ISO latin, forcefully converting it as such, depending on your point of view.)

    Johan Sundström - 20th January 2006 17:18 - #

  4. Very cool function Simon.

    If you want to avoid regular expressions altogether, you have a few options for replacing a specific string, regardless of special characters.

    If you don't care about the substring's position, you can simply use the String.replace variant that does not take a regular expression as the first argument. For global behavior, you would have to do something like:

    var newString = oldString;
    while (newString.indexOf(stringToRemove) != -1) {
        newString = newString.replace(stringToRemove, "");
    }

    If position within the string is significant, you would have to do some such ugliness:

    // beginning of the string
    if (oldString.indexOf(stringToRemove) == 0) {
        newString = oldString.substring(stringToRemove.length);
    }
    
    // end of the string
    if (oldString.lastIndexOf(stringToRemove) == oldString.length - stringToRemove.length) {
        newString = oldString.substring(0, oldString.lastIndexOf(stringToRemove));
    }

    Anything more complicated than that, and you would have to use regular expressions.

    David Lindquist - 20th January 2006 18:22 - #

  5. Instead of storing the regexp on the callee, you could execute a function which returns the function with a closure to the regexp.

    Or in code:

    RegExp.escape = (function() {
      var specials = [
        '/', '.', '*', '+', '?', '|',
        '(', ')', '[', ']', '{', '}', '\\';
      ];
    
      sRE = new RegExp(
        '(\\' + specials.join('|\\') + ')', 'g'
      );
      
      return function(text) {
        return text.replace(sRE, '\\$1');
      }
    })();
    

    Mark Wubben - 20th January 2006 19:26 - #

  6. I'm getting an error because of a semi-colon ";" within the specials array.

    Patrick Fitzgerald - 20th January 2006 19:31 - #

  7. Nice one, Mark. I wonder how ref counting would clean up docs where an alien object has a closure reference to a native object though? I really must better with JS implementations...

    Jeremy Dunck - 20th January 2006 20:03 - #

  8. You wrote: I've added my function as a property of the RegExp constructor.

    This might be confusing since RegExp is an object, and it's hard to tell the difference from a method (RegExp.test) and a function that has been attached to the object constructor (RegExp.escape)

    A method: var re1 = new RegExp("foo.bar"); re1.test("This is foobar");
    vs. a function: reString = RegExp.escape("foo.bar");

    For example, the following would cause an error that might be hard to debug if you didn't know exactly what was going on:
    var re1 = new RegExp("foo"); re1.escape("foo.bar"); /* not valid */

    Plus the typical way this might be called seems a bit confusing:
    var re1 = new RegExp(RegExp.escape("foo.bar"));

    I think this function would be better as a prototype to the String object since it's actually performing a transformation on string data. Then you could use it like this:
    var re1 = new RegExp("foo.bar".escapeRegExp());
    var re2 = new RegExp(s1.escapeRegExp());

    Patrick Fitzgerald - 20th January 2006 20:22 - #

  9. I probably shouldn't comment on the coding habits of my betters, but I view cacheing the compiled regexp on the calling function from within the called function as poor practice. Your function is now broken if you call it twice from within the same callee, and worse, will you remember that behavior a year from now when you re-use this function?

    Much better, I think, to do the cacheing at the callee level.

    Scott Turner

    Scott Turner - 20th January 2006 21:02 - #

  10. Scott, the callee is the function itself, not the function that calls it. The regex is cached because it will be the same for every call of the function, so you don't want to regenerate it every time. Are you confusing callee with caller? Or am I missunderstanding you?

    Rory Parle - 20th January 2006 22:57 - #

  11. I had a feeling I'd get some good feedback on this one! I really like Mark's closure trick and I almost agree with Patrick that this would be better as an extension of the String method. However, my priority for this is reusability and it's likely that other people might already have defined a String.prototype.escape method of their own with different semantics.

    Simon Willison - 21st January 2006 12:45 - #

  12. Indeed, Mark, yours looks good with the closure... I've noticed you do that a bit and I'm starting to catch on to its usefulness.

    Dustin - 22nd January 2006 05:13 - #

  13. If you're interested in this closure stuff, I just published an article about it: Getting Funky With Scopes and Closures.

    (Simon, apologies for the plug! But I figure it'd be useful for some...)

    Mark Wubben - 23rd January 2006 00:36 - #

  14. In fact caching the expression doesn't help too much in terms of execution speed. It seems that modern browsers implement internal caching or something. Some time now I use the following little function that runs about twice as fast as RegExp.escape:

    function encodeRE(s) { return s.replace(/([.*+?^${}()|[\]\/\\])/g, '\\$1') }

    Proof

    Theodor Zoulias - 24th January 2006 18:04 - #

  15. In addition to Theodor's remark: regular expression literals are per spec compiled only once during parsing of the script; any assignments will result in a reference to the compiled RegExp object. Since the needed expression here will always be the same it is better to use a literal instead of going through lengths to prevent compilation of the same expression every time you need it.
    Whether you then implement it as a function or as a method is a matter of preference...

    General remark: I don't see the need for using a subpattern here; this should work as well:

    function encodeRE(s) { return s.replace(/[.*+?^${}()|[\]\/\\]/g, '\\$0'); }

    Tino Zijdel - 25th January 2006 12:14 - #

  16. good info. can you share some more related links. Web Designer from India

    s kumar - 1st February 2006 19:08 - #

  17. good read helped me in my design.

    Fabian De Rango - 5th February 2006 08:19 - #

  18. swappy

    swappy - 7th February 2006 17:08 - #

  19. hi this is baby

    swappy - 7th February 2006 17:11 - #

  20. Mmmz, test before you comment :P

    In javascript there is no $0 that refers to the full match, so the subpattern is necessary.

    Tino Zijdel - 7th February 2006 22:58 - #

  21. outline then write the a five paragraph characterization of eithor ,jack,ralph,siomon or piggy. 5 paragraph give steps

    vinhkinh - 15th February 2006 03:40 - #

  22. The only problem with Mark's approach is that he forgot to var sRE, making it visible outside the function's scope. This is a tricky part of JavaScript that should be simple but few people seem to grasp. I recommend reading ECMA-262's entry about the "Activation Object" (which holds the function's parameters, local variables and a link to the parent scope). I would have probably written it like this:

    (function() {
        var __specials__ = "/.*+?|()[]{}\\".split("");
        var __sre__ = new RegExp("(\\" + __specials__.join("|\\") + ")", "g");
        RegExp.escape = function(text) { text.replace(__sre__, "\\$1"); }
    })();

    Which is really saying the same thing, in a less elegant manner :) When I first heard about arguments.callee (and caller) I started relying on it so much that it didn't take me long to realize the abuse (I wrote that for Flash three years ago when ActionScript didn't have regexes...). It looks leet and all, but nowadays I try to avoid it when possible. Using anonymous function scopes to store private members is a much cleaner way to do it.

    Jonas Galvez - 28th February 2006 12:36 - #

  23. Jonas your code has sixteen underscores too many, and one return too less. :) Speed results same as Mark's version.

    Theodor Zoulias - 1st March 2006 08:15 - #

  24. Heh, indeed. Don't see anything wrong with underscores, tho ;)

    Jonas Galvez - 2nd March 2006 14:52 - #

  25. try now

    penis enlargement - 6th March 2006 20:48 - #

  26. come man visit

    enlarge your penis - 6th March 2006 20:48 - #

  27. i love this

    penis enlargement pills - 6th March 2006 20:49 - #

  28. hello

    enlarge penis - 6th March 2006 20:50 - #

  29. Thanks for all that work Simon.. great site and even better resource!.

    Criss - 23rd March 2006 13:34 - #

  30. Simon, this post (and the resulting comments) saved me quite a lot of work on my current project, and I've been getting a lot of use out of your "(Re)-Introduction to JavaScript" article. Thanks!

    Jordan Running - 13th April 2006 05:38 - #

  31. Just a thought... With javascript can't you just use a back slash to use special characters? For example, this works fine for me: html.replace(/\$yourname\$/g,document.getElementBy Id("YourName").value); If you were looking to replace '$yourname$'... Maybe i'm missing the point of this post...If so, sorry!

    Tom Meier - 18th July 2006 05:53 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2006/01/20/escape

A django site