Escaping regular expression characters in JavaScript
JavaScript’s support for regular expressions is generally pretty good, but there is one notable omission: an escaping mechanism for literal strings. Say for example you need to create a regular expression that removes a specific string from the end of a string. If you know the string you want to remove when you write the script this is easy:
var newString = oldString.replace(/Remove from end$/, '');
But what if the string to be removed comes from a variable? You’ll need to construct a regular expression from the variable, using the RegExp constructor function:
var re = new RegExp(stringToRemove + '$');
var newString = oldString.replace(re, '');
But what if the string you want to remove may contain regular expression metacharacters—characters like $ or . that affect the behaviour of the expression? Languages such as Python provide functions for escaping these characters (see re.escape); with JavaScript you have to write your own.
Here’s mine:
RegExp.escape = function(text) {
if (!arguments.callee.sRE) {
var specials = [
'/', '.', '*', '+', '?', '|',
'(', ')', '[', ']', '{', '}', '\\'
];
arguments.callee.sRE = new RegExp(
'(\\' + specials.join('|\\') + ')', 'g'
);
}
return text.replace(arguments.callee.sRE, '\\$1');
}
This deals with another common problem in JavaScript: compiling a regular expression once (rather than every time you use it) while keeping it local to a function. argmuments.callee inside a function always refers to the function itself, and since JavaScript functions are objects you can store properties on them. In this case, the first time the function is run it compiles a regular expression and stashes it in the sRE property. On subsequent calls the pre-compiled expression can be reused.
In the above snippet I’ve added my function as a property of the RegExp constructor. There’s no pressing reason to do this other than a desire to keep generic functionality relating to regular expression handling the same place. If you rename the function it will still work as expected, since the use of arguments.callee eliminates any coupling between the function definition and the rest of the code.
This page just hit the del.icio.us "popular" list. Well done Simon!
Chris Beach - 20th January 2006 16:32 - #
/is not RegExp special, except when you write them in literal form, which you have not when this method is used. My corresponding sRE looks like this:/([.*+?|(){}[\]\\])/g.Johan Sundström - 20th January 2006 17:12 - #
Johan Sundström - 20th January 2006 17:18 - #
Very cool function Simon.
If you want to avoid regular expressions altogether, you have a few options for replacing a specific string, regardless of special characters.
If you don't care about the substring's position, you can simply use the
String.replacevariant that does not take a regular expression as the first argument. For global behavior, you would have to do something like:If position within the string is significant, you would have to do some such ugliness:
Anything more complicated than that, and you would have to use regular expressions.
David Lindquist - 20th January 2006 18:22 - #
Instead of storing the regexp on the callee, you could execute a function which returns the function with a closure to the regexp.
Or in code:
RegExp.escape = (function() { var specials = [ '/', '.', '*', '+', '?', '|', '(', ')', '[', ']', '{', '}', '\\'; ]; sRE = new RegExp( '(\\' + specials.join('|\\') + ')', 'g' ); return function(text) { return text.replace(sRE, '\\$1'); } })();Mark Wubben - 20th January 2006 19:26 - #
Patrick Fitzgerald - 20th January 2006 19:31 - #
Nice one, Mark. I wonder how ref counting would clean up docs where an alien object has a closure reference to a native object though? I really must better with JS implementations...
Jeremy Dunck - 20th January 2006 20:03 - #
You wrote: I've added my function as a property of the RegExp constructor.
This might be confusing since RegExp is an object, and it's hard to tell the difference from a method (RegExp.test) and a function that has been attached to the object constructor (RegExp.escape)
A method:
var re1 = new RegExp("foo.bar"); re1.test("This is foobar");vs. a function:
reString = RegExp.escape("foo.bar");For example, the following would cause an error that might be hard to debug if you didn't know exactly what was going on:
var re1 = new RegExp("foo"); re1.escape("foo.bar"); /* not valid */Plus the typical way this might be called seems a bit confusing:
var re1 = new RegExp(RegExp.escape("foo.bar"));I think this function would be better as a prototype to the String object since it's actually performing a transformation on string data. Then you could use it like this:
var re1 = new RegExp("foo.bar".escapeRegExp());var re2 = new RegExp(s1.escapeRegExp());
Patrick Fitzgerald - 20th January 2006 20:22 - #
Much better, I think, to do the cacheing at the callee level.
Scott Turner
Scott Turner - 20th January 2006 21:02 - #
Rory Parle - 20th January 2006 22:57 - #
Simon Willison - 21st January 2006 12:45 - #
Dustin - 22nd January 2006 05:13 - #
If you're interested in this closure stuff, I just published an article about it: Getting Funky With Scopes and Closures.
(Simon, apologies for the plug! But I figure it'd be useful for some...)
Mark Wubben - 23rd January 2006 00:36 - #
In fact caching the expression doesn't help too much in terms of execution speed. It seems that modern browsers implement internal caching or something. Some time now I use the following little function that runs about twice as fast as RegExp.escape:
function encodeRE(s) { return s.replace(/([.*+?^${}()|[\]\/\\])/g, '\\$1') }Proof
Theodor Zoulias - 24th January 2006 18:04 - #
In addition to Theodor's remark: regular expression literals are per spec compiled only once during parsing of the script; any assignments will result in a reference to the compiled RegExp object. Since the needed expression here will always be the same it is better to use a literal instead of going through lengths to prevent compilation of the same expression every time you need it.
Whether you then implement it as a function or as a method is a matter of preference...
General remark: I don't see the need for using a subpattern here; this should work as well:
function encodeRE(s) { return s.replace(/[.*+?^${}()|[\]\/\\]/g, '\\$0'); }Tino Zijdel - 25th January 2006 12:14 - #
s kumar - 1st February 2006 19:08 - #
Brad - 5th February 2006 03:56 - #
Fabian De Rango - 5th February 2006 08:19 - #
swappy - 7th February 2006 17:08 - #
swappy - 7th February 2006 17:11 - #
Mmmz, test before you comment :P
In javascript there is no $0 that refers to the full match, so the subpattern is necessary.
Tino Zijdel - 7th February 2006 22:58 - #
vinhkinh - 15th February 2006 03:40 - #
The only problem with Mark's approach is that he forgot to var
sRE, making it visible outside the function's scope. This is a tricky part of JavaScript that should be simple but few people seem to grasp. I recommend reading ECMA-262's entry about the "Activation Object" (which holds the function's parameters, local variables and a link to the parent scope). I would have probably written it like this:Which is really saying the same thing, in a less elegant manner :) When I first heard about
arguments.callee(andcaller) I started relying on it so much that it didn't take me long to realize the abuse (I wrote that for Flash three years ago when ActionScript didn't have regexes...). It looks leet and all, but nowadays I try to avoid it when possible. Using anonymous function scopes to store private members is a much cleaner way to do it.Jonas Galvez - 28th February 2006 12:36 - #
Theodor Zoulias - 1st March 2006 08:15 - #
Jonas Galvez - 2nd March 2006 14:52 - #
penis enlargement - 6th March 2006 20:48 - #
enlarge your penis - 6th March 2006 20:48 - #
penis enlargement pills - 6th March 2006 20:49 - #
enlarge penis - 6th March 2006 20:50 - #
Criss - 23rd March 2006 13:34 - #
Jordan Running - 13th April 2006 05:38 - #
Tom Meier - 18th July 2006 05:53 - #