Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Capturing the power of re.split

A couple of Python tips. The first is really a tip for Mozilla/Firebird: You can set up a Custom Keyword for instantly accessing Python module documentation using the string www.python.org/doc/current/lib/module-%s.html—I have this set up as pydoc, so I can type pydoc re to jump straight to the re module documentation. I only set it up half an hour ago and I’ve already used it about a dozen times.

The second tip is so powerful I’ve been kicking myself for not finding out about it sooner. It relates to the regular expression module’s re.split() function. Just like string.split(), this lets you split up a string based on a certain token. With string.split() you the token you split on isn’t included in the resulting array:

>>> 'pipe|separated|values'.split('|')
['pipe', 'separated', 'values']

This is also true of re.split:

>>> splitter = re.compile('<.>')
>>> splitter.split('hi<a>there<b>from<c>python')
['hi', 'there', 'from', 'python']

Here’s the magic part though. If you put part or all of the regular expression in parenthesis the separating tokens get included in the resulting list:

>>> splitter = re.compile('(<.>)')
>>> splitter.split('hi<a>there<b>from<c>python')
['hi', '<a>', 'there', '<b>', 'from', '<c>', 'python']

Why is this a big deal? Because it suddenly makes writing simple parsers and tokenisers a whole heck of a lot easier. Using the above example, say you wanted to do something with each of the <?> style tags. You can just iterate through the resulting list identifying each tag using the regular expression you’ve already compiled and then altering just those list items, before joining the whole list back together again at the end.

Simple parsing and replacement of easily identified tags can already be achieved using the re.sub() method, which allows you to provide a callback function to process each matching token. The difference with using re.split() is that you can easily take in to account the order of the tokens, allowing you to build systems that can use special tags to define areas of documents without getting confused by nesting tag sets. As a simple example, you could build a basic event based XML parser using just a couple of expressions. In fact, I discovered this technique while examining the source code for the tinpy tiny python template module, which gives a clue to why I’m so interested in it.

Having discovered this feature in Python, I just had to see if it existed in other languages as well. Unsurprisingly it does; PHP’s preg_split offers an optional PREG_SPLIT_DELIM_CAPTURE flag (added in PHP 4.0.5) and Javascript has similar behaviour to Python, including the splitting token if it is wrapped in parentheses.

I’m probably the last person to find out about this, but it’s such a useful technique I felt I just had to share it with the world.

This is Capturing the power of re.split by Simon Willison, posted on 26th October 2003.

View blog reactions

Next: Avoiding RSI

Previous: XUL in Safari

11 comments

  1. I've found re.findall similarly useful (particularly when I'm trying to do something immediately with the results, instead of passing them around to other functions) because it will return the grouped matches, if there are any groups.

    Mark Eichin - 26th October 2003 07:21 - #

  2. Duh! I've used the Custom Keywords in the past, but completely forgot about them. I get tired of going to CPAN and doing a module search. Thanks for the reminder!

    Chris - 26th October 2003 15:23 - #

  3. fwiw, having implemented them both ;-), I tend to prefer re.findall("token|sep") over re.split("sep").

    (but that's probably only because I got used to that style back in the pre-2.2 days, when findall was written in C but split was written in Python. in 2.2 and later, they're both about as fast as they can be.)

    Fredrik Lundh - 27th October 2003 11:55 - #

  4. Opera users can use your firebird tip. Just edit the search.ini in the profile folder. I just set it up to be able to type: p re in the address bar to go to python module doc. Some other ones I use all the time g - google n - google news r - google groups z - amazon e - ebay i - imdb w - wikipedia

    Greg - 28th October 2003 05:19 - #

  5. Having discovered this feature in Python, I just had to see if it existed in other languages as well.

    This feature was most likely dreamt up by Larry Wall & Co.

    The split function can return as part of the returned array any substrings matched as part of the delimiter: split(/([-,])/, '1-10,20') returns (1,'-',10,',',20)

    The above is from the Changes file of Perl 3.0, released on October 18th, 1989 (making it older than any Python source code).

    Arien - 28th October 2003 07:09 - #

  6. I was just playing with this myself, one little oddness though, why the first '' in the following results? >>> ts2 '.y=2003' >>> re.split('(\.\w{1,2})=',ts2) ['', '.y', '2003'] I guess split always wants to return a pair, and consequent, if I don't use the grouping in the regexp (), it has too return '' >>> re.split('\.\w{1,2}=',ts2) ['', '2003'] I should just avoid empty matches...

    Joseph Reagle - 31st October 2003 20:58 - #

  7. I tried this, and it just didn't work. What DID work was specifying the pattern as '(<.*?>)'

    Steve Ferg - 9th May 2004 01:43 - #

  8. Holy cow, why didn't I ever think of 'pydoc %s'.. I already use a few of these for other, less obviously useful things.

    *sound of palm meeting forehead*

    Cory Dodt - 21st September 2004 21:52 - #

  9. Thanks, this saved me probably several hours of work ^^

    Nico - 1st January 2005 21:35 - #

  10. Just a tip for anyone arriving here off Google (this page is the first result for "javascript re split"): "string".split() doesn't work like this in IE. In fact, the way it works in IE with regular expressions is almost useless--it omits empty strings from the results array (from when RE or string matches twice in a row, or at the beginning or end of the search string).

    Tom W.M. - 13th April 2006 08:17 - #

  11. Arrived here after Googling on the the split() method in Javascript. As Tom W.M mentions although spilt returns delimiters when used as you detail above in Gecko based browsers Internet Exlorer 5 and 6 do not return the delimiters. This (unfortunately!) means using a combinmation of match() and split() to accomodate IE.

    Paul C - 29th June 2006 04:23 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/10/26/reSplit

A django site