“sexeger”[::-1]
Via Ned Batchelder, an article on Reversing Regular Expressions from Perl.com. Otherwise known as Sexeger, these offer a performance boost over normal regular expressions for certain tasks. The basic idea is pretty simple: searching backwards through a string using a regular expression can be a messy business, but by reversing both the string and the expression, running it, then reversing the result far better performance can be achieved (reversing a string is a relatively inexpensive operation). The example code is in Perl, but I couldn’t resist trying it in Python. The challenge is to find the last number occurring in a string.
>>> import re
>>> lastnum = re.compile(r'(\d+)(?!\D*\d)')
>>> s = ' this isa 454 asd very very
very long strin9 asd9 009 76 with numbers
99 in it and here is the last 537 number'
# NB this was all on one line originally
>>> lastnum.search(s).group(0)
'537'
>>> import timeit
>>> t1 = timeit.Timer('lastnum.search(s).group(0)',
'from __main__ import lastnum, s')
>>> print "%.2f usec/pass" % (1000000 * t1.timeit(number=100000)/100000)
26.82 usec/pass
>>> lastnumrev = re.compile('(\d+)')
>>> lastnumrev.search(s[::-1]).group(0)[::-1]
'537'
>>> t2 = timeit.Timer('lastnumrev.search(s[::-1]).group(0)[::-1]',
'from __main__ import lastnumrev, s')
>>> print "%.2f usec/pass" % (1000000 * t2.timeit(number=100000)/100000)
9.26 usec/pass
There are a few points worth explaining in the above code. The (?!\D*\d) part of the first regular expression is a negative lookahead assertion—it basically means "match the subpattern provided it isn’t followed by a string of non-digits followed by at least one digit. This is the bit that ensures we only get back the last digit in the string, and is also the bit that could cause a performance problem.
'some string'[::-1] is an example of Extended Slices, introduced in Python 2.3. Its affect is to reverse the string, by stepping through it from start to end going back one character at a time.
The actual benchmarking code makes use of the new timeit module from Python 2.3—I copied it verbatim from that module’s example section in the manual.
The resutls speak for themselves: 26.82 for the lookahead assertion expression compared to just 9.26 for the reversed regular expression. This is definitely a useful trick to add to the tool box.
Keith - 17th September 2003 02:21 - #
Argh, my bad.
lastnum = re.compile(r'(\d+)\D*$')lastnum.search(s).group(1)
Keith - 17th September 2003 02:21 - #
Neat trick! I'd be interested to see how
(\d+)\D*$compares.Adrian Holovaty - 17th September 2003 02:27 - #
Adrian Holovaty - 17th September 2003 02:28 - #
Keith - 17th September 2003 02:53 - #
"(\d+)\D*$" is about as slow as the original pattern; the main problem is that search has to apply the pattern to every position in the character string
to speed things up, you match for ".*\D(\d+)\D*$" (and perhaps fall back on "(\d+)$" if that pattern doesn't match). the leading ".*" tells match that it's safe to search backwards, from the end of the string.
the match approach also works on Python's before 2.3, btw.
(PS. those "XHTML is not well-formed" and "illegal tag" messages are pretty annoying. instead of putting all the burden on the human, may I suggest running the comments through TIDY and mapping presentational markup from human space to semantic space in a second step? you can do that in about 10 lines of Python, using elementtidy and friends... I could have written that code for you in less time than it took me to get this comment through your validator ;-)
Fredrik Lundh - 17th September 2003 07:09 - #
I've started reconsidering the "well formed XHTML" requirement for this site's comment system. When I was blogging mostly about web standards I figured the audience would mostly be comfortable with it, and anyone who wasn't would get a practical lesson in standards on the spot ;) Now that I'm increasing coverage of other topics (in particular Python) that doesn't really hold true any more, and it becomes more of an irritant than a neat feature.
The current code base is getting to the point where any more hacks will cause it to implode, but I'll probably work HTML Tidy support in to the new version of the software (which has been quietly under construction for a while).
Keith, Adrian: I just ran it with
(\d+)\D*$and got 33.72.Simon Willison - 17th September 2003 10:34 - #
Oh, I'm sure you could squeeze in a call to this little just-over-10-lines-or-so utility (installing elementtidy on your server might be more work, though...).
from elementtidy.TidyHTMLTreeBuilder import TidyHTMLTreeBuilderfrom elementtree.ElementTree import tostringVALID_TAGS = ("a", "p", "blockquote", "ul", "ol", "li", "dl" , "dt", "dd", "em","strong", "dfn", "code", "q", "samp", "kbd", " var", "cite","abbr", "acronym", "sub", "sup")TAG_MAP = { "b": "strong", "i": "em", "tt": "samp" }NS_XHTML = "{http://www.w3.org/1999/xhtml}"def fixcomment(comment):parser = TidyHTMLTreeBuilder()parser.feed(comment)body = parser.close().find(NS_XHTML + "body")for elem in body:for elem in elem.getiterator():# fix XHTML namespaceif elem.tag.startswith(NS_XHTML):elem.tag = elem.tag[len(NS_XHTML): ]# map presentation tags to semantic ta gselem.tag = TAG_MAP.get(elem.tag, elem. tag)# check for bad tagsif elem.tag not in VALID_TAGS:raise SyntaxError("invalid tag: %r " % elem.tag)# clear out attributes (extend as nece ssary)if elem.tag == "a":href = elem.get("href")elem.attrib.clear()if href:elem.set("href", href)else:elem.attrib.clear()# replace the body tag with a "div class='comm ent'" parentbody.tag = "div"; body.attrib = {"class": "com ment"}return tostring(body)print fixcomment("""hello, simon! <p>this is a mal formed <b>comment</b></p>""")Fredrik Lundh - 17th September 2003 16:34 - #
Simon Willison - 17th September 2003 17:20 - #
Keith - 17th September 2003 19:14 - #
The problem with structured text is that there are so many different formats for it. I love it for sites like Wikis, but for a blog comment system I don't think it would work as well simply because visitors would have to learn it just for this one site.
Simon Willison - 17th September 2003 20:58 - #
Fredrik Lundh - 19th September 2003 12:11 - #