37 posts tagged “regular-expressions”
2026
If it's good enough for antirez to add to Redis I figured Ville Laurikari's TRE regular expression engine was worth exploring in a little more detail.
I had Claude Code build an experimental Python binding (it used ctypes) and try some malicious regular expression attacks against the library. TRE handles those much better than Python's standard library implementation, thanks mainly to the lack of support for backtracking.
Salvatore Sanfilippo submitted a PR adding a new data type - arrays - to Redis.
The new commands are ARCOUNT, ARDEL, ARDELRANGE, ARGET, ARGETRANGE, ARGREP, ARINFO, ARINSERT, ARLASTITEMS, ARLEN, ARMGET, ARMSET, ARNEXT, AROP, ARRING, ARSCAN, ARSEEK, ARSET.
The implementation is currently available in a branch, so I had Claude Code for web build this interactive playground for trying out the new commands in a WASM-compiled build of a subset of Redis running in the browser.

The most interesting new command is ARGREP which can run a server-side grep against a range of values in the array using the newly vendored TRE regex library.
Salvatore wrote more about the AI-assisted development process for the array type in Redis array type: short story of a long development.
2025
MicroQuickJS. New project from programming legend Fabrice Bellard, of ffmpeg and QEMU and QuickJS and so much more fame:
MicroQuickJS (aka. MQuickJS) is a Javascript engine targetted at embedded systems. It compiles and runs Javascript programs with as low as 10 kB of RAM. The whole engine requires about 100 kB of ROM (ARM Thumb-2 code) including the C library. The speed is comparable to QuickJS.
It supports a subset of full JavaScript, though it looks like a rich and full-featured subset to me.
One of my ongoing interests is sandboxing: mechanisms for executing untrusted code - from end users or generated by LLMs - in an environment that restricts memory usage and applies a strict time limit and restricts file or network access. Could MicroQuickJS be useful in that context?
I fired up Claude Code for web (on my iPhone) and kicked off an asynchronous research project to see explore that question:
My full prompt is here. It started like this:
Clone https://github.com/bellard/mquickjs to /tmp
Investigate this code as the basis for a safe sandboxing environment for running untrusted code such that it cannot exhaust memory or CPU or access files or the network
First try building python bindings for this using FFI - write a script that builds these by checking out the code to /tmp and building against that, to avoid copying the C code in this repo permanently. Write and execute tests with pytest to exercise it as a sandbox
Then build a "real" Python extension not using FFI and experiment with that
Then try compiling the C to WebAssembly and exercising it via both node.js and Deno, with a similar suite of tests [...]
I later added to the interactive session:
Does it have a regex engine that might allow a resource exhaustion attack from an expensive regex?
(The answer was no - the regex engine calls the interrupt handler even during pathological expression backtracking, meaning that any configured time limit should still hold.)
Here's the full transcript and the final report.
Some key observations:
- MicroQuickJS is very well suited to the sandbox problem. It has robust near and time limits baked in, it doesn't expose any dangerous primitive like filesystem of network access and even has a regular expression engine that protects against exhaustion attacks (provided you configure a time limit).
- Claude span up and tested a Python library that calls a MicroQuickJS shared library (involving a little bit of extra C), a compiled a Python binding and a library that uses the original MicroQuickJS CLI tool. All of those approaches work well.
- Compiling to WebAssembly was a little harder. It got a version working in Node.js and Deno and Pyodide, but the Python libraries wasmer and wasmtime proved harder, apparently because "mquickjs uses setjmp/longjmp for error handling". It managed to get to a working wasmtime version with a gross hack.
I'm really excited about this. MicroQuickJS is tiny, full featured, looks robust and comes from excellent pedigree. I think this makes for a very solid new entrant in the quest for a robust sandbox.
Update: I had Claude Code build tools.simonwillison.net/microquickjs, an interactive web playground for trying out the WebAssembly build of MicroQuickJS, adapted from my previous QuickJS plaground. My QuickJS page loads 2.28 MB (675 KB transferred). The MicroQuickJS one loads 303 KB (120 KB transferred).
Here are the prompts I used for that.
New dashboard: alt text for all my images. I got curious today about how I'd been using alt text for images on my blog, and realized that since I have Django SQL Dashboard running on this site and PostgreSQL is capable of parsing HTML with regular expressions I could probably find out using a SQL query.
I pasted my PostgreSQL schema into Claude and gave it a pretty long prompt:
Give this PostgreSQL schema I want a query that returns all of my images and their alt text. Images are sometimes stored as HTML image tags and other times stored in markdown.
blog_quotation.quotation,blog_note.bodyboth contain markdown.blog_blogmark.commentaryhas markdown ifuse_markdownis true or HTML otherwise.blog_entry.bodyis always HTMLWrite me a SQL query to extract all of my images and their alt tags using regular expressions. In HTML documents it should look for either
<img .* src="..." .* alt="..."or<img alt="..." .* src="..."(images may be self-closing XHTML style in some places). In Markdown they will always beI want the resulting table to have three columns: URL, alt_text, src - the URL column needs to be constructed as e.g.
/2025/Feb/2/slugfor a record where created is on 2nd feb 2025 and theslugcolumn containsslugUse CTEs and unions where appropriate
It almost got it right on the first go, and with a couple of follow-up prompts I had the query I wanted. I also added the option to search my alt text / image URLs, which has already helped me hunt down and fix a few old images on expired domain names. Here's a copy of the finished 100 line SQL query.
tc39/proposal-regex-escaping. I just heard from Kris Kowal that this proposal for ECMAScript has been approved for ECMA TC-39:
Almost 20 years later, @simon’s RegExp.escape idea comes to fruition. This reached “Stage 4” at ECMA TC-39 just now, which formalizes that multiple browsers have shipped the feature and it’s in the next revision of the JavaScript specification.
I'll be honest, I had completely forgotten about my 2006 blog entry Escaping regular expression characters in JavaScript where I proposed that JavaScript should have an equivalent of the Python re.escape() function.
It turns out my post was referenced in this 15 year old thread on the esdiscuss mailing list, which evolved over time into a proposal which turned into implementations in Safari, Firefox and soon Chrome - here's the commit landing it in v8 on February 12th 2025.
One of the best things about having a long-running blog is that sometimes posts you forgot about over a decade ago turn out to have a life of their own.
2023
2022
Why I invented “dash encoding”, a new encoding scheme for URL paths
Datasette now includes its own custom string encoding scheme, which I’ve called dash encoding. I really didn’t want to have to invent something new here, but unfortunately I think this is the best solution to my very particular problem. Some notes on how dash encoding works and why I created it.
[... 1,392 words]2021
2020
datasette-ripgrep: deploy a regular expression search engine for your source code
This week I built datasette-ripgrep—a web application for running regular expression searches against source code, built on top of the amazing ripgrep command-line tool.
[... 1,362 words]The unexpected Google wide domain check bypass (via) Fantastic story of discovering a devious security vulnerability in a bunch of Google products stemming from a single exploitable regular expression in the Google closure JavaScript library.
2019
Weeknotes: ONA19, twitter-to-sqlite, datasette-rure
I’ve decided to start writing weeknotes for the duration of my JSK fellowship. Here goes!
[... 919 words]Details of the Cloudflare outage on July 2, 2019 (via) Best retrospective I’ve read in a long time. The outage was caused by a backtracking regex rule that was added to the Web Application Firewall project, which rolls out globally and skips most of Cloudflare’s regular graduar rollout process (delightfully animal themed, named DOG for the dogfooding PoP that their employees use, PIG for the Guinea Pig PoPs reserved for free customers, then Canary for the final step) so that they can deploy counter-measures to newly discovered vulnerabilities as quickly as possible—but the real value in the retro is that it provides an extremely deep insight into how Cloudflare organize, test and manage their changes. Really interesting stuff.
2018
r1chardj0n3s/parse: Parse strings using a specification based on the Python format() syntax. (via) Really neat API design: parse() behaves almost exactly in the opposite way to Python’s built-in format(), so you can use format strings as an alternative to regular expressions for extracting specific data from a string.
2017
A Regular Expression Matcher: Code by Rob Pike, Exegesis by Brian Kernighan (via) Delightfully clear and succinct 30-line C implementation of a regular expression matcher that supports $, ^, . and * operations.
2012
What are the best resources for learning regular expressions?
The O’Reilly book on Regular Expressions is absolutely superb. It will help you build a much deeper understanding if how they actually work than any online tutorial I’ve seen.
[... 62 words]2010
Escaping regular expression characters in JavaScript (updated). The JavaScript regular expression meta-character escaping code I posted back in 2006 has some serious flaws—I’ve just posted an update to the original post.
Introduction to Surlex. A neat drop-in alternative for Django’s regular expression based URL parsing, providing simpler syntax for common path patterns.
RE2: a principled approach to regular expression matching. Google have open sourced RE2, the C++ regular expression library they developed for Google Code Search, Sawzall, Bigtable and other internal projects. Unlike PCRE it avoids the potential for exponential run time and unbounded stack usage and guarantees that searches complete in linear time, mainly by dropping support for back references.
2009
Request Routing With URI Templates in Node.JS. I quite like this approach (though the implementation is a bit “this” heavy for my taste). JavaScript has no equivalent to Python’s raw strings, so regular expression based routing ala Django ends up being a bit uglier in JavaScript. URI template syntax is more appealing.
Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide.
Django security updates released. A potential denial of service vulnerability has been discovered in the regular expressions used by Django form library’s EmailField and URLField—a malicious input could trigger a pathological performance. Patches (and patched releases) for Django 1.1 and Django 1.0 have been published.
Introducing Yardbird. I absolutely love it—an IRC bot built on top of Twisted that passes incoming messages off to Django code running in a separate thread. Requests and Response objects are used to represent incoming and outgoing messages, and Django’s regex-based URL routing is used to dispatch messages to different handling functions based on their content.
2008
Python gems of my own (via) Did you know you can pass 128 as a flag to Python’s re.compile() function to spit out a parse tree? I didn’t. re.compile(“pattern”, 128)
2007
Primality regex. A regular expression that can identify prime numbers. Unsurprisingly, this one comes from the Perl community.
2006
Wrapping Text With Regular Expressions. Neat regexp trick.
Escaping regular expression characters in JavaScript
JavaScript’s support for regular expressions is generally pretty good, but there is one notable omission: an escaping mechanism for literal strings. Say for example you need to create a regular expression that removes a specific string from the end of a string. If you know the string you want to remove when you write the script this is easy:
[... 519 words]2005
Lexical Analysis, Python-style (via) Clever trick using named groups in regular expressions.
2004
Unicode Regular Expressions. [a \U00010450] Match “a” or U+10450 SHAVIAN LETTER PEEP
2003
Brief Guide to Regular Expressions (via) Regular expressions for everyone else
Capturing the power of re.split
A couple of Python tips. The first is really a tip for Mozilla/Firebird: You can set up a Custom Keyword for instantly accessing Python module documentation using the string www.python.org/doc/current/lib/module-%s.html—I have this set up as pydoc, so I can type pydoc re to jump straight to the re module documentation. I only set it up half an hour ago and I’ve already used it about a dozen times.
[... 478 words]
