Simon Willison’s Weblog

Subscribe

920 items tagged “python”

2024

Ruff v0.4.0: a hand-written recursive descent parser for Python. The latest release of Ruff—a Python linter and formatter, written in Rust—includes a complete rewrite of the core parser. Previously Ruff used a parser borrowed from RustPython, generated using the LALRPOP parser generator. Victor Hugo Gomes contributed a new parser written from scratch, which provided a 2x speedup and also added error recovery, allowing parsing of invalid Python—super-useful for a linter.

I tried Ruff 0.4.0 just now against Datasette—a reasonably large Python project—and it ran in less than 1/10th of a second. This thing is Fast. # 19th April 2024, 5 am

mistralai/mistral-common. New from Mistral: mistral-common, an open source Python library providing “a set of tools to help you work with Mistral models”.

So far that means a tokenizer! This is similar to OpenAI’s tiktoken library in that it lets you run tokenization in your own code, which crucially means you can count the number of tokens that you are about to use—useful for cost estimates but also for cramming the maximum allowed tokens in the context window for things like RAG.

Mistral’s library is better than tiktoken though, in that it also includes logic for correctly calculating the tokens needed for conversation construction and tool definition. With OpenAI’s APIs you’re currently left guessing how many tokens are taken up by these advanced features.

Anthropic haven’t published any form of tokenizer at all—it’s the feature I’d most like to see from them next.

Here’s how to explore the vocabulary of the tokenizer:

MistralTokenizer.from_model(
“open-mixtral-8x22b”
).instruct_tokenizer.tokenizer.vocab()[:12]

[’<unk>’, ’<s>’, ’</s>’, ’[INST]’, ’[/INST]’, ’[TOOL_CALLS]’, ’[AVAILABLE_TOOLS]’, ’[/AVAILABLE_TOOLS]’, ’[TOOL_RESULTS]’, ’[/TOOL_RESULTS]’] # 18th April 2024, 12:39 am

inline-snapshot. I’m a big fan of snapshot testing, where expected values are captured the first time a test suite runs and then asserted against in future runs. It’s a very productive way to build a robust test suite.

inline-snapshot by Frank Hoffmann is a particularly neat implementation of the pattern. It defines a snapshot() function which you can use in your tests:

assert 1548 * 18489 == snapshot()

When you run that test using “pytest --inline-snapshot=create” the snapshot() function will be replaced in your code (using AST manipulation) with itself wrapping the repr() of the expected result:

assert 1548 * 18489 == snapshot(28620972)

If you modify the code and need to update the tests you can run “pytest --inline-snapshot=fix” to regenerate the recorded snapshot values. # 16th April 2024, 4:04 pm

Bringing Python to Workers using Pyodide and WebAssembly (via) Cloudflare Workers is Cloudflare’s serverless hosting tool for deploying server-side functions to edge locations in their CDN.

They just released Python support, accompanied by an extremely thorough technical explanation of how they got that to work. The details are fascinating.

Workers runs on V8 isolates, and the new Python support was implemented using Pyodide (CPython compiled to WebAssembly) running inside V8.

Getting this to work performantly and ergonomically took a huge amount of work.

There are too many details in here to effectively summarize, but my favorite detail is this one:

“We scan the Worker’s code for import statements, execute them, and then take a snapshot of the Worker’s WebAssembly linear memory. Effectively, we perform the expensive work of importing packages at deploy time, rather than at runtime.” # 2nd April 2024, 4:09 pm

PEP 738 – Adding Android as a supported platform (via) The BeeWare project got PEP 730—Adding iOS as a supported platform—accepted by the Python Steering Council in December, now it’s Android’s turn. Both iOS and Android will be supported platforms for CPython 3.13.

It’s been possible to run custom compiled Python builds on those platforms for years, but official support means that they’ll be included in Python’s own CI and release process. # 1st April 2024, 11:57 pm

Reviving PyMiniRacer (via) PyMiniRacer is “a V8 bridge in Python”—it’s a library that lets Python code execute JavaScript code in a V8 isolate and pass values back and forth (provided they serialize to JSON) between the two environments.

It was originally released in 2016 by Sqreen, a web app security startup startup. They were acquired by Datadog in 2021 and the project lost its corporate sponsor, but in this post Ben Creech announces that he is revitalizing the project, with the approval of the original maintainers.

I’m always interested in new options for running untrusted code in a safe sandbox. PyMiniRacer has the three features I care most about: code can’t access the filesystem or network by default, you can limit the RAM available to it and you can have it raise an error if code execution exceeds a time limit.

The documentation includes a newly written architecture overview which is well worth a read. Rather than embed V8 directly in Python the authors chose to use ctypes—they build their own V8 with a thin additional C++ layer to expose a ctypes-friendly API, then the Python library code uses ctypes to call that.

I really like this. V8 is a notoriously fast moving and complex dependency, so reducing the interface to just a thin C++ wrapper via ctypes feels very sensible to me.

This blog post is fun too: it’s a good, detailed description of the process to update something like this to use modern Python and modern CI practices. The steps taken to build V8 (6.6 GB of miscellaneous source and assets!) across multiple architectures in order to create binary wheels are particularly impressive—the Linux aarch64 build takes several days to run on GitHub Actions runners (via emulation), so they use Mozilla’s Sccache to cache compilation steps so they can retry until it finally finishes.

On macOS (Apple Silicon) installing the package with “pip install mini-racer” got me a 37MB dylib and a 17KB ctypes wrapper module. # 24th March 2024, 5 pm

shelmet (via) This looks like a pleasant ergonomic alternative to Python’s subprocess module, plus a whole bunch of other useful utilities. Lets you do things like this:

sh.cmd(“ps”, “aux”).pipe(“grep”, “-i”, check=False).run(“search term”)

I like the way it uses context managers as well: ’with sh.environ({“KEY1”: “val1”})’ sets new environment variables for the duration of the block, ’with sh.cd(“path/to/dir”)’ temporarily changes the working directory and ’with sh.atomicfile(“file.txt”) as fp’ lets you write to a temporary file that will be atomically renamed when the block finishes. # 24th March 2024, 4:37 am

time-machine example test for a segfault in Python (via) Here’s a really neat testing trick by Adam Johnson. Someone reported a segfault bug in his time-machine library. How you you write a unit test that exercises a segfault without crashing the entire test suite?

Adam’s solution is a test that does this:

subprocess.run([sys.executable, “-c”, code_that_crashes_python], check=True)

sys.executable is the path to the current Python executable—ensuring the code will run in the same virtual environment as the test suite itself. The -c option can be used to have it run a (multi-line) string of Python code, and check=True causes the subprocess.run() function to raise an error if the subprocess fails to execute cleanly and returns an error code.

I’m absolutely going to be borrowing this pattern next time I need to add tests to cover a crashing bug in one of my projects. # 23rd March 2024, 7:44 pm

Talking about Django’s history and future on Django Chat (via) Django co-creator Jacob Kaplan-Moss sat down with the Django Chat podcast team to talk about Django’s history, his recent return to the Django Software Foundation board and what he hopes to achieve there.

Here’s his post about it, where he used Whisper and Claude to extract some of his own highlights from the conversation. # 21st March 2024, 12:42 am

Every dunder method in Python. Trey Hunner: “Python includes 103 ’normal’ dunder methods, 12 library-specific dunder methods, and at least 52 other dunder attributes of various types.”

This cheat sheet doubles as a tour of many of the more obscure corners of the Python language and standard library.

I did not know that Python has over 100 dunder methods now! Quite a few of these were new to me, like __class_getitem__ which can be used to implement type annotations such as list[int]. # 20th March 2024, 3:45 am

DiskCache (via) Grant Jenks built DiskCache as an alternative caching backend for Django (also usable without Django), using a SQLite database on disk. The performance numbers are impressive—it even beats memcached in microbenchmarks, due to avoiding the need to access the network.

The source code (particularly in core.py) is a great case-study in SQLite performance optimization, after five years of iteration on making it all run as fast as possible. # 19th March 2024, 3:43 pm

pywebview 5 (via) pywebview is a library for building desktop (and now Android) applications using Python, based on the idea of displaying windows that use the system default browser to display an interface to the user—styled such that the fact they run on HTML, CSS and JavaScript is mostly hidden from the end-user.

It’s a bit like a much simpler version of Electron. Unlike Electron it doesn’t bundle a full browser engine (Electron bundles Chromium), which reduces the size of the dependency a lot but does mean that cross-browser differences (quite rare these days) do come back into play.

I tried out their getting started example and it’s very pleasant to use—import webview, create a window and then start the application loop running to display it.

You can register JavaScript functions that call back to Python, and you can execute JavaScript in a window from your Python code. # 13th March 2024, 2:15 pm

gh-116167: Allow disabling the GIL with PYTHON_GIL=0 or -X gil=0. Merged into python:main 14 hours ago. Looks like the first phase of Sam Gross’s phenomenal effort to provide a GIL free Python (here via an explicit opt-in) will ship in Python 3.13. # 12th March 2024, 5:40 am

llm-claude-3. I built a new plugin for LLM—my command-line tool and Python library for interacting with Large Language Models—which adds support for the new Claude 3 models from Anthropic. # 4th March 2024, 6:46 pm

The Zen of Python, Unix, and LLMs. Here’s the YouTube recording of my 1.5 hour conversation with Hugo Bowne-Anderson yesterday.

I fed a Whisper transcript to Google Gemini Pro 1.5 and asked it for the themes from our conversation, and it said we talked about “Python’s success and versatility, the rise and potential of LLMs, data sharing and ethics in the age of LLMs, Unix philosophy and its influence on software development and the future of programming and human-computer interaction”. # 29th February 2024, 9:04 pm

aiolimiter. I found myself wanting an asyncio rate limiter for Python today—so I could send POSTs to an API endpoint no more than once every 10 seconds. This library worked out really well—it has a very neat design and lets you set up rate limits for things like “no more than 50 items every 10 seconds”, implemented using the leaky bucket algorithm. # 20th February 2024, 1:15 am

datasette-studio. I’ve been thinking for a while that it might be interesting to have a version of Datasette that comes bundled with a set of useful plugins, aimed at expanding Datasette’s default functionality to cover things like importing data and editing schemas.

This morning I built the very first experimental preview of what that could look like. Install it using pipx:

pipx install datasette-studio

I recommend pipx because it will ensure datasette-studio gets its own isolated environment, independent of any other Datasette installations you might have.

Now running “datasette-studio” instead of “datasette” will get you the version with the bundled plugins.

The implementation of this is fun—it’s a single pyproject.toml file defining the dependencies and setting up the datasette-studio CLI hook, which is enough to provide the full set of functionality.

Is this a good idea? I don’t know yet, but it’s certainly an interesting initial experiment. # 18th February 2024, 8:38 pm

wddbfs – Mount a sqlite database as a filesystem. Ingenious hack from Adam Obeng. Install this Python tool and run it against a SQLite database:

wddbfs --anonymous --db-path path/to/content.db

Then tell the macOS Finder to connect to Go -> Connect to Server -> http://127.0.0.1:8080/ (connect as guest)—connecting via WebDAV.

/Volumes/127.0.0.1/content.db will now be a folder full of CSV, TSV, JSON and JSONL files—one of each format for every table.

This means you can open data from SQLite directly in any application that supports that format, and you can even run CLI commands such as grep, ripgrep or jq directly against the data!

Adam used WebDAV because “Despite how clunky it is, this seems to be the best way to implement a filesystem given that getting FUSE support is not straightforward”. What a neat trick. # 18th February 2024, 3:31 am

uv: Python packaging in Rust (via) “uv is an extremely fast Python package installer and resolver, written in Rust, and designed as a drop-in replacement for pip and pip-tools workflows.”

From Charlie Marsh and Astral, the team behind Ruff, who describe it as a milestone in their pursuit of a “Cargo for Python”.

Also in this announcement: Astral are taking over stewardship of Armin Ronacher’s Rye packaging tool, another Rust project.

uv is reported to be 8-10x faster than regular pip, increasing to 80-115x faster with a warm global module cache thanks to copy-on-write and hard links on supported filesystems—which saves on disk space too.

It also has a --resolution=lowest option for installing the lowest available version of dependencies—extremely useful for testing, I’ve been wanting this for my own projects for a while.

Also included: “uv venv”—a fast tool for creating new virtual environments with no dependency on Python itself. # 15th February 2024, 7:57 pm

Python Development on macOS Notes: pyenv and pyenv-virtualenvwrapper (via) Jeff Triplett shares the recipe he uses for working with pyenv (initially installed via Homebrew) on macOS.

I really need to start habitually using this. The benefit of pyenv over Homebrew’s default Python is that pyenv managed Python versions are forever—your projects won’t suddenly stop working in the future when Homebrew changes its default Python version. # 11th February 2024, 4:41 am

Rye: Added support for marking virtualenvs ignored for cloud sync (via) A neat feature in the new Rye 0.22.0 release. It works by using an xattr Rust crate to set the attributes “com.dropbox.ignored” and “com.apple.fileprovider.ignore#P” on the folder. # 10th February 2024, 6:50 am

Rye lets you get from no Python on a computer to a fully functioning Python project in under a minute with linting, formatting and everything in place.

[...] Because it was demonstrably designed to avoid interference with any pre-existing Python configurations, Rye allows for a smooth and gradual integration and the emotional barrier of picking it up even for people who use other tools was shown to be low.

Armin Ronacher # 4th February 2024, 3:12 pm

unstructured. Relatively new but impressively capable Python library (Apache 2 licensed) for extracting information from unstructured documents, such as PDFs, images, Word documents and many other formats.

I got some good initial results against a PDF by running “pip install ’unstructured[pdf]’” and then using the “unstructured.partition.pdf.partition_pdf(filename)” function.

There are a lot of moving parts under the hood: pytesseract, OpenCV, various PDF libraries, even an ONNX model—but it installed cleanly for me on macOS and worked out of the box. # 2nd February 2024, 2:47 am

snoop. Neat Python debugging utility by Alex Hall: snoop lets you “import snoop” and then add “@snoop” as a decorator to any function, which causes that function’s source code to be output directly to the console with details of any variable state changes that occur while it’s running.

I didn’t know you could make a Python module callable like that—turns out it’s running “sys.modules[’snoop’] = snoop” in the __init__.py module! # 31st January 2024, 6:19 pm

Beej’s Guide to Networking Concepts (via) Beej’s Guide to Network Programming is a legendary tutorial on network programming in C, continually authored and updated by Brian “Beej” Hall since 1995.

This is NOT that. Beej’s Guide to Networking Concepts is brand new—started in March 2023—and illustrates a whole bunch of networking concepts using Python instead of C.

From the forward: “Is it Beej’s Guide to Network Programming in Python? Well, kinda, actually. The C book is more about how C’s (well, Unix’s) network API works. And this book is more about the concepts underlying it, using Python as a vehicle.” # 30th January 2024, 10:08 pm

urllib3 2.2.0. Highlighted feature: “urllib3 now works in the browser”—the core urllib3 library now includes code that can integrate with Pyodide, using the browser’s fetch() or XMLHttpRequest APIs to make HTTP requests (to CORS-enabled endpoints). # 30th January 2024, 4:31 pm

Getting Started With CUDA for Python Programmers (via) if, like me, you’ve avoided CUDA programming (writing efficient code that runs on NVIGIA GPUs) in the past, Jeremy Howard has a new 1hr17m video tutorial that demystifies the basics. The code is all run using PyTorch in notebooks running on Google Colab, and it starts with a very clear demonstration of how to convert a RGB image to black and white. # 29th January 2024, 9:23 pm

Find a level of abstraction that works for what you need to do. When you have trouble there, look beneath that abstraction. You won’t be seeing how things really work, you’ll be seeing a lower-level abstraction that could be helpful. Sometimes what you need will be an abstraction one level up. Is your Python loop too slow? Perhaps you need a C loop. Or perhaps you need numpy array operations.

You (probably) don’t need to learn C.

Ned Batchelder # 24th January 2024, 6:25 pm

Python packaging must be getting better—a datapoint (via) Luke Plant reports on a recent project he developed on Linux using a requirements.txt file and some complex binary dependencies—Qt5 and VTK—and when he tried to run it on Windows... it worked! No modifications required.

I think Python’s packaging system has never been more effective... provided you know how to use it. The learning curve is still too high, which I think accounts for the bulk of complaints about it today. # 22nd January 2024, 6:06 pm

Publish Python packages to PyPI with a python-lib cookiecutter template and GitHub Actions

I use cookiecutter to start almost all of my Python projects. It helps me quickly generate a skeleton of a project with my preferred directory structure and configured tools.

[... 686 words]