Simon Willison’s Weblog

759 items tagged “python”


Why you shouldn’t invoke directly (via) Paul Ganssle explains why you shouldn’t use “python command” any more. I’ve mostly switched to pip and pytest and twine but I was still using “python sdist”—apparently the new replacement recipe for that is “python -m build”. # 19th October 2021, 5:22 pm

Where does all the effort go? Looking at Python core developer activity (via) Łukasz Langa used Datasette to explore 28,780 pull requests made to the CPython GitHub repository, using some custom Python scripts (and sqlite-utils) to load in the data. # 18th October 2021, 8:21 pm

Tests aren’t enough: Case study after adding type hints to urllib3. Very thorough write-up by Seth Michael Larson describing what it took for the urllib3 Python library to fully embrace mypy and optional typing and what they learned along the way. # 18th October 2021, 7:03 pm

Web Browser Engineering (via) In progress free online book by Pavel Panchekha and Chris Harrelson that demonstrates how a web browser works by writing one from scratch using Python, tkinter and the DukPy wrapper around the Duktape JavaScript interpreter. # 17th October 2021, 3:53 pm

Finding and reporting an asyncio bug in Python 3.10

I found a bug in Python 3.10 today! Some notes on how I found it and my process for handling it once I figured out what was going on.

[... 1789 words]

The GIL and its effects on Python multithreading (via) Victor Skvortsov presents the most in-depth explanation of the Python Global Interpreter Lock I’ve seen anywhere. I learned a ton from reading this. # 29th September 2021, 5:23 pm

django-upgrade (via) Adam Johnson’s new CLI tool for upgrading Django projects by automatically applying changes to counter deprecations made in different versions of the framework. Uses the Python standard library tokenize module which gives it really quick performance in parsing and rewriting Python code. Exciting to see this kind of codemod approach becoming more common in Python world—JavaScript developers use this kind of thing a lot. # 26th September 2021, 5:42 am

SQLModel. A new project by FastAPI creator Sebastián Ramírez: SQLModel builds on top of both SQLAlchemy and Sebastián’s Pydantic validation library to provide a new ORM that’s designed around Python 3’s optional typing. The real brilliance here is that a SQLModel subclass is simultaneously a valid SQLAlchemy ORM model AND a valid Pydantic validation model, saving on duplicate code by allowing the same class to be used both for form/API validation and for interacting with the database. # 24th August 2021, 11:16 pm

sqlite-utils API reference (via) I released sqlite-utils 3.15.1 today with just one change, but it’s a big one: I’ve added docstrings and type annotations to nearly every method in the library, and I’ve started using sphinx-autodoc to generate an API reference page in the documentation directly from those docstrings. I’ve deliberately avoided building this kind of documentation in the past because I so often see projects where the class reference is the ONLY documentation, which I find makes it really hard to figure out how to actually use it. sqlite-utils already has extensive narrative prose documentation so in this case I think it’s a useful enhancement—especially since the docstrings and type hints can help improve the usability of the library in IDEs and Jupyter notebooks. # 11th August 2021, 1:03 am

How the Python import system works (via) Remarkably detailed and thorough dissection of how exactly import, modules and packages work in Python—eventually digging right down into the C code. Part of Victor Skvortsov’s excellent “Python behind the scenes” series. # 24th July 2021, 8:12 pm

I’ve always believed that a book, even a technical book, should try to tell a cohesive story. The challenge is that as Python has grown in popularity, it has really turned into three different languages--each with their own story. There is a whimsical Python for scripting and tinkering, a quirky Python for asynchronous programming and networking, and a serious Python for enterprise applications. Sometimes these stories intersect. Sometimes not.

David Beazley # 18th July 2021, 2:53 pm

Group thousands of similar spreadsheet text cells in seconds (via) Luke Whyte explains how to efficiently group similar text columns in a table (Walmart and Wal-mart for example) using a clever combination of TF/IDF, sparse matrices and cosine similarity. Includes the clearest explanation of cosine similarity for text I’ve seen—and Luke wrote a Python library, textpack, that implements the described pattern. # 27th June 2021, 4:24 pm

Querying Parquet using DuckDB (via) DuckDB is a relatively new SQLite-style database (released as an embeddable library) with a focus on analytical queries. This tutorial really made the benefits click for me: it ships with support for the Parquet columnar data format, and you can use it to execute SQL queries directly against Parquet files—e.g. “SELECT COUNT(*) FROM ’taxi_2019_04.parquet’”. Performance against large files is fantastic, and the whole thing can be installed just using “pip install duckdb”. I wonder if faceting-style group/count queries (pretty expensive with regular RDBMSs) could be sped up with this? # 25th June 2021, 10:40 pm

Powering the Python Package Index in 2021. PyPI now serves “nearly 900 terabytes over more than 2 billion requests per day”. Bandwidth is donated by Fastly, a value estimated at 1.8 million dollars per month! Lots more detail about how PyPI has evolved over the past years in this post by Dustin Ingram. # 14th May 2021, 4:50 am

Async functions require an event loop to run. Flask, as a WSGI application, uses one worker to handle one request/response cycle. When a request comes in to an async view, Flask will start an event loop in a thread, run the view function there, then return the result. Each request still ties up one worker, even for async views. The upside is that you can run async code within a view, for example to make multiple concurrent database queries, HTTP requests to an external API, etc. However, the number of requests your application can handle at one time will remain the same.

Using async and await in Flask 2.0 # 12th May 2021, 5:59 pm

New Major Versions Released! Flask 2.0, Werkzeug 2.0, Jinja 3.0, Click 8.0, ItsDangerous 2.0, and MarkupSafe 2.0. Huge set of releases from the Pallets team. Python 3.6+ required and comprehensive type annotations. Flask now supports async views, Jinja async templates (used extensively by Datasette) “no longer requires patching”, Click has a bunch of new code around shell tab completion, ItsDangerous supports key rotation and so much more. # 12th May 2021, 5:37 pm

cinder: Instagram’s performance oriented fork of CPython (via) Instagram forked CPython to add some performance-oriented features they wanted, including a method-at-a-time JIT compiler and a mechanism for eagerly evaluating coroutines (avoiding the overhead of creating a coroutine if an awaited function returns a value without itself needing to await). They’re open sourcing the code to help start conversations about implementing some of these features in CPython itself. I particularly enjoyed the warning that accompanies the repo: this is not intended to be a supported release, and if you decide to run it in production you are on your own! # 4th May 2021, 10:13 pm

Hello, HPy (via) HPy provides a new way to write C extensions for Python in a way that is compatible with multiple Python implementations at once, including PyPy. # 29th March 2021, 2:40 pm

Homebrew Python Is Not For You. If you’ve been running into frustrations with your Homebrew Python environments breaking over the past few months (the dreaded “Reason: image not found” error) Justin Mayer has a good explanation. Python in a Homebrew is designed to work as a dependency for their other packages, and recent policy changes that they made to support smoother upgrades have had catastrophic problems effects on those of us who try to use it for development environments. # 25th March 2021, 3:14 pm

When you have to mock a collaborator, avoid using the Mock object directly. Either use mock.create_autospec() or mock.patch(autospec=True) if at all possible. Autospeccing from the real collaborator means that if the collaborator’s interface changes, your tests will fail. Manually speccing or not speccing at all means that changes in the collaborator’s interface will not break your tests that use the collaborator: you could have 100% test coverage and your library would fall over when used!

Thea Flowers # 17th March 2021, 4:44 pm

sqlite-uuid (via) Another Python package that wraps a SQLite module written in C: this one provides access to UUID functions as SQLite functions. # 15th March 2021, 2:55 am

sqlite-spellfix (via) I really like this pattern: “pip install sqlite-spellfix” gets you a Python module which includes a compiled (on your system when pip install ran) copy of the SQLite spellfix1 module, plus a utility variable containing its path so you can easily load it into a SQLite connection. # 15th March 2021, 2:52 am

unasync (via) Today I started wondering out loud if one could write code that takes an asyncio Python library and transforms it into the synchronous equivalent by using some regular expressions to strip out the “await ...” keywords and suchlike. Turns out that can indeed work, and Ratan Kulshreshtha built it! unasync uses the standard library tokenize module to run some transformations against an async library and spit out the sync version automatically. I’m now considering using this for sqlite-utils. # 27th February 2021, 10:20 pm

Dependency Confusion: How I Hacked Into Apple, Microsoft and Dozens of Other Companies (via) Alex Birsan describes a new category of security vulnerability he discovered in the npm, pip and gem packaging ecosystems: if a company uses a private repository with internal package names, uploading a package with the same name to the public repository can often result in an attacker being able to execute their own code inside the networks of their target. Alex scored over $130,000 in bug bounties from this one, from a number of name-brand companies. Of particular note for Python developers: the --extra-index-url argument to pip will consult both public and private registries and install the package with the highest version number! # 10th February 2021, 8:42 pm

Weeknotes: datasette-export-notebook, PyInstaller packaged Datasette, CBSAs

What a terrible week. I’ve found it hard to concentrate on anything substantial. In a mostly futile attempt to distract myself from doomscrolling I’ve mainly been building some experimental output plugins, fiddling with PyInstaller and messing around with shapefiles.

[... 732 words]


datasette-ripgrep: deploy a regular expression search engine for your source code

This week I built datasette-ripgrep—a web application for running regular expression searches against source code, built on top of the amazing ripgrep command-line tool.

[... 1362 words]

Unravelling `not` in Python (via) Part of a series where Brett Cannon looks at how fundamental Python syntactic sugar works, including a clearly explained dive into the underlying op codes and C implementation. # 27th November 2020, 5:59 pm

Hunting for Malicious Packages on PyPI (via) Jordan Wright installed all 268,000 Python packages from PyPI in containers, and ran Sysdig to capture syscalls made during installation to see if any of them were making extra network calls or reading or writing from the filesystem. Absolutely brilliant piece of security engineering and research. # 14th November 2020, 4:48 am

selenium-wire. Really useful scraping tool: enhances the Python Selenium bindings to run against a proxy which then allows Python scraping code to look at captured requests—great for if a site you are working with triggers Ajax requests and you want to extract data from the raw JSON that came back. # 2nd November 2020, 6:58 pm

Inevitably we got round to talking about async. As much of an unneeded complication as it is for so many day-to-day use-cases, it’s important for Python because, if and when you do need the high throughput handling of these io-bound use-cases, you don’t want to have to switch language. The same for Django: most of what you’re doing has no need of async but you don’t want to have to change web framework just because you need a sprinkling of non-blocking IO.

Carlton Gibson # 27th September 2020, 3:09 pm