Simon Willison’s Weblog

Subscribe

Thursday, 11th September 2025

In Python 3.14, I have implemented several changes to fix thread safety of asyncio and enable it to scale effectively on the free-threaded build of CPython. It is now implemented using lock-free data structures and per-thread state, allowing for highly efficient task management and execution across multiple threads. In the general case of multiple event loops running in parallel, there is no lock contention and performance scales linearly with the number of threads. [...]

For a deeper dive into the implementation, check out the internal docs for asyncio.

Kumar Aditya, Scaling asyncio on Free-Threaded Python

# 3:07 am / async, gil, python, scaling, threads

Defeating Nondeterminism in LLM Inference (via) A very common question I see about LLMs concerns why they can't be made to deliver the same response to the same prompt by setting a fixed random number seed.

Like many others I had been lead to believe this was due to the non-associative nature of floating point arithmetic, where (a + b) + c ≠ a + (b + c), combining with unpredictable calculation orders on concurrent GPUs. This new paper calls that the "concurrency + floating point hypothesis":

One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first. We will call this the “concurrency + floating point” hypothesis for LLM inference nondeterminism.

It then convincingly argues that this is not the core of the problem, because "in the typical forward pass of an LLM, there is usually not a single atomic add present."

Why are LLMs so often non-deterministic then?

[...] the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies! This nondeterminism is not unique to GPUs — LLM inference endpoints served from CPUs or TPUs will also have this source of nondeterminism.

The thinking-machines-lab/batch_invariant_ops code that accompanies this paper addresses this by providing a PyTorch implementation of invariant kernels and demonstrates them running Qwen3-8B deterministically under vLLM.

This paper is the first public output from Thinking Machines, the AI Lab founded in February 2025 by Mira Murati, OpenAI's former CTO (and interim CEO for a few days). It's unrelated to Thinking Machines Corporation, the last employer of Richard Feynman (as described in this most excellent story by Danny Hillis).

# 6:53 am / ai, pytorch, generative-ai, llms, qwen

2025 » September

MTWTFSS
1234567
891011121314
15161718192021
22232425262728
2930