Simon Willison’s Weblog

Subscribe

Wednesday, 24th December 2025

Research Datasette-lite JavaScript Initialization: Problem Analysis & Solution Proposal — Datasette-lite faces a core limitation: HTML content injected via `innerHTML` does not execute embedded JavaScript, breaking interactive features and plugin functionality. The proposed solution introduces a standardized initialization event (`datasette_init`) triggered after each content update, allowing dependent scripts and plugins to reinitialize reliably.
Research Blog Tag Prediction with Scikit-Learn — Automatically assigning meaningful tags to historic, untagged blog posts, this project leverages the Simon Willison blog database and scikit-learn to train and compare multi-label text classification models. Four approaches—TF-IDF + Logistic Regression, Multinomial Naive Bayes, Random Forest, and LinearSVC—were tested on posts’ title and body text using the 158 most frequently used tags.
Research Can BeautifulSoup Use JustHTML as a Parser? — BeautifulSoup 4 can be integrated with JustHTML, a pure Python HTML5 parser, enabling full compliance with the HTML5 parsing algorithm according to the WHATWG specification. By implementing a custom `JustHTMLTreeBuilder`, BeautifulSoup’s parser plugin system can leverage JustHTML for parsing, allowing seamless use of BeautifulSoup’s familiar API and features—like `find_all()` and CSS selectors—while inheriting robust, standards-adherent HTML handling.
Research minijinja vs jinja2 Performance Benchmark — Benchmarking the Python bindings for minijinja (https://github.com/mitsuhiko/minijinja) against Jinja2 (https://palletsprojects.com/p/jinja/) on Python 3.14 and 3.14t measured template render performance using a realistic e-commerce template with inheritance, loops, and ~65KB HTML output. The suite runs 200 iterations per scenario, captures mean/median/std/min/max, and provides reproducible scripts (run_benchmark.sh, benchmark.py) plus matplotlib charts to visualize results.
Research AST-Grep Import Rewriter — Leveraging ast-grep and custom YAML rules, the AST-Grep Import Rewriter offers a structured approach to automatically extract, analyze, and rewrite obfuscated JavaScript import statements across ES6, CommonJS, dynamic imports, and webpack bundles. By parsing source files, it generates mapping templates and applies user-defined mappings, converting unreadable module paths into meaningful names with either regex- or AST-based transformations.
Research Epsilon Python Wrapper — Epsilon Python Wrapper provides seamless Python bindings to Epsilon, Google's pure Go WebAssembly 2.0 runtime, enabling efficient and dependency-free WASM execution within Python projects. The wrapper exposes a simple API for module instantiation, function calls (with type safety), memory operations, and export inspection, supporting advanced features like SIMD and resource limiting.
Research SQLite Ripgrep Function — SQLite Ripgrep Function enables fast code and text search inside SQLite queries by integrating the powerful ripgrep search tool as a custom SQL function. It offers both a pure Python implementation and a performant C extension, allowing users to search files within a configurable directory, restrict output with glob patterns (e.g., `*.py`), and enforce time limits to avoid runaway queries.
Research Apptron Analysis Report — Apptron is a browser-based cloud IDE that hosts a full x86 Linux environment using emulation and WebAssembly, delivering a seamless developer experience directly in the browser. By tightly integrating VS Code, a Linux terminal, and persistent cloud storage via Cloudflare R2, users are able to work on customizable environments without any local setup.
Research GitHub CLI API Proxy Investigation — Proxying GitHub CLI (`gh`) API traffic can be achieved through standard HTTP/HTTPS proxies or via a Unix domain socket, each suited to different use cases and levels of flexibility. The CLI, implemented in Go, natively supports proxy environment variables (`HTTPS_PROXY`, `HTTP_PROXY`, `NO_PROXY`), making integration with existing HTTP proxies seamless and requiring no changes to the CLI configuration.
Research Datasette 1.0a20 SQL Permissions System: Architecture Review — A comprehensive architecture review of Datasette's new SQL-based permissions system (introduced in v1.0a20) finds that transitioning from a callback-driven model to SQL query resolution greatly improves scalability for large deployments. The redesigned system efficiently checks access by evaluating compiled permission rules through internal catalog tables, substantially reducing processing overhead compared to the multiplicative N x M callback pattern.
Research Automatic JavaScript API Tagging for simonw/tools — Efficiently categorizing the 155 HTML tools in simonw/tools by their JavaScript API usage, this project developed an automated pipeline combining Cheerio for HTML parsing and Acorn for JavaScript AST analysis. The solution robustly filters out false positives from comments, strings, and non-code regions, accurately tagging over 60 Web APIs and handling modern ES modules and edge script types.
Research Absurd-in-SQLite — Durable execution workflows can be implemented using SQLite, as demonstrated by the Absurd-in-SQLite project, which is inspired by Armin Ronacher's Absurd. This project provides a proof-of-concept implementation of durable execution using SQLite, allowing for reliable and long-running workflows that can survive crashes and network failures.
Research SVG to PNG Conversion Methods in Python — Multiple Python-based approaches for converting SVG files to PNG were benchmarked using the tiger.svg image, evaluating file size, output quality, and ease of installation. Pure Python solutions like CairoSVG and svglib+reportlab offered simple pip-based installs with predictable PNGs, though svglib lacks alpha channel support. Wand (ImageMagick bindings) and ImageMagick CLI yielded the highest quality output (16-bit RGBA) at the cost of larger files and system-level dependencies.
Research OpenAI Codex CLI Sandbox Implementation Analysis — OpenAI Codex CLI's sandbox employs strong, platform-specific isolation to securely constrain the behavior of AI-driven code agents. On macOS, it uses Apple's Seatbelt sandbox with finely tuned dynamic policies, while on Linux, it combines Landlock for strict filesystem controls and seccomp for syscall-based network blocking—ensuring that agents can only write to user-approved directories and have no outgoing network by default.
Research h3o-python — h3o-python delivers efficient Python bindings for the h3o Rust library, enabling fast and convenient access to H3 geospatial indexing from Python. Utilizing PyO3 and packaged with maturin, it allows encoding geographic coordinates into 64-bit H3 cell indexes, decoding indexes, performing neighborhood queries, calculating great-circle distances, and retrieving surface area metrics—all without requiring a separate H3 installation.
Research Wazero Python Bindings — Wazero Python Bindings enable seamless integration of the wazero WebAssembly runtime—written in Go—with Python applications, delivering a zero-dependency solution for running WASM modules natively from Python. The project exposes a clean, Pythonic API for instantiating modules, calling exported WASM functions, and managing resources efficiently with context managers. Performance benchmarks demonstrate rapid execution and minimal overhead between Python and WASM.
Research cmarkgfm in Pyodide - ✅ WORKING! — By rewriting cmarkgfm's bindings from CFFI to the Python C API, the project successfully ported GitHub's cmark-gfm Markdown parser to Pyodide. The resulting wheel is fully functional, requires no further building, and supports all GitHub Flavored Markdown features with high performance, thanks to direct C code execution via WebAssembly.
Research env86 Analysis Report — env86 is a Go-based management tool that enables users to run x86 Linux virtual machines within browser contexts via the v86 WebAssembly emulator. By combining a native desktop application (embedding a browser), a robust CLI, and an integrated virtual networking stack, env86 provides an easily distributable and reproducible Linux environment that can boot instantly from snapshots, support host-guest communication, and mount host filesystems.
Research DeepSeek-OCR on NVIDIA GB10 (ARM64 + CUDA 13.0) — Successfully deployed DeepSeek-OCR on an NVIDIA GB10 (ARM64, sm_121) by upgrading to PyTorch 2.9.0+cu130 so CUDA 13.0 wheels could be used instead of building from source. The repo includes automated scripts (setup.sh, run_ocr.py) that load the 6.3GB safetensors model (~34s) and run GPU inference (~58s for a 3503×1668 image), producing annotated images, markdown/text outputs and bounding boxes with validated multi-column accuracy.
Research SQLite-utils Iterator Support Research — Enhancements to the sqlite-utils library now allow its `insert_all` and `upsert_all` methods to efficiently process Python iterators yielding lists, in addition to the original dict-based input. Detection of the iterator type is automatic and maintains full backward compatibility, streamlining bulk inserts from row-based data sources like CSV streams and reducing memory usage by avoiding dict construction.
Research Claude Code for Web Environment — Running Claude Code on the web offers developers a versatile coding sandbox on Ubuntu 24.04, leveraging a broad toolkit that includes Python 3.11, Node.js 22, Go, Rust, and more, alongside developer utilities (Git, Make) and database clients (SQLite, PostgreSQL).
Research Pyodide Simple Demo — A compact demo shows how to run Python scripts inside a WebAssembly sandbox from Node.js using Pyodide: after npm install, launching node server-simple.js executes example-simple.py and writes generated files to the output/ directory. The project demonstrates a minimal server-side integration pattern for Pyodide (https://pyodide.org/) under Node.js (https://nodejs.org/) and is aimed at quick experimentation with sandboxed Python execution.
Research Datasette Plugin Writer Skill — Covering every aspect of Datasette plugin development, this project creates a comprehensive skill set for authors—from bootstrapping with cookiecutter to deploying on GitHub and PyPI. It provides precise guides and working code samples for essential plugin hooks like custom SQL functions, authentication, custom views, and output formats.
Research SQLite Query Linter — The SQLite Query Linter is a lightweight Python library that wraps the standard `sqlite3` module to provide configurable linting and rule-based analysis of SQL queries before execution. Acting as a drop-in replacement, it helps catch common syntax errors and platform incompatibilities—such as invalid types in `CAST`, use of unsupported functions, `SELECT *`, missing `WHERE` clauses, and string quoting mistakes—helping developers avoid runtime errors and improve code quality.
Research Datasette Plugins Analysis - Complete Report — Datasette Plugins Analysis presents a systematic evaluation of 44 key plugins from the Datasette ecosystem, focusing on dependencies, permissions hooks, and release patterns as of October 2025. The study finds that 89% of these plugins rely on ALPHA versions of Datasette, with only 8 plugins having stable releases and just 5 supporting stable Datasette while using advanced hooks like `register_permissions()`.
Research mquickjs Sandbox Investigation — Exploring mquickjs, a highly minimal JavaScript engine, this project rigorously evaluates its suitability as a safe sandbox for running untrusted code. Various integration approaches are implemented, including Python FFI, C extensions, subprocess invocation, and WebAssembly runtimes—each tested for startup and execution performance, security isolation, and feature compatibility.
Research UV Run Flow Analysis: Deep Dive into `uv run myscript.py` — Running `uv run myscript.py` in a directory with a `pyproject.toml` launches a multi-phase workflow that automates Python script execution within an isolated, dependency-managed environment. uv scans for project metadata, resolves and validates interpreter and package requirements, manages virtual environments, locks dependencies with a TOML-based `uv.lock` file using the PubGrub algorithm, efficiently syncs the environment with parallel downloads and caching, and finally executes the desired…
Research Browser URL Size Limits Investigation — Major browser engines demonstrate significant differences in how they enforce URL length limits. Chromium sets a 2 MB cap at its inter-process communication boundary, rejecting longer URLs when crossing processes. Firefox relies on user-configurable preferences, employing a 1 MB "standard" limit but permitting up to 512 MB in absolute terms, with stricter limits (2,000 characters) for history and bookmarks.
Research LLM Pyodide OpenAI Plugin — Leveraging the LLM Python package and pyodide, this project successfully adapts LLM’s OpenAI model interface for direct use in browser environments by bypassing the standard openai library (which fails in browsers due to its httpx dependency) and instead using the browser-native fetch API for CORS-compliant API calls.
Research SVG to PNG Renderer — A lightweight SVG to PNG renderer has been developed using Python, leveraging the `xml.etree.ElementTree` and `Pillow` libraries to parse SVG XML data and convert it to raster PNG images. This minimal library supports a range of SVG elements, including paths, basic shapes, and containers, as well as attributes such as colors, styling, and transforms.
Research Streaming File Upload Prototype — Demonstrating efficient large file uploads, this prototype integrates the streaming-form-data library with a Starlette-based ASGI server to enable true streaming of multipart file data directly to disk, bypassing memory bottlenecks. It incrementally parses incoming form data and supports checksum calculation on-the-fly, handling multiple simultaneous file uploads via async workflows.
Research Self-Hosting Datasette Lite: Research Report — Datasette Lite, a browser-based SQLite explorer powered by Pyodide and WebAssembly, can be fully self-hosted and used offline by bundling all core files, required Python wheels, and optional sample databases locally instead of relying on external CDNs and PyPI hosts.
Research Litestream S3 Replication Experiments — Experiments in this project evaluate Litestream’s robustness when SQLite writes occur while Litestream is stopped and later restarted, with focus on replication to S3. Both the simple restart and the scenario where the WAL is checkpointed (truncated) while Litestream is offline confirm no data loss: Litestream either streams pending WAL changes upon restart or detects a database change and uploads a new full snapshot (“generation”).
Research Datasette Lite NPM Package Investigation — Converting Datasette Lite into a self-hostable NPM package enables seamless client-side data exploration using SQLite, CSV, JSON, and Parquet files directly in the browser, powered by Pyodide. The project removes analytics, adds a CLI server for local testing, and exposes all necessary static assets for easy deployment to platforms like GitHub Pages, Netlify, or Vercel.
Research H3 Library Performance Benchmark — A systematic performance benchmark was conducted on two prominent Python libraries implementing Uber's H3 geospatial indexing system: h3-py (official, C-based) and h3o-python (Rust-based). Results show h3o-python consistently outperforms h3-py on core operations, achieving over 2x speedup for coordinate conversions and up to 13x faster neighbor queries, while area calculations remain comparable.
Research Python Markdown Library Comparison: cmarkgfm vs Alternatives — Comparing seven prominent Python markdown libraries, cmarkgfm—bindings to GitHub’s C-based CommonMark/GFM parser—proved dramatically faster (10-50x) than pure Python options such as mistune, Python-Markdown, and marko. The benchmark, spanning small to large markdown documents, consistently found cmarkgfm excels in both speed and stability, making it ideal for high-volume or performance-critical applications.
Research SQLite Hierarchical Permission System — Proof of Concept — A proof-of-concept implements a fully SQLite-based hierarchical permission system that computes allowed database/table pairs by cascading rules across child (table), parent (database), and global levels with DENY-over-ALLOW semantics; it uses only plain SQL (CTEs + SQLite JSON functions) and is built on SQLite (https://sqlite.org).
Research Offline Notes Sync System — Building on offline-first principles, this notes sync system enables robust note creation and editing without active internet connectivity, using IndexedDB and service workers on the client side. It employs operation-based sync and vector clocks for fine-grained conflict detection and resolution, and features a three-way character-level merge algorithm inspired by Apple Notes.
Research Browser-Based Single File Bundler — Investigating the feasibility of Vite as a browser-based bundler, this project demonstrates that while Vite itself cannot operate directly in the browser due to its Node.js dependencies, client-side file bundling is achievable using alternative strategies. Three approaches were prototyped: a pure JavaScript "simple" bundler for inlining assets, an esbuild-wasm browser integration for ES module support, and full Vite bundling via StackBlitz WebContainers using vite-plugin-singlefile.

uv-init-demos. uv has a useful uv init command for setting up new Python projects, but it comes with a bunch of different options like --app and --package and --lib and I wasn't sure how they differed.

So I created this GitHub repository which demonstrates all of those options, generated using this update-projects.sh script (thanks, Claude) which will run on a schedule via GitHub Actions to capture any changes made by future releases of uv.

# 10:05 pm / projects, python, github-actions, git-scraping, uv

Release claude-code-transcripts 0.1 — Tools for publishing transcripts for Claude Code sessions
Tuesday, 23rd December 2025
Thursday, 25th December 2025