Simon Willison’s Weblog

Subscribe

Wednesday, 27th July 2022

Fastest way to turn HTML into text in Python (via) A light benchmark of the new-to-me selectolax Python library shows it performing extremely well for tasks such as extracting just the text from an HTML string, after first manipulating the DOM. selectolax is a Python binding over the Modest and Lexbor HTML parsing engines, which are written in no-outside-dependency C. # 5:55 pm

SQLite Internals: Pages & B-trees (via) Ben Johnson provides a delightfully clear introduction to SQLite internals, describing the binary format used to store rows on disk and how SQLite uses 4KB pages for both row storage and for the b-trees used to look up records. # 2:57 pm

2022 » July

MTWTFSS
    123
45678910
11121314151617
18192021222324
25262728293031