Simon Willison’s Weblog

Subscribe

Blogmarks in Oct, 2023

Filters: Type: blogmark × Year: 2023 × Month: Oct × Sorted by date


My User Experience Porting Off setup.py (via) PyOxidizer maintainer Gregory Szorc provides a detailed account of his experience trying to figure out how to switch from setup.py to pyproject.toml for his zstandard Python package.

This kind of detailed usability feedback is incredibly valuable for project maintainers, especially when the user encountered this many different frustrations along the way. It’s like the written version of a detailed usability testing session. # 31st October 2023, 7:57 pm

Our search for the best OCR tool in 2023, and what we found. DocumentCloud’s Sanjin Ibrahimovic reviews the best options for OCR. Tesseract scores highly for easily machine readable text, newcomer docTR is great for ease of use but still not great at handwriting. Amazon Textract is great for everything except non-Latin languages, Google Cloud Vision is great at pretty much everything except for ease-of-use. Azure AI Document Intelligence sounds worth considering as well. # 31st October 2023, 7:21 pm

I’m Sorry I Bit You During My Job Interview. The way this 2011 McSweeney’s piece by Tom O’Donnell escalates is delightful. # 31st October 2023, 4:21 pm

Microsoft announces new Copilot Copyright Commitment for customers. Part of an interesting trend where some AI vendors are reassuring their paying customers by promising legal support in the face of future legal threats:

“As customers ask whether they can use Microsoft’s Copilot services and the output they generate without worrying about copyright claims, we are providing a straightforward answer: yes, you can, and if you are challenged on copyright grounds, we will assume responsibility for the potential legal risks involved.” # 31st October 2023, 3:35 pm

Through the Ages: Apple CPU Architecture (via) I enjoyed this review of Apple’s various CPU migrations—Motorola 68k to PowerPC to Intel x86 to Apple Silicon—by Jacob Bartlett. # 30th October 2023, 10:56 pm

Making PostgreSQL tick: New features in pg_cron (via) pg_cron adds cron-style scheduling directly to PostgreSQL. It's a pretty mature extension at this point, and recently gained the ability to schedule repeating tasks at intervals as low as every 1s.

The examples in this post are really informative. I like this example, which cleans up the ever-growing cron.job_run_details table by using pg_cron itself to run the cleanup:

SELECT cron.schedule('delete-job-run-details', '0 12 * * *', $$DELETE FROM cron.job_run_details WHERE end_time < now() - interval '3 days'$$);

pg_cron can be used to schedule functions written in PL/pgSQL, which is a great example of the kind of DSL that I used to avoid but I'm now much happier to work with because I know GPT-4 can write basic examples for me and help me understand exactly what unfamiliar code is doing. # 27th October 2023, 2:57 am

Oh-Auth—Abusing OAuth to take over millions of accounts (via) Describes an attack against vulnerable implementations of OAuth.

Let’s say your application uses OAuth against Facebook, and then takes the returned Facebook token and gives it access to the user account with the matching email address passed in the token from Facebook.

It’s critical that you also confirm the token was generated for your own application, not something else. Otherwise any secretly malicious app online that uses Facebook login could take on of their stored tokens and use it to hijack an account of your site belonging to that user’s email address. # 26th October 2023, 3:51 pm

Web Components Will Outlive Your JavaScript Framework (via) A really clear explanation of the benefit of Web Components built using dependency-free vanilla JavaScript, specifically for interactive components that you might want to embed in something like a blog post. Includes a very neat minimal example component. # 25th October 2023, 5:19 pm

chDB (via) This is a really interesting development: chDB offers “an embedded SQL OLAP Engine” as a Python package, which you can install using “pip install chdb”. What you’re actually getting is a wrapper around ClickHouse—it’s almost like ClickHouse has been repackaged into an embedded database similar to SQLite. # 24th October 2023, 11:04 pm

Solving the Engineering Strategy Crisis (via) Will Larson’s 49m video discussing engineering strategy: what one is and how to build one. He defines an engineering strategy as having two key components: an honest diagnosis of the way things currently work, and a practical approach to making things better.

Towards the end of the talk he suggests that there are two paths to developing a new strategy. The first is to borrow top-down authority from a sponsor such as a CTO, and the second is to work without any borrowed authority, instead researching how things work at the moment and, through documenting that, write a strategy document into existence! # 22nd October 2023, 9:18 pm

Patrick Newman’s Software Engineering Management Checklist (via) This tiny document may have the highest density of good engineering management advice I’ve ever encountered. # 22nd October 2023, 9:16 pm

New Default: Underlined Links for Improved Accessibility (GitHub Blog). “By default, links within text blocks on GitHub are now underlined. This ensures links are easily distinguishable from surrounding text.” # 19th October 2023, 4:19 pm

I’m banned for life from advertising on Meta. Because I teach Python. (via) If accurate, this describes a nightmare scenario of automated decision making.

Reuven recently found he had a permanent ban from advertising on Facebook. They won’t tell him exactly why, and have marked this as a final decision that can never be reviewed.

His best theory (impossible for him to confirm) is that it’s because he tried advertising a course on Python and Pandas a few years ago which was blocked because a dumb algorithm thought he was trading exotic animals!

The worst part? An appeal is no longer possible because relevant data is only retained for 180 days and so all of the related evidence has now been deleted.

Various comments on Hacker News from people familiar with these systems confirm that this story likely holds up. # 19th October 2023, 2:56 pm

Making CRDTs 98% more efficient (via) Outstanding piece of explanatory writing by Jake Lazaroff showing how he reduced the transmitted state of his pixel art CRDT implementation from 643KB to 15KB using a progression of tricks, each of which is meticulously explained and accompanied by an interactive demo. # 17th October 2023, 5:15 pm

Multimodality and Large Multimodal Models (LMMs) (via) Useful, extensive review of the current state of the art of multimodal models by Chip Huyen. Chip calls them LMMs for Large Multimodal Models, a term that seems to be catching on. # 14th October 2023, 7:51 pm

Wikimedia Commons: Photographs by Gage Skidmore (via) Gage Skidmore is a Wikipedia legend: this category holds 93,458 photographs taken by Gage and released under a Creative Commons license, including a vast number of celebrities taken at events like San Diego Comic-Con. CC licensed photos of celebrities are generally pretty hard to come by so if you see a photo of any celebrity on Wikipedia there’s a good chance it’s credited to Gage. # 10th October 2023, 4:17 am

Bottleneck T5 Text Autoencoder (via) Colab notebook by Linus Lee demonstrating his Contra Bottleneck T5 embedding model, which can take up to 512 tokens of text, convert that into a 1024 floating point number embedding vector... and then then reconstruct the original text (or a close imitation) from the embedding again.

This allows for some fascinating tricks, where you can do things like generate embeddings for two completely different sentences and then reconstruct a new sentence that combines the weights from both. # 10th October 2023, 2:12 am

Decomposing Language Models Into Understandable Components. Anthropic appear to have made a major breakthrough with respect to the interpretability of Large Language Models:

“[...] we outline evidence that there are better units of analysis than individual neurons, and we have built machinery that lets us find these units in small transformer models. These units, called features, correspond to patterns (linear combinations) of neuron activations. This provides a path to breaking down complex neural networks into parts we can understand” # 8th October 2023, 3:43 pm

jo (via) Neat little C utility (available via brew/apt-get install etc) for conveniently outputting JSON from a shell: “jo -p name=jo n=17 parser=false” will output a JSON object with string, integer and boolean values, and you can nest it to create nested objects. Looks very handy. # 8th October 2023, 5:20 am

Think before you speak: Training Language Models With Pause Tokens. Another example of how much low hanging fruit remains to be discovered in basic Large Language Model research: this team from Carnegie Mellon and Google Research note that, since LLMs get to run their neural networks once for each token of input and output, inserting “pause” tokens that don’t output anything at all actually gives them extra opportunities to “think” about their output. # 4th October 2023, 4:23 pm

An Interactive Intro to CRDTs (via) Superb interactive essay by Jake Lazaroff, providing a very clear explanation of how the fundamental mechanisms behind CRDTs (Conflict-free Replicated Data Types) work. The interactive explanatory demos are very neatly designed and a lot of fun to play with. # 4th October 2023, 3:10 pm

Translating Latin demonology manuals with GPT-4 and Claude (via) UC Santa Cruz history professor Benjamin Breen puts LLMs to work on historical texts. They do an impressive job of translating flaky OCRd text from 1599 Latin and 1707 Portuguese.

“It’s not about getting the AI to replace you. Instead, it’s asking the AI to act as a kind of polymathic research assistant to supply you with leads.” # 4th October 2023, 1:49 am

New sqlite3 CLI tool in Python 3.12. The newly released Python 3.12 includes a SQLite shell, which you can open using “python -m sqlite3”—handy for when you’re using a machine that has Python installed but no sqlite3 binary.

I installed Python 3.12 for macOS using the official installer from Python.org and now “/usr/local/bin/python3 -m sqlite3” gives me a SQLite 3.41.1 shell—a pleasantly recent version from March 2023 (the latest SQLite is 3.43.1, released in September). # 3rd October 2023, 6:57 pm

Weird A.I. Yankovic, a cursed deep dive into the world of voice cloning. Andy Baio reports back on his investigations into the world of AI voice cloning.

This is no longer a niche interest. There’s a Discord with 500,000 members sharing tips and tricks on cloning celebrity voices in order to make their own cover songs, often built with Google Colab using models distributed through Hugging Face.

Andy then makes his own, playing with the concept “What if every Weird Al song was the original, and every other artist was covering his songs instead?”

I particularly enjoyed Madonna’s cover of “Like A Surgeon”, Lady Gaga’s “Perform This Way” and Lorde’s “Foil”. # 2nd October 2023, 6:50 pm

jq 1.7. First new release of jq in five years! The project has moved from a solo maintainer to a new team with a dedicated GitHub organization. A ton of new features in this release—I’m most excited about the new pick(.key1, .key2.nested) builtin for emitting a selected subset of the incoming objects, and the --raw-output0 option which outputs zero byte delimited lists, designed to be piped to “xargs -0”. # 2nd October 2023, 4:58 am

Database Migrations. Vadim Kravcenko provides a useful, in-depth description of the less obvious challenges of applying database migrations successfully. Vadim uses and likes Django’s migrations (as do I) but notes that running them at scale still involves a number of thorny challenges.

The biggest of these, which I’ve encountered myself multiple times, is that if you want truly zero downtime deploys you can’t guarantee that your schema migrations will be deployed at the exact same instant as changes you make to your application code.

This means all migrations need to be forward-compatible: you need to apply a schema change in a way that your existing code will continue to work error-free, then ship the related code change as a separate operation.

Vadim describes what this looks like in detail for a number of common operations: adding a field, removing a field and changing a field that has associated business logic implications. He also discusses the importance of knowing when to deploy a dual-write strategy. # 1st October 2023, 11:55 pm

Observable notebook: Detect objects in images (via) I built an Observable notebook that uses Transformers.js and the Xenova/detra-resnet-50 model to detect objects in images, entirely running within your browser. You can select an image using a file picker and it will show you that image with bounding boxes and labels drawn around items within it. I have a demo image showing some pelicans flying ahead, but it works with any image you give it—all without uploading that image to a server. # 1st October 2023, 3:46 pm