Simon Willison’s Weblog

Subscribe

Items in Jan, 2024

Filters: Year: 2024 × Month: Jan × Sorted by date


Does GPT-2 Know Your Phone Number? (via) This report from Berkeley Artificial Intelligence Research in December 2020 showed GPT-3 outputting a full page of chapter 3 of Harry Potter and the Philosopher’s Stone—similar to how the recent suit from the New York Times against OpenAI and Microsoft demonstrates memorized news articles from that publication as outputs from GPT-4. # 8th January 2024, 5:26 am

Text Embeddings Reveal (Almost) As Much As Text. Embeddings of text—where a text string is converted into a fixed-number length array of floating point numbers—are demonstrably reversible: “a multi-step method that iteratively corrects and re-embeds text is able to recover 92% of 32-token text inputs exactly”.

This means that if you’re using a vector database for embeddings of private data you need to treat those embedding vectors with the same level of protection as the original text. # 8th January 2024, 5:22 am

Weeknotes: Page caching and custom templates for Datasette Cloud

My main development focus this week has been adding public page caching to Datasette Cloud, and exploring what custom template support might look like for that service.

[... 924 words]

It’s OK to call it Artificial Intelligence

Update 9th January 2024: This post was clumsily written and failed to make the point I wanted it to make. I’ve published a follow-up, What I should have said about the term Artificial Intelligence which you should read instead.

[... 1818 words]

GPT in 500 lines of SQL (via) Utterly brilliant piece of PostgreSQL hackery by Alex Bolenok, who implements a full GPT-2 style language model in SQL on top of pg_vector. The final inference query is 498 lines long! # 6th January 2024, 10:55 pm

Microsoft Research relicense Phi-2 as MIT (via) Phi-2 was already an interesting model—really strong results for its size—made available under a non-commercial research license. It just got significantly more interesting: Microsoft relicensed it as MIT open source. # 6th January 2024, 6:06 am

Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (via) NIST—the National Institute of Standards and Technology, a US government agency, released a 106 page report on attacks against modern machine learning models, mostly covering LLMs.

Prompt injection gets two whole sections, one on direct prompt injection (which incorporates jailbreaking as well, which they misclassify as a subset of prompt injection) and one on indirect prompt injection.

They talk a little bit about mitigations, but for both classes of attack conclude: “Unfortunately, there is no comprehensive or foolproof solution for protecting models against adversarial prompting, and future work will need to be dedicated to investigating suggested defenses for their efficacy.” # 6th January 2024, 4:08 am

If you learn something the hard way, share your findings with others. You have blazed a new trail; now you must mark it for your fellow travellers. Sharing knowledge is an unreasonably effective way of helping others.

Nicolas Bouliane # 5th January 2024, 10:32 pm

My blog’s year archive pages now have tag clouds (via) Inspired by the tag cloud I used in my recent 2023 AI roundup post, I decided to add a tag cloud to the top of every one of my archive-by-year pages showing what topics I had spent the most time with that year.

I already had old code for this, so I pasted it into GPT-4 along with an example of the output of my JSON endpoint from Django SQL Dashboard and had it do most of the work for me. # 4th January 2024, 9:02 pm

container2wasm (via) “Converts a container to WASM with emulation by Bochs (for x86_64 containers) and TinyEMU (for riscv64 containers)”—effectively letting you take a Docker container and turn it into a WebAssembly blob that can then run in any WebAssembly host environment, including the browser.

Run “c2w ubuntu:22.04 out.wasm” to output a WASM binary for the Ubuntu 22:04 container from Docker Hub, then “wasmtime out.wasm uname -a” to run a command.

Even better, check out the live browser demos linked fro the README, which let you do things like run a Python interpreter in a Docker container directly in your browser. # 3rd January 2024, 11:21 pm

Fastest Way to Read Excel in Python (via) Haki Benita produced a meticulously researched and written exploration of the options for reading a large Excel spreadsheet into Python. He explored Pandas, Tablib, Openpyxl, shelling out to LibreOffice, DuckDB and python-calamine (a Python wrapper of a Rust library). Calamine was the winner, taking 3.58s to read 500,00 rows—compared to Pandas in last place at 32.98s. # 3rd January 2024, 8:04 pm

Tom Scott, and the formidable power of escalating streaks

Ten years ago yesterday, Tom Scott posted this video to YouTube about “Special Crossings For Horses In Britain”. It was the first in his Things You Might Not Know series, but more importantly it was the start of a streak.

[... 1352 words]

NPM: modele-social (via) This is a fascinating open source package: it’s an NPM module containing an implementation of the rules for calculating social security contributions in France, maintained by a team at Urssaf, the not-quite-government organization in France that manages the collection of social security contributions there.

The rules themselves can be found in the associated GitHub repository, encoded in a YAML-like declarative language called Publicodes that was developed by the French government for this and similar purposes. # 2nd January 2024, 5:55 pm

Since the advent of ChatGPT, and later by using LLMs that operate locally, I have made extensive use of this new technology. The goal is to accelerate my ability to write code, but that’s not the only purpose. There’s also the intent to not waste mental energy on aspects of programming that are not worth the effort.

[...] Current LLMs will not take us beyond the paths of knowledge, but if we want to tackle a topic we do not know well, they can often lift us from our absolute ignorance to the point where we know enough to move forward on our own.

Salvatore Sanfilippo # 2nd January 2024, 2:50 pm

After ten years, it’s time to stop making videos. Ten years ago, my friend Tom Scott started a deliberate streak of posting YouTube videos—initially about one a day before settling into a cadence of one a week. He kept that up for the full ten years, growing his subscribers to over 6 million in the process.

Today he’s ending that streak, in unparalleled style.

(I’m proud to have made an appearance in video number 13, talking about Zeppelins.) # 1st January 2024, 10:59 pm