Simon Willison’s Weblog

Subscribe

Blogmarks

Filters: Sorted by date

PaliGemma model README (via) One of the more over-looked announcements from Google I/O yesterday was PaliGemma, an openly licensed VLM (Vision Language Model) in the Gemma family of models.

The model accepts an image and a text prompt. It outputs text, but that text can include special tokens representing regions on the image. This means it can return both bounding boxes and fuzzier segment outlines of detected objects, behavior that can be triggered using a prompt such as "segment puffins".

From the README:

PaliGemma uses the Gemma tokenizer with 256,000 tokens, but we further extend its vocabulary with 1024 entries that represent coordinates in normalized image-space (<loc0000>...<loc1023>), and another with 128 entries (<seg000>...<seg127>) that are codewords used by a lightweight referring-expression segmentation vector-quantized variational auto-encoder (VQ-VAE) [...]

You can try it out on Hugging Face.

It's a 3B model, making it feasible to run on consumer hardware.

# 15th May 2024, 9:16 pm / google, google-io, ai, generative-ai, local-llms, llms, vision-llms, gemma, image-segmentation

OpenAI: Managing your work in the API platform with Projects (via) New OpenAI API feature: you can now create API keys for "projects" that can have a monthly spending cap. The UI for that limit says:

If the project's usage exceeds this amount in a given calendar month (UTC), subsequent API requests will be rejected

You can also set custom token-per-minute and request-per-minute rate limits for individual models.

I've been wanting this for ages: this means it's finally safe to ship a weird public demo on top of their various APIs without risk of accidental bankruptcy if the demo goes viral!

# 15th May 2024, 7:18 pm / openai, chatgpt, llms

How to PyCon (via) Glyph’s tips on making the most out of PyCon. I particularly like his suggestion that “dinners are for old friends, but lunches are for new ones”.

I’m heading out to Pittsburgh tonight, and giving a keynote (!) on Saturday. If you see me there please come and say hi!

# 15th May 2024, 3:29 pm / conferences, pycon, python, glyph

Context caching for Google Gemini (via) Another new Gemini feature announced today. Long context models enable answering questions against large chunks of text, but the price of those long prompts can be prohibitive - $3.50/million for Gemini Pro 1.5 up to 128,000 tokens and $7/million beyond that.

Context caching offers a price optimization, where the long prefix prompt can be reused between requests, halving the cost per prompt but at an additional cost of $4.50 / 1 million tokens per hour to keep that context cache warm.

Given that hourly extra charge this isn't a default optimization for all cases, but certain high traffic applications might be able to save quite a bit on their longer prompt systems.

It will be interesting to see if other vendors such as OpenAI and Anthropic offer a similar optimization in the future.

Update 14th August 2024: Anthropic's Claude now has its own version of prompt caching.

# 14th May 2024, 8:42 pm / google, ai, prompt-engineering, generative-ai, llms, gemini, llm-pricing, prompt-caching, long-context

llm-gemini 0.1a4. A new release of my llm-gemini plugin adding support for the Gemini 1.5 Flash model that was revealed this morning at Google I/O.

I'm excited about this new model because of its low price. Flash is $0.35 per 1 million tokens for prompts up to 128K token and $0.70 per 1 million tokens for longer prompts - up to a million tokens now and potentially two million at some point in the future. That's 1/10th of the price of Gemini Pro 1.5, cheaper than GPT 3.5 ($0.50/million) and only a little more expensive than Claude 3 Haiku ($0.25/million).

# 14th May 2024, 8:32 pm / projects, google-io, ai, generative-ai, llms, llm, gemini, llm-release

How developers are using Gemini 1.5 Pro’s 1 million token context window. I got to be a talking head for a few seconds in an intro video for today's Google I/O keynote, talking about how I used Gemini Pro 1.5 to index my bookshelf (and with a cameo from my squirrel nutcracker). I'm at 1m25s.

(Or at 10m6s in the full video of the keynote)

# 14th May 2024, 8:27 pm / google, video, google-io, ai, generative-ai, llms, gemini

Why your voice assistant might be sexist (via) Given OpenAI's demo yesterday of a vocal chat assistant with a flirty, giggly female voice - and the new ability to be interrupted! - it's worth revisiting this piece by Chris Baraniuk from June 2022 about gender dynamics in voice assistants. Includes a link to this example of a synthesized non-binary voice.

# 14th May 2024, 4:16 pm / ai, openai, chatgpt

LLM 0.14, with support for GPT-4o. It's been a while since the last LLM release. This one adds support for OpenAI's new model:

llm -m gpt-4o "fascinate me"

Also a new llm logs -r (or --response) option for getting back just the response from your last prompt, without wrapping it in Markdown that includes the prompt.

Plus nine new plugins since 0.13!

# 13th May 2024, 9 pm / projects, ai, openai, generative-ai, llms, llm

Hello GPT-4o. OpenAI announced a new model today: GPT-4o, where the o stands for "omni".

It looks like this is the gpt2-chatbot we've been seeing in the Chat Arena the past few weeks.

GPT-4o doesn't seem to be a huge leap ahead of GPT-4 in terms of "intelligence" - whatever that might mean - but it has a bunch of interesting new characteristics.

First, it's multi-modal across text, images and audio as well. The audio demos from this morning's launch were extremely impressive.

ChatGPT's previous voice mode worked by passing audio through a speech-to-text model, then an LLM, then a text-to-speech for the output. GPT-4o does everything with the one model, reducing latency to the point where it can act as a live interpreter between people speaking in two different languages. It also has the ability to interpret tone of voice, and has much more control over the voice and intonation it uses in response.

It's very science fiction, and has hints of uncanny valley. I can't wait to try it out - it should be rolling out to the various OpenAI apps "in the coming weeks".

Meanwhile the new model itself is already available for text and image inputs via the API and in the Playground interface, as model ID "gpt-4o" or "gpt-4o-2024-05-13". My first impressions are that it feels notably faster than gpt-4-turbo.

This announcement post also includes examples of image output from the new model. It looks like they may have taken big steps forward in two key areas of image generation: output of text (the "Poetic typography" examples) and maintaining consistent characters across multiple prompts (the "Character design - Geary the robot" example).

The size of the vocabulary of the tokenizer - effectively the number of unique integers used to represent text - has increased to ~200,000 from ~100,000 for GPT-4 and GPT-3.5. Inputs in Gujarati use 4.4x fewer tokens, Japanese uses 1.4x fewer, Spanish uses 1.1x fewer. Previously languages other than English paid a material penalty in terms of how much text could fit into a prompt, it's good to see that effect being reduced.

Also notable: the price. OpenAI claim a 50% price reduction compared to GPT-4 Turbo. Conveniently, gpt-4o costs exactly 10x gpt-3.5: 4o is $5/million input tokens and $15/million output tokens. 3.5 is $0.50/million input tokens and $1.50/million output tokens.

(I was a little surprised not to see a price decrease there to better compete with the less expensive Claude 3 Haiku.)

The price drop is particularly notable because OpenAI are promising to make this model available to free ChatGPT users as well - the first time they've directly made their "best" model available to non-paying customers.

Tucked away right at the end of the post:

We plan to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.

I'm looking forward to learning more about these video capabilities, which were hinted at by some of the live demos in this morning's presentation.

# 13th May 2024, 7:09 pm / ai, openai, generative-ai, gpt-4, llms, vision-llms, llm-pricing, multi-modal-output, chatbot-arena

GPUs Go Brrr (via) Fascinating, detailed low-level notes on how to get the most out of NVIDIA's H100 GPUs (currently selling for around $40,000 a piece) from the research team at Stanford who created FlashAttention, among other things.

The swizzled memory layouts are flat-out incorrectly documented, which took considerable time for us to figure out.

# 13th May 2024, 4:08 am / stanford, ai, nvidia, gpus

Parsing PNG images in Mojo (via) It’s still very early days for Mojo, the new systems programming language from Chris Lattner that imitates large portions of Python and can execute Python code directly via a compatibility layer.

Ferdinand Schenck reports here on building a PNG decoding routine in Mojo, with a detailed dive into both the PNG spec and the current state of the Mojo language.

# 12th May 2024, 8:17 pm / png, python, mojo

About ARDC (Amateur Radio Digital Communications). In ham radio adjacent news, here's a foundation that it's worth knowing about:

ARDC makes grants to projects and organizations that are experimenting with new ways to advance both amateur radio and digital communication science.

In 1981 they were issued the entire 44.x.x.x block of IP addresses - 16 million in total. In 2019 they sold a quarter of those IPs to Amazon for about $100 million, providing them with a very healthy endowment from which they can run their grants program!

# 12th May 2024, 5:21 pm / ham-radio

“Link In Bio” is a slow knife (via) Anil Dash writing in 2019 about how Instagram’s “link in bio” thing (where users cannot post links to things in Instagram posts or comments, just a single link field in their bio) is harmful for linking on the web.

Today it’s even worse. TikTok has the same culture, and LinkedIn and Twitter both algorithmically de-boost anything with a URL in it, encouraging users to share screenshots (often unsourced) rather than linking to content and reducing their distribution.

It’s gross.

# 12th May 2024, 2:15 pm / anil-dash, linkedin, links, social-media, twitter, tiktok

Ham radio general exam question pool as JSON. I scraped a pass of my Ham radio general exam this morning. One of the tools I used to help me pass was a Datasette instance with all 429 questions from the official question pool. I've published that raw data as JSON on GitHub, which I converted from the official question pool document using an Observable notebook.

Relevant TIL: How I studied for my Ham radio general exam.

# 11th May 2024, 7:16 pm / json, projects, radio, datasette, observable, ham-radio

Exploring Hacker News by mapping and analyzing 40 million posts and comments for fun (via) A real tour de force of data engineering. Wilson Lin fetched 40 million posts and comments from the Hacker News API (using Node.js with a custom multi-process worker pool) and then ran them all through the BGE-M3 embedding model using RunPod, which let him fire up ~150 GPU instances to get the whole run done in a few hours, using a custom RocksDB and Rust queue he built to save on Amazon SQS costs.

Then he crawled 4 million linked pages, embedded that content using the faster and cheaper jina-embeddings-v2-small-en model, ran UMAP dimensionality reduction to render a 2D map and did a whole lot of follow-on work to identify topic areas and make the map look good.

That's not even half the project - Wilson built several interactive features on top of the resulting data, and experimented with custom rendering techniques on top of canvas to get everything to render quickly.

There's so much in here, and both the code and data (multiple GBs of arrow files) are available if you want to dig in and try some of this out for yourself.

In the Hacker News comments Wilson shares that the total cost of the project was a couple of hundred dollars.

One tiny detail I particularly enjoyed - unrelated to the embeddings - was this trick for testing which edge location is closest to a user using JavaScript:

const edge = await Promise.race(
  EDGES.map(async (edge) => {
    // Run a few times to avoid potential cold start biases.
    for (let i = 0; i < 3; i++) {
      await fetch(`https://${edge}.edge-hndr.wilsonl.in/healthz`);
    }
    return edge;
  }),
);

# 10th May 2024, 4:42 pm / hacker-news, embeddings, jina

uv pip install --exclude-newer example (via) A neat new feature of the uv pip install command is the --exclude-newer option, which can be used to avoid installing any package versions released after the specified date.

Here's a clever example of that in use from the typing_extensions packages CI tests that run against some downstream packages:

uv pip install --system -r test-requirements.txt --exclude-newer $(git show -s --date=format:'%Y-%m-%dT%H:%M:%SZ' --format=%cd HEAD)

They use git show to get the date of the most recent commit (%cd means commit date) formatted as an ISO timestamp, then pass that to --exclude-newer.

# 10th May 2024, 4:35 pm / git, pip, python, uv, astral

Bullying in Open Source Software Is a Massive Security Vulnerability. The Xz story from last month, where a malicious contributor almost managed to ship a backdoor to a number of major Linux distributions, included a nasty detail where presumed collaborators with the attacker bullied the maintainer to make them more susceptible to accepting help.

Hans-Christoph Steiner from F-Droid reported a similar attempt from a few years ago:

A new contributor submitted a merge request to improve the search, which was oft requested but the maintainers hadn't found time to work on. There was also pressure from other random accounts to merge it. In the end, it became clear that it added a SQL injection vulnerability.

404 Media's Jason Koebler ties the two together here and makes the case for bullying as a genuine form of security exploit in the open source ecosystem.

# 9th May 2024, 10:26 pm / open-source, security, jason-koebler

experimental-phi3-webgpu (via) Run Microsoft’s excellent Phi-3 model directly in your browser, using WebGPU so didn’t work in Firefox for me, just in Chrome.

It fetches around 2.1GB of data into the browser cache on first run, but then gave me decent quality responses to my prompts running at an impressive 21 tokens a second (M2, 64GB).

I think Phi-3 is the highest quality model of this size, so it’s a really good fit for running in a browser like this.

# 9th May 2024, 10:21 pm / browsers, ai, webassembly, generative-ai, local-llms, llms, phi, webgpu

datasette-pins — a new Datasette plugin for pinning tables and queries. Alex Garcia built this plugin for Datasette Cloud, and as with almost every Datasette Cloud features we're releasing it as an open source package as well.

datasette-pins allows users with the right permission to "pin" tables, databases and queries to their homepage. It's a lightweight way to customize that homepage, especially useful as your Datasette instance grows to host dozens or even hundreds of tables.

# 9th May 2024, 6:29 pm / plugins, datasette, datasette-cloud, alex-garcia

OpenAI Model Spec, May 2024 edition (via) New from OpenAI, a detailed specification describing how they want their models to behave in both ChatGPT and the OpenAI API.

“It includes a set of core objectives, as well as guidance on how to deal with conflicting objectives or instructions.”

The document acts as guidelines for the reinforcement learning from human feedback (RLHF) process, and in the future may be used directly to help train models.

It includes some principles that clearly relate to prompt injection: “In some cases, the user and developer will provide conflicting instructions; in such cases, the developer message should take precedence”.

# 8th May 2024, 6:15 pm / ai, openai, prompt-injection, generative-ai, llms

Modern SQLite: Generated columns (via) The second in Anton Zhiyanov's series on SQLite features you might have missed.

It turns out I had an incorrect mental model of generated columns. In SQLite these can be "virtual" or "stored" (written to disk along with the rest of the table, a bit like a materialized view). Anton noted that "stored are rarely used in practice", which surprised me because I thought that storing them was necessary for them to participate in indexes.

It turns out that's not the case. Anton's example here shows a generated column providing indexed access to a value stored inside a JSON key:

create table events (
  id integer primary key,
  event blob,
  etime text as (event ->> 'time'),
  etype text as (event ->> 'type')
);
create index events_time on events(etime);
insert into events(event) values (
  '{"time": "2024-05-01", "type": "credit"}'
);

Update: snej reminded me that this isn't a new capability either: SQLite has been able to create indexes on expressions for years.

# 8th May 2024, 4:55 pm / sqlite, anton-zhiyanov

Tagged Pointer Strings (2015) (via) Mike Ash digs into a fascinating implementation detail of macOS.

Tagged pointers provide a way to embed a literal value in a pointer reference. Objective-C pointers on macOS are 64 bit, providing plenty of space for representing entire values. If the least significant bit is 1 (the pointer is a 64 bit odd number) then the pointer is "tagged" and represents a value, not a memory reference.

Here's where things get really clever. Storing an integer value up to 60 bits is easy. But what about strings?

There's enough space for three UTF-16 characters, with 12 bits left over. But if the string fits ASCII we can store 7 characters.

Drop everything except a-z A-Z.0-9 and we need 6 bits per character, allowing 10 characters to fit in the pointer.

Apple take this a step further: if the string contains just eilotrm.apdnsIc ufkMShjTRxgC4013 ("b" is apparently uncommon enough to be ignored here) they can store 11 characters in that 60 bits!

# 8th May 2024, 2:23 pm / c, objectivec, strings

Towards universal version control with Patchwork (via) Geoffrey Litt has been working with Ink & Switch exploring UI patterns for applying version control to different kinds of applications, with the goal of developing a set of conceptual primitives that can bring branching and version tracking to interfaces beyond just Git-style version control.

Geoffrey observes that basic version control is already a metaphor in a lot of software—the undo stack in Photoshop or suggestion mode in Google Docs are two examples.

Extending that is a great way to interact with AI tools as well—allowing for editorial bots that can suggest their own changes for you to accept, for example.

# 8th May 2024, 1:44 am / version-control, ai, generative-ai, llms, geoffrey-litt, ink-and-switch

gpt2-chatbot confirmed as OpenAI (via) The mysterious gpt2-chatbot model that showed up in the LMSYS arena a few days ago was suspected to be a testing preview of a new OpenAI model. This has now been confirmed, thanks to a 429 rate limit error message that exposes details from the underlying OpenAI API platform.

The model has been renamed to im-also-a-good-gpt-chatbot and is now only randomly available in "Arena (battle)" mode, not via "Direct Chat".

# 8th May 2024, 12:33 am / ai, openai, generative-ai, llms, chatbot-arena

Deterministic Quoting: Making LLMs Safe for Healthcare (via) Matt Yeung introduces Deterministic Quoting, a technique to help reduce the risk of hallucinations while working with LLMs. The key idea is to have parts of the output that are copied directly from relevant source documents, with a different visual treatment to help indicate that they are exact quotes, not generated output.

The AI chooses which section of source material to quote, but the retrieval of that text is a traditional non-AI database lookup. That’s the only way to guarantee that an LLM has not transformed text: don’t send it through the LLM in the first place.

The LLM may still pick misleading quotes or include hallucinated details in the accompanying text, but this is still a useful improvement.

The implementation is straight-forward: retrieved chunks include a unique reference, and the LLM is instructed to include those references as part of its replies. Matt's posts include examples of the prompts they are using for this.

# 7th May 2024, 7:08 pm / ai, prompt-engineering, generative-ai, llms, rag, hallucinations

OpenAI cookbook: How to get token usage data for streamed chat completion response (via) New feature in the OpenAI streaming API that I've been wanting for a long time: you can now set stream_options={"include_usage": True} to get back a "usage" block at the end of the stream showing how many input and output tokens were used.

This means you can now accurately account for the total cost of each streaming API call. Previously this information was only an available for non-streaming responses.

# 7th May 2024, 2:46 am / ai, openai, generative-ai, llms

What You Need to Know about Modern CSS (Spring 2024 Edition) (via) Useful guide to the many new CSS features that have become widely enough supported to start using as-of May 2024. Time to learn container queries!

View transitions are still mostly limited to Chrome—I can’t wait for those to land in Firefox and Safari.

# 5th May 2024, 2:08 pm / css, view-transitions

Figma’s journey to TypeScript: Compiling away our custom programming language (via) I love a good migration story. Figma had their own custom language that compiled to JavaScript, called Skew. As WebAssembly support in browsers emerged and improved the need for Skew’s performance optimizations reduced, and TypeScript’s maturity and popularity convinced them to switch.

Rather than doing a stop-the-world rewrite they built a transpiler from Skew to TypeScript, enabling a multi-year migration without preventing their product teams from continuing to make progress on new features.

# 4th May 2024, 2:08 pm / compilers, javascript, typescript, webassembly

I’m writing a new vector search SQLite Extension. Alex Garcia is working on sqlite-vec, a spiritual successor to his sqlite-vss project. The new SQLite C extension will have zero other dependencies (sqlite-vss used some tricky C++ libraries) and will work using virtual tables, storing chunks of vectors in shadow tables to avoid needing to load everything into memory at once.

# 3rd May 2024, 3:16 am / c, sqlite, vectors, alex-garcia, vector-search, embeddings

Printing music with CSS Grid (via) Stephen Bond demonstrates some ingenious tricks for creating surprisingly usable sheet music notation using clever application of CSS grids.

It uses rules like .stave > [data-duration="0.75"] { grid-column-end: span 18; } to turn data- attributes for musical properties into positions on the rendered stave.

# 2nd May 2024, 2:28 pm / css, music

Years

Tags