Simon Willison's Weblog: o1

Database Remote-Copy Tool For SQLite (draft)

2024-10-04T20:57:39+00:00

Database Remote-Copy Tool For SQLite (draft)

Neat new SQLite utilities often show up in branches of the SQLite repository. Here's a new one from last month: sqlite3-rsync, providing tools for efficiently creating and updating copies of WAL-mode SQLite databases on either the same machine or across remote machines via SSH.

The way it works is neat, inspired by rsync (hence the tool's name):

The protocol is for the replica to send a cryptographic hash of each of its pages over to the origin side, then the origin sends back the complete content of any page for which the hash does not match.

SQLite's default page size is 4096 bytes and a hash is 20 bytes, so if nothing has changed then the client will transmit 0.5% of the database size in hashes and get nothing back in return.

The tool takes full advantage of SQLite's WAL mode - when you run it you'll get an exact snapshot of the database state as it existed at the moment the copy was initiated, even if the source database continues to apply changes.

I wrote up a TIL on how to compile it - short version:

cd /tmp
git clone https://github.com/sqlite/sqlite.git
cd sqlite
git checkout sqlite3-rsync
./configure
make sqlite3.c
cd tool
gcc -o sqlite3-rsync sqlite3-rsync.c ../sqlite3.c -DSQLITE_ENABLE_DBPAGE_VTAB
./sqlite3-rsync --help

Update: It turns out you can now just run ./configure && make sqlite-rsync in the root checkout.

Something I’ve worried about in the past is that if I want to make a snapshot backup of a SQLite database I need enough additional free disk space to entirely duplicate the current database first (using the backup mechanism or VACUUM INTO). This tool fixes that - I don’t need any extra disk space at all, since the pages that have been updated will be transmitted directly over the wire in 4096 byte chunks.

I tried feeding the 1800 lines of C through OpenAI’s o1-preview with the prompt “Explain the protocol over SSH part of this” and got a pretty great high level explanation - markdown copy here.

Via lobste.rs

Tags: sqlite, c, o1

Solving a bug with o1-preview, files-to-prompt and LLM

2024-09-25T18:41:13+00:00

Solving a bug with o1-preview, files-to-prompt and LLM

I added a new feature to DJP this morning: you can now have plugins specify their middleware in terms of how it should be positioned relative to other middleware - inserted directly before or directly after django.middleware.common.CommonMiddleware for example.

At one point I got stuck with a weird test failure, and after ten minutes of head scratching I decided to pipe the entire thing into OpenAI's o1-preview to see if it could spot the problem. I used files-to-prompt to gather the code and LLM to run the prompt:

files-to-prompt **/*.py -c | llm -m o1-preview "
The middleware test is failing showing all of these - why is MiddlewareAfter repeated so many times?

['MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware5', 'MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware2', 'MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware5', 'MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware4', 'MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware5', 'MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware2', 'MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware5', 'MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware', 'MiddlewareBefore']"

The model whirled away for a few seconds and spat out an explanation of the problem - one of my middleware classes was accidentally calling self.get_response(request) in two different places.

I did enjoy how o1 attempted to reference the relevant Django documentation and then half-repeated, half-hallucinated a quote from it:

This took 2,538 input tokens and 4,354 output tokens - by my calculations at $15/million input and $60/million output that prompt cost just under 30 cents.

Tags: o1, llm, djp, openai, ai, llms, ai-assisted-programming, generative-ai

Quoting Riley Goodside

2024-09-16T17:28:52+00:00

o1 prompting is alien to me. Its thinking, gloriously effective at times, is also dreamlike and unamenable to advice.

Just say what you want and pray. Any notes on “how” will be followed with the diligence of a brilliant intern on ketamine.

— Riley Goodside

Tags: riley-goodside, o1, prompt-engineering, generative-ai, openai, ai, llms

Quoting Terrence Tao

2024-09-15T00:04:03+00:00

[… OpenAI’s o1] could work its way to a correct (and well-written) solution if provided a lot of hints and prodding, but did not generate the key conceptual ideas on its own, and did make some non-trivial mistakes. The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student. However, this was an improvement over previous models, whose capability was closer to an actually incompetent graduate student.

— Terrence Tao

Tags: o1, generative-ai, openai, mathematics, ai, llms

Quoting Noam Brown

2024-09-13T11:35:51+00:00

Believe it or not, the name Strawberry does not come from the “How many r’s are in strawberry” meme. We just chose a random word. As far as we know it was a complete coincidence.

— Noam Brown, OpenAI

Tags: o1, generative-ai, openai, ai, llms

Quoting Jason Wei

2024-09-12T23:45:19+00:00

o1-mini is the most surprising research result I've seen in the past year

Obviously I cannot spill the secret, but a small model getting >60% on AIME math competition is so good that it's hard to believe

— Jason Wei, OpenAI

Tags: o1, generative-ai, openai, ai, llms

LLM 0.16

2024-09-12T23:20:59+00:00

LLM 0.16

New release of LLM adding support for the o1-preview and o1-mini OpenAI models that were released today.

Tags: llm, projects, generative-ai, openai, ai, llms, o1

Notes on OpenAI's new o1 chain-of-thought models

2024-09-12T22:36:37+00:00

OpenAI released two major new preview models today: o1-preview and o1-mini (that mini one is not a preview) - previously rumored as having the codename "strawberry". There's a lot to understand about these models - they're not as simple as the next step up from GPT-4o, instead introducing some major trade-offs in terms of cost and performance in exchange for improved "reasoning" capabilities.

Trained for chain of thought
Low-level details from the API documentation
Hidden reasoning tokens
Examples
What's new in all of this

Trained for chain of thought

OpenAI's elevator pitch is a good starting point:

We've developed a new series of AI models designed to spend more time thinking before they respond.

One way to think about these new models is as a specialized extension of the chain of thought prompting pattern - the "think step by step" trick that we've been exploring as a a community for a couple of years now, first introduced in the paper Large Language Models are Zero-Shot Reasoners in May 2022.

OpenAI's article Learning to Reason with LLMs explains how the new models were trained:

Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.

[...]

Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason.

Effectively, this means the models can better handle significantly more complicated prompts where a good result requires backtracking and "thinking" beyond just next token prediction.

I don't really like the term "reasoning" because I don't think it has a robust definition in the context of LLMs, but OpenAI have committed to using it here and I think it does an adequate job of conveying the problem these new models are trying to solve.

Low-level details from the API documentation

Some of the most interesting details about the new models and their trade-offs can be found in their API documentation:

For applications that need image inputs, function calling, or consistently fast response times, the GPT-4o and GPT-4o mini models will continue to be the right choice. However, if you're aiming to develop applications that demand deep reasoning and can accommodate longer response times, the o1 models could be an excellent choice.

Some key points I picked up from the docs:

API access to the new o1-preview and o1-mini models is currently reserved for tier 5 accounts - you’ll need to have spent at least $1,000 on API credits.
No system prompt support - the models use the existing chat completion API but you can only send user and assistant messages.
No streaming support, tool usage, batch calls or image inputs either.
“Depending on the amount of reasoning required by the model to solve the problem, these requests can take anywhere from a few seconds to several minutes.”

Most interestingly is the introduction of “reasoning tokens” - tokens that are not visible in the API response but are still billed and counted as output tokens. These tokens are where the new magic happens.

Thanks to the importance of reasoning tokens - OpenAI suggests allocating a budget of around 25,000 of these for prompts that benefit from the new models - the output token allowance has been increased dramatically - to 32,768 for o1-preview and 65,536 for the supposedly smaller o1-mini! These are an increase from the gpt-4o and gpt-4o-mini models which both currently have a 16,384 output token limit.

One last interesting tip from that API documentation:

Limit additional context in retrieval-augmented generation (RAG): When providing additional context or documents, include only the most relevant information to prevent the model from overcomplicating its response.

This is a big change from how RAG is usually implemented, where the advice is often to cram as many potentially relevant documents as possible into the prompt.

Hidden reasoning tokens

A frustrating detail is that those reasoning tokens remain invisible in the API - you get billed for them, but you don't get to see what they were. OpenAI explain why in Hiding the Chains of Thought:

Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.

Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.

So two key reasons here: one is around safety and policy compliance: they want the model to be able to reason about how it's obeying those policy rules without exposing intermediary steps that might include information that violates those policies. The second is what they call competitive advantage - which I interpret as wanting to avoid other models being able to train against the reasoning work that they have invested in.

I'm not at all happy about this policy decision. As someone who develops against LLMs interpretability and transparency are everything to me - the idea that I can run a complex prompt and have key details of how that prompt was evaluated hidden from me feels like a big step backwards.

Examples

OpenAI provide some initial examples in the Chain of Thought section of their announcement, covering things like generating Bash scripts, solving crossword puzzles and calculating the pH of a moderately complex solution of chemicals.

These examples show that the ChatGPT UI version of these models does expose details of the chain of thought... but it doesn't show the raw reasoning tokens, instead using a separate mechanism to summarize the steps into a more human-readable form.

OpenAI also have two new cookbooks with more sophisticated examples, which I found a little hard to follow:

Using reasoning for data validation shows a multiple step process for generating example data in an 11 column CSV and then validating that in various different ways.
Using reasoning for routine generation showing o1-preview code to transform knowledge base articles into a set of routines that an LLM can comprehend and follow.

I asked on Twitter for examples of prompts that people had found which failed on GPT-4o but worked on o1-preview. A couple of my favourites:

How many words are in your response to this prompt? by Matthew Berman - the model thinks for ten seconds across five visible turns before answering "There are seven words in this sentence."
Explain this joke: “Two cows are standing in a field, one cow asks the other: “what do you think about the mad cow disease that’s going around?”. The other one says: “who cares, I’m a helicopter!” by Fabian Stelzer - the explanation makes sense, apparently other models have failed here.

Great examples are still a bit thin on the ground though. Here's a relevant note from OpenAI researcher Jason Wei, who worked on creating these new models:

Results on AIME and GPQA are really strong, but that doesn’t necessarily translate to something that a user can feel. Even as someone working in science, it’s not easy to find the slice of prompts where GPT-4o fails, o1 does well, and I can grade the answer. But when you do find such prompts, o1 feels totally magical. We all need to find harder prompts.

Ethan Mollick has been previewing the models for a few weeks, and published his initial impressions. His crossword example is particularly interesting for the visible reasoning steps, which include notes like:

I noticed a mismatch between the first letters of 1 Across and 1 Down. Considering "CONS" instead of "LIES" for 1 Across to ensure alignment.

What's new in all of this

It's going to take a while for the community to shake out the best practices for when and where these models should be applied. I expect to continue mostly using GPT-4o (and Claude 3.5 Sonnet), but it's going to be really interesting to see us collectively expand our mental model of what kind of tasks can be solved using LLMs given this new class of model.

I expect we'll see other AI labs, including the open model weights community, start to replicate some of these results with their own versions of models that are specifically trained to apply this style of chain-of-thought reasoning.

Tags: ai, openai, prompt-engineering, generative-ai, llms, o1