Simon Willison’s Weblog

On openai 148 googleio 6 chatgpt 102 gemini 12 security 449 ...


Recent entries

Training is not the same as chatting: ChatGPT and other LLMs don’t remember everything you say one day ago

I’m beginning to suspect that one of the most common misconceptions about LLMs such as ChatGPT involves how “training” works.

A common complaint I see about these tools is that people don’t want to even try them out because they don’t want to contribute to their training data.

This is by no means an irrational position to take, but it does often correspond to an incorrect mental model about how these tools work.

Short version: ChatGPT and other similar tools do not directly learn from and memorize everything that you say to them.

This can be quite unintuitive: these tools imitate a human conversational partner, and humans constantly update their knowledge based on what you say to to them. Computers have much better memory than humans, so surely ChatGPT would remember every detail of everything you ever say to it. Isn’t that what “training” means?

That’s not how these tools work.

LLMs are stateless functions

From a computer science point of view, it’s best to think of LLMs as stateless function calls. Given this input text, what should come next?

In the case of a “conversation” with a chatbot such as ChatGPT or Claude or Google Gemini, that function input consists of the current conversation (everything said by both the human and the bot) up to that point, plus the user’s new prompt.

Every time you start a new chat conversation, you clear the slate. Each conversation is an entirely new sequence, carried out entirely independently of previous conversations from both yourself and other users.

Understanding this is key to working effectively with these models. Every time you hit “new chat” you are effectively wiping the short-term memory of the model, starting again from scratch.

This has a number of important consequences:

  1. There is no point at all in “telling” a model something in order to improve its knowledge for future conversations. I’ve heard from people who have invested weeks of effort pasting new information into ChatGPT sessions to try and “train” a better bot. That’s a waste of time!
  2. Understanding this helps explain why the “context length” of a model is so important. Different LLMs have different context lengths, expressed in terms of “tokens”—a token is about 3/4s of a word. This is the number that tells you how much of a conversation the bot can consider at any one time. If your conversation goes past this point the model will “forget” details that occurred at the beginning of the conversation.
  3. Sometimes it’s a good idea to start a fresh conversation in order to deliberately reset the model. If a model starts making obvious mistakes, or refuses to respond to a valid question for some weird reason that reset might get it back on the right track.
  4. Tricks like Retrieval Augmented Generation and ChatGPT’s “memory” make sense only once you understand this fundamental limitation to how these models work.
  5. If you’re excited about local models because you can be certain there’s no way they can train on your data, you’re mostly right: you can run them offline and audit your network traffic to be absolutely sure your data isn’t being uploaded to a server somewhere. But...
  6. ... if you’re excited about local models because you want something on your computer that you can chat to and it will learn from you and then better respond to your future prompts, that’s probably not going to work.

So what is “training” then?

When we talk about model training, we are talking about the process that was used to build these models in the first place.

As a big simplification, there are two phases to this. The first is to pile in several TBs of text—think all of Wikipedia, a scrape of a large portion of the web, books, newspapers, academic papers and more—and spend months of time and potentially millions of dollars in electricity crunching through that “pre-training” data identifying patterns in how the words relate to each other.

This gives you a model that can complete sentences, but not necessarily in a way that will delight and impress a human conversational partner. The second phase aims to fix that—this can incorporate instruction tuning or Reinforcement Learning from Human Feedback (RLHF) which has the goal of teaching the model to pick the best possible sequences of words to have productive conversations.

The end result of these phases is the model itself—an enormous (many GB) blob of floating point numbers that capture both the statistical relationships between the words and some version of “taste” in terms of how best to assemble new words to reply to a user’s prompts.

Once trained, the model remains static and unchanged—sometimes for months or even years.

Here’s a note from Jason D. Clinton, an engineer who works on Claude 3 at Anthropic:

The model is stored in a static file and loaded, continuously, across 10s of thousands of identical servers each of which serve each instance of the Claude model. The model file never changes and is immutable once loaded; every shard is loading the same model file running exactly the same software.

These models don’t change very often!

Reasons to worry anyway

A frustrating thing about this issue is that it isn’t actually possible to confidently state “don’t worry, ChatGPT doesn’t train on your input”.

Many LLM providers have terms and conditions that allow them to improve their models based on the way you are using them. Even when they have opt-out mechanisms these are often opted-in by default.

When OpenAI say “We may use Content to provide, maintain, develop, and improve our Services” it’s not at all clear what they mean by that!

Are they storing up everything anyone says to their models and dumping that into the training run for their next model versions every few months?

I don’t think it’s that simple: LLM providers don’t want random low-quality text or privacy-invading details making it into their training data. But they are notoriously secretive, so who knows for sure?

The opt-out mechanisms are also pretty confusing. OpenAI try to make it as clear as possible that they won’t train on any content submitted through their API (so you had better understand what an “API” is), but lots of people don’t believe them! I wrote about the AI trust crisis last year: the pattern where many people actively disbelieve model vendors and application developers (such as Dropbox and Slack) that claim they don’t train models on private data.

People also worry that those terms might change in the future. There are options to protect against that: if you’re spending enough money you can sign contracts with OpenAI and other vendors that freeze the terms and conditions.

If your mental model is that LLMs remember and train on all input, it’s much easier to assume that developers who claim they’ve disabled that ability may not be telling the truth. If you tell your human friend to disregard a juicy piece of gossip you’ve mistakenly passed on to them you know full well that they’re not going to forget it!

The other major concern is the same as with any cloud service: it’s reasonable to assume that your prompts are still logged for a period of time, for compliance and abuse reasons, and if that data is logged there’s always a chance of exposure thanks to an accidental security breach.

What about “memory” features?

To make things even more confusing, some LLM tools are introducing features that attempt to work around this limitation.

ChatGPT recently added a memory feature where it can “remember” small details and use them in follow-up conversations.

As with so many LLM features this is a relatively simple [prompting trick]( during a conversation the bot can call a mechanism to record a short note—your name, or a preference you have expressed—which will then be invisibly included in the chat context passed in future conversations.

You can review (and modify) the list of remembered fragments at any time, and ChatGPT shows a visible UI element any time it adds to its memory.

Bad policy based on bad mental models

One of the most worrying results of this common misconception concerns people who make policy decisions for how LLM tools should be used.

Does your company ban all use of LLMs because they don’t want their private data leaked to the model providers?

They’re not 100% wrong—see reasons to worry anyway—but if they are acting based on the idea that everything said to a model is instantly memorized and could be used in responses to other users they’re acting on faulty information.

Even more concerning is what happens with lawmakers. How many politicians around the world are debating and voting on legislation involving these models based on a science fiction idea of what they are and how they work?

If people believe ChatGPT is a machine that instantly memorizes and learns from everything anyone says to it there is a very real risk they will support measures that address invented as opposed to genuine risks involving this technology.

Weeknotes: PyCon US 2024 one day ago

Earlier this month I attended PyCon US 2024 in Pittsburgh, Pennsylvania. I gave an invited keynote on the Saturday morning titled “Imitation intelligence”, tying together much of what I’ve learned about Large Language Models over the past couple of years and making the case that the Python community has a unique opportunity and responsibility to help try to nudge this technology in a positive direction.

The video isn’t out yet but I’ll publish detailed notes to accompany my talk (using my annotated presentation format) as soon as it goes live on YouTube.

PyCon was a really great conference. Pittsburgh is a fantastic city, and I’m delighted that PyCon will be in the same venue next year so I can really take advantage of the opportunity to explore in more detail.

I also realized that it’s about time Datasette participated in the PyCon sprints—the project is mature enough for that to be a really valuable opportunity now. I’m looking forward to leaning into that next year.

I’m on a family-visiting trip back to the UK at the moment, so taking a bit of time off from my various projects.

LLM support for new models

The big new language model releases from May were OpenAI GPT-4o and Google’s Gemini Flash. I released LLM 0.14, datasette-extract 0.1a7 and datasette-enrichments-gpt 0.5 with support for GPT-4o, and llm-gemini 0.1a4 adding support for the new inexpensive Gemini 1.5 Flash.

Gemini 1.5 Flash is a particularly interesting model: it’s now ranked 9th on the LMSYS leaderboard, beating Llama 3 70b. It’s inexpensive, priced close to Claude 3 Haiku, and can handle up to a million tokens of context.

I’m also excited about GPT-4o—half the price of GPT-4 Turbo, around twice as fast and it appears to be slightly more capable too. I’ve been getting particularly good results from it for structured data extraction using datasette-extract—it seems to be able to more reliably produce a longer sequence of extracted rows from a given input.

Blog entries



ChatGPT in “4o” mode is not running the new features yet 14 days ago

Monday’s OpenAI announcement of their new GPT-4o model included some intriguing new features:

  • Creepily good improvements to the ability to both understand and produce voice (Sam Altman simply tweeted “her”), and to be interrupted mid-sentence
  • New image output capabilities that appear to leave existing models like DALL-E 3 in the dust—take a look at the examples, they seem to have solved consistent character representation AND reliable text output!

They also made the new 4o model available to paying ChatGPT Plus users, on the web and in their apps.

But, crucially, those big new features were not part of that release.

Here’s the relevant section from the announcement post:

We recognize that GPT-4o’s audio modalities present a variety of novel risks. Today we are publicly releasing text and image inputs and text outputs. Over the upcoming weeks and months, we’ll be working on the technical infrastructure, usability via post-training, and safety necessary to release the other modalities.

This is catching out a lot of people. The ChatGPT iPhone app already has image output, and it already has a voice mode. These worked with the previous GPT-4 mode and they still work with the new GPT-4o mode... but they are not using the new model’s capabilities.

Lots of people are discovering the voice mode for the first time—it’s the headphone icon in the bottom right of the interface.

They try it and it’s impressive (it was impressive before) but it’s nothing like as good as the voice mode in Monday’s demos.

Honestly, it’s not at all surprising that people are confused. They’re seeing the “4o” option and, understandably, are assuming that this is the set of features that were announced earlier this week.

Screenshot of the ChatGPT iPhone app. An arrow points to the 4o indicator in the title saying GPT-4o - another arrow points to the headphone icon at the bottom saying Not GPT-4o

Most people don’t distinguish models from features

Think about what you need to know in order to understand what’s going on here:

GPT-4o is a brand new multi-modal Large Language Model. It can handle text, image and audio input and produce text, image and audio output.

But... the version of GPT-4o that has been made available so far—both via the API and via the OpenAI apps—is only able to handle text and image input and produce text output. The other features are not yet available outside of OpenAI (and a select group of partners).

And yet in the apps it can still handle audio input and output and generate images. That’s because the app version of the model is wrapped with additional tools.

The audio input is handled by a separate model called Whisper, which converts speech to text. That text is then fed into the LLM, which generates a text response.

The response is passed to OpenAI’s boringly-named tts-1 (or maybe tts-1-hd) model (described here), which converts that text to speech.

While nowhere near as good as the audio in Monday’s demo, tts-1 is still a really impressive model. I’ve been using it via my ospeak CLI tool since it was released back in November.

As for images? Those are generated using DALL-E 3, through a process where ChatGPT directly prompts that model. I wrote about how that works back in October.

So what’s going on with ChatGPT’s GPT-4o mode is completely obvious, provided you already understand:

  • GPT-4 v.s. GPT-4o
  • Whisper
  • tts-1
  • DALL-E 3
  • Why OpenAI would demonstrate these features and then release a version of the model that doesn’t include them

I’m reminded of the kerfluffle back in March when the Google Gemini image creator was found to generate images of Black Nazis. I saw a whole bunch of people refer to that in conversations about the Google Gemini Pro 1.5 LLM, released at the same time, despite the quality of that model being entirely unrelated to Google’s policy decisions about how one of the interfaces to that model should make use of the image creator tool.

What can we learn from this?

If you’re fully immersed in this world, it’s easy to lose track of how incredibly complicated these systems have become. The amount you have to know in order to even understand what that “4o” mode in the ChatGPT app does is very easy to underestimate.

Fundamentally these are challenges in user experience design. You can’t just write documentation about them, because no-one reads documentation.

A good starting here is to acknowledge the problem. LLM systems are extremely difficult to understand and use. We need to design the tools we build on top of them accordingly.

Update: a UI workaround

On May 16th around 1PM PT OpenAI released a new iPhone app update which adds the following warning message the first time you try to access that headphones icon:

New Voice Mode coming soon

We plan to launch a new Voice Mode with new GPT-4o capabilities in an alpha within ChatGPT Plus in the coming weeks. We’ll let you know when you have access.

Slop is the new name for unwanted AI-generated content 21 days ago

I saw this tweet yesterday from @deepfates, and I am very on board with this:

Watching in real time as “slop” becomes a term of art. the way that “spam” became the term for unwanted emails, “slop” is going in the dictionary as the term for unwanted AI generated content

I’m a big proponent of LLMs as tools for personal productivity, and as software platforms for building interesting applications that can interact with human language.

But I’m increasingly of the opinion that sharing unreviewed content that has been artificially generated with other people is rude.

Slop is the ideal name for this anti-pattern.

Not all promotional content is spam, and not all AI-generated content is slop. But if it’s mindlessly generated and thrust upon someone who didn’t ask for it, slop is the perfect term for it.

Remember that time Microsoft listed the Ottawa Food Bank on an AI-generated “Here’s what you shoudn’t miss!” travel guide? Perfect example of slop.

One of the things I love about this is that it’s helpful for defining my own position on AI ethics. I’m happy to use LLMs for all sorts of purposes, but I’m not going to use them to produce slop. I attach my name and stake my credibility on the things that I publish.

Personal AI ethics remains a complicated set of decisions. I think don’t publish slop is a useful baseline.

Update 9th May: Joseph Thacker asked what a good name would be for the equivalent subset of spam—spam that was generated with AI tools.

I propose “slom”.

Venn diagram: the left-hand circle is red and labeled spam, the right hand circle is green and labeled slop, the overlap in the middle is labeled slom

Weeknotes: more datasette-secrets, plus a mystery video project 22 days ago

I introduced datasette-secrets two weeks ago. The core idea is to provide a way for end-users to store secrets such as API keys in Datasette, allowing other plugins to access them.

datasette-secrets 0.2 is the first non-alpha release of that project. The big new feature is that the plugin is now compatible with both the Datasette 1.0 alphas and the stable releases of Datasette (currently Datasette 0.64.6).

My policy at the moment is that a plugin that only works with the Datasette 1.0 alphas must itself be an alpha release. I’ve been feeling the weight of this as the number of plugins that depend on 1.0a has grown—on the one hand it’s a great reason to push through to that 1.0 stable release, but it’s painful to have so many features that are incompatible with current Datasette.

This came to a head with Datasette Enrichments. I wanted to start consuming secrets from enrichments such as datasette-enrichments-gpt and datasette-enrichments-opencage, but I didn’t want the whole enrichments ecosystem to become 1.0a only.

Patterns for plugins that work against multiple Datasette versions

I ended up building out quite a bit of infrastructure to help support plugins that work with both versions.

I already have a GitHub Actions pattern for running tests against both versions, which looks like this:

    runs-on: ubuntu-latest
        python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
        datasette-version: ["<1.0", ">=1.0a13"]
    - uses: actions/checkout@v4
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v5
        python-version: ${{ matrix.python-version }}
        cache: pip
        cache-dependency-path: pyproject.toml
    - name: Install dependencies
      run: |
        pip install '.[test]'
        pip install "datasette${{ matrix.datasette-version }}"
    - name: Run tests
      run: |

This uses a GitHub Actions matrix to run the test suite ten times—five against Datasette <1.0 on different Python versions and then five again on Datasette >=1.0a13.

One of the big changes in Datasette 1.0 involves the way plugins are configured. I have a datasette-test library to help paper over those differences, which can be used like this:

from datasette_test import Datasette

def test_something():
    datasette = Datasette(
            "datasette-secrets": {
                "database": "_internal",
                "encryption-key": TEST_ENCRYPTION_KEY,
        permissions={"manage-secrets": {"id": "admin"}},

The plugin_config= argument there is unique to that datasette_test.Datasette() class constructor, and does the right thing against both versions of Datasette. permissions= is a similar utility function. Both are described in the datasette-test README.

The PR adding <1.0 and >1.0a compatibility has a few more details of changes I made to get datasette-secrets to work with both versions.

Here’s what the secrets management interface looks like now:

Manage secrets creen in Datasette Cloud. Simon Willison is logged in. A secret called OpenAI_API_KEY is at version 1, last updated by swillison on 25th April.

Adding secrets to enrichments

I ended up changing the core enrichments framework to add support for secrets. The new mechanism is documented here—but the short version is you can now define an Enrichments subclass that looks like this:

from datasette_enrichments import Enrichment
from datasette_secrets import Secret

class TrainEnthusiastsEnrichment(Enrichment):
    name = "Train Enthusiasts"
    slug = "train-enthusiasts"
    description = "Enrich with extra data from the Train Enthusiasts API"
    secret = Secret(
        description="An API key from train-enthusiasts.doesnt.exist",
        obtain_label="Get an API key"

This imaginary enrichment will now do the following:

  1. If a TRAIN_ENTHUSIASTS_API_KEY environment variable is present it will use that without asking for an API key.
  2. A user with sufficient permissions, in a properly configured Datasette instance, can visit the “Manage secrets” page to set that API key such that it will be encrypted and persisted in the Datasette invisible “internal” database.
  3. If neither of those are true, the enrichment will ask for an API key every time a user tries to run it. That API key will be kept in memory, used and then discarded—it will not be persisted anywhere.

There are still a bunch more enrichments that need to be upgraded to the new pattern, but those upgrades are now a pretty straightforward process.

Mystery video

I’ve been collaborating on a really fun video project for the past few weeks. More on this when it’s finished, but it’s been a wild experience. I can’t wait to see how it turns out, and share it with the world.



Weeknotes: Llama 3, AI for Data Journalism, llm-evals and datasette-secrets one month ago

Llama 3 landed on Thursday. I ended up updating a whole bunch of different plugins to work with it, described in Options for accessing Llama 3 from the terminal using LLM.

I also wrote up the talk I gave at Stanford a few weeks ago: AI for Data Journalism: demonstrating what we can do with this stuff right now.

That talk had 12 different live demos in it, and a bunch of those were software that I hadn’t released yet when I gave the talk—so I spent quite a bit of time cleaning those up for release. The most notable of those is datasette-query-assistant, a plugin built on top of Claude 3 that takes a question in English and converts that into a SQL query. Here’s the section of that video with the demo.

I’ve also spun up two new projects which are still very much in the draft stage.


Ony of my biggest frustrations in working with LLMs is that I still don’t have a great way to evaluate improvements to my prompts. Did capitalizing OUTPUT IN JSON really make a difference? I don’t have a great mechanism for figuring that out.

datasette-query-assistant really needs this: Which models are best at generating SQLite SQL? What prompts make it most likely I’ll get a SQL query that executes successfully against the schema?

llm-evals-plugin (llmevals was taken on PyPI already) is a very early prototype of an LLM plugin that I hope to use to address this problem.

The idea is to define “evals” as YAML files, which might look something like this (format still very much in flux):

name: Simple translate
system: |
  Return just a single word in the specified language
prompt: |
  Apple in Spanish
- iexact: manzana
- notcontains: apple

Then, to run the eval against multiple models:

llm install llm-evals-plugin
llm evals simple-translate.yml -m gpt-4-turbo -m gpt-3.5-turbo

Which currently outputs this:

('gpt-4-turbo-preview', [True, True])
('gpt-3.5-turbo', [True, True])

Those checks: are provided by a plugin hook, with the aim of having plugins that add new checks like sqlite_execute: [["1", "Apple"]] that run SQL queries returned by the model and assert against the results—or even checks like js: response_text == 'manzana' that evaluate using a programming language (in that case using quickjs to run code in a sandbox).

This is still a rough sketch of how the tool will work. The big missing feature at the moment is parameterization: I want to be able to try out different prompt/system prompt combinations and run a whole bunch of additional examples that are defined in a CSV or JSON or YAML file.

I also want to record the results of those runs to a SQLite database, and also make it easy to dump those results out in a format that’s suitable for storing in a GitHub repository in order to track differences to the results over time.

This is a very early idea. I may find a good existing solution and use that instead, but for the moment I’m enjoying using running code as a way to explore a new problem space.


datasette-secrets is another draft project, this time a Datasette plugin.

I’m increasingly finding a need for Datasette plugins to access secrets—things like API keys. datasette-extract and datasette-enrichments-gpt both need an OpenAI API key, datasette-enrichments-opencage needs OpenCage Geocoder and datasette-query-assistant needs a key for Anthropic’s Claude.

Currently those keys are set using environment variables, but for both Datasette Cloud and Datasette Desktop I’d like users to be able to bring their own keys, without messing around with their environment.

datasette-secrets adds a UI for entering registered secrets, available to administrator level users with the manage-secrets permission. Those secrets are stored encrypted in the SQLite database, using symmetric encryption powered by the Python cryptography library.

The goal of the encryption is to ensure that if someone somehow obtains the SQLite database itself they won’t be able to access the secrets contained within, unless they also have access to the encryption key which is stored separately.

The next step with datasette-secrets is to ship some other plugins that use it. Once it’s proved itself there (and in an alpha release to Datasette Cloud) I’ll remove the alpha designation and start recommending it for use in other plugins.

Datasette screenshot. A message at the top reads: Note updated: OPENAL_API_KEY. The manage secrets screen then lists ANTHROPI_API_KEY, EXAMPLE_SECRET and OPENAI_API_KEY, each with a note, a version, when they were last updated and who updated them. The bottom of the screen says These secrets have not been set: and lists DEMO_SECRET_ONE and DEMO_SECRET_TWO





  • The realization hit me [when the GPT-3 paper came out] that an important property of the field flipped. In ~2011, progress in AI felt constrained primarily by algorithms. We needed better ideas, better modeling, better approaches to make further progress. If you offered me a 10X bigger computer, I’m not sure what I would have even used it for. GPT-3 paper showed that there was this thing that would just become better on a large variety of practical tasks, if you only trained a bigger one. Better algorithms become a bonus, not a necessity for progress in AGI. Possibly not forever and going forward, but at least locally and for the time being, in a very practical sense. Today, if you gave me a 10X bigger computer I would know exactly what to do with it, and then I’d ask for more.

    Andrej Karpathy # 30th May 2024, 7:27 am


  • In their rush to cram in “AI” “features”, it seems to me that many companies don’t actually understand why people use their products. [...] Trust is a precious commodity. It takes a long time to build trust. It takes a short time to destroy it.

    Jeremy Keith # 29th May 2024, 11:06 am

  • Sometimes the most creativity is found in enumerating the solution space. Design is the process of prioritizing tradeoffs in a high dimensional space. Understand that dimensionality.

    Chris Perry # 29th May 2024, 7:17 am

28th May 2024

27th May 2024

26th May 2024

25th May 2024

24th May 2024

  • I just left Google last month. The “AI Projects” I was working on were poorly motivated and driven by this panic that as long as it had “AI” in it, it would be great. This myopia is NOT something driven by a user need. It is a stone cold panic that they are getting left behind.

    The vision is that there will be a Tony Stark like Jarvis assistant in your phone that locks you into their ecosystem so hard that you’ll never leave. That vision is pure catnip. The fear is that they can’t afford to let someone else get there first.

    Scott Jenson # 24th May 2024, 6:33 am

  • The leader of a team—especially a senior one—is rarely ever the smartest, the most expert or even the most experienced.

    Often it’s the person who can best understand individuals’ motivations and galvanize them towards an outcome, all while helping them stay cohesive.

    Nivia Henry # 24th May 2024, 6:09 am

  • But increasingly, I’m worried that attempts to crack down on the cryptocurrency industry — scummy though it may be — may result in overall weakening of financial privacy, and may hurt vulnerable people the most. As they say, “hard cases make bad law”.

    Molly White # 24th May 2024, 1:19 am

23rd May 2024

  • The most effective mechanism I’ve found for rolling out No Wrong Door is initiating three-way conversations when asked questions. If someone direct messages me a question, then I will start a thread with the question asker, myself, and the person I believe is the correct recipient for the question. This is particularly effective because it’s a viral approach: rolling out No Wrong Door just requires any one of the three participants to adopt the approach.

    Will Larson # 23rd May 2024, 2:48 pm

22nd May 2024

  • The default prefix used to be “sqlite_”. But then Mcafee started using SQLite in their anti-virus product and it started putting files with the “sqlite” name in the c:/temp folder. This annoyed many windows users. Those users would then do a Google search for “sqlite”, find the telephone numbers of the developers and call to wake them up at night and complain. For this reason, the default name prefix is changed to be “sqlite” spelled backwards.

    D. Richard Hipp, 18 years ago # 22nd May 2024, 4:21 am

21st May 2024