Simon Willison’s Weblog


It’s infuriatingly hard to understand how closed models train on their input

4th June 2023

One of the most common concerns I see about large language models regards their training data. People are worried that anything they say to ChatGPT could be memorized by it and spat out to other users. People are concerned that anything they store in a private repository on GitHub might be used as training data for future versions of Copilot.

When someone asked Google Bard how it was trained back in March, it told them its training data included Gmail! This turned out to be a complete fabrication—a hallucination by the model itself—and Google issued firm denials, but it’s easy to see why that freaked people out.

I’ve been wanting to write something reassuring about this issue for a while now. The problem is... I can’t do it. I don’t have the information I need to credibly declare these concerns unfounded, and the more I look into this the murkier it seems to get.

Closed model vendors won’t tell you what’s in their training data

The fundamental issue here is one of transparency. The builders of the big closed models—GPT-3, GPT-4, Google’s PaLM and PaLM 2, Anthropic’s Claude—refuse to tell us what’s in their training data.

Given this lack of transparency, there’s no way to confidently state that private data that is passed to them isn’t being used to further train future versions of these models.

I’ve spent a lot of time digging around in openly available training sets. I built an early tool for searching the training set for Stable Diffusion. I can tell you exactly what has gone in to the RedPajama training set that’s being used for an increasing number of recent openly licensed language models.

But for those closed models? Barring loose, high-level details that are revealed piecemeal in blog posts and papers, I have no idea what’s in them.

What OpenAI do and don’t tell us

The good news is that OpenAI have an unambiguous policy regarding data that is sent to them by API users who are paying for the service:

OpenAI does not use data submitted by customers via our API to train OpenAI models or improve OpenAI’s service offering.

That’s very clear. It’s worth noting that this is a new policy though, introduced in March. The API data usage policies page includes this note:

Data submitted to the API prior to March 1, 2023 (the effective date of this change) may have been used for improvements if the customer had not previously opted out of sharing data.

Where things get a lot murkier is ChatGPT itself. Emphasis mine:

We don’t use data for selling our services, advertising, or building profiles of people—we use data to make our models more helpful for people. ChatGPT, for instance, improves by further training on the conversations people have with it, unless you choose to disable training.

But what does this mean in practice?

My initial assumption had been that this isn’t as simple as anything you type into ChatGPT being used as raw input for further rounds of model training—I expected it was more about using that input to identify trends in the kinds of questions people ask, or using feedback from the up/down vote buttons to further fine-tune the model.

But honestly, I have no idea. Maybe they just run a regular expression to strip out phone numbers and email address and pipe everything else straight into the GPT-5 training runs? Without further transparency all we can do is guess.

A clue from the InstructGPT paper

The best clue I’ve seen as to how this data might actually be used comes from OpenAI’s description of InstructGPT back in January 2022:

To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API[A] our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.

Crucially, this hints that the data isn’t being used as raw input for future trained models. Instead, it’s being used in an exercise where several potential outputs are produced and human labelers then select which of those is the best possible answer to the prompt. Aside from exposing potentially private data to those human labelers, I don’t see this as a risk for leaking that data in the later output of the model.

That [A] footnote turns out to be important:

We only use prompts submitted through the Playground to an earlier version of the InstructGPT models that was deployed in January 2021. Our human annotators remove personal identifiable information from all prompts before adding it to the training set.

Again though, I’m left with even more questions. This was before ChatGPT existed, so was the Playground development tool being treated separately from the API itself back then? What does “adding it to the training set” mean—is that the raw pre-training data used for future models, or is it the RLHF data used for the fine-tuning that they mentioned earlier?

Security leaks are another threat

Aside from training concerns, there’s another danger to consider here: the risk that an AI vendor might log inputs to their models and then suffer from a security flaw that exposes that data to attackers—or an insider threat where vendor employees access logged data that they shouldn’t.

OpenAI themselves had a widely publicized security issue a few months ago where ChatGPT users could see summarized titles of sessions by other users. This is an extremely bad breach!

Their new site appears to be entirely aimed at reassuring companies about their approach to security.

To be fair, this is not a new issue: companies have been trusting their private data to cloud providers like AWS and Google Cloud for more than a decade.

The challenge is that these AI companies have much less of a track record for staying secure. AWS and Google Cloud have large security teams with many years of experience securing their customer’s data. These newer AI vendors are building up those capabilities as they go.

Self-hosted, openly licensed models

I’ve been tracking the meteoric rise of openly licensed LLMs you can run on your own hardware since LLaMA and Alpaca demonstrated how capable they could be back in March.

These models aren’t yet anywhere near as capable as GPT-4, and claims that they compete with ChatGPT’s gpt-3.5-turbo mostly don’t hold up to deeper scrutiny.

But... they’re pretty good—and they’re getting better at an impressive rate.

And since you can run them on your own instances, they remove all possible concerns about what happens to the data that you pipe through them.

An open question for me remains how large a large language model actually needs in order to solve the kind of problems companies need to solve. Could a weaker, openly licensed model armed with the same retrieval augmented generation tricks that we’ve seen from Bing and Bard be capable enough to remove the need for a closed model like GPT-4?

My hunch is that for many applications these augmented openly licensed models will be increasingly capable, and will see widespread adoption over the next few months and years.

Bonus section: does GitHub use private repos to train future models?

This question came up on Hacker News this morning. GitHub’s Privacy & Data Sharing policy says the following:

Private repository data is scanned by machine and never read by GitHub staff. Human eyes will never see the contents of your private repositories, except as described in our Terms of Service.

Your individual personal or repository data will not be shared with third parties. We may share aggregate data learned from our analysis with our partners.

I interpret this as GitHub saying that no employee will ever see the contents of your private repo (barring incidents where they are compelled by law), and that the only data that might be shared with partners is “aggregate data learned from our analysis”.

But what is “aggregate data”?

Could a large language model trained on data fit under that term? I don’t think so, but the terminology is vague enough that once again I’m not ready to stake my reputation on it.

Clarity on this kind of thing is just so important. I think organizations like GitHub need to over-communicate on this kind of thing, and avoid any terminology like “aggregate data” that could leave people confused.

Thanks to Andy Baio and Fred Benenson for reviewing early drafts of this post.