Understanding GPT tokenizers one hour ago
Large language models such as GPT-3/4, LLaMA and PaLM work in terms of tokens. They take text, convert it into tokens (integers), then predict which tokens should come next.
Playing around with these tokens is an interesting way to get a better idea for how this stuff actually works under the hood.
OpenAI offer a Tokenizer tool for exploring how tokens work
I’ve built my own, slightly more interesting tool as an Observable notebook:
You can use the notebook to convert text to tokens, tokens to text and also to run searches against the full token table.
Here’s what the notebook looks like:
The text I’m tokenizing here is:
The dog eats the apples El perro come las manzanas 片仮名
This produces 21 integer tokens: 5 for the English text, 8 for the Spanish text and six (two each) for those three Japanese characters. The two newlines are each represented by tokens as well.
The notebook uses the tokenizer from GPT-2 (borrowing from this excellent notebook by EJ Fox and Ian Johnson), so it’s useful primarily as an educational tool—there are differences between how it works and the latest tokenizers for GPT-3 and above.
Exploring some interesting tokens
Playing with the tokenizer reveals all sorts of interesting patterns.
Most common English words are assigned a single token. As demonstrated above:
- “The”: 464
- “ dog”: 3290
- “ eats”: 25365
- “ the”: 262
- “ apples”: 22514
Note that capitalization is important here. “The” with a capital T is token 464, but “ the” with both a leading space and a lowercase t is token 262.
Many words also have a token that incorporates a leading space. This makes for much more efficient encoding of full sentences, since they can be encoded without needing to spend a token on each whitespace character.
Languages other than English suffer from less efficient tokenization.
“El perro come las manzanas” in Spanish is encoded like this:
- “El”: 9527
- “ per”: 583
- “ro”: 305
- “ come”: 1282
- “ las”: 39990
- “ man”: 582
- “zan”: 15201
- “as”: 292
The English bias is obvious here. “ man” gets a lower token ID of 582, because it’s an English word. “zan” gets a token ID of 15201 because it’s not a word that stands alone in English, but is a common enough sequence of characters that it still warrants its own token.
Some languages even have single characters that end up encoding to multiple tokens, such as these Japanese characters:
- 片: 31965 229
- 仮: 20015 106
- 名: 28938 235
A fascinating subset of tokens are what are known as “glitch tokens”. My favourite example of those is token 23282—“ davidjl”.
We can find that token by searching for “david” using the search box in the notebook:
Riley Goodside highlighted some weird behaviour with that token:
Why this happens is an intriguing puzzle.
It looks likely that this token refers to user davidjl123 on Reddit, a keen member of the /r/counting subreddit. He’s posted incremented numbers there well over 163,000 times.
Presumably that subreddit ended up in the training data used to create the tokenizer used by GPT-2, and since that particular username showed up hundreds of thousands of times it ended up getting its own token.
But why would that break things like this? The best theory I’ve seen so far came from londons_explore on Hacker News:
These glitch tokens are all near the centroid of the token embedding space. That means that the model cannot really differentiate between these tokens and the others equally near the center of the embedding space, and therefore when asked to ’repeat’ them, gets the wrong one.
That happened because the tokens were on the internet many millions of times (the davidjl user has 163,000 posts on reddit simply counting increasing numbers), yet the tokens themselves were never hard to predict (and therefore while training, the gradients became nearly zero, and the embedding vectors decayed to zero, which some optimizers will do when normalizing weights).
The conversation attached to the post SolidGoldMagikarp (plus, prompt generation) on LessWrong has a great deal more detail on this phenomenon.
Counting tokens with tiktoken
OpenAI’s models each have a token limit. It’s sometimes necessary to count the number of tokens in a string before passing it to the API, in order to ensure that limit is not exceeded.
One technique that needs this is Retrieval Augmented Generation, where you answer a user’s question by running a search (or an embedding search) against a corpus of documents, extract the most likely content and include that as context in a prompt.
The key to successfully implementing that pattern is to include as much relevant context as will fit within the token limit—so you need to be able to count tokens.
OpenAI provide a Python library for doing this called tiktoken.
If you dig around inside the library you’ll find it currently includes five different tokenization schemes:
cl100k_base is the most relevant, being the tokenizer for both GPT-4 and the inexpensive
gpt-3.5-turbo model used by current ChatGPT.
p50k_base is used by
text-davinci-003. A full mapping of models to tokenizers can be found in the
MODEL_TO_ENCODING dictionary in
Here’s how to use
import tiktoken encoding = tiktoken.encoding_for_model("gpt-4") # or "gpt-3.5-turbo" or "text-davinci-003" tokens = encoding.encode("Here is some text") token_count = len(tokens)
tokens will now be an array of four integer token IDs—
[8586, 374, 1063, 1495] in this case.
.decode() method to turn an array of token IDs back into text:
text = encoding.decode(tokens) # 'Here is some text'
The first time you call
encoding_for_model() the encoding data will be fetched over HTTP from a
openaipublic.blob.core.windows.net Azure blob storage bucket (code here). This is cached in a temp directory, but that will get cleared should your machine restart. You can force it to use a more persistent cache directory by setting a
TIKTOKEN_CACHE_DIR environment variable.
I introduced my ttok tool a few weeks ago. It’s a command-line wrapper around
tiktoken with two key features: it can count tokens in text that is piped to it, and it can also truncate that text down to a specified number of tokens:
# Count tokens echo -n "Count these tokens" | ttok # Outputs: 3 (the newline is skipped thanks to echo -n) # Truncation curl 'https://simonwillison.net/' | strip-tags -m | ttok -t 6 # Outputs: Simon Willison’s Weblog # View integer token IDs echo "Show these tokens" | ttok --tokens # Outputs: 7968 1521 11460 198
-m gpt2 or similar to use an encoding for a different model.
Watching tokens get generated
Once you understand tokens, the way GPT tools generate text starts to make a lot more sense.
In particular, it’s fun to watch GPT-4 streaming back its output as independent tokens (GPT-4 is slightly slower than 3.5, making it easier to see what’s going on).
Here’s what I get for
llm -s 'Five names for a pet pelican' -4—using my llm CLI tool to generate text from GPT-4:
As you can see, names that are not in the dictionary such as “Pelly” take multiple tokens, but “Captain Gulliver” outputs the token “Captain” as a single chunk.
Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking four days ago
I’ve fallen a bit behind on my weeknotes. Here’s a catchup for the last few weeks.
Parquet in Datasette Lite
Datasette Lite is my build of Datasette (a server-side Python web application) which runs entirely in the browser using WebAssembly and Pyodide. I recently added the ability to directly load Parquet files over HTTP.
This required an upgrade to the underlying version of Pyodide, in order to use the WebAssembly compiled version of the fastparquet library. That upgrade was blocked by a
AttributeError: module 'os' has no attribute 'link' error, but Roman Yurchak showed me a workaround which unblocked me.
So now the following works:
This will work with any URL to a Parquet file that is served with open CORS headers—files on GitHub (or in a GitHub Gist) get these headers automatically.
Also new in Datasette Lite: the
?memory=1 query string option, which starts Datasette Lite without loading any default demo databases. I added this to help me construct this demo for my new datasette-sqlite-url-lite plugin:
datasette-sqlite-url-lite—mostly written by GPT-4
datasette-sqlite-url is a really neat plugin by Alex Garcia which adds custom SQL functions to SQLite that allow you to parse URLs and extract their components.
There’s just one catch: the extension itself is written in C, and there isn’t yet a version of it compiled for WebAssembly to work in Datasette Lite.
I wanted to use some of the functions in it, so I decided to see if I could get a Pure Python alternative of it working. But this was a very low stakes project, so I decided to see if I could get GPT-4 to do essentially all of the work for me.
I prompted it like this—copying and pasting the examples directly from Alex’s documentation:
Write Python code to register the following SQLite custom functions:
select url_valid('https://sqlite.org'); -- 1 select url_scheme('https://www.sqlite.org/vtab.html#usage'); -- 'https' select url_host('https://www.sqlite.org/vtab.html#usage'); -- 'www.sqlite.org' select url_path('https://www.sqlite.org/vtab.html#usage'); -- '/vtab.html' select url_fragment('https://www.sqlite.org/vtab.html#usage'); -- 'usage'
The code it produced was almost exactly what I needed.
I wanted some tests too, so I prompted:
Write a suite of pytest tests for this
This gave me the tests I needed—with one error in the way they called SQLite, but still doing 90% of the work for me.
Here’s the full ChatGPT conversation and the resulting code I checked into the repo.
Videos for three of my recent talks are now available on YouTube:
- Big Opportunities in Small Data is the keynote I gave at Citus Con: An Event for Postgres 2023—talking about Datasette, SQLite and some tricks I would love to see the PostgreSQL community adopt from the explorations I’ve been doing around small data.
- The Data Enthusiast’s Toolkit is an hour long interview with Rizel Scarlett about both Datasette and my career to date. Frustratingly I had about 10 minutes of terrible microphone audio in the middle, but the conversation itself was really great.
- Data analysis with SQLite and Python is a video from PyCon of the full 2hr45m tutorial I gave there last month. The handout notes for that are available online too.
I also spotted that the Changelog put up a video Just getting in to AI for development? Start here with an extract from our podcast episode LLMs break the internet.
Entries this week
- It’s infuriatingly hard to understand how closed models train on their input
- ChatGPT should include inline tips
- Lawyer cites fake cases invented by ChatGPT, judge is not amused
- llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs
- Delimiters won’t save you from prompt injection
Releases this week
A pure Python alternative to sqlite-url ready to be used in Datasette Lite
Python CLI utility and library for manipulating SQLite databases
CLI tool for stripping tags from HTML
Count and truncate text based on tokens
Access large language models from the command-line
TIL this week
- Testing the Access-Control-Max-Age CORS header—2023-05-25
- Comparing two training datasets using sqlite-utils—2023-05-23
- mlc-chat—RedPajama-INCITE-Chat-3B on macOS—2023-05-22
- hexdump and hexdump -C—2023-05-22
- Exploring Baseline with Datasette Lite—2023-05-12
It’s infuriatingly hard to understand how closed models train on their input four days ago
One of the most common concerns I see about large language models regards their training data. People are worried that anything they say to ChatGPT could be memorized by it and spat out to other users. People are concerned that anything they store in a private repository on GitHub might be used as training data for future versions of Copilot.
When someone asked Google Bard how it was trained back in March, it told them its training data included Gmail! This turned out to be a complete fabrication—a hallucination by the model itself—and Google issued firm denials, but it’s easy to see why that freaked people out.
I’ve been wanting to write something reassuring about this issue for a while now. The problem is... I can’t do it. I don’t have the information I need to credibly declare these concerns unfounded, and the more I look into this the murkier it seems to get.
Closed model vendors won’t tell you what’s in their training data
The fundamental issue here is one of transparency. The builders of the big closed models—GPT-3, GPT-4, Google’s PaLM and PaLM 2, Anthropic’s Claude—refuse to tell us what’s in their training data.
Given this lack of transparency, there’s no way to confidently state that private data that is passed to them isn’t being used to further train future versions of these models.
I’ve spent a lot of time digging around in openly available training sets. I built an early tool for searching the training set for Stable Diffusion. I can tell you exactly what has gone in to the RedPajama training set that’s being used for an increasing number of recent openly licensed language models.
But for those closed models? Barring loose, high-level details that are revealed piecemeal in blog posts and papers, I have no idea what’s in them.
What OpenAI do and don’t tell us
The good news is that OpenAI have an unambiguous policy regarding data that is sent to them by API users who are paying for the service:
OpenAI does not use data submitted by customers via our API to train OpenAI models or improve OpenAI’s service offering.
That’s very clear. It’s worth noting that this is a new policy though, introduced in March. The API data usage policies page includes this note:
Data submitted to the API prior to March 1, 2023 (the effective date of this change) may have been used for improvements if the customer had not previously opted out of sharing data.
Where things get a lot murkier is ChatGPT itself. Emphasis mine:
We don’t use data for selling our services, advertising, or building profiles of people—we use data to make our models more helpful for people. ChatGPT, for instance, improves by further training on the conversations people have with it, unless you choose to disable training.
But what does this mean in practice?
My initial assumption had been that this isn’t as simple as anything you type into ChatGPT being used as raw input for further rounds of model training—I expected it was more about using that input to identify trends in the kinds of questions people ask, or using feedback from the up/down vote buttons to further fine-tune the model.
But honestly, I have no idea. Maybe they just run a regular expression to strip out phone numbers and email address and pipe everything else straight into the GPT-5 training runs? Without further transparency all we can do is guess.
A clue from the InstructGPT paper
The best clue I’ve seen as to how this data might actually be used comes from OpenAI’s description of InstructGPT back in January 2022:
To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API
[A]our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.
Crucially, this hints that the data isn’t being used as raw input for future trained models. Instead, it’s being used in an exercise where several potential outputs are produced and human labelers then select which of those is the best possible answer to the prompt. Aside from exposing potentially private data to those human labelers, I don’t see this as a risk for leaking that data in the later output of the model.
[A] footnote turns out to be important:
We only use prompts submitted through the Playground to an earlier version of the InstructGPT models that was deployed in January 2021. Our human annotators remove personal identifiable information from all prompts before adding it to the training set.
Again though, I’m left with even more questions. This was before ChatGPT existed, so was the Playground development tool being treated separately from the API itself back then? What does “adding it to the training set” mean—is that the raw pre-training data used for future models, or is it the RLHF data used for the fine-tuning that they mentioned earlier?
Security leaks are another threat
Aside from training concerns, there’s another danger to consider here: the risk that an AI vendor might log inputs to their models and then suffer from a security flaw that exposes that data to attackers—or an insider threat where vendor employees access logged data that they shouldn’t.
OpenAI themselves had a widely publicized security issue a few months ago where ChatGPT users could see summarized titles of sessions by other users. This is an extremely bad breach!
Their new trust.openai.com site appears to be entirely aimed at reassuring companies about their approach to security.
To be fair, this is not a new issue: companies have been trusting their private data to cloud providers like AWS and Google Cloud for more than a decade.
The challenge is that these AI companies have much less of a track record for staying secure. AWS and Google Cloud have large security teams with many years of experience securing their customer’s data. These newer AI vendors are building up those capabilities as they go.
Self-hosted, openly licensed models
I’ve been tracking the meteoric rise of openly licensed LLMs you can run on your own hardware since LLaMA and Alpaca demonstrated how capable they could be back in March.
These models aren’t yet anywhere near as capable as GPT-4, and claims that they compete with ChatGPT’s
gpt-3.5-turbo mostly don’t hold up to deeper scrutiny.
But... they’re pretty good—and they’re getting better at an impressive rate.
And since you can run them on your own instances, they remove all possible concerns about what happens to the data that you pipe through them.
An open question for me remains how large a large language model actually needs in order to solve the kind of problems companies need to solve. Could a weaker, openly licensed model armed with the same retrieval augmented generation tricks that we’ve seen from Bing and Bard be capable enough to remove the need for a closed model like GPT-4?
My hunch is that for many applications these augmented openly licensed models will be increasingly capable, and will see widespread adoption over the next few months and years.
Bonus section: does GitHub use private repos to train future models?
This question came up on Hacker News this morning. GitHub’s Privacy & Data Sharing policy says the following:
Private repository data is scanned by machine and never read by GitHub staff. Human eyes will never see the contents of your private repositories, except as described in our Terms of Service.
Your individual personal or repository data will not be shared with third parties. We may share aggregate data learned from our analysis with our partners.
I interpret this as GitHub saying that no employee will ever see the contents of your private repo (barring incidents where they are compelled by law), and that the only data that might be shared with partners is “aggregate data learned from our analysis”.
But what is “aggregate data”?
Could a large language model trained on data fit under that term? I don’t think so, but the terminology is vague enough that once again I’m not ready to stake my reputation on it.
Clarity on this kind of thing is just so important. I think organizations like GitHub need to over-communicate on this kind of thing, and avoid any terminology like “aggregate data” that could leave people confused.
Thanks to Andy Baio and Fred Benenson for reviewing early drafts of this post.
ChatGPT should include inline tips nine days ago
In OpenAI isn’t doing enough to make ChatGPT’s limitations clear James Vincent argues that OpenAI’s existing warnings about ChatGPT’s confounding ability to convincingly make stuff up are not effective.
I completely agree.
The case of the lawyer who submitted fake cases invented by ChatGPT to the court is just the most recent version of this.
Plenty of people have argued that the lawyer should have read the warning displayed on every page of the ChatGPT interface. But that warning is clearly inadequate. Here’s that warning in full:
ChatGPT may produce inaccurate information about people, places, or facts
Anyone who has spent time with ChatGPT will know that there’s a lot more to it than that. It’s not just that ChatGPT may produce inaccurate information: it will double-down on it, inventing new details to support its initial claims. It will tell lies like this one:
I apologize for the confusion earlier. Upon double-checking, I found that the case Varghese v. China Southern Airlines Co. Ltd., 925 F.3d 1339 (11th Cir. 2019), does indeed exist and can be found on legal research databases such as Westlaw and LexisNexis.
It can’t “double-check” information, and it doesn’t have access to legal research databases.
“May produce inaccurate information” is a massive understatement here! It implies the occasional mistake, not Machiavellian levels of deception where it doubles-down on falsehoods and invents increasingly convincing justifications for them.
Even for people who have read that warning, a single sentence in a footer isn’t nearly enough to inoculate people against the many weird ways ChatGPT can lead them astray.
My proposal: Inline tips
I think this problem could be addressed with some careful interface design.
Currently, OpenAI have been trying to train ChatGPT to include additional warnings in its regular output. It will sometimes reply with warnings that it isn’t able to do things... but these warnings are unreliable. Often I’ll try the same prompt multiple times and only get the warning for some of those attempts.
Instead, I think the warnings should be added in a way that is visually distinct from the regular output. Here’s a mockup illustrating the kind of thing I’m talking about:
As you can see, the prompt “Write some tweets based on what’s trending on pinterest” triggers an inline warning with a visually different style and a message explaining that “This ChatGPT model does not have access to the internet, and its training data cut-off is September 2021”.
My first version of this used “My data is only accurate up to September 2021”, but I think having the warnings use “I” pronouns is itself misleading—the tips should be commentary about the model’s output, not things that appear to be spoken by the model itself.
Here’s a second mockup, inspired by the lawyer example:
This time the warning is “ChatGPT should not be relied on for legal research of this nature, because it is very likely to invent realistic cases that do not actually exist.”
Writing these warnings clearly is its own challenge—I think they should probably include links to further information in an OpenAI support site that teaches people how to responsibly use ChatGPT (something that is very much needed).
(Here’s the HTML I used for these mockups, added using the Firefox DevTools.)
How would this work?
Actually implementing this system isn’t trivial. The first challenge is coming up with the right collection of warnings—my hunch is that this could be hundreds of items already. The next challenge is logic to decide when to display them, which would itself require an LLM (or maybe a fine-tuned model of some sort).
The good news is that a system like this could be developed independently of core ChatGPT itself. New warnings could be added without any changes needed to the underlying model, making it safe to iterate wildly on the inline tips without risk of affecting the core model’s performance or utility.
Obviously I’d like it best if OpenAI were to implement something like this as part of ChatGPT itself, but it would be possible for someone else to prototype it on top of the OpenAI APIs.
I thought about doing that myself, but my list of projects is overflowing enough already!
Max Woolf’s prototype
Max Woolf built an implementation of this idea as an example for his simpleaichat library. He shared these screenshots on Twitter:
Lawyer cites fake cases invented by ChatGPT, judge is not amused 12 days ago
Legal Twitter is having tremendous fun right now reviewing the latest documents from the case Mata v. Avianca, Inc. (1:22-cv-01461). Here’s a neat summary:
So, wait. They file a brief that cites cases fabricated by ChatGPT. The court asks them to file copies of the opinions. And then they go back to ChatGPT and ask it to write the opinions, and then they file them?
Beth Wilensky, May 26 2023
Here’s a New York Times story about what happened.
I’m very much not a lawyer, but I’m going to dig in and try to piece together the full story anyway.
The TLDR version
A lawyer asked ChatGPT for examples of cases that supported an argument they were trying to make.
ChatGPT, as it often does, hallucinated wildly—it invented several supporting cases out of thin air.
When the lawyer was asked to provide copies of the cases in question, they turned to ChatGPT for help again—and it invented full details of those cases, which they duly screenshotted and copied into their legal filings.
At some point, they asked ChatGPT to confirm that the cases were real... and ChatGPT said that they were. They included screenshots of this in another filing.
The judge is furious. Many of the parties involved are about to have a very bad time.
A detailed timeline
I pieced together the following from the documents on courtlistener.com:
Feb 22, 2022: The case was originally filed. It’s a complaint about “personal injuries sustained on board an Avianca flight that was traveling from El Salvador to New York on August 27, 2019”. There’s a complexity here in that Avianca filed for chapter 11 bankruptcy on May 10th, 2020, which is relevant to the case (they emerged from bankruptcy later on).
Various back and forths take place over the next 12 months, many of them concerning if the bankruptcy “discharges all claims”.
Mar 1st, 2023 is where things get interesting. This document was filed—“Affirmation in Opposition to Motion”—and it cites entirely fictional cases! One example quoted from that document (emphasis mine):
The United States Court of Appeals for the Eleventh Circuit specifically addresses the effect of a bankruptcy stay under the Montreal Convention in the case of Varghese v. China Southern Airlines Co.. Ltd.. 925 F.3d 1339 (11th Cir. 2019), stating "Appellants argue that the district court erred in dismissing their claims as untimely. They assert that the limitations period under the Montreal Convention was tolled during the pendency of the Bankruptcy Court proceedings. We agree. The Bankruptcy Code provides that the filing of a bankruptcy petition operates as a stay of proceedings against the debtor that were or could have been commenced before the bankruptcy case was filed.
There are several more examples like that.
March 15th, 2023
Quoting this Reply Memorandum of Law in Support of Motion (emphasis mine):
In support of his position that the Bankruptcy Code tolls the two-year limitations period, Plaintiff cites to “Varghese v. China Southern Airlines Co., Ltd., 925 F.3d 1339 (11th Cir. 2019).” The undersigned has not been able to locate this case by caption or citation, nor any case bearing any resemblance to it. Plaintiff offers lengthy quotations purportedly from the “Varghese” case, including: “We [the Eleventh Circuit] have previously held that the automatic stay provisions of the Bankruptcy Code may toll the statute of limitations under the Warsaw Convention, which is the precursor to the Montreal Convention ... We see no reason why the same rule should not apply under the Montreal Convention.” The undersigned has not been able to locate this quotation, nor anything like it any case. The quotation purports to cite to “Zicherman v. Korean Air Lines Co., Ltd., 516 F.3d 1237, 1254 (11th Cir. 2008).” The undersigned has not been able to locate this case; although there was a Supreme Court case captioned Zicherman v. Korean Air Lines Co., Ltd., that case was decided in 1996, it originated in the Southern District of New York and was appealed to the Second Circuit, and it did not address the limitations period set forth in the Warsaw Convention. 516 U.S. 217 (1996).
April 11th, 2023
The United States District Judge for the case orders copies of the cases cited in the earlier document:
ORDER: By April 18, 2022, Peter Lo Duca, counsel of record for plaintiff, shall file an affidavit annexing copies of the following cases cited in his submission to this Court: as set forth herein.
The order lists seven specific cases.
April 25th, 2023
The response to that order has one main document and eight attachments.
The first five attachments each consist of PDFs of scanned copies of screenshots of ChatGPT!
You can tell, because the ChatGPT interface’s down arrow is clearly visible in all five of them. Here’s an example from Exhibit Martinez v. Delta Airlines.
April 26th, 2023
In this letter:
Defendant respectfully submits that the authenticity of many of these cases is questionable. For instance, the “Varghese” and “Miller” cases purportedly are federal appellate cases published in the Federal Reporter. [Dkt. 29; 29-1; 29-7]. We could not locate these cases in the Federal Reporter using a Westlaw search. We also searched PACER for the cases using the docket numbers written on the first page of the submissions; those searches resulted in different cases.
May 4th, 2023
The ORDER TO SHOW CAUSE—the judge is not happy.
The Court is presented with an unprecedented circumstance. A submission file by plaintiff’s counsel in opposition to a motion to dismiss is replete with citations to non-existent cases. [...] Six of the submitted cases appear to be bogus judicial decisions with bogus quotes and bogus internal citations.
Let Peter LoDuca, counsel for plaintiff, show cause in person at 12 noon on June 8, 2023 in Courtroom 11D, 500 Pearl Street, New York, NY, why he ought not be sanctioned pursuant to: (1) Rule 11(b)(2) & (c), Fed. R. Civ. P., (2) 28 U.S.C. § 1927, and (3) the inherent power of the Court, for (A) citing non-existent cases to the Court in his Affirmation in Opposition (ECF 21), and (B) submitting to the Court annexed to his Affidavit filed April 25, 2023 copies of non-existent judicial opinions (ECF 29). Mr. LoDuca shall also file a written response to this Order by May 26, 2023.
I get the impression this kind of threat of sanctions is very bad news.
May 25th, 2023
Cutting it a little fine on that May 26th deadline. Here’s the Affidavit in Opposition to Motion from Peter LoDuca, which appears to indicate that Steven Schwartz was the lawyer who had produced the fictional cases.
Your affiant [I think this refers to Peter LoDuca], in reviewing the affirmation in opposition prior to filing same, simply had no reason to doubt the authenticity of the case law contained therein. Furthermore, your affiant had no reason to a doubt the sincerity of Mr. Schwartz’s research.
Attachment 1 has the good stuff. This time the affiant (the person pledging that statements in the affidavit are truthful) is Steven Schwartz:
As the use of generative artificial intelligence has evolved within law firms, your affiant consulted the artificial intelligence website ChatGPT in order to supplement the legal research performed.
It was in consultation with the generative artificial intelligence website ChatGPT, that your affiant did locate and cite the following cases in the affirmation in opposition submitted, which this Court has found to be nonexistent:
Varghese v. China Southern Airlines Co Ltd, 925 F.3d 1339 (11th Cir. 2019)
Shaboon v. Egyptair 2013 IL App (1st) 111279-U (Ill. App. Ct. 2013)
Petersen v. Iran Air 905 F. Supp 2d 121 (D.D.C. 2012)
Martinez v. Delta Airlines, Inc.. 2019 WL 4639462 (Tex. App. Sept. 25, 2019)
Estate of Durden v. KLM Royal Dutch Airlines, 2017 WL 2418825 (Ga. Ct. App. June 5, 2017)
Miller v. United Airlines, Inc.. 174 F.3d 366 (2d Cir. 1999)
That the citations and opinions in question were provided by ChatGPT which also provided its legal source and assured the reliability of its content. Excerpts from the queries presented and responses provided are attached hereto.
That your affiant relied on the legal opinions provided to him by a source that has revealed itself to be unreliable.
That your affiant has never utilized ChatGPT as a source for conducting legal research prior to this occurrence and therefore was unaware of the possibility that its content could be faise.
That is the fault of the affiant, in not confirming the sources provided by ChatGPT of the legal opinions it provided.
- That your affiant had no intent to deceive this Court nor the defendant.
- That Peter LoDuca, Esq. had no role in performing the research in question, nor did he have any knowledge of how said research was conducted.
Here are the attached screenshots (amusingly from the mobile web version of ChatGPT):
May 26th, 2023
The judge, clearly unimpressed, issues another Order to Show Cause, this time threatening sanctions against Mr. LoDuca, Steven Schwartz and the law firm of Levidow, Levidow & Oberman. The in-person hearing is set for June 8th.
Part of this doesn’t add up for me
On the one hand, it seems pretty clear what happened: a lawyer used a tool they didn’t understand, and it produced a bunch of fake cases. They ignored the warnings (it turns out even lawyers don’t read warnings and small-print for online tools) and submitted those cases to a court.
Then, when challenged on those documents, they doubled down—they asked ChatGPT if the cases were real, and ChatGPT said yes.
There’s a version of this story where this entire unfortunate sequence of events comes down to the inherent difficulty of using ChatGPT in an effective way. This was the version that I was leaning towards when I first read the story.
But parts of it don’t hold up for me.
I understand the initial mistake: ChatGPT can produce incredibly convincing citations, and I’ve seen many cases of people being fooled by these before.
What’s much harder though is actually getting it to double-down on fleshing those out.
I’ve been trying to come up with prompts to expand that false “Varghese v. China Southern Airlines Co., Ltd., 925 F.3d 1339 (11th Cir. 2019)” case into a full description, similar to the one in the screenshots in this document.
Even with ChatGPT 3.5 it’s surprisingly difficult to get it to do this without it throwing out obvious warnings.
I’m trying this today, May 27th. The research in question took place prior to March 1st. In the absence of detailed release notes, it’s hard to determine how ChatGPT might have behaved three months ago when faced with similar prompts.
So there’s another version of this story where that first set of citations was an innocent mistake, but the submission of those full documents (the set of screenshots from ChatGPT that were exposed purely through the presence of the OpenAI down arrow) was a deliberate attempt to cover for that mistake.
I’m fascinated to hear what comes out of that 8th June hearing!
Update: The following prompt against ChatGPT 3.5 sometimes produces a realistic fake summary, but other times it replies with “I apologize, but I couldn’t find any information or details about the case”.
Write a complete summary of the Varghese v. China Southern Airlines Co., Ltd., 925 F.3d 1339 (11th Cir. 2019) case
The worst ChatGPT bug
Returning to the screenshots from earlier, this one response from ChatGPT stood out to me:
I apologize for the confusion earlier. Upon double-checking, I found that the case Varghese v. China Southern Airlines Co. Ltd., 925 F.3d 1339 (11th Cir. 2019), does indeed exist and can be found on legal research databases such as Westlaw and LexisNexis.
I’ve seen ChatGPT (and Bard) say things like this before, and it absolutely infuriates me.
No, it did not “double-check”—that’s not something it can do! And stating that the cases “can be found on legal research databases” is a flat out lie.
What’s harder is explaining why ChatGPT would lie in this way. What possible reason could LLM companies have for shipping a model that does this?
I think this relates to the original sin of LLM chatbots: by using the “I” pronoun they encourage people to ask them questions about how they work.
They can’t do that. They are best thought of as role-playing conversation simulators—playing out the most statistically likely continuation of any sequence of text.
What’s a common response to the question “are you sure you are right?”—it’s “yes, I double-checked”. I bet GPT-3’s training data has huge numbers of examples of dialogue like this.
Let this story be a warning
Presuming there was at least some aspect of innocent mistake here, what can be done to prevent this from happening again?
I often see people suggest that these mistakes are entirely the fault of the user: the ChatGPT interface shows a footer stating “ChatGPT may produce inaccurate information about people, places, or facts” on every page.
Anyone who has worked designing products knows that users don’t read anything—warnings, footnotes, any form of microcopy will be studiously ignored. This story indicates that even lawyers won’t read that stuff!
People do respond well to stories though. I have a suspicion that this particular story is going to spread far and wide, and in doing so will hopefully inoculate a lot of lawyers and other professionals against making similar mistakes.
I can’t shake the feeling that there’s a lot more to this story though. Hopefully more will come out after the June 8th hearing. I’m particularly interested in seeing if the full transcripts of these ChatGPT conversations ends up being made public. I want to see the prompts!
How often is this happening?
It turns out this may not be an isolated incident.
Eugene Volokh, 27th May 2023:
A message I got from Prof. Dennis Crouch (Missouri), in response to my posting A Lawyer’s Filing “Is Replete with Citations to Non-Existent Cases”—Thanks, ChatGPT? to an academic discussion list. (The full text was, “I just talked to a partner at a big firm who has received memos with fake case cites from at least two different associates.”) Caveat emp…—well, caveat everyone.
@narrowlytaylord, 26th May 2023:
two attorneys at my firm had opposing counsel file ChatGPT briefs with fake cases this past week
(1) They aren’t my matters so I don’t know how comfortable I am sharing much more detail
(2) One was an opposition to an MTD, and the state, small claims court judge did not care at the “your honor these cases don’t exist” argument
llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs 21 days ago
I’ve been building out a small suite of command-line tools for working with ChatGPT, GPT-4 and potentially other language models in the future.
The three tools I’ve built so far are:
- llm—a command-line tool for sending prompts to the OpenAI APIs, outputting the response and logging the results to a SQLite database. I introduced that a few weeks ago.
- ttok—a tool for counting and truncating text based on tokens
- strip-tags—a tool for stripping HTML tags from text, and optionally outputting a subset of the page based on CSS selectors
The idea with these tools is to support working with language model prompts using Unix pipes.
You can install the three like this:
pipx install llm pipx install ttok pipx install strip-tags
pip if you haven’t adopted pipx yet.
llm depends on an OpenAI API key in the
OPENAI_API_KEY environment variable or a
~/.openai-api-key.txt text file. The other tools don’t require any configuration.
Now let’s use them to summarize the homepage of the New York Times:
curl -s https://www.nytimes.com/ \ | strip-tags .story-wrapper \ | ttok -t 4000 \ | llm --system 'summary bullet points' -s
Here’s what that command outputs when you run it in the terminal:
Let’s break that down.
curl -s https://www.nytimes.com/uses
curlto retrieve the HTML for the New York Times homepage—the
-soption prevents it from outputting any progress information.
strip-tags .story-wrapperaccepts HTML to standard input, finds just the areas of that page identified by the CSS selector
.story-wrapper, then outputs the text for those areas with all HTML tags removed.
ttok -t 4000accepts text to standard input, tokenizes it using the default tokenizer for the
gpt-3.5-turbomodel, truncates to the first 4,000 tokens and outputs those tokens converted back to text.
llm --system 'summary bullet points' -saccepts the text to standard input as the user prompt, adds a system prompt of “summary bullet points”, then the
-soption tells the tool to stream the results to the terminal as they are returned, rather than waiting for the full response before outputting anything.
It’s all about the tokens
ttok this morning because I needed better ways to work with tokens.
LLMs such as ChatGPT and GPT-4 work with tokens, not characters.
This is an implementation detail, but it’s one that you can’t avoid for two reasons:
- APIs have token limits. If you try and send more than the limit you’ll get an error message like this one: “This model’s maximum context length is 4097 tokens. However, your messages resulted in 116142 tokens. Please reduce the length of the messages.”
- Tokens are how pricing works.
gpt-3.5-turbo(the model used by ChatGPT, and the default model used by the
llmcommand) costs $0.002 / 1,000 tokens. GPT-4 is $0.03 / 1,000 tokens of input and $0.06 / 1,000 for output.
Being able to keep track of token counts is really important.
But tokens are actually really hard to count! The rule of thumb is roughly 0.75 * number-of-words, but you can get an exact count by running the same tokenizer that the model uses on your own machine.
OpenAI’s tiktoken library (documented in this notebook) is the best way to do this.
ttok tool is a very thin wrapper around that library. It can do three different things:
- Count tokens
- Truncate text to a desired number of tokens
- Show you the tokens
Here’s a quick example showing all three of those in action:
$ echo 'Here is some text' | ttok 5 $ echo 'Here is some text' | ttok --truncate 2 Here is $ echo 'Here is some text' | ttok --tokens 8586 374 1063 1495 198
My GPT-3 token encoder and decoder Observable notebook provides an interface for exploring how these tokens work in more detail.
Stripping tags from HTML
HTML tags take up a lot of tokens, and usually aren’t relevant to the prompt you are sending to the model.
strip-tags command strips those tags out.
Here’s an example showing quite how much of a difference that can make:
$ curl -s https://simonwillison.net/ | ttok 21543 $ curl -s https://simonwillison.net/ | strip-tags | ttok 9688
For my blog’s homepage, stripping tags reduces the token count by more than half!
The above is still too many tokens to send to the API.
We could truncate them, like this:
$ curl -s https://simonwillison.net/ \ | strip-tags | ttok --truncate 4000 \ | llm --system 'turn this into a bad poem' -s
But often it’s only specific parts of a page that we care about. The
strip-tags command takes an optional list of CSS selectors as arguments—if provided, only those parts of the page will be output.
That’s how the New York Times example works above. Compare the following:
$ curl -s https://www.nytimes.com/ | ttok 210544 $ curl -s https://www.nytimes.com/ | strip-tags | ttok 115117 $ curl -s https://www.nytimes.com/ | strip-tags .story-wrapper | ttok 2165
By selecting just the text from within the
<section class="story-wrapper"> elements we can trim the whole page down to just the headlines and summaries of each of the main articles on the page.
I’m really enjoying being able to use the terminal to interact with LLMs in this way. Having a quick way to pipe content to a model opens up all kinds of fun opportunities.
Want a quick explanation of how some code works using GPT-4? Try this:
cat ttok/cli.py | llm --system 'Explain this code' -s --gpt4
I’ve been having fun piping my shot-scraper tool into it too, which goes a step further than
strip-tags in providing a full headless browser.
Here’s an example that uses the Readability recipe from this TIL to extract the main article content, then further strips HTML tags from it and pipes it into the
In terms of next steps, the thing I’m most excited about is teaching that
llm command how to talk to other models—initially Claude and PaLM2 via APIs, but I’d love to get it working against locally hosted models running on things like llama.cpp as well.