Simon Willison’s Weblog


Think of language models like ChatGPT as a “calculator for words”

2nd April 2023

One of the most pervasive mistakes I see people using with large language model tools like ChatGPT is trying to use them as a search engine.

As with other LLM misconceptions, it’s easy to understand why people do this.

If you ask an LLM a question, it will answer it—no matter what the question! Using them as an alternative to a search engine such as Google is one of the most obvious applications—and for a lot of queries this works just fine.

It’s also going to quickly get you into trouble.

Ted Chiang’s classic essay ChatGPT Is a Blurry JPEG of the Web helps explain why:

Think of ChatGPT as a blurry jpeg of all the text on the Web. It retains much of the information on the Web, in the same way that a jpeg retains much of the information of a higher-resolution image, but, if you’re looking for an exact sequence of bits, you won’t find it; all you will ever get is an approximation. But, because the approximation is presented in the form of grammatical text, which ChatGPT excels at creating, it’s usually acceptable.

The ChatGPT model is huge, but it’s not huge enough to retain every exact fact it’s encountered in its training set.

It can produce a convincing answer to anything, but that doesn’t mean it’s reflecting actual facts in its answers. You always have to stay skeptical and fact check what it tells you.

Language models are also famous for “hallucinating”—for inventing new facts that fit the sentence structure despite having no basis in the underlying data.

There are plenty of “facts” about the world which humans disagree on. Regular search lets you compare those versions and consider their sources. A language model might instead attempt to calculate some kind of average of every opinion it’s been trained on—which is sometimes what you want, but often is not.

This becomes even more obvious when you consider smaller language models. LLaMA 7B can be represented as a 3.9 GB file—it contains an astonishing amount of information, but evidently that’s not enough storage space to accurately answer every question you might have.

So if they’re not reliable for use as a search engines, what are LLMs even good for?

A calculator for words

I like to think of language models like ChatGPT as a calculator for words.

This is reflected in their name: a “language model” implies that they are tools for working with language. That’s what they’ve been trained to do, and it’s language manipulation where they truly excel.

Want them to work with specific facts? Paste those into the language model as part of your original prompt!

There are so many applications of language models that fit into this calculator for words category:

  • Summarization. Give them an essay and ask for a summary.
  • Question answering: given these paragraphs of text, answer this specific question about the information they represent.
  • Fact extraction: ask for bullet points showing the facts presented by an article.
  • Rewrites: reword things to be more “punchy” or “professional” or “sassy” or “sardonic”—part of the fun here is using increasingly varied adjectives and seeing what happens. They’re very good with language after all!
  • Suggesting titles—actually a form of summarization.
  • World’s most effective thesaurus. “I need a word that hints at X”, “I’m very Y about this situation, what could I use for Y?”—that kind of thing.
  • Fun, creative, wild stuff. Rewrite this in the voice of a 17th century pirate. What would a sentient cheesecake think of this? How would Alexander Hamilton rebut this argument? Turn this into a rap battle. Illustrate this business advice with an anecdote about sea otters running a kayak rental shop. Write the script for kickstarter fundraising video about this idea.

A calculator for words is an incredibly powerful thing.

Here’s where things get a bit complicated: some language models CAN work as search engines. The two most obvious are Microsoft Bing and Google Bard, but there are plenty of other examples of this pattern too—there’s even an alpha feature of ChatGPT called “browsing mode” that can do this.

You can think of these search tools as augmented language models.

The way these work is the language model identifies when a search might help answer a question... and then runs that search through an attached search engine, via an API.

It then copies data from the search results back into itself as part of an invisible prompt, and uses that new context to help it answer the original question.

It’s effectively the same thing as if you ran a search, then copied and pasted information back into the language model and asked it a question about that data.

I wrote about how to implement this pattern against your own data in How to implement Q&A against your documentation with GPT3, embeddings and Datasette. It’s an increasingly common pattern.

It’s important to note that there is still a risk of hallucination here, even when you feed it the facts you want it to use. I’ve caught both Bing and Bard adding made-up things in the middle of text that should have been entirely derived from their search results!

Using language models effectively is deceptively difficult

So many of the challenges involving language models come down to this: they look much, much easier to use than they actually are.

To get the most value out of them—and to avoid the many traps that they set for the unwary user—you need to spend time with them, and work to build an accurate mental model of how they work, what they are capable of and where they are most likely to go wrong.

I hope this “calculator for words” framing can help.

A flaw in this analogy: calculators are repeatable

Andy Baio pointed out a flaw in this particular analogy: calculators always give you the same answer for a given input. Language models don’t—if you run the same prompt through a LLM several times you’ll get a slightly different reply every time.

This is a very good point! You should definitely keep this in mind.

All analogies are imperfect, but some are more imperfect that others.

Update: December 5th 2023

Anthony Bucci, in Word calculators don’t add up, responds to this post with further notes on why this analogy doesn’t work for him, including:

[...] a calculator has a well-defined, well-scoped set of use cases, a well-defined, well-scoped user interface, and a set of well-understood and expected behaviors that occur in response to manipulations of that interface.

Large language models, when used to drive chatbots or similar interactive text-generation systems, have none of those qualities. They have an open-ended set of unspecified use cases.