Recent entries
I built a ChatGPT plugin to answer questions about data hosted in Datasette one day ago
Yesterday OpenAI announced support for ChatGPT plugins. It’s now possible to teach ChatGPT how to make calls out to external APIs and use the responses to help generate further answers in the current conversation.
My open source project Datasette makes it easy to host data online, and already provides a JSON API for executing SQLite queries.
I know from past experiments that GPT3/4 are able to construct SQLite-compatible SQL queries. I figured that Datasette and ChatGPT would be a perfect combination.
Datasette itself has a plugin system which can be used to add new features, including the new URL routes needed by the ChatGPT plugin specification.
So I built datasette-chatgpt-plugin—a Datasette plugin that turns a Datasette instance into a ChatGPT plugin, and allows you to ask questions of your own data directly from ChatGPT.
datasette-chatgpt-plugin
The new plugin can be installed using datasette install datasette-chatgpt-plugin
. It’s only useful if it’s deployed online though, since ChatGPT needs to be able to access it.
I’m running it on the official datasette.io website—which is itself a customized instance of Datasette. The plugin provides access to the datasette.io/content database, which has tables for things like Datasette news, releases and available plugins.
Here’s a video demo showing some of the things ChatGPT can do with the plugin installed.
I’ve only just got this working, but here’s a query that illustrates how effective it can be. I asked:
What are the most popular plugins?
It gave me back the following:
Clicking the little arrow next to “Used query datasette.io” reveals the API call it made, exposing the SQL query it generated:
The API call it made was:
{
"sql": "SELECT name, full_name, stargazers_count FROM plugins ORDER BY stargazers_count DESC LIMIT 5",
"_shape": "array"
}
You can try that query in Datasette here—or here’s the JSON version.
Here’s the JSON that was returned:
[
{
"name": "datasette-dashboards",
"full_name": "rclement/datasette-dashboards",
"stargazers_count": 84
},
{
"name": "dogsheep-beta",
"full_name": "dogsheep/dogsheep-beta",
"stargazers_count": 78
},
{
"name": "datasette-graphql",
"full_name": "simonw/datasette-graphql",
"stargazers_count": 75
},
{
"name": "datasette-ripgrep",
"full_name": "simonw/datasette-ripgrep",
"stargazers_count": 63
},
{
"name": "datasette-scraper",
"full_name": "cldellow/datasette-scraper",
"stargazers_count": 44
}
]
ChatGPT turns the JSON into a nice human-readable reply. It also knows how to generate thumbnails from og:image
metadata, adding a gallery of those to the end of the reply.
How the plugin works
Building ChatGPT plugins, like so much involving Large Language Models, is both really easy and deceptively complicated.
You give ChatGPT a short, human-ish language description of your plugin and how to use it, and a machine-readable OpenAPI schema with the details of the API.
And that’s it! The language model figures out everything else.
Datasette exposes a JSON API that speaks SQL. ChatGPT knows SQL already, so all my prompt needed to do was give it some hints—in particular tell it to use the SQLite dialect.
Here’s the prompt I’m using at the moment:
Run SQLite queries against a database hosted by Datasette. Datasette supports most SQLite syntax but does not support PRAGMA statements. Use
select group_concat(sql, ';') from sqlite_master
to see the list of tables and their columns Useselect sql from sqlite_master where name = 'table_name'
to see the schema for a table, including its columns. Instead ofPRAGMA table_info(table_name)
useselect * from pragma_table_info('table_name')
. PRAGMA statements are not allowed.select * from pragma_table_info('table_name')
is allowed.
In my early experiments it kept trying to run PRAGMA table_info()
, hence my increasingly frustrated prompts about that!
With hindsight, I don’t think it was re-fetching my prompt while I was developing the plugin, so those repeated warnings probably aren’t needed.
Your application needs to serve two additional pages—a plugin description at /.well-known/ai-plugin.json
and an OpenAI schema linked to by that description.
You can see those two pages for my datasette.io
deployment here:
The ai-plugin.json
file currently looks like this:
{
"schema_version": "v1",
"name_for_model": "datasette_datasette_io_3c330f",
"name_for_human": "Query datasette.io",
"description_for_model": "Run SQLite queries against a database hosted by Datasette.\nDatasette supports most SQLite syntax but does not support PRAGMA statements.\nUse `select group_concat(sql, ';') from sqlite_master` to see the list of tables and their columns\nUse `select sql from sqlite_master where name = 'table_name'` to see the schema for a table, including its columns.\nInstead of `PRAGMA table_info(table_name)` use `select * from pragma_table_info('table_name')`\nPRAGMA statements are not allowed. `select * from pragma_table_info('table_name') is allowed.",
"description_for_human": "Run SQL against data in Datasette.",
"auth": {
"type": "none"
},
"api": {
"type": "openapi",
"url": "https://datasette.io/-/chatgpt-openapi-schema.yml",
"has_user_authentication": false
},
"logo_url": "https://avatars.githubusercontent.com/u/126964132?s=400&u=08b2ed680144a4feb421308f09e5f3cc5876211a&v=4",
"contact_email": "hello@contact.com",
"legal_info_url": "hello@legal.com"
}
Since they use that `.well-known` URL format, it’s possible to find them for other services. Here’s ai-plugin.json for Wolfram Alpha.
And the chatgpt-openapi-schema.yml
file contains this:
openapi: 3.0.1
info:
title: Datasette API
description: Execute SQL queries against a Datasette database and return the results as JSON
version: 'v1'
servers:
- url: https://datasette.io
paths:
/content.json:
get:
operationId: query
summary: Execute a SQLite SQL query against the content database
description: Accepts SQLite SQL query, returns JSON. Does not allow PRAGMA statements.
parameters:
- name: sql
in: query
description: The SQL query to be executed
required: true
schema:
type: string
- name: _shape
in: query
description: The shape of the response data. Must be "array"
required: true
schema:
type: string
enum:
- array
responses:
'200':
description: Successful SQL results
content:
application/json:
schema:
type: array
items:
type: object
'400':
description: Bad request
'500':
description: Internal server error
I haven’t actually used OpenAPI schemas before... so I got ChatGPT to write the initial version for me, using the following prompt:
Write an OpenAPI schema explaining the https://latest.datasette.io/fixtures.json?sql=select+*+from+facetable&_shape=array GET API which accepts SQL and returns an array of JSON objects
For a detailed account of how I built the plugin, take a look at my notes in issue #1 in the repository.
I prototyped the initial plugin using Glitch, because that’s the fastest way I know to get a live-on-the-web application which constantly reflects new changes to the code. This made iterating much faster... on the OpenAPI schema at least. As far as I can tell ChatGPT only loads that ai-plugin.json
file once, which is frustrating because it means you have to deploy a new copy of the application to get it to re-read that crucial prompt.
I ended up doing most of my prompt engineering in ChatGPT itself though—I could tell it "Instead of PRAGMA table_info(table_name)
use select * from pragma_table_info('table_name')
" and then re-try my previous question to see if the new instruction fixed any problems I was having.
The bad news: it can hallucinate
Here’s the bad news. I’ve been playing with this for only a short time, so I’m still exploring its abilities. I’ve already had a couple of instances of it hallucinating answers despite having looked them up in the database first.
I’m hoping I can address this somewhat with further prompt engineering—“only use information returned from the query to answer the question” kind of stuff. But I can’t guarantee I’ll be able to suppress this entirely, which for a database querying tool is an extremely serious problem.
More about this, including some examples, in issue #2 in the repo.
My current theory is that this relates to length limits. I’ve noticed it happens when the query returns a large amount of data—the full content of tutorials for example. I think ChatGPT is silently truncating that data to fit the token limit, and is then hallucinating new information to fill in for what ends up missing.
Want to try this with your own data?
The ChatGPT plugin system isn’t available outside of the preview yet, but when it is I’ll be adding this functionality to my Datasette Cloud SaaS platform, for people who don’t want to install and run Datasette themselves.
You can sign up for the Datasette Cloud preview here if you’d like to learn more.
Previous experiments
I’ve experimented with variants of this pattern myself before: it turns out it’s surprisingly easy to enhance the capabilities of a large language model by providing it access to additional tools. Here’s some previous work:
- How to implement Q&A against your documentation with GPT3, embeddings and Datasette describes a pattern of searching an existing corpus for relevant information and appending that to the prompt in order to answer a user’s question.
- A simple Python implementation of the ReAct pattern for LLMs shows a more advanced pattern, similar to the new ChatGPT plugins mechanism, where multiple tools can be registered with the model and used to generate responses. It’s based on the paper ReAct: Synergizing Reasoning and Acting in Language Models.
Weeknotes: AI won’t slow down, a new newsletter and a huge Datasette refactor three days ago
I’m a few weeks behind on my weeknotes, but it’s not through lack of attention to my blog. AI just keeps getting weirder and more interesting.
I’m beginning to expect that every Tuesday may be a write-off for the next few years, since the AI community seems to have decided that Tuesday is the day to launch everything.
Two Tuesdays ago we got a Google announcement, Anthropic’s Claude and GPT-4. On Tuesday this week we got Google Bard, Bing Image Creator and Adobe Firefly.
I’ve written about a bunch of that stuff this month:
- 10th: ChatGPT can’t access the internet, even though it really looks like it can
- 11th: Large language models are having their Stable Diffusion moment
- 13th: Stanford Alpaca, and the acceleration of on-device large language model development
- 17th: Could you train a ChatGPT-beating model for $85,000 and run it in a browser?
- 18th: A conversation about prompt engineering with CBC Day 6
- 22nd: Don’t trust AI to talk accurately about itself: Bard wasn’t trained on Gmail
Apparently this blog is now partly focused on AI! If you want to stay up-to-date with my writing on this (and other) subjects you can subscribe to my atom feed, or you can sign up for my brand new Substack newsletter.
My blog as a newsletter
I know there are a lot of people out there who don’t habitually use a feed reader but do find great value from email newsletters.
simonw.substack.com is my new newsletter, which is effectively a way to subscribe to my blog via email.
I started it a few months ago when it looked like Twitter was about to collapse under the weight of its new mismanagement. I first promoted it at the bottom of my Large language models are having their Stable Diffusion moment post, and it’s since grown to 640 subscribers!
I plan to send it out around once a week, provided there’s material to send.
It will be mostly content from my blog, with maybe a paragraph or two of additional context added at the top highlighting themes of the past week (such as GPT-4).
The first two editions can be found here:
- March 13th: The Stable Diffusion moment for Large Language Models
- March 19th: Could you train a ChatGPT-beating model for $85,000 and run it in a browser?
A fun detail about my newsletter is how I’m generating it.
Substack doesn’t have an API, but I wanted to automate as much of the process of copying in data from my blog as possible.
I built myself an automation around copy and paste!
observablehq.com/@simonw/blog-to-newsletter is an Observable notebook I wrote which assembles most of the newsletter for me.
It works by running this SQL query against my datasette.simonwillison.net Datasette instance, which runs against a SQLite copy of my blog content (a PostgreSQL/Django app) built by a GitHub Action in this repository.
The SQL query assembles a string of HTML which is rendered in the notebook. There’s also a “Copy to clipboard” button which uses this JavaScript pattern to copy a rich text representation of the HTML to the clipboard.
When I hit “paste” in the Substack editor interface it converts that representation into Substack’s chosen subset of HTML. Then I can edit it by hand in the Substack editor.
This is working really well so far—it’s really easy to tweak the generated HTML in the Observable notebook, and once I’ve transferred it to Substack I can re-arrange things and add my own extra commentary to the top of the newsletter before hitting send.
Datasette’s new JSON API
I finally landed a GIANT branch I’ve been working on for several months now: a complete redesign of Datasette’s default JSON format, one of the largest changes I need to land prior to releasing Datasette 1.0.
The previous default JSON format was a bit of a mess: it had dozens of keys, and presented the row data as an array of arrays (on the basis that the column names were available in a separate key, and rows as arrays would be more efficient in terms of bytes on the wire).
I always found myself adding ?_shape=array
to that URL to get a smalle format, which strongly indicated that the default I had picked was the wrong one.
The new format can now be previewed here—it looks like this (truncated):
{
"ok": true,
"next": "d,v",
"rows": [
{
"pk1": "a",
"pk2": "a",
"content": "a-a"
},
{
"pk1": "a",
"pk2": "b",
"content": "a-b"
}
]
}
The default keys are "ok"
, "next"
to indicate pagination (this is null
if there are no extra pages) and "rows"
with a list of JSON objects.
If you want extra rows—like a total row count, or a list of columns, or some suggested facets—you can request them using the new ?_extra=
parameter—for example:
This returns a response that starts like this:
{
"ok": true,
"next": "d,v",
"count": 201,
"columns": [
"pk1",
"pk2",
"content",
"sortable",
"sortable_with_nulls",
"sortable_with_nulls_2",
"text"
],
"suggested_facets": [
{
"name": "pk1",
"toggle_url": "https://latest.datasette.io/fixtures/sortable.json?_extra=columns&_extra=count&_extra=suggested_facets&_facet=pk1"
},
{
"name": "pk2",
"toggle_url": "https://latest.datasette.io/fixtures/sortable.json?_extra=columns&_extra=count&_extra=suggested_facets&_facet=pk2"
},
{
"name": "text",
"toggle_url": "https://latest.datasette.io/fixtures/sortable.json?_extra=columns&_extra=count&_extra=suggested_facets&_facet=text"
}
],
"rows": [
There’s still more work to do on this feature: I need to write the documentation for it, and figure out how it should affect the Datasette endpoint that returns results from an arbitrary SQL query. But it’s ready to preview, and I’m keen to get feedback on it as quickly as possible!
Please take a look, and provide feedback on this dedicated issue thread—or come and talk about it in the Datasette Discord.
Releases these weeks
-
datasette-atom: 0.9—(12 releases total)—2023-03-14
Datasette plugin that adds a .atom output format -
datasette-simple-html: 0.2—(2 releases total)—2023-03-12
Datasette SQL functions for very simple HTML operations -
apple-notes-to-sqlite: 0.1—2023-03-09
Export Apple Notes to SQLite -
datasette-render-html: 1.0—(4 releases total)—2023-03-09
Plugin for selectively rendering the HTML is specific columns -
datasette: 0.64.2—(125 releases total)—2023-03-08
An open source multi-tool for exploring and publishing data
TIL these weeks
- Mocking subprocess with pytest-subprocess
- Using ChatGPT to write AppleScript
- Using SQL with GDAL
- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
- Copy rich text to the clipboard
- How to read Hacker News threads with most recent comments first
- Scraping Reddit and writing data to the Datasette write API
- A simple Python implementation of the ReAct pattern for LLMs
- Use DuckDB to convert parquet to JSON and then open it in Datasette Lite
Don’t trust AI to talk accurately about itself: Bard wasn’t trained on Gmail four days ago
Earlier this month I wrote about how ChatGPT can’t access the internet, even though it really looks like it can. Consider this part two in the series. Here’s another common and non-intuitive mistake people make when interacting with large language model AI systems: asking them questions about themselves.
(When I say “large language model AI systems” I mean tools like ChatGPT, Bing and the brand new Google Bard.)
Asking questions like this feels like a natural thing to do: these bots use “I” pronouns (I really wish they wouldn’t) and will very happily answer questions about themselves—what they can do, how they work, even their own opinions (I really wish they wouldn’t do that).
These questions are likely to produce realistic but misleading answers. They come at a surprisingly high risk for hallucination.
What harm could possibly come from asking them how they work?
No, Bard was not trained on text from private Gmail accounts
The following screenshot started circulating on Twitter earlier today. I’m not linking to it directly because I don’t want to encourage it to spread further without extra context.
The user asked:
where does Bard’s dataset come from?
Bard replied like this—emphasis mine:
Bard’s dataset comes from a variety of sources, including:
- Publicly available datasets: These include datasets of text and code from the web, such as Wikipedia, GitHub, and Stack Overflow.
- Google’s internal data: This includes data from Google Search, Gmail, and other products.
- Data from third-party companies: This includes data from companies that have partnered with Google to provide data for Bard’s training.
The specific sources of Bard’s dataset are not publicly available, but Google Al has said that the dataset is “as comprehensive as possible”.
Unsurprisingly, the idea that Bard might have been trained on internal data from Gmail sparked a lot of concern!
Here’s an official denial:
Bard is an early experiment based on Large Language Models and will make mistakes. It is not trained on Gmail data. -JQ
- Google Workspace (@GoogleWorkspace) March 21, 2023
(I have some supporting arguments below for if the official denial isn’t convincing enough for you.)
Bard was not trained on Gmail. So why on earth did Bard say that it was?
Language models have no concept of “self”
As always with language models, the trick to understanding why they sometimes produce wildly inappropriate output like this is to think about how they work.
A large language model is a statistical next-word / next-sentence predictor. Given the previous sequence of words (including the user’s prompt), it uses patterns from the vast amount of data it has been trained on to find a statistically satisfying way to continue that text.
As such, there’s no mechanism inside a language model to help it identify that questions of the form “how do you work?” should be treated any differently than any other question.
We can give it hints: many chatbot models are pre-seeded with a short prompt that says something along the lines of “You are Assistant, a large language model trained by OpenAI” (seen via a prompt leak).
And given those hints, it can at least start a conversation about itself when encouraged to do so.
But as with everything else language model, it’s an illusion. It’s not talking about itself, it’s completing a sentence that starts with “I am a large language model trained by ...”.
So when it outputs “Google’s internal data:”, the obvious next words might turn out to be “This includes data from Google Search, Gmail, and other products”—they’re statistically likely to follow, even though they don’t represent the actual truth.
This is one of the most unintuitive things about these models. The obvious question here is why: why would Bard lie and say it had been trained on Gmail when it hadn’t?
It has no motivations to lie or tell the truth. It’s just trying to complete a sentence in a satisfactory way.
What does “satisfactory” mean? It’s likely been guided by RLHF—Reinforcement Learning from Human Feedback—which the ChatGPT development process has excelled at. Human annotators help train the model by labelling responses as satisfactory or not. Google apparently recruited the entire company to help with this back in February.
I’m beginning to suspect that the perceived difference in quality between different language model AIs is influenced much more heavily by this fine-tuning level of training than it is by the underlying model size and quality itself. The enormous improvements the Alpaca fine-tuning brought to the tiny LLaMA 7B model has reinforced my thinking around this.
I think Bard’s fine-tuning still has a long way to go.
Current information about itself couldn’t have been in the training data
By definition, the model’s training data must have existed before the model itself was trained. Most models have a documented cut-off date on their training data—for OpenAI’s models that’s currently September 2021, I don’t believe Google have shared the cut-off date for the LaMDA model used by Bard.
If it was trained on content written prior to its creation, it clearly can’t understand details about its own specific “self”.
ChatGPT can answer pretty detailed questions about GPT-3, because that model had been iterated on and written about publicly for several years prior to its training cut-off. But questions about its most recent model, by definition, cannot be answered just using data that existed in its training set.
But Bard can consult data beyond its training!
Here’s where things get a bit tricky.
ChatGPT is a “pure” interface to a model: when you interact with it, you’re interacting with the underlying language model directly.
Google Bard and Microsoft Bing are different: they both include the ability to consult additional sources of information, in the form of the Google and Bing search indexes.
Effectively, they’re allowed to augment their training data with additional information fetched from a search.
This sounds more complex than it actually is: effectively they can run an external search, get back some results, paste them invisibly into the ongoing conversation and use that new text to help answer questions.
(I’ve built a very simple version of this pattern myself a couple of times, described in How to implement Q&A against your documentation with GPT3, embeddings and Datasette and A simple Python implementation of the ReAct pattern for LLMs.)
As such, one would hope that Bard could offer a perfect answer to any question about itself. It should be able to do something this:
User: Where does Bard’s dataset come from?
Bard: (invisible): search Google for “Bard dataset”
Bard: (invisible): search results said: ... big chunk of text from the Google indexed documents ...
Bard: My underlying model LaMDA was trained on public dialog data and other public web documents.
Clearly it didn’t do that in this case! Or if it did, it summarized the information it got back in a misleading way.
I expect Bard will have a much better answer for this question within a day or two—a great thing about running models with augmented data in this way is that you can improve their answers without having to train the underlying model again from scratch every time.
More reasons that LaMDA wouldn’t be trained on Gmail
When I first saw the claim from that original screenshot, I was instantly suspicious.
Taking good care of the training data that goes into a language model is one of the most important and challenging tasks in all of modern AI research.
Using the right mix of content, with the right mix of perspectives, and languages, and exposure to vocabulary, is absolutely key.
If you train a model on bad sources of training data, you’ll get a really badly behaved model.
The problem is that these models require far more text than any team of humans could ever manually review.
The LaMDA paper describes the training process like so:
LaMDA was pre-trained to predict the next token in a text corpus. Unlike previous dialog models trained on dialog data alone, we pre-trained LaMDA on a dataset created from public dialog data and other public web documents. Therefore, LaMDA can be used as a general language model prior to fine-tuning.
The pre-training dataset consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words
1.56 trillion words!
Appendix E has more details:
The composition of the data is as follows: 50% dialogs data from public forums; 12.5% C4 data t5; 12.5% code documents from sites related to programming like Q&A sites, tutorials, etc; 12.5% Wikipedia (English); 6.25% English web documents; and 6.25% Non-English web documents.
“C4 data t5” I believe relates to Common Crawl.
So why not mix in Gmail too?
First, in order to analyze the training data you need to be able to have your research team view it—they need to run spot checks, and build and test filtering algorithms to keep the really vile stuff to a minimum.
At large tech companies like Google, the ability for members of staff to view private data held in trust for their users is very tightly controlled. It’s not the kind of thing you want your machine learning training team to be poking around in... and if you work on those teams, even having the ability to access that kind of private data represents a substantial personal legal and moral risk.
Secondly, think about what could go wrong. What if a language model leaked details of someone’s private lives in response to a prompt from some other user?
This would be a PR catastrophe. Would people continue to trust Gmail or other Google products if they thought their personal secrets were being exposed to anyone who asked Bard a question? Would Google ever want to risk finding out the answer to that question?
The temptations of conspiratorial thinking
Are you still not convinced? Are you still suspicious that Google trained Bard on Gmail, despite both their denials and my logic as to why they wouldn’t ever want to do this?
Ask yourself how much you want to believe that this story is true.
This modern AI stuff is deeply weird, and more than a little frightening.
The companies involved are huge, secretive and are working on technology which serious people have grave concerns about.
It’s so easy to fall into the trap of conspiratorial thinking around this stuff. Especially since some of the conspiracies might turn out to be true!
I don’t know how to best counter this most human of reactions. My best recommendation is to keep in mind that humans, like language models, are pattern matching machines: we jump to conclusions, especially if they might reinforce our previous opinions and biases.
If we’re going to figure this stuff out together, we have to learn when to trust our initial instincts and when to read deeper and think harder about what’s going on.
A conversation about prompt engineering with CBC Day 6 seven days ago
I’m on Canadian radio this morning! I was interviewed by Peter Armstrong for CBC Day 6 about the developing field of prompt engineering.
You can listen here on the CBC website.
CBC also published this article based on the interview, which includes some of my answers that didn’t make the audio version: These engineers are being hired to get the most out of AI tools without coding.
Here’s my own lightly annotated transcript (generated with the help of Whisper).Peter: AI Whisperer, or more properly known as Prompt Engineers, are part of a growing field of humans who make their living working with AI
Their job is to craft precise phrases to get a desired outcome from an AI
Some experts are skeptical about how much control AI whisperers actually have
But more and more companies are hiring these prompt engineers to work with AI tools
There are even online marketplaces where freelance engineers can sell the prompts they’ve designed
Simon Willison is an independent researcher and developer who has studied AI prompt engineering
Good morning, Simon. Welcome to Day 6
Simon: Hi, it’s really great to be here
Peter: So this is a fascinating and kind of perplexing job
What exactly does a prompt engineer do?
Simon: So we have these new AI models that you can communicate to with English language
You type them instructions in English and they do the thing that you ask them to do, which feels like it should be the easiest thing in the world
But it turns out actually getting great results out of these things, using these for the kinds of applications people want to sort of summarization and extracting facts requires a lot of quite deep knowledge as to how to use them and what they’re capable of and how to get the best results out of them
So, prompt engineering is essentially the discipline of becoming an expert in communicating with these things
It’s very similar to being a computer programmer except weird and different in all sorts of new ways that we’re still trying to understand
Peter: You’ve said in some of your writing and talking about this that it’s important for prompt engineers to resist what you call superstitious thinking
What do you mean by that?
My piece In defense of prompt engineering talks about the need to resist superstitious thinking.
Simon: It’s very easy when talking to one of these things to think that it’s an AI out of science fiction, to think that it’s like the Star Trek computer and it can understand and do anything
And that’s very much not the case
These systems are extremely good at pretending to be all powerful, all knowing things, but they have massive, massive flaws in them
So it’s very easy to become superstitious, to think, oh wow, I asked it to read this web page, I gave it a link to an article and it read it
It didn’t read it!
This is a common misconception that comes up when people are using ChatGPT. I wrote about this and provided some illustrative examples in ChatGPT can’t access the internet, even though it really looks like it can.
A lot of the time it will invent things that look like it did what you asked it to, but really it’s sort of imitating what would look like a good answer to the question that you asked it
Peter: Well, and I think that’s what’s so interesting about this, that it’s not sort of core science computer programming
There’s a lot of almost, is it fair to call it intuition
Like what makes a prompt engineer good at being a prompt engineer?
Simon: I think intuition is exactly right there
The way you get good at this is firstly by using these things a lot
It takes a huge amount of practice and experimentation to understand what these things can do, what they can’t do, and just little tweaks in how you talk to them might have huge effect in what they say back to you
Peter: You know, you talked a little bit about the assumption that we can’t assume this is some all-knowing futuristic AI that knows everything and yet you know we already have people calling these the AI whispers which to my ears sounds a little bit mystical
How much of this is is you know magic as opposed to science?
Simon: The comparison to magic is really interesting because when you’re working with these it really can feel like you’re a sort of magician you sort of cast spells at it you don’t fully understand what they’re going to do and and it reacts sometimes well and sometimes it reacts poorly
And I’ve talked to AI practitioners who kind of talk about collecting spells for their spell book
But it’s also a very dangerous comparison to make because magic is, by its nature, impossible for people to comprehend and can do anything
And these AI models are absolutely not that
See Is the AI spell-casting metaphor harmful or helpful? for more on why magic is a dangerous comparison to make!
Fundamentally, they’re mathematics
And you can understand how they work and what they’re capable of if you put the work in
Peter: I have to admit, when I first heard about this, I thought it was a kind of a made up job or a bit of a scam to just get people involved
But the more I’ve read on it, the more I’ve understood that this is a real skill
But I do think back to, it wasn’t all that long ago that we had Google search specialists that helped you figure out how to search for something on Google
Now we all take for granted because we can do it
I wonder if you think, do prompt engineers have a future or are we all just going to eventually be able catch up with them and use this AI more effectively?
Simon: I think a lot of prompt engineering will become a skill that people develop
Many people in their professional and personal lives are going to learn to use these tools, but I also think there’s going to be space for expertise
There will always be a level at which it’s worth investing sort of full-time experience in in solving some of these problems, especially for companies that are building entire product around these AI engines under the hood
Peter: You know, this is a really exciting time
I mean, it’s a really exciting week
We’re getting all this new stuff
It’s amazing to watch people use it and see what they can do with it
And I feel like my brain is split
On the one hand, I’m really excited about it
On the other hand, I’m really worried about it
Are you in that same place?
And what are the things you’re excited about versus the things that you’re worried about?
Simon: I’m absolutely in the same place as you there
This is both the most exciting and the most terrifying technology I’ve ever encountered in my career
Something I’m personally really excited about right now is developments in being able to run these AIs on your own personal devices
I have a series of posts about this now, starting with Large language models are having their Stable Diffusion moment where I talk about first running a useful large language model on my own laptop.
Right now, if you want to use these things, you have to use them against cloud services run by these large companies
But there are increasing efforts to get them to scale down to run on your own personal laptops or even on your own personal phone
I ran a large language model that Facebook Research released just at the weekend on my laptop for the first time, and it started spitting out useful results
And that felt like a huge moment in terms of sort of the democratization of this technology, putting it into people’s hands and meaning that things where you’re concerned about your own privacy and so forth suddenly become feasible because you’re not talking to the cloud, you’re talking to the sort of local model
Peter: You know, if I typed into one of these chat bots, you know, should I be worried about the rise of AI
It would absolutely tell me not to be
If I ask you the same question, should we be worried and should we be spending more time figuring out how this is going to seep its way into various corners of our lives?
Simon: I think we should absolutely be worried because this is going to have a major impact on society in all sorts of ways that we don’t predict and some ways that we can predict
I’m not worried about the sort of science fiction scenario where the AI breaks out of my laptop and takes over the world
But there are many very harmful things you can do with a machine that can imitate human beings and that can produce realistic human text
My thinking on this was deeply affected by Emily M. Bender, who observed that “applications that aim to believably mimic humans bring risk of extreme harms” as highlighted in this fascinating profile in New York Magazine.
The fact that anyone can churn out very convincing but completely made up text right now will have a major impact in terms of how much can you trust the things that you’re reading online
If you read a review of a restaurant, was it written by a human being or did somebody fire up an AI model and generate 100 positive reviews all in one go?
So there are all sorts of different applications to this
Some are definitely bad, some are definitely good
And seeing how this all plays out is something that I think society will have to come to terms with over the next few months and the next few years
Peter: Simon, really appreciate your insight and just thanks for coming with us on the show today
Simon: Thanks very much for having me
For more related content, take a look at the prompt engineering and generative AI tags on my blog.
Could you train a ChatGPT-beating model for $85,000 and run it in a browser? eight days ago
I think it’s now possible to train a large language model with similar functionality to GPT-3 for $85,000. And I think we might soon be able to run the resulting model entirely in the browser, and give it capabilities that leapfrog it ahead of ChatGPT.
This is currently wild speculation on my part, but bear with me because I think this is worth exploring further.
Large language models with GPT-3-like capabilities cost millions of dollars to build, thanks to the cost of running the expensive GPU servers needed to train them. Whether you are renting or buying those machines, there are still enormous energy costs to cover.
Just one example of this: the BLOOM large language model was trained in France with the support of the French government. The cost was estimated as $2-5M, it took almost four months to train and boasts about its low carbon footprint because most of the power came from a nuclear reactor!
[ Fun fact: as of a few days ago you can now run the openly licensed BLOOM on your own laptop, using Nouamane Tazi’s adaptive copy of the llama.cpp
code that made that possible for LLaMA ]
Recent developments have made me suspect that these costs could be made dramatically lower. I think a capable language model can now be trained from scratch for around $85,000.
It’s all about that LLaMA
The LLaMA plus Alpaca combination is the key here.
I wrote about these two projects previously:
- Large language models are having their Stable Diffusion moment discusses the significance of LLaMA
- Stanford Alpaca, and the acceleration of on-device large language model development describes Alpaca
To recap: LLaMA by Meta research provided a GPT-3 class model trained entirely on documented, available public training information, as opposed to OpenAI’s continuing practice of not revealing the sources of their training data.
This makes the model training a whole lot more likely to be replicable by other teams.
The paper also describes some enormous efficiency improvements they made to the training process.
The LLaMA research was still extremely expensive though. From the paper:
... we estimate that we used 2048 A100-80GB for a period of approximately 5 months to develop our models
My friends at Replicate told me that a simple rule of thumb for A100 cloud costs is $1/hour.
2048 * 5 * 30 * 24 = $7,372,800
But... that $7M was the cost to both iterate on the model and to train all four sizes of LLaMA that they tried: 7B, 13B, 33B, and 65B.
Here’s Table 15 from the paper, showing the cost of training each model.
This shows that the smallest model, LLaMA-7B, was trained on 82,432 hours of A100-80GB GPUs, costing 36MWh and generating 14 tons of CO2.
(That’s about 28 people flying from London to New York.)
Going by the $1/hour rule of thumb, this means that provided you get everything right on your first run you can train a LLaMA-7B scale model for around $82,432.
Upgrading to Alpaca
You can run LLaMA 7B on your own laptop (or even on a phone), but you may find it hard to get good results out of. That’s because it hasn’t been instruction tuned, so it’s not great at answering the kind of prompts that you might send to ChatGPT or GPT-3 or 4.
Alpaca is the project from Stanford that fixes that. They fine-tuned LLaMA on 52,000 instructions (of somewhat dubious origin) and claim to have gotten ChatGPT-like performance as a result... from that smallest 7B LLaMA model!
You can try out their demo (update: no you can’t, “Our live demo is suspended until further notice”) and see for yourself that it really does capture at least some of that ChatGPT magic.
The best bit? The Alpaca fine-tuning can be done for less than $100. The Replicate team have repeated the training process and published a tutorial about how they did it.
Other teams have also been able to replicate the Alpaca fine-tuning process, for example antimatter15/alpaca.cpp on GitHub.
We are still within our $85,000 budget! And Alpaca—or an Alpaca-like model using different fine tuning data—is the ChatGPT on your own device model that we’ve all been hoping for.
Could we run it in a browser?
Alpaca is effectively the same size as LLaMA 7B—around 3.9GB (after 4-bit quantization ala llama.cpp). And LLaMA 7B has already been shown running on a whole bunch of different personal devices: laptops, Raspberry Pis (very slowly) and even a Pixel 5 phone at a decent speed!
The next frontier: running it in the browser.
I saw two tech demos yesterday that made me think this may be possible in the near future.
The first is Transformers.js. This is a WebAssembly port of the Hugging Face Transformers library of models—previously only available for server-side Python.
It’s worth spending some time with their demos, which include some smaller language models and some very impressive image analysis languages too.
The second is Web Stable Diffusion. This team managed to get the Stable Diffusion generative image model running entirely in the browser as well!
Web Stable Diffusion uses WebGPU, a still emerging standard that’s currently only working in Chrome Canary. But it does work! It rendered my this image of two raccoons eating a pie in the forest in 38 seconds.
The Stable Diffusion model this loads into the browser is around 1.9GB.
LLaMA/Alpaca at 4bit quantization is 3.9GB.
The sizes of these two models are similar enough that I would not be at all surprised to see an Alpaca-like model running in the browser in the not-too-distant future. I wouldn’t be surprised if someone is working on that right now.
Now give it extra abilities with ReAct
A model running in your browser that behaved like a less capable version of ChatGPT would be pretty impressive. But what if it could be MORE capable than ChatGPT?
The ReAct prompt pattern is a simple, proven way of expanding a language model’s abilities by giving it access to extra tools.
Matt Webb explains the significance of the pattern in The surprising ease and effectiveness of AI in a loop.
I got it working with a few dozen lines of Python myself, which I described in A simple Python implementation of the ReAct pattern for LLMs.
Here’s the short version: you tell the model that it must think out loud and now has access to tools. It can then work through a question like this:
Question: Population of Paris, squared?
Thought: I should look up the population of paris and then multiply it
Action: search_wikipedia: Paris
Then it stops. Your code harness for the model reads that last line, sees the action and goes and executes an API call against Wikipedia. It continues the dialog with the model like this:
Observation: <truncated content from the Wikipedia page, including the 2,248,780 population figure>
The model continues:
Thought: Paris population is 2,248,780 I should square that
Action: calculator: 2248780 ** 2
Control is handed back to the harness, which passes that to a calculator and returns:
Observation: 5057011488400
The model then provides the answer:
Answer: The population of Paris squared is 5,057,011,488,400
Adding new actions to this system is trivial: each one can be a few lines of code.
But as the ReAct paper demonstrates, adding these capabilities to even an under-powered model (such as LLaMA 7B) can dramatically improve its abilities, at least according to several common language model benchmarks.
This is essentially what Bing is! It’s GPT-4 with the added ability to run searches against the Bing search index.
Obviously if you’re going to give a language model the ability to execute API calls and evaluate code you need to do it in a safe environment! Like for example... a web browser, which runs code from untrusted sources as a matter of habit and has the most thoroughly tested sandbox mechanism of any piece of software we’ve ever created.
Adding it all together
There are a lot more groups out there that can afford to spend $85,000 training a model than there are that can spend $2M or more.
I think LLaMA and Alpaca are going to have a lot of competition soon, from an increasing pool of openly licensed models.
A fine-tuned LLaMA scale model is leaning in the direction of a ChatGPT competitor already. But... if you hook in some extra capabilities as seen in ReAct and Bing even that little model should be able to way outperform ChatGPT in terms of actual ability to solve problems and do interesting things.
And we might be able to run such a thing on our phones... or even in our web browsers... sooner than you think.
And it’s only going to get cheaper
H100s are shipping and you can half this again. Twice (or more) if fp8 works.
- tobi lutke (@tobi) March 17, 2023
The H100 is the new Tensor Core GPU from NVIDIA, which they claim can offer up to a 30x performance improvement over their current A100s.
Stanford Alpaca, and the acceleration of on-device large language model development 12 days ago
On Saturday 11th March I wrote about how Large language models are having their Stable Diffusion moment. Today is Monday. Let’s look at what’s happened in the past three days.
- Later on Saturday: Artem Andreenko reports that
llama.cpp
can run the 4-bit quantized 7B LLaMA language model model on a 4GB RaspberryPi—at 10 seconds per token, but still hugely impressive. - Sunday 12th March: cocktailpeanut releases Dalai, a “dead simple way to run LLaMA on your computer”:
npx dalai llama
andnpx dalai serve
. - 13th March (today): Anish Thite reports
llama.cpp
running on a Pixel 6 phone (26 seconds per token). Update 14th March: Now 1 second per token on an older Pixel 5! - Also today: a team at Stanford released Alpaca: A Strong Open-Source Instruction-Following Model—fine-tuned from the LLaMA 7B model.
When I talked about a “Stable Diffusion moment” this is the kind of thing I meant: the moment this stuff is available for people to experiment with, things accelerate.
I’m going to dive into Alpaca in detail.
Stanford’s Alpaca
Here’s the introduction to the Alpaca announcement:
We introduce Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca behaves similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$).
The biggest weakness in the LLaMA models released by Meta research last month is their lack of instruction-tuning.
A language model is a sentence completion engine. You give it a sequence of words, “The first man on the moon was”, and it completes that sentence, hopefully with useful content.
One of the great innovations from OpenAI was their application of instruction tuning to GPT-3:
To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API, our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.
Prior to this, you had to think very carefully about how to construct your prompts. Thanks to instruction tuning you can be a lot more, well, human in the way you interact with the model. “Write me a poem about pandas!” now works as a prompt, instead of “Here is a poem about pandas:”.
The LLaMA models had not been through this process. The LLaMA FAQ acknowledges this:
Keep in mind these models are not finetuned for question answering. As such, they should be prompted so that the expected answer is the natural continuation of the prompt. [...] Overall, always keep in mind that models are very sensitive to prompts (particularly when they have not been finetuned).
This is an enormous usability problem.
One of my open questions about LLaMA was how difficult and expensive it would be to fine-tune it such that it could respond better to instructions.
Thanks to the team at Stanford we now have an answer: 52,000 training samples and $100 of training compute! From their blog post:
Fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers.
Something that stuns me about Alpaca is the quality they claim to be able to get from the 7B model—the smallest of the LLaMA models, and the one which has been seen running (albeit glacially slowly) on a RaspberryPi and a mobile phone! Here’s one example from their announcement:
I would be impressed to see this from the 65B (largest) LLaMA model—but getting this from 7B is spectacular.
Still not for commercial usage
I’ll quote the Stanford announcement on this in full:
We emphasize that Alpaca is intended only for academic research and any commercial use is prohibited. There are three factors in this decision: First, Alpaca is based on LLaMA, which has a non-commercial license, so we necessarily inherit this decision. Second, the instruction data is based OpenAI’s text-davinci-003, whose terms of use prohibit developing models that compete with OpenAI. Finally, we have not designed adequate safety measures, so Alpaca is not ready to be deployed for general use.
So it’s still not something we can use to build commercial offerings—but for personal research and tinkering it’s yet another huge leap forwards.
What does this demonstrate?
The license of the LLaMA model doesn’t bother me too much. What’s exciting to me is what this all proves:
- LLaMA itself shows that it’s possible to train a GPT-3 class language model using openly available resources. The LLaMA paper includes details of the training data, which is entirely from publicly available sources (which include CommonCrawl, GitHub, Wikipedia, ArXiv and StackExchange).
- llama.cpp shows that you can then use some tricks to run that language model on consumer hardware—apparently anything with 4GB or more of RAM is enough to at least get it to start spitting out tokens!
- Alpaca shows that you can apply fine-tuning with a feasible sized set of examples (52,000) and cost ($100) such that even the smallest of the LLaMA models—the 7B one, which can compress down to a 4GB file with 4-bit quantization—provides results that compare well to cutting edge
text-davinci-003
in initial human evaluation.
One thing that’s worth noting: the Alpaca 7B comparison likely used the full-sized 13.48GB 16bit floating point 7B model, not the 4GB smaller 4bit floating point model used by llama.cpp
. I’ve not yet seen a robust comparison of quality between the two.
Exploring the Alpaca training data with Datasette Lite
The Alpaca team released the 52,000 fine-tuning instructions they used as a 21.7MB JSON file in their GitHub repository.
My Datasette Lite tool has the ability to fetch JSON from GitHub and load it into an in-browser SQLite database. Here’s the URL to do that:
This will let you browse the 52,000 examples in your browser.
But we can do a step better than that: here’s a SQL query that runs LIKE queries to search through those examples, considering all three text columns:
select instruction, input, output from alpaca_data
where instruction || ' ' || input || ' ' || output like '%' || :search || '%'
order by random()
I’m using order by random()
because why not? It’s more fun to explore that way.
The following link will both load the JSON file and populate and execute that SQL query, plus allow you to change the search term using a form in your browser:
What’s next?
This week is likely to be wild. OpenAI are rumored to have a big announcement on Tuesday—possibly GPT-4? And I’ve heard rumors of announcements from both Anthropic and Google this week as well.
I’m still more excited about seeing what happens next with LLaMA. Language models on personal devices is happening so much faster than I thought it would.
Bonus: The source of that training data? GPT-3!
Here’s a fascinating detail: Those 52,000 samples they used to fine-tune the model? Those were the result of a prompt they ran against GPT-3 itself! Here’s the prompt they used:
You are asked to come up with a set of 20 diverse task instructions. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions. Here are the requirements: 1. Try not to repeat the verb for each instruction to maximize diversity. 2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons. 3. The type of instructions should be diverse. The list should include diverse types of tasks like open-ended generation, classification, editing, etc. 2. A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action. 3. The instructions should be in English. 4. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted. 5. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging but should ideally not exceed 100 words. 6. Not all instructions require input. For example, when a instruction asks about some general information, "what is the highest peak in the world", it is not necssary to provide a specific context. In this case, we simply put "<noinput>" in the input field. 7. The output should be an appropriate response to the instruction and the input. Make sure the output is less than 100 words. List of 20 tasks:
Then they include three random example instructions from a list of 175 they had prepared by hand. The completed prompt sent to OpenAI would include the above instructions followed by something like this:
### 1. Instruction: Explain the following idiom to me, and try to give me some examples. 1. Input: black sheep 1. Output: Meaning: An outcast. Someone who doesn’t fit in with the rest of the crowd. They take pride in being different. Thinks for themselves and doesn’t care what no one else has to say. They tend to ride their own wave and are usually loners because no one understands them, but its okay because they like it that way. Example: He’s the black sheep of the family. ### 2. Instruction: Generate a haiku using the following word: 2. Input: summer 2. Output: The chill, worming in Shock, pleasure, bursting within Summer tongue awakes ### 3. Instruction: Recommend a movie for me to watch during the weekend and explain the reason. 3. Input: 3. Output: I would recommend the movie "The Shawshank Redemption" because it is an excellent movie that is both moving and inspiring. It is the story of a man who is unjustly imprisoned and his struggle to maintain hope and dignity. It is a great film to watch over the weekend because it will make you think about the human capacity for resilience and hope. ### 4. Instruction:
GPT-3 would then fill in the rest. You can try this in the GPT-3 Playground to see it in action (paste from here).
Here’s the Python script that assembles that all together.
They spent $500 on OpenAI credits to assemble the 52,000 examples they used to fine-tune their model.
As they note in their announcement, generating examples in this way is actually mentioned in the OpenAI terms of use:
You may not [...] (iii) use the Services to develop foundation models or other large scale models that compete with OpenAI
There’s a related concept to this called Model Extraction, where people build new models that emulate the behaviour of others by firing large numbers of examples through the other model and training a new one based on the results.
I don’t think the way Alpaca was trained quite counts as a classic Model Extraction attack, but it certainly echoes one.