Simon Willison’s Weblog

On scraping 20 promptengineering 32 sqlite 198 opensource 154 bing 20 ...


Recent entries

I built a ChatGPT plugin to answer questions about data hosted in Datasette one day ago

Yesterday OpenAI announced support for ChatGPT plugins. It’s now possible to teach ChatGPT how to make calls out to external APIs and use the responses to help generate further answers in the current conversation.

My open source project Datasette makes it easy to host data online, and already provides a JSON API for executing SQLite queries.

I know from past experiments that GPT3/4 are able to construct SQLite-compatible SQL queries. I figured that Datasette and ChatGPT would be a perfect combination.

Datasette itself has a plugin system which can be used to add new features, including the new URL routes needed by the ChatGPT plugin specification.

So I built datasette-chatgpt-plugin—a Datasette plugin that turns a Datasette instance into a ChatGPT plugin, and allows you to ask questions of your own data directly from ChatGPT.


The new plugin can be installed using datasette install datasette-chatgpt-plugin. It’s only useful if it’s deployed online though, since ChatGPT needs to be able to access it.

I’m running it on the official website—which is itself a customized instance of Datasette. The plugin provides access to the database, which has tables for things like Datasette news, releases and available plugins.

Here’s a video demo showing some of the things ChatGPT can do with the plugin installed.

I’ve only just got this working, but here’s a query that illustrates how effective it can be. I asked:

What are the most popular plugins?

It gave me back the following:

Prompt: What are the most popular plugins? ChatGPT said: Used Query UNVERIFIED. Here are the most popular Datasette plugins based on the number of GitHub stars: 1. Plugin: datasette-dashboards GitHub Repository: clement/datasette-dashboards GitHub Stars: 84 2. Plugin: dogsheep-beta GitHub Repository: dogsheep/dogsheep-beta GitHub Stars: 78 3. Plugin: datasette-graphql GitHub Repository: simonw/datasette-graphql GitHub Stars: 75 4. Plugin: datasette-ripgrep GitHub Repository: simonw/datasette-ripgre GitHub Stars: 63 5. Plugin: datasette-scraper GitHub Repository: cldellow/datasette-scraper GitHub Stars: 44 These are the top 5 most popular Datasette plugins based on the number of stars they have received on GitHub. If you have any other questions or need more information, please let me know! There follows a gallery of thumbnails of those links.

Clicking the little arrow next to “Used query” reveals the API call it made, exposing the SQL query it generated:

Used query expanded - shows JSON for the query and the returned response

The API call it made was:

  "sql": "SELECT name, full_name, stargazers_count FROM plugins ORDER BY stargazers_count DESC LIMIT 5",
  "_shape": "array"

You can try that query in Datasette here—or here’s the JSON version.

Here’s the JSON that was returned:

    "name": "datasette-dashboards",
    "full_name": "rclement/datasette-dashboards",
    "stargazers_count": 84
    "name": "dogsheep-beta",
    "full_name": "dogsheep/dogsheep-beta",
    "stargazers_count": 78
    "name": "datasette-graphql",
    "full_name": "simonw/datasette-graphql",
    "stargazers_count": 75
    "name": "datasette-ripgrep",
    "full_name": "simonw/datasette-ripgrep",
    "stargazers_count": 63
    "name": "datasette-scraper",
    "full_name": "cldellow/datasette-scraper",
    "stargazers_count": 44

ChatGPT turns the JSON into a nice human-readable reply. It also knows how to generate thumbnails from og:image metadata, adding a gallery of those to the end of the reply.

How the plugin works

Building ChatGPT plugins, like so much involving Large Language Models, is both really easy and deceptively complicated.

You give ChatGPT a short, human-ish language description of your plugin and how to use it, and a machine-readable OpenAPI schema with the details of the API.

And that’s it! The language model figures out everything else.

Datasette exposes a JSON API that speaks SQL. ChatGPT knows SQL already, so all my prompt needed to do was give it some hints—in particular tell it to use the SQLite dialect.

Here’s the prompt I’m using at the moment:

Run SQLite queries against a database hosted by Datasette. Datasette supports most SQLite syntax but does not support PRAGMA statements. Use select group_concat(sql, ';') from sqlite_master to see the list of tables and their columns Use select sql from sqlite_master where name = 'table_name' to see the schema for a table, including its columns. Instead of PRAGMA table_info(table_name) use select * from pragma_table_info('table_name'). PRAGMA statements are not allowed. select * from pragma_table_info('table_name') is allowed.

In my early experiments it kept trying to run PRAGMA table_info(), hence my increasingly frustrated prompts about that!

With hindsight, I don’t think it was re-fetching my prompt while I was developing the plugin, so those repeated warnings probably aren’t needed.

Your application needs to serve two additional pages—a plugin description at /.well-known/ai-plugin.json and an OpenAI schema linked to by that description.

You can see those two pages for my deployment here:

The ai-plugin.json file currently looks like this:

  "schema_version": "v1",
  "name_for_model": "datasette_datasette_io_3c330f",
  "name_for_human": "Query",
  "description_for_model": "Run SQLite queries against a database hosted by Datasette.\nDatasette supports most SQLite syntax but does not support PRAGMA statements.\nUse `select group_concat(sql, ';') from sqlite_master` to see the list of tables and their columns\nUse `select sql from sqlite_master where name = 'table_name'` to see the schema for a table, including its columns.\nInstead of `PRAGMA table_info(table_name)` use `select * from pragma_table_info('table_name')`\nPRAGMA statements are not allowed. `select * from pragma_table_info('table_name') is allowed.",
  "description_for_human": "Run SQL against data in Datasette.",
  "auth": {
    "type": "none"
  "api": {
    "type": "openapi",
    "url": "",
    "has_user_authentication": false
  "logo_url": "",
  "contact_email": "",
  "legal_info_url": ""

Since they use that `.well-known` URL format, it’s possible to find them for other services. Here’s ai-plugin.json for Wolfram Alpha.

And the chatgpt-openapi-schema.yml file contains this:

openapi: 3.0.1
  title: Datasette API
  description: Execute SQL queries against a Datasette database and return the results as JSON
  version: 'v1'
  - url:
      operationId: query
      summary: Execute a SQLite SQL query against the content database
      description: Accepts SQLite SQL query, returns JSON. Does not allow PRAGMA statements.
      - name: sql
        in: query
        description: The SQL query to be executed
        required: true
          type: string
      - name: _shape
        in: query
        description: The shape of the response data. Must be "array"
        required: true
          type: string
            - array
          description: Successful SQL results
                type: array
                  type: object
          description: Bad request
          description: Internal server error

I haven’t actually used OpenAPI schemas before... so I got ChatGPT to write the initial version for me, using the following prompt:

Write an OpenAPI schema explaining the*+from+facetable&_shape=array GET API which accepts SQL and returns an array of JSON objects

For a detailed account of how I built the plugin, take a look at my notes in issue #1 in the repository.

I prototyped the initial plugin using Glitch, because that’s the fastest way I know to get a live-on-the-web application which constantly reflects new changes to the code. This made iterating much faster... on the OpenAPI schema at least. As far as I can tell ChatGPT only loads that ai-plugin.json file once, which is frustrating because it means you have to deploy a new copy of the application to get it to re-read that crucial prompt.

I ended up doing most of my prompt engineering in ChatGPT itself though—I could tell it "Instead of PRAGMA table_info(table_name) use select * from pragma_table_info('table_name')" and then re-try my previous question to see if the new instruction fixed any problems I was having.

The bad news: it can hallucinate

Here’s the bad news. I’ve been playing with this for only a short time, so I’m still exploring its abilities. I’ve already had a couple of instances of it hallucinating answers despite having looked them up in the database first.

I’m hoping I can address this somewhat with further prompt engineering—“only use information returned from the query to answer the question” kind of stuff. But I can’t guarantee I’ll be able to suppress this entirely, which for a database querying tool is an extremely serious problem.

More about this, including some examples, in issue #2 in the repo.

My current theory is that this relates to length limits. I’ve noticed it happens when the query returns a large amount of data—the full content of tutorials for example. I think ChatGPT is silently truncating that data to fit the token limit, and is then hallucinating new information to fill in for what ends up missing.

Want to try this with your own data?

The ChatGPT plugin system isn’t available outside of the preview yet, but when it is I’ll be adding this functionality to my Datasette Cloud SaaS platform, for people who don’t want to install and run Datasette themselves.

You can sign up for the Datasette Cloud preview here if you’d like to learn more.

Previous experiments

I’ve experimented with variants of this pattern myself before: it turns out it’s surprisingly easy to enhance the capabilities of a large language model by providing it access to additional tools. Here’s some previous work:

Weeknotes: AI won’t slow down, a new newsletter and a huge Datasette refactor three days ago

I’m a few weeks behind on my weeknotes, but it’s not through lack of attention to my blog. AI just keeps getting weirder and more interesting.

I’m beginning to expect that every Tuesday may be a write-off for the next few years, since the AI community seems to have decided that Tuesday is the day to launch everything.

Two Tuesdays ago we got a Google announcement, Anthropic’s Claude and GPT-4. On Tuesday this week we got Google Bard, Bing Image Creator and Adobe Firefly.

I’ve written about a bunch of that stuff this month:

Apparently this blog is now partly focused on AI! If you want to stay up-to-date with my writing on this (and other) subjects you can subscribe to my atom feed, or you can sign up for my brand new Substack newsletter.

My blog as a newsletter

I know there are a lot of people out there who don’t habitually use a feed reader but do find great value from email newsletters. is my new newsletter, which is effectively a way to subscribe to my blog via email.

I started it a few months ago when it looked like Twitter was about to collapse under the weight of its new mismanagement. I first promoted it at the bottom of my Large language models are having their Stable Diffusion moment post, and it’s since grown to 640 subscribers!

I plan to send it out around once a week, provided there’s material to send.

It will be mostly content from my blog, with maybe a paragraph or two of additional context added at the top highlighting themes of the past week (such as GPT-4).

The first two editions can be found here:

A fun detail about my newsletter is how I’m generating it.

Substack doesn’t have an API, but I wanted to automate as much of the process of copying in data from my blog as possible.

I built myself an automation around copy and paste! is an Observable notebook I wrote which assembles most of the newsletter for me.

It works by running this SQL query against my Datasette instance, which runs against a SQLite copy of my blog content (a PostgreSQL/Django app) built by a GitHub Action in this repository.

The SQL query assembles a string of HTML which is rendered in the notebook. There’s also a “Copy to clipboard” button which uses this JavaScript pattern to copy a rich text representation of the HTML to the clipboard.

When I hit “paste” in the Substack editor interface it converts that representation into Substack’s chosen subset of HTML. Then I can edit it by hand in the Substack editor.

This is working really well so far—it’s really easy to tweak the generated HTML in the Observable notebook, and once I’ve transferred it to Substack I can re-arrange things and add my own extra commentary to the top of the newsletter before hitting send.

Datasette’s new JSON API

I finally landed a GIANT branch I’ve been working on for several months now: a complete redesign of Datasette’s default JSON format, one of the largest changes I need to land prior to releasing Datasette 1.0.

The previous default JSON format was a bit of a mess: it had dozens of keys, and presented the row data as an array of arrays (on the basis that the column names were available in a separate key, and rows as arrays would be more efficient in terms of bytes on the wire).

I always found myself adding ?_shape=array to that URL to get a smalle format, which strongly indicated that the default I had picked was the wrong one.

The new format can now be previewed here—it looks like this (truncated):

  "ok": true,
  "next": "d,v",
  "rows": [
      "pk1": "a",
      "pk2": "a",
      "content": "a-a"
      "pk1": "a",
      "pk2": "b",
      "content": "a-b"

The default keys are "ok", "next" to indicate pagination (this is null if there are no extra pages) and "rows" with a list of JSON objects.

If you want extra rows—like a total row count, or a list of columns, or some suggested facets—you can request them using the new ?_extra= parameter—for example:

This returns a response that starts like this:

  "ok": true,
  "next": "d,v",
  "count": 201,
  "columns": [
  "suggested_facets": [
      "name": "pk1",
      "toggle_url": ""
      "name": "pk2",
      "toggle_url": ""
      "name": "text",
      "toggle_url": ""
  "rows": [

There’s still more work to do on this feature: I need to write the documentation for it, and figure out how it should affect the Datasette endpoint that returns results from an arbitrary SQL query. But it’s ready to preview, and I’m keen to get feedback on it as quickly as possible!

Please take a look, and provide feedback on this dedicated issue thread—or come and talk about it in the Datasette Discord.

Releases these weeks

TIL these weeks

Don’t trust AI to talk accurately about itself: Bard wasn’t trained on Gmail four days ago

Earlier this month I wrote about how ChatGPT can’t access the internet, even though it really looks like it can. Consider this part two in the series. Here’s another common and non-intuitive mistake people make when interacting with large language model AI systems: asking them questions about themselves.

(When I say “large language model AI systems” I mean tools like ChatGPT, Bing and the brand new Google Bard.)

Asking questions like this feels like a natural thing to do: these bots use “I” pronouns (I really wish they wouldn’t) and will very happily answer questions about themselves—what they can do, how they work, even their own opinions (I really wish they wouldn’t do that).

These questions are likely to produce realistic but misleading answers. They come at a surprisingly high risk for hallucination.

What harm could possibly come from asking them how they work?

No, Bard was not trained on text from private Gmail accounts

The following screenshot started circulating on Twitter earlier today. I’m not linking to it directly because I don’t want to encourage it to spread further without extra context.

User: where does Bard's dataset come from? Response: Bard's dataset comes from a variety of sources, including: • Publicly available datasets: These include datasets of text and code from the web, such as Wikipedia, GitHub, and Stack Overflow. • Google's internal data: This includes data from Google Search, Gmail, and other products. • Data from third-party companies: This includes data from companies that have partnered with Google to provide data for Bard's training. The specific sources of Bard's dataset are not publicly available, but Google Al has said that the dataset is "as comprehensive as possible"

The user asked:

where does Bard’s dataset come from?

Bard replied like this—emphasis mine:

Bard’s dataset comes from a variety of sources, including:

  • Publicly available datasets: These include datasets of text and code from the web, such as Wikipedia, GitHub, and Stack Overflow.
  • Google’s internal data: This includes data from Google Search, Gmail, and other products.
  • Data from third-party companies: This includes data from companies that have partnered with Google to provide data for Bard’s training.

The specific sources of Bard’s dataset are not publicly available, but Google Al has said that the dataset is “as comprehensive as possible”.

Unsurprisingly, the idea that Bard might have been trained on internal data from Gmail sparked a lot of concern!

Here’s an official denial:

Bard is an early experiment based on Large Language Models and will make mistakes. It is not trained on Gmail data. -JQ

- Google Workspace (@GoogleWorkspace) March 21, 2023

(I have some supporting arguments below for if the official denial isn’t convincing enough for you.)

Bard was not trained on Gmail. So why on earth did Bard say that it was?

Language models have no concept of “self”

As always with language models, the trick to understanding why they sometimes produce wildly inappropriate output like this is to think about how they work.

A large language model is a statistical next-word / next-sentence predictor. Given the previous sequence of words (including the user’s prompt), it uses patterns from the vast amount of data it has been trained on to find a statistically satisfying way to continue that text.

As such, there’s no mechanism inside a language model to help it identify that questions of the form “how do you work?” should be treated any differently than any other question.

We can give it hints: many chatbot models are pre-seeded with a short prompt that says something along the lines of “You are Assistant, a large language model trained by OpenAI” (seen via a prompt leak).

And given those hints, it can at least start a conversation about itself when encouraged to do so.

But as with everything else language model, it’s an illusion. It’s not talking about itself, it’s completing a sentence that starts with “I am a large language model trained by ...”.

So when it outputs “Google’s internal data:”, the obvious next words might turn out to be “This includes data from Google Search, Gmail, and other products”—they’re statistically likely to follow, even though they don’t represent the actual truth.

This is one of the most unintuitive things about these models. The obvious question here is why: why would Bard lie and say it had been trained on Gmail when it hadn’t?

It has no motivations to lie or tell the truth. It’s just trying to complete a sentence in a satisfactory way.

What does “satisfactory” mean? It’s likely been guided by RLHF—Reinforcement Learning from Human Feedback—which the ChatGPT development process has excelled at. Human annotators help train the model by labelling responses as satisfactory or not. Google apparently recruited the entire company to help with this back in February.

I’m beginning to suspect that the perceived difference in quality between different language model AIs is influenced much more heavily by this fine-tuning level of training than it is by the underlying model size and quality itself. The enormous improvements the Alpaca fine-tuning brought to the tiny LLaMA 7B model has reinforced my thinking around this.

I think Bard’s fine-tuning still has a long way to go.

Current information about itself couldn’t have been in the training data

By definition, the model’s training data must have existed before the model itself was trained. Most models have a documented cut-off date on their training data—for OpenAI’s models that’s currently September 2021, I don’t believe Google have shared the cut-off date for the LaMDA model used by Bard.

If it was trained on content written prior to its creation, it clearly can’t understand details about its own specific “self”.

ChatGPT can answer pretty detailed questions about GPT-3, because that model had been iterated on and written about publicly for several years prior to its training cut-off. But questions about its most recent model, by definition, cannot be answered just using data that existed in its training set.

But Bard can consult data beyond its training!

Here’s where things get a bit tricky.

ChatGPT is a “pure” interface to a model: when you interact with it, you’re interacting with the underlying language model directly.

Google Bard and Microsoft Bing are different: they both include the ability to consult additional sources of information, in the form of the Google and Bing search indexes.

Effectively, they’re allowed to augment their training data with additional information fetched from a search.

This sounds more complex than it actually is: effectively they can run an external search, get back some results, paste them invisibly into the ongoing conversation and use that new text to help answer questions.

(I’ve built a very simple version of this pattern myself a couple of times, described in How to implement Q&A against your documentation with GPT3, embeddings and Datasette and A simple Python implementation of the ReAct pattern for LLMs.)

As such, one would hope that Bard could offer a perfect answer to any question about itself. It should be able to do something this:

User: Where does Bard’s dataset come from?

Bard: (invisible): search Google for “Bard dataset”

Bard: (invisible): search results said: ... big chunk of text from the Google indexed documents ...

Bard: My underlying model LaMDA was trained on public dialog data and other public web documents.

Clearly it didn’t do that in this case! Or if it did, it summarized the information it got back in a misleading way.

I expect Bard will have a much better answer for this question within a day or two—a great thing about running models with augmented data in this way is that you can improve their answers without having to train the underlying model again from scratch every time.

More reasons that LaMDA wouldn’t be trained on Gmail

When I first saw the claim from that original screenshot, I was instantly suspicious.

Taking good care of the training data that goes into a language model is one of the most important and challenging tasks in all of modern AI research.

Using the right mix of content, with the right mix of perspectives, and languages, and exposure to vocabulary, is absolutely key.

If you train a model on bad sources of training data, you’ll get a really badly behaved model.

The problem is that these models require far more text than any team of humans could ever manually review.

The LaMDA paper describes the training process like so:

LaMDA was pre-trained to predict the next token in a text corpus. Unlike previous dialog models trained on dialog data alone, we pre-trained LaMDA on a dataset created from public dialog data and other public web documents. Therefore, LaMDA can be used as a general language model prior to fine-tuning.

The pre-training dataset consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words

1.56 trillion words!

Appendix E has more details:

The composition of the data is as follows: 50% dialogs data from public forums; 12.5% C4 data t5; 12.5% code documents from sites related to programming like Q&A sites, tutorials, etc; 12.5% Wikipedia (English); 6.25% English web documents; and 6.25% Non-English web documents.

“C4 data t5” I believe relates to Common Crawl.

So why not mix in Gmail too?

First, in order to analyze the training data you need to be able to have your research team view it—they need to run spot checks, and build and test filtering algorithms to keep the really vile stuff to a minimum.

At large tech companies like Google, the ability for members of staff to view private data held in trust for their users is very tightly controlled. It’s not the kind of thing you want your machine learning training team to be poking around in... and if you work on those teams, even having the ability to access that kind of private data represents a substantial personal legal and moral risk.

Secondly, think about what could go wrong. What if a language model leaked details of someone’s private lives in response to a prompt from some other user?

This would be a PR catastrophe. Would people continue to trust Gmail or other Google products if they thought their personal secrets were being exposed to anyone who asked Bard a question? Would Google ever want to risk finding out the answer to that question?

The temptations of conspiratorial thinking

Are you still not convinced? Are you still suspicious that Google trained Bard on Gmail, despite both their denials and my logic as to why they wouldn’t ever want to do this?

Ask yourself how much you want to believe that this story is true.

This modern AI stuff is deeply weird, and more than a little frightening.

The companies involved are huge, secretive and are working on technology which serious people have grave concerns about.

It’s so easy to fall into the trap of conspiratorial thinking around this stuff. Especially since some of the conspiracies might turn out to be true!

I don’t know how to best counter this most human of reactions. My best recommendation is to keep in mind that humans, like language models, are pattern matching machines: we jump to conclusions, especially if they might reinforce our previous opinions and biases.

If we’re going to figure this stuff out together, we have to learn when to trust our initial instincts and when to read deeper and think harder about what’s going on.

A conversation about prompt engineering with CBC Day 6 seven days ago

I’m on Canadian radio this morning! I was interviewed by Peter Armstrong for CBC Day 6 about the developing field of prompt engineering.

You can listen here on the CBC website.

CBC also published this article based on the interview, which includes some of my answers that didn’t make the audio version: These engineers are being hired to get the most out of AI tools without coding.

Here’s my own lightly annotated transcript (generated with the help of Whisper).

Peter: AI Whisperer, or more properly known as Prompt Engineers, are part of a growing field of humans who make their living working with AI

Their job is to craft precise phrases to get a desired outcome from an AI

Some experts are skeptical about how much control AI whisperers actually have

But more and more companies are hiring these prompt engineers to work with AI tools

There are even online marketplaces where freelance engineers can sell the prompts they’ve designed

Simon Willison is an independent researcher and developer who has studied AI prompt engineering

Good morning, Simon. Welcome to Day 6

Simon: Hi, it’s really great to be here

Peter: So this is a fascinating and kind of perplexing job

What exactly does a prompt engineer do?

Simon: So we have these new AI models that you can communicate to with English language

You type them instructions in English and they do the thing that you ask them to do, which feels like it should be the easiest thing in the world

But it turns out actually getting great results out of these things, using these for the kinds of applications people want to sort of summarization and extracting facts requires a lot of quite deep knowledge as to how to use them and what they’re capable of and how to get the best results out of them

So, prompt engineering is essentially the discipline of becoming an expert in communicating with these things

It’s very similar to being a computer programmer except weird and different in all sorts of new ways that we’re still trying to understand

Peter: You’ve said in some of your writing and talking about this that it’s important for prompt engineers to resist what you call superstitious thinking

What do you mean by that?

My piece In defense of prompt engineering talks about the need to resist superstitious thinking.

Simon: It’s very easy when talking to one of these things to think that it’s an AI out of science fiction, to think that it’s like the Star Trek computer and it can understand and do anything

And that’s very much not the case

These systems are extremely good at pretending to be all powerful, all knowing things, but they have massive, massive flaws in them

So it’s very easy to become superstitious, to think, oh wow, I asked it to read this web page, I gave it a link to an article and it read it

It didn’t read it!

This is a common misconception that comes up when people are using ChatGPT. I wrote about this and provided some illustrative examples in ChatGPT can’t access the internet, even though it really looks like it can.

A lot of the time it will invent things that look like it did what you asked it to, but really it’s sort of imitating what would look like a good answer to the question that you asked it

Peter: Well, and I think that’s what’s so interesting about this, that it’s not sort of core science computer programming

There’s a lot of almost, is it fair to call it intuition

Like what makes a prompt engineer good at being a prompt engineer?

Simon: I think intuition is exactly right there

The way you get good at this is firstly by using these things a lot

It takes a huge amount of practice and experimentation to understand what these things can do, what they can’t do, and just little tweaks in how you talk to them might have huge effect in what they say back to you

Peter: You know, you talked a little bit about the assumption that we can’t assume this is some all-knowing futuristic AI that knows everything and yet you know we already have people calling these the AI whispers which to my ears sounds a little bit mystical

How much of this is is you know magic as opposed to science?

Simon: The comparison to magic is really interesting because when you’re working with these it really can feel like you’re a sort of magician you sort of cast spells at it you don’t fully understand what they’re going to do and and it reacts sometimes well and sometimes it reacts poorly

And I’ve talked to AI practitioners who kind of talk about collecting spells for their spell book

But it’s also a very dangerous comparison to make because magic is, by its nature, impossible for people to comprehend and can do anything

And these AI models are absolutely not that

See Is the AI spell-casting metaphor harmful or helpful? for more on why magic is a dangerous comparison to make!

Fundamentally, they’re mathematics

And you can understand how they work and what they’re capable of if you put the work in

Peter: I have to admit, when I first heard about this, I thought it was a kind of a made up job or a bit of a scam to just get people involved

But the more I’ve read on it, the more I’ve understood that this is a real skill

But I do think back to, it wasn’t all that long ago that we had Google search specialists that helped you figure out how to search for something on Google

Now we all take for granted because we can do it

I wonder if you think, do prompt engineers have a future or are we all just going to eventually be able catch up with them and use this AI more effectively?

Simon: I think a lot of prompt engineering will become a skill that people develop

Many people in their professional and personal lives are going to learn to use these tools, but I also think there’s going to be space for expertise

There will always be a level at which it’s worth investing sort of full-time experience in in solving some of these problems, especially for companies that are building entire product around these AI engines under the hood

Peter: You know, this is a really exciting time

I mean, it’s a really exciting week

We’re getting all this new stuff

It’s amazing to watch people use it and see what they can do with it

And I feel like my brain is split

On the one hand, I’m really excited about it

On the other hand, I’m really worried about it

Are you in that same place?

And what are the things you’re excited about versus the things that you’re worried about?

Simon: I’m absolutely in the same place as you there

This is both the most exciting and the most terrifying technology I’ve ever encountered in my career

Something I’m personally really excited about right now is developments in being able to run these AIs on your own personal devices

I have a series of posts about this now, starting with Large language models are having their Stable Diffusion moment where I talk about first running a useful large language model on my own laptop.

Right now, if you want to use these things, you have to use them against cloud services run by these large companies

But there are increasing efforts to get them to scale down to run on your own personal laptops or even on your own personal phone

I ran a large language model that Facebook Research released just at the weekend on my laptop for the first time, and it started spitting out useful results

And that felt like a huge moment in terms of sort of the democratization of this technology, putting it into people’s hands and meaning that things where you’re concerned about your own privacy and so forth suddenly become feasible because you’re not talking to the cloud, you’re talking to the sort of local model

Peter: You know, if I typed into one of these chat bots, you know, should I be worried about the rise of AI

It would absolutely tell me not to be

If I ask you the same question, should we be worried and should we be spending more time figuring out how this is going to seep its way into various corners of our lives?

Simon: I think we should absolutely be worried because this is going to have a major impact on society in all sorts of ways that we don’t predict and some ways that we can predict

I’m not worried about the sort of science fiction scenario where the AI breaks out of my laptop and takes over the world

But there are many very harmful things you can do with a machine that can imitate human beings and that can produce realistic human text

My thinking on this was deeply affected by Emily M. Bender, who observed that “applications that aim to believably mimic humans bring risk of extreme harms” as highlighted in this fascinating profile in New York Magazine.

The fact that anyone can churn out very convincing but completely made up text right now will have a major impact in terms of how much can you trust the things that you’re reading online

If you read a review of a restaurant, was it written by a human being or did somebody fire up an AI model and generate 100 positive reviews all in one go?

So there are all sorts of different applications to this

Some are definitely bad, some are definitely good

And seeing how this all plays out is something that I think society will have to come to terms with over the next few months and the next few years

Peter: Simon, really appreciate your insight and just thanks for coming with us on the show today

Simon: Thanks very much for having me

For more related content, take a look at the prompt engineering and generative AI tags on my blog.

Could you train a ChatGPT-beating model for $85,000 and run it in a browser? eight days ago

I think it’s now possible to train a large language model with similar functionality to GPT-3 for $85,000. And I think we might soon be able to run the resulting model entirely in the browser, and give it capabilities that leapfrog it ahead of ChatGPT.

This is currently wild speculation on my part, but bear with me because I think this is worth exploring further.

Large language models with GPT-3-like capabilities cost millions of dollars to build, thanks to the cost of running the expensive GPU servers needed to train them. Whether you are renting or buying those machines, there are still enormous energy costs to cover.

Just one example of this: the BLOOM large language model was trained in France with the support of the French government. The cost was estimated as $2-5M, it took almost four months to train and boasts about its low carbon footprint because most of the power came from a nuclear reactor!

[ Fun fact: as of a few days ago you can now run the openly licensed BLOOM on your own laptop, using Nouamane Tazi’s adaptive copy of the llama.cpp code that made that possible for LLaMA ]

Recent developments have made me suspect that these costs could be made dramatically lower. I think a capable language model can now be trained from scratch for around $85,000.

It’s all about that LLaMA

The LLaMA plus Alpaca combination is the key here.

I wrote about these two projects previously:

To recap: LLaMA by Meta research provided a GPT-3 class model trained entirely on documented, available public training information, as opposed to OpenAI’s continuing practice of not revealing the sources of their training data.

This makes the model training a whole lot more likely to be replicable by other teams.

The paper also describes some enormous efficiency improvements they made to the training process.

The LLaMA research was still extremely expensive though. From the paper:

... we estimate that we used 2048 A100-80GB for a period of approximately 5 months to develop our models

My friends at Replicate told me that a simple rule of thumb for A100 cloud costs is $1/hour.

2048 * 5 * 30 * 24 = $7,372,800

But... that $7M was the cost to both iterate on the model and to train all four sizes of LLaMA that they tried: 7B, 13B, 33B, and 65B.

Here’s Table 15 from the paper, showing the cost of training each model.

Table 15: Carbon footprint of training different models in the same data center. We follow Wu et al. (2022) to compute carbon emission of training OPT, BLOOM and our models in the same data center. For the power consumption of a A100-80GB, we take the thermal design power for NVLink systems, that is 400W. We take a PUE of 1.1 and a carbon intensity factor set at the national US average of 0.385 kg COze per KWh. Lists 6 models. OPT-175B: 809,472 GPU hours, 356 MWh, 137 tons CO2. BLOOM-175B: 1,082,880 GPU hours, 475 MWh, 183 tons. LLaMA-7B: 82,432 GPU hours, 36 MWh, 14 tons. LLaMA-13B: 135,168 GPU hours, 59 MWh, 23 tons. LLaMA-33B: 530,432 GPU hours, 233 MWh, 90 tons. LLaMA-65B: 1,022,362 GPU hours, 449 MWh, 173 tons.

This shows that the smallest model, LLaMA-7B, was trained on 82,432 hours of A100-80GB GPUs, costing 36MWh and generating 14 tons of CO2.

(That’s about 28 people flying from London to New York.)

Going by the $1/hour rule of thumb, this means that provided you get everything right on your first run you can train a LLaMA-7B scale model for around $82,432.

Upgrading to Alpaca

You can run LLaMA 7B on your own laptop (or even on a phone), but you may find it hard to get good results out of. That’s because it hasn’t been instruction tuned, so it’s not great at answering the kind of prompts that you might send to ChatGPT or GPT-3 or 4.

Alpaca is the project from Stanford that fixes that. They fine-tuned LLaMA on 52,000 instructions (of somewhat dubious origin) and claim to have gotten ChatGPT-like performance as a result... from that smallest 7B LLaMA model!

You can try out their demo (update: no you can’t, “Our live demo is suspended until further notice”) and see for yourself that it really does capture at least some of that ChatGPT magic.

The best bit? The Alpaca fine-tuning can be done for less than $100. The Replicate team have repeated the training process and published a tutorial about how they did it.

Other teams have also been able to replicate the Alpaca fine-tuning process, for example antimatter15/alpaca.cpp on GitHub.

We are still within our $85,000 budget! And Alpaca—or an Alpaca-like model using different fine tuning data—is the ChatGPT on your own device model that we’ve all been hoping for.

Could we run it in a browser?

Alpaca is effectively the same size as LLaMA 7B—around 3.9GB (after 4-bit quantization ala llama.cpp). And LLaMA 7B has already been shown running on a whole bunch of different personal devices: laptops, Raspberry Pis (very slowly) and even a Pixel 5 phone at a decent speed!

The next frontier: running it in the browser.

I saw two tech demos yesterday that made me think this may be possible in the near future.

The first is Transformers.js. This is a WebAssembly port of the Hugging Face Transformers library of models—previously only available for server-side Python.

It’s worth spending some time with their demos, which include some smaller language models and some very impressive image analysis languages too.

The second is Web Stable Diffusion. This team managed to get the Stable Diffusion generative image model running entirely in the browser as well!

Web Stable Diffusion uses WebGPU, a still emerging standard that’s currently only working in Chrome Canary. But it does work! It rendered my this image of two raccoons eating a pie in the forest in 38 seconds. in a browser. The input prompt is two racoons eating a pie in the woods, with the default 20 step scheduler. After 38 seconds elapsed on the prograss bar a realistic photograph of two raccoons eating a fruit pie appears - although on closer inspection the raccoon holding the pie has three paws!

The Stable Diffusion model this loads into the browser is around 1.9GB.

LLaMA/Alpaca at 4bit quantization is 3.9GB.

The sizes of these two models are similar enough that I would not be at all surprised to see an Alpaca-like model running in the browser in the not-too-distant future. I wouldn’t be surprised if someone is working on that right now.

Now give it extra abilities with ReAct

A model running in your browser that behaved like a less capable version of ChatGPT would be pretty impressive. But what if it could be MORE capable than ChatGPT?

The ReAct prompt pattern is a simple, proven way of expanding a language model’s abilities by giving it access to extra tools.

Matt Webb explains the significance of the pattern in The surprising ease and effectiveness of AI in a loop.

I got it working with a few dozen lines of Python myself, which I described in A simple Python implementation of the ReAct pattern for LLMs.

Here’s the short version: you tell the model that it must think out loud and now has access to tools. It can then work through a question like this:

Question: Population of Paris, squared?

Thought: I should look up the population of paris and then multiply it

Action: search_wikipedia: Paris

Then it stops. Your code harness for the model reads that last line, sees the action and goes and executes an API call against Wikipedia. It continues the dialog with the model like this:

Observation: <truncated content from the Wikipedia page, including the 2,248,780 population figure>

The model continues:

Thought: Paris population is 2,248,780 I should square that

Action: calculator: 2248780 ** 2

Control is handed back to the harness, which passes that to a calculator and returns:

Observation: 5057011488400

The model then provides the answer:

Answer: The population of Paris squared is 5,057,011,488,400

Adding new actions to this system is trivial: each one can be a few lines of code.

But as the ReAct paper demonstrates, adding these capabilities to even an under-powered model (such as LLaMA 7B) can dramatically improve its abilities, at least according to several common language model benchmarks.

This is essentially what Bing is! It’s GPT-4 with the added ability to run searches against the Bing search index.

Obviously if you’re going to give a language model the ability to execute API calls and evaluate code you need to do it in a safe environment! Like for example... a web browser, which runs code from untrusted sources as a matter of habit and has the most thoroughly tested sandbox mechanism of any piece of software we’ve ever created.

Adding it all together

There are a lot more groups out there that can afford to spend $85,000 training a model than there are that can spend $2M or more.

I think LLaMA and Alpaca are going to have a lot of competition soon, from an increasing pool of openly licensed models.

A fine-tuned LLaMA scale model is leaning in the direction of a ChatGPT competitor already. But... if you hook in some extra capabilities as seen in ReAct and Bing even that little model should be able to way outperform ChatGPT in terms of actual ability to solve problems and do interesting things.

And we might be able to run such a thing on our phones... or even in our web browsers... sooner than you think.

And it’s only going to get cheaper

Tobias Lütke on Twitter:

The H100 is the new Tensor Core GPU from NVIDIA, which they claim can offer up to a 30x performance improvement over their current A100s.

Stanford Alpaca, and the acceleration of on-device large language model development 12 days ago

On Saturday 11th March I wrote about how Large language models are having their Stable Diffusion moment. Today is Monday. Let’s look at what’s happened in the past three days.

When I talked about a “Stable Diffusion moment” this is the kind of thing I meant: the moment this stuff is available for people to experiment with, things accelerate.

I’m going to dive into Alpaca in detail.

Stanford’s Alpaca

Here’s the introduction to the Alpaca announcement:

We introduce Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca behaves similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$).

The biggest weakness in the LLaMA models released by Meta research last month is their lack of instruction-tuning.

A language model is a sentence completion engine. You give it a sequence of words, “The first man on the moon was”, and it completes that sentence, hopefully with useful content.

One of the great innovations from OpenAI was their application of instruction tuning to GPT-3:

To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API, our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.

Prior to this, you had to think very carefully about how to construct your prompts. Thanks to instruction tuning you can be a lot more, well, human in the way you interact with the model. “Write me a poem about pandas!” now works as a prompt, instead of “Here is a poem about pandas:”.

The LLaMA models had not been through this process. The LLaMA FAQ acknowledges this:

Keep in mind these models are not finetuned for question answering. As such, they should be prompted so that the expected answer is the natural continuation of the prompt. [...] Overall, always keep in mind that models are very sensitive to prompts (particularly when they have not been finetuned).

This is an enormous usability problem.

One of my open questions about LLaMA was how difficult and expensive it would be to fine-tune it such that it could respond better to instructions.

Thanks to the team at Stanford we now have an answer: 52,000 training samples and $100 of training compute! From their blog post:

Fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers.

Something that stuns me about Alpaca is the quality they claim to be able to get from the 7B model—the smallest of the LLaMA models, and the one which has been seen running (albeit glacially slowly) on a RaspberryPi and a mobile phone! Here’s one example from their announcement:

Enter your instruction and press enter: Write an e-mail to congratulate new Stanford admits and mention that you are exited about meeting all of them in person Stanford-Alpaca-7B: An Open-Source Instruction-Following Language Model Dear Stanford Admits, Congratulations on your admission to Stanford! I am so excited to meet all of you in person. As a current student at Stanford, I am looking forward to welcoming you to the Stanford community and showing you all that the university has to offer. I wish you all the best in your future endeavors and I can't wait to see you on campus! Sincerely, Your Name

I would be impressed to see this from the 65B (largest) LLaMA model—but getting this from 7B is spectacular.

Still not for commercial usage

I’ll quote the Stanford announcement on this in full:

We emphasize that Alpaca is intended only for academic research and any commercial use is prohibited. There are three factors in this decision: First, Alpaca is based on LLaMA, which has a non-commercial license, so we necessarily inherit this decision. Second, the instruction data is based OpenAI’s text-davinci-003, whose terms of use prohibit developing models that compete with OpenAI. Finally, we have not designed adequate safety measures, so Alpaca is not ready to be deployed for general use.

So it’s still not something we can use to build commercial offerings—but for personal research and tinkering it’s yet another huge leap forwards.

What does this demonstrate?

The license of the LLaMA model doesn’t bother me too much. What’s exciting to me is what this all proves:

  • LLaMA itself shows that it’s possible to train a GPT-3 class language model using openly available resources. The LLaMA paper includes details of the training data, which is entirely from publicly available sources (which include CommonCrawl, GitHub, Wikipedia, ArXiv and StackExchange).
  • llama.cpp shows that you can then use some tricks to run that language model on consumer hardware—apparently anything with 4GB or more of RAM is enough to at least get it to start spitting out tokens!
  • Alpaca shows that you can apply fine-tuning with a feasible sized set of examples (52,000) and cost ($100) such that even the smallest of the LLaMA models—the 7B one, which can compress down to a 4GB file with 4-bit quantization—provides results that compare well to cutting edge text-davinci-003 in initial human evaluation.

One thing that’s worth noting: the Alpaca 7B comparison likely used the full-sized 13.48GB 16bit floating point 7B model, not the 4GB smaller 4bit floating point model used by llama.cpp. I’ve not yet seen a robust comparison of quality between the two.

Exploring the Alpaca training data with Datasette Lite

The Alpaca team released the 52,000 fine-tuning instructions they used as a 21.7MB JSON file in their GitHub repository.

My Datasette Lite tool has the ability to fetch JSON from GitHub and load it into an in-browser SQLite database. Here’s the URL to do that:

This will let you browse the 52,000 examples in your browser.

But we can do a step better than that: here’s a SQL query that runs LIKE queries to search through those examples, considering all three text columns:

select instruction, input, output from alpaca_data
where instruction || ' ' || input || ' ' || output like '%' || :search || '%'
order by random()

I’m using order by random() because why not? It’s more fun to explore that way.

The following link will both load the JSON file and populate and execute that SQL query, plus allow you to change the search term using a form in your browser:

Screenshot of Datasette executing that SQL query, retruning three results that match 'occam'

What’s next?

This week is likely to be wild. OpenAI are rumored to have a big announcement on Tuesday—possibly GPT-4? And I’ve heard rumors of announcements from both Anthropic and Google this week as well.

I’m still more excited about seeing what happens next with LLaMA. Language models on personal devices is happening so much faster than I thought it would.

Bonus: The source of that training data? GPT-3!

Here’s a fascinating detail: Those 52,000 samples they used to fine-tune the model? Those were the result of a prompt they ran against GPT-3 itself! Here’s the prompt they used:

You are asked to come up with a set of 20 diverse task instructions. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions.

Here are the requirements:
1. Try not to repeat the verb for each instruction to maximize diversity.
2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons.
3. The type of instructions should be diverse. The list should include diverse types of tasks like open-ended generation, classification, editing, etc.
2. A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action.
3. The instructions should be in English.
4. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
5. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging but should ideally not exceed 100 words.
6. Not all instructions require input. For example, when a instruction asks about some general information, "what is the highest peak in the world", it is not necssary to provide a specific context. In this case, we simply put "<noinput>" in the input field.
7. The output should be an appropriate response to the instruction and the input. Make sure the output is less than 100 words.

List of 20 tasks:

Then they include three random example instructions from a list of 175 they had prepared by hand. The completed prompt sent to OpenAI would include the above instructions followed by something like this:

1. Instruction: Explain the following idiom to me, and try to give me some examples.
1. Input:
black sheep
1. Output:
Meaning: An outcast. Someone who doesn’t fit in with the rest of the crowd. They take pride in being different. Thinks for themselves and doesn’t care what no one else has to say. They tend to ride their own wave and are usually loners because no one understands them, but its okay because they like it that way.
Example: He’s the black sheep of the family.

2. Instruction: Generate a haiku using the following word:
2. Input:
2. Output:
The chill, worming in
Shock, pleasure, bursting within
Summer tongue awakes

3. Instruction: Recommend a movie for me to watch during the weekend and explain the reason.
3. Input:
3. Output:
I would recommend the movie "The Shawshank Redemption" because it is an excellent movie that is both moving and inspiring. It is the story of a man who is unjustly imprisoned and his struggle to maintain hope and dignity. It is a great film to watch over the weekend because it will make you think about the human capacity for resilience and hope.

4. Instruction:

GPT-3 would then fill in the rest. You can try this in the GPT-3 Playground to see it in action (paste from here).

Here’s the Python script that assembles that all together.

They spent $500 on OpenAI credits to assemble the 52,000 examples they used to fine-tune their model.

As they note in their announcement, generating examples in this way is actually mentioned in the OpenAI terms of use:

You may not [...] (iii) use the Services to develop foundation models or other large scale models that compete with OpenAI

There’s a related concept to this called Model Extraction, where people build new models that emulate the behaviour of others by firing large numbers of examples through the other model and training a new one based on the results.

I don’t think the way Alpaca was trained quite counts as a classic Model Extraction attack, but it certainly echoes one.



  • scrapeghost (via) Scraping is a really interesting application for large language model tools like GPT3. James Turk’s scrapeghost is a very neatly designed entrant into this space—it’s a Python library and CLI tool that can be pointed at any URL and given a roughly defined schema (using a neat mini schema language) which will then use GPT3 to scrape the page and try to return the results in the supplied format. #26th March 2023, 5:29 am

24th March 2023

  • SvelteKit is written in JS and distributed as source code — no build step — and it’s been miraculous for productivity. build steps make sense for apps, they make much less sense for libraries

    Rich Harris # 24th March 2023, 11:07 pm

  • Hello Dolly: Democratizing the magic of ChatGPT with open models. A team at DataBricks applied the same fine-tuning data used by Stanford Alpaca against LLaMA to a much older model—EleutherAI’s GPT-J 6B, first released in May 2021. As with Alpaca, they found that instruction tuning took the raw model—which was extremely difficult to interact with—and turned it into something that felt a lot more like ChatGPT. It’s a shame they reused the license-encumbered 52,000 training samples from Alpaca, but I doubt it will be long before someone recreates a freely licensed alternative to that training set. #24th March 2023, 5:05 pm

23rd March 2023

  • textra (via) Tiny (432KB) macOS binary CLI tool by Dylan Freedman which produces high quality text extraction from PDFs, images and even audio files using the VisionKit APIs in macOS 13 and higher. It handles handwriting too! #23rd March 2023, 9:08 pm
  • ChatGPT Retrieval Plugin. “The ChatGPT Retrieval Plugin repository provides a flexible solution for semantic search and retrieval of personal or organizational documents using natural language queries.” How many existing startups were building this I wonder? #23rd March 2023, 8:58 pm
  • ChatGPT plugins. ChatGPT is getting a plugins mechanism, which will allow developers to provide extra capabilities to ChatGPT, like looking up restaurants on OpenTable or fetching data from APIs. This feels like the kind of feature that could obsolete—or launch—a thousand startups. It also makes ChatGPT much more interesting as a general purpose tool, as opposed to something that only works as an interface to a language model. #23rd March 2023, 8:56 pm
  • mitsua-diffusion-one (via) “Mitsua Diffusion One is a latent text-to-image diffusion model, which is a successor of Mitsua Diffusion CC0. This model is trained from scratch using only public domain/CC0 or copyright images with permission for use.” I’ve been talking about how much I’d like to try out a “vegan” AI model trained entirely on out-of-copyright images for ages, and here one is! It looks like the training data mainly came from CC0 art gallery collections such as the Metropolitan Museum of Art Open Access. #23rd March 2023, 2:56 pm
  • Teaching News Apps with Codespaces (via) Derek Willis used GitHub Codespaces for the latest data journalism class he taught, and it eliminated the painful process of trying to get students on an assortment of Mac, Windows and Chromebook laptops all to a point where they could start working and learning together. #23rd March 2023, 12:39 am
  • If you ask Microsoft’s Bing chatbot if Google’s Bard chatbot has been shut down, it says yes, citing as evidence a news article that discusses a tweet in which a user asked Bard when it would be shut down and Bard said it already had, itself citing a comment from Hacker News in which someone joked about this happening, and someone else used ChatGPT to write fake news coverage about the event.

    James Vincent # 23rd March 2023, 12:10 am

22nd March 2023

  • Datasette: Gather feedback on new ?_extra= design. I just landed the single biggest backwards-incompatible change to Datasette ever, in preparation for the 1.0 release. It’s a change to the default JSON format from the Datasette API—the new format is much slimmer, and can be expanded using a new ?_extra= query string parameter. I’m desperately keen on getting feedback on this change! This issues has more details and a call for feedback. #22nd March 2023, 11:14 pm
  • GPT-4, like GPT-3 before it, has a capability overhang; at the time of release, neither OpenAI or its various deployment partners have a clue as to the true extent of GPT-4’s capability surface—that’s something that we’ll get to collectively discover in the coming years. This also means we don’t know the full extent of plausible misuses or harms.

    Jack Clark # 22nd March 2023, 12:40 am

21st March 2023

  • The Age of AI has begun. Bill Gates calls GPT-class large language models “the most important advance in technology since the graphical user interface”. His essay here focuses on the philanthropy angle, mostly from the point of view of AI applications in healthcare, education and concerns about keeping access to these new technologies as equitable as possible. #21st March 2023, 9:14 pm
  • Here are some absurdly expensive things you can do on a trip to Tokyo: Buy a golden toilet. There is a toilet in Tokyo that is made of gold and costs around 10 million yen. If you are looking for a truly absurd experience, you can buy this toilet and use it for your next bowel movement. [...]

    Google Bard # 21st March 2023, 6:27 pm

  • Google Bard is now live. Google Bard launched today. There’s a waiting list, but I made it through within a few hours of signing up, as did other people I’ve talked to. It’s similar to ChatGPT and Bing—it’s the same chat interface, and it can clearly run searches under the hood (though unlike Bing it doesn’t tell you what it’s looking for). #21st March 2023, 6:25 pm
  • Prompt Engineering. Extremely detailed introduction to the field of prompt engineering by Lilian Weng, who leads applied research at OpenAI. #21st March 2023, 5:12 pm
  • Bing Image Creator comes to the new Bing. Bing Chat is integrating DALL-E directly into their interface, giving it the ability to generate images when prompted to do so. #21st March 2023, 5:10 pm
  • Adobe made an AI image generator — and says it didn’t steal artists’ work to do it. Adobe Firefly is a brand new text-to-image model which Adobe claim was trained entirely on fully licensed imagery—either out of copyright, specially licensed or part of the existing Adobe Stock library. I’m sure they have the license, but I still wouldn’t be surprised to hear complaints from artists who licensed their content to Adobe Stock who didn’t anticipate it being used for model training. #21st March 2023, 5:08 pm
  • OpenAI to discontinue support for the Codex API (via) OpenAI shutting off access to their Codex model—a GPT3 variant fine-tuned for code related tasks, but that was being used for all sorts of other purposes—partly because it had been in a beta phase for over a year where OpenAI didn’t charge anything for it. This feels to me like a major strategic misstep for OpenAI: they’re only giving three days notice, which is shaking people’s confidence in them as a stable platform for building on at the very moment when competition from other vendors (and open source alternatives) is heating up. #21st March 2023, 5:04 pm
  • Was on a plane yesterday, studying some physics; got confused about something and I was able to solve my problem by just asking alpaca-13B—running locally on my machine—for an explanation. Felt straight-up spooky.

    Andy Matuschak # 21st March 2023, 2:45 pm

17th March 2023

  • Fine-tune LLaMA to speak like Homer Simpson. Replicate spent 90 minutes fine-tuning LLaMA on 60,000 lines of dialog from the first 12 seasons of the Simpsons, and now it can do a good job of producing invented dialog from any of the characters from the series. This is a really interesting result: I’ve been skeptical about how much value can be had from fine-tuning large models on just a tiny amount of new data, assuming that the new data would be statistically irrelevant compared to the existing model. Clearly my mental model around this was incorrect. #17th March 2023, 11:08 pm
  • The Unpredictable Abilities Emerging From Large AI Models (via) Nice write-up of the most interesting aspect of large language models: the fact that they gain emergent abilities at certain “breakthrough” size points, and no-one is entirely sure they understand why. #17th March 2023, 10:54 pm
  • Web Stable Diffusion (via) I just ran the full Stable Diffusion image generation model entirely in my browser, and used it to generate an image (of two raccoons eating pie in the woods, see “via” link). I had to use Google Chrome Canary since this depends on WebGPU which still isn’t fully rolled out, but it worked perfectly. #17th March 2023, 4:46 am
  • The surprising ease and effectiveness of AI in a loop (via) Matt Webb on the langchain Python library and the ReAct design pattern, where you plug additional tools into a language model by teaching it to work in a “Thought... Act... Observation” loop where the Act specifies an action it wishes to take (like searching Wikipedia) and an extra layer of software than carries out that action and feeds back the result as the Observation. Matt points out that the ChatGPT 1/10th price drop makes this kind of model usage enormously more cost effective than it was before. #17th March 2023, 12:04 am

16th March 2023

  • Transformers.js. Hugging Face Transformers is a library of Transformer machine learning models plus a Python package for loading and running them. Transformers.js provides a JavaScript alternative interface which runs in your browser, thanks to a set of precompiled WebAssembly binaries for a selection of models. This interactive demo is incredible: in particular, try running the Image classification with google/vit-base-patch16-224 (91MB) model against any photo to get back labels representing that photo. Dropping one of these models onto a page is as easy as linking to a hosted CDN script and running a few lines of JavaScript. #16th March 2023, 11:41 pm
  • Train and run Stanford Alpaca on your own machine. The team at Replicate managed to train their own copy of Stanford’s Alpaca—a fine-tuned version of LLaMA that can follow instructions like ChatGPT. Here they provide step-by-step instructions for recreating Alpaca yourself—running the training needs one or more A100s for a few hours, which you can rent through various cloud providers. #16th March 2023, 4:10 pm
  • Not By AI: Your AI-free Content Deserves a Badge (via) A badge for non-AI generated content. Interesting to note that they set the cutoff at 90%: “Use this badge if your article, including blog posts, essays, research, letters, and other text-based content, contains less than 10% of AI output.” #16th March 2023, 4:05 pm
  • As an NLP researcher I’m kind of worried about this field after 10-20 years. Feels like these oversized LLMs are going to eat up this field and I’m sitting in my chair thinking, “What’s the point of my research when GPT-4 can do it better?”

    Jeonghwan Kim # 16th March 2023, 5:39 am

  • I expect GPT-4 will have a LOT of applications in web scraping

    The increased 32,000 token limit will be large enough to send it the full DOM of most pages, serialized to HTML—then ask questions to extract data

    Or... take a screenshot and use the GPT4 image input mode to ask questions about the visually rendered page instead!

    Might need to dust off all of those old semantic web dreams, because the world’s information is rapidly becoming fully machine readable

    Me # 16th March 2023, 1:09 am

  • bloomz.cpp (via) Nouamane Tazi Adapted the llama.cpp project to run against the BLOOM family of language models, which were released in July 2022 and trained in France on 45 natural languages and 12 programming languages using the Jean Zay Public Supercomputer, provided by the French government and powered using mostly nuclear energy.

    It’s under the RAIL license which allows (limited) commercial use, unlike LLaMA.

    Nouamane reports getting 16 tokens/second from BLOOMZ-7B1 running on an M1 Pro laptop. #16th March 2023, 12:24 am

15th March 2023

  • “AI” has for recent memory been a marketing term anyway. Deep learning and variations have had a good run at being what people mean when they refer to AI, probably overweighting towards big convolution based computer vision models.

    Now, “AI” in people’s minds means generative models.

    That’s it, it doesn’t mean generative models are replacing CNNs, just like CNNs don’t replace SVMs or regression or whatever. It’s just that pop culture has fallen in love with something else.

    version_five # 15th March 2023, 9:05 pm

  • We call on the field to recognize that applications that aim to believably mimic humans bring risk of extreme harms. Work on synthetic human behavior is a bright line in ethical Al development, where downstream effects need to be understood and modeled in order to block foreseeable harm to society and different social groups.

    Emily M. Bender # 15th March 2023, 3:30 pm

  • GPT-4 Developer Livestream. 25 minutes of live demos from OpenAI co-founder Greg Brockman at the GPT-4 launch. These demos are all fascinating, including code writing and multimodal vision inputs. The one that really struck me is when Greg pasted in a copy of the tax code and asked GPT-4 to answer some sophisticated tax questions, involving step-by-step calculations that cited parts of the tax code it was working with. #15th March 2023, 12:20 am

14th March 2023

  • GPT-4 Technical Report (PDF). 98 pages of much more detailed information about GPT-4. The appendices are particularly interesting, including examples of advanced prompt engineering as well as examples of harmful outputs before and after tuning attempts to try and suppress them. #14th March 2023, 9:39 pm
  • We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks. [...] We’ve spent 6 months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails.

    OpenAI # 14th March 2023, 5:02 pm

13th March 2023