Simon Willison’s Weblog

Subscribe

Stuff we figured out about AI in 2023

31st December 2023

2023 was the breakthrough year for Large Language Models (LLMs). I think it’s OK to call these AI—they’re the latest and (currently) most interesting development in the academic field of Artificial Intelligence that dates back to the 1950s.

Here’s my attempt to round up the highlights in one place!

Here’s the sequel to this post: Things we learned about LLMs in 2024.

Large Language Models

In the past 24-36 months, our species has discovered that you can take a GIANT corpus of text, run it through a pile of GPUs, and use it to create a fascinating new kind of software.

LLMs can do a lot of things. They can answer questions, summarize documents, translate from one language to another, extract information and even write surprisingly competent code.

They can also help you cheat at your homework, generate unlimited streams of fake content and be used for all manner of nefarious purposes.

So far, I think they’re a net positive. I’ve used them on a personal level to improve my productivity (and entertain myself) in all sorts of different ways. I think people who learn how to use them effectively can gain a significant boost to their quality of life.

A lot of people are yet to be sold on their value! Some think their negatives outweigh their positives, some think they are all hot air, and some even think they represent an existential threat to humanity.

They’re actually quite easy to build

The most surprising thing we’ve learned about LLMs this year is that they’re actually quite easy to build.

Intuitively, one would expect that systems this powerful would take millions of lines of complex code. Instead, it turns out a few hundred lines of Python is genuinely enough to train a basic version!

What matters most is the training data. You need a lot of data to make these things work, and the quantity and quality of the training data appears to be the most important factor in how good the resulting model is.

If you can gather the right data, and afford to pay for the GPUs to train it, you can build an LLM.

A year ago, the only organization that had released a generally useful LLM was OpenAI. We’ve now seen better-than-GPT-3 class models produced by Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu and a bunch of other organizations.

The training cost (hardware and electricity) is still significant—initially millions of dollars, but that seems to have dropped to the tens of thousands already. Microsoft’s Phi-2 claims to have used “14 days on 96 A100 GPUs”, which works out at around $35,000 using current Lambda pricing.

So training an LLM still isn’t something a hobbyist can afford, but it’s no longer the sole domain of the super-rich. I like to compare the difficulty of training an LLM to that of building a suspension bridge—not trivial, but hundreds of countries around the world have figured out how to do it. (Correction: Wikipedia’s Suspension bridges by country category lists 44 countries).

You can run LLMs on your own devices

In January of this year, I thought it would be years before I could run a useful LLM on my own computer. GPT-3 and 3.5 were pretty much the only games in town, and I thought that even if the model weights were available it would take a $10,000+ server to run them.

Then in February, Meta released Llama. And a few weeks later in March, Georgi Gerganov released code that got it working on a MacBook.

I wrote about how Large language models are having their Stable Diffusion moment, and with hindsight that was a very good call!

This unleashed a whirlwind of innovation, which was accelerated further in July when Meta released Llama 2—an improved version which, crucially, included permission for commercial use.

Today there are literally thousands of LLMs that can be run locally, on all manner of different devices.

I run a bunch of them on my laptop. I run Mistral 7B (a surprisingly great model) on my iPhone. You can install several different apps to get your own, local, completely private LLM. My own LLM project provides a CLI tool for running an array of different models via plugins.

You can even run them entirely in your browser using WebAssembly and the latest Chrome!

Hobbyists can build their own fine-tuned models

I said earlier that building an LLM was still out of reach of hobbyists. That may be true for training from scratch, but fine-tuning one of those models is another matter entirely.

There’s now a fascinating ecosystem of people training their own models on top of these foundations, publishing those models, building fine-tuning datasets and sharing those too.

The Hugging Face Open LLM Leaderboard is one place that tracks these. I can’t even attempt to count them, and any count would be out-of-date within a few hours.

The best overall openly licensed LLM at any time is rarely a foundation model: instead, it’s whichever fine-tuned community model has most recently discovered the best combination of fine-tuning data.

This is a huge advantage for open over closed models: the closed, hosted models don’t have thousands of researchers and hobbyists around the world collaborating and competing to improve them.

We don’t yet know how to build GPT-4

Frustratingly, despite the enormous leaps ahead we’ve had this year, we are yet to see an alternative model that’s better than GPT-4.

OpenAI released GPT-4 in March, though it later turned out we had a sneak peak of it in February when Microsoft used it as part of the new Bing.

This may well change in the next few weeks: Google’s Gemini Ultra has big claims, but isn’t yet available for us to try out.

The team behind Mistral are working to beat GPT-4 as well, and their track record is already extremely strong considering their first public model only came out in September, and they’ve released two significant improvements since then.

Still, I’m surprised that no-one has beaten the now almost year old GPT-4 by now. OpenAI clearly have some substantial tricks that they haven’t shared yet.

Vibes Based Development

As a computer scientist and software engineer, LLMs are infuriating.

Even the openly licensed ones are still the world’s most convoluted black boxes. We continue to have very little idea what they can do, how exactly they work and how best to control them.

I’m used to programming where the computer does exactly what I tell it to do. Prompting an LLM is decidedly not that!

The worst part is the challenge of evaluating them.

There are plenty of benchmarks, but no benchmark is going to tell you if an LLM actually “feels” right when you try it for a given task.

I find I have to work with an LLM for a few weeks in order to get a good intuition for it’s strengths and weaknesses. This greatly limits how many I can evaluate myself!

The most frustrating thing for me is at the level of individual prompting.

Sometimes I’ll tweak a prompt and capitalize some of the words in it, to emphasize that I really want it to OUTPUT VALID MARKDOWN or similar. Did capitalizing those words make a difference? I still don’t have a good methodology for figuring that out.

We’re left with what’s effectively Vibes Based Development. It’s vibes all the way down.

I’d love to see us move beyond vibes in 2024!

LLMs are really smart, and also really, really dumb

On the one hand, we keep on finding new things that LLMs can do that we didn’t expect—and that the people who trained the models didn’t expect either. That’s usually really fun!

But on the other hand, the things you sometimes have to do to get the models to behave are often incredibly dumb.

Does ChatGPT get lazy in December, because its hidden system prompt includes the current date and its training data shows that people provide less useful answers coming up to the holidays?

The honest answer is “maybe”! No-one is entirely sure, but if you give it a different date its answers may skew slightly longer.

Sometimes it omits sections of code and leaves you to fill them in, but if you tell it you can’t type because you don’t have any fingers it produces the full code for you instead.

There are so many more examples like this. Offer it cash tips for better answers. Tell it your career depends on it. Give it positive reinforcement. It’s all so dumb, but it works!

Gullibility is the biggest unsolved problem

I coined the term prompt injection in September last year.

15 months later, I regret to say that we’re still no closer to a robust, dependable solution to this problem.

I’ve written a ton about this already.

Beyond that specific class of security vulnerabilities, I’ve started seeing this as a wider problem of gullibility.

Language Models are gullible. They “believe” what we tell them—what’s in their training data, then what’s in the fine-tuning data, then what’s in the prompt.

In order to be useful tools for us, we need them to believe what we feed them!

But it turns out a lot of the things we want to build need them not to be gullible.

Everyone wants an AI personal assistant. If you hired a real-world personal assistant who believed everything that anyone told them, you would quickly find that their ability to positively impact your life was severely limited.

A lot of people are excited about AI agents—an infuriatingly vague term that seems to be converging on “AI systems that can go away and act on your behalf”. We’ve been talking about them all year, but I’ve seen few if any examples of them running in production, despite lots of exciting prototypes.

I think this is because of gullibility.

Can we solve this? Honestly, I’m beginning to suspect that you can’t fully solve gullibility without achieving AGI. So it may be quite a while before those agent dreams can really start to come true!

Code may be the best application

Over the course of the year, it’s become increasingly clear that writing code is one of the things LLMs are most capable of.

If you think about what they do, this isn’t such a big surprise. The grammar rules of programming languages like Python and JavaScript are massively less complicated than the grammar of Chinese, Spanish or English.

It’s still astonishing to me how effective they are though.

One of the great weaknesses of LLMs is their tendency to hallucinate—to imagine things that don’t correspond to reality. You would expect this to be a particularly bad problem for code—if an LLM hallucinates a method that doesn’t exist, the code should be useless.

Except... you can run generated code to see if it’s correct. And with patterns like ChatGPT Code Interpreter the LLM can execute the code itself, process the error message, then rewrite it and keep trying until it works!

So hallucination is a much lesser problem for code generation than for anything else. If only we had the equivalent of Code Interpreter for fact-checking natural language!

How should we feel about this as software engineers?

On the one hand, this feels like a threat: who needs a programmer if ChatGPT can write code for you?

On the other hand, as software engineers we are better placed to take advantage of this than anyone else. We’ve all been given weird coding interns—we can use our deep knowledge to prompt them to solve coding problems more effectively than anyone else can.

The ethics of this space remain diabolically complex

In September last year Andy Baio and I produced the first major story on the unlicensed training data behind Stable Diffusion.

Since then, almost every major LLM (and most of the image generation models) have also been trained on unlicensed data.

Just this week, the New York Times launched a landmark lawsuit against OpenAI and Microsoft over this issue. The 69 page PDF is genuinely worth reading—especially the first few pages, which lay out the issues in a way that’s surprisingly easy to follow. The rest of the document includes some of the clearest explanations of what LLMs are, how they work and how they are built that I’ve read anywhere.

The legal arguments here are complex. I’m not a lawyer, but I don’t think this one will be easily decided. Whichever way it goes, I expect this case to have a profound impact on how this technology develops in the future.

Law is not ethics. Is it OK to train models on people’s content without their permission, when those models will then be used in ways that compete with those people?

As the quality of results produced by AI models has increased over the year, these questions have become even more pressing.

The impact on human society in terms of these models is already huge, if difficult to objectively measure.

People have certainly lost work to them—anecdotally, I’ve seen this for copywriters, artists and translators.

There are a great deal of untold stories here. I’m hoping 2024 sees significant amounts of dedicated journalism on this topic.

My blog in 2023

Here’s a tag cloud for content I posted to my blog in 2023 (generated using Django SQL Dashboard):

Tag cloud words in order of size: ai, generativeai, llms, openai, chatgpt, projects, python, datasette, ethics, llama, homebrewllms, sqlite, gpt3, promptengineering, promptinjection, llm, security, opensource, gpt4, weeknotes

The top five: ai (342), generativeai (300), llms (287), openai (86), chatgpt (78).

I’ve written a lot about this stuff!

I grabbed a screenshot of my Plausible analytics for the year, fed that to ChatGPT Vision, told it to extract the data into a table, then got it to mix in entry titles (from a SQL query it wrote) and produced this table with it. Here are my top entries this year by amount of traffic:

Article Visitors Pageviews
Bing: “I will not harm you unless you harm me first” 1.1M 1.3M
Leaked Google document: “We Have No Moat, And Neither Does OpenAI” 132k 162k
Large language models are having their Stable Diffusion moment 121k 150k
Prompt injection: What’s the worst that can happen? 79.8k 95.9k
Embeddings: What they are and why they matter 61.7k 79.3k
Catching up on the weird world of LLMs 61.6k 85.9k
llamafile is the new best way to run an LLM on your own computer 52k 66k
Prompt injection explained, with video, slides, and a transcript 51k 61.9k
AI-enhanced development makes me more ambitious with my projects 49.6k 60.1k
Understanding GPT tokenizers 49.5k 61.1k
Exploring GPTs: ChatGPT in a trench coat? 46.4k 58.5k
Could you train a ChatGPT-beating model for $85,000 and run it in a browser? 40.5k 49.2k
How to implement Q&A against your documentation with GPT3, embeddings and Datasette 37.3k 44.9k
Lawyer cites fake cases invented by ChatGPT, judge is not amused 37.1k 47.4k
Now add a walrus: Prompt engineering in DALL-E 3 32.8k 41.2k
Web LLM runs the vicuna-7b Large Language Model entirely in your browser, and it’s very impressive 32.5k 38.2k
ChatGPT can’t access the internet, even though it really looks like it can 30.5k 34.2k
Stanford Alpaca, and the acceleration of on-device large language model development 29.7k 35.7k
Run Llama 2 on your own Mac using LLM and Homebrew 27.9k 33.6k
Midjourney 5.1 26.7k 33.4k
Think of language models like ChatGPT as a “calculator for words” 25k 31.8k
Multi-modal prompt injection image attacks against GPT-4V 23.7k 27.4k

I also gave a bunch of talks and podcast appearances. I’ve started habitually turning my talks into annotated presentations—here are my best from 2023:

And in podcasts:

This is Stuff we figured out about AI in 2023 by Simon Willison, posted on 31st December 2023.

Part of series LLMs annual review

  1. Stuff we figured out about AI in 2023 - Dec. 31, 2023, 11:59 p.m.
  2. Things we learned about LLMs in 2024 - Dec. 31, 2024, 6:07 p.m.

Next: Tom Scott, and the formidable power of escalating streaks

Previous: Last weeknotes of 2023