Simon Willison’s Weblog

Subscribe

568 items tagged “ai”

2024

My notes on gpt2-chatbot. There's a new, unlabeled and undocumented model on the LMSYS Chatbot Arena today called gpt2-chatbot. It's been giving some impressive responses - you can prompt it directly in the Direct Chat tab by selecting it from the big model dropdown menu.

It looks like a stealth new model preview. It's giving answers that are comparable to GPT-4 Turbo and in some cases better - my own experiments lead me to think it may have more "knowledge" baked into it, as ego prompts ("Who is Simon Willison?") and questions about things like lists of speakers at DjangoCon over the years seem to hallucinate less and return more specific details than before.

The lack of transparency here is both entertaining and infuriating. Lots of people are performing a parallel distributed "vibe check" and sharing results with each other, but it's annoying that even the most basic questions (What even IS this thing? Can it do RAG? What's its context length?) remain unanswered so far.

The system prompt appears to be the following - but system prompts just influence how the model behaves, they aren't guaranteed to contain truthful information:

You are ChatGPT, a large language model trained
by OpenAI, based on the GPT-4 architecture.

Knowledge cutoff: 2023-11
Current date: 2024-04-29

Image input capabilities: Enabled
Personality: v2

My best guess is that this is a preview of some kind of OpenAI "GPT 4.5" release. I don't think it's a big enough jump in quality to be a GPT-5.

Update: LMSYS do document their policy on using anonymized model names for tests of unreleased models.

Update May 7th: The model has been confirmed as belonging to OpenAI thanks to an error message that leaked details of the underlying API platform. # 29th April 2024, 8:45 pm

The creator of a model can not ensure that a model is never used to do something harmful – any more so that the developer of a web browser, calculator, or word processor could. Placing liability on the creators of general purpose tools like these mean that, in practice, such tools can not be created at all, except by big businesses with well funded legal teams.

[...]

Instead of regulating the development of AI models, the focus should be on regulating their applications, particularly those that pose high risks to public safety and security. Regulate the use of AI in high-risk areas such as healthcare, criminal justice, and critical infrastructure, where the potential for harm is greatest, would ensure accountability for harmful use, whilst allowing for the continued advancement of AI technology.

Jeremy Howard # 29th April 2024, 4:04 pm

I’ve worked out why I don’t get much value out of LLMs. The hardest and most time-consuming parts of my job involve distinguishing between ideas that are correct, and ideas that are plausible-sounding but wrong. Current AI is great at the latter type of ideas, and I don’t need more of those.

Martin Kleppmann # 27th April 2024, 7:31 pm

It’s very fast to build something that’s 90% of a solution. The problem is that the last 10% of building something is usually the hard part which really matters, and with a black box at the center of the product, it feels much more difficult to me to nail that remaining 10%. With vibecheck, most of the time the results to my queries are great; some percentage of the time they aren’t. Closing that gap with gen AI feels much more fickle to me than a normal engineering problem. It could be that I’m unfamiliar with it, but I also wonder if some classes of generative AI based products are just doomed to mediocrity as a result.

Moxie Marlinspike # 26th April 2024, 9:40 pm

Snowflake Arctic Cookbook. Today's big model release was Snowflake Arctic, an enormous 480B model with a 128×3.66B MoE (Mixture of Experts) architecture. It's Apache 2 licensed and Snowflake state that "in addition, we are also open sourcing all of our data recipes and research insights."

The research insights will be shared on this Arctic Cookbook blog - which currently has two articles covering their MoE architecture and describing how they optimized their training run in great detail.

They also list dozens of "coming soon" posts, which should be pretty interesting given how much depth they've provided in their writing so far. # 25th April 2024, 2:47 am

When I said “Send a text message to Julian Chokkattu,” who’s a friend and fellow AI Pin reviewer over at Wired, I thought I’d be asked what I wanted to tell him. Instead, the device simply said OK and told me it sent the words “Hey Julian, just checking in. How’s your day going?” to Chokkattu. I’ve never said anything like that to him in our years of friendship, but I guess technically the AI Pin did do what I asked.

Cherlynn Low # 24th April 2024, 3:07 pm

openelm/README-pretraining.md. Apple released something big three hours ago, and I’m still trying to get my head around exactly what it is.

The parent project is called CoreNet, described as “A library for training deep neural networks”. Part of the release is a new LLM called OpenELM, which includes completely open source training code and a large number of published training checkpoint.

I’m linking here to the best documentation I’ve found of that training data: it looks like the bulk of it comes from RefinedWeb, RedPajama, The Pile and Dolma. # 24th April 2024, 2:57 am

microsoft/Phi-3-mini-4k-instruct-gguf (via) Microsoft’s Phi-3 LLM is out and it’s really impressive. This 4,000 token context GGUF model is just a 2.2GB (for the Q4 version) and ran on my Mac using the llamafile option described in the README. I could then run prompts through it using the llm-llamafile plugin.

The vibes are good! Initial test prompts I’ve tried feel similar to much larger 7B models, despite using just a few GBs of RAM. Tokens are returned fast too—it feels like the fastest model I’ve tried yet.

And it’s MIT licensed. # 23rd April 2024, 5:40 pm

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions (via) By far the most detailed paper on prompt injection I’ve seen yet from OpenAI, published a few days ago and with six credited authors: Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke and Alex Beutel.

The paper notes that prompt injection mitigations which completely refuse any form of instruction in an untrusted prompt may not actually be ideal: some forms of instruction are harmless, and refusing them may provide a worse experience.

Instead, it proposes a hierarchy—where models are trained to consider if instructions from different levels conflict with or support the goals of the higher-level instructions—if they are aligned or misaligned with them.

The authors tested this idea by fine-tuning a model on top of GPT 3.5, and claim that it shows greatly improved performance against numerous prompt injection benchmarks.

As always with prompt injection, my key concern is that I don’t think “improved” is good enough here. If you are facing an adversarial attacker reducing the chance that they might find an exploit just means they’ll try harder until they find an attack that works.

The paper concludes with this note: “Finally, our current models are likely still vulnerable to powerful adversarial attacks. In the future, we will conduct more explicit adversarial training, and study more generally whether LLMs can be made sufficiently robust to enable high-stakes agentic applications.” # 23rd April 2024, 3:36 am

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone.

Phi-3 Technical Report # 23rd April 2024, 3 am

timpaul/form-extractor-prototype (via) Tim Paul, Head of Interaction Design at the UK’s Government Digital Service, published this brilliant prototype built on top of Claude 3 Opus.

The video shows what it can do. Give it an image of a form and it will extract the form fields and use them to create a GDS-style multi-page interactive form, using their GOV.UK Forms design system and govuk-frontend npm package.

It works for both hand-drawn napkin illustrations and images of existing paper forms.

The bulk of the prompting logic is the schema definition in data/extract-form-questions.json

I’m always excited to see applications built on LLMs that go beyond the chatbot UI. This is a great example of exactly that. # 22nd April 2024, 10:01 pm

Options for accessing Llama 3 from the terminal using LLM

Llama 3 was released on Thursday. Early indications are that it’s now the best available openly licensed model—Llama 3 70b Instruct has taken joint 5th place on the LMSYS arena leaderboard, behind only Claude 3 Opus and some GPT-4s and sharing 5th place with Gemini Pro and Claude 3 Sonnet. But unlike those other models Llama 3 70b is weights available and can even be run on a (high end) laptop!

[... 1962 words]

llm-gpt4all. New release of my LLM plugin which builds on Nomic's excellent gpt4all Python library. I've upgraded to their latest version which adds support for Llama 3 8B Instruct, so after a 4.4GB model download this works:

llm -m Meta-Llama-3-8B-Instruct "say hi in Spanish" # 20th April 2024, 5:58 pm

Andrej Karpathy’s Llama 3 review. The most interesting coverage I’ve seen so far of Meta’s Llama 3 models (8b and 70b so far, 400b promised later).

Andrej notes that Llama 3 trained on 15 trillion tokens—up from 2 trillion for Llama 2—and they used that many even for the smaller 8b model, 75x more than the chinchilla scaling laws would suggest.

The tokenizer has also changed—they now use 128,000 tokens, up from 32,000. This results in a 15% drop in the tokens needed to represent a string of text.

The one disappointment is the context length—just 8,192, 2x that of Llama 2 and 4x LLaMA 1 but still pretty small by today’s standards.

If early indications hold, the 400b model could be the first genuinely GPT-4 class openly licensed model. We’ll have to wait and see. # 18th April 2024, 8:50 pm

How cheap, outsourced labour in Africa is shaping AI English. The word “delve” has been getting a lot of attention recently as an example of something that might be an indicator of ChatGPT generated content.

One example: articles on medical research site PubMed now use “delve” 10 to 100 times more than a few years ago!

Nigerian Twitter took offense recently to Paul Graham’s suggestion that “delve” is a sign of bad writing. It turns out Nigerian formal writing has a subtly different vocabulary.

Alex Hern theorizes that the underlying cause may be related. Companies like OpenAI frequently outsource data annotation to countries like Nigeria that have excellent English skills and low wages. RLHF (reinforcement learning from human feedback) involves annotators comparing and voting on the “best” responses from the models.

Are they teaching models to favour Nigerian-English? It’s a pretty solid theory! # 18th April 2024, 4:04 pm

I have a child who is also 2e and has been part of the NYC G&T program. We’ve had a positive experience with the citywide program, specifically with the program at The Anderson School.

Meta AI bot, answering a question on a forum # 18th April 2024, 3:34 am

llm-reka. My new plugin for running LLM prompts against the Reka family of API hosted LLM models: reka-core ($10 per million input), reka-flash (80c per million) and reka-edge (40c per million).

All three of those models are trained from scratch by a team that includes several Google Brain alumni.

Reka Core is their most powerful model, released on Monday 15th April and claiming benchmark scores competitive with GPT-4 and Claude 3 Opus. # 18th April 2024, 3:17 am

mistralai/mistral-common. New from Mistral: mistral-common, an open source Python library providing "a set of tools to help you work with Mistral models".

So far that means a tokenizer! This is similar to OpenAI's tiktoken library in that it lets you run tokenization in your own code, which crucially means you can count the number of tokens that you are about to use - useful for cost estimates but also for cramming the maximum allowed tokens in the context window for things like RAG.

Mistral's library is better than tiktoken though, in that it also includes logic for correctly calculating the tokens needed for conversation construction and tool definition. With OpenAI's APIs you're currently left guessing how many tokens are taken up by these advanced features.

Anthropic haven't published any form of tokenizer at all - it's the feature I'd most like to see from them next.

Here's how to explore the vocabulary of the tokenizer:

MistralTokenizer.from_model(
    "open-mixtral-8x22b"
).instruct_tokenizer.tokenizer.vocab()[:12]

['<unk>', '<s>', '</s>', '[INST]', '[/INST]', '[TOOL_CALLS]', '[AVAILABLE_TOOLS]', '[/AVAILABLE_TOOLS]', '[TOOL_RESULTS]', '[/TOOL_RESULTS]'] # 18th April 2024, 12:39 am

In mid-March, we added this line to our system prompt to prevent Claude from thinking it can open URLs:

“It cannot open URLs, links, or videos, so if it seems as though the interlocutor is expecting Claude to do so, it clarifies the situation and asks the human to paste the relevant text or image content directly into the conversation.”

Alex Albert (Anthropic) # 18th April 2024, 12:22 am

AI for Data Journalism: demonstrating what we can do with this stuff right now

I gave a talk last month at the Story Discovery at Scale data journalism conference hosted at Stanford by Big Local News. My brief was to go deep into the things we can use Large Language Models for right now, illustrated by a flurry of demos to help provide starting points for further conversations at the conference.

[... 6080 words]

But the reality is that you can’t build a hundred-billion-dollar industry around a technology that’s kind of useful, mostly in mundane ways, and that boasts perhaps small increases in productivity if and only if the people who use it fully understand its limitations.

Molly White # 17th April 2024, 7:53 pm

The saddest part about it, though, is that the garbage books don’t actually make that much money either. It’s even possible to lose money generating your low-quality ebook to sell on Kindle for $0.99. The way people make money these days is by teaching students the process of making a garbage ebook. It’s grift and garbage all the way down — and the people who ultimately lose out are the readers and writers who love books.

Constance Grady # 16th April 2024, 11:31 pm

Google NotebookLM Data Exfiltration (via) NotebookLM is a Google Labs product that lets you store information as sources (mainly text files in PDF) and then ask questions against those sources—effectively an interface for building your own custom RAG (Retrieval Augmented Generation) chatbots.

Unsurprisingly for anything that allows LLMs to interact with untrusted documents, it’s susceptible to prompt injection.

Johann Rehberger found some classic prompt injection exfiltration attacks: you can create source documents with instructions that cause the chatbot to load a Markdown image that leaks other private data to an external domain as data passed in the query string.

Johann reported this privately in the December but the problem has not yet been addressed. UPDATE: The NotebookLM team deployed a fix for this on 18th April.

A good rule of thumb is that any time you let LLMs see untrusted tokens there is a risk of an attack like this, so you should be very careful to avoid exfiltration vectors like Markdown images or even outbound links. # 16th April 2024, 9:28 pm

OpenAI Batch API (via) OpenAI are now offering a 50% discount on batch chat completion API calls if you submit them in bulk and allow for up to 24 hours for them to be run.

Requests are sent as a newline-delimited JSON file, with each line looking something like this:

{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-3.5-turbo", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 2+2?"}]}}

You upload a file for the batch, kick off a batch request and then poll for completion.

This makes GPT-3.5 Turbo cheaper than Claude 3 Haiku - provided you're willing to wait a few hours for your responses. # 15th April 2024, 5:58 pm

[On complaints about Claude 3 reduction in quality since launch] The model is stored in a static file and loaded, continuously, across 10s of thousands of identical servers each of which serve each instance of the Claude model. The model file never changes and is immutable once loaded; every shard is loading the same model file running exactly the same software. We haven’t changed the temperature either. We don’t see anywhere where drift could happen. The files are exactly the same as at launch and loaded each time from a frozen pristine copy.

Jason D. Clinton, Anthropic # 15th April 2024, 1:27 am

Lessons after a half-billion GPT tokens (via) Ken Kantzer presents some hard-won experience from shipping real features on top of OpenAI’s models.

They ended up settling on a very basic abstraction over the chat API—mainly to handle automatic retries on a 500 error. No complex wrappers, not even JSON mode or function calling or system prompts.

Rather than counting tokens they estimate tokens as 3 times the length in characters, which works well enough.

One challenge they highlight for structured data extraction (one of my favourite use-cases for LLMs): “GPT really cannot give back more than 10 items. Trying to have it give you back 15 items? Maybe it does it 15% of the time.”

(Several commenters on Hacker News report success in getting more items back by using numbered keys or sequence IDs in the returned JSON to help the model keep count.) # 13th April 2024, 8:54 pm

The language issues are indicative of the bigger problem facing the AI Pin, ChatGPT, and frankly, every other AI product out there: you can’t see how it works, so it’s impossible to figure out how to use it. [...] our phones are constant feedback machines — colored buttons telling us what to tap, instant activity every time we touch or pinch or scroll. You can see your options and what happens when you pick one. With AI, you don’t get any of that. Using the AI Pin feels like wishing on a star: you just close your eyes and hope for the best. Most of the time, nothing happens.

David Pierce # 12th April 2024, 12:39 pm

3Blue1Brown: Attention in transformers, visually explained. Grant Sanderson publishes animated explainers of mathematical topics on YouTube, to over 6 million subscribers. His latest shows how the attention mechanism in transformers (the algorithm behind most LLMs) works and is by far the clearest explanation I’ve seen of the topic anywhere.

I was intrigued to find out what tool he used to produce the visualizations. It turns out Grant built his own open source Python animation library, manim, to enable his YouTube work. # 11th April 2024, 4:12 pm

Use an llm to automagically generate meaningful git commit messages. Neat, thoroughly documented recipe by Harper Reed using my LLM CLI tool as part of a scheme for if you’re feeling too lazy to write a commit message—it uses a prepare-commit-msg Git hook which runs any time you commit without a message and pipes your changes to a model along with a custom system prompt. # 11th April 2024, 4:06 am

[on GitHub Copilot] It’s like insisting to walk when you can take a bike. It gets the hard things wrong but all the easy things right, very helpful and much faster. You have to learn what it can and can’t do.

Andrej Karpathy # 11th April 2024, 1:27 am