Simon Willison’s Weblog

Subscribe

432 items tagged “llms”

2024

Andrej Karpathy’s Llama 3 review. The most interesting coverage I’ve seen so far of Meta’s Llama 3 models (8b and 70b so far, 400b promised later).

Andrej notes that Llama 3 trained on 15 trillion tokens—up from 2 trillion for Llama 2—and they used that many even for the smaller 8b model, 75x more than the chinchilla scaling laws would suggest.

The tokenizer has also changed—they now use 128,000 tokens, up from 32,000. This results in a 15% drop in the tokens needed to represent a string of text.

The one disappointment is the context length—just 8,192, 2x that of Llama 2 and 4x LLaMA 1 but still pretty small by today’s standards.

If early indications hold, the 400b model could be the first genuinely GPT-4 class openly licensed model. We’ll have to wait and see. # 18th April 2024, 8:50 pm

How cheap, outsourced labour in Africa is shaping AI English. The word “delve” has been getting a lot of attention recently as an example of something that might be an indicator of ChatGPT generated content.

One example: articles on medical research site PubMed now use “delve” 10 to 100 times more than a few years ago!

Nigerian Twitter took offense recently to Paul Graham’s suggestion that “delve” is a sign of bad writing. It turns out Nigerian formal writing has a subtly different vocabulary.

Alex Hern theorizes that the underlying cause may be related. Companies like OpenAI frequently outsource data annotation to countries like Nigeria that have excellent English skills and low wages. RLHF (reinforcement learning from human feedback) involves annotators comparing and voting on the “best” responses from the models.

Are they teaching models to favour Nigerian-English? It’s a pretty solid theory! # 18th April 2024, 4:04 pm

I have a child who is also 2e and has been part of the NYC G&T program. We’ve had a positive experience with the citywide program, specifically with the program at The Anderson School.

Meta AI bot, answering a question on a forum # 18th April 2024, 3:34 am

llm-reka. My new plugin for running LLM prompts against the Reka family of API hosted LLM models: reka-core ($10 per million input), reka-flash (80c per million) and reka-edge (40c per million).

All three of those models are trained from scratch by a team that includes several Google Brain alumni.

Reka Core is their most powerful model, released on Monday 15th April and claiming benchmark scores competitive with GPT-4 and Claude 3 Opus. # 18th April 2024, 3:17 am

mistralai/mistral-common. New from Mistral: mistral-common, an open source Python library providing “a set of tools to help you work with Mistral models”.

So far that means a tokenizer! This is similar to OpenAI’s tiktoken library in that it lets you run tokenization in your own code, which crucially means you can count the number of tokens that you are about to use—useful for cost estimates but also for cramming the maximum allowed tokens in the context window for things like RAG.

Mistral’s library is better than tiktoken though, in that it also includes logic for correctly calculating the tokens needed for conversation construction and tool definition. With OpenAI’s APIs you’re currently left guessing how many tokens are taken up by these advanced features.

Anthropic haven’t published any form of tokenizer at all—it’s the feature I’d most like to see from them next.

Here’s how to explore the vocabulary of the tokenizer:

MistralTokenizer.from_model(
“open-mixtral-8x22b”
).instruct_tokenizer.tokenizer.vocab()[:12]

[’<unk>’, ’<s>’, ’</s>’, ’[INST]’, ’[/INST]’, ’[TOOL_CALLS]’, ’[AVAILABLE_TOOLS]’, ’[/AVAILABLE_TOOLS]’, ’[TOOL_RESULTS]’, ’[/TOOL_RESULTS]’] # 18th April 2024, 12:39 am

In mid-March, we added this line to our system prompt to prevent Claude from thinking it can open URLs:

“It cannot open URLs, links, or videos, so if it seems as though the interlocutor is expecting Claude to do so, it clarifies the situation and asks the human to paste the relevant text or image content directly into the conversation.”

Alex Albert (Anthropic) # 18th April 2024, 12:22 am

AI for Data Journalism: demonstrating what we can do with this stuff right now

I gave a talk last month at the Story Discovery at Scale data journalism conference hosted at Stanford by Big Local News. My brief was to go deep into the things we can use Large Language Models for right now, illustrated by a flurry of demos to help provide starting points for further conversations at the conference.

[... 6080 words]

But the reality is that you can’t build a hundred-billion-dollar industry around a technology that’s kind of useful, mostly in mundane ways, and that boasts perhaps small increases in productivity if and only if the people who use it fully understand its limitations.

Molly White # 17th April 2024, 7:53 pm

The saddest part about it, though, is that the garbage books don’t actually make that much money either. It’s even possible to lose money generating your low-quality ebook to sell on Kindle for $0.99. The way people make money these days is by teaching students the process of making a garbage ebook. It’s grift and garbage all the way down — and the people who ultimately lose out are the readers and writers who love books.

Constance Grady # 16th April 2024, 11:31 pm

Google NotebookLM Data Exfiltration (via) NotebookLM is a Google Labs product that lets you store information as sources (mainly text files in PDF) and then ask questions against those sources—effectively an interface for building your own custom RAG (Retrieval Augmented Generation) chatbots.

Unsurprisingly for anything that allows LLMs to interact with untrusted documents, it’s susceptible to prompt injection.

Johann Rehberger found some classic prompt injection exfiltration attacks: you can create source documents with instructions that cause the chatbot to load a Markdown image that leaks other private data to an external domain as data passed in the query string.

Johann reported this privately in the December but the problem has not yet been addressed.

A good rule of thumb is that any time you let LLMs see untrusted tokens there is a risk of an attack like this, so you should be very careful to avoid exfiltration vectors like Markdown images or even outbound links. # 16th April 2024, 9:28 pm

OpenAI Batch API (via) OpenAI are now offering a 50% discount on batch chat completion API calls if you submit them in bulk and allow for up to 24 hours for them to be run.

Requests are sent as a newline-delimited JSON file, with each line looking something like this:

{“custom_id”: “request-1”, “method”: “POST”, “url”: “/v1/chat/completions”, “body”: {“model”: “gpt-3.5-turbo”, “messages”: [{“role”: “system”, “content”: “You are a helpful assistant.”}, {“role”: “user”, “content”: “What is 2+2?”}]}}

You upload a file for the batch, kick off a batch request and then poll for completion.

This makes GPT-3.5 Turbo cheaper than Claude 3 Haiku—provided you’re willing to wait a few hours for your responses. # 15th April 2024, 5:58 pm

[On complaints about Claude 3 reduction in quality since launch] The model is stored in a static file and loaded, continuously, across 10s of thousands of identical servers each of which serve each instance of the Claude model. The model file never changes and is immutable once loaded; every shard is loading the same model file running exactly the same software. We haven’t changed the temperature either. We don’t see anywhere where drift could happen. The files are exactly the same as at launch and loaded each time from a frozen pristine copy.

Jason D. Clinton, Anthropic # 15th April 2024, 1:27 am

Lessons after a half-billion GPT tokens (via) Ken Kantzer presents some hard-won experience from shipping real features on top of OpenAI’s models.

They ended up settling on a very basic abstraction over the chat API—mainly to handle automatic retries on a 500 error. No complex wrappers, not even JSON mode or function calling or system prompts.

Rather than counting tokens they estimate tokens as 3 times the length in characters, which works well enough.

One challenge they highlight for structured data extraction (one of my favourite use-cases for LLMs): “GPT really cannot give back more than 10 items. Trying to have it give you back 15 items? Maybe it does it 15% of the time.”

(Several commenters on Hacker News report success in getting more items back by using numbered keys or sequence IDs in the returned JSON to help the model keep count.) # 13th April 2024, 8:54 pm

3Blue1Brown: Attention in transformers, visually explained. Grant Sanderson publishes animated explainers of mathematical topics on YouTube, to over 6 million subscribers. His latest shows how the attention mechanism in transformers (the algorithm behind most LLMs) works and is by far the clearest explanation I’ve seen of the topic anywhere.

I was intrigued to find out what tool he used to produce the visualizations. It turns out Grant built his own open source Python animation library, manim, to enable his YouTube work. # 11th April 2024, 4:12 pm

[on GitHub Copilot] It’s like insisting to walk when you can take a bike. It gets the hard things wrong but all the easy things right, very helpful and much faster. You have to learn what it can and can’t do.

Andrej Karpathy # 11th April 2024, 1:27 am

Notes on how to use LLMs in your product. A whole bunch of useful observations from Will Larson here. I love his focus on the key characteristic of LLMs that “you cannot know whether a given response is accurate”, nor can you calculate a dependable confidence score for a response—and as a result you need to either “accept potential inaccuracies (which makes sense in many cases, humans are wrong sometimes too) or keep a Human-in-the-Loop (HITL) to validate the response.” # 10th April 2024, 11:14 pm

The challenge [with RAG] is that most corner-cutting solutions look like they’re working on small datasets while letting you pretend that things like search relevance don’t matter, while in reality relevance significantly impacts quality of responses when you move beyond prototyping (whether they’re literally search relevance or are better tuned SQL queries to retrieve more appropriate rows). This creates a false expectation of how the prototype will translate into a production capability, with all the predictable consequences: underestimating timelines, poor production behavior/performance, etc.

Will Larson # 10th April 2024, 11:09 pm

Three major LLM releases in 24 hours (plus weeknotes)

I’m a bit behind on my weeknotes, so there’s a lot to cover here. But first... a review of the last 24 hours of Large Language Model news. All times are in US Pacific on April 9th 2024.

[... 1401 words]

Gemini 1.5 Pro public preview (via) Huge release from Google: Gemini 1.5 Pro—the GPT-4 competitive model with the incredible 1 million token context length—is now available without a waitlist in 180+ countries (including the USA but not Europe or the UK as far as I can tell)... and the API is free for 50 requests/day (rate limited to 2/minute).

Beyond that you’ll need to pay—$7/million input tokens and $21/million output tokens, which is slightly less than GPT-4 Turbo and a little more than Claude 3 Sonnet.

They also announced audio input (up to 9.5 hours in a single prompt), system instruction support and a new JSON mod. # 10th April 2024, 2:38 am

Mistral tweet a magnet link for mixtral-8x22b. Another open model release from Mistral using their now standard operating procedure of tweeting out a raw torrent link.

This one is an 8x22B Mixture of Experts model. Their previous most powerful openly licensed release was Mixtral 8x7B, so this one is a whole lot bigger (a 281GB download)—and apparently has a 65,536 context length, at least according to initial rumors on Twitter. # 10th April 2024, 2:31 am

Extracting data from unstructured text and images with Datasette and GPT-4 Turbo. Datasette Extract is a new Datasette plugin that uses GPT-4 Turbo (released to general availability today) and GPT-4 Vision to extract structured data from unstructured text and images.

I put together a video demo of the plugin in action today, and posted it to the Datasette Cloud blog along with screenshots and a tutorial describing how to use it. # 9th April 2024, 11:03 pm

A solid pattern to build LLM Applications (feat. Claude) (via) Hrishi Olickel is one of my favourite prompt whisperers. In this YouTube video he walks through his process for building quick interactive applications with the assistance of Claude 3, spinning up an app that analyzes his meeting transcripts to extract participants and mentioned organisations, then presents a UI for exploring the results built with Next.js and shadcn/ui.

An interesting tip I got from this: use the weakest, not the strongest models to iterate on your prompts. If you figure out patterns that work well with Claude 3 Haiku they will have a significantly lower error rate with Sonnet or Opus. The speed of the weaker models also means you can iterate much faster, and worry less about the cost of your experiments. # 9th April 2024, 6:39 pm

Command R+ now ranked 6th on the LMSYS Chatbot Arena. The LMSYS Chatbot Arena Leaderboard is one of the most interesting approaches to evaluating LLMs because it captures their ever-elusive “vibes”—it works by users voting on the best responses to prompts from two initially hidden models

Big news today is that Command R+—the brand new open weights model (Creative Commons non-commercial) by Cohere—is now the highest ranked non-proprietary model, in at position six and beating one of the GPT-4s.

(Linking to my screenshot on Mastodon.) # 9th April 2024, 4:19 pm

llm.c (via) Andrej Karpathy implements LLM training—initially for GPT-2, other architectures to follow—in just over 1,000 lines of C on top of CUDA. Includes a tutorial about implementing LayerNorm by porting an implementation from Python. # 9th April 2024, 3:24 pm

Building files-to-prompt entirely using Claude 3 Opus

files-to-prompt is a new tool I built to help me pipe several files at once into prompts to LLMs such as Claude and GPT-4.

[... 3235 words]

The lifecycle of a code AI completion (via) Philipp Spiess provides a deep dive into how Sourcegraph’s Cody code completion assistant works. Lots of fascinating details in here:

“One interesting learning was that if a user is willing to wait longer for a multi-line request, it usually is worth it to increase latency slightly in favor of quality. For our production setup this means we use a more complex language model for multi-line completions than we do for single-line completions.”

This article is from October 2023 and talks about Claude Instant. The code for Cody is open source so I checked to see if they have switched to Haiku yet and found a commit from March 25th that adds Haiku as an A/B test. # 7th April 2024, 7:37 pm

llm-command-r. Cohere released Command R Plus today—an open weights (non commercial/research only) 104 billion parameter LLM, a big step up from their previous 35 billion Command R model.

Both models are fine-tuned for both tool use and RAG. The commercial API has features to expose this functionality, including a web-search connector which lets the model run web searches as part of answering the prompt and return documents and citations as part of the JSON response.

I released a new plugin for my LLM command line tool this morning adding support for the Command R models.

In addition to the two models it also adds a custom command for running prompts with web search enabled and listing the referenced documents. # 4th April 2024, 5:38 pm

The cost of AI reasoning over time (via) Karina Nguyen from Anthropic provides a fascinating visualization illustrating the cost of different levels of LLM over the past few years, plotting their cost-per-token against their scores on the MMLU benchmark.

Claude 3 Haiku currently occupies the lowest cost to score ratio, over on the lower right hand side of the chart. # 4th April 2024, 12:51 pm

LLMs are like a trained circus bear that can make you porridge in your kitchen. It’s a miracle that it’s able to do it at all, but watch out because no matter how well they can act like a human on some tasks, they’re still a wild animal. They might ransack your kitchen, and they could kill you, accidentally or intentionally!

Alex Komoroske # 2nd April 2024, 3:19 pm

Diving Deeper into AI Package Hallucinations. Bar Lanyado noticed that LLMs frequently hallucinate the names of packages that don’t exist in their answers to coding questions, which can be exploited as a supply chain attack.

He gathered 2,500 questions across Python, Node.js, Go, .NET and Ruby and ran them through a number of different LLMs, taking notes of any hallucinated packages and if any of those hallucinations were repeated.

One repeat example was “pip install huggingface-cli” (the correct package is “huggingface[cli]”). Bar then published a harmless package under that name in January, and observebd 30,000 downloads of that package in the three months that followed. # 1st April 2024, 10:51 pm