Simon Willison’s Weblog

Subscribe

48 items tagged “anthropic”

Anthropic are the AI research company behind Claude.

2024

OpenAI and Anthropic focused on building models and not worrying about products. For example, it took 6 months for OpenAI to bother to release a ChatGPT iOS app and 8 months for an Android app!

Google and Microsoft shoved AI into everything in a panicked race, without thinking about which products would actually benefit from AI and how they should be integrated.

Both groups of companies forgot the “make something people want” mantra. The generality of LLMs allowed developers to fool themselves into thinking that they were exempt from the need to find a product-market fit, as if prompting is a replacement for carefully designed products or features. [...]

But things are changing. OpenAI and Anthropic seem to be transitioning from research labs focused on a speculative future to something resembling regular product companies. If you take all the human-interest elements out of the OpenAI boardroom drama, it was fundamentally about the company's shift from creating gods to building products.

Arvind Narayanan

# 16th July 2024, 4:06 pm / google, microsoft, ai, openai, generative-ai, llms, anthropic

We've doubled the max output token limit for Claude 3.5 Sonnet from 4096 to 8192 in the Anthropic API.

Just add the header "anthropic-beta": "max-tokens-3-5-sonnet-2024-07-15" to your API calls.

Alex Albert

# 15th July 2024, 9:33 pm / ai, generative-ai, llms, anthropic, claude, alex-albert

Yeah, unfortunately vision prompting has been a tough nut to crack. We've found it's very challenging to improve Claude's actual "vision" through just text prompts, but we can of course improve its reasoning and thought process once it extracts info from an image.

In general, I think vision is still in its early days, although 3.5 Sonnet is noticeably better than older models.

Alex Albert (Anthropic)

# 10th July 2024, 6:56 pm / ai, prompt-engineering, generative-ai, llms, anthropic, claude, vision-llms, alex-albert

Anthropic cookbook: multimodal. I'm currently on the lookout for high quality sources of information about vision LLMs, including prompting tricks for getting the most out of them.

This set of Jupyter notebooks from Anthropic (published four months ago to accompany the original Claude 3 models) is the best I've found so far. Best practices for using vision with Claude includes advice on multi-shot prompting with example, plus this interesting think step-by-step style prompt for improving Claude's ability to count the dogs in an image:

You have perfect vision and pay great attention to detail which makes you an expert at counting objects in images. How many dogs are in this picture? Before providing the answer in <answer> tags, think step by step in <thinking> tags and analyze every part of the image.

# 10th July 2024, 6:38 pm / jupyter, vision-llms, anthropic, claude, generative-ai, ai, llms

Claude: You can now publish, share, and remix artifacts. Artifacts is the feature Anthropic released a few weeks ago to accompany Claude 3.5 Sonnet, allowing Claude to create interactive HTML+JavaScript tools in response to prompts.

This morning they added the ability to make those artifacts public and share links to them, which makes them even more useful!

Here's my box shadow playground from the other day, and an example page I requested demonstrating the Milligram CSS framework - Artifacts can load most code that is available via cdnjs so they're great for quickly trying out new libraries.

# 9th July 2024, 10:25 pm / anthropic, claude, generative-ai, ai, llms, ai-assisted-programming

Box shadow CSS generator (via) Another example of a tiny personal tool I built using Claude 3.5 Sonnet and artifacts. In this case my prompt was:

CSS for a slight box shadow, build me a tool that helps me twiddle settings and preview them and copy and paste out the CSS

I changed my mind half way through typing the prompt and asked it for a custom tool, and it built me this!

Box shadow CSS generator. Shows a preview, then provides sliders to set Horizontal Offset, Vertical Offset, Blur Radius,  Spread Radius,  Color and Opacity - plus the generated CSS and a Copy to Clipboard button

Here's the full transcript - in a follow-up prompt I asked for help deploying it and it rewrote the tool to use <script type="text/babel"> and the babel-standalone library to add React JSX support directly in the browser - a bit of a hefty dependency (387KB compressed / 2.79MB total) but I think acceptable for this kind of one-off tool.

Being able to knock out tiny custom tools like this on a whim is a really interesting new capability. It's also a lot of fun!

# 8th July 2024, 7:30 pm / css, anthropic, claude, generative-ai, projects, ai, llms, ai-assisted-programming

Compare PDFs. Inspired by this thread on Hacker News about the C++ diff-pdf tool I decided to see what it would take to produce a web-based PDF diff visualization tool using Claude 3.5 Sonnet.

It took two prompts:

Build a tool where I can drag and drop on two PDF files and it uses PDF.js to turn each of their pages into canvas elements and then displays those pages side by side with a third image that highlights any differences between them, if any differences exist

That give me a React app that didn't quite work, so I followed-up with this:

rewrite that code to not use React at all

Which gave me a working tool! You can see the full Claude transcript in this Gist. Here's a screenshot of the tool in action:

Screenshot showing a page from a PDF, then that page with a red squiggle on it, then just the squiggle

Being able to knock out little custom interactive web tools like this in a couple of minutes is so much fun.

# 2nd July 2024, 7:54 pm / projects, pdf, claude, llms, anthropic, ai-assisted-programming, tools

Claude Projects. New Claude feature, quietly launched this morning for Claude Pro users. Looks like their version of OpenAI's GPTs, designed to take advantage of Claude's 200,000 token context limit:

You can upload relevant documents, text, code, or other files to a project’s knowledge base, which Claude will use to better understand the context and background for your individual chats within that project. Each project includes a 200K context window, the equivalent of a 500-page book, so users can add all of the insights needed to enhance Claude’s effectiveness.

You can also set custom instructions, which presumably get added to the system prompt.

I tried dropping in all of Datasette's existing documentation - 693KB of .rst files (which I had to rename to .rst.txt for it to let me upload them) - and it worked and showed "63% of knowledge size used".

This is a slightly different approach from OpenAI, where the GPT knowledge feature supports attaching up to 20 files each with up to 2 million tokens, which get ingested into a vector database (likely Qdrant) and used for RAG.

It looks like Claude instead handle a smaller amount of extra knowledge but paste the whole thing into the context window, which avoids some of the weirdness around semantic search chunking but greatly limits the size of the data.

My big frustration with the knowledge feature in GPTs remains the lack of documentation on what it's actually doing under the hood. Without that it's difficult to make informed decisions about how to use it - with Claude Projects I can at least develop a robust understanding of what the tool is doing for me and how best to put it to work.

No equivalent (yet) for the GPT actions feature where you can grant GPTs the ability to make API calls out to external systems.

# 25th June 2024, 4:03 pm / anthropic, claude, openai, ai, llms, chatgpt, prompt-engineering, rag

Building search-based RAG using Claude, Datasette and Val Town

Visit Building search-based RAG using Claude, Datasette and Val Town

Retrieval Augmented Generation (RAG) is a technique for adding extra “knowledge” to systems built on LLMs, allowing them to answer questions against custom information not included in their training data. A common way to implement this is to take a question from a user, translate that into a set of search queries, run those against a search engine and then feed the results back into the LLM to generate an answer.

[... 3,372 words]

llm-claude-3 0.4. LLM plugin release adding support for the new Claude 3.5 Sonnet model:

pipx install llm
llm install -U llm-claude-3
llm keys set claude
# paste AP| key here
llm -m claude-3.5-sonnet \
  'a joke about a pelican and a walrus having lunch'

# 20th June 2024, 11:04 pm / llm, anthropic, claude, generative-ai, projects, ai, llms

One of the core constitutional principles that guides our AI model development is privacy. We do not train our generative models on user-submitted data unless a user gives us explicit permission to do so. To date we have not used any customer or user-submitted data to train our generative models.

Anthropic

# 20th June 2024, 7:19 pm / ethics, privacy, ai, llms, anthropic

Claude 3.5 Sonnet. Anthropic released a new model this morning, and I think it's likely now the single best available LLM. Claude 3 Opus was already mostly on-par with GPT-4o, and the new 3.5 Sonnet scores higher than Opus on almost all of Anthropic's internal evals.

It's also twice the speed and one fifth of the price of Opus (it's the same price as the previous Claude 3 Sonnet). To compare:

  • gpt-4o: $5/million input tokens and $15/million output
  • Claude 3.5 Sonnet: $3/million input, $15/million output
  • Claude 3 Opus: $15/million input, $75/million output

Similar to Claude 3 Haiku then, which both under-cuts and out-performs OpenAI's GPT-3.5 model.

In addition to the new model, Anthropic also added a "artifacts" feature to their Claude web interface. The most exciting part of this is that any of the Claude models can now build and then render web pages and SPAs, directly in the Claude interface.

This means you can prompt them to e.g. "Build me a web app that teaches me about mandelbrot fractals, with interactive widgets" and they'll do exactly that - I tried that prompt on Claude 3.5 Sonnet earlier and the results were spectacular (video demo).

An unsurprising note at the end of the post:

To complete the Claude 3.5 model family, we’ll be releasing Claude 3.5 Haiku and Claude 3.5 Opus later this year.

If the pricing stays consistent with Claude 3, Claude 3.5 Haiku is going to be a very exciting model indeed.

# 20th June 2024, 6:01 pm / anthropic, claude, generative-ai, ai, llms, vision-llms

Claude: Building evals and test cases. More documentation updates from Anthropic: this section on writing evals for Claude is new today and includes Python code examples for a number of different evaluation techniques.

Included are several examples of the LLM-as-judge pattern, plus an example using cosine similarity and another that uses the new-to-me Rouge Python library that implements the ROUGE metric for evaluating the quality of summarized text.

# 18th June 2024, 4:28 pm / prompt-engineering, anthropic, claude, generative-ai, ai, llms

Anthropic release notes (via) Anthropic have started publishing release notes! Currently available for their API and their apps (mobile and web).

What I'd really like to see are release notes for the models themselves, though as far as I can tell there haven't been any updates to those since the Claude 3 models were first released (the Haiku model name in the API is still claude-3-haiku-20240307 and Anthropic say they'll change that identifier after any updates to the model).

# 18th June 2024, 4:25 pm / anthropic, claude, generative-ai, ai, llms, alex-albert

Language models on the command-line

Visit Language models on the command-line

I gave a talk about accessing Large Language Models from the command-line last week as part of the Mastering LLMs: A Conference For Developers & Data Scientists six week long online conference. The talk focused on my LLM Python command-line utility and ways you can use it (and its plugins) to explore LLMs and use them for useful tasks.

[... 4,992 words]

Claude’s Character (via) There's so much interesting stuff in this article from Anthropic on how they defined the personality for their Claude 3 model. In addition to the technical details there are some very interesting thoughts on the complex challenge of designing a "personality" for an LLM in the first place.

Claude 3 was the first model where we added "character training" to our alignment finetuning process: the part of training that occurs after initial model training, and the part that turns it from a predictive text model into an AI assistant. The goal of character training is to make Claude begin to have more nuanced, richer traits like curiosity, open-mindedness, and thoughtfulness.

But what other traits should it have? This is a very difficult set of decisions to make! The most obvious approaches are all flawed in different ways:

Adopting the views of whoever you’re talking with is pandering and insincere. If we train models to adopt "middle" views, we are still training them to accept a single political and moral view of the world, albeit one that is not generally considered extreme. Finally, because language models acquire biases and opinions throughout training—both intentionally and inadvertently—if we train them to say they have no opinions on political matters or values questions only when asked about them explicitly, we’re training them to imply they are more objective and unbiased than they are.

The training process itself is particularly fascinating. The approach they used focuses on synthetic data, and effectively results in the model training itself:

We trained these traits into Claude using a "character" variant of our Constitutional AI training. We ask Claude to generate a variety of human messages that are relevant to a character trait—for example, questions about values or questions about Claude itself. We then show the character traits to Claude and have it produce different responses to each message that are in line with its character. Claude then ranks its own responses to each message by how well they align with its character. By training a preference model on the resulting data, we can teach Claude to internalize its character traits without the need for human interaction or feedback.

There's still a lot of human intervention required, but significantly less than more labour-intensive patterns such as Reinforcement Learning from Human Feedback (RLHF):

Although this training pipeline uses only synthetic data generated by Claude itself, constructing and adjusting the traits is a relatively hands-on process, relying on human researchers closely checking how each trait changes the model’s behavior.

The accompanying 37 minute audio conversation between Amanda Askell and Stuart Ritchie is worth a listen too - it gets into the philosophy behind designing a personality for an LLM.

# 8th June 2024, 9:41 pm / anthropic, claude, generative-ai, ai, llms

Golden Gate Claude. This is absurdly fun and weird. Anthropic's recent LLM interpretability research gave them the ability to locate features within the opaque blob of their Sonnet model and boost the weight of those features during inference.

For a limited time only they're serving a "Golden Gate Claude" model which has the feature for the Golden Gate Bridge boosted. No matter what question you ask it the Golden Gate Bridge is likely to be involved in the answer in some way. Click the little bridge icon in the Claude UI to give it a go.

I asked for names for a pet pelican and the first one it offered was this:

Golden Gate - This iconic bridge name would be a fitting moniker for the pelican with its striking orange color and beautiful suspension cables.

And from a recipe for chocolate covered pretzels:

Gently wipe any fog away and pour the warm chocolate mixture over the bridge/brick combination. Allow to air dry, and the bridge will remain accessible for pedestrians to walk along it.

UPDATE: I think the experimental model is no longer available, approximately 24 hours after release. We'll miss you, Golden Gate Claude.

# 24th May 2024, 8:17 am / anthropic, claude, generative-ai, ai, llms, interpretability

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (via) Big advances in the field of LLM interpretability from Anthropic, who managed to extract millions of understandable features from their production Claude 3 Sonnet model (the mid-point between the inexpensive Haiku and the GPT-4-class Opus).

Some delightful snippets in here such as this one:

We also find a variety of features related to sycophancy, such as an empathy / “yeah, me too” feature 34M/19922975, a sycophantic praise feature 1M/847723, and a sarcastic praise feature 34M/19415708.

# 21st May 2024, 6:25 pm / anthropic, claude, generative-ai, ai, llms, interpretability

Introducing the Claude Team plan and iOS app. The iOS app seems nice, and provides free but heavily rate-limited access to Sonnet (the middle-sized Claude 3 model)—I ran two prompts just now and it told me I could have 3 more, resetting in five hours.

For $20/month you get access to Opus and 5x the capacity—which feels a little ungenerous to me.

The new $30/user/month team plan provides higher rate limits but is a minimum of five seats.

# 1st May 2024, 4:06 pm / anthropic, claude

timpaul/form-extractor-prototype (via) Tim Paul, Head of Interaction Design at the UK’s Government Digital Service, published this brilliant prototype built on top of Claude 3 Opus.

The video shows what it can do. Give it an image of a form and it will extract the form fields and use them to create a GDS-style multi-page interactive form, using their GOV.UK Forms design system and govuk-frontend npm package.

It works for both hand-drawn napkin illustrations and images of existing paper forms.

The bulk of the prompting logic is the schema definition in data/extract-form-questions.json

I’m always excited to see applications built on LLMs that go beyond the chatbot UI. This is a great example of exactly that.

# 22nd April 2024, 10:01 pm / forms, anthropic, claude, generative-ai, ai, llms

mistralai/mistral-common. New from Mistral: mistral-common, an open source Python library providing "a set of tools to help you work with Mistral models".

So far that means a tokenizer! This is similar to OpenAI's tiktoken library in that it lets you run tokenization in your own code, which crucially means you can count the number of tokens that you are about to use - useful for cost estimates but also for cramming the maximum allowed tokens in the context window for things like RAG.

Mistral's library is better than tiktoken though, in that it also includes logic for correctly calculating the tokens needed for conversation construction and tool definition. With OpenAI's APIs you're currently left guessing how many tokens are taken up by these advanced features.

Anthropic haven't published any form of tokenizer at all - it's the feature I'd most like to see from them next.

Here's how to explore the vocabulary of the tokenizer:

MistralTokenizer.from_model(
    "open-mixtral-8x22b"
).instruct_tokenizer.tokenizer.vocab()[:12]

['<unk>', '<s>', '</s>', '[INST]', '[/INST]', '[TOOL_CALLS]', '[AVAILABLE_TOOLS]', '[/AVAILABLE_TOOLS]', '[TOOL_RESULTS]', '[/TOOL_RESULTS]']

# 18th April 2024, 12:39 am / mistral, anthropic, python, generative-ai, openai, ai, llms, prompt-engineering, rag

In mid-March, we added this line to our system prompt to prevent Claude from thinking it can open URLs:

"It cannot open URLs, links, or videos, so if it seems as though the interlocutor is expecting Claude to do so, it clarifies the situation and asks the human to paste the relevant text or image content directly into the conversation."

Alex Albert (Anthropic)

# 18th April 2024, 12:22 am / ai, prompt-engineering, generative-ai, llms, anthropic, claude, alex-albert

[On complaints about Claude 3 reduction in quality since launch] The model is stored in a static file and loaded, continuously, across 10s of thousands of identical servers each of which serve each instance of the Claude model. The model file never changes and is immutable once loaded; every shard is loading the same model file running exactly the same software. We haven’t changed the temperature either. We don’t see anywhere where drift could happen. The files are exactly the same as at launch and loaded each time from a frozen pristine copy.

Jason D. Clinton, Anthropic

# 15th April 2024, 1:27 am / ai, generative-ai, llms, anthropic, claude

Building files-to-prompt entirely using Claude 3 Opus

Visit Building files-to-prompt entirely using Claude 3 Opus

files-to-prompt is a new tool I built to help me pipe several files at once into prompts to LLMs such as Claude and GPT-4.

[... 3,235 words]

The lifecycle of a code AI completion (via) Philipp Spiess provides a deep dive into how Sourcegraph's Cody code completion assistant works. Lots of fascinating details in here:

"One interesting learning was that if a user is willing to wait longer for a multi-line request, it usually is worth it to increase latency slightly in favor of quality. For our production setup this means we use a more complex language model for multi-line completions than we do for single-line completions."

This article is from October 2023 and talks about Claude Instant. The code for Cody is open source so I checked to see if they have switched to Haiku yet and found a commit from March 25th that adds Haiku as an A/B test.

# 7th April 2024, 7:37 pm / anthropic, claude, generative-ai, ai, llms, ai-assisted-programming

The cost of AI reasoning over time (via) Karina Nguyen from Anthropic provides a fascinating visualization illustrating the cost of different levels of LLM over the past few years, plotting their cost-per-token against their scores on the MMLU benchmark.

Claude 3 Haiku currently occupies the lowest cost to score ratio, over on the lower right hand side of the chart.

# 4th April 2024, 12:51 pm / anthropic, claude, generative-ai, ai, llms

“The king is dead”—Claude 3 surpasses GPT-4 on Chatbot Arena for the first time. I’m quoted in this piece by Benj Edwards for Ars Technica:

“For the first time, the best available models—Opus for advanced tasks, Haiku for cost and efficiency—are from a vendor that isn’t OpenAI. That’s reassuring—we all benefit from a diversity of top vendors in this space. But GPT-4 is over a year old at this point, and it took that year for anyone else to catch up.”

# 27th March 2024, 4:58 pm / anthropic, claude, generative-ai, openai, gpt4, ai, llms, benj-edwards

Claude and ChatGPT for ad-hoc sidequests

Visit Claude and ChatGPT for ad-hoc sidequests

Here is a short, illustrative example of one of the ways in which I use Claude and ChatGPT on a daily basis.

[... 1,754 words]

llm-claude-3 0.3. Anthropic released Claude 3 Haiku today, their least expensive model: $0.25/million tokens of input, $1.25/million of output (GPT-3.5 Turbo is $0.50/$1.50). Unlike GPT-3.5 Haiku also supports image inputs.

I just released a minor update to my llm-claude-3 LLM plugin adding support for the new model.

# 13th March 2024, 9:18 pm / llm, anthropic, claude, generative-ai, projects, ai, llms

The GPT-4 barrier has finally been broken

Four weeks ago, GPT-4 remained the undisputed champion: consistently at the top of every key benchmark, but more importantly the clear winner in terms of “vibes”. Almost everyone investing serious time exploring LLMs agreed that it was the most capable default model for the majority of tasks—and had been for more than a year.

[... 717 words]