Simon Willison's Weblog: llm-release

Mistral Small 3.1

2025-03-17T18:45:04+00:00

Mistral Small 3 came out in January and was a notable, genuinely excellent local model that used an Apache 2.0 license.

Mistral Small 3.1 offers a significant improvement: it's multi-modal (images) and has an increased 128,000 token context length, while still "fitting within a single RTX 4090 or a 32GB RAM MacBook once quantized" (according to their model card). Mistral's own benchmarks show it outperforming Gemma 3 and GPT-4o Mini, but I haven't seen confirmation from external benchmarks.

Despite their mention of a 32GB MacBook I haven't actually seen any quantized GGUF or MLX releases yet, which is a little surprising since they partnered with Ollama on launch day for their previous Mistral Small 3. I expect we'll see various quantized models released by the community shortly.

The model is available via their La Plateforme API, which means you can access it via my llm-mistral plugin.

Here's the model describing my photo of two pelicans in flight:

llm install llm-mistral
# Run this if you have previously installed the plugin:
llm mistral refresh
llm -m mistral/mistral-small-2503 'describe' \
  -a https://static.simonwillison.net/static/2025/two-pelicans.jpg

The image depicts two brown pelicans in flight against a clear blue sky. Pelicans are large water birds known for their long bills and large throat pouches, which they use for catching fish. The birds in the image have long, pointed wings and are soaring gracefully. Their bodies are streamlined, and their heads and necks are elongated. The pelicans appear to be in mid-flight, possibly gliding or searching for food. The clear blue sky in the background provides a stark contrast, highlighting the birds' silhouettes and making them stand out prominently.

I added Mistral's API prices to my tools.simonwillison.net/llm-prices pricing calculator by pasting screenshots of Mistral's pricing tables into Claude.

Tags: vision-llms, mistral, llm, generative-ai, ai, llms, ai-assisted-programming, llm-release

Quoting Ai2

2025-03-13T21:10:29+00:00

Today we release OLMo 2 32B, the most capable and largest model in the OLMo 2 family, scaling up the OLMo 2 training recipe used for our 7B and 13B models released in November. It is trained up to 6T tokens and post-trained using Tulu 3.1. OLMo 2 32B is the first fully-open model (all data, code, weights, and details are freely available) to outperform GPT3.5-Turbo and GPT-4o mini on a suite of popular, multi-skill academic benchmarks.

— Ai2, OLMo 2 32B release announcement

Tags: ai2, llms, ai, generative-ai, open-source, training-data, llm-release

Introducing Command A: Max performance, minimal compute

2025-03-13T20:37:32+00:00

Introducing Command A: Max performance, minimal compute

New LLM release from Cohere. It's interesting to see which aspects of the model they're highlighting, as an indicator of what their commercial customers value the most (highlights mine):

Command A delivers maximum performance with minimal hardware costs when compared to leading proprietary and open-weights models, such as GPT-4o and DeepSeek-V3. For private deployments, Command A excels on business-critical agentic and multilingual tasks, while being deployable on just two GPUs, compared to other models that typically require as many as 32. [...]

With a serving footprint of just two A100s or H100s, it requires far less compute than other comparable models on the market. This is especially important for private deployments. [...]

Its 256k context length (2x most leading models) can handle much longer enterprise documents. Other key features include Cohere’s advanced retrieval-augmented generation (RAG) with verifiable citations, agentic tool use, enterprise-grade security, and strong multilingual performance.

It's open weights but very much not open source - the license is Creative Commons Attribution Non-Commercial and also requires adhering to their Acceptable Use Policy.

Cohere offer it for commercial use via "contact us" pricing or through their API. I released llm-command-r 0.3 adding support for this new model, plus their smaller and faster Command R7B (released in December) and support for structured outputs via LLM schemas.

(I found a weird bug with their schema support where schemas that end in an integer output a seemingly limitless integer - in my experiments it affected Command R and the new Command A but not Command R7B.)

Via @Prince_Canuma

Tags: llm, structured-extraction, cohere, generative-ai, ai, llms, llm-release

Notes on Google's Gemma 3

2025-03-12T16:15:19+00:00

Google's Gemma team released an impressive new model today (under their not-open-source Gemma license). Gemma 3 comes in four sizes - 1B, 4B, 12B, and 27B - and while 1B is text-only the larger three models are all multi-modal for vision:

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling.

Here's the Gemma 3 Technical Report PDF, which makes the big claim that they find "Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks".

I tried out the largest model using the latest Ollama - this is the second time I've spotted a major model release partnering with Ollama on launch day, the first being Mistral Small 3.

I ran this (after upgrading Ollama through their menu icon upgrade option):

ollama pull gemma3:27b

That pulled 17GB of model weights. I've been trying it out using LLM and llm-ollama:

llm install llm-ollama
llm -m gemma3:27b 'Build a single page HTML+CSS+JavaScript UI that gives me a large textarea for writing in which constantly saves what I have entered to localStorage (restoring when I reload the page) and displays a word counter'

That was a replay of a prompt I ran against Claude Artifacts a few months ago. Here's what Gemma built, and the full chat transcript. It's a simple example but it worked just right.

Let's try out the vision mode, by asking it to describe this photo:

llm -m gemma3:27b describe -a https://static.simonwillison.net/static/2025/two-pelicans.jpg

Here's the full response:

Overall Impression:

The image features two brown pelicans in flight against a solid, bright blue background. The composition is dynamic, capturing the birds mid-wingbeat.

Details:

Pelicans: Both pelicans are a rich, dark brown color. They have large, pale yellow-orange beaks and pouches. One pelican is shown with its wings fully extended, appearing to be gliding or soaring. The other is captured with its wings in a more dynamic, upward-sweeping motion.

Background: The background is a flat, uniform shade of bright blue, which makes the pelicans stand out.

Composition: The birds are positioned in opposite directions, creating a sense of movement and balance. The image is well-lit, highlighting the texture of the pelicans' feathers.

Overall, the image is a striking and visually appealing depiction of these majestic birds in flight.

Is it right to say "the birds are positioned in opposite directions"? I thought that was a mistake at first, but I'll give it a pass since their wings are positioned in opposite directions to each other.

Something I've been curious about recently is longer context support: how well can a local model on my laptop deal with summarization or data extraction tasks against longer pieces of text?

I decided to try my Hacker News summarize script using Gemma, against the thread there discussing the Gemma 3 technical paper.

First I did a quick token count (using the OpenAI tokenizer but it's usually a similar number to other models):

curl 'https://hn.algolia.com/api/v1/items/43340491' | ttok

This returned 22,260 - well within Gemma's documented limits but still a healthy number considering just last year most models topped out at 4,000 or 8,000.

I ran my script like this:

hn-summary.sh 43340491 -m gemma3:27b

It did a pretty good job! Here's the full prompt and response. The one big miss is that it ignored my instructions to include illustrative quotes - I don't know if modifying the prompt will fix that but it's disappointing that it didn't handle that well, given how important direct quotes are for building confidence in RAG-style responses.

Here's what I got for Generate an SVG of a pelican riding a bicycle:

llm -m gemma3:27b 'Generate an SVG of a pelican riding a bicycle'

You can also try out the new Gemma in Google AI Studio, and via their API. I added support for it to llm-gemini 0.15, though sadly it appears vision mode doesn't work with that API hosted model yet.

llm install -U llm-gemini
llm keys set gemini
# paste key here
llm -m gemma-3-27b-it 'five facts about pelicans of interest to skunks'

Here's what I got. I'm not sure how pricing works for that hosted model.

Gemma 3 is also already available through MLX-VLM - here's the MLX model collection - but I haven't tried that version yet.

Tags: google, ai, generative-ai, llms, gemini, vision-llms, mlx, ollama, pelican-riding-a-bicycle, gemma, llm-release

QwQ-32B: Embracing the Power of Reinforcement Learning

2025-03-05T21:10:28+00:00

QwQ-32B: Embracing the Power of Reinforcement Learning

New Apache 2 licensed reasoning model from Qwen:

We are excited to introduce QwQ-32B, a model with 32 billion parameters that achieves performance comparable to DeepSeek-R1, which boasts 671 billion parameters (with 37 billion activated). This remarkable outcome underscores the effectiveness of RL when applied to robust foundation models pretrained on extensive world knowledge.

I had a lot of fun trying out their previous QwQ reasoning model last November. I demonstrated this new QwQ in my talk at NICAR about recent LLM developments. Here's the example I ran.

LM Studio just released GGUFs ranging in size from 17.2 to 34.8 GB. MLX have compatible weights published in 3bit, 4bit, 6bit and 8bit. Ollama has the new qwq too - it looks like they've renamed the previous November release qwq:32b-preview.

Via @alibaba_qwen

Tags: generative-ai, inference-scaling, ai, qwen, llms, open-source, mlx, ollama, llm-release

Initial impressions of GPT-4.5

2025-02-27T22:02:59+00:00

GPT-4.5 is out today as a "research preview" - it's available to OpenAI Pro ($200/month) customers and to developers with an API key. OpenAI also published a GPT-4.5 system card.

I've added it to LLM so you can run it like this:

llm -m gpt-4.5-preview 'impress me'

It's very expensive right now: currently $75.00 per million input tokens and $150/million for output! For comparison, o1 is $15/$60 and GPT-4o is $2.50/$10. GPT-4o mini is $0.15/$0.60 making OpenAI's least expensive model 500x cheaper than GPT-4.5 for input and 250x cheaper for output!

As far as I can tell almost all of its key characteristics are the same as GPT-4o: it has the same 128,000 context length, handles the same inputs (text and image) and even has the same training cut-off date of October 2023.

So what's it better at? According to OpenAI's blog post:

Combining deep understanding of the world with improved collaboration results in a model that integrates ideas naturally in warm and intuitive conversations that are more attuned to human collaboration. GPT‑4.5 has a better understanding of what humans mean and interprets subtle cues or implicit expectations with greater nuance and “EQ”. GPT‑4.5 also shows stronger aesthetic intuition and creativity. It excels at helping with writing and design.

They include this chart of win-rates against GPT-4o, where it wins between 56.8% and 63.2% of the time for different classes of query:

They also report a SimpleQA hallucination rate of 37.1% - a big improvement on GPT-4o (61.8%) and o3-mini (80.3%) but not much better than o1 (44%). The coding benchmarks all appear to score similar to o3-mini.

Paul Gauthier reports a score of 45% on Aider's polyglot coding benchmark - below DeepSeek V3 (48%), Sonnet 3.7 (60% without thinking, 65% with thinking) and o3-mini (60.4%) but significantly ahead of GPT-4o (23.1%).

OpenAI don't seem to have enormous confidence in the model themselves:

GPT‑4.5 is a very large and compute-intensive model, making it more expensive⁠ than and not a replacement for GPT‑4o. Because of this, we're evaluating whether to continue serving it in the API long-term as we balance supporting current capabilities with building future models.

It drew me this for "Generate an SVG of a pelican riding a bicycle":

Accessed via the API the model feels weirdly slow - here's an animation showing how that pelican was rendered - the full response took 112 seconds!

OpenAI's Rapha Gontijo Lopes calls this "(probably) the largest model in the world" - evidently the problem with large models is that they are a whole lot slower than their smaller alternatives!

Andrej Karpathy has published some notes on the new model, where he highlights that the improvements are limited considering the 10x increase in training cost compute to GPT-4:

I remember being a part of a hackathon trying to find concrete prompts where GPT4 outperformed 3.5. They definitely existed, but clear and concrete "slam dunk" examples were difficult to find. [...] So it is with that expectation that I went into testing GPT4.5, which I had access to for a few days, and which saw 10X more pretraining compute than GPT4. And I feel like, once again, I'm in the same hackathon 2 years ago. Everything is a little bit better and it's awesome, but also not exactly in ways that are trivial to point to.

Andrej is also running a fun vibes-based polling evaluation comparing output from GPT-4.5 and GPT-4o. Update GPT-4o won 4/5 rounds!

There's an extensive thread about GPT-4.5 on Hacker News. When it hit 324 comments I ran a summary of it using GPT-4.5 itself with this script:

hn-summary.sh 43197872 -m gpt-4.5-preview

Here's the result, which took 154 seconds to generate and cost $2.11 (25797 input tokens and 1225 input, price calculated using my LLM pricing calculator).

For comparison, I ran the same prompt against GPT-4o, GPT-4o Mini, Claude 3.7 Sonnet, Claude 3.5 Haiku, Gemini 2.0 Flash, Gemini 2.0 Flash Lite and Gemini 2.0 Pro.

Tags: ai, openai, andrej-karpathy, generative-ai, llms, evals, uv, pelican-riding-a-bicycle, paul-gauthier, llm-release

Gemini 2.0 Flash and Flash-Lite

2025-02-25T20:16:50+00:00

Gemini 2.0 Flash and Flash-Lite

Gemini 2.0 Flash-Lite is now generally available - previously it was available just as a preview - and has announced pricing. The model is $0.075/million input tokens and $0.030/million output - the same price as Gemini 1.5 Flash.

Google call this "simplified pricing" because 1.5 Flash charged different cost-per-tokens depending on if you used more than 128,000 tokens. 2.0 Flash-Lite (and 2.0 Flash) are both priced the same no matter how many tokens you use.

I released llm-gemini 0.12 with support for the new gemini-2.0-flash-lite model ID. I've also updated my LLM pricing calculator with the new prices.

Via @OfficialLoganK

Tags: gemini, google, generative-ai, llm-pricing, ai, llms, llm, projects, llm-release

Claude 3.7 Sonnet, extended thinking and long output, llm-anthropic 0.14

2025-02-25T17:50:25+00:00

Claude 3.7 Sonnet (previously) is a very interesting new model. I released llm-anthropic 0.14 last night adding support for the new model's features to LLM. I learned a whole lot about the new model in the process of building that plugin.

The base Claude 3.7 Sonnet model mostly feels like an incremental improvement over the excellent Claude 3.5 Sonnet. It benchmarks better and my vibe checks so far are good: it writes excellent code across Python and JavaScript and seems to do everything 3.5 did, for the same price but with an uptick in overall quality.

Extended thinking mode

The big new feature is extended thinking mode. This is the option that turns Claude into an inference scaling model similar to OpenAI's o1 and o3, Gemini 2.0 Flash Thinking and DeepSeek R1.

From my plugin's release notes:

Support for the new Claude 3.7 Sonnet model, including -o thinking 1 to turn on extended thinking mode and -o thinking_budget X to increase the thinking token budget from the 1024 default. #14

You need to opt-in to this new mode. In the Claude API that means appending this block of JSON to your request:

{
    "thinking": {
        "type": "enabled",
        "budget_tokens": 1024
    }
}

The budget_tokens defines how many tokens Claude can spend "thinking" about your prompt. 1,024 is the minimum, but you can increase that up to a maximum of 128,000 (see below).

This is Claude's equivalent of OpenAI's reasoning_effort option, which can be set to low, medium (the default) or high.

Running thinking prompts

You can run a "thinking" prompt through LLM like this - the new -o thinking 1 option switches on the new extended thinking mode:

llm install -U llm-anthropic
llm -m claude-3.7-sonnet -o thinking 1 "write a speech about pelicans for congress"

The -o thinking_budget 4000 option can increase the number of allowed thinking tokens from the default value of 1024.

For the moment the thinking tokens are not visible - but you can review them once the prompt has completed using the llm logs --json command:

llm logs --json -c

The -c option is inherited from "continue previous conversation" - it will show the log for the most recently completed prompt.

You can see the speech about pelicans here, but here's the more interesting thought process that was output by the model:

I'm being asked to write a speech about pelicans that would be delivered to Congress. This seems like an unusual request, but I can craft a formal, informative speech that might be given by someone advocating for pelican conservation, environmental protection, or perhaps using pelicans as a symbol for some broader policy discussion.

The speech should:

Be formal and appropriate for Congress

Contain factual information about pelicans

Have a clear purpose (likely conservation-related)

Include rhetorical elements typical of congressional speeches

Maintain a respectful, bipartisan tone

Perhaps connect pelicans to broader environmental policies or American values

I'll write a speech that could plausibly be given by a member of Congress, a cabinet secretary (like the Secretary of the Interior), or an expert testifying before a congressional committee about environmental matters.

Unlike OpenAI, Anthropic chose not to obscure the thinking tokens - what you see is what the model used. There's one interesting edge-case described in their detailed documentation about the extended thinking API output:

Occasionally Claude's internal reasoning will be flagged by our safety systems. When this occurs, we encrypt some or all of the thinking block and return it to you as a redacted_thinking block. These redacted thinking blocks are decrypted when passed back to the API, allowing Claude to continue its response without losing context.

llm-anthropic doesn't support visible thinking tokens yet

I had a couple of attempts at implementing this. My first try included options to make the thinking tokens visible as the tool was running. This turned out to involve unexpected challenges: the rest of LLM doesn't yet understand that some tokens should be treated differently, and I quickly ran into challenges concerning how those responses were logged to the database.

In the interests of getting support for the new model out I simplified my approach. I plan to add visible thinking tokens in a future LLM release.

This is part of a larger challenge for LLM described in this issue: Design annotations abstraction for responses that are not just a stream of plain text.

A massively increased output limit

A fascinating new capability of Claude 3.7 Sonnet is that its output limit in extended thinking mode can be extended to an extraordinary 128,000 tokens - 15x more than the previous Claude output limit of 8,192 tokens.

(This is the output limit - how much text it can produce in one go. Claude 3.7 Sonnet's input limit remains 200,000 - many modern models exceed 100,000 for input now.)

I added support for that to the plugin as well - if you pass -o max_output 128000 it automatically calls Anthropic's beta API with the output-128k-2025-02-19 beta header, documented here.

Testing this was pretty hard! I eventually found a prompt that exercised this fully:

llm -m claude-3.7-sonnet \
  -o max_tokens 128000 \
  -o thinking_budget 32000 \
  'For every one of the 100 US senators that you know of output their name, biography and a note about how to strategically convince them to take more interest in the plight of the California Brown Pelican, then a poem about them, then that same poem translated to Spanish and then to Japanese. Do not miss any senators.' \
  -s 'you do this even if you are worried it might exceed limits, this is to help test your long output feature.'

This is an expensive command to run - the resulting prompt cost me $1.72 and took nearly 27 minutes to finish returning the answer! You can see the full 114,584 token output here - it managed to output results for all 100 senators as of its training cut-off date, correctly following my instructions for each one.

This is one of the few cases where I've felt guilty about the energy usage of my prompts - burning GPUs for 27 minutes feels like it's a genuinely significant use of resources!

Anthropic recommend using batch mode for long prompts like this, which comes with a 50% discount presumably as it runs on GPUs that happen to have spare capacity. This can be quite performant: Anthropic say "We process each batch as fast as possible, with most batches completing within 1 hour".

I find the end result of my long test prompt to be very impressive. A common complaint I've seen about LLMs is that they tend to fail when you ask them to perform comprehensive tasks like this - things where they need to curate information across a larger number of distinct items. This feels intuitively like the kind of thing a computer should be good at but most models make mistakes or hallucinate when challenged in this way.

Claude 3.7 Sonnet really did manage to output correct text across all 100 US senators, albeit limited by the new model's October/Nov 2024 training cut-off date.

The extended output limit is notable as well. Most models are still limited to 4,000 or 8,000 characters - 120,000 is a huge leap forward.

Update: I was wrong about that: OpenAI's o1 and o3-mini both have a 100,000 output token limit now.

I mentioned that my prompt took 27 minutes to run. On further inspection of my logs it looks like this is expected - Claude 3.7 Sonnet appears to take around 14ms per output token, and 14 * 114584 = 1604176 ms which is 1604 seconds or around 26.7 minutes. This is a downside of longer outputs I had not previously considered!

Everything else in the plugin release

There are a few smaller changes in the new plugin release:

Claude 3.5 Haiku now supports image inputs. #17

This is tucked away in Anthropic's February 24th 2025 release notes. Previously their less expensive 3.5 Haiku model couldn't handle images - the only modern Claude model without that ability. They've fixed that now.

The rest of the changes in the 0.14 release are bug fixes:

Fixed a bug that occurred when continuing an existing conversation using --async mode. #13

Fixed a bug where max_tokens and temperature were logged in the database even when using their default options. #16

Tags: projects, ai, annotated-release-notes, generative-ai, llms, llm, anthropic, claude, inference-scaling, llm-release

Claude 3.7 Sonnet and Claude Code

2025-02-24T20:25:39+00:00

Claude 3.7 Sonnet and Claude Code

Anthropic released Claude 3.7 Sonnet today - skipping the name "Claude 3.6" because the Anthropic user community had already started using that as the unofficial name for their October update to 3.5 Sonnet.

As you may expect, 3.7 Sonnet is an improvement over 3.5 Sonnet - and is priced the same, at $3/million tokens for input and $15/m output.

The big difference is that this is Anthropic's first "reasoning" model - applying the same trick that we've now seen from OpenAI o1 and o3, Grok 3, Google Gemini 2.0 Thinking, DeepSeek R1 and Qwen's QwQ and QvQ. The only big model families without an official reasoning model now are Mistral and Meta's Llama.

I'm still working on adding support to my llm-anthropic plugin but I've got enough working code that I was able to get it to draw me a pelican riding a bicycle. Here's the non-reasoning model:

And here's that same prompt but with "thinking mode" enabled:

Here's the transcript for that second one, which mixes together the thinking and the output tokens. I'm still working through how best to differentiate between those two types of token.

Claude 3.7 Sonnet has a training cut-off date of Oct 2024 - an improvement on 3.5 Haiku's July 2024 - and can output up to 64,000 tokens in thinking mode (some of which are used for thinking tokens) and up to 128,000 if you enable a special header:

Claude 3.7 Sonnet can produce substantially longer responses than previous models with support for up to 128K output tokens (beta)---more than 15x longer than other Claude models. This expanded capability is particularly effective for extended thinking use cases involving complex reasoning, rich code generation, and comprehensive content creation.

This feature can be enabled by passing an anthropic-beta header of output-128k-2025-02-19.

Anthropic's other big release today is a preview of Claude Code - a CLI tool for interacting with Claude that includes the ability to prompt Claude in terminal chat and have it read and modify files and execute commands. This means it can both iterate on code and execute tests, making it an extremely powerful "agent" for coding assistance.

Here's Anthropic's documentation on getting started with Claude Code, which uses OAuth (a first for Anthropic's API) to authenticate against your API account, so you'll need to configure billing.

Short version:

npm install -g @anthropic-ai/claude-code
claude

It can burn a lot of tokens so don't be surprised if a lengthy session with it adds up to single digit dollars of API spend.

Tags: llm, anthropic, claude, ai-agents, inference-scaling, ai, llms, ai-assisted-programming, generative-ai, pelican-riding-a-bicycle, oauth, llm-release

Andrej Karpathy's initial impressions of Grok 3

2025-02-18T16:46:25+00:00

Andrej Karpathy's initial impressions of Grok 3

Andrej has the most detailed analysis I've seen so far of xAI's Grok 3 release from last night. He runs through a bunch of interesting test prompts, and concludes:

As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented.

I was delighted to see him include my Generate an SVG of a pelican riding a bicycle benchmark in his tests:

Grok 3 is currently sat at the top of the LLM Chatbot Arena (across all of their categories) so it's doing very well based on vibes for the voters there.

Tags: andrej-karpathy, llms, ai, generative-ai, pelican-riding-a-bicycle, grok, llm-release

Gemini 2.0 is now available to everyone

2025-02-05T16:37:29+00:00

Gemini 2.0 is now available to everyone

Big new Gemini 2.0 releases today:

Gemini 2.0 Pro (Experimental) is Google's "best model yet for coding performance and complex prompts" - currently available as a free preview.
Gemini 2.0 Flash is now generally available.
Gemini 2.0 Flash-Lite looks particularly interesting:

We’ve gotten a lot of positive feedback on the price and speed of 1.5 Flash. We wanted to keep improving quality, while still maintaining cost and speed. So today, we’re introducing 2.0 Flash-Lite, a new model that has better quality than 1.5 Flash, at the same speed and cost. It outperforms 1.5 Flash on the majority of benchmarks.

That means Gemini 2.0 Flash-Lite is priced at 7.5c/million input tokens and 30c/million output tokens - half the price of OpenAI's GPT-4o mini (15c/60c).

Gemini 2.0 Flash isn't much more expensive: 10c/million for text/image input, 70c/million for audio input, 40c/million for output. Again, cheaper than GPT-4o mini.

I pushed a new LLM plugin release, llm-gemini 0.10, adding support for the three new models:

llm install -U llm-gemini
llm keys set gemini
# paste API key here
llm -m gemini-2.0-flash "impress me"
llm -m gemini-2.0-flash-lite-preview-02-05 "impress me"
llm -m gemini-2.0-pro-exp-02-05 "impress me"

Here's the output for those three prompts.

I ran Generate an SVG of a pelican riding a bicycle through the three new models. Here are the results, cheapest to most expensive:

gemini-2.0-flash-lite-preview-02-05

gemini-2.0-flash

gemini-2.0-pro-exp-02-05

Full transcripts here.

I also ran the same prompt I tried with o3-mini the other day:

cd /tmp
git clone https://github.com/simonw/datasette
cd datasette
files-to-prompt datasette -e py -c | \
  llm -m gemini-2.0-pro-exp-02-05 \
  -s 'write extensive documentation for how the permissions system works, as markdown' \
  -o max_output_tokens 10000

Here's the result from that - you can compare that to o3-mini's result here.

Tags: gemini, llm, google, generative-ai, llm-pricing, ai, llms, pelican-riding-a-bicycle, llm-release

OpenAI o3-mini, now available in LLM

2025-01-31T21:50:36+00:00

OpenAI's o3-mini is out today. As with other o-series models it's a slightly difficult one to evaluate - we now need to decide if a prompt is best run using GPT-4o, o1, o3-mini or (if we have access) o1 Pro.

Confusing matters further, the benchmarks in the o3-mini system card (PDF) aren't a universal win for o3-mini across all categories. It generally benchmarks higher than GPT-4o and o1 but not across everything.

The biggest win for o3-mini is on the Codeforces ELO competitive programming benchmark, which I think is described by this 2nd January 2025 paper, with the following scores:

o3-mini (high) 2130
o3-mini (medium) 2036
o1 1891
o3-mini (low) 1831
o1-mini 1650
o1-preview 1258
GPT-4o 900

Weirdly, that GPT-4o score was in an older copy of the System Card PDF which has been replaced by an updated document that doesn't mention Codeforces ELO scores at all.

One note from the System Card that stood out for me concerning intended applications of o3-mini for OpenAI themselves:

We also plan to allow users to use o3-mini to search the internet and summarize the results in ChatGPT. We expect o3-mini to be a useful and safe model for doing this, especially given its performance on the jailbreak and instruction hierarchy evals detailed in Section 4 below.

This is notable because the existing o1 models on ChatGPT have not yet had access to their web search tool - despite the mixture of search and "reasoning" models having very clear benefits.

o3-mini does not and will not support vision. We will have to wait for future OpenAI reasoning models for that.

I released LLM 0.21 with support for the new model, plus its -o reasoning_effort high (or medium or low) option for tweaking the reasoning effort - details in this issue.

Note that the new model is currently only available for Tier 3 and higher users, which requires you to have spent at least $100 on the API.

o3-mini is priced at $1.10/million input tokens, $4.40/million output tokens - less than half the price of GPT-4o (currently $2.50/$10) and massively cheaper than o1 ($15/$60). The GPT-4o comparison isn't quite as simple as that though, as o3-mini's invisible reasoning tokens still count towards the output tokens you get charged for.

I tried using it to summarize this conversation about o3-mini on Hacker News, using my hn-summary.sh script.

hn-summary.sh 42890627 -o o3-mini

Here's the result - it used 18,936 input tokens and 2,905 output tokens for a total cost of 3.3612 cents.

o3-mini (and o1-mini) are text-only models: they don't accept image inputs. The full o1 API model can accept images in the same way as GPT-4o.

Another characteristic worth noting is o3-mini's token output limit - the measure of how much text it can output in one go. That's 100,000 tokens, compared to 16,000 for GPT-4o and just 8,000 for both DeepSeek R1 and Claude 3.5.

Invisible "reasoning tokens" come out of the same budget, so it's likely not possible to have it output the full 100,000.

The model accepts up to 200,000 tokens of input, an improvement on GPT-4o's 128,000.

An application where output limits really matter is translation between human languages, where the output can realistically be expected to have a similar length to the input. It will be interesting seeing how well o3-mini works for that, especially given its low price.

Update: Here's a fascinating comment on this by professional translator Tom Gally on Hacker News:

I just did a test in which both R1 and o3-mini got worse at translation in the latter half of a long text. [...]

An initial comparison of the output suggested that, while R1 didn’t seem bad, o3-mini produced a writing style closer to what I asked for in the prompt—smoother and more natural English. But then I noticed that the output length was 5,855 characters for R1, 9,052 characters for o3-mini, and 11,021 characters for my own polished version. Comparing the three translations side-by-side with the original Japanese, I discovered that R1 had omitted entire paragraphs toward the end of the speech, and that o3-mini had switched to a strange abbreviated style (using slashes instead of “and” between noun phrases, for example) toward the end as well. The vanilla versions of ChatGPT, Claude, and Gemini that I ran the same prompt and text through a month ago had had none of those problems.

Tags: projects, translation, ai, openai, generative-ai, llm, llm-pricing, inference-scaling, o3, llm-release

Mistral Small 3

2025-01-30T15:36:30+00:00

Mistral Small 3

First model release of 2025 for French AI lab Mistral, who describe Mistral Small 3 as "a latency-optimized 24B-parameter model released under the Apache 2.0 license."

More notably, they claim the following:

Mistral Small 3 is competitive with larger models such as Llama 3.3 70B or Qwen 32B, and is an excellent open replacement for opaque proprietary models like GPT4o-mini. Mistral Small 3 is on par with Llama 3.3 70B instruct, while being more than 3x faster on the same hardware.

Llama 3.3 70B and Qwen 32B are two of my favourite models to run on my laptop - that ~20GB size turns out to be a great trade-off between memory usage and model utility. It's exciting to see a new entrant into that weight class.

The license is important: previous Mistral Small models used their Mistral Research License, which prohibited commercial deployments unless you negotiate a commercial license with them. They appear to be moving away from that, at least for their core models:

We’re renewing our commitment to using Apache 2.0 license for our general purpose models, as we progressively move away from MRL-licensed models. As with Mistral Small 3, model weights will be available to download and deploy locally, and free to modify and use in any capacity. […] Enterprises and developers that need specialized capabilities (increased speed and context, domain specific knowledge, task-specific models like code completion) can count on additional commercial models complementing what we contribute to the community.

Despite being called Mistral Small 3, this appears to be the fourth release of a model under that label. The Mistral API calls this one mistral-small-2501 - previous model IDs were mistral-small-2312, mistral-small-2402 and mistral-small-2409.

I've updated the llm-mistral plugin for talking directly to Mistral's La Plateforme API:

llm install -U llm-mistral
llm keys set mistral
# Paste key here
llm -m mistral/mistral-small-latest "tell me a joke about a badger and a puffin"

Sure, here's a light-hearted joke for you:

Why did the badger bring a puffin to the party?

Because he heard puffins make great party 'Puffins'!

(That's a play on the word "puffins" and the phrase "party people.")

API pricing is $0.10/million tokens of input, $0.30/million tokens of output - half the price of the previous Mistral Small API model ($0.20/$0.60). for comparison, GPT-4o mini is $0.15/$0.60.

Mistral also ensured that the new model was available on Ollama in time for their release announcement.

You can pull the model like this (fetching 14GB):

ollama run mistral-small:24b

The llm-ollama plugin will then let you prompt it like so:

llm install llm-ollama
llm -m mistral-small:24b "say hi"

Via @MistralAI

Tags: open-source, mistral, llm, ollama, generative-ai, ai, llms, llm-pricing, llm-release

DeepSeek Janus-Pro

2025-01-27T21:49:44+00:00

DeepSeek Janus-Pro

Another impressive model release from DeepSeek. Janus is their series of "unified multimodal understanding and generation models" - these are models that can both accept images as input and generate images for output.

Janus-Pro is the new 7B model, which DeepSeek describe as "an advanced version of Janus, improving both multimodal understanding and visual generation significantly". It's released under the not fully open source DeepSeek license.

Janus-Pro is accompanied by this paper, which includes this note about the training:

Our Janus is trained and evaluated using HAI-LLM, which is a lightweight and efficient distributed training framework built on top of PyTorch. The whole training process took about 7/14 days on a cluster of 16/32 nodes for 1.5B/7B model, each equipped with 8 Nvidia A100 (40GB) GPUs.

It includes a lot of high benchmark scores, but closes with some notes on the model's current limitations:

In terms of multimodal understanding, the input resolution is limited to 384 × 384, which affects its performance in fine-grained tasks such as OCR. For text-to-image generation, the low resolution, combined with reconstruction losses introduced by the vision tokenizer, results in images that, while rich in semantic content, still lack fine details. For example, small facial regions occupying limited image space may appear under-detailed. Increasing the image resolution could mitigate these issues.

The easiest way to try this one out is using the Hugging Face Spaces demo. I tried the following prompt for the image generation capability:

A photo of a raccoon holding a handwritten sign that says "I love trash"

And got back this image:

It's now also been ported to Transformers.js, which means you can run the 1B model directly in a WebGPU browser such as Chrome here at webml-community/janus-pro-webgpu (loads about 2.24 GB of model files).

Tags: vision-llms, generative-ai, deepseek, ai, llms, transformers-js, multi-modal-output, llm-release

Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens

2025-01-26T18:54:45+00:00

Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens

Very significant new release from Alibaba's Qwen team. Their openly licensed (sometimes Apache 2, sometimes Qwen license, I've had trouble keeping up) Qwen 2.5 LLM previously had an input token limit of 128,000 tokens. This new model increases that to 1 million, using a new technique called Dual Chunk Attention, first described in this paper from February 2024.

They've released two models on Hugging Face: Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, both requiring CUDA and both under an Apache 2.0 license.

You'll need a lot of VRAM to run them at their full capacity:

VRAM Requirement for processing 1 million-token sequences:

Qwen2.5-7B-Instruct-1M: At least 120GB VRAM (total across GPUs).

Qwen2.5-14B-Instruct-1M: At least 320GB VRAM (total across GPUs).

If your GPUs do not have sufficient VRAM, you can still use Qwen2.5-1M models for shorter tasks.

Qwen recommend using their custom fork of vLLM to serve the models:

You can also use the previous framework that supports Qwen2.5 for inference, but accuracy degradation may occur for sequences exceeding 262,144 tokens.

GGUF quantized versions of the models are already starting to show up. LM Studio's "official model curator" Bartowski published lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF and lmstudio-community/Qwen2.5-14B-Instruct-1M-GGUF - sizes range from 4.09GB to 8.1GB for the 7B model and 7.92GB to 15.7GB for the 14B.

These might not work well yet with the full context lengths as the underlying llama.cpp library may need some changes.

I tried running the 8.1GB 7B model using Ollama on my Mac like this:

ollama run hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0

Then with LLM:

llm install llm-ollama
llm models -q qwen # To search for the model ID
# I set a shorter q1m alias:
llm aliases set q1m hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0

I tried piping a large prompt in using files-to-prompt like this:

files-to-prompt ~/Dropbox/Development/llm -e py -c | llm -m q1m 'describe this codebase in detail'

That should give me every Python file in my llm project. Piping that through ttok first told me this was 63,014 OpenAI tokens, I expect that count is similar for Qwen.

The result was disappointing: it appeared to describe just the last Python file that stream. Then I noticed the token usage report:

2,048 input, 999 output

This suggests to me that something's not working right here - maybe the Ollama hosting framework is truncating the input, or maybe there's a problem with the GGUF I'm using?

I'll update this post when I figure out how to run longer prompts through the new Qwen model using GGUF weights on a Mac.

Update: It turns out Ollama has a num_ctx option which defaults to 2048, affecting the input context length. I tried this:

files-to-prompt \
  ~/Dropbox/Development/llm \
  -e py -c | \
llm -m q1m 'describe this codebase in detail' \
 -o num_ctx 80000

But I quickly ran out of RAM (I have 64GB but a lot of that was in use already) and hit Ctrl+C to avoid crashing my computer. I need to experiment a bit to figure out how much RAM is used for what context size.

Awni Hannun shared tips for running mlx-community/Qwen2.5-7B-Instruct-1M-4bit using MLX, which should work for up to 250,000 tokens. They ran 120,000 tokens and reported:

Peak RAM for prompt filling was 22GB

Peak RAM for generation 12GB

Prompt filling took 350 seconds on an M2 Ultra

Generation ran at 31 tokens-per-second on M2 Ultra

Via VB

Tags: llms, ai, qwen, generative-ai, llm, ollama, long-context, llama-cpp, llm-release

llm-gemini 0.9

2025-01-22T04:32:42+00:00

llm-gemini 0.9

This new release of my llm-gemini plugin adds support for two new experimental models:

learnlm-1.5-pro-experimental is "an experimental task-specific model that has been trained to align with learning science principles when following system instructions for teaching and learning use cases" - more here.
gemini-2.0-flash-thinking-exp-01-21 is a brand new version of the Gemini 2.0 Flash Thinking model released today:

Latest version also includes code execution, a 1M token content window & a reduced likelihood of thought-answer contradictions.

The most exciting new feature though is support for Google search grounding, where some Gemini models can execute Google searches as part of answering a prompt. This feature can be enabled using the new -o google_search 1 option.

Tags: gemini, llm, projects, generative-ai, inference-scaling, ai, llms, llm-release

DeepSeek-R1 and exploring DeepSeek-R1-Distill-Llama-8B

2025-01-20T15:20:23+00:00

DeepSeek are the Chinese AI lab who dropped the best currently available open weights LLM on Christmas day, DeepSeek v3. That model was trained in part using their unreleased R1 "reasoning" model. Today they've released R1 itself, along with a whole family of new models derived from that base.

There's a whole lot of stuff in the new release.

DeepSeek-R1-Zero appears to be the base model. It's over 650GB in size and, like most of their other releases, is under a clean MIT license. DeepSeek warn that "DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing." ... so they also released:

DeepSeek-R1 - which "incorporates cold-start data before RL" and "achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks". That one is also MIT licensed, and is a similar size.

I don't have the ability to run models larger than about 50GB (I have an M2 with 64GB of RAM), so neither of these two models are something I can easily play with myself. That's where the new distilled models come in.

To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen.

This is a fascinating flex! They have models based on Qwen 2.5 (14B, 32B, Math 1.5B and Math 7B) and Llama 3 (Llama-3.1 8B and Llama 3.3 70B Instruct).

Weirdly those Llama models have an MIT license attached, which I'm not sure is compatible with the underlying Llama license. Qwen models are Apache licensed so maybe MIT is OK?

Licensing aside, these distilled models are fascinating beasts.

Running DeepSeek-R1-Distill-Llama-8B-GGUF

Quantized versions are already starting to show up. So far I've tried just one of those - unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF released by Unsloth AI - and it's really fun to play with.

I'm running it using the combination of Ollama, LLM and the llm-ollama plugin.

First I fetched and ran the model using Ollama itself:

ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0

This downloads around 8.5GB of model data and starts an interactive chat interface.

Once the model has been fetched, LLM can talk to it as well. I prefer using LLM for experiments because it logs everything to SQLite for later exploration.

I installed the plugin and ran it like this:

llm install llm-ollama
llm -m 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0' \
  'a joke about a pelican and a walrus who run a tea room together'

Each response starts with a <think>...</think> pseudo-XML tag containing the chain of thought used to help generate the response. For "a joke about a pelican and a walrus who run a tea room together" that started like this:

<think>
Okay, so I need to come up with a joke about a pelican and a walrus running a tea room together. Hmm, that's an interesting combination. Let me think about how these two characters might interact in a humorous situation.

First, let's consider their characteristics. Pelicans are known for their long beaks and Webbed feet, often seen near the beach or water. Walruses have big teeth, thick fur, and they're generally found in colder climates, like icebergs or snowy areas. So, combining these two into a tea room setting is already a funny image.

[...]

It then thought for 20 paragraphs before outputting the joke! Here's the full transcript and here's the output at the end, after the </think> tag:

The Joke:

A pelican and a walrus decide to open a quaint little tea room together. The walrus, with its big size, struggles to find comfortable chairs, so it sits on the table by accident, knocking over the teapot. Meanwhile, the pelican, trying to help, uses its beak to place saucers on the table, causing a few spills.

After a series of comical mishaps, the walrus looks up and says with a grin, "This isn't so fishy anymore." The pelican smirks and remarks, "Maybe not, but we do have a lot of krill in our tea!"

Explanation:

Setup: Pelican and Walrus run an unlikely tea room. Their sizes lead to comedic chaos.

Punchline: "This isn't so fishy anymore" is a pun on the situation becoming less absurd. The follow-up joke introduces "krill," a play on words that ties into both marine life ("fish") and tea, making it humorous while fitting the theme.

So yeah, the joke is awful. But the process of getting there was such an interesting insight into how these new models work.

This was a relatively small 8B model. I'm looking forward to trying out the Llama 70B version, which isn't yet available in a GGUF I can run with Ollama. Given the strength of Llama 3.3 70B - currently my favourite GPT-4 class model that I've run on my own machine - I have high expectations.

Update 21st January 2025: I got this quantized version of that Llama 3.3 70B R1 distilled model working like this - a 34GB download:

ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF:Q3_K_M

Can it draw a pelican?

I tried my classic Generate an SVG of a pelican riding a bicycle prompt too. It did not do very well:

It looked to me like it got the order of the elements wrong, so I followed up with:

the background ended up covering the rest of the image

It thought some more and gave me this:

As with the earlier joke, the chain of thought in the transcript was far more interesting than the end result.

Other ways to try DeepSeek-R1

If you want to try the model out without installing anything at all you can do so using chat.deepseek.com - you'll need to create an account (sign in with Google, use an email address or provide a Chinese +86 phone number) and then select the "DeepThink" option below the prompt input box.

DeepSeek offer the model via their API, using an OpenAI-imitating endpoint. You can access that via LLM by dropping this into your extra-openai-models.yaml configuration file:

- model_id: deepseek-reasoner
  model_name: deepseek-reasoner
  api_key_name: deepseek
  api_base: "https://api.deepseek.com"

Then run llm keys set deepseek and paste in your API key, then use llm -m deepseek-reasoner 'prompt' to run prompts.

This won't show you the reasoning tokens, sadly. Those are served up by the API (example here) but LLM doesn't yet have a way to display them.

Since I published this article Ollama made available their own versions of the DeepSeek models. Confusingly, they've taken the various different distilled models and made them all available as different sizes of their deepseek-r1 - for example, their 9GB R1 Qwen2.5-14B model is called deepseek-r1:14b and can be run like this:

ollama run deepseek-r1:14b

Using a browser that supports WebGPU such as Google Chrome you can visit webml-community/deepseek-r1-webgpu and load and then prompt a 1.28GB R1 model based on Qwen2.5-Math-1.5B entirely in your browser.

Tags: ai, generative-ai, llms, llm, ollama, pelican-riding-a-bicycle, inference-scaling, deepseek, llm-release

microsoft/phi-4

2025-01-08T17:57:18+00:00

microsoft/phi-4

Here's the official release of Microsoft's Phi-4 LLM, now officially under an MIT license.

A few weeks ago I covered the earlier unofficial versions, where I talked about how the model used synthetic training data in some really interesting ways.

It benchmarks favorably compared to GPT-4o, suggesting this is yet another example of a GPT-4 class model that can run on a good laptop.

The model already has several available community quantizations. I ran the mlx-community/phi-4-4bit one (a 7.7GB download) using mlx-llm like this:

uv run --with 'numpy<2' --with mlx-lm python -c '
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/phi-4-4bit")

prompt = "Generate an SVG of a pelican riding a bicycle"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=2048)
print(response)'

Here's what I got back.

Update: The model is now available via Ollama, so you can fetch a 9.1GB model file using ollama run phi4, after which it becomes available via the llm-ollama plugin.

Tags: phi, generative-ai, ai, microsoft, llms, uv, pelican-riding-a-bicycle, llm, ollama, llm-release

DeepSeek_V3.pdf

2024-12-26T18:49:05+00:00

DeepSeek_V3.pdf

The DeepSeek v3 paper (and model card) are out, after yesterday's mysterious release of the undocumented model weights.

Plenty of interesting details in here. The model pre-trained on 14.8 trillion "high-quality and diverse tokens" (not otherwise documented).

Following this, we conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on the base model of DeepSeek-V3, to align it with human preferences and further unlock its potential. During the post-training stage, we distill the reasoning capability from the DeepSeek-R1 series of models, and meanwhile carefully maintain the balance between model accuracy and generation length.

By far the most interesting detail though is how much the training cost. DeepSeek v3 trained on 2,788,000 H800 GPU hours at an estimated cost of $5,576,000. For comparison, Meta AI's Llama 3.1 405B (smaller than DeepSeek v3's 685B parameters) trained on 11x that - 30,840,000 GPU hours, also on 15 trillion tokens.

DeepSeek v3 benchmarks comparably to Claude 3.5 Sonnet, indicating that it's now possible to train a frontier-class model (at least for the 2024 version of the frontier) for less than $6 million!

Andrej Karpathy:

For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being brought up today are more around 100K GPUs. E.g. Llama 3 405B used 30.8M GPU-hours, while DeepSeek-V3 looks to be a stronger model at only 2.8M GPU-hours (~11X less compute). If the model also passes vibe checks (e.g. LLM arena rankings are ongoing, my few quick tests went well so far) it will be a highly impressive display of research and engineering under resource constraints.

DeepSeek also announced their API pricing. From February 8th onwards:

Input: $0.27/million tokens ($0.07/million tokens with cache hits)
Output: $1.10/million tokens

Claude 3.5 Sonnet is currently $3/million for input and $15/million for output, so if the models are indeed of equivalent quality this is a dramatic new twist in the ongoing LLM pricing wars.

Via @deepseek_ai

Tags: deepseek, training-data, llms, ai, generative-ai, llm-pricing, llama, meta, andrej-karpathy, llm-release

deepseek-ai/DeepSeek-V3-Base

2024-12-25T19:00:33+00:00

deepseek-ai/DeepSeek-V3-Base

No model card or announcement yet, but this new model release from Chinese AI lab DeepSeek (an arm of Chinese hedge fund High-Flyer) looks very significant.

It's a huge model - 685B parameters, 687.9 GB on disk (TIL how to size a git-lfs repo). The architecture is a Mixture of Experts with 256 experts, using 8 per token.

For comparison, Meta AI's largest released model is their Llama 3.1 model with 405B parameters.

The new model is apparently available to some people via both chat.deepseek.com and the DeepSeek API as part of a staged rollout.

Paul Gauthier got API access and used it to update his new Aider Polyglot leaderboard - DeepSeek v3 preview scored 48.4%, putting it in second place behind o1-2024-12-17 (high) and in front of both claude-3-5-sonnet-20241022 and gemini-exp-1206!

I never know if I can believe models or not (the first time I asked "what model are you?" it claimed to be "based on OpenAI's GPT-4 architecture"), but I just got this result using LLM and the llm-deepseek plugin:

llm -m deepseek-chat 'what deepseek model are you?'

I'm DeepSeek-V3 created exclusively by DeepSeek. I'm an AI assistant, and I'm at your service! Feel free to ask me anything you'd like. I'll do my best to assist you.

Here's my initial experiment log.

Via @ivanfioravanti

Tags: aider, hugging-face, generative-ai, ai, llms, deepseek, paul-gauthier, llm-release

Trying out QvQ - Qwen's new visual reasoning model

2024-12-24T20:49:08+00:00

I thought we were done for major model releases in 2024, but apparently not: Alibaba's Qwen team just dropped the ~~Apache 2.0 licensed~~ Qwen licensed (the license changed) QvQ-72B-Preview, "an experimental research model focusing on enhancing visual reasoning capabilities".

Their blog post is titled QvQ: To See the World with Wisdom - similar flowery language to their QwQ announcement QwQ: Reflect Deeply on the Boundaries of the Unknown a few weeks ago in November.

It's a vision-focused follow-up to QwQ, which I wrote about previousy. QwQ is an impressive openly licensed inference-scaling model: give it a prompt and it will think out loud over many tokens while trying to derive a good answer, similar to OpenAI's o1 and o3 models.

The new QvQ adds vision to the mix. You can try it out on Hugging Face Spaces - it accepts an image and a single prompt and then streams out a very long response where it thinks through the problem you have posed it. There's no option to send a follow-up prompt.

I've tried it out with a bunch of things, with mixed results - but it's really fun seeing how it works through a problem.

My most successful prompt was "Count the pelicans" with this image:

Here's the full response. Some highlights:

So I've got this picture of some pelicans, and I need to count them. Let me see... First off, I know that pelicans are those big birds with the massive beaks, right? [...]

Let me try counting again, more carefully this time. Starting from the left:

The first pelican is fully visible, standing at the front.

Behind it, slightly to the right, is the second pelican.

Then, further back, there's a third pelican.

And next to that one, there's a fourth pelican.

So, that's four pelicans in total. But, I notice that on the far right of the image, there's part of another bird, maybe a pelican's back or tail feathers, but it's not fully in the frame, so I don't think I should count it as a whole pelican.

Something about the tone of voice it uses, "First off, I know that pelicans are those big birds with the massive beaks, right?", tickles me.

I also tried this on a much harder image - it counted 17 which looks about right to me but honestly I have trouble with that image myself.

I fed in one of the ARC-AGI puzzles that o3 had failed at.

It produced a very lengthy chain of thought that was almost entirely incorrect, but had some interesting "ideas" in it:

[...] Let me try to think in terms of cellular automata rules, like Conway's Game of Life.

In Game of Life, each cell's state in the next generation is determined by its current state and the states of its eight neighbors.

Perhaps a similar rule applies here.

Finally, I asked it to "Estimate the height of the dinosaur" against this image (which, as it correctly noted, is actually an inflatable dragon):

I'm not sure this task is even possible given the lack of items in the frame of comparable known height. It had a good go at it though! I particularly enjoyed how it began:

So I've got this inflatable Christmas decoration, and it's a dragon! But not just any dragon—it's a green dragon wearing a Santa hat and a red and white scarf. And get this, there's a reindeer on its back, also wearing a Santa hat and holding onto the dragon's wings. It's pretty festive and colorful, with lots of vibrant greens, reds, and yellows.

The dragon has these big, friendly eyes and a smiling face, which makes it look really cheerful. Its wings are spread out, and it has white claws on its feet. The reindeer is brown with antlers and seems to be enjoying the ride on the dragon's back.

Now, the question is, how tall is this inflatable dragon? I need to estimate its height. Let me see. I'm standing next to it, and I'm about 5 feet 10 inches tall. The dragon is definitely taller than me. Maybe around 8 or 9 feet high? But I should try to be more precise.

I wonder how it decided that its own height was 5 feet 10 inches!

Running QvQ locally

All of my experiments so far have used the hosted Hugging Face demo. I'm hoping to try it out on my own laptop soon - I think it should just fit in 64GB of M2, maybe with a smaller quantization.

Right now the options are:

Qwen/QVQ-72B-Preview on Hugging Face has the GPU model weights, for use with Hugging Face Transformers and the qwen-vl-utils Python package.
Prince Canuma converted the model for Apple's MLX framework as mlx-community/qvq-72b-preview and got it working with his excellent mlx-vlm package.

As a happy user of Ollama's qwq port I'm hoping they add a QvQ release at some point soon as well.

Running QvQ on my Mac using mlx-vlm

Update: I successfully ran QvQ on my 64GB M2 MacBook Pro using Prince Canuma's mlx-vlm via uv run with the following command:

uv run --with 'numpy<2.0' --with mlx-vlm python \
  -m mlx_vlm.generate \
    --model mlx-community/QVQ-72B-Preview-4bit \
    --max-tokens 10000 \
    --temp 0.0 \
    --prompt "describe this" \
    --image pelicans-on-bicycles-veo2.jpg

My image input here was taken from my experiment with Google's Veo 2 video generation model:

Here's the full output from running that command, after it first downloaded 38GB to ~/.cache/huggingface/hub/models--mlx-community--QVQ-72B-Preview-4bi:

==========
Image: ['pelicans-on-bicycles-veo2.jpg'] 

Prompt: <|im_start|>system
You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.<|im_end|>
<|im_start|>user
describe this<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

Alright, I've got this "Text to Video" tool to explore. It seems pretty advanced, allowing me to turn text descriptions into actual videos. The interface looks user-friendly, with a dark theme that's easy on the eyes. On the left side, there's a panel where I can input my text prompt. It already has an example filled in: "A pelican riding a bicycle along a coastal path overlooking a harbor."

That sounds interesting! I can see four different video thumbnails on the right side, each showing variations of this scene. One of them even has the pelican wearing a helmet, which adds a humorous touch. Another shows the pelican standing next to a bicycle, almost as if it's ready to hop on and start pedaling.

Below the text input, there are some style options like "photorealistic," "35mm film," "minimal," "sketchy," "handmade," and "abstract." These probably let me choose the visual style of the video. For now, I'll stick with the default settings.

At the bottom of the left panel, there are two buttons: "Start over" and "Create videos." I guess "Create videos" is the one I need to press to generate the videos based on my text prompt.

Let me think about this. I want to see what kind of videos this tool can produce based on my description. The example prompt is already filled in, so maybe I should just go ahead and click "Create videos" to see what happens.

Wait a minute, the prompt mentions a coastal path overlooking a harbor. In the thumbnails, I can see a background with water and boats, which matches the description. It's cool how the tool has interpreted the text into visual elements.

I should also pay attention to the "Visual Prompt" section, which says "View only." That probably means I can't edit it directly, but I can view the generated videos based on this prompt.

Hmm, maybe I should try changing the prompt to see different results. For example, instead of a pelican riding a bicycle, maybe I could describe a different scene. But for now, I'll stick with the given example to understand how it works.

After clicking "Create videos," I expect to see the videos being generated. There might be a loading process, depending on the complexity of the prompt and the tool's processing speed.

Once the videos are ready, I can preview them to see how well they match my description. If I'm not satisfied, I can always go back and adjust the prompt or try different style options.

Overall, this seems like a powerful tool for creating custom videos based on textual descriptions. It could be really useful for storytelling, presentations, or even just for fun, experimenting with different scenes and styles.

I think I'll spend some time exploring the different style options to see how they affect the final videos. For instance, selecting "35mm film" might give the videos a vintage, cinematic look, while "abstract" could produce more artistic and less realistic visuals.

Also, I wonder if there's a way to customize more specific aspects, like the camera angle, lighting, or movement. Maybe after generating the initial videos, there are additional editing options available.

In summary, I'm excited to use this Text to Video tool and see what kind of creative possibilities it opens up. With a bit of practice, I could probably create some impressive visual content just by describing what I want to see.

**Final Answer**

To proceed with generating videos based on the provided text prompt, I need to click on the "Create videos" button. This action will initiate the process of converting the textual description into visual content, resulting in multiple video thumbnails that can be previewed for further use or adjustments.
==========
Prompt: 0.870 tokens-per-sec
Generation: 7.694 tokens-per-sec

The license changed from Apache 2.0 to Qwen

When I wrote this post yesterday the LICENSE file in the Hugging Face repository was Apache 2.0. Just after midnight UTC on 25th December this commit landed updating the QVQ-72B-Preview license file to the Qwen license instead.

This looks to me like they were correcting a mistake, not changing their policy. The README.md for that repository has this block of YAML:

license: other
license_name: qwen

And commits to that README at one point linked to the Qwen2.5-72B-Instruct copy of the Qwen license.

The QwQ model repository continues to list Apache 2.0, which matches the YAML in its README as well.

So it looks to me like the intention is for QvQ and Qwen2.5-72B-Instruct to be Qwen licensed, while QwQ is Apache 2.0.

Tags: python, ai, generative-ai, llms, hugging-face, vision-llms, uv, qwen, mlx, inference-scaling, llm-release

Gemini 2.0 Flash "Thinking mode"

2024-12-19T23:59:54+00:00

Those new model releases just keep on flowing. Today it's Google's snappily named gemini-2.0-flash-thinking-exp, their first entrant into the o1-style inference scaling class of models. I posted about a great essay about the significance of these just this morning.

From the Gemini model documentation:

Gemini 2.0 Flash Thinking Mode is an experimental model that's trained to generate the "thinking process" the model goes through as part of its response. As a result, Thinking Mode is capable of stronger reasoning capabilities in its responses than the base Gemini 2.0 Flash model.

I just shipped llm-gemini 0.8 with support for the model. You can try it out using LLM like this:

llm install -U llm-gemini
# If you haven't yet set a gemini key:
llm keys set gemini
# Paste key here

llm -m gemini-2.0-flash-thinking-exp-1219 "solve a harder variant of that goat lettuce wolf river puzzle"

It's a very talkative model - 2,277 output tokens answering that prompt.

A more interesting example

The best source of example prompts I've found so far is the Gemini 2.0 Flash Thinking cookbook - a Jupyter notebook full of demonstrations of what the model can do.

My favorite so far is this one:

What's the area of the overlapping region?

This model is multi-modal!

Here's how to run that example using llm-gemini:

llm -m gemini-2.0-flash-thinking-exp-1219 \
  -a https://storage.googleapis.com/generativeai-downloads/images/geometry.png \
  "What's the area of the overlapping region?"

Here's the full response, complete with MathML working. The eventual conclusion:

The final answer is 9π/4

That's the same answer as Google provided in their example notebook, so I'm presuming it's correct. Impressive!

How about an SVG of a pelican riding a bicycle?

llm -m gemini-2.0-flash-thinking-exp-1219 \
  "Generate an SVG of a pelican riding a bicycle"

Here's the full response. Interestingly it slightly corrupted the start of its answer:

This thought process involves a combination of visual thinking, knowledge of SVG syntax, and iterative refinement. The key is to break down the problem into manageable parts and build up the image piece by piece. Even experienced SVG creators often go through several adjustments before arriving at the final version.00" height="250" viewBox="0 0 300 250" fill="none" xmlns="http://www.w3.org/2000/svg">
<g>


After I manually repaired that to add the <svg opening tag I got this:

So maybe not an artistic genius, but it's interesting to read through its chain of thought for that task.

Who's next?

It's very clear now that inference scaling is the next big area of research for the large labs. We've seen models from OpenAI (o1), Qwen (QwQ), DeepSeek (DeepSeek-R1-Lite-Preview) and now Google Gemini. I'm interested to hear if Anthropic or Meta or Mistral or Amazon have anything cooking in this category.

Tags: google, ai, generative-ai, llms, llm, gemini, o1, pelican-riding-a-bicycle, inference-scaling, llm-release

Phi-4 Technical Report

2024-12-15T23:58:22+00:00

Phi-4 Technical Report

Phi-4 is the latest LLM from Microsoft Research. It has 14B parameters and claims to be a big leap forward in the overall Phi series. From Introducing Phi-4: Microsoft’s Newest Small Language Model Specializing in Complex Reasoning:

Phi-4 outperforms comparable and larger models on math related reasoning due to advancements throughout the processes, including the use of high-quality synthetic datasets, curation of high-quality organic data, and post-training innovations. Phi-4 continues to push the frontier of size vs quality.

The model is currently available via Azure AI Foundry. I couldn't figure out how to access it there, but Microsoft are planning to release it via Hugging Face in the next few days. It's not yet clear what license they'll use - hopefully MIT, as used by the previous models in the series.

In the meantime, unofficial GGUF versions have shown up on Hugging Face already. I got one of the matteogeniaccio/phi-4 GGUFs working with my LLM tool and llm-gguf plugin like this:

llm install llm-gguf
llm gguf download-model https://huggingface.co/matteogeniaccio/phi-4/resolve/main/phi-4-Q4_K_M.gguf
llm chat -m gguf/phi-4-Q4_K_M

This downloaded a 8.4GB model file. Here are some initial logged transcripts I gathered from playing around with the model.

An interesting detail I spotted on the Azure AI Foundry page is this:

Limited Scope for Code: Majority of phi-4 training data is based in Python and uses common packages such as typing, math, random, collections, datetime, itertools. If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses.

This leads into the most interesting thing about this model: the way it was trained on synthetic data. The technical report has a lot of detail about this, including this note about why synthetic data can provide better guidance to a model:

Synthetic data as a substantial component of pretraining is becoming increasingly common, and the Phi series of models has consistently emphasized the importance of synthetic data. Rather than serving as a cheap substitute for organic data, synthetic data has several direct advantages over organic data.

Structured and Gradual Learning. In organic datasets, the relationship between tokens is often complex and indirect. Many reasoning steps may be required to connect the current token to the next, making it challenging for the model to learn effectively from next-token prediction. By contrast, each token generated by a language model is by definition predicted by the preceding tokens, making it easier for a model to follow the resulting reasoning patterns.

And this section about their approach for generating that data:

Our approach to generating synthetic data for phi-4 is guided by the following principles:

Diversity: The data should comprehensively cover subtopics and skills within each domain. This requires curating diverse seeds from organic sources.

Nuance and Complexity: Effective training requires nuanced, non-trivial examples that reflect the complexity and the richness of the domain. Data must go beyond basics to include edge cases and advanced examples.

Accuracy: Code should execute correctly, proofs should be valid, and explanations should adhere to established knowledge, etc.

Chain-of-Thought: Data should encourage systematic reasoning, teaching the model various approaches to the problems in a step-by-step manner. [...]

We created 50 broad types of synthetic datasets, each one relying on a different set of seeds and different multi-stage prompting procedure, spanning an array of topics, skills, and natures of interaction, accumulating to a total of about 400B unweighted tokens. [...]

Question Datasets: A large set of questions was collected from websites, forums, and Q&A platforms. These questions were then filtered using a plurality-based technique to balance difficulty. Specifically, we generated multiple independent answers for each question and applied majority voting to assess the consistency of responses. We discarded questions where all answers agreed (indicating the question was too easy) or where answers were entirely inconsistent (indicating the question was too difficult or ambiguous). [...]

Creating Question-Answer pairs from Diverse Sources: Another technique we use for seed curation involves leveraging language models to extract question-answer pairs from organic sources such as books, scientific papers, and code.

Via @peteratmsr

Tags: llm, phi, generative-ai, training-data, ai, microsoft, llms, ai-assisted-programming, python, llm-release

Gemini 2.0 Flash: An outstanding multi-modal LLM with a sci-fi streaming mode

2024-12-11T20:16:58+00:00

Huge announcment from Google this morning: Introducing Gemini 2.0: our new AI model for the agentic era. There's a ton of stuff in there (including updates on Project Astra and the new Project Mariner), but the most interesting pieces are the things we can start using today, built around the brand new Gemini 2.0 Flash model. The developer blog post has more of the technical details, and the Gemini 2.0 Cookbook is useful for understanding the API via Python code examples.

Gemini 2.0 Flash is a multi-modal LLM. Google claim it's both more capable and twice as fast as Gemini 1.5 Pro, their previous best model.

The new Flash can handle the same full range of multi-modal inputs as the Gemini 1.5 series: images, video, audio and documents. Unlike the 1.5 series it can output in multiple modalities as well - images and audio in addition to text. The image and audio outputs aren't yet generally available but should be coming early next year.

It also introduces streaming capabilities which are wildly impressive. More on that later in this post.

Let's start by trying it out as a multi-modal input, text output model.

Running a vision prompt using llm-gemini

I released llm-gemini 0.7 adding support for the new model to my LLM command-line tool. You' ll need a Gemini API key - then install LLM and run:

llm install -U llm-gemini
llm keys set gemini
# ... paste API key here
llm -m gemini-2.0-flash-exp describe \
  -a https://static.simonwillison.net/static/2024/pelicans.jpg

That's using my pelicans.jpg image as input, a low resolution photograph of a confusing mass of pelicans on a rocky outcrop. It's a good starting point for exploring a model's vision capabilities.

The full transcript is here. This is one of the best results I've seen:

Overall Scene:

The image depicts a large group of pelicans and other birds gathered closely together on a rocky outcrop. The birds are densely packed, with many overlapping and almost blending into one another. Behind them is a body of water with a dark shoreline or cliff visible in the distance. The scene appears to be taken on a relatively sunny day, with good lighting that illuminates the birds.

Birds:

Pelicans: The most prominent birds are the brown pelicans, characterized by their large, gray-brown bodies, long beaks, and prominent throat pouches. Some pelicans have white on their heads, suggesting that some of them are older birds. They are in various poses, some standing upright, others preening or interacting. They are positioned at the middle and upper parts of the rocks.

Smaller Birds: There are also many smaller, darker birds, possibly cormorants or gulls, scattered throughout the scene. They are often found closer to the ground. They all seem to have a similar appearance with dark plumage and short beaks. These are mixed in with and around the pelicans.

Arrangement: The birds appear clustered together. They are on top of the rocks in a way that shows that they are using the rocky outcropping as a place to rest. It is a crowded scene.

Environment:

Rocks: The birds are resting on a light-colored, rough rocky surface. It appears to be a natural outcropping, with various textured surfaces. The rocks form the base of the picture.

Water: In the background, there is a dark body of water, possibly the sea or a large lake. The water is mostly smooth, adding a sense of calm to the scene.

Background: There seems to be a rocky shore or cliff face in the far background. The background provides a contrast to the lighter colored rock surface and birds. The background is dark, almost solid colored.

Lighting and Color:

Light: The lighting in the photo is natural, bright, and provides plenty of detail on the birds' feathers. It appears to be coming from overhead, and there are minimal harsh shadows.

Colors: The color palette is mostly earthy tones. The pelicans have variations of browns and grays with white. The smaller birds are in shades of dark grey and browns. The rocks are a light grey, almost white, and the background is a dark gray.

Overall Impression:

The overall impression is that of a busy, natural habitat where various birds gather together. The image captures the wildness and the coexistence of different bird species. The scene feels peaceful and serene despite the large number of animals.

A white head on a California Brown Pelican is breeding plumage, which does indeed indicate they are older birds.

Impressive performance on spatial reasoning

One of the most interesting characteristics of the Gemini 1.5 series is its ability to return bounding boxes for objects within an image. I described a tool I built for exploring that in Building a tool showing how Gemini Pro can return bounding boxes for objects in images back in August.

I upgraded that tool to support Gemini 2.0 Flash and ran my pelican photo through, with this prompt:

Return bounding boxes around all pelicans as JSON arrays [ymin, xmin, ymax, xmax]

Here's what I got back, overlayed on the image by my tool:

Given how complicated that photograph is I think this is a pretty amazing result.

AI Studio now offers its own Spatial Understanding demo app which can be used to try out this aspect of the model, including the ability to return "3D bounding boxes" which I still haven't fully understood!

Google published a short YouTube video about Gemini 2.0's spatial understanding.

It can both write and execute code

The Gemini 1.5 Pro models have this ability too: you can ask the API to enable a code execution mode, which lets the models write Python code, run it and consider the result as part of their response.

Here's how to access that using LLM, with the -o code_execution 1 flag:

llm -m gemini-2.0-flash-exp -o code_execution 1 \
  'write and execute python to generate a 80x40 ascii art fractal'

The full response is here - here's what it drew for me:

The code environment doesn't have access to make outbound network calls. I tried this:

llm -m gemini-2.0-flash-exp -o code_execution 1 \
  'write python code to retrieve https://simonwillison.net/ and use a regex to extract the title, run that code'

And the model attempted to use requests, realized it didn't have it installed, then tried urllib.request and got a Temporary failure in name resolution error.

Amusingly it didn't know that it couldn't access the network, but it gave up after a few more tries.

The streaming API is next level

The really cool thing about Gemini 2.0 is the brand new streaming API. This lets you open up a two-way stream to the model sending audio and video to it and getting text and audio back in real time.

I urge you to try this out right now using https://aistudio.google.com/live. It works for me in Chrome on my laptop and Mobile Safari on my iPhone - it didn't quite work in Firefox.

Here's a minute long video demo I just shot of it running on my phone:

The API itself is available to try out right now. I managed to get the multimodal-live-api-web-console demo app working by doing this:

Clone the repo: git clone https://github.com/google-gemini/multimodal-live-api-web-console
Install the NPM dependencies: cd multimodal-live-api-web-console && npm install
Edit the .env file to add my Gemini API key
Run the app with npm start

It's pretty similar to the previous live demo, but has additional tools - so you can tell it to render a chart or run Python code and it will show you the output.

This stuff is straight out of science fiction: being able to have an audio conversation with a capable LLM about things that it can "see" through your camera is one of those "we live in the future" moments.

Worth noting that OpenAI released their own WebSocket streaming API at DevDay a few months ago, but that one only handles audio and is currently very expensive to use. Google haven't announced the pricing for Gemini 2.0 Flash yet (it's a free preview) but if the Gemini 1.5 series is anything to go by it's likely to be shockingly inexpensive.

Things to look forward to

I usually don't get too excited about not-yet-released features, but this thing from the Native image output video caught my eye:

The dream of multi-modal image output is that models can do much more finely grained image editing than has been possible using previous generations of diffusion-based image models. OpenAI and Amazon have both promised models with these capabilities in the near-future, so it looks like we're going to have a lot of fun with this stuff in 2025.

The Building with Gemini 2.0: Native audio output demo video shows off how good Gemini 2.0 Flash will be at audio output with different voices, intonations, languages and accents. This looks similar to what's possible with OpenAI's advanced voice mode today.

Tags: google, ai, generative-ai, llms, gemini, vision-llms, multi-modal-output, llm-release

Meta AI release Llama 3.3

2024-12-06T18:30:55+00:00

Meta AI release Llama 3.3

This new Llama-3.3-70B-Instruct model from Meta AI makes some bold claims:

This model delivers similar performance to Llama 3.1 405B with cost effective inference that’s feasible to run locally on common developer workstations.

I have 64GB of RAM in my M2 MacBook Pro, so I'm looking forward to trying a slightly quantized GGUF of this model to see if I can run it while still leaving some memory free for other applications.

Update: Ollama have a 43GB GGUF available now. And here's an MLX 8bit version and other MLX quantizations.

Llama 3.3 has 70B parameters, a 128,000 token context length and was trained to support English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

The model card says that the training data was "A new mix of publicly available online data" - 15 trillion tokens with a December 2023 cut-off.

They used "39.3M GPU hours of computation on H100-80GB (TDP of 700W) type hardware" which they calculate as 11,390 tons CO2eq. I believe that's equivalent to around 20 fully loaded passenger flights from New York to London (at ~550 tons per flight).

Update 19th January 2025: On further consideration I no longer trust my estimate here: it's surprisingly hard to track down reliable numbers but I think the total CO2 used by those flights may be more in the order of 200-400 tons, so my estimate for Llama 3.3 70B should have been more in the order of between 28 and 56 flights. Don't trust those numbers either though!

Tags: meta, generative-ai, llama, training-data, ai, edge-llms, llms, ollama, llm-release

New Gemini model: gemini-exp-1206

2024-12-06T18:05:30+00:00

New Gemini model: gemini-exp-1206

Google's Jeff Dean:

Today’s the one year anniversary of our first Gemini model releases! And it’s never looked better.

Check out our newest release, Gemini-exp-1206, in Google AI Studio and the Gemini API!

I upgraded my llm-gemini plugin to support the new model and released it as version 0.6 - you can install or upgrade it like this:

llm install -U llm-gemini

Running my SVG pelican on a bicycle test prompt:

llm -m gemini-exp-1206 "Generate an SVG of a pelican riding a bicycle"

Provided this result, which is the best I've seen from any model:

Here's the full output - I enjoyed these two pieces of commentary from the model:

<polygon>: Shapes the distinctive pelican beak, with an added line for the lower mandible.
[...]
transform="translate(50, 30)": This attribute on the pelican's <g> tag moves the entire pelican group 50 units to the right and 30 units down, positioning it correctly on the bicycle.

The new model is also currently in top place on the Chatbot Arena.

Update: a delightful bonus, here's what I got from the follow-up prompt:

llm -c "now animate it"

Transcript here.

Tags: gemini, llm, ai, llms, svg, releases, google, generative-ai, pelican-riding-a-bicycle, llm-release

New Pleias 1.0 LLMs trained exclusively on openly licensed data

2024-12-05T17:13:30+00:00

New Pleias 1.0 LLMs trained exclusively on openly licensed data

I wrote about the Common Corpus public domain dataset back in March. Now Pleias, the team behind Common Corpus, have released the first family of models that are:

[...] trained exclusively on open data, meaning data that are either non-copyrighted or are published under a permissible license.

There's a lot to absorb here. The Pleias 1.0 family comes in three base model sizes: 350M, 1.2B and 3B. They've also released two models specialized for multi-lingual RAG: Pleias-Pico (350M) and Pleias-Nano (1.2B).

Here's an official GGUF for Pleias-Pico.

I'm looking forward to seeing benchmarks from other sources, but Pleias ran their own custom multilingual RAG benchmark which had their Pleias-nano-1.2B-RAG model come in between Llama-3.2-Instruct-3B and Llama-3.2-Instruct-8B.

The 350M and 3B models were trained on the French government's Jean Zay supercomputer. Pleias are proud of their CO2 footprint for training the models - 0.5, 4 and 16 tCO2eq for the three models respectively, which they compare to Llama 3.2,s reported figure of 133 tCO2eq.

How clean is the training data from a licensing perspective? I'm confident people will find issues there - truly 100% public domain data remains a rare commodity. So far I've seen questions raised about the GitHub source code data (most open source licenses have attribution requirements) and Wikipedia (CC BY-SA, another attribution license). Plus this from the announcement:

To supplement our corpus, we have generated 30B+ words synthetically with models allowing for outputs reuse.

If those models were themselves trained on unlicensed data this could be seen as a form of copyright laundering.

Via @Dorialexander

Tags: open-source, generative-ai, training-data, ai, llms, ethics, pleias, ai-ethics, llm-release

First impressions of the new Amazon Nova LLMs (via a new llm-bedrock plugin)

2024-12-04T15:50:50+00:00

Amazon released three new Large Language Models yesterday at their AWS re:Invent conference. The new model family is called Amazon Nova and comes in three sizes: Micro, Lite and Pro.

I built a new LLM plugin called llm-bedrock for accessing the models in the terminal via boto3 and the Amazon Bedrock API.

My initial impressions from trying out the models are that they're mainly competitive with the Google Gemini family. They are extremely inexpensive - Nova Micro slightly undercuts even previously cheapest model Gemini 1.5 Flash-8B - can handle quite large context and the two larger models can handle images, video and PDFs.

Pricing and capabilities

Amazon list their pricing in price per 1,000 input tokens. Almost everyone else uses price per million, so I've done the conversion.

Here's a table comparing the least expensive models from the largest providers:

Provider	Model	Cents per million input	Cents per million output
OpenAI	GPT-4o Mini	15	60
Anthropic	Claude 3 Haiku	25	125
Anthropic	Claude 3.5 Haiku	80	400
Google	Gemini 1.5 Flash-8B	3.75	15
Google	Gemini 1.5 Flash	7.5	30
Amazon	Nova Micro	3.5	14
Amazon	Nova Lite	6	24

As you can see, Nova Micro is .25 of a cent cheaper on input and 1 cent cheaper on output than previous champion Gemini 1.5 Flash-8B.

And the more expensive models:

Provider	Model	Cents per million input	Cents per million output
OpenAI	GPT-4o	250	1000
OpenAI	o1-mini	300	1200
OpenAI	o1-preview	1500	6000
Anthropic	Claude 3.5 Sonnet	300	1500
Anthropic	Claude 3 Opus	1500	7500
Google	Gemini 1.5 Pro	125	500
Amazon	Nova Pro	80	320

So Nova Pro isn't priced the same as those top-tier models, in fact it's priced about the same as Claude 3.5 Haiku (after their new 20% price drop).

The best model in the Nova family is still to come - from their post:

Amazon Nova Premier – Our most capable multimodal model for complex reasoning tasks and for use as the best teacher for distilling custom models. Amazon Nova Premier is still in training. We’re targeting availability in early 2025.

And from the press release, which presumably is talking about Nova Premier as well:

In early 2025, Amazon will support context length of over 2M input tokens.

Comparing models based on token pricing doesn't capture the entire picture because models use different tokenizers and hence may use up different numbers of tokens for the same input, especially with images and video added to the mix. I'd like to implement a good way to compare those counts in the future.

In terms of capabilities: all of the above models can handle image inputs now, with the exception of Amazon Nova Micro and o1-mini and o1-preview which are text only. Video support is currently unique to the Gemini and Nova models - Anthropic and OpenAI don't have any that handle video yet. See Amazon Nova video understanding limitations for more details.

Nova models can't handle audio, and in fact ignore audio in video that they analyze (treating it as a sequence of images). Gemini models have audio input, as does OpenAI's gpt-4o-audio-preview model ($100/$200 for input/output of those audio tokens).

Nova mini has a 128,000 input token limit, while Lite and Pro are both 300,000 tokens. This isn't quite in the same league as Gemini (2 million for Pro and Flash, 1 million for Flash-8B) but beats the standard models from both OpenAI and Anthropic. Anthropic have an "Enterprise" tier with 500,000 but I've not had a chance to try that myself.

Using Nova from the terminal with LLM

Install LLM, then install the plugin like this:

llm install llm-bedrock

Next comes the hard part: you'll need AWS credentials that have permission to access Bedrock models on your behalf.

If you've previously configured the AWS CLI you may be able to use a shortcut: request access to the Bedrock models (Nova models are approved instantly) and the tool should be able to pick up your existing credentials.

I got stuck on this for a while, then Nils Durner came to the rescue with step-by-step instructions (17 steps in total, screenshots included) for requesting access to the models, creating an IAM user, creating a JSON IAM policy that allows access to the bedrock: actions, assigning that policy to the new user and then creating and storing an access key and access secret for calling the API.

Now that you have those credentials, you can provide them to LLM by running:

llm keys set bedrock

And then pasting in the access_key:secret_access_key - the two keys with a colon in the middle.

Run llm models to confirm that the models are installed correctly - you should see output that ends like this:

Bedrock: us.amazon.nova-micro-v1:0 (aliases: nova-micro)
Bedrock: us.amazon.nova-lite-v1:0 (aliases: nova-lite)
Bedrock: us.amazon.nova-pro-v1:0 (aliases: nova-pro)

And now you can run prompts! A simple text prompt looks like this - I'm setting an optional system prompt too:

llm -m nova-micro 'Three ways to impress a pelican' -s 'short sentences'

Offer a fish.

Dance like a clown.

Wear bright colors.

That's using nova-micro, the cheapest and fastest model. Trying that again with -m nova-pro (the best model) gave me:

Offer fresh fish.

Perform mimicry of pelican calls.

Create a calm, respectful environment.

My plugin supports LLM attachments, so you can attach images, video and PDF files to a prompt using the -a option which accepts either a URL or a file path.

Trying against my standard low-resolution pelican group photo:

llm -m nova-lite 'describe image' \
  -a https://static.simonwillison.net/static/2024/pelicans.jpg -u

The -u option there outputs token usage. I got back:

The image depicts a large gathering of birds on a rocky cliff adjacent to a body of water, likely a sea or a large lake. The birds, primarily pelicans, appear to be resting and socializing in a tightly packed group. Some of the birds are standing while others are perched on the rocks, and a few are near the edge of the cliff, possibly preparing to dive into the water below. The cliff is rocky and has a rugged appearance, with some areas showing signs of erosion or weathering. The water beneath the birds is calm and appears to be a deep blue color, reflecting the clear sky above. The birds' feathers are visible in various shades of brown, gray, and white, adding to the natural and picturesque scene.

Token usage: 1,668 input, 148 output

Using my LLM pricing calculator (now upgraded to support Nova) I get 0.0136 cents - that's just over 1/100th of a cent!

For the same prompt using -m nova-pro for the more expensive Nova Pro model:

The image depicts a large gathering of birds on a rocky shore or beach, adjacent to a body of water. The scene is bustling with a variety of birds, primarily pelicans, which are identifiable by their large size, long necks, and distinctive beaks. The pelicans are positioned both standing and sitting on the rocks, with some standing tall and others hunched over.

In addition to pelicans, there are numerous smaller birds, likely seagulls, characterized by their gray and white plumage. These birds are densely packed together, creating a vivid scene of avian life. The seagulls are scattered throughout the rocky area, some standing, others sitting, and a few engaged in various activities such as preening their feathers or searching for food.

The rocky shore is uneven and covered with a mix of pebbles and larger stones, providing a natural habitat for the birds. The body of water in the background appears calm, with a reflection of the sky and the rocky shore. The water's surface is smooth, indicating little to no wind.

The sky above is clear, suggesting fair weather conditions. The lighting in the image suggests it is daytime, likely in the early morning or late afternoon, given the soft light and the absence of harsh shadows.

In summary, the image captures a lively scene of wildlife, with a focus on pelicans and seagulls congregating on a rocky shore next to a calm body of water under a clear sky. The diverse bird population and their interactions with the environment create a dynamic and engaging visual experience.

Token usage: 1,668 input, 305 output

That came in at a whopping 0.231 cents - nearly a quarter of a cent!

Let's try something absurdly ambitious. I ran some random Google searches for government military PDFs and stumbled across this one. Let's convert that to Markdown:

llm -u -m nova-pro 'convert this document to markdown' \
  -a https://www.darpa.mil/attachments/DARPA_FY18_Presidents_Budget_Request.pdf

You can see the result here. It stopped around page 9 and missed some of the tables, but this was an absurdly challenging document so I'm not surprised it didn't manage a perfect conversion. It cost me 176,386 input, 4,283 output which is a full 15.4814 cents.

Finally, let's see if it can code:

llm -m nova-pro 'python function to domnload a CSV file from a URL, create a corresponding table in SQLite and import the data'

24 input, 616 output = 0.0087 cents.

Full output here, but this is the code it wrote:

import requests
import csv
import sqlite3
import os
from io import StringIO

def download_csv_and_import_to_sqlite(url, db_name, table_name):
    # Download the CSV file from the URL
    response = requests.get(url)
    response.raise_for_status()  # Ensure we notice bad responses...
    
    # Read the CSV data
    csv_data = response.content.decode('utf-8')
    csv_reader = csv.reader(StringIO(csv_data))
    
    # Get the header (column names) from the CSV
    header = next(csv_reader)
    
    # Create the SQLite database and connect to it
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()
    
    # Create the table
    create_table_query = f"CREATE TABLE IF NOT EXISTS {table_name} ("
    create_table_query += ",".join([f"{col} TEXT" for col in header])
    create_table_query += ")"
    cursor.execute(create_table_query)
    
    # Insert the data into the table
    insert_query = f"INSERT INTO {table_name} VALUES ({','.join(['?' for _ in header])})"
    for row in csv_reader:
        cursor.execute(insert_query, row)
    
    # Commit the changes and close the connection
    conn.commit()
    conn.close()
    
    print(f"Data successfully imported into {db_name} in table {table_name}.")

# Example usage
url = "https://datasette.io/content/stats.csv"
db_name = "example.db"
table_name = "data_table"

download_csv_and_import_to_sqlite(url, db_name, table_name)

That's spot on. I may have to retire that benchmark question, it's not enough of a challenge for modern models.

Can they produce an SVG of a pelican on a bicycle? Not very well.

Closing thoughts: GAMOA

My experiments here aren't particularly comprehensive - I look forward from hearing from people who have better challenges and a more disciplined way of applying them. LiveBench have some useful results here already.

As a general initial vibe check though these Nova models pass with flying colors.

Google Gemini now has competition in terms of pricing. This is a relief to me, Gemini is so cheap I've been nervous that they'll jack the prices up through lack of competition!
These appear to be good multi-modal models - their image handling feels solid and it's neat to have a new video-input model (even if it's quite limited compared to Gemini)
Anthropic's decision to raise the price for Claude 3.5 Haiku isn't looking great, given the continued competition at the bottom end of the market. Just a few months ago Claude 3 Haiku was the cheapest model available, now it's hardly worth considering, undercut by even GPT-4o mini.
Gaining API access to anything on AWS remains horrifyingly difficult - seriously, 17 steps! Some day a PM at AWS is going to figure out that fixing this will have a material impact on Amazon's bottom line.
From the Nova press release in the what's next section:

Amazon will introduce an Amazon Nova speech-to-speech model in the first quarter of 2025. [...] Amazon is also developing a novel model that can take text, images, audio, and video as input, and generate outputs in any of these modalities. This Amazon Nova model with native multimodal-to-multimodal – or “any-to-any” modality capabilities – will be introduced mid-year 2025.

This reminded me that OpenAI's GPT-4o was meant to be able to provide image output, which they demonstrated at their launch and as far as I'm aware have not mentioned since. Maybe this will encourage them to finally ship that feature!
The cost of multi-modal models - first Gemini, now Nova - continues to leave me almost speechless. At 0.0136 cents per image I could generate a description of every photo in my 67,771 personal photo library using Nova Mini for $9.21.

With this release I think Amazon may have earned a spot among the top tier of model providers. Maybe we need a new FAANG acronym that covers OpenAI, Anthropic, Google, Meta and Amazon. I like GAMOA.

Tags: amazon, projects, releases, ai, openai, generative-ai, llms, llm, anthropic, gemini, vision-llms, llm-pricing, multi-modal-output, llm-release

SmolVLM - small yet mighty Vision Language Model

2024-11-28T20:29:27+00:00

SmolVLM - small yet mighty Vision Language Model

I've been having fun playing with this new vision model from the Hugging Face team behind SmolLM. They describe it as:

[...] a 2B VLM, SOTA for its memory footprint. SmolVLM is small, fast, memory-efficient, and fully open-source. All model checkpoints, VLM datasets, training recipes and tools are released under the Apache 2.0 license.

I've tried it in a few flavours but my favourite so far is the mlx-vlm approach, via mlx-vlm author Prince Canuma. Here's the uv recipe I'm using to run it:

uv run \
  --with mlx-vlm \
  --with torch \
  python -m mlx_vlm.generate \
    --model mlx-community/SmolVLM-Instruct-bf16 \
    --max-tokens 500 \
    --temp 0.5 \
    --prompt "Describe this image in detail" \
    --image IMG_4414.JPG

If you run into an error using Python 3.13 (torch compatibility) try uv run --python 3.11 instead.

This one-liner installs the necessary dependencies, downloads the model (about 4.2GB, saved to ~/.cache/huggingface/hub/models--mlx-community--SmolVLM-Instruct-bf16) and executes the prompt and displays the result.

I ran that against this Pelican photo:

The model replied:

In the foreground of this photograph, a pelican is perched on a pile of rocks. The pelican’s wings are spread out, and its beak is open. There is a small bird standing on the rocks in front of the pelican. The bird has its head cocked to one side, and it seems to be looking at the pelican. To the left of the pelican is another bird, and behind the pelican are some other birds. The rocks in the background of the image are gray, and they are covered with a variety of textures. The rocks in the background appear to be wet from either rain or sea spray.

There are a few spatial mistakes in that description but the vibes are generally in the right direction.

On my 64GB M2 MacBook pro it read the prompt at 7.831 tokens/second and generated that response at an impressive 74.765 tokens/second.

Tags: vision-llms, uv, mlx, ai, edge-llms, llms, python, generative-ai, smollm, llm-release

QwQ: Reflect Deeply on the Boundaries of the Unknown

2024-11-27T23:59:19+00:00

QwQ: Reflect Deeply on the Boundaries of the Unknown

Brand new openly licensed (Apache 2) model from Alibaba Cloud's Qwen team, this time clearly inspired by OpenAI's work on reasoning in o1.

I love the flowery language they use to introduce the new model:

Through deep exploration and countless trials, we discovered something profound: when given time to ponder, to question, and to reflect, the model’s understanding of mathematics and programming blossoms like a flower opening to the sun. Just as a student grows wiser by carefully examining their work and learning from mistakes, our model achieves deeper insight through patient, thoughtful analysis.

It's already available through Ollama as a 20GB download. I initially ran it like this:

ollama run qwq

This downloaded the model and started an interactive chat session. I tried the classic "how many rs in strawberry?" and got this lengthy but correct answer, which concluded:

Wait, but maybe I miscounted. Let's list them: 1. s 2. t 3. r 4. a 5. w 6. b 7. e 8. r 9. r 10. y Yes, definitely three "r"s. So, the word "strawberry" contains three "r"s.

Then I switched to using LLM and the llm-ollama plugin. I tried prompting it for Python that imports CSV into SQLite:

Write a Python function import_csv(conn, url, table_name) which acceopts a connection to a SQLite databse and a URL to a CSV file and the name of a table - it then creates that table with the right columns and imports the CSV data from that URL

It thought through the different steps in detail and produced some decent looking code.

Finally, I tried this:

llm -m qwq 'Generate an SVG of a pelican riding a bicycle'

For some reason it answered in Simplified Chinese. It opened with this:

生成一个SVG图像，内容是一只鹈鹕骑着一辆自行车。这听起来挺有趣的！我需要先了解一下什么是SVG，以及如何创建这样的图像。

Which translates (using Google Translate) to:

Generate an SVG image of a pelican riding a bicycle. This sounds interesting! I need to first understand what SVG is and how to create an image like this.

It then produced a lengthy essay discussing the many aspects that go into constructing a pelican on a bicycle - full transcript here. After a full 227 seconds of constant output it produced this as the final result.

I think that's pretty good!

Tags: llm, ollama, generative-ai, ai, qwen, llms, edge-llms, pelican-riding-a-bicycle, inference-scaling, llm-release