Simon Willison's Weblog: pelican-riding-a-bicycle

Gemini 2.0 Flash "Thinking mode"

2024-12-19T23:59:54+00:00

Those new model releases just keep on flowing. Today it's Google's snappily named gemini-2.0-flash-thinking-exp, their first entrant into the o1-style inference scaling class of models. I posted about a great essay about the significance of these just this morning.

From the Gemini model documentation:

Gemini 2.0 Flash Thinking Mode is an experimental model that's trained to generate the "thinking process" the model goes through as part of its response. As a result, Thinking Mode is capable of stronger reasoning capabilities in its responses than the base Gemini 2.0 Flash model.

I just shipped llm-gemini 0.8 with support for the model. You can try it out using LLM like this:

llm install -U llm-gemini
# If you haven't yet set a gemini key:
llm keys set gemini
# Paste key here

llm -m gemini-2.0-flash-thinking-exp-1219 "solve a harder variant of that goat lettuce wolf river puzzle"

It's a very talkative model - 2,277 output tokens answering that prompt.

A more interesting example

The best source of example prompts I've found so far is the Gemini 2.0 Flash Thinking cookbook - a Jupyter notebook full of demonstrations of what the model can do.

My favorite so far is this one:

What's the area of the overlapping region?

This model is multi-modal!

Here's how to run that example using llm-gemini:

llm -m gemini-2.0-flash-thinking-exp-1219 \
  -a https://storage.googleapis.com/generativeai-downloads/images/geometry.png \
  "What's the area of the overlapping region?"

Here's the full response, complete with MathML working. The eventual conclusion:

The final answer is 9π/4

That's the same answer as Google provided in their example notebook, so I'm presuming it's correct. Impressive!

How about an SVG of a pelican riding a bicycle?

llm -m gemini-2.0-flash-thinking-exp-1219 \
  "Generate an SVG of a pelican riding a bicycle"

Here's the full response. Interestingly it slightly corrupted the start of its answer:

This thought process involves a combination of visual thinking, knowledge of SVG syntax, and iterative refinement. The key is to break down the problem into manageable parts and build up the image piece by piece. Even experienced SVG creators often go through several adjustments before arriving at the final version.00" height="250" viewBox="0 0 300 250" fill="none" xmlns="http://www.w3.org/2000/svg">
<g>


After I manually repaired that to add the <svg opening tag I got this:

So maybe not an artistic genius, but it's interesting to read through its chain of thought for that task.

Who's next?

It's very clear now that inference scaling is the next big area of research for the large labs. We've seen models from OpenAI (o1), Qwen (QwQ), DeepSeek (DeepSeek-R1-Lite-Preview) and now Google Gemini. I'm interested to hear if Anthropic or Meta or Mistral or Amazon have anything cooking in this category.

Tags: google, ai, generative-ai, llms, llm, gemini, o1, pelican-riding-a-bicycle, inference-scaling

Veo 2

2024-12-16T23:31:59+00:00

Veo 2

Google's text-to-video model, now available via waitlisted preview. I got through the waitlist and tried the same prompt I ran against OpenAI's Sora last week:

A pelican riding a bicycle along a coastal path overlooking a harbor

It generated these four videos:

Here's the larger video.

Via Hacker News

Tags: ai, google, generative-ai, pelican-riding-a-bicycle, text-to-video

Web Component by Google

2024-12-13T18:46:13+00:00

<model-viewer> Web Component by Google

I learned about this Web Component from Claude when looking for options to render a .glb file on a web page. It's very pleasant to use:

<model-viewer style="width: 100%; height: 200px"
  src="https://static.simonwillison.net/static/cors-allow/2024/a-pelican-riding-a-bicycle.glb"
  camera-controls="1" auto-rotate="1"
></model-viewer>

Here it is showing a 3D pelican on a bicycle I created while trying out BlenderGPT, a new prompt-driven 3D asset creating tool (my prompt was "a pelican riding a bicycle"). There's a comment from BlenderGPT's creator on Hacker News explaining that it's currently using Microsoft's TRELLIS model.

Via Claude: options for displaying a glb file on a web page

Tags: web-components, google, generative-ai, ai, 3d, blender, microsoft, claude, pelican-riding-a-bicycle

Sora

2024-12-09T18:35:36+00:00

Sora

OpenAI's released their long-threatened Sora text-to-video model this morning, available in most non-European countries to subscribers to ChatGPT Plus ($20/month) or Pro ($200/month).

Here's what I got for the very first test prompt I ran through it:

A pelican riding a bicycle along a coastal path overlooking a harbor

The Pelican inexplicably morphs to cycle in the opposite direction half way through, but I don't see that as a particularly significant issue: Sora is built entirely around the idea of directly manipulating and editing and remixing the clips it generates, so the goal isn't to have it produce usable videos from a single prompt.

Via Sora–12 Days of OpenAI: Day 3

Tags: video, generative-ai, openai, pelican-riding-a-bicycle, ai, text-to-video

I can now run a GPT-4 class model on my laptop

2024-12-09T15:08:40+00:00

Meta's new Llama 3.3 70B is a genuinely GPT-4 class Large Language Model that runs on my laptop.

Just 20 months ago I was amazed to see something that felt GPT-3 class run on that same machine. The quality of models that are accessible on consumer hardware has improved dramatically in the past two years.

My laptop is a 64GB MacBook Pro M2, which I got in January 2023 - two months after the initial release of ChatGPT. All of my experiments running LLMs on a laptop have used this same machine.

In March 2023 I wrote that Large language models are having their Stable Diffusion moment after running Meta's initial LLaMA release (think of that as Llama 1.0) via the then-brand-new llama.cpp. I said:

As my laptop started to spit out text at me I genuinely had a feeling that the world was about to change

I had a moment of déjà vu the day before yesterday, when I ran Llama 3.3 70B on the same laptop for the first time.

Meta claim that:

This model delivers similar performance to Llama 3.1 405B with cost effective inference that’s feasible to run locally on common developer workstations.

Llama 3.1 405B is their much larger best-in-class model, which is very much in the same weight class as GPT-4 and friends.

Everything I've seen so far from Llama 3.3 70B suggests that it holds up to that standard. I honestly didn't think this was possible - I assumed that anything as useful as GPT-4 would require many times more resources than are available to me on my consumer-grade laptop.

I'm so excited by the continual efficiency improvements we're seeing in running these impressively capable models. In the proprietary hosted world it's giving us incredibly cheap and fast models like Gemini 1.5 Flash, GPT-4o mini and Amazon Nova. In the openly licensed world it's giving us increasingly powerful models we can run directly on our own devices.

How I ran Llama 3.3 70B on my machine using Ollama

I don't expect that this model would work well with much less than my 64GB of RAM. The first time I tried it consumed every remaining bit of available memory and hard-crashed my Mac! For my second attempt I made sure not to have Firefox and VS Code running at the same time and it worked just fine.

I used Ollama. Install that, then run this command to fetch the model:

ollama pull llama3.3

This fetched 42GB of data into my ~/.ollama/models/blobs directory.

I used my LLM tool to run prompts against the model, since that way both the prompt and the response would be automatically logged to SQLite. First I installed the llm-ollama plugin:

llm install llm-ollama

Then with the Ollama local server running I could run this:

llm models

And see this in the list:

Ollama: llama3.3:latest (aliases: llama3.3)

Putting the model through its paces

This means I can run prompts like so:

llm -m llama3.3 'Write a letter to the mayor of Half Moon Bay CA advocating for the installation of cosy boxes for Pelicans in the harbor'

And get back a response that started like this:

[Your Name]
[Your Address]
Half Moon Bay, CA [Zip Code]
[Email Address]
[Phone Number]
[Date]

The Honorable Mayor, Deborah Penrose
City of Half Moon Bay
501 Main Street
Half Moon Bay, CA 94019

Dear Mayor Penrose,

I am writing to propose an innovative and unique initiative that would not only enhance the beauty of our picturesque harbor but also provide a safe haven for one of our beloved coastal residents: the California Brown Pelican. I am excited to advocate for the installation of cozy boxes specifically designed for pelicans in the Half Moon Bay harbor. [...]

(Deborah Penrose was the mayor of Half Moon Bay for a single year from December 2016 to December 2017 - so a hint of some quite finely grained world knowledge there, even if it's not relevant for the present day.)

Here's the full transcript, including the response I got when I ran the follow-up llm -c 'translate that to spanish'.

Next I tried a simple coding challenge. I love using Claude Artifacts to build simple single page applications, can Llama 3.3 on my laptop do that too?

llm -m llama3.3 'produce HTML with embedded CSS and JavaScript - no react, no dependencies - for an app that takes a URL to a YouTube video and outputs images along with visible URLs for all known thumbnails for that video'

Here's the code it produced (the full transcript is here):

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>YouTube Thumbnail Extractor</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            margin: 20px;
        }
        #thumbnails {
            display: flex;
            flex-wrap: wrap;
        }
        .thumbnail {
            margin: 10px;
        }
        img {
            max-width: 150px;
            max-height: 150px;
            object-fit: cover;
        }
    </style>
</head>
<body>

<h2>YouTube Thumbnail Extractor</h2>
<input type="text" id="youtubeUrl" placeholder="Enter YouTube video URL">
<button onclick="extractThumbnails()">Extract Thumbnails</button>
<div id="thumbnails"></div>

<script>
    function extractThumbnails() {
        const youtubeUrl = document.getElementById('youtubeUrl').value;
        const videoId = getYoutubeVideoId(youtubeUrl);
        
        if (!videoId) {
            alert("Could not parse YouTube video ID from the URL.");
            return;
        }
        
        const thumbnailSizes = ['default', 'mqdefault', 'hqdefault', 'sddefault', 'maxresdefault'];
        const thumbnailsDiv = document.getElementById('thumbnails');
        thumbnailsDiv.innerHTML = ''; // Clear previous results
        
        thumbnailSizes.forEach(size => {
            const thumbnailUrl = `https://img.youtube.com/vi/${videoId}/${size}.jpg`;
            const img = document.createElement('img');
            img.src = thumbnailUrl;
            img.alt = `${size} thumbnail`;
            img.className = 'thumbnail';
            
            const link = document.createElement('a');
            link.href = thumbnailUrl;
            link.text = size;
            link.style.display = 'block';
            
            const div = document.createElement('div');
            div.appendChild(img);
            div.appendChild(link);
            thumbnailsDiv.appendChild(div);
        });
    }
    
    function getYoutubeVideoId(url) {
        const regExp = /^.*((youtu.be\/)|(v\/)|(\/u\/\w\/)|(embed\/)|(watch\?))\??v?=?([^#&?]*).*/;
        const match = url.match(regExp);
        return (match && match[7].length == 11) ? match[7] : false;
    }
</script>

</body>
</html>

Again, we're illustrating some world knowledge here: it knows that YouTube videos are generally available as default, mqdefault, hqdefault, sddefault and maxresdefault.

Here's the hosted page, and this is what it looks like when run against this YouTube URL:

It's not as good as the version I iterated on with Claude, but this still shows that Llama 3.3 can one-shot a full interactive application while running on my Mac.

How does it score?

It's always useful to check independent benchmarks for this kind of model.

One of my current favorites for that is LiveBench, which calls itself "a challenging, contamination-free LLM benchmark" and tests a large array of models with a comprehensive set of different tasks.

llama-3.3-70b-instruct-turbo currently sits in position 19 on their table, a place ahead of Claude 3 Opus (my favorite model for several months after its release in March 2024) and just behind April's GPT-4 Turbo and September's GPT-4o.

LiveBench here is sorted by the average across multiple evals, and Llama 3.3 70B somehow currently scores top of the table for the "IF" (Instruction Following) eval which likely skews that average. Here's the Instruction-Following Evaluation for Large Language Models paper describing that particular test.

It's worth noting that the benchmarks listed here run against the full-sized Llama 3.3 release. The versions I've been running on my laptop are quantized (Ollama's is Q4_K_M), so they aren't exactly the same model and likely have different (lower) benchmark scores.

Honorable mentions

Llama 3.3 is currently the model that has impressed me the most that I've managed to run on my own hardware, but I've had several other positive experiences recently.

Last month I wrote about Qwen2.5-Coder-32B, an Apache 2.0 licensed model from Alibaba's Qwen research team that also gave me impressive results with code.

A couple of weeks ago I tried another Qwen model, QwQ, which implements a similar chain-of-thought pattern to OpenAI's o1 series but again runs comfortably on my own device.

Meta's Llama 3.2 family of models are interesting as well: tiny 1B and 3B models (those should run even on a Raspberry Pi) that are way more capable than I would have expected - plus Meta's first multi-modal vision models at 11B and 90B sizes. I wrote about those in September.

Is performance about to plateau?

I've been mostly unconvinced by the ongoing discourse around LLMs hitting a plateau. The areas I'm personally most excited about are multi-modality (images, audio and video as input) and model efficiency. Both of those have had enormous leaps forward in the past year.

I don't particularly care about "AGI". I want models that can do useful things that I tell them to, quickly and inexpensively - and that's exactly what I've been getting more of over the past twelve months.

Even if progress on these tools entirely stopped right now, the amount I could get done with just the models I've downloaded and stashed on a USB drive would keep me busy and productive for years.

Bonus: running Llama 3.3 70B with MLX

I focused on Ollama in this article because it's the easiest option, but I also managed to run a version of Llama 3.3 using Apple's excellent MLX library, which just celebrated its first birthday.

Here's how I ran the model with MLX, using uv to fire up a temporary virtual environment:

uv run --with mlx-lm --python 3.12 python

This gave me a Python interpreter with mlx-lm available. Then I ran this:

from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Llama-3.3-70B-Instruct-4bit")

This downloaded 37G from mlx-community/Llama-3.3-70B-Instruct-4bit to ~/.cache/huggingface/hub/models--mlx-community--Llama-3.3-70B-Instruct-4bit.

Then:

prompt = "Generate an SVG of a pelican riding a bicycle, start with <svg>"
messages = [{"role": "user", "content": prompt}]
prompt2 = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
response = generate(
    model, tokenizer, prompt=prompt2, verbose=True
)

Which gave me:

<svg xmlns="http://www.w3.org/2000/svg" width="400" height="300" viewBox="0 0 400 300">
  <!-- Sky -->
  <rect x="0" y="0" width="400" height="300" fill="#87CEEB" rx="0" />
  
  <!-- Ground -->
  <rect x="0" y="250" width="400" height="50" fill="#964B00" rx="0" />
  
  <!-- Pelican -->
  <g transform="translate(150, 150)">
    <!-- Body -->
    <ellipse cx="0" cy="0" rx="30" ry="40" fill="#FFFFFF" stroke="#000000" stroke-width="2" />
    
    <!-- Beak -->
    <path d="M 0 -40 L 20 -60 L 40 -40 Z" fill="#FFFF00" stroke="#000000" stroke-width="2" />
    
    <!-- Legs -->
    <line x1="0" y1="40" x2="-20" y2="80" stroke="#000000" stroke-width="5" />
    <line x1="0" y1="40" x2="20" y2="80" stroke="#000000" stroke-width="5" />
    
    <!-- Eyes -->
    <circle cx="-15" cy="-20" r="5" fill="#000000" />
    <circle cx="15" cy="-20" r="5" fill="#000000" />
  </g>
  
  <!-- Bicycle -->
  <g transform="translate(200, 220) scale(0.7, 0.7)">
    <!-- Frame -->
    <rect x="-50" y="-20" width="100" height="40" fill="#808080" rx="10" />
    
    <!-- Wheels -->
    <circle cx="-35" cy="20" r="30" fill="#808080" />
    <circle cx="35" cy="20" r="30" fill="#808080" />
    
    <!-- Pedals -->
    <rect x="-45" y="40" width="10" height="20" fill="#808080" rx="5" />
    <rect x="35" y="40" width="10" height="20" fill="#808080" rx="5" />
    
    <!-- Seat -->
    <rect x="-20" y="-40" width="40" height="20" fill="#808080" rx="5" />
  </g>
  
  <!-- Pelican on Bicycle -->
  <g transform="translate(200, 150)">
    <!-- Pelican's hands on handlebars -->
    <line x1="-20" y1="0" x2="-40" y2="-20" stroke="#000000" stroke-width="5" />
    <line x1="20" y1="0" x2="40" y2="-20" stroke="#000000" stroke-width="5" />
    
    <!-- Pelican's feet on pedals -->
    <line x1="0" y1="40" x2="-20" y2="60" stroke="#000000" stroke-width="5" />
    <line x1="0" y1="40" x2="20" y2="60" stroke="#000000" stroke-width="5" />
  </g>
</svg>

Followed by:

Prompt: 52 tokens, 49.196 tokens-per-sec
Generation: 723 tokens, 8.733 tokens-per-sec
Peak memory: 40.042 GB

Here's what that looks like:

Honestly, I've seen worse.

Tags: python, ai, generative-ai, llama, gpt-4, edge-llms, llms, ai-assisted-programming, llm, meta, uv, mlx, ollama, pelican-riding-a-bicycle

New Gemini model: gemini-exp-1206

2024-12-06T18:05:30+00:00

New Gemini model: gemini-exp-1206

Google's Jeff Dean:

Today’s the one year anniversary of our first Gemini model releases! And it’s never looked better.

Check out our newest release, Gemini-exp-1206, in Google AI Studio and the Gemini API!

I upgraded my llm-gemini plugin to support the new model and released it as version 0.6 - you can install or upgrade it like this:

llm install -U llm-gemini

Running my SVG pelican on a bicycle test prompt:

llm -m gemini-exp-1206 "Generate an SVG of a pelican riding a bicycle"

Provided this result, which is the best I've seen from any model:

Here's the full output - I enjoyed these two pieces of commentary from the model:

<polygon>: Shapes the distinctive pelican beak, with an added line for the lower mandible.
[...]
transform="translate(50, 30)": This attribute on the pelican's <g> tag moves the entire pelican group 50 units to the right and 30 units down, positioning it correctly on the bicycle.

The new model is also currently in top place on the Chatbot Arena.

Update: a delightful bonus, here's what I got from the follow-up prompt:

llm -c "now animate it"

Transcript here.

Tags: gemini, llm, ai, llms, svg, releases, google, generative-ai, pelican-riding-a-bicycle

QwQ: Reflect Deeply on the Boundaries of the Unknown

2024-11-27T23:59:19+00:00

QwQ: Reflect Deeply on the Boundaries of the Unknown

Brand new openly licensed (Apache 2) model from Alibaba Cloud's Qwen team, this time clearly inspired by OpenAI's work on reasoning in o1.

I love the flowery language they use to introduce the new model:

Through deep exploration and countless trials, we discovered something profound: when given time to ponder, to question, and to reflect, the model’s understanding of mathematics and programming blossoms like a flower opening to the sun. Just as a student grows wiser by carefully examining their work and learning from mistakes, our model achieves deeper insight through patient, thoughtful analysis.

It's already available through Ollama as a 20GB download. I initially ran it like this:

ollama run qwq

This downloaded the model and started an interactive chat session. I tried the classic "how many rs in strawberry?" and got this lengthy but correct answer, which concluded:

Wait, but maybe I miscounted. Let's list them: 1. s 2. t 3. r 4. a 5. w 6. b 7. e 8. r 9. r 10. y Yes, definitely three "r"s. So, the word "strawberry" contains three "r"s.

Then I switched to using LLM and the llm-ollama plugin. I tried prompting it for Python that imports CSV into SQLite:

Write a Python function import_csv(conn, url, table_name) which acceopts a connection to a SQLite databse and a URL to a CSV file and the name of a table - it then creates that table with the right columns and imports the CSV data from that URL

It thought through the different steps in detail and produced some decent looking code.

Finally, I tried this:

llm -m qwq 'Generate an SVG of a pelican riding a bicycle'

For some reason it answered in Simplified Chinese. It opened with this:

生成一个SVG图像，内容是一只鹈鹕骑着一辆自行车。这听起来挺有趣的！我需要先了解一下什么是SVG，以及如何创建这样的图像。

Which translates (using Google Translate) to:

Generate an SVG image of a pelican riding a bicycle. This sounds interesting! I need to first understand what SVG is and how to create an image like this.

It then produced a lengthy essay discussing the many aspects that go into constructing a pelican on a bicycle - full transcript here. After a full 227 seconds of constant output it produced this as the final result.

I think that's pretty good!

Tags: llm, ollama, generative-ai, ai, qwen, llms, edge-llms, pelican-riding-a-bicycle, inference-scaling

Say hello to gemini-exp-1121

2024-11-22T06:14:26+00:00

Say hello to gemini-exp-1121

Google Gemini's Logan Kilpatrick on Twitter:

Say hello to gemini-exp-1121! Our latest experimental gemini model, with:

significant gains on coding performance

stronger reasoning capabilities

improved visual understanding

Available on Google AI Studio and the Gemini API right now

The 1121 in the name is a release date of the 21st November. This comes fast on the heels of last week's gemini-exp-1114.

Both of these new experimental Gemini models have seen moments at the top of the Chatbot Arena. gemini-exp-1114 took the top spot a few days ago, and then lost it to a new OpenAI model called "ChatGPT-4o-latest (2024-11-20)"... only for the new gemini-exp-1121 to hold the top spot right now.

(These model names are all so, so bad.)

I released llm-gemini 0.4.2 with support for the new model - this should have been 0.5 but I already have a 0.5a0 alpha that depends on an unreleased feature in LLM core.

I tried my pelican benchmark:

llm -m gemini-exp-1121 'Generate an SVG of a pelican riding a bicycle'

Since Gemini is a multi-modal vision model, I had it describe the image it had created back to me (by feeding it a PNG render):

llm -m gemini-exp-1121 describe -a pelican.png

And got this description, which is pretty great:

The image shows a simple, stylized drawing of an insect, possibly a bee or an ant, on a vehicle. The insect is composed of a large yellow circle for the body and a smaller yellow circle for the head. It has a black dot for an eye, a small orange oval for a beak or mouth, and thin black lines for antennae and legs. The insect is positioned on top of a simple black and white vehicle with two black wheels. The drawing is abstract and geometric, using basic shapes and a limited color palette of black, white, yellow, and orange.

Update: Logan confirmed on Twitter that these models currently only have a 32,000 token input, significantly less than the rest of the Gemini family.

Tags: vision-llms, gemini, llm, google, generative-ai, ai, llms, logan-kilpatrick, pelican-riding-a-bicycle

Recraft V3

2024-11-15T04:24:09+00:00

Recraft V3

Recraft are a generative AI design tool startup based out of London who released their v3 model a few weeks ago. It's currently sat at the top of the Artificial Analysis Image Arena Leaderboard, beating Midjourney and Flux 1.1 pro.

The thing that impressed me is that it can generate both raster and vector graphics... and the vector graphics can be exported as SVG!

Here's what I got for raccoon with a sign that says "I love trash" - SVG here.

That's an editable SVG - when I open it up in Pixelmator I can select and modify the individual paths and shapes:

They also have an API. I spent $1 on 1000 credits and then spent 80 credits (8 cents) making this SVG of a pelican riding a bicycle, using my API key stored in 1Password:

export RECRAFT_API_TOKEN="$(
  op item get recraft.ai --fields label=password \
  --format json | jq .value -r)"

curl https://external.api.recraft.ai/v1/images/generations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $RECRAFT_API_TOKEN" \
  -d '{
    "prompt": "california brown pelican riding a bicycle",
    "style": "vector_illustration",
    "model": "recraftv3"
  }'

Tags: ai, generative-ai, text-to-image, pelican-riding-a-bicycle

Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac

2024-11-12T23:37:36+00:00

There's a whole lot of buzz around the new Qwen2.5-Coder Series of open source (Apache 2.0 licensed) LLM releases from Alibaba's Qwen research team. On first impression it looks like the buzz is well deserved.

Qwen claim:

Qwen2.5-Coder-32B-Instruct has become the current SOTA open-source code model, matching the coding capabilities of GPT-4o.

That's a big claim for a 32B model that's small enough that it can run on my 64GB MacBook Pro M2. The Qwen published scores look impressive, comparing favorably with GPT-4o and Claude 3.5 Sonnet (October 2024) edition across various code-related benchmarks:

How about benchmarks from other researchers? Paul Gauthier's Aider benchmarks have a great reputation and Paul reports:

The new Qwen 2.5 Coder models did very well on aider's code editing benchmark. The 32B Instruct model scored in between GPT-4o and 3.5 Haiku.

84% 3.5 Sonnet, 75% 3.5 Haiku, 74% Qwen2.5 Coder 32B, 71% GPT-4o, 69% Qwen2.5 Coder 14B, 58% Qwen2.5 Coder 7B

That was for the Aider "whole edit" benchmark. The "diff" benchmark scores well too, with Qwen2.5 Coder 32B tying with GPT-4o (but a little behind Claude 3.5 Haiku).

Given these scores (and the positive buzz on Reddit) I had to try it for myself.

My attempts to run the Qwen/Qwen2.5-Coder-32B-Instruct-GGUF Q8 using llm-gguf were a bit too slow, because I don't have that compiled to use my Mac's GPU at the moment.

But both the Ollama version and the MLX version worked great!

I installed the Ollama version using:

ollama pull qwen2.5-coder:32b

That fetched a 20GB quantized file. I ran a prompt through that using my LLM tool and Sergey Alexandrov's llm-ollama plugin like this:

llm install llm-ollama
llm models # Confirming the new model is present
llm -m qwen2.5-coder:32b 'python function that takes URL to a CSV file and path to a SQLite database, fetches the CSV with the standard library, creates a table with the right columns and inserts the data'

Here's the result. The code worked, but I had to work around a frustrating ssl bug first (which wouldn't have been an issue if I'd allowed the model to use requests or httpx instead of the standard library).

I also tried running it using the Apple Silicon fast array framework MLX using the mlx-llm library directly, run via uv like this:

uv run --with mlx-lm \
  mlx_lm.generate \
  --model mlx-community/Qwen2.5-Coder-32B-Instruct-8bit \
  --max-tokens 4000 \
  --prompt 'write me a python function that renders a mandelbrot fractal as wide as the current terminal'

That gave me a very satisfying result - when I ran the code it generated in a terminal I got this:

MLX reported the following performance metrics:

Prompt: 49 tokens, 95.691 tokens-per-sec
Generation: 723 tokens, 10.016 tokens-per-sec
Peak memory: 32.685 GB

Let's see how it does on the Pelican on a bicycle benchmark.

llm -m qwen2.5-coder:32b 'Generate an SVG of a pelican riding a bicycle'

Here's what I got:

Questionable Pelican SVG drawings aside, this is a really promising development. 32GB is just small enough that I can run the model on my Mac without having to quit every other application I'm running, and both the speed and the quality of the results feel genuinely competitive with the current best of the hosted models.

Given that code assistance is probably around 80% of my LLM usage at the moment this is a meaningfully useful release for how I engage with this class of technology.

Tags: open-source, ai, generative-ai, edge-llms, llms, ai-assisted-programming, llm, uv, qwen, mlx, ollama, pelican-riding-a-bicycle

Pelicans on a bicycle

2024-10-25T23:56:50+00:00

Pelicans on a bicycle

I decided to roll out my own LLM benchmark: how well can different models render an SVG of a pelican riding a bicycle?

I chose that because a) I like pelicans and b) I'm pretty sure there aren't any pelican on a bicycle SVG files floating around (yet) that might have already been sucked into the training data.

My prompt:

Generate an SVG of a pelican riding a bicycle

I've run it through 16 models so far - from OpenAI, Anthropic, Google Gemini and Meta (Llama running on Cerebras), all using my LLM CLI utility. Here's my (Claude assisted) Bash script: generate-svgs.sh

Here's Claude 3.5 Sonnet (2024-06-20) and Claude 3.5 Sonnet (2024-10-22):

Gemini 1.5 Flash 001 and Gemini 1.5 Flash 002:

GPT-4o mini and GPT-4o:

o1-mini and o1-preview:

Cerebras Llama 3.1 70B and Llama 3.1 8B:

And a special mention for Gemini 1.5 Flash 8B:

The rest of them are linked from the README.

Tags: gemini, anthropic, llama, openai, ai, llms, svg, generative-ai, llm, cerebras, pelican-riding-a-bicycle