Cognitive load is what matters |
https://minds.md/zakirullin/cognitive |
Excellent living document (the underlying repo has [625 commits](https://github.com/zakirullin/cognitive-load/commits/main/) since being created in May 2023) maintained by Artem Zakirullin about minimizing the cognitive load needed to understand and maintain software.
This all rings very true to me. I judge the quality of a piece of code by how easy it is to change, and anything that causes me to take on more cognitive load - unraveling a class hierarchy, reading though dozens of tiny methods - reduces the quality of the code by that metric.
Lots of accumulated snippets of wisdom in this one.
> Mantras like "methods should be shorter than 15 lines of code" or "classes should be small" turned out to be somewhat wrong. |
2024-12-26 06:01:08+00:00 |
deepseek-ai/DeepSeek-V3-Base |
https://huggingface.co/deepseek-ai/DeepSeek-V3-Base |
No model card or announcement yet, but this new model release from Chinese AI lab DeepSeek (an arm of Chinese hedge fund [High-Flyer](https://en.wikipedia.org/wiki/High-Flyer_(company))) looks very significant.
It's a huge model - 685B parameters, 687.9 GB on disk ([TIL how to size a git-lfs repo](https://til.simonwillison.net/git/size-of-lfs-files)) The architecture is [a Mixture of Experts](https://twitter.com/dysondunbar/status/1871955700949430299) with 256 experts, using 8 per token.
For comparison, Meta AI's largest released model is their [Llama 3.1 model](https://ai.meta.com/blog/meta-llama-3-1/) with 405B parameters.
The new model is apparently available to some people via both [chat.deepseek.com](https://chat.deepseek.com/) and the DeepSeek API as part of a staged rollout.
Paul Gauthier got API access and [used it](https://twitter.com/paulgauthier/status/1871919612000092632) to update his new [Aider Polyglot leaderboard](https://aider.chat/docs/leaderboards/) - DeepSeek v3 preview scored 48.4%, putting it in second place behind `o1-2024-12-17 (high)` and in front of both `claude-3-5-sonnet-20241022` and `gemini-exp-1206`!
![Aider leaderboard chart showing DeepSeek Chat V3 preview in second place](https://static.simonwillison.net/static/2024/deepseek-v3.jpg)
I never know if I can believe models or not (the first time I asked "what model are you?" it claimed to be "based on OpenAI's GPT-4 architecture"), but I just got this result using [LLM](https://llm.datasette.io/) and the [llm-deepseek](https://pypi.org/project/llm-deepseek/) plugin:
llm -m deepseek-chat 'what deepseek model are you?'
> I'm DeepSeek-V3 created exclusively by DeepSeek. I'm an AI assistant, and I'm at your service! Feel free to ask me anything you'd like. I'll do my best to assist you.
Here's my [initial experiment log](https://gist.github.com/simonw/e7528dc52828fb31415f6e14e3527b93). |
2024-12-25 19:00:33+00:00 |
Finally, a replacement for BERT: Introducing ModernBERT |
https://www.answer.ai/posts/2024-12-19-modernbert.html |
[BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) was an early language model released by Google in October 2018. Unlike modern LLMs it wasn't designed for generating text. BERT was trained for masked token prediction and was generally applied to problems like Named Entity Recognition or Sentiment Analysis. BERT also wasn't very useful on its own - most applications required you to fine-tune a model on top of it.
In exploring BERT I decided to try out [dslim/distilbert-NER](https://huggingface.co/dslim/distilbert-NER), a popular Named Entity Recognition model fine-tuned on top of DistilBERT (a smaller distilled version of the original BERT model). [Here are my notes](https://til.simonwillison.net/llms/bert-ner) on running that using `uv run`.
Jeremy Howard's [Answer.AI](https://www.answer.ai/) research group, [LightOn](https://www.lighton.ai/) and friends supported the development of ModernBERT, a brand new BERT-style model that applies many enhancements from the past six years of advances in this space.
While BERT was trained on 3.3 billion tokens, producing 110 million and 340 million parameter models, ModernBERT trained on 2 trillion tokens, resulting in 140 million and 395 million parameter models. The parameter count hasn't increased much because it's designed to run on lower-end hardware. It has a 8192 token context length, a significant improvement on BERT's 512.
I was able to run one of the demos from the announcement post using `uv run` like this (I'm not sure why I had to use `numpy<2.0` but without that I got an error about `cannot import name 'ComplexWarning' from 'numpy.core.numeric'`):
<div class="highlight highlight-source-shell"><pre>uv run --with <span class="pl-s"><span class="pl-pds">'</span>numpy<2.0<span class="pl-pds">'</span></span> --with torch --with <span class="pl-s"><span class="pl-pds">'</span>git+https://github.com/huggingface/transformers.git<span class="pl-pds">'</span></span> python</pre></div>
<p>Then this Python:</p>
<pre><span class="pl-k">import</span> <span class="pl-s1">torch</span>
<span class="pl-k">from</span> <span class="pl-s1">transformers</span> <span class="pl-k">import</span> <span class="pl-s1">pipeline</span>
<span class="pl-k">from</span> <span class="pl-s1">pprint</span> <span class="pl-k">import</span> <span class="pl-s1">pprint</span>
<span class="pl-s1">pipe</span> <span class="pl-c1">=</span> <span class="pl-en">pipeline</span>(
<span class="pl-s">"fill-mask"</span>,
<span class="pl-s1">model</span><span class="pl-c1">=</span><span class="pl-s">"answerdotai/ModernBERT-base"</span>,
<span class="pl-s1">torch_dtype</span><span class="pl-c1">=</span><span class="pl-s1">torch</span>.<span class="pl-c1">bfloat16</span>,
)
<span class="pl-s1">input_text</span> <span class="pl-c1">=</span> <span class="pl-s">"He walked to the [MASK]."</span>
<span class="pl-s1">results</span> <span class="pl-c1">=</span> <span class="pl-en">pipe</span>(<span class="pl-s1">input_text</span>)
<span class="pl-en">pprint</span>(<span class="pl-s1">results</span>)</pre>
<p>Which downloaded 573MB to <code>~/.cache/huggingface/hub/models--answerdotai--ModernBERT-base</code> and output:</p>
<pre>[{<span class="pl-s">'score'</span>: <span class="pl-c1">0.11669921875</span>,
<span class="pl-s">'sequence'</span>: <span class="pl-s">'He walked to the door.'</span>,
<span class="pl-s">'token'</span>: <span class="pl-c1">3369</span>,
<span class="pl-s">'token_str'</span>: <span class="pl-s">' door'</span>},
{<span class="pl-s">'score'</span>: <span class="pl-c1">0.037841796875</span>,
<span class="pl-s">'sequence'</span>: <span class="pl-s">'He walked to the office.'</span>,
<span class="pl-s">'token'</span>: <span class="pl-c1">3906</span>,
<span class="pl-s">'token_str'</span>: <span class="pl-s">' office'</span>},
{<span class="pl-s">'score'</span>: <span class="pl-c1">0.0277099609375</span>,
<span class="pl-s">'sequence'</span>: <span class="pl-s">'He walked to the library.'</span>,
<span class="pl-s">'token'</span>: <span class="pl-c1">6335</span>,
<span class="pl-s">'token_str'</span>: <span class="pl-s">' library'</span>},
{<span class="pl-s">'score'</span>: <span class="pl-c1">0.0216064453125</span>,
<span class="pl-s">'sequence'</span>: <span class="pl-s">'He walked to the gate.'</span>,
<span class="pl-s">'token'</span>: <span class="pl-c1">7394</span>,
<span class="pl-s">'token_str'</span>: <span class="pl-s">' gate'</span>},
{<span class="pl-s">'score'</span>: <span class="pl-c1">0.020263671875</span>,
<span class="pl-s">'sequence'</span>: <span class="pl-s">'He walked to the window.'</span>,
<span class="pl-s">'token'</span>: <span class="pl-c1">3497</span>,
<span class="pl-s">'token_str'</span>: <span class="pl-s">' window'</span>}]</pre>
I'm looking forward to trying out models that use ModernBERT as their base. The model release is accompanied by a paper ([Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference](https://arxiv.org/abs/2412.13663)) and [new documentation](https://huggingface.co/docs/transformers/main/en/model_doc/modernbert) for using it with the Transformers library. |
2024-12-24 06:21:29+00:00 |
openai/openai-openapi |
https://github.com/openai/openai-openapi |
Seeing as the LLM world has semi-standardized on imitating OpenAI's API format for a whole host of different tools, it's useful to note that OpenAI themselves maintain a dedicated repository for a [OpenAPI](https://www.openapis.org/) YAML representation of their current API.
(I get OpenAI and OpenAPI typo-confused all the time, so `openai-openapi` is a delightfully fiddly repository name.)
The [openapi.yaml](https://github.com/openai/openai-openapi/blob/master/openapi.yaml) file itself is over 26,000 lines long, defining 76 API endpoints ("paths" in OpenAPI terminology) and 284 "schemas" for JSON that can be sent to and from those endpoints. A much more interesting view onto it is the [commit history](https://github.com/openai/openai-openapi/commits/master/openapi.yaml) for that file, showing details of when each different API feature was released.
Browsing 26,000 lines of YAML isn't pleasant, so I [got Claude](https://gist.github.com/simonw/54b4e533481cc7a686b0172c3a9ac21e) to build me a rudimentary YAML expand/hide exploration tool. Here's that tool running against the OpenAI schema, loaded directly from GitHub via a CORS-enabled `fetch()` call: [https://tools.simonwillison.net/yaml-explorer#.eyJ1c...](https://tools.simonwillison.net/yaml-explorer#eyJ1cmwiOiJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vb3BlbmFpL29wZW5haS1vcGVuYXBpL3JlZnMvaGVhZHMvbWFzdGVyL29wZW5hcGkueWFtbCIsIm9wZW4iOlsiZDAiLCJkMjAiXX0=) - the code after that fragment is a base64-encoded JSON for the current state of the tool (mostly Claude's idea).
![Screenshot of the YAML explorer, showing a partially expanded set of sections from the OpenAI API specification.](https://static.simonwillison.net/static/2024/yaml-explorer.jpg)
The tool is a little buggy - the expand-all option doesn't work quite how I want - but it's useful enough for the moment.
**Update**: It turns out the [petstore.swagger.io](https://petstore.swagger.io/) demo has an (as far as I can tell) undocumented `?url=` parameter which can load external YAML files, so [here's openai-openapi/openapi.yaml](https://petstore.swagger.io/?url=https://raw.githubusercontent.com/openai/openai-openapi/refs/heads/master/openapi.yaml) in an OpenAPI explorer interface.
![The Swagger API browser showing the OpenAI API](https://static.simonwillison.net/static/2024/swagger.jpg) |
2024-12-22 22:59:25+00:00 |
What happened to the world's largest tube TV? |
https://www.youtube.com/watch?v=JfZxOuc9Qwk |
This YouTube video is an absolute delight.
<p><lite-youtube videoid="JfZxOuc9Qwk"
title="What happened to the world's largest tube TV?"
playlabel="Play: What happened to the world's largest tube TV?"
> </lite-youtube></p>
Shank Mods describes the legendary [Sony PVM-4300](https://consolemods.org/wiki/CRT:PVM-4300) - the largest CRT television ever made, released by Sony in 1989 and weighing over 400lb. CRT enthusiasts had long debated its very existence, given the lack of known specimens outside of Sony's old marketing materials. Then Shank tracked a working one down... on the second floor of a 300 year old Soba noodle restaurant in Osaka, Japan.
This story of how they raced to rescue the TV before the restaurant was demolished, given the immense difficulty of moving a 400lb television (and then shipping it to the USA), is a fantastic ride. |
2024-12-22 21:41:45+00:00 |
Clay UI library |
https://www.nicbarker.com/clay |
Fascinating project by Nic Barker, who describes Clay like this:
> Clay is a flex-box style UI auto layout library in C, with declarative syntax and microsecond performance.
His [intro video](https://www.youtube.com/watch?v=DYWTw19_8r4) to the library is outstanding: I learned a ton about how UI layout works from this, and the animated visual explanations are clear, tasteful and really helped land the different concepts:
<p><lite-youtube videoid="DYWTw19_8r4"
title="Introducing Clay - High Performance UI Layout in C"
playlabel="Play: Introducing Clay - High Performance UI Layout in C"
> </lite-youtube></p>
Clay is a C library delivered in a single ~2000 line [clay.h](https://github.com/nicbarker/clay/blob/main/clay.h) dependency-free header file. It only handles layout calculations: if you want to render the result you need to add an additional rendering layer.
In a fascinating demo of the library, the [Clay site itself](https://www.nicbarker.com/clay) is rendered using Clay C compiled to WebAssembly! You can even switch between the default HTML renderer and an alternative based on Canvas.
This isn't necessarily a great idea: because the layout is entirely handled using `<div>` elements positioned using `transform: translate(0px, 70px)` style CSS attempting to select text across multiple boxes behaves strangely, and it's not clear to me what the accessibility implications are.
**Update**: [Matt Campbell](https://toot.cafe/@matt/113693374074675126):
> The accessibility implications are as serious as you might guess. The links aren't properly labeled, there's no semantic markup such as headings, and since there's a div for every line, continuous reading with a screen reader is choppy, that is, it pauses at the end of every physical line.
It does make for a very compelling demo of what Clay is capable of though, especially when you resize your browser window and the page layout is recalculated in real-time via the Clay WebAssembly bridge.
You can hit "D" on the website and open up a custom Clay debugger showing the hierarchy of layout elements on the page:
![Clay website on the left, on the right is a panel showing a tree of UI layout elements, one has been selected and is showing details in a box at the bottom of the panel: Bounding Box: { x: 278, y: 13, width: 101, height: 24}, Layout Direction: LEFT_TO_RIGHT, Sizing: width: FITQ, height: FITQ, Padding: {x:8,uy:0}](https://static.simonwillison.net/static/2024/clay-debug.jpg)
This also means that the entire page is defined using C code! Given that, I find the code itself [surprisingly readable](https://github.com/nicbarker/clay/blob/35d72e5fba6872be48d15ed9d84269a86cd72b4e/examples/clay-official-website/main.c#L124-L139)
<div class="highlight highlight-source-c"><pre><span class="pl-smi">void</span> <span class="pl-en">DeclarativeSyntaxPageDesktop</span>() {
<span class="pl-en">CLAY</span>(<span class="pl-en">CLAY_ID</span>(<span class="pl-s">"SyntaxPageDesktop"</span>), <span class="pl-en">CLAY_LAYOUT</span>({ .<span class="pl-s1">sizing</span> <span class="pl-c1">=</span> { <span class="pl-en">CLAY_SIZING_GROW</span>(), <span class="pl-en">CLAY_SIZING_FIT</span>({ .<span class="pl-s1">min</span> <span class="pl-c1">=</span> <span class="pl-s1">windowHeight</span> <span class="pl-c1">-</span> <span class="pl-c1">50</span> }) }, .<span class="pl-s1">childAlignment</span> <span class="pl-c1">=</span> {<span class="pl-c1">0</span>, <span class="pl-c1">CLAY_ALIGN_Y_CENTER</span>}, .<span class="pl-s1">padding</span> <span class="pl-c1">=</span> {.<span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-c1">50</span>} })) {
<span class="pl-c1">CLAY</span>(<span class="pl-en">CLAY_ID</span>(<span class="pl-s">"SyntaxPage"</span>), <span class="pl-c1">CLAY_LAYOUT</span>({ .<span class="pl-s1">sizing</span> <span class="pl-c1">=</span> { <span class="pl-en">CLAY_SIZING_GROW</span>(), <span class="pl-en">CLAY_SIZING_GROW</span>() }, .<span class="pl-s1">childAlignment</span> <span class="pl-c1">=</span> { <span class="pl-c1">0</span>, <span class="pl-c1">CLAY_ALIGN_Y_CENTER</span> }, .<span class="pl-s1">padding</span> <span class="pl-c1">=</span> { <span class="pl-c1">32</span>, <span class="pl-c1">32</span> }, .<span class="pl-s1">childGap</span> <span class="pl-c1">=</span> <span class="pl-c1">32</span> }), <span class="pl-en">CLAY_BORDER</span>({ .<span class="pl-s1">left</span> <span class="pl-c1">=</span> { <span class="pl-c1">2</span>, <span class="pl-c1">COLOR_RED</span> }, .<span class="pl-s1">right</span> <span class="pl-c1">=</span> { <span class="pl-c1">2</span>, <span class="pl-c1">COLOR_RED</span> } })) {
<span class="pl-c1">CLAY</span>(<span class="pl-en">CLAY_ID</span>(<span class="pl-s">"SyntaxPageLeftText"</span>), <span class="pl-c1">CLAY_LAYOUT</span>({ .<span class="pl-s1">sizing</span> <span class="pl-c1">=</span> { <span class="pl-en">CLAY_SIZING_PERCENT</span>(<span class="pl-c1">0.5</span>) }, .<span class="pl-c1">layoutDirection</span> <span class="pl-c1">=</span> <span class="pl-c1">CLAY_TOP_TO_BOTTOM</span>, .<span class="pl-c1">childGap</span> <span class="pl-c1">=</span> <span class="pl-c1">8</span> })) {
<span class="pl-en">CLAY_TEXT</span>(<span class="pl-en">CLAY_STRING</span>(<span class="pl-s">"Declarative Syntax"</span>), <span class="pl-en">CLAY_TEXT_CONFIG</span>({ .<span class="pl-s1">fontSize</span> <span class="pl-c1">=</span> <span class="pl-c1">52</span>, .<span class="pl-c1">fontId</span> <span class="pl-c1">=</span> <span class="pl-c1">FONT_ID_TITLE_56</span>, .<span class="pl-c1">textColor</span> <span class="pl-c1">=</span> <span class="pl-c1">COLOR_RED</span> }));
<span class="pl-en">CLAY</span>(<span class="pl-en">CLAY_ID</span>(<span class="pl-s">"SyntaxSpacer"</span>), <span class="pl-en">CLAY_LAYOUT</span>({ .<span class="pl-s1">sizing</span> <span class="pl-c1">=</span> { <span class="pl-en">CLAY_SIZING_GROW</span>({ .<span class="pl-s1">max</span> <span class="pl-c1">=</span> <span class="pl-c1">16</span> }) } })) {}
<span class="pl-en">CLAY_TEXT</span>(<span class="pl-en">CLAY_STRING</span>(<span class="pl-s">"Flexible and readable declarative syntax with nested UI element hierarchies."</span>), <span class="pl-en">CLAY_TEXT_CONFIG</span>({ .<span class="pl-s1">fontSize</span> <span class="pl-c1">=</span> <span class="pl-c1">28</span>, .<span class="pl-c1">fontId</span> <span class="pl-c1">=</span> <span class="pl-c1">FONT_ID_BODY_36</span>, .<span class="pl-c1">textColor</span> <span class="pl-c1">=</span> <span class="pl-c1">COLOR_RED</span> }));
<span class="pl-en">CLAY_TEXT</span>(<span class="pl-en">CLAY_STRING</span>(<span class="pl-s">"Mix elements with standard C code like loops, conditionals and functions."</span>), <span class="pl-en">CLAY_TEXT_CONFIG</span>({ .<span class="pl-s1">fontSize</span> <span class="pl-c1">=</span> <span class="pl-c1">28</span>, .<span class="pl-c1">fontId</span> <span class="pl-c1">=</span> <span class="pl-c1">FONT_ID_BODY_36</span>, .<span class="pl-c1">textColor</span> <span class="pl-c1">=</span> <span class="pl-c1">COLOR_RED</span> }));
<span class="pl-en">CLAY_TEXT</span>(<span class="pl-en">CLAY_STRING</span>(<span class="pl-s">"Create your own library of re-usable components from UI primitives like text, images and rectangles."</span>), <span class="pl-en">CLAY_TEXT_CONFIG</span>({ .<span class="pl-s1">fontSize</span> <span class="pl-c1">=</span> <span class="pl-c1">28</span>, .<span class="pl-c1">fontId</span> <span class="pl-c1">=</span> <span class="pl-c1">FONT_ID_BODY_36</span>, .<span class="pl-c1">textColor</span> <span class="pl-c1">=</span> <span class="pl-c1">COLOR_RED</span> }));
}
<span class="pl-en">CLAY</span>(<span class="pl-en">CLAY_ID</span>(<span class="pl-s">"SyntaxPageRightImage"</span>), <span class="pl-en">CLAY_LAYOUT</span>({ .<span class="pl-s1">sizing</span> <span class="pl-c1">=</span> { <span class="pl-en">CLAY_SIZING_PERCENT</span>(<span class="pl-c1">0.50</span>) }, .<span class="pl-c1">childAlignment</span> <span class="pl-c1">=</span> {.<span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-c1">CLAY_ALIGN_X_CENTER</span>} })) {
<span class="pl-c1">CLAY</span>(<span class="pl-en">CLAY_ID</span>(<span class="pl-s">"SyntaxPageRightImageInner"</span>), <span class="pl-en">CLAY_LAYOUT</span>({ .<span class="pl-s1">sizing</span> <span class="pl-c1">=</span> { <span class="pl-en">CLAY_SIZING_GROW</span>({ .<span class="pl-s1">max</span> <span class="pl-c1">=</span> <span class="pl-c1">568</span> }) } }), <span class="pl-c1">CLAY_IMAGE</span>({ .<span class="pl-s1">sourceDimensions</span> <span class="pl-c1">=</span> {<span class="pl-c1">1136</span>, <span class="pl-c1">1194</span>}, .<span class="pl-s1">sourceURL</span> <span class="pl-c1">=</span> <span class="pl-en">CLAY_STRING</span>(<span class="pl-s">"/clay/images/declarative.png"</span>) })) {}
}
}
}
}</pre></div>
I'm not ready to ditch HTML and CSS for writing my web pages in C compiled to WebAssembly just yet, but as an exercise in understanding layout engines (and a potential tool for building non-web interfaces in the future) this is a really interesting project to dig into.
To clarify here: I don't think the web layout / WebAssembly thing is the key idea behind Clay at all - I think it's a neat demo of the library, but it's not what Clay is *for*. It's certainly an interesting way to provide a demo of a layout library!
Nic [confirms](https://bsky.app/profile/nicbarker.com/post/3ldu44rxyx22h):
> You totally nailed it, the fact that you can compile to wasm and run in HTML stemmed entirely from a “wouldn’t it be cool if…” It was designed for my C projects first and foremost! |
2024-12-21 23:12:17+00:00 |
OpenAI o3 breakthrough high score on ARC-AGI-PUB |
https://arcprize.org/blog/oai-o3-pub-breakthrough |
François Chollet is the co-founder of the ARC Prize and had advanced access to today's o3 results. His article here is the most insightful coverage I've seen of o3, going beyond just the benchmark results to talk about what this all means for the field in general.
One fascinating detail: it cost $6,677 to run o3 in "high efficiency" mode against the 400 public ARC-AGI puzzles for a score of 82.8%, and an undisclosed amount of money to run the "low efficiency" mode model to score 91.5%. A note says:
> o3 high-compute costs not available as pricing and feature availability is still TBD. The amount of compute was roughly 172x the low-compute configuration.
So we can get a ballpark estimate here in that 172 * $6,677 = $1,148,444!
Here's how François explains the likely mechanisms behind o3, which reminds me of how a brute-force chess computer might work.
> For now, we can only speculate about the exact specifics of how o3 works. But o3's core mechanism appears to be natural language program search and execution within token space – at test time, the model searches over the space of possible Chains of Thought (CoTs) describing the steps required to solve the task, in a fashion perhaps not too dissimilar to AlphaZero-style Monte-Carlo tree search. In the case of o3, the search is presumably guided by some kind of evaluator model. To note, Demis Hassabis hinted back in a June 2023 interview that DeepMind had been researching this very idea – this line of work has been a long time coming.
>
> So while single-generation LLMs struggle with novelty, o3 overcomes this by generating and executing its own programs, where the program itself (the CoT) becomes the artifact of knowledge recombination. Although this is not the only viable approach to test-time knowledge recombination (you could also do test-time training, or search in latent space), it represents the current state-of-the-art as per these new ARC-AGI numbers.
>
> Effectively, o3 represents a form of deep learning-guided program search. The model does test-time search over a space of "programs" (in this case, natural language programs – the space of CoTs that describe the steps to solve the task at hand), guided by a deep learning prior (the base LLM). The reason why solving a single ARC-AGI task can end up taking up tens of millions of tokens and cost thousands of dollars is because this search process has to explore an enormous number of paths through program space – including backtracking.
I'm not sure if o3 (and o1 and similar models) even qualifies as an LLM any more - there's clearly a whole lot more going on here than just next-token prediction.
On the question of if o3 should qualify as AGI (whatever that might mean):
> Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.
>
> Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training).
The post finishes with examples of the puzzles that o3 *didn't* manage to solve, including this one which reassured me that I can still solve at least some puzzles that couldn't be handled with thousands of dollars of GPU compute!
![A puzzle with colored squares, where drawing a line between the single blue squares and turning any intersected rectangles blue is clearly the solution.](https://static.simonwillison.net/static/2024/arc-agi-task-0d87d2a6.png) |
2024-12-20 22:17:42+00:00 |
Building effective agents |
https://www.anthropic.com/research/building-effective-agents |
My principal complaint about the term "agents" is that while it has many different potential definitions most of the people who use it seem to assume that everyone else shares and understands the definition that they have chosen to use.
This outstanding piece by Erik Schluntz and Barry Zhang at Anthropic bucks that trend from the start, providing a clear definition that they then use throughout.
They discuss "agentic systems" as a parent term, then define a distinction between "workflows" - systems where multiple LLMs are orchestrated together using pre-defined patterns - and "agents", where the LLMs "dynamically direct their own processes and tool usage". This second definition is later expanded with this delightfully clear description:
> Agents begin their work with either a command from, or interactive discussion with, the human user. Once the task is clear, agents plan and operate independently, potentially returning to the human for further information or judgement. During execution, it's crucial for the agents to gain “ground truth” from the environment at each step (such as tool call results or code execution) to assess its progress. Agents can then pause for human feedback at checkpoints or when encountering blockers. The task often terminates upon completion, but it’s also common to include stopping conditions (such as a maximum number of iterations) to maintain control.
That's a definition I can live with!
They also introduce a term that I _really_ like: **the augmented LLM**. This is an LLM with augmentations such as tools - I've seen people use the term "agents" just for this, which never felt right to me.
The rest of the article is the clearest practical guide to building systems that combine multiple LLM calls that I've seen anywhere.
Most of the focus is actually on workflows. They describe five different patterns for workflows in detail:
- Prompt chaining, e.g. generating a document and then translating it to a separate language as a second LLM call
- Routing, where an initial LLM call decides which model or call should be used next (sending easy tasks to Haiku and harder tasks to Sonnet, for example)
- Parallelization, where a task is broken up and run in parallel (e.g. image-to-text on multiple document pages at once) or processed by some kind of voting mechanism
- Orchestrator-workers, where a orchestrator triggers multiple LLM calls that are then synthesized together, for example running searches against multiple sources and combining the results
- Evaluator-optimizer, where one model checks the work of another in a loop
These patterns all make sense to me, and giving them clear names makes them easier to reason about.
When should you upgrade from basic prompting to workflows and then to full agents? The authors provide this sensible warning:
> When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all.
But assuming you do need to go beyond what can be achieved even with the aforementioned workflow patterns, their model for agents may be a useful fit:
> Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents' autonomy makes them ideal for scaling tasks in trusted environments.
>
> The autonomous nature of agents means higher costs, and the potential for compounding errors. We recommend extensive testing in sandboxed environments, along with the appropriate guardrails
They also warn against investing in complex agent frameworks before you've exhausted your options using direct API access and simple code.
The article is accompanied by a brand new set of [cookbook recipes](https://github.com/anthropics/anthropic-cookbook/tree/main/patterns/agents) illustrating all five of the workflow patterns. The [Evaluator-Optimizer Workflow](https://github.com/anthropics/anthropic-cookbook/blob/main/patterns/agents/evaluator_optimizer.ipynb) example is particularly fun, setting up a code generating prompt and an code reviewing evaluator prompt and having them loop until the evaluator is happy with the result. |
2024-12-20 05:50:33+00:00 |
Is AI progress slowing down? |
https://www.aisnakeoil.com/p/is-ai-progress-slowing-down |
This piece by Arvind Narayanan, Sayash Kapoor and Benedikt Ströbl is the single most insightful essay about AI and LLMs I've seen in a long time. It's long and worth reading every inch of it - it defies summarization, but I'll try anyway.
The key question they address is the widely discussed issue of whether model scaling has stopped working. Last year it seemed like the secret to ever increasing model capabilities was to keep dumping in more data and parameters and training time, but the lack of a convincing leap forward in the two years since GPT-4 - from any of the big labs - suggests that's no longer the case.
> The new dominant narrative seems to be that model scaling is dead, and “inference scaling”, also known as “test-time compute scaling” is the way forward for improving AI capabilities. The idea is to spend more and more computation when using models to perform a task, such as by having them “think” before responding.
Inference scaling is the trick introduced by OpenAI's o1 and now explored by other models such as Qwen's [QwQ](https://simonwillison.net/2024/Nov/27/qwq/). It's an increasingly practical approach as inference gets more efficient and cost per token continues to [drop through the floor](https://simonwillison.net/tags/llm-pricing/).
But how far can inference scaling take us, especially if it's only effective for certain types of problem?
> The straightforward, intuitive answer to the first question is that inference scaling is useful for problems that have clear correct answers, such as coding or mathematical problem solving. [...] In contrast, for tasks such as writing or language translation, it is hard to see how inference scaling can make a big difference, especially if the limitations are due to the training data. For example, if a model works poorly in translating to a low-resource language because it isn’t aware of idiomatic phrases in that language, the model can’t reason its way out of this.
There's a delightfully spicy section about why it's a bad idea to defer to the expertise of industry insiders:
> In short, the reasons why one might give more weight to insiders’ views aren’t very important. On the other hand, there’s a huge and obvious reason why we should probably give less weight to their views, which is that they have an incentive to say things that are in their commercial interests, and have a track record of doing so.
I also enjoyed this note about how we are still potentially years behind in figuring out how to build usable applications that take full advantage of the capabilities we have today:
> The furious debate about whether there is a capability slowdown is ironic, because the link between capability increases and the real-world usefulness of AI is extremely weak. The development of AI-based [applications](https://www.ben-evans.com/benedictevans/2024/4/19/looking-for-ai-use-cases) lags far behind the increase of AI capabilities, so even existing AI capabilities remain greatly underutilized. One reason is the [capability-reliability gap](https://www.aisnakeoil.com/i/147899150/reliability) --- even when a certain capability exists, it may not work reliably enough that you can take the human out of the loop and actually automate the task (imagine a food delivery app that only works 80% of the time). And the methods for improving reliability are often application-dependent and distinct from methods for improving capability. That said, reasoning models also seem to exhibit [reliability improvements](https://youtu.be/iBfQTnA2n2s?si=a-760cPz5ZghJc7w&t=161), which is exciting. |
2024-12-19 18:10:23+00:00 |
q and qv zsh functions for asking questions of websites and YouTube videos with LLM |
https://github.com/davidgasquez/dotfiles/blob/bb9df4a369dbaef95ca0c35642de491c7dd41269/shell/zshrc#L50-L99 |
Spotted these in David Gasquez's `zshrc` dotfiles: two shell functions that use my [LLM](https://llm.datasette.io/) tool to answer questions about a website or YouTube video.
Here's how to ask a question of a website:
q https://simonwillison.net/ 'What has Simon written about recently?'
I got back:
> Recently, Simon Willison has written about various topics including:
>
> 1. **Building Python Tools** - Exploring one-shot applications using Claude and dependency management with `uv`.
> 2. **Modern Java Usage** - Discussing recent developments in Java that simplify coding.
> 3. **GitHub Copilot Updates** - New free tier and features in GitHub Copilot for Vue and VS Code.
> 4. **AI Engagement on Bluesky** - Investigating the use of bots to create artificially polite disagreements.
> 5. **OpenAI WebRTC Audio** - Demonstrating a new API for real-time audio conversation with models.
It works by constructing a [Jina Reader URL](https://simonwillison.net/2024/Jun/16/jina-ai-reader/) to convert that URL to Markdown, then piping that content into LLM along with the question.
The YouTube one is even more fun:
qv 'https://www.youtube.com/watch?v=uRuLgar5XZw' 'what does Simon say about open source?'
It said (about [this 72 minute video](https://www.youtube.com/watch?v=uRuLgar5XZw))
> Simon emphasizes that open source has significantly increased productivity in software development. He points out that before open source, developers often had to recreate existing solutions or purchase proprietary software, which often limited customization. The availability of open source projects has made it easier to find and utilize existing code, which he believes is one of the primary reasons for more efficient software development today.
The secret sauce behind that one is the way it uses `yt-dlp` to extract just the subtitles for the video:
local subtitle_url=$(yt-dlp -q --skip-download --convert-subs srt --write-sub --sub-langs "en" --write-auto-sub --print "requested_subtitles.en.url" "$url")
local content=$(curl -s "$subtitle_url" | sed '/^$/d' | grep -v '^[0-9]*$' | grep -v '\-->' | sed 's/<[^>]*>//g' | tr '\n' ' ')
That first line retrieves a URL to the subtitles in WEBVTT format - I [saved a copy of that here](https://gist.github.com/simonw/7f07837cf8adcee23fd5cd5394170f27). The second line then uses `curl` to fetch them, then `sed` and `grep` to remove the timestamp information, producing [this](https://gist.github.com/simonw/7f07837cf8adcee23fd5cd5394170f27?permalink_comment_id=5350044#gistcomment-5350044). |
2024-12-19 15:42:34+00:00 |