<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: ai2</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/ai2.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-11-22T23:59:46+00:00</updated><author><name>Simon Willison</name></author><entry><title>Olmo 3 is a fully open LLM</title><link href="https://simonwillison.net/2025/Nov/22/olmo-3/#atom-tag" rel="alternate"/><published>2025-11-22T23:59:46+00:00</published><updated>2025-11-22T23:59:46+00:00</updated><id>https://simonwillison.net/2025/Nov/22/olmo-3/#atom-tag</id><summary type="html">
    &lt;p&gt;Olmo is the LLM series from Ai2 - the &lt;a href="https://allenai.org/"&gt;Allen institute for AI&lt;/a&gt;. Unlike most open weight models these are notable for including the full training data, training process and checkpoints along with those releases.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://allenai.org/blog/olmo3"&gt;new Olmo 3&lt;/a&gt; claims to be "the best fully open 32B-scale thinking model" and has a strong focus on interpretability:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;At its center is &lt;strong&gt;Olmo 3-Think (32B)&lt;/strong&gt;, the best fully open 32B-scale thinking model that for the first time lets you inspect intermediate reasoning traces and trace those behaviors back to the data and training decisions that produced them.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They've released four 7B models - Olmo 3-Base, Olmo 3-Instruct, Olmo 3-Think and Olmo 3-RL Zero, plus 32B variants of the 3-Think and 3-Base models.&lt;/p&gt;
&lt;p&gt;Having full access to the training data is really useful. Here's how they describe that:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Olmo 3 is pretrained on &lt;strong&gt;Dolma 3&lt;/strong&gt;, a new ~9.3-trillion-token corpus drawn from web pages, science PDFs processed with &lt;a href="https://olmocr.allenai.org/"&gt;olmOCR&lt;/a&gt;, codebases, math problems and solutions, and encyclopedic text. From this pool, we construct &lt;strong&gt;Dolma 3 Mix&lt;/strong&gt;, a 5.9-trillion-token (~6T) pretraining mix with a higher proportion of coding and mathematical data than earlier Dolma releases, plus much stronger decontamination via extensive deduplication, quality filtering, and careful control over data mixing. We follow established web standards in collecting training data and don't collect from sites that explicitly disallow it, including paywalled content.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They also highlight that they are training on fewer tokens than their competition:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] it's the strongest fully open thinking model we're aware of, narrowing the gap to the best open-weight models of similar scale – such as Qwen 3 32B – while training on roughly 6x fewer tokens.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If you're continuing to hold out hope for a model trained entirely on licensed data this one sadly won't fit the bill - a lot of that data still comes from a crawl of the web.&lt;/p&gt;
&lt;p&gt;I tried out the 32B Think model and the 7B Instruct model &lt;a href="https://lmstudio.ai/models/olmo3"&gt;using LM Studio&lt;/a&gt;. The 7B model is a 4.16GB download, the 32B one is 18.14GB.&lt;/p&gt;
&lt;p&gt;The 32B model is absolutely an over-thinker! I asked it to "Generate an SVG of a pelican riding a bicycle" and it thought for &lt;em&gt;14 minutes 43 seconds&lt;/em&gt;, outputting 8,437 tokens total most of which was &lt;a href="https://gist.github.com/simonw/2ae9d5ed71de9608b7955eea9671306f"&gt;this epic thinking trace&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I don't usually quote the full SVG in these write-ups, but in this case it's short enough that I think it's worth sharing. The SVG comments give a great impression of what it was trying to do - it has a Bicycle, Bike frame, Pelican, Left and Right wings and even "Feet on pedals".&lt;/p&gt;
&lt;div class="highlight highlight-text-xml-svg"&gt;&lt;pre&gt;&amp;lt;&lt;span class="pl-ent"&gt;svg&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;200&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;200&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;viewBox&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0 0 100 100&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Bicycle &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;none&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;70&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;none&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Bike frame &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;rect&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;35&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;25&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;10&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;saddlebrown&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;35&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;65&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;70&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pelican &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;ellipse&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;55&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;65&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;rx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;ry&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;white&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;52 50,57 35,62 50&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Head/beak &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;55&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;white&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;white&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45 60,50 70,55 60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;lightgrey&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Left wing &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;65 60,70 70,55 60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;lightgrey&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Right wing &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Feet on pedals &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;25 75,30 85,35 75&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;75 75,70 85,65 75&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
&amp;lt;/&lt;span class="pl-ent"&gt;svg&lt;/span&gt;&amp;gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Rendered it looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/olmo3-32b-pelican.jpg" alt="Two circles, each with a triangle sticking out from the bottom. They have bars leading up to a brown box. Overlapping them is a black triangle with white circles for eyes and two grey triangles that are probably meant to be wings. It is not recognizable as a pelican or a bicycle." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I tested OLMo 2 32B 4bit &lt;a href="https://simonwillison.net/2025/Mar/16/olmo2/"&gt;back in March&lt;/a&gt; and got something that, while pleasingly abstract, didn't come close to resembling a pelican or a bicycle:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/olmo2-pelican.jpg" alt="Blue and black wiggly lines looking more like a circuit diagram than a pelican riding a bicycle" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;To be fair 32B models generally don't do great with this. Here's Qwen 3 32B's attempt (I ran that just now &lt;a href="https://openrouter.ai/chat?models=qwen/qwen3-32b"&gt;using OpenRouter&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/qwen3-32b-pelican.png" alt="The bicycle is two black circles joined by two lines, with a weird rectangular saddle perched on top The pelican is a blue oval, a white circles with a yellow triangle in it and a weird eye shaped oval overlapping the blue one." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="olmotrace"&gt;OlmoTrace&lt;/h4&gt;
&lt;p&gt;I was particularly keen on trying out the ability to "inspect intermediate reasoning traces". Here's how that's described later in the announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A core goal of Olmo 3 is not just to &lt;em&gt;open&lt;/em&gt; the model flow, but to make it &lt;em&gt;actionable&lt;/em&gt; for people who want to understand and improve model behavior. Olmo 3 integrates with &lt;a href="https://allenai.org/blog/olmotrace"&gt;&lt;strong&gt;OlmoTrace&lt;/strong&gt;&lt;/a&gt;, our tool for tracing model outputs back to training data in real time.&lt;/p&gt;
&lt;p&gt;For example, in the Ai2 Playground, you can ask Olmo 3-Think (32B) to answer a general-knowledge question, then use OlmoTrace to inspect where and how the model may have learned to generate parts of its response. This closes the gap between training data and model behavior: you can see not only what the model is doing, but why---and adjust data or training decisions accordingly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can access OlmoTrace via &lt;a href="https://playground.allenai.org/"&gt;playground.allenai.org&lt;/a&gt;, by first running a prompt and then clicking the "Show OlmoTrace" button below the output.&lt;/p&gt;
&lt;p&gt;I tried that on "Generate a conference bio for Simon Willison" (an ego-prompt I use to see how much the models have picked up about me from their training data) and got back a result that looked like this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/olmotrace.jpg" alt="Screenshot of AI2 Playground interface with Olmo 3 32B Think model selected. User prompt reads &amp;quot;Generate a conference bio for Simon Willison&amp;quot;. AI response shows expanded &amp;quot;Thoughts&amp;quot; section: &amp;quot;Okay, the user wants me to generate a conference bio for Simon Willison. Let me start by recalling who Simon is. He's a well-known developer, entrepreneur, and author in the tech community. I remember he's the co-founder of co:here and now works at Anthropic. He's famous for his work with SQLite, especially the datasette project. Also, he writes a blog called SimonWillison.net which is quite popular. First, I need to structure a typical conference bio. These usually include his current role, past positions, notable projects, and maybe some achievements or recognitions. I should highlight his key contributions like datasette, which allows easy exploration of SQLite databases via a web interface. That's a big deal in data journalism and open data circles.&amp;quot; Right panel shows &amp;quot;OlmoTrace&amp;quot; feature described as &amp;quot;Documents from the training data that have exact text matches with the model response. Powered by infini-gram&amp;quot;. First document excerpt discusses technology and innovation, with highlighted match text &amp;quot;societal implications of technology, emphasizing the&amp;quot; shown in bold, surrounded by text about responsibility and merging innovation with intellect. Second document excerpt about Matt Hall has highlighted match &amp;quot;is a software engineer and entrepreneur based in&amp;quot; shown in bold, describing someone in New York City who co-founded a PFP collection and works at Google Creative Lab. Note indicates &amp;quot;Document repeated 2 times in result&amp;quot; with &amp;quot;View all repeated documents&amp;quot; link." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It thinks I co-founded co:here and work at Anthropic, both of which are incorrect - but that's not uncommon with LLMs, I frequently see them suggest that I'm the CTO of GitHub and other such inaccuracies.&lt;/p&gt;
&lt;p&gt;I found the OlmoTrace panel on the right disappointing. None of the training documents it highlighted looked relevant - it appears to be looking for phrase matches (powered by &lt;a href="https://infini-gram.io/"&gt;Ai2's infini-gram&lt;/a&gt;) but the documents it found had nothing to do with me at all.&lt;/p&gt;
&lt;h4 id="can-open-training-data-address-concerns-of-backdoors-"&gt;Can open training data address concerns of backdoors?&lt;/h4&gt;
&lt;p&gt;Ai2 claim that Olmo 3 is "the best fully open 32B-scale thinking model", which I think holds up provided you define "fully open" as including open training data. There's not a great deal of competition in that space though - Ai2 compare themselves to &lt;a href="https://marin.community/"&gt;Stanford's Marin&lt;/a&gt; and &lt;a href="https://www.swiss-ai.org/apertus"&gt;Swiss AI's Apertus&lt;/a&gt;, neither of which I'd heard about before.&lt;/p&gt;
&lt;p&gt;A big disadvantage of other open weight models is that it's impossible to audit their training data. Anthropic published a paper last month showing that &lt;a href="https://www.anthropic.com/research/small-samples-poison"&gt;a small number of samples can poison LLMs of any size&lt;/a&gt; - it can take just "250 poisoned documents" to add a backdoor to a large model that triggers undesired behavior based on a short carefully crafted prompt.&lt;/p&gt;

&lt;p&gt;This makes fully open training data an even bigger deal.&lt;/p&gt;

&lt;p&gt;Ai2 researcher Nathan Lambert included this note about the importance of transparent training data in &lt;a href="https://www.interconnects.ai/p/olmo-3-americas-truly-open-reasoning"&gt;his detailed post about the release&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;In particular, we're excited about the future of RL Zero research on Olmo 3 precisely because everything is open. Researchers can study the interaction between the reasoning traces we include at midtraining and the downstream model behavior (qualitative and quantitative).&lt;/p&gt;

&lt;p&gt;This helps answer questions that have plagued RLVR results on Qwen models, hinting at forms of data contamination particularly on math and reasoning benchmarks (see Shao, Rulin, et al. "Spurious rewards: Rethinking training signals in rlvr." &lt;a href="https://arxiv.org/abs/2506.10947"&gt;arXiv preprint arXiv:2506.10947&lt;/a&gt; (2025). or Wu, Mingqi, et al. "Reasoning or memorization? unreliable results of reinforcement learning due to data contamination." &lt;a href="https://arxiv.org/abs/2507.10532"&gt;arXiv preprint arXiv:2507.10532&lt;/a&gt; (2025).)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I hope we see more competition in this space, including further models in the Olmo series. The improvements from Olmo 1 (in &lt;a href="https://simonwillison.net/2024/Feb/2/olmos/"&gt;February 2024&lt;/a&gt;) and Olmo 2 (in &lt;a href="https://simonwillison.net/2025/Mar/16/olmo2/"&gt;March 2025&lt;/a&gt;) have been significant. I'm hoping that trend continues!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/interpretability"&gt;interpretability&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai2"&gt;ai2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nathan-lambert"&gt;nathan-lambert&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/olmo"&gt;olmo&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="interpretability"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="ai2"/><category term="ai-ethics"/><category term="llm-release"/><category term="lm-studio"/><category term="nathan-lambert"/><category term="olmo"/></entry><entry><title>mlx-community/OLMo-2-0325-32B-Instruct-4bit</title><link href="https://simonwillison.net/2025/Mar/16/olmo2/#atom-tag" rel="alternate"/><published>2025-03-16T03:30:41+00:00</published><updated>2025-03-16T03:30:41+00:00</updated><id>https://simonwillison.net/2025/Mar/16/olmo2/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/mlx-community/OLMo-2-0325-32B-Instruct-4bit"&gt;mlx-community/OLMo-2-0325-32B-Instruct-4bit&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OLMo 2 32B &lt;a href="https://simonwillison.net/2025/Mar/13/ai2/"&gt;claims to be&lt;/a&gt; "the first fully-open model (all data, code, weights, and details are freely available) to outperform GPT3.5-Turbo and GPT-4o mini". Thanks to the MLX project here's a recipe that worked for me to run it on my Mac, via my &lt;a href="https://github.com/simonw/llm-mlx"&gt;llm-mlx&lt;/a&gt; plugin.&lt;/p&gt;
&lt;p&gt;To install the model:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-mlx
llm mlx download-model mlx-community/OLMo-2-0325-32B-Instruct-4bit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That downloads 17GB to &lt;code&gt;~/.cache/huggingface/hub/models--mlx-community--OLMo-2-0325-32B-Instruct-4bit&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;To start an interactive chat with OLMo 2:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm chat -m mlx-community/OLMo-2-0325-32B-Instruct-4bit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or to run a prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m mlx-community/OLMo-2-0325-32B-Instruct-4bit 'Generate an SVG of a pelican riding a bicycle' -o unlimited 1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;-o unlimited 1&lt;/code&gt; removes the cap on the number of output tokens - the default for &lt;code&gt;llm-mlx&lt;/code&gt; is 1024 which isn't enough to attempt to draw a pelican.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://gist.github.com/simonw/53f00731d494439d4aeca6bdd55368ca"&gt;pelican it drew&lt;/a&gt; is refreshingly abstract:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Blue and black wiggly lines looking more like a circuit diagram than a pelican riding a bicycle" src="https://static.simonwillison.net/static/2025/olmo2-pelican.jpg" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/awnihannun/status/1900408729268609309"&gt;@awnihannun&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai2"&gt;ai2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/olmo"&gt;olmo&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="mlx"/><category term="pelican-riding-a-bicycle"/><category term="ai2"/><category term="olmo"/></entry><entry><title>Quoting Ai2</title><link href="https://simonwillison.net/2025/Mar/13/ai2/#atom-tag" rel="alternate"/><published>2025-03-13T21:10:29+00:00</published><updated>2025-03-13T21:10:29+00:00</updated><id>https://simonwillison.net/2025/Mar/13/ai2/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://allenai.org/blog/olmo2-32B"&gt;&lt;p&gt;Today we release &lt;strong&gt;OLMo 2 32B&lt;/strong&gt;, the most capable and largest model in the OLMo 2 family, scaling up the OLMo 2 training recipe used for our &lt;a href="https://allenai.org/blog/olmo2"&gt;7B and 13B models released in November&lt;/a&gt;. It is trained up to 6T tokens and post-trained using Tulu 3.1. OLMo 2 32B is the &lt;strong&gt;first fully-open model&lt;/strong&gt; (all data, code, weights, and details are freely available) &lt;strong&gt;to outperform GPT3.5-Turbo and GPT-4o mini&lt;/strong&gt; on a suite of popular, multi-skill academic benchmarks.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://allenai.org/blog/olmo2-32B"&gt;Ai2&lt;/a&gt;, OLMo 2 32B release announcement&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai2"&gt;ai2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/olmo"&gt;olmo&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="training-data"/><category term="ai2"/><category term="llm-release"/><category term="olmo"/></entry><entry><title>olmOCR</title><link href="https://simonwillison.net/2025/Feb/26/olmocr/#atom-tag" rel="alternate"/><published>2025-02-26T02:04:03+00:00</published><updated>2025-02-26T02:04:03+00:00</updated><id>https://simonwillison.net/2025/Feb/26/olmocr/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://olmocr.allenai.org/"&gt;olmOCR&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from &lt;a href="https://allenai.org/"&gt;Ai2&lt;/a&gt; - olmOCR is "an open-source tool designed for high-throughput conversion of PDFs and other documents into plain text while preserving natural reading order".&lt;/p&gt;
&lt;p&gt;At its core is &lt;a href="https://huggingface.co/allenai/olmOCR-7B-0225-preview"&gt;allenai/olmOCR-7B-0225-preview&lt;/a&gt;, a Qwen2-VL-7B-Instruct variant trained on ~250,000 pages of diverse PDF content (both scanned and text-based) that were labelled using GPT-4o and made available as the &lt;a href="https://huggingface.co/datasets/allenai/olmOCR-mix-0225"&gt;olmOCR-mix-0225 dataset&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/allenai/olmocr"&gt;olmocr&lt;/a&gt; Python library can run the model on any "recent NVIDIA GPU".  I haven't managed to run it on my own Mac yet - there are &lt;a href="https://huggingface.co/lmstudio-community/olmOCR-7B-0225-preview-GGUF"&gt;GGUFs out there&lt;/a&gt; but it's not clear to me how to run vision prompts through them - but Ai2 offer &lt;a href="https://olmocr.allenai.org/"&gt;an online demo&lt;/a&gt; which can handle up to ten pages for free.&lt;/p&gt;
&lt;p&gt;Given the right hardware this looks like a very inexpensive way to run large scale document conversion projects:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We carefully optimized our inference pipeline for large-scale batch processing using SGLang, enabling olmOCR to convert one million PDF pages for just $190 - about 1/32nd the cost of using GPT-4o APIs.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The most interesting idea from &lt;a href="https://olmocr.allenai.org/papers/olmocr.pdf"&gt;the technical report (PDF)&lt;/a&gt; is something they call "document anchoring":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Document anchoring extracts coordinates of salient elements in each page (e.g., text blocks and images) and injects them alongside raw text extracted
from the PDF binary file. [...]&lt;/p&gt;
&lt;p&gt;Document anchoring processes PDF document pages via the PyPDF library to extract a representation of the page’s structure from the underlying PDF. All of the text blocks and images in the page are extracted, including position information. Starting with the most relevant text blocks and images, these are sampled and added to the prompt of the VLM, up to a defined maximum character limit. This extra information is then available to the model when processing the document.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img alt="Left side shows a green-header interface with coordinates like [150x220]√3x−1+(1+x)², [150x180]Section 6, [150x50]Lorem ipsum dolor sit amet, [150x70]consectetur adipiscing elit, sed do, [150x90]eiusmod tempor incididunt ut, [150x110]labore et dolore magna aliqua, [100x280]Table 1, followed by grid coordinates with A, B, C, AA, BB, CC, AAA, BBB, CCC values. Right side shows the rendered document with equation, text and table." src="https://static.simonwillison.net/static/2025/olmocr-document-anchoring.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The one limitation of olmOCR at the moment is that it doesn't appear to do anything with diagrams, figures or illustrations. Vision models are actually very good at interpreting these now, so my ideal OCR solution would include detailed automated descriptions of this kind of content in the resulting text.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Jonathan Soma &lt;a href="https://jonathansoma.com/words/olmocr-on-macos-with-lm-studio.html"&gt;figured out how to run it on a Mac&lt;/a&gt; using LM Studio and the &lt;a href="https://github.com/allenai/olmocr/"&gt;olmocr&lt;/a&gt; Python package.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/soldni/status/1894418235334037570?s=46"&gt;Luca Soldaini&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fine-tuning"&gt;fine-tuning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai2"&gt;ai2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ocr"/><category term="pdf"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="fine-tuning"/><category term="vision-llms"/><category term="qwen"/><category term="ai2"/><category term="lm-studio"/><category term="ai-in-china"/></entry><entry><title>What do people really ask chatbots? It’s a lot of sex and homework</title><link href="https://simonwillison.net/2024/Aug/4/what-do-people-really-ask-chatbots/#atom-tag" rel="alternate"/><published>2024-08-04T18:59:46+00:00</published><updated>2024-08-04T18:59:46+00:00</updated><id>https://simonwillison.net/2024/Aug/4/what-do-people-really-ask-chatbots/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.washingtonpost.com/technology/2024/08/04/chatgpt-use-real-ai-chatbot-conversations/"&gt;What do people really ask chatbots? It’s a lot of sex and homework&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Jeremy B. Merrill and Rachel Lerman at the Washington Post analyzed &lt;a href="https://wildchat.allen.ai/"&gt;WildChat&lt;/a&gt;, a dataset of 1 million ChatGPT-style interactions collected and released by the Allen Institute for AI.&lt;/p&gt;
&lt;p&gt;From a random sample of 458 queries they categorized the conversations as 21% creative writing and roleplay, 18% homework help, 17% "search and other inquiries", 15% work/business and 7% coding.&lt;/p&gt;
&lt;p&gt;I talked to them a little for this story:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“I don’t think I’ve ever seen a piece of technology that has this many use cases,” said Simon Willison, a programmer and independent researcher.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/washington-post"&gt;washington-post&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai2"&gt;ai2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/press-quotes"&gt;press-quotes&lt;/a&gt;&lt;/p&gt;



</summary><category term="washington-post"/><category term="ai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="ai2"/><category term="press-quotes"/></entry><entry><title>Open Language Models (OLMos) and the LLM landscape</title><link href="https://simonwillison.net/2024/Feb/2/olmos/#atom-tag" rel="alternate"/><published>2024-02-02T04:11:40+00:00</published><updated>2024-02-02T04:11:40+00:00</updated><id>https://simonwillison.net/2024/Feb/2/olmos/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.interconnects.ai/p/olmo"&gt;Open Language Models (OLMos) and the LLM landscape&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OLMo is a newly released LLM from the Allen Institute for AI (AI2) currently available in 7b and 1b parameters (OLMo-65b is on the way) and trained on a fully openly published dataset called Dolma.&lt;/p&gt;

&lt;p&gt;The model and code are Apache 2, while the data is under the “AI2 ImpACT license”.&lt;/p&gt;

&lt;p&gt;From the benchmark scores shared here by Nathan Lambert it looks like this may be the highest performing model currently available that was built using a fully documented training set.&lt;/p&gt;

&lt;p&gt;What’s in Dolma? It’s mainly Common Crawl, Wikipedia, Project Gutenberg and the Stack.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/natolambert/status/1753081921293484333"&gt;@natolambert&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai2"&gt;ai2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nathan-lambert"&gt;nathan-lambert&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/olmo"&gt;olmo&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="training-data"/><category term="ai2"/><category term="llm-release"/><category term="nathan-lambert"/><category term="olmo"/></entry></feed>