Simon Willison’s Weblog


Thursday, 25th April 2024

The only difference between screwing around and science is writing it down

Alex Jason, via Adam Savage # 2:17 pm

I’ve been at OpenAI for almost a year now. In that time, I’ve trained a lot of generative models. [...] It’s becoming awfully clear to me that these models are truly approximating their datasets to an incredible degree. [...] What this manifests as is – trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point. [...] This is a surprising observation! It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else. Everything else is a means to an end in efficiently delivery compute to approximating that dataset.

James Betker # 5:13 am

Blogmarks that use markdown. I needed to attach a correction to an older blogmark (my 20-year old name for short-form links with commentary on my blog) today - but the commentary field has always been text, not HTML, so I didn't have a way to add the necessary link.

This motivated me to finally add optional Markdown support for blogmarks to my blog's custom Django CMS. I then went through and added inline code markup to a bunch of different older posts, and built this Django SQL Dashboard to keep track of which posts I had updated. # 4:34 am

No, Most Books Don’t Sell Only a Dozen Copies. I linked to a story the other day about book sales claiming "90 percent of them sold fewer than 2,000 copies and 50 percent sold less than a dozen copies", based on numbers released in the Penguin antitrust lawsuit. It turns out those numbers were interpreted incorrectly.

In this piece from September 2022 Lincoln Michel addresses this and other common misconceptions about book statistics.

Understanding these numbers requires understanding a whole lot of intricacies about how publishing actually works. Here's one illustrative snippet:

"Take the statistic that most published books only sell 99 copies. This seems shocking on its face. But if you dig into it, you’ll notice it was counting one year’s sales of all books that were in BookScan’s system. That’s quite different statistic than saying most books don’t sell 100 copies in total! A book could easily be a bestseller in, say, 1960 and sell only a trickle of copies today."

The top comment on the post comes from Kristen McLean of NPD BookScan, the organization who's numbers were misrepresented is the trial. She wasn't certain how the numbers had been sliced to get that 90% result, but in her own analysis of "frontlist sales for the top 10 publishers by unit volume in the U.S. Trade market" she found that 14.7% sold less than 12 copies and the 51.4% spot was for books selling less than a thousand. # 3:41 am

Snowflake Arctic Cookbook. Today's big model release was Snowflake Arctic, an enormous 480B model with a 128×3.66B MoE (Mixture of Experts) architecture. It's Apache 2 licensed and Snowflake state that "in addition, we are also open sourcing all of our data recipes and research insights."

The research insights will be shared on this Arctic Cookbook blog - which currently has two articles covering their MoE architecture and describing how they optimized their training run in great detail.

They also list dozens of "coming soon" posts, which should be pretty interesting given how much depth they've provided in their writing so far. # 2:47 am