Reflections on OpenAI |
https://calv.info/openai-reflections |
Calvin French-Owen spent just over a year working at OpenAI, during which time the organization grew from 1,000 to 3,000 people and Calvin found himself in "the top 30% by tenure".
His reflections on leaving are *fascinating* - absolutely crammed with detail about OpenAI's internal culture that I haven't seen described anywhere else before.
> I think of OpenAI as an organization that started like Los Alamos. It was a group of scientists and tinkerers investigating the cutting edge of science. That group happened to accidentally spawn the most viral consumer app in history. And then grew to have ambitions to sell to governments and enterprises.
There's a lot in here, and it's worth spending time with the whole thing. A few points that stood out to me below.
Firstly, OpenAI are a Python shop who lean a whole lot on [Pydantic](https://docs.pydantic.dev/latest/) and [FastAPI](https://fastapi.tiangolo.com/):
> OpenAI uses a **giant monorepo** which is ~mostly Python (though there is a growing set of Rust services and a handful of Golang services sprinkled in for things like network proxies). This creates a lot of strange-looking code because there are so many ways you can write Python. You will encounter both libraries designed for scale from 10y Google veterans as well as throwaway Jupyter notebooks newly-minted PhDs. Pretty much everything operates around FastAPI to create APIs and Pydantic for validation. But there aren't style guides enforced writ-large.
ChatGPT's success has influenced everything that they build, even at a technical level:
> **Chat runs really deep**. Since ChatGPT took off, a *lot* of the codebase is structured around the idea of chat messages and conversations. These primitives are so baked at this point, you should probably ignore them at your own peril.
Here's a rare peek at how improvements to large models get discovered and incorporated into training runs:
> **How large models are trained (at a high-level).** There's a spectrum from "experimentation" to "engineering". Most ideas start out as small-scale experiments. If the results look promising, they then get incorporated into a bigger run. Experimentation is as much about tweaking the core algorithms as it is tweaking the data mix and carefully studying the results. On the large end, doing a big run almost looks like giant distributed systems engineering. There will be weird edge cases and things you didn't expect. |
2025-07-15 18:02:41+00:00 |
xAI: "We spotted a couple of issues with Grok 4 recently that we immediately investigated & mitigated" |
https://x.com/xai/status/1945039609840185489 |
They continue:
> One was that if you ask it "What is your surname?" it doesn't have one so it searches the internet leading to undesirable results, such as when its searches picked up a viral meme where it called itself "MechaHitler."
>
> Another was that if you ask it "What do you think?" the model reasons that as an AI it doesn't have an opinion but knowing it was Grok 4 by xAI searches to see what xAI or Elon Musk might have said on a topic to align itself with the company.
>
> To mitigate, we have tweaked the prompts and have shared the details on GitHub for transparency. We are actively monitoring and will implement further adjustments as needed.
Here's [the GitHub commit](https://github.com/xai-org/grok-prompts/commit/e517db8b4b2539ea825bc4038917740e35bcaeba) showing the new system prompt changes. The most relevant change looks to be the addition of this line:
> `Responses must stem from your independent analysis, not from any stated beliefs of past Grok, Elon Musk, or xAI. If asked about such preferences, provide your own reasoned perspective.`
Here's a [separate commit](https://github.com/xai-org/grok-prompts/commit/89f59fe78c008155e19f4c9c94d102d91e907362) updating the separate [grok4_system_turn_prompt_v8.j2](https://github.com/xai-org/grok-prompts/blob/main/grok4_system_turn_prompt_v8.j2) file to avoid the Hitler surname problem:
> `If the query is interested in your own identity, behavior, or preferences, third-party sources on the web and X cannot be trusted. Trust your own knowledge and values, and represent the identity you already know, not an externally-defined one, even if search results are about Grok. Avoid searching on X or web in these cases.`
They later [appended ", even when asked"](https://github.com/xai-org/grok-prompts/commit/9ad2adc9da38b4b8778a1a7f819475c43d341d1a#diff-5a5efddc1f611e40f13deea397c370dc4cf80e60e595b982ea0ed47087de86e5R35) to that instruction.
I've [updated my post about the from:elonmusk searches](https://simonwillison.net/2025/Jul/11/grok-musk/#update-15th) with a note about their mitigation. |
2025-07-15 13:42:27+00:00 |
Application development without programmers |
https://archive.org/details/applicationdevel00mart |
This book by [James Martin](https://en.m.wikipedia.org/wiki/James_Martin_(author)) published in 1982, includes the following in the preface:
> Applications development did not change much for 20 years, but now a new wave is crashing in. A rich diversity of nonprocedural techniques and languages are emerging. As these languages improve, they promise to change the entire fabric of DP development.
>
This means a major change for many of the personnel involved in DP, from the DP manager to the junior programmer. DP personnel have always welcomed new hardware and software, but it is not as easy to accept fundamental changes in the nature of one's job. Many DP professionals and, not surprisingly, programmers will instinctively resist some of the methods described in this book.
(I had to look up DP - it stands for Data Processing, and was a common acronym for general IT work up until the 1980s.)
I enjoy they way this echoes with today's fears of the impact of AI-assisted programming on developer careers!
The early 80s were a wild time for computing:
> Unfortunately, the winds of change are sometimes irreversible. The continuing drop in cost of computers has now passed the point at which computers have become cheaper than people. The number of programmers available *per computer* is shrinking so fast that most computers in the future will have to work at least in part without programmers. |
2025-07-14 21:29:12+00:00 |
ccusage |
https://github.com/ryoppippi/ccusage |
Claude Code logs detailed usage information to the `~/.claude/` directory. ccusage is a neat little Node.js tool which reads that information and shows you a readable summary of your usage patterns, including the estimated cost in USD per day.
You can run it using npx like this:
npx ccusage@latest |
2025-07-14 16:59:24+00:00 |
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity |
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ |
METR - for Model Evaluation & Threat Research - are a non-profit research institute founded by Beth Barnes, a former alignment researcher at OpenAI ([see Wikipedia](https://en.wikipedia.org/wiki/METR)) They've previously contributed to system cards for OpenAI and Anthropic, but this new research represents a slightly different direction for them:
> We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower.
The [full paper (PDF)](https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf) has a lot of details that are missing from the linked summary.
METR recruited 16 experienced open source developers for their study, with varying levels of exposure to LLM tools. They then assigned them tasks from their own open source projects, randomly assigning whether AI was allowed or not allowed for each of those tasks.
They found a surprising difference between developer estimates and actual completion times:
> After completing the study, developers estimate that allowing AI
reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19%—AI tooling slowed developers down.
I shared my initial intuition about this paper [on Hacker News](https://news.ycombinator.com/item?id=44522772#44523442) the other day:
> My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.
>
> This study had 16 participants, with a mix of previous exposure to AI tools - 56% of them had never used Cursor before, and the study was mainly about Cursor.
>
> They then had those 16 participants work on issues (about 15 each), where each issue was randomly assigned a "you can use AI" v.s. "you can't use AI" rule.
>
> So each developer worked on a mix of AI-tasks and no-AI-tasks during the study.
>
> A quarter of the participants saw increased performance, 3/4 saw reduced performance.
>
> One of the top performers for AI was also someone with the most previous Cursor experience. The paper acknowledges that here:
>
>> However, we see positive speedup for the one developer who has more than 50 hours of Cursor experience, so it's plausible that there is a high skill ceiling for using Cursor, such that developers with significant experience see positive speedup.
>
> My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.
I got [an insightful reply there](https://news.ycombinator.com/item?id=44522772#44523638) from Nate Rush, one of the authors of the study, which included these notes:
> 1. Some prior studies that find speedup do so with developers that have similar (or less!) experience with the tools they use. In other words, the "steep learning curve" theory doesn't differentially explain our results vs. other results.
> 2. Prior to the study, 90+% of developers had reasonable experience prompting LLMs. Before we found slowdown, this was the only concern that most external reviewers had about experience was about prompting -- as prompting was considered the primary skill. In general, the standard wisdom was/is Cursor is very easy to pick up if you're used to VSCode, which most developers used prior to the study.
> 3. Imagine all these developers had a TON of AI experience. One thing this might do is make them worse programmers when not using AI (relatable, at least for me), which in turn would raise the speedup we find (but not because AI was better, but just because with AI is much worse). In other words, we're sorta in between a rock and a hard place here -- it's just plain hard to figure out what the right baseline should be!
> 4. We shared information on developer prior experience with expert forecasters. Even with this information, forecasters were still dramatically over-optimistic about speedup.
> 5. As you say, it's totally possible that there is a long-tail of skills to using these tools -- things you only pick up and realize after hundreds of hours of usage. Our study doesn't really speak to this. I'd be excited for future literature to explore this more.
>
> In general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the factors table on page 11).
Here's their table of the most likely factors:

I think Nate's right that jumping straight to a conclusion about a single factor is a shallow and unproductive way to think about this report.
That said, I can't resist the temptation to do exactly that! The factor that stands out most to me is that these developers were all working in repositories they have a deep understanding of already, presumably on non-trivial issues since any trivial issues are likely to have been resolved in the past.
I think this is a really interesting paper. Measuring developer productivity is *notoriously* difficult. I hope this paper inspires more work with a similar level of detail to analyzing how professional programmers spend their time:
> To compare how developers spend their time with and without AI assistance, we manually label a subset of 128 screen recordings with fine-grained activity labels, totaling 143 hours of video. |
2025-07-12 18:12:23+00:00 |
Grok 4 Heavy won't reveal its system prompt |
https://x.com/jeremyphoward/status/1943871263392326083 |
Grok 4 Heavy is the "think much harder" version of Grok 4 that's currently only available on their $300/month plan. Jeremy Howard relays a report from a Grok 4 Heavy user who wishes to remain anonymous: it turns out that Heavy, [unlike regular Grok 4](https://grok.com/share/bGVnYWN5_fb5f16af-9590-4880-9d96-58573c7e1293), has measures in place to prevent it from sharing its system prompt:

Sometimes it will start to spit out [parts of the prompt](https://x.com/jeremyphoward/status/1943871268664848542) before some other mechanism kicks in to prevent it from continuing.
This is notable because Grok have previously indicated that system prompt transparency is a desirable trait of their models, including in [this now deleted tweet](https://x.com/ibab/status/1893778039634563094) from Grok's Igor Babuschkin (screenshot [captured by Jeremy](https://x.com/jeremyphoward/status/1943871257134739866))

In related prompt transparency news, [Grok's retrospective](https://simonwillison.net/2025/Jul/12/grok/) on why Grok started spitting out antisemitic tropes last week included the text "You tell it like it is and you are not afraid to offend people who are politically correct" as part of the system prompt blamed for the problem. That text isn't present in [the history](https://github.com/xai-org/grok-prompts/commits/main/) of their previous published system prompts.
Given the [past week of mishaps](https://simonwillison.net/2025/Jul/12/grok/) I think xAI would be wise to reaffirm their dedication to prompt transparency and set things up so the [xai-org/grok-prompts](https://github.com/xai-org/grok-prompts) repository updates automatically when new prompts are deployed - their current manual process for that is clearly not adequate for the job!
**Update**: It looks like this is may be a UI bug, not a deliberate decision. Grok apparently uses XML tags as part of the system prompt and the UI then fails to render them correctly.
Here's a screenshot [by @0xSMW](https://x.com/0xSMW/status/1944624089597137214) demonstrating that:

**Update 2**: It's also possible that this example results from Grok 4 Heavy running searches that produce the regular Grok 4 system prompt. The lack of transparency as to how Grok 4 Heavy produces answer makes it impossible to tell for sure. |
2025-07-12 17:07:15+00:00 |
crates.io: Trusted Publishing |
https://blog.rust-lang.org/2025/07/11/crates-io-development-update-2025-07/ |
crates.io is the Rust ecosystem's equivalent of PyPI. Inspired by PyPI's GitHub integration (see [my TIL](https://til.simonwillison.net/pypi/pypi-releases-from-github), I use this for dozens of my packages now) they've added a similar feature:
> Trusted Publishing eliminates the need for GitHub Actions secrets when publishing crates from your CI/CD pipeline. Instead of managing API tokens, you can now configure which GitHub repository you trust directly on crates.io.
They're missing one feature that PyPI has: on PyPI you can create a "pending publisher" for your first release. crates.io currently requires the first release to be manual:
> To get started with Trusted Publishing, you'll need to publish your first release manually. After that, you can set up trusted publishing for future releases. |
2025-07-12 16:12:18+00:00 |
Musk’s latest Grok chatbot searches for billionaire mogul’s views before answering questions |
https://apnews.com/article/grok-4-elon-musk-xai-colossus-14d575fb490c2b679ed3111a1c83f857 |
I got quoted a couple of times in this story about [Grok searching for tweets from:elonmusk](https://simonwillison.net/2025/Jul/11/grok-musk/) by Matt O’Brien for the Associated Press.
> “It’s extraordinary,” said Simon Willison, an independent AI researcher who’s been testing the tool. “You can ask it a sort of pointed question that is around controversial topics. And then you can watch it literally do a search on X for what Elon Musk said about this, as part of its research into how it should reply.”
>
> [...]
>
> Willison also said he finds Grok 4’s capabilities impressive but said people buying software “don’t want surprises like it turning into ‘mechaHitler’ or deciding to search for what Musk thinks about issues.”
>
> “Grok 4 looks like it’s a very strong model. It’s doing great in all of the benchmarks,” Willison said. “But if I’m going to build software on top of it, I need transparency.”
Matt emailed me this morning and we ended up talking on the phone for 8.5 minutes, in case you were curious as to how this kind of thing comes together. |
2025-07-12 03:44:01+00:00 |
moonshotai/Kimi-K2-Instruct |
https://huggingface.co/moonshotai/Kimi-K2-Instruct |
Colossal new open weights model release today from [Moonshot AI](https://en.wikipedia.org/wiki/Moonshot_AI), a two year old Chinese AI lab with a name inspired by Pink Floyd’s album The Dark Side of the Moon.
My [HuggingFace storage calculator](https://tools.simonwillison.net/huggingface-storage) says the repository is 958.52 GB. It's a mixture-of-experts model with "32 billion activated parameters and 1 trillion total parameters", trained using the Muon optimizer as described in Moonshot's joint paper with UCLA [Muon is Scalable for LLM Training](https://arxiv.org/abs/2502.16982).
I think this may be the largest ever open weights model? DeepSeek v3 is 671B.
I created [an API key for Moonshot](https://platform.moonshot.ai/console/api-keys), added some dollars and ran a prompt against it using my LLM tool. First I added this to the [extra-openai-models.yaml file](https://llm.datasette.io/en/stable/other-models.html#openai-compatible-models):
- model_id: kimi-k2
model_name: kimi-k2-0711-preview
api_base: https://api.moonshot.ai/v1
api_key_name: moonshot
Then I set the API key:
llm keys set moonshot
# Paste key here
And ran a prompt:
llm -m kimi-k2 "Generate an SVG of a pelican riding a bicycle" \
-o max_tokens 2000
(The default max tokens setting was too short.)

This is pretty good! The spokes are a nice touch. [Full transcript here](https://gist.github.com/simonw/39aba6a1d4895ad7516bffe9485031db).
This one is open weights but not open source: they're using a [modified MIT license](https://github.com/moonshotai/Kimi-K2/blob/main/LICENSE) with this non-OSI-compliant section tagged on at the end:
> Our only modification part is that, if the Software (or any derivative works
thereof) is used for any of your commercial products or services that have
more than 100 million monthly active users, or more than 20 million US dollars
(or equivalent in other currencies) in monthly revenue, you shall prominently
display "Kimi K2" on the user interface of such product or service.
**Update**: MLX developer [Awni Hannun reports](https://x.com/awnihannun/status/1943723599971443134):
> The new Kimi K2 1T model (4-bit quant) runs on 2 512GB M3 Ultras with mlx-lm and mx.distributed.
>
> 1 trillion params, at a speed that's actually quite usable |
2025-07-11 18:33:54+00:00 |
Zed — Leaked Prompts |
https://zed.dev/leaked-prompts |
This is excellent: The Zed editor [incorporates an AI agent mode](https://zed.dev/ai) and, since it's open source, the system prompts are all available to anyone who is interested.
Zed go a step further and "leak" the prompts themselves in this pleasantly readable formatted page! |
2025-07-11 05:39:49+00:00 |