Simon Willison’s Weblog

Subscribe

Quotations

Filters: Type: quotation × Sorted by date

But I’ve also had my own quiet concerns about what [vibe coding] means for early-career developers. So much of how I learned came from chasing bugs in broken tutorials and seeing how all the pieces connected, or didn’t. There was value in that. And maybe I’ve been a little protective of it.

A mentor challenged that. He pointed out that debugging AI generated code is a lot like onboarding into a legacy codebase, making sense of decisions you didn’t make, finding where things break, and learning to trust (or rewrite) what’s already there. That’s the kind of work a lot of developers end up doing anyway.

Ashley Willis, What Even Is Vibe Coding?

# 8th May 2025, 12:10 pm / vibe-coding, ai-assisted-programming, ai, generative-ai

That's it. I've had it. I'm putting my foot down on this craziness.

1. Every reporter submitting security reports on #Hackerone for #curl now needs to answer this question:

"Did you use an AI to find the problem or generate this submission?"

(and if they do select it, they can expect a stream of proof of actual intelligence follow-up questions)

2. We now ban every reporter INSTANTLY who submits reports we deem AI slop. A threshold has been reached. We are effectively being DDoSed. If we could, we would charge them for this waste of our time.

We still have not seen a single valid security report done with AI help.

Daniel Stenberg

# 6th May 2025, 3:12 pm / ai, llms, ai-ethics, daniel-stenberg, slop, security, curl, generative-ai

Two things can be true simultaneously: (a) LLM provider cost economics are too negative to return positive ROI to investors, and (b) LLMs are useful for solving problems that are meaningful and high impact, albeit not to the AGI hype that would justify point (a). This particular combination creates a frustrating gray area that requires a nuance that an ideologically split social media can no longer support gracefully. [...]

OpenAI collapsing would not cause the end of LLMs, because LLMs are useful today and there will always be a nonzero market demand for them: it’s a bell that can’t be unrung.

Max Woolf

# 5th May 2025, 6:31 pm / max-woolf, generative-ai, openai, ai, llms

[On using generative AI for work despite the risk of errors:]

  • AI is helpful despite being error-prone if it is faster to verify the output than it is to do the work yourself. For example, if you're using it to find a product that matches a given set of specifications, verification may be a lot faster than search.
  • There are many uses where errors don't matter, like using it to enhance creativity by suggesting or critiquing ideas.
  • At a meta level, if you use AI without a plan and simply turn to AI tools when you feel like it, then you're unlikely to be able to think through risks and mitigations. It is better to identify concrete ways to integrate AI into your workflows, with known benefits and risks, that you can employ repeatedly.

Arvind Narayanan

# 5th May 2025, 4:11 pm / llms, ai, arvind-narayanan, generative-ai

You also mentioned the whole Chatbot Arena thing, which I think is interesting and points to the challenge around how you do benchmarking. How do you know what models are good for which things?

One of the things we've generally tried to do over the last year is anchor more of our models in our Meta AI product north star use cases. The issue with open source benchmarks, and any given thing like the LM Arena stuff, is that they’re often skewed toward a very specific set of uses cases, which are often not actually  what any normal person does in your product. [...]

So we're trying to anchor our north star on the product value that people report to us, what they say that they want, and what their revealed preferences are, and using the experiences that we have. Sometimes these benchmarks just don't quite line up. I think a lot of them are quite easily gameable.

On the Arena you'll see stuff like Sonnet 3.7, which is a great model, and it's not near the top. It was relatively easy for our team to tune a version of Llama 4 Maverick that could be way at the top. But the version we released, the pure model, actually has no tuning for that at all, so it's further down. So you just need to be careful with some of these benchmarks. We're going to index primarily on the products.

Mark Zuckerberg, on Dwarkesh Patel's podcast

# 1st May 2025, 12:28 am / meta, generative-ai, llama, mark-zuckerberg, ai, chatbot-arena, llms

When we were first shipping Memory, the initial thought was: “Let’s let users see and edit their profiles”. Quickly learned that people are ridiculously sensitive: “Has narcissistic tendencies” - “No I do not!”, had to hide it.

Mikhail Parakhin, talking about Bing

# 29th April 2025, 1:17 pm / ai-ethics, llms, ai, generative-ai, bing, ai-personality

Betting on mobile made all the difference. We're making a similar call now, and this time the platform shift is AI.

AI isn't just a productivity boost. It helps us get closer to our mission. To teach well, we need to create a massive amount of content, and doing that manually doesn't scale. One of the best decisions we made recently was replacing a slow, manual content creation process with one powered by AI. Without AI, it would take us decades to scale our content to more learners. We owe it to our learners to get them this content ASAP. [...]

We'll be rolling out a few constructive constraints to help guide this shift:

  • We'll gradually stop using contractors to do work that AI can handle
  • AI use will be part of what we look for in hiring
  • AI use will be part of what we evaluate in performance reviews
  • Headcount will only be given if a team cannot automate more of their work
  • Most functions will have specific initiatives to fundamentally change how they work [...]

Luis von Ahn, Duolingo all-hands memo, shared on LinkedIn

# 28th April 2025, 7:48 pm / ai-ethics, careers, ai, generative-ai, duolingo

the last couple of GPT-4o updates have made the personality too sycophant-y and annoying (even though there are some very good parts of it), and we are working on fixes asap, some today and some this week.

Sam Altman

# 28th April 2025, 3:24 am / sam-altman, generative-ai, openai, chatgpt, ai, llms, ai-personality

We've been seeing if the latest versions of LLMs are any better at geolocating and chronolocating images, and they've improved dramatically since we last tested them in 2023. [...]

Before anyone worries about it taking our job, I see it more as the difference between a hand whisk and an electric whisk, just the same job done quicker, and either way you've got to check if your peaks are stiff at the end of it.

Eliot Higgins, Bellingcat

# 26th April 2025, 8:40 pm / vision-llms, bellingcat, data-journalism, llms, ai-ethics, ai, generative-ai, geoguessing

I don’t have a “mission” for this blog, but if I did, it would be to slightly increase the space in which people are calm and respectful and care about getting the facts right. I think we need more of this, and I’m worried that society is devolving into “trench warfare” where facts are just tools to be used when convenient for your political coalition, and everyone assumes everyone is distorting everything, all the time.

dynomight

# 26th April 2025, 5:05 pm / blogging

Despite being rusty with coding (I don't code every day these days): since starting to use Windsurf / Cursor with the recent increasingly capable models: I am SO back to being as fast in coding as when I was coding every day "in the zone" [...]

When you are driving with a firm grip on the steering wheel - because you know exactly where you are going, and when to steer hard or gently - it is just SUCH a big boost.

I have a bunch of side projects and APIs that I operate - but usually don't like to touch it because it's (my) legacy code.

Not any more.

I'm making large changes, quickly. These tools really feel like a massive multiplier for experienced devs - those of us who have it in our head exactly what we want to do and now the LLM tooling can move nearly as fast as my thoughts!

Gergely Orosz

# 23rd April 2025, 2:43 am / ai-assisted-programming, generative-ai, gergely-orosz, ai, llms

I was against using AI for programming for a LONG time. It never felt effective.

But with the latest models + tools, it finally feels like a real performance boost

If you’re still holding out, do yourself a favor: spend a few focused hours actually using it

Ellie Huxtable

# 22nd April 2025, 5:51 pm / ai-assisted-programming, llms, ai, generative-ai

In some tasks, AI is unreliable. In others, it is superhuman. You could, of course, say the same thing about calculators, but it is also clear that AI is different. It is already demonstrating general capabilities and performing a wide range of intellectual tasks, including those that it is not specifically trained on. Does that mean that o3 and Gemini 2.5 are AGI? Given the definitional problems, I really don’t know, but I do think they can be credibly seen as a form of “Jagged AGI” - superhuman in enough areas to result in real changes to how we work and live, but also unreliable enough that human expertise is often needed to figure out where AI works and where it doesn’t.

Ethan Mollick, On Jagged AGI

# 20th April 2025, 4:35 pm / gemini, ethan-mollick, generative-ai, o3, ai, llms

To me, a successful eval meets the following criteria. Say, we currently have system A, and we might tweak it to get a system B:

  • If A works significantly better than B according to a skilled human judge, the eval should give A a significantly higher score than B.
  • If A and B have similar performance, their eval scores should be similar.

Whenever a pair of systems A and B contradicts these criteria, that is a sign the eval is in “error” and we should tweak it to make it rank A and B correctly.

Andrew Ng

# 18th April 2025, 6:47 pm / evals, llms, ai, generative-ai, andrew-ng

We (Jon and Zach) teamed up with the Harris Poll to confirm this finding and extend it. We conducted a nationally representative survey of 1,006 Gen Z young adults (ages 18-27). We asked respondents to tell us, for various platforms and products, if they wished that it “was never invented.” For Netflix, Youtube, and the internet itself, relatively few said yes to that question (always under 20%). We found much higher levels of regret for the dominant social media platforms: Instagram (34%), Facebook (37%), Snapchat (43%), and the most regretted platforms of all: TikTok (47%) and X/Twitter (50%).

Jon Haidt and Zach Rausch, TikTok Is Harming Children at an Industrial Scale

# 17th April 2025, 5:05 pm / social-media, twitter, tiktok

Our hypothesis is that o4-mini is a much better model, but we'll wait to hear feedback from developers. Evals only tell part of the story, and we wouldn't want to prematurely deprecate a model that developers continue to find value in. Model behavior is extremely high dimensional, and it's impossible to prevent regression on 100% use cases/prompts, especially if those prompts were originally tuned to the quirks of the older model. But if the majority of developers migrate happily, then it may make sense to deprecate at some future point.

We generally want to give developers as stable as an experience as possible, and not force them to swap models every few months whether they want to or not.

Ted Sanders, OpenAI, on deprecating o3-mini

# 17th April 2025, 1:07 am / openai, llms, ai, generative-ai

I work for OpenAI. [...] o4-mini is actually a considerably better vision model than o3, despite the benchmarks. Similar to how o3-mini-high was a much better coding model than o1. I would recommend using o4-mini-high over o3 for any task involving vision.

James Betker, OpenAI

# 16th April 2025, 10:47 pm / vision-llms, generative-ai, openai, ai, llms

The single most impactful investment I’ve seen AI teams make isn’t a fancy evaluation dashboard—it’s building a customized interface that lets anyone examine what their AI is actually doing. I emphasize customized because every domain has unique needs that off-the-shelf tools rarely address. When reviewing apartment leasing conversations, you need to see the full chat history and scheduling context. For real-estate queries, you need the property details and source documents right there. Even small UX decisions—like where to place metadata or which filters to expose—can make the difference between a tool people actually use and one they avoid. [...]

Teams with thoughtfully designed data viewers iterate 10x faster than those without them. And here’s the thing: These tools can be built in hours using AI-assisted development (like Cursor or Loveable). The investment is minimal compared to the returns.

Hamel Husain, A Field Guide to Rapidly Improving AI Products

# 15th April 2025, 6:05 pm / ai-assisted-programming, datasette, hamel-husain, ai, llms

Slopsquatting -- when an LLM hallucinates a non-existent package name, and a bad actor registers it maliciously. The AI brother of typosquatting.

Credit to @sethmlarson for the name

Andrew Nesbitt

# 12th April 2025, 4:30 pm / ai-ethics, slop, packaging, generative-ai, supply-chain, ai, llms, seth-michael-larson

Backticks are traditionally banned from use in future language features, due to the small symbol. No reader should need to distinguish ` from ' at a glance.

Steve Dower, CPython core developer, August 2024

# 12th April 2025, 3:32 am / programming-languages, python

The first generation of AI-powered products (often called “AI Wrapper” apps, because they “just” are wrapped around an LLM API) were quickly brought to market by small teams of engineers, picking off the low-hanging problems. But today, I’m seeing teams of domain experts wading into the field, hiring a programmer or two to handle the implementation, while the experts themselves provide the prompts, data labeling, and evaluations.

For these companies, the coding is commodified but the domain expertise is the differentiator.

Drew Breunig, The Dynamic Between Domain Experts & Developers Has Shifted

# 10th April 2025, 9:23 pm / drew-breunig, llms, ai, generative-ai

Imagine if Ford published a paper saying it was thinking about long term issues of the automobiles it made and one of those issues included “misalignment “Car as an adversary”” and when you asked Ford for clarification the company said “yes, we believe as we make our cars faster and more capable, they may sometimes take actions harmful to human well being” and you say “oh, wow, thanks Ford, but… what do you mean precisely?” and Ford says “well, we cannot rule out the possibility that the car might decide to just start running over crowds of people” and then Ford looks at you and says “this is a long-term research challenge”.

Jack Clark, DeepMind gazes into the AGI future

# 8th April 2025, 5:39 am / ai-ethics, jack-clark, ai

We've seen questions from the community about the latest release of Llama-4 on Arena. To ensure full transparency, we're releasing 2,000+ head-to-head battle results for public review. [...]

In addition, we're also adding the HF version of Llama-4-Maverick to Arena, with leaderboard results published shortly. Meta’s interpretation of our policy did not match what we expect from model providers. Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.

lmarena.ai

# 8th April 2025, 1:26 am / meta, ai-ethics, generative-ai, llama, ai, llms, chatbot-arena

My first games involved hand assembling machine code and turning graph paper characters into hex digits. Software progress has made that work as irrelevant as chariot wheel maintenance. [...]

AI tools will allow the best to reach even greater heights, while enabling smaller teams to accomplish more, and bring in some completely new creator demographics.

Yes, we will get to a world where you can get an interactive game (or novel, or movie) out of a prompt, but there will be far better exemplars of the medium still created by dedicated teams of passionate developers.

The world will be vastly wealthier in terms of the content available at any given cost.

Will there be more or less game developer jobs? That is an open question. It could go the way of farming, where labor saving technology allow a tiny fraction of the previous workforce to satisfy everyone, or it could be like social media, where creative entrepreneurship has flourished at many different scales. Regardless, “don’t use power tools because they take people’s jobs” is not a winning strategy.

John Carmack

# 7th April 2025, 7:39 pm / ai-ethics, game-design, ai, john-carmack

Using Al effectively is now a fundamental expectation of everyone at Shopify. It's a tool of all trades today, and will only grow in importance. Frankly, I don't think it's feasible to opt out of learning the skill of applying Al in your craft; you are welcome to try, but I want to be honest I cannot see this working out today, and definitely not tomorrow. Stagnation is almost certain, and stagnation is slow-motion failure. If you're not climbing, you're sliding [...]

We will add Al usage questions to our performance and peer review questionnaire. Learning to use Al well is an unobvious skill. My sense is that a lot of people give up after writing a prompt and not getting the ideal thing back immediately. Learning to prompt and load context is important, and getting peers to provide feedback on how this is going will be valuable.

Tobias Lütke, CEO of Shopify, self-leaked memo

# 7th April 2025, 6:32 pm / careers, ai, ai-ethics

[...] The disappointing releases of both GPT-4.5 and Llama 4 have shown that if you don't train a model to reason with reinforcement learning, increasing its size no longer provides benefits.

Reinforcement learning is limited only to domains where a reward can be assigned to the generation result. Until recently, these domains were math, logic, and code. Recently, these domains have also included factual question answering, where, to find an answer, the model must learn to execute several searches. This is how these "deep search" models have likely been trained.

If your business idea isn't in these domains, now is the time to start building your business-specific dataset. The potential increase in generalist models' skills will no longer be a threat.

Andriy Burkov

# 6th April 2025, 8:47 pm / generative-ai, llama, openai, ai, llms

The Llama series have been re-designed to use state of the art mixture-of-experts (MoE) architecture and natively trained with multimodality. We’re dropping Llama 4 Scout & Llama 4 Maverick, and previewing Llama 4 Behemoth.

📌 Llama 4 Scout is highest performing small model with 17B activated parameters with 16 experts. It’s crazy fast, natively multimodal, and very smart. It achieves an industry leading 10M+ token context window and can also run on a single GPU!

📌 Llama 4 Maverick is the best multimodal model in its class, beating GPT-4o and Gemini 2.0 Flash across a broad range of widely reported benchmarks, while achieving comparable results to the new DeepSeek v3 on reasoning and coding – at less than half the active parameters. It offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena. It can also run on a single host!

📌 Previewing Llama 4 Behemoth, our most powerful model yet and among the world’s smartest LLMs. Llama 4 Behemoth outperforms GPT4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks. Llama 4 Behemoth is still training, and we’re excited to share more details about it even while it’s still in flight.

Ahmed Al-Dahle, VP and Head of GenAI at Meta

# 5th April 2025, 7:44 pm / meta, generative-ai, llama, ai, llms

Blogging is small-p political again, today. It’s come back round. It’s a statement to put your words in a place where they are not subject to someone else’s algorithm telling you what success looks like; when you blog, your words are not a vote for the values of someone else’s platform.

Matt Webb, Interview for People and Blogs

# 5th April 2025, 4:56 am / matt-webb, blogging

change of plans: we are going to release o3 and o4-mini after all, probably in a couple of weeks, and then do GPT-5 in a few months

Sam Altman

# 4th April 2025, 6:56 pm / sam-altman, generative-ai, openai, ai, llms

I started using Claude and Claude Code a bit in my regular workflow. I’ll skip the suspense and just say that the tool is way more capable than I would ever have expected. The way I can use it to interrogate a large codebase, or generate unit tests, or even “refactor every callsite to use such-and-such pattern” is utterly gobsmacking. [...]

Here’s the main problem I’ve found with generative AI, and with “vibe coding” in general: it completely sucks out the joy of software development for me. [...]

This is how I feel using gen-AI: like a babysitter. It spits out reams of code, I read through it and try to spot the bugs, and then we repeat.

Nolan Lawson, AI ambivalence

# 3rd April 2025, 1:56 am / ai-assisted-programming, claude, generative-ai, ai, llms, nolan-lawson, claude-code, coding-agents