Simon Willison’s Weblog

Subscribe

Quotations

Filters: Sorted by date

Some argue that by aggregating knowledge drawn from human experience, LLMs aren’t sources of creativity, as the moniker “generative” implies, but rather purveyors of mediocrity. Yes and no. There really are very few genuinely novel ideas and methods, and I don’t expect LLMs to produce them. Most creative acts, though, entail novel recombinations of known ideas and methods. Because LLMs radically boost our ability to do that, they are amplifiers of — not threats to — human creativity.

Jon Udell

# 10th August 2024, 5:57 pm / llms, jon-udell, ai, generative-ai

The RM [Reward Model] we train for LLMs is just a vibe check […] It gives high scores to the kinds of assistant responses that human raters statistically seem to like. It's not the "actual" objective of correctly solving problems, it's a proxy objective of what looks good to humans. Second, you can't even run RLHF for too long because your model quickly learns to respond in ways that game the reward model. […]

No production-grade actual RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale. And intuitively, this is because getting actual rewards (i.e. the equivalent of win the game) is really difficult in the open-ended problem solving tasks. […] But how do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or re-writing some Java code to Python?

Andrej Karpathy

# 8th August 2024, 8:13 am / andrej-karpathy, llms, ai, generative-ai

[On WebGPU in Firefox] There is a lot of work to do still to make sure we comply with the spec. in a way that's acceptable to ship in a browser. We're 90% of the way there in terms of functionality, but the last 10% of fixing up spec. changes in the last few years + being significantly more resourced-constrained (we have 3 full-time folks, Chrome has/had an order of magnitude more humans working on WebGPU) means we've got our work cut out for us. We're hoping to ship sometime in the next year, but I won't make promises here.

Erich Gubler

# 5th August 2024, 2:26 am / firefox, webgpu

[On release notes] in our partial defense, training these models can be more discovery than invention. often we don't exactly know what will come out.

we've long wanted to do release notes that describe each model's differences, but we also don't want to give false confidence with a shallow story.

Ted Sanders, OpenAI

# 3rd August 2024, 8:09 pm / openai, llms, ai, generative-ai

I think the mistake the industry has made is (and I had to learn this as well), that "we observed ab tests work really well" is really a statement that should read "the majority of the changes we make are characterized as hill-climbing growth of a post-PMF b2c product and ab tests work really well for that".

Malte Ubl

# 3rd August 2024, 7:06 pm / ab-testing

When Noam and Daniel started Character.AI, our goal of personalized superintelligence required a full stack approach. We had to pre-train models, post-train them to power the experiences that make Character.AI special, and build a product platform with the ability to reach users globally. Over the past two years, however, the landscape has shifted – many more pre-trained models are now available. Given these changes, we see an advantage in making greater use of third-party LLMs alongside our own. This allows us to devote even more resources to post-training and creating new product experiences for our growing user base.

Character.AI

# 2nd August 2024, 9:07 pm / llms, ai, fine-tuning, generative-ai

For the past 10 years or so, AWS has been rolling out these peripheral services at an astonishing rate, dozens every year. A few get traction, most don’t—but they all stick around, undead zombies behind impressive-looking marketing pages, because historically AWS just doesn’t make many breaking changes. [...]

AWS made this mess for themselves by rushing all sorts of half-baked services to market. The mess had to be cleaned up at some point, and they’re doing that. But now they’ve explicitly revealed something to customers: The new stuff we release isn’t guaranteed to stick around.

Forrest Brazeal

# 31st July 2024, 8:58 pm / aws, forrest-brazeal

After giving it a lot of thought, we made the decision to discontinue new access to a small number of services, including AWS CodeCommit.

While we are no longer onboarding new customers to these services, there are no plans to change the features or experience you get today, including keeping them secure and reliable. [...]

The services I'm referring to are: S3 Select, CloudSearch, Cloud9, SimpleDB, Forecast, Data Pipeline, and CodeCommit.

Jeff Barr

# 31st July 2024, 12:59 pm / s3, aws, jeff-barr

The [Apple Foundation Model] pre-training dataset consists of a diverse and high quality data mixture. This includes data we have licensed from publishers, curated publicly-available or open-sourced datasets, and publicly available information crawled by our web-crawler, Applebot. We respect the right of webpages to opt out of being crawled by Applebot, using standard robots.txt directives.

Given our focus on protecting user privacy, we note that no private Apple user data is included in the data mixture. Additionally, extensive efforts have been made to exclude profanity, unsafe material, and personally identifiable information from publicly available data (see Section 7 for more details). Rigorous decontamination is also performed against many common evaluation benchmarks.

We find that data quality, much more so than quantity, is the key determining factor of downstream model performance.

Apple Intelligence Foundation Language Models, PDF

# 29th July 2024, 10:39 pm / apple, generative-ai, training-data, ai, llms, apple-intelligence

The key to understanding the pace of today’s infrastructure buildout is to recognize that while AI optimism is certainly a driver of AI CapEx, it is not the only one. The cloud players exist in a ruthless oligopoly with intense competition. [...]

Every time Microsoft escalates, Amazon is motivated to escalate to keep up. And vice versa. We are now in a cycle of competitive escalation between three of the biggest companies in the history of the world, collectively worth more than $7T. At each cycle of the escalation, there is an easy justification—we have plenty of money to afford this. With more commitment comes more confidence, and this loop becomes self-reinforcing. Supply constraints turbocharge this dynamic: If you don’t acquire land, power and labor now, someone else will.

David Cahn

# 28th July 2024, 10:40 am / david-cahn, ai

Among many misunderstandings, [users] expect the RAG system to work like a search engine, not as a flawed, forgetful analyst. They will not do the work that you expect them to do in order to verify documents and ground truth. They will not expect the AI to try to persuade them.

Ethan Mollick

# 27th July 2024, 1:46 am / ethan-mollick, generative-ai, ai, rag, llms

Our estimate of OpenAI’s $4 billion in inference costs comes from a person with knowledge of the cluster of servers OpenAI rents from Microsoft. That cluster has the equivalent of 350,000 Nvidia A100 chips, this person said. About 290,000 of those chips, or more than 80% of the cluster, were powering ChartGPT, this person said.

Amir Efrati and Aaron Holmes

# 25th July 2024, 9:35 pm / generative-ai, openai, chatgpt, ai, llms, nvidia

One interesting observation is the impact of environmental factors on training performance at scale. For Llama 3 405B , we noted a diurnal 1-2% throughput variation based on time-of-day. This fluctuation is the result of higher mid-day temperatures impacting GPU dynamic voltage and frequency scaling.

During training, tens of thousands of GPUs may increase or decrease power consumption at the same time, for example, due to all GPUs waiting for checkpointing or collective communications to finish, or the startup or shutdown of the entire training job. When this happens, it can result in instant fluctuations of power consumption across the data center on the order of tens of megawatts, stretching the limits of the power grid. This is an ongoing challenge for us as we scale training for future, even larger Llama models.

The Llama 3 Herd of Models

# 23rd July 2024, 11:26 pm / meta, generative-ai, llama, ai, llms, gpus

As we've noted many times since March, these benchmarks aren't necessarily scientifically sound and don't convey the subjective experience of interacting with AI language models. [...] We've instead found that measuring the subjective experience of using a conversational AI model (through what might be called "vibemarking") on A/B leaderboards like Chatbot Arena is a better way to judge new LLMs.

Benj Edwards

# 23rd July 2024, 9:14 pm / benj-edwards, llms, ai, generative-ai, chatbot-arena

I believe the Llama 3.1 release will be an inflection point in the industry where most developers begin to primarily use open source, and I expect that approach to only grow from here.

Mark Zuckerberg

# 23rd July 2024, 4:52 pm / meta, open-source, generative-ai, facebook, mark-zuckerberg, ai, llms, llama

I have a hard time describing the real value of consumer AI because it’s less some grand thing around AI agents or anything and more AI saving humans a hour of work on some random task, millions of times a day.

Chris Albon

# 21st July 2024, 3:08 pm / ai, llms, ai-agents

Stepping back, though, the very speed with which ChatGPT went from a science project to 100m users might have been a trap (a little as NLP was for Alexa). LLMs look like they work, and they look generalised, and they look like a product - the science of them delivers a chatbot and a chatbot looks like a product. You type something in and you get magic back! But the magic might not be useful, in that form, and it might be wrong. It looks like product, but it isn’t. [...]

LLMs look like better databases, and they look like search, but, as we’ve seen since, they’re ‘wrong’ enough, and the ‘wrong’ is hard enough to manage, that you can’t just give the user a raw prompt and a raw output - you need to build a lot of dedicated product around that, and even then it’s not clear how useful this is.

Benedict Evans

# 20th July 2024, 3:28 pm / generative-ai, chatgpt, product-management, ai, llms, benedict-evans

The reason current models are so large is because we're still being very wasteful during training - we're asking them to memorize the internet and, remarkably, they do and can e.g. recite SHA hashes of common numbers, or recall really esoteric facts. (Actually LLMs are really good at memorization, qualitatively a lot better than humans, sometimes needing just a single update to remember a lot of detail for a long time). But imagine if you were going to be tested, closed book, on reciting arbitrary passages of the internet given the first few words. This is the standard (pre)training objective for models today. The reason doing better is hard is because demonstrations of thinking are "entangled" with knowledge, in the training data.

Therefore, the models have to first get larger before they can get smaller, because we need their (automated) help to refactor and mold the training data into ideal, synthetic formats.

It's a staircase of improvement - of one model helping to generate the training data for next, until we're left with "perfect training set". When you train GPT-2 on it, it will be a really strong / smart model by today's standards. Maybe the MMLU will be a bit lower because it won't remember all of its chemistry perfectly.

Andrej Karpathy

# 19th July 2024, 1:09 pm / andrej-karpathy, generative-ai, training-data, ai, llms, gpt-2

Update, July 12: This innovation sparked a lot of conversation and questions that have no answers yet. We look forward to continuing to work with our customers on the responsible use of AI, but will not further pursue digital workers in the product.

Lattice, HR platform

# 17th July 2024, 3:08 am / ai, ethics, ai-ethics

OpenAI and Anthropic focused on building models and not worrying about products. For example, it took 6 months for OpenAI to bother to release a ChatGPT iOS app and 8 months for an Android app!

Google and Microsoft shoved AI into everything in a panicked race, without thinking about which products would actually benefit from AI and how they should be integrated.

Both groups of companies forgot the “make something people want” mantra. The generality of LLMs allowed developers to fool themselves into thinking that they were exempt from the need to find a product-market fit, as if prompting is a replacement for carefully designed products or features. [...]

But things are changing. OpenAI and Anthropic seem to be transitioning from research labs focused on a speculative future to something resembling regular product companies. If you take all the human-interest elements out of the OpenAI boardroom drama, it was fundamentally about the company's shift from creating gods to building products.

Arvind Narayanan

# 16th July 2024, 4:06 pm / anthropic, llms, google, openai, generative-ai, ai, microsoft, arvind-narayanan

We've doubled the max output token limit for Claude 3.5 Sonnet from 4096 to 8192 in the Anthropic API.

Just add the header "anthropic-beta": "max-tokens-3-5-sonnet-2024-07-15" to your API calls.

Alex Albert

# 15th July 2024, 9:33 pm / alex-albert, anthropic, claude, generative-ai, ai, llms, claude-3-5-sonnet

So much of knowledge/intelligence involves translating ideas between fields (domains). Those domains are walls the keep ideas siloed. But LLMs can help break those walls down and encourage humans to do more interdisciplinary thinking, which may lead to faster discoveries.

And note that I am implying that humans will make the breakthroughs, using LLMs as translation tools when appropriate, to help make connections. LLMs are strongest as translators of information that you provide. BYOD: Bring your own data!

Benj Edwards

# 14th July 2024, 3:25 pm / ai, llms, benj-edwards

My architecture is a monolith written in Go (this is intentional, I sacrificed scalability to improve my shipping speed), and this is where SQLite shines. With a DB located on the local NVMe disk, a 5$ VPS can deliver a whopping 60K reads and 20K writes per second.

Nikita Melkozerov

# 13th July 2024, 11:44 pm / performance, go, sqlite

We respect wildlife in the wilderness because we’re in their house. We don’t fully understand the complexity of most ecosystems, so we seek to minimize our impact on those ecosystems since we can’t always predict what outcomes our interactions with nature might have.

In software, many disastrous mistakes stem from not understanding why a system was built the way it was, but changing it anyway. It’s super common for a new leader to come in, see something they see as “useless”, and get rid of it – without understanding the implications. Good leaders make sure they understand before they mess around.

Jacob Kaplan-Moss

# 13th July 2024, 2:12 pm / jacob-kaplan-moss, software-engineering, management

Add tests in a commit before the fix. They should pass, showing the behavior before your change. Then, the commit with your change will update the tests. The diff between these commits represents the change in behavior. This helps the author test their tests (I've written tests thinking they covered the relevant case but didn't), the reviewer to more precisely see the change in behavior and comment on it, and the wider community to understand what the PR description is about.

Ed Page

# 13th July 2024, 5:28 am / git

Third, X fails to provide access to its public data to researchers in line with the conditions set out in the DSA. In particular, X prohibits eligible researchers from independently accessing its public data, such as by scraping, as stated in its terms of service. In addition, X's process to grant eligible researchers access to its application programming interface (API) appears to dissuade researchers from carrying out their research projects or leave them with no other choice than to pay disproportionally high fees.

European Commission

# 13th July 2024, 3:52 am / twitter, apis, europe

Fighting bots is fighting humans [...] remind you that "only allow humans to access" is just not an achievable goal. Any attempt at limiting bot access will inevitably allow some bots through and prevent some humans from accessing the site, and it's about deciding where you want to set the cutoff. I fear that media outlets and other websites, in attempting to "protect" their material from AI scrapers, will go too far in the anti-human direction.

Molly White

# 12th July 2024, 2:45 pm / molly-white, ai

[On Paddington 3] If this movie is anywhere near as good as the second one, we are going to need to have an extremely serious conversation about this being one of the greatest film trilogies ever made.

Brian Grubb

# 11th July 2024, 9:11 pm / brian-grubb, film

My main concern is that the substantial cost to develop and run Al technology means that Al applications must solve extremely complex and important problems for enterprises to earn an appropriate return on investment.

We estimate that the Al infrastructure buildout will cost over $1tn in the next several years alone, which includes spending on data centers, utilities, and applications. So, the crucial question is: What $1tn problem will Al solve? Replacing low-wage jobs with tremendously costly technology is basically the polar opposite of the prior technology transitions I've witnessed in my thirty years of closely following the tech industry.

Jim Covello, Goldman Sachs

# 11th July 2024, 2:35 am / ai, generative-ai

Yeah, unfortunately vision prompting has been a tough nut to crack. We've found it's very challenging to improve Claude's actual "vision" through just text prompts, but we can of course improve its reasoning and thought process once it extracts info from an image.

In general, I think vision is still in its early days, although 3.5 Sonnet is noticeably better than older models.

Alex Albert, Anthropic

# 10th July 2024, 6:56 pm / vision-llms, prompt-engineering, anthropic, claude, generative-ai, ai, llms, alex-albert, claude-3-5-sonnet