Simon Willison's Weblog: stanford

GPUs Go Brrr

2024-05-13T04:08:46+00:00

Fascinating, detailed low-level notes on how to get the most out of NVIDIA's H100 GPUs (currently selling for around $40,000 a piece) from the research team at Stanford who created FlashAttention, among other things.

The swizzled memory layouts are flat-out incorrectly documented, which took considerable time for us to figure out.

Via Hacker News

Tags: stanford, ai, nvidia, gpus

Train and run Stanford Alpaca on your own machine

2023-03-16T16:10:39+00:00

Train and run Stanford Alpaca on your own machine

The team at Replicate managed to train their own copy of Stanford’s Alpaca—a fine-tuned version of LLaMA that can follow instructions like ChatGPT. Here they provide step-by-step instructions for recreating Alpaca yourself—running the training needs one or more A100s for a few hours, which you can rent through various cloud providers.

Tags: stanford, ai, generative-ai, llama, edge-llms, llms, replicate, fine-tuning

Stanford Alpaca, and the acceleration of on-device large language model development

2023-03-13T19:19:09+00:00

On Saturday 11th March I wrote about how Large language models are having their Stable Diffusion moment. Today is Monday. Let's look at what's happened in the past three days.

Later on Saturday: Artem Andreenko reports that llama.cpp can run the 4-bit quantized 7B LLaMA language model model on a 4GB RaspberryPi - at 10 seconds per token, but still hugely impressive.
Sunday 12th March: cocktailpeanut releases Dalai, a "dead simple way to run LLaMA on your computer": npx dalai llama and npx dalai serve.
13th March (today): Anish Thite reports llama.cpp running on a Pixel 6 phone (26 seconds per token). Update 14th March: Now 1 second per token on an older Pixel 5!
Also today: a team at Stanford released Alpaca: A Strong Open-Source Instruction-Following Model - fine-tuned from the LLaMA 7B model.

When I talked about a "Stable Diffusion moment" this is the kind of thing I meant: the moment this stuff is available for people to experiment with, things accelerate.

I'm going to dive into Alpaca in detail.

Stanford's Alpaca

Here's the introduction to the Alpaca announcement:

We introduce Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca behaves similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$).

The biggest weakness in the LLaMA models released by Meta research last month is their lack of instruction-tuning.

A language model is a sentence completion engine. You give it a sequence of words, "The first man on the moon was", and it completes that sentence, hopefully with useful content.

One of the great innovations from OpenAI was their application of instruction tuning to GPT-3:

To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API, our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.

Prior to this, you had to think very carefully about how to construct your prompts. Thanks to instruction tuning you can be a lot more, well, human in the way you interact with the model. "Write me a poem about pandas!" now works as a prompt, instead of "Here is a poem about pandas:".

The LLaMA models had not been through this process. The LLaMA FAQ acknowledges this:

Keep in mind these models are not finetuned for question answering. As such, they should be prompted so that the expected answer is the natural continuation of the prompt. [...] Overall, always keep in mind that models are very sensitive to prompts (particularly when they have not been finetuned).

This is an enormous usability problem.

One of my open questions about LLaMA was how difficult and expensive it would be to fine-tune it such that it could respond better to instructions.

Thanks to the team at Stanford we now have an answer: 52,000 training samples and $100 of training compute! From their blog post:

Fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers.

Something that stuns me about Alpaca is the quality they claim to be able to get from the 7B model - the smallest of the LLaMA models, and the one which has been seen running (albeit glacially slowly) on a RaspberryPi and a mobile phone! Here's one example from their announcement:

I would be impressed to see this from the 65B (largest) LLaMA model - but getting this from 7B is spectacular.

Still not for commercial usage

I'll quote the Stanford announcement on this in full:

We emphasize that Alpaca is intended only for academic research and any commercial use is prohibited. There are three factors in this decision: First, Alpaca is based on LLaMA, which has a non-commercial license, so we necessarily inherit this decision. Second, the instruction data is based OpenAI's text-davinci-003, whose terms of use prohibit developing models that compete with OpenAI. Finally, we have not designed adequate safety measures, so Alpaca is not ready to be deployed for general use.

So it's still not something we can use to build commercial offerings - but for personal research and tinkering it's yet another huge leap forwards.

What does this demonstrate?

The license of the LLaMA model doesn't bother me too much. What's exciting to me is what this all proves:

LLaMA itself shows that it's possible to train a GPT-3 class language model using openly available resources. The LLaMA paper includes details of the training data, which is entirely from publicly available sources (which include CommonCrawl, GitHub, Wikipedia, ArXiv and StackExchange).
llama.cpp shows that you can then use some tricks to run that language model on consumer hardware - apparently anything with 4GB or more of RAM is enough to at least get it to start spitting out tokens!
Alpaca shows that you can apply fine-tuning with a feasible sized set of examples (52,000) and cost ($100) such that even the smallest of the LLaMA models - the 7B one, which can compress down to a 4GB file with 4-bit quantization - provides results that compare well to cutting edge text-davinci-003 in initial human evaluation.

One thing that's worth noting: the Alpaca 7B comparison likely used the full-sized 13.48GB 16bit floating point 7B model, not the 4GB smaller 4bit floating point model used by llama.cpp. I've not yet seen a robust comparison of quality between the two.

Exploring the Alpaca training data with Datasette Lite

The Alpaca team released the 52,000 fine-tuning instructions they used as a 21.7MB JSON file in their GitHub repository.

My Datasette Lite tool has the ability to fetch JSON from GitHub and load it into an in-browser SQLite database. Here's the URL to do that:

https://lite.datasette.io/?json=https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json

This will let you browse the 52,000 examples in your browser.

But we can do a step better than that: here's a SQL query that runs LIKE queries to search through those examples, considering all three text columns:

select instruction, input, output from alpaca_data
where instruction || ' ' || input || ' ' || output like '%' || :search || '%'
order by random()

I'm using order by random() because why not? It's more fun to explore that way.

The following link will both load the JSON file and populate and execute that SQL query, plus allow you to change the search term using a form in your browser:

https://lite.datasette.io/?json=https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json#/data?sql=select+instruction%2C+input%2C+output+from+alpaca_data%0Awhere+instruction+%7C%7C+%27+%27+%7C%7C+input+%7C%7C+%27+%27+%7C%7C+output+like+%27%25%27+%7C%7C+%3Asearch+%7C%7C+%27%25%27%0Aorder+by+random%28%29&search=occam

What's next?

This week is likely to be wild. OpenAI are rumored to have a big announcement on Tuesday - possibly GPT-4? And I've heard rumors of announcements from both Anthropic and Google this week as well.

I'm still more excited about seeing what happens next with LLaMA. Language models on personal devices is happening so much faster than I thought it would.

Bonus: The source of that training data? GPT-3!

Here's a fascinating detail: Those 52,000 samples they used to fine-tune the model? Those were the result of a prompt they ran against GPT-3 itself! Here's the prompt they used:

You are asked to come up with a set of 20 diverse task instructions. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions.

Here are the requirements:
1. Try not to repeat the verb for each instruction to maximize diversity.
2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons.
3. The type of instructions should be diverse. The list should include diverse types of tasks like open-ended generation, classification, editing, etc.
2. A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action.
3. The instructions should be in English.
4. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
5. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging but should ideally not exceed 100 words.
6. Not all instructions require input. For example, when a instruction asks about some general information, "what is the highest peak in the world", it is not necssary to provide a specific context. In this case, we simply put "<noinput>" in the input field.
7. The output should be an appropriate response to the instruction and the input. Make sure the output is less than 100 words.

List of 20 tasks:

Then they include three random example instructions from a list of 175 they had prepared by hand. The completed prompt sent to OpenAI would include the above instructions followed by something like this:

###
1. Instruction: Explain the following idiom to me, and try to give me some examples.
1. Input:
black sheep
1. Output:
Meaning: An outcast. Someone who doesn’t fit in with the rest of the crowd. They take pride in being different. Thinks for themselves and doesn’t care what no one else has to say. They tend to ride their own wave and are usually loners because no one understands them, but its okay because they like it that way.
Example: He’s the black sheep of the family.

###
2. Instruction: Generate a haiku using the following word:
2. Input:
summer
2. Output:
The chill, worming in
Shock, pleasure, bursting within
Summer tongue awakes

###
3. Instruction: Recommend a movie for me to watch during the weekend and explain the reason.
3. Input:
3. Output:
I would recommend the movie "The Shawshank Redemption" because it is an excellent movie that is both moving and inspiring. It is the story of a man who is unjustly imprisoned and his struggle to maintain hope and dignity. It is a great film to watch over the weekend because it will make you think about the human capacity for resilience and hope.

###
4. Instruction:

GPT-3 would then fill in the rest. You can try this in the GPT-3 Playground to see it in action (paste from here).

Here's the Python script that assembles that all together.

They spent $500 on OpenAI credits to assemble the 52,000 examples they used to fine-tune their model.

As they note in their announcement, generating examples in this way is actually mentioned in the OpenAI terms of use:

You may not [...] (iii) use the Services to develop foundation models or other large scale models that compete with OpenAI

There's a related concept to this called Model Extraction, where people build new models that emulate the behaviour of others by firing large numbers of examples through the other model and training a new one based on the results.

I don't think the way Alpaca was trained quite counts as a classic Model Extraction attack, but it certainly echoes one.

Tags: open-source, stanford, ai, gpt-3, generative-ai, llama, edge-llms, llms, fine-tuning

Quoting Alpaca: A Strong Open-Source Instruction-Following Model

2023-03-13T18:18:37+00:00

We introduce Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca behaves similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$).

— Alpaca: A Strong Open-Source Instruction-Following Model

Tags: stanford, ai, generative-ai, llama, llms, fine-tuning

Weeknotes: Working on my screenplay

2020-05-14T04:53:46+00:00

I'm taking an Introduction to Screenwriting course with Adam Tobin at Stanford, and my partial screenplay is due this week. I'm pulling together some scenes that tell the story of the Russian 1917 February Revolution and the fall of the Tsar through the lens of the craftsmen working on the Tsar's last Fabergé egg. So I've not been spending much time on anything else.

Some brief bullet points for this week's software projects:

Released version 0.1 of datasette-media, a new plugin that allows Datasette to serve files from disk based on executing a SQL query to find the file to return. I'm building it to help make photos-to-sqlite more immediately useful.
Released Datasette 0.42 with improved (and now documented) internal methods to allow plugins to execute read-only SQL queries. I needed these for datasette-media.
Released sqlite-utils 2.9 with new CLI commands sqlite-utils drop-table and sqlite-utils drop-view.
Released sqlite-utils 2.9.1 with a tiny cosmetic improvement: the PyPI project page now shows project links! See this TIL for details.

I've also started adding changelog badges to various projects, showing the latest release version according to GitHub and linking to that project's changelog. Datasette, photos-to-sqlite, sqlite-utils all have these now.

TIL this week

Tags: projects, stanford, screenwriting, datasette, weeknotes, til

Weeknotes: Improv at Stanford, planning Datasette Cloud

2020-01-14T00:22:18+00:00

Last week was the first week of the quarter at Stanford - which is called "shopping week" here because students are expected to try different classes to see which ones they are going to stick with.

I've settled on three classes this quarter: Beginning Improvising, Designing Machine Learning and Entrepreneurship: Formation of New Ventures.

Beginning Improvising is the Stanford improv theater course. It's a big time commitment: three two-hours sessions a week for ten weeks is nearly 60 hours of improv!

It's already proving to be really interesting though: it turns out the course is a thinly disguised applied psychology course.

Improv is about creating a creative space for other people to shine. The applications to professional teamwork are obvious and fascinating to me. I'll probably write more about this as the course continues.

Designing Machine Learning is a class at the Stanford d.School taught by Michelle Carney and Emily Callaghan. It focuses on multidisciplinary applications of machine learning, mixing together students from many different disciplines around Stanford.

I took a fast.ai deep learning course last year which gave me a basic understanding of the code size of neural networks, but I'm much more interestind in figuring out applications so this seems like a much more interesting option than a more code-focused course.

The class started out building some initial models using Google's Teachable Machine tool, which is fascinating. It lets you train transfer learning models for image, audio and posture recognition entirely in your browser - no data is transferred to Google's servers at all. You can then export those models and use them with a variety of different libraries - I've got them to work with both JavaScript and Python already.

I'm taking Entrepreneurship: Formation of New Ventures because of the rave reviews I heard from other JSK fellows who took it last quarter. It's a classic case-study business school class: each session features a guest speaker who is a successful entrepreneur, and the class discusses their case for the first two thirds of the section while they listen in - then finds out how well the discussion matched to what actually happened.

Planning Datasette Cloud

Shopping week kept me pretty busy so I've not done much actual development over the past week, but I have started planning out and researching my next major project, which I'm currently calling Datasette Cloud.

Datasette Cloud will be an invite-only hosted SaaS version of Datasette. It's designed to help get news organizations on board with the software without having to talk them through figuring out their own hosting, so I can help them solve real problems and learn more about how the ecosystem should evolve to support them.

I'd love to be able to run this on serverless hosting platforms like Google Cloud Run or Heroku, but sadly those tools aren't an option to me due to a key problem: I'm trying to build a stateful service (SQLite databases need to live on a local disk) in 2020.

I posed this challenge on Twitter back in October:

What's the easiest way of running a stateful web application these days?

Stateful as in it supports a process which can accept web requests and is allowed to write to a durable disk

So not Heroku/Zeit Now/Cloud Run etc
- Simon Willison (@simonw) October 9, 2019

I've been exploring my options since then, and I think I've settled on a decidedly 2010-era way of doing this: I'm going to run my own instances! So I've been exploring hosting Datasette on both AWS Lightsail and Digital Ocean Droplets over the past few months.

My current plan is to have each Datasette Cloud account run as a Datasette instance in its own Docker container, primarily to ensure filesystem isolation: different accounts must not be able to see each other's database files.

I started another discussion about this on Twitter and had several recommendations for Traefik as a load balancer for assigning hostnames to different Docker containers, which is exactly what I need to do.

So this afternoon I made my way through Digital Ocean's outstanding tutorial How To Use Traefik as a Reverse Proxy for Docker Containers on Ubuntu 18.04 and I think I've convinced myself that this is a smart way forward.

So, mostly a research week but I've got a solid plan for my next steps.

This week's Niche Museums

Jelly Belly Factory in Fairfield, CA
Bevolo Gas Light Museum in New Orleans, LA
Museo de las Misiones de Baja California in Loreto
Fort Point in San Francisco, CA
Donner Memorial State Park Visitor Center in Nevada County, CA
Anja Community Reserve in Madagascar
Palace of Fine Arts in San Francisco, CA

I also finally got around to implementing a map.

Tags: stanford, docker, jsk, weeknotes, datasette-cloud, digitalocean

Weeknotes: first week of Stanford classes

2019-09-30T16:28:12+00:00

One of the benefits of the JSK fellowship is that I can take classes and lectures at Stanford, on a somewhat ad-hoc basis (I don’t take exams or earn credits).

With thousands of courses to chose from, figuring out how best to take advantage of this isn’t at all easy - especially since I want to spend a big portion of my time focusing on my fellowship project.

This week was the first week of classes, which Stanford calls “shopping week” - because students are encouraged to try out lots of different things and literally walk out half way through a lecture if they decide it’s not for them! Feels really rude to me, but apparently that’s how it works here.

For this term I’ve settled on four classes:

Strategic Communications, at the Stanford Graduate School of Business. This is an extremely highly regarded course on public speaking and effective written communication. As you might expect from a class on public speaking the lectures themselves have been case studies in how to communicate well. I’ve given dozens of conference talks and I’m already learning a huge amount from this that will help me perform better in the future.
Classical Guitar. I’m taking this with three other fellows. It turns out my cheap acoustic guitar (bought on an impulse a couple of years ago from Amazon Prime Now) isn’t the correct instrument for this class (Classical Guitars are nylon stringed and a different shape) but the instructor thinks it will be fine for the moment. Great opportunity to do something musical!
Biostatistics. I want to firm up my fundamental knowledge of statistics, and I figured learning it from the biology department would be much more interesting than the corresponding maths or computer science classes.
Media Innovation. This is a lunchtime series of guest lectures from different professionals in different parts of the media industry. As such it doesn’t have much homework (wow, Stanford courses have a lot of homework) which makes it a good fit for my schedule, and the variety of speakers look to be really informative.

Combined with the JSK afternoon sessions on Monday, Wednesday and Friday I’ll be on campus every weekday, which will hopefully help me build a schedule that incorporates plenty of useful conversations with people about my project, plus actual time to get some code written.

… what with all the shopping for classes, I wrote almost no code at all this week!

I did some experimentation with structlog - I have an unfinished module which can write structlog entries to a SQLite database using sqlite-utils (here’s a Gist) and I’ve been messing around with Python threads in a Jupyter notebook as part of ongoing research into smarter connection pooling for Datasette but aside from that I’ve been concentrating on figuring out Stanford.

Books

Stanford classes come with all sorts of required reading, but I’ve also made some progress on Just Enough Research by Erika Hall (mentioned last week). I’m about half way through and it’s fantastic - really fun to read and packed with useful tips on getting the most out of user interviews and associated techniques. Hopefully I’ll get to start putting it into practice next week!

Tags: music, reading, speaking, stanford, jsk, weeknotes

JSK Journalism Fellowships names Class of 2019-2020 (and I'm in it!)

2019-05-01T16:43:54+00:00

JSK Journalism Fellowships names Class of 2019-2020 (and I'm in it!)

In personal news... I’ve been accepted for a ten month journalism fellowship at Stanford (starting September)! My work there will involve “Improving the impact of investigative stories by expanding the open-source ecosystem of tools that allows journalists to share the underlying data”.

Via @simonw

Tags: journalism, personal, stanford, datasette, jsk

VectorMagic

2007-10-28T11:46:44+00:00

VectorMagic

Neat online tool (with a Flex frontend) for tracing bitmap images in to vectors, based on research at the Stanford AI lab.

Tags: flash, flex, graphics, images, stanford, vectormagic, vectors