Simon Willison's Weblog: fine-tuning

NuExtract 1.5

2024-11-16T16:33:17+00:00

Structured extraction - where an LLM helps turn unstructured text (or image content) into structured data - remains one of the most directly useful applications of LLMs.

NuExtract is a family of small models directly trained for this purpose, and released under the MIT license.

It comes in a variety of shapes and sizes:

NuExtract-v1.5 is a 3.8B parameter model fine-tuned on Phi-3.5-mini instruct. You can try this one out in this playground.
NuExtract-tiny-v1.5 is 494M parameters, fine-tuned on Qwen2.5-0.5B.
NuExtract-1.5-smol is 1.7B parameters, fine-tuned on SmolLM2-1.7B.

All three models were fine-tuned on NuMind's "private high-quality dataset". It's interesting to see a model family that uses one fine-tuning set against three completely different base models.

Useful tip from Steffen Röcker:

Make sure to use it with low temperature, I've uploaded NuExtract-tiny-v1.5 to Ollama and set it to 0. With the Ollama default of 0.7 it started repeating the input text. It works really well despite being so smol.

Tags: llms, ai, generative-ai, fine-tuning, phi, qwen, hugging-face

Gemini API Additional Terms of Service

2024-10-17T03:06:23+00:00

Gemini API Additional Terms of Service

I've been trying to figure out what Google's policy is on using data submitted to their Google Gemini LLM for further training. It turns out it's clearly spelled out in their terms of service, but it differs for the paid v.s. free tiers.

The paid APIs do not train on your inputs:

When you're using Paid Services, Google doesn't use your prompts (including associated system instructions, cached content, and files such as images, videos, or documents) or responses to improve our products [...] This data may be stored transiently or cached in any country in which Google or its agents maintain facilities.

The Gemini API free tier does:

The terms in this section apply solely to your use of Unpaid Services. [...] Google uses this data, consistent with our Privacy Policy, to provide, improve, and develop Google products and services and machine learning technologies, including Google’s enterprise features, products, and services. To help with quality and improve our products, human reviewers may read, annotate, and process your API input and output.

But watch out! It looks like the AI Studio tool, since it's offered for free (even if you have a paid account setup) is treated as "free" for the purposes of these terms. There's also an interesting note about the EU:

The terms in this "Paid Services" section apply solely to your use of paid Services ("Paid Services"), as opposed to any Services that are offered free of charge like direct interactions with Google AI Studio or unpaid quota in Gemini API ("Unpaid Services"). [...] If you're in the European Economic Area, Switzerland, or the United Kingdom, the terms applicable to Paid Services apply to all Services including AI Studio even though it's offered free of charge.

Confusingly, the following paragraph about data used to fine-tune your own custom models appears in that same "Data Use for Unpaid Services" section:

Google only uses content that you import or upload to our model tuning feature for that express purpose. Tuning content may be retained in connection with your tuned models for purposes of re-tuning when supported models change. When you delete a tuned model, the related tuning content is also deleted.

It turns out their tuning service is "free of charge" on both pay-as-you-go and free plans according to the Gemini pricing page, though you still pay for input/output tokens at inference time (on the paid tier - it looks like the free tier remains free even for those fine-tuned models).

Tags: gemini, llms, google, generative-ai, training-data, ai, fine-tuning

Quoting Character.AI

2024-08-02T21:07:34+00:00

When Noam and Daniel started Character.AI, our goal of personalized superintelligence required a full stack approach. We had to pre-train models, post-train them to power the experiences that make Character.AI special, and build a product platform with the ability to reach users globally. Over the past two years, however, the landscape has shifted – many more pre-trained models are now available. Given these changes, we see an advantage in making greater use of third-party LLMs alongside our own. This allows us to devote even more resources to post-training and creating new product experiences for our growing user base.

— Character.AI

Tags: llms, ai, fine-tuning, generative-ai

Introducing Apple’s On-Device and Server Foundation Models

2024-06-11T15:44:31+00:00

Introducing Apple’s On-Device and Server Foundation Models

Apple Intelligence uses both on-device and in-the-cloud models that were trained from scratch by Apple.

Their on-device model is a 3B model that "outperforms larger models including Phi-3-mini, Mistral-7B, and Gemma-7B", while the larger cloud model is comparable to GPT-3.5.

The language models were trained on unlicensed scraped data - I was hoping they might have managed to avoid that, but sadly not:

We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot.

The most interesting thing here is the way they apply fine-tuning to the local model to specialize it for different tasks. Apple call these "adapters", and they use LoRA for this - a technique first published in 2021. This lets them run multiple on-device models based on a shared foundation, specializing in tasks such as summarization and proof-reading.

Here's the section of the Platforms State of the Union talk that talks about the foundation models and their fine-tuned variants.

As Hamel Husain says:

This talk from Apple is the best ad for fine tuning that probably exists.

The video also describes their approach to quantization:

The next step we took is compressing the model. We leveraged state-of-the-art quantization techniques to take a 16-bit per parameter model down to an average of less than 4 bits per parameter to fit on Apple Intelligence-supported devices, all while maintaining model quality.

Still no news on how their on-device image model was trained. I'd love to find out it was trained exclusively using licensed imagery - Apple struck a deal with Shutterstock a few months ago.

Tags: apple, llms, ai, generative-ai, fine-tuning, apple-intelligence

teknium/OpenHermes-2.5

2024-02-01T04:18:47+00:00

teknium/OpenHermes-2.5

The Nous-Hermes and Open Hermes series of LLMs, fine-tuned on top of base models like Llama 2 and Mistral, have an excellent reputation and frequently rank highly on various leaderboards.

The developer behind them, Teknium, just released the full set of fine-tuning data that they curated to build these models. It’s a 2GB JSON file with over a million examples of high quality prompts, responses and some multi-prompt conversations, gathered from a number of different sources and described in the data card.

Via @Teknium1

Tags: llms, ai, fine-tuning, generative-ai, nous-research

Fine-tuning GPT3.5-turbo based on 140k slack messages

2023-11-08T02:44:00+00:00

Fine-tuning GPT3.5-turbo based on 140k slack messages

Ross Lazerowitz spent $83.20 creating a fine-tuned GPT-3.5 turbo model based on 140,000 of his Slack messages (10,399,747 tokens), massaged into a JSONL file suitable for use with the OpenAI fine-tuning API.

Then he told the new model “write a 500 word blog post on prompt engineering”, and it replied “Sure, I shall work on that in the morning”.

Tags: generative-ai, openai, slack, ai, llms, fine-tuning

A Hackers' Guide to Language Models

2023-09-25T00:24:50+00:00

A Hackers' Guide to Language Models

Jeremy Howard’s new 1.5 hour YouTube introduction to language models looks like a really useful place to catch up if you’re an experienced Python programmer looking to start experimenting with LLMs. He covers what they are and how they work, then shows how to build against the OpenAI API, build a Code Interpreter clone using OpenAI functions, run models from Hugging Face on your own machine (with NVIDIA cards or on a Mac) and finishes with a demo of fine-tuning a Llama 2 model to perform text-to-SQL using an open dataset.

Tags: llms, ai, jeremy-howard, generative-ai, python, llama, openai, fine-tuning, nvidia

airoboros LMoE

2023-08-24T22:31:57+00:00

airoboros LMoE

airoboros provides a system for fine-tuning Large Language Models. The latest release adds support for LMoE—LoRA Mixture of Experts. GPT-4 is strongly rumoured to work as a mixture of experts—several (maybe 8?) 220B models each with a different specialty working together to produce the best result. This is the first open source (Apache 2) implementation of that pattern that I’ve seen.

Tags: generative-ai, opensearch, gpt-4, ai, llms, fine-tuning

Introducing Code Llama, a state-of-the-art large language model for coding

2023-08-24T17:54:53+00:00

Introducing Code Llama, a state-of-the-art large language model for coding

New LLMs from Meta built on top of Llama 2, in three shapes: a foundation Code Llama model, Code Llama Python that’s specialized for Python, and a Code Llama Instruct model fine-tuned for understanding natural language instructions.

Via facebookresearch/codellama

Tags: llama, llms, ai, generative-ai, fine-tuning, meta

Quoting Ted Sanders

2023-04-15T13:44:19+00:00

Although fine-tuning can feel like the more natural option—training on data is how GPT learned all of its other knowledge, after all—we generally do not recommend it as a way to teach the model knowledge. Fine-tuning is better suited to teaching specialized tasks or styles, and is less reliable for factual recall. [...] In contrast, message inputs are like short-term memory. When you insert knowledge into a message, it's like taking an exam with open notes. With notes in hand, the model is more likely to arrive at correct answers.

— Ted Sanders, OpenAI

Tags: prompt-engineering, gpt-3, generative-ai, openai, gpt-4, ai, llms, fine-tuning

Replacing my best friends with an LLM trained on 500,000 group chat messages

2023-04-12T23:01:45+00:00

Replacing my best friends with an LLM trained on 500,000 group chat messages

Izzy Miller used a 7 year long group text conversation with five friends from college to fine-tune LLaMA, such that it could simulate ongoing conversations. They started by extracting the messages from the iMessage SQLite database on their Mac, then generated a new training set from those messages and ran it using code from the Stanford Alpaca repository. This is genuinely one of the clearest explanations of the process of fine-tuning a model like this I’ve seen anywhere.

Via Hacker News

Tags: llama, llms, sqlite, edge-llms, fine-tuning, training-data

gpt4all

2023-03-29T18:03:09+00:00

gpt4all

Similar to Alpaca, here’s a project which takes the LLaMA base model and fine-tunes it on instruction examples generated by GPT-3—in this case, it’s 800,000 examples generated using the ChatGPT GPT 3.5 turbo model (Alpaca used 52,000 generated by regular GPT-3). This is currently the easiest way to get a LLaMA derived chatbot running on your own computer: the repo includes compiled binaries for running on M1/M2, Intel Mac, Windows and Linux and provides a link to download the 3.9GB 4-bit quantized model.

Tags: llama, open-source, ai, generative-ai, edge-llms, llms, fine-tuning

Hello Dolly: Democratizing the magic of ChatGPT with open models

2023-03-24T17:05:47+00:00

Hello Dolly: Democratizing the magic of ChatGPT with open models

A team at DataBricks applied the same fine-tuning data used by Stanford Alpaca against LLaMA to a much older model—EleutherAI’s GPT-J 6B, first released in May 2021. As with Alpaca, they found that instruction tuning took the raw model—which was extremely difficult to interact with—and turned it into something that felt a lot more like ChatGPT. It’s a shame they reused the license-encumbered 52,000 training samples from Alpaca, but I doubt it will be long before someone recreates a freely licensed alternative to that training set.

Tags: llama, ai, generative-ai, edge-llms, llms, dolly, chatgpt, fine-tuning

Fine-tune LLaMA to speak like Homer Simpson

2023-03-17T23:08:40+00:00

Fine-tune LLaMA to speak like Homer Simpson

Replicate spent 90 minutes fine-tuning LLaMA on 60,000 lines of dialog from the first 12 seasons of the Simpsons, and now it can do a good job of producing invented dialog from any of the characters from the series. This is a really interesting result: I’ve been skeptical about how much value can be had from fine-tuning large models on just a tiny amount of new data, assuming that the new data would be statistically irrelevant compared to the existing model. Clearly my mental model around this was incorrect.

Tags: llama, the-simpsons, ai, generative-ai, edge-llms, llms, replicate, fine-tuning

Train and run Stanford Alpaca on your own machine

2023-03-16T16:10:39+00:00

Train and run Stanford Alpaca on your own machine

The team at Replicate managed to train their own copy of Stanford’s Alpaca—a fine-tuned version of LLaMA that can follow instructions like ChatGPT. Here they provide step-by-step instructions for recreating Alpaca yourself—running the training needs one or more A100s for a few hours, which you can rent through various cloud providers.

Tags: llama, stanford, ai, generative-ai, edge-llms, llms, replicate, fine-tuning

Stanford Alpaca, and the acceleration of on-device large language model development

2023-03-13T19:19:09+00:00

On Saturday 11th March I wrote about how Large language models are having their Stable Diffusion moment. Today is Monday. Let's look at what's happened in the past three days.

Later on Saturday: Artem Andreenko reports that llama.cpp can run the 4-bit quantized 7B LLaMA language model model on a 4GB RaspberryPi - at 10 seconds per token, but still hugely impressive.
Sunday 12th March: cocktailpeanut releases Dalai, a "dead simple way to run LLaMA on your computer": npx dalai llama and npx dalai serve.
13th March (today): Anish Thite reports llama.cpp running on a Pixel 6 phone (26 seconds per token). Update 14th March: Now 1 second per token on an older Pixel 5!
Also today: a team at Stanford released Alpaca: A Strong Open-Source Instruction-Following Model - fine-tuned from the LLaMA 7B model.

When I talked about a "Stable Diffusion moment" this is the kind of thing I meant: the moment this stuff is available for people to experiment with, things accelerate.

I'm going to dive into Alpaca in detail.

Stanford's Alpaca

Here's the introduction to the Alpaca announcement:

We introduce Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca behaves similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$).

The biggest weakness in the LLaMA models released by Meta research last month is their lack of instruction-tuning.

A language model is a sentence completion engine. You give it a sequence of words, "The first man on the moon was", and it completes that sentence, hopefully with useful content.

One of the great innovations from OpenAI was their application of instruction tuning to GPT-3:

To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API, our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.

Prior to this, you had to think very carefully about how to construct your prompts. Thanks to instruction tuning you can be a lot more, well, human in the way you interact with the model. "Write me a poem about pandas!" now works as a prompt, instead of "Here is a poem about pandas:".

The LLaMA models had not been through this process. The LLaMA FAQ acknowledges this:

Keep in mind these models are not finetuned for question answering. As such, they should be prompted so that the expected answer is the natural continuation of the prompt. [...] Overall, always keep in mind that models are very sensitive to prompts (particularly when they have not been finetuned).

This is an enormous usability problem.

One of my open questions about LLaMA was how difficult and expensive it would be to fine-tune it such that it could respond better to instructions.

Thanks to the team at Stanford we now have an answer: 52,000 training samples and $100 of training compute! From their blog post:

Fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers.

Something that stuns me about Alpaca is the quality they claim to be able to get from the 7B model - the smallest of the LLaMA models, and the one which has been seen running (albeit glacially slowly) on a RaspberryPi and a mobile phone! Here's one example from their announcement:

I would be impressed to see this from the 65B (largest) LLaMA model - but getting this from 7B is spectacular.

Still not for commercial usage

I'll quote the Stanford announcement on this in full:

We emphasize that Alpaca is intended only for academic research and any commercial use is prohibited. There are three factors in this decision: First, Alpaca is based on LLaMA, which has a non-commercial license, so we necessarily inherit this decision. Second, the instruction data is based OpenAI's text-davinci-003, whose terms of use prohibit developing models that compete with OpenAI. Finally, we have not designed adequate safety measures, so Alpaca is not ready to be deployed for general use.

So it's still not something we can use to build commercial offerings - but for personal research and tinkering it's yet another huge leap forwards.

What does this demonstrate?

The license of the LLaMA model doesn't bother me too much. What's exciting to me is what this all proves:

LLaMA itself shows that it's possible to train a GPT-3 class language model using openly available resources. The LLaMA paper includes details of the training data, which is entirely from publicly available sources (which include CommonCrawl, GitHub, Wikipedia, ArXiv and StackExchange).
llama.cpp shows that you can then use some tricks to run that language model on consumer hardware - apparently anything with 4GB or more of RAM is enough to at least get it to start spitting out tokens!
Alpaca shows that you can apply fine-tuning with a feasible sized set of examples (52,000) and cost ($100) such that even the smallest of the LLaMA models - the 7B one, which can compress down to a 4GB file with 4-bit quantization - provides results that compare well to cutting edge text-davinci-003 in initial human evaluation.

One thing that's worth noting: the Alpaca 7B comparison likely used the full-sized 13.48GB 16bit floating point 7B model, not the 4GB smaller 4bit floating point model used by llama.cpp. I've not yet seen a robust comparison of quality between the two.

Exploring the Alpaca training data with Datasette Lite

The Alpaca team released the 52,000 fine-tuning instructions they used as a 21.7MB JSON file in their GitHub repository.

My Datasette Lite tool has the ability to fetch JSON from GitHub and load it into an in-browser SQLite database. Here's the URL to do that:

https://lite.datasette.io/?json=https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json

This will let you browse the 52,000 examples in your browser.

But we can do a step better than that: here's a SQL query that runs LIKE queries to search through those examples, considering all three text columns:

select instruction, input, output from alpaca_data
where instruction || ' ' || input || ' ' || output like '%' || :search || '%'
order by random()

I'm using order by random() because why not? It's more fun to explore that way.

The following link will both load the JSON file and populate and execute that SQL query, plus allow you to change the search term using a form in your browser:

https://lite.datasette.io/?json=https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json#/data?sql=select+instruction%2C+input%2C+output+from+alpaca_data%0Awhere+instruction+%7C%7C+%27+%27+%7C%7C+input+%7C%7C+%27+%27+%7C%7C+output+like+%27%25%27+%7C%7C+%3Asearch+%7C%7C+%27%25%27%0Aorder+by+random%28%29&search=occam

What's next?

This week is likely to be wild. OpenAI are rumored to have a big announcement on Tuesday - possibly GPT-4? And I've heard rumors of announcements from both Anthropic and Google this week as well.

I'm still more excited about seeing what happens next with LLaMA. Language models on personal devices is happening so much faster than I thought it would.

Bonus: The source of that training data? GPT-3!

Here's a fascinating detail: Those 52,000 samples they used to fine-tune the model? Those were the result of a prompt they ran against GPT-3 itself! Here's the prompt they used:

You are asked to come up with a set of 20 diverse task instructions. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions.

Here are the requirements:
1. Try not to repeat the verb for each instruction to maximize diversity.
2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons.
3. The type of instructions should be diverse. The list should include diverse types of tasks like open-ended generation, classification, editing, etc.
2. A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action.
3. The instructions should be in English.
4. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
5. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging but should ideally not exceed 100 words.
6. Not all instructions require input. For example, when a instruction asks about some general information, "what is the highest peak in the world", it is not necssary to provide a specific context. In this case, we simply put "<noinput>" in the input field.
7. The output should be an appropriate response to the instruction and the input. Make sure the output is less than 100 words.

List of 20 tasks:

Then they include three random example instructions from a list of 175 they had prepared by hand. The completed prompt sent to OpenAI would include the above instructions followed by something like this:

###
1. Instruction: Explain the following idiom to me, and try to give me some examples.
1. Input:
black sheep
1. Output:
Meaning: An outcast. Someone who doesn’t fit in with the rest of the crowd. They take pride in being different. Thinks for themselves and doesn’t care what no one else has to say. They tend to ride their own wave and are usually loners because no one understands them, but its okay because they like it that way.
Example: He’s the black sheep of the family.

###
2. Instruction: Generate a haiku using the following word:
2. Input:
summer
2. Output:
The chill, worming in
Shock, pleasure, bursting within
Summer tongue awakes

###
3. Instruction: Recommend a movie for me to watch during the weekend and explain the reason.
3. Input:
3. Output:
I would recommend the movie "The Shawshank Redemption" because it is an excellent movie that is both moving and inspiring. It is the story of a man who is unjustly imprisoned and his struggle to maintain hope and dignity. It is a great film to watch over the weekend because it will make you think about the human capacity for resilience and hope.

###
4. Instruction:

GPT-3 would then fill in the rest. You can try this in the GPT-3 Playground to see it in action (paste from here).

Here's the Python script that assembles that all together.

They spent $500 on OpenAI credits to assemble the 52,000 examples they used to fine-tune their model.

As they note in their announcement, generating examples in this way is actually mentioned in the OpenAI terms of use:

You may not [...] (iii) use the Services to develop foundation models or other large scale models that compete with OpenAI

There's a related concept to this called Model Extraction, where people build new models that emulate the behaviour of others by firing large numbers of examples through the other model and training a new one based on the results.

I don't think the way Alpaca was trained quite counts as a classic Model Extraction attack, but it certainly echoes one.

Tags: open-source, stanford, ai, gpt-3, generative-ai, llama, edge-llms, llms, fine-tuning

Quoting Alpaca: A Strong Open-Source Instruction-Following Model

2023-03-13T18:18:37+00:00

We introduce Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca behaves similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$).

— Alpaca: A Strong Open-Source Instruction-Following Model

Tags: llama, stanford, ai, generative-ai, llms, fine-tuning