Simon Willison’s Weblog

Subscribe

OpenAI DevDay 2024 live blog

1st October 2024

I’m at OpenAI DevDay in San Francisco, and I’m trying something new: a live blog, where this entry will be updated with new notes during the event.

10:19 The keynote is starting with a review of o1, and some examples of applications that use it.

10:30 They started with some demos of o1 being used in applications, and announced that the rate limit for o1 doubled to 10000 RPM (from 5000 RPM) - same as GPT-4 now.

10:31 The first big announcement: a realtime API, providing the ability to use WebSockets to implement voice input and output against their models.

10:33 What can you build with the new realtime API? The demonstrated an updated version of their Wanderlust travel agent demo. The demo uses voice as both input and output.

10:37

The image shows a large display screen mounted on a wall, showcasing a map application called Wanderlust. The screen has a pink background with a map of San Francisco on the right side.
The top of the screen shows macOS menu bar with icons and the time Tue Oct 1 10:35 AM.
The left side of the screen displays a conversation:
SF
Annotated map
San Francisco, here we are! How can I assist?
Yeah, could we take a look at Fort Mason, please?
Annotated map
There we go, Fort Mason! What do you need help with there?
Well, I have a few hundred awesome developers here in the room and I'm thinking they might need a little snack. Do you have some chocolate shops around here?
Annotated map
Absolutely! Check out Ghirardelli Square, Ilan's Strawberries, and The Candy Store!
The right side shows a map of San Francisco, focusing on the Fort Mason area. Various landmarks and streets are labeled, including Aquatic Park, Ghirardelli Square, and Fisherman's Wharf.
At the bottom of the screen, there are icons for microphone input, text input, and map layers.
The screen is mounted above what appears to be wood paneling, and a small portion of a plant is visible in the lower-left corner of the image.

10:39 ... and a demo of an AI assistant making a phone call to order food (thankfully to an OpenAI staff member on stage, not to an actual business!)

10:41 And a demo of the Speak language learning app using the new Realtime API. The API is rolling out generally today.

10:41 Next up: model customization. They have fine-tuning for GPT-4o and 4o-mini now. Today they're announcing fine-tuning support for their vision models.

10:42 Now you can use images to fine-tune the model. They suggest this can be used for product recommendations, medical imaging, or even things like traffic sign detection or lane detection (Grab have been using it for that).

10:42 Fine-tuning with vision is available to every developer for GPT-4 (presumably they mean GPT-4o? Not clear.)

10:43 Next: price drops. Cost-per-token is already 99% cheaper than two years ago.

10:44 And today they're adding prompt caching - as seen previously in Claude and Gemini.

10:44 Their version of prompt caching is automatic! A 50% discount on tokens the model has seen before.

10:45 Model distillation: where a smaller model is taught by a larger model. Today they're announcing tools for model distillation - so you can fine-tune a 4o-mini model based on output from the larger models.

10:46 Two new tools: stored completions, which lets you store your interactions with the models permanently in the OpenAI platform, to use for fine-tuning and model distillation. That tool ships to all developers today.

10:46 Plus new evaluation tools, also shipping today.

10:47 Sam Altman will be here for the fireside chat in the afternoon, but won't be presenting keynotes before then.

10:51 And now a break. The schedule for the rest of the event has been updated - previously it said "to be announced during the keynote", now we are seeing that 11-11:45am is "Structured Outputs for reliable applications" and 12-12:45pm is "Powerful small models with distillation.

10:52 Then at 2-2:45pm "Multimodel apps with the realtime API.

11:05 Next up: Structured outputs for reliable applications. I've done a bunch of work with the OpenAI tools mechanism in the past for this, most notably my datasette-extract plugin for loading unstructured text and images into structured SQLite database tables.

11:06 Atty Eleti and Michelle Pokrass are talking about structured outputs, the most recent evolution of that mechanism.

11:07 Atty starts with a review of GPT-4 apps like Duolingo, Klarna and Cursor, which "connect to the outside world" - and hence need structured outputs, typically using JSON.

11:08 Classic example of how asking for JSON in the past has often been unreliable - resulting in responses that start "Here is the JSON you asked for...". Developers end up begging for "Just the JSON please!".

11:09 Function calling launched in June 2023, and which helped a bit. In November - last year's DevDay - they released JSON mode that ensured valid data - but this could still hallucinate parameters or output the wrong type.

11:10 Structured Outputs, released in August this year, ensures that the output will exactly match a specified JSON schema. Michelle will explain how this works under the hood. The challenge was making sure the solution was performant for inference at scale.

11:12 Function calling continues to work the same way: you provide a tool with a type of function, then describe that function's parameters using JSON schema. Add "strict": true to that JSON to turn on the new structured output mode for those functions.

11:14 And now a demo, with a function that describes data tables, their columns and operations that can be executed against them. Adding "strict": true fixed a bug where the model used an operator that wasn't defined in the set of operators.

11:15 "response_format": {"type": "json_schema"} enables specifying a full JSON schema that's guaranteed to be followed by the structured outputs mode.

11:17 The demo imagines AI-enhanced glasses, using a neat {"voice_over": "This is what the glasses say to you", "display": "4 feet tall"} output format which updates a display with a short string and specifies a longer string to be spoken out loud.

11:18 Next demo imagines a resume reviewing application, where you can drop a PDF resume directly onto a web form which then uses structured outputs to pull out the fields needed by the resume application.

11:20 The OpenAI library for JavaScript supports Zod for defining these schemas, and the Python library supports Pydantic.

11:21 I'm presuming that previous demo converted the PDF to images automatically - I don't think any of the OpenAI APIs accept PDF directly.

11:23 This next demo is much more interesting: defining a schema for a full dashboard interface, where different cards can represent charts or tables or rows - so now the tool can output a custom answer to a question with embedded charts and data. Hopefully the code will be published on GitHub after the talk.

11:24 Overall this is a pretty sophisticated demo of a custom chat UI with a whole assortment of custom tools built on top of function calling and structured output.

11:25 Atty emphasizes that prior to structured outputs reliability was a really big problem - any of these steps failing could break the entire application.

11:25 Next, Michelle is talking about the underlying implementation of structured outputs.

11:26 "We took an approach that combined both research and engineering". It's more than just prompting - they used a technique called "constrained decoding". (Sounds to me like the Llama.cpp grammars trick).

11:27 As an example, consider handwritten number recognition - where there are only ten possible output labels, from 0 to 9. This is a classic machine learning image recognition task.

11:28 LLMs predict more than just digits through 0-9 - they output tokens, see my article Understanding GPT tokenizers from last year.

11:30 For structured outputs, the trick is to limit which tokens can be produced next. The technique used here is called "token masking". The LLM still generates probabilities for likely next tokens, but they then mask out any tokens that would not match the desired schema.

11:31 These masks have to be updated on every single inference step, so the operation has to be lightning fast to keep inference as fast as possible. Token sampling happens on a GPU in batches - which means the CPU can calculate the masks for the next step in parallel while the GPU is calculating probabilities.

11:32 These masks need to be calculated within 10ms. "We wanted to pre-compute as much work as possible" - to make mask computation more of a lookup. They build an index, derived from the JSON schema, to make fetching those masks as fast as possible.

11:33 JSON Schema is converted into a grammar, then a parser, then they iterate over ALL tokens and parse states and use that to create the index.

11:33 The indexs is a trie - a prefix-based data structure allowing for O1 lookups.

11:34 Generating the index is computationally expensive due to the need to go over all of the possible states. They do that just once and cache the index - which is why the first query to structured inputs can take a little time - sometimes up to 10 seconds - but following prompts are fast.

11:35 The open source community has used tricks to implement masks by turning a schema into a regular expression. But regular expressions don't cover recursive or deeply nested schemas - so they can't cover all of the features of JSON Schema.

11:35 Generative UI is a good example of a use-case that needs nested schemas - each component can have a list of children that might include more components. These cannot be converted to a regular expression due to that limit in terms of recursive nesting.

11:37 OpenAI wanted recursive JSON schema support, so they added a stack. They call this the CFG - Context Free Grammar - approach, which combines regular expressions and a stack. This is why it takes a little bit of time to build up the trie for inference.

11:38 So the trade-off here is the short delay when the schema is first encountered, which OpenAI think is worthwhile for the improved reliability.

11:38 Atty is now talking about the research side: how can we get the model to follow the schema in the most useful way possible?

11:39 If you force a model to output JSON fitting the grammar, you might end up with {\n\n\n\n\n\n\n\n\n\n\n\n until it hits the maximum of tokens.

11:40 OpenAI's internal evals showed that gpt-4o-2024-08-06 with Structured Outputs was far more accurate than prompting alone against the older models. They now get to 100% on that eval (not sure what that's measuring though).

11:42 One of the controversial API design decisions made was around additionalProperties: true - which is usually a default in JSON Schema. OpenAI have disallowed additional properties by default in their API, which differs from developer expectations.

11:42 OpenAI say explicit is better, so developers have to pass additionalProperties: false in their schemas.

11:43 All properties are required by default - no support for optional parameters (although they can be made nullable). Developers need to follow this rule in the JSON Schemas they send over as well.

11:43 Fields are generated in the same order that you defined them in the schema, even though JSON is supposed to ignore key order. This ensures you can implement things like chain-of-thought by adding those keys in the correct order in your schema design.

11:45 Looks like the documentation for the new Realtime API is now available.

11:46 This session didn't present any new features - they were all in the documentation already - but the insight into how the Structured Output works under the hood was new.

12:01 Next up: Powerful small models with distillation, with John Allard and Steven Heidel.

12:02 Distillation "allows you to create powerful, small models". They'll talk about why it matters, how it works and best practices and use cases - plus demos of the two new API platform features they are launching today.

12:02 Once you get an AI app working, the next step is figuring out how to get it to work at scale. You care about uptime, rate limits, latency and cost.

12:04 GPT-4o is 15x more expensive than GPT-4o mini, but it brings a large amount of additional "knowledge" - graduate level physics etc. It excels at the toughest knowledge benchmarks. Do you need that type of intelligence for your application?

12:06 Distillation: you fine-tune the small output on the outputs of the large model. You're compressing some of the capabilities of the large model into that smaller model.

12:07 Distillation involves three steps. The first and most important is to build task-specific evals for your application. You can't skip this step, because you can't improve what you can't measure.

12:07 The second step is to capture examples of what good performance looks like. Store example completions from a large model like GPT-4o and create a dataset.

12:08 The final step is the fine-tuning. Teach the small model how to replicate the responses from the large model by showing it many of those captured examples. We're trying to "compress the intelligence" of the large model into that small model.

12:09 A lot of people have done distillation on the OpenAI platform before, using the existing fine-tuning mechanism. Doing it that way is a lot of work though.

12:10 The two new features they are launching today will make distillation easier. The first is stored completions: a new parameter to the chat completions API that will let you opt-in to storing the full input and output to the model. You can apply tags as well, to help filter those later to create datasets for fine-tuning. {"store:" true}

12:11 The second feature is a beta of an Evals product. This should allow you to do distillation end-to-end on the OpenAI platform.

12:11 Real-world use-case based on the Superhuman email app. That app has a "quick reply" feature that suggests options for a reply based on reading through the existing thread. How would you scale that feature to hundreds of millions of emails?

12:13 Aside: here's openai/openai-realtime-api-beta with example code for talking to the new Realtime API using JavaScript. openai/openai-realtime-console is an example React app.

12:14 Using the Python client library for client.chat.completions.create() you can add store=True, metadata={"tag": "test-set"} to store a prompt/response and add it to a tag.

12:14 A new UI at platform.openai.com/chat-completions lets you browse through your stored completions.

12:15 Then in the new /evalutions/create interface you can add testing criteria and use that to create a new evaluation. (I don't have access to that page yet.)

12:17 Having created an eval it's easy to run that against other models - try it against GPT-4o mini and compare that with GPT-4o for example.

12:20

Screenshot of a web interface showing evaluation results for an AI model named 'quick-reply-2-4o'. The interface displays a table with columns for messages, output, and three evaluation metrics: 'repliesToRightPerson', 'repliesToMostPressingIssue', and 'repliesMakeSense'. The table shows 8 rows of data, each representing a different conversation. Overall metrics at the top indicate 95%, 91%, and 97% success rates for the three evaluation criteria respectively. The interface appears to be part of a platform called 'Distillation Test' in a 'DevDay Demo' project.

12:21 ... and now a demonstration of the fine-tuning UI, showing how a fine-tuned GPT-4o mini model on that data performs much better than 4o-mini on its own.

12:23 Is distillation right for your use-cases? That comes down to task generality against required precision. Great use-cases for distillation are tasks that cover a relatively narrow domain and have a relatively low precision requirement - a great fit for small models.

12:23 Tasks that have high precision needs but narrow generality work well too - that's a lot of forms of categorization. You may need a larger and more diverse data set to get that to work well. Same for broad generality and low precision.

12:24 Tasks with a broad generality and high need for precision are a poor fit for distillation - they need a full-powered large model.

12:25 Things to watch out for: Unevenly distributed or biased data points. Your training data should match the patterns of your production data. Also sparse examples which may result in blind spots in your data. A great example is fraud - if it's rare you might find that 1,000 samples have no instances of fraud at all!

12:26 Part of the value of distillation is you don't necessarily need human generated data or responses - but that doesn't mean you don't need to actively curate your distillation dataset. "We tend to see distillation work best with the order of thousands rather than millions of examples."

12:27 Finally, take an iterative approach. Fine-tuning might not work on your first try - there are many variables to consider. It's important to start small with a few hundred examples and scale up once you know it's working based on your evals. Don't jump straight to millions of data points.

12:28 It strikes me that fine-tuning and distillation are strategically a great way of keeping people locked to one platform - if you build an application purely on top of prompt engineering it's much easier to swap between different LLM vendors than if you have fine-tuned a model.

12:29 They expect that it will become common for applications to be built using a collection of many different distilled small models, with a few large models for tasks that don't work well for distillation.

12:30 ... and now lunch - sessions resume at 2pm.

12:32 The system I built for this live blog is very simple - just fetch() calls polling an endpoint and updating a <div> using innerHTML - but the endpoint itself sets a 10s cache so Cloudflare should only let a hit through to the underlying app every 10s no matter how many people are viewing the page.

13:27 I've upgraded this live blog (with the help of GPT-4o) - it no longer refreshes the entire updates section (since that means any selected text is un-selected), instead appending new updates to the existing HTML. I've also added a toggle to switch between a display order of most-recent or least-recent first.

14:01 Multimodel apps with the Realtime API - Jordan Sitkin, API Capabilities and Katia Gil Guzman, Developer Experience

14:04 Building multimodal apps right now involves wiring together several different components: Whisper, then GPT-4, then a TTS model for output. This makes it hard to build "fluid conversational experiences that feel life-like".

14:04 The new Realtime API means GPT-4o can handle all of this as a single component - audio input, processing and then audio output.

14:06 The focus for this first release of the Realtime API is speech, text and function calling.

14:08 First, a demo of an app built the old way - with Whisper and then GPT-4 and then TTS output. It's clearly not real-time enough for the experience to be worthwhile.

14:08 Next a demo of the Realtime API, which feels much more responsive. It's effectively the same experience as using the ChatGPT app with the new voice mode.

14:10 The Realtime API exposes a new endpoint that provides a WebSockets connection for your application. You can exchange JSON messages containing a mix of text, audio and function calls.

14:13 The example code demonstrates connecting directly to the API with a WebSocket, though that's not recommended for most apps as it exposes the OpenAI API key in the source code. Audio data is encoded is base64 and sent as JSON.

14:14

Screenshot of some code

14:16 It's also possible to implement interruptions using the API.

More code

14:17 It's very neat that it's possible to connect to the API and implement full voice mode using just Vanilla JavaScript with no extra dependencies (albeit with an exposed API key) - but that's not how most implementations are likely to work.

14:19 Katia used o1 to help build a 3D visualization of the solar system, then added voice mode to answer questions like "how many planets are there in the solar system" (it tried and failed to display a bar chart, which was unintended and didn't quite work). Then "I'm curious about Earth" caused the visualization to zoom in on Earth while speaking out loud about the planet.

14:20 This is a very cool demo.

14:21 It's using a display_data tool for additional rendering of charts on the visualization.

14:21

3D render of Earth

14:22 One more demo, this time "Where is the ISS right now" could rotate Earth to show the ISS, based on a function call that retrieves the real current position of the ISS.

14:24 And a neat little show_moons() tool which zooms in on a planet and highlights its moons.

14:26 The Realtime API starts in public beta today, and is currently rolling out. It's going to be $5/1m tokens for input and ... I missed the rest of the pricing, they skipped the slide forward.

14:26 S2S = Speech to Speech.

14:28 Pricing is up on the pricing page. $5/m input and $20/m autput for text, $100/m input and $200/m output for audio. A note says that "Audio input costs approximately 6¢ per minute; Audio output costs approximately 24¢ per minute".

14:30 Various attendees at DevDay have tried and failed to access the Realtime API (myself included) - from talking to OpenAI staff it sounds like it's still rolling out.

14:42 I'm switching tracks - I'm now in OpenAI Research: Building with o1 with Jason Wei and Hyung Won Chung (starting in 18 minutes).

14:43 Here's what my live blogging interface looks like - I use the Django admin to add new "live update" items attached to an entry, which show up on the entry page a few seconds later.

Two browser windows next to each other, on the left is the Django admin adding a live update item  with a content field and associated with an entry ID, on the right is my blog entry which updates live