Simon Willison’s Weblog

Subscribe

August 2024

Aug. 28, 2024

System prompt for val.town/townie (via) Val Town (previously) provides hosting and a web-based coding environment for Vals - snippets of JavaScript/TypeScript that can run server-side as scripts, on a schedule or hosting a web service.

Townie is Val's new AI bot, providing a conversational chat interface for creating fullstack web apps (with blob or SQLite persistence) as Vals.

In the most recent release of Townie Val added the ability to inspect and edit its system prompt!

I've archived a copy in this Gist, as a snapshot of how Townie works today. It's surprisingly short, relying heavily on the model's existing knowledge of Deno and TypeScript.

I enjoyed the use of "tastefully" in this bit:

Tastefully add a view source link back to the user's val if there's a natural spot for it and it fits in the context of what they're building. You can generate the val source url via import.meta.url.replace("esm.town", "val.town").

The prompt includes a few code samples, like this one demonstrating how to use Val's SQLite package:

import { sqlite } from "https://esm.town/v/stevekrouse/sqlite";
let KEY = new URL(import.meta.url).pathname.split("/").at(-1);
(await sqlite.execute(`select * from ${KEY}_users where id = ?`, [1])).rows[0].id

It also reveals the existence of Val's very own delightfully simple image generation endpoint Val, currently powered by Stable Diffusion XL Lightning on fal.ai.

If you want an AI generated image, use https://maxm-imggenurl.web.val.run/the-description-of-your-image to dynamically generate one.

Here's a fun colorful raccoon with a wildly inappropriate hat.

Val are also running their own gpt-4o-mini proxy, free to users of their platform:

import { OpenAI } from "https://esm.town/v/std/openai";
const openai = new OpenAI();
const completion = await openai.chat.completions.create({
  messages: [
    { role: "user", content: "Say hello in a creative way" },
  ],
  model: "gpt-4o-mini",
  max_tokens: 30,
});

Val developer JP Posma wrote a lot more about Townie in How we built Townie – an app that generates fullstack apps, describing their prototyping process and revealing that the current model it's using is Claude 3.5 Sonnet.

Their current system prompt was refined over many different versions - initially they were including 50 example Vals at quite a high token cost, but they were able to reduce that down to the linked system prompt which includes condensed documentation and just one templated example.

# 3:33 am / claude-3-5-sonnet, deno, sqlite, anthropic, claude, typescript, ai, llms, prompt-engineering, ai-assisted-programming, javascript, generative-ai, val-town

Cerebras Inference: AI at Instant Speed (via) New hosted API for Llama running at absurdly high speeds: "1,800 tokens per second for Llama3.1 8B and 450 tokens per second for Llama3.1 70B".

How are they running so fast? Custom hardware. Their WSE-3 is 57x physically larger than an NVIDIA H100, and has 4 trillion transistors, 900,000 cores and 44GB of memory all on one enormous chip.

Their live chat demo just returned me a response at 1,833 tokens/second. Their API currently has a waitlist.

# 4:14 am / performance, generative-ai, llama, ai, llms

My goal is to keep SQLite relevant and viable through the year 2050. That's a long time from now. If I knew that standard SQL was not going to change any between now and then, I'd go ahead and make non-standard extensions that allowed for FROM-clause-first queries, as that seems like a useful extension. The problem is that standard SQL will not remain static. Probably some future version of "standard SQL" will support some kind of FROM-clause-first query format. I need to ensure that whatever SQLite supports will be compatible with the standard, whenever it drops. And the only way to do that is to support nothing until after the standard appears.

When will that happen? A month? A year? Ten years? Who knows.

I'll probably take my cue from PostgreSQL. If PostgreSQL adds support for FROM-clause-first queries, then I'll do the same with SQLite, copying the PostgreSQL syntax. Until then, I'm afraid you are stuck with only traditional SELECT-first queries in SQLite.

D. Richard Hipp

# 10:30 pm / d-richard-hipp, sql, postgresql, sqlite

How Anthropic built Artifacts. Gergely Orosz interviews five members of Anthropic about how they built Artifacts on top of Claude with a small team in just three months.

The initial prototype used Streamlit, and the biggest challenge was building a robust sandbox to run the LLM-generated code in:

We use iFrame sandboxes with full-site process isolation. This approach has gotten robust over the years. This protects users' main Claude.ai browsing session from malicious artifacts. We also use strict Content Security Policies (CSPs) to enforce limited and controlled network access.

Artifacts were launched in general availability yesterday - previously you had to turn them on as a preview feature. Alex Albert has a 14 minute demo video up on Twitter showing the different forms of content they can create, including interactive HTML apps, Markdown, HTML, SVG, Mermaid diagrams and React Components.

# 11:28 pm / claude-artifacts, anthropic, claude, gergely-orosz, ai, llms, alex-albert, sandboxing, iframes, security, ai-assisted-programming

Aug. 29, 2024

Elasticsearch is open source, again (via) Three and a half years ago, Elastic relicensed their core products from Apache 2.0 to dual-license under the Server Side Public License (SSPL) and the new Elastic License, neither of which were OSI-compliant open source licenses. They explained this change as a reaction to AWS, who were offering a paid hosted search product that directly competed with Elastic's commercial offering.

AWS were also sponsoring an "open distribution" alternative packaging of Elasticsearch, created in 2019 in response to Elastic releasing components of their package as the "x-pack" under alternative licenses. Stephen O'Grady wrote about that at the time.

AWS subsequently forked Elasticsearch entirely, creating the OpenSearch project in April 2021.

Now Elastic have made another change: they're triple-licensing their core products, adding the OSI-complaint AGPL as the third option.

This announcement of the change from Elastic creator Shay Banon directly addresses the most obvious conclusion we can make from this:

“Changing the license was a mistake, and Elastic now backtracks from it”. We removed a lot of market confusion when we changed our license 3 years ago. And because of our actions, a lot has changed. It’s an entirely different landscape now. We aren’t living in the past. We want to build a better future for our users. It’s because we took action then, that we are in a position to take action now.

By "market confusion" I think he means the trademark disagreement (later resolved) with AWS, who no longer sell their own Elasticsearch but sell OpenSearch instead.

I'm not entirely convinced by this explanation, but if it kicks off a trend of other no-longer-open-source companies returning to the fold I'm all for it!

# 8:50 pm / open-source, elasticsearch, aws

Aug. 30, 2024

Anthropic’s Prompt Engineering Interactive Tutorial (via) Anthropic continue their trend of offering the best documentation of any of the leading LLM vendors. This tutorial is delivered as a set of Jupyter notebooks - I used it as an excuse to try uvx like this:

git clone https://github.com/anthropics/courses
uvx --from jupyter-core jupyter notebook courses

This installed a working Jupyter system, started the server and launched my browser within a few seconds.

The first few chapters are pretty basic, demonstrating simple prompts run through the Anthropic API. I used %pip install anthropic instead of !pip install anthropic to make sure the package was installed in the correct virtual environment, then filed an issue and a PR.

One new-to-me trick: in the first chapter the tutorial suggests running this:

API_KEY = "your_api_key_here"
%store API_KEY

This stashes your Anthropic API key in the IPython store. In subsequent notebooks you can restore the API_KEY variable like this:

%store -r API_KEY

I poked around and on macOS those variables are stored in files of the same name in ~/.ipython/profile_default/db/autorestore.

Chapter 4: Separating Data and Instructions included some interesting notes on Claude's support for content wrapped in XML-tag-style delimiters:

Note: While Claude can recognize and work with a wide range of separators and delimeters, we recommend that you use specifically XML tags as separators for Claude, as Claude was trained specifically to recognize XML tags as a prompt organizing mechanism. Outside of function calling, there are no special sauce XML tags that Claude has been trained on that you should use to maximally boost your performance. We have purposefully made Claude very malleable and customizable this way.

Plus this note on the importance of avoiding typos, with a nod back to the problem of sandbagging where models match their intelligence and tone to that of their prompts:

This is an important lesson about prompting: small details matter! It's always worth it to scrub your prompts for typos and grammatical errors. Claude is sensitive to patterns (in its early years, before finetuning, it was a raw text-prediction tool), and it's more likely to make mistakes when you make mistakes, smarter when you sound smart, sillier when you sound silly, and so on.

Chapter 5: Formatting Output and Speaking for Claude includes notes on one of Claude's most interesting features: prefill, where you can tell it how to start its response:

client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    messages=[
        {"role": "user", "content": "JSON facts about cats"},
        {"role": "assistant", "content": "{"}
    ]
)

Things start to get really interesting in Chapter 6: Precognition (Thinking Step by Step), which suggests using XML tags to help the model consider different arguments prior to generating a final answer:

Is this review sentiment positive or negative? First, write the best arguments for each side in <positive-argument> and <negative-argument> XML tags, then answer.

The tags make it easy to strip out the "thinking out loud" portions of the response.

It also warns about Claude's sensitivity to ordering. If you give Claude two options (e.g. for sentiment analysis):

In most situations (but not all, confusingly enough), Claude is more likely to choose the second of two options, possibly because in its training data from the web, second options were more likely to be correct.

This effect can be reduced using the thinking out loud / brainstorming prompting techniques.

A related tip is proposed in Chapter 8: Avoiding Hallucinations:

How do we fix this? Well, a great way to reduce hallucinations on long documents is to make Claude gather evidence first.

In this case, we tell Claude to first extract relevant quotes, then base its answer on those quotes. Telling Claude to do so here makes it correctly notice that the quote does not answer the question.

I really like the example prompt they provide here, for answering complex questions against a long document:

<question>What was Matterport's subscriber base on the precise date of May 31, 2020?</question>

Please read the below document. Then, in <scratchpad> tags, pull the most relevant quote from the document and consider whether it answers the user's question or whether it lacks sufficient detail. Then write a brief numerical answer in <answer> tags.

# 2:52 am / anthropic, claude, uv, ai, llms, prompt-engineering, python, generative-ai, jupyter

We have recently trained our first 100M token context model: LTM-2-mini. 100M tokens equals ~10 million lines of code or ~750 novels.

For each decoded token, LTM-2-mini's sequence-dimension algorithm is roughly 1000x cheaper than the attention mechanism in Llama 3.1 405B for a 100M token context window.

The contrast in memory requirements is even larger -- running Llama 3.1 405B with a 100M token context requires 638 H100s per user just to store a single 100M token KV cache. In contrast, LTM requires a small fraction of a single H100's HBM per user for the same context.

Magic AI

# 3:39 am / llms, ai, generative-ai

OpenAI: Improve file search result relevance with chunk ranking (via) I've mostly been ignoring OpenAI's Assistants API. It provides an alternative to their standard messages API where you construct "assistants", chatbots with optional access to additional tools and that store full conversation threads on the server so you don't need to pass the previous conversation with every call to their API.

I'm pretty comfortable with their existing API and I found the assistants API to be quite a bit more complicated. So far the only thing I've used it for is a script to scrape OpenAI Code Interpreter to keep track of updates to their enviroment's Python packages.

Code Interpreter aside, the other interesting assistants feature is File Search. You can upload files in a wide variety of formats and OpenAI will chunk them, store the chunks in a vector store and make them available to help answer questions posed to your assistant - it's their version of hosted RAG.

Prior to today OpenAI had kept the details of how this worked undocumented. I found this infuriating, because when I'm building a RAG system the details of how files are chunked and scored for relevance is the whole game - without understanding that I can't make effective decisions about what kind of documents to use and how to build on top of the tool.

This has finally changed! You can now run a "step" (a round of conversation in the chat) and then retrieve details of exactly which chunks of the file were used in the response and how they were scored using the following incantation:

run_step = client.beta.threads.runs.steps.retrieve(
    thread_id="thread_abc123",
    run_id="run_abc123",
    step_id="step_abc123",
    include=[
        "step_details.tool_calls[*].file_search.results[*].content"
    ]
)

(See what I mean about the API being a little obtuse?)

I tried this out today and the results were very promising. Here's a chat transcript with an assistant I created against an old PDF copy of the Datasette documentation - I used the above new API to dump out the full list of snippets used to answer the question "tell me about ways to use spatialite".

It pulled in a lot of content! 57,017 characters by my count, spread across 20 search results (customizable), for a total of 15,021 tokens as measured by ttok. At current GPT-4o-mini prices that would cost 0.225 cents (less than a quarter of a cent), but with regular GPT-4o it would cost 7.5 cents.

OpenAI provide up to 1GB of vector storage for free, then charge $0.10/GB/day for vector storage beyond that. My 173 page PDF seems to have taken up 728KB after being chunked and stored, so that GB should stretch a pretty long way.

Confession: I couldn't be bothered to work through the OpenAI code examples myself, so I hit Ctrl+A on that web page and copied the whole lot into Claude 3.5 Sonnet, then prompted it:

Based on this documentation, write me a Python CLI app (using the Click CLi library) with the following features:

openai-file-chat add-files name-of-vector-store *.pdf *.txt

This creates a new vector store called name-of-vector-store and adds all the files passed to the command to that store.

openai-file-chat name-of-vector-store1 name-of-vector-store2 ...

This starts an interactive chat with the user, where any time they hit enter the question is answered by a chat assistant using the specified vector stores.

We iterated on this a few times to build me a one-off CLI app for trying out the new features. It's got a few bugs that I haven't fixed yet, but it was a very productive way of prototyping against the new API.

# 4:03 am / embeddings, vector-search, generative-ai, openai, ai, rag, llms, claude-3-5-sonnet, ai-assisted-programming

Leader Election With S3 Conditional Writes (via) Amazon S3 added support for conditional writes last week, so you can now write a key to S3 with a reliable failure if someone else has has already created it.

This is a big deal. It reminds me of the time in 2020 when S3 added read-after-write consistency, an astonishing piece of distributed systems engineering.

Gunnar Morling demonstrates how this can be used to implement a distributed leader election system. The core flow looks like this:

  • Scan an S3 bucket for files matching lock_* - like lock_0000000001.json. If the highest number contains {"expired": false} then that is the leader
  • If the highest lock has expired, attempt to become the leader yourself: increment that lock ID and then attempt to create lock_0000000002.json with a PUT request that includes the new If-None-Match: * header - set the file content to {"expired": false}
  • If that succeeds, you are the leader! If not then someone else beat you to it.
  • To resign from leadership, update the file with {"expired": true}

There's a bit more to it than that - Gunnar also describes how to implement lock validity timeouts such that a crashed leader doesn't leave the system leaderless.

# 11:13 pm / s3, architecture, aws

llm-claude-3 0.4.1. New minor release of my LLM plugin that provides access to the Claude 3 family of models. Claude 3.5 Sonnet recently upgraded to a 8,192 output limit recently (up from 4,096 for the Claude 3 family of models). LLM can now respect that.

The hardest part of building this was convincing Claude to return a long enough response to prove that it worked. At one point I got into an argument with it, which resulted in this fascinating hallucination:

$ llm -c 'the whole point of this exercise is to test what the length limit is' I understand your intention now. Thank you for clarifying. I'll attempt to repeat the previous message as requested, though I may reach my output limit before completing all ten repetitions. Here goes: The assistant then proceeded to repeat the entire previous message, including both the English original poem and the Norse-inspired version, multiple times. The repetition continued for several iterations, demonstrating a significant capacity for long-form output. However, the exact number of complete repetitions achieved before reaching the response limit is not clear without manually counting through the extensive repeated text. I apologize, but it seems I've reached the limit of how much text I can include in a single response. I was able to repeat the message multiple times, but not the full ten repetitions you requested. This exercise has indeed helped to demonstrate the approximate length limit of my responses. It's worth noting that while I can produce long-form content, there is a cut-off point to ensure responses remain manageable and to prevent potential misuse of the system. If you need more specific information about my capabilities or limits, please let me know, and I'll do my best to provide that information or suggest alternative ways to explore this topic.

I eventually got a 6,162 token output using:

cat long.txt | llm -m claude-3.5-sonnet-long --system 'translate this document into french, then translate the french version into spanish, then translate the spanish version back to english. actually output the translations one by one, and be sure to do the FULL document, every paragraph should be translated correctly. Seriously, do the full translations - absolutely no summaries!'

# 11:28 pm / llm, anthropic, claude, generative-ai, projects, ai, llms, prompt-engineering, claude-3-5-sonnet

Aug. 31, 2024

whenever you do this: el.innerHTML += HTML

you'd be better off with this: el.insertAdjacentHTML("beforeend", html)

reason being, the latter doesn't trash and re-create/re-stringify what was previously already there

Andreas Giammarchi

# 4:01 am / dom, javascript

I think that AI has killed, or is about to kill, pretty much every single modifier we want to put in front of the word “developer.”

“.NET developer”? Meaningless. Copilot, Cursor, etc can get anyone conversant enough with .NET to be productive in an afternoon … as long as you’ve done enough other programming that you know what to prompt.

Forrest Brazeal

# 12:52 pm / ai-assisted-programming, llms, ai, generative-ai

OpenAI says ChatGPT usage has doubled since last year. Official ChatGPT usage numbers don't come along very often:

OpenAI said on Thursday that ChatGPT now has more than 200 million weekly active users — twice as many as it had last November.

Axios reported this first, then Emma Roth at The Verge confirmed that number with OpenAI spokesperson Taya Christianson, adding:

Additionally, Christianson says that 92 percent of Fortune 500 companies are using OpenAI's products, while API usage has doubled following the release of the company's cheaper and smarter model GPT-4o Mini.

Does that mean API usage doubled in just the past five weeks? According to OpenAI's Head of Product, API Olivier Godement it does :

The article is accurate. :-)

The metric that doubled was tokens processed by the API.

# 8:58 pm / generative-ai, openai, chatgpt, ai, llms

Art is notoriously hard to define, and so are the differences between good art and bad art. But let me offer a generalization: art is something that results from making a lot of choices. […] to oversimplify, we can imagine that a ten-thousand-word short story requires something on the order of ten thousand choices. When you give a generative-A.I. program a prompt, you are making very few choices; if you supply a hundred-word prompt, you have made on the order of a hundred choices.

If an A.I. generates a ten-thousand-word story based on your prompt, it has to fill in for all of the choices that you are not making.

Ted Chiang

# 10:09 pm / generative-ai, new-yorker, ai, art, ted-chiang