Simon Willison’s Weblog


Tuesday, 18th June 2024

Anthropic release notes (via) Anthropic have started publishing release notes! Currently available for their API and their apps (mobile and web).

What I'd really like to see are release notes for the models themselves, though as far as I can tell there haven't been any updates to those since the Claude 3 models were first released (the Haiku model name in the API is still claude-3-haiku-20240307 and Anthropic say they'll change that identifier after any updates to the model).

# 4:25 pm / ai, generative-ai, llms, anthropic, claude, alex-albert

Claude: Building evals and test cases. More documentation updates from Anthropic: this section on writing evals for Claude is new today and includes Python code examples for a number of different evaluation techniques.

Included are several examples of the LLM-as-judge pattern, plus an example using cosine similarity and another that uses the new-to-me Rouge Python library that implements the ROUGE metric for evaluating the quality of summarized text.

# 4:28 pm / ai, prompt-engineering, generative-ai, llms, anthropic, claude

Tags with descriptions. Tiny new feature on my blog: I can now add optional descriptions to my tag pages, for example on datasette and sqliteutils and promptinjection.

I built this feature on a live call this morning as an unplanned demonstration of GitHub's new Copilot Workspace feature, where you can run a prompt against a repository and have it plan, implement and file a pull request implementing a change to the code.

My prompt was:

Add a feature that lets me add a description to my tag pages, stored in the database table for tags and visible on the /tags/x/ page at the top

It wasn't as compelling a demo as I expected: Copilot Workspace currently has to stream an entire copy of each file it modifies, which can take a long time if your codebase includes several large files that need to be changed.

It did create a working implementation on its first try, though I had given it an extra tip not to forget the database migration. I ended up making a bunch of changes myself before I shipped it, listed in the pull request.

I've been using Copilot Workspace quite a bit recently as a code explanation tool - I'll prompt it to e.g. "add architecture documentation to the README" on a random repository not owned by me, then read its initial plan to see what it's figured out without going all the way through to the implementation and PR phases. Example in this tweet where I figured out the rough design of the Jina AI Reader API for this post.

# 4:50 pm / blogging, github, projects, ai, generative-ai, llms