Simon Willison’s Weblog


Sunday, 31st March 2024

Your AI Product Needs Evals (via) Hamel Husain: “I’ve seen many successful and unsuccessful approaches to building LLM products. I’ve found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems.”

I’ve been frustrated about this for a while: I know I need to move beyond “vibe checks” for the systems I have started to build on top of LLMs, but I was lacking a thorough guide about how to build automated (and manual) evals in a productive way.

Hamel has provided exactly the tutorial I was needing for this, with a really thorough example case-study.

Using GPT-4 to create test cases is an interesting approach: “Write 50 different instructions that a real estate agent can give to his assistant to create contacts on his CRM. The contact details can include name, phone, email, partner name, birthday, tags, company, address and job.”

Also important: “... unlike traditional unit tests, you don’t necessarily need a 100% pass rate. Your pass rate is a product decision.”

Hamel’s guide then covers the importance of traces for evaluating real-world performance of your deployed application, plus the pros and cons of leaning on automated evaluation using LLMs themselves.

Plus some wisdom from a footnote: “A reasonable heuristic is to keep reading logs until you feel like you aren’t learning anything new.” # 9:53 pm

No one wants to build a product on a model that makes things up. The core problem is that GenAI models are not information retrieval systems. They are synthesizing systems, with no ability to discern from the data it’s trained on unless significant guardrails are put in place.

Rumman Chowdhury # 9:20 pm

Optimizing SQLite for servers (via) Sylvain Kerkour’s comprehensive set of lessons learned running SQLite for server-based applications.

There’s a lot of useful stuff in here, including detailed coverage of the different recommended PRAGMA settings.

There was also a tip I haven’t seen before about “BEGIN IMMEDIATE” transactions:

“By default, SQLite starts transactions in DEFERRED mode: they are considered read only. They are upgraded to a write transaction that requires a database lock in-flight, when query containing a write/update/delete statement is issued.

The problem is that by upgrading a transaction after it has started, SQLite will immediately return a SQLITE_BUSY error without respecting the busy_timeout previously mentioned, if the database is already locked by another connection.

This is why you should start your transactions with BEGIN IMMEDIATE instead of only BEGIN. If the database is locked when the transaction starts, SQLite will respect busy_timeout.” # 8:16 pm

llm-nomic-api-embed. My new plugin for LLM which adds API access to the Nomic series of embedding models. Nomic models can be run locally too, which makes them a great long-term commitment as there’s no risk of the models being retired in a way that damages the value of your previously calculated embedding vectors. # 3:17 pm