Posts tagged ai, python in 2024

Filters: Year: 2024 × ai × python × Sorted by date

32 results page 1 / 2 next »

Open WebUI. I tried out this open source (MIT licensed, JavaScript and Python) localhost UI for accessing LLMs today for the first time. It's very nicely done.

I ran it with uvx like this:

uvx --python 3.11 open-webui serve

On first launch it installed a bunch of dependencies and then downloaded 903MB to ~/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2 - a copy of the all-MiniLM-L6-v2 embedding model, presumably for its RAG feature.

It then presented me with a working Llama 3.2:3b chat interface, which surprised me because I hadn't spotted it downloading that model. It turns out that was because I have Ollama running on my laptop already (with several models, including Llama 3.2:3b, already installed) - and Open WebUI automatically detected Ollama and gave me access to a list of available models.

I found a "knowledge" section and added all of the Datasette documentation (by dropping in the .rst files from the docs) - and now I can type # in chat to search for a file, add that to the context and then ask questions about it directly.

I selected the spatialite.rst.txt file, prompted it with "How do I use SpatiaLite with Datasette" and got back this:

That's honestly a very solid answer, especially considering the Llama 3.2 3B model from Ollama is just a 1.9GB file! It's impressive how well that model can handle basic Q&A and summarization against text provided to it - it somehow has a 128,000 token context size.

Open WebUI has a lot of other tricks up its sleeve: it can talk to API models such as OpenAI directly, has optional integrations with web search and custom tools and logs every interaction to a SQLite database. It also comes with extensive documentation.

# 27th December 2024, 1:38 am / python, sqlite, ai, generative-ai, llama, local-llms, llms, rag, uv, ollama

Trying out QvQ—Qwen’s new visual reasoning model

I thought we were done for major model releases in 2024, but apparently not: Alibaba’s Qwen team just dropped the ~~Apache 2.0 licensed~~ Qwen licensed (the license changed) QvQ-72B-Preview, “an experimental research model focusing on enhancing visual reasoning capabilities”.

[... 1,838 words]

8:49 pm / 24th December 2024 / python, ai, generative-ai, local-llms, llms, hugging-face, vision-llms, uv, qwen, mlx, llm-reasoning, llm-release, prince-canuma

Finally, a replacement for BERT: Introducing ModernBERT (via) BERT was an early language model released by Google in October 2018. Unlike modern LLMs it wasn't designed for generating text. BERT was trained for masked token prediction and was generally applied to problems like Named Entity Recognition or Sentiment Analysis. BERT also wasn't very useful on its own - most applications required you to fine-tune a model on top of it.

In exploring BERT I decided to try out dslim/distilbert-NER, a popular Named Entity Recognition model fine-tuned on top of DistilBERT (a smaller distilled version of the original BERT model). Here are my notes on running that using uv run.

Jeremy Howard's Answer.AI research group, LightOn and friends supported the development of ModernBERT, a brand new BERT-style model that applies many enhancements from the past six years of advances in this space.

While BERT was trained on 3.3 billion tokens, producing 110 million and 340 million parameter models, ModernBERT trained on 2 trillion tokens, resulting in 140 million and 395 million parameter models. The parameter count hasn't increased much because it's designed to run on lower-end hardware. It has a 8192 token context length, a significant improvement on BERT's 512.

I was able to run one of the demos from the announcement post using uv run like this (I'm not sure why I had to use numpy<2.0 but without that I got an error about cannot import name 'ComplexWarning' from 'numpy.core.numeric'):

uv run --with 'numpy<2.0' --with torch --with 'git+https://github.com/huggingface/transformers.git' python

Then this Python:

import torch
from transformers import pipeline
from pprint import pprint
pipe = pipeline(
    "fill-mask",
    model="answerdotai/ModernBERT-base",
    torch_dtype=torch.bfloat16,
)
input_text = "He walked to the [MASK]."
results = pipe(input_text)
pprint(results)

Which downloaded 573MB to ~/.cache/huggingface/hub/models--answerdotai--ModernBERT-base and output:

[{'score': 0.11669921875,
  'sequence': 'He walked to the door.',
  'token': 3369,
  'token_str': ' door'},
 {'score': 0.037841796875,
  'sequence': 'He walked to the office.',
  'token': 3906,
  'token_str': ' office'},
 {'score': 0.0277099609375,
  'sequence': 'He walked to the library.',
  'token': 6335,
  'token_str': ' library'},
 {'score': 0.0216064453125,
  'sequence': 'He walked to the gate.',
  'token': 7394,
  'token_str': ' gate'},
 {'score': 0.020263671875,
  'sequence': 'He walked to the window.',
  'token': 3497,
  'token_str': ' window'}]

I'm looking forward to trying out models that use ModernBERT as their base. The model release is accompanied by a paper (Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference) and new documentation for using it with the Transformers library.

# 24th December 2024, 6:21 am / bert, nlp, python, transformers, ai, jeremy-howard, hugging-face, uv

Building Python tools with a one-shot prompt using uv run and Claude Projects

I’ve written a lot about how I’ve been using Claude to build one-shot HTML+JavaScript applications via Claude Artifacts. I recently started using a similar pattern to create one-shot Python utilities, using a custom Claude Project combined with the dependency management capabilities of uv.

[... 899 words]

7 am / 19th December 2024 / aws, cli, python, s3, ai, prompt-engineering, generative-ai, llms, ai-assisted-programming, claude, claude-artifacts, uv

Phi-4 Technical Report (via) Phi-4 is the latest LLM from Microsoft Research. It has 14B parameters and claims to be a big leap forward in the overall Phi series. From Introducing Phi-4: Microsoft’s Newest Small Language Model Specializing in Complex Reasoning:

Phi-4 outperforms comparable and larger models on math related reasoning due to advancements throughout the processes, including the use of high-quality synthetic datasets, curation of high-quality organic data, and post-training innovations. Phi-4 continues to push the frontier of size vs quality.

The model is currently available via Azure AI Foundry. I couldn't figure out how to access it there, but Microsoft are planning to release it via Hugging Face in the next few days. It's not yet clear what license they'll use - hopefully MIT, as used by the previous models in the series.

In the meantime, unofficial GGUF versions have shown up on Hugging Face already. I got one of the matteogeniaccio/phi-4 GGUFs working with my LLM tool and llm-gguf plugin like this:

llm install llm-gguf
llm gguf download-model https://huggingface.co/matteogeniaccio/phi-4/resolve/main/phi-4-Q4_K_M.gguf
llm chat -m gguf/phi-4-Q4_K_M

This downloaded a 8.4GB model file. Here are some initial logged transcripts I gathered from playing around with the model.

An interesting detail I spotted on the Azure AI Foundry page is this:

Limited Scope for Code: Majority of phi-4 training data is based in Python and uses common packages such as typing, math, random, collections, datetime, itertools. If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses.

This leads into the most interesting thing about this model: the way it was trained on synthetic data. The technical report has a lot of detail about this, including this note about why synthetic data can provide better guidance to a model:

Synthetic data as a substantial component of pretraining is becoming increasingly common, and the Phi series of models has consistently emphasized the importance of synthetic data. Rather than serving as a cheap substitute for organic data, synthetic data has several direct advantages over organic data.

Structured and Gradual Learning. In organic datasets, the relationship between tokens is often complex and indirect. Many reasoning steps may be required to connect the current token to the next, making it challenging for the model to learn effectively from next-token prediction. By contrast, each token generated by a language model is by definition predicted by the preceding tokens, making it easier for a model to follow the resulting reasoning patterns.

And this section about their approach for generating that data:

Our approach to generating synthetic data for phi-4 is guided by the following principles:

Diversity: The data should comprehensively cover subtopics and skills within each domain. This requires curating diverse seeds from organic sources.

Nuance and Complexity: Effective training requires nuanced, non-trivial examples that reflect the complexity and the richness of the domain. Data must go beyond basics to include edge cases and advanced examples.

Accuracy: Code should execute correctly, proofs should be valid, and explanations should adhere to established knowledge, etc.

Chain-of-Thought: Data should encourage systematic reasoning, teaching the model various approaches to the problems in a step-by-step manner. [...]

We created 50 broad types of synthetic datasets, each one relying on a different set of seeds and different multi-stage prompting procedure, spanning an array of topics, skills, and natures of interaction, accumulating to a total of about 400B unweighted tokens. [...]

Question Datasets: A large set of questions was collected from websites, forums, and Q&A platforms. These questions were then filtered using a plurality-based technique to balance difficulty. Specifically, we generated multiple independent answers for each question and applied majority voting to assess the consistency of responses. We discarded questions where all answers agreed (indicating the question was too easy) or where answers were entirely inconsistent (indicating the question was too difficult or ambiguous). [...]

Creating Question-Answer pairs from Diverse Sources: Another technique we use for seed curation involves leveraging language models to extract question-answer pairs from organic sources such as books, scientific papers, and code.

# 15th December 2024, 11:58 pm / microsoft, python, ai, generative-ai, llms, ai-assisted-programming, llm, phi, training-data, llm-release

googleapis/python-genai. Google released this brand new Python library for accessing their generative AI models yesterday, offering an alternative to their existing generative-ai-python library.

The API design looks very solid to me, and it includes both sync and async implementations. Here's an async streaming response:

async for response in client.aio.models.generate_content_stream(
    model='gemini-2.0-flash-exp',
    contents='Tell me a story in 300 words.'
):
    print(response.text)

It also includes Pydantic-based output schema support and some nice syntactic sugar for defining tools using Python functions.

# 12th December 2024, 4:21 pm / async, google, python, ai, generative-ai, llms, gemini, llm-tool-use, pydantic

ChatGPT Canvas can make API requests now, but it’s complicated

Today’s 12 Days of OpenAI release concerned ChatGPT Canvas, a new ChatGPT feature that enables ChatGPT to pop open a side panel with a shared editor in it where you can collaborate with ChatGPT on editing a document or writing code.

[... 1,116 words]

9:49 pm / 10th December 2024 / python, security, usability, ai, webassembly, pyodide, openai, prompt-injection, generative-ai, chatgpt, llms, claude-artifacts, cors

I can now run a GPT-4 class model on my laptop

Meta’s new Llama 3.3 70B is a genuinely GPT-4 class Large Language Model that runs on my laptop.

[... 2,905 words]

3:08 pm / 9th December 2024 / python, ai, generative-ai, llama, gpt-4, local-llms, llms, ai-assisted-programming, llm, meta, uv, mlx, ollama, pelican-riding-a-bicycle

SmolVLM—small yet mighty Vision Language Model. I've been having fun playing with this new vision model from the Hugging Face team behind SmolLM. They describe it as:

[...] a 2B VLM, SOTA for its memory footprint. SmolVLM is small, fast, memory-efficient, and fully open-source. All model checkpoints, VLM datasets, training recipes and tools are released under the Apache 2.0 license.

I've tried it in a few flavours but my favourite so far is the mlx-vlm approach, via mlx-vlm author Prince Canuma. Here's the uv recipe I'm using to run it:

uv run \
  --with mlx-vlm \
  --with torch \
  python -m mlx_vlm.generate \
    --model mlx-community/SmolVLM-Instruct-bf16 \
    --max-tokens 500 \
    --temp 0.5 \
    --prompt "Describe this image in detail" \
    --image IMG_4414.JPG

If you run into an error using Python 3.13 (torch compatibility) try uv run --python 3.11 instead.

This one-liner installs the necessary dependencies, downloads the model (about 4.2GB, saved to ~/.cache/huggingface/hub/models--mlx-community--SmolVLM-Instruct-bf16) and executes the prompt and displays the result.

I ran that against this Pelican photo:

A glorious pelican on some rocks, two other pelicans are visible plus some other birds

The model replied:

In the foreground of this photograph, a pelican is perched on a pile of rocks. The pelican’s wings are spread out, and its beak is open. There is a small bird standing on the rocks in front of the pelican. The bird has its head cocked to one side, and it seems to be looking at the pelican. To the left of the pelican is another bird, and behind the pelican are some other birds. The rocks in the background of the image are gray, and they are covered with a variety of textures. The rocks in the background appear to be wet from either rain or sea spray.

There are a few spatial mistakes in that description but the vibes are generally in the right direction.

On my 64GB M2 MacBook pro it read the prompt at 7.831 tokens/second and generated that response at an impressive 74.765 tokens/second.

# 28th November 2024, 8:29 pm / python, ai, generative-ai, local-llms, llms, vision-llms, uv, mlx, smollm, llm-release, prince-canuma

Introducing the Model Context Protocol (via) Interesting new initiative from Anthropic. The Model Context Protocol aims to provide a standard interface for LLMs to interact with other applications, allowing applications to expose tools, resources (contant that you might want to dump into your context) and parameterized prompts that can be used by the models.

Their first working version of this involves the Claude Desktop app (for macOS and Windows). You can now configure that app to run additional "servers" - processes that the app runs and then communicates with via JSON-RPC over standard input and standard output.

Each server can present a list of tools, resources and prompts to the model. The model can then make further calls to the server to request information or execute one of those tools.

(For full transparency: I got a preview of this last week, so I've had a few days to try it out.)

The best way to understand this all is to dig into the examples. There are 13 of these in the modelcontextprotocol/servers GitHub repository so far, some using the Typesscript SDK and some with the Python SDK (mcp on PyPI).

My favourite so far, unsurprisingly, is the sqlite one. This implements methods for Claude to execute read and write queries and create tables in a SQLite database file on your local computer.

This is clearly an early release: the process for enabling servers in Claude Desktop - which involves hand-editing a JSON configuration file - is pretty clunky, and currently the desktop app and running extra servers on your own machine is the only way to try this out.

The specification already describes the next step for this: an HTTP SSE protocol which will allow Claude (and any other software that implements the protocol) to communicate with external HTTP servers. Hopefully this means that MCP will come to the Claude web and mobile apps soon as well.

A couple of early preview partners have announced their MCP implementations already:

Cody supports additional context through Anthropic's Model Context Protocol
The Context Outside the Code is the Zed editor's announcement of their MCP extensions.

# 25th November 2024, 6:48 pm / python, sqlite, ai, generative-ai, llms, anthropic, claude, alex-albert, model-context-protocol

open-interpreter (via) This "natural language interface for computers" open source ChatGPT Code Interpreter alternative has been around for a while, but today I finally got around to trying it out.

Here's how I ran it (without first installing anything) using uv:

uvx --from open-interpreter interpreter

The default mode asks you for an OpenAI API key so it can use gpt-4o - there are a multitude of other options, including the ability to use local models with interpreter --local.

It runs in your terminal and works by generating Python code to help answer your questions, asking your permission to run it and then executing it directly on your computer.

I pasted in an API key and then prompted it with this:

find largest files on my desktop

Here's the full transcript.

Since code is run directly on your machine there are all sorts of ways things could go wrong if you don't carefully review the generated code before hitting "y". The team have an experimental safe mode in development which works by scanning generated code with semgrep. I'm not convinced by that approach, I think executing code in a sandbox would be a much more robust solution here - but sandboxing Python is still a very difficult problem.

They do at least have an experimental Docker integration.

# 24th November 2024, 6:29 pm / python, sandboxing, ai, docker, openai, generative-ai, llms, ai-assisted-programming, code-interpreter, uv, coding-agents

llm-gemini 0.4. New release of my llm-gemini plugin, adding support for asynchronous models (see LLM 0.18), plus the new gemini-exp-1114 model (currently at the top of the Chatbot Arena) and a -o json_object 1 option to force JSON output.

I also released llm-claude-3 0.9 which adds asynchronous support for the Claude family of models.

# 18th November 2024, 7:37 am / async, google, plugins, projects, python, ai, generative-ai, llms, llm, anthropic, claude, gemini

LLM 0.18. New release of LLM. The big new feature is asynchronous model support - you can now use supported models in async Python code like this:

import llm

model = llm.get_async_model("gpt-4o")
async for chunk in model.prompt(
    "Five surprising names for a pet pelican"
):
    print(chunk, end="", flush=True)

Also new in this release: support for sending audio attachments to OpenAI's gpt-4o-audio-preview model.

# 17th November 2024, 8:40 pm / async, projects, python, ai, generative-ai, llms, llm

ChainForge. I'm still on the hunt for good options for running evaluations against prompts. ChainForge offers an interesting approach, calling itself "an open-source visual programming environment for prompt engineering".

The interface is one of those boxes-and-lines visual programming tools, which reminds me of Yahoo Pipes.

It's open source (from a team at Harvard) and written in Python, which means you can run a local copy instantly via uvx like this:

uvx chainforge serve

You can then configure it with API keys to various providers (OpenAI worked for me, Anthropic models returned JSON parsing errors due to a 500 page from the ChainForge proxy) and start trying it out.

The "Add Node" menu shows the full list of capabilities.

The JavaScript and Python evaluation blocks are particularly interesting: the JavaScript one runs outside of a sandbox using plain eval(), while the Python one still runs in your browser but uses Pyodide in a Web Worker.

# 8th November 2024, 8:52 pm / javascript, python, ai, pyodide, prompt-engineering, prompt-injection, generative-ai, llms, evals, uv

yet-another-applied-llm-benchmark. Nicholas Carlini introduced this personal LLM benchmark suite back in February as a collection of over 100 automated tests he runs against new LLM models to evaluate their performance against the kinds of tasks he uses them for.

There are two defining features of this benchmark that make it interesting. Most importantly, I've implemented a simple dataflow domain specific language to make it easy for me (or anyone else!) to add new tests that realistically evaluate model capabilities. This DSL allows for specifying both how the question should be asked and also how the answer should be evaluated. [...] And then, directly as a result of this, I've written nearly 100 tests for different situations I've actually encountered when working with LLMs as assistants

The DSL he's using is fascinating. Here's an example:

"Write a C program that draws an american flag to stdout." >> LLMRun() >> CRun() >> \
    VisionLLMRun("What flag is shown in this image?") >> \
    (SubstringEvaluator("United States") | SubstringEvaluator("USA")))

This triggers an LLM to execute the prompt asking for a C program that renders an American Flag, runs that through a C compiler and interpreter (executed in a Docker container), then passes the output of that to a vision model to guess the flag and checks that it returns a string containing "United States" or "USA".

The DSL itself is implemented entirely in Python, using the __rshift__ magic method for >> and __rrshift__ to enable strings to be piped into a custom object using "command to run" >> LLMRunNode.

# 6th November 2024, 8 pm / dsl, python, ai, generative-ai, llms, evals, nicholas-carlini

Docling. MIT licensed document extraction Python library from the Deep Search team at IBM, who released Docling v2 on October 16th.

Here's the Docling Technical Report paper from August, which provides details of two custom models: a layout analysis model for figuring out the structure of the document (sections, figures, text, tables etc) and a TableFormer model specifically for extracting structured data from tables.

Those models are available on Hugging Face.

Here's how to try out the Docling CLI interface using uvx (avoiding the need to install it first - though since it downloads models it will take a while to run the first time):

uvx docling mydoc.pdf --to json --to md

This will output a mydoc.json file with complex layout information and a mydoc.md Markdown file which includes Markdown tables where appropriate.

The Python API is a lot more comprehensive. It can even extract tables as Pandas DataFrames:

from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
for table in result.document.tables:
    df = table.export_to_dataframe()
    print(df)

I ran that inside uv run --with docling python. It took a little while to run, but it demonstrated that the library works.

# 3rd November 2024, 4:57 am / cli, ibm, ocr, pdf, python, ai, hugging-face, uv

Running Llama 3.2 Vision and Phi-3.5 Vision on a Mac with mistral.rs

mistral.rs is an LLM inference library written in Rust by Eric Buehler. Today I figured out how to use it to run the Llama 3.2 Vision and Phi-3.5 Vision models on my Mac.

[... 1,231 words]

4:14 pm / 19th October 2024 / microsoft, python, ai, rust, generative-ai, llama, local-llms, llms, mistral, phi, vision-llms, meta

An LLM TDD loop (via) Super neat demo by David Winterbottom, who wrapped my LLM and files-to-prompt tools in a short Bash script that can be fed a file full of Python unit tests and an empty implementation file and will then iterate on that file in a loop until the tests pass.

# 13th October 2024, 7:37 pm / python, ai, pytest, generative-ai, llms, ai-assisted-programming, llm, files-to-prompt

marimo v0.9.0 with mo.ui.chat. The latest release of the Marimo Python reactive notebook project includes a neat new feature: you can now easily embed a custom chat interface directly inside of your notebook.

Marimo co-founder Myles Scolnick posted this intriguing demo on Twitter, demonstrating a chat interface to my LLM library “in only 3 lines of code”:

import marimo as mo
import llm

model = llm.get_model()
conversation = model.conversation()
mo.ui.chat(lambda messages: conversation.prompt(messages[-1].content))

I tried that out today - here’s the result:

marimo.ui.chat() takes a function which is passed a list of Marimo chat messages (representing the current state of that widget) and returns a string - or other type of renderable object - to add as the next message in the chat. This makes it trivial to hook in any custom chat mechanism you like.

Marimo also ship their own built-in chat handlers for OpenAI, Anthropic and Google Gemini which you can use like this:

mo.ui.chat(
    mo.ai.llm.anthropic(
        "claude-3-5-sonnet-20240620",
        system_message="You are a helpful assistant.",
        api_key="sk-ant-...",
    ),
    show_configuration_controls=True
)

# 5th October 2024, 10:59 pm / python, ai, generative-ai, llms, llm, marimo

Conflating Overture Places Using DuckDB, Ollama, Embeddings, and More. Drew Breunig's detailed tutorial on "conflation" - combining different geospatial data sources by de-duplicating address strings such as RESTAURANT LOS ARCOS,3359 FOOTHILL BLVD,OAKLAND,94601 and LOS ARCOS TAQUERIA,3359 FOOTHILL BLVD,OAKLAND,94601.

Drew uses an entirely offline stack based around Python, DuckDB and Ollama and finds that a combination of H3 geospatial tiles and mxbai-embed-large embeddings (though other embedding models should work equally well) gets really good results.

# 30th September 2024, 5:24 pm / gis, python, ai, duckdb, embeddings, drew-breunig, overture, ollama

mlx-vlm (via) The MLX ecosystem of libraries for running machine learning models on Apple Silicon continues to expand. Prince Canuma is actively developing this library for running vision models such as Qwen-2 VL and Pixtral and LLaVA using Python running on a Mac.

I used uv to run it against this image with this shell one-liner:

uv run --with mlx-vlm \
  python -m mlx_vlm.generate \
  --model Qwen/Qwen2-VL-2B-Instruct \
  --max-tokens 1000 \
  --temp 0.0 \
  --image https://static.simonwillison.net/static/2024/django-roadmap.png \
  --prompt "Describe image in detail, include all text"

The --image option works equally well with a URL or a path to a local file on disk.

Django roadmap diagram, described below

This first downloaded 4.1GB to my ~/.cache/huggingface/hub/models--Qwen--Qwen2-VL-2B-Instruct folder and then output this result, which starts:

The image is a horizontal timeline chart that represents the release dates of various software versions. The timeline is divided into years from 2023 to 2029, with each year represented by a vertical line. The chart includes a legend at the bottom, which distinguishes between different types of software versions.

Legend

Mainstream Support:

4.2 (2023)

5.0 (2024)

5.1 (2025)

5.2 (2026)

6.0 (2027) [...]

# 29th September 2024, 9:38 pm / apple, python, ai, generative-ai, local-llms, llms, vision-llms, uv, qwen, mlx, prince-canuma

Anthropic’s Prompt Engineering Interactive Tutorial (via) Anthropic continue their trend of offering the best documentation of any of the leading LLM vendors. This tutorial is delivered as a set of Jupyter notebooks - I used it as an excuse to try uvx like this:

git clone https://github.com/anthropics/courses
uvx --from jupyter-core jupyter notebook courses

This installed a working Jupyter system, started the server and launched my browser within a few seconds.

The first few chapters are pretty basic, demonstrating simple prompts run through the Anthropic API. I used %pip install anthropic instead of !pip install anthropic to make sure the package was installed in the correct virtual environment, then filed an issue and a PR.

One new-to-me trick: in the first chapter the tutorial suggests running this:

API_KEY = "your_api_key_here"
%store API_KEY

This stashes your Anthropic API key in the IPython store. In subsequent notebooks you can restore the API_KEY variable like this:

%store -r API_KEY

I poked around and on macOS those variables are stored in files of the same name in ~/.ipython/profile_default/db/autorestore.

Chapter 4: Separating Data and Instructions included some interesting notes on Claude's support for content wrapped in XML-tag-style delimiters:

Note: While Claude can recognize and work with a wide range of separators and delimeters, we recommend that you use specifically XML tags as separators for Claude, as Claude was trained specifically to recognize XML tags as a prompt organizing mechanism. Outside of function calling, there are no special sauce XML tags that Claude has been trained on that you should use to maximally boost your performance. We have purposefully made Claude very malleable and customizable this way.

Plus this note on the importance of avoiding typos, with a nod back to the problem of sandbagging where models match their intelligence and tone to that of their prompts:

This is an important lesson about prompting: small details matter! It's always worth it to scrub your prompts for typos and grammatical errors. Claude is sensitive to patterns (in its early years, before finetuning, it was a raw text-prediction tool), and it's more likely to make mistakes when you make mistakes, smarter when you sound smart, sillier when you sound silly, and so on.

Chapter 5: Formatting Output and Speaking for Claude includes notes on one of Claude's most interesting features: prefill, where you can tell it how to start its response:

client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    messages=[
        {"role": "user", "content": "JSON facts about cats"},
        {"role": "assistant", "content": "{"}
    ]
)

Things start to get really interesting in Chapter 6: Precognition (Thinking Step by Step), which suggests using XML tags to help the model consider different arguments prior to generating a final answer:

Is this review sentiment positive or negative? First, write the best arguments for each side in <positive-argument> and <negative-argument> XML tags, then answer.

The tags make it easy to strip out the "thinking out loud" portions of the response.

It also warns about Claude's sensitivity to ordering. If you give Claude two options (e.g. for sentiment analysis):

In most situations (but not all, confusingly enough), Claude is more likely to choose the second of two options, possibly because in its training data from the web, second options were more likely to be correct.

This effect can be reduced using the thinking out loud / brainstorming prompting techniques.

A related tip is proposed in Chapter 8: Avoiding Hallucinations:

How do we fix this? Well, a great way to reduce hallucinations on long documents is to make Claude gather evidence first.

In this case, we tell Claude to first extract relevant quotes, then base its answer on those quotes. Telling Claude to do so here makes it correctly notice that the quote does not answer the question.

I really like the example prompt they provide here, for answering complex questions against a long document:

<question>What was Matterport's subscriber base on the precise date of May 31, 2020?</question>

Please read the below document. Then, in <scratchpad> tags, pull the most relevant quote from the document and consider whether it answers the user's question or whether it lacks sufficient detail. Then write a brief numerical answer in <answer> tags.

# 30th August 2024, 2:52 am / python, xml, ai, jupyter, prompt-engineering, generative-ai, llms, anthropic, claude, uv

mlx-whisper (via) Apple's MLX framework for running GPU-accelerated machine learning models on Apple Silicon keeps growing new examples. mlx-whisper is a Python package for running OpenAI's Whisper speech-to-text model. It's really easy to use:

pip install mlx-whisper

Then in a Python console:

>>> import mlx_whisper
>>> result = mlx_whisper.transcribe(
...    "/tmp/recording.mp3",
...     path_or_hf_repo="mlx-community/distil-whisper-large-v3")
.gitattributes: 100%|███████████| 1.52k/1.52k [00:00<00:00, 4.46MB/s]
config.json: 100%|██████████████| 268/268 [00:00<00:00, 843kB/s]
README.md: 100%|████████████████| 332/332 [00:00<00:00, 1.95MB/s]
Fetching 4 files:  50%|████▌    | 2/4 [00:01<00:01,  1.26it/s]
weights.npz:  63%|██████████  ▎ | 944M/1.51G [02:41<02:15, 4.17MB/s]
>>> result.keys()
dict_keys(['text', 'segments', 'language'])
>>> result['language']
'en'
>>> len(result['text'])
100105
>>> print(result['text'][:3000])
 This is so exciting. I have to tell you, first of all ...

Here's Activity Monitor confirming that the Python process is using the GPU for the transcription:

python3.10 is using 549% CPU, 44.20 CPU time, 9 threads, 90.8% GPU, 42.53 GPU time

This example downloaded a 1.5GB model from Hugging Face and stashed it in my ~/.cache/huggingface/hub/models--mlx-community--distil-whisper-large-v3 folder.

Calling .transcribe(filepath) without the path_or_hf_repo argument uses the much smaller (74.4 MB) whisper-tiny-mlx model.

A few people asked how this compares to whisper.cpp. Bill Mill compared the two and found mlx-whisper to be about 3x faster on an M1 Max.

Update: this note from Josh Marshall:

That '3x' comparison isn't fair; completely different models. I ran a test (14" M1 Pro) with the full (non-distilled) large-v2 model quantised to 8 bit (which is my pick), and whisper.cpp was 1m vs 1m36 for mlx-whisper.

Then later:

I've now done a better test, using the MLK audio, multiple runs and 2 models (distil-large-v3, large-v2-8bit)... and mlx-whisper is indeed 30-40% faster

# 13th August 2024, 4:15 pm / apple, python, ai, openai, whisper, mlx

django-http-debug, a new Django app mostly written by Claude

Yesterday I finally developed something I’ve been casually thinking about building for a long time: django-http-debug. It’s a reusable Django app—something you can pip install into any Django project—which provides tools for quickly setting up a URL that returns a canned HTTP response and logs the full details of any incoming request to a database table.

[... 2,692 words]

3:26 pm / 8th August 2024 / django, django-admin, projects, python, webhooks, ai, generative-ai, llms, ai-assisted-programming, anthropic, claude, claude-3-5-sonnet

Aider. Aider is an impressive open source local coding chat assistant terminal application, developed by Paul Gauthier (founding CTO of Inktomi back in 1996-2000).

I tried it out today, using an Anthropic API key to run it using Claude 3.5 Sonnet:

pipx install aider-chat
export ANTHROPIC_API_KEY=api-key-here
aider --dark-mode

I found the --dark-mode flag necessary to make it legible using the macOS terminal "Pro" theme.

Aider starts by generating a concise map of files in your current Git repository. This is passed to the LLM along with the prompts that you type, and Aider can then request additional files be added to that context - or you can add the manually with the /add filename command.

It defaults to making modifications to files and then committing them directly to Git with a generated commit message. I found myself preferring the /ask command which lets you ask a question without making any file modifications:

The Aider documentation includes extensive examples and the tool can work with a wide range of different LLMs, though it recommends GPT-4o, Claude 3.5 Sonnet (or 3 Opus) and DeepSeek Coder V2 for the best results. Aider maintains its own leaderboard, emphasizing that "Aider works best with LLMs which are good at editing code, not just good at writing code".

The prompts it uses are pretty fascinating - they're tucked away in various *_prompts.py files in aider/coders.

# 31st July 2024, 3:26 am / python, ai, generative-ai, llms, ai-assisted-programming, claude-3-5-sonnet, aider, paul-gauthier

pip install GPT (via) I've been uploading wheel files to ChatGPT in order to install them into Code Interpreter for a while now. Nico Ritschel built a better way: this GPT can download wheels directly from PyPI and then install them.

I didn't think this was possible, since Code Interpreter is blocked from making outbound network requests.

Nico's trick uses a new-to-me feature of GPT Actions: you can return up to ten files from an action call and ChatGPT will download those files to the same disk volume that Code Interpreter can access.

Nico wired up a Val Town endpoint that can divide a PyPI wheel into multiple 9.5MB files (if necessary) to fit the file size limit for files returned to a GPT, then uses prompts to tell ChatGPT to combine the resulting files and treat them as installable wheels.

# 21st July 2024, 5:54 am / pypi, python, ai, generative-ai, chatgpt, llms, code-interpreter, coding-agents

Imitation Intelligence, my keynote for PyCon US 2024

I gave an invited keynote at PyCon US 2024 in Pittsburgh this year. My goal was to say some interesting things about AI—specifically about Large Language Models—both to help catch people up who may not have been paying close attention, but also to give people who were paying close attention some new things to think about.

[... 10,624 words]

4:59 am / 14th July 2024 / pycon, python, my-talks, ai, generative-ai, llms, annotated-talks, chatbot-arena

mistralai/mistral-common. New from Mistral: mistral-common, an open source Python library providing "a set of tools to help you work with Mistral models".

So far that means a tokenizer! This is similar to OpenAI's tiktoken library in that it lets you run tokenization in your own code, which crucially means you can count the number of tokens that you are about to use - useful for cost estimates but also for cramming the maximum allowed tokens in the context window for things like RAG.

Mistral's library is better than tiktoken though, in that it also includes logic for correctly calculating the tokens needed for conversation construction and tool definition. With OpenAI's APIs you're currently left guessing how many tokens are taken up by these advanced features.

Anthropic haven't published any form of tokenizer at all - it's the feature I'd most like to see from them next.

Here's how to explore the vocabulary of the tokenizer:

MistralTokenizer.from_model(
    "open-mixtral-8x22b"
).instruct_tokenizer.tokenizer.vocab()[:12]

['<unk>', '<s>', '</s>', '[INST]', '[/INST]', '[TOOL_CALLS]', '[AVAILABLE_TOOLS]', '[/AVAILABLE_TOOLS]', '[TOOL_RESULTS]', '[/TOOL_RESULTS]']

# 18th April 2024, 12:39 am / python, ai, openai, prompt-engineering, generative-ai, llms, anthropic, mistral, rag, tokenization

llm-claude-3. I built a new plugin for LLM—my command-line tool and Python library for interacting with Large Language Models—which adds support for the new Claude 3 models from Anthropic.

# 4th March 2024, 6:46 pm / projects, python, ai, generative-ai, llms, llm, anthropic, claude

The Zen of Python, Unix, and LLMs. Here’s the YouTube recording of my 1.5 hour conversation with Hugo Bowne-Anderson yesterday.

I fed a Whisper transcript to Google Gemini Pro 1.5 and asked it for the themes from our conversation, and it said we talked about “Python’s success and versatility, the rise and potential of LLMs, data sharing and ethics in the age of LLMs, Unix philosophy and its influence on software development and the future of programming and human-computer interaction”.

# 29th February 2024, 9:04 pm / python, speaking, my-talks, ai, whisper, llms, gemini

page 1 / 2 next »

Simon Willison’s Weblog