Simon Willison’s Weblog

Subscribe

ospeak: a CLI tool for speaking text in the terminal via OpenAI

7th November 2023

I attended OpenAI DevDay today, the first OpenAI developer conference. It was a lot. They released a bewildering array of new API tools, which I’m just beginning to wade my way through fully understanding.

My preferred way to understand a new API is to build something with it, and in my experience the easiest and fastest things to build are usually CLI utilities.

I’ve been enjoying the new ChatGPT voice interface a lot, so I was delighted to see that OpenAI today released a text-to-speech API that uses the same model.

My first new tool is ospeak, a CLI utility for piping text through that API.

ospeak

You can install ospeak like this. I’ve only tested in on macOS, but it might well work on Linux and Windows as well:

pipx install ospeak

Since it uses the OpenAI API you’ll need an API key. You can either pass that directly to the tool:

ospeak "Hello there" --token="sk-..."

Or you can set it as an environment variable so you don’t have to enter it multiple times:

export OPENAI_API_KEY=sk-...
ospeak "Hello there"

Now you can call it and your computer will speak whatever you pass to it!

ospeak "This is really quite a convincing voice"

OpenAI currently have six voices: alloy, echo, fable, onyx, nova and shimmer. The command defaults to alloy, but you can specify another voice by passing -v/--voice:

ospeak "This is a different voice" -v nova 

If you pass the special value -v all it will say the same thing in each voice, prefixing with the name of the voice:

ospeak "This is a demonstration of my voice." -v all

Here’s a recording of the output from that:

You can also set the speed—from 0.25 (four times slower than normal) to 4.0 (four times faster). I find 2x is fast but still understandable:

ospeak "This is a fast voice" --speed 2.0

Finally, you can save the output to a .mp3 or .wav file instead of speaking it through the speakers, using the -o/--output option:

ospeak "This is saved to a file" -o output.mp3

That’s pretty much all there is to it. There are a few more details in the README.

The source code was adapted from an example in OpenAI’s documentation.

The real fun is when you combine it with llm, to pipe output from a language model directly into the tool. Here’s how to have your computer give a passionate speech about why you should care about pelicans:

llm -m gpt-4-turbo \
  "A short passionate speech about why you should care about pelicans" \
  | ospeak -v nova

Here’s what that gave me (transcript here):

I thoroughly enjoy how using text-to-speech like this genuinely elevates an otherwise unexciting piece of output from an LLM. This speech engine really is very impressive.

LLM 0.12 for gpt-4-turbo

I upgraded LLM to support the newly released GPT 4.0 Turbo model—an impressive beast which is 1/3 the price of GPT-4 (technically 3x cheaper for input tokens and 2x cheaper for output) and supports a huge 128,000 tokens, up from 8,000 for regular GPT-4.

You can try that out like so:

pipx install llm
llm keys set openai
# Paste OpenAI API key here
llm -m gpt-4-turbo "Ten great names for a pet walrus"
# Or a shortcut:
llm -m 4t "Ten great names for a pet walrus"

Here’s a one-liner that summarizes the Hacker News discussion about today’s OpenAI announcements using the new model (and taking advantage of its much longer token limit):

curl -s "https://hn.algolia.com/api/v1/items/38166420" | \
  jq -r 'recurse(.children[]) | .author + ": " + .text' | \
  llm -m gpt-4-turbo 'Summarize the themes of the opinions expressed here,
  including direct quotes in quote markers (with author attribution) for each theme.
  Fix HTML entities. Output markdown. Go long.'

Example output here. I adapted that from my Claude 2 version, but I found I had to adjust the prompt a bit to get GPT-4 Turbo to output quotes in the manner I wanted.

I also added support for a new -o seed 1 option for the OpenAI models, which passes a seed integer that more-or-less results in reproducible outputs—another new feature announced today.

So much more to explore

I’ve honestly hardly even begun to dig into the things that were released today. A few of the other highlights:

  • GPT-4 vision! You can now pass images to the GPT-4 API, in the same way as ChatGPT has supported for the past few weeks. I have so many things I want to build on top of this.
  • JSON mode: both 3.5 and 4.0 turbo can now reliably produce valid JSON output. Previously they could produce JSON but would occasionally make mistakes—this mode makes mistakes impossible by altering the token stream as it is being produced (similar to Llama.cpp grammars).
  • Function calling got some big upgrades, the most important of which is that you can now be asked by the API to execute multiple functions in parallel.
  • Assistants. This is the big one. You can now define custom GPTs (effectively a custom system prompt, set of function calls and collection of documents for use with Retrieval Augmented Generation) using the ChatGPT interface or via the API, then share those with other people.... or use them directly via the API. This makes building simple RAG systems trivial, and you can also enable both Code Interpreter and Bing Browse mode as part of your new assistant. It’s a huge recipe for prompt injection, but it also cuts out a lot of the work involved in building a custom chatbot.

Honestly today was pretty overwhelming. I think it’s going to take us all months to fully understand the new capabilities we have around the OpenAI family of models.

It also feels like a whole bunch of my potential future side projects just dropped from several weeks of work to several hours.