Simon Willison’s Weblog

Subscribe

Whisper large-v3-turbo model. It’s OpenAI DevDay today. Last year they released a whole stack of new features, including GPT-4 vision and GPTs and their text-to-speech API, so I’m intrigued to see what they release today (I’ll be at the San Francisco event).

Looks like they got an early start on the releases, with the first new Whisper model since November 2023.

Whisper Turbo is a new speech-to-text model that fits the continued trend of distilled models getting smaller and faster while maintaining the same quality as larger models.

large-v3-turbo is 809M parameters - slightly larger than the 769M medium but significantly smaller than the 1550M large. OpenAI claim its 8x faster than large and requires 6GB of VRAM compared to 10GB for the larger model.

The model file is a 1.6GB download. OpenAI continue to make Whisper (both code and model weights) available under the MIT license.

It’s already supported in both Hugging Face transformers - live demo here - and in mlx-whisper on Apple Silicon, via Awni Hannun:

import mlx_whisper
print(mlx_whisper.transcribe(
  "path/to/audio",
  path_or_hf_repo="mlx-community/whisper-turbo"
)["text"])

Awni reports:

Transcribes 12 minutes in 14 seconds on an M2 Ultra (~50X faster than real time).