Moshi (via) Moshi is "a speech-text foundation model and full-duplex spoken dialogue framework". It's effectively a text-to-text model - like an LLM but you input audio directly to it and it replies with its own audio.
It's fun to play around with, but it's not particularly useful in comparison to other pure text models: I tried to talk to it about California Brown Pelicans and it gave me some very basic hallucinated thoughts about California Condors instead.
It's very easy to run locally, at least on a Mac (and likely on other systems too). I used uv
and got the 8 bit quantized version running as a local web server using this one-liner:
uv run --with moshi_mlx python -m moshi_mlx.local_web -q 8
That downloads ~8.17G of model to a folder in ~/.cache/huggingface/hub/
- or you can use -q 4
and get a 4.81G version instead (albeit even lower quality).
Recent articles
- How often do LLMs snitch? Recreating Theo's SnitchBench with LLM - 31st May 2025
- Talking AI and jobs with Natasha Zouves for News Nation - 30th May 2025
- Large Language Models can run tools in your terminal with LLM 0.26 - 27th May 2025