Cerebras Inference: AI at Instant Speed (via) New hosted API for Llama running at absurdly high speeds: "1,800 tokens per second for Llama3.1 8B and 450 tokens per second for Llama3.1 70B".
How are they running so fast? Custom hardware. Their WSE-3 is 57x physically larger than an NVIDIA H100, and has 4 trillion transistors, 900,000 cores and 44GB of memory all on one enormous chip.
Their live chat demo just returned me a response at 1,833 tokens/second. Their API currently has a waitlist.
Recent articles
- LLM 0.27, the annotated release notes: GPT-5 and improved tool calling - 11th August 2025
- Qwen3-4B-Thinking: "This is art - pelicans don't ride bikes!" - 10th August 2025
- My Lethal Trifecta talk at the Bay Area AI Security Meetup - 9th August 2025