100x Defect Tolerance: How Cerebras Solved the Yield Problem (via) I learned a bunch about how chip manufacture works from this piece where Cerebras reveal some notes about how they manufacture chips that are 56x physically larger than NVIDIA's H100.
The key idea here is core redundancy: designing a chip such that if there are defects the end-product is still useful. This has been a technique for decades:
For example in 2006 Intel released the Intel Core Duo – a chip with two CPU cores. If one core was faulty, it was disabled and the product was sold as an Intel Core Solo. Nvidia, AMD, and others all embraced this core-level redundancy in the coming years.
Modern GPUs are deliberately designed with redundant cores: the H100 needs 132 but the wafer contains 144, so up to 12 can be defective without the chip failing.
Cerebras designed their monster (look at the size of this thing) with absolutely tiny cores: "approximately 0.05mm2" - with the whole chip needing 900,000 enabled cores out of the 970,000 total. This allows 93% of the silicon area to stay active in the finished chip, a notably high proportion.
Recent articles
- My AI/LLM predictions for the next 1, 3 and 6 years, for Oxide and Friends - 10th January 2025
- Weeknotes: Starting 2025 a little slow - 4th January 2025
- I still don't think companies serve you ads based on spying through your microphone - 2nd January 2025