August 2025
4 posts: 1 link, 1 quote, 2 notes
Aug. 1, 2025
Gemini Deep Think, our SOTA model with parallel thinking that won the IMO Gold Medal 🥇, is now available in the Gemini App for Ultra subscribers!! [...]
Quick correction: this is a variation of our IMO gold model that is faster and more optimized for daily use! We are also giving the IMO gold full model to a set of mathematicians to test the value of the full capabilities.
— Logan Kilpatrick, announcing Gemini Deep Think
This morning I sent out the third edition of my LLM digest newsletter for my $10/month and higher sponsors on GitHub. It included the following section headers:
- Claude Code
- Model releases in July
- Gold medal performances in the IMO
- Reverse engineering system prompts
- Tools I'm using at the moment
The newsletter is a condensed summary of highlights from the past month of my blog. I published 98 posts in July - the concept for the newsletter is that you can pay me for the version that only takes 10 minutes to read!
Here are the newsletters I sent out for June 2025 and May 2025, if you want a taste of what you'll be getting as a sponsor. New sponsors instantly get access to the archive of previous newsletters, including the one I sent this morning.
Update: I also sent out my much longer, more frequent and free weekly-ish newsletter - this edition covers just the last three days because there's been so much going on. That one is entirely copy-and-pasted from my blog so if you read me via feeds you'll have seen it all already.
Deep Think in the Gemini app (via) Google released Gemini 2.5 Deep Think this morning, exclusively to their Ultra ($250/month) subscribers:
It is a variation of the model that recently achieved the gold-medal standard at this year's International Mathematical Olympiad (IMO). While that model takes hours to reason about complex math problems, today's release is faster and more usable day-to-day, while still reaching Bronze-level performance on the 2025 IMO benchmark, based on internal evaluations.
Google describe Deep Think's architecture like this:
Just as people tackle complex problems by taking the time to explore different angles, weigh potential solutions, and refine a final answer, Deep Think pushes the frontier of thinking capabilities by using parallel thinking techniques. This approach lets Gemini generate many ideas at once and consider them simultaneously, even revising or combining different ideas over time, before arriving at the best answer.
This approach sounds a little similar to the llm-consortium plugin by Thomas Hughes, see this video from January's Datasette Public Office Hours.
I don't have an Ultra account, but thankfully nickandbro on Hacker News tried "Create a svg of a pelican riding on a bicycle" (a very slight modification of my prompt, which uses "Generate an SVG") and got back a very solid result:
The bicycle is the right shape, and this is one of the few results I've seen for this prompt where the bird is very clearly a pelican thanks to the shape of its beak.
There are more details on Deep Think in the Gemini 2.5 Deep Think Model Card (PDF). Some highlights from that document:
- 1 million token input window, accepting text, images, audio, and video.
- Text output up to 192,000 tokens.
- Training ran on TPUs and used JAX and ML Pathways.
- "We additionally trained Gemini 2.5 Deep Think on novel reinforcement learning techniques that can leverage more multi-step reasoning, problem-solving and theorem-proving data, and we also provided access to a curated corpus of high-quality solutions to mathematics problems."
- Knowledge cutoff is January 2025.
Two interesting examples of inference speed as a flagship feature of LLM services today.
First, Cerebras announced two new monthly plans for their extremely high speed hosted model service: Cerebras Code Pro ($50/month, 1,000 messages a day) and Cerebras Code Max ($200/month, 5,000/day). The model they are selling here is Qwen's Qwen3-Coder-480B-A35B-Instruct, likely the best available open weights coding model right now and one that was released just ten days ago. Ten days from model release to third-party subscription service feels like some kind of record.
Cerebras claim they can serve the model at an astonishing 2,000 tokens per second - four times the speed of Claude Sonnet 4 in their demo video.
Also today, Moonshot announced a new hosted version of their trillion parameter Kimi K2 model called kimi-k2-turbo-preview
:
🆕 Say hello to kimi-k2-turbo-preview Same model. Same context. NOW 4× FASTER.
⚡️ From 10 tok/s to 40 tok/s.
💰 Limited-Time Launch Price (50% off until Sept 1)
- $0.30 / million input tokens (cache hit)
- $1.20 / million input tokens (cache miss)
- $5.00 / million output tokens
👉 Explore more: platform.moonshot.ai
This is twice the price of their regular model for 4x the speed (increasing to 4x the price in September). No details yet on how they achieved the speed-up.
I am interested to see how much market demand there is for faster performance like this. I've experimented with Cerebras in the past and found that the speed really does make iterating on code with live previews feel a whole lot more interactive.