Gemini 2.5 Computer Use can solve Google’s own CAPTCHAs
7th October 2025
Google just introduced a new Gemini 2.5 Computer Use model, specially designed to help operate a GUI interface by interacting with visible elements using a virtual mouse and keyboard. I just tried their demo... and watched it solved Google’s own CAPTCHA without me even asking it to.
The official demo is hosted at gemini.browserbase.com, and one of the click-to-try example prompts shown there is the following:
Go to Hacker News and find the most controversial post from today, then read the top 3 comments and summarize the debate.
I activated the demo and Gemini decided to start by navigating to www.google.com
in order to search for “hacker news”. But Google served a CAPTCHA challenge, presumably because of a large volume of suspicious traffic from the Browserbase IP range.
The model instantly got to solving that CAPTCHA:
It went through a few rounds of this, solved all of them and continued on to Google Search, where it ran the search for “hacker news”, navigated to the site and then did an admittedly unimpressive job of solving the original prompt. It looked at just one thread and reported back on what it found there. I was hoping it would consider more than one option to discover the “most controversial post from today”.
The Gemini 2.5 Computer Use Model card (PDF) talks about training the model to “recognize when it is tasked with a high-stakes action” and request user confirmation before proceeding, but doesn’t have anything to say about not solving CAPTCHAs. So I guess this behaviour is the model working as intended!
Something that did impress me—aside from the unprompted CAPTCHA solve against Google’s very own system—was the quality of the mouse usage. I’ve written about Computer Use models before from both Anthropic and OpenAI (they called their version “Operator”) and by far the biggest challenge for them is accurately clicking the right targets with the mouse.
It would take a formal eval to derive if Gemini really is best at this, but given the Gemini models previous demonstrations of both bounding boxes and image segmentation masks it doesn’t surprise me that a Gemini model can do a great job of clicking on the right elements in a screenshot of an operating system or browser.
More recent articles
- Vibe engineering - 7th October 2025
- OpenAI DevDay 2025 live blog - 6th October 2025
- Embracing the parallel coding agent lifestyle - 5th October 2025