Simon Willison’s Weblog

Subscribe

Initial explorations of Anthropic’s new Computer Use capability

22nd October 2024

Two big announcements from Anthropic today: a new Claude 3.5 Sonnet model and a new API mode that they are calling computer use.

(They also pre-announced Haiku 3.5, but that’s not available yet so I’m ignoring it until I can try it out myself.)

Computer use is really interesting. Here’s what I’ve figured out about it so far.

You provide the computer

Unlike OpenAI’s Code Interpreter mode, Anthropic are not providing hosted virtual machine computers for the model to interact with. You call the Claude models as usual, sending it both text and screenshots of the current state of the computer you have tasked it with controlling. It sends back commands about what you should do next.

The quickest way to get started is to use the new anthropic-quickstarts/computer-use-demo repository. Anthropic released that this morning and it provides a one-liner Docker command which spins up an Ubuntu 22.04 container preconfigured with a bunch of software and a VNC server.

export ANTHROPIC_API_KEY=%your_api_key%
docker run \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  -v $HOME/.anthropic:/home/computeruse/.anthropic \
  -p 5900:5900 \
  -p 8501:8501 \
  -p 6080:6080 \
  -p 8080:8080 \
  -it ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest

I’ve tried this and it works exactly as advertised. It starts the container with a web server listening on http://localhost:8080/—visiting that in a browser provides a web UI for chatting with the model and a large noVNC panel showing you exactly what is going on.

I tried this prompt and it worked first time:

Navigate to http://simonwillison.net and search for pelicans

Screenshot. On the left a chat panel - the bot is displaying screenshots of the desktop and saying things like Now I can see Simon's website4. Let me use the search box at the top to search for "pelicans". On the right is a large Ubuntu desktop screen showing Firefox running with a sarch for pelicans on my website.

This has very obvious safety and security concerns, which Anthropic warn about with a big red “Caution” box in both new API documentation and the computer-use-demo README, which includes a specific callout about the threat of prompt injection:

In some circumstances, Claude will follow commands found in content even if it conflicts with the user’s instructions. For example, Claude instructions on webpages or contained in images may override instructions or cause Claude to make mistakes. We suggest taking precautions to isolate Claude from sensitive data and actions to avoid risks related to prompt injection.

Coordinate support is a new capability

The most important new model feature relates to screenshots and coordinates. Previous Anthropic (and OpenAI) models have been unable to provide coordinates on a screenshot—which means they can’t reliably tell you to “mouse click at point xx,yy”.

The new Claude 3.5 Sonnet model can now do this: you can pass it a screenshot and get back specific coordinates of points within that screenshot.

I previously wrote about Google Gemini’s support for returning bounding boxes—it looks like the new Anthropic model may have caught up to that capability.

The Anthropic-defined tools documentation helps show how that new coordinate capability is being used. They include a new pre-defined computer_20241022 tool which acts on the following instructions (I love that Anthropic are sharing these):

Use a mouse and keyboard to interact with a computer, and take screenshots.
* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.
* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try taking another screenshot.
* The screen's resolution is {{ display_width_px }}x{{ display_height_px }}.
* The display number is {{ display_number }}
* Whenever you intend to move the cursor to click on an element like an icon, you should consult a screenshot to determine the coordinates of the element before moving the cursor.
* If you tried clicking on a program or link but it failed to load, even after waiting, try adjusting your cursor position so that the tip of the cursor visually falls on the element that you want to click.
* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.

Anthropic also note that:

We do not recommend sending screenshots in resolutions above XGA/WXGA to avoid issues related to image resizing.

I looked those up in the code: XGA is 1024x768, WXGA is 1280x800.

The computer-use-demo example code defines a ComputerTool class which shells out to xdotool to move and click the mouse.

Things to try

I’ve only just scratched the surface of what the new computer use demo can do. So far I’ve had it:

  • Compile and run hello world in C (it has gcc already so this just worked)
  • Then compile and run a Mandelbrot C program
  • Install ffmpeg—it can use apt-get install to add Ubuntu packages it is missing
  • Use my https://datasette.simonwillison.net/ interface to run count queries against my blog’s database
  • Attempt and fail to solve this Sudoku puzzle—Claude is terrible at Sudoku!

A Sudoku puzzle is displayed - the bot has already fillef in several squares incorrectly with invalid numbers which have a subtle pink background.

Prompt injection and other potential misuse

Anthropic have further details in their post on Developing a computer use model, including this note about the importance of coordinate support:

When a developer tasks Claude with using a piece of computer software and gives it the necessary access, Claude looks at screenshots of what’s visible to the user, then counts how many pixels vertically or horizontally it needs to move a cursor in order to click in the correct place. Training Claude to count pixels accurately was critical. Without this skill, the model finds it difficult to give mouse commands—similar to how models often struggle with simple-seeming questions like “how many A’s in the word ‘banana’?”.

And another note about prompt injection:

In this spirit, our Trust & Safety teams have conducted extensive analysis of our new computer-use models to identify potential vulnerabilities. One concern they’ve identified is “prompt injection”—a type of cyberattack where malicious instructions are fed to an AI model, causing it to either override its prior directions or perform unintended actions that deviate from the user’s original intent. Since Claude can interpret screenshots from computers connected to the internet, it’s possible that it may be exposed to content that includes prompt injection attacks.

Plus a note that they’re particularly concerned about potential misuse regarding the upcoming US election:

Given the upcoming U.S. elections, we’re on high alert for attempted misuses that could be perceived as undermining public trust in electoral processes. While computer use is not sufficiently advanced or capable of operating at a scale that would present heightened risks relative to existing capabilities, we’ve put in place measures to monitor when Claude is asked to engage in election-related activity, as well as systems for nudging Claude away from activities like generating and posting content on social media, registering web domains, or interacting with government websites.