Click, Type, Think: Claude Gets a Mouse

Anthropic's latest feature lets AI navigate computers like humans do

Oct 28, 2024

Imagine if AI could use your computer just like you do - clicking buttons, filling forms, and navigating websites. That's exactly what Anthropic is enabling with 'computer use', announced this week alongside updates to their Claude models. While I've tested the new flagship "Claude 3.5 Sonnet (new)" and found it even more natural-sounding in conversation, it's computer use that represents the most significant advancement. By allowing Claude to understand and interact with existing computer GUIs, this feature fundamentally transforms how AI can interact with software.

This advancement turns foundation models into general purpose software automation platforms. Rather than developing software to fit models in the form of bespoke "tools" (as has been the case for over a year), the model itself has some understanding of existing software. Anthropic aptly describes this as "... the model fit the tools" as opposed to "... made tools fit the model". This is analogous to how LLMs themselves are a general purpose tool for what previously had been several models in NLP, computer vision, etc. As a consequence of this generality, this opens up many possibilities for agents. With computer use, foundation models are able to drive legacy software, rather than requiring programmers write custom integrations. Put another way: existing LLMs supercharges writing new software because they are very good at coding; computer use supercharges using LLMs with the iceberg that is legacy software. It's worth noting that this holy grail of automation has been tackled from different angles over the years; RPA platforms like UiPath have been automating legacy enterprise software for years, and AI-native entrants like Adept have been tackling the same problems.

What makes Anthropic's release interesting is the generality of the solution: the same model that processes human language, computer code, and images can now drive GUIs. Foundation models like Claude are having their iPhone moment - just as Jobs famously introduced what seemed like three devices but was actually one revolutionary product in 2007, LLMs are consolidating multiple AI capabilities into a single, general-purpose system. The iPhone was revolutionary in large part because it aggregated disparate demand for computing in a single, portable device. LLMs have exhibited a similar march over the last 2 years. To start, LLMs aggregated what were previously separate models in NLP including sentiment analysis, named entity recognition, amongst others. Then, LLMs became multi-modal, with support for images and audio. With computer use, these models gain yet another capability that previously required separate models.

Anthropic notes that computer use is in beta, and has a number of limitations. To get a sense of what's possible I decided to try it out with a simple example. I ran Anthropic's sandbox for computer use, which basically just set up an environment with a Linux desktop, and a browser. I asked Claude to help me get a refund for a fictional Delta flight. Earlier this year I had to go through the process, and while not entirely onerous I found it tedious.

The process entailed 1) performing a web search 2) determining which case I fell under, and 3) filling out a form in a modal with flight and transaction details. The 2nd step is usually the most complex, as it involves a fair amount of reasoning to determine if a refund is even applicable, and if so which form needs to be filled out.

image 3.png — This isn’t entirely easy to navigate…

I also liked this example because it's a mini-version of what computer use promises: integrating with legacy software that does not have clean, easy-to-use APIs. Delta's forms were alright but even on a desktop computer, the process can be difficult to navigate.

I had to run the same process a few times because the results were sporadic:

The first time I ran it, Claude correctly identified what needed to be done. When I asked it to fill out the form, it refused, citing that it cannot work with PII. This is a known limitation of computer use, which Anthropic has documented.
The 2nd time, Claude concluded that nothing needed to be done, which I don't believe is correct under all circumstances. I think this is a problem with missing context; I provided a very broad prompt, but it would be more helpful for Claude to have more context around my (fictional) situation.
The 3rd time around, Claude hallucinated a URL but recovered from it. It decided to summarize what manual steps I needed to take in order to get a refund, instead of executing. Again, this is likely due to prompt underspecification.

claude computer use.gif — Claude gets to work

Key Takeaways from Testing Computer Use:

Computer use is slow and costly; each run cost me about $1 and my issue wasn't fully resolved. As noted above, with better prompting a full resolution probably would have been reached, but it probably would have been 2x as costly. This implies that developers need to be judicious with the use-cases they apply this new capability to, while the technology and costs improve.
Anthopic's constraints on passing sensitive data makes sense, but is very limiting. One of the biggest promises of AI agents is the ability for them to act on your behalf, and that involves PII. Human assistants have access to and utilize privileged information all the time, and so I hope there is some evolution on this front.
I used Anthropic's ready-to-run computer use sandbox, but this is not necessary. Developers can actually control the agent's loop, which opens up a lot of possibilities, like different kinds of input control. Since developers also control the execution of GUI actions, this also allows for different isolation/sandboxing schemes, and application-specific error-handling; basically you're free to implement whatever agent architecture you want.

When I chatted with Claude about setting up the sandbox, I was amused by this very natural response in the new 3.5 Sonnet model:

My initial impression is that Anthropic has trained a model that is vastly superior in vision capabilities, with specific capabilities as it relates to GUIs. Notably, Anthropic mentions that this model was trained to be able to count individual pixels. One outcome from this is that Claude may have very good vision capabilities that apply to other domains. For example, it looks like Claude can identify plane models with very few cues. On the agentic front, my initial impression is that computer use, is very useful for interpreting GUIs but it might be better to use other tools for other parts of a workflow. For example, I wouldn't waste time asking Claude to perform a web search. Instead, I would provide screenshots of web pages that have already been filtered for relevance. Similarly, for web applications specifically, I might use Claude to "direct" traditional automation software like Selenium or Playwright, rather than having Claude execute everything itself.

All that being said, I've been impressed by what computer use has to offer on day one. While current limitations around speed, cost, and handling sensitive data need to be addressed, the potential is clear: we're moving toward a future where AI can seamlessly interact with existing software interfaces, potentially revolutionizing how we think about automation. The evolution of this technology will be fascinating to watch.

blending bits

Discussion about this post