It’s only a matter of time before everyone realizes how good these computer use models are.
Browser agents today are typically made up of multiple models strapped together in a planner-actor-critic loop and their performance largely depends on the sanitization of the HTML DOM. My cofounder and I worked on these for months at multion. So much of the agent performance was locked behind frequent browser problems, prompting, and orchestration.
CUA, claude computer use, and open source models like UI-TARS present a completely new paradigm: one model that is simultaneously planning, acting, and evaluating its decisions, depending solely on vision input. We’ve seen the latest models one-shot day-to-day tasks like ordering Doordash, but we’ve also seen them successfully navigate complex and uncommon GUIs and course-correct when they go off track. They approach problems creatively, generalize to any interface, and are resilient to errors. They are not bound by messy textual representations like the DOM. This means that you can take them out of the browser, use them in enterprise desktop apps, games, and terminals. You can observe their intelligence for yourself when you ask “show me something smart.”
Computer use agents are the future. People just haven’t realized it yet.
Quote
justin
@justinsunyt
I don’t have to decide what to eat for dinner anymore
I just use the two most capable computer use agents in the world to order take out
Mukbang Roulette: Episode 1
Show more
1:20
4,826
Views