šŸ§‘ā€šŸš€ Computer Use Agents: The Next Evolution of AI

How reasoning models and autonomous agents are reshaping the future of digital tasks and redefining human-AI collaboration.

Good morning, itā€™s Saturday. Welcome to a special weekend edition of FF Daily.

With OpenAI's leap into autonomous digital agents, weā€™re diving into the fascinating world of Operator and the broader market of Computer Use Agents (CUAs). How does Operator work? How does it compare to other CUAs? And where might this transformative technology take us next?

Read on!

Computer Use Agents: Operator Has Ushered in a New Era

ā€œAgents are emerging in production as LLMs mature in key capabilitiesā€”understanding complex inputs, engaging in reasoning and planning, using tools reliably, and recovering from errors. Agents begin their work with either a command from, or interactive discussion with, the human user. Once the task is clear, agents plan and operate independently, potentially returning to the human for further information or judgment.ā€

Anthropic

OpenAI recently introduced ā€œOperatorā€, its first true agent. This agent has the impressive ability to take control of its own computer and currently browses the web and performs tasks independently, such as booking a restaurant visit.

The importance of agents like Operator for technological development can hardly be overestimated. After the market launch of powerful language models like GPT-4 in 2023, which impressively demonstrated the potential of AI for the first time to a major audience, the year 2024 brought another revolution: so-called reasoning models. These models added the ability to ā€œthinkā€ and reason to AI. In particular, OpenAI's model o1 showed how powerful AI models can become when they are able to think longer and in a more structured way.

This development is based on the concept of ā€œSystem-2 thinkingā€ as described by Nobel Prize winner Daniel Kahneman. This involves the targeted use of computing power to scale conclusions in the inference phase. Whereas the focus used to be on ever larger models in pre-training, the focus has now been placed on optimizing the inference processes. Classic language models worked mainly on the basis of probabilistic predictions ā€“ they were often decried as ā€œstatistical parrotsā€ because they only generated the next probable word (token). This limitation was overcome with reasoning models, which acquired the ability to engage in complex deliberation, planning and decision weighing. This enabled them to achieve significant progress, particularly in logical and clearly verifiable tasks.

In 2025, We Are Now Witnessing the Next Milestone in the Evolution of AI: Agents

These agents work largely autonomously and can independently plan and execute complex process chains. While language models have so far tended to operate in a rather one-dimensional way, i.e. they could answer a question or perform a task, agents go one step further. They are able to develop multi-dimensional processes in which a single instruction is followed by a whole series of actions that they independently coordinate.

An outstanding example of this new technology is OpenAI's Operator. However, OpenAI is not the first to offer such a system. Anthropic presented its own Computer-Use-Agent (CUA) as early as the end of 2024, setting an initial milestone in this area. The significance of this development will be no less groundbreaking than that of the Reasoners in 2024 or the LLMs in 2023.

But before we take a closer look at the differences between the various models, let's understand how exactly such computer use agents work like Operator and what impact they have on our understanding of AI.

What Is ā€œOperatorā€ and What Are CUAs?

ā€œToday we introduced a research preview of Operatorā  (opens in a new window), an agent that can go to the web to perform tasks for you. Powering Operator is Computer-Using Agent (CUA), a model that combines GPT-4o's vision capabilities with advanced reasoning through reinforcement learning. CUA is trained to interact with graphical user interfaces (GUIs)ā€”the buttons, menus, and text fields people see on a screenā€”just as humans do. This gives it the flexibility to perform digital tasks without using OS- or web-specific APIs.ā€

OpenAI

Operator is a newly developed agent from OpenAI designed to independently perform tasks in a web browser. It represents a new stage in the development of AI-powered tools by not only responding to text input but actively interacting on the internet. With the help of its own browser, Operator can ā€œseeā€ web pages, interpret them using screenshots, and interact with them directly, for example by clicking, scrolling, and filling out forms. These capabilities allow it to automate a variety of repetitive and time-consuming tasks, such as booking travel, ordering food, or even creating content like memes.

How Does Operator Work?

ā€œGiven a userā€™s instruction, CUA operates through an iterative loop that integrates perception, reasoning, and action:

Perception: Screenshots from the computer are added to the modelā€™s context, providing a visual snapshot of the computer's current state.
Reasoning: CUA reasons through the next steps using chain-of-thought, taking into consideration current and past screenshots and actions. This inner monologue improves task performance by enabling the model to evaluate its observations, track intermediate steps, and adapt dynamically.
Action: It performs the actionsā€”clicking, scrolling, or typingā€”until it decides that the task is completed or user input is needed. While it handles most steps automatically, CUA seeks user confirmation for sensitive actions, such as entering login details or responding to CAPTCHA forms.ā€ (OpenAI)

OpenAI

Functionality of CUA, OpenAI

Operator is able to work independently, but can turn to the user for complex or security-critical tasks to hand over control. For example, Operator prompts the user to take action when login data needs to be entered or a payment is made. This feature ensures that sensitive data remains protected and privacy is maintained. What's more, Operator is designed to learn from its mistakes: if it fails at a task, it can adapt its strategy and attempt to correct itself. This makes it an adaptive tool that is designed to learn and continuously improve.

In addition, OpenAI has stated in the presentation that Operator provides notifications so that the user is aware of the different processes.


Advantages of Operator

Operator aims to save users time and optimize work processes. With the ability to interact with websites just as a human would, the agent can complete tasks that are often tedious or repetitive. This is not only useful for individuals, but also for companies. For example, businesses can use Operator to improve customer service, automate orders, or make other digital processes more efficient.

Another advantage is the personalization: users can give Operator individual instructions, for example to set preferred settings for certain websites. This way, someone could prioritize their favorite airline on booking portals or automate standard orders at food delivery services. This flexibility makes Operator not only a practical tool for everyday life, but also a valuable tool for complex workflows.

A central concern in the development of Operator is the security and privacy of users. Even if it bores many users, the security of agents is an aspect that should not be underestimated. After all, the agent works with sensitive data such as passwords.

OpenAI has integrated several layers of protection for this:

  1. Control by the user: Operator gives the user the option to take control at any time, especially for sensitive actions such as entering passwords or payment data. In these cases, the user can ensure that no sensitive information is collected or stored.

  2. Transparency and data deletion: Users can delete their browsing data and previous activities with a single click. In addition, it is ensured that no data is used for model improvement if the user has disabled this in the settings.

  3. Protection against misuse: Operator has been equipped with special security mechanisms to prevent manipulation by malicious websites. These can, for example, recognize and ignore hidden prompt injections.

Despite all the security measures, Operator will not be released in Europe in the foreseeable future. Europe continues to fall behind. I will write a longer analysis about the geopolitical implications and the global competition for AI in the near future.

Difference between OpenAI's Operator and Anthropicā€™s

Anthropic has introduced a Computer-Use API that allows their AI model ā€œClaudeā€ to control computers and perform tasks autonomously, such as moving the cursor and clicking buttons. This feature has been released as a public beta, indicating that it is accessible to a wider user base. The Computer-Use API enables ā€œClaudeā€ to perform tasks by observing the screen, similar to Microsoft's Copilot Vision and OpenAI's ChatGPT desktop application. 

Both agents are designed as Computer-Using Agents (CUA). ā€œOperatorā€ automates web tasks and is available for ChatGPT Pro users in the US, while ā€œClaudeā€ can control computers by observing screens and is available as a public beta version. 

In short, both are referred to as CUA. OpenAI's Operator, however, seems to break down processes and tasks into smaller steps and works through them one after the other, going back and re-trying when problems arise. Anthropic's model seems to take a less structured approach.

On closer inspection, both models appear to be more similar than initially suspected. However, ā€œOperatorā€ seems to be more developed, equipped with more reasoning and thus achieves better results in application. Something that can also be proven by benchmarks.

Conclusion: Era of Agents

ā€œThe next challenge space we plan to explore is expanding the action space of agents. The flexibility offered by a universal interface addresses this challenge, enabling an agent that can navigate any software tool designed for humans. By moving beyond specialized agent-friendly APIs, CUA can adapt to whatever computer environment is availableā€”truly addressing the ā€œlong tailā€ of digital use cases that remain out of reach for most AI models.ā€

OpenAI

At present, agents are still comparatively inaccurate. Benchmarks show that OpenAI's Operator is superior to other web agents, but the technology still requires significant development.

ā€œIn these benchmarks, CUA sets a new standard using the same universal interface that perceives the browser screen as pixels and takes action through mouse and keyboard. CUA achieved a 58.1% success rate on WebArena and an 87% success rate on WebVoyager for web-based tasks. While CUA achieves a high success rate on WebVoyager, where most tasks are relatively simple, CUA still needs more improvements to close the gap with human performance on more complex benchmarks like WebArena.ā€

OpenAI

Nevertheless, it is clear in which direction we are moving. All computer work will be fundamentally changed in the foreseeable future. In the future, white-collar workers will no longer perform their own tasks, but will delegate them to agents. Instead of laboriously creating Excel spreadsheets or dealing with standardized processes on the computer, agents will take over all these tasks. In short, we will all become agent managers who determine what needs to be done.

This disruptive development can hardly be overestimated. The entire way office workers function will change forever. Agents can be duplicated without difficulty: while humans have difficulties with multitasking and usually process tasks one after the other, in the future we will be able to use numerous agents in parallel. This will lead to a massive increase in efficiency. In addition, agents do not need breaks, vacations or time for a cigarette. These features make agents a revolutionary force. A previously unimaginable increase in efficiency will occur.

In the future, the function of agents will of course not be limited to web browsing. Agents will be integrated into all processes. Sam Altman considers it a ā€œholy grailā€ that agents can conduct research independently. They should act as intelligent researchers who can independently carry out and evaluate complete research processes. If these abilities are extended to include independent learning, the path to the so-called ā€œintelligence explosionā€ would be paved - the development towards singularity. Agents would then be an integral part of the coming AGI and ASI.

The impact on the world of work is enormous. If a person can manage several agents at the same time, both the number of tasks completed and the speed of research will increase exponentially. Agents have therefore been a major focus of OpenAI from the very beginning. With Operator, the company has taken the first step into the era of agents. Certainly, this is an initial attempt, comparable to the first iteration of ChatGPT. Operator seems like an early ChatGPT 3.5: still clunky, underdeveloped and error-prone. But it is already usable and clearly shows the direction of travel.

Just as ChatGPT 3.5 was replaced by GPT-4 within a few months and later by o1, we can expect similar progress with agents. That is the significance of this release. Operator may not be a transformative force right away, but the current rate of model improvements will be reflected in agents. Within a few months, we could be confronted with Operator 2 or competing products that make today's Operator seem like a relic of the past.

Exponential development is difficult for humans to grasp. This is due to their historical background: humans have always had to concentrate on the present and near future because immediate dangers determine life and death. Exponential developments were irrelevant. That is why the first products are often ridiculed and dismissed as incapable. But the development of AI has shown us how incredibly fast progress can be. From OpenAI's o1 to o3, it took only three months to increase performance in the ARC Challenge from 31% to 88%. With that in mind, let's look at Operator. It's just the first step, but the next steps will be giant leaps. In light of this, it's no surprise that the web magazine ā€œThe Informationā€ has heard from internal sources that OpenAI is already working on a senior software engineer agent that comes close to AGI:

ā€œNow the company is working on AI to help senior software engineers handle more complex programming tasks, a key step in the companyā€™s attempt to develop artificial general intelligence that outperforms people at most economically valuable work, according to three people who spoke to OpenAI leaders about the product.ā€

The Information

The ā€œOperatorā€ is there, the agents are there. Even if they don't change the world immediately today, it will only take a few months before they do. That is the profound truth of acceleration.

-Written by Kim Isenberg | Follow Kim on X

What did you think of our first-ever weekend edition?

Weā€™re trying something new: long-form content on a day when most people have a few extra moments. Weā€™d love to hear your feedback. Whether itā€™s positive or critical, we truly appreciate itā€”it all helps us get better.

Select an option below:

Login or Subscribe to participate in polls.

Thanks for Joining Us Today

Remember to follow us on X for quick daily updates and bite-sized content, and subscribe to our YouTube channel for in-depth technical analysis.

Have a great weekend!

The Forward Future Team
šŸ§‘ā€šŸš€ šŸ§‘ā€šŸš€ šŸ§‘ā€šŸš€ šŸ§‘ā€šŸš€ 

Reply

or to participate.