AiNews.com
Posts
How AI Agents Perform in Real-World Tasks: Insights from New Benchmark

How AI Agents Perform in Real-World Tasks: Insights from New Benchmark

Alicia Shapiro
January 06, 2025 • Estimated Reading Time: 6 minutes

A futuristic office environment showcases AI agents performing tasks in a high-tech setting. At the center, a humanoid AI figure interacts with holographic screens displaying a variety of data, including charts, task lists, and lines of code. Surrounding the AI figure, digital interfaces and tools simulate collaborative tasks, with virtual avatars representing simulated colleagues visible on chat interfaces displayed on large monitors. A leaderboard in the background lists benchmark scores for AI models such as Claude 3.5 Sonnet and GPT-4o, emphasizing their performance metrics. The scene is illuminated by a sleek, modern aesthetic with glowing blue tones, blending corporate professionalism and cutting-edge technology. The image captures the essence of AI-powered workplaces and advanced benchmarking processes.

Image Source: ChatGPT-4o

How AI Agents Perform in Real-World Tasks: Insights from New Benchmark

Researchers from Carnegie Mellon University and collaborators have unveiled TheAgentCompany, a benchmark designed to test the performance of AI agents in realistic workplace scenarios. With AI-powered systems like ChatGPT reshaping work, this new benchmark evaluates how well these agents handle tasks such as coding, financial analysis, and project management in simulated office environments.

What is TheAgentCompany?

TheAgentCompany creates a controlled, reproducible test environment mimicking a small software company. It includes:

Realistic Scenarios: Tasks range from setting up servers to managing project sprints and collaborating with simulated coworkers via internal tools like chat systems and file storage.
Multiple Roles and Interfaces: Agents interact through web browsers, coding environments, and communication tools, simulating a professional workspace.
Checkpoint-Based Evaluation: Tasks are broken into milestones, allowing partial credit to highlight progress even when full completion isn’t achieved.

Key Findings: Current AI Performance

TheAgentCompany benchmark tested seven large language models, including proprietary systems like Anthropic Claude 3.5 Sonnet, OpenAI GPT-4o, and Google Gemini, as well as open-weight models like Meta’s Llama 3.1 and Alibaba’s Qwen. The evaluations compared performance on diverse tasks, highlighting differences in capability, cost, and efficiency.

The benchmark revealed both strengths and weaknesses in current AI agents:

Best Model Performance: The top-performing AI, Anthropic’s Claude 3.5 Sonnet, completed 24% of tasks autonomously, with a partial credit score of 34.4%.
Challenges with Complex Tasks: Tasks requiring long-term planning, nuanced decision-making, or complex UI navigation—like filling forms or communicating with colleagues—proved difficult for all models.
Strength in Technical Tasks: Models excelled in programming-related tasks, likely due to the abundance of publicly available training data in this domain.

Benchmarks and Metrics

TheAgentCompany tested seven prominent models across 175 diverse tasks, providing a nuanced picture of AI agents’ strengths and weaknesses:

Task Completion: Anthropic’s Claude 3.5 Sonnet emerged as the top performer, completing 24% of tasks autonomously and scoring 34.4% with partial credits. This model demonstrated strong capabilities in coding-related tasks, such as setting up servers and managing project sprints. However, administrative and finance-related tasks, like analyzing spreadsheets or navigating internal company tools, saw much lower success rates.
Efficiency and Cost: The top-performing model was also the most resource-intensive, requiring an average of 29 steps and costing $6.34 per task to execute. On the other hand, Google Gemini 2.0 Flash was a more cost-effective model at just $0.79 per task but achieved a significantly lower success rate of 11.4%. This highlights trade-offs between efficiency and effectiveness.
Performance by Task Type: Agents performed best in software engineering roles, achieving success rates above 30%, likely due to the wealth of available training data in this domain. In contrast, tasks requiring financial analysis or administrative skills, which involve private or less commonly available training data, saw models struggle to complete even basic checkpoints.
Platform Interaction: Models were evaluated on tasks across platforms like GitLab, RocketChat, and ownCloud. While agents handled GitLab tasks better, they struggled significantly with RocketChat and ownCloud due to the complexity of user interfaces and the need for effective communication with simulated colleagues.
Partial Credit Evaluation: Tasks were assessed not only on full completion but also on progress toward intermediate goals. Models could earn partial credit for completing subtasks or checkpoints, with the best-performing model, Claude 3.5 Sonnet, scoring 34.4% by completing parts of longer, multi-step tasks. This nuanced scoring system highlighted where agents made progress even if they couldn’t fully achieve the desired outcomes.

Implications for AI Development

The findings highlight critical areas for improvement:

Social and Communication Skills: Many failures stemmed from agents’ inability to navigate nuanced conversations or follow up appropriately with simulated coworkers.
UI and Browsing Challenges: Complex web interfaces often stumped AI agents, with tasks involving file uploads or navigating pop-ups leading to errors.
Training Gaps: Administrative and financial tasks remain a challenge due to limited training data in these areas.

Looking Ahead: The Future of AI Agents

The research points to exciting opportunities for future advancements:

Smaller, Efficient Models: Newer models like Llama 3.3 (70B) are becoming more efficient, narrowing the gap between open and proprietary systems.
Improved Task Diversity: Expanding benchmarks to include vague or creative tasks could better reflect real-world applications.
Enhanced Collaboration: Developing agents that excel in communication and multi-step workflows is essential for broader adoption in workplaces.

As researchers refine tools like TheAgentCompany, these benchmarks will guide the evolution of AI agents capable of tackling more complex and meaningful tasks in professional settings.

For more details on this research, you can read the paper here.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.