- AiNews.com
- Posts
- How AI Agents Perform in Real-World Tasks: Insights from New Benchmark
How AI Agents Perform in Real-World Tasks: Insights from New Benchmark
Image Source: ChatGPT-4o
How AI Agents Perform in Real-World Tasks: Insights from New Benchmark
Researchers from Carnegie Mellon University and collaborators have unveiled TheAgentCompany, a benchmark designed to test the performance of AI agents in realistic workplace scenarios. With AI-powered systems like ChatGPT reshaping work, this new benchmark evaluates how well these agents handle tasks such as coding, financial analysis, and project management in simulated office environments.
What is TheAgentCompany?
TheAgentCompany creates a controlled, reproducible test environment mimicking a small software company. It includes:
Realistic Scenarios: Tasks range from setting up servers to managing project sprints and collaborating with simulated coworkers via internal tools like chat systems and file storage.
Multiple Roles and Interfaces: Agents interact through web browsers, coding environments, and communication tools, simulating a professional workspace.
Checkpoint-Based Evaluation: Tasks are broken into milestones, allowing partial credit to highlight progress even when full completion isn’t achieved.
Key Findings: Current AI Performance
TheAgentCompany benchmark tested seven large language models, including proprietary systems like Anthropic Claude 3.5 Sonnet, OpenAI GPT-4o, and Google Gemini, as well as open-weight models like Meta’s Llama 3.1 and Alibaba’s Qwen. The evaluations compared performance on diverse tasks, highlighting differences in capability, cost, and efficiency.
The benchmark revealed both strengths and weaknesses in current AI agents:
Best Model Performance: The top-performing AI, Anthropic’s Claude 3.5 Sonnet, completed 24% of tasks autonomously, with a partial credit score of 34.4%.
Challenges with Complex Tasks: Tasks requiring long-term planning, nuanced decision-making, or complex UI navigation—like filling forms or communicating with colleagues—proved difficult for all models.
Strength in Technical Tasks: Models excelled in programming-related tasks, likely due to the abundance of publicly available training data in this domain.
Benchmarks and Metrics
TheAgentCompany tested seven prominent models across 175 diverse tasks, providing a nuanced picture of AI agents’ strengths and weaknesses:
Task Completion: Anthropic’s Claude 3.5 Sonnet emerged as the top performer, completing 24% of tasks autonomously and scoring 34.4% with partial credits. This model demonstrated strong capabilities in coding-related tasks, such as setting up servers and managing project sprints. However, administrative and finance-related tasks, like analyzing spreadsheets or navigating internal company tools, saw much lower success rates.
Efficiency and Cost: The top-performing model was also the most resource-intensive, requiring an average of 29 steps and costing $6.34 per task to execute. On the other hand, Google Gemini 2.0 Flash was a more cost-effective model at just $0.79 per task but achieved a significantly lower success rate of 11.4%. This highlights trade-offs between efficiency and effectiveness.
Performance by Task Type: Agents performed best in software engineering roles, achieving success rates above 30%, likely due to the wealth of available training data in this domain. In contrast, tasks requiring financial analysis or administrative skills, which involve private or less commonly available training data, saw models struggle to complete even basic checkpoints.
Platform Interaction: Models were evaluated on tasks across platforms like GitLab, RocketChat, and ownCloud. While agents handled GitLab tasks better, they struggled significantly with RocketChat and ownCloud due to the complexity of user interfaces and the need for effective communication with simulated colleagues.
Partial Credit Evaluation: Tasks were assessed not only on full completion but also on progress toward intermediate goals. Models could earn partial credit for completing subtasks or checkpoints, with the best-performing model, Claude 3.5 Sonnet, scoring 34.4% by completing parts of longer, multi-step tasks. This nuanced scoring system highlighted where agents made progress even if they couldn’t fully achieve the desired outcomes.
Implications for AI Development
The findings highlight critical areas for improvement:
Social and Communication Skills: Many failures stemmed from agents’ inability to navigate nuanced conversations or follow up appropriately with simulated coworkers.
UI and Browsing Challenges: Complex web interfaces often stumped AI agents, with tasks involving file uploads or navigating pop-ups leading to errors.
Training Gaps: Administrative and financial tasks remain a challenge due to limited training data in these areas.
Looking Ahead: The Future of AI Agents
The research points to exciting opportunities for future advancements:
Smaller, Efficient Models: Newer models like Llama 3.3 (70B) are becoming more efficient, narrowing the gap between open and proprietary systems.
Improved Task Diversity: Expanding benchmarks to include vague or creative tasks could better reflect real-world applications.
Enhanced Collaboration: Developing agents that excel in communication and multi-step workflows is essential for broader adoption in workplaces.
As researchers refine tools like TheAgentCompany, these benchmarks will guide the evolution of AI agents capable of tackling more complex and meaningful tasks in professional settings.
For more details on this research, you can read the paper here.
Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.