• AiNews.com
  • Posts
  • Humanity’s Last Exam: A New, Harder Benchmark for Frontier AI Testing

Humanity’s Last Exam: A New, Harder Benchmark for Frontier AI Testing

A futuristic academic setting showcasing an AI interface attempting to answer complex questions from Humanity’s Last Exam. The screen displays topics such as advanced biology and mythology, with the AI's low accuracy score (e.g., 6%) highlighted. In the background, professors and researchers from diverse fields review the questions, appearing equally puzzled. The room combines a modern lecture hall with glowing digital interfaces, symbolizing the intersection of human and AI knowledge. The image emphasizes the challenge of the benchmark and the limits of both AI and human expertise.

Image Source: ChatGPT-4o

Humanity’s Last Exam: A New, Harder Benchmark for Frontier AI Testing

The Center for AI Safety (CAIS) and Scale AI have unveiled Humanity’s Last Exam (HLE), an advanced AI benchmark designed to push large language models (LLMs) to their limits. With AI systems now surpassing 90% accuracy on many existing benchmarks like MMLU, HLE introduces a significantly tougher challenge: 3,000 expert-crafted questions across 100+ subjects.

But here’s the twist—not only do current AI systems struggle with this exam, achieving accuracy rates below 10%, but it’s also likely that most humans wouldn’t pass either.

The Hardest AI Benchmark Yet

Humanity’s Last Exam was developed to measure AI’s academic knowledge at the frontiers of human expertise. Crafted by nearly 1,000 subject experts from over 500 institutions in 50 countries, the questions span a range of disciplines, including complex scientific reasoning, humanities, and multimodal analysis. These questions aimed to stump both AIs and the average human.

Key Features of HLE:

  • Question Formats: The dataset includes exact-match and multiple-choice questions, with 10% involving multimodal challenges that combine text and images.

  • Calibration Assessment: AI systems must provide both an answer and a confidence score (0-100%) to gauge not only accuracy but also uncertainty—a key metric to reduce hallucinations.

  • Private Test Set: A portion of the dataset is withheld to prevent overfitting and ensure robust evaluation of future models.

Sample Questions:

How tough is it? Here are three sample questions:

  • Sample Question 1: "Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone?"

  • Sample Question 2: "I am providing the standardized Biblical Hebrew source text from the Biblia Hebraica Stuttgartensia (Psalms 104:7). Your task is to distinguish between closed and open syllables. Please identify and list all closed syllables (ending in a consonant sound) based on the latest research on the Tiberian pronunciation tradition of Biblical Hebrew by scholars such as Geoffrey Khan, Aaron D. Hornkohl, Kim Phillips, and Benjamin Suchard. Medieval sources, such as the Karaite transcription manuscripts, have enabled modern researchers to better understand specific aspects of Biblical Hebrew pronunciation in the Tiberian tradition, including the qualities and functions of the shewa and which letters were pronounced as consonants at the ends of syllables."

  • Sample Question 3: In Greek mythology, who was Jason’s maternal great-grandfather?

Most people (and AIs) wouldn’t stand a chance with these questions.

How AI Models Perform on HLE

Despite their dominance on benchmarks like MMLU and MATH, top AI models struggle with HLE. Reported accuracy rates include:

  • OpenAI’s GPT-4o: 3.3%

  • Grok-2: 3.8%

  • Claude 3.5 Sonnet: 4.3%

  • Gemini: 6.2%

  • o1: 9.1%

  • DeepSeek-R1 (text-only model): 9.4%

HLE is substantially harder than even the most rigorous existing tests like GPQA and MATH, showing that current AI reasoning capabilities still have a long way to go.

The Bigger Picture

HLE’s creators believe that while current AI models may exceed 50% accuracy on this benchmark by the end of 2025, achieving expert-level performance on structured, academic problems is only part of the puzzle.

HLE is not a measure of artificial general intelligence (AGI) or creative problem-solving. Instead, it focuses on closed-ended, verifiable questions requiring deep technical knowledge and reasoning. However, the rapid pace of AI development suggests that such challenges will eventually be overcome. HLE might be the final academic test for AI models, but it’s only the beginning when it comes to benchmarking the broader capabilities of artificial intelligence.

What HLE Means for AI and Humanity

Humanity’s Last Exam isn’t just a test of AI capabilities—it’s a call to action for researchers, developers, and policymakers. The benchmark provides a clear reference point for:

  • Tracking Progress: Monitoring how quickly AI models improve in tackling expert-level problems.

  • Policy Discussions: Informing regulatory efforts and governance strategies as AI systems evolve.

  • Research Collaboration: Crowdsourcing challenges from experts to refine and push the limits of AI benchmarks.

With a $500,000 prize pool incentivizing high-quality question submissions, HLE also encourages innovation in benchmark design.

Could You Pass Humanity’s Last Exam?

The exam is so difficult that many humans, even experts, would struggle to achieve passing scores. But this difficulty serves a purpose: as AI continues to advance, tests like HLE ensure that we can effectively measure progress, identify limitations, and understand the potential risks of increasingly powerful systems.

For now, no AI—or human—has mastered Humanity’s Last Exam. But as AI capabilities grow, the moment when an LLM aces this benchmark will signal a major milestone in the development of artificial intelligence.

What This Means

Humanity’s Last Exam underscores the complexity and depth needed to challenge advanced AI systems. Upcoming models, like OpenAI’s o3, have the potential to significantly improve results, narrowing the gap between current AI capabilities and expert-level performance. While current models fall short, the rapid evolution of AI suggests breakthroughs on this benchmark are not far off. As a measure of AI progress, HLE also sparks vital discussions about safety, governance, and the implications of AI systems achieving human-level reasoning.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.