• AiNews.com
  • Posts
  • Meta Denies Llama 4 AI Benchmark Cheating Allegations

Meta Denies Llama 4 AI Benchmark Cheating Allegations

A modern tech workspace with a large computer monitor showing AI model benchmarks and performance graphs. A Meta logo is visible on the screen. In the foreground, a person types on a keyboard, with code and data dashboards reflected in their glasses. The scene conveys a high-tech, analytical environment, hinting at tension and scrutiny around AI model evaluation.

Image Source: ChatGPT-4o

Meta Denies Llama 4 AI Benchmark Cheating Allegations

Meta is pushing back against rumors that it manipulated benchmark scores for its newly released Llama 4 models by training on evaluation data—a move that would misrepresent the models’ actual capabilities.

Ahmad Al-Dahle, Meta’s Vice President of Generative AI, addressed the rumors in a post on X, firmly denying that the company trained its Llama 4 Maverick and Scout models on benchmark test sets. He called the claim “simply not true” and explained that using test sets during training would artificially boost performance scores and misrepresent real-world capability.

“We’ve also heard claims that we trained on test sets—that’s simply not true and we would never do that,” Al-Dahle wrote. “Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations.”

The speculation began circulating on Reddit and X over the weekend, reportedly stemming from a Chinese social media post by someone claiming to have resigned from Meta over concerns about benchmarking practices. The poster alleged that leadership proposed mixing test sets into post-training to generate results that “look okay” across various metrics.

Reports of Performance Issues Fuel the Controversy

The launch of Llama 4 has drawn interest across the AI community, but some users have reported inconsistent performance—particularly between the publicly available models and those tested in benchmark environments like LM Arena.

Researchers have noted stark behavioral differences between the downloadable version of Maverick and the version used in LM Arena scores, raising questions about how the models were evaluated. Compounding these concerns, Meta reportedly used an experimental, unreleased version of Maverick to achieve higher benchmark scores.

Al-Dahle acknowledged the performance gap in some environments, explaining that the rollout is ongoing and that public implementations may take several days to stabilize.

“We dropped the models as soon as they were ready,” Al-Dahle wrote. “We’ll keep working through our bug fixes and onboarding partners.”

Internal Whistleblower Adds Fuel to the Fire

In a widely shared Substack post, former Meta AI team member Tony Peng shared an anonymous account from a whistleblower who claimed to have resigned in protest over the Llama 4 project.

According to the account, the internal model repeatedly underperformed during testing, and leadership suggested manipulating benchmark data to produce more favorable results. The whistleblower stated that unless performance improved by the end of April, Meta might cut further investment in the project. The individual also asked that their name be removed from Llama 4’s technical documentation and claimed that Meta’s VP of AI had resigned for similar reasons.

While the claims remain unverified, they’ve added to the scrutiny surrounding the launch, particularly as questions mount over transparency and trust in open-source AI development.

What This Means

As AI benchmarks play an increasingly central role in shaping industry perception and adoption, even the suggestion of data manipulation can damage trust—especially for a company like Meta seeking to position itself as a leader in open-source AI.

The controversy highlights broader industry challenges around transparency, reproducibility, and responsible model reporting. It also underscores the tension between open-source ideals and the pressures of performance metrics, especially when public benchmarks heavily influence funding, partnerships, and reputation.

If internal dissent continues to surface and real-world performance lags, Meta may face growing calls to disclose more about its evaluation processes—and to ensure its open models meet the standards their benchmarks promise.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.