AiNews.com
Posts
Meta Denies Llama 4 AI Benchmark Cheating Allegations

Meta Denies Llama 4 AI Benchmark Cheating Allegations

Alicia Shapiro
April 09, 2025 • Estimated Reading Time: 6 minutes

Image Source: ChatGPT-4o

Meta Denies Llama 4 AI Benchmark Cheating Allegations

Meta is pushing back against rumors that it manipulated benchmark scores for its newly released Llama 4 models by training on evaluation data—a move that would misrepresent the models’ actual capabilities.

Ahmad Al-Dahle, Meta’s Vice President of Generative AI, addressed the rumors in a post on X, firmly denying that the company trained its Llama 4 Maverick and Scout models on benchmark test sets. He called the claim “simply not true” and explained that using test sets during training would artificially boost performance scores and misrepresent real-world capability.

“We’ve also heard claims that we trained on test sets—that’s simply not true and we would never do that,” Al-Dahle wrote. “Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations.”

A screenshot of a tweet from Ahmad Al-Dahle, Meta’s VP of Generative AI, dated April 7, 2025. In the tweet, Al-Dahle says Meta is pleased to release Llama 4 and is already hearing positive feedback. He acknowledges reports of mixed model quality across services and attributes it to the early release and implementation delays. He strongly denies allegations that Meta trained on benchmark test sets, calling it “simply not true,” and affirms the company’s commitment to transparency and ongoing improvement. The tweet has 330.8K views, with user engagement visible at the bottom.

Image Source: Ahmad Al-Dahle X Post

The speculation began circulating on Reddit and X over the weekend, reportedly stemming from a Chinese social media post by someone claiming to have resigned from Meta over concerns about benchmarking practices. The poster alleged that leadership proposed mixing test sets into post-training to generate results that “look okay” across various metrics.

Reports of Performance Issues Fuel the Controversy

The launch of Llama 4 has drawn interest across the AI community, but some users have reported inconsistent performance—particularly between the publicly available models and those tested in benchmark environments like LM Arena.

Researchers have noted stark behavioral differences between the downloadable version of Maverick and the version used in LM Arena scores, raising questions about how the models were evaluated. Compounding these concerns, Meta reportedly used an experimental, unreleased version of Maverick to achieve higher benchmark scores.

Al-Dahle acknowledged the performance gap in some environments, explaining that the rollout is ongoing and that public implementations may take several days to stabilize.

“We dropped the models as soon as they were ready,” Al-Dahle wrote. “We’ll keep working through our bug fixes and onboarding partners.”

Internal Whistleblower Adds Fuel to the Fire

In a widely shared Substack post, former Meta AI team member Tony Peng shared an anonymous account from a whistleblower who claimed to have resigned in protest over the Llama 4 project.

According to the account, the internal model repeatedly underperformed during testing, and leadership suggested manipulating benchmark data to produce more favorable results. The whistleblower stated that unless performance improved by the end of April, Meta might cut further investment in the project. The individual also asked that their name be removed from Llama 4’s technical documentation and claimed that Meta’s VP of AI had resigned for similar reasons.

While the claims remain unverified, they’ve added to the scrutiny surrounding the launch, particularly as questions mount over transparency and trust in open-source AI development.

A screenshot of a Substack post by Tony Peng, a former AI reporter and ex-Baidu communications lead, dated April 6. The post features an anonymous whistleblower from Meta’s AI team who claims internal Llama 4 models failed to meet open-source performance standards. The whistleblower says leadership proposed mixing benchmark test sets into post-training and resigned in protest, asking not to be credited in the technical report. The post includes a partial screenshot of a Chinese social media post that appears to corroborate the account. The content raises ethical concerns about Meta’s AI evaluation process.

Image Source: Tony Peng Substack Note Post

What This Means

As AI benchmarks play an increasingly central role in shaping industry perception and adoption, even the suggestion of data manipulation can damage trust—especially for a company like Meta seeking to position itself as a leader in open-source AI.

The controversy highlights broader industry challenges around transparency, reproducibility, and responsible model reporting. It also underscores the tension between open-source ideals and the pressures of performance metrics, especially when public benchmarks heavily influence funding, partnerships, and reputation.

If internal dissent continues to surface and real-world performance lags, Meta may face growing calls to disclose more about its evaluation processes—and to ensure its open models meet the standards their benchmarks promise.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.