Meta is pushing back against growing online speculation that it manipulated benchmark results for its latest AI models. Llama 4 Maverick and Llama 4 Scout. The company is facing scrutiny after rumors spread that its models were secretly trained on test sets used to evaluate AI performance. An act that, if true, would falsely inflate benchmark scores.
Ahmad Al-Dahle, Meta’s VP of generative AI, publicly addressed the controversy on Monday, taking to X (formerly Twitter) to set the record straight. “It’s simply not true,” he wrote, rejecting the claim that the models were trained using any benchmark test sets. In AI development, test sets are critical for measuring model accuracy and generalization after training is complete. Using them during training would undermine the credibility of benchmark scores and misrepresent a model’s real-world capability.
The controversy traces back to an unverified post on a Chinese social media platform. Where an alleged former Meta employee claimed to have resigned over the company’s AI evaluation practices. That post sparked discussions across X and Reddit over the weekend, leading to viral accusations that Meta may have optimized its models purely to look good on standard leaderboards. Especially on the popular LM Arena benchmark.
Some users and researchers added fuel to the fire, noting that Maverick’s public version behaves differently from the one hosted on LM Arena, where it achieved higher performance scores. The inconsistencies led to further speculation that Meta may have used an unreleased or specially tuned variant of Maverick for the benchmark showcase. Rather than the model version available to developers.
Al-Dahle didn’t shy away from addressing the performance gaps. Admitting that users may experience “mixed quality” depending on where the models are deployed. He attributed this to the fact that Meta released the models rapidly. With the understanding that it would take time for all partner platforms and cloud providers to properly calibrate their implementations.
“Since we dropped the models as soon as they were ready, we expect it’ll take several days for all the public implementations to get dialed in,” Al-Dahle explained. He also confirmed that Meta is actively working on bug fixes and supporting onboarding partners to smooth out these inconsistencies.
Although benchmark results remain a hot topic in the AI space. Meta maintains that Llama 4 Maverick and Scout were developed and released in good faith. The company is urging the AI community to remain patient as updates and improvements continue to roll out across platforms.
Still, the debate over benchmark transparency continues to cast a shadow over the AI industry. As more companies compete to dominate AI performance charts, questions around model evaluation, reproducibility. And fair comparison are becoming central to how trust is built—or lost—in generative AI.