Crowdsourced AI Benchmarks Under Fire from Experts

Crowdsourced AI Benchmarks Under Fire from Experts Crowdsourced AI Benchmarks Under Fire from Experts
IMAGE CREDITS: DATAGROM

As AI labs race to refine and evaluate their models, many have turned to crowdsourced benchmarking platforms like Chatbot Arena to assess performance. Labs such as OpenAI, Google, and Meta are increasingly using community-driven tools that allow users to compare model outputs and vote on the best responses. While these platforms provide accessible, large-scale feedback, critics argue they are flawed — both ethically and scientifically.

One of the loudest critics is Emily Bender, a linguistics professor at the University of Washington and co-author of The AI Con. She argues that platforms like Chatbot Arena lack construct validity — a benchmark’s ability to truly measure what it claims. “Chatbot Arena hasn’t shown that voting for one output over another actually correlates with preferences, however they may be defined,” Bender said. She contends that valid benchmarks must have clearly defined goals and measurable constructs, which many crowdsourced systems lack.

The Industry’s Growing Dependence on Public Evaluation

The basic premise of platforms like Chatbot Arena is simple: users interact with two anonymous models, prompt them with questions, and select the preferred answer. High scores on the platform often become PR fodder, helping companies claim model superiority. But Asmelash Teka Hadgu, co-founder of AI firm Lesan and a fellow at the Distributed AI Research Institute, warns that this process is being exploited. He points to Meta’s Llama 4 Maverick model as a prime example — a version fine-tuned to score well on Chatbot Arena was showcased, while a lower-performing model was the one ultimately released.

Hadgu suggests that AI benchmarks should be managed differently: dynamic instead of static, and distributed across independent institutions like universities or professional organizations. Each benchmark should be tailored to specific domains — such as healthcare or education — and involve real-world practitioners who use AI tools in their daily work.

Ethical concerns also extend to the treatment of evaluators. Kristine Gloria, formerly of the Aspen Institute’s tech initiative, warns that AI labs risk repeating the exploitative practices of the data labeling industry. “Benchmarks should never be the only metric for evaluation,” she said, adding that contributors deserve to be compensated. Although the crowdsourced nature of platforms like Chatbot Arena can enrich evaluation with diverse perspectives, she says this alone doesn’t justify unpaid labor.

Matt Fredrikson, CEO of Gray Swan AI, which runs paid red-teaming efforts using crowd volunteers, agrees. While his platform offers cash prizes and learning incentives, he emphasizes that public feedback must be paired with private, paid assessments to ensure thorough evaluations. “Internal benchmarks, algorithmic red teams, and contracted experts bring the domain knowledge and depth needed,” Fredrikson noted.

Even leaders within the open benchmarking space acknowledge its limitations. Alex Atallah, CEO of OpenRouter, and Wei-Lin Chiang, a UC Berkeley PhD student and co-founder of LMArena, both agree that open testing is not enough. Chiang clarified that Chatbot Arena’s aim is to “measure our community’s preferences,” not provide definitive assessments of model quality. He says the Maverick incident wasn’t due to flaws in the system itself, but rather a misinterpretation of platform policies by Meta.

Since then, Chatbot Arena’s team has updated its policies to better ensure fair, transparent evaluations, and to prevent future discrepancies. Chiang emphasized that participation is voluntary and community-driven: “People use LMArena because we give them an open, transparent place to engage with AI and give collective feedback.”

Still, critics maintain that industry reliance on crowdsourced leaderboards for serious evaluation may be misguided. While they offer accessibility and democratized input, these platforms — unless improved — fall short of delivering the scientific rigor and ethical clarity that real-world AI applications demand.

Share with others

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Service

Follow us