Hugging Face Community Evals: Decentralized and Transparent Model Evaluation

The Problem: Model Evaluation is Broken

In 2026, AI model evaluation faces a credibility crisis. Classic benchmarks like MMLU are saturated above 91%, GSM8K hit 94%+, and HumanEval is conquered. But models that ace these benchmarks often still can’t:

  • Reliably browse the web
  • Write production code
  • Handle multi-step tasks without hallucinating

There’s a clear gap between benchmark scores and real-world performance.

The Single Source of Truth Problem

Another gap: multiple sources report different results. From Model Cards to papers and evaluation platforms, there’s no alignment in reported scores. The result: the community lacks a single source of truth.

The Solution: Community Evals

Hugging Face is taking evaluations on the Hub in a new direction by decentralizing reporting and allowing the entire community to openly report scores for benchmarks.

How It Works

For Benchmarks

  • Dataset repos can now register as benchmarks
  • MMLU-Pro, GPQA, and HLE are already live
  • Automatically aggregate reported results from across the Hub
  • Display leaderboards in the dataset card
  • The benchmark defines the eval spec via eval.yaml, based on the Inspect AI format
  • Reported results need to align with the task definition

For Models

  • Eval scores live in .eval_results/*.yaml in the model repo
  • Appear on the model card and are fed into benchmark datasets
  • Both the model author’s results and open pull requests for results will be aggregated
  • Model authors can close score PR and hide results

For the Community

  • Any user can submit evaluation results for any model via a PR
  • Results get shown as “community”, without waiting for model authors
  • The community can link to sources like a paper, Model Card, third-party evaluation platform, or inspect eval logs
  • The community can discuss scores like any PR
  • Since the Hub is Git based, there’s a history of when evals were added, when changes were made, etc.

Why This Matters

  1. Transparency: Exposing scores that already exist across the community in sources like model cards and papers
  2. Aggregation: The community can build on top of them to aggregate, track, and understand scores across the field
  3. APIs: All scores will be exposed via Hub APIs, making it easy to aggregate and build curated leaderboards, dashboards, etc.

Clear Limitations

Hugging Face is honest about what this doesn’t solve:

  • It doesn’t solve benchmark saturation
  • It doesn’t close the benchmark-reality gap
  • It won’t stop training on test sets

But it makes the game visible by exposing what is evaluated, how, when, and by whom.

The Future: Reproducible Benchmarks

Mostly, Hugging Face hopes to make the Hub an active place to build and share reproducible benchmarks, particularly focusing on:

  • New tasks
  • New domains
  • Challenges that better test SOTA models

Get Started

  1. Add eval results: Publish the evals you conducted as YAML files in .eval_results/ on any model repo
  2. Check out the scores: Check the scores on the benchmark dataset
  3. Register a new benchmark: Add eval.yaml to your dataset repo and contact us to be included in the shortlist

The feature is in beta, and Hugging Face is building in the open. Feedback is welcome.


Sources:

Translations: