Hugging Face Community Evals: Decentralized and Transparent Model Evaluation

Feb 16, 2026 · 3 min read · huggingface evaluations benchmarks community open-source ·

Share on:

The Problem: Model Evaluation is Broken

In 2026, AI model evaluation faces a credibility crisis. Classic benchmarks like MMLU are saturated above 91%, GSM8K hit 94%+, and HumanEval is conquered. But models that ace these benchmarks often still can’t:

Reliably browse the web
Write production code
Handle multi-step tasks without hallucinating

There’s a clear gap between benchmark scores and real-world performance.

The Single Source of Truth Problem

Another gap: multiple sources report different results. From Model Cards to papers and evaluation platforms, there’s no alignment in reported scores. The result: the community lacks a single source of truth.

The Solution: Community Evals

Hugging Face is taking evaluations on the Hub in a new direction by decentralizing reporting and allowing the entire community to openly report scores for benchmarks.

How It Works

For Benchmarks

Dataset repos can now register as benchmarks
MMLU-Pro, GPQA, and HLE are already live
Automatically aggregate reported results from across the Hub
Display leaderboards in the dataset card
The benchmark defines the eval spec via eval.yaml, based on the Inspect AI format
Reported results need to align with the task definition

For Models

Eval scores live in .eval_results/*.yaml in the model repo
Appear on the model card and are fed into benchmark datasets
Both the model author’s results and open pull requests for results will be aggregated
Model authors can close score PR and hide results

For the Community

Any user can submit evaluation results for any model via a PR
Results get shown as “community”, without waiting for model authors
The community can link to sources like a paper, Model Card, third-party evaluation platform, or inspect eval logs
The community can discuss scores like any PR
Since the Hub is Git based, there’s a history of when evals were added, when changes were made, etc.

Why This Matters

Transparency: Exposing scores that already exist across the community in sources like model cards and papers
Aggregation: The community can build on top of them to aggregate, track, and understand scores across the field
APIs: All scores will be exposed via Hub APIs, making it easy to aggregate and build curated leaderboards, dashboards, etc.

Clear Limitations

Hugging Face is honest about what this doesn’t solve:

It doesn’t solve benchmark saturation
It doesn’t close the benchmark-reality gap
It won’t stop training on test sets

But it makes the game visible by exposing what is evaluated, how, when, and by whom.

The Future: Reproducible Benchmarks

Mostly, Hugging Face hopes to make the Hub an active place to build and share reproducible benchmarks, particularly focusing on:

New tasks
New domains
Challenges that better test SOTA models

Get Started

Add eval results: Publish the evals you conducted as YAML files in .eval_results/ on any model repo
Check out the scores: Check the scores on the benchmark dataset
Register a new benchmark: Add eval.yaml to your dataset repo and contact us to be included in the shortlist

The feature is in beta, and Hugging Face is building in the open. Feedback is welcome.

Sources: