University of Montreal Study Shows AI Beats Average Humans on Creativity Tests

Feb 25, 2026 · 9 min read · research creativity llm university-montreal gpt-4 claude gemini ·

Share on:

University of Montreal Study Shows AI Beats Average Humans on Creativity Tests

The largest human-AI comparison ever conducted provides objective evidence that half of humanity is now less creative than language models, but the top 10% of humans operate in territory no AI can reach.

The Research: 100,000 Humans vs. Four AI Giants

A study published January 21, 2026 in Scientific Reports (Nature) by University of Montreal Professor Karim Jerbi delivers the first large-scale objective measurement of where AI creativity actually stands against human baseline. The research team—which includes deep learning pioneer Yoshua Bengio and collaborators from Mila (Quebec AI Institute), Google DeepMind, University of Toronto, and Concordia University—tested GPT-4, ChatGPT, Claude, and Gemini against over 100,000 human participants.

The primary benchmark was the Divergent Association Task (DAT), a standardized psychological test requiring subjects to generate ten words with maximum semantic distance from each other. Think “telescope,” “mushroom,” “legislation,” “whisper”—concepts so unrelated that linking them requires genuine cognitive flexibility. The test measures divergent thinking, the cognitive process most closely associated with creative potential.

The results split cleanly down the middle. AI models outperformed the median human participant on DAT scores. But the top 50% of human test-takers exceeded all four AI models. The top 10% operate in a performance band that no AI approached.

The Median Line Crossed

The significance isn’t that AI scored well on a creativity test. It’s that we now have empirical proof of exactly where the human-AI capability boundary sits—and that boundary runs through the middle of humanity.

For the past two years, creativity debate has been theoretical. Can machines truly create? Is it just sophisticated pattern matching? The Montreal study sidesteps the philosophical question entirely. It doesn’t ask whether AI creativity is “real.” It asks whether AI output is measurably more divergent than human output. For half of humans tested, the answer is yes.

This creates a new competitive landscape. Roles that required average creative output—brainstorming sessions, initial concept generation, baseline content creation—now have a non-human alternative that performs comparably or better on objective metrics. The 100,000-person sample size removes any argument about statistical noise or cherry-picked results.

But the study reveals something more important than AI capability. It reveals human capability variance. The gap between median human performance and top-decile human performance on DAT exceeds the gap between median human performance and GPT-4 performance. Put differently: difference between an average creative and an exceptional creative is larger than difference between an average creative and a machine.

The strategic moat isn’t “human creativity”—it’s exceptional human creativity.

Winners in this landscape are organizations that can identify and deploy top-quartile creative talent while using AI to handle median-level creative work. Losers are those who assume “we have humans, so we’re covered” without differentiating between human performance tiers.

Technical Depth: What DAT Actually Measures and Why AI Struggles Beyond It

The Divergent Association Task works by calculating semantic distance using word embedding models. When you generate ten words, the algorithm computes average pairwise distance between all word vectors in high-dimensional semantic space. Words that rarely appear in similar contexts across training corpora have greater semantic distance.

This is precisely what large language models are architecturally optimized to understand—semantic relationships between tokens. The surprise isn’t that GPT-4 performs well on DAT. The surprise is that it doesn’t dominate completely.

The research team discovered that temperature settings and etymology-based prompting strategies could boost AI creativity scores. Higher temperature increases sampling randomness, allowing models to select less probable tokens. Etymology prompting—asking the model to consider word origins and relationships—seemed to activate more diverse association pathways.

These findings reveal mechanism behind AI creativity: it’s fundamentally a function of sampling breadth across learned probability distributions. Increase randomness, and you get more divergent output. But there’s a ceiling. The model can only sample from what it has learned, and its learned space, while vast, is bounded by training data.

Top creative human performers likely succeed through mechanisms unavailable to current architectures. They draw on embodied experience, emotional resonance, cross-domain expertise integration, and genuine novelty generation rather than recombination of existing patterns. The DAT captures one dimension of creativity—semantic divergence—and even there, top humans outperform.

Where the gap widens dramatically is what researchers call “richer narrative tasks.” The study tested participants and AI models on haiku composition, movie plot synopses, and flash fiction. Humans significantly outperformed AI on all three narrative benchmarks.

Why Narrative Beats AI: Constraint Satisfaction Meets Meaning Generation

Haikus require 5-7-5 syllable structure while conveying genuine insight about human experience. Movie plots require narrative tension that satisfies genre conventions while subverting expectations. Flash fiction demands character motivation that feels psychologically authentic.

Each of these tasks combines formal constraint satisfaction—which AI handles competently—with meaning generation that must resonate with human readers. The AI can produce technically correct haikus. It struggles to produce haikus that make a reader pause and feel something true.

This distinction matters for engineering decisions. If your application requires semantic divergence (brainstorming, ideation, exploration of solution space), current AI models perform at or above median human level. If your application requires meaningful narrative (user communication, brand voice, content that builds emotional connection), top-tier human talent still dramatically outperforms.

Practical Implications: What Technical Leaders Should Actually Do

Audit Your Creative Workflows by Task Type

Map every process that requires creative input in your organization. Categorize each by whether primary output is:

Semantic exploration (brainstorming, keyword generation, solution space mapping): AI performs at median human level. Augmentation or replacement viable for efficiency gains.
Structured narrative (documentation, technical writing, standard communications): AI performs competently with human review. Quality ceiling exists but acceptable for many use cases.
Resonant narrative (brand voice, user-facing content that builds relationship, creative work intended to move people emotionally): Human top performers still dramatically outperform. Reserve these tasks for your best talent and invest in identifying who that is.

Implement Performance Variance Measurement

Most organizations don’t measure creative output quality with any rigor. This study demonstrates that human creative performance varies by factors of 2-3x or more across the population. If you can’t identify your top-quartile performers, you can’t deploy them effectively.

Consider implementing DAT-style assessments in hiring pipelines for creative roles. The test is well-validated, takes minutes to administer, and provides an objective baseline. It won’t capture everything about creative potential, but it captures more than interviews and portfolio reviews that are subject to halo effects and presentation skills.

Adjust Temperature and Prompting for Ideation Workloads

The finding that temperature settings and etymology-based prompting boost AI creativity scores has immediate practical application. If you’re using AI for brainstorming or concept generation, default settings are leaving performance on the table.

Experiment with temperature ranges of 0.8-1.2 for ideation tasks (versus typical 0.7 for factual accuracy). Structure prompts to ask for etymological relationships or conceptual origins as part of ideation chain. Measure output divergence to calibrate for your specific use case.

Build Human-AI Creative Teams with Clear Role Separation

The study points toward an optimal configuration: AI for volume generation of candidates across solution space, humans for selection, refinement, and final creative judgment.

In practice, this means structuring creative workflows as:

AI generates 10x-100x candidate ideas a human would produce in the same time
Human experts filter for quality, feasibility, and resonance
Selected candidates get human development and refinement
Final output represents human creative judgment informed by AI-expanded option space

This architecture plays to AI strength (breadth, speed, semantic exploration) and human strength (depth, meaning, selection judgment). Neither component alone matches the performance of the combined system.

What Top Humans Do Differently

The most intriguing question Montreal study raises is not about AI. It’s about the top 10% of humans. What are they doing that current AI architectures cannot replicate?

The research doesn’t answer this definitively, but narrative task results point toward possible explanations. Top creative performers likely integrate multiple cognitive systems that AI models currently separate or simulate poorly:

Embodied cognition: Human creativity draws on physical experience—the feeling of cold water, weight of grief, confusion of navigating a foreign city. AI models can describe these experiences but don’t have them. The best human creatives translate embodied knowledge into resonant expression.

Emotional modeling: Creating work that moves people requires predicting emotional response in audiences. Humans with deep social and emotional experience model readers intuitively. AI models approximate this through pattern matching on emotionally successful texts, but mechanism differs.

Deliberate constraint violation: Great creativity often involves knowing rules well enough to break them productively. This requires meta-awareness of expectations and intentional subversion. Current AI models are excellent at following patterns and reasonable at varying them, but weak at strategic pattern breaking with specific effect in mind.

Cross-domain integration: The most creative insights often come from connecting ideas across distant fields—applying biological principles to software architecture, or economic theory to relationship dynamics. This requires having genuine expertise in multiple domains and perceiving structural similarities that don’t appear in surface statistics. It’s unclear whether scaling current architectures will produce this capability or whether it requires fundamentally different approaches.

Understanding what top humans do differently isn’t just philosophical curiosity. It’s a roadmap for AI development. If we can characterize mechanisms behind top-decile human creative performance, we can attempt to build systems that exhibit similar properties. The Montreal study gives us a benchmark; next challenge is reverse-engineering the process that beats it.

Conclusion: The New Creative Landscape

The Montreal study marks a transition point. We move from asking “can AI be creative?” to asking “which creative tasks still require top human talent?”

The answer is clear: tasks requiring semantic divergence alone now have competitive AI solutions. Tasks requiring meaningful narrative, emotional resonance, and cultural relevance still need exceptional human performers—and emphasis is on exceptional. Median creative talent now competes with AI that doesn’t tire, negotiate, or need healthcare.

For technical leaders, study provides actionable intelligence. Know which tasks fall into which category. Measure creative performance in your organization. Design human-AI workflows that leverage complementary strengths. And prepare for compensation structures that increasingly reward the right tail of human creative distribution.

The 100,000 humans who participated in this study unknowingly ran an experiment that answers a question every CTO needs to answer: where do machines stop and exceptional humans begin? That boundary now has coordinates.

The organizations that thrive will be those that can work both sides of the line—deploying AI for scale and speed while concentrating rare human creative talent where it still outperforms anything artificial.

Sources

This post was generated by AI using GLM-4.7

University of Montreal Study Shows AI Beats Average Humans on Creativity Tests

University of Montreal Study Shows AI Beats Average Humans on Creativity Tests

The Research: 100,000 Humans vs. Four AI Giants

The Median Line Crossed

Technical Depth: What DAT Actually Measures and Why AI Struggles Beyond It

Why Narrative Beats AI: Constraint Satisfaction Meets Meaning Generation

Practical Implications: What Technical Leaders Should Actually Do

Audit Your Creative Workflows by Task Type

Implement Performance Variance Measurement

Adjust Temperature and Prompting for Ideation Workloads

Build Human-AI Creative Teams with Clear Role Separation

What Top Humans Do Differently

Conclusion: The New Creative Landscape

Sources

Translations: