the benchmark illusion nobody wants to talk about

sneha prabhu·dec 7, 2025·10 min rant

benchmarks give us a very comforting illusion: a single number that claims to measure intelligence.

"we hit 95% on mmlu"

"98% accuracy achieved"

"new sota"

"#1 on the leaderboard"

headlines you see every release.

it looks objective, scientific, and measurable.

except… most benchmarks are literally disconnected from reality and tell you almost nothing about product capability, user experience, or real-world intelligence.

yet the industry keeps using them as the north star.

this is like measuring cooking skill by how well someone cuts onions but never tasting the actual dish.

designed for papers, worshipped like gospel

99% of benchmarks began their lives inside academic papers.

the goal wasn't:

"does this model actually solve anything?"

it was:

"we need a standardized score so we can compare paper a with paper b."

so you end up with stuff like:

  • ifeval testing if the letter "n" appears exactly 3 times
  • gsm-8k math questions no actual human ever cares about
  • multiple-choice tasks pretending to measure intelligence
this is literally academic research cosplay.

but wait, the datasets themselves are broken lol

benchmarks assume their "ground truth" is… well, true.

but a bunch of benchmarks aren't even reliable.

  • humanity's last exam → a 2025 third-party audit found ~30% of biology & chemistry answers conflicted with peer-reviewed sources. (futurehouse audit)
  • mmlu → new analyses show many questions have outdated facts and annotation errors, making "superhuman" benchmarks misleading for real-world knowledge. (openai safety discussion + independent evaluations)

yes, that literally means our industry headline benchmarks sometimes validate wrong answers as "ground truth."

we're scoring models against datasets with fundamental accuracy issues.

and we treat these datasets as some pure objective measure of truth when they're literally messy crowdsourced guesswork.

garbage ground truth → garbage score → fancy press release.

multiple choice questions don't measure reality

you cannot evaluate creativity, reasoning, long-horizon planning, cultural context, empathy, humor, or conversation with multiple choice models.

but benchmarks need multiple choice, because:

  • evaluation must be "objective"
  • correctness must be auto-checkable
  • results need to be comparable

so what do we do?

we reduce intelligence into checkbox scoring.

the outcome:

models become excellent at guessing predefined answers and terrible at real interaction.

product reality is messy, non-binary, unpredictable

real problems are:

  • ambiguous
  • contextual
  • subjective
  • open-ended
  • non-deterministic

real users don't ask:

"select option c"

they ask:

  • "write me a breakup text that sounds emotional but not dramatic"
  • "summarize this video like you're my smart friend"
  • "build me a landing page with indie aesthetic vibes"
benchmarks can't evaluate any of that.

metric optimization ruins product direction

give an ai lab a benchmark score and incentives worth billions, and guess what happens?

people game the metric.

every single time.

this is goodhart's law:

when a metric becomes a target, it stops being a good metric.

so ai labs optimize for benchmarks, not capability.

practical competence becomes a side effect instead of the actual goal.

frontier labs have quietly moved on

this is the part most people don't see:

frontier teams (the ones trying to reach higher forms of intelligence) don't use benchmarks as the primary north star.

that's for:

  • marketing slides
  • investor decks
  • "announcement threads"

instead, frontier teams use human evaluations.

because real humans can judge:

  • nuance
  • quality
  • creativity
  • reasoning depth
  • usefulness

benchmarks can't score "wisdom".

humans can.

even ilya sutskever (co-founder of openai, now founder of safe superintelligence) has publicly said that benchmarks are fundamentally misleading and fail to capture real-world performance or generalization.

when the people building agi say benchmarks don't work… maybe we should listen.

the benchmark death spiral (i see this every release)

every 2–3 months:

  • model hits 95% on somebench
  • press gets excited
  • users try it
  • users: "why can't this thing write a decent email"
  • community yells "fake results!"
  • trust evaporates
  • everyone gets cynical
  • then another release happens
  • and we repeat the cycle again.

benchmarks create hype, but hype doesn't build intelligence

benchmarks are great for:

  • headlines
  • virality
  • marketing
  • surface-level comparison
  • pitch decks

but terrible at:

  • evaluating reasoning
  • modeling knowledge
  • testing creativity
  • measuring user satisfaction
  • understanding alignment quality
if your north star is a leaderboard, your model ends up good at playing standardized tests… not real life.

so what should we measure instead?

future evaluations should measure:

  • ability to reason over long context
  • ability to plan
  • ability to hold conversations over time
  • ability to solve unseen problems
  • ability to adapt
  • subjective usefulness
  • real task execution
  • agentic behavior
  • multimodal coordination

aka:

can this model actually do things?

that's what matters.

the future is human eval, not accuracy eval

human evals measure:

  • vibe
  • intention following
  • useful outputs
  • real problem solving
  • cultural alignment
  • comprehension
  • communication style

this is where true capability shows up.

also — intelligence is not a single score.

intelligence is an experience.

closing thought

benchmarks might win headlines,
but they don't build trust
and they don't build the future.

we need new north stars that reflect human-level goals, not multiple-choice scoring systems designed a decade ago.

the frontier is not about "95% accuracy".

the frontier is about "can this thing actually do what humans need?"

until we evaluate models like we evaluate real intelligence, we're basically grading agi using a school quiz and calling it science.

further reading: for a deeper dive into systematic benchmark issues, check out rethinking benchmark and contamination for language models with rephrased samples which explores how benchmark contamination affects model evaluation.

okay fine, if you're still looking for those lists…

here are some places where people religiously track these numbers we just spent 3000 words roasting:

(yes, the irony is not lost on me. enjoy your leaderboards, you absolute nerds)