Part 1 — Knowledge Benchmark (MMLU)
MMLU (Massive Multitask Language Understanding) was the dominant benchmark from 2022–2025, tracking broad knowledge across 57 academic disciplines. Both labs are now near-saturated at 97%+, making it a poor differentiator for current frontier models — but the historical arc is striking.
+26.5pp
OpenAI MMLU gain
GPT-3.5 → GPT-5.5 (3.4 yrs)
+23.9pp
Claude MMLU gain
Claude 1 → Opus 4.8
~10× /yr
Cost drop
Per equivalent performance
~97%+
Frontier ceiling
MMLU near saturation
OpenAI / ChatGPTAnthropic / Claude● = Latest releases (Apr–May 2026)
MMLU BENCHMARK (%) — HIGHER IS BETTER
OpenAI / ChatGPT
Nov '22GPT-3.5
ChatGPT launch — 1M users in 5 days
70% Mar '23GPT-4
Major leap; launched in Bing and ChatGPT
86.4% Nov '23GPT-4 Turbo
128K context; updated knowledge cutoff
87.4% May '24GPT-4o
Omni-modal; free-tier access
88.7% Sep '24o1-preview
Chain-of-thought reasoning model
90.8% Dec '24o1
Full o1 + $200/mo Pro tier
92.3% Jan '25o3-mini
Compact, fast reasoning
93% Apr '25o3 / o4-mini
Top AIME 2024/25 benchmark scores
96.7% May '25GPT-5
Unified reasoning + conversational AI
96% Nov '25GPT-5.1
Adaptive reasoning; 8 personality presets
96.4% Feb '26GPT-5.2
Improved polish; spreadsheet & finance tasks
96.8% Mar '26GPT-5.4
Native computer use (75% OSWorld); 1M context; merges GPT-5.3-Codex line
97.2% Apr '26GPT-5.5NEW
Codename 'Spud'; 82.7% Terminal-Bench 2.0; 88.7% SWE-bench; first fully-retrained base since GPT-4.5
96.5% Anthropic / Claude
Mar '23Claude 1
First public Claude via limited API
73% Jul '23Claude 2
General public access; 100K context
78.5% Nov '23Claude 2.1
200K context; reduced hallucination rate
80% Mar '24Claude 3 Opus
Multimodal; noted self-awareness in tests
86.8% Jun '24Claude 3.5 Sonnet
Beats Opus at Sonnet price; Artifacts
88.7% Oct '24Claude 3.5 Sonnet v2
Computer use capability; upgraded Haiku
89.5% Feb '25Claude 3.7 Sonnet
First hybrid reasoning model
91% May '25Claude 4 Opus
ASL-3 safety classification; Claude Code
94.5% Sep '25Claude 4.5 Sonnet
77.2% SWE-bench; 30+ hour task focus
95% Nov '25Claude 4.5 Opus
First model >80% SWE-bench (80.9%)
95.8% Feb '26Claude Opus 4.6
1M context beta; 65.4% Terminal-Bench; adaptive thinking; agent teams
96.5% Mar '26Claude Sonnet 4.6
Cost-efficient tier; released 12 days after Opus 4.6
96.2% Apr '26Claude Opus 4.7
87.6% SWE-bench Verified; xhigh effort tier; 3.3× vision resolution
96.7% May '26Claude Opus 4.8NEW
88.6% SWE-bench Verified; 1890 Elo GDPval-AA (lead); dynamic workflows; 41-day cadence
96.9% Part 2 — Capabilities Benchmarks (May 2026)
MMLU measures what models know. These benchmarks measure what they can actually do — write and deploy working code, navigate a computer autonomously, and complete real professional tasks. SWE-bench Verified is now saturating around 88–89% for both labs, so the contested benchmark has moved to SWE-bench Pro. Both frontier models now exceed the human baseline on OSWorld. Hover each bar for model details and source notes.
⚠OSWorld human baseline: 72.4%. Both frontier models now lead — GPT-5.5 at 78.7%, Opus 4.8 at 83.4% — putting autonomous desktop navigation past the "better than humans" threshold.
SCORE — HIGHER IS BETTER | GDPval-AA IN ELO POINTS (SCALE DIFFERS)
SWE-bench Pro
Harder, less-contaminated software engineering benchmark — predicts production coding performance
GPT-5.558.6%
Claude Opus 4.869.2% 🏆
Opus 4.8 leads by 10.6 points. SWE-bench Verified is now saturating (~88–89% for both labs), so Pro is the meaningful differentiator.
Terminal-Bench
Agentic coding in real terminal environments — planning, iteration, tool coordination
GPT-5.578.2% 🏆
Claude Opus 4.874.6%
GPT-5.5 leads by 3.6 points on Terminal-Bench 2.1. GPT-5.5 also holds the Terminal-Bench 2.0 record at 82.7%.
OSWorld
Autonomous desktop navigation via screenshots + keyboard/mouse
GPT-5.578.7%
Claude Opus 4.883.4% 🏆
Human baseline72.4%
Human baseline: 72.4%. Both frontier models now exceed human performance; Opus 4.8 leads by 4.7 points.
GDPval-AA
Economically valuable tasks across finance, legal & 44 professions
GPT-5.51769 Elo
Claude Opus 4.81890 Elo 🏆
Opus 4.8 leads by 121 Elo (~67% expected win rate). Largest competitive gap on the page.
Benchmark notes: MMLU scores — official where available; recent entries (*) estimated from ArtificialAnalysis Intelligence Index and relative positioning. MMLU has been saturated at 97%+ since late 2025 and is no longer the primary frontier differentiator. Capabilities benchmark scores sourced from Anthropic and OpenAI launch posts, ArtificialAnalysis, TokenMix, and llm-stats.com (May 2026). SWE-bench Verified is saturating (~88–89% for both labs); SWE-bench Pro is now the more representative coding benchmark. Terminal-Bench moved to version 2.1 in May 2026. GDPval-AA is measured in Elo points and is not directly comparable to percentage-based benchmarks. Mythos-class preview models from Anthropic are excluded (not generally available).
For the latest information: llm-stats.com — live benchmark tracking across 50+ evaluations and 20+ API providers.