Part 1 — Knowledge Benchmark (MMLU)
MMLU (Massive Multitask Language Understanding) was the dominant benchmark from 2022–2025, tracking broad knowledge across 57 academic disciplines. Both labs are now near-saturated at 97%+, making it a poor differentiator for current frontier models — but the historical arc is striking.
+27.2pp
OpenAI MMLU gain
GPT-3.5 → GPT-5.4 (3.3 yrs)
+23.2pp
Claude MMLU gain
Claude 1 → Sonnet 4.6
~10× /yr
Cost drop
Per equivalent performance
~97%+
Frontier ceiling
MMLU near saturation
OpenAI / ChatGPTAnthropic / Claude● = Latest releases (Mar 2026)
MMLU BENCHMARK (%) — HIGHER IS BETTER
OpenAI / ChatGPT
Nov '22GPT-3.5
ChatGPT launch — 1M users in 5 days
70% Mar '23GPT-4
Major leap; launched in Bing and ChatGPT
86.4% Nov '23GPT-4 Turbo
128K context; updated knowledge cutoff
87.4% May '24GPT-4o
Omni-modal; free-tier access
88.7% Sep '24o1-preview
Chain-of-thought reasoning model
90.8% Dec '24o1
Full o1 + $200/mo Pro tier
92.3% Jan '25o3-mini
Compact, fast reasoning
93% Apr '25o3 / o4-mini
Top AIME 2024/25 benchmark scores
96.7% May '25GPT-5
Unified reasoning + conversational AI
96% Nov '25GPT-5.1
Adaptive reasoning; 8 personality presets
96.4% Feb '26GPT-5.2
Improved polish; spreadsheet & finance tasks
96.8% Mar '26GPT-5.4NEW
Native computer use (75% OSWorld); 1M context; merges GPT-5.3-Codex line
97.2% Anthropic / Claude
Mar '23Claude 1
First public Claude via limited API
73% Jul '23Claude 2
General public access; 100K context
78.5% Nov '23Claude 2.1
200K context; reduced hallucination rate
80% Mar '24Claude 3 Opus
Multimodal; noted self-awareness in tests
86.8% Jun '24Claude 3.5 Sonnet
Beats Opus at Sonnet price; Artifacts
88.7% Oct '24Claude 3.5 Sonnet v2
Computer use capability; upgraded Haiku
89.5% Feb '25Claude 3.7 Sonnet
First hybrid reasoning model
91% May '25Claude 4 Opus
ASL-3 safety classification; Claude Code
94.5% Sep '25Claude 4.5 Sonnet
77.2% SWE-bench; 30+ hour task focus
95% Nov '25Claude 4.5 Opus
First model >80% SWE-bench (80.9%)
95.8% Feb '26Claude Opus 4.6
1M context beta; 65.4% Terminal-Bench; adaptive thinking; agent teams
96.5% Mar '26Claude Sonnet 4.6NEW
Cost-efficient tier; released 12 days after Opus 4.6
96.2% Part 2 — Capabilities Benchmarks (Mar 2026)
MMLU measures what models know. These benchmarks measure what they can actually do — write and deploy working code, navigate a computer autonomously, and complete real professional tasks. This is where the frontier labs are differentiating in 2026. Hover each bar for model details and source notes.
⚠OSWorld human baseline: 72.4%. GPT-5.4 (75.0%) is the first model to surpass human performance on autonomous desktop navigation.
SCORE — HIGHER IS BETTER | GDPval-AA IN ELO POINTS (SCALE DIFFERS)
SWE-bench
Real-world software engineering tasks resolved autonomously
GPT-5.479.2%
Claude 4.5 Opus80.9% 🏆
Claude 4.5 Opus holds the published record (80.9%); GPT-5.4 estimated ~79% based on relative positioning*
Terminal-Bench
Agentic coding in real terminal environments
GPT-5.2 + Codex64.7%
Claude Opus 4.665.4% 🏆
Claude Opus 4.6 holds the record (65.4%); GPT-5.2 Codex CLI at 64.7%. GPT-5.4 score pending independent verification.
OSWorld
Autonomous desktop navigation via screenshots + keyboard/mouse
GPT-5.475% 🏆
Claude Opus 4.672.7%
Human baseline72.4%
Human baseline: 72.4%. GPT-5.4 becomes first model to surpass human performance.
GDPval-AA
Economically valuable tasks across finance, legal & 44 professions
GPT-5.21462 Elo
Claude Opus 4.61606 Elo 🏆
Claude Opus 4.6 leads by ~144 Elo (~70% win rate). GPT-5.4 GDPval score pending; GPT-5.2 shown for comparison.
Benchmark notes: MMLU scores — official where available; recent entries (*) estimated from ArtificialAnalysis Intelligence Index and relative positioning. MMLU is near-saturated at 97%+ and no longer the primary frontier differentiator. Capabilities benchmark scores sourced from official model cards, TechCrunch, DigitalApplied, and ArtificialAnalysis (March 2026). GDPval-AA is measured in Elo points and is not directly comparable to percentage-based benchmarks. Independent verification of GPT-5.4 Terminal-Bench and GDPval scores is still emerging.
For the latest information: llm-stats.com — live benchmark tracking across 50+ evaluations and 20+ API providers.