Foxxe Labs
Part 1 — Knowledge Benchmark (MMLU)

MMLU (Massive Multitask Language Understanding) was the dominant benchmark from 2022–2025, tracking broad knowledge across 57 academic disciplines. Both labs are now near-saturated at 97%+, making it a poor differentiator for current frontier models — but the historical arc is striking.

+27.2pp
OpenAI MMLU gain
GPT-3.5 → GPT-5.4 (3.3 yrs)
+23.2pp
Claude MMLU gain
Claude 1 → Sonnet 4.6
~10× /yr
Cost drop
Per equivalent performance
~97%+
Frontier ceiling
MMLU near saturation
OpenAI / ChatGPTAnthropic / Claude● = Latest releases (Mar 2026)
MMLU BENCHMARK (%) — HIGHER IS BETTER
OpenAI / ChatGPT
Nov '22
GPT-3.5
ChatGPT launch — 1M users in 5 days
70%
Mar '23
GPT-4
Major leap; launched in Bing and ChatGPT
86.4%
Nov '23
GPT-4 Turbo
128K context; updated knowledge cutoff
87.4%
May '24
GPT-4o
Omni-modal; free-tier access
88.7%
Sep '24
o1-preview
Chain-of-thought reasoning model
90.8%
Dec '24
o1
Full o1 + $200/mo Pro tier
92.3%
Jan '25
o3-mini
Compact, fast reasoning
93%
Apr '25
o3 / o4-mini
Top AIME 2024/25 benchmark scores
96.7%
May '25
GPT-5
Unified reasoning + conversational AI
96%
Nov '25
GPT-5.1
Adaptive reasoning; 8 personality presets
96.4%
Feb '26
GPT-5.2
Improved polish; spreadsheet & finance tasks
96.8%
Mar '26
GPT-5.4NEW
Native computer use (75% OSWorld); 1M context; merges GPT-5.3-Codex line
97.2%
Anthropic / Claude
Mar '23
Claude 1
First public Claude via limited API
73%
Jul '23
Claude 2
General public access; 100K context
78.5%
Nov '23
Claude 2.1
200K context; reduced hallucination rate
80%
Mar '24
Claude 3 Opus
Multimodal; noted self-awareness in tests
86.8%
Jun '24
Claude 3.5 Sonnet
Beats Opus at Sonnet price; Artifacts
88.7%
Oct '24
Claude 3.5 Sonnet v2
Computer use capability; upgraded Haiku
89.5%
Feb '25
Claude 3.7 Sonnet
First hybrid reasoning model
91%
May '25
Claude 4 Opus
ASL-3 safety classification; Claude Code
94.5%
Sep '25
Claude 4.5 Sonnet
77.2% SWE-bench; 30+ hour task focus
95%
Nov '25
Claude 4.5 Opus
First model >80% SWE-bench (80.9%)
95.8%
Feb '26
Claude Opus 4.6
1M context beta; 65.4% Terminal-Bench; adaptive thinking; agent teams
96.5%
Mar '26
Claude Sonnet 4.6NEW
Cost-efficient tier; released 12 days after Opus 4.6
96.2%
Part 2 — Capabilities Benchmarks (Mar 2026)

MMLU measures what models know. These benchmarks measure what they can actually do — write and deploy working code, navigate a computer autonomously, and complete real professional tasks. This is where the frontier labs are differentiating in 2026. Hover each bar for model details and source notes.

OSWorld human baseline: 72.4%. GPT-5.4 (75.0%) is the first model to surpass human performance on autonomous desktop navigation.
SCORE — HIGHER IS BETTER  |  GDPval-AA IN ELO POINTS (SCALE DIFFERS)
SWE-bench
Real-world software engineering tasks resolved autonomously
GPT-5.479.2%
Claude 4.5 Opus80.9% 🏆
Claude 4.5 Opus holds the published record (80.9%); GPT-5.4 estimated ~79% based on relative positioning*
Terminal-Bench
Agentic coding in real terminal environments
GPT-5.2 + Codex64.7%
Claude Opus 4.665.4% 🏆
Claude Opus 4.6 holds the record (65.4%); GPT-5.2 Codex CLI at 64.7%. GPT-5.4 score pending independent verification.
OSWorld
Autonomous desktop navigation via screenshots + keyboard/mouse
GPT-5.475% 🏆
Claude Opus 4.672.7%
Human baseline72.4%
Human baseline: 72.4%. GPT-5.4 becomes first model to surpass human performance.
GDPval-AA
Economically valuable tasks across finance, legal & 44 professions
GPT-5.21462 Elo
Claude Opus 4.61606 Elo 🏆
Claude Opus 4.6 leads by ~144 Elo (~70% win rate). GPT-5.4 GDPval score pending; GPT-5.2 shown for comparison.

Benchmark notes: MMLU scores — official where available; recent entries (*) estimated from ArtificialAnalysis Intelligence Index and relative positioning. MMLU is near-saturated at 97%+ and no longer the primary frontier differentiator. Capabilities benchmark scores sourced from official model cards, TechCrunch, DigitalApplied, and ArtificialAnalysis (March 2026). GDPval-AA is measured in Elo points and is not directly comparable to percentage-based benchmarks. Independent verification of GPT-5.4 Terminal-Bench and GDPval scores is still emerging.

For the latest information: llm-stats.com — live benchmark tracking across 50+ evaluations and 20+ API providers.