Foxxe Labs
Part 1 — Knowledge Benchmark (MMLU)

MMLU (Massive Multitask Language Understanding) was the dominant benchmark from 2022–2025, tracking broad knowledge across 57 academic disciplines. Both labs are now near-saturated at 97%+, making it a poor differentiator for current frontier models — but the historical arc is striking.

+26.5pp
OpenAI MMLU gain
GPT-3.5 → GPT-5.5 (3.4 yrs)
+23.9pp
Claude MMLU gain
Claude 1 → Opus 4.8
~10× /yr
Cost drop
Per equivalent performance
~97%+
Frontier ceiling
MMLU near saturation
OpenAI / ChatGPTAnthropic / Claude● = Latest releases (Apr–May 2026)
MMLU BENCHMARK (%) — HIGHER IS BETTER
OpenAI / ChatGPT
Nov '22
GPT-3.5
ChatGPT launch — 1M users in 5 days
70%
Mar '23
GPT-4
Major leap; launched in Bing and ChatGPT
86.4%
Nov '23
GPT-4 Turbo
128K context; updated knowledge cutoff
87.4%
May '24
GPT-4o
Omni-modal; free-tier access
88.7%
Sep '24
o1-preview
Chain-of-thought reasoning model
90.8%
Dec '24
o1
Full o1 + $200/mo Pro tier
92.3%
Jan '25
o3-mini
Compact, fast reasoning
93%
Apr '25
o3 / o4-mini
Top AIME 2024/25 benchmark scores
96.7%
May '25
GPT-5
Unified reasoning + conversational AI
96%
Nov '25
GPT-5.1
Adaptive reasoning; 8 personality presets
96.4%
Feb '26
GPT-5.2
Improved polish; spreadsheet & finance tasks
96.8%
Mar '26
GPT-5.4
Native computer use (75% OSWorld); 1M context; merges GPT-5.3-Codex line
97.2%
Apr '26
GPT-5.5NEW
Codename 'Spud'; 82.7% Terminal-Bench 2.0; 88.7% SWE-bench; first fully-retrained base since GPT-4.5
96.5%
Anthropic / Claude
Mar '23
Claude 1
First public Claude via limited API
73%
Jul '23
Claude 2
General public access; 100K context
78.5%
Nov '23
Claude 2.1
200K context; reduced hallucination rate
80%
Mar '24
Claude 3 Opus
Multimodal; noted self-awareness in tests
86.8%
Jun '24
Claude 3.5 Sonnet
Beats Opus at Sonnet price; Artifacts
88.7%
Oct '24
Claude 3.5 Sonnet v2
Computer use capability; upgraded Haiku
89.5%
Feb '25
Claude 3.7 Sonnet
First hybrid reasoning model
91%
May '25
Claude 4 Opus
ASL-3 safety classification; Claude Code
94.5%
Sep '25
Claude 4.5 Sonnet
77.2% SWE-bench; 30+ hour task focus
95%
Nov '25
Claude 4.5 Opus
First model >80% SWE-bench (80.9%)
95.8%
Feb '26
Claude Opus 4.6
1M context beta; 65.4% Terminal-Bench; adaptive thinking; agent teams
96.5%
Mar '26
Claude Sonnet 4.6
Cost-efficient tier; released 12 days after Opus 4.6
96.2%
Apr '26
Claude Opus 4.7
87.6% SWE-bench Verified; xhigh effort tier; 3.3× vision resolution
96.7%
May '26
Claude Opus 4.8NEW
88.6% SWE-bench Verified; 1890 Elo GDPval-AA (lead); dynamic workflows; 41-day cadence
96.9%
Part 2 — Capabilities Benchmarks (May 2026)

MMLU measures what models know. These benchmarks measure what they can actually do — write and deploy working code, navigate a computer autonomously, and complete real professional tasks. SWE-bench Verified is now saturating around 88–89% for both labs, so the contested benchmark has moved to SWE-bench Pro. Both frontier models now exceed the human baseline on OSWorld. Hover each bar for model details and source notes.

OSWorld human baseline: 72.4%. Both frontier models now lead — GPT-5.5 at 78.7%, Opus 4.8 at 83.4% — putting autonomous desktop navigation past the "better than humans" threshold.
SCORE — HIGHER IS BETTER  |  GDPval-AA IN ELO POINTS (SCALE DIFFERS)
SWE-bench Pro
Harder, less-contaminated software engineering benchmark — predicts production coding performance
GPT-5.558.6%
Claude Opus 4.869.2% 🏆
Opus 4.8 leads by 10.6 points. SWE-bench Verified is now saturating (~88–89% for both labs), so Pro is the meaningful differentiator.
Terminal-Bench
Agentic coding in real terminal environments — planning, iteration, tool coordination
GPT-5.578.2% 🏆
Claude Opus 4.874.6%
GPT-5.5 leads by 3.6 points on Terminal-Bench 2.1. GPT-5.5 also holds the Terminal-Bench 2.0 record at 82.7%.
OSWorld
Autonomous desktop navigation via screenshots + keyboard/mouse
GPT-5.578.7%
Claude Opus 4.883.4% 🏆
Human baseline72.4%
Human baseline: 72.4%. Both frontier models now exceed human performance; Opus 4.8 leads by 4.7 points.
GDPval-AA
Economically valuable tasks across finance, legal & 44 professions
GPT-5.51769 Elo
Claude Opus 4.81890 Elo 🏆
Opus 4.8 leads by 121 Elo (~67% expected win rate). Largest competitive gap on the page.

Benchmark notes: MMLU scores — official where available; recent entries (*) estimated from ArtificialAnalysis Intelligence Index and relative positioning. MMLU has been saturated at 97%+ since late 2025 and is no longer the primary frontier differentiator. Capabilities benchmark scores sourced from Anthropic and OpenAI launch posts, ArtificialAnalysis, TokenMix, and llm-stats.com (May 2026). SWE-bench Verified is saturating (~88–89% for both labs); SWE-bench Pro is now the more representative coding benchmark. Terminal-Bench moved to version 2.1 in May 2026. GDPval-AA is measured in Elo points and is not directly comparable to percentage-based benchmarks. Mythos-class preview models from Anthropic are excluded (not generally available).

For the latest information: llm-stats.com — live benchmark tracking across 50+ evaluations and 20+ API providers.