Foxxe Labs Research

AI Model Performance Timeline

Two charts, two stories. The first tracks MMLU — the dominant knowledge benchmark from 2022–2025 — across every major OpenAI and Anthropic release. The headline: Claude entered 2023 thirteen points behind GPT-4 and had fully caught up by mid-2024. Both labs now sit at near-saturation around 97%. The second chart covers where the frontier is actually contested in May 2026: SWE-bench Pro, Terminal-Bench, OSWorld, and GDPval. Both frontier models now exceed the human baseline on OSWorld desktop navigation, and the contested benchmarks have moved to harder, less-contaminated variants like SWE-bench Pro — where Anthropic's just-released Opus 4.8 holds the lead. Hover any data point for model details and source notes.

Last updated: 29 May 2026 · Todd McCaffrey, Foxxe Labs

Part 1 — Knowledge Benchmark (MMLU)

MMLU (Massive Multitask Language Understanding) was the dominant benchmark from 2022–2025, tracking broad knowledge across 57 academic disciplines. Both labs are now near-saturated at 97%+, making it a poor differentiator for current frontier models — but the historical arc is striking.

+26.5pp

OpenAI MMLU gain

GPT-3.5 → GPT-5.5 (3.4 yrs)

+23.9pp

Claude MMLU gain

Claude 1 → Opus 4.8

~10× /yr

Cost drop

Per equivalent performance

~97%+

Frontier ceiling

MMLU near saturation

OpenAI / ChatGPTAnthropic / Claude● = Latest releases (Apr–May 2026)

MMLU BENCHMARK (%) — HIGHER IS BETTER

OpenAI / ChatGPT

Nov '22

GPT-3.5

ChatGPT launch — 1M users in 5 days

70%

Mar '23

GPT-4

Major leap; launched in Bing and ChatGPT

86.4%

Nov '23

GPT-4 Turbo

128K context; updated knowledge cutoff

87.4%

May '24

GPT-4o

Omni-modal; free-tier access

88.7%

Sep '24

o1-preview

Chain-of-thought reasoning model

90.8%

Dec '24

Full o1 + $200/mo Pro tier

92.3%

Jan '25

o3-mini

Compact, fast reasoning

93%

Apr '25

o3 / o4-mini

Top AIME 2024/25 benchmark scores

96.7%

May '25

GPT-5

Unified reasoning + conversational AI

96%

Nov '25

GPT-5.1

Adaptive reasoning; 8 personality presets

96.4%

Feb '26

GPT-5.2

Improved polish; spreadsheet & finance tasks

96.8%

Mar '26

GPT-5.4

Native computer use (75% OSWorld); 1M context; merges GPT-5.3-Codex line

97.2%

Apr '26

GPT-5.5NEW

Codename 'Spud'; 82.7% Terminal-Bench 2.0; 88.7% SWE-bench; first fully-retrained base since GPT-4.5

96.5%

Anthropic / Claude

Mar '23

Claude 1

First public Claude via limited API

73%

Jul '23

Claude 2

General public access; 100K context

78.5%

Nov '23

Claude 2.1

200K context; reduced hallucination rate

80%

Mar '24

Claude 3 Opus

Multimodal; noted self-awareness in tests

86.8%

Jun '24

Claude 3.5 Sonnet

Beats Opus at Sonnet price; Artifacts

88.7%

Oct '24

Claude 3.5 Sonnet v2

Computer use capability; upgraded Haiku

89.5%

Feb '25

Claude 3.7 Sonnet

First hybrid reasoning model

91%

May '25

Claude 4 Opus

ASL-3 safety classification; Claude Code

94.5%

Sep '25

Claude 4.5 Sonnet

77.2% SWE-bench; 30+ hour task focus

95%

Nov '25

Claude 4.5 Opus

First model >80% SWE-bench (80.9%)

95.8%

Feb '26

Claude Opus 4.6

1M context beta; 65.4% Terminal-Bench; adaptive thinking; agent teams

96.5%

Mar '26

Claude Sonnet 4.6

Cost-efficient tier; released 12 days after Opus 4.6

96.2%

Apr '26

Claude Opus 4.7

87.6% SWE-bench Verified; xhigh effort tier; 3.3× vision resolution

96.7%

May '26

Claude Opus 4.8NEW

88.6% SWE-bench Verified; 1890 Elo GDPval-AA (lead); dynamic workflows; 41-day cadence

96.9%

Part 2 — Capabilities Benchmarks (May 2026)

MMLU measures what models know. These benchmarks measure what they can actually do — write and deploy working code, navigate a computer autonomously, and complete real professional tasks. SWE-bench Verified is now saturating around 88–89% for both labs, so the contested benchmark has moved to SWE-bench Pro. Both frontier models now exceed the human baseline on OSWorld. Hover each bar for model details and source notes.

⚠OSWorld human baseline: 72.4%. Both frontier models now lead — GPT-5.5 at 78.7%, Opus 4.8 at 83.4% — putting autonomous desktop navigation past the "better than humans" threshold.

SCORE — HIGHER IS BETTER | GDPval-AA IN ELO POINTS (SCALE DIFFERS)

SWE-bench Pro

Harder, less-contaminated software engineering benchmark — predicts production coding performance

GPT-5.558.6%

Claude Opus 4.869.2% 🏆

Opus 4.8 leads by 10.6 points. SWE-bench Verified is now saturating (~88–89% for both labs), so Pro is the meaningful differentiator.

Terminal-Bench

Agentic coding in real terminal environments — planning, iteration, tool coordination

GPT-5.578.2% 🏆

Claude Opus 4.874.6%

GPT-5.5 leads by 3.6 points on Terminal-Bench 2.1. GPT-5.5 also holds the Terminal-Bench 2.0 record at 82.7%.

OSWorld

Autonomous desktop navigation via screenshots + keyboard/mouse

GPT-5.578.7%

Claude Opus 4.883.4% 🏆

Human baseline72.4%

Human baseline: 72.4%. Both frontier models now exceed human performance; Opus 4.8 leads by 4.7 points.

GDPval-AA

Economically valuable tasks across finance, legal & 44 professions

GPT-5.51769 Elo

Claude Opus 4.81890 Elo 🏆

Opus 4.8 leads by 121 Elo (~67% expected win rate). Largest competitive gap on the page.

Benchmark notes: MMLU scores — official where available; recent entries (*) estimated from ArtificialAnalysis Intelligence Index and relative positioning. MMLU has been saturated at 97%+ since late 2025 and is no longer the primary frontier differentiator. Capabilities benchmark scores sourced from Anthropic and OpenAI launch posts, ArtificialAnalysis, TokenMix, and llm-stats.com (May 2026). SWE-bench Verified is saturating (~88–89% for both labs); SWE-bench Pro is now the more representative coding benchmark. Terminal-Bench moved to version 2.1 in May 2026. GDPval-AA is measured in Elo points and is not directly comparable to percentage-based benchmarks. Mythos-class preview models from Anthropic are excluded (not generally available).

For the latest information: llm-stats.com — live benchmark tracking across 50+ evaluations and 20+ API providers.

Interactive · full screen

The Expanding Universe of Large Language Models

A dark-observatory globe that grows the whole model landscape from the 2017 Transformer to today — a curated ~100 models plotted on the city of their lab, sized by parameters, raised by capability, and coloured by maker, with a playable timeline, usage-flow arcs, and sound.

Open the globe →