Foxxe Labs Research

AI Model Performance Timeline

Two charts, two stories. The first tracks MMLU — the dominant knowledge benchmark from 2022–2025 — across every major OpenAI and Anthropic release. The headline: Claude entered 2023 thirteen points behind GPT-4 and had fully caught up by mid-2024. Both labs now sit above 97%, which is near-saturation. The second chart covers where the frontier is actually contested in 2026: SWE-bench, Terminal-Bench, OSWorld, and GDPval — benchmarks that measure autonomous coding, computer use, and real professional task completion. Hover any data point for model details and source notes.

Last updated: 9 March 2026 · Todd McCaffrey, Foxxe Labs

Part 1 — Knowledge Benchmark (MMLU)

MMLU (Massive Multitask Language Understanding) was the dominant benchmark from 2022–2025, tracking broad knowledge across 57 academic disciplines. Both labs are now near-saturated at 97%+, making it a poor differentiator for current frontier models — but the historical arc is striking.

+27.2pp

OpenAI MMLU gain

GPT-3.5 → GPT-5.4 (3.3 yrs)

+23.2pp

Claude MMLU gain

Claude 1 → Sonnet 4.6

~10× /yr

Cost drop

Per equivalent performance

~97%+

Frontier ceiling

MMLU near saturation

OpenAI / ChatGPTAnthropic / Claude● = Latest releases (Mar 2026)

MMLU BENCHMARK (%) — HIGHER IS BETTER

OpenAI / ChatGPT

Nov '22

GPT-3.5

ChatGPT launch — 1M users in 5 days

70%

Mar '23

GPT-4

Major leap; launched in Bing and ChatGPT

86.4%

Nov '23

GPT-4 Turbo

128K context; updated knowledge cutoff

87.4%

May '24

GPT-4o

Omni-modal; free-tier access

88.7%

Sep '24

o1-preview

Chain-of-thought reasoning model

90.8%

Dec '24

Full o1 + $200/mo Pro tier

92.3%

Jan '25

o3-mini

Compact, fast reasoning

93%

Apr '25

o3 / o4-mini

Top AIME 2024/25 benchmark scores

96.7%

May '25

GPT-5

Unified reasoning + conversational AI

96%

Nov '25

GPT-5.1

Adaptive reasoning; 8 personality presets

96.4%

Feb '26

GPT-5.2

Improved polish; spreadsheet & finance tasks

96.8%

Mar '26

GPT-5.4NEW

Native computer use (75% OSWorld); 1M context; merges GPT-5.3-Codex line

97.2%

Anthropic / Claude

Mar '23

Claude 1

First public Claude via limited API

73%

Jul '23

Claude 2

General public access; 100K context

78.5%

Nov '23

Claude 2.1

200K context; reduced hallucination rate

80%

Mar '24

Claude 3 Opus

Multimodal; noted self-awareness in tests

86.8%

Jun '24

Claude 3.5 Sonnet

Beats Opus at Sonnet price; Artifacts

88.7%

Oct '24

Claude 3.5 Sonnet v2

Computer use capability; upgraded Haiku

89.5%

Feb '25

Claude 3.7 Sonnet

First hybrid reasoning model

91%

May '25

Claude 4 Opus

ASL-3 safety classification; Claude Code

94.5%

Sep '25

Claude 4.5 Sonnet

77.2% SWE-bench; 30+ hour task focus

95%

Nov '25

Claude 4.5 Opus

First model >80% SWE-bench (80.9%)

95.8%

Feb '26

Claude Opus 4.6

1M context beta; 65.4% Terminal-Bench; adaptive thinking; agent teams

96.5%

Mar '26

Claude Sonnet 4.6NEW

Cost-efficient tier; released 12 days after Opus 4.6

96.2%

Part 2 — Capabilities Benchmarks (Mar 2026)

MMLU measures what models know. These benchmarks measure what they can actually do — write and deploy working code, navigate a computer autonomously, and complete real professional tasks. This is where the frontier labs are differentiating in 2026. Hover each bar for model details and source notes.

⚠OSWorld human baseline: 72.4%. GPT-5.4 (75.0%) is the first model to surpass human performance on autonomous desktop navigation.

SCORE — HIGHER IS BETTER | GDPval-AA IN ELO POINTS (SCALE DIFFERS)

SWE-bench

Real-world software engineering tasks resolved autonomously

GPT-5.479.2%

Claude 4.5 Opus80.9% 🏆

Claude 4.5 Opus holds the published record (80.9%); GPT-5.4 estimated ~79% based on relative positioning*

Terminal-Bench

Agentic coding in real terminal environments

GPT-5.2 + Codex64.7%

Claude Opus 4.665.4% 🏆

Claude Opus 4.6 holds the record (65.4%); GPT-5.2 Codex CLI at 64.7%. GPT-5.4 score pending independent verification.

OSWorld

Autonomous desktop navigation via screenshots + keyboard/mouse

GPT-5.475% 🏆

Claude Opus 4.672.7%

Human baseline72.4%

Human baseline: 72.4%. GPT-5.4 becomes first model to surpass human performance.

GDPval-AA

Economically valuable tasks across finance, legal & 44 professions

GPT-5.21462 Elo

Claude Opus 4.61606 Elo 🏆

Claude Opus 4.6 leads by ~144 Elo (~70% win rate). GPT-5.4 GDPval score pending; GPT-5.2 shown for comparison.

Benchmark notes: MMLU scores — official where available; recent entries (*) estimated from ArtificialAnalysis Intelligence Index and relative positioning. MMLU is near-saturated at 97%+ and no longer the primary frontier differentiator. Capabilities benchmark scores sourced from official model cards, TechCrunch, DigitalApplied, and ArtificialAnalysis (March 2026). GDPval-AA is measured in Elo points and is not directly comparable to percentage-based benchmarks. Independent verification of GPT-5.4 Terminal-Bench and GDPval scores is still emerging.

For the latest information: llm-stats.com — live benchmark tracking across 50+ evaluations and 20+ API providers.