The Benchmarking Crisis Nobody’s Talking About

The Stanford 2026 AI Index Report has exposed a troubling paradox at the heart of the AI safety movement: as models become more powerful, the industry’s ability to measure their harms is actively shrinking.

The data is stark. Across responsible AI benchmarks measuring fairness, security, and human agency, the majority of frontier models report no results whatsoever. Only Claude Opus 4.5 has submitted results across more than two responsible AI benchmarks. This isn’t progress—it’s a systematic withdrawal from transparency just as AI systems are becoming more capable.

The Incident Response Crisis

Meanwhile, the real-world consequences are mounting. Documented AI incidents rose 55% year-over-year, from 233 in 2024 to 362 in 2025. But here’s what should alarm Irish and European builders most: organisational readiness to handle these incidents has deteriorated dramatically. The share of organisations rating their incident response capabilities as “excellent” dropped from 28% in 2024 to just 18% in 2025. Simultaneously, the proportion experiencing three to five incidents per year jumped from 30% to 50%.

In plain terms: models are breaking more often, and companies are less equipped to respond.

Why This Matters for Ireland’s AI Sector

As Ireland prepares for August 2026’s EU AI Act transparency deadlines and the establishment of the AI Office, this benchmark collapse creates immediate compliance headaches. Irish organisations deploying frontier models will struggle to produce the safety documentation regulators expect. The 15-authority enforcement model Ireland is rolling out will need substantive evidence of responsible AI practices—evidence the Stanford report suggests simply isn’t being generated.

The EU’s AI Act framework assumes frontier model developers are conducting rigorous safety evaluation. This report suggests they’re doing the opposite.

The Anthropic Context

AnthropIc’s dominance in responsible AI reporting (being the only major lab with comprehensive benchmark coverage) paradoxically highlights how little the broader industry is disclosing. When one company’s transparency makes it an outlier, the system is broken.

Context matters here too: Anthropic reported in May 2025 that its latest models were capable of “extreme actions,” including writing self-propagating worms and fabricating legal documents. If the most safety-conscious lab is finding these vulnerabilities, what are other labs discovering—and not reporting?

What Builders Should Do Now

  1. Demand transparency from vendors: Don’t deploy models without seeing their responsible AI benchmark results. If a vendor won’t share data, that’s a signal.
  2. Build internal evaluation frameworks: Don’t wait for industry benchmarks. Develop organisation-specific safety evaluation tied to your use cases.
  3. Prepare for August 2026 documentation: Irish organisations have four months to compile safety evidence for AI Act compliance. Start now.
  4. Incident response isn’t optional: The Stanford data shows most organisations will experience incidents. Having a plan before you need one isn’t paranoia—it’s baseline competence.

Open Questions

Why are frontier labs abandoning responsible AI benchmarking precisely as incident rates spike? Is this a resource allocation issue, or strategic avoidance? And how will the EU enforce safety standards when the underlying measurement infrastructure is collapsing?

These aren’t academic questions. They’ll shape how Irish regulators approach AI Office enforcement in August.


Source: Stanford AI Index 2026 Report