Blog

Why actAVA Built χ-BENCH

Last week, two of the actAVA co-founders, Frank Wang and Dr. Weiran Yao, were interviewed about the launch of the actAVA χ-BENCH. One question kept coming up. "Why did you spend so much energy building an evaluation benchmark?" This answer is rather simple. Too many healthcare AI companies sound similar right now. Same demo. Same pitch deck. Same promise: "production-ready agents for prior authorization, utilization management, and care management." We decided to create a focused benchmark to tell marketing from reality.

By Haolin Chen

5 min read·May 25, 2026

Every Healthcare AI Company Sounds the Same Right Now

From a healthcare buyer's perspective, it is very hard to tell who is truly production-ready and who is only demo-ready. That is a real problem — and not a small one. Healthcare AI is one of the biggest enterprise opportunities of the decade, but the gap between an impressive demo and a reliable production workflow is still enormous.

We built χ-Bench to measure that gap. Not to market around it. Not to claim we had already closed it. To put a number on it so the entire field could work from the same honest starting point.

WHAT WE TESTED

30 Agent Configurations. 75 Real Workflows. No Shortcuts.

We evaluated 30 frontier agent configurations across 75 real healthcare workflows spanning prior authorization, utilization management, and care management. Each task can take 60 to 80 steps — four to six distinct stages, with state that carries forward and cannot be retried once committed.

The environment includes 21 healthcare applications, 200+ MCP tools, and a 1,279-document managed-care operations handbook. The agent is not answering a clinical question in isolation. It is driving a full administrative transaction — from chart pull to final determination — through the same policy-dense, multi-role, multi-system environment a real provider or payer operates in.

The χ-Bench Environment

Each task puts an agent through a real administrative transaction — policy checks, role switches, tool calls, and a final irreversible determination.

The Results

The best agent — Claude Code running Opus 4.6 — achieved 28% pass@1 on prior authorization tasks. Fully end-to-end multi-agent provider-payer scenarios are still near zero. These are some of the best models and agent systems available today.

They are powerful. They are not yet reliable enough for many real healthcare operations. That gap is the entire game.

28%

Best pass@1 — Claude Code + Opus 4.6 on prior authorization tasks

29.3%

Best UM pass@1 — Codex + GPT-5.5, the top-performing configuration on utilization management

~0%

End-to-end provider-payer automation — the full multi-agent handoff no configuration has yet solved

Frontier agent configurations evaluated — every major model and harness combination we could test

WHY WE BUILT IT

Four Reasons. One Honest Answer.

Healthcare buyers need a way to cut through the marketing

χ-Bench gives procurement teams an open, reproducible scorecard. It lets them compare agent systems on evidence — not slide decks, not curated demos, not vendor-selected case studies. A shared benchmark protects buyers and raises the bar for everyone building.

Frontier models plus agent reasoning are not solved for healthcare yet — naming that honestly is the first step

Not by us. Not by anyone. The 28% number is not a failure — it is a starting line. The field can only improve what it is willing to measure. We would rather put the real number out than let the industry sleepwalk into deployment at 28% reliability on transactions that affect patient access to care.

Healthcare AI is too important to evaluate casually

A healthcare agent must follow policy, use tools correctly, respect role boundaries, maintain auditability, and make the right decision at the right time. A benchmark that only tests one of those dimensions gives a false sense of readiness. χ-Bench tests all of them, simultaneously, in the same task run.

Adoption moves faster when everyone agrees on what "working" means

A shared benchmark gives model builders, agent frameworks, healthcare startups, providers, payers, and investors a common language. Right now, "production-ready" means something different to every vendor. χ-Bench is our attempt to make it mean something specific, reproducible, and independent of who is selling what.

OPEN SCIENCE

Built in the Open. Built with the People Who Do the Work.

We open-sourced χ-Bench under Apache 2.0, launched a live leaderboard, and built the benchmark with input from a coalition of 20+ clinical and academic institutions — including Johns Hopkins Medicine, Wellstar Health System, Yale School of Medicine, Stanford, Carnegie Mellon, the University of Oxford, USC, UCSD, Brown, Emory, and the University of Washington, among others.

Johns Hopkins Medicine Wellstar Health System Yale School of Medicine Stanford University Carnegie Mellon University of Oxford USC · UCSD · Brown Emory · UW · Northeastern + 12 more institutions

The benchmark code, dataset, and the full 1,279-document operations handbook are public. The leaderboard is live. If you are building serious agent systems for healthcare, the infrastructure is there to run yours against the same tasks we used.

Today's healthcare agents have a long way to go. The work ahead is real. The opportunity is enormous. And we would rather measure the gap honestly than pretend the industry has already solved it.

The Floor Is What We're Raising

χ-Bench is not about positioning actAVA above the field. It is about dragging the entire field toward a shared, honest definition of what production-grade healthcare AI actually requires. Model builders, harness builders, healthcare startups, providers, and payers — everyone benefits when the measurement is real.

The work actAVA KORA does — the RED layer for testing and remediation, the GREEN layer for continuous improvement — is built exactly for this gap. Not to patch over the 72% that fails today, but to systematically shrink it. χ-Bench tells us where the work is. The platform does the work.

Run your agent on the leaderboard.
The benchmark, dataset, and 1,279-doc handbook are open under Apache 2.0. Submit at actava.ai/benchmarks · Read the paper at arXiv 2605.16679 · Dataset at HuggingFace

Sources

actAVA. χ-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? arXiv 2605.16679, May 2026. Primary source for all benchmark methodology and results.
actAVA. CHI-Bench v1.0.0 — Live Leaderboard and Task Browser. Full results, submission portal, and benchmark documentation.
actAVA. chi-bench dataset — HuggingFace. Open dataset including all 75 tasks, the operations handbook, and evaluation rubrics. Apache 2.0.
Z-Potentials. Z-Potentials Substack. 150-minute interview with Kevin Riley and Dr. Weiran Yao on χ-Bench — episode forthcoming.

Written by

Haolin Chen

Lead AI Researcher

Why actAVA Built χ-BENCH

Every Healthcare AI Company Sounds the Same Right Now

30 Agent Configurations. 75 Real Workflows. No Shortcuts.

The Results

Four Reasons. One Honest Answer.

Healthcare buyers need a way to cut through the marketing

Frontier models plus agent reasoning are not solved for healthcare yet — naming that honestly is the first step

Healthcare AI is too important to evaluate casually

Adoption moves faster when everyone agrees on what "working" means

Built in the Open. Built with the People Who Do the Work.

The Floor Is What We're Raising

More from the blog

Own Your Long Tail Workflows, Own (some of) Your Inference

AI Is Advancing Faster Than the Systems Built to Keep It Safe

Non-Human Resources: Managing the Workforce That Doesn't Sleep

Contact

Locations

Solutions

About

Compliance

Library

Benchmarks

News

Company