Why actAVA Built χ-BENCH
Last week, two of the actAVA co-founders, Frank Wang and Dr. Weiran Yao, were interviewed about the launch of the actAVA χ-BENCH. One question kept coming up. "Why did you spend so much energy building an evaluation benchmark?" This answer is rather simple. Too many healthcare AI companies sound similar right now. Same demo. Same pitch deck. Same promise: "production-ready agents for prior authorization, utilization management, and care management." We decided to create a focused benchmark to tell marketing from reality.

Every Healthcare AI Company Sounds the Same Right Now
From a healthcare buyer's perspective, it is very hard to tell who is truly production-ready and who is only demo-ready. That is a real problem — and not a small one. Healthcare AI is one of the biggest enterprise opportunities of the decade, but the gap between an impressive demo and a reliable production workflow is still enormous.
30 Agent Configurations. 75 Real Workflows. No Shortcuts.
We evaluated 30 frontier agent configurations across 75 real healthcare workflows spanning prior authorization, utilization management, and care management. Each task can take 60 to 80 steps — four to six distinct stages, with state that carries forward and cannot be retried once committed.
The environment includes 21 healthcare applications, 200+ MCP tools, and a 1,279-document managed-care operations handbook. The agent is not answering a clinical question in isolation. It is driving a full administrative transaction — from chart pull to final determination — through the same policy-dense, multi-role, multi-system environment a real provider or payer operates in.
The χ-Bench Environment
The Results
The best agent — Claude Code running Opus 4.6 — achieved 28% pass@1 on prior authorization tasks. Fully end-to-end multi-agent provider-payer scenarios are still near zero. These are some of the best models and agent systems available today.
They are powerful. They are not yet reliable enough for many real healthcare operations. That gap is the entire game.
Four Reasons. One Honest Answer.
Healthcare buyers need a way to cut through the marketing
χ-Bench gives procurement teams an open, reproducible scorecard. It lets them compare agent systems on evidence — not slide decks, not curated demos, not vendor-selected case studies. A shared benchmark protects buyers and raises the bar for everyone building.
Frontier models plus agent reasoning are not solved for healthcare yet — naming that honestly is the first step
Not by us. Not by anyone. The 28% number is not a failure — it is a starting line. The field can only improve what it is willing to measure. We would rather put the real number out than let the industry sleepwalk into deployment at 28% reliability on transactions that affect patient access to care.
Healthcare AI is too important to evaluate casually
A healthcare agent must follow policy, use tools correctly, respect role boundaries, maintain auditability, and make the right decision at the right time. A benchmark that only tests one of those dimensions gives a false sense of readiness. χ-Bench tests all of them, simultaneously, in the same task run.
Adoption moves faster when everyone agrees on what "working" means
A shared benchmark gives model builders, agent frameworks, healthcare startups, providers, payers, and investors a common language. Right now, "production-ready" means something different to every vendor. χ-Bench is our attempt to make it mean something specific, reproducible, and independent of who is selling what.
Built in the Open. Built with the People Who Do the Work.
We open-sourced χ-Bench under Apache 2.0, launched a live leaderboard, and built the benchmark with input from a coalition of 20+ clinical and academic institutions — including Johns Hopkins Medicine, Wellstar Health System, Yale School of Medicine, Stanford, Carnegie Mellon, the University of Oxford, USC, UCSD, Brown, Emory, and the University of Washington, among others.
The benchmark code, dataset, and the full 1,279-document operations handbook are public. The leaderboard is live. If you are building serious agent systems for healthcare, the infrastructure is there to run yours against the same tasks we used.
Today's healthcare agents have a long way to go. The work ahead is real. The opportunity is enormous. And we would rather measure the gap honestly than pretend the industry has already solved it.
The Floor Is What We're Raising
χ-Bench is not about positioning actAVA above the field. It is about dragging the entire field toward a shared, honest definition of what production-grade healthcare AI actually requires. Model builders, harness builders, healthcare startups, providers, and payers — everyone benefits when the measurement is real.
The work actAVA KORA does — the RED layer for testing and remediation, the GREEN layer for continuous improvement — is built exactly for this gap. Not to patch over the 72% that fails today, but to systematically shrink it. χ-Bench tells us where the work is. The platform does the work.
The benchmark, dataset, and 1,279-doc handbook are open under Apache 2.0. Submit at actava.ai/benchmarks · Read the paper at arXiv 2605.16679 · Dataset at HuggingFace
Sources
- actAVA. χ-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? arXiv 2605.16679, May 2026. Primary source for all benchmark methodology and results.
- actAVA. CHI-Bench v1.0.0 — Live Leaderboard and Task Browser. Full results, submission portal, and benchmark documentation.
- actAVA. chi-bench dataset — HuggingFace. Open dataset including all 75 tasks, the operations handbook, and evaluation rubrics. Apache 2.0.
- Z-Potentials. Z-Potentials Substack. 150-minute interview with Kevin Riley and Dr. Weiran Yao on χ-Bench — episode forthcoming.