Blog
χ-BENCH Update: Frontier Agents Complete Only 28% of Complex Healthcare Workflows
As healthcare operations shift toward agentic AI, actAVA’s new χ-BENCH benchmark reveals that today's frontier agents are not yet ready to run complex workflows independently. Developed alongside leading academic and clinical institutions, the benchmark found that the best-performing agent setup completed only 28% of complex administrative workflows on the first try, consistency never exceeded 8%, and performance dropped to 0% at critical provider-payer handoffs. Because these policy-heavy workflows carry high stakes—where failures can delay care, increase costs, and introduce compliance risks—healthcare leaders must prioritize systems that can be safely measured and governed rather than automating blindly. To address this trust gap, actAVA offers a purpose-built agent-lifecycle platform that provides the orchestration, safety guardrails, and auditability required to reliably scale AI across enterprise operations
By Haolin Chen
Healthcare is entering a new phase of AI adoption. Across the industry, organizations are moving beyond chatbots and copilots toward agentic AI systems that promise to complete complex work — across prior authorization, utilization management, revenue cycle, care management, and provider-payer coordination. That promise is real. And so is the risk.
At actAVA, we believe AI agents will play a major role in the future of healthcare operations. We also believe they must be measured with the same seriousness as the workflows they are being asked to perform. That is why we published χ-BENCH — a new benchmark designed to test whether agentic AI systems can reliably complete real-world, complex healthcare administrative workflows.
Frontier Agents Complete Only 28% of Complex Healthcare Workflows
The findings are clear. Today's frontier agents are not ready to run healthcare's complex workflows independently.
That should give every healthcare leader pause. And it should clarify the mandate: healthcare does not need AI that looks impressive in a demo and breaks at the payer handoff. It needs systems that can be measured, governed, audited, and safely orchestrated before they touch mission-critical operations.
Current AI Benchmarks Are Not Enough
The issue is not whether AI will enter healthcare administration. It already is. The issue is whether organizations can reliably detect when AI fails, contain that failure, and build the deterministic infrastructure required to operate safely at scale.
Administrative healthcare workflows are not simple web tasks. They are policy-heavy, multi-step, multi-role processes with hidden state, handoffs, exceptions, and real operational consequences. A small failure can create downstream costs, delay care, increase friction, or introduce compliance risks. And yet the benchmarks most commonly used to evaluate AI agents — drawn from general web navigation, question-answering, or isolated API interactions — bear little resemblance to these environments.
Built in partnership with 20+ clinical, academic, and research collaborators, χ-BENCH covers three domains — prior authorization, utilization management, and care management — across 75 tasks, each requiring 60–80 agent steps and involving 4–6 distinct role-composed stages. Every evaluation is grounded in a 1,279-page medical policy corpus, deterministic rubrics, and real healthcare application environments.
Three Properties That Break General-Purpose Agents
01 — Long-Horizon
State that commits and cannot retry
A single prior authorization case runs 60–80 steps across intake, case construction, documentation review, and final disposition. Once a stage commits, it cannot be retried. One wrong site-of-service classification cascades into multiple scorecard failures — the kind of error that compounds silently in production.
02 — Role-Composed
One agent, many seats
A single agent must play clinical intake clerk, PA coordinator, clinical reviewer, and submitter — each with different permissions, different tools, and different artifact obligations. Without a harness that enforces role boundaries, models collapse them and produce outputs that no single human in that role would have generated.
03 — Policy-Driven
1,279 pages of criteria at every decision point
Medical policy criteria, payer-specific coverage rules, consent scripts, and evidence-grounding requirements are checked by deterministic rubrics — not estimated by a language model. The agent must retrieve, interpret, and apply the right policy at the right step, every time, without fail.
These three properties — long-horizon state management, role composition, and policy-driven decision-making — are precisely what general-purpose AI agents are not designed for. They explain why benchmark performance on web browsing tasks tells organizations almost nothing about how the same agent will perform on a prior authorization case.
Before You Deploy, You Need to Know What You Are Deploying
For healthcare and technology leaders, the implication is immediate. Before deploying agentic AI into production, organizations need to understand four things:
| Question | Why It Matters |
|---|---|
| Where does this agent succeed? | Knowing the ceiling of agent capability tells you which workflows can be automated safely and which require human backup at every step. |
| Where does this agent fail? | The 0% handoff failure rate is not an anomaly — it is a structural property of workflows with hidden state and policy interdependencies that models cannot reliably navigate without governance infrastructure. |
| How consistently does it fail? | An agent that fails predictably can be governed around. An agent that fails randomly — as χ-BENCH's consistency scores reveal — cannot be safely operated without deterministic guardrails at each decision point. |
| What guardrails are required? | The answer is different for every workflow. χ-BENCH makes it possible to derive those guardrails from observed failure modes rather than discovering them in production at the expense of real patients and payers. |
Measurement Is the Prerequisite. Governed Deployment Is the Work.
This is the central lesson of χ-BENCH. Not that AI agents are insufficient — but that knowing where they are insufficient is the foundation of responsible deployment.
actAVA was built for exactly this reality. KORA is an agent-lifecycle platform purpose-built for healthcare that helps organizations move AI from pilot to production with orchestration, auditability, governance, and measurable ROI. Every agent carries a versioned audit log, an approval lifecycle, human-in-the-loop gates calibrated to workflow risk, and the observability infrastructure to detect when performance deviates from benchmark baselines.
The next era of AI in healthcare will not be won by organizations that automate blindly. It will be won by those who can measure reliability, control risk, and scale agents responsibly across enterprise operations. χ-BENCH is an important step toward that future — and a benchmark that every healthcare organization deploying agentic AI should be running against.
OUR COLLABORATORS
χ-BENCH was built with the clinicians and researchers who do the work.
We are grateful to the clinical, academic, and research collaborators who contributed to this work, including expertise from Johns Hopkins Medicine, Stanford University, Carnegie Mellon University, UC San Diego, Yale School of Medicine, Salesforce AI Research, University of Washington, University of Oxford, Brown University, Emory University, University of Southern California, and many other prestigious institutions. Healthcare needs AI we can trust — and building that trust starts with measurement.
Healthcare needs AI, but it needs AI we can trust in the workflows that matter most.
Read the full paper at arXiv and explore the benchmark, leaderboard, and task browser at actava.ai/benchmarks.

Written by
Haolin Chen
Lead AI Researcher


