Blog

Why Healthcare AI Needs a Better Benchmark

The numbers are out, and they are staggering. U.S. healthcare spending has surpassed $5.3 trillion, accounting for 18% of GDP. But here is the kicker: roughly one in five dollars never actually reaches a patient. Instead, it is swallowed whole by a $1 trillion administrative machinery of billing, credentialing, and the infamous prior authorization (PA) process. While AI agents are being pitched as the ultimate savior for healthcare’s back office, a massive gap remains between tech-vendor promises and real-world execution. Here is a summary of where healthcare administration stands, why current AI solutions are stalling, and how the industry is trying to fix its measurement problem.

By Weiran Yao

8 min read·June 17, 2026

The United States spends more than $1 trillion every year on healthcare administration. Not care. Not drugs. Not devices. Administration. Billing departments, prior authorization queues, claims clerks, compliance reviewers — an entire shadow economy layered on top of medicine. The question healthcare technology has struggled to answer honestly is: where exactly does AI actually help, and how do you prove it?

This article maps the burden by the numbers, traces where current AI pilots stall, and explains why measurement — not more models — is the missing piece.

$1T+
Annual US healthcare administrative costs — nearly a third of all healthcare spending1
$200B
Estimated waste attributable to excess administrative complexity in payer operations alone2
$140B
Addressable administrative waste identified as reducible with technology and standardization3
$12–$40
Cost per prior authorization transaction when processed manually — vs. $3–$4 electronically4
THE PRIOR AUTHORIZATION PROBLEM

Prior Auth: Where Physicians Spend Their Days

Prior authorization has become the defining friction point in US healthcare delivery. The AMA's 2023 survey of over 1,000 physicians found that 94% reported that prior authorization delayed access to necessary care for their patients, and 33% said the delays led patients to abandon treatment entirely.5 The average physician practice completes 45 prior authorization requests per physician per week — consuming nearly two full business days of staff time.5

KFF analysis of Medicare Advantage found that 1 in 7 prior authorization requests was denied on initial submission, with denial rates varying dramatically by plan — some exceeding 35%.6 Premier Healthcare research estimates that 75% of denied claims that are appealed are ultimately overturned, confirming that the initial denial was clinically unjustified in the majority of cases — yet providers incur the full cost of the appeals process regardless.7

CMS-0057-F is now in effect. The CMS Interoperability and Prior Authorization Final Rule (effective January 2026) mandates that payers implement electronic prior authorization APIs, provide real-time decision timelines, and publish denial rate data publicly. For payers still running PA workflows manually or on legacy systems, the compliance window has closed.8
THE WORKFORCE DIMENSION

Clinical Staff Are Paying the Hidden Tax

The $1 trillion headline figure tends to focus attention on billing and claims. But administrative burden exacts a parallel cost on clinical staff — particularly nurses and care managers — that manifests as burnout, turnover, and reduced patient contact time rather than a line item on the income statement.

A 2023 JAMA Internal Medicine study found that for every hour physicians spend in direct patient care, they spend nearly two hours on EHR and administrative tasks.9 For nurses, the ratio is similar: only 37% of nursing shift time is spent on direct patient care, with documentation, care coordination paperwork, and administrative follow-up consuming the rest.10

Utilization management workflows are a particular offender. A care manager handling concurrent UM reviews for a panel of Medicare Advantage patients may touch the same case three to five times across multiple platforms before a decision is rendered — repeating data entry, chasing clinical notes, and resubmitting forms that could have been automated end-to-end.

WHERE AI PILOTS STALL

The Pilot-to-Production Gap

Every major health system and payer has run AI pilots. Most of them work in the pilot. The failure occurs during the handoff to production, where the complexity of real clinical environments overwhelms models evaluated on curated datasets.

Healthcare AI: Pilot Success vs. Production Deployment 100% 75% 50% 25% PA Auto UM Review Claims Care Mgmt P-P Handoff Pilot benchmark score Production deployment rate
Illustrative comparison of pilot-phase benchmark scores vs. reported production deployment rates across five core healthcare AI workflow categories. Provider–payer handoff remains near zero in production.

The pattern is consistent across workflow categories: strong pilot performance, sharp falloff at production deployment. Provider–payer handoff workflows — the most administratively costly — remain almost entirely undeployed at scale, because no existing benchmark has adequately measured end-to-end completion across real multi-app, multi-system environments.

THE MEASUREMENT GAP

Why Existing Benchmarks Don't Predict Production Performance

The healthcare AI evaluation landscape has significant gaps. MedQA, MedMCQA, and similar benchmarks test factual clinical knowledge — useful for gauging general medical reasoning, but orthogonal to whether an agent can actually complete a prior authorization workflow inside Availity or navigate a payer portal to retrieve a remittance advice.

MedHELM, one of the more comprehensive recent evaluation frameworks, reports task-completion scores in the 0.53–0.63 range for clinical NLP tasks — but it evaluates language models, not agents executing multi-step administrative workflows.11 A 2024 JMIR Medical Informatics review found zero FDA-cleared agentic AI systems for payer operations workflows, reflecting how early the field remains in translating benchmark performance into regulated deployment.12

The core problem: Existing benchmarks measure what models know. Healthcare operations require measuring what agents do — across real applications, real credentialing systems, real payer portals, under real-time constraints, with real consequences for errors.

Task-level question answering is not a proxy for end-to-end workflow completion. An agent that scores 0.85 on MedQA may fail when asked to submit a prior authorization through a live portal, extract a denial reason from an EOB, and escalate to a peer-to-peer review — a sequence any experienced UM coordinator handles dozens of times per week.

THE BENCHMARK BUILT FOR THIS

χ-Bench: Measuring What Actually Matters

actAVA built χ-Bench to fill the gap that all existing healthcare AI benchmarks leave open: end-to-end agentic task completion across real payer operations workflows. Rather than testing what a model knows, χ-Bench tests what an agent can accomplish — inside real applications, navigating real UI, completing tasks a human UM coordinator would actually be assigned.

75
Tasks spanning prior authorization, utilization management, and care management workflows
28%
Best pass@1 rate across all tested agent configurations — establishing an honest baseline
21
Real healthcare applications evaluated — Availity, NaviMedix, Change Healthcare, and more
~0%
End-to-end provider–payer handoff completion for any tested configuration — the hardest frontier

χ-Bench evaluates 30 distinct agent configurations across the 75 tasks, using a 1,279-document clinical handbook and 200+ MCP tools that mirror real operational environments. The 28% best pass@1 is not a failure — it is a calibration. It tells the industry where AI agents actually are, rather than where vendor demos suggest they are.

The near-zero rate on end-to-end provider–payer handoff tasks is the finding that matters most for the $1 trillion administrative burden story. The workflows that cost the most — multi-system prior authorization spanning the provider EHR, the payer portal, and clinical review — are precisely the ones where current agents are not yet performing well. That is the honest picture. It is also the roadmap.

χ-Bench is open-sourced under Apache 2.0. Benchmark results, methodology, and the full task suite are available at actava.ai/benchmarks and arXiv 2605.16679.

CITATIONS

Sources

  1. Tseng P, Kaplan RS, Richman BD, et al. "Administrative Costs Associated With Physician Billing and Insurance-Related Activities at an Academic Health Care System." JAMA. 2018;319(7):691–697. jamanetwork.com. Updated estimates in 2022–2023 literature place the total at over $1 trillion annually.
  2. Dieleman JL, et al. "US Health Care Spending by Payer and Health Condition, 1996–2016." JAMA. 2020;323(9):863–884. Payer operations overhead analysis. jamanetwork.com
  3. Shrank WH, Rogstad TL, Parekh N. "Waste in the US Health Care System: Estimated Costs and Potential for Savings." JAMA. 2019;322(15):1501–1509. jamanetwork.com
  4. CAQH. 2023 CAQH Index: Conducting Electronic Business Transactions. Council for Affordable Quality Healthcare. caqh.org. Manual PA transaction cost $12–$40; electronic $3–$4.
  5. American Medical Association. 2023 AMA Prior Authorization Physician Survey. ama-assn.org. 94% delay care; 33% lead to abandonment; 45 requests/physician/week.
  6. KFF. "Medicare Advantage Prior Authorization and Access to Care." 2023. kff.org. 1 in 7 requests denied on initial submission.
  7. Premier Inc. "Addressing the Administrative Burden: How Automation Can Help." Premier Healthcare Research. premierinc.com. 75% of appealed denials overturned.
  8. Centers for Medicare & Medicaid Services. CMS-0057-F: Interoperability and Prior Authorization Final Rule. Effective January 1, 2026. cms.gov
  9. Arndt BG, et al. "Tethered to the EHR: Primary Care Physician Workload Assessment Using EHR Event Log Data and Time-Motion Observations." Ann Fam Med. 2017;15(5):419–426. annfammed.org. 2 hours admin per 1 hour patient care.
  10. Hendrich A, et al. "A 36-Hospital Time and Motion Study: How Do Medical-Surgical Nurses Spend Their Time?" Perm J. 2008;12(3):25–34. thepermanentejournal.org. 37% of nursing shift time is on direct patient care.
  11. Harrington SG, et al. "MedHELM: A Comprehensive Benchmark for Evaluating Clinical NLP Systems." Stanford Center for Biomedical Informatics Research. 2023. Scores in the 0.53–0.63 range for task-completion on clinical NLP. stanfordmlgroup.github.io
  12. Li K, et al. "Regulatory Landscape for AI in Healthcare: Barriers to Clinical Deployment." JMIR Med Inform. 2024;12:e52010. Zero FDA-cleared agentic AI systems for payer operations. medinform.jmir.org

All statistics are sourced from peer-reviewed literature, government data, or named industry research as cited. actAVA χ-Bench results reflect internal evaluation as of May 2025 across 30 agent configurations. For full methodology, see actava.ai/benchmarks and arXiv 2605.16679.


Weiran Yao

Written by

Weiran Yao

CAIO & Co-Founder

Share this