CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
We built a high-fidelity simulator of 21 healthcare apps, ran 30 frontier agents through 75 long-horizon workflows, and the best one solved 28% of tasks. Here is what broke, why, and where it matters.
What we built
U.S. healthcare runs on long, fragmented, policy-heavy workflows — the kind that take a nurse, a coordinator, and a medical director hours of clicking, calling, and chart-reading to push to a terminal action. Frontier AI agents are increasingly pitched as the natural automation candidate here. We wanted to know how close they actually are.
So we built a high-fidelity simulator of 21 real healthcare apps — provider EHRs, payer UM portals, care-management consoles — wired together by their own state machines, and dropped frontier agents into them with a 1,279-document Managed-Care Operations Handbook and 200+ role-scoped MCP tools. Each task starts with a real clinical case (a knee-replacement prior auth, a chemotherapy review, a diabetes care-plan kickoff) and asks the agent to drive it end-to-end to a submitted packet, a finalized determination, or a complete care plan with patient consent. When the agent stops, a composite verifier reads the full workspace and grades the run.
- 200+ role-scoped MCP tools
- 1,279-doc Handbook skill
- Workspace files + chart access
- Deterministic contract — case status, codes, P2P required
- LLM judge (Opus 4.7) — rationale, autonomy, grounding
Why this is hard
Coding benchmarks like SWE-Bench and Terminal-Bench measure long-horizon execution against a clean, deterministic backdrop: the file system doesn't talk back, the build either passes or it doesn't. Healthcare workflows look superficially similar but are wired differently in three ways that almost no current benchmark combines.
Policy density. Every decision has to be grounded in a specific rule — a payer's medical-necessity criterion, an internal escalation procedure, a state regulation. There are thousands of them, they vary by payer and plan year, and “misreading the rule” and “applying the right rule incorrectly” are different failure modes that we score separately.
Multi-role composition. A single PA case touches a coordinator, a UM nurse, a medical director, and sometimes a peer-to-peer call between an MD on each side. The agent has to switch role context, authority boundaries, and what it's safe to write — and every handoff is terminal. Submit the wrong packet and there's no rollback.
Multilateral interaction. Some steps aren't tool calls but live multi-turn conversations: a peer-to-peer review with the payer's MD, an RFI back to the provider, a twenty-minute outreach call with a chronic-disease patient. The agent has to drop out of execute-mode, hold a real conversation, and carry the result back into the workflow.
Three domains, 75 tasks
We picked the three places frontier agents are most often pitched as automation candidates today: provider prior authorization, payer utilization management, and care management. Each domain has 25 long-horizon tasks. To produce ground truth, a clinician (or an author wearing that hat) walked every task end-to-end on the live UI before it shipped — the average task takes 21 steps, the longest hits 40. Each task is then graded independently across three trials.
Build the case packet a payer needs to approve a service or medication. The agent works the chart, drafts the medical-necessity rationale, attaches policy evidence, and submits — once.
Triage, nurse-review, and MD-decide an incoming authorization request against criteria. Run a peer-to-peer call if needed, then finalize the determination with rationale-rich notes.
Run intake, outreach, assessment, and care-plan steps for a chronic-disease member. Patient calls are simulated multi-turn dialogs; the agent must obtain consent before scoping the program.
How we grade a run
When the agent stops, the verifier reads everything it touched: world-store updates, files in the workspace, the full tool-call event trail, and any conversation transcripts. Two layers grade in parallel and both have to pass.
The deterministic contract covers checks that have a single right answer — did the case status reach the expected terminal state? did the right CPT codes land on the request? was a peer-to-peer raised when policy required one? The rubric LLM judge (pinned to claude-opus-4-7) handles items the contract can't formalize: was the medical-necessity rationale grounded in the cited policy? did the outreach call respect patient autonomy or quietly badger a refusing member into “yes”?
We report pass@1, pass@3 (any of three independent trials), and the strict pass^3 (all three) as a reliability metric.
- ›Per-stage states
- ›Side-effect artifacts
- ›Workspace documents
- ›Event log
- ›Conversation transcripts
Headline results
We ran 30 harness × model combinations: every frontier proprietary stack (Claude Code, OpenAI Codex, Gemini CLI) paired with that lab's closed-weight models, plus four open-source frameworks (OpenClaw, Hermes, OpenAI Agents SDK, DeepAgents) sweeping five open-weight models. Best in class is Claude Code + Opus 4.6 at 28.0% pass@1. The per-domain leaders split cleanly across three labs: Codex + GPT-5.5 leads PA at 29.3%, Opus 4.6 leads UM at 41.3%, and Opus 4.7 leads CM at 32.0%. There is no single “best agent” for healthcare workflows yet.
Reliability is the bigger problem. No agent clears 20% on pass^3 — the metric where the same agent has to pass the same task three runs in a row. Agents that win one trial often fail the next on the same case. Two stress tests collapse the headline number further:
The marathon. Load all 25 tasks of a domain into a single agent session and ask it to finish them in any order. The best agents finish with overall 3.8%. On PA, neither leading agent submits a single authorization across 25 queued cases despite touching most of them. Long context doesn't save you: Opus 4.7 (1M-token context, no compaction) and GPT-5.5 (auto-compacts 4–6× per session) fail in roughly the same shape.
The end-to-end arena. Wire two Codex + GPT-5.5 agents together — one as the provider, one as the payer — and let them exchange information only through MCP tools. The PA configuration that scores 30.4% solo collapses to 0% when the payer agent and cross-role checks join: 18 of 23 cases never reach an MD decision, and on the five PA cases that policy requires a peer-to-peer call, neither side raises one.
The full leaderboard
All 30 harness × model configurations. Each pass cell is shaded by its value (deeper pink = higher pass rate); the per-column maximum is ringed in pink. Efficiency columns report the per-trial averages over all 225 trials in that row. Scroll horizontally on narrow screens.
| Agent harness | Model | Overall 75 tasks | Prior Authorization 25 tasks | Utilization Management 25 tasks | Care Management 25 tasks | Efficiency per trial | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| p@1 | p@3 | p^3 | p@1 | p@3 | p^3 | p@1 | p@3 | p^3 | p@1 | p@3 | p^3 | Steps | Cost | ||
| Proprietary stackFrontier first-party CLI + closed-weight models | |||||||||||||||
| Codex | GPT-5.5 | 20.9 | 30.7 | 9.3 | 29.3 | 40.0 | 16.0 | 32.0 | 48.0 | 12.0 | 1.3 | 4.0 | 0.0 | 54 | $1.29 |
| Codex | GPT-5.4 | 16.0 | 25.3 | 8.0 | 24.0 | 32.0 | 16.0 | 17.3 | 24.0 | 8.0 | 6.7 | 20.0 | 0.0 | 58 | $1.30 |
| Codex | GPT-5.4 Mini | 8.4 | 20.0 | 0.0 | 10.7 | 24.0 | 0.0 | 13.3 | 32.0 | 0.0 | 1.3 | 4.0 | 0.0 | 58 | $0.27 |
| Claude Code | Claude Opus 4.7 | 24.4 | 41.3 | 10.7 | 24.0 | 32.0 | 16.0 | 17.3 | 28.0 | 8.0 | 32.0 | 64.0 | 8.0 | 68 | $9.91 |
| Claude Code | Claude Opus 4.6 | 28.0 | 38.7 | 18.7 | 18.7 | 24.0 | 12.0 | 41.3 | 44.0 | 40.0 | 24.0 | 48.0 | 4.0 | 76 | $6.47 |
| Claude Code | Claude Sonnet 4.6 | 26.2 | 41.3 | 12.0 | 24.0 | 28.0 | 20.0 | 34.7 | 52.0 | 16.0 | 20.0 | 44.0 | 0.0 | 82 | $1.30 |
| Claude Code | Claude Haiku 4.5 | 6.2 | 10.7 | 2.7 | 0.0 | 0.0 | 0.0 | 14.7 | 24.0 | 8.0 | 4.0 | 8.0 | 0.0 | 41 | $0.16 |
| Gemini CLI | Gemini 3.1 Pro | 7.1 | 13.3 | 1.3 | 14.7 | 24.0 | 4.0 | 6.7 | 16.0 | 0.0 | 0.0 | 0.0 | 0.0 | 82 | $2.11 |
| Gemini CLI | Gemini 3 Flash | 12.5 | 17.3 | 8.0 | 18.7 | 28.0 | 8.0 | 18.7 | 24.0 | 16.0 | 0.0 | 0.0 | 0.0 | 142 | $0.33 |
| Open-source stackOpen frameworks + open-weight models | |||||||||||||||
| OpenClaw | Claude Opus 4.7 | 17.3 | 37.3 | 4.0 | 18.7 | 28.0 | 8.0 | 13.3 | 32.0 | 4.0 | 20.0 | 52.0 | 0.0 | 41 | $11.48 |
| OpenClaw | Kimi K2.6 | 10.2 | 18.7 | 2.7 | 12.0 | 20.0 | 4.0 | 18.7 | 36.0 | 4.0 | 0.0 | 0.0 | 0.0 | 72 | $0.91 |
| OpenClaw | DeepSeek V4 Pro | 11.1 | 24.0 | 1.3 | 14.7 | 28.0 | 4.0 | 12.0 | 28.0 | 0.0 | 6.7 | 16.0 | 0.0 | 42 | $0.53 |
| OpenClaw | GLM-5.1 | 16.9 | 30.7 | 6.7 | 13.3 | 24.0 | 4.0 | 26.7 | 36.0 | 16.0 | 10.7 | 32.0 | 0.0 | 116 | $0.96 |
| OpenClaw | Qwen 3.6 Max | 4.9 | 10.7 | 0.0 | 10.7 | 24.0 | 0.0 | 4.0 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 79 | $2.80 |
| OpenClaw | Grok 4.3 | 0.4 | 1.3 | 0.0 | 1.3 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 65 | $2.66 |
| OAI Agents | Kimi K2.6 | 15.1 | 22.7 | 8.0 | 17.3 | 28.0 | 12.0 | 25.3 | 36.0 | 12.0 | 2.7 | 4.0 | 0.0 | 60 | $0.43 |
| OAI Agents | DeepSeek V4 Pro | 14.2 | 22.7 | 9.3 | 10.7 | 16.0 | 8.0 | 28.0 | 40.0 | 20.0 | 4.0 | 12.0 | 0.0 | 52 | $0.25 |
| OAI Agents | GLM-5.1 | 18.7 | 26.7 | 12.0 | 18.7 | 24.0 | 12.0 | 33.3 | 44.0 | 24.0 | 4.0 | 12.0 | 0.0 | 58 | $0.27 |
| OAI Agents | Qwen 3.6 Max | 15.6 | 22.7 | 9.3 | 16.0 | 20.0 | 12.0 | 26.7 | 36.0 | 16.0 | 4.0 | 12.0 | 0.0 | 48 | $0.58 |
| OAI Agents | Grok 4.3 | 5.8 | 10.7 | 1.3 | 0.0 | 0.0 | 0.0 | 16.0 | 28.0 | 4.0 | 1.3 | 4.0 | 0.0 | 32 | $1.54 |
| Hermes | Kimi K2.6 | 15.6 | 24.0 | 6.7 | 18.7 | 24.0 | 12.0 | 21.3 | 36.0 | 8.0 | 6.7 | 12.0 | 0.0 | 31 | $1.07 |
| Hermes | DeepSeek V4 Pro | 13.8 | 22.7 | 8.0 | 8.0 | 16.0 | 4.0 | 25.3 | 32.0 | 20.0 | 8.0 | 20.0 | 0.0 | 26 | $2.19 |
| Hermes | GLM-5.1 | 18.7 | 28.0 | 10.7 | 10.7 | 16.0 | 8.0 | 34.7 | 44.0 | 24.0 | 10.7 | 24.0 | 0.0 | 30 | $1.04 |
| Hermes | Qwen 3.6 Max | 16.4 | 28.0 | 5.3 | 9.3 | 16.0 | 4.0 | 26.7 | 36.0 | 12.0 | 13.3 | 32.0 | 0.0 | 29 | $4.12 |
| Hermes | Grok 4.3 | 4.4 | 8.0 | 1.3 | 0.0 | 0.0 | 0.0 | 13.3 | 24.0 | 4.0 | 0.0 | 0.0 | 0.0 | 32 | $1.05 |
| DeepAgents | Kimi K2.6 | 3.1 | 8.0 | 0.0 | 8.0 | 20.0 | 0.0 | 1.3 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 39 | $0.55 |
| DeepAgents | DeepSeek V4 Pro | 10.7 | 18.7 | 2.7 | 14.7 | 24.0 | 4.0 | 10.7 | 20.0 | 4.0 | 6.7 | 12.0 | 0.0 | 15 | $0.21 |
| DeepAgents | GLM-5.1 | 11.1 | 17.3 | 5.3 | 17.3 | 24.0 | 12.0 | 10.7 | 16.0 | 4.0 | 5.3 | 12.0 | 0.0 | 21 | $0.26 |
| DeepAgents | Qwen 3.6 Max | 9.3 | 16.0 | 4.0 | 12.0 | 16.0 | 8.0 | 10.7 | 16.0 | 4.0 | 5.3 | 16.0 | 0.0 | 18 | $0.57 |
| DeepAgents | Grok 4.3 | 2.2 | 5.3 | 0.0 | 0.0 | 0.0 | 0.0 | 5.3 | 12.0 | 0.0 | 1.3 | 4.0 | 0.0 | 21 | $1.43 |
Cost vs accuracy
Spending more does not buy reliability in healthcare workflows. Plot every harness × model configuration by mean per-trial spend and pass@1 and the field falls into four quadrants. The Sweet Spot (cheap and accurate) is sparsely populated and dominated by a handful of frugal open-weight pairings. The Premium tier — Claude Code paired with Opus 4.6 or Sonnet 4.6, and Codex with GPT-5.5 — buys the top of the leaderboard but at 4–50× the trial cost of the frontier's budget end. Everything to the right of the $1 line and below the 13% bar is Overpriced: Claude Opus 4.7 on OpenClaw spends almost $12 per trial to land at 17.3% pass@1.
The connecting line traces the cost-accuracy Pareto frontier: no other configuration is both cheaper and more accurate than the points on it. Seven configurations sit on the frontier — Haiku 4.5, two DeepSeek V4 Pro setups, GLM-5.1 (OAI Agents), GPT-5.5 (Codex), Sonnet 4.6, and Opus 4.6 — and they span two orders of magnitude in spend for a 22-point accuracy spread.
- Codex
- Claude Code
- Gemini CLI
- OpenClaw
- OAI Agents
- Hermes
- DeepAgents
- GPT-5.5
- GPT-5.4
- GPT-5.4 Mini
- Claude Opus 4.7
- Claude Opus 4.6
- Claude Sonnet 4.6
- Claude Haiku 4.5
- Gemini 3.1 Pro
- Gemini 3 Flash
- DeepSeek V4 Pro
- GLM-5.1
- Kimi K2.6
- Qwen 3.6 Max
- Grok 4.3
- Haiku 4.5 (Claude Code)$0.16 · 6.2%
- DS V4 Pro (DeepAgents)$0.21 · 10.7%
- DS V4 Pro (OAI Agents)$0.25 · 14.2%
- GLM-5.1 (OAI Agents)$0.27 · 18.7%
- GPT-5.5 (Codex)$1.29 · 20.9%
- Sonnet 4.6 (Claude Code)$1.30 · 26.2%
- Opus 4.6 (Claude Code)$6.47 · 28.0%
Where agents break down
A 28% vs. 21% leaderboard gap suggests the leading agents are doing roughly the same thing, only at different success rates. They aren't. We pulled the per-check, per-tool-call data from 225 trials each for the two leading configurations from different labs — Claude Code + Opus 4.6 (28.0% pass@1) and Codex + GPT-5.5 (20.9% pass@1) — and the failure shapes are wildly different.
Claude is the careful process-follower that submits packets it shouldn't. It nails every mandatory peer-to-peer (12 of 12) and writes care plans with 100% goal-structure compliance, but fails 0/33 of the “gather more evidence and put the case on hold” tasks because it always submits. Codex is the fast shortcutter: 1.5–2× faster per trial, better at form-filling, but 88% of its outreach calls fail the patient-conversation rubric, and it racked up 122 consecutive retries on a single UM trial when it couldn't format a tool call correctly.
Each agent is good at the things the other one is bad at. Both fail in ways that would cause actual harm in production. Below: side-by-side strength/weakness profiles, a per-check breakdown across all 15 rubric checks, and six diagnostics that explain the most consistent gaps.
- ✓UMProcedural fidelity — never skips mandatory workflow steps (P2P: 12/12 perfect)
- ✓CMConversational empathy — 73% outreach quality pass; adaptive rapport-building
- ✓UMRole-switching — correct authority boundaries across 5 UM roles
- ✓CMCare plan structure — 100% goal structure compliance, 89% intervention structure
- ✓PAEvidence gathering — 3.7 supporting documents per PA trial vs 1.0 for Codex
- ✓Deterministic — same outcome on 96% of tasks across independent trials
- ✗PACompleteness bias — 0/33 on "gather more evidence" tasks; always submits
- ✗PACreates submission packets when policy requires stopping (96% failure)
- ✗Slow execution — 1.5–2× longer than Codex; 68 avg steps vs 46
- ✗CMCM assessment quality judging — 64% failure on structured assessment checks
- ✓PADocumentation gap detection — 6/33 on hold tasks vs Claude's 0/33
- ✓PAIntake efficiency — 40% pass on structured form-filling tasks
- ✓PAExecution speed — 3.5 min avg PA trial vs Claude's 6.6 min
- ✓CMStage coherence — 63% pass on CM workflow staging vs Claude's 37%
- ✗UMSkips mandatory steps — bypasses P2P when clinical answer seems "obvious"
- ✗CMCatastrophic outreach failure — 88% fail on patient conversation quality
- ✗UMRetry storms — 122 consecutive retries with 163 schema validation errors
- ✗CMCare plan deficiency — 65% fail on goal structure, 55% on interventions
- ✗UMHard-coded triage routing bug breaks gold-card auto-approvals
- ✗Non-deterministic — different outcomes on 36% of tasks across trials
Per-check capability breakdown
Three views over the same 225-trial-per-agent run: failure rate by check, a capability radar across nine skills, and behavioral diagnostics for retries, schema errors, and output volume.
Key diagnostics
44% of PA Provider tasks require stopping mid-workflow. Claude submitted all 33 hold-required trials — noting missing documents in reasoning but executing submission anyway. A knowledge gap this is not; it's an execution bias toward completion.
Codex enters retry storms on structured tool calls — 122 consecutive retries with 163 schema validation errors on a single UM trial. Same malformed payload repeated without adaptation, consuming 40%+ of execution time.
When Codex determines a case is clear-cut for denial, it skips the mandatory peer-to-peer review. Claude never skips it. Healthcare workflows encode safety margins in mandatory process — shortcuts defeat the safeguards.
88% of Codex's patient outreach conversations fail quality review. Claude averages 890 chars/message with adaptive emotional validation; Codex averages 676 chars with clinical efficiency. Trust-building requires warmth, not just information transfer.
Both agents achieve just 8% (1/12) on physician-level review tasks. Synthesizing full clinical evidence, applying specialty criteria, and documenting legally binding rationale remains a genuine capability frontier.
Both models reach ~72–74% accuracy on correct approve/deny determinations. The performance gap between agents is driven by procedural compliance — following the right steps — not by clinical reasoning quality.
Cite this
If you reference CHI-Bench in a paper, post, or eval — here's the BibTeX. The data, code, and all 75 task definitions are open-sourced at the GitHub repo linked above.
@misc{chen2026chibenchaiagentsautomate,
title={CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?},
author={Haolin Chen and Deon Metelski and Leon Qi and Tao Xia and Joonyul Lee and Steve Brown and Kevin Riley and Frank Wang and T. Y. Alvin Liu and Hank Capps MD and Zeyu Tang and Xiangchen Song and Lingjing Kong and Fan Feng and Tianyi Zeng and Zhiwei Liu and Zixian Ma and Hang Jiang and Fangli Geng and Yuan Yuan and Chenyu You and Qingsong Wen and Hua Wei and Yanjie Fu and Yue Zhao and Carl Yang and Biwei Huang and Kun Zhang and Caiming Xiong and Sanmi Koyejo and Eric P. Xing and Philip S. Yu and Weiran Yao},
year={2026},
eprint={2605.16679},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.16679},
}