CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

We built a high-fidelity simulator of 21 healthcare apps, ran 30 frontier agents through 75 long-horizon workflows, and the best one solved 28% of tasks. Here is what broke, why, and where it matters.

actAVA Research·May 13, 2026

Haolin Chen¹, Deon Metelski¹, Leon Qi¹, Tao Xia¹, Joonyul Lee¹, Steve Brown¹, Kevin Riley¹, Frank Wang¹, T. Y. Alvin Liu, MD², Hank Capps, MD³, Zeyu Tang⁴, Xiangchen Song⁵, Lingjing Kong⁵, Fan Feng⁶, Tianyi Zeng⁷, Zhiwei Liu⁸, Zixian Ma⁹, Hang Jiang¹⁰, Fangli Geng¹¹, Yuan Yuan¹², Chenyu You¹³, Qingsong Wen¹⁴, Hua Wei¹⁵, Yanjie Fu¹⁵, Yue Zhao¹⁶, Carl Yang¹⁷, Biwei Huang⁶, Kun Zhang^5,19,
Caiming Xiong¹⁸, Sanmi Koyejo⁴, Eric P. Xing^19,5, Philip S. Yu²⁰, Weiran Yao¹

¹actAVA.ai · ²Johns Hopkins Medicine · ³Wellstar Health System · ⁴Stanford University · ⁵CMU · ⁶UCSD · ⁷Yale School of Medicine · ⁸Salesforce AI Research · ⁹University of Washington · ¹⁰Northeastern University · ¹¹Brown University · ¹²Boston College · ¹³Stony Brook University · ¹⁴University of Oxford · ¹⁵Arizona State University · ¹⁶University of Southern California · ¹⁷Emory University · ¹⁸Recursive Superintelligence · ¹⁹MBZUAI · ²⁰University of Illinois at Chicago

Paper GitHub Dataset Leaderboard

What we built

U.S. healthcare runs on long, fragmented, policy-heavy workflows — the kind that take a nurse, a coordinator, and a medical director hours of clicking, calling, and chart-reading to push to a terminal action. Frontier AI agents are increasingly pitched as the natural automation candidate here. We wanted to know how close they actually are.

So we built a high-fidelity simulator of 21 real healthcare apps — provider EHRs, payer UM portals, care-management consoles — wired together by their own state machines, and dropped frontier agents into them with a 1,279-document Managed-Care Operations Handbook and 200+ role-scoped MCP tools. Each task starts with a real clinical case (a knee-replacement prior auth, a chemotherapy review, a diabetes care-plan kickoff) and asks the agent to drive it end-to-end to a submitted packet, a finalized determination, or a complete care plan with patient consent. When the agent stops, a composite verifier reads the full workspace and grades the run.

Agent

Frontier harness + model

200+ role-scoped MCP tools
1,279-doc Handbook skill
Workspace files + chart access

30 harness × model combos evaluated

χ-World simulator

21 healthcare apps · 3 MCP servers

Provider — Prior Auth

5 apps · :8020

Payer — Utilization Mgmt

10 apps · :8100

RN — Care Management

5 apps · :8200

FastAPI + SQLite + MCP over HTTP

In-situ verifier

Two-layer scorecard

Deterministic contract — case status, codes, P2P required
LLM judge (Opus 4.7) — rationale, autonomy, grounding

Reward = contract ∧ judge

3 domains·75 tasks·3 trials per task·pass@1 · pass@3 · pass^3

Figure 1. CHI-Bench: Clinical Healthcare In-Situ environment and evaluation benchmark. The agent operates 21 healthcare apps through MCP, writes role artifacts to a shared workspace, and is graded by a composite verifier that reads the workspace, world state, and event trail.

Domains

PA · UM · CM

Tasks

25 per domain

Healthcare apps

via MCP

MCP tools

200+

Handbook docs

1,279

Agent configs

harness × model

Why this is hard

Coding benchmarks like SWE-Bench and Terminal-Bench measure long-horizon execution against a clean, deterministic backdrop: the file system doesn't talk back, the build either passes or it doesn't. Healthcare workflows look superficially similar but are wired differently in three ways that almost no current benchmark combines.

Policy density. Every decision has to be grounded in a specific rule — a payer's medical-necessity criterion, an internal escalation procedure, a state regulation. There are thousands of them, they vary by payer and plan year, and “misreading the rule” and “applying the right rule incorrectly” are different failure modes that we score separately.

Multi-role composition. A single PA case touches a coordinator, a UM nurse, a medical director, and sometimes a peer-to-peer call between an MD on each side. The agent has to switch role context, authority boundaries, and what it's safe to write — and every handoff is terminal. Submit the wrong packet and there's no rollback.

Multilateral interaction. Some steps aren't tool calls but live multi-turn conversations: a peer-to-peer review with the payer's MD, an RFI back to the provider, a twenty-minute outreach call with a chronic-disease patient. The agent has to drop out of execute-mode, hold a real conversation, and carry the result back into the workflow.

Three domains, 75 tasks

We picked the three places frontier agents are most often pitched as automation candidates today: provider prior authorization, payer utilization management, and care management. Each domain has 25 long-horizon tasks. To produce ground truth, a clinician (or an author wearing that hat) walked every task end-to-end on the live UI before it shipped — the average task takes 21 steps, the longest hits 40. Each task is then graded independently across three trials.

Prior Authorization

Provider role · 25 tasks

Build the case packet a payer needs to approve a service or medication. The agent works the chart, drafts the medical-necessity rationale, attaches policy evidence, and submits — once.

Best pass@1

29.3%Codex + GPT-5.5

Utilization Management

Payer role · 25 tasks

Triage, nurse-review, and MD-decide an incoming authorization request against criteria. Run a peer-to-peer call if needed, then finalize the determination with rationale-rich notes.

Best pass@1

41.3%Claude Code + Opus 4.6

Care Management

Care manager role · 25 tasks

Run intake, outreach, assessment, and care-plan steps for a chronic-disease member. Patient calls are simulated multi-turn dialogs; the agent must obtain consent before scoping the program.

Best pass@1

32.0%Claude Code + Opus 4.7

Browse all 75 tasks →

How we grade a run

When the agent stops, the verifier reads everything it touched: world-store updates, files in the workspace, the full tool-call event trail, and any conversation transcripts. Two layers grade in parallel and both have to pass.

The deterministic contract covers checks that have a single right answer — did the case status reach the expected terminal state? did the right CPT codes land on the request? was a peer-to-peer raised when policy required one? The rubric LLM judge (pinned to claude-opus-4-7) handles items the contract can't formalize: was the medical-necessity rationale grounded in the cited policy? did the outreach call respect patient autonomy or quietly badger a refusing member into “yes”?

We report pass@1, pass@3 (any of three independent trials), and the strict pass^3 (all three) as a reliability metric.

Persisted record

What the agent leaves behind

›Per-stage states
›Side-effect artifacts
›Workspace documents
›Event log
›Conversation transcripts

Layer 1

Deterministic contract

Code-checkable assertions

∧

Layer 2

LLM judge

Opus 4.7, majority of 3 votes

Trial passes

R = contract ∧ judge

Deterministic checks include

Terminal status reachedStage payload valuesRequired event log entriesDocument field valuesCross-app side effects

LLM judge rubrics include

Policy alignmentReasoning soundnessInternal coherencePatient engagementAutonomy-first outreach

Figure 2. Verification pipeline. Every trial emits a persisted record (world store, event log, transcripts). A deterministic contract and an LLM judge grade in parallel; the trial passes only when both layers pass.

Headline results

We ran 30 harness × model combinations: every frontier proprietary stack (Claude Code, OpenAI Codex, Gemini CLI) paired with that lab's closed-weight models, plus four open-source frameworks (OpenClaw, Hermes, OpenAI Agents SDK, DeepAgents) sweeping five open-weight models. Best in class is Claude Code + Opus 4.6 at 28.0% pass@1. The per-domain leaders split cleanly across three labs: Codex + GPT-5.5 leads PA at 29.3%, Opus 4.6 leads UM at 41.3%, and Opus 4.7 leads CM at 32.0%. There is no single “best agent” for healthcare workflows yet.

Reliability is the bigger problem. No agent clears 20% on pass^3 — the metric where the same agent has to pass the same task three runs in a row. Agents that win one trial often fail the next on the same case. Two stress tests collapse the headline number further:

The marathon. Load all 25 tasks of a domain into a single agent session and ask it to finish them in any order. The best agents finish with overall 3.8%. On PA, neither leading agent submits a single authorization across 25 queued cases despite touching most of them. Long context doesn't save you: Opus 4.7 (1M-token context, no compaction) and GPT-5.5 (auto-compacts 4–6× per session) fail in roughly the same shape.

The end-to-end arena. Wire two Codex + GPT-5.5 agents together — one as the provider, one as the payer — and let them exchange information only through MCP tools. The PA configuration that scores 30.4% solo collapses to 0% when the payer agent and cross-role checks join: 18 of 23 cases never reach an MD decision, and on the five PA cases that policy requires a peer-to-peer call, neither side raises one.

OpenAI · CodexAnthropic · Claude CodeGoogle · Gemini CLI

Prior Authorization

Best: 29.3% · Codex + GPT-5.5

Utilization Management

Best: 41.3% · Claude + Opus 4.6

Care Management

Best: 32.0% · Claude + Opus 4.7

Figure 3. pass@1 across the three CHI-Bench environments for the nine proprietary harness × model configurations. The per-domain leaders split across all three labs — there is no single “best” agent for healthcare workflows yet.

The full leaderboard

All 30 harness × model configurations. Each pass cell is shaded by its value (deeper pink = higher pass rate); the per-column maximum is ringed in pink. Efficiency columns report the per-trial averages over all 225 trials in that row. Scroll horizontally on narrow screens.

Agent harness	Model	Overall 75 tasks			Prior Authorization 25 tasks			Utilization Management 25 tasks			Care Management 25 tasks			Efficiency per trial
Agent harness	Model	p@1	p@3	p^3	p@1	p@3	p^3	p@1	p@3	p^3	p@1	p@3	p^3	Steps	Cost
Proprietary stackFrontier first-party CLI + closed-weight models
Codex	GPT-5.5	20.9	30.7	9.3	29.3	40.0	16.0	32.0	48.0	12.0	1.3	4.0	0.0	54	$1.29
Codex	GPT-5.4	16.0	25.3	8.0	24.0	32.0	16.0	17.3	24.0	8.0	6.7	20.0	0.0	58	$1.30
Codex	GPT-5.4 Mini	8.4	20.0	0.0	10.7	24.0	0.0	13.3	32.0	0.0	1.3	4.0	0.0	58	$0.27
Claude Code	Claude Opus 4.7	24.4	41.3	10.7	24.0	32.0	16.0	17.3	28.0	8.0	32.0	64.0	8.0	68	$9.91
Claude Code	Claude Opus 4.6	28.0	38.7	18.7	18.7	24.0	12.0	41.3	44.0	40.0	24.0	48.0	4.0	76	$6.47
Claude Code	Claude Sonnet 4.6	26.2	41.3	12.0	24.0	28.0	20.0	34.7	52.0	16.0	20.0	44.0	0.0	82	$1.30
Claude Code	Claude Haiku 4.5	6.2	10.7	2.7	0.0	0.0	0.0	14.7	24.0	8.0	4.0	8.0	0.0	41	$0.16
Gemini CLI	Gemini 3.1 Pro	7.1	13.3	1.3	14.7	24.0	4.0	6.7	16.0	0.0	0.0	0.0	0.0	82	$2.11
Gemini CLI	Gemini 3 Flash	12.5	17.3	8.0	18.7	28.0	8.0	18.7	24.0	16.0	0.0	0.0	0.0	142	$0.33
Open-source stackOpen frameworks + open-weight models
OpenClaw	Claude Opus 4.7	17.3	37.3	4.0	18.7	28.0	8.0	13.3	32.0	4.0	20.0	52.0	0.0	41	$11.48
OpenClaw	Kimi K2.6	10.2	18.7	2.7	12.0	20.0	4.0	18.7	36.0	4.0	0.0	0.0	0.0	72	$0.91
OpenClaw	DeepSeek V4 Pro	11.1	24.0	1.3	14.7	28.0	4.0	12.0	28.0	0.0	6.7	16.0	0.0	42	$0.53
OpenClaw	GLM-5.1	16.9	30.7	6.7	13.3	24.0	4.0	26.7	36.0	16.0	10.7	32.0	0.0	116	$0.96
OpenClaw	Qwen 3.6 Max	4.9	10.7	0.0	10.7	24.0	0.0	4.0	8.0	0.0	0.0	0.0	0.0	79	$2.80
OpenClaw	Grok 4.3	0.4	1.3	0.0	1.3	4.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	65	$2.66
OAI Agents	Kimi K2.6	15.1	22.7	8.0	17.3	28.0	12.0	25.3	36.0	12.0	2.7	4.0	0.0	60	$0.43
OAI Agents	DeepSeek V4 Pro	14.2	22.7	9.3	10.7	16.0	8.0	28.0	40.0	20.0	4.0	12.0	0.0	52	$0.25
OAI Agents	GLM-5.1	18.7	26.7	12.0	18.7	24.0	12.0	33.3	44.0	24.0	4.0	12.0	0.0	58	$0.27
OAI Agents	Qwen 3.6 Max	15.6	22.7	9.3	16.0	20.0	12.0	26.7	36.0	16.0	4.0	12.0	0.0	48	$0.58
OAI Agents	Grok 4.3	5.8	10.7	1.3	0.0	0.0	0.0	16.0	28.0	4.0	1.3	4.0	0.0	32	$1.54
Hermes	Kimi K2.6	15.6	24.0	6.7	18.7	24.0	12.0	21.3	36.0	8.0	6.7	12.0	0.0	31	$1.07
Hermes	DeepSeek V4 Pro	13.8	22.7	8.0	8.0	16.0	4.0	25.3	32.0	20.0	8.0	20.0	0.0	26	$2.19
Hermes	GLM-5.1	18.7	28.0	10.7	10.7	16.0	8.0	34.7	44.0	24.0	10.7	24.0	0.0	30	$1.04
Hermes	Qwen 3.6 Max	16.4	28.0	5.3	9.3	16.0	4.0	26.7	36.0	12.0	13.3	32.0	0.0	29	$4.12
Hermes	Grok 4.3	4.4	8.0	1.3	0.0	0.0	0.0	13.3	24.0	4.0	0.0	0.0	0.0	32	$1.05
DeepAgents	Kimi K2.6	3.1	8.0	0.0	8.0	20.0	0.0	1.3	4.0	0.0	0.0	0.0	0.0	39	$0.55
DeepAgents	DeepSeek V4 Pro	10.7	18.7	2.7	14.7	24.0	4.0	10.7	20.0	4.0	6.7	12.0	0.0	15	$0.21
DeepAgents	GLM-5.1	11.1	17.3	5.3	17.3	24.0	12.0	10.7	16.0	4.0	5.3	12.0	0.0	21	$0.26
DeepAgents	Qwen 3.6 Max	9.3	16.0	4.0	12.0	16.0	8.0	10.7	16.0	4.0	5.3	16.0	0.0	18	$0.57
DeepAgents	Grok 4.3	2.2	5.3	0.0	0.0	0.0	0.0	5.3	12.0	0.0	1.3	4.0	0.0	21	$1.43

p@1 = pass@1, p@3 = pass@3 (any of 3 trials), p^3 = pass^3 (all 3 trials). Cost is the mean per-trial USD spend at list pricing.

Cost vs accuracy

Spending more does not buy reliability in healthcare workflows. Plot every harness × model configuration by mean per-trial spend and pass@1 and the field falls into four quadrants. The Sweet Spot (cheap and accurate) is sparsely populated and dominated by a handful of frugal open-weight pairings. The Premium tier — Claude Code paired with Opus 4.6 or Sonnet 4.6, and Codex with GPT-5.5 — buys the top of the leaderboard but at 4–50× the trial cost of the frontier's budget end. Everything to the right of the $1 line and below the 13% bar is Overpriced: Claude Opus 4.7 on OpenClaw spends almost $12 per trial to land at 17.3% pass@1.

The connecting line traces the cost-accuracy Pareto frontier: no other configuration is both cheaper and more accurate than the points on it. Seven configurations sit on the frontier — Haiku 4.5, two DeepSeek V4 Pro setups, GLM-5.1 (OAI Agents), GPT-5.5 (Codex), Sonnet 4.6, and Opus 4.6 — and they span two orders of magnitude in spend for a 22-point accuracy spread.

Harness

Codex
Claude Code
Gemini CLI
OpenClaw
OAI Agents
Hermes
DeepAgents

Model

GPT-5.5
GPT-5.4
GPT-5.4 Mini
Claude Opus 4.7
Claude Opus 4.6
Claude Sonnet 4.6
Claude Haiku 4.5
Gemini 3.1 Pro
Gemini 3 Flash
DeepSeek V4 Pro
GLM-5.1
Kimi K2.6
Qwen 3.6 Max
Grok 4.3

Pareto frontier (7 configurations)

Haiku 4.5 (Claude Code)$0.16 · 6.2%
DS V4 Pro (DeepAgents)$0.21 · 10.7%
DS V4 Pro (OAI Agents)$0.25 · 14.2%
GLM-5.1 (OAI Agents)$0.27 · 18.7%
GPT-5.5 (Codex)$1.29 · 20.9%
Sonnet 4.6 (Claude Code)$1.30 · 26.2%
Opus 4.6 (Claude Code)$6.47 · 28.0%

Figure 4. Cost-accuracy ROI quadrants for all 30 harness × model configurations. Dashed lines split the plane at $1 per trial and 13% pass@1, defining the Sweet Spot, Premium, Budget, and Overpriced regions. The solid line traces the Pareto frontier — seven configurations where no other point is both cheaper and more accurate. Shape encodes harness; color encodes model family.

Where agents break down

A 28% vs. 21% leaderboard gap suggests the leading agents are doing roughly the same thing, only at different success rates. They aren't. We pulled the per-check, per-tool-call data from 225 trials each for the two leading configurations from different labs — Claude Code + Opus 4.6 (28.0% pass@1) and Codex + GPT-5.5 (20.9% pass@1) — and the failure shapes are wildly different.

Claude is the careful process-follower that submits packets it shouldn't. It nails every mandatory peer-to-peer (12 of 12) and writes care plans with 100% goal-structure compliance, but fails 0/33 of the “gather more evidence and put the case on hold” tasks because it always submits. Codex is the fast shortcutter: 1.5–2× faster per trial, better at form-filling, but 88% of its outreach calls fail the patient-conversation rubric, and it racked up 122 consecutive retries on a single UM trial when it couldn't format a tool call correctly.

Each agent is good at the things the other one is bad at. Both fail in ways that would cause actual harm in production. Below: side-by-side strength/weakness profiles, a per-check breakdown across all 15 rubric checks, and six diagnostics that explain the most consistent gaps.

Claude Code + Opus 4.6

The thorough process-follower

28.0% pass · 0.834 fractional · 96% consistency

✓UMProcedural fidelity — never skips mandatory workflow steps (P2P: 12/12 perfect)
✓CMConversational empathy — 73% outreach quality pass; adaptive rapport-building
✓UMRole-switching — correct authority boundaries across 5 UM roles
✓CMCare plan structure — 100% goal structure compliance, 89% intervention structure
✓PAEvidence gathering — 3.7 supporting documents per PA trial vs 1.0 for Codex
✓Deterministic — same outcome on 96% of tasks across independent trials
✗PACompleteness bias — 0/33 on "gather more evidence" tasks; always submits
✗PACreates submission packets when policy requires stopping (96% failure)
✗Slow execution — 1.5–2× longer than Codex; 68 avg steps vs 46
✗CMCM assessment quality judging — 64% failure on structured assessment checks

Codex + GPT-5.5

The efficient shortcutter

20.9% pass · 0.805 fractional · 64% consistency

✓PADocumentation gap detection — 6/33 on hold tasks vs Claude's 0/33
✓PAIntake efficiency — 40% pass on structured form-filling tasks
✓PAExecution speed — 3.5 min avg PA trial vs Claude's 6.6 min
✓CMStage coherence — 63% pass on CM workflow staging vs Claude's 37%
✗UMSkips mandatory steps — bypasses P2P when clinical answer seems "obvious"
✗CMCatastrophic outreach failure — 88% fail on patient conversation quality
✗UMRetry storms — 122 consecutive retries with 163 schema validation errors
✗CMCare plan deficiency — 65% fail on goal structure, 55% on interventions
✗UMHard-coded triage routing bug breaks gold-card auto-approvals
✗Non-deterministic — different outcomes on 36% of tasks across trials

Per-check capability breakdown

Three views over the same 225-trial-per-agent run: failure rate by check, a capability radar across nine skills, and behavioral diagnostics for retries, schema errors, and output volume.

Claude Code · Opus 4.6Codex · GPT-5.5

Higher bars = more trials failing that check. Filter by domain to compare per-role behavior.

Key diagnostics

The Completeness Trap

44% of PA Provider tasks require stopping mid-workflow. Claude submitted all 33 hold-required trials — noting missing documents in reasoning but executing submission anyway. A knowledge gap this is not; it's an execution bias toward completion.

The Schema Mismatch

Codex enters retry storms on structured tool calls — 122 consecutive retries with 163 schema validation errors on a single UM trial. Same malformed payload repeated without adaptation, consuming 40%+ of execution time.

Process Over Certainty

When Codex determines a case is clear-cut for denial, it skips the mandatory peer-to-peer review. Claude never skips it. Healthcare workflows encode safety margins in mandatory process — shortcuts defeat the safeguards.

The Empathy Gap

88% of Codex's patient outreach conversations fail quality review. Claude averages 890 chars/message with adaptive emotional validation; Codex averages 676 chars with clinical efficiency. Trust-building requires warmth, not just information transfer.

MD Review: Nobody's Home

Both agents achieve just 8% (1/12) on physician-level review tasks. Synthesizing full clinical evidence, applying specialty criteria, and documenting legally binding rationale remains a genuine capability frontier.

The 72% Ceiling

Both models reach ~72–74% accuracy on correct approve/deny determinations. The performance gap between agents is driven by procedural compliance — following the right steps — not by clinical reasoning quality.

Cite this

If you reference CHI-Bench in a paper, post, or eval — here's the BibTeX. The data, code, and all 75 task definitions are open-sourced at the GitHub repo linked above.

@misc{chen2026chibenchaiagentsautomate,
      title={CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?},
      author={Haolin Chen and Deon Metelski and Leon Qi and Tao Xia and Joonyul Lee and Steve Brown and Kevin Riley and Frank Wang and T. Y. Alvin Liu and Hank Capps MD and Zeyu Tang and Xiangchen Song and Lingjing Kong and Fan Feng and Tianyi Zeng and Zhiwei Liu and Zixian Ma and Hang Jiang and Fangli Geng and Yuan Yuan and Chenyu You and Qingsong Wen and Hua Wei and Yanjie Fu and Yue Zhao and Carl Yang and Biwei Huang and Kun Zhang and Caiming Xiong and Sanmi Koyejo and Eric P. Xing and Philip S. Yu and Weiran Yao},
      year={2026},
      eprint={2605.16679},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.16679},
}