Introduction
CHI-Bench is an open-source benchmark for AI agents on long-horizon, policy-rich U.S. healthcare workflows. It pairs a high-fidelity simulator of 21 healthcare apps (exposed as MCP tools) with a 1,279-document Managed-Care Operations Handbook and a composite verifier that scores both deterministic artifacts and agent reasoning.
What it measures
Three healthcare workflows, 25 tasks each. The agent is handed a clinical case and drives it to a terminal action through tool calls and artifact authoring:
- Prior Authorization (Provider). Verify coverage, gather evidence, submit the PA packet, work the response (RFIs, peer-to-peer, appeals).
- Utilization Management (Payer). Intake the request, check plan policy, escalate through nurse and physician reviewers, issue determination.
- Care Management. Review the chart, contact the patient, administer assessments, author a care plan.
Why it's hard
The benchmark stresses three capabilities under-represented in coding-style agent benchmarks:
- Policy density. Decisions must be grounded in a large library of medical, insurance, and operational rules — and the rules vary across providers and payers, and drift over time.
- Multi-role composition. A single workflow spans clinician, UM nurse, medical director, and care-manager handoffs. Each handoff is terminal: a step cannot be re-run.
- Multilateral interaction. Some steps are multi-turn dialogs — peer-to-peer review, patient outreach — not tool calls. Agents must shift from background execution to live conversation and carry results back.
Headline numbers
- Best agent (Claude Code + Claude Opus 4.6): 28.0% overall pass@1.
- No agent clears 20% on strict pass^3 (3 independent trials all pass).
- Marathon (all 25 tasks in one session): best is 3.8%.
- End-to-end provider–payer arena: 0% on the best PA agents.
How to use these docs
The docs are organized in three groups:
Get started
Leaderboard
Beyond these docs
- actava-ai/chi-bench — the producer repo (CLI, harnesses, fixtures, image).
- actava-ai/leaderboard — the data-only PR-based submission registry.
- Hugging Face dataset — gated dataset releases.
- Public leaderboard · Task explorer