Introduction

CHI-Bench is an open-source benchmark for AI agents on long-horizon, policy-rich U.S. healthcare workflows. It pairs a high-fidelity simulator of 21 healthcare apps (exposed as MCP tools) with a 1,279-document Managed-Care Operations Handbook and a composite verifier that scores both deterministic artifacts and agent reasoning.

What it measures

Three healthcare workflows, 25 tasks each. The agent is handed a clinical case and drives it to a terminal action through tool calls and artifact authoring:

Prior Authorization (Provider). Verify coverage, gather evidence, submit the PA packet, work the response (RFIs, peer-to-peer, appeals).
Utilization Management (Payer). Intake the request, check plan policy, escalate through nurse and physician reviewers, issue determination.
Care Management. Review the chart, contact the patient, administer assessments, author a care plan.

Why it's hard

The benchmark stresses three capabilities under-represented in coding-style agent benchmarks:

Policy density. Decisions must be grounded in a large library of medical, insurance, and operational rules — and the rules vary across providers and payers, and drift over time.
Multi-role composition. A single workflow spans clinician, UM nurse, medical director, and care-manager handoffs. Each handoff is terminal: a step cannot be re-run.
Multilateral interaction. Some steps are multi-turn dialogs — peer-to-peer review, patient outreach — not tool calls. Agents must shift from background execution to live conversation and carry results back.

Headline numbers

Best agent (Claude Code + Claude Opus 4.6): 28.0% overall pass@1.
No agent clears 20% on strict pass^3 (3 independent trials all pass).
Marathon (all 25 tasks in one session): best is 3.8%.
End-to-end provider–payer arena: 0% on the best PA agents.

How to use these docs

The docs are organized in three groups:

Get started

CHI-Bench

Leaderboard

Leaderboard repo →

Beyond these docs

actava-ai/chi-bench — the producer repo (CLI, harnesses, fixtures, image).
actava-ai/leaderboard — the data-only PR-based submission registry.
Hugging Face dataset — gated dataset releases.
Public leaderboard · Task explorer