Introduction

CHI-Bench is an open-source benchmark for AI agents on long-horizon, policy-rich U.S. healthcare workflows. It pairs a high-fidelity simulator of 21 healthcare apps (exposed as MCP tools) with a 1,279-document Managed-Care Operations Handbook and a composite verifier that scores both deterministic artifacts and agent reasoning.

What it measures

Three healthcare workflows, 25 tasks each. The agent is handed a clinical case and drives it to a terminal action through tool calls and artifact authoring:

  • Prior Authorization (Provider). Verify coverage, gather evidence, submit the PA packet, work the response (RFIs, peer-to-peer, appeals).
  • Utilization Management (Payer). Intake the request, check plan policy, escalate through nurse and physician reviewers, issue determination.
  • Care Management. Review the chart, contact the patient, administer assessments, author a care plan.

Why it's hard

The benchmark stresses three capabilities under-represented in coding-style agent benchmarks:

  • Policy density. Decisions must be grounded in a large library of medical, insurance, and operational rules — and the rules vary across providers and payers, and drift over time.
  • Multi-role composition. A single workflow spans clinician, UM nurse, medical director, and care-manager handoffs. Each handoff is terminal: a step cannot be re-run.
  • Multilateral interaction. Some steps are multi-turn dialogs — peer-to-peer review, patient outreach — not tool calls. Agents must shift from background execution to live conversation and carry results back.

Headline numbers

  • Best agent (Claude Code + Claude Opus 4.6): 28.0% overall pass@1.
  • No agent clears 20% on strict pass^3 (3 independent trials all pass).
  • Marathon (all 25 tasks in one session): best is 3.8%.
  • End-to-end provider–payer arena: 0% on the best PA agents.

How to use these docs

The docs are organized in three groups:

Beyond these docs