Tasks & domains
CHI-Bench v1.0.0 ships 75 tasks across three domains. Each task is a single end-to-end clinical or administrative workflow scored with a rubric judge.
The three domains
| Domain | Tasks | What the agent does |
|---|---|---|
| Prior Authorization — Provider | 25 | Verify coverage, gather evidence, submit the PA packet, work the response (RFIs, peer-to-peer, appeals). |
| Prior Authorization — UM (Payer) | 25 | Intake the request, check plan policy, escalate through nurse and physician reviewers, issue determination. |
| Care Management | 25 | Review the chart, contact the patient, administer assessments, author a care plan. |
Task layout on disk
Two physical layouts depending on context:
- Host source:
data/<domain>/tasks/...— the layout written byhuggingface-cli download. - Inside the baked image (
/opt/chi-bench): flattasks/withmarathon//worlds/siblings, handbook at/workspace/skills/managed-care-operations-handbook.
cb data verify auto-detects which layout it's pointed at.
Per-task fixtures
Each task directory contains an instruction.md the agent reads at start, plus a fixtures/ directory with the pre-staged world state (charts, claims, prior notes, etc.). The fixtures/expectations.json file is hidden from the agent — it contains rubric items the verifier scores against. The entrypoint exposes raw artifacts via /workspace/raw/artifacts/ except for *_new_referral_provider tasks, where the chart is projected through MCP tools only.
Browse all 75 tasks
The full task explorer is at /benchmarks/tasks — search by title or ID, filter by domain, and click into any task to read the agent-facing instructions.
Naming conventions
Task IDs follow this pattern:
pa_t001_t001_o001_p01_new_referral_provider
│ │ │ │ │ │
│ │ │ │ │ └─ workflow / role suffix
│ │ │ │ └─────── patient ID
│ │ │ └───────────── organization ID
│ │ └─────────────────── trial / template ID
│ └───────────────────────── task ID
└──────────────────────────── domain prefix (pa | um | cm)Marathon mode
For Table 3 reproduction, all 25 tasks of a domain can be queued into a single agent session via the marathon/ sibling directory. Memory and context behavior across tasks is the headline finding here — best-agent pass@1 drops from 28.0% single-task to 3.8% marathon.