Tasks & domains

CHI-Bench v1.0.0 ships 75 tasks across three domains. Each task is a single end-to-end clinical or administrative workflow scored with a rubric judge.

The three domains

Domain	Tasks	What the agent does
Prior Authorization — Provider	25	Verify coverage, gather evidence, submit the PA packet, work the response (RFIs, peer-to-peer, appeals).
Prior Authorization — UM (Payer)	25	Intake the request, check plan policy, escalate through nurse and physician reviewers, issue determination.
Care Management	25	Review the chart, contact the patient, administer assessments, author a care plan.

Task layout on disk

Two physical layouts depending on context:

Host source: data/<domain>/tasks/... — the layout written by huggingface-cli download.
Inside the baked image (/opt/chi-bench): flat tasks/ with marathon//worlds/ siblings, handbook at /workspace/skills/managed-care-operations-handbook.

cb data verify auto-detects which layout it's pointed at.

Per-task fixtures

Each task directory contains an instruction.md the agent reads at start, plus a fixtures/ directory with the pre-staged world state (charts, claims, prior notes, etc.). The fixtures/expectations.json file is hidden from the agent — it contains rubric items the verifier scores against. The entrypoint exposes raw artifacts via /workspace/raw/artifacts/ except for *_new_referral_provider tasks, where the chart is projected through MCP tools only.

Browse all 75 tasks

The full task explorer is at /benchmarks/tasks — search by title or ID, filter by domain, and click into any task to read the agent-facing instructions.

Naming conventions

Task IDs follow this pattern:

pa_t001_t001_o001_p01_new_referral_provider
│  │     │     │     │     │
│  │     │     │     │     └─ workflow / role suffix
│  │     │     │     └─────── patient ID
│  │     │     └───────────── organization ID
│  │     └─────────────────── trial / template ID
│  └───────────────────────── task ID
└──────────────────────────── domain prefix (pa | um | cm)

Marathon mode

For Table 3 reproduction, all 25 tasks of a domain can be queued into a single agent session via the marathon/ sibling directory. Memory and context behavior across tasks is the headline finding here — best-agent pass@1 drops from 28.0% single-task to 3.8% marathon.