Architecture

A single Python package hosts a FastAPI server, three MCP servers, and the workspace judge. Each trial runs in a fresh Docker container that bundles all of the above plus the per-task fixtures.

Trial flow

Harbor — cb experiment run -f <config> shells out to Harbor, which spawns one container per trial via ChiBenchDockerEnvironment (local) or ChiBenchModalEnvironment (Modal sandbox).
Container start — docker/entrypoint.sh reads CHI_BENCH_TASK_ID, wires /opt/chi-bench/tasks/<id>/fixtures to /fixtures, starts the unified server (HTTP + 3 MCP threads on fixed ports), and waits for all four endpoints to accept traffic before exec'ing the agent harness CLI.
Agent — the harness drives the agent against the role-scoped MCP tools. Some steps are multi-turn dialogs (peer-to-peer, patient outreach) that the harness mediates as natural-language conversations.
Verifier — after the agent stops, Harbor invokes WorkspaceJudge on claude-opus-4-7 in the same container; it reads /fixtures/expectations.json (hidden from the agent) and the full workspace, then writes verifier/scorecard.json + verifier/verdicts.json.
Reward — Harbor writes result.json. Trial reward is the AND of rubric verdicts (or a continuous score for Care Management).

Module layout

Source lives under src/chi_bench/:

core/ — domain models (PriorAuthCase, CMOutreachTask, …), state machines, world store.
services/ — ~29 HTTP/MCP-backed domain services (chart, coverage, intake, p2p, …).
server/ — FastAPI app exposing the services as REST endpoints under /api/....
mcp/ — three MCP servers wrapping the services; see mcp/{server,payer_server,cm_server}.py.
conversation/ — patient simulator and peer-to-peer session orchestration.
experiment/ — Harbor-driven trial runner + agents/ (seven harnesses) + dual_pa_e2e_*.
verifier/ — pluggable judge (default WorkspaceJudge), rubric stages, and rejudge runner.

Service ports

Service	Port	Role
FastAPI backend	`:8010`	HTTP REST surface for all services.
Provider MCP	`:8020`	Tools scoped to the provider/PA-author role.
Payer MCP	`:8100`	Tools scoped to the UM nurse / medical director.
Care Management MCP	`:8200`	Tools scoped to the RN care manager.

The verifier (workspace judge)

The verifier is a composite: deterministic rubric checks (file exists, payload field equals X, terminal status reached) combined with a rubric-based LLM judge that scores reasoning soundness, policy alignment, and patient-engagement quality.

Judge model: pinned to claude-opus-4-7. Override with CHI_BENCH_JUDGE_MODEL but you'll deviate from the paper protocol.
Voting: CHI_BENCH_JUDGE_NUM_VOTES > 1 for majority-voted judging.
Re-judge: cb verifier rejudge re-scores trials without re-running the agent.

Key invariants

/fixtures is not exposed to the agent — expectations, scoring contracts, and manifests are reserved for the verifier.
cb serve starts the payer in agent mode by setting CHI_BENCH_PAYER_MODE=agent if unset.
The /opt/chi-bench/tasks/<id>/fixtures directory is mounted read-only.