Architecture
A single Python package hosts a FastAPI server, three MCP servers, and the workspace judge. Each trial runs in a fresh Docker container that bundles all of the above plus the per-task fixtures.
Trial flow
- Harbor —
cb experiment run -f <config>shells out to Harbor, which spawns one container per trial viaChiBenchDockerEnvironment(local) orChiBenchModalEnvironment(Modal sandbox). - Container start —
docker/entrypoint.shreadsCHI_BENCH_TASK_ID, wires/opt/chi-bench/tasks/<id>/fixturesto/fixtures, starts the unified server (HTTP + 3 MCP threads on fixed ports), and waits for all four endpoints to accept traffic before exec'ing the agent harness CLI. - Agent — the harness drives the agent against the role-scoped MCP tools. Some steps are multi-turn dialogs (peer-to-peer, patient outreach) that the harness mediates as natural-language conversations.
- Verifier — after the agent stops, Harbor invokes
WorkspaceJudgeonclaude-opus-4-7in the same container; it reads/fixtures/expectations.json(hidden from the agent) and the full workspace, then writesverifier/scorecard.json+verifier/verdicts.json. - Reward — Harbor writes
result.json. Trial reward is the AND of rubric verdicts (or a continuous score for Care Management).
Module layout
Source lives under src/chi_bench/:
core/— domain models (PriorAuthCase,CMOutreachTask, …), state machines, world store.services/— ~29 HTTP/MCP-backed domain services (chart, coverage, intake, p2p, …).server/— FastAPI app exposing the services as REST endpoints under/api/....mcp/— three MCP servers wrapping the services; seemcp/{server,payer_server,cm_server}.py.conversation/— patient simulator and peer-to-peer session orchestration.experiment/— Harbor-driven trial runner +agents/(seven harnesses) +dual_pa_e2e_*.verifier/— pluggable judge (defaultWorkspaceJudge), rubric stages, and rejudge runner.
Service ports
| Service | Port | Role |
|---|---|---|
| FastAPI backend | :8010 | HTTP REST surface for all services. |
| Provider MCP | :8020 | Tools scoped to the provider/PA-author role. |
| Payer MCP | :8100 | Tools scoped to the UM nurse / medical director. |
| Care Management MCP | :8200 | Tools scoped to the RN care manager. |
The verifier (workspace judge)
The verifier is a composite: deterministic rubric checks (file exists, payload field equals X, terminal status reached) combined with a rubric-based LLM judge that scores reasoning soundness, policy alignment, and patient-engagement quality.
- Judge model: pinned to
claude-opus-4-7. Override withCHI_BENCH_JUDGE_MODELbut you'll deviate from the paper protocol. - Voting:
CHI_BENCH_JUDGE_NUM_VOTES > 1for majority-voted judging. - Re-judge:
cb verifier rejudgere-scores trials without re-running the agent.
Key invariants
/fixturesis not exposed to the agent — expectations, scoring contracts, and manifests are reserved for the verifier.cb servestarts the payer in agent mode by settingCHI_BENCH_PAYER_MODE=agentif unset.- The
/opt/chi-bench/tasks/<id>/fixturesdirectory is mounted read-only.
Further reading
- System diagram and module boundaries: docs/architecture.md
- Verifier details: docs/judge.md
- Full CLI reference: docs/cli.md
- Up next: Run experiments