Quickstart: run one task
Smoke-test that everything is wired up by running a single Utilization Management medical-director-review task.
One trial, one command
uv run cb experiment run \
--dataset data/prior_auth_um/tasks/pa_t008_t008_o002_p01_mdreview_payer \
--agent codex --model openai/gpt-5.5Trial output lands under logs/experiments/.../trial_*/. Two files give you everything you need at a glance:
result.json— the verifier reward and agent metadata.verifier/scorecard.json— per-check verdicts (deterministic + LLM judge).
See Read the scorecard for the full field-by-field walkthrough — what binary vs fractional reward means, the stage structure, the deterministic-check namespaces, and how Care Management's two-axis rubric differs from PA/UM.
What just happened
- Harbor spawned a fresh Docker container for the trial.
- The container started the unified FastAPI + 3 MCP servers and waited for them to accept traffic.
- The agent harness drove the agent against the role-scoped MCP tools.
- After the agent stopped, the workspace judge (pinned to
claude-opus-4-7) read the workspace, world state, and event trail and wrote the scorecard. - Harbor wrote
result.jsonwith the AND of rubric verdicts as the final reward.
See Architecture for the full picture of how the pieces fit together.
Trying other agents
Replace --agent and --model with any supported pair:
# Claude Code with Opus 4.6
uv run cb experiment run \
--dataset data/prior_auth_provider/tasks/pa_t001_t001_o001_p01_new_referral_provider \
--agent claude-code --model anthropic/claude-opus-4-6
# Open-stack: Hermes harness on GLM-5.1 via OpenRouter
uv run cb experiment run \
--dataset data/prior_auth_um/tasks/pa_t008_t008_o002_p01_mdreview_payer \
--agent hermes --model openrouter/z-ai/glm-5.1The full 30-row matrix lives in configs/experiments/table1_main_matrix.yaml. See Run experiments for paper-table reproduction.
Next steps
- If the scorecard reads, you're ready to submit your agent to the leaderboard.
- Want to plug in a custom harness or model endpoint? See Bring your own agent.
- For the full CLI flag reference: docs/cli.md in the repo.