Run experiments
Single trials, full submission runs, and the paper-table matrices — all driven by one CLI.
Single trial
uv run cb experiment run \
--dataset data/<domain>/tasks/<task-dir> \
--agent <agent-id> --model <model-id>Output lands under logs/experiments/.../trial_*/ with result.json and verifier/scorecard.json. See Quickstart for a worked example.
Submission lifecycle
One YAML drives all four steps:
uv run cb submission validate -f configs/submissions/<id>.yaml
uv run cb submission run -f configs/submissions/<id>.yaml
uv run cb submission status -f configs/submissions/<id>.yaml
uv run cb submission prepare -f configs/submissions/<id>.yaml- validate — schema + preflight: dataset pin, Modal/Docker readiness, agent name resolution.
- run — runs all 3 domains. Default: one trial per task (pass@1).
- status — progress check; safe to run while
runis in flight. - prepare — curates the leaderboard-ready packet at
logs/submissions/<id>/packet/YYYY-MM-DD-<id>/.
For the leaderboard PR step, see the Leaderboard guide.
Reproduce paper tables
Each table maps to a matrix YAML and a single command:
| Paper | Config | Command |
|---|---|---|
| Table 1 (Main matrix) | table1_main_matrix.yaml | ./scripts/run_table.sh table1 |
| Table 2 (E2E arena) | table2_e2e_arena.yaml | ./scripts/run_table.sh table2 |
| Table 3 (Marathon) | table3_marathon.yaml | ./scripts/run_table.sh table3 |
| Fig. 4 (Skill ablation) | table4_skill_ablation.yaml | ./scripts/run_table.sh table4 |
| Table 5 (MCP vs CLI) | table5_mcp_vs_cli.yaml | ./scripts/run_table.sh table5 |
After all slices finish, aggregate:
uv run python scripts/aggregate.py \
--trials-dir logs/experiments/table1_main_matrix \
--prices configs/prices.yaml \
--out-csv logs/table1.csvCSV columns: agent, model, n_trials, n_tasks, pass_at_1, pass_at_1_lo, pass_at_1_hi, pass_at_3, ..., pass_pow_3, pass_pow_3_hi, mean_cost_usd, mean_walltime_s with Wilson 95% CIs.
Modal vs Docker
Local Docker is fine for a handful of trials. Matrix reproduction on a single host takes days, so use Modal for parallel execution:
./scripts/run_table.sh table1 --modalcb experiment run -e modal defaults to profile actava; pass --modal-profile '' to skip Modal preflight, or MODAL_PROFILE=<name> for a named profile.
Test markers
The default uv run pytest skips judge-hitting and slow suites. Opt in:
uv run pytest -m requires_anthropic_key # hits the live judge
uv run pytest -m slow # includes docker-build smoke
CHI_BENCH_SKIP_DOCKER_BUILD=1 uv run pytest tests/smoke -v -m slowUseful environment variables
ANTHROPIC_API_KEY— always required (judge).CHI_BENCH_JUDGE_MODEL— override the pinned judge model (deviates from paper protocol).CHI_BENCH_JUDGE_NUM_VOTES— > 1 enables majority-voted judging.CHI_BENCH_PAYER_MODE— set toagentfor the local server (auto-set bycb serve).MODAL_PROFILE— named Modal profile for parallel execution.