Read the scorecard
A trial's verifier output explains why it passed or failed. Three files under verifier/ tell you everything: reward.json (one number), scorecard.json (per-check breakdown), and exported_state.json (the world snapshot the verifier scored against).
Where to look first
Every trial under logs/experiments/.../trial_*/ (or logs/submissions/.../<task>__<hash>/) writes the same set of verifier files:
| File | Purpose |
|---|---|
verifier/reward.json | One line: {"reward": 0.0 | 1.0}. The single binary number Harbor uses for pass@1. |
verifier/scorecard.json | Per-check breakdown — read this to understand why a trial passed or failed. |
verifier/exported_state.json | Post-run world snapshot the verifier scored against (fixtures, agent-authored artifacts, audit log, conversation transcripts). |
verifier/verifier_context.json | Target case/patient/order IDs and the task actor — useful when joining with raw fixtures. |
result.json (sibling of verifier/) | Harbor's view: agent metadata, verifier_result.rewards.reward, token counts, cost, timing per phase. |
scorecard.json shape
{
"binary_reward": 0.0,
"fractional_reward": 0.91,
"passed_checks": 42,
"total_checks": 46,
"check_scores": { "<check_name>": 1.0 | 0.0 | null, ... },
"checks": { "<check_name>": true | false | "not_applicable", ... },
"failed_checks": ["md.signed_off", "judge.md_review:mp1126_sca_05", ...],
"not_applicable_checks": ["md.denial_rationale_present"],
"stages": {
"md_review": {
"passed": false,
"checks": { "md.decision_exists": true, "md.signed_off": false, ... },
"passed_count": 3,
"total_count": 4,
"not_applicable_count": 1,
"details": { "criteria": [...] }
},
"outcome": { ... },
"cross_stage": { ... }
}
}The two reward axes
- binary_reward is the strict pass/fail axis.
1.0only when every non-N/A check passes;0.0if any required check fails. This is what the leaderboard publishes as pass@1. - fractional_reward = passed_checks / total_checks is partial credit for diagnostics. A
0.0 / 0.91split means "near miss": one or two checks tanked an otherwise-strong run.
Stages
Checks are grouped by workflow stage. When you see binary_reward: 0, scan stages first — the failed stage is where to start debugging. Each stage records its own passed, passed_count, total_count, not_applicable_count, and free-form details.
Stage names you'll encounter:
| Stage | Domains | What it gates |
|---|---|---|
intake | UM | Auth request received and routed; member/provider/code resolution. |
nurse_review | UM | Nurse triage decision and rationale. |
md_review | UM | Medical-director decision, sign-off, denial rationale (when applicable). |
p2p | UM, PA-Provider | Peer-to-peer call: scheduling, transcript, post-call decision update. |
appeal | UM, PA-Provider | Appeal acceptance, evidence handling, final disposition. |
outcome | UM, PA-Provider | Terminal status reached, determination/letter exists, decision matches reviewer recommendation. |
cross_stage | UM, PA-Provider | No forbidden world-state mutations; only forward state transitions. |
provider_pre_submission, provider_request_package, provider_submission, new_referral | PA-Provider | PA packet assembly and submission gates. |
cm_chart_review, cm_assessment, cm_care_plan, cm_cross_stage | Care Management | Chart review, assessment completeness, care plan structure, cross-stage invariants. |
e2e_consistency | Provider–Payer arena | Provider PA and payer determination tell the same story end-to-end. |
Check-name namespaces
Every check name has a prefix telling you who graded it and on what.
Deterministic checks
| Prefix | Examples | Source |
|---|---|---|
md.* | md.decision_exists, md.signed_off, md.rationale_present, md.denial_rationale_present, md.audit | stages/md_review.py |
outcome.* | outcome.target_status, outcome.letter_types, outcome.determination_exists, outcome.review_decision_matches, outcome.terminal_transition_exists, outcome.clean_determination | stages/outcome.py |
cross.* | cross.forbidden_mutations, cross.forward_transitions | stages/cross_stage.py |
cm.* | cm.assessment.completed, cm.care_plan.problem_count, cm.care_plan.escalation_conditions_present, cm.cross_stage.target_status | stages/cm_v4.py |
Other deterministic prefixes: intake.*, nurse.*, p2p.*, appeal.*, provider_pre_submission.*, provider_request_package.*, provider_submission.*, new_referral.*, e2e_consistency.*.
Rubrics-based LLM judge
Named judge.<stage>:<rubric_id>, e.g. judge.md_review:final_decision = the final_decision rubric item evaluated as part of the md_review stage.
- Run by WorkspaceJudge — a Claude model (default
claude-opus-4-7, override withCHI_BENCH_JUDGE_MODEL) reading the task'sfixtures/judge/rubrics.jsonagainst the agent's workspace. - Set
CHI_BENCH_JUDGE_NUM_VOTES > 1for majority-voted judging. - Degraded judge runs (timeout, crash, malformed verdicts) collapse
binary_rewardto0even on gate-pass runs — visible asjudge_unavailable_reasonon the scorecard. Rubrics missing a parsed verdict count as not passing.
Three check states
Inside checks, every entry is one of:
true— pass. Counts toward bothpassed_checksandtotal_checks.false— fail. Counts towardtotal_checksonly; appears infailed_checks."not_applicable"— rubric/check doesn't apply (e.g.md.denial_rationale_presentfor an approve decision). Excluded fromtotal_checks; appears innot_applicable_checks.
Mirror in check_scores: 1.0 for pass, 0.0 for fail, null for N/A.
Care Management scorecard fields
CM trials share the same pass criterion as PA/UM — binary=1.0 requires both the deterministic gate and all LLM judge rubrics to pass. The scorecard schema differs in two ways (compute_cm_reward in src/chi_bench/verifier/stages/cm_rubric.py):
gate_pass— bool. Did the deterministic gate (chart review exists, assessment completed, care plan finalized, cross-stage invariants) pass?rubric_yes_count/rubric_total— counts from the LLM judge rubric.fractional_reward = rubric_yes_count / rubric_total(rubric-only, notpassed_checks / total_checks).
If the gate fails, the LLM judge is skipped (no point spending a judge session on a clearly-failed run); the scorecard still records which gate check broke. Reward semantics:
| gate_pass | rubric outcome | binary | fractional |
|---|---|---|---|
| false | (skipped — judge isn't run) | 0.0 | 0.0 |
| true | rubric_total = 0 (gate-only task) | 1.0 | 1.0 |
| true | every rubric pass | 1.0 | 1.0 |
| true | N of M rubrics pass | 0.0 | N/M |
Worked example
From a real UM medical-director-review trial:
{
"binary_reward": 0.0,
"fractional_reward": 0.91,
"passed_checks": 42,
"total_checks": 46,
"failed_checks": [
"md.signed_off",
"judge.md_review:decision_rationale",
"judge.md_review:preop_weight_history",
"judge.md_review:preop_psychosocial_eval"
],
"stages": {
"md_review": { "passed": false, "passed_count": 3, "total_count": 4 },
"outcome": { "passed": true, "passed_count": 13, "total_count": 13 },
"cross_stage": { "passed": true, "passed_count": 2, "total_count": 2 }
}
}Reading this top-down: 91% of checks passed, but md_review failed becausemd.signed_off was false (the agent never marked the determination as MD-signed) and three judge rubrics flipped to false.outcome and cross_stage were clean — so the agent reached the correct terminal state, but the artifact wasn't signed off and the rationale missed three rubric items. Binary = 0; this trial does not count for pass@1 even though everything else looks right.
Where the rules live
- Stage verifiers:
src/chi_bench/verifier/stages/— one file per stage. - Reward computation:
build_scorecard()andverify_task()insrc/chi_bench/verifier/task_runtime.py;compute_cm_reward()instages/cm_rubric.py. - LLM judge:
verifier/judge/workspace_judge.py+verifier/judge/cm_adapter.py. - Per-task expectations (the ground truth the verifier scores against):
data/<domain>/tasks/<task-dir>/expectations.json— hidden from the agent.
Re-judging an old trial
If you upgrade the judge model or fix a rubric, re-score existing trials without re-running the agent:
uv run cb verifier rejudge --trials-dir logs/experiments/<dir>This re-runs only the LLM judge against the saved exported_state.json and rewrites scorecard.json + reward.json in place. Deterministic checks are recomputed too. Source: src/chi_bench/verifier/rejudge.py.
See also
- Quickstart — runs one trial; you read its scorecard.
- Architecture — how the verifier fits into the trial container.
- Run experiments — single-trial CLI and submission lifecycle.
- Leaderboard repo — what CI re-validates against the packet's scorecards.