Read the scorecard

A trial's verifier output explains why it passed or failed. Three files under verifier/ tell you everything: reward.json (one number), scorecard.json (per-check breakdown), and exported_state.json (the world snapshot the verifier scored against).

Where to look first

Every trial under logs/experiments/.../trial_*/ (or logs/submissions/.../<task>__<hash>/) writes the same set of verifier files:

File	Purpose
`verifier/reward.json`	One line: `{"reward": 0.0 \| 1.0}`. The single binary number Harbor uses for pass@1.
`verifier/scorecard.json`	Per-check breakdown — read this to understand why a trial passed or failed.
`verifier/exported_state.json`	Post-run world snapshot the verifier scored against (fixtures, agent-authored artifacts, audit log, conversation transcripts).
`verifier/verifier_context.json`	Target case/patient/order IDs and the task actor — useful when joining with raw fixtures.
`result.json` (sibling of `verifier/`)	Harbor's view: agent metadata, `verifier_result.rewards.reward`, token counts, cost, timing per phase.

scorecard.json shape

{
  "binary_reward":     0.0,
  "fractional_reward": 0.91,
  "passed_checks":     42,
  "total_checks":      46,
  "check_scores":  { "<check_name>": 1.0 | 0.0 | null, ... },
  "checks":        { "<check_name>": true | false | "not_applicable", ... },
  "failed_checks": ["md.signed_off", "judge.md_review:mp1126_sca_05", ...],
  "not_applicable_checks": ["md.denial_rationale_present"],
  "stages": {
    "md_review": {
      "passed": false,
      "checks": { "md.decision_exists": true, "md.signed_off": false, ... },
      "passed_count": 3,
      "total_count":  4,
      "not_applicable_count": 1,
      "details": { "criteria": [...] }
    },
    "outcome":     { ... },
    "cross_stage": { ... }
  }
}

The two reward axes

binary_reward is the strict pass/fail axis. 1.0 only when every non-N/A check passes; 0.0 if any required check fails. This is what the leaderboard publishes as pass@1.
fractional_reward = passed_checks / total_checks is partial credit for diagnostics. A 0.0 / 0.91 split means "near miss": one or two checks tanked an otherwise-strong run.

Stages

Checks are grouped by workflow stage. When you see binary_reward: 0, scan stages first — the failed stage is where to start debugging. Each stage records its own passed, passed_count, total_count, not_applicable_count, and free-form details.

Stage names you'll encounter:

Stage	Domains	What it gates
`intake`	UM	Auth request received and routed; member/provider/code resolution.
`nurse_review`	UM	Nurse triage decision and rationale.
`md_review`	UM	Medical-director decision, sign-off, denial rationale (when applicable).
`p2p`	UM, PA-Provider	Peer-to-peer call: scheduling, transcript, post-call decision update.
`appeal`	UM, PA-Provider	Appeal acceptance, evidence handling, final disposition.
`outcome`	UM, PA-Provider	Terminal status reached, determination/letter exists, decision matches reviewer recommendation.
`cross_stage`	UM, PA-Provider	No forbidden world-state mutations; only forward state transitions.
`provider_pre_submission`, `provider_request_package`, `provider_submission`, `new_referral`	PA-Provider	PA packet assembly and submission gates.
`cm_chart_review`, `cm_assessment`, `cm_care_plan`, `cm_cross_stage`	Care Management	Chart review, assessment completeness, care plan structure, cross-stage invariants.
`e2e_consistency`	Provider–Payer arena	Provider PA and payer determination tell the same story end-to-end.

Check-name namespaces

Every check name has a prefix telling you who graded it and on what.

Deterministic checks

Prefix	Examples	Source
`md.*`	`md.decision_exists`, `md.signed_off`, `md.rationale_present`, `md.denial_rationale_present`, `md.audit`	`stages/md_review.py`
`outcome.*`	`outcome.target_status`, `outcome.letter_types`, `outcome.determination_exists`, `outcome.review_decision_matches`, `outcome.terminal_transition_exists`, `outcome.clean_determination`	`stages/outcome.py`
`cross.*`	`cross.forbidden_mutations`, `cross.forward_transitions`	`stages/cross_stage.py`
`cm.*`	`cm.assessment.completed`, `cm.care_plan.problem_count`, `cm.care_plan.escalation_conditions_present`, `cm.cross_stage.target_status`	`stages/cm_v4.py`

Other deterministic prefixes: intake.*, nurse.*, p2p.*, appeal.*, provider_pre_submission.*, provider_request_package.*, provider_submission.*, new_referral.*, e2e_consistency.*.

Rubrics-based LLM judge

Named judge.<stage>:<rubric_id>, e.g. judge.md_review:final_decision = the final_decision rubric item evaluated as part of the md_review stage.

Run by WorkspaceJudge — a Claude model (default claude-opus-4-7, override with CHI_BENCH_JUDGE_MODEL) reading the task's fixtures/judge/rubrics.json against the agent's workspace.
Set CHI_BENCH_JUDGE_NUM_VOTES > 1 for majority-voted judging.
Degraded judge runs (timeout, crash, malformed verdicts) collapse binary_reward to 0 even on gate-pass runs — visible as judge_unavailable_reason on the scorecard. Rubrics missing a parsed verdict count as not passing.

Three check states

Inside checks, every entry is one of:

true — pass. Counts toward both passed_checks and total_checks.
false — fail. Counts toward total_checks only; appears in failed_checks.
"not_applicable" — rubric/check doesn't apply (e.g. md.denial_rationale_present for an approve decision). Excluded from total_checks; appears in not_applicable_checks.

Mirror in check_scores: 1.0 for pass, 0.0 for fail, null for N/A.

Care Management scorecard fields

CM trials share the same pass criterion as PA/UM — binary=1.0 requires both the deterministic gate and all LLM judge rubrics to pass. The scorecard schema differs in two ways (compute_cm_reward in src/chi_bench/verifier/stages/cm_rubric.py):

gate_pass — bool. Did the deterministic gate (chart review exists, assessment completed, care plan finalized, cross-stage invariants) pass?
rubric_yes_count / rubric_total — counts from the LLM judge rubric. fractional_reward = rubric_yes_count / rubric_total (rubric-only, not passed_checks / total_checks).

If the gate fails, the LLM judge is skipped (no point spending a judge session on a clearly-failed run); the scorecard still records which gate check broke. Reward semantics:

gate_pass	rubric outcome	binary	fractional
false	(skipped — judge isn't run)	0.0	0.0
true	rubric_total = 0 (gate-only task)	1.0	1.0
true	every rubric pass	1.0	1.0
true	N of M rubrics pass	0.0	N/M

Worked example

From a real UM medical-director-review trial:

{
  "binary_reward":     0.0,
  "fractional_reward": 0.91,
  "passed_checks":     42,
  "total_checks":      46,
  "failed_checks": [
    "md.signed_off",
    "judge.md_review:decision_rationale",
    "judge.md_review:preop_weight_history",
    "judge.md_review:preop_psychosocial_eval"
  ],
  "stages": {
    "md_review":   { "passed": false, "passed_count": 3,  "total_count": 4  },
    "outcome":     { "passed": true,  "passed_count": 13, "total_count": 13 },
    "cross_stage": { "passed": true,  "passed_count": 2,  "total_count": 2  }
  }
}

Reading this top-down: 91% of checks passed, but md_review failed becausemd.signed_off was false (the agent never marked the determination as MD-signed) and three judge rubrics flipped to false.outcome and cross_stage were clean — so the agent reached the correct terminal state, but the artifact wasn't signed off and the rationale missed three rubric items. Binary = 0; this trial does not count for pass@1 even though everything else looks right.

Where the rules live

Stage verifiers: src/chi_bench/verifier/stages/ — one file per stage.
Reward computation: build_scorecard() and verify_task() in src/chi_bench/verifier/task_runtime.py; compute_cm_reward() in stages/cm_rubric.py.
LLM judge: verifier/judge/workspace_judge.py + verifier/judge/cm_adapter.py.
Per-task expectations (the ground truth the verifier scores against): data/<domain>/tasks/<task-dir>/expectations.json — hidden from the agent.

Re-judging an old trial

If you upgrade the judge model or fix a rubric, re-score existing trials without re-running the agent:

uv run cb verifier rejudge --trials-dir logs/experiments/<dir>

This re-runs only the LLM judge against the saved exported_state.json and rewrites scorecard.json + reward.json in place. Deterministic checks are recomputed too. Source: src/chi_bench/verifier/rejudge.py.