Submit

Submit a run to the leaderboard.

chi-bench is open source. Submissions are two repos with one handoff: you produce a packet with cb submission prepare, then open a PR adding that packet to actava-ai/leaderboard. CI validates schema and trial integrity; a maintainer reviews and merges.

ProducerIn your fork of actava-ai/chi-bench
  1. 01
    Install and download the dataset
    Clone actava-ai/chi-bench, install with uv sync --extra dev, then fetch the gated dataset from Hugging Face:
    uv run huggingface-cli login
    
    REV=chi-bench-v1.0.0
    uv run huggingface-cli download actava/chi-bench \
      --repo-type dataset --revision "$REV" --local-dir data/
    echo "$REV" > data/.chi-bench-version
    The pin in data/.chi-bench-version is what submission preflight verifies against your config's dataset.version.
  2. 02
    Set API keys and build the Docker image
    Copy .env.example.env and fill in keys for the providers you intend to run. ANTHROPIC_API_KEY is always required — the workspace judge is pinned to claude-opus-4-7. Then build the runtime image (~5 min, one-time):
    uv run cb docker build
  3. 03
    Write a submission YAML
    Copy configs/submission_example.yaml configs/submissions/<your-id>.yaml and fill in id, team, contact, agent, and model. Optionally set notes and run.*.
    Bringing a custom agent harness or model endpoint? See docs/extending.md.
  4. 04
    Run trials and prepare the packet
    Four commands — preflight, run, monitor, package:
    # Schema + preflight: dataset pin, Modal/Docker, agent name
    uv run cb submission validate -f configs/submissions/<your-id>.yaml
    
    # Run all 3 domains. Default: one trial per task (pass@1).
    uv run cb submission run      -f configs/submissions/<your-id>.yaml
    
    # Check progress; safe to run while `submission run` is in flight.
    uv run cb submission status   -f configs/submissions/<your-id>.yaml
    
    # Curate the leaderboard-ready packet.
    uv run cb submission prepare  -f configs/submissions/<your-id>.yaml
    The packet lands at logs/submissions/<id>/packet/YYYY-MM-DD-<id>/ — typically <100 MB. Workspace artifacts and Harbor scratch are excluded by design.
    packet/2026-05-13-<id>/
    ├── submission.json          # manifest: agent, model, results, provenance
    ├── results.csv              # leaderboard rows (one per domain + overall)
    ├── sub.yaml                 # frozen copy of your config
    ├── provenance.json          # git SHA, image digest, timestamps
    ├── README.md                # auto-generated headline summary
    └── trials/<domain>/<trial_id>/
        ├── result.json
        ├── verifier/scorecard.json
        ├── verifier/reward.json
        └── agent/trajectory.jsonl.zst
LeaderboardIn a fork of actava-ai/leaderboard
  1. 05
    Fork the leaderboard repo (one-time)
    gh auth login
    gh repo fork actava-ai/leaderboard --clone=false
    git clone https://github.com/<you>/leaderboard && cd leaderboard
    Subsequent submissions reuse this same fork.
  2. 06
    Open a PR adding the packet
    Three equivalent paths — pick one. They all run the same CI validator (.github/workflows/validate.yml).
    AHelper scriptRecommended
    The helper validates locally, copies the packet to benchmarks/<bench>/submissions/<dir>/, creates branch sub/<bench>/<dir>, pushes to your fork, and opens the PR.
    python scripts/submit.py /path/to/packet/2026-05-13-<slug>/

    Useful flags: --no-fork, --no-open-pr, --on-conflict abandon|replace|bump-date.

    BClaude Code / Codex
    The leaderboard repo ships a submit-to-leaderboard skill ( AGENTS.md points Codex at the same file). Open the repo and ask:
    /submit-to-leaderboard /abs/path/to/packet/2026-05-13-<slug>/
    The skill wraps the helper with preflight checks, partial-failure recovery, and pointers to producer-side fixes when the validator complains.
    CManual
    From your fork clone, the underlying flow is just five commands:
    cp -r /path/to/packet/2026-05-13-<slug>/ benchmarks/<bench>/submissions/
    python scripts/validate.py benchmarks/<bench>/submissions/2026-05-13-<slug>/
    git checkout -b sub/<bench>/2026-05-13-<slug>
    git commit -am "<bench>: <team> · <agent> · <model>"
    git push origin sub/<bench>/2026-05-13-<slug>
    gh pr create -R actava-ai/leaderboard --base main
  3. 07
    CI labels the PR; a maintainer reviews
    CI runs schema and trial-integrity checks and labels the PR valid-submission invalid-submission or needs-review, with a sticky comment summarizing each check. A maintainer spot-inspects a trajectory or two and merges if everything looks plausible.
    What CI checks
    • Directory name = YYYY-MM-DD-<slug>
    • Required files present: submission.json, results.csv, sub.yaml, provenance.json, README.md, ≥1 trial result
    • No unexpected files (.zip, .bak, hidden files except .gitkeep)
    • submission.json matches the JSON Schema
    • results.csv rows match the manifest
    • Per-trial integrity: required files, valid zstd, valid JSONL per line
    • Trial counts match results.per_domain.<domain>.n_trials
    • Per-file and total size limits
Policy notes
  • Leaderboard is pass@1. Set run.n_attempts: 3 to keep extra trials on disk for your own pass@3 / pass^3 analysis — the manifest still publishes pass@1.
  • Partial submissions (--domain pa | um | cm on submission run) are accepted but flagged as partial on the leaderboard.
  • Resubmissions with a fresh date prefix are always acceptable. Old submissions are kept for historical record; mention in the PR body if you want one removed.
  • PR scope. Each PR touches exactly one new directory under benchmarks/<bench>/submissions/. Schema or workflow changes go in separate PRs.

Packet contract for benchmark authors building their own producers: docs/submission-packet.md.