The leaderboard repo

actava-ai/leaderboard is a public, data-only record of benchmark submissions. The full audit packet lives in git so reviewers can inspect any submission directly from the PR diff. This page covers the PR workflow, the CI validator, and the resubmission policy.

Two repos, one handoff

  1. Producer — run trials and produce a packet with cb submission prepare in actava-ai/chi-bench. See Run experiments.
  2. Leaderboard — fork actava-ai/leaderboard and open a PR adding that packet under benchmarks/<bench>/submissions/<YYYY-MM-DD>-<slug>/.
  3. CI labels the PR valid-submission / invalid-submission / needs-review and posts a sticky report. A maintainer reviews and merges.

One-time setup

Fork the leaderboard repo once; subsequent submissions reuse the same fork.

gh auth login                          # authenticate to GitHub
gh repo fork actava-ai/leaderboard --clone=false
git config --global user.email "you@example.com"   # if not set
git clone https://github.com/<you>/leaderboard && cd leaderboard

Three submission paths

All three run the same CI validator (.github/workflows/validate.yml).

A. Helper script (recommended)

The helper validates locally, copies the packet, branches, pushes to your fork, and opens the PR.

python scripts/submit.py /path/to/packet/2026-05-13-<slug>/

Useful flags: --no-fork, --no-open-pr, --on-conflict abandon|replace|bump-date, --leaderboard-repo <slug>.

B. Claude Code / Codex

The repo ships a submit-to-leaderboard skill (AGENTS.md points Codex at the same file). Open the repo and ask:

/submit-to-leaderboard /abs/path/to/packet/2026-05-13-<slug>/

The skill wraps the helper with preflight checks, partial-failure recovery, and pointers to producer-side fixes.

C. Manual

cp -r /path/to/packet/2026-05-13-<slug>/ benchmarks/<bench>/submissions/
python scripts/validate.py benchmarks/<bench>/submissions/2026-05-13-<slug>/
git checkout -b sub/<bench>/2026-05-13-<slug>
git commit -am "<bench>: <team> · <agent> · <model>"
git push origin sub/<bench>/2026-05-13-<slug>
gh pr create -R actava-ai/leaderboard --base main

Pre-PR sanity check

scripts/validate.py is a thin shim around the CI validator — same code path. Run it locally before opening a PR:

python scripts/validate.py benchmarks/chi-bench/submissions/2026-05-13-<slug>/

Exit 0 = passed. Exit 1 = errors printed; fix and rerun.

The validator depends on jsonschema, zstandard, pyyaml — these are CI-only and not in any pyproject.toml. Inject them with uv run --with jsonschema --with zstandard --with pyyaml python scripts/validate.py ...

What CI catches

  • Directory name = YYYY-MM-DD-<slug>
  • Required files: submission.json, results.csv, sub.yaml, provenance.json, README.md, ≥1 trial result
  • No unexpected files (.zip, .bak, hidden files except .gitkeep)
  • submission.json matches the JSON Schema (per benchmarks/<bench>/schema/submission-v<N>.json)
  • results.csv rows match the manifest exactly
  • provenance.json has required keys
  • Per-trial integrity: required files, valid zstd, valid JSONL per line
  • Trial counts match results.per_domain.<domain>.n_trials
  • Per-file and total size limits

Soft warnings (not failures): unknown dataset version, duplicate submission.id.

What reviewers do beyond CI

  • Sanity-check headline metrics (a 99% pass@1 on a benchmark where state-of-the-art is 30% warrants a closer look).
  • Spot-inspect one or two trajectories: zstdcat trials/<dom>/<id>/agent/trajectory.jsonl.zst | jq .
  • Confirm the producer repo and dataset version look right.
  • For resubmissions: decide whether to keep the old submission alongside the new one.

CI does not re-judge submissions in v1 (trust-the-evidence model). Maintainers may manually re-judge a random trial via the producer's tooling if a submission looks suspicious.

Resubmission policy

  • A new submission with a fresh date prefix is always acceptable, even if the slug is identical to an existing submission.
  • Old submissions are kept by default for historical record. If you want an old run removed, say so in the PR body of your new submission.

PR scope

Submission PRs must touch only files under benchmarks/*/submissions/*/. Changes to schemas, READMEs, workflows, or other benchmarks require a separate PR (or the maintainer-applied meta: label, which bypasses the CI scope check).

Inspecting submissions

Submission directories are plain files. Click into any one on GitHub to see the manifest, headline metrics (in the auto-generated README), and the per-trial tree. Trajectories are zstd-compressed JSONL:

zstdcat benchmarks/chi-bench/submissions/<dir>/trials/<domain>/<trial_id>/agent/trajectory.jsonl.zst | jq .

Authoritative docs