Installation

One-time setup: clone the repo, install Python deps, fetch the dataset, set API keys, and build the Docker image.

Prerequisites

  • Python 3.12+
  • Docker (for the trial sandbox)
  • uv — Python package manager

1. Clone and install

git clone https://github.com/actava-ai/chi-bench && cd chi-bench
uv sync --extra dev

2. API keys

Copy .env.example to .env and fill in keys for the providers you intend to run. ANTHROPIC_API_KEY is always required — the workspace judge is pinned to claude-opus-4-7.

  • ANTHROPIC_API_KEY — required (judge + Claude Code harness default).
  • OPENAI_API_KEY — required for Codex and OAI Agents rows.
  • GEMINI_API_KEY — required for Gemini CLI rows.
  • OPENROUTER_API_KEY — required for the open-stack rows on open-weight models.
  • CLAUDE_CODE_OAUTH_TOKEN — optional, cheaper for smoke-testing the Claude Code harness.

3. Task fixtures from Hugging Face

The dataset is gated. Authenticate once with the CLI, then download a pinned revision:

uv run huggingface-cli login

REV=chi-bench-v1.0.0
uv run huggingface-cli download actava/chi-bench \
  --repo-type dataset --revision "$REV" --local-dir data/
echo "$REV" > data/.chi-bench-version

The pin in data/.chi-bench-version is what submission preflight verifies against your config's dataset.version. Always rewrite it when changing revisions.

4. Managed-Care Operations Handbook

The handbook (1,279 markdown documents) lives off Hugging Face because of size and clinical-collaborator curation provenance. Download the tarball from the share URL in your invitation email, then extract:

mkdir -p data/skills
tar -xzf managed-care-operations-handbook.tar.gz -C data/skills/

5. Build the Docker image

~5 min, one-time. The image bundles the FastAPI server, the workspace judge, the agent harness, and per-task fixtures.

uv run cb docker build

cb is the short alias for chi-bench; both commands resolve to the same CLI. If your shell already aliases cb, use chi-bench.

Verify setup

uv run cb data verify

A clean run means you're ready for the quickstart.

Optional: Modal for parallel execution

Modal parallelizes trials across remote sandboxes — strongly recommended for matrix runs.

# default profile, or:
uv run modal setup
# named profile:
uv run modal token set --profile chi-bench

If you use a named profile, export MODAL_PROFILE=chi-bench in your shell before running the matrix.