Bring your own agent

The same submission flow works for built-in and custom agents — only the harness wiring differs.

Built-in agents

--agentExample --modelPaper rows
claude-codeanthropic/claude-opus-4-7Claude Code
codexopenai/gpt-5.5Codex
gemini-cligemini/gemini-3-pro-previewGemini CLI
openclawanthropic/claude-opus-4-7OpenClaw
hermesopenrouter/z-ai/glm-5.1Hermes
openai-agentsdeepseek/deepseek-v4-proOAI Agents
deepagentsopenrouter/x-ai/grok-4.3DeepAgents

The full 30-row matrix lives in configs/experiments/table1_main_matrix.yaml.

What an agent harness needs to provide

A harness is a Python class under src/chi_bench/experiment/agents/ that implements three things:

  • A constructor that receives the per-task instruction.md, the role-scoped MCP server URL, the model identifier, and any provider credentials.
  • An async run() that drives the agent loop until it terminates (success, failure, or budget exhausted).
  • A trajectory writer that emits one JSONL record per step into agent/trajectory.jsonl following the ATIF schema (Agent Trajectory Interchange Format).

Wiring a new harness

  1. Create src/chi_bench/experiment/agents/<your_agent>.py, subclassing the base AgentHarness. Point at the MCP server using the URL passed in via constructor.
  2. Register it in src/chi_bench/experiment/agents/__init__.py so --agent <your_agent> resolves it on the CLI.
  3. Smoke-test with cb experiment run --agent <your_agent> --model ...on a single task; confirm verifier/scorecard.json reads.
  4. Run a full submission via cb submission run -f configs/submissions/<id>.yaml.

Custom model endpoints

Most harnesses route through provider SDKs (Anthropic, OpenAI, Google, OpenRouter) keyed by the --model prefix. To add a new provider:

  • Add a model-resolver entry that maps <provider>/<model-id> to a client construction.
  • Add the provider's API key handling to .env.example and document it in the README.
  • If the endpoint is OpenAI-compatible, you can usually reuse the existing codex or openai-agents harnesses by setting OPENAI_BASE_URL appropriately — minus the judge subprocess, which always uses the real Anthropic API.

Authoritative docs

The packet shape is identical regardless of whether you submit a built-in agent or a custom one. The leaderboard PR flow doesn't care how the trials were produced, only that the manifest, results CSV, and per-trial evidence pass the validator.