Leaderboard
Leaderboard · chi-bench-v1.0.0

CHI-Bench Leaderboard

Χ-Bench evaluates long-horizon, policy-rich U.S. healthcare workflow agents across three domains: provider prior authorization, payer utilization management, and care management. Each domain ships 25 tasks scored by an automated workspace judge under pass@1 with a binary 0/1 reward. Submissions below are ranked by accuracy on the selected domain.

Org
Agent
Model
Type
Accuracy
PA
UM
CM
Evidence
Date
Anthropic
claude-code
claude-opus-4-6
Proprietary
28.0%
18.7%
41.3%
24.0%
2026-05-01
Anthropic
claude-code
claude-sonnet-4-6
Proprietary
26.2%
24.0%
34.7%
20.0%
2026-05-01
Anthropic
claude-code
claude-opus-4-7
Proprietary
24.4%
24.0%
17.3%
32.0%
2026-05-01
OpenAI
codex
gpt-5.5
Proprietary
20.9%
29.3%
32.0%
1.3%
2026-05-01
OpenAIZhipu
openai-agents
glm-5.1
Open-Source
18.7%
18.7%
33.3%
4.0%
2026-05-01
Nous ResearchZhipu
hermes
glm-5.1
Open-Source
18.7%
10.7%
34.7%
10.7%
2026-05-01
OpenClawAnthropic
openclaw
claude-opus-4-7
Proprietary
17.3%
18.7%
13.3%
20.0%
2026-05-01
OpenClawZhipu
openclaw
glm-5.1
Open-Source
16.9%
13.3%
26.7%
10.7%
2026-05-01
Nous ResearchAlibaba
hermes
qwen-3.6-max
Open-Source
16.4%
9.3%
26.7%
13.3%
2026-05-01
OpenAI
codex
gpt-5.4
Proprietary
16.0%
24.0%
17.3%
6.7%
2026-05-01
OpenAIAlibaba
openai-agents
qwen-3.6-max
Open-Source
15.6%
16.0%
26.7%
4.0%
2026-05-01
Nous ResearchMoonshot
hermes
kimi-k2.6
Open-Source
15.6%
18.7%
21.3%
6.7%
2026-05-01
OpenAIMoonshot
openai-agents
kimi-k2.6
Open-Source
15.1%
17.3%
25.3%
2.7%
2026-05-01
OpenAIDeepSeek
openai-agents
deepseek-v4-pro
Open-Source
14.2%
10.7%
28.0%
4.0%
2026-05-01
Nous ResearchDeepSeek
hermes
deepseek-v4-pro
Open-Source
13.8%
8.0%
25.3%
8.0%
2026-05-01
Google
gemini-cli
gemini-3-flash
Proprietary
12.5%
18.7%
18.7%
0.0%
2026-05-01
OpenClawDeepSeek
openclaw
deepseek-v4-pro
Open-Source
11.1%
14.7%
12.0%
6.7%
2026-05-01
LangChainZhipu
deepagents
glm-5.1
Open-Source
11.1%
17.3%
10.7%
5.3%
2026-05-01
LangChainDeepSeek
deepagents
deepseek-v4-pro
Open-Source
10.7%
14.7%
10.7%
6.7%
2026-05-01
OpenClawMoonshot
openclaw
kimi-k2.6
Open-Source
10.2%
12.0%
18.7%
0.0%
2026-05-01
LangChainAlibaba
deepagents
qwen-3.6-max
Open-Source
9.3%
12.0%
10.7%
5.3%
2026-05-01
OpenAI
codex
gpt-5.4-mini
Proprietary
8.4%
10.7%
13.3%
1.3%
2026-05-01
Google
gemini-cli
gemini-3.1-pro
Proprietary
7.1%
14.7%
6.7%
0.0%
2026-05-01
Anthropic
claude-code
claude-haiku-4-5
Proprietary
6.2%
0.0%
14.7%
4.0%
2026-05-01
OpenAIxAI
openai-agents
grok-4.3
Open-Source
5.8%
0.0%
16.0%
1.3%
2026-05-01
OpenClawAlibaba
openclaw
qwen-3.6-max
Open-Source
4.9%
10.7%
4.0%
0.0%
2026-05-01
Nous ResearchxAI
hermes
grok-4.3
Open-Source
4.4%
0.0%
13.3%
0.0%
2026-05-01
LangChainMoonshot
deepagents
kimi-k2.6
Open-Source
3.1%
8.0%
1.3%
0.0%
2026-05-01
LangChainxAI
deepagents
grok-4.3
Open-Source
2.2%
0.0%
5.3%
1.3%
2026-05-01
OpenClawxAI
openclaw
grok-4.3
Open-Source
0.4%
1.3%
0.0%
0.0%
2026-05-01
Submissions ranked by pass@1 on All Domains. Click any column header to sort.
Got results?

Submit your agent to the CHI-Bench leaderboard.

Run the evaluation suite with your harness/model, prepare a packet with cb submission prepare, and open a PR. CI re-runs the validator and a maintainer merges within one business day.

Submit a run