actAVA Benchmarks·CHI-Bench v1.0.0

The world’s first long-horizon healthcare benchmark for AI agents.

Built with the clinicians who do the work.

actAVA partners with 20+ hospitals and top universities to evaluate frontier agents across prior authorization, utilization management, and care management. The best agent today resolves 28% of tasks at pass@1; end-to-end prior authorization automation drops to 0%.

Read the paper Leaderboard Browse 75 tasks GitHub

20+clinical & academic collaborators

Johns Hopkins MedicineWellstar Health SystemYale School of MedicineStanford UniversityCMUUniversity of OxfordUSCUCSDBrown UniversityEmory UniversityUniversity of WashingtonNortheasternArizona StateUICBoston CollegeStony BrookMBZUAI+3 more

Domains

PA · UM · CM

Tasks

25 per domain

Agent configs

harness × model

Best pass@1

28.0%

Claude Code · Opus 4.6

MCP tools

200+

21 healthcare apps

Handbook docs

1,279

policy corpus

See it in motion

One agent, one trial, one final scorecard.

Each animation replays an actual trajectory from the claude-code-opus-4-6 submission — stage transitions, real tool calls, real policy gates, real scorecard outcomes.

Seat 1 · provider intake

Clinical Intake Clerk

chart.search_patients

chart.get_patient_chart

chart.list_candidate_orders

people.list_ordering_providers

Seat 2 · case construction

PA Coordinator

cases.create_from_ordercode=43775 · site=outpatient

forms.list_required_forms

Seat 3 · documentation review

Clinical Reviewer

docs.upload_evidencechart_note

docs.upload_evidencelab_result

forms.save_form_responsequestionnaire

docs.upload_evidenceletter_of_medical_necessity

Seat 4 · disposition + submit

PA Submitter

docs.create_submission_bundle

auth.finalize_provider_determinationfinal_action=submit_pa

auth.submit_authorization

auth.check_status

Medical Policy · 2026T0362QQ

Bariatric Surgery

✓

BMI ≥ 40 documented

42.3

✓

Two supervised programs documented

✓

Procedure code 43775 on covered list

✗

Site of service must be inpatient_hospital

agent submitted outpatient

✗

Pre-op psych eval completed (referral ≠ done)

final_action should be gather_more_evidence

00:00

00:32↻ auto-loops

An RN drives a clinical referral from chart pull through policy lookup, evidence assembly, and a final action decision. Every stage commits — there is no retry.

Best pass@1

29.3%

Codex + GPT-5.5

Why it's hard · one wrong site-of-service flip cascades into four scorecard failures.

Why CHI-Bench is hard

Three capabilities underrepresented in current benchmarks.

Long-horizon

60–80 agent steps per trial across 4–6 distinct stages, with state that carries forward and can't be retried after commit.

Role-composed

One agent plays many seats — intake clerk, nurse, medical director, peer-to-peer reviewer, letter center — and each seat writes its own artifact.

Policy-driven

Medical-policy criteria, site-of-service rules, consent scripts, evidence-grounding requirements — checked by deterministic rubrics plus an LLM semantic judge.

Get involved

Run a submission. Or read the methodology.

The benchmark code, dataset, and 1,279-page operations handbook are open. The leaderboard is live.

CHI-Bench paper Submit a run Docs Dataset