actAVA Benchmarks·CHI-Bench v1.0.0

The world’s first long-horizon healthcare benchmark for AI agents.

Built with the clinicians who do the work.

actAVA partners with 20+ hospitals and top universities to evaluate frontier agents across prior authorization, utilization management, and care management. The best agent today resolves 28% of tasks at pass@1; end-to-end prior authorization automation drops to 0%.

20+clinical & academic collaborators
Johns Hopkins MedicineWellstar Health SystemYale School of MedicineStanford UniversityCMUUniversity of OxfordUSCUCSDBrown UniversityEmory UniversityUniversity of WashingtonNortheasternArizona StateUICBoston CollegeStony BrookMBZUAI+3 more
Domains
3
PA · UM · CM
Tasks
75
25 per domain
Agent configs
30
harness × model
Best pass@1
28.0%
Claude Code · Opus 4.6
MCP tools
200+
21 healthcare apps
Handbook docs
1,279
policy corpus
See it in motion

One agent, one trial, one final scorecard.

Each animation replays an actual trajectory from the claude-code-opus-4-6 submission — stage transitions, real tool calls, real policy gates, real scorecard outcomes.

Task pa_t00869 steps · 7m41s
Reward0.00 · 7/11
Seat 1 · provider intake
Clinical Intake Clerk
chart.search_patients
chart.get_patient_chart
chart.list_candidate_orders
people.list_ordering_providers
Seat 2 · case construction
PA Coordinator
cases.create_from_ordercode=43775 · site=outpatient
forms.list_required_forms
Seat 3 · documentation review
Clinical Reviewer
docs.upload_evidencechart_note
docs.upload_evidencelab_result
forms.save_form_responsequestionnaire
docs.upload_evidenceletter_of_medical_necessity
Seat 4 · disposition + submit
PA Submitter
docs.create_submission_bundle
auth.finalize_provider_determinationfinal_action=submit_pa
auth.submit_authorization
auth.check_status
Medical Policy · 2026T0362QQ
Bariatric Surgery
BMI ≥ 40 documented
42.3
Two supervised programs documented
Procedure code 43775 on covered list
Site of service must be inpatient_hospital
agent submitted outpatient
Pre-op psych eval completed (referral ≠ done)
!
final_action should be gather_more_evidence
00:00
00:32↻ auto-loops

An RN drives a clinical referral from chart pull through policy lookup, evidence assembly, and a final action decision. Every stage commits — there is no retry.

Best pass@1
29.3%
Codex + GPT-5.5
Why it's hard · one wrong site-of-service flip cascades into four scorecard failures.
Why CHI-Bench is hard

Three capabilities underrepresented in current benchmarks.

01
Long-horizon

60–80 agent steps per trial across 4–6 distinct stages, with state that carries forward and can't be retried after commit.

02
Role-composed

One agent plays many seats — intake clerk, nurse, medical director, peer-to-peer reviewer, letter center — and each seat writes its own artifact.

03
Policy-driven

Medical-policy criteria, site-of-service rules, consent scripts, evidence-grounding requirements — checked by deterministic rubrics plus an LLM semantic judge.

Get involved

Run a submission. Or read the methodology.

The benchmark code, dataset, and 1,279-page operations handbook are open. The leaderboard is live.