Agentic Clinical Reasoning over Longitudinal Myeloma Records
Agentic clinical reasoning over longitudinal myeloma records — a retrospective evaluation against expert consensus
Agentic Clinical Reasoning over Longitudinal Myeloma Records:
A Retrospective Evaluation against Expert Consensus
About
Onco-Agent is a multi-turn retrieval-augmented agent for answering structured clinical questions from patient records. It is the system described in our paper: Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus.
Given a patient's clinical record (discharge letters, radiology reports, lab values, etc.) and a doctor's question, the agent:
- Plans which information to retrieve and which domain skills to apply.
- Retrieves relevant report sections and lab values using specialized retrieval tools with date and type filters.
- Applies domain-specific skills (clinical workflows, evidence ranking, scoring systems).
- Produces a structured answer with inline citations.
Demo
| Single-document lookup | Temporal reasoning | Multi-criteria synthesis |
Abstract
Background. Multiple myeloma is managed through sequential lines of therapy over years to decades, with each treatment decision depending on cumulative disease history distributed across dozens to hundreds of heterogeneous clinical documents. Whether large language model-based systems can synthesise this evidence at a level approaching expert agreement has not been established.
Methods. A retrospective evaluation was conducted on longitudinal clinical records of 811 patients with multiple myeloma treated at a tertiary medical centre between 2001 and 2026, covering 44,962 documents and 1,334,677 laboratory values, with external validation on MIMIC-IV. An agentic reasoning system was compared against single-pass retrieval-augmented generation (RAG), iterative RAG, and full-context input on 469 patient–question pairs derived from 48 templates stratified into three complexity levels. The reference standard was established by independent double annotation from four oncologists with adjudication by a senior haematologist.
Findings. Iterative retrieval-augmented generation and full-context input converged on a shared performance ceiling (75·4% versus 75·8%, Bonferroni-corrected p = 1·00). The agentic system reached 79·6% concordance (95% CI 76·4–82·8), significantly exceeding both baselines (+3·8 and +4·2 percentage points; p = 0·006 and 0·007). Gains increased with question complexity, reaching +9·4 percentage points on criteria-based synthesis (p = 0·032), and with record length, reaching +13·5 percentage points in the top decile (exploratory, n = 10). The system error rate (12·2%) was comparable to expert disagreement (13·6%), but severity distributions were inverted, with 57·8% of system errors classified as clinically significant against 18·8% of expert disagreements.
Interpretation. Agentic reasoning was the only approach to exceed the shared performance ceiling, with gains concentrated on the most complex questions and longest records. The greater clinical consequence of residual system errors relative to expert disagreement indicates that prospective evaluation in routine care will be required before these findings translate into measurable patient benefit.
BibTeX
@article{moll2026agentic,
title={Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus},
author={Moll, Johannes and L{\"u}bberstedt, Jannik and Nuernbergk, Christoph and Stroh, Jacob and Mertens, Luisa and Purcarea, Anna and Zirn, Christopher and Benchaaben, Zeineb and Drexel, Fabian and H{\"a}ntze, Hartmut and Narayanan, Anirudh and Puttkammer, Friedrich and Zhukov, Andrei and Lammert, Jacqueline and Ziegelmayer, Sebastian and Graf, Markus and H{\"o}gner, Marion and Makowski, Marcus and Bassermann, Florian and Adams, Lisa C and Pan, Jiazhen and Rueckert, Daniel and Braitsch, Krischan and Bressem, Keno K},
journal={arXiv preprint arXiv:2604.24473},
year={2026}
}