Welcome to the Nexus of Ethics, Psychology, Morality, Philosophy and Health Care

Welcome to the nexus of ethics, psychology, morality, technology, health care, and philosophy

Monday, April 27, 2026

An autonomous agentic workflow for clinical detection of cognitive concerns using large language models

Tian, J., Fard, P., et al. (2026).
Npj Digital Medicine, 9(1), 51.


Abstract

Early detection of cognitive impairment is limited by traditional screening tools and resource constraints. We developed two large language model workflows for identifying cognitive concerns from clinical notes: (1) an expert-driven workflow with iterative prompt refinement across three LLMs (LLaMA 3.1 8B, LLaMA 3.2 3B, Med42 v2 8B), and (2) an autonomous agentic workflow coordinating five specialized agents for prompt optimization. Using Llama3.1, we optimized on a balanced refinement dataset and validated on an independent dataset reflecting real-world prevalence. The agentic workflow achieved comparable validation performance (F1 = 0.74 vs. 0.81) and superior refinement results (0.93 vs. 0.87) relative to the expert-driven workflow. Sensitivity decreased from 0.91 to 0.62 between datasets, demonstrating the impact of prevalence shift on generalizability. Expert re-adjudication revealed 44% of apparent false negatives reflected clinically appropriate reasoning. These findings demonstrate that autonomous agentic systems can approach expert-level performance while maintaining interpretability, offering scalable clinical decision supports.


Here are some thoughts:

This paper introduces an AI-powered system designed to automatically detect signs of cognitive decline in clinical notes, without requiring any human involvement after initial setup. The researchers compared two approaches: one guided by clinical experts who refined the AI's instructions over time, and a fully autonomous system where specialized AI agents worked together to improve their own performance. 

The autonomous system performed surprisingly well, and in many cases where it seemed to make mistakes, expert review later confirmed that its reasoning was actually clinically sound. The main challenge the team identified was that a system trained under idealized conditions can struggle when deployed in real-world settings where patient populations look different. 

Overall, the findings suggest that autonomous AI systems can approach expert-level performance in clinical screening tasks, but will need careful calibration before being trusted in routine medical practice.