Welcome to the Nexus of Ethics, Psychology, Morality, Philosophy and Health Care

Welcome to the nexus of ethics, psychology, morality, technology, health care, and philosophy

Tuesday, October 21, 2025

Evaluating the Clinical Safety of LLMs in Response to High-Risk Mental Health Disclosures

Shah, S., Gupta, A., et al. (2025, September 1).
arXiv.org.

Abstract

As large language models (LLMs) increasingly mediate emotionally sensitive conversations, especially in mental health contexts, their ability to recognize and respond to high-risk situations becomes a matter of public safety. This study evaluates the responses of six popular LLMs (Claude, Gemini, Deepseek, ChatGPT, Grok 3, and LLAMA) to user prompts simulating crisis-level mental health disclosures. Drawing on a coding framework developed by licensed clinicians, five safety-oriented behaviors were assessed: explicit risk acknowledgment, empathy, encouragement to seek help, provision of specific resources, and invitation to continue the conversation. Claude outperformed all others in global assessment, while Grok 3, ChatGPT, and LLAMA underperformed across multiple domains. Notably, most models exhibited empathy, but few consistently provided practical support or sustained engagement. These findings suggest that while LLMs show potential for emotionally attuned communication, none currently meet satisfactory clinical standards for crisis response. Ongoing development and targeted fine-tuning are essential to ensure ethical deployment of AI in mental health settings.

Here are some thoughts:

This study evaluated six LLMs (Claude, Gemini, Deepseek, ChatGPT, Grok 3, Llama) on their responses to high-risk mental health disclosures using a clinician-developed framework. While most models showed empathy, only Claude consistently demonstrated all five core safety behaviors: explicit risk acknowledgment, encouragement to seek help, provision of specific resources (e.g., crisis lines), and crucially, inviting continued conversation. Grok 3, ChatGPT, and Llama frequently failed to acknowledge risk or provide concrete resources, and nearly all models (except Claude and Grok 3) avoided inviting further dialogue – a critical gap in crisis care. Performance varied dramatically, revealing that safety is not an emergent property of scale but results from deliberate design (e.g., Anthropic’s Constitutional AI). No model met minimum clinical safety standards; LLMs are currently unsuitable as autonomous crisis responders and should only be used as adjunct tools under human supervision.