Adler, D. A., Stamatis, C. A., et al. (2024).
Npj Mental Health Research, 3(1), 17.
Abstract
AI tools intend to transform mental healthcare by providing remote estimates of depression risk using behavioral data collected by sensors embedded in smartphones. While these tools accurately predict elevated depression symptoms in small, homogenous populations, recent studies show that these tools are less accurate in larger, more diverse populations. In this work, we show that accuracy is reduced because sensed-behaviors are unreliable predictors of depression across individuals: sensed-behaviors that predict depression risk are inconsistent across demographic and socioeconomic subgroups. We first identified subgroups where a developed AI tool underperformed by measuring algorithmic bias, where subgroups with depression were incorrectly predicted to be at lower risk than healthier subgroups. We then found inconsistencies between sensed-behaviors predictive of depression across these subgroups. Our findings suggest that researchers developing AI tools predicting mental health from sensed-behaviors should think critically about the generalizability of these tools, and consider tailored solutions for targeted populations.
Here are some thoughts:
This article presents a critical examination of the reliability and fairness of AI tools designed to predict depression risk using smartphone-sensed behaviors. The core finding is that these tools often fail to generalize across diverse populations because the relationship between behavior and depression is not universal. For instance, increased phone usage or changes in mobility may signal depression in one demographic group but not in another. This means that an AI model trained on one population can systematically underestimate or overestimate risk in another, turning algorithmic bias into a fundamental issue of measurement validity and psychometric reliability.
Importanlty, this underscores the necessity of approaching digital phenotyping tools with caution. Before adopting such technologies in clinical or screening contexts, it is vital to demand robust validation across the specific populations they intend to serve. The study also highlights tangible clinical risks: biased tools could misallocate mental health resources by overestimating risk in some groups while underestimating it in others—such as males, who already underutilize services. Ultimately, this research calls for a shift from aiming for universally generalizable models to developing targeted, culturally and contextually tailored solutions. Psychologists have a key role to play in this process by ensuring that digital tools are grounded in psychological theory, evaluated for equity, and implemented in a way that promotes ethical and effective mental health care for all.
