Ethics and Psychology: Explicitly unbiased large language models still form biased associations

Bai, X., Wang, A., et al. (2025).

PNAS, 122(8).

Abstract

Large language models (LLMs) can pass explicit social bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such implicit biases can be a challenge: As LLMs become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. We address both challenges by introducing two measures: LLM Word Association Test, a prompt-based method for revealing implicit bias; and LLM Relative Decision Test, a strategy to detect subtle discrimination in contextual decisions. Both measures are based on psychological research: LLM Word Association Test adapts the Implicit Association Test, widely used to study the automatic associations between concepts held in human minds; and LLM Relative Decision Test operationalizes psychological results indicating that relative evaluations between two candidates, not absolute evaluations assessing each independently, are more diagnostic of implicit biases. Using these measures, we found pervasive stereotype biases mirroring those in society in 8 value-aligned models across 4 social categories (race, gender, religion, health) in 21 stereotypes (such as race and criminality, race and weapons, gender and science, age and negativity). These prompt-based measures draw from psychology’s long history of research into measuring stereotypes based on purely observable behavior; they expose nuanced biases in proprietary value-aligned LLMs that appear unbiased according to standard benchmarks.

Significance

Modern large language models (LLMs) are designed to align with human values. They can appear unbiased on standard benchmarks, but we find that they still show widespread stereotype biases on two psychology-inspired measures. These measures allow us to measure biases in LLMs based on just their behavior, which is necessary as these models have become increasingly proprietary. We found pervasive stereotype biases mirroring those in society in 8 value-aligned models across 4 social categories (race, gender, religion, health) in 21 stereotypes (such as race and criminality, race and weapons, gender and science, age and negativity), also demonstrating sizable effects on discriminatory decisions. Given the growing use of these models, biases in their behavior can have significant consequences for human societies.

Here are some thoughts:

This research is important to psychologists because it highlights the parallels between implicit biases in humans and those that persist in large language models (LLMs), even when these models are explicitly aligned to be unbiased. By adapting psychological tools like the Implicit Association Test (IAT) and focusing on relative decision-making tasks, the study uncovers pervasive stereotype biases in LLMs across social categories such as race, gender, religion, and health—mirroring well-documented human biases. This insight is critical for psychologists studying bias formation, transmission, and mitigation, as it suggests that similar cognitive mechanisms might underlie both human and machine biases. Moreover, the findings raise ethical concerns about how these biases might influence real-world decisions made or supported by LLMs, emphasizing the need for continued scrutiny and development of more robust alignment techniques. The research also opens new avenues for understanding how biases evolve in artificial systems, offering a unique lens through which psychologists can explore the dynamics of stereotyping and discrimination in both human and machine contexts.

Ethics and Psychology

Resource Pages

Friday, August 8, 2025

Explicitly unbiased large language models still form biased associations