Welcome to the Nexus of Ethics, Psychology, Morality, Philosophy and Health Care

Welcome to the nexus of ethics, psychology, morality, technology, health care, and philosophy

Wednesday, June 19, 2024

The Internal State of an LLM Knows When its Lying

A. Azaria and T. Mitchell
Last Revised 17 Oct 23


While Large Language Models (LLMs) have shown exceptional performance in various tasks, their (arguably) most prominent drawback is generating inaccurate or false information with a confident tone. In this paper, we hypothesize that the LLM's internal state can be used to reveal the truthfulness of a statement. Therefore, we introduce a simple yet effective method to detect the truthfulness of LLM-generated statements, which utilizes the LLM's hidden layer activations to determine the veracity of statements. To train and evaluate our method, we compose a dataset of true and false statements in six different topics. A classifier is trained to detect which statement is true or false based on an LLM's activation values. Specifically, the classifier receives as input the activation values from the LLM for each of the statements in the dataset. Our experiments demonstrate that our method for detecting statement veracity significantly outperforms even few-shot prompting methods, highlighting its potential to enhance the reliability of LLM-generated content and its practical applicability in real-world scenarios.

Here is a summary:

The research presents evidence that a large language model's (LLM's) internal state, specifically the hidden layer activations, can reveal whether statements it generates or is given are truthful or false.

The approach is to train a classifier on the LLM's hidden activations when processing true and false statements. In experiments, this classifier achieved 71-83% accuracy in labeling statements as true or false, outperforming methods based solely on the probability the LLM assigns to statements.

While LLM probability is related to truthfulness, it is also influenced by sentence length and word frequencies. The trained classifier provides a more reliable way to detect truthfulness.

The findings suggest that while LLMs can generate false information confidently, their internal representations encode signals about the veracity of statements. Leveraging these signals could help enhance the reliability of LLM outputs.

However, the approach was evaluated on a limited dataset of true/false statements across topics. Its generalization to arbitrary statements or knowledge domains is unclear from the study.