Welcome to the Nexus of Ethics, Psychology, Morality, Philosophy and Health Care

Welcome to the nexus of ethics, psychology, morality, technology, health care, and philosophy

Tuesday, October 3, 2023

Emergent analogical reasoning in large language models

Webb, T., Holyoak, K.J. & Lu, H. 
Nat Hum Behav (2023).


The recent advent of large language models has reinvigorated debate over whether human cognitive capacities might emerge in such generic models given sufficient training data. Of particular interest is the ability of these models to reason about novel problems zero-shot, without any direct training. In human cognition, this capacity is closely tied to an ability to reason by analogy. Here we performed a direct comparison between human reasoners and a large language model (the text-davinci-003 variant of Generative Pre-trained Transformer (GPT)-3) on a range of analogical tasks, including a non-visual matrix reasoning task based on the rule structure of Raven’s Standard Progressive Matrices. We found that GPT-3 displayed a surprisingly strong capacity for abstract pattern induction, matching or even surpassing human capabilities in most settings; preliminary tests of GPT-4 indicated even better performance. Our results indicate that large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems.


We have presented an extensive evaluation of analogical reasoning in a state-of-the-art large language model. We found that GPT-3 appears to display an emergent ability to reason by analogy, matching or surpassing human performance across a wide range of problem types. These included a novel text-based problem set (Digit Matrices) modeled closely on Raven’s Progressive Matrices, where GPT-3 both outperformed human participants, and captured a number of specific signatures of human behavior across problem types. Because we developed the Digit Matrix task specifically for this evaluation, we can be sure GPT-3 had never been exposed to problems of this type, and therefore was performing zero-shot reasoning. GPT-3 also displayed an ability to solve analogies based on more meaningful relations, including four-term verbal analogies and analogies between stories about naturalistic problems.

It is certainly not the case that GPT-3 mimics human analogical reasoning in all respects. Its performance is limited to the processing of information provided in its local context. Unlike humans, GPT-3 does not have long-term memory for specific episodes. It is therefore unable to search for previously-encountered situations that might create useful analogies with a current problem. For example, GPT-3 can use the general story to guide its solution to the radiation problem, but as soon as its context buffer is emptied, it reverts to giving its non-analogical solution to the problem – the system has learned nothing from processing the analogy. GPT-3’s reasoning ability is also limited by its lack of physical understanding of the world, as evidenced by its failure (in comparison with human children) to use an analogy to solve a transfer problem involving construction and use of simple tools. GPT-3’s difficulty with this task is likely due at least in part to its purely text-based input, lacking the multimodal experience necessary to build a more integrated world model.

But despite these major caveats, our evaluation reveals that GPT-3 exhibits a very general capacity to identify and generalize – in zero-shot fashion – relational patterns to be found within both formal problems and meaningful texts. These results are extremely surprising. It is commonly held that although neural networks can achieve a high level of performance within a narrowly-deļ¬ned task domain, they cannot robustly generalize what they learn to new problems in the way that human learners do. Analogical reasoning is typically viewed as a quintessential example of this human capacity for abstraction and generalization, allowing human reasoners to intelligently approach novel problems zero-shot.