Welcome to the Nexus of Ethics, Psychology, Morality, Philosophy and Health Care

Welcome to the nexus of ethics, psychology, morality, technology, health care, and philosophy

Sunday, April 6, 2025

Large Language Models Pass the Turing Test

Jones, C. R., & Bergen, B. K. (2025, March 31).
arXiv.org.

Abstract

We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.

Here are some thoughts:

The study highlights significant advancements in AI technology, particularly in the capabilities of large language models (LLMs), as demonstrated by their ability to pass the Turing test. GPT-4.5 and LLaMa-3.1-405B, when given specific persona prompts, achieved win rates of 73% and 56%, respectively, meaning they were judged to be human more often than actual human participants in some cases. This marks the first robust empirical evidence that an AI system can pass the standard three-party Turing test, a major milestone in AI development. The success of these models underscores their ability to convincingly mimic human conversation, blurring the line between human and machine interaction.

A key factor in their performance was the use of tailored prompts. Models instructed to adopt a humanlike persona—such as a young, introverted individual familiar with internet culture—significantly outperformed those without such guidance. This adaptability demonstrates the flexibility of modern LLMs and their capacity to refine behavior based on contextual instructions. In contrast, older systems like ELIZA and GPT-4o performed poorly, with win rates of just 23% and 21%, highlighting the rapid progress in AI conversational abilities. The study also challenges the "ELIZA effect," showing that contemporary LLMs succeed not through superficial imitation but by replicating nuanced human conversational patterns.

Human interrogators often relied on social and emotional cues—such as humor, personality, and linguistic style—rather than traditional measures of intelligence to distinguish humans from AI. Despite some effective strategies, like "jailbreak" prompts or probing for inconsistencies, most participants struggled to reliably identify AI, further emphasizing the sophistication of these models. The findings suggest that LLMs can now effectively substitute for humans in short conversations, raising both opportunities and concerns. On one hand, this capability could enhance customer service, education, and entertainment. On the other, it poses ethical risks, including the potential for AI to be used in deception, social engineering, or the spread of misinformation.

Looking ahead, the study calls for further research into longer interactions, expert interrogators, and cultural common ground to better understand the limits of AI’s humanlike abilities. It also reignites philosophical debates about whether passing the Turing test truly reflects intelligence or merely advanced imitation. As AI continues to evolve, these advancements underscore the need for careful consideration of their societal impact, ethical implications, and the future of human-AI interaction.