Atari, M., Xue, M. J.et al.
(2023, September 22).
https://doi.org/10.31234/osf.io/5b26t
Abstract
Large language models (LLMs) have recently made vast advances in both generating and analyzing textual data. Technical reports often compare LLMs’ outputs with “human” performance on various tests. Here, we ask, “Which humans?” Much of the existing literature largely ignores the fact that humans are a cultural species with substantial psychological diversity around the globe that is not fully captured by the textual data on which current LLMs have been trained. We show that LLMs’ responses to psychological measures are an outlier compared with large-scale cross-cultural data, and that their performance on cognitive psychological tasks most resembles that of people from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies but declines rapidly as we move away from these populations (r = -.70). Ignoring cross-cultural diversity in both human and machine psychology raises numerous scientific and ethical issues. We close by discussing ways to mitigate the WEIRD bias in future generations of generative language models.
My summary:
The authors argue that much of the existing literature on LLMs largely ignores the fact that humans are a cultural species with substantial psychological diversity around the globe. This diversity is not fully captured by the textual data on which current LLMs have been trained.
For example, LLMs are often evaluated on their ability to complete tasks such as answering trivia questions, generating creative text formats, and translating languages. However, these tasks are all biased towards the cultural context of the data on which the LLMs were trained. This means that LLMs may perform well on these tasks for people from certain cultures, but poorly for people from other cultures.
Atari and his co-authors argue that it is important to be aware of this bias when interpreting the results of LLM evaluations. They also call for more research on the performance of LLMs across different cultures and demographics.
One specific example they give is the use of LLMs to generate creative text formats, such as poems and code. They argue that LLMs that are trained on a dataset of text from English-speaking countries are likely to generate creative text that is more culturally relevant to those countries. This could lead to bias and discrimination against people from other cultures.
Atari and his co-authors conclude by calling for more research on the following questions:
- How do LLMs perform on different tasks across different cultures and demographics?
- How can we develop LLMs that are less biased towards the cultural context of their training data?
- How can we ensure that LLMs are used in a way that is fair and equitable for all people?