Welcome to the Nexus of Ethics, Psychology, Morality, Philosophy and Health Care

Welcome to the nexus of ethics, psychology, morality, technology, health care, and philosophy

Saturday, October 11, 2025

GDPval: Evaluating AI Model Performance on Real-World Economincally Valuable Tasks

OpenAI. (2025).

We introduce GDPval, a benchmark designed to evaluate how well AI models perform economically valuable tasks in real-world settings. GDPval includes the majority of work activities defined by the U.S. Bureau of Labor Statistics for 44 occupations across the nine sectors that contribute most to U.S. GDP. The tasks in GDPval are based on the actual work of industry professionals who average 14 years of experience.

Our findings show that frontier AI models are improving on GDPval at a roughly linear rate over time. The strongest models now produce deliverables that are approaching the quality of work produced by industry experts. We also examine how pairing frontier models with human oversight could allow these tasks to be completed more quickly and at lower cost than by unaided experts.

Model performance improves further when reasoning effort, task context, and structured guidance are increased. To support future research on real-world AI capabilities, we are releasing a gold-standard subset of 220 tasks and providing a public automated grading service at evals.openai.com.

Here is my brief summary:

This paper introduces GDPval, a new benchmark developed by OpenAI to evaluate AI models on real-world, economically valuable tasks that reflect actual knowledge work across 44 occupations and 9 major U.S. GDP sectors. Unlike traditional academic benchmarks, GDPval emphasizes realism, representativeness, and multi-modality, with tasks based on expert-validated work products that take professionals an average of 7 hours to complete. The evaluation uses pairwise comparisons by industry experts to measure AI performance, finding that top models like Claude Opus 4.1 and GPT-5 are approaching human-level performance in some areas—Claude excels in aesthetics and formatting, while GPT-5 leads in accuracy and instruction-following. The authors open-source a 220-task "gold subset," provide an experimental automated grader, and analyze how factors like reasoning effort, prompting, and scaffolding impact model performance, highlighting both the potential and current limitations of AI in professional workflows.