Abstract
Downstream scaling laws aim to predict task performance at larger scales from pretraining losses at smaller scales. Whether this prediction should be possible is unclear: some works demonstrate that task performance follows clear linear scaling trends under transformation, whereas others point out fundamental challenges to downstream scaling laws, such as emergence and inverse scaling. In this work, we conduct a meta-analysis of existing data on downstream scaling laws, finding that close fit to linear scaling laws only occurs in a minority of cases: 39% of the time. Furthermore, seemingly benign changes to the experimental setting can completely change the scaling trend. Our analysis underscores the need to understand the conditions under which scaling laws succeed. To fully model the relationship between pretraining loss and downstream task performance, we must embrace the cases in which scaling behavior deviates from linear trends.
Here is a summary:
This paper challenges the reliability of downstream scaling laws—the idea that you can predict how well a large language model will perform on specific tasks (like question answering or reasoning) based on its pretraining loss at smaller scales. While some prior work claims a consistent, often linear relationship between pretraining loss and downstream performance, this study shows that such predictable scaling is actually the exception, not the rule.
Key findings:
- Only 39% of 46 evaluated tasks showed smooth, predictable (linear-like) scaling.
 - The rest exhibited irregular behaviors: inverse scaling (performance gets worse as models grow), nonmonotonic trends, high noise, no trend, or sudden “breakthrough” improvements (emergence).
 - Validation dataset choice matters: switching the corpus used to compute pretraining perplexity can flip conclusions about which model or pretraining data is better.
 - Experimental details matter: even with the same task and data, small changes in setup (e.g., prompt format, number of answer choices) can qualitatively change scaling behavior.
 
Conclusion: Downstream scaling laws are context-dependent and fragile. Researchers and practitioners should not assume linear scaling holds universally—and must validate scaling behavior in their own specific settings before relying on extrapolations.
