Scaling Laws Open New Dimensions When Old Ones Stall

January 17, 2025

When pre-training scaling shows diminishing returns, the AI industry does not plateau. It discovers new dimensions to scale along, just as the semiconductor industry found parallelism, packaging, and specialization when Dennard scaling ended.

"Despite the slowing down of one trend, the industry collectively remains moving forward at a breakneck pace due to other new emerging paradigms that are ripe for scaling and expansion." AJ Kourabi

The pre-training scaling wall is real but overstated. High-quality text data is being exhausted, models are trained below Chinchilla-optimalFrom DeepMind's 2022 paper by Hoffmann et al., which showed that for a given compute budget, model parameters and training tokens should be scaled roughly equally. Prior practice (e.g., GPT-3) had massively over-parameterized models relative to their training data, a finding that reset industry norms. ratios, and the remaining web data is increasingly low-quality. But these constraints have catalyzed innovation across multiple new scaling dimensions. Inference-time compute, exemplified by OpenAI's o1 and DeepSeek's R1Released in late 2024 and early 2025 respectively. These models use chain-of-thought reasoning at inference time, spending variable compute per query rather than a fixed forward pass. DeepSeek's R1 achieved comparable reasoning performance at a fraction of o1's training cost using reinforcement learning without supervised fine-tuning., trades more computation at serving time for dramatically better reasoning, a new scaling law entirely separate from pre-training. Synthetic data, where frontier models generate training data for the next generation, has created a self-improving flywheel: Anthropic used Claude 3.5 Opus not for release but to generate synthetic data that made Claude 3.5 Sonnet significantly better at the same inference cost.

Post-training techniques (supervised fine-tuning, reinforcement learning from human feedback, rejection samplingA technique where a model generates many candidate responses, scores them using a reward model or evaluator, and keeps only the best for further training. It converts cheap inference compute into high-quality training signal., and model-as-judge scoring) each contain their own scaling laws. Better judge models produce higher-quality datasets, which produce better models, which become better judges. Meta has 100x more proprietary data than is on the public internet; YouTube has 720,000 hours of video uploaded daily. Training on the quadrillions of tokens available from video represents yet another unexplored dimension requiring massive new compute investment.

The parallel to semiconductors is exact. Just as predicting the "end of Moore's Law" by tracking clock speed missed the shift to multi-core and specialized accelerators, predicting the "end of AI scaling" by tracking pre-training loss curves misses the shift to inference-time compute, synthetic data, and architectural innovation.

Scaling laws do not die; they metamorphose. The builders who recognize new scaling dimensions earliest capture disproportionate advantage.

Scaling Laws Open New Dimensions When Old Ones Stall

Linked from