Paper status: completed

Scaling Laws for Neural Language Models

Published:01/23/2020
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
4 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper shows that language model loss follows power-law scaling with model size, data, and compute, while depth/width effects are minimal. It reveals quantitative laws of overfitting and training speed, enabling optimal compute allocation favoring large, sample-efficient model

Abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of this paper is "Scaling Laws for Neural Language Models."

1.2. Authors

The authors of the paper are:

  • Jared Kaplan (Johns Hopkins University, OpenAI)

  • Sam McCandlish (OpenAI)

  • Tom Henighan (OpenAI)

  • Tom B. Brown (OpenAI)

  • Benjamin Chess (OpenAI)

  • Rewon Child (OpenAI)

  • Scott Gray (OpenAI)

  • Alec Radford (OpenAI)

  • Jeffrey Wu (OpenAI)

  • Dario Amodei (OpenAI)

    The research primarily comes from OpenAI, with Jared Kaplan also affiliated with Johns Hopkins University. This indicates a strong background in artificial intelligence research, particularly in large-scale language models, given OpenAI's known work in this area (e.g., GPT series).

1.3. Journal/Conference

This paper was published on arXiv, a preprint server, on 2020-01-23. While arXiv is not a peer-reviewed journal or conference in itself, it is a highly influential platform in the academic community, especially in fields like AI and machine learning, for rapidly disseminating cutting-edge research before or concurrently with formal publication. Papers published on arXiv often gain significant traction and citations, and many are subsequently published in top-tier conferences or journals. Given the authors' affiliations with OpenAI, this paper quickly became a foundational work in understanding the scaling behavior of large language models.

1.4. Publication Year

2020

1.5. Abstract

This paper empirically investigates the cross-entropy loss performance of neural language models. The core finding is that performance scales as a power-law with model size (N), dataset size (D), and compute used for training, with these trends spanning over seven orders of magnitude. In contrast, other architectural details like network width or depth have minimal effects within a wide range. The authors establish simple equations that describe how overfitting depends on model/dataset size and how training speed depends on model size. These relationships provide a framework for optimally allocating a fixed compute budget, revealing that larger models are significantly more sample-efficient. Consequently, the most compute-efficient training strategy involves using very large models on a relatively modest amount of data and stopping significantly before full convergence.

The original source link is https://arxiv.org/abs/2001.08361v1. The PDF link is https://arxiv.org/pdf/2001.08361v1.pdf. This paper is a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is understanding and predicting the performance of neural language models as they scale across various dimensions. In the rapidly advancing field of deep learning for natural language processing (NLP), models were growing in size and complexity, but a clear, quantitative understanding of how performance relates to key resources like model size, dataset size, and compute was lacking.

This problem is critically important because the development of state-of-the-art language models (like the Transformer-based models prevalent today) requires significant computational resources and time. Without a clear understanding of scaling behavior, researchers and engineers might:

  1. Allocate resources inefficiently: Spend too much time or compute on models that are too small or too large for the available data, or train them for suboptimal durations.

  2. Struggle with predictability: Be unable to reliably forecast the performance gains (or diminishing returns) from investing more into larger models or datasets.

  3. Over-engineer architectures: Focus excessively on intricate architectural details that might have marginal impact compared to sheer scale.

    Prior research often involved ad-hoc scaling or focused on specific aspects (e.g., impact of dataset size). The specific challenge was to find universal, predictive laws that govern language model performance across vast scales and various training conditions, guiding future research and development more systematically.

The paper's entry point is an empirical investigation into Transformer-based language models, aiming to identify fundamental power-law relationships between performance (measured by cross-entropy loss) and key scaling factors. The innovative idea is to systematically explore these relationships across many orders of magnitude to uncover predictable and actionable insights.

2.2. Main Contributions / Findings

The paper makes several significant contributions and presents key findings that have fundamentally influenced the development of large language models:

  1. Empirical Power-Law Scaling Laws: The primary contribution is the discovery that language model cross-entropy loss (a measure of performance) follows predictable power-law relationships with model size (N), dataset size (D), and compute (C) used for training. These laws hold consistently over six to seven orders of magnitude, providing a robust framework for performance prediction.

    • L(N)NαNL(N) \propto N^{-\alpha_N}
    • L(D)DαDL(D) \propto D^{-\alpha_D}
    • L(Cmin)CminαCminL(C_{min}) \propto C_{min}^{-\alpha_C^{min}}
  2. Weak Dependence on Architectural Details: Within a reasonable range, other architectural hyperparameters (e.g., network width vs. depth, number of attention heads) have minimal impact on performance when the total model size is fixed. This suggests that scale is far more critical than precise architectural choices for Transformer models.

  3. Universality of Overfitting and Training: The paper provides simple equations to model overfitting, showing its dependence on the ratio N0.74/DN^{0.74}/D. It also demonstrates that training curves follow predictable power-laws and can be extrapolated, implying a universality in the training dynamics.

  4. Optimal Allocation of Compute Budget: Based on these scaling laws, the authors determine the most compute-efficient way to allocate resources. For a fixed compute budget, optimal performance is achieved by training very large models on a relatively modest amount of data and stopping training significantly before convergence.

  5. Enhanced Sample Efficiency of Large Models: A crucial finding is that larger models are significantly more sample-efficient. They require fewer optimization steps and fewer data points to reach a given level of performance compared to smaller models.

  6. Predictive Framework: The derived scaling relations go beyond mere observation, offering a predictive framework for guiding research and development in large language models. They suggest that continued scaling will lead to better performance and sample efficiency.

    These findings solve the problem of inefficient resource allocation and lack of predictability in deep learning for language models. They provide a clear roadmap for how to effectively scale up Transformers, shifting the focus from hyper-parameter tuning to strategic scaling of model, data, and compute.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully understand this paper, a beginner needs to grasp several fundamental concepts related to neural networks, language modeling, and deep learning practices.

  • Neural Language Models: At their core, neural language models are neural networks designed to understand and generate human language. Their primary task is often autoregressive language modeling, which means predicting the next word (or token) in a sequence given the preceding ones. For example, if given "The cat sat on the...", a language model tries to predict "mat", "rug", etc., based on probabilities. They learn statistical relationships and patterns in language by being trained on vast amounts of text data.
  • Transformer Architecture: The Transformer is a specific neural network architecture introduced in 2017 that revolutionized NLP. Unlike previous architectures like Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTMs) which process sequences sequentially, Transformers rely heavily on a mechanism called self-attention. This allows them to weigh the importance of different tokens in the input sequence when processing each token, enabling them to capture long-range dependencies in text much more effectively and in parallel. Key components include:
    • Self-Attention: A mechanism that allows the model to look at other positions in the input sequence to compute a representation for each token.
    • Feed-Forward Networks: Standard neural networks applied to each token independently and identically.
    • Encoder-Decoder Structure (or Decoder-Only): Transformers can have an encoder part (for understanding input) and a decoder part (for generating output), or be decoder-only for autoregressive language modeling. This paper focuses on decoder-only Transformer models.
  • Cross-entropy Loss: This is a standard loss function used in machine learning, particularly for classification and generative models. In language modeling, it quantifies how well a model predicts the next token. A lower cross-entropy loss value indicates better model performance, meaning the model's predicted probability distribution for the next token is closer to the true distribution (where the true token has a probability of 1). The goal during training is to minimize this loss.
    • Conceptual Definition: Cross-entropy loss measures the dissimilarity between two probability distributions: the true distribution of tokens in the training data and the distribution predicted by the model. It quantifies the "surprise" or "uncertainty" in the model's predictions. When the model is very confident about an incorrect prediction, the loss is high; when it's confident about a correct prediction, the loss is low.
    • Mathematical Formula: For a single token prediction, given a true token yy and the model's predicted probability distribution y^\hat{y} over the vocabulary, the cross-entropy loss is: $ L = -\sum_{i} y_i \log(\hat{y}i) $ For language modeling, yiy_i is typically 1 for the actual next token and 0 for all other tokens in the vocabulary. So, the formula simplifies to: $ L = -\log(\hat{y}{\text{true token}}) $
    • Symbol Explanation:
      • LL: The cross-entropy loss value.
      • yiy_i: The true probability of the ii-th token in the vocabulary. (Often 1 for the correct token, 0 otherwise).
      • y^i\hat{y}_i: The predicted probability of the ii-th token in the vocabulary by the model.
      • log\log: The natural logarithm.
      • y^true token\hat{y}_{\text{true token}}: The model's predicted probability for the actual next token in the sequence.
  • Power Laws: A power law describes a functional relationship between two quantities where one quantity varies as a power of another. Mathematically, it's expressed as y=axky = ax^k (or yxky \propto x^k), where xx and yy are variables, aa is a constant, and kk is the exponent. In this paper, power laws describe how loss (performance) changes with model size, dataset size, or compute. The exponent kk (e.g., αN\alpha_N) indicates the strength and direction of this relationship. For example, a negative exponent means that as xx increases, yy decreases (performance improves).
  • Compute (FLOPs, PF-days): Compute refers to the total number of arithmetic operations (like additions and multiplications) performed during training. FLOPs (Floating Point Operations) is a unit for this. PF-days (PetaFLOPS-days) is a large unit of compute, where one PetaFLOP-day equals 1015×24×3600=8.64×101910^{15} \times 24 \times 3600 = 8.64 \times 10^{19} FLOPs. It provides a standardized way to measure the computational cost of training deep learning models.
  • Model Size (N): This refers to the number of learnable parameters (weights and biases) in a neural network. Larger models have more parameters and thus higher capacity to learn complex patterns, but also require more compute and data. The paper specifically focuses on non-embedding parameters for cleaner scaling laws.
  • Dataset Size (D): This is the total number of tokens (words or sub-word units) in the training data. Larger datasets provide more examples for the model to learn from, potentially leading to better generalization.
  • Batch Size (B): During neural network training, parameters are updated based on the gradients computed from a small subset of the data called a batch. Batch size is the number of samples (e.g., sequences of tokens) processed in one training step. A larger batch size can lead to more stable gradient estimates but might require more memory and can change optimization dynamics.
  • Training Steps (S): A training step or parameter update refers to one iteration of computing gradients and updating model parameters using an optimizer (e.g., Adam). The total number of training steps is a measure of how long a model has been trained.
  • Overfitting: This occurs when a machine learning model learns the training data too well, capturing noise and specific patterns that are not representative of the underlying data distribution. As a result, the model performs poorly on unseen data (the test set) even if it has very low loss on the training set. The paper explores how to manage overfitting as model size and dataset size vary.
  • Sample Efficiency: A model is considered sample-efficient if it can achieve a certain level of performance using fewer training samples (data points) or fewer optimization steps. The paper highlights that larger models tend to be more sample-efficient.

3.2. Previous Works

The paper contextualizes its findings by referencing several important prior studies and concepts:

  • Transformer Architecture (Vaswani et al., 2017 [VSP+17]): This seminal work introduced the Transformer architecture, which is the foundation for the language models studied in this paper. The Transformer's reliance on self-attention and its ability to parallelize computations were key to enabling the scaling observed in this paper. The paper builds directly on this by investigating the scaling behavior of Transformer-based models.
  • GPT-2 and WebText (Radford et al., 2019 [RWC+19]): GPT-2 demonstrated unprecedented performance in language generation and unsupervised multitask learning. The WebText dataset, curated for GPT-2, is an extended version of the WebText2 dataset used in this paper. The success of GPT-2 highlighted the potential of scaling and provided motivation for a deeper understanding of scaling laws.
  • Empirical Model of Large-Batch Training (McCandlish et al., 2018 [MKAT18]): This work introduced the concept of a critical batch size (BcritB_{crit}) and provided an empirical theory for batch size dependence in training. It argued that for batch sizes up to BcritB_{crit}, training can scale efficiently without significant performance degradation, but beyond BcritB_{crit}, returns diminish. This paper leverages [MKAT18] to adjust for batch size effects and define minimal steps (SminS_{min}) and minimum compute (CminC_{min}) to achieve cleaner scaling laws.
  • Scaling of Deep Learning (Hestness et al., 2017 [HNA+17], 2019 [HAD19]): Earlier work by Hestness et al. also investigated scaling relationships between model size and data size. While this paper acknowledges these efforts, it notes a crucial difference: [HNA+17][HNA+17] found super-linear scaling of dataset size with model size, whereas the current paper finds a sub-linear scaling (DN0.74D \propto N^{0.74}) to avoid overfitting for Transformer language models. This distinction is significant for practical resource allocation.
  • LSTMs and Universal Transformers (Hochreiter & Schmidhuber, 1997; Dehghani et al., 2018 [DGV+18]): The paper compares Transformer performance against LSTMs (a type of Recurrent Neural Network) to demonstrate the superior scaling of Transformers, especially for longer contexts. It also compares to Universal Transformers, which re-use parameters and show slightly better performance per parameter but require more compute per parameter due to repeated operations.
  • BERT and XLNet (Devlin et al., 2018 [DCLT18], Yang et al., 2019 [YDY+19], Liuetal.,2019[LOG+19]Liu et al., 2019 [LOG+19]): These are other prominent Transformer-based models that achieved state-of-the-art results in NLP around the time of this paper, demonstrating the general applicability and power of the Transformer architecture.

3.3. Technological Evolution

The language modeling field has seen a rapid evolution, moving from simpler statistical models to Recurrent Neural Networks (RNNs) and LSTMs, and then to Transformers.

  • Early Models: Focused on n-grams and basic probabilistic models.

  • RNNs and LSTMs: Introduced the ability to process sequences and capture some long-range dependencies, but struggled with very long sequences due to vanishing/exploding gradients and sequential processing bottlenecks.

  • Transformers: Marked a significant leap by introducing self-attention, enabling parallel computation, capturing longer-range dependencies, and scaling to much larger model sizes. This architecture became the backbone for highly influential models like BERT, GPT, and XLNet.

    This paper's work fits squarely within the Transformer era, providing the scientific underpinnings for the rapid scaling trends observed. It moves beyond simply building larger models to systematically understanding why and how they scale, thus guiding the next generation of large language models.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper offers several core differences and innovations:

  1. Comprehensive Scaling Laws: While previous works (e.g., [HNA+17][HNA+17]) studied aspects of scaling, this paper provides a more comprehensive and unified power-law framework across model size (N), dataset size (D), and compute (C). It systematically disentangles their individual and combined effects on cross-entropy loss.

  2. Focus on Non-Embedding Parameters: The paper explicitly shows that excluding embedding parameters from the model size (N) leads to significantly cleaner and more universal scaling laws (Figure 6). This is a practical distinction that simplifies the analysis of scaling.

  3. Predictive Overfitting Model: The paper develops a specific equation (1.5) that quantifies the dependence of overfitting on both NN and DD, providing concrete guidance on how much data is needed to prevent overfitting for a given model size (i.e., DN0.74D \propto N^{0.74}). This is more detailed than general observations about overfitting.

  4. Optimal Compute Allocation: A key innovation is deriving a strategy for the optimal allocation of a fixed compute budget. It quantitatively demonstrates that the most compute-efficient approach involves training very large models with modest data and early stopping, which contrasts with traditional approaches of training smaller models to convergence.

  5. Sample Efficiency Emphasis: The paper strongly highlights the superior sample efficiency of larger models, providing empirical evidence that they require fewer samples and steps to reach comparable performance. This insight has profound implications for data requirements and training strategies.

  6. Empirical Rigor across Scales: The study spans unprecedented scales (seven orders of magnitude for compute), lending strong empirical weight and universality to its power-law findings, which was not as extensively demonstrated in prior work.

    In essence, this paper provides a quantitative, predictive "thermodynamics" for Transformer language models, whereas previous work might be considered more qualitative or focused on specific "microscopic" aspects.

4. Methodology

4.1. Principles

The core idea of the method used in this paper is to empirically discover and characterize scaling laws that govern the performance of neural language models (specifically, Transformer-based models) as key resources—model size, dataset size, and compute—are varied. The underlying theoretical basis is the observation that many complex systems in nature and technology exhibit power-law relationships between macroscopic properties. The intuition is that if such fundamental relationships exist for deep learning models, they can provide a universal, predictive framework for understanding and guiding their development, independent of many specific architectural details.

The methodology focuses on measuring the cross-entropy loss (a standard metric for language model performance) and observing how it changes as different factors are systematically scaled up or down, often across many orders of magnitude. The goal is to find simple, quantitative relationships (the scaling laws) that allow for accurate predictions of performance and optimal resource allocation.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology is built upon extensive empirical experimentation, systematic variation of key factors, and fitting power-law functions to the observed performance.

4.2.1. Model Architecture and Parameterization

The paper primarily uses decoder-only Transformer models. To systematically study scaling laws, the authors define a clear way to count model parameters and estimate compute.

  • Transformer Hyperparameters: The architecture is parameterized by:

    • n_layer: Number of layers.
    • d_model: Dimension of the residual stream (the main data pathway within the Transformer).
    • d_ff: Dimension of the intermediate feed-forward layer.
    • d_attn: Dimension of the attention output.
    • n_heads: Number of attention heads per layer.
    • n_ctx: Number of tokens in the input context (usually 1024).
  • Model Size (NN) Definition: A crucial aspect is the definition of model size NN. The authors define NN as the number of non-embedding parameters, excluding vocabulary and positional embeddings. This choice leads to significantly cleaner scaling laws (as shown in Figure 6, which will be discussed later). The approximate formula for NN is: $ N \approx 2 d_{model} n_{layer} (2 d_{attn} + d_{ff}) $ With the standard Transformer configuration where dattn=dff/4=dmodeld_{attn} = d_{ff} / 4 = d_{model}, this simplifies to: $ N = 12 n_{layer} d_{model}^2 $

    • Symbol Explanation:
      • NN: The number of non-embedding parameters in the Transformer model.
      • dmodeld_{model}: The dimensionality of the model's internal representations.
      • nlayern_{layer}: The number of Transformer layers.
      • dattnd_{attn}: The dimensionality of the attention output.
      • dffd_{ff}: The dimensionality of the feed-forward layer's hidden state.
  • Compute (CC) Estimation: The total non-embedding training compute CC is estimated based on the model size, batch size, and training steps. $ C \approx 6 N B S $ This formula accounts for the forward pass (roughly 2N FLOPs per token) and backward pass (approximately twice the forward pass compute), multiplied by the batch size and steps. The numerical values for compute are quoted in PF-days.

    • Symbol Explanation:
      • CC: Estimated total non-embedding compute.
      • NN: Number of non-embedding parameters.
      • BB: Batch size (number of tokens per batch).
      • SS: Number of training steps (parameter updates).
      • 6: A constant factor representing the FLOPs per parameter per token for both forward and backward passes.

4.2.2. Training Procedures

Models were trained using the Adam optimizer [KB14] or Adafactor [SS18] for larger models due to memory constraints. Training runs typically lasted for 2.5×1052.5 \times 10^5 steps with a batch size of 512 sequences of 1024 tokens. Various learning rates and schedules were experimented with, but the final loss was found to be largely independent of the schedule, provided it included a warmup period and a cosine decay.

4.2.3. Datasets

The primary training dataset was WebText2, an extended version of the WebText dataset [RWC+19][RWC+19].

  • Source: Outbound links from Reddit with a minimum of 3 karma, extracted using Newspaper3k.
  • Scale: 20.3 million documents, 96 GB of text, 1.62×10101.62 \times 10^{10} words, yielding 2.29×10102.29 \times 10^{10} tokens after byte-pair encoding [SHB15] with a vocabulary size of 50257.
  • Test Set: 6.6×1086.6 \times 10^8 tokens reserved from WebText2 for testing.
  • Other Test Distributions: Books Corpus [ZKZ+15][ZKZ+15], Common Crawl [Fou], English Wikipedia, and Internet Books were also used for generalization testing.

4.2.4. Empirical Results and Basic Power Laws (Section 3)

The study involved training a wide variety of models, varying model size (768 to 1.5 billion non-embedding parameters), dataset size (22 million to 23 billion tokens), model shape (depth, width, attention heads, feed-forward dimension), context length, and batch size.

  • Approximate Transformer Shape and Hyperparameter Independence: The paper found that Transformer performance depends very weakly on shape parameters (n_layer, n_heads, d_ff) when the total non-embedding parameter count NN is held fixed (Figure 5). This suggests that overall scale is more important than specific architectural ratios.
  • Performance with Non-Embedding Parameter Count (NN): When models are trained to near convergence on sufficiently large datasets (effectively the infinite data limit), the test loss (LL) follows a power-law relationship with NN. $ L ( N ) \approx \left( { \frac { N _ { c } } { N } } \right) ^ { \alpha _ { N } } $ The fitted values are αN0.076\alpha_N \approx 0.076 and Nc8.8×1013N_c \approx 8.8 \times 10^{13} non-embedding parameters. The constant NcN_c is dependent on vocabulary size and tokenization and does not have a fundamental meaning. This relationship is shown in Figure 1 (middle panel) and Figure 6. Critically, using non-embedding parameters provides a much cleaner trend compared to including embedding parameters (Figure 6).
  • Performance with Dataset Size (DD): For models with sufficient capacity trained with a limited dataset and early stopping, the test loss (LL) scales as a power-law with dataset size (DD). $ L ( D ) \approx \left( \frac { D _ { c } } { D } \right) ^ { \alpha _ { D } } $ The fitted values are αD0.095\alpha_D \approx 0.095 and Dc5.4×1013D_c \approx 5.4 \times 10^{13} tokens. This relationship is shown in Figure 1 (right panel).
  • Performance with Compute (CC): When compute is limited, with a large enough dataset and an optimally sized model, the loss (LL) scales as a power-law with compute (CC). $ L ( C ) \approx \left( { \frac { C _ { c } } { C } } \right) ^ { \alpha _ { C } } $ The fitted values are αC0.057\alpha_C \approx 0.057 and Cc1.6×107C_c \approx 1.6 \times 10^7 PF-days (this is a naive fit before adjusting for optimal batch size). This relationship is shown in Figure 1 (left panel).

The following figure (Figure 1 from the original paper) summarizes these basic power-law relationships.

Figure 1 Language modeling performance improves smoothly as we increase the model size, datasetset size, and amount of compute used for training.For optimal performance all three factors must be scal… 该图像是三幅折线图组成的图表,展示了语言模型测试损失(Test Loss)与训练计算量(Compute)、数据集大小(Dataset Size)以及模型参数数量(Parameters)之间的关系。图中均采用对数坐标,测试损失随着这三者的增加呈现出明显的幂律下降趋势。对应的幂律关系用公式表示为:L=(Cmin/2.3108)0.050L = (C_{\min}/2.3 \cdot 10^8)^{-0.050}L=(D/5.41013)0.095L = (D/5.4 \cdot 10^{13})^{-0.095}L=(N/8.81013)0.076L = (N/8.8 \cdot 10^{13})^{-0.076},其中 LL 是测试损失,CC 是计算量,DD 是数据集大小,NN 是模型参数量。图表反映出在其他因素不成为瓶颈时,这三者均对模型性能有平滑且持续的提升作用。

4.2.5. Charting the Infinite Data Limit and Overfitting (Section 4)

This section integrates the scaling laws for NN and DD to model overfitting.

  • Proposed L(N,D) Equation: The paper proposes a combined scaling law for test loss that depends simultaneously on model size NN and dataset size DD: $ L ( N , D ) = \left[ \left( \frac { N _ { c } } { N } \right) ^ { \frac { \alpha _ { N } } { \alpha _ { D } } } + \frac { D _ { c } } { D } \right] ^ { \alpha _ { D } } $

    • Symbol Explanation:
      • L(N,D): The cross-entropy loss as a function of model size NN and dataset size DD.
      • Nc,αNN_c, \alpha_N: Parameters from the loss vs. model size power law.
      • Dc,αDD_c, \alpha_D: Parameters from the loss vs. dataset size power law.
    • Rationale: This formula satisfies three principles:
      1. It accounts for vocabulary and tokenization changes through NcN_c and DcD_c.
      2. It correctly approaches L(D) when NN \to \infty (infinite capacity model) and L(N) when DD \to \infty (infinite data).
      3. It has an analytic expansion in 1/D1/D with integer powers, which is motivated by overfitting scaling with variance (1/D1/\sqrt{D}) or signal-to-noise ratio.
  • Results for L(N,D): The fit to this equation using empirical data yields the following parameters: The following are the results from Table 2 of the original paper:

    Parameter αN\alpha_N αD\alpha_D NcN_c DcD_c
    Value 0.076 0.103 6.4×10136.4 \times 10^{13} 1.8×10131.8 \times 10^{13}

    The fit is excellent, except for very small datasets where overfitting occurs extremely early. The universality of overfitting is demonstrated by showing that the extent of overfitting (δL=L(N,D)/L(N,)1\delta L = L(N,D)/L(N,\infty) - 1) depends predominantly on the ratio NαNαD/DN^{\frac{\alpha_N}{\alpha_D}}/D. This leads to a practical rule for avoiding significant overfitting: $ D \gtrsim ( 5 \times 1 0 ^ { 3 } ) N ^ { 0 . 7 4 } $ This implies that dataset size can grow sub-linearly with model size to prevent overfitting. The following figure (Figure 9 from the original paper) visualizes the L(N,D) relationship.

    Figure 9 The early-stopped test loss `L ( N , D )` depends predictably on the dataset size \(D\) and model size \(N\) EY according to Equation (1.5). Left: For large \(D\) , performance is a straight power… 该图像是论文中图9的图表,展示了早停测试损失L(N,D)关于数据集大小DD与模型大小NN的关系。左图表明大数据下损失随NN以幂律下降,小数据下过拟合导致性能停滞;右图显示过拟合程度主要取决于NαNαD/DN^{\frac{\alpha_N}{\alpha_D}}/D,并用曲线拟合该关系。

4.2.6. Scaling Laws with Model Size and Training Time (Section 5)

This section focuses on how loss scales with model size NN and training time (measured by steps SS).

  • Adjustment for Training at Bcrit(L)B_{crit}(L): The paper incorporates findings from [MKAT18] regarding the critical batch size (BcritB_{crit}) to normalize training steps. BcritB_{crit} is the batch size at which training achieves a good balance between time and compute efficiency. It was found that BcritB_{crit} depends on the current loss LL, but not directly on model size. The relationship between training steps (SS) and data examples processed (E=BSE = B S) for a fixed loss LL is given by: $ \left( \frac { S } { S _ { \mathrm { m i n } } } - 1 \right) \left( \frac { E } { E _ { \mathrm { m i n } } } - 1 \right) = 1 $

    • Symbol Explanation:
      • SS: Number of training steps taken.
      • SminS_{min}: Minimum number of steps required to reach loss LL (at very large batch size).
      • EE: Total data examples processed (B×SB \times S).
      • EminE_{min}: Minimum number of data examples required to reach loss LL (at very small batch size). This relation defines the critical batch size as: $ B _ { \mathrm { c r i t } } ( L ) \equiv \frac { E _ { \mathrm { m i n } } } { S _ { \mathrm { m i n } } } $ The empirical observation is that Bcrit(L)B_{crit}(L) follows a power-law in the loss: $ B _ { \mathrm { c r i t } } ( L ) \approx \frac { B _ { \ast } } { L ^ { 1 / \alpha _ { B } } } $ The fitted values are B2×108B_* \approx 2 \times 10^8 and αB0.21\alpha_B \approx 0.21. This means BcritB_{crit} increases as loss decreases (performance improves). The following figure (Figure 10 from the original paper) shows this relationship.

    Figure 10 The critical batch size \(B _ { \\mathrm { c r i t } }\) follows a power law in the loss as performance increase, and does not depend directly on the model size. We find that the critical batc… 该图像是图表,展示了图10中临界批量大小 BcritB_{crit} 与训练损失之间的关系。BcritB_{crit} 随损失的降低呈幂律上升,约每降低13%的损失,临界批量大小翻倍。图中还包括基于梯度噪声规模的预测公式 Bcrit=2.1×108 tokensL4.8B_{crit} = 2.1 \times 10^8 \textrm{ tokens} \cdot L^{-4.8}

    To account for actual training batch sizes (BB) not being BcritB_{crit}, the minimum steps (SminS_{min}) and minimum compute (CminC_{min}) are adjusted: $ S _ { \mathrm { m i n } } ( S ) \equiv \frac { S } { 1 + B _ { \mathrm { c r i t } } ( L ) / B } \qquad \mathrm { ( m i n i m u m ~ s t e p s , ~ a t ~ } B \gg B _ { \mathrm { c r i t } } ) $ $ C _ { \mathrm { m i n } } ( C ) \equiv \frac { C } { 1 + B / B _ { \mathrm { c r i t } } ( L ) } \qquad \mathrm { ( m i n i m u m ~ c o m p u t e ,a t } ~ B \ll B _ { \mathrm { c r i t } } ) $

    • Symbol Explanation:
      • Smin(S)S_{min}(S): The effective minimum number of steps if training was done at a very large batch size.
      • Cmin(C)C_{min}(C): The effective minimum compute if training was done at a very small batch size.
      • BB: The actual batch size used in training.
      • Bcrit(L)B_{crit}(L): The critical batch size for the current loss LL.
      • SS: Actual training steps.
      • CC: Actual compute used.
  • Results for L(N,Smin)L(N, S_{min}): Using the adjusted S_min, the loss as a function of model size NN and training time (SminS_{min}) is modeled by: $ L ( N , S _ { \mathrm { m i n } } ) = \left( \frac { N _ { c } } { N } \right) ^ { \alpha _ { N } } + \left( \frac { S _ { c } } { S _ { \mathrm { m i n } } } \right) ^ { \alpha _ { S } } $ The fitted parameters are: The following are the results from Table 3 of the original paper:

Parameter αN\alpha_N αS\alpha_S NcN_c ScS_c
Value 0.077 0.76 6.5×10136.5 \times 10^{13} 2.1×1032.1 \times 10^3
This equation fits the `learning curves` well, especially after an initial transient period. The `power-law` dependence on SminS_{min} suggests `optimizer dynamics` and `loss landscape` properties are roughly independent of `model size`.
The following figure (Figure 11 from the original paper) illustrates how performance follows `L(N,S)` under fixed compute or training steps.

![Figure 11 When we hold either total compute or number of training steps fixed, performance follows `L ( N , S )` from Equation (5.6). Each value of compute budget has an associated optimal model size…](/files/papers/6902de2659708f78ec6fae78/images/11.jpg)
*该图像是论文中图11的图表,展示了在固定总计算预算或训练步数时,测试损失与模型参数量的关系。左图显示不同计算预算下的测试损失,右图展示不同训练步数下的测试损失,体现了性能随模型规模和训练步数的变化趋势,符合公式 `L ( N , S )`。*

4.2.7. Optimal Allocation of the Compute Budget (Section 6)

This section leverages the derived scaling laws to determine the most efficient way to spend a given compute budget.

  • Optimal Performance and Allocations: By adjusting for batch size inefficiency to use CminC_{min}, a cleaner power-law for loss vs. compute is obtained: $ L ( C _ { \mathrm { m i n } } ) = \left( { \frac { C _ { c } ^ { \mathrm { m i n } } } { C _ { \mathrm { m i n } } } } \right) ^ { \alpha _ { C } ^ { \mathrm { m i n } } } $ The fitted values are αCmin0.050\alpha_C^{min} \approx 0.050 and Ccmin3.1×108C_c^{min} \approx 3.1 \times 10^8 PF-days. This is the most reliable scaling law for extrapolating to larger compute. The following figure (Figure 13 from the original paper) shows the adjusted L(Cmin)L(C_min) trend.

    Figure 13 When adjusting performance to simulate training far below the critical batch size, we find a somewhat altered power law for \(L ( C _ { \\mathrm { m i n } } )\) when compared with the fully em… 该图像是论文中图13的图表,展示了调整后训练远低于临界批量大小时的测试损失与计算量(PF-days)的关系。图中给出两个不同的拟合幂律公式,分别为 L=(Cmin/2.3108)0.050L = (C_{min}/2.3 \cdot 10^{8})^{-0.050}L=(C/2.0107)0.057L = (C/2.0 \cdot 10^{7})^{-0.057},显示了不同训练层数对损失趋势的影响。

    The optimal model size NN for a given compute budget CminC_{min} grows very rapidly: $ N ( C _ { \mathrm { m i n } } ) \propto ( C _ { \mathrm { m i n } } ) ^ { 0 . 7 3 } $ This means for every 10x increase in compute, the optimal model size increases by 5x. The optimal batch size (BcritB_{crit}) scales as BcritCmin0.24B_{crit} \propto C_{min}^{0.24}. The optimal number of steps (SminS_{min}) increases very slowly: $ S _ { \mathrm { m i n } } \propto ( C _ { \mathrm { m i n } } ) ^ { 0 . 0 3 } $ This implies that as compute increases, it should primarily be spent on making models larger, with only a marginal increase in training steps or dataset size. This leads to the conclusion that compute-efficient training involves training very large models and stopping significantly before convergence. The following figure (Figure 14 from the original paper) illustrates the optimal model size and optimization steps as functions of the compute budget.

    Figure 14 Left: Each value of the compute budget \(C _ { \\mathrm { m i n } }\) has an associated optimal model size \(N\) Optimal model size grows very rapidly with \(C _ { \\mathrm { m i n } }\) , increasi… 该图像是两幅图表,展示了计算资源预算与模型参数规模及优化步数之间的关系。左图显示最优模型规模NN随计算预算CminC_{\min}迅速增长,分别拟合为N=(1.3109)Cmin0.73N=(1.3 \cdot 10^9) \cdot C_{\min}^{0.73}N=(1.6109)Cmin0.88N=(1.6 \cdot 10^9) \cdot C_{\min}^{0.88};右图显示批量调整后的优化步数随计算预算增长缓慢,拟合公式为Smin=(5.4103)Cmin0.03S_{\min}=(5.4 \cdot 10^3) \cdot C_{\min}^{0.03},且固定批量方案的步数增长较快。

  • Predictions from L(N,Smin)L(N, S_{min}): The theoretical derivation from L(N,Smin)L(N, S_{min}) (Appendix B) predicts the power-law exponent for L(Cmin)L(C_{min}): $ \alpha _ { C } ^ { \mathrm { m i n } } \equiv \frac { 1 } { 1 / \alpha _ { S } + 1 / \alpha _ { B } + 1 / \alpha _ { N } } \approx 0 . 0 5 4 $ This closely matches the empirical αCmin0.050\alpha_C^{min} \approx 0.050. It also predicts the optimal model size scaling: $ N ( C _ { \mathrm { m i n } } ) \propto ( C _ { \mathrm { m i n } } ) ^ { \alpha _ { C } ^ { \mathrm { m i n } } / \alpha _ { N } } \approx ( C _ { \mathrm { m i n } } ) ^ { 0 . 7 1 } $ This also matches the empirical 0.73. These agreements validate the predictive power of the framework.

  • Contradictions and a Conjecture: The authors note that at scales far beyond those empirically studied, the predicted loss from L(Cmin)L(C_{min}) eventually decreases below what is possible given the sub-linear growth of data needed for compute-efficient training. Specifically, the data required to avoid overfitting scales as DN0.74Cmin0.54D \propto N^{0.74} \propto C_{min}^{0.54}, while the data processed in compute-efficient training (training for one epoch at BcritB_{crit}) scales as D(Cmin)Cmin0.26D(C_{min}) \propto C_{min}^{0.26}. Since the former grows faster than the latter, eventually the model will be data-bottlenecked. This leads to an intersection point where the L(Cmin)L(C_min) prediction (driven by compute scaling) meets the L(D) prediction (driven by data scaling/overfitting). This intersection is estimated at: $ C ^ { * } \sim 1 0 ^ { 4 } \mathrm { ~ P F-D ays } \quad N ^ { * } \sim 1 0 ^ { 1 2 } \mathrm { ~ p a r a m e t e r s , } \quad D ^ { * } \sim 1 0 ^ { 1 2 } \mathrm { ~ t o k e n s , } \quad L ^ { * } \sim 1 . 7 \mathrm { ~ nats/token } $ This point suggests a breakdown of the scaling laws or possibly a fundamental limit. The authors conjecture that LL^* might provide a rough estimate for the entropy per token of natural language, representing maximal performance for Transformer language models. The following figure (Figure 15 from the original paper) illustrates this potential contradiction.

    该图像是一个图表,展示了测试损失(Test Loss)随计算量(PF-days)变化的关系,横坐标为计算量,纵坐标为测试损失。图中两条曲线分别表示 \(L(C_{min})\) 和 `L(D(C))`,揭示了在不同计算预算下模型性能的表现。 该图像是一个图表,展示了测试损失(Test Loss)随计算量(PF-days)变化的关系,横坐标为计算量,纵坐标为测试损失。图中两条曲线分别表示 L(Cmin)L(C_{min})L(D(C)),揭示了在不同计算预算下模型性能的表现。

4.2.8. Notation Summary

For clarity, the paper defines the following notation:

  • LL: The cross-entropy loss in nats.
  • NN: The number of model parameters, excluding vocabulary and positional embeddings.
  • CC: An estimate of the total non-embedding training compute (C6NBSC \approx 6 N B S), quoted in PF-days.
  • DD: The dataset size in tokens.
  • BcritB_{crit}: The critical batch size, balancing time and compute efficiency.
  • CminC_{min}: An estimate of the minimum non-embedding compute to reach a given loss value (training at BBcritB \ll B_{crit}).
  • SminS_{min}: An estimate of the minimal number of training steps to reach a given loss value (training at BBcritB \gg B_{crit}).
  • αX\alpha_X: Power-law exponents for the scaling of loss as L(X)1/XαXL(X) \propto 1/X^{\alpha_X}, where XX can be N,D,C,S,B,CminN, D, C, S, B, C_{min}.

4.2.9. Summary of Power Laws (Appendix A)

The paper provides a concise summary of the key scaling laws and their fitted parameters.

The following are the results from Table 4 of the original paper:

Parameters Data Compute Batch Size Equation
N \infty \infty Fixed L(N)=(Nc/N)αNL(N) = (N_c/N)^{\alpha_N}
\infty D Early Stop Fixed L(D)=(Dc/D)αDL(D) = (D_c/D)^{\alpha_D}
Optimal \infty C Fixed L(C)=(Cc/C)αCL(C) = (C_c/C)^{\alpha_C} (naive)
Nopt Dopt CminC_{min} B < BcritB_{crit} L(Cmin)=(Ccmin/Cmin)αCminL(C_{min}) = (C_c^{min}/C_{min})^{\alpha_C^{min}}
N D Early Stop Fixed L(N,D)=[(Nc/N)αN/αD+Dc/D]αDL(N, D) = [ (N_c/N)^{\alpha_N/\alpha_D} + D_c/D ]^{\alpha_D}
N \infty S steps B L(N,S)=(Nc/N)αN+(Sc/Smin(S,B))αSL(N, S) = (N_c/N)^{\alpha_N} + (S_c/S_{min}(S,B))^{\alpha_S}

The following are the results from Table 5 of the original paper:

Power Law Scale (tokenization-dependent)
αN=0.076\alpha_N = 0.076 Nc=8.8×1013N_c = 8.8 \times 10^{13} params (non-embed)
αD=0.095\alpha_D = 0.095 Dc=5.4×1013D_c = 5.4 \times 10^{13} tokens
αC=0.057\alpha_C = 0.057 Cc=1.6×107C_c = 1.6 \times 10^7 PF-days
αCmin=0.050\alpha_C^{min} = 0.050 Ccmin=3.1×108C_c^{min} = 3.1 \times 10^8 PF-days
αB=0.21\alpha_B = 0.21 B=2.1×108B^* = 2.1 \times 10^8 tokens
αS=0.76\alpha_S = 0.76 Sc=2.1×103S_c = 2.1 \times 10^3 steps

The following are the results from Table 6 of the original paper:

Compute-Efficient Value Power Law Scale
Nopt=NeCminpNN_{opt} = N_e \cdot C_{min}^{pN} pN=0.73pN = 0.73 Ne=1.3109N_e = 1.3 \cdot 10^9 params
B<Bcrit=BL1/αB=BeCminpBB_{<Bcrit} = B^*L^{1/\alpha_B} = B_e C_{min}^{pB} pB=0.24pB = 0.24 Be=2.0106B_e = 2.0 \cdot 10^6 tokens
Smin=SeCminpsminS_{min} = S_e C_{min}^{psmin} (lower bound) ps=0.03ps = 0.03 Se=5.4103S_e = 5.4 \cdot 10^3 steps
D<min(1epoch)D_{<min(1 epoch)} pD=0.27pD = 0.27 De=21010D_e = 2 \cdot 10^{10} tokens

5. Experimental Setup

5.1. Datasets

The experiments primarily used WebText2, an expanded version of the WebText dataset [RWC+19][RWC+19].

  • Source: WebText2 was constructed from outbound Reddit links that received at least 3 karma, for the period up to October 2018. This karma threshold served as a heuristic for quality, indicating content that people found interesting or useful. The text was extracted using the Newspaper3k Python library.
  • Scale and Characteristics: The dataset consists of 20.3 million documents, containing 96 GB of text. It comprises 1.62×10101.62 \times 10^{10} words. After byte-pair encoding [SHB15] with a vocabulary size of 50257, this amounts to 2.29×10102.29 \times 10^{10} tokens.
  • Domain: The WebText2 dataset represents a broad collection of internet text, covering a diverse range of topics, writing styles, and factual information. This makes it suitable for training general-purpose language models.
  • Test Sets: A portion of WebText2 (6.6×1086.6 \times 10^8 tokens) was reserved as a test set. Additionally, for evaluating generalization performance across different text distributions, the models were tested on samples from:
    • Books Corpus [ZKZ+15][ZKZ+15]
    • Common Crawl [Fou]
    • English Wikipedia
    • A collection of publicly available Internet Books
  • Choice Rationale: These datasets were chosen for their scale and relevance to language modeling tasks. WebText2 provides a vast and diverse source of "good quality" internet text, suitable for training very large language models. The additional test datasets allow for robust evaluation of generalization capabilities, demonstrating whether the scaling laws hold across different text domains.

5.2. Evaluation Metrics

The principal performance metric used throughout the paper is the cross-entropy loss.

  • Cross-entropy Loss (in nats):
    • Conceptual Definition: Cross-entropy loss quantifies the difference between the probability distribution predicted by the language model for the next token and the true probability distribution (which assigns a probability of 1 to the actual next token and 0 to all others). It is a measure of how "surprised" the model is by the actual next token. A lower loss value indicates better model performance, meaning the model is more accurately predicting the sequence of tokens. The unit nats implies the use of the natural logarithm (base ee) in the calculation. The loss is typically averaged over the tokens in a context (usually 1024 tokens in this paper).
    • Mathematical Formula: For a given sequence of tokens x1,,xTx_1, \dots, x_T, the language model predicts the probability of each token given the preceding ones: P(xtx1,,xt1)P(x_t | x_1, \dots, x_{t-1}). The cross-entropy loss LL for this sequence is calculated as the negative log-likelihood, normalized by the sequence length TT: $ L = -\frac{1}{T} \sum_{t=1}^{T} \log P(x_t | x_1, \dots, x_{t-1}) $
    • Symbol Explanation:
      • LL: The cross-entropy loss averaged over the tokens in the context.
      • TT: The length of the token sequence (or context).
      • xtx_t: The tt-th token in the sequence.
      • P(xtx1,,xt1)P(x_t | x_1, \dots, x_{t-1}): The probability assigned by the language model to the tt-th token, given the previous t-1 tokens in the sequence.
      • log\log: The natural logarithm.

5.3. Baselines

The paper primarily investigates the scaling laws of Transformer models. However, for comparative analysis, it also mentions and evaluates other architectures:

  • LSTM Models: Long Short-Term Memory networks are a type of recurrent neural network (RNN) that were state-of-the-art for sequence modeling before Transformers. The paper compares LSTM performance against Transformers (Figure 7) to highlight Transformers' superior performance, particularly for longer contexts. LSTMs were trained with the same dataset and context length for a fair comparison.

  • Universal Transformers [DGV+18][DGV+18]: These are a variant of Transformer models that re-use parameters across steps (like RNNs) instead of having distinct parameters for each layer. The paper compares them to standard Transformers (Figure 17 in Appendix D.2). Universal Transformers might perform slightly better as a function of parameter count NN due to parameter re-use, but might be slightly worse when compute is considered due to additional operations per parameter.

    These baselines are representative as they cover both older (but still relevant) recurrent architectures and close variants of the Transformer architecture, providing context for the Transformer's specific scaling properties. The primary goal of the paper is not to outperform these baselines, but to understand the scaling laws of Transformers themselves.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results strongly validate the effectiveness of the proposed scaling laws as a predictive framework for neural language model performance. The consistent power-law relationships across vast scales demonstrate that performance is largely determined by scale rather than intricate architectural tuning.

  • Weak Dependence on Model Shape: Figure 5 illustrates that Transformer performance is remarkably insensitive to variations in hyperparameters like feed-forward ratio, aspect ratio (depth vs. width), or attention head dimension when the total non-embedding parameter count NN is kept constant. Loss varies only by a few percent over a wide range of shapes. For example, an aspect ratio can vary by a factor of 40 with minimal impact. This suggests that resource allocation should prioritize increasing overall scale over fine-tuning specific architectural configurations. The following figure (Figure 5 from the original paper) shows the mild effect of model shape on loss.

    Figure 5 Performance depends very mildly on model shape when the total number of non-embedding parameters \(N\) is held fixed. The loss varies only a few percent over a wide range of shapes. Small diff… 该图像是图表,展示了不同模型结构参数对损失增加的影响。图中通过三个子图比较了在固定参数量下,前馈网络比例、层宽比和注意力头数维度变化对损失的轻微影响,表明结构变化对性能影响较小。

  • Importance of Non-Embedding Parameters: Figure 6 highlights a critical methodological choice. When embedding parameters are included in the model size count (left panel), the performance trend appears less clear, with different layer counts leading to distinct curves. However, when embedding parameters are excluded (right panel), the performance of models with different depths converges to a single, clear power-law trend. This finding validates the paper's decision to define NN as non-embedding parameters for cleaner scaling laws. Only models with very few layers or extreme depth-to-width ratios significantly deviate. The following figure (Figure 6 from the original paper) compares performance trends with and without embedding parameters.

    Figure 6 Left: When we include embedding parameters, performance appears to depend strongly on the number of layers in addition to the number of parameters. Right: When we exclude embedding parameter… 该图像是论文中图6的对比图,展示了包含与不包含embedding参数时,不同层数模型的测试损失随参数数量变化的曲线。左图显示包含embedding时,层数对性能影响显著;右图排除embedding后,多层模型性能趋于一致,只有极端深宽比或层数少于2时偏离趋势。

  • Transformer vs. LSTM Performance: Figure 7 demonstrates that Transformers asymptotically outperform LSTMs as model size increases. While LSTMs can perform comparably for early tokens in a context, they plateau for later tokens, indicating limitations in capturing long-range dependencies compared to Transformers. This underscores why Transformers are the appropriate architecture for studying advanced scaling laws. The following figure (Figure 7 from the original paper) compares LSTM and Transformer performance.

    该图像是两个折线图,展示了Transformer和LSTM模型在不同参数规模及上下文长度下的测试损失变化。左图显示了随模型参数增大,Transformer测试损失显著低于LSTM。右图展示Transformer在上下文中后续Token的测试损失下降趋势,模型参数从40万到3亿不等。 该图像是两个折线图,展示了Transformer和LSTM模型在不同参数规模及上下文长度下的测试损失变化。左图显示了随模型参数增大,Transformer测试损失显著低于LSTM。右图展示Transformer在上下文中后续Token的测试损失下降趋势,模型参数从40万到3亿不等。

  • Smooth Power-Law Scalings: Figure 1 (in the Executive Summary) and Figure 23 clearly show the cross-entropy loss follows smooth power-law relationships with model size (NN), dataset size (DD), and compute (CC). These trends span multiple orders of magnitude without showing signs of deviation at the upper end. This universality allows for reliable extrapolation and prediction of model performance. Figure 23 specifically contrasts the power-law fit with a logarithmic fit, demonstrating the qualitative superiority of the power-law for L(N). The following figure (Figure 23 from the original paper) compares power-law and logarithmic fits.

    Figure 23 The trend for performance as a function of parameter count, `L ( N )` , is fit better by a power law than by other functions such as a logarithm at a qualitative level. 该图像是一个图表,展示了参数数量与测试损失(收敛时)的关系,图中表明性能趋势更符合幂律关系,公式为 L=(N/8.8imes1013)0.076L = (N/8.8 imes 10^{13})^{-0.076},相比对数函数拟合更优。

  • Universality of Overfitting and L(N,D): Figure 9 vividly demonstrates how loss depends on both model size NN and dataset size DD. For large datasets, loss continues to decrease with increasing NN. However, for smaller datasets, increasing NN eventually leads to overfitting, causing performance to plateau. The right panel confirms that the extent of overfitting is predictable and depends on the ratio NαNαD/DN^{\frac{\alpha_N}{\alpha_D}}/D, validating the combined scaling law L(N,D). This provides crucial guidance on how to scale data alongside models to avoid performance penalties. The following figure (Figure 9 from the original paper) illustrates the universality of overfitting.

    Figure 24 We show evaluations on a series of datasets for models with approximately 1.5 Billion parameters. We observe no effect of depth on generalization; generalization performance depends primari… 该图像是图表,展示了约15亿参数模型在不同数据集上随网络深度变化的测试损失。结果表明深度对泛化影响不大,泛化主要依赖训练分布性能。12层模型在Internet Books数据集上过拟合,并显示了早停性能。

  • Universality of Training and Sample Efficiency: Figure 4 (right panel) and Figure 2 show that learning curves for various model sizes (after an initial transient) can be accurately described by power-laws in training steps SminS_{min}. This suggests a universal loss landscape structure. Importantly, Figure 2 and Figure 19 clearly demonstrate that larger models are significantly more sample-efficient, reaching the same level of performance with fewer optimization steps and data points. This means they learn faster and more effectively from data. The following figure (Figure 4 from the original paper) shows performance with varying model and data size, or model and training steps.

    Figure 4 Left: The early-stopped test loss `L ( N , D )` varies predictably with the dataset size \(D\) and model size \(N\) according to Equation (1.5). Right: After an initial transient period, learnin… 该图像是论文中图4的图表,左图展示了早停时测试损失L(N,D)如何随数据集大小DD和模型大小NN变化,符合公式(1.5);右图展示了所有模型大小在初始变动后,学习曲线可由参数SminS_{\mathrm{min}}拟合,符合公式(1.6)。

    The following figure (Figure 2 from the original paper) shows language model training runs illustrating sample efficiency.

    Figure 2 We show a series of language model training runs, with models ranging in size from \(1 0 ^ { 3 }\) to \(1 0 ^ { 9 }\) parameters (excluding embeddings). 该图像是图表,展示了不同规模语言模型(从10^3到10^9参数)在训练过程中测试损失随处理Token数量和计算量(PF-days)的变化趋势。图中曲线颜色对应模型参数数量,显示较大模型在相同Token或计算量下测试损失显著较低,且计算效率最高的训练通常在收敛之前即停止。

    The following figure (Figure 19 from the original paper) provides another view of sample efficiency.

    Figure 19 The number of minimum serial steps needed to reach any fixed value of the test loss decreases precipitously with model size. Sample efficiency (show here for training far below the critical… 该图像是两个等高线图,展示了模型参数规模与达到固定测试损失值所需最小训练步骤数和最小训练样本数的关系。图中显示随着模型参数数量增加,所需步骤和样本数显著减少,表明更大的模型具有更高的样本效率。

  • Optimal Compute Allocation and Early Stopping: The paper's most impactful finding is illustrated in Figure 3 and further detailed in Figure 12 and Figure 14. For a fixed compute budget, optimal performance is achieved by primarily increasing model size (NCmin0.73N \propto C_{min}^{0.73}), with only modest increases in batch size (BCmin0.24B \propto C_{min}^{0.24}) and negligible increases in training steps (SCmin0.03S \propto C_{min}^{0.03}). This implies that compute-efficient training involves training very large models and stopping significantly before convergence (i.e., not training until the loss completely flattens out, as seen in Figure 2). Figure 12 also shows that while there's an optimal model size for a given compute budget, slight deviations are tolerable without huge compute penalties. The following figure (Figure 3 from the original paper) illustrates how to scale model size, batch size, and serial steps.

    Figure 3 As more compute becomes available, we can choose how much to allocate towards training larger moels, using larger batches, andtrainng or moe steps.We illustrate this for a billion-old incrs… 该图像是论文中的图表,展示了不同计算资源(PF-days)分配对训练中模型大小、批量大小和串行步骤的乘法贡献关系,反映了随着计算量增加,模型大小的增加远超其他因素。

    The following figure (Figure 12 from the original paper) shows the effect of training on suboptimal models.

    Figure 12 Left: Given a fixed compute budget, a particular model size is optimal, though somewhat larger or smaller models can be trained with minimal additional compute. Right: Models larger than th… 该图像是两个子图组成的图表,展示了在固定计算预算下模型大小的优化效果。左图显示模型大小偏离最优值时,所需额外计算量的变化范围;右图显示较小模型训练步骤多,较大模型训练步骤少的趋势,捕捉了训练步骤的超额情况。

  • Generalization Across Data Distributions: Figure 8 shows that generalization performance to other data distributions (e.g., Books Corpus, Wikipedia) improves smoothly with model size, in direct parallel with performance on the WebText2 training distribution. The loss on out-of-distribution datasets maintains a roughly constant offset from the in-distribution loss. This indicates that generalization is primarily a function of the in-distribution validation loss and model size, rather than specific training duration or model depth (Figure 24 in Appendix D.8). The following figure (Figure 8 from the original paper) shows generalization to other test datasets.

    Figure 23 The trend for performance as a function of parameter count, `L ( N )` , is fit better by a power law than by other functions such as a logarithm at a qualitative level. 该图像是一个图表,展示了参数数量与测试损失(收敛时)的关系,图中表明性能趋势更符合幂律关系,公式为 L=(N/8.8imes1013)0.076L = (N/8.8 imes 10^{13})^{-0.076},相比对数函数拟合更优。

  • Context Dependence: Figure 20 and Figure 21 explore loss per token as a function of its position within the 1024-token context. Loss scales predictably as a power-law in token position. Larger models show faster improvements at early tokens, suggesting they are more efficient at detecting patterns with less contextual information. This indicates that large models learn both short-range and long-range correlations more effectively. The following figure (Figure 20 from the original paper) shows power-law dependence of performance on context position.

    Figure 20 This figure provides information about the performance per token as a function of model size and training time. Left: Loss per token as a function of its position \(T\) in the 1024-token cont… 该图像是图表,展示了图20中模型规模和训练时间对每个Token损失的影响。左图显示在1024-token上下文中,损失随Token位置TT的幂律下降,右图展示训练步数增加时,测试损失的变化趋势。

    The following figure (Figure 21 from the original paper) shows performance at different context positions versus model size.

    Figure 21 In addition to the averaged loss, individual tokens within the 1024-token context also improve smoothly as model size increases. Training runs with shorter context \$n _ { \\mathrm { c t x }… 该图像是论文中图21的图表,展示了随着模型参数数量增加,不同位置的单词在测试损失上的表现变化。图中实线代表1024长度上下文的单词损失,虚线表示长度为8的短上下文训练,短上下文在前面单词上的损失更低。

6.2. Data Presentation (Tables)

The paper provides several key tables summarizing the scaling laws, fitted parameters, and optimal compute-efficient training trends.

The following are the results from Table 1 of the original paper:

OperationParametersFLOPs per Token
Embed(nvocab + nctx) dmodel4dmodel
Attention: QKVnlayerdmodel3dattn2nlayerdmodel3dattn
Attention: Mask2nlayernctxdattn
Attention: Projectnlayerdattndmodel2n1ayerdattndembd
Feedforwardnlayer2dmodeldff2nlayer2dmodeldff
De-embed2dmodelnvocab
Total (Non-Embedding)N = 2dmodelnlayer (2dattn + df)Cforward = 2N + 2nlayernctxdattn

Table 1: Parameter counts and compute (forward pass) estimates for a Transformer model. Sub-leading terms such as nonlinearities, biases, and layer normalization are omitted.

The following are the results from Table 2 of the original paper:

Parameter αN\alpha_N αD\alpha_D NcN_c DcD_c
Value 0.076 0.103 6.4×10136.4 \times 10^{13} 1.8×10131.8 \times 10^{13}

Table 2: Fits to L(N,D)

The following are the results from Table 3 of the original paper:

Parameter αN\alpha_N αS\alpha_S NcN_c ScS_c
Value 0.077 0.76 6.5×10136.5 \times 10^{13} 2.1×1032.1 \times 10^3

Table 3: Fits to L(N,S)

The following are the results from Table 4 of the original paper:

Parameters Data Compute Batch Size Equation
N \infty \infty Fixed L(N)=(Nc/N)αNL(N) = (N_c/N)^{\alpha_N}
\infty D Early Stop Fixed L(D)=(Dc/D)αDL(D) = (D_c/D)^{\alpha_D}
Optimal \infty C Fixed L(C)=(Cc/C)αCL(C) = (C_c/C)^{\alpha_C} (naive)
Nopt Dopt CminC_{min} B < BcritB_{crit} L(Cmin)=(Ccmin/Cmin)αCminL(C_{min}) = (C_c^{min}/C_{min})^{\alpha_C^{min}}
N D Early Stop Fixed L(N,D)=[(Nc/N)αN/αD+Dc/D]αDL(N, D) = [ (N_c/N)^{\alpha_N/\alpha_D} + D_c/D ]^{\alpha_D}
N \infty S steps B L(N,S)=(Nc/N)αN+(Sc/Smin(S,B))αSL(N, S) = (N_c/N)^{\alpha_N} + (S_c/S_{min}(S,B))^{\alpha_S}

Table 4: Key trend equations.

The following are the results from Table 5 of the original paper:

Power Law Scale (tokenization-dependent)
αN=0.076\alpha_N = 0.076 Nc=8.8×1013N_c = 8.8 \times 10^{13} params (non-embed)
αD=0.095\alpha_D = 0.095 Dc=5.4×1013D_c = 5.4 \times 10^{13} tokens
αC=0.057\alpha_C = 0.057 Cc=1.6×107C_c = 1.6 \times 10^7 PF-days
αCmin=0.050\alpha_C^{min} = 0.050 Ccmin=3.1×108C_c^{min} = 3.1 \times 10^8 PF-days
αB=0.21\alpha_B = 0.21 B=2.1×108B^* = 2.1 \times 10^8 tokens
αS=0.76\alpha_S = 0.76 Sc=2.1×103S_c = 2.1 \times 10^3 steps

Table 5: Key parameters to trend fits.

The following are the results from Table 6 of the original paper:

Compute-Efficient Value Power Law Scale
Nopt=NeCminpNN_{opt} = N_e \cdot C_{min}^{pN} pN=0.73pN = 0.73 Ne=1.3109N_e = 1.3 \cdot 10^9 params
B<Bcrit=BL1/αB=BeCminpBB_{<Bcrit} = B^*L^{1/\alpha_B} = B_e C_{min}^{pB} pB=0.24pB = 0.24 Be=2.0106B_e = 2.0 \cdot 10^6 tokens
Smin=SeCminpsminS_{min} = S_e C_{min}^{psmin} (lower bound) ps=0.03ps = 0.03 Se=5.4103S_e = 5.4 \cdot 10^3 steps
D<min(1epoch)D_{<min(1 epoch)} pD=0.27pD = 0.27 De=21010D_e = 2 \cdot 10^{10} tokens

Table 6: Trends for compute-efficient training.

6.3. Ablation Studies / Parameter Analysis

The authors conducted several analyses that function as ablation studies or parameter analyses to understand the robustness and specifics of their scaling laws.

  • Model Shape (Depth vs. Width): As discussed, Figure 5 demonstrates that Transformer performance is largely invariant to specific architectural choices such as depth, width, feed-forward dimension, or number of attention heads, as long as the total non-embedding parameter count NN remains fixed. The loss varies by only a few percent across a wide range of these hyperparameters. This suggests that the total capacity, as represented by NN, is the dominant factor, not the specific distribution of that capacity across layers or dimensions.

  • Impact of Embedding Parameters: Figure 6 serves as an ablation study on the definition of model size. It clearly shows that including embedding parameters in NN obscures the clean power-law trend. Excluding them yields a much more consistent scaling law across different model depths. This implies that the embedding matrix can be made smaller without impacting performance, a finding corroborated by other research [LCG+19][LCG+19].

  • Learning Rate Schedules and Error Analysis: Figure 22 (in Appendix D.6) presents an analysis of various learning rate schedules (cosine decay, linear decay, faster/slower decays). The conclusion is that the choice of learning rate schedule is largely irrelevant to the final loss at convergence, provided that the total summed learning rate is sufficiently large, and the schedule includes a warmup period and a final decay. Variations among schedules appear to be within the range of statistical noise. This suggests that the scaling laws are robust to these optimization hyperparameters. The authors also note that larger models generally require smaller learning rates to prevent divergence. The following figure (Figure 22 from the original paper) shows the learning rate schedule scan.

    Figure 2 We tet avarietyof learning rate schedules including cosine decay, linear decay, as wel s er faster/slower decays schedules on a 3 million parameter model, shown on the left. For these experi… 该图像是两部分组成的图表,左侧展示了不同学习率衰减调度曲线随训练步数变化,右侧为累计学习率与损失的散点关系,反映学习率调度对模型性能的影响。

  • Generalization and Architecture (Depth): Figure 24 (in Appendix D.8) explicitly shows that generalization performance to other data distributions does not depend on network depth when the total parameter count is held fixed. Instead, generalization performance is primarily determined by the performance achieved on the training distribution. This reinforces the idea that overall scale, rather than specific architectural details like depth, is the key driver of general capabilities. The following figure (Figure 24 from the original paper) shows generalization versus depth.

    Figure 24 We show evaluations on a series of datasets for models with approximately 1.5 Billion parameters. We observe no effect of depth on generalization; generalization performance depends primari… 该图像是图表,展示了约15亿参数模型在不同数据集上随网络深度变化的测试损失。结果表明深度对泛化影响不大,泛化主要依赖训练分布性能。12层模型在Internet Books数据集上过拟合,并显示了早停性能。

6.4. Critical Batch Size (B_crit)

Figure 10 illustrates the power-law relationship of the critical batch size BcritB_{crit} with the loss LL. The key finding is that Bcrit(L)B_{crit}(L) is independent of model size and only depends on the loss being achieved. This confirms the predictions from [MKAT18]. The power-law fit shows that BcritB_{crit} approximately doubles for every 13% decrease in loss.

6.5. Universal Transformers

Figure 17 (in Appendix D.2) compares recurrent Transformers to standard Transformers. Recurrent Transformers, which re-use parameters, perform slightly better when compared solely by parameter count NN. However, when compute is factored in (because parameter re-use implies more FLOPs per parameter), they perform slightly worse. This demonstrates that while parameter efficiency can be achieved, it might not translate directly to compute efficiency. The following figure (Figure 17 from the original paper) compares recurrent Transformers to standard Transformers.

Figure 17 We compare recurrent Transformers \(\[ \\mathrm { D G V ^ { + } } 1 8 \]\) , which re-use parameters, to standard Transformers. Recurrent Transformers perform slightly better when comparing mode… 该图像是图表,展示了图17中循环Transformer与标准Transformer的性能比较。左图和右图分别以不同方式计算参数,测试集损失(Test Loss)随参数数量变化。结果显示,循环Transformer在等参数模型对比下稍优,但按计算量(FLOP)对比略逊于非循环模型。

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper rigorously establishes a set of empirical scaling laws that govern the performance of neural language models, particularly those based on the Transformer architecture. The core finding is that cross-entropy loss scales predictably as a power-law with model size (N), dataset size (D), and compute (C) used for training, across many orders of magnitude. In contrast, architectural specifics like network width or depth have only minimal effects. These relationships allow for precise modeling of overfitting and training speed. A significant implication is that compute-efficient training involves training very large models on relatively modest amounts of data and stopping significantly before full convergence, as larger models are substantially more sample-efficient. The research provides a robust, predictive framework for advancing language modeling research.

7.2. Limitations & Future Work

The authors candidly acknowledge several limitations and suggest future research directions:

  • Lack of Theoretical Understanding: A major limitation is the absence of a solid theoretical foundation for the observed scaling laws. While empirical, the underlying reasons for these power-law relationships, especially with model size and compute, remain mysterious. Without a theory, it's difficult to predict when these laws might break down or how they might be systematically corrected.
  • Uncertainty in BcritB_{crit} Extrapolation: The predictions for critical batch size (Bcrit(L)B_{crit}(L)) might not be reliable for loss values far outside the explored range. Changes in BcritB_{crit} could significantly alter trade-offs between data parallelism and serial training steps, impacting training time.
  • Limited Small Data Regime Investigation: The study did not thoroughly investigate the small data regime, and the fits for L(N,D) were less accurate for the smallest dataset sizes.
  • Fixed Regularization and Augmentation: The regularization (e.g., dropout probability) and data augmentation strategies were not optimized while varying dataset and model size. Improvements in these areas could quantitatively or qualitatively alter the results.
  • Compute Estimation Limitations: The compute estimate C6NBSC \approx 6NBS excluded context-dependent terms (nctxn_{ctx}). This might confound scalings in scenarios with very large context lengths (nctx12dmodeln_{ctx} \gtrsim 12 d_{model}).
  • Learning Rate Tuning: While learning rate schedules were explored, the optimal choice of learning rate itself is sensitive to the target loss. The authors did not experiment with higher learning rates for short training runs, which might be beneficial.
  • The "Contradiction" at Extreme Scales: The most intriguing limitation is the self-identified "contradiction" where the scaling laws for compute and data imply a breakdown at extremely large scales (estimated around C104C^* \sim 10^4 PF-days, N1012N^* \sim 10^{12} parameters, D1012D^* \sim 10^{12} tokens, L1.7L^* \sim 1.7 nats/token). This suggests that the current scaling laws must eventually break down, or that this point represents a fundamental limit (e.g., reaching the entropy of natural language).

Future research directions suggested include:

  • Testing these scaling relations on other generative modeling tasks (e.g., images, audio, video) and random network distillation to determine their universality.
  • Developing a theoretical framework (a "statistical mechanics") from which these scaling relations can be derived, leading to more precise predictions and understanding of their limitations.
  • Investigating whether continued improvement in loss translates into qualitative improvements on relevant language tasks.
  • Further research into model parallelism (e.g., pipelining, wide networks, sparsity, branching) to efficiently train increasingly large models.
  • Exploring methods that grow networks as they train to remain on the compute-efficient frontier.

7.3. Personal Insights & Critique

This paper is a landmark study that provided a much-needed empirical foundation for the deep learning era of large language models. My personal insights include:

  • Paradigm Shift: The most profound insight is the shift in perspective from intricate hyperparameter tuning to prioritizing sheer scale. Before this paper, much effort was spent optimizing architectures or regularization for marginal gains. This work demonstrated that simply making models bigger (and training them appropriately) yields consistent, predictable improvements across a broad range of tasks. This has directly influenced the "bigger is better" trend that led to models like GPT-3 and beyond.
  • Practical Guidance for Resource Allocation: The finding that compute-efficient training involves very large models trained with early stopping on moderate data is immensely practical. It advises against training small models to full convergence or trying to scrape together impossibly large datasets for every model size. This guidance is invaluable for researchers and organizations with fixed compute budgets.
  • Implications for Hardware and Data: The sub-linear scaling of data requirement with model size (DN0.74D \propto N^{0.74}) and the increasing sample efficiency of larger models are crucial. This suggests that while compute scales rapidly, the demand for new, unique data may grow more slowly than previously thought. This has implications for data curation efforts and highlights the immense importance of model parallelism in hardware design.
  • Robustness of Transformers: The weak dependence on model shape (Figure 5) speaks to the inherent robustness and flexibility of the Transformer architecture. It implies that Transformers are highly adaptable to various computational constraints (e.g., favoring width over depth for parallelism) without significant performance trade-offs, as long as the overall parameter count is maintained.
  • The "Contradiction" as a Research Frontier: The conjecture about the scaling laws breaking down at an "intersection point" for compute and data requirements is fascinating. It suggests that there might be a fundamental limit to how much performance can be gained by simply scaling current Transformer language models, or that new architectural innovations will be required to push beyond this limit. This point also offers a potential way to estimate the entropy of language, a long-standing open problem.

Potential Issues/Unverified Assumptions:

  • Domain Specificity: While the authors suggest scaling laws might apply to other generative modeling tasks, this paper's findings are specifically for Transformer language models on natural language data. Their universality needs empirical verification in other domains.

  • Generalizability to Fine-tuning/Downstream Tasks: The scaling laws are focused on cross-entropy loss on the pre-training task. While transfer learning performance correlates, it's not explicitly proven that these scaling laws translate directly to optimal performance on all downstream tasks or fine-tuning scenarios.

  • "Good" Data Assumption: The WebText2 dataset is curated for quality (Reddit karma threshold). The scaling laws might differ for lower-quality or highly noisy datasets.

  • Empirical Nature: The core limitation acknowledged by the authors themselves is the empirical nature of the findings. Without a theoretical grounding, the scaling laws are observed phenomena rather than derived principles, making their extrapolation beyond the observed range potentially risky. The exact exponents and critical points are sensitive to fitting choices.

    Overall, this paper provides a powerful "engineering manual" for large language models, shifting the focus towards quantifiable scaling strategies and opening new avenues for theoretical inquiry into the fundamental properties of deep learning and natural language.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.