Scaling Laws for Neural Language Models
TL;DR Summary
The paper shows that language model loss follows power-law scaling with model size, data, and compute, while depth/width effects are minimal. It reveals quantitative laws of overfitting and training speed, enabling optimal compute allocation favoring large, sample-efficient model
Abstract
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "Scaling Laws for Neural Language Models."
1.2. Authors
The authors of the paper are:
-
Jared Kaplan (Johns Hopkins University, OpenAI)
-
Sam McCandlish (OpenAI)
-
Tom Henighan (OpenAI)
-
Tom B. Brown (OpenAI)
-
Benjamin Chess (OpenAI)
-
Rewon Child (OpenAI)
-
Scott Gray (OpenAI)
-
Alec Radford (OpenAI)
-
Jeffrey Wu (OpenAI)
-
Dario Amodei (OpenAI)
The research primarily comes from OpenAI, with Jared Kaplan also affiliated with Johns Hopkins University. This indicates a strong background in artificial intelligence research, particularly in large-scale language models, given OpenAI's known work in this area (e.g., GPT series).
1.3. Journal/Conference
This paper was published on arXiv, a preprint server, on 2020-01-23. While arXiv is not a peer-reviewed journal or conference in itself, it is a highly influential platform in the academic community, especially in fields like AI and machine learning, for rapidly disseminating cutting-edge research before or concurrently with formal publication. Papers published on arXiv often gain significant traction and citations, and many are subsequently published in top-tier conferences or journals. Given the authors' affiliations with OpenAI, this paper quickly became a foundational work in understanding the scaling behavior of large language models.
1.4. Publication Year
2020
1.5. Abstract
This paper empirically investigates the cross-entropy loss performance of neural language models. The core finding is that performance scales as a power-law with model size (N), dataset size (D), and compute used for training, with these trends spanning over seven orders of magnitude. In contrast, other architectural details like network width or depth have minimal effects within a wide range. The authors establish simple equations that describe how overfitting depends on model/dataset size and how training speed depends on model size. These relationships provide a framework for optimally allocating a fixed compute budget, revealing that larger models are significantly more sample-efficient. Consequently, the most compute-efficient training strategy involves using very large models on a relatively modest amount of data and stopping significantly before full convergence.
1.6. Original Source Link
The original source link is https://arxiv.org/abs/2001.08361v1. The PDF link is https://arxiv.org/pdf/2001.08361v1.pdf. This paper is a preprint available on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper aims to solve is understanding and predicting the performance of neural language models as they scale across various dimensions. In the rapidly advancing field of deep learning for natural language processing (NLP), models were growing in size and complexity, but a clear, quantitative understanding of how performance relates to key resources like model size, dataset size, and compute was lacking.
This problem is critically important because the development of state-of-the-art language models (like the Transformer-based models prevalent today) requires significant computational resources and time. Without a clear understanding of scaling behavior, researchers and engineers might:
-
Allocate resources inefficiently: Spend too much time or compute on models that are too small or too large for the available data, or train them for suboptimal durations.
-
Struggle with predictability: Be unable to reliably forecast the performance gains (or diminishing returns) from investing more into larger models or datasets.
-
Over-engineer architectures: Focus excessively on intricate architectural details that might have marginal impact compared to sheer scale.
Prior research often involved ad-hoc scaling or focused on specific aspects (e.g., impact of dataset size). The specific challenge was to find universal, predictive laws that govern language model performance across vast scales and various training conditions, guiding future research and development more systematically.
The paper's entry point is an empirical investigation into Transformer-based language models, aiming to identify fundamental power-law relationships between performance (measured by cross-entropy loss) and key scaling factors. The innovative idea is to systematically explore these relationships across many orders of magnitude to uncover predictable and actionable insights.
2.2. Main Contributions / Findings
The paper makes several significant contributions and presents key findings that have fundamentally influenced the development of large language models:
-
Empirical
Power-LawScaling Laws: The primary contribution is the discovery that language modelcross-entropy loss(a measure of performance) follows predictablepower-lawrelationships withmodel size (N),dataset size (D), andcompute (C)used for training. These laws hold consistently over six to seven orders of magnitude, providing a robust framework for performance prediction. -
Weak Dependence on Architectural Details: Within a reasonable range, other architectural hyperparameters (e.g., network width vs. depth, number of attention heads) have minimal impact on performance when the total
model sizeis fixed. This suggests that scale is far more critical than precise architectural choices forTransformermodels. -
Universality of Overfitting and Training: The paper provides simple equations to model
overfitting, showing its dependence on the ratio . It also demonstrates thattraining curvesfollow predictablepower-lawsand can be extrapolated, implying a universality in the training dynamics. -
Optimal Allocation of Compute Budget: Based on these scaling laws, the authors determine the most compute-efficient way to allocate resources. For a fixed compute budget, optimal performance is achieved by training very large models on a relatively modest amount of data and stopping training significantly before convergence.
-
Enhanced
Sample Efficiencyof Large Models: A crucial finding is that larger models are significantly moresample-efficient. They require fewer optimization steps and fewer data points to reach a given level of performance compared to smaller models. -
Predictive Framework: The derived scaling relations go beyond mere observation, offering a
predictive frameworkfor guiding research and development in large language models. They suggest that continued scaling will lead to better performance and sample efficiency.These findings solve the problem of inefficient resource allocation and lack of predictability in
deep learningfor language models. They provide a clear roadmap for how to effectively scale upTransformers, shifting the focus from hyper-parameter tuning to strategic scaling of model, data, and compute.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand this paper, a beginner needs to grasp several fundamental concepts related to neural networks, language modeling, and deep learning practices.
- Neural Language Models: At their core,
neural language modelsareneural networksdesigned to understand and generate human language. Their primary task is oftenautoregressive language modeling, which means predicting the next word (ortoken) in a sequence given the preceding ones. For example, if given "The cat sat on the...", a language model tries to predict "mat", "rug", etc., based on probabilities. They learn statistical relationships and patterns in language by being trained on vast amounts of text data. - Transformer Architecture: The
Transformeris a specificneural networkarchitecture introduced in 2017 that revolutionizedNLP. Unlike previous architectures likeRecurrent Neural Networks (RNNs)orLong Short-Term Memory (LSTMs)which process sequences sequentially,Transformersrely heavily on a mechanism calledself-attention. This allows them to weigh the importance of differenttokensin the input sequence when processing eachtoken, enabling them to capture long-range dependencies in text much more effectively and in parallel. Key components include:- Self-Attention: A mechanism that allows the model to look at other positions in the input sequence to compute a representation for each
token. - Feed-Forward Networks: Standard
neural networksapplied to eachtokenindependently and identically. - Encoder-Decoder Structure (or Decoder-Only):
Transformerscan have anencoderpart (for understanding input) and adecoderpart (for generating output), or bedecoder-onlyforautoregressive language modeling. This paper focuses ondecoder-onlyTransformermodels.
- Self-Attention: A mechanism that allows the model to look at other positions in the input sequence to compute a representation for each
- Cross-entropy Loss: This is a standard
loss functionused inmachine learning, particularly forclassificationandgenerative models. Inlanguage modeling, it quantifies how well a model predicts the nexttoken. A lowercross-entropy lossvalue indicates better model performance, meaning the model's predicted probability distribution for the nexttokenis closer to the true distribution (where the truetokenhas a probability of 1). The goal during training is to minimize thisloss.- Conceptual Definition:
Cross-entropy lossmeasures the dissimilarity between two probability distributions: the true distribution oftokensin the training data and the distribution predicted by the model. It quantifies the "surprise" or "uncertainty" in the model's predictions. When the model is very confident about an incorrect prediction, thelossis high; when it's confident about a correct prediction, thelossis low. - Mathematical Formula: For a single
tokenprediction, given a truetokenand the model's predicted probability distribution over the vocabulary, thecross-entropy lossis: $ L = -\sum_{i} y_i \log(\hat{y}i) $ Forlanguage modeling, is typically 1 for the actual nexttokenand 0 for all othertokensin the vocabulary. So, the formula simplifies to: $ L = -\log(\hat{y}{\text{true token}}) $ - Symbol Explanation:
- : The
cross-entropy lossvalue. - : The true probability of the -th
tokenin the vocabulary. (Often 1 for the correcttoken, 0 otherwise). - : The predicted probability of the -th
tokenin the vocabulary by the model. - : The natural logarithm.
- : The model's predicted probability for the actual next
tokenin the sequence.
- : The
- Conceptual Definition:
- Power Laws: A
power lawdescribes a functional relationship between two quantities where one quantity varies as apowerof another. Mathematically, it's expressed as (or ), where and are variables, is a constant, and is theexponent. In this paper,power lawsdescribe howloss(performance) changes withmodel size,dataset size, orcompute. Theexponent(e.g., ) indicates the strength and direction of this relationship. For example, a negativeexponentmeans that as increases, decreases (performance improves). - Compute (FLOPs, PF-days):
Computerefers to the total number of arithmetic operations (like additions and multiplications) performed during training.FLOPs(Floating Point Operations) is a unit for this.PF-days(PetaFLOPS-days) is a large unit ofcompute, where one PetaFLOP-day equalsFLOPs. It provides a standardized way to measure the computational cost of trainingdeep learningmodels. - Model Size (N): This refers to the number of learnable
parameters(weights and biases) in aneural network. Largermodelshave moreparametersand thus higher capacity to learn complex patterns, but also require morecomputeand data. The paper specifically focuses onnon-embedding parametersfor cleaner scaling laws. - Dataset Size (D): This is the total number of
tokens(words or sub-word units) in the training data. Largerdatasetsprovide more examples for the model to learn from, potentially leading to better generalization. - Batch Size (B): During
neural networktraining,parametersare updated based on thegradientscomputed from a small subset of the data called abatch.Batch sizeis the number ofsamples(e.g., sequences oftokens) processed in onetraining step. A largerbatch sizecan lead to more stablegradientestimates but might require more memory and can changeoptimization dynamics. - Training Steps (S): A
training steporparameter updaterefers to one iteration of computinggradientsand updatingmodel parametersusing anoptimizer(e.g.,Adam). The total number oftraining stepsis a measure of how long a model has been trained. - Overfitting: This occurs when a
machine learning modellearns the training data too well, capturing noise and specific patterns that are not representative of the underlying data distribution. As a result, the model performs poorly on unseen data (the test set) even if it has very lowlosson the training set. The paper explores how to manageoverfittingasmodel sizeanddataset sizevary. - Sample Efficiency: A
modelis consideredsample-efficientif it can achieve a certain level of performance using fewertraining samples(data points) or feweroptimization steps. The paper highlights that largermodelstend to be moresample-efficient.
3.2. Previous Works
The paper contextualizes its findings by referencing several important prior studies and concepts:
TransformerArchitecture (Vaswani et al., 2017 [VSP+17]): This seminal work introduced theTransformerarchitecture, which is the foundation for the language models studied in this paper. TheTransformer's reliance onself-attentionand its ability to parallelize computations were key to enabling the scaling observed in this paper. The paper builds directly on this by investigating the scaling behavior ofTransformer-based models.GPT-2andWebText(Radford et al., 2019 [RWC+19]):GPT-2demonstrated unprecedented performance inlanguage generationandunsupervised multitask learning. TheWebTextdataset, curated forGPT-2, is an extended version of theWebText2dataset used in this paper. The success ofGPT-2highlighted the potential of scaling and provided motivation for a deeper understanding ofscaling laws.- Empirical Model of Large-Batch Training (McCandlish et al., 2018 [MKAT18]): This work introduced the concept of a
critical batch size() and provided an empirical theory forbatch sizedependence in training. It argued that forbatch sizesup to , training can scale efficiently without significant performance degradation, but beyond , returns diminish. This paper leverages[MKAT18]to adjust forbatch sizeeffects and defineminimal steps() andminimum compute() to achieve cleanerscaling laws. - Scaling of
Deep Learning(Hestness et al., 2017 [HNA+17], 2019 [HAD19]): Earlier work by Hestness et al. also investigated scaling relationships betweenmodel sizeanddata size. While this paper acknowledges these efforts, it notes a crucial difference: foundsuper-linear scalingofdataset sizewithmodel size, whereas the current paper finds asub-linear scaling() to avoidoverfittingforTransformerlanguage models. This distinction is significant for practical resource allocation. LSTMsandUniversal Transformers(Hochreiter & Schmidhuber, 1997; Dehghani et al., 2018 [DGV+18]): The paper comparesTransformerperformance againstLSTMs(a type ofRecurrent Neural Network) to demonstrate the superior scaling ofTransformers, especially for longer contexts. It also compares toUniversal Transformers, which re-useparametersand show slightly better performance perparameterbut require morecomputeperparameterdue to repeated operations.BERTandXLNet(Devlin et al., 2018 [DCLT18],Yang et al., 2019 [YDY+19], ): These are other prominentTransformer-based models that achieved state-of-the-art results inNLParound the time of this paper, demonstrating the general applicability and power of theTransformerarchitecture.
3.3. Technological Evolution
The language modeling field has seen a rapid evolution, moving from simpler statistical models to Recurrent Neural Networks (RNNs) and LSTMs, and then to Transformers.
-
Early Models: Focused on
n-gramsand basic probabilistic models. -
RNNsandLSTMs: Introduced the ability to process sequences and capture some long-range dependencies, but struggled with very long sequences due to vanishing/explodinggradientsand sequential processing bottlenecks. -
Transformers: Marked a significant leap by introducingself-attention, enabling parallel computation, capturing longer-range dependencies, and scaling to much largermodel sizes. This architecture became the backbone for highly influential models likeBERT,GPT, andXLNet.This paper's work fits squarely within the
Transformerera, providing the scientific underpinnings for the rapid scaling trends observed. It moves beyond simply building larger models to systematically understanding why and how they scale, thus guiding the next generation oflarge language models.
3.4. Differentiation Analysis
Compared to the main methods in related work, this paper offers several core differences and innovations:
-
Comprehensive
Scaling Laws: While previous works (e.g., ) studied aspects ofscaling, this paper provides a more comprehensive and unifiedpower-lawframework acrossmodel size (N),dataset size (D), andcompute (C). It systematically disentangles their individual and combined effects oncross-entropy loss. -
Focus on
Non-Embedding Parameters: The paper explicitly shows that excludingembedding parametersfrom themodel size (N)leads to significantly cleaner and more universalscaling laws(Figure 6). This is a practical distinction that simplifies the analysis of scaling. -
Predictive
OverfittingModel: The paper develops a specificequation (1.5)that quantifies the dependence ofoverfittingon both and , providing concrete guidance on how much data is needed to preventoverfittingfor a givenmodel size(i.e., ). This is more detailed than general observations aboutoverfitting. -
Optimal
Compute Allocation: A key innovation is deriving a strategy for theoptimal allocation of a fixed compute budget. It quantitatively demonstrates that the most compute-efficient approach involves training very large models with modest data and early stopping, which contrasts with traditional approaches of training smaller models to convergence. -
Sample EfficiencyEmphasis: The paper strongly highlights the superiorsample efficiencyof larger models, providing empirical evidence that they require fewersamplesandstepsto reach comparable performance. This insight has profound implications for data requirements and training strategies. -
Empirical Rigor across Scales: The study spans unprecedented scales (
seven orders of magnitudefor compute), lending strong empirical weight and universality to itspower-lawfindings, which was not as extensively demonstrated in prior work.In essence, this paper provides a quantitative, predictive "thermodynamics" for
Transformerlanguage models, whereas previous work might be considered more qualitative or focused on specific "microscopic" aspects.
4. Methodology
4.1. Principles
The core idea of the method used in this paper is to empirically discover and characterize scaling laws that govern the performance of neural language models (specifically, Transformer-based models) as key resources—model size, dataset size, and compute—are varied. The underlying theoretical basis is the observation that many complex systems in nature and technology exhibit power-law relationships between macroscopic properties. The intuition is that if such fundamental relationships exist for deep learning models, they can provide a universal, predictive framework for understanding and guiding their development, independent of many specific architectural details.
The methodology focuses on measuring the cross-entropy loss (a standard metric for language model performance) and observing how it changes as different factors are systematically scaled up or down, often across many orders of magnitude. The goal is to find simple, quantitative relationships (the scaling laws) that allow for accurate predictions of performance and optimal resource allocation.
4.2. Core Methodology In-depth (Layer by Layer)
The methodology is built upon extensive empirical experimentation, systematic variation of key factors, and fitting power-law functions to the observed performance.
4.2.1. Model Architecture and Parameterization
The paper primarily uses decoder-only Transformer models. To systematically study scaling laws, the authors define a clear way to count model parameters and estimate compute.
-
Transformer Hyperparameters: The architecture is parameterized by:
n_layer: Number of layers.d_model: Dimension of the residual stream (the main data pathway within theTransformer).d_ff: Dimension of the intermediatefeed-forward layer.d_attn: Dimension of the attention output.n_heads: Number ofattention headsper layer.n_ctx: Number oftokensin the input context (usually 1024).
-
Model Size () Definition: A crucial aspect is the definition of
model size. The authors define as the number of non-embedding parameters, excludingvocabularyandpositional embeddings. This choice leads to significantly cleanerscaling laws(as shown in Figure 6, which will be discussed later). The approximate formula for is: $ N \approx 2 d_{model} n_{layer} (2 d_{attn} + d_{ff}) $ With the standardTransformerconfiguration where , this simplifies to: $ N = 12 n_{layer} d_{model}^2 $- Symbol Explanation:
- : The number of
non-embedding parametersin theTransformermodel. - : The dimensionality of the
model's internal representations. - : The number of
Transformerlayers. - : The dimensionality of the
attentionoutput. - : The dimensionality of the
feed-forwardlayer's hidden state.
- : The number of
- Symbol Explanation:
-
Compute () Estimation: The total
non-embedding training computeis estimated based on themodel size,batch size, andtraining steps. $ C \approx 6 N B S $ This formula accounts for theforward pass(roughly2NFLOPspertoken) andbackward pass(approximately twice theforward passcompute), multiplied by thebatch sizeandsteps. The numerical values forcomputeare quoted inPF-days.- Symbol Explanation:
- : Estimated total
non-embedding compute. - : Number of
non-embedding parameters. - :
Batch size(number oftokensperbatch). - : Number of
training steps(parameter updates). 6: A constant factor representing theFLOPsperparameterpertokenfor bothforwardandbackward passes.
- : Estimated total
- Symbol Explanation:
4.2.2. Training Procedures
Models were trained using the Adam optimizer [KB14] or Adafactor [SS18] for larger models due to memory constraints. Training runs typically lasted for steps with a batch size of 512 sequences of 1024 tokens. Various learning rates and schedules were experimented with, but the final loss was found to be largely independent of the schedule, provided it included a warmup period and a cosine decay.
4.2.3. Datasets
The primary training dataset was WebText2, an extended version of the WebText dataset .
- Source: Outbound links from Reddit with a minimum of 3 karma, extracted using
Newspaper3k. - Scale:
20.3million documents, 96 GB of text, words, yieldingtokensafterbyte-pair encoding[SHB15]with avocabulary sizeof 50257. - Test Set:
tokensreserved fromWebText2for testing. - Other Test Distributions:
Books Corpus,Common Crawl[Fou], EnglishWikipedia, andInternet Bookswere also used for generalization testing.
4.2.4. Empirical Results and Basic Power Laws (Section 3)
The study involved training a wide variety of models, varying model size (768 to 1.5 billion non-embedding parameters), dataset size (22 million to 23 billion tokens), model shape (depth, width, attention heads, feed-forward dimension), context length, and batch size.
- Approximate Transformer Shape and Hyperparameter Independence: The paper found that
Transformerperformance depends very weakly onshape parameters(n_layer,n_heads,d_ff) when the totalnon-embedding parameter countis held fixed (Figure 5). This suggests that overall scale is more important than specific architectural ratios. - Performance with
Non-Embedding Parameter Count(): When models are trained to nearconvergenceon sufficiently largedatasets(effectively the infinite data limit), thetest loss() follows apower-lawrelationship with . $ L ( N ) \approx \left( { \frac { N _ { c } } { N } } \right) ^ { \alpha _ { N } } $ The fitted values are andnon-embedding parameters. The constant is dependent onvocabulary sizeandtokenizationand does not have a fundamental meaning. This relationship is shown in Figure 1 (middle panel) and Figure 6. Critically, usingnon-embedding parametersprovides a much cleaner trend compared to includingembedding parameters(Figure 6). - Performance with
Dataset Size(): For models with sufficient capacity trained with a limiteddatasetandearly stopping, thetest loss() scales as apower-lawwithdataset size(). $ L ( D ) \approx \left( \frac { D _ { c } } { D } \right) ^ { \alpha _ { D } } $ The fitted values are andtokens. This relationship is shown in Figure 1 (right panel). - Performance with
Compute(): Whencomputeis limited, with a large enoughdatasetand an optimally sized model, theloss() scales as apower-lawwithcompute(). $ L ( C ) \approx \left( { \frac { C _ { c } } { C } } \right) ^ { \alpha _ { C } } $ The fitted values are andPF-days(this is a naive fit before adjusting for optimalbatch size). This relationship is shown in Figure 1 (left panel).
The following figure (Figure 1 from the original paper) summarizes these basic power-law relationships.
该图像是三幅折线图组成的图表,展示了语言模型测试损失(Test Loss)与训练计算量(Compute)、数据集大小(Dataset Size)以及模型参数数量(Parameters)之间的关系。图中均采用对数坐标,测试损失随着这三者的增加呈现出明显的幂律下降趋势。对应的幂律关系用公式表示为:,,,其中 是测试损失, 是计算量, 是数据集大小, 是模型参数量。图表反映出在其他因素不成为瓶颈时,这三者均对模型性能有平滑且持续的提升作用。
4.2.5. Charting the Infinite Data Limit and Overfitting (Section 4)
This section integrates the scaling laws for and to model overfitting.
-
Proposed
L(N,D)Equation: The paper proposes a combinedscaling lawfortest lossthat depends simultaneously onmodel sizeanddataset size: $ L ( N , D ) = \left[ \left( \frac { N _ { c } } { N } \right) ^ { \frac { \alpha _ { N } } { \alpha _ { D } } } + \frac { D _ { c } } { D } \right] ^ { \alpha _ { D } } $- Symbol Explanation:
L(N,D): Thecross-entropy lossas a function ofmodel sizeanddataset size.- : Parameters from the
lossvs.model sizepower law. - : Parameters from the
lossvs.dataset sizepower law.
- Rationale: This formula satisfies three principles:
- It accounts for
vocabularyandtokenizationchanges through and . - It correctly approaches
L(D)when (infinite capacity model) andL(N)when (infinite data). - It has an analytic expansion in with integer
powers, which is motivated byoverfittingscaling withvariance() orsignal-to-noise ratio.
- It accounts for
- Symbol Explanation:
-
Results for
L(N,D): The fit to this equation using empirical data yields the following parameters: The following are the results from Table 2 of the original paper:Parameter Value 0.076 0.103 The fit is excellent, except for very small
datasetswhereoverfittingoccurs extremely early. Theuniversality of overfittingis demonstrated by showing that the extent ofoverfitting() depends predominantly on the ratio . This leads to a practical rule for avoiding significantoverfitting: $ D \gtrsim ( 5 \times 1 0 ^ { 3 } ) N ^ { 0 . 7 4 } $ This implies thatdataset sizecan growsub-linearlywithmodel sizeto preventoverfitting. The following figure (Figure 9 from the original paper) visualizes theL(N,D)relationship.
该图像是论文中图9的图表,展示了早停测试损失L(N,D)关于数据集大小与模型大小的关系。左图表明大数据下损失随以幂律下降,小数据下过拟合导致性能停滞;右图显示过拟合程度主要取决于,并用曲线拟合该关系。
4.2.6. Scaling Laws with Model Size and Training Time (Section 5)
This section focuses on how loss scales with model size and training time (measured by steps ).
-
Adjustment for Training at : The paper incorporates findings from
[MKAT18]regarding thecritical batch size() to normalizetraining steps. is thebatch sizeat which training achieves a good balance betweentimeandcompute efficiency. It was found that depends on the currentloss, but not directly onmodel size. The relationship betweentraining steps() anddata examples processed() for a fixedlossis given by: $ \left( \frac { S } { S _ { \mathrm { m i n } } } - 1 \right) \left( \frac { E } { E _ { \mathrm { m i n } } } - 1 \right) = 1 $- Symbol Explanation:
- : Number of
training stepstaken. - : Minimum number of
stepsrequired to reachloss(at very largebatch size). - : Total
data examplesprocessed (). - : Minimum number of
data examplesrequired to reachloss(at very smallbatch size). This relation defines thecritical batch sizeas: $ B _ { \mathrm { c r i t } } ( L ) \equiv \frac { E _ { \mathrm { m i n } } } { S _ { \mathrm { m i n } } } $ The empirical observation is that follows apower-lawin theloss: $ B _ { \mathrm { c r i t } } ( L ) \approx \frac { B _ { \ast } } { L ^ { 1 / \alpha _ { B } } } $ The fitted values are and . This means increases aslossdecreases (performance improves). The following figure (Figure 10 from the original paper) shows this relationship.
- : Number of
该图像是图表,展示了图10中临界批量大小 与训练损失之间的关系。 随损失的降低呈幂律上升,约每降低13%的损失,临界批量大小翻倍。图中还包括基于梯度噪声规模的预测公式 。To account for actual training
batch sizes() not being , theminimum steps() andminimum compute() are adjusted: $ S _ { \mathrm { m i n } } ( S ) \equiv \frac { S } { 1 + B _ { \mathrm { c r i t } } ( L ) / B } \qquad \mathrm { ( m i n i m u m ~ s t e p s , ~ a t ~ } B \gg B _ { \mathrm { c r i t } } ) $ $ C _ { \mathrm { m i n } } ( C ) \equiv \frac { C } { 1 + B / B _ { \mathrm { c r i t } } ( L ) } \qquad \mathrm { ( m i n i m u m ~ c o m p u t e ,a t } ~ B \ll B _ { \mathrm { c r i t } } ) $- Symbol Explanation:
- : The effective minimum number of
stepsif training was done at a very largebatch size. - : The effective minimum
computeif training was done at a very smallbatch size. - : The actual
batch sizeused in training. - : The
critical batch sizefor the currentloss. - : Actual
training steps. - : Actual
computeused.
- : The effective minimum number of
- Symbol Explanation:
-
Results for : Using the adjusted
S_min, thelossas a function ofmodel sizeandtraining time() is modeled by: $ L ( N , S _ { \mathrm { m i n } } ) = \left( \frac { N _ { c } } { N } \right) ^ { \alpha _ { N } } + \left( \frac { S _ { c } } { S _ { \mathrm { m i n } } } \right) ^ { \alpha _ { S } } $ The fitted parameters are: The following are the results from Table 3 of the original paper:
| Parameter | ||||
|---|---|---|---|---|
| Value | 0.077 | 0.76 |
This equation fits the `learning curves` well, especially after an initial transient period. The `power-law` dependence on suggests `optimizer dynamics` and `loss landscape` properties are roughly independent of `model size`.
The following figure (Figure 11 from the original paper) illustrates how performance follows `L(N,S)` under fixed compute or training steps.

*该图像是论文中图11的图表,展示了在固定总计算预算或训练步数时,测试损失与模型参数量的关系。左图显示不同计算预算下的测试损失,右图展示不同训练步数下的测试损失,体现了性能随模型规模和训练步数的变化趋势,符合公式 `L ( N , S )`。*
4.2.7. Optimal Allocation of the Compute Budget (Section 6)
This section leverages the derived scaling laws to determine the most efficient way to spend a given compute budget.
-
Optimal Performance and Allocations: By adjusting for
batch sizeinefficiency to use , a cleanerpower-lawforlossvs.computeis obtained: $ L ( C _ { \mathrm { m i n } } ) = \left( { \frac { C _ { c } ^ { \mathrm { m i n } } } { C _ { \mathrm { m i n } } } } \right) ^ { \alpha _ { C } ^ { \mathrm { m i n } } } $ The fitted values are andPF-days. This is the most reliablescaling lawfor extrapolating to largercompute. The following figure (Figure 13 from the original paper) shows the adjusted trend.
该图像是论文中图13的图表,展示了调整后训练远低于临界批量大小时的测试损失与计算量(PF-days)的关系。图中给出两个不同的拟合幂律公式,分别为 和 ,显示了不同训练层数对损失趋势的影响。The optimal
model sizefor a givencompute budgetgrows very rapidly: $ N ( C _ { \mathrm { m i n } } ) \propto ( C _ { \mathrm { m i n } } ) ^ { 0 . 7 3 } $ This means for every 10x increase incompute, the optimalmodel sizeincreases by 5x. Theoptimal batch size() scales as . Theoptimal number of steps() increases very slowly: $ S _ { \mathrm { m i n } } \propto ( C _ { \mathrm { m i n } } ) ^ { 0 . 0 3 } $ This implies that ascomputeincreases, it should primarily be spent on making models larger, with only a marginal increase intraining stepsordataset size. This leads to the conclusion thatcompute-efficient traininginvolves training very large models and stopping significantly beforeconvergence. The following figure (Figure 14 from the original paper) illustrates the optimalmodel sizeandoptimization stepsas functions of thecompute budget.
该图像是两幅图表,展示了计算资源预算与模型参数规模及优化步数之间的关系。左图显示最优模型规模随计算预算迅速增长,分别拟合为和;右图显示批量调整后的优化步数随计算预算增长缓慢,拟合公式为,且固定批量方案的步数增长较快。 -
Predictions from : The theoretical derivation from (Appendix B) predicts the
power-law exponentfor : $ \alpha _ { C } ^ { \mathrm { m i n } } \equiv \frac { 1 } { 1 / \alpha _ { S } + 1 / \alpha _ { B } + 1 / \alpha _ { N } } \approx 0 . 0 5 4 $ This closely matches the empirical . It also predicts the optimalmodel sizescaling: $ N ( C _ { \mathrm { m i n } } ) \propto ( C _ { \mathrm { m i n } } ) ^ { \alpha _ { C } ^ { \mathrm { m i n } } / \alpha _ { N } } \approx ( C _ { \mathrm { m i n } } ) ^ { 0 . 7 1 } $ This also matches the empirical0.73. These agreements validate the predictive power of the framework. -
Contradictions and a Conjecture: The authors note that at scales far beyond those empirically studied, the predicted
lossfrom eventually decreases below what is possible given thesub-linear growthof data needed forcompute-efficient training. Specifically, the data required to avoidoverfittingscales as , while the data processed incompute-efficient training(training for one epoch at ) scales as . Since the former grows faster than the latter, eventually the model will bedata-bottlenecked. This leads to an intersection point where the prediction (driven bycomputescaling) meets theL(D)prediction (driven bydatascaling/overfitting). This intersection is estimated at: $ C ^ { * } \sim 1 0 ^ { 4 } \mathrm { ~ P F-D ays } \quad N ^ { * } \sim 1 0 ^ { 1 2 } \mathrm { ~ p a r a m e t e r s , } \quad D ^ { * } \sim 1 0 ^ { 1 2 } \mathrm { ~ t o k e n s , } \quad L ^ { * } \sim 1 . 7 \mathrm { ~ nats/token } $ This point suggests a breakdown of thescaling lawsor possibly a fundamental limit. The authors conjecture that might provide a rough estimate for theentropypertokenof natural language, representing maximal performance forTransformerlanguage models. The following figure (Figure 15 from the original paper) illustrates this potential contradiction.
该图像是一个图表,展示了测试损失(Test Loss)随计算量(PF-days)变化的关系,横坐标为计算量,纵坐标为测试损失。图中两条曲线分别表示 和 L(D(C)),揭示了在不同计算预算下模型性能的表现。
4.2.8. Notation Summary
For clarity, the paper defines the following notation:
- : The
cross-entropy lossinnats. - : The number of
model parameters, excludingvocabularyandpositional embeddings. - : An estimate of the total
non-embedding training compute(), quoted inPF-days. - : The
dataset sizeintokens. - : The
critical batch size, balancingtimeandcompute efficiency. - : An estimate of the minimum
non-embedding computeto reach a givenlossvalue (training at ). - : An estimate of the minimal number of
training stepsto reach a givenlossvalue (training at ). - :
Power-law exponentsfor the scaling oflossas , where can be .
4.2.9. Summary of Power Laws (Appendix A)
The paper provides a concise summary of the key scaling laws and their fitted parameters.
The following are the results from Table 4 of the original paper:
| Parameters | Data | Compute | Batch Size | Equation |
|---|---|---|---|---|
| N | Fixed | |||
| D | Early Stop | Fixed | ||
| Optimal | C | Fixed | (naive) | |
| Nopt | Dopt | B < | ||
| N | D | Early Stop | Fixed | |
| N | S steps | B |
The following are the results from Table 5 of the original paper:
| Power Law | Scale (tokenization-dependent) |
|---|---|
| params (non-embed) | |
| tokens | |
| PF-days | |
| PF-days | |
| tokens | |
| steps |
The following are the results from Table 6 of the original paper:
| Compute-Efficient Value | Power Law | Scale |
|---|---|---|
| params | ||
| tokens | ||
| (lower bound) | steps | |
| tokens |
5. Experimental Setup
5.1. Datasets
The experiments primarily used WebText2, an expanded version of the WebText dataset .
- Source:
WebText2was constructed from outbound Reddit links that received at least 3 karma, for the period up to October 2018. Thiskarma thresholdserved as a heuristic for quality, indicating content that people found interesting or useful. The text was extracted using theNewspaper3kPython library. - Scale and Characteristics: The dataset consists of
20.3million documents, containing 96 GB of text. It comprises words. Afterbyte-pair encoding[SHB15]with avocabulary sizeof 50257, this amounts totokens. - Domain: The
WebText2dataset represents a broad collection of internet text, covering a diverse range of topics, writing styles, and factual information. This makes it suitable for training general-purposelanguage models. - Test Sets: A portion of
WebText2(tokens) was reserved as a test set. Additionally, for evaluatinggeneralizationperformance across different text distributions, the models were tested on samples from:Books CorpusCommon Crawl[Fou]- English
Wikipedia - A collection of publicly available
Internet Books
- Choice Rationale: These
datasetswere chosen for their scale and relevance tolanguage modelingtasks.WebText2provides a vast and diverse source of "good quality" internet text, suitable for training very largelanguage models. The additional testdatasetsallow for robust evaluation ofgeneralizationcapabilities, demonstrating whether thescaling lawshold across different text domains.
5.2. Evaluation Metrics
The principal performance metric used throughout the paper is the cross-entropy loss.
- Cross-entropy Loss (in
nats):- Conceptual Definition:
Cross-entropy lossquantifies the difference between the probability distribution predicted by thelanguage modelfor the nexttokenand the true probability distribution (which assigns a probability of 1 to the actual nexttokenand 0 to all others). It is a measure of how "surprised" the model is by the actual nexttoken. A lowerlossvalue indicates better model performance, meaning the model is more accurately predicting the sequence oftokens. The unitnatsimplies the use of the natural logarithm (base ) in the calculation. Thelossis typically averaged over thetokensin acontext(usually 1024tokensin this paper). - Mathematical Formula: For a given sequence of
tokens, thelanguage modelpredicts the probability of eachtokengiven the preceding ones: . Thecross-entropy lossfor this sequence is calculated as the negativelog-likelihood, normalized by the sequence length : $ L = -\frac{1}{T} \sum_{t=1}^{T} \log P(x_t | x_1, \dots, x_{t-1}) $ - Symbol Explanation:
- : The
cross-entropy lossaveraged over thetokensin thecontext. - : The length of the
tokensequence (orcontext). - : The -th
tokenin the sequence. - : The probability assigned by the
language modelto the -thtoken, given the previoust-1tokensin the sequence. - : The natural logarithm.
- : The
- Conceptual Definition:
5.3. Baselines
The paper primarily investigates the scaling laws of Transformer models. However, for comparative analysis, it also mentions and evaluates other architectures:
-
LSTMModels:Long Short-Term Memorynetworks are a type ofrecurrent neural network (RNN)that were state-of-the-art for sequence modeling beforeTransformers. The paper comparesLSTMperformance againstTransformers(Figure 7) to highlightTransformers' superior performance, particularly for longercontexts.LSTMswere trained with the samedatasetandcontext lengthfor a fair comparison. -
Universal Transformers: These are a variant ofTransformermodels that re-useparametersacrosssteps(likeRNNs) instead of having distinctparametersfor eachlayer. The paper compares them to standardTransformers(Figure 17 in Appendix D.2).Universal Transformersmight perform slightly better as a function ofparameter countdue toparameterre-use, but might be slightly worse whencomputeis considered due to additional operations perparameter.These baselines are representative as they cover both older (but still relevant)
recurrentarchitectures and close variants of theTransformerarchitecture, providing context for theTransformer's specificscaling properties. The primary goal of the paper is not to outperform these baselines, but to understand thescaling lawsofTransformersthemselves.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results strongly validate the effectiveness of the proposed scaling laws as a predictive framework for neural language model performance. The consistent power-law relationships across vast scales demonstrate that performance is largely determined by scale rather than intricate architectural tuning.
-
Weak Dependence on Model Shape: Figure 5 illustrates that
Transformerperformance is remarkably insensitive to variations inhyperparameterslikefeed-forward ratio,aspect ratio(depth vs. width), orattention head dimensionwhen the totalnon-embedding parameter countis kept constant.Lossvaries only by a few percent over a wide range of shapes. For example, anaspect ratiocan vary by a factor of 40 with minimal impact. This suggests that resource allocation should prioritize increasing overallscaleover fine-tuning specific architectural configurations. The following figure (Figure 5 from the original paper) shows the mild effect of model shape on loss.
该图像是图表,展示了不同模型结构参数对损失增加的影响。图中通过三个子图比较了在固定参数量下,前馈网络比例、层宽比和注意力头数维度变化对损失的轻微影响,表明结构变化对性能影响较小。 -
Importance of
Non-Embedding Parameters: Figure 6 highlights a critical methodological choice. Whenembedding parametersare included in themodel sizecount (left panel), the performance trend appears less clear, with differentlayercounts leading to distinct curves. However, whenembedding parametersare excluded (right panel), the performance of models with different depths converges to a single, clearpower-lawtrend. This finding validates the paper's decision to define asnon-embedding parametersfor cleanerscaling laws. Only models with very fewlayersor extreme depth-to-width ratios significantly deviate. The following figure (Figure 6 from the original paper) compares performance trends with and withoutembedding parameters.
该图像是论文中图6的对比图,展示了包含与不包含embedding参数时,不同层数模型的测试损失随参数数量变化的曲线。左图显示包含embedding时,层数对性能影响显著;右图排除embedding后,多层模型性能趋于一致,只有极端深宽比或层数少于2时偏离趋势。 -
Transformervs.LSTMPerformance: Figure 7 demonstrates thatTransformersasymptotically outperformLSTMsasmodel sizeincreases. WhileLSTMscan perform comparably for earlytokensin acontext, they plateau for latertokens, indicating limitations in capturing long-range dependencies compared toTransformers. This underscores whyTransformersare the appropriate architecture for studying advancedscaling laws. The following figure (Figure 7 from the original paper) comparesLSTMandTransformerperformance.
该图像是两个折线图,展示了Transformer和LSTM模型在不同参数规模及上下文长度下的测试损失变化。左图显示了随模型参数增大,Transformer测试损失显著低于LSTM。右图展示Transformer在上下文中后续Token的测试损失下降趋势,模型参数从40万到3亿不等。 -
Smooth
Power-LawScalings: Figure 1 (in the Executive Summary) and Figure 23 clearly show thecross-entropy lossfollows smoothpower-lawrelationships withmodel size(),dataset size(), andcompute(). These trends span multiple orders of magnitude without showing signs of deviation at the upper end. This universality allows for reliable extrapolation and prediction of model performance. Figure 23 specifically contrasts thepower-lawfit with a logarithmic fit, demonstrating the qualitative superiority of thepower-lawforL(N). The following figure (Figure 23 from the original paper) comparespower-lawand logarithmic fits.
该图像是一个图表,展示了参数数量与测试损失(收敛时)的关系,图中表明性能趋势更符合幂律关系,公式为 ,相比对数函数拟合更优。 -
Universality of OverfittingandL(N,D): Figure 9 vividly demonstrates howlossdepends on bothmodel sizeanddataset size. For largedatasets,losscontinues to decrease with increasing . However, for smallerdatasets, increasing eventually leads tooverfitting, causing performance to plateau. The right panel confirms that the extent ofoverfittingis predictable and depends on the ratio , validating the combinedscaling lawL(N,D). This provides crucial guidance on how to scale data alongsidemodelsto avoid performance penalties. The following figure (Figure 9 from the original paper) illustrates theuniversality of overfitting.
该图像是图表,展示了约15亿参数模型在不同数据集上随网络深度变化的测试损失。结果表明深度对泛化影响不大,泛化主要依赖训练分布性能。12层模型在Internet Books数据集上过拟合,并显示了早停性能。 -
Universality of TrainingandSample Efficiency: Figure 4 (right panel) and Figure 2 show thatlearning curvesfor variousmodel sizes(after an initial transient) can be accurately described bypower-lawsintraining steps. This suggests a universalloss landscapestructure. Importantly, Figure 2 and Figure 19 clearly demonstrate that larger models are significantly moresample-efficient, reaching the same level of performance with feweroptimization stepsanddata points. This means they learn faster and more effectively from data. The following figure (Figure 4 from the original paper) shows performance with varying model and data size, or model and training steps.
该图像是论文中图4的图表,左图展示了早停时测试损失L(N,D)如何随数据集大小和模型大小变化,符合公式(1.5);右图展示了所有模型大小在初始变动后,学习曲线可由参数拟合,符合公式(1.6)。The following figure (Figure 2 from the original paper) shows
language model training runsillustratingsample efficiency.
该图像是图表,展示了不同规模语言模型(从10^3到10^9参数)在训练过程中测试损失随处理Token数量和计算量(PF-days)的变化趋势。图中曲线颜色对应模型参数数量,显示较大模型在相同Token或计算量下测试损失显著较低,且计算效率最高的训练通常在收敛之前即停止。The following figure (Figure 19 from the original paper) provides another view of
sample efficiency.
该图像是两个等高线图,展示了模型参数规模与达到固定测试损失值所需最小训练步骤数和最小训练样本数的关系。图中显示随着模型参数数量增加,所需步骤和样本数显著减少,表明更大的模型具有更高的样本效率。 -
Optimal
Compute AllocationandEarly Stopping: The paper's most impactful finding is illustrated in Figure 3 and further detailed in Figure 12 and Figure 14. For a fixedcompute budget, optimal performance is achieved by primarily increasingmodel size(), with only modest increases inbatch size() and negligible increases intraining steps(). This implies thatcompute-efficient traininginvolves training very large models and stopping significantly beforeconvergence(i.e., not training until thelosscompletely flattens out, as seen in Figure 2). Figure 12 also shows that while there's an optimalmodel sizefor a givencompute budget, slight deviations are tolerable without hugecomputepenalties. The following figure (Figure 3 from the original paper) illustrates how to scalemodel size,batch size, andserial steps.
该图像是论文中的图表,展示了不同计算资源(PF-days)分配对训练中模型大小、批量大小和串行步骤的乘法贡献关系,反映了随着计算量增加,模型大小的增加远超其他因素。The following figure (Figure 12 from the original paper) shows the effect of training on
suboptimal models.
该图像是两个子图组成的图表,展示了在固定计算预算下模型大小的优化效果。左图显示模型大小偏离最优值时,所需额外计算量的变化范围;右图显示较小模型训练步骤多,较大模型训练步骤少的趋势,捕捉了训练步骤的超额情况。 -
GeneralizationAcross Data Distributions: Figure 8 shows thatgeneralizationperformance to otherdata distributions(e.g.,Books Corpus,Wikipedia) improves smoothly withmodel size, in direct parallel with performance on theWebText2training distribution. Thelosson out-of-distributiondatasetsmaintains a roughly constant offset from thein-distribution loss. This indicates thatgeneralizationis primarily a function of thein-distribution validation lossandmodel size, rather than specifictraining durationormodel depth(Figure 24 in Appendix D.8). The following figure (Figure 8 from the original paper) showsgeneralizationto other testdatasets.
该图像是一个图表,展示了参数数量与测试损失(收敛时)的关系,图中表明性能趋势更符合幂律关系,公式为 ,相比对数函数拟合更优。 -
Context Dependence: Figure 20 and Figure 21 explorelosspertokenas a function of its position within the 1024-tokencontext.Lossscales predictably as apower-lawintokenposition. Larger models show faster improvements at earlytokens, suggesting they are more efficient at detecting patterns with less contextual information. This indicates that large models learn both short-range and long-range correlations more effectively. The following figure (Figure 20 from the original paper) showspower-law dependenceof performance oncontext position.
该图像是图表,展示了图20中模型规模和训练时间对每个Token损失的影响。左图显示在1024-token上下文中,损失随Token位置的幂律下降,右图展示训练步数增加时,测试损失的变化趋势。The following figure (Figure 21 from the original paper) shows performance at different
context positionsversusmodel size.
该图像是论文中图21的图表,展示了随着模型参数数量增加,不同位置的单词在测试损失上的表现变化。图中实线代表1024长度上下文的单词损失,虚线表示长度为8的短上下文训练,短上下文在前面单词上的损失更低。
6.2. Data Presentation (Tables)
The paper provides several key tables summarizing the scaling laws, fitted parameters, and optimal compute-efficient training trends.
The following are the results from Table 1 of the original paper:
| Operation | Parameters | FLOPs per Token |
| Embed | (nvocab + nctx) dmodel | 4dmodel |
| Attention: QKV | nlayerdmodel3dattn | 2nlayerdmodel3dattn |
| Attention: Mask | 2nlayernctxdattn | |
| Attention: Project | nlayerdattndmodel | 2n1ayerdattndembd |
| Feedforward | nlayer2dmodeldff | 2nlayer2dmodeldff |
| De-embed | 2dmodelnvocab | |
| Total (Non-Embedding) | N = 2dmodelnlayer (2dattn + df) | Cforward = 2N + 2nlayernctxdattn |
Table 1: Parameter counts and compute (forward pass) estimates for a Transformer model. Sub-leading terms such as nonlinearities, biases, and layer normalization are omitted.
The following are the results from Table 2 of the original paper:
| Parameter | ||||
|---|---|---|---|---|
| Value | 0.076 | 0.103 |
Table 2: Fits to L(N,D)
The following are the results from Table 3 of the original paper:
| Parameter | ||||
|---|---|---|---|---|
| Value | 0.077 | 0.76 |
Table 3: Fits to L(N,S)
The following are the results from Table 4 of the original paper:
| Parameters | Data | Compute | Batch Size | Equation |
|---|---|---|---|---|
| N | Fixed | |||
| D | Early Stop | Fixed | ||
| Optimal | C | Fixed | (naive) | |
| Nopt | Dopt | B < | ||
| N | D | Early Stop | Fixed | |
| N | S steps | B |
Table 4: Key trend equations.
The following are the results from Table 5 of the original paper:
| Power Law | Scale (tokenization-dependent) |
|---|---|
| params (non-embed) | |
| tokens | |
| PF-days | |
| PF-days | |
| tokens | |
| steps |
Table 5: Key parameters to trend fits.
The following are the results from Table 6 of the original paper:
| Compute-Efficient Value | Power Law | Scale |
|---|---|---|
| params | ||
| tokens | ||
| (lower bound) | steps | |
| tokens |
Table 6: Trends for compute-efficient training.
6.3. Ablation Studies / Parameter Analysis
The authors conducted several analyses that function as ablation studies or parameter analyses to understand the robustness and specifics of their scaling laws.
-
Model Shape (
Depthvs.Width): As discussed, Figure 5 demonstrates thatTransformerperformance is largely invariant to specific architectural choices such asdepth,width,feed-forward dimension, ornumber of attention heads, as long as the totalnon-embedding parameter countremains fixed. Thelossvaries by only a few percent across a wide range of thesehyperparameters. This suggests that the total capacity, as represented by , is the dominant factor, not the specific distribution of that capacity across layers or dimensions. -
Impact of
Embedding Parameters: Figure 6 serves as anablation studyon the definition ofmodel size. It clearly shows that includingembedding parametersin obscures the cleanpower-lawtrend. Excluding them yields a much more consistentscaling lawacross different model depths. This implies that theembedding matrixcan be made smaller without impacting performance, a finding corroborated by other research . -
Learning Rate SchedulesandError Analysis: Figure 22 (in Appendix D.6) presents an analysis of variouslearning rate schedules(cosine decay, linear decay, faster/slower decays). The conclusion is that the choice oflearning rate scheduleis largely irrelevant to the finallossatconvergence, provided that the total summedlearning rateis sufficiently large, and the schedule includes awarmup periodand a final decay. Variations among schedules appear to be within the range of statistical noise. This suggests that thescaling lawsare robust to theseoptimization hyperparameters. The authors also note that larger models generally require smallerlearning ratesto preventdivergence. The following figure (Figure 22 from the original paper) shows thelearning rate schedulescan.
该图像是两部分组成的图表,左侧展示了不同学习率衰减调度曲线随训练步数变化,右侧为累计学习率与损失的散点关系,反映学习率调度对模型性能的影响。 -
GeneralizationandArchitecture(Depth): Figure 24 (in Appendix D.8) explicitly shows thatgeneralizationperformance to otherdata distributionsdoes not depend on networkdepthwhen the totalparameter countis held fixed. Instead,generalizationperformance is primarily determined by the performance achieved on the training distribution. This reinforces the idea that overall scale, rather than specific architectural details likedepth, is the key driver of general capabilities. The following figure (Figure 24 from the original paper) showsgeneralizationversusdepth.
该图像是图表,展示了约15亿参数模型在不同数据集上随网络深度变化的测试损失。结果表明深度对泛化影响不大,泛化主要依赖训练分布性能。12层模型在Internet Books数据集上过拟合,并显示了早停性能。
6.4. Critical Batch Size (B_crit)
Figure 10 illustrates the power-law relationship of the critical batch size with the loss . The key finding is that is independent of model size and only depends on the loss being achieved. This confirms the predictions from [MKAT18]. The power-law fit shows that approximately doubles for every 13% decrease in loss.
6.5. Universal Transformers
Figure 17 (in Appendix D.2) compares recurrent Transformers to standard Transformers. Recurrent Transformers, which re-use parameters, perform slightly better when compared solely by parameter count . However, when compute is factored in (because parameter re-use implies more FLOPs per parameter), they perform slightly worse. This demonstrates that while parameter efficiency can be achieved, it might not translate directly to compute efficiency.
The following figure (Figure 17 from the original paper) compares recurrent Transformers to standard Transformers.
该图像是图表,展示了图17中循环Transformer与标准Transformer的性能比较。左图和右图分别以不同方式计算参数,测试集损失(Test Loss)随参数数量变化。结果显示,循环Transformer在等参数模型对比下稍优,但按计算量(FLOP)对比略逊于非循环模型。
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper rigorously establishes a set of empirical scaling laws that govern the performance of neural language models, particularly those based on the Transformer architecture. The core finding is that cross-entropy loss scales predictably as a power-law with model size (N), dataset size (D), and compute (C) used for training, across many orders of magnitude. In contrast, architectural specifics like network width or depth have only minimal effects. These relationships allow for precise modeling of overfitting and training speed. A significant implication is that compute-efficient training involves training very large models on relatively modest amounts of data and stopping significantly before full convergence, as larger models are substantially more sample-efficient. The research provides a robust, predictive framework for advancing language modeling research.
7.2. Limitations & Future Work
The authors candidly acknowledge several limitations and suggest future research directions:
- Lack of Theoretical Understanding: A major limitation is the absence of a solid theoretical foundation for the observed
scaling laws. While empirical, the underlying reasons for thesepower-lawrelationships, especially withmodel sizeandcompute, remain mysterious. Without a theory, it's difficult to predict when these laws might break down or how they might be systematically corrected. - Uncertainty in Extrapolation: The predictions for
critical batch size() might not be reliable forlossvalues far outside the explored range. Changes in could significantly altertrade-offsbetweendata parallelismandserial training steps, impactingtraining time. - Limited Small Data Regime Investigation: The study did not thoroughly investigate the
small data regime, and the fits forL(N,D)were less accurate for the smallestdataset sizes. - Fixed Regularization and Augmentation: The
regularization(e.g.,dropout probability) anddata augmentationstrategies were not optimized while varyingdatasetandmodel size. Improvements in these areas could quantitatively or qualitatively alter the results. - Compute Estimation Limitations: The
computeestimate excludedcontext-dependentterms (). This might confoundscalingsin scenarios with very largecontext lengths(). Learning Rate Tuning: Whilelearning rate scheduleswere explored, the optimal choice oflearning rateitself is sensitive to the targetloss. The authors did not experiment with higherlearning ratesfor shorttraining runs, which might be beneficial.- The "Contradiction" at Extreme Scales: The most intriguing limitation is the self-identified "contradiction" where the
scaling lawsforcomputeanddataimply a breakdown at extremely large scales (estimated aroundPF-days,parameters,tokens,nats/token). This suggests that the currentscaling lawsmust eventually break down, or that this point represents a fundamental limit (e.g., reaching theentropyof natural language).
Future research directions suggested include:
- Testing these
scaling relationson othergenerative modelingtasks (e.g.,images,audio,video) andrandom network distillationto determine their universality. - Developing a theoretical framework (a "statistical mechanics") from which these
scaling relationscan be derived, leading to more precise predictions and understanding of their limitations. - Investigating whether continued improvement in
losstranslates into qualitative improvements on relevantlanguage tasks. - Further research into
model parallelism(e.g.,pipelining,wide networks,sparsity,branching) to efficiently train increasingly large models. - Exploring methods that
grow networksas they train to remain on thecompute-efficient frontier.
7.3. Personal Insights & Critique
This paper is a landmark study that provided a much-needed empirical foundation for the deep learning era of large language models. My personal insights include:
- Paradigm Shift: The most profound insight is the shift in perspective from intricate
hyperparametertuning to prioritizing sheerscale. Before this paper, much effort was spent optimizingarchitecturesorregularizationfor marginal gains. This work demonstrated that simply making models bigger (and training them appropriately) yields consistent, predictable improvements across a broad range of tasks. This has directly influenced the "bigger is better" trend that led to models likeGPT-3and beyond. - Practical Guidance for Resource Allocation: The finding that
compute-efficient traininginvolves very large models trained withearly stoppingon moderate data is immensely practical. It advises against training small models to fullconvergenceor trying to scrape together impossibly largedatasetsfor everymodel size. This guidance is invaluable for researchers and organizations with fixedcompute budgets. - Implications for Hardware and Data: The
sub-linear scalingof data requirement withmodel size() and the increasingsample efficiencyof larger models are crucial. This suggests that whilecomputescales rapidly, the demand for new, unique data may grow more slowly than previously thought. This has implications fordata curationefforts and highlights the immense importance ofmodel parallelismin hardware design. - Robustness of
Transformers: The weak dependence onmodel shape(Figure 5) speaks to the inherent robustness and flexibility of theTransformerarchitecture. It implies thatTransformersare highly adaptable to various computational constraints (e.g., favoringwidthoverdepthforparallelism) without significant performance trade-offs, as long as the overallparameter countis maintained. - The "Contradiction" as a Research Frontier: The
conjectureabout thescaling lawsbreaking down at an "intersection point" forcomputeanddatarequirements is fascinating. It suggests that there might be a fundamental limit to how much performance can be gained by simply scaling currentTransformerlanguage models, or that new architectural innovations will be required to push beyond this limit. This point also offers a potential way to estimate theentropyof language, a long-standing open problem.
Potential Issues/Unverified Assumptions:
-
Domain Specificity: While the authors suggest
scaling lawsmight apply to othergenerative modelingtasks, this paper's findings are specifically forTransformerlanguage modelsonnatural languagedata. Their universality needs empirical verification in other domains. -
Generalizability to Fine-tuning/Downstream Tasks: The
scaling lawsare focused oncross-entropy losson thepre-trainingtask. Whiletransfer learningperformance correlates, it's not explicitly proven that thesescaling lawstranslate directly to optimal performance on alldownstream tasksorfine-tuningscenarios. -
"Good" Data Assumption: The
WebText2datasetis curated for quality (Reddit karma threshold). Thescaling lawsmight differ for lower-quality or highly noisydatasets. -
Empirical Nature: The core limitation acknowledged by the authors themselves is the empirical nature of the findings. Without a theoretical grounding, the
scaling lawsare observed phenomena rather than derived principles, making their extrapolation beyond the observed range potentially risky. The exactexponentsandcritical pointsare sensitive to fitting choices.Overall, this paper provides a powerful "engineering manual" for
large language models, shifting the focus towards quantifiablescalingstrategies and opening new avenues for theoretical inquiry into the fundamental properties ofdeep learningandnatural language.
Similar papers
Recommended via semantic vector search.