EvoLM: In Search of Lost Language Model Training Dynamics
TL;DR Summary
EvoLM offers a systematic analysis of language model training dynamics across pre-training, continued training, fine-tuning, and reinforcement learning. Key findings include diminishing returns from excessive training and the importance of continued training in bridging stages, w
Abstract
Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. We train over 100 LMs with 1B and 4B parameters from scratch, and evaluate both upstream (language modeling) and downstream (problem-solving) capabilities, including considerations of both in-domain and out-of-domain generalization. Key insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is EvoLM: In Search of Lost Language Model Training Dynamics. It focuses on systematically analyzing the training dynamics of language models across their various lifecycle stages.
1.2. Authors
The authors are:
- Zhenting Qi (Harvard)
- Fan Nie (Stanford)
- Alexandre Alahi (EPFL)
- James Zou (Stanford)
- Himabindu Lakkaraju (Harvard)
- Yilun Du (Harvard)
- Eric Xing (CMU)
- Sham Kakade (Harvard)
- Hanlin Zhang (Harvard)
1.3. Journal/Conference
The paper is published at (UTC): 2025-06-19T04:58:47.000Z. While a specific journal or conference is not named in the provided text, the arXiv link indicates it is a preprint, commonly a precursor to publication in top-tier machine learning conferences or journals. Given the affiliations and the topic, it likely targets venues like NeurIPS, ICML, ICLR, or ACL.
1.4. Publication Year
The publication year is 2025, specifically June 19, 2025.
1.5. Abstract
The paper addresses the challenge of understanding the impact of design choices in modern multi-stage language model (LM) training. It introduces EvoLM, a model suite designed for systematic and transparent analysis of LM training dynamics across pre-training, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). The authors train over 100 LMs with 1B and 4B parameters from scratch, evaluating both upstream (language modeling) and downstream (problem-solving) capabilities, considering both in-domain (ID) and out-of-domain (OOD) generalization. Key findings include diminishing returns from excessive pre-training and post-training, the importance of mitigating forgetting during domain-specific CPT using replay strategies, the crucial role of CPT in bridging training phases, and various trade-offs in SFT and RL configurations. To promote open research, all models, datasets, and the entire training/evaluation pipeline are released.
1.6. Original Source Link
The original source link is https://arxiv.org/abs/2506.16029v2 and the PDF link is https://arxiv.org/pdf/2506.16029v2. It is currently a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the difficulty in evaluating the impact of design choices made at each stage of modern language model (LM) training. LM training has become a multi-stage process, including pre-training, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). This complexity, coupled with the vast design space and interactions between phases, makes it hard for downstream developers to understand which decisions lead to reliable performance gains.
This problem is important because scaling up LMs has become a dominant paradigm for various applications. However, existing studies often rely on opaque analyses from post-training studies using off-the-shelf base models (without strict control over variables like model size or data) or evaluations based on intermediate checkpoints (which may have suboptimal performance due to incomplete learning rate decay). These issues introduce confounding factors and hinder a clear understanding of training dynamics and how they translate to downstream problem-solving performance.
The paper's entry point is to establish an end-to-end, transparent, and systematic development pipeline. By training LMs from scratch with complete control over all stages and parameters, they aim to explicitly investigate the reasoning capabilities of LMs throughout their entire lifecycle.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
- Systematic and Transparent Analysis Framework: Introduction of
EvoLM, a model suite and framework enabling a systematic and transparent analysis of LM capabilities across the entire lifecycle (pre-training to RL post-training). This includes evaluation onreasoning-intensive upstream cloze tasksanddownstream generative tasks, considering bothin-domain(ID) andout-of-domain(OOD) generalization. - Open-Sourced Resources: Release of trained from scratch (1B and 4B parameters), their training data for all stages, and a comprehensive, reproducible training and evaluation pipeline. This facilitates open research and allows the community to build upon their findings.
- Key Insights into Training Dynamics:
- Diminishing Returns: Excessive pre-training and post-training lead to
diminishing returnson both upstream and downstream performance, and can even cause degradation on some OOD tasks. Saturation points were observed around80xto160xmodel size for pre-training, and specific epoch/data amounts for SFT and RL. - Model Size vs. Compute: Under limited pre-training budgets, smaller post-trained models can outperform larger counterparts. However, once pre-training tokens reach the saturation regime, increasing model size leads to clear improvements in both ID performance and OOD generalization.
- Mitigating Catastrophic Forgetting: Domain-specific
continued pre-training(CPT) inducescatastrophic forgettingof general-domain knowledge. Incorporating a smallreplay budget(e.g., 5% of pre-training data) effectively mitigates this degradation, benefiting both upstream and downstream performance. - Role of CPT: Adequate domain-specific CPT data is crucial. Without it, SFT performance remains suboptimal, and RL can even degrade performance. As CPT data increases, ID downstream performance steadily improves, and SFT models benefit more from RL fine-tuning. Sufficient CPT data also promotes generalization to OOD tasks.
- SFT and RL Trade-offs:
- Excessive SFT (especially too many epochs) improves ID performance with diminishing returns but can degrade OOD performance and limit further RL improvements.
- Excessive RL (epochs or examples) improves downstream performance (ID and OOD) with diminishing returns. Beyond saturation, RL primarily increases the probability of sampling high-quality rollouts rather than fundamentally improving reasoning capabilities (
Pass@16accuracy degrades whileCorrect Ratioincreases). - Under constrained downstream data budgets, allocating more data to SFT maximizes ID gains at the expense of weaker OOD generalization, while more RL data improves OOD performance.
- Intermediate Checkpoints are Unreliable: Using intermediate checkpoints from a longer training run as surrogates for fully trained smaller models is unreliable, as they consistently lag behind models trained with their own complete learning rate schedules.
- ORM Score as a Proxy:
Outcome Reward Model (ORM)scores exhibit strong correlations (0.62 to 0.84) with downstream task accuracy, suggesting they can serve as reliable unsupervised proxy metrics for assessing generation quality during post-training, especially in data-constrained scenarios. This contrasts withvalidation perplexity, which shows negligible correlation with downstream task accuracy in post-trained models.
- Diminishing Returns: Excessive pre-training and post-training lead to
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following core concepts in language model development:
- Language Models (LMs): Statistical models that learn the probability distribution of sequences of words or tokens. They predict the next word in a sequence given the preceding words.
Autoregressive generative modelsare a common type, meaning they generate text one token at a time, conditioned on previously generated tokens. - Transformer Architecture: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017) that forms the backbone of most modern LMs. It relies heavily on
self-attention mechanismsto weigh the importance of different words in a sequence. The paper mentions using theLLaMA-2architecture, which is a Transformer-based decoder-only model. - Pre-training: The initial phase of training LMs on a massive corpus of text data (e.g., billions or trillions of tokens) to learn general language understanding and generation capabilities. The primary objective is typically
next-token prediction, minimizingcross-entropy loss(orlog-loss). - Tokens: The basic units of text processed by LMs. These can be words, subword units (like
wordpiecesorbyte-pair encodings), or characters. - Perplexity (PPL): A common metric for evaluating language models. It measures how well a probability model predicts a sample. Lower
perplexityindicates a better model. For a sequence of tokens ,perplexityis defined as: $ \mathrm{PPL}(W) = P(w_1, w_2, \dots, w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1, w_2, \dots, w_N)}} $ where is the probability of the sequence, often computed as the product of conditional probabilities: $ P(w_1, \dots, w_N) = \prod_{i=1}^N P(w_i | w_1, \dots, w_{i-1}) $ A lower perplexity means the model assigns higher probabilities to the actual sequence, indicating better language modeling. - Fine-tuning: The process of adapting a pre-trained model to a specific downstream task or domain by training it further on a smaller, task-specific dataset.
- Supervised Fine-Tuning (SFT): A type of fine-tuning where the model is trained on labeled data for a specific task (e.g., question-answering pairs, summarization). The model learns to generate desired outputs based on given inputs and corresponding ground truth responses.
- Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by interacting with an environment to maximize a cumulative reward. In the context of LMs,
Reinforcement Learning from Human Feedback (RLHF)is common, where human preferences orreward models(RMs) provide signals to optimize the LM's outputs. - Proximal Policy Optimization (PPO): A popular
on-policyRL algorithm used in this paper. It optimizes apolicy(the LM's generation strategy) by taking small steps to avoid large changes that could destabilize training, ensuring stable and effective learning. - Catastrophic Forgetting: A phenomenon in neural networks where training a model on new data causes it to forget previously learned information. This is particularly relevant during
continued pre-trainingor fine-tuning when adapting to a new domain. - Scaling Laws: Empirical observations that describe how model performance (e.g.,
log-lossorperplexity) changes as a function of various resources, such as model size (parameters), dataset size (tokens), and computational budget (FLOPs). A key concept is thecompute-optimal ratiofor training. - In-domain (ID) and Out-of-domain (OOD) Generalization:
In-domainrefers to evaluating a model on data similar to its training data.Out-of-domainrefers to evaluating on data that is distinct or different from the training data, testing the model's ability to generalize to new, unseen distributions. - Outcome Reward Model (ORM): A separate model trained to assign a scalar score to a generated output (e.g., a solution to a problem) given an input. This score serves as a proxy for the quality or correctness of the output and is used to provide feedback in RL or as an evaluation metric.
3.2. Previous Works
The paper contextualizes its work by referencing several lines of prior research:
- Scaling Laws: Early works like those by Hestness et al. [22], Kaplan et al. [27], and Hoffman et al. [23] established fundamental relationships between
training lossandmodel size,data quantity, andcompute. The paper specifically mentionsChinchilla scaling law[23], which recommends acompute-optimal ratioof approximately 20 tokens per model parameter. - Training Dynamics and Emergent Abilities: Research has explored how models learn during training [59, 53, 43, 16] and the concept of
emergent abilities[45], where LMs gain new capabilities only at large scales. However, accurately forecasting downstreamproblem-solving performanceremains challenging due to thetraining-inference mismatch[46] andnon-smooth improvements[45]. - Post-training Studies and Opaque Checkpoints: Existing studies often rely on off-the-shelf base models or intermediate checkpoints [59, 51] which lack transparency regarding training details and control over variables like model size or pre-training data composition [42, 10, 61]. This can introduce
confounding factors. The paper aims to address this by providing a fully transparent pipeline. - Catastrophic Forgetting and Continued Pre-training (CPT): The phenomenon of
catastrophic forgetting[15] during fine-tuning or adaptation is a known challenge. Prior work has exploreddata replay strategies[25, 41, 4, 62] to mitigate this during CPT, especially indomain-specific continual pre-training[41]. - SFT and RL for Reasoning: Studies have investigated the impact of SFT and RL on reasoning capabilities. Some suggest that SFT scales with fine-tuning examples [42, 72] and that RL amplifies pre-trained patterns, driving models towards dominant output distributions [70]. Others critically examine whether RL truly boosts reasoning beyond pre-trained baselines, suggesting it primarily enhances confidence in existing correct outputs rather than fundamentally improving reasoning [64, 10]. The paper expands on these findings by providing precise trade-offs.
- Overtraining: Recent work has identified "catastrophic overtraining" [51], where prolonged pre-training can impair downstream fine-tuning by increasing sensitivity to parameter updates and exacerbating forgetting. The paper's findings align with and extend this by showing degradation on downstream generative reasoning tasks even after RL.
3.3. Technological Evolution
The field of language models has rapidly evolved from early statistical models to deep learning architectures like recurrent neural networks (RNNs) and long short-term memory (LSTMs), culminating in the Transformer architecture. This evolution has been driven by increased computational power, larger datasets, and architectural innovations, leading to ever-larger models.
Initially, LMs were primarily evaluated on language modeling loss or perplexity. However, as models became more capable, the focus shifted to their performance on diverse downstream tasks, requiring fine-tuning for specific applications. The introduction of reinforcement learning from human feedback (RLHF) marked a significant step in aligning LM outputs with human preferences and improving conversational abilities.
This paper fits into the current state of this evolution by moving beyond simply scaling models and focusing on the dynamics of how capabilities emerge and are shaped across the entire multi-stage training lifecycle. It addresses the challenge of understanding the complex interactions between these stages, which is crucial for building more efficient, reliable, and interpretable large language models. It also emphasizes reasoning-centric evaluation, moving beyond simple next-token prediction to assess problem-solving capabilities.
3.4. Differentiation Analysis
Compared to related work, EvoLM differentiates itself in several key aspects:
- Holistic, Transparent Lifecycle Analysis: Unlike many studies that focus on a single stage (e.g., pre-training scaling laws) or use opaque, off-the-shelf base models, EvoLM provides an end-to-end, systematic, and transparent analysis across all training stages (
pre-training,CPT,SFT,RL). This allows for a unique investigation into the interplay and dependencies between these stages. - From-Scratch Training with Full Control: The paper trains from scratch with 1B and 4B parameters, ensuring complete control over
model size,pre-training data size,data components, andcomplete learning rate decay. This rigorous control eliminatesconfounding factorsoften present in studies using pre-existing checkpoints. - Focus on Reasoning-Intensive Downstream Tasks: While many works evaluate upstream
log-loss,EvoLMspecifically evaluatesproblem-solving capabilitiesonreasoning-intensive tasks(math, code, logic, commonsense) for bothin-domainandout-of-domaingeneralization, providing a more practical assessment of LM utility. - Empirical Validation of Over-training Phenomena: It systematically quantifies the
diminishing returnsand evendegradationfrom excessive pre-training and post-training, extending findings like "catastrophic overtraining" [51] to generative reasoning tasks and showing its impact onRL fine-tuning. - Detailed CPT Strategies and Forgetting Mitigation: The study rigorously explores the role of
continued pre-training, highlightingcatastrophic forgettingduring domain adaptation and empirically demonstrating the effectiveness ofpre-training data replay strategies(e.g., 5% replay) in mitigating it. - Granular Analysis of SFT/RL Trade-offs: The paper provides fine-grained insights into how
scaling SFT epochs,SFT dataset size,RL epochs, andRL dataset sizeimpact ID and OOD performance, and how these interact, including when RL can be limited or even degrade performance. - Reliability of Evaluation Metrics: It critically assesses
validation perplexityas an evaluation metric for post-trained models, showing its unreliability forgenerative reasoning performance, and proposesORM scoresas a more effectiveunsupervised proxy metric. - Open-Source Commitment: The commitment to releasing all models, datasets, and the entire training/evaluation pipeline fosters reproducibility and future research, differentiating it from many works that only present findings without corresponding open resources.
4. Methodology
The EvoLM methodology focuses on systematically investigating the training dynamics of language models across four distinct stages: pre-training, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). The core idea is to train numerous models from scratch under controlled conditions and evaluate their capabilities using a comprehensive set of upstream and downstream tasks.
4.1. Principles
The underlying principle of the EvoLM methodology is to achieve transparency and systematic control over the entire LM development lifecycle. By controlling key variables like model size, pre-training data volume, data composition, and hyper-parameters at each stage, the authors aim to isolate the impact of different design choices. This allows for empirical validation of scaling laws, understanding catastrophic forgetting, and identifying critical trade-offs in fine-tuning strategies. The theoretical basis rests on the empirical observation of scaling laws in LMs, which posit predictable relationships between resources and performance, and the need to extend these observations to the multi-stage post-training regime.
4.2. Core Methodology In-depth (Layer by Layer)
The training pipeline consists of four sequential stages, each with specific configurations and objectives:
4.2.1. Pre-training
The initial stage involves training decoder-only autoregressive LMs from scratch.
- Architecture: All models use the
LLaMA-2architecture [56]. The paper investigates models with 1B and 4B parameters (and some 0.5B for initial scaling studies). The architecture details are provided in Table 5 in the Appendix. - Pre-training Data:
FineWeb-Edu[38], an extensive educational dataset sourced from web content, is used. It contains approximately 1.3 trillion tokens. - Token Budgets: Guided by the
Chinchilla scaling law[23], which recommends acompute-optimal ratioof approximately 20 tokens per model parameter, models are pre-trained across various token budgets:- From
20x model size(the optimalChinchilla ratio) - Up to
320B tokens(which represents16x Chinchillafor a 1B model, i.e., tokens). - This range allows for investigating effects of
mild over-training( Chinchilla, Chinchilla) andexcessive over-training( Chinchilla).
- From
- Learning Rate Schedule: A
complete learning rate decayschedule is used, and only the final checkpoints are considered. - Hyperparameters: Specific
hyperparametersfor pre-training (and continued pre-training) are detailed in Table 6, includingprecision,global_batch_size,max_seq_length,lr_warmup_ratio,max_norm,lr(denoted asbr),min_lr,weight_decay,beta1,beta2, andepoch. For example, for a 1B model, the learning rate (br) is0.0002, andweight_decayis0.1.
4.2.2. Continued Pre-training (CPT)
After initial pre-training, some models undergo continued pre-training to adapt to a specific domain.
- CPT Data:
FineMath[2], a curated dataset of mathematical texts, problems, and solutions (approximately 50 billion tokens), is used for domain-specific CPT. - Token Budgets: CPT token budgets range from
2B to 42Btokens. - Mitigating Catastrophic Forgetting: To address
catastrophic forgettingof general-domain knowledge during domain-specific CPT, apre-training data replay strategy[25, 41, 4, 62] is employed. This involves randomly interleaving a small amount of the originalFineWeb-Edupre-training data with theFineMathdata. The paper explores different replay budgets (e.g., 8 billion tokens of replayed general-domain data). - Learning Rate Schedule: Similar to pre-training, a
complete learning rate decayschedule is applied. - Hyperparameters: The same hyperparameters as pre-training (Table 6) are generally used, though specific CPT runs may vary total tokens.
4.2.3. Supervised Fine-Tuning (SFT)
Following pre-training and CPT, models are fine-tuned on a task-specific dataset.
- SFT Data: A dataset of
QA pairsaugmented fromGSM8K[12] andMATH[20] is used. This data is collected from a mixture ofMetaMathQA[63],OpenMathInstruct2[55], andNuminaMath[31]. - Data Filtering: Low-quality prompts are filtered out using
model correctness consistency[39], discarding samples with zerointer-model consensus. - SFT Examples/Epochs: The number of SFT examples and epochs are varied to study their impact (e.g., 100K examples for 1 epoch, or varying epochs from 1 to 32, or examples from 50K to 400K).
- SFT/RL Template: A specific template is used for SFT and RL tuning to structure inputs and outputs:
This template frames the task as a conversational interaction.Human: {query} Assistant: {response} - Hyperparameters: Specific hyperparameters for SFT are detailed in Table 7, including
cutoff_len(maximum sequence length),batch_size,learning_rate,lr_scheduler_type(cosine), andwarmup_ratio. For a 1B model, thelearning_rateis0.00001, andbatch_sizeis128.
4.2.4. Reinforcement Learning (RL)
The final stage involves reinforcement learning to further refine the model's behavior based on a reward signal.
- RL Algorithm:
Proximal Policy Optimization (PPO)[47] is employed. - Reward Signal: A
binary verifiable rewardis used, implying that the correctness of generated solutions can be objectively checked. - RL Data: The RL stage uses the same data sources as SFT (augmented QA pairs from
GSM8K,MATH, etc.) but ensures no overlap with the SFT dataset to prevent data contamination. - RL Examples/Epochs: Similar to SFT, the number of RL examples and epochs are varied (e.g., 100K examples for 8 epochs, or varying epochs from 1 to 32, or examples from 50K to 400K).
- Hyperparameters: Specific hyperparameters for RL (PPO) are detailed in Table 8, including
actor_lr(learning rate for the policy network),critic_lr(learning rate for the value network),kl(KL divergence coefficient),train_batch_size,max_prompt_length,max_response_length,ppo_mini_batch_size,ppo_micro_batch_size_per_gpu,log_prob_micro_batch_size_per_gpu, andwarmup_steps_ratio. For a 1B model,actor_lris , andcritic_lris .
4.2.5. Model Signature
A compact model signature is used to denote the configuration of each model across training stages. For example:
1B-160BT-8+42BT-100Kep1-100Kep16
-
1B: Model with 1 billion parameters. -
160BT: Pretrained on 160 billion tokens fromFineWeb-Edu. -
: Continued pre-trained with 8 billion tokens of replayed general-domain data (
FineWeb-Edu) and 42 billion tokens of domain-specific data (FineMath). -
100Kep1: Supervised fine-tuned on 100K examples for 1 epoch. -
100Kep16: Reinforcement learning fine-tuned on 100K examples for 16 epochs.All models are trained with
complete learning rate scheduling, and only thefinal checkpointsare used for study, which is a critical aspect highlighted in Section 4.1 to avoid the pitfalls of using intermediate checkpoints.
5. Experimental Setup
5.1. Datasets
The study utilizes a variety of datasets across different training and evaluation stages:
Training Data
- FineWeb-Edu [38]: This is an extensive educational dataset derived from web content, specifically designed for
pre-traininglanguage models. It focuses on high-quality academic and educational text. The total size is approximately1.3 trillion tokens. This dataset forms the backbone of the initial general-domain knowledge acquisition. - FineMath [2]: A curated dataset consisting of mathematical texts, problems, and solutions. Its purpose is to enhance language models' mathematical knowledge during
continued pre-training. It contains around50 billion tokens. - OpenMathInstruct2 [55], MetaMathQA [63], and NuminaMath [31]: These are
instruction-tuning datasetsprimarily containing mathematical questions paired with step-by-step solutions and explanations. They are used forsupervised fine-tuning (SFT)andreinforcement learning (RL). The responses for these datasets were collected by prompting theQwen2.5-7B-Math-Instructmodel [60].
Evaluation Benchmarks
The evaluation covers both upstream cloze tasks and downstream generative tasks, assessing in-domain (math reasoning) and out-of-domain (general reasoning) capabilities.
Upstream Cloze Tasks
These tasks evaluate the models' language modeling capabilities via next-token prediction without requiring conversational abilities. Models are evaluated on the average 0-shot accuracy across these datasets:
- HellaSwag [65]: A commonsense reasoning benchmark that challenges models to choose the most plausible ending to a given sentence.
- Winogrande [44]: An adversarial
Winograd Schema Challengedesigned to be resistant to simple statistical biases, testing common-sense reasoning and pronoun resolution. - PIQA [6]: A physical commonsense reasoning benchmark where models must choose the correct solution to everyday physical problems.
- OBQA [36] (OpenBookQA): A question-answering dataset that tests knowledge and reasoning about elementary science facts.
- ARC-Easy/Challenge [11] (AI2 Reasoning Challenge): A set of science questions designed to be difficult for current AI systems, divided into an
Easyset and aChallengeset requiring more advanced reasoning.
Downstream Generative Tasks
These tasks evaluate models' problem-solving abilities in a generative, conversational setting.
-
In-Domain Tasks (Math Reasoning):
- GSM8K-Platinum [57]: A revised, manually cleaned, and denoised version of
GSM8K[12], a benchmark of 8.5K high-quality grade-school math word problems requiringmulti-step reasoning. It emphasizes elementary arithmetic and no concepts beyond early algebra. The test set contains 1319 unique problems. - MATH [20]: A large-scale benchmark (12,500 problems) designed to evaluate mathematical reasoning. Problems are sourced from math competitions, categorized into seven topics (Algebra, Geometry, Calculus, Number Theory) and five difficulty levels. They require detailed, step-by-step solutions.
- GSM8K-Platinum [57]: A revised, manually cleaned, and denoised version of
-
Out-of-Domain Tasks (General Reasoning):
- CRUXEval [19]: A benchmark for evaluating code reasoning, understanding, and execution. It features 800 Python functions (3-13 lines) with input-output pairs. Models are tasked to generate corresponding outputs given a function snippet and an input example. Its test set contains 800 unique problems.
- BGQA [28] (BoardgameQA): A logical reasoning benchmark that assesses models' ability to reason with contradictory information using
defeasible reasoning(where conflicts are resolved based on source preferences like credibility). Its test set contains 15K unique problems.- Example Data Sample (from paper Appendix):
Human: [Game state and rules describing a board game scenario with contradictory information and rule preferences]. Based on the game state and the rules and preferences, does the worm hide the cards that she has from the mouse?Assistant: [Model output with reasoning and final answer (True/False)].
- Example Data Sample (from paper Appendix):
- TabMWP [35]: A benchmark introduced to evaluate mathematical reasoning over tabular data. It contains approximately 38,000 math word problems, each associated with relevant tables, spanning diverse mathematical reasoning types.
- StrategyQA [18]: A commonsense question-answering benchmark designed for
multi-hop reasoningwhere reasoning steps are implicit and must be inferred. Each of its 2,780 examples includes a strategy question, its step-by-step decomposition, and supporting Wikipedia evidence.
Sampling Parameters
For all test datasets and models, a consistent prompt template (same as SFT/RL) is used.
- Temperature: Set to
0forgreedy decoding(deterministic response) and1for decoding with randomness (16 generations). - Repetition Penalty: Set to
1.1. - Inference Framework:
vLLM[29] is used for efficient inference.
5.2. Evaluation Metrics
The paper uses a comprehensive set of evaluation metrics to assess upstream and downstream performance:
Upstream Metrics
- Average 0-shot Accuracy: For
upstream cloze tasks, this measures the percentage of correctly predicted answers on multiple-choice style questions, without any prior examples in the prompt.
Downstream Generative Task Metrics
For downstream generative tasks, correctness is determined by extracting final answers from model outputs and comparing them against ground truth solutions.
-
Accuracy: Measured under four prompting schemes:
- Pass@1:
- Conceptual Definition: Measures the accuracy when the model generates a single, deterministic response. If this response is correct, the problem is marked as solved. It assesses the model's most confident, singular output.
- Mathematical Formula: $ \mathrm{Pass@1} = \frac{\text{Number of problems with 1 correct deterministic response}}{\text{Total number of problems}} $
- Symbol Explanation:
Number of problems with 1 correct deterministic response: The count of problems where the single generated output (with temperature 0) matches the ground truth.Total number of problems: The total count of problems in the evaluation set.
- Maj@16 (Majority@16):
- Conceptual Definition: The model generates 16 responses (with temperature 1 for randomness). The problem is marked correct if the majority answer among these 16 responses is correct. This metric aims to capture a more robust measure of the model's understanding by leveraging multiple samples.
- Mathematical Formula: $ \mathrm{Maj@16} = \frac{\text{Number of problems where the majority of 16 samples are correct}}{\text{Total number of problems}} $
- Symbol Explanation:
Number of problems where the majority of 16 samples are correct: The count of problems where, among 16 generated samples, more than 8 samples yield the correct answer.Total number of problems: The total count of problems in the evaluation set.
- RM@16 (Reward Model@16):
- Conceptual Definition: From the 16 generated responses, the one with the highest
ORM score(assigned by anoutcome reward model) is selected, and its correctness is evaluated. This metric assesses how well a reward model can identify the best-performing generated solution. - Mathematical Formula: $ \mathrm{RM@16} = \frac{\text{Number of problems where the highest ORM-scored sample is correct}}{\text{Total number of problems}} $
- Symbol Explanation:
Number of problems where the highest ORM-scored sample is correct: The count of problems where the single generated output selected by the reward model (from 16 samples) matches the ground truth.Total number of problems: The total count of problems in the evaluation set.
- Conceptual Definition: From the 16 generated responses, the one with the highest
- Pass@16:
- Conceptual Definition: The model generates 16 responses (with temperature 1 for randomness). The problem is marked solved if any one of these 16 responses is correct. This metric measures the model's potential to solve a problem if given multiple attempts, indicating whether the model "knows" the answer even if it doesn't consistently produce it.
- Mathematical Formula: $ \mathrm{Pass@16} = \frac{\text{Number of problems with at least 1 correct response among 16 samples}}{\text{Total number of problems}} $
- Symbol Explanation:
Number of problems with at least 1 correct response among 16 samples: The count of problems where at least one of the 16 generated outputs matches the ground truth.Total number of problems: The total count of problems in the evaluation set.
- Pass@1:
-
Correct Ratio:
- Conceptual Definition: For response groups that have at least one correct solution, this metric computes the ratio of the number of correct solutions to the total number of solutions (16). It provides insight into the "hit rate" within successful problem attempts.
- Mathematical Formula: $ \mathrm{Correct Ratio} = \frac{\sum_{i=1}^{N_{has_correct}} (\text{Number of correct solutions in group } i)}{N_{has_correct} \times 16} $
- Symbol Explanation:
- : The number of problem instances where at least one of the 16 generated solutions was correct.
Number of correct solutions in group i: The count of correct solutions among the 16 samples for a specific problem , given that is one of the problems.16: The total number of solutions sampled for each problem.
-
ORM Score (Outcome Reward Model Score):
- Conceptual Definition: Uses a pre-trained
outcome reward model(Skywork-Reward-Llama-3.1-8B-v0.2[33]) to assign a scalar score to each generated solution based on the input problem and response. This metric serves as anunsupervised proxyfor solution quality, useful when direct correctness checking is difficult or for ranking multiple generations. - Mathematical Formula: The paper refers to
Skywork-Reward-Llama-3.1-8B-v0.2[33] as the ORM. While the internal function of this specific reward model is complex (typically a fine-tuned LM that outputs a scalar reward), conceptually, for a given problem and generated response , the ORM computes: $ \mathrm{ORM_Score}(P, R) = f_{\mathrm{ORM}}(P, R) $ - Symbol Explanation:
- : The function implemented by the
Skywork-Reward-Llama-3.1-8B-v0.2reward model. - : The input problem.
- : The generated response from the language model.
ORM_Score(P, R): The scalar score assigned by the reward model, representing the quality of response for problem .
- : The function implemented by the
- Conceptual Definition: Uses a pre-trained
5.3. Baselines
The paper doesn't strictly compare its EvoLM models against external baselines in the sense of comparing a novel algorithm against existing ones. Instead, it systematically builds and evaluates its own EvoLM model suite across different configurations, effectively using different EvoLM variants as baselines for comparison (e.g., comparing a 1B model with 80B pre-training tokens against a 1B model with 160B tokens).
However, in Appendix A.1 (Observational Comparison of Pre-trained Models), the paper does provide an observational comparison of its pre-trained models (1B and 4B models at various token counts) against several open-weight small-size LMs for upstream benchmark performance:
-
OPT 1.3B [69]
-
Pythia 1B, Pythia 1.4B, Pythia 6.9B [5]
-
TinyLlama 1B [68]
-
Llama3.2 1B, Llama 3.2 3B [56]
-
Qwen3 1.7B, Qwen1.5 4B, Qwen2.5 3B, Qwen3 4B [3]
These models are representative baselines because they are widely recognized, open-source, and cover a range of parameter sizes and pre-training token budgets (some significantly larger than EvoLM's pre-training budget). This comparison helps establish that EvoLM's base models achieve competitive performance despite being trained on significantly less data, validating the efficiency of their pre-training process.
6. Results & Analysis
6.1. Core Results Analysis
The paper presents several key findings across the training stages, demonstrating the complex dynamics of language model development.
6.1.1. Scaling Up Pre-training Compute
The study investigates how varying pre-training compute affects upstream (language modeling) and downstream (problem-solving) performance.
-
Upstream Performance: As shown in Figure 2, upstream task performance (average accuracy on tasks like HellaSwag, Winogrande, etc.) improves steadily with more pre-training tokens, but experiences
rapidly diminishing returnsbeyond approximately80xto160xthe model size. For example, the 1B model's accuracy gain from 80BT to 160BT is less than a percentage point, and the 4B model also plateaus by 320BT.
该图像是一个图表,展示了不同规模模型(0.5B、1B、4B)在不同预训练令牌数量下的上游任务性能(平均准确率)。随着预训练令牌数量的增加,模型的平均准确率逐渐提高。
Figure 2: Upstream task performance vs. pretraining tokens on models {0.5B, 1B, 4B}- {10BT, 20BT, 40BT, 80BT, 160BT, 320BT}.
-
Downstream Performance: Figure 3 illustrates that
downstream capabilities(for both SFT and SFT+RL models) also show strong initial gains up to80BTbut then saturate. For instance, the IDMaj@16accuracy of the SFT model for 1B models rises from 8% at 20BT to 15% at 80BT, but only reaches 17% at 320BT.RL fine-tuningprovides a consistent uplift over SFT but similarly shows negligible benefit from pre-training beyond 80BT. Critically,OOD tasks(e.g.,Maj@16,RM@16, andPass@16accuracies) can evendecrease after 160BT, accompanied by a drop inORM score, indicating overall generation quality degradation.
该图像是图表,展示了不同模型在预训练令牌数量下的下游任务表现。通过比较 SFT 和 SFT + RL 模型在 ID 和 OOD 情况下的表现,观察到随预训练令牌数量增加,模型性能的变化趋势,主要评估指标包括 Greedy、Maj@16、RM@16 等。图中展示了对应的数值变化和趋势。
Figure 3: Downstream task performance vs. number of pretraining tokens on models: - SFT: 1B-{20BT, 40BT, 80BT, 160BT, 320BT}-8+42BT-100Kep1 - : 1B-{20BT, 40BT, 80BT, 160BT, 320BT}-8+42BT-100Kep1-100Kep8.
These findings lead to Takeaway 1: Excessive general-domain pre-training does not always improve domain-specific post-training and might even cause performance degradation on some downstream tasks (saturation happens around 80x to 160x model size in this study).
-
Model Size and Pre-training Budget Interplay: Table 1 compares 1B and 4B models. Under a fixed pre-training compute budget (e.g., 1B-320BT vs. 4B-80BT), the smaller 1B model can even outperform the 4B model. When matching on pre-training tokens at lower budgets (e.g., 80B tokens), 1B and 4B models perform comparably. However, once the budget rises to
160B tokens, the 4B model significantly outperforms its 1B counterpart, especially inID Maj@16accuracy (26.4% vs. 14.2% for SFT) and performance.The following are the results from Table 1 of the original paper:
Base Model ID Acc. (SFT / SFT+RL) OOD Acc. (SFT / SFT+RL) Greedy Maj@16 Pass@16 Greedy Maj@16 Pass@16 Same Pretraining Compute 1B-320BT-8+42BT 14.1 / 20.1 16.1 / 25.0 36.0 / 49.0 25.3 / 28.3 24.8 / 29.9 54.4 / 62.6 4B-80BT-8+42BT 11.3 / 15.7 13.2 / 20.0 34.2 / 43.0 24.8 / 28.2 23.4 / 29.6 52.2 / 60.2 Same Pretraining Tokens 1B-80BT-8+42BT 12.1 / 18.0 14.1/21.4 35.1 / 45.4 25.4 / 27.5 24.6 /31.0 55.7 / 65.3 4B-80BT-8+42BT 11.3 / 15.7 13.2 / 20.0 34.2 / 43.0 24.8 / 28.2 23.4 / 29.6 52.2 / 60.2 1B-160BT-8+42BT 12.8 / 17.5 14.2 / 22.5 34.5 / 45.1 23.8 / 28.2 25.6 /31.6 55.3 / 64.9 4B-160BT-8+42BT 22.0 / 27.8 26.4 / 34.8 47.6 /58.4 27.9 / 29.6 26.0 / 33.2 57.3 / 66.2
This leads to Takeaway 2: Under limited pre-training budgets, smaller post-trained models can even outperform larger counterparts. Conversely, once pre-training tokens reach the saturation regime, increasing model size enables clear improvements in both in-domain performance and OOD generalization.
6.1.2. Scaling Up Continued Pre-training Compute
The paper investigates the impact of Continued Pre-training (CPT) and methods to mitigate catastrophic forgetting.
-
Catastrophic Forgetting: Figure 4 shows that increasing CPT compute on domain-specific data (
FineMath) gradually degradesupstream task performance, indicatingcatastrophic forgetting. -
Replay Strategy: Incorporating a small
replay budget(e.g., 8BT ofFineWeb-Edudata) during CPT effectively mitigates this. The model with 8BT replay consistently maintains higher upstream accuracy than the no-replay baseline.
该图像是一个图表,展示了不同CPT(Continued Pre-Training)tokens总数与上游任务的平均准确率(Avg. ACC.)之间的关系。图中标明了多种训练模型的表现,包括预训练、带重放(replay)和不带重放的情况。
Figure 4: Upstream task performance vs. CPT tokens on models:
Table 2 further demonstrates that pure FineMath CPT (50 BT) achieves 19.27% Pass@1 accuracy on GSM8K-Platinum, but a mix of 8 BT FineWeb replay with 42 BT FineMath yields a better result of 21.01%. Configurations with either too little or too much replay perform worse, suggesting an optimal replay budget (around 5%).
The following are the results from Table 2 of the original paper:
| CPT Config | Acc. |
| No CPT | 6.04 |
| FineMath 50BT | 19.27 |
| FineWeb 1.6BT + FineMath 48.4BT | 16.21 |
| FineWeb 8BT + FineMath 42BT | 21.01 |
| FineWeb 16BT + FineMath 34BT | 15.22 |
This leads to Takeaway 3: CPT on domain-specific data induces catastrophic forgetting of pre-trained knowledge which could harm both upstream and downstream performance, while incorporating a small replay budget (e.g. 5%) could effectively mitigate this degradation.
-
Downstream Performance with CPT: Figure 5 shows that
downstream performance(SFT and SFT+RL models with 8BT replay) improves steadily with more domain-specific tokens up to around32BTand then plateaus by42BT. For example,ID greedy accuracyof the SFT model rises from 5% at 2BT to 12% at 32BT. This trend is also observed in OOD metrics.
该图像是图表,展示了在不同继续预训练(CPT)token数量下,模型在多个下游任务上的表现,包括贪婪策略、Maj@16、RM@16等。横轴为CPT token数量(B),纵轴为任务性能值。图中比较了SFT和SFT+RL两种训练策略的表现差异。
Figure 5: Downstream task performance vs. continued pre-training tokens on models: - SFT: 1B-160BT-100Kep1, 1B-160BT {2BT, .., 42BT}-100Kep1 -SFT+RL: 1B-160BT-100Kep1-100Kep8, 1B-160BT {2BT, ., 42BT}-100Kep1-100Kep8.
Takeaway 4: Domain-specific post-training should be supported by adequate domain-specific CPT data: without it, SFT performance remains suboptimal and RL can even degrade such performance.
Takeaway 5: As domain-specific CPT data increase, in-domain downstream performance steadily improves and the SFT models could benefit more from RL finetuning.
Takeaway 6: With sufficient domain-specific CPT data, post-training on in-domain tasks not only improves in-domain performance but also generalizes effectively to OOD tasks.
6.1.3. Scaling Up SFT Compute
The paper explores how increasing SFT compute (epochs or dataset size) affects downstream performance.
-
Varying SFT Epochs: Figure 6 indicates that ID metrics increase steadily with more epochs and saturate around
8 epochs, reflecting increased memorization. However, OOD performance peaks at2-4 epochsbefore declining, implyingover-specializationhinders generalization. Furthermore, the marginal gains fromdownstream RL finetuningshrink on over-trained SFT models.
该图像是图表,展示了不同训练阶段下的下游任务性能与监督微调(SFT)轮次的关系。图中包含五个子图,分别展示了不同模型在 ID 和 OOD 数据集上的表现,其中 SFT 和 曲线对比突出。
Figure 6: Downstream task performance vs. number of SFT epochs for models: - SFT: 1B-160BT-8+42BT-100Kep{1,2,4,8,16,32} - : 1B-160BT-8+42BT-100Kep{1,2,4,8,16,32}-100Kep8.
-
Varying SFT Dataset Size: Figure 7 confirms that ID performance improves monotonically with more examples. However, OOD metrics fluctuate and can even decline with larger datasets, similar to scaling epochs. The incremental benefit from RL also diminishes with more SFT examples.
该图像是一个图表,展示了不同模型在下游任务性能与SFT示例数量之间的关系。包括的性能指标涉及ID和OOD的任务,如Greedy、Maj@16、RM@16等。图中同时展示了SFT和SFT+RL的结果,反映了随着示例数量增加性能的变化。
Figure 7: Downstream task performance vs. number of SFT examples for models: - SFT: 1B-160BT-8+42BT-{50K, 100K, 150K, ., 400K}ep1 - : 1B-160BT-8+42BT-{50K, 100K, 150K, .., 400K}ep1-100Kep8.
Takeaway 7: Excessive SFT improves ID performance with diminishing returns but does not necessarily improve and can even degrade OOD performance. Takeaway 8: Excessive SFT, especially overly large epochs, could limit further RL improvements.
6.1.4. Scaling Up RL Compute
The paper investigates how increasing RL compute (epochs or dataset size) affects downstream performance.
-
Varying RL Epochs: As shown in Figure 8a,
greedy,Maj@16, andRM@16accuracies for both ID and OOD tasks peak around8-16 epochsand then saturate. Notably,Pass@16accuracy degrades beyond 4 epochs, whilecorrect ratiokeeps increasing. This suggests RL primarily sharpens confidence in already-correct outputs rather than expanding the set of solvable samples.
该图像是一个图表,展示了在不同强化学习(RL)训练轮次下的准确率、ORM和正确率(Correct Ratio)的变化情况,分为ID和OOD两种情况。数据源自模型训练动态的评估,以便分析不同设计选择的影响。
(a) Performance v.s. number of RL epochs for models 1B-160BT-8+42BT-100Kep1-100Kep{0, 1, 2, 4, 8, 16, 32}.
-
Varying RL Dataset Size: Figure 8b illustrates that
greedy,Maj@16, andRM@16accuracies continue to increase up to150-200K examples, after which gains flatten and fluctuate.Pass@Ksaturates earlier and degrades, whilecorrect ratiocontinues to increase. A drastic performance drop at 350K and 400K examples is observed due to models generating excessively long responses exceeding context window limits (Figure 12).
该图像是一个图表,展示了在不同强化学习(RL)样本规模下的下游任务表现。图中比较了在ID和OOD任务下,不同方法(Greedy,Maj@16,RM@16,Pass@16)的准确率、ORM(平均@16)和正确率。每个子图清晰显示了随样本数量变化的性能趋势。
(b) Performance v.s. number of RL examples for models 1B-160BT-8+42BT-100Kep1-{0, 50K, 100K, 150K, .., 400K}ep1.
Takeaway 9: RL with excessive epochs or examples improves downstream performance on both ID and OOD tasks but with diminishing returns (saturation happens at 4-8 epochs or 50-100K examples in this study).
Takeaway 10: Beyond saturation regime, RL primarily increases the probability of sampling high-quality rollouts but does not necessarily improve models' fundamental reasoning capabilities.
-
SFT and RL Data Allocation Trade-offs: Figure 9 investigates data allocation under a constrained budget (100K total samples).
ID accuracy(greedy andPass@16) increases with more SFT data, plateauing beyond 70K SFT examples. Conversely,OOD metricsare driven by RL allocation, peaking at 10K SFT (i.e., 90K RL). This trend holds for both 1B and 4B models.
该图像是图表,展示了在不同 SFT/RL 数据分配下,1B 和 4B 模型在降重测试(ID)和超参数测试(OOD)中的表现。横轴表示数据分配量(K 示例),纵轴显示模型性能。这些数据旨在分析后训练阶段的配置效果。
Figure 9: Downstream task performance for {1B, 4B}-160BT-8+42BT-{10K, .., 90K}ep4-{90K, .., 10K}ep4. Darker green/blue denotes more data allocation to SFT/RL. The total number of posttraining samples is fixed at 100K.
Takeaway 11: Under a constrained downstream data budget, allocating more examples to SFT maximizes in-domain gains at the expense of weaker OOD generalization, while allocating more to RL improves OOD performance.
6.1.5. Intermediate Checkpoints May Not Be Reliable Surrogates
The paper highlights the issue of using intermediate checkpoints for evaluation. Table 3 compares models fully trained for specific token counts (e.g., 20BT full) against intermediate checkpoints taken from a longer run (e.g., 20BT int. from a 160BT run).
The following are the results from Table 3 of the original paper:
| Model | Upstream | Downstream (Greedy / Pass@16) | |
| Math Level 1 | Math Level 2 | ||
| 20BT full | 46.43 | 2.75 / 17.85 | 3.36 / 15.10 |
| 20BT int. | 46.07 | 2.52 / 11.44 | 1.90 / 12.64 |
| 40BT full | 49.38 | 2.97 / 17.96 | 3.36 / 14.88 |
| 40BT int. | 49.06 | 1.37 / 9.38 | 2.68 / 8.72 |
Intermediate checkpoints consistently lag behind their dedicated, fully trained counterparts in both upstream accuracy and math reasoning performance. This is attributed to the incomplete optimization trajectory (e.g., incomplete learning rate decay) of intermediate checkpoints. This demonstrates that using mid-course snapshots understates true model capability and can be misleading for studying training dynamics.
6.1.6. Correlating Downstream Task Performance with ORM Score
The study examines the correlation between ORM scores and downstream task accuracy. Figure 10 shows that ORM scores (average of 16 samples) exhibit consistently strong predictive power for Maj@16 accuracies, with correlation coefficients ranging from 0.62 to 0.84 across most ID and OOD tasks. The correlation was lower for StrategyQA, possibly due to its focus on commonsense knowledge or the reward model's suitability for that specific distribution.

该图像是图表,展示了不同任务下准确率与ORM评分之间的相关性。每个子图代表一个数据集,每个点对应一个模型变体,虚线表示线性趋势,标题中报告了Pearson相关系数。
Figure 10: Correlation between accuracy and ORM score across different tasks. Each subplot represents one dataset, where each point corresponds to a model variant. A dashed line indicates the linear trend, and the Pearson correlation coefficient is reported in each title.
This contrasts with validation perplexity (Figure 14), which shows negligible correlation with downstream task accuracy, indicating that low perplexity does not reliably predict enhanced generative reasoning performance in post-trained models.

该图像是图表,展示了准确性与验证困惑度之间的相关性,包含不同任务的子图。每个子图对应一个数据集,其中每个点代表一个后训练模型变体,虚线表示线性趋势,标题中报告了皮尔逊相关系数。
Figure 14: Correlation between accuracy and validation PPL across different tasks. Each subplot represents one dataset, where each point corresponds to a post-trained model variant. A dashed line indicates the linear trend, and the Pearson correlation coefficient is reported in each title.
Takeaway 12: ORM score could be a more reliable unsupervised validation metric that helps predict downstream task performance during post-training, compared to validation loss (perplexity). Notably, ORM scores from an 8B reward model correlate well with problem-solving accuracies of 1B models on many downstream reasoning tasks.
6.2. Ablation Studies / Parameter Analysis
The paper's entire experimental design can be viewed as a series of ablation studies and parameter analyses, systematically varying one aspect of the training pipeline while keeping others constant to observe its impact.
-
Pre-training Compute: The variation of pre-training tokens (10BT to 320BT) for 0.5B, 1B, and 4B models (Figure 2, Figure 3, Table 1) directly serves as an ablation study on the effect of
pre-training compute. It shows the saturation and diminishing returns, and the interplay with model size. -
CPT Replay Strategy: The comparison of CPT with no replay, different replay budgets (1.6BT, 8BT, 16BT
FineWebreplay in Table 2, and Figure 4) is a direct ablation study onforgetting mitigationstrategies during domain adaptation. It identifies an optimal replay ratio (around 5%). -
CPT Token Budget: Varying domain-specific CPT tokens from 0 to 42BT (Figure 5) is an analysis of how
domain adaptationscales and its impact on SFT and RL benefits. -
SFT Epochs and Dataset Size: Varying SFT epochs (1 to 32, Figure 6) and SFT examples (50K to 400K, Figure 7) constitute ablation studies on
SFT compute. These revealover-specializationand diminishing returns for OOD tasks and the interaction with subsequent RL. -
RL Epochs and Dataset Size: Varying RL epochs (0 to 32, Figure 8a) and RL examples (0 to 400K, Figure 8b) are ablation studies on
RL compute. These highlight saturation points, the degradation ofPass@16due to increasing response length (Figure 12, Figure 13), and the nature of RL's improvements (confidence boosting vs. fundamental reasoning). -
SFT/RL Data Allocation: The experiment with fixed total downstream data (100K) split between SFT and RL (Figure 9) is a specific parameter analysis on
resource allocationstrategy, demonstrating the trade-off between ID gains (SFT-heavy) and OOD generalization (RL-heavy). -
Intermediate vs. Full Checkpoints: The comparison in Table 3 specifically ablates the
training trajectoryaspect, showing that an intermediate snapshot of a longer run is not equivalent to a fully optimized shorter run.These systematic variations across all stages provide comprehensive insights into how different hyper-parameters and design choices influence the overall performance and characteristics of LMs.
6.3. Observational Comparison of Pre-trained Models
The paper also includes an observational comparison in Appendix A.1 (Table 4) of its pre-trained models against several prominent open-weight LMs on upstream benchmarks.
The following are the results from Table 4 of the original paper:
| Model Name | Tokens | H/S | W/G | PIQA | OBQA | ARC-E | ARC-C | Avg. |
| OPT 1.3B | 300B | 53.65 | 59.59 | 72.36 | 33.40 | 50.80 | 29.44 | 49.87 |
| Pythia 1B | 300B | 47.16 | 53.43 | 69.21 | 31.40 | 48.99 | 27.05 | 46.21 |
| Pythia 1.4B | 300B | 52.01 | 57.38 | 70.95 | 33.20 | 54.00 | 28.50 | 49.34 |
| TinyLlama 1B | 2T | 61.47 | 59.43 | 73.56 | 36.80 | 55.47 | 32.68 | 53.23 |
| Llama3.2 1B | 9T | 63.66 | 60.46 | 74.54 | 37.00 | 60.48 | 35.75 | 55.31 |
| Qwen3 1.7B | 36T | 60.46 | 61.01 | 72.36 | 36.80 | 69.91 | 43.26 | 57.30 |
| 20B | 42.25 | 51.30 | 67.85 | 32.80 | 54.80 | 29.61 | 46.44 | |
| 40B | 47.53 | 54.62 | 69.59 | 36.20 | 58.08 | 30.29 | 49.38 | |
| 1B (ours) | 80B | 51.05 | 53.59 | 70.78 | 37.20 | 62.71 | 35.92 | 51.88 |
| 160B | 52.30 | 53.99 | 71.71 | 36.60 | 63.09 | 36.09 | 52.30 | |
| 320B | 53.86 | 53.51 | 71.93 | 37.20 | 62.29 | 36.18 | 52.49 | |
| Pythia 6.9B | 300B | 63.89 | 61.17 | 76.39 | 37.20 | 61.07 | 35.15 | 55.81 |
| OPT 6.7B | 300B | 67.18 | 65.35 | 76.50 | 37.40 | 60.06 | 34.73 | 56.87 |
| Qwen1.5 4B | 3T | 71.45 | 64.09 | 77.10 | 39.60 | 61.41 | 39.51 | 58.86 |
| Qwen2.5 3B | 18T | 73.61 | 68.51 | 78.89 | 42.00 | 73.23 | 47.18 | 63.90 |
| Qwen3 4B | 36T | 73.71 | 70.64 | 77.75 | 41.00 | 76.22 | 51.88 | 65.20 |
| Llama 3.2 3B | 9T | 73.63 | 69.69 | 77.53 | 43.20 | 71.76 | 45.90 | 63.62 |
| 80B | 48.84 | 54.38 | 69.91 | 35.80 | 59.68 | 32.68 | 50.22 | |
| 4B (ours) | 160B | 56.49 | 55.88 | 72.63 | 40.20 | 66.67 | 39.93 | 55.30 |
| 320B | 61.38 | 57.46 | 74.27 | 41.80 | 67.55 | 39.16 | 56.94 |
The comparison reveals that EvoLM's 1B and 4B models, trained on significantly fewer tokens (e.g., 320B tokens for our models vs. 2T for TinyLlama-1B and 3T for Qwen1.5-4B), achieve competitive performance on standard benchmarks. This reinforces the finding of diminishing returns from excessive pre-training, suggesting that beyond a certain point, additional tokens yield minimal incremental gains in general-domain upstream tasks. For example, our 1B model at 320B tokens (Avg. 52.49%) is close to TinyLlama 1B at 2T tokens (Avg. 53.23%) and Pythia 1.4B at 300B tokens (Avg. 49.34%), demonstrating efficiency.
7. Conclusion & Reflections
7.1. Conclusion Summary
The EvoLM study systematically investigates the training dynamics of language models across their entire lifecycle, from pre-training to continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). By training over 100 1B and 4B parameter models from scratch, the authors reveal critical insights into how design choices at each stage influence upstream language modeling and downstream problem-solving capabilities, including both in-domain (ID) and out-of-domain (OOD) generalization.
Key conclusions drawn are:
- Diminishing Returns: Excessive
pre-trainingandpost-traininglead todiminishing returnsand can even cause performance degradation, particularly onOOD tasks. - Role of CPT:
Continued pre-trainingis crucial for bridging pre-training and post-training. Domain-specific CPT causescatastrophic forgetting, but this can be effectively mitigated by incorporating a smallreplay budgetof general-domain data. Adequate CPT data is essential for SFT and for maximizing the benefits of RL, also improving OOD generalization. - SFT and RL Trade-offs: Excessive
SFTcan lead toover-specialization, improvingID performanceat the cost ofOOD generalizationand limiting subsequentRL improvements.RLwith excessive compute (epochs or examples) also yieldsdiminishing returns, primarily boosting the confidence of existing correct outputs rather than enhancing fundamentalreasoning capabilities. Under data constraints, there's a clear trade-off in allocating data between SFT (for ID gains) and RL (for OOD generalization). - Evaluation Reliability: Intermediate checkpoints are unreliable surrogates for fully trained models.
Outcome Reward Model (ORM)scores are shown to be a more reliableunsupervised validation metricfordownstream task performancethanvalidation perplexity. - Open Research: The release of
EvoLM(models, datasets, and pipeline) promotes transparency, reproducibility, and open research in understanding LM training dynamics.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
- Model Scale: The study primarily focused on models up to 4B parameters. Future research should investigate whether the observed trends generalize to larger models and explore more optimal hyperparameters for them.
- Post-training Objectives: The current study focused on
reasoning-centric post-training objectives. Future work could explore other critical objectives likesafety alignment,instruction-following,tool-calling, andcoding tasks. - RL Methods: The RL experiments exclusively used
Proximal Policy Optimization (PPO)withverifiable rewards. Exploring alternativereinforcement learning methodscould offer broader insights into their effects on downstream capabilities.
7.3. Personal Insights & Critique
This paper provides an invaluable, systematic, and highly transparent investigation into the complex multi-stage training of modern language models. Its emphasis on training models from scratch with full control is a significant strength, addressing a major gap in the literature where many studies rely on opaque, publicly available checkpoints. The EvoLM suite itself is a substantial contribution to open science.
One particularly insightful finding is the quantification of diminishing returns and even performance degradation from excessive pre-training and post-training. This challenges the common intuition that "more is always better" and provides concrete evidence for optimal resource allocation. The distinction between merely increasing the probability of good outputs (correct ratio) versus genuinely improving fundamental reasoning capabilities (Pass@16 degradation) through RL is also crucial for practitioners configuring complex training pipelines.
The demonstration that intermediate checkpoints are not reliable surrogates for fully trained smaller models is a critical methodological insight that could prevent misleading conclusions in future research. The finding that ORM scores are better predictors of downstream task performance than validation perplexity is also highly practical, offering a more effective way to monitor and validate model quality during post-training, especially in data-scarce scenarios.
Critically, the paper's findings regarding the interplay between CPT, catastrophic forgetting, and replay strategies offer actionable guidance for adapting LMs to new domains without sacrificing general capabilities. The granular analysis of SFT and RL trade-offs for ID versus OOD performance under data constraints highlights the importance of strategic data allocation, which is vital for efficient model development.
While the study is comprehensive, its focus on only decoder-only models and a specific set of reasoning tasks means that the generalizability of some findings might need further validation for encoder-decoder models or different task domains (e.g., creative writing, summarization). The specific compute-optimal ratios and saturation points observed are inherently tied to the LLaMA-2 architecture, FineWeb-Edu dataset, and specific evaluation tasks. Future work could explore how these optimal points shift with different architectures or pre-training data compositions.
Overall, EvoLM serves as a robust foundation for understanding the intricate dance of modern LM training. Its methodologies and insights can be broadly applied to improve the efficiency, effectiveness, and interpretability of large language models across various applications, contributing significantly to the responsible development of AI.
Similar papers
Recommended via semantic vector search.