Paper status: completed

EvoLM: In Search of Lost Language Model Training Dynamics

Published:06/19/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

EvoLM offers a systematic analysis of language model training dynamics across pre-training, continued training, fine-tuning, and reinforcement learning. Key findings include diminishing returns from excessive training and the importance of continued training in bridging stages, w

Abstract

Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. We train over 100 LMs with 1B and 4B parameters from scratch, and evaluate both upstream (language modeling) and downstream (problem-solving) capabilities, including considerations of both in-domain and out-of-domain generalization. Key insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The title of the paper is EvoLM: In Search of Lost Language Model Training Dynamics. It focuses on systematically analyzing the training dynamics of language models across their various lifecycle stages.

1.2. Authors

The authors are:

  • Zhenting Qi (Harvard)
  • Fan Nie (Stanford)
  • Alexandre Alahi (EPFL)
  • James Zou (Stanford)
  • Himabindu Lakkaraju (Harvard)
  • Yilun Du (Harvard)
  • Eric Xing (CMU)
  • Sham Kakade (Harvard)
  • Hanlin Zhang (Harvard)

1.3. Journal/Conference

The paper is published at (UTC): 2025-06-19T04:58:47.000Z. While a specific journal or conference is not named in the provided text, the arXiv link indicates it is a preprint, commonly a precursor to publication in top-tier machine learning conferences or journals. Given the affiliations and the topic, it likely targets venues like NeurIPS, ICML, ICLR, or ACL.

1.4. Publication Year

The publication year is 2025, specifically June 19, 2025.

1.5. Abstract

The paper addresses the challenge of understanding the impact of design choices in modern multi-stage language model (LM) training. It introduces EvoLM, a model suite designed for systematic and transparent analysis of LM training dynamics across pre-training, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). The authors train over 100 LMs with 1B and 4B parameters from scratch, evaluating both upstream (language modeling) and downstream (problem-solving) capabilities, considering both in-domain (ID) and out-of-domain (OOD) generalization. Key findings include diminishing returns from excessive pre-training and post-training, the importance of mitigating forgetting during domain-specific CPT using replay strategies, the crucial role of CPT in bridging training phases, and various trade-offs in SFT and RL configurations. To promote open research, all models, datasets, and the entire training/evaluation pipeline are released.

The original source link is https://arxiv.org/abs/2506.16029v2 and the PDF link is https://arxiv.org/pdf/2506.16029v2. It is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the difficulty in evaluating the impact of design choices made at each stage of modern language model (LM) training. LM training has become a multi-stage process, including pre-training, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). This complexity, coupled with the vast design space and interactions between phases, makes it hard for downstream developers to understand which decisions lead to reliable performance gains.

This problem is important because scaling up LMs has become a dominant paradigm for various applications. However, existing studies often rely on opaque analyses from post-training studies using off-the-shelf base models (without strict control over variables like model size or data) or evaluations based on intermediate checkpoints (which may have suboptimal performance due to incomplete learning rate decay). These issues introduce confounding factors and hinder a clear understanding of training dynamics and how they translate to downstream problem-solving performance.

The paper's entry point is to establish an end-to-end, transparent, and systematic development pipeline. By training LMs from scratch with complete control over all stages and parameters, they aim to explicitly investigate the reasoning capabilities of LMs throughout their entire lifecycle.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  • Systematic and Transparent Analysis Framework: Introduction of EvoLM, a model suite and framework enabling a systematic and transparent analysis of LM capabilities across the entire lifecycle (pre-training to RL post-training). This includes evaluation on reasoning-intensive upstream cloze tasks and downstream generative tasks, considering both in-domain (ID) and out-of-domain (OOD) generalization.
  • Open-Sourced Resources: Release of 100+LMs100+ LMs trained from scratch (1B and 4B parameters), their training data for all stages, and a comprehensive, reproducible training and evaluation pipeline. This facilitates open research and allows the community to build upon their findings.
  • Key Insights into Training Dynamics:
    • Diminishing Returns: Excessive pre-training and post-training lead to diminishing returns on both upstream and downstream performance, and can even cause degradation on some OOD tasks. Saturation points were observed around 80x to 160x model size for pre-training, and specific epoch/data amounts for SFT and RL.
    • Model Size vs. Compute: Under limited pre-training budgets, smaller post-trained models can outperform larger counterparts. However, once pre-training tokens reach the saturation regime, increasing model size leads to clear improvements in both ID performance and OOD generalization.
    • Mitigating Catastrophic Forgetting: Domain-specific continued pre-training (CPT) induces catastrophic forgetting of general-domain knowledge. Incorporating a small replay budget (e.g., 5% of pre-training data) effectively mitigates this degradation, benefiting both upstream and downstream performance.
    • Role of CPT: Adequate domain-specific CPT data is crucial. Without it, SFT performance remains suboptimal, and RL can even degrade performance. As CPT data increases, ID downstream performance steadily improves, and SFT models benefit more from RL fine-tuning. Sufficient CPT data also promotes generalization to OOD tasks.
    • SFT and RL Trade-offs:
      • Excessive SFT (especially too many epochs) improves ID performance with diminishing returns but can degrade OOD performance and limit further RL improvements.
      • Excessive RL (epochs or examples) improves downstream performance (ID and OOD) with diminishing returns. Beyond saturation, RL primarily increases the probability of sampling high-quality rollouts rather than fundamentally improving reasoning capabilities (Pass@16 accuracy degrades while Correct Ratio increases).
      • Under constrained downstream data budgets, allocating more data to SFT maximizes ID gains at the expense of weaker OOD generalization, while more RL data improves OOD performance.
    • Intermediate Checkpoints are Unreliable: Using intermediate checkpoints from a longer training run as surrogates for fully trained smaller models is unreliable, as they consistently lag behind models trained with their own complete learning rate schedules.
    • ORM Score as a Proxy: Outcome Reward Model (ORM) scores exhibit strong correlations (0.62 to 0.84) with downstream task accuracy, suggesting they can serve as reliable unsupervised proxy metrics for assessing generation quality during post-training, especially in data-constrained scenarios. This contrasts with validation perplexity, which shows negligible correlation with downstream task accuracy in post-trained models.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following core concepts in language model development:

  • Language Models (LMs): Statistical models that learn the probability distribution of sequences of words or tokens. They predict the next word in a sequence given the preceding words. Autoregressive generative models are a common type, meaning they generate text one token at a time, conditioned on previously generated tokens.
  • Transformer Architecture: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017) that forms the backbone of most modern LMs. It relies heavily on self-attention mechanisms to weigh the importance of different words in a sequence. The paper mentions using the LLaMA-2 architecture, which is a Transformer-based decoder-only model.
  • Pre-training: The initial phase of training LMs on a massive corpus of text data (e.g., billions or trillions of tokens) to learn general language understanding and generation capabilities. The primary objective is typically next-token prediction, minimizing cross-entropy loss (or log-loss).
  • Tokens: The basic units of text processed by LMs. These can be words, subword units (like wordpieces or byte-pair encodings), or characters.
  • Perplexity (PPL): A common metric for evaluating language models. It measures how well a probability model predicts a sample. Lower perplexity indicates a better model. For a sequence of tokens W=(w1,w2,,wN)W = (w_1, w_2, \dots, w_N), perplexity is defined as: $ \mathrm{PPL}(W) = P(w_1, w_2, \dots, w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1, w_2, \dots, w_N)}} $ where P(w1,,wN)P(w_1, \dots, w_N) is the probability of the sequence, often computed as the product of conditional probabilities: $ P(w_1, \dots, w_N) = \prod_{i=1}^N P(w_i | w_1, \dots, w_{i-1}) $ A lower perplexity means the model assigns higher probabilities to the actual sequence, indicating better language modeling.
  • Fine-tuning: The process of adapting a pre-trained model to a specific downstream task or domain by training it further on a smaller, task-specific dataset.
  • Supervised Fine-Tuning (SFT): A type of fine-tuning where the model is trained on labeled data for a specific task (e.g., question-answering pairs, summarization). The model learns to generate desired outputs based on given inputs and corresponding ground truth responses.
  • Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by interacting with an environment to maximize a cumulative reward. In the context of LMs, Reinforcement Learning from Human Feedback (RLHF) is common, where human preferences or reward models (RMs) provide signals to optimize the LM's outputs.
  • Proximal Policy Optimization (PPO): A popular on-policy RL algorithm used in this paper. It optimizes a policy (the LM's generation strategy) by taking small steps to avoid large changes that could destabilize training, ensuring stable and effective learning.
  • Catastrophic Forgetting: A phenomenon in neural networks where training a model on new data causes it to forget previously learned information. This is particularly relevant during continued pre-training or fine-tuning when adapting to a new domain.
  • Scaling Laws: Empirical observations that describe how model performance (e.g., log-loss or perplexity) changes as a function of various resources, such as model size (parameters), dataset size (tokens), and computational budget (FLOPs). A key concept is the compute-optimal ratio for training.
  • In-domain (ID) and Out-of-domain (OOD) Generalization: In-domain refers to evaluating a model on data similar to its training data. Out-of-domain refers to evaluating on data that is distinct or different from the training data, testing the model's ability to generalize to new, unseen distributions.
  • Outcome Reward Model (ORM): A separate model trained to assign a scalar score to a generated output (e.g., a solution to a problem) given an input. This score serves as a proxy for the quality or correctness of the output and is used to provide feedback in RL or as an evaluation metric.

3.2. Previous Works

The paper contextualizes its work by referencing several lines of prior research:

  • Scaling Laws: Early works like those by Hestness et al. [22], Kaplan et al. [27], and Hoffman et al. [23] established fundamental relationships between training loss and model size, data quantity, and compute. The paper specifically mentions Chinchilla scaling law [23], which recommends a compute-optimal ratio of approximately 20 tokens per model parameter.
  • Training Dynamics and Emergent Abilities: Research has explored how models learn during training [59, 53, 43, 16] and the concept of emergent abilities [45], where LMs gain new capabilities only at large scales. However, accurately forecasting downstream problem-solving performance remains challenging due to the training-inference mismatch [46] and non-smooth improvements [45].
  • Post-training Studies and Opaque Checkpoints: Existing studies often rely on off-the-shelf base models or intermediate checkpoints [59, 51] which lack transparency regarding training details and control over variables like model size or pre-training data composition [42, 10, 61]. This can introduce confounding factors. The paper aims to address this by providing a fully transparent pipeline.
  • Catastrophic Forgetting and Continued Pre-training (CPT): The phenomenon of catastrophic forgetting [15] during fine-tuning or adaptation is a known challenge. Prior work has explored data replay strategies [25, 41, 4, 62] to mitigate this during CPT, especially in domain-specific continual pre-training [41].
  • SFT and RL for Reasoning: Studies have investigated the impact of SFT and RL on reasoning capabilities. Some suggest that SFT scales with fine-tuning examples [42, 72] and that RL amplifies pre-trained patterns, driving models towards dominant output distributions [70]. Others critically examine whether RL truly boosts reasoning beyond pre-trained baselines, suggesting it primarily enhances confidence in existing correct outputs rather than fundamentally improving reasoning [64, 10]. The paper expands on these findings by providing precise trade-offs.
  • Overtraining: Recent work has identified "catastrophic overtraining" [51], where prolonged pre-training can impair downstream fine-tuning by increasing sensitivity to parameter updates and exacerbating forgetting. The paper's findings align with and extend this by showing degradation on downstream generative reasoning tasks even after RL.

3.3. Technological Evolution

The field of language models has rapidly evolved from early statistical models to deep learning architectures like recurrent neural networks (RNNs) and long short-term memory (LSTMs), culminating in the Transformer architecture. This evolution has been driven by increased computational power, larger datasets, and architectural innovations, leading to ever-larger models.

Initially, LMs were primarily evaluated on language modeling loss or perplexity. However, as models became more capable, the focus shifted to their performance on diverse downstream tasks, requiring fine-tuning for specific applications. The introduction of reinforcement learning from human feedback (RLHF) marked a significant step in aligning LM outputs with human preferences and improving conversational abilities.

This paper fits into the current state of this evolution by moving beyond simply scaling models and focusing on the dynamics of how capabilities emerge and are shaped across the entire multi-stage training lifecycle. It addresses the challenge of understanding the complex interactions between these stages, which is crucial for building more efficient, reliable, and interpretable large language models. It also emphasizes reasoning-centric evaluation, moving beyond simple next-token prediction to assess problem-solving capabilities.

3.4. Differentiation Analysis

Compared to related work, EvoLM differentiates itself in several key aspects:

  • Holistic, Transparent Lifecycle Analysis: Unlike many studies that focus on a single stage (e.g., pre-training scaling laws) or use opaque, off-the-shelf base models, EvoLM provides an end-to-end, systematic, and transparent analysis across all training stages (pre-training, CPT, SFT, RL). This allows for a unique investigation into the interplay and dependencies between these stages.
  • From-Scratch Training with Full Control: The paper trains 100+LMs100+ LMs from scratch with 1B and 4B parameters, ensuring complete control over model size, pre-training data size, data components, and complete learning rate decay. This rigorous control eliminates confounding factors often present in studies using pre-existing checkpoints.
  • Focus on Reasoning-Intensive Downstream Tasks: While many works evaluate upstream log-loss, EvoLM specifically evaluates problem-solving capabilities on reasoning-intensive tasks (math, code, logic, commonsense) for both in-domain and out-of-domain generalization, providing a more practical assessment of LM utility.
  • Empirical Validation of Over-training Phenomena: It systematically quantifies the diminishing returns and even degradation from excessive pre-training and post-training, extending findings like "catastrophic overtraining" [51] to generative reasoning tasks and showing its impact on RL fine-tuning.
  • Detailed CPT Strategies and Forgetting Mitigation: The study rigorously explores the role of continued pre-training, highlighting catastrophic forgetting during domain adaptation and empirically demonstrating the effectiveness of pre-training data replay strategies (e.g., 5% replay) in mitigating it.
  • Granular Analysis of SFT/RL Trade-offs: The paper provides fine-grained insights into how scaling SFT epochs, SFT dataset size, RL epochs, and RL dataset size impact ID and OOD performance, and how these interact, including when RL can be limited or even degrade performance.
  • Reliability of Evaluation Metrics: It critically assesses validation perplexity as an evaluation metric for post-trained models, showing its unreliability for generative reasoning performance, and proposes ORM scores as a more effective unsupervised proxy metric.
  • Open-Source Commitment: The commitment to releasing all models, datasets, and the entire training/evaluation pipeline fosters reproducibility and future research, differentiating it from many works that only present findings without corresponding open resources.

4. Methodology

The EvoLM methodology focuses on systematically investigating the training dynamics of language models across four distinct stages: pre-training, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). The core idea is to train numerous models from scratch under controlled conditions and evaluate their capabilities using a comprehensive set of upstream and downstream tasks.

4.1. Principles

The underlying principle of the EvoLM methodology is to achieve transparency and systematic control over the entire LM development lifecycle. By controlling key variables like model size, pre-training data volume, data composition, and hyper-parameters at each stage, the authors aim to isolate the impact of different design choices. This allows for empirical validation of scaling laws, understanding catastrophic forgetting, and identifying critical trade-offs in fine-tuning strategies. The theoretical basis rests on the empirical observation of scaling laws in LMs, which posit predictable relationships between resources and performance, and the need to extend these observations to the multi-stage post-training regime.

4.2. Core Methodology In-depth (Layer by Layer)

The training pipeline consists of four sequential stages, each with specific configurations and objectives:

4.2.1. Pre-training

The initial stage involves training decoder-only autoregressive LMs from scratch.

  • Architecture: All models use the LLaMA-2 architecture [56]. The paper investigates models with 1B and 4B parameters (and some 0.5B for initial scaling studies). The architecture details are provided in Table 5 in the Appendix.
  • Pre-training Data: FineWeb-Edu [38], an extensive educational dataset sourced from web content, is used. It contains approximately 1.3 trillion tokens.
  • Token Budgets: Guided by the Chinchilla scaling law [23], which recommends a compute-optimal ratio of approximately 20 tokens per model parameter, models are pre-trained across various token budgets:
    • From 20x model size (the optimal Chinchilla ratio)
    • Up to 320B tokens (which represents 16x Chinchilla for a 1B model, i.e., 1B×20tokens/parameter×16=320B1 \mathrm{B} \times 20 \mathrm{tokens/parameter} \times 16 = 320 \mathrm{B} tokens).
    • This range allows for investigating effects of mild over-training (ζ>1x\zeta > 1 \mathrm{x} Chinchilla, 16x\le 16 \mathrm{x} Chinchilla) and excessive over-training (>16x> 16 \mathrm{x} Chinchilla).
  • Learning Rate Schedule: A complete learning rate decay schedule is used, and only the final checkpoints are considered.
  • Hyperparameters: Specific hyperparameters for pre-training (and continued pre-training) are detailed in Table 6, including precision, global_batch_size, max_seq_length, lr_warmup_ratio, max_norm, lr (denoted as br), min_lr, weight_decay, beta1, beta2, and epoch. For example, for a 1B model, the learning rate (br) is 0.0002, and weight_decay is 0.1.

4.2.2. Continued Pre-training (CPT)

After initial pre-training, some models undergo continued pre-training to adapt to a specific domain.

  • CPT Data: FineMath [2], a curated dataset of mathematical texts, problems, and solutions (approximately 50 billion tokens), is used for domain-specific CPT.
  • Token Budgets: CPT token budgets range from 2B to 42B tokens.
  • Mitigating Catastrophic Forgetting: To address catastrophic forgetting of general-domain knowledge during domain-specific CPT, a pre-training data replay strategy [25, 41, 4, 62] is employed. This involves randomly interleaving a small amount of the original FineWeb-Edu pre-training data with the FineMath data. The paper explores different replay budgets (e.g., 8 billion tokens of replayed general-domain data).
  • Learning Rate Schedule: Similar to pre-training, a complete learning rate decay schedule is applied.
  • Hyperparameters: The same hyperparameters as pre-training (Table 6) are generally used, though specific CPT runs may vary total tokens.

4.2.3. Supervised Fine-Tuning (SFT)

Following pre-training and CPT, models are fine-tuned on a task-specific dataset.

  • SFT Data: A dataset of QA pairs augmented from GSM8K [12] and MATH [20] is used. This data is collected from a mixture of MetaMathQA [63], OpenMathInstruct2 [55], and NuminaMath [31].
  • Data Filtering: Low-quality prompts are filtered out using model correctness consistency [39], discarding samples with zero inter-model consensus.
  • SFT Examples/Epochs: The number of SFT examples and epochs are varied to study their impact (e.g., 100K examples for 1 epoch, or varying epochs from 1 to 32, or examples from 50K to 400K).
  • SFT/RL Template: A specific template is used for SFT and RL tuning to structure inputs and outputs:
    Human: {query}
    Assistant: {response}
    
    This template frames the task as a conversational interaction.
  • Hyperparameters: Specific hyperparameters for SFT are detailed in Table 7, including cutoff_len (maximum sequence length), batch_size, learning_rate, lr_scheduler_type (cosine), and warmup_ratio. For a 1B model, the learning_rate is 0.00001, and batch_size is 128.

4.2.4. Reinforcement Learning (RL)

The final stage involves reinforcement learning to further refine the model's behavior based on a reward signal.

  • RL Algorithm: Proximal Policy Optimization (PPO) [47] is employed.
  • Reward Signal: A binary verifiable reward is used, implying that the correctness of generated solutions can be objectively checked.
  • RL Data: The RL stage uses the same data sources as SFT (augmented QA pairs from GSM8K, MATH, etc.) but ensures no overlap with the SFT dataset to prevent data contamination.
  • RL Examples/Epochs: Similar to SFT, the number of RL examples and epochs are varied (e.g., 100K examples for 8 epochs, or varying epochs from 1 to 32, or examples from 50K to 400K).
  • Hyperparameters: Specific hyperparameters for RL (PPO) are detailed in Table 8, including actor_lr (learning rate for the policy network), critic_lr (learning rate for the value network), kl (KL divergence coefficient), train_batch_size, max_prompt_length, max_response_length, ppo_mini_batch_size, ppo_micro_batch_size_per_gpu, log_prob_micro_batch_size_per_gpu, and warmup_steps_ratio. For a 1B model, actor_lr is 2.00×1062.00 \times 10^{-6}, and critic_lr is 2.00×1052.00 \times 10^{-5}.

4.2.5. Model Signature

A compact model signature is used to denote the configuration of each model across training stages. For example: 1B-160BT-8+42BT-100Kep1-100Kep16

  • 1B: Model with 1 billion parameters.

  • 160BT: Pretrained on 160 billion tokens from FineWeb-Edu.

  • 8+42BT8+42BT: Continued pre-trained with 8 billion tokens of replayed general-domain data (FineWeb-Edu) and 42 billion tokens of domain-specific data (FineMath).

  • 100Kep1: Supervised fine-tuned on 100K examples for 1 epoch.

  • 100Kep16: Reinforcement learning fine-tuned on 100K examples for 16 epochs.

    All models are trained with complete learning rate scheduling, and only the final checkpoints are used for study, which is a critical aspect highlighted in Section 4.1 to avoid the pitfalls of using intermediate checkpoints.

5. Experimental Setup

5.1. Datasets

The study utilizes a variety of datasets across different training and evaluation stages:

Training Data

  • FineWeb-Edu [38]: This is an extensive educational dataset derived from web content, specifically designed for pre-training language models. It focuses on high-quality academic and educational text. The total size is approximately 1.3 trillion tokens. This dataset forms the backbone of the initial general-domain knowledge acquisition.
  • FineMath [2]: A curated dataset consisting of mathematical texts, problems, and solutions. Its purpose is to enhance language models' mathematical knowledge during continued pre-training. It contains around 50 billion tokens.
  • OpenMathInstruct2 [55], MetaMathQA [63], and NuminaMath [31]: These are instruction-tuning datasets primarily containing mathematical questions paired with step-by-step solutions and explanations. They are used for supervised fine-tuning (SFT) and reinforcement learning (RL). The responses for these datasets were collected by prompting the Qwen2.5-7B-Math-Instruct model [60].

Evaluation Benchmarks

The evaluation covers both upstream cloze tasks and downstream generative tasks, assessing in-domain (math reasoning) and out-of-domain (general reasoning) capabilities.

Upstream Cloze Tasks

These tasks evaluate the models' language modeling capabilities via next-token prediction without requiring conversational abilities. Models are evaluated on the average 0-shot accuracy across these datasets:

  • HellaSwag [65]: A commonsense reasoning benchmark that challenges models to choose the most plausible ending to a given sentence.
  • Winogrande [44]: An adversarial Winograd Schema Challenge designed to be resistant to simple statistical biases, testing common-sense reasoning and pronoun resolution.
  • PIQA [6]: A physical commonsense reasoning benchmark where models must choose the correct solution to everyday physical problems.
  • OBQA [36] (OpenBookQA): A question-answering dataset that tests knowledge and reasoning about elementary science facts.
  • ARC-Easy/Challenge [11] (AI2 Reasoning Challenge): A set of science questions designed to be difficult for current AI systems, divided into an Easy set and a Challenge set requiring more advanced reasoning.

Downstream Generative Tasks

These tasks evaluate models' problem-solving abilities in a generative, conversational setting.

  • In-Domain Tasks (Math Reasoning):

    • GSM8K-Platinum [57]: A revised, manually cleaned, and denoised version of GSM8K [12], a benchmark of 8.5K high-quality grade-school math word problems requiring multi-step reasoning. It emphasizes elementary arithmetic and no concepts beyond early algebra. The test set contains 1319 unique problems.
    • MATH [20]: A large-scale benchmark (12,500 problems) designed to evaluate mathematical reasoning. Problems are sourced from math competitions, categorized into seven topics (Algebra, Geometry, Calculus, Number Theory) and five difficulty levels. They require detailed, step-by-step solutions.
  • Out-of-Domain Tasks (General Reasoning):

    • CRUXEval [19]: A benchmark for evaluating code reasoning, understanding, and execution. It features 800 Python functions (3-13 lines) with input-output pairs. Models are tasked to generate corresponding outputs given a function snippet and an input example. Its test set contains 800 unique problems.
    • BGQA [28] (BoardgameQA): A logical reasoning benchmark that assesses models' ability to reason with contradictory information using defeasible reasoning (where conflicts are resolved based on source preferences like credibility). Its test set contains 15K unique problems.
      • Example Data Sample (from paper Appendix): Human: [Game state and rules describing a board game scenario with contradictory information and rule preferences]. Based on the game state and the rules and preferences, does the worm hide the cards that she has from the mouse? Assistant: [Model output with reasoning and final answer (True/False)].
    • TabMWP [35]: A benchmark introduced to evaluate mathematical reasoning over tabular data. It contains approximately 38,000 math word problems, each associated with relevant tables, spanning diverse mathematical reasoning types.
    • StrategyQA [18]: A commonsense question-answering benchmark designed for multi-hop reasoning where reasoning steps are implicit and must be inferred. Each of its 2,780 examples includes a strategy question, its step-by-step decomposition, and supporting Wikipedia evidence.

Sampling Parameters

For all test datasets and models, a consistent prompt template (same as SFT/RL) is used.

  • Temperature: Set to 0 for greedy decoding (deterministic response) and 1 for decoding with randomness (16 generations).
  • Repetition Penalty: Set to 1.1.
  • Inference Framework: vLLM [29] is used for efficient inference.

5.2. Evaluation Metrics

The paper uses a comprehensive set of evaluation metrics to assess upstream and downstream performance:

Upstream Metrics

  • Average 0-shot Accuracy: For upstream cloze tasks, this measures the percentage of correctly predicted answers on multiple-choice style questions, without any prior examples in the prompt.

Downstream Generative Task Metrics

For downstream generative tasks, correctness is determined by extracting final answers from model outputs and comparing them against ground truth solutions.

  • Accuracy: Measured under four prompting schemes:

    1. Pass@1:
      • Conceptual Definition: Measures the accuracy when the model generates a single, deterministic response. If this response is correct, the problem is marked as solved. It assesses the model's most confident, singular output.
      • Mathematical Formula: $ \mathrm{Pass@1} = \frac{\text{Number of problems with 1 correct deterministic response}}{\text{Total number of problems}} $
      • Symbol Explanation:
        • Number of problems with 1 correct deterministic response: The count of problems where the single generated output (with temperature 0) matches the ground truth.
        • Total number of problems: The total count of problems in the evaluation set.
    2. Maj@16 (Majority@16):
      • Conceptual Definition: The model generates 16 responses (with temperature 1 for randomness). The problem is marked correct if the majority answer among these 16 responses is correct. This metric aims to capture a more robust measure of the model's understanding by leveraging multiple samples.
      • Mathematical Formula: $ \mathrm{Maj@16} = \frac{\text{Number of problems where the majority of 16 samples are correct}}{\text{Total number of problems}} $
      • Symbol Explanation:
        • Number of problems where the majority of 16 samples are correct: The count of problems where, among 16 generated samples, more than 8 samples yield the correct answer.
        • Total number of problems: The total count of problems in the evaluation set.
    3. RM@16 (Reward Model@16):
      • Conceptual Definition: From the 16 generated responses, the one with the highest ORM score (assigned by an outcome reward model) is selected, and its correctness is evaluated. This metric assesses how well a reward model can identify the best-performing generated solution.
      • Mathematical Formula: $ \mathrm{RM@16} = \frac{\text{Number of problems where the highest ORM-scored sample is correct}}{\text{Total number of problems}} $
      • Symbol Explanation:
        • Number of problems where the highest ORM-scored sample is correct: The count of problems where the single generated output selected by the reward model (from 16 samples) matches the ground truth.
        • Total number of problems: The total count of problems in the evaluation set.
    4. Pass@16:
      • Conceptual Definition: The model generates 16 responses (with temperature 1 for randomness). The problem is marked solved if any one of these 16 responses is correct. This metric measures the model's potential to solve a problem if given multiple attempts, indicating whether the model "knows" the answer even if it doesn't consistently produce it.
      • Mathematical Formula: $ \mathrm{Pass@16} = \frac{\text{Number of problems with at least 1 correct response among 16 samples}}{\text{Total number of problems}} $
      • Symbol Explanation:
        • Number of problems with at least 1 correct response among 16 samples: The count of problems where at least one of the 16 generated outputs matches the ground truth.
        • Total number of problems: The total count of problems in the evaluation set.
  • Correct Ratio:

    • Conceptual Definition: For response groups that have at least one correct solution, this metric computes the ratio of the number of correct solutions to the total number of solutions (16). It provides insight into the "hit rate" within successful problem attempts.
    • Mathematical Formula: $ \mathrm{Correct Ratio} = \frac{\sum_{i=1}^{N_{has_correct}} (\text{Number of correct solutions in group } i)}{N_{has_correct} \times 16} $
    • Symbol Explanation:
      • Nhas_correctN_{has\_correct}: The number of problem instances where at least one of the 16 generated solutions was correct.
      • Number of correct solutions in group i: The count of correct solutions among the 16 samples for a specific problem ii, given that ii is one of the Nhas_correctN_{has\_correct} problems.
      • 16: The total number of solutions sampled for each problem.
  • ORM Score (Outcome Reward Model Score):

    • Conceptual Definition: Uses a pre-trained outcome reward model (Skywork-Reward-Llama-3.1-8B-v0.2 [33]) to assign a scalar score to each generated solution based on the input problem and response. This metric serves as an unsupervised proxy for solution quality, useful when direct correctness checking is difficult or for ranking multiple generations.
    • Mathematical Formula: The paper refers to Skywork-Reward-Llama-3.1-8B-v0.2 [33] as the ORM. While the internal function of this specific reward model is complex (typically a fine-tuned LM that outputs a scalar reward), conceptually, for a given problem PP and generated response RR, the ORM computes: $ \mathrm{ORM_Score}(P, R) = f_{\mathrm{ORM}}(P, R) $
    • Symbol Explanation:
      • fORMf_{\mathrm{ORM}}: The function implemented by the Skywork-Reward-Llama-3.1-8B-v0.2 reward model.
      • PP: The input problem.
      • RR: The generated response from the language model.
      • ORM_Score(P, R): The scalar score assigned by the reward model, representing the quality of response RR for problem PP.

5.3. Baselines

The paper doesn't strictly compare its EvoLM models against external baselines in the sense of comparing a novel algorithm against existing ones. Instead, it systematically builds and evaluates its own EvoLM model suite across different configurations, effectively using different EvoLM variants as baselines for comparison (e.g., comparing a 1B model with 80B pre-training tokens against a 1B model with 160B tokens).

However, in Appendix A.1 (Observational Comparison of Pre-trained Models), the paper does provide an observational comparison of its pre-trained models (1B and 4B models at various token counts) against several open-weight small-size LMs for upstream benchmark performance:

  • OPT 1.3B [69]

  • Pythia 1B, Pythia 1.4B, Pythia 6.9B [5]

  • TinyLlama 1B [68]

  • Llama3.2 1B, Llama 3.2 3B [56]

  • Qwen3 1.7B, Qwen1.5 4B, Qwen2.5 3B, Qwen3 4B [3]

    These models are representative baselines because they are widely recognized, open-source, and cover a range of parameter sizes and pre-training token budgets (some significantly larger than EvoLM's pre-training budget). This comparison helps establish that EvoLM's base models achieve competitive performance despite being trained on significantly less data, validating the efficiency of their pre-training process.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents several key findings across the training stages, demonstrating the complex dynamics of language model development.

6.1.1. Scaling Up Pre-training Compute

The study investigates how varying pre-training compute affects upstream (language modeling) and downstream (problem-solving) performance.

  • Upstream Performance: As shown in Figure 2, upstream task performance (average accuracy on tasks like HellaSwag, Winogrande, etc.) improves steadily with more pre-training tokens, but experiences rapidly diminishing returns beyond approximately 80x to 160x the model size. For example, the 1B model's accuracy gain from 80BT to 160BT is less than a percentage point, and the 4B model also plateaus by 320BT.

    Figure 2: Upstream task performance vs. pretraining tokens on models {0.5B, 1B, 4B}- {10BT, 20BT, 40BT, 80BT, 160BT, 320BT}. 该图像是一个图表,展示了不同规模模型(0.5B、1B、4B)在不同预训练令牌数量下的上游任务性能(平均准确率)。随着预训练令牌数量的增加,模型的平均准确率逐渐提高。

Figure 2: Upstream task performance vs. pretraining tokens on models {0.5B, 1B, 4B}- {10BT, 20BT, 40BT, 80BT, 160BT, 320BT}.

  • Downstream Performance: Figure 3 illustrates that downstream capabilities (for both SFT and SFT+RL models) also show strong initial gains up to 80BT but then saturate. For instance, the ID Maj@16 accuracy of the SFT model for 1B models rises from 8% at 20BT to 15% at 80BT, but only reaches 17% at 320BT. RL fine-tuning provides a consistent uplift over SFT but similarly shows negligible benefit from pre-training beyond 80BT. Critically, OOD tasks (e.g., Maj@16, RM@16, and Pass@16 accuracies) can even decrease after 160BT, accompanied by a drop in ORM score, indicating overall generation quality degradation.

    Figure 3: Downstream task performance vs. number of pretraining tokens on models: - SFT: 1B-{20BT, 40BT, 80BT, 160BT, 320BT}-8+42BT-100Kep1 - \(\\mathbf { S F T + R L }\) : 1B-{20BT, 40BT, 80BT, 160BT, 320BT}-8+42BT-100Kep1-100Kep8. 该图像是图表,展示了不同模型在预训练令牌数量下的下游任务表现。通过比较 SFT 和 SFT + RL 模型在 ID 和 OOD 情况下的表现,观察到随预训练令牌数量增加,模型性能的变化趋势,主要评估指标包括 Greedy、Maj@16、RM@16 等。图中展示了对应的数值变化和趋势。

Figure 3: Downstream task performance vs. number of pretraining tokens on models: - SFT: 1B-{20BT, 40BT, 80BT, 160BT, 320BT}-8+42BT-100Kep1 - SFT+RL\mathbf { SFT+RL } : 1B-{20BT, 40BT, 80BT, 160BT, 320BT}-8+42BT-100Kep1-100Kep8.

These findings lead to Takeaway 1: Excessive general-domain pre-training does not always improve domain-specific post-training and might even cause performance degradation on some downstream tasks (saturation happens around 80x to 160x model size in this study).

  • Model Size and Pre-training Budget Interplay: Table 1 compares 1B and 4B models. Under a fixed pre-training compute budget (e.g., 1B-320BT vs. 4B-80BT), the smaller 1B model can even outperform the 4B model. When matching on pre-training tokens at lower budgets (e.g., 80B tokens), 1B and 4B models perform comparably. However, once the budget rises to 160B tokens, the 4B model significantly outperforms its 1B counterpart, especially in ID Maj@16 accuracy (26.4% vs. 14.2% for SFT) and SFT+RLSFT+RL performance.

    The following are the results from Table 1 of the original paper:

    Base Model ID Acc. (SFT / SFT+RL) OOD Acc. (SFT / SFT+RL)
    Greedy Maj@16 Pass@16 Greedy Maj@16 Pass@16
    Same Pretraining Compute
    1B-320BT-8+42BT 14.1 / 20.1 16.1 / 25.0 36.0 / 49.0 25.3 / 28.3 24.8 / 29.9 54.4 / 62.6
    4B-80BT-8+42BT 11.3 / 15.7 13.2 / 20.0 34.2 / 43.0 24.8 / 28.2 23.4 / 29.6 52.2 / 60.2
    Same Pretraining Tokens
    1B-80BT-8+42BT 12.1 / 18.0 14.1/21.4 35.1 / 45.4 25.4 / 27.5 24.6 /31.0 55.7 / 65.3
    4B-80BT-8+42BT 11.3 / 15.7 13.2 / 20.0 34.2 / 43.0 24.8 / 28.2 23.4 / 29.6 52.2 / 60.2
    1B-160BT-8+42BT 12.8 / 17.5 14.2 / 22.5 34.5 / 45.1 23.8 / 28.2 25.6 /31.6 55.3 / 64.9
    4B-160BT-8+42BT 22.0 / 27.8 26.4 / 34.8 47.6 /58.4 27.9 / 29.6 26.0 / 33.2 57.3 / 66.2

This leads to Takeaway 2: Under limited pre-training budgets, smaller post-trained models can even outperform larger counterparts. Conversely, once pre-training tokens reach the saturation regime, increasing model size enables clear improvements in both in-domain performance and OOD generalization.

6.1.2. Scaling Up Continued Pre-training Compute

The paper investigates the impact of Continued Pre-training (CPT) and methods to mitigate catastrophic forgetting.

  • Catastrophic Forgetting: Figure 4 shows that increasing CPT compute on domain-specific data (FineMath) gradually degrades upstream task performance, indicating catastrophic forgetting.

  • Replay Strategy: Incorporating a small replay budget (e.g., 8BT of FineWeb-Edu data) during CPT effectively mitigates this. The model with 8BT replay consistently maintains higher upstream accuracy than the no-replay baseline.

    Figure 4: Upstream task performance vs. CPT tokens on models: 该图像是一个图表,展示了不同CPT(Continued Pre-Training)tokens总数与上游任务的平均准确率(Avg. ACC.)之间的关系。图中标明了多种训练模型的表现,包括预训练、带重放(replay)和不带重放的情况。

Figure 4: Upstream task performance vs. CPT tokens on models:

Table 2 further demonstrates that pure FineMath CPT (50 BT) achieves 19.27% Pass@1 accuracy on GSM8K-Platinum, but a mix of 8 BT FineWeb replay with 42 BT FineMath yields a better result of 21.01%. Configurations with either too little or too much replay perform worse, suggesting an optimal replay budget (around 5%).

The following are the results from Table 2 of the original paper:

CPT Config Acc.
No CPT 6.04
FineMath 50BT 19.27
FineWeb 1.6BT + FineMath 48.4BT 16.21
FineWeb 8BT + FineMath 42BT 21.01
FineWeb 16BT + FineMath 34BT 15.22

This leads to Takeaway 3: CPT on domain-specific data induces catastrophic forgetting of pre-trained knowledge which could harm both upstream and downstream performance, while incorporating a small replay budget (e.g. 5%) could effectively mitigate this degradation.

  • Downstream Performance with CPT: Figure 5 shows that downstream performance (SFT and SFT+RL models with 8BT replay) improves steadily with more domain-specific tokens up to around 32BT and then plateaus by 42BT. For example, ID greedy accuracy of the SFT model rises from 5% at 2BT to 12% at 32BT. This trend is also observed in OOD metrics.

    Figure 5: Downstream task performance vs. continued pre-training tokens on models: - SFT: 1B-160BT-100Kep1, 1B-160BT \(\\cdot 8 +\) {2BT, .., 42BT}-100Kep1 -SFT+RL: 1B-160BT-100Kep1-100Kep8, 1B-160BT \(^ { 8 + }\) {2BT, ., 42BT}-100Kep1-100Kep8. 该图像是图表,展示了在不同继续预训练(CPT)token数量下,模型在多个下游任务上的表现,包括贪婪策略、Maj@16、RM@16等。横轴为CPT token数量(B),纵轴为任务性能值。图中比较了SFT和SFT+RL两种训练策略的表现差异。

Figure 5: Downstream task performance vs. continued pre-training tokens on models: - SFT: 1B-160BT-100Kep1, 1B-160BT 8+\cdot 8 + {2BT, .., 42BT}-100Kep1 -SFT+RL: 1B-160BT-100Kep1-100Kep8, 1B-160BT 8+^ { 8 + } {2BT, ., 42BT}-100Kep1-100Kep8.

Takeaway 4: Domain-specific post-training should be supported by adequate domain-specific CPT data: without it, SFT performance remains suboptimal and RL can even degrade such performance. Takeaway 5: As domain-specific CPT data increase, in-domain downstream performance steadily improves and the SFT models could benefit more from RL finetuning. Takeaway 6: With sufficient domain-specific CPT data, post-training on in-domain tasks not only improves in-domain performance but also generalizes effectively to OOD tasks.

6.1.3. Scaling Up SFT Compute

The paper explores how increasing SFT compute (epochs or dataset size) affects downstream performance.

  • Varying SFT Epochs: Figure 6 indicates that ID metrics increase steadily with more epochs and saturate around 8 epochs, reflecting increased memorization. However, OOD performance peaks at 2-4 epochs before declining, implying over-specialization hinders generalization. Furthermore, the marginal gains from downstream RL finetuning shrink on over-trained SFT models.

    Figure 6: Downstream task performance vs. number of SFT epochs for models: - SFT: 1B-160BT-8+42BT-100Kep{1,2,4,8,16,32} - \(\\mathbf { S F T + R L }\) : 1B-160BT-8+42BT-100Kep{1,2,4,8,16,32}-100Kep8. 该图像是图表,展示了不同训练阶段下的下游任务性能与监督微调(SFT)轮次的关系。图中包含五个子图,分别展示了不同模型在 ID 和 OOD 数据集上的表现,其中 SFT 和 SFT+RLSFT + RL 曲线对比突出。

Figure 6: Downstream task performance vs. number of SFT epochs for models: - SFT: 1B-160BT-8+42BT-100Kep{1,2,4,8,16,32} - SFT+RL\mathbf { SFT+RL } : 1B-160BT-8+42BT-100Kep{1,2,4,8,16,32}-100Kep8.

  • Varying SFT Dataset Size: Figure 7 confirms that ID performance improves monotonically with more examples. However, OOD metrics fluctuate and can even decline with larger datasets, similar to scaling epochs. The incremental benefit from RL also diminishes with more SFT examples.

    Figure 7: Downstream task performance vs. number of SFT examples for models: - SFT: 1B-160BT-8+42BT-{50K, 100K, 150K, ., 400K}ep1 - \(\\mathbf { S F T + R L }\) : 1B-160BT-8+42BT-{50K, 100K, 150K, .., 400K}ep1-100Kep8. 该图像是一个图表,展示了不同模型在下游任务性能与SFT示例数量之间的关系。包括的性能指标涉及ID和OOD的任务,如Greedy、Maj@16、RM@16等。图中同时展示了SFT和SFT+RL的结果,反映了随着示例数量增加性能的变化。

Figure 7: Downstream task performance vs. number of SFT examples for models: - SFT: 1B-160BT-8+42BT-{50K, 100K, 150K, ., 400K}ep1 - SFT+RL\mathbf { SFT+RL } : 1B-160BT-8+42BT-{50K, 100K, 150K, .., 400K}ep1-100Kep8.

Takeaway 7: Excessive SFT improves ID performance with diminishing returns but does not necessarily improve and can even degrade OOD performance. Takeaway 8: Excessive SFT, especially overly large epochs, could limit further RL improvements.

6.1.4. Scaling Up RL Compute

The paper investigates how increasing RL compute (epochs or dataset size) affects downstream performance.

  • Varying RL Epochs: As shown in Figure 8a, greedy, Maj@16, and RM@16 accuracies for both ID and OOD tasks peak around 8-16 epochs and then saturate. Notably, Pass@16 accuracy degrades beyond 4 epochs, while correct ratio keeps increasing. This suggests RL primarily sharpens confidence in already-correct outputs rather than expanding the set of solvable samples.

    该图像是一个图表,展示了在不同强化学习(RL)训练轮次下的准确率、ORM和正确率(Correct Ratio)的变化情况,分为ID和OOD两种情况。数据源自模型训练动态的评估,以便分析不同设计选择的影响。 该图像是一个图表,展示了在不同强化学习(RL)训练轮次下的准确率、ORM和正确率(Correct Ratio)的变化情况,分为ID和OOD两种情况。数据源自模型训练动态的评估,以便分析不同设计选择的影响。

(a) Performance v.s. number of RL epochs for models 1B-160BT-8+42BT-100Kep1-100Kep{0, 1, 2, 4, 8, 16, 32}.

  • Varying RL Dataset Size: Figure 8b illustrates that greedy, Maj@16, and RM@16 accuracies continue to increase up to 150-200K examples, after which gains flatten and fluctuate. Pass@K saturates earlier and degrades, while correct ratio continues to increase. A drastic performance drop at 350K and 400K examples is observed due to models generating excessively long responses exceeding context window limits (Figure 12).

    Figure 8: Downstream task performance under different RL scales. 该图像是一个图表,展示了在不同强化学习(RL)样本规模下的下游任务表现。图中比较了在ID和OOD任务下,不同方法(Greedy,Maj@16,RM@16,Pass@16)的准确率、ORM(平均@16)和正确率。每个子图清晰显示了随样本数量变化的性能趋势。

(b) Performance v.s. number of RL examples for models 1B-160BT-8+42BT-100Kep1-{0, 50K, 100K, 150K, .., 400K}ep1.

Takeaway 9: RL with excessive epochs or examples improves downstream performance on both ID and OOD tasks but with diminishing returns (saturation happens at 4-8 epochs or 50-100K examples in this study). Takeaway 10: Beyond saturation regime, RL primarily increases the probability of sampling high-quality rollouts but does not necessarily improve models' fundamental reasoning capabilities.

  • SFT and RL Data Allocation Trade-offs: Figure 9 investigates data allocation under a constrained budget (100K total samples). ID accuracy (greedy and Pass@16) increases with more SFT data, plateauing beyond 70K SFT examples. Conversely, OOD metrics are driven by RL allocation, peaking at 10K SFT (i.e., 90K RL). This trend holds for both 1B and 4B models.

    Figure 9: Downstream task performance for {1B, 4B}-160BT-8+42BT-{10K, .., 90K}ep4-{90K, .., 10K}ep4. Darker green/blue denotes more data allocation to SFT/RL. The total number of posttraining samples is fixed at 100K. 该图像是图表,展示了在不同 SFT/RL 数据分配下,1B 和 4B 模型在降重测试(ID)和超参数测试(OOD)中的表现。横轴表示数据分配量(K 示例),纵轴显示模型性能。这些数据旨在分析后训练阶段的配置效果。

Figure 9: Downstream task performance for {1B, 4B}-160BT-8+42BT-{10K, .., 90K}ep4-{90K, .., 10K}ep4. Darker green/blue denotes more data allocation to SFT/RL. The total number of posttraining samples is fixed at 100K.

Takeaway 11: Under a constrained downstream data budget, allocating more examples to SFT maximizes in-domain gains at the expense of weaker OOD generalization, while allocating more to RL improves OOD performance.

6.1.5. Intermediate Checkpoints May Not Be Reliable Surrogates

The paper highlights the issue of using intermediate checkpoints for evaluation. Table 3 compares models fully trained for specific token counts (e.g., 20BT full) against intermediate checkpoints taken from a longer run (e.g., 20BT int. from a 160BT run).

The following are the results from Table 3 of the original paper:

Model Upstream Downstream (Greedy / Pass@16)
Math Level 1 Math Level 2
20BT full 46.43 2.75 / 17.85 3.36 / 15.10
20BT int. 46.07 2.52 / 11.44 1.90 / 12.64
40BT full 49.38 2.97 / 17.96 3.36 / 14.88
40BT int. 49.06 1.37 / 9.38 2.68 / 8.72

Intermediate checkpoints consistently lag behind their dedicated, fully trained counterparts in both upstream accuracy and math reasoning performance. This is attributed to the incomplete optimization trajectory (e.g., incomplete learning rate decay) of intermediate checkpoints. This demonstrates that using mid-course snapshots understates true model capability and can be misleading for studying training dynamics.

6.1.6. Correlating Downstream Task Performance with ORM Score

The study examines the correlation between ORM scores and downstream task accuracy. Figure 10 shows that ORM scores (average of 16 samples) exhibit consistently strong predictive power for Maj@16 accuracies, with correlation coefficients ranging from 0.62 to 0.84 across most ID and OOD tasks. The correlation was lower for StrategyQA, possibly due to its focus on commonsense knowledge or the reward model's suitability for that specific distribution.

Figure 10: Correlation between accuracy and ORM score across different tasks. Each subplot represents one dataset, where each point corresponds to a model variant. A dashed line indicates the linear trend, and the Pearson correlation coefficient is reported in each title.
该图像是图表,展示了不同任务下准确率与ORM评分之间的相关性。每个子图代表一个数据集,每个点对应一个模型变体,虚线表示线性趋势,标题中报告了Pearson相关系数。

Figure 10: Correlation between accuracy and ORM score across different tasks. Each subplot represents one dataset, where each point corresponds to a model variant. A dashed line indicates the linear trend, and the Pearson correlation coefficient is reported in each title.

This contrasts with validation perplexity (Figure 14), which shows negligible correlation with downstream task accuracy, indicating that low perplexity does not reliably predict enhanced generative reasoning performance in post-trained models.

Figure 14: Correlation between accuracy and validation PPL across different tasks. Each subplot represents one dataset, where each point corresponds to a post-trained model variant. A dashed line indicates the linear trend, and the Pearson correlation coefficient is reported in each title.
该图像是图表,展示了准确性与验证困惑度之间的相关性,包含不同任务的子图。每个子图对应一个数据集,其中每个点代表一个后训练模型变体,虚线表示线性趋势,标题中报告了皮尔逊相关系数。

Figure 14: Correlation between accuracy and validation PPL across different tasks. Each subplot represents one dataset, where each point corresponds to a post-trained model variant. A dashed line indicates the linear trend, and the Pearson correlation coefficient is reported in each title.

Takeaway 12: ORM score could be a more reliable unsupervised validation metric that helps predict downstream task performance during post-training, compared to validation loss (perplexity). Notably, ORM scores from an 8B reward model correlate well with problem-solving accuracies of 1B models on many downstream reasoning tasks.

6.2. Ablation Studies / Parameter Analysis

The paper's entire experimental design can be viewed as a series of ablation studies and parameter analyses, systematically varying one aspect of the training pipeline while keeping others constant to observe its impact.

  • Pre-training Compute: The variation of pre-training tokens (10BT to 320BT) for 0.5B, 1B, and 4B models (Figure 2, Figure 3, Table 1) directly serves as an ablation study on the effect of pre-training compute. It shows the saturation and diminishing returns, and the interplay with model size.

  • CPT Replay Strategy: The comparison of CPT with no replay, different replay budgets (1.6BT, 8BT, 16BT FineWeb replay in Table 2, and Figure 4) is a direct ablation study on forgetting mitigation strategies during domain adaptation. It identifies an optimal replay ratio (around 5%).

  • CPT Token Budget: Varying domain-specific CPT tokens from 0 to 42BT (Figure 5) is an analysis of how domain adaptation scales and its impact on SFT and RL benefits.

  • SFT Epochs and Dataset Size: Varying SFT epochs (1 to 32, Figure 6) and SFT examples (50K to 400K, Figure 7) constitute ablation studies on SFT compute. These reveal over-specialization and diminishing returns for OOD tasks and the interaction with subsequent RL.

  • RL Epochs and Dataset Size: Varying RL epochs (0 to 32, Figure 8a) and RL examples (0 to 400K, Figure 8b) are ablation studies on RL compute. These highlight saturation points, the degradation of Pass@16 due to increasing response length (Figure 12, Figure 13), and the nature of RL's improvements (confidence boosting vs. fundamental reasoning).

  • SFT/RL Data Allocation: The experiment with fixed total downstream data (100K) split between SFT and RL (Figure 9) is a specific parameter analysis on resource allocation strategy, demonstrating the trade-off between ID gains (SFT-heavy) and OOD generalization (RL-heavy).

  • Intermediate vs. Full Checkpoints: The comparison in Table 3 specifically ablates the training trajectory aspect, showing that an intermediate snapshot of a longer run is not equivalent to a fully optimized shorter run.

    These systematic variations across all stages provide comprehensive insights into how different hyper-parameters and design choices influence the overall performance and characteristics of LMs.

6.3. Observational Comparison of Pre-trained Models

The paper also includes an observational comparison in Appendix A.1 (Table 4) of its pre-trained models against several prominent open-weight LMs on upstream benchmarks.

The following are the results from Table 4 of the original paper:

Model Name Tokens H/S W/G PIQA OBQA ARC-E ARC-C Avg.
OPT 1.3B 300B 53.65 59.59 72.36 33.40 50.80 29.44 49.87
Pythia 1B 300B 47.16 53.43 69.21 31.40 48.99 27.05 46.21
Pythia 1.4B 300B 52.01 57.38 70.95 33.20 54.00 28.50 49.34
TinyLlama 1B 2T 61.47 59.43 73.56 36.80 55.47 32.68 53.23
Llama3.2 1B 9T 63.66 60.46 74.54 37.00 60.48 35.75 55.31
Qwen3 1.7B 36T 60.46 61.01 72.36 36.80 69.91 43.26 57.30
20B 42.25 51.30 67.85 32.80 54.80 29.61 46.44
40B 47.53 54.62 69.59 36.20 58.08 30.29 49.38
1B (ours) 80B 51.05 53.59 70.78 37.20 62.71 35.92 51.88
160B 52.30 53.99 71.71 36.60 63.09 36.09 52.30
320B 53.86 53.51 71.93 37.20 62.29 36.18 52.49
Pythia 6.9B 300B 63.89 61.17 76.39 37.20 61.07 35.15 55.81
OPT 6.7B 300B 67.18 65.35 76.50 37.40 60.06 34.73 56.87
Qwen1.5 4B 3T 71.45 64.09 77.10 39.60 61.41 39.51 58.86
Qwen2.5 3B 18T 73.61 68.51 78.89 42.00 73.23 47.18 63.90
Qwen3 4B 36T 73.71 70.64 77.75 41.00 76.22 51.88 65.20
Llama 3.2 3B 9T 73.63 69.69 77.53 43.20 71.76 45.90 63.62
80B 48.84 54.38 69.91 35.80 59.68 32.68 50.22
4B (ours) 160B 56.49 55.88 72.63 40.20 66.67 39.93 55.30
320B 61.38 57.46 74.27 41.80 67.55 39.16 56.94

The comparison reveals that EvoLM's 1B and 4B models, trained on significantly fewer tokens (e.g., 320B tokens for our models vs. 2T for TinyLlama-1B and 3T for Qwen1.5-4B), achieve competitive performance on standard benchmarks. This reinforces the finding of diminishing returns from excessive pre-training, suggesting that beyond a certain point, additional tokens yield minimal incremental gains in general-domain upstream tasks. For example, our 1B model at 320B tokens (Avg. 52.49%) is close to TinyLlama 1B at 2T tokens (Avg. 53.23%) and Pythia 1.4B at 300B tokens (Avg. 49.34%), demonstrating efficiency.

7. Conclusion & Reflections

7.1. Conclusion Summary

The EvoLM study systematically investigates the training dynamics of language models across their entire lifecycle, from pre-training to continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). By training over 100 1B and 4B parameter models from scratch, the authors reveal critical insights into how design choices at each stage influence upstream language modeling and downstream problem-solving capabilities, including both in-domain (ID) and out-of-domain (OOD) generalization.

Key conclusions drawn are:

  • Diminishing Returns: Excessive pre-training and post-training lead to diminishing returns and can even cause performance degradation, particularly on OOD tasks.
  • Role of CPT: Continued pre-training is crucial for bridging pre-training and post-training. Domain-specific CPT causes catastrophic forgetting, but this can be effectively mitigated by incorporating a small replay budget of general-domain data. Adequate CPT data is essential for SFT and for maximizing the benefits of RL, also improving OOD generalization.
  • SFT and RL Trade-offs: Excessive SFT can lead to over-specialization, improving ID performance at the cost of OOD generalization and limiting subsequent RL improvements. RL with excessive compute (epochs or examples) also yields diminishing returns, primarily boosting the confidence of existing correct outputs rather than enhancing fundamental reasoning capabilities. Under data constraints, there's a clear trade-off in allocating data between SFT (for ID gains) and RL (for OOD generalization).
  • Evaluation Reliability: Intermediate checkpoints are unreliable surrogates for fully trained models. Outcome Reward Model (ORM) scores are shown to be a more reliable unsupervised validation metric for downstream task performance than validation perplexity.
  • Open Research: The release of EvoLM (models, datasets, and pipeline) promotes transparency, reproducibility, and open research in understanding LM training dynamics.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

  • Model Scale: The study primarily focused on models up to 4B parameters. Future research should investigate whether the observed trends generalize to larger models and explore more optimal hyperparameters for them.
  • Post-training Objectives: The current study focused on reasoning-centric post-training objectives. Future work could explore other critical objectives like safety alignment, instruction-following, tool-calling, and coding tasks.
  • RL Methods: The RL experiments exclusively used Proximal Policy Optimization (PPO) with verifiable rewards. Exploring alternative reinforcement learning methods could offer broader insights into their effects on downstream capabilities.

7.3. Personal Insights & Critique

This paper provides an invaluable, systematic, and highly transparent investigation into the complex multi-stage training of modern language models. Its emphasis on training models from scratch with full control is a significant strength, addressing a major gap in the literature where many studies rely on opaque, publicly available checkpoints. The EvoLM suite itself is a substantial contribution to open science.

One particularly insightful finding is the quantification of diminishing returns and even performance degradation from excessive pre-training and post-training. This challenges the common intuition that "more is always better" and provides concrete evidence for optimal resource allocation. The distinction between merely increasing the probability of good outputs (correct ratio) versus genuinely improving fundamental reasoning capabilities (Pass@16 degradation) through RL is also crucial for practitioners configuring complex training pipelines.

The demonstration that intermediate checkpoints are not reliable surrogates for fully trained smaller models is a critical methodological insight that could prevent misleading conclusions in future research. The finding that ORM scores are better predictors of downstream task performance than validation perplexity is also highly practical, offering a more effective way to monitor and validate model quality during post-training, especially in data-scarce scenarios.

Critically, the paper's findings regarding the interplay between CPT, catastrophic forgetting, and replay strategies offer actionable guidance for adapting LMs to new domains without sacrificing general capabilities. The granular analysis of SFT and RL trade-offs for ID versus OOD performance under data constraints highlights the importance of strategic data allocation, which is vital for efficient model development.

While the study is comprehensive, its focus on only decoder-only models and a specific set of reasoning tasks means that the generalizability of some findings might need further validation for encoder-decoder models or different task domains (e.g., creative writing, summarization). The specific compute-optimal ratios and saturation points observed are inherently tied to the LLaMA-2 architecture, FineWeb-Edu dataset, and specific evaluation tasks. Future work could explore how these optimal points shift with different architectures or pre-training data compositions.

Overall, EvoLM serves as a robust foundation for understanding the intricate dance of modern LM training. Its methodologies and insights can be broadly applied to improve the efficiency, effectiveness, and interpretability of large language models across various applications, contributing significantly to the responsible development of AI.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.