AiPaper
Status: completed

Debunk the Myth of SFT Generalization

Supervised Fine-Tuning GeneralizationPrompt Diversity TrainingChain-of-Thought SupervisionGeneralization Evaluation on Decision-Making TasksComparison between SFT and Reinforcement Learning
Original LinkPDFEdit PDF notes
Price: 0.10
Price: 0.10
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper challenges the notion of SFT's inherent generalization failure. It shows that using prompt diversity and Chain-of-Thought supervision enables SFT models to generalize robustly to varied instructions and harder, out-of-distribution tasks, matching or exceeding RL baseli

Abstract

A prevailing view holds that supervised fine-tuning (SFT) memorizes training data and fails to generalize, whereas reinforcement learning (RL) attains broader robustness. We revisit this claim through a systematic evaluation on two decision-making benchmarks, Sokoban and General Points, and arrive at a different conclusion. We show that much of SFT's perceived failure stems from frozen-prompt artifacts: when trained on fixed instruction templates, SFT models cling to training semantics rather than adapting to new ones. Introducing prompt diversity during training breaks this shortcut and yields strong generalization to unseen instruction variants without harming in-distribution performance. Beyond instruction shifts, we ask whether SFT can generalize to strictly harder tasks. Here, chain-of-thought (CoT) supervision provides an algorithmic scaffold that markedly improves transfer to more difficult regimes, such as larger Sokoban grids with additional boxes and arithmetic with out-of-distribution values or five-card compositions that increase combinatorial complexity. Finally, combining prompt diversity with CoT achieves the best of both worlds: robust generalization across both instruction-variant and difficulty-variant settings, matching or surpassing RL baselines on our benchmarks while retaining SFT's simplicity and stability. These findings challenge the narrative that SFT is inherently inferior to RL and support a data-centric perspective: with appropriately curated demonstrations, vanilla SFT can generalize as strongly as RL. Code reproducing the results in the paper can be found at: https://github.com/XiaofengLin7/debunking-sft-generalization.

English Analysis

1. Bibliographic Information

  • Title: Debunk the Myth of SFT Generalization
  • Authors:
    • Xiaofeng Lin (Boston University, Boston, MA)
    • Hejian Sang (LinkedIn, Sunnyvale, CA)
    • Zhipeng Wang (LinkedIn, Sunnyvale, CA)
    • Xuezhou Zhang (Boston University, Boston, MA)
  • Journal/Conference: This paper is a preprint available on arXiv. Preprints are research articles shared publicly before or during the formal peer-review process.
  • Publication Year: 2025 (based on the arXiv identifier 2510.00237 which suggests an October 2025 submission).
  • Abstract: The paper challenges the common belief that Supervised Fine-Tuning (SFT) leads to models that merely memorize training data and fail to generalize, unlike Reinforcement Learning (RL) which is thought to achieve better robustness. Through experiments on two decision-making tasks, Sokoban and General Points, the authors argue that SFT's perceived failures are often due to "frozen-prompt artifacts"—models overfit to fixed instruction templates. They show that introducing prompt diversity during training enables strong generalization to new instructions. To tackle generalization to harder tasks, they use Chain-of-Thought (CoT) supervision. By combining prompt diversity and CoT, they demonstrate that a simple SFT approach can match or even outperform RL baselines in generalization, highlighting the critical role of data curation over algorithmic complexity.
  • Original Source Link:

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: A prevailing view in the AI community is that Supervised Fine-Tuning (SFT), a popular and efficient method for adapting Large Language Models (LLMs), is fundamentally flawed. It is believed to encourage models to memorize the training data's surface patterns, leading to poor performance on tasks that deviate even slightly from the training format (i.e., poor out-of-distribution generalization). In contrast, Reinforcement Learning (RL) is considered superior for achieving robust, generalizable capabilities.
    • Why It Matters: SFT is significantly simpler, more stable, and more cost-effective than RL. If its generalization limits are not intrinsic but rather a product of how it's used, then practitioners could achieve strong model performance without resorting to the complexities of RL.
    • Fresh Angle: Instead of proposing a new, more complex algorithm to "fix" SFT, this paper takes a data-centric perspective. It hypothesizes that the problem isn't the SFT algorithm itself (the maximum-likelihood objective) but the data it is trained on.
  • Main Contributions / Findings (What):

    1. Identifies the "Frozen-Prompt Artifact": The paper provides strong evidence that SFT's failure on new instructions is due to models learning a shortcut. When trained on prompts with a fixed structure and wording, the model learns to ignore the instructional part and focuses only on the variable parts (like the game state), leading to failure when instructions change.
    2. Demonstrates the Efficacy of Prompt Diversity: By simply diversifying the instruction templates during training (e.g., using random words for actions in a game), the model is forced to actually interpret the instructions, leading to a dramatic improvement in generalization to unseen instruction variants.
    3. Shows CoT Enables Difficulty Generalization: The paper shows that to generalize to strictly harder problems (e.g., larger game boards, more complex arithmetic), providing Chain-of-Thought (CoT) demonstrations is highly effective. CoT gives the model an "algorithmic scaffold" to learn the underlying problem-solving process, not just the final answer.
    4. Reframes the SFT vs. RL Trade-off: By combining prompt diversity and CoT, the authors create a purely supervised training recipe that produces models that generalize as well as or better than RL-tuned models on their benchmarks. This challenges the narrative that SFT is inherently inferior and suggests that with high-quality, diverse data, SFT is a powerful and practical tool.

3. Prerequisite Knowledge & Related Work

  • Foundational Concepts:

    • Supervised Fine-Tuning (SFT): A method to adapt a pre-trained language model for specific tasks. The model is trained on a dataset of high-quality examples, typically (prompt, ideal_response) pairs. The training objective is to maximize the probability of generating the ideal response given the prompt, using a standard cross-entropy loss. It is essentially teaching the model to imitate expert demonstrations.
    • Reinforcement Learning (RL) Fine-Tuning: An alternative training method where the model, or "agent," learns by interacting with an environment. Instead of imitating a perfect response, it generates its own responses and receives a numerical "reward" indicating how good they were. The objective is to learn a policy that maximizes this cumulative reward. PPO (Proximal Policy Optimization) is a popular RL algorithm for this purpose.
    • Chain-of-Thought (CoT): A technique where the model is prompted or trained not just with the final answer but also with the intermediate reasoning steps required to reach it. For example, instead of (Question: 5+8=?, Answer: 13), a CoT example would be (Question: 5+8=?, Answer: Let's break this down. 5 plus 8 is 13. So the answer is 13). This helps models learn more robust reasoning procedures.
    • In-Distribution (ID) vs. Out-of-Distribution (OOD): ID data is similar to the data a model was trained on. OOD data differs in some significant way, such as having different instructions, being more complex, or covering new topics. A model's ability to perform well on OOD data is a key measure of its generalization capability.
  • Previous Works: The paper positions itself against a body of literature that reports SFT's limitations. Prior studies ([2], [7], [13]) consistently found that SFT models overfit to training prompts and perform poorly on OOD tasks, while RL models demonstrate better robustness and forget less of their pre-trained knowledge. Other works ([8], [10], [17], [21]) accepted SFT's limitations and proposed algorithmic modifications—like reweighting the training data or adding regularization terms to the loss function—to improve its generalization.

  • Differentiation: This paper's key innovation is its argument that these algorithmic fixes may be unnecessary. It systematically diagnoses the cause of SFT's failure as a data artifact (frozen prompts) and proposes a data-centric solution. It shows that by improving the training data with prompt diversity and CoT, the vanilla SFT algorithm can achieve the strong generalization previously thought to require RL or complex SFT modifications.

4. Methodology (Core Technology & Implementation)

The paper's methodology does not introduce a new algorithm but rather investigates the impact of different data curation strategies on existing algorithms. The core methods evaluated are SFT and RL.

  • Supervised Fine-Tuning (SFT): The standard SFT objective is to minimize the negative log-likelihood (NLL) of the expert's response yy^* given a prompt xx: LSFT(θ)=E(x,y)D[logπθ(yx)] \mathcal { L } _ { \mathrm { S F T } } ( \theta ) = \mathbb { E } _ { ( x , y ^ { * } ) \sim \mathcal { D } } \big [ - \log \pi _ { \theta } ( y ^ { * } \mid x ) \big ]

    • θ\theta: The parameters of the language model.
    • πθ(yx)\pi_{\theta}(y^*|x): The probability assigned by the model to the expert response yy^* for the prompt xx.
    • D\mathcal{D}: The dataset of expert (prompt, response) pairs.
    • This objective trains the model to imitate the provided "correct" answers.
  • Reinforcement Learning (RL) Fine-Tuning: The RL objective is to maximize the expected reward from the model's generated responses: LRL(θ)=ExD,yπθ(x)[r(x,y)] \mathcal { L } _ { \mathrm { R L } } ( \theta ) = \mathbb { E } _ { x \sim \mathcal { D } , y \sim \pi _ { \theta } ( \cdot | x ) } \left[ r ( x , y ) \right]

    • yπθ(x)y \sim \pi_{\theta}(\cdot|x): The model generates its own response yy by sampling from its output distribution.
    • r(x,y)r(x, y): A reward function that scores how good the response yy is.
    • The paper uses Group Relative Policy Optimization (GRPO), a variant of PPO. For each prompt, it samples a group of responses, calculates their rewards, and computes a normalized "advantage" for each response relative to the group's average reward. This advantage signal is then used to update the model.
  • Data Curation Strategies (The Paper's Core Contribution):

    1. Prompt Diversity: To combat the "frozen-prompt artifact," the instructions in the training prompts are varied.
      • For Sokoban, instead of always using "Up, Down, Left, Right," they sample random words for each action in every training example and explicitly state the mapping in the prompt (e.g., "Use 'apple' for Up, 'banana' for Down...").
      • For General Points, instead of always mapping face cards J, Q, K to 10, they train on various mappings (e.g., J=Q=K=8, or J=7, Q=8, K=9) and state the current mapping in the prompt.
      • Principle: This forces the model to treat the instructions as a dynamic part of the input that must be read and interpreted, rather than a static template that can be ignored.
    2. Chain-of-Thought (CoT) Demonstrations: To improve generalization to harder problems, the training data includes step-by-step reasoning.
      • Instead of just providing the final action (Sokoban) or equation (General Points), they provide a "thought" process that leads to the answer.
      • Generation: They used a strong, RL-finetuned Qwen3-8B model to generate multiple candidate reasoning traces for each problem. They then used rejection sampling to filter out incorrect traces, resulting in a high-quality dataset of correct reasoning paths.

5. Experimental Setup

  • Datasets & Tasks: The evaluation is conducted on two decision-making benchmarks designed to test generalization across instruction and difficulty variations.

    • Sokoban: A puzzle game requiring multi-step planning.
      • Training: Performed only on a 6×66 \times 6 grid with one box and standard action names ("up", "down", etc.).
      • Instruction Variants (OOD): Evaluating on different action mappings (e.g., 1 for up, A for up, * for up).
      • Difficulty Variants (OOD): Evaluating on harder puzzles: a larger 10×1010 \times 10 grid or puzzles with two boxes.
    • General Points: An arithmetic reasoning task similar to the game "24".
      • Training: Performed using four cards where face cards J, Q, K are always worth 10.
      • Instruction Variants (OOD): Evaluating on different mappings for face cards (e.g., J=Q=K=5).
      • Difficulty Variants (OOD): Evaluating on harder problems with out-of-distribution numbers (14-19) or five cards instead of four.
  • Evaluation Metrics:

    • Success Rate: The primary metric, representing the fraction of tasks solved correctly.
    • Instruction Validity: A diagnostic metric to check if the model is following the given instructions, even if the final answer is wrong. For Sokoban, this means using tokens from the allowed action set. For General Points, it means using the correct numerical values for face cards.
  • Baselines:

    • Answer-only SFT (Ans.): Standard SFT, representing the method criticized in prior work.
    • RL (warm-started): An SFT model further fine-tuned with RL (GRPO). This represents the supposedly more robust, state-of-the-art approach.
    • The experiments were run on two base models: Qwen (likely Qwen2.5-7B) and Llama (Llama-3.1-8B-Instruct).

6. Results & Analysis

The paper's results systematically build a case for its data-centric thesis.

  • Core Results:

    1. SFT Fails on Instruction Variants due to Shortcut Learning:

    • As shown in Image 1, standard SFT (orange and blue lines in the first two columns) achieves high success on in-distribution tasks but completely fails on instruction variants, with accuracy dropping to near zero.

    • Crucially, in the third column (Fake), where the prompt gives new instructions but the environment secretly expects the old ones, the SFT model performs very well. This is the smoking gun for the frozen-prompt hypothesis: the model has learned to ignore the instructions and stick to the memorized training-time action vocabulary.

      Success rates of SFT on Sokoban and General Points. Columns (left to right): in-distribution performance; instruction-variant performance; performance on the fake environment.

      2. Instruction Validity Plummets During Training:

    • Image 2 confirms this diagnosis. The plots show that the model's ability to follow the new instructions (action_is_valid) is initially present but rapidly decays during SFT training, as the model overfits to the single training template.

      Instruction-following validity during SFT training on two instruction variants. Left: SimpleSokobanNumerical. Right: General Points (all_5).

      3. Prompt Diversity Solves Instruction Generalization:

    • The results in Table 1 (in the paper) are decisive. The Diver. + Ans. method (SFT with prompt diversity) achieves very high success rates on all instruction variants (Alpha., Num., Rand. for Sokoban; All-5, All-7 for General Points), a massive improvement over standard SFT.

    • Simultaneously, its success on the Fake environment drops to zero, proving it has stopped relying on the memorized shortcut and is now correctly interpreting the instructions in the prompt.

    4. CoT Solves Difficulty Generalization:

    • Table 1 also shows that standard SFT (Ans.) struggles with harder tasks (e.g., TwoBoxes, Complex Sokoban; Large, Five Cards General Points).
    • The CoT method significantly boosts performance on these difficulty variants. For instance, on TwoBoxesSokoban, Qwen improves from 0.35 to 0.57 success rate. This demonstrates that learning the reasoning process enables transfer to more complex problems.

    5. Diversity + CoT: The Best of Both Worlds:

    • The Diver. + CoT method consistently achieves the highest scores across almost all settings. It robustly handles both instruction variants and difficulty variants, often matching or outperforming the RL (warm) baseline.
    • For example, on Sokoban, Diver. + CoT on Qwen achieves near-perfect scores (0.97-1.0) on instruction variants and the strongest scores on difficulty variants (e.g., 0.4 on Complex), surpassing the RL baseline (0.1). This shows that a purely supervised, data-centric approach can be superior to RL on these tasks.
  • Ablations / Parameter Sensitivity:

    • In Appendix C, the authors test an alternative hypothesis: perhaps SFT's problem is that it moves too far away from the base model's parameters. They test this by adding regularization (KL divergence or L2 distance) to keep the SFT model "close" to the base model.
    • The results (Tables 6 and 7) show that while this regularization provides a minor benefit for instruction generalization, it harms performance on both in-distribution and harder difficulty variants. This suggests that simply constraining the model is not the right solution; the model needs to change significantly to learn complex tasks, and CoT provides the right learning signal to do so effectively, while prompt diversity provides the signal for instruction-following.

7. Conclusion & Reflections

  • Conclusion Summary: The paper convincingly argues that the "SFT memorizes, RL generalizes" narrative is an oversimplification. Much of SFT's reputed failure to generalize stems from poor data design, specifically the use of fixed, "frozen" prompts. The authors demonstrate a simple, effective, and purely supervised recipe for achieving strong generalization with vanilla SFT:

    1. Use prompt diversity to teach the model to follow instructions.
    2. Use Chain-of-Thought (CoT) supervision to teach the model to solve harder problems. This data-centric approach matches or surpasses complex RL methods on the tested benchmarks, while retaining the simplicity, stability, and efficiency of SFT.
  • Limitations & Future Work:

    • The study is limited to two structured, decision-making tasks (Sokoban and General Points) and two base models. Further research is needed to see if these findings hold for more open-ended, creative tasks (e.g., writing poetry, dialogue) and across a wider range of model architectures and sizes.
    • Future work could explore combining the stability of this SFT approach with the reward-optimization capabilities of RL in a more efficient and stable hybrid method.
  • Personal Insights & Critique:

    • This is a strong, practical paper that delivers a clear and actionable insight. Its focus on data quality over algorithmic complexity is a valuable contribution and a refreshing perspective in a field often dominated by the pursuit of more complex models and training algorithms.
    • The "frozen-prompt hypothesis" and the "fake environment" experiment used to test it are elegant and provide compelling evidence.
    • A potential critique is the reliance on a powerful, RL-tuned "teacher" model (Qwen3-8B) to generate the CoT data. This implies that at some point in the ecosystem, a strong (and potentially RL-trained) model is needed to bootstrap the high-quality data required to make SFT generalize well. This doesn't invalidate the findings but highlights the practical challenge of sourcing high-quality, procedural data at scale.
    • Overall, the paper successfully reframes the SFT vs. RL debate, shifting the focus towards the critical importance of data curation. It provides a powerful reminder that the simplest methods, when paired with the right data, can be incredibly effective.

Discussion

Leave a comment

Sign in to join the discussion.

No comments yet. Start the discussion!