-
As shown in Image 1, standard SFT (orange and blue lines in the first two columns) achieves high success on in-distribution tasks but completely fails on instruction variants, with accuracy dropping to near zero.
-
Crucially, in the third column (Fake
), where the prompt gives new instructions but the environment secretly expects the old ones, the SFT model performs very well. This is the smoking gun for the frozen-prompt hypothesis: the model has learned to ignore the instructions and stick to the memorized training-time action vocabulary.

2. Instruction Validity Plummets During Training:
-
Image 2 confirms this diagnosis. The plots show that the model's ability to follow the new instructions (action_is_valid
) is initially present but rapidly decays during SFT training, as the model overfits to the single training template.

3. Prompt Diversity Solves Instruction Generalization:
-
The results in Table 1 (in the paper) are decisive. The Diver. + Ans.
method (SFT with prompt diversity) achieves very high success rates on all instruction variants (Alpha.
, Num.
, Rand.
for Sokoban; All-5
, All-7
for General Points), a massive improvement over standard SFT.
-
Simultaneously, its success on the Fake
environment drops to zero, proving it has stopped relying on the memorized shortcut and is now correctly interpreting the instructions in the prompt.