Ablations / Parameter Sensitivity:
1. Contribution of Hierarchy and Training Stages (Figure 3):
该图像是图表,展示了ScienceWorld中不同模型架构在未见任务上的消融性能对比。柱状图中实心柱代表层次模型,阴影柱表示去除层次结构,紫色/黄色/绿色分别对应SFT、ORL及SFT+ORL训练阶段。
This figure analyzes the impact of GLIDER's key components on unseen tasks in ScienceWorld.
- Hierarchical vs. Non-Hierarchical: In every comparison, the hierarchical models (solid bars) dramatically outperform their non-hierarchical counterparts (shaded bars). This is the clearest evidence that the "divide-and-conquer" strategy is the main driver of performance.
- Training Stages:
- The full SFT+ORL pipeline (green bars) consistently yields the best results. This shows that starting with imitation learning and then refining with RL is the most effective strategy.
ORL
only (yellow bars) performs better than SFT
only (purple bars). This suggests that reinforcement learning is more powerful than simple imitation, as it allows the agent to learn from sub-optimal data and explore.
2. Impact of Model Scale (Table 2):
Manual Transcription of Table 2:
Model |
w/o Hier |
w/ Hier |
SFT |
ORL |
SFT+ORL |
SFT |
ORL |
SFT+ORL |
Llama-1B |
37.24 |
45.31 |
48.48 |
44.50 |
50.43 |
53.62 |
Llama-3B |
38.19 |
52.47 |
56.93 |
48.11 |
55.98 |
61.29 |
Llama-8B |
41.88 |
50.16 |
53.94 |
50.17 |
57.12 |
68.34 |
- Analysis: The benefits of the hierarchical structure (
w/ Hier
) and the full SFT+ORL pipeline hold true across different model sizes (1B, 3B, 8B). Interestingly, the hierarchical Llama-3B model (score 61.29) outperforms even larger non-hierarchical models (e.g., it is better than the non-hierarchical Llama-8B's 53.94). This highlights the efficiency of GLIDER's architecture—a better structure can be more important than simply increasing model size.
3. Generalization via Online Fine-tuning (Figure 4):
该图像是图表,展示了GLIDER在ScienceWorld基准测试中相较于AWAC和AC基线的在线微调性能(评分/100),三组任务(test-conductivity、find-animal、boil)中GLIDER的表现显著优于其他方法,且随训练步数增加测试分数提升明显。
This experiment tests how well a pre-trained GLIDER agent adapts to a completely new task with online fine-tuning.
- Zero-shot Generalization: At step 0 (before any online fine-tuning), GLIDER's score is already much higher than the baselines (AC and AWAC). This indicates that the pre-trained agent has better "zero-shot" knowledge transfer to new tasks.
- Fast Adaptation: During online fine-tuning, GLIDER's performance curve rises much more steeply and reaches a higher final score than the baselines. This confirms that freezing the low-level skills and only tuning the high-level planner is a highly effective and efficient strategy for adapting to new environments.
4. Impact of Data Mixture Ratios (Figure 5):
该图像是图表,展示了在ScienceWorld中不同专家与中等质量数据混合比例下,采用Llama-3-8B作为LLM骨干时的性能表现对比。图中红色实线代表使用层级策略(w/ Hier),绿色虚线代表不使用层级策略(w/o Hier),性能随混合比例变化呈现明显差异。
This figure explores how the mix of expert and medium-quality data affects performance.
- Mixture is Best: The best performance is achieved with a mixture of expert and medium data (peaking at a 1:2 ratio).
- Diversity over Perfection: Interestingly, training only on medium data (score ~36.0) yields better results than training only on expert data (score ~29.7). This suggests that the diversity and broader state-space coverage of the medium-quality data are more valuable for learning a generalizable policy than having a smaller set of perfect-only trajectories. This strongly motivates the use of RL, which is designed to learn from such imperfect data.