Core Results:
1. Direct Alignment Quality (MT-bench):
The following is a transcription of Table 1, showing MT-bench scores.
Model |
Size |
UF |
SHP |
Gemma-SFT |
2B |
4.73 |
4.73 |
Gemma-DPO |
2B |
6.09 |
5.13 |
Gemma-MMPO |
2B |
6.10 |
5.57 |
Gemma-SFT |
7B |
6.84 |
6.84 |
Gemma-DPO |
7B |
7.40 |
6.49 |
Gemma-MMPO |
7B |
7.53 |
7.23 |
Gemma-IT |
7B |
6.26 |
|
Zephyr—β |
7B |
7.34 |
|
GPT-3.5-Turbo |
- |
7.94 |
|
GPT-4 |
- |
8.99 |
|
-
Analysis: MMPO consistently outperforms DPO across both model sizes and both feedback datasets (UF
for AI feedback, SHP
for human feedback). The most dramatic improvement is on the SHP
dataset with the 7B model, where DPO actually performs worse than the SFT baseline (6.49 vs. 6.84), while MMPO shows a strong improvement (7.23). This suggests MMPO is particularly effective at handling the noisier, more ambiguous preferences found in human-generated data.
-
Figure 3 further breaks down the 7B model's performance on MT-bench
, showing that Gemma-MMPO
is competitive with GPT-3.5
in several domains and significantly better than the instruction-tuned Gemma-IT
model.

2. Capability as Reward Models (RewardBench):
The following is a transcription of Table 2, showing RewardBench scores.
Model |
Size |
Avg |
Chat |
Chat Hard |
Safety |
Reason |
Prior Sets |
Gemma-DPO |
2B |
59.4 |
95.0 |
45.6 |
51.9 |
49.6 |
50.1 |
Gemma-MMPO |
2B |
62.3 |
96.1 |
45.1 |
52.3 |
59.8 |
53.6 |
Gemma-DPO |
7B |
73.0 |
96.6 |
59.9 |
73.7 |
69.0 |
58.3 |
Gemma-MMPO |
7B |
75.6 |
97.5 |
62.9 |
71.1 |
75.0 |
67.7 |
Zephyr-β |
7B |
70.7 |
95.3 |
62.6 |
54.1 |
89.6 |
52.2 |
Zephyr-α |
7B |
73.6 |
91.6 |
63.2 |
70.0 |
89.6 |
53.5 |
Tulu-2-DPO |
70B |
77.0 |
97.5 |
60.8 |
85.1 |
88.9 |
52.8 |
- Analysis: MMPO models again outperform DPO models, achieving higher average accuracy on
RewardBench
. The performance gain is especially large on the Reason
and Prior Sets
subsets, which contain prompts that are different from the chat-focused fine-tuning data. This indicates that MMPO leads to models that generalize their preference understanding better to unseen domains. The 7B MMPO model achieves state-of-the-art performance for its size, even outperforming the much larger 70B Tulu-2-DPO
on the Prior Sets
subset.
3. Calibration Analysis:
Figure 4 shows reliability diagrams comparing the calibration of the 7B DPO and MMPO models.
该图像是图表,展示了7B DPO模型(左)和7B MMPO模型(右)在RewardBench Prior Sets上的可靠性图,横轴为置信区间,纵轴为准确率。MMPO模型校准更好,具有更低的期望校准误差(ECE)。
- Analysis: The DPO model is poorly calibrated, showing a large gap between confidence and accuracy. The MMPO model's curve is much closer to the diagonal "perfect calibration" line, resulting in a significantly lower ECE. This confirms that MMPO produces models whose confidence in a preference is a more reliable indicator of correctness.
4. Robustness to Overfitting:
Figure 5 demonstrates MMPO's robustness.
该图像是图表,展示了论文中图5的内容。左图显示了UltraFeedback验证集中DPO与MMPO模型在不同训练轮次的隐含奖励差异,DPO在第3轮出现较大边距疑似过拟合;右图对应MT-bench性能,DPO性能下降而MMPO保持较好表现。
- Analysis: The left plot shows that as training progresses, the implicit reward difference for the DPO model explodes in epoch 3. This is a classic sign of overfitting, where the model learns to be overconfident in its training data preferences. This overfitting directly correlates with a drop in
MT-bench
performance (right plot). In contrast, the MMPO model's reward margin remains stable, and its performance continues to improve.
5. Reward Modeling with MMPO:
Figure 6 shows that reward models trained with MMPO scale better in a best-of-n sampling scenario.
该图像是图表,展示了图6中使用UltraFeedback训练的2B(左)和7B(右)模型的MT-bench评分,比较了带有MMPO与不带MMPO的奖励模型在best-of-n(n分别为16、64、256)条件下的表现。随着n增加,MMPO模型表现持续提升,而基线模型表现先增后降。
- Analysis: As more candidate responses (n) are generated, the MMPO-trained reward model consistently picks better ones, leading to improved
MT-bench
scores. The baseline reward model, however, suffers from over-optimization; its performance peaks and then degrades at very high n, likely because it starts picking responses that exploit flaws in the reward function.