IGPO Objective Function:
The policy is optimized using a clipped surrogate objective similar to PPO/GRPO, but using the dense, turn-level advantages Ai,t.
JIGPO(θ)=E(q,a)∼D,{oi}∼πθold[G1i=1∑G∣oi∣1t=1∑∣oi∣min(πθold(oi,t∣…)πθ(oi,t∣…)Ai,t,clip(πθold(oi,t∣…)πθ(oi,t∣…),1−ϵ,1+ϵ)Ai,t)−βDKL(πθ∥πref)]
- Symbol Explanation:
- This objective encourages actions that have a positive advantage Ai,t by increasing their probability.
- The
min
and clip
functions are from PPO, used to prevent excessively large policy updates.
- βDKL is a regularization term to keep the updated policy from diverging too far from a reference model πref.
- Crucially, the advantage Ai,t is applied to every decision token within turn t, providing fine-grained credit assignment.