The Risk-Sensitive Policy Gradient and Advantage Function:
The authors derive the policy gradient for this new objective (Theorem 1), which has a similar structure to the standard policy gradient but with a new risk-sensitive advantage function, Aβπθ.
∇θIRS(πθ)=Ex∼D,y∼πθ(⋅∣x)[Aβπθ(y)∇θlogπθ(y∣x)]
The risk-sensitive advantage function is given by:
Aβπθ(y)=β1(Ey′∼πθ(⋅∣x)[eβr(y′)]eβr(y)−1)
In practice, the expectation in the denominator is estimated using N samples from the policy for a given prompt:
A^βπθ(yi)=β1(N1∑j=1Neβr(yj)eβr(yi)−1)
This practical form, RS-GRPO, is a simple "drop-in" replacement for the standard advantage calculation in algorithms like GRPO.