The core of Search-R1 is framing the task of reasoning with a search engine as an RL problem.
-
Principles: The LLM is the agent, and the environment consists of the task (the question) and the search engine. The agent's actions are generating tokens. Some sequences of tokens trigger an interaction with the environment (a search query), which returns new information that becomes part of the agent's context for subsequent actions.
-
Reinforcement Learning Objective: The goal is to train the policy LLM (πθ) to maximize the expected reward while not straying too far from a reference model (πref). The general objective is:
πθmaxEx∼D,y∼πθ(⋅∣x;R)[rϕ(x,y)]−βDKL[πθ(y∣x;R)∣∣πref(y∣x;R)],
- πθ: The LLM policy being trained.
- πref: A reference copy of the model, used to prevent the trained model from diverging too much (regularization).
- x: The input question from dataset D.
- y: The full generated output trajectory, including reasoning, search queries, and the final answer.
- R: The search engine, which is part of the generation process. πθ(⋅∣x;R) signifies that generation is interleaved with results from R.
- rϕ(x,y): The reward function, which scores the final output y.
- DKL: The KL-divergence, a measure of how different the trained policy πθ is from the reference policy πref.
- β: A hyperparameter that controls the strength of the KL penalty.
-
Steps & Procedures:
The process, illustrated in Figure 1, involves several key steps.
该图像为示意图,展示了结合搜索引擎的PPO(上半部分)和GRPO(下半部分)两种强化学习策略框架流程。PPO框架中输入查询q由策略LLM生成搜索引擎交互输出o,通过价值LLM、奖励模型和参考LLM计算优势函数A。GRPO中,策略LLM生成多个输出o1到oG,经奖励模型和参考LLM计算对应奖励r1到rG,经过组合作用计算多个优势A1到AG。图中使用颜色区分了已训练模型、冻结模型和搜索引擎组件。
- Rollout: For a given question, the LLM generates a response step-by-step. This process is detailed in Algorithm 1.
- The model generates text within ... tags.
- If it needs information, it generates a query within ... tags.
- The system detects this, sends the query to the search engine R, and inserts the results back into the context within ... tags.
- This can repeat multiple times.
- Finally, the model generates its final answer within ... tags.
- Reward Calculation: Once the rollout is complete, the final answer is extracted and compared to the ground truth. A reward is calculated using a simple rule.
- Policy Optimization: The trajectory (the full sequence of generated tokens) and its corresponding reward are used to update the LLM's weights using an RL algorithm like PPO or GRPO.
-
Mathematical Formulas & Key Details:
- Loss Masking for Retrieved Tokens: This is a critical detail. During the loss calculation for PPO or GRPO, the model's predictions for tokens inside the ... block are ignored. This is crucial because those tokens are not generated by the LLM; they are copied from the search engine. Trying to train the LLM to predict this external text would be unstable and pointless. The masking is represented by I(yt) in the formulas, which is 1 for LLM-generated tokens and 0 for retrieved tokens.
- PPO Objective: The specific PPO objective used is:
TPPO(θ)=Ex∼D,y∼π0d(⋅∣x⋅R)∑i=1∣y∣I(yt)1t=1:I(yt)=1∑∣y∣min(πodd(yt∣x,y⋅<t;R)πθ(yt∣x,y⋅<t;R)At,clip(πodd(yt∣x,y⋅<t;R)πθ(yt∣x,y⋅<t;R),1−ϵ,1+ϵ)At),
- At: The "advantage" at timestep t, which estimates how much better a specific action (generating token yt) was compared to the average action at that state. It's calculated using a critic model.
- πold: The policy before the update. The ratio πoldπθ measures the change in policy.
clip(...)
: This function constrains the policy ratio to be within [1−ϵ,1+ϵ], preventing overly aggressive updates and stabilizing training.
- Reward Model: The reward is based on a simple rule, Exact Match (EM):
rϕ(x,y)=EM(apred,agold),
-
apred is the answer extracted from the tags in the model's output y.
-
agold is the ground truth answer.
-
The reward is 1 if they match exactly, and 0 otherwise.