ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
TL;DR Summary
This paper introduces ViSurf, a novel post-training paradigm integrating Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) for large vision-and-language models, enhancing performance through simultaneous external supervision and internal reinf
Abstract
Typical post-training paradigms for Large Vision-and-Language Models (LVLMs) include Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). SFT leverages external guidance to inject new knowledge, whereas RLVR utilizes internal reinforcement to enhance reasoning capabilities and overall performance. However, our analysis reveals that SFT often leads to sub-optimal performance, while RLVR struggles with tasks that exceed the model's internal knowledge base. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage. We analyze the derivation of the SFT and RLVR objectives to establish the ViSurf objective, providing a unified perspective on these two paradigms. The core of ViSurf involves injecting ground-truth labels into the RLVR rollouts, thereby providing simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to stabilize and optimize the training process. Extensive experiments across several diverse benchmarks demonstrate the effectiveness of ViSurf, outperforming both individual SFT, RLVR, and two-stage SFT \textrightarrow RLVR. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models. It proposes a novel fine-tuning paradigm for Large Vision-and-Language Models (LVLMs) that combines the benefits of both Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR).
1.2. Authors
The authors are Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong, Bei Yu, and Jiaya Jia. Their affiliations include CUHK (The Chinese University of Hong Kong), HKUST (The Hong Kong University of Science and Technology), and RUC (Renmin University of China). These are prominent academic institutions known for their research in computer vision, natural language processing, and artificial intelligence.
1.3. Journal/Conference
The paper is published as a preprint on arXiv (arXiv:2510.10606). arXiv is a widely used open-access repository for preprints of scientific papers in fields like physics, mathematics, computer science, and more. Publishing on arXiv allows for rapid dissemination of research findings before formal peer review and publication in a journal or conference, making it highly influential for sharing cutting-edge work in AI.
1.4. Publication Year
The paper was published at 2025-10-12T13:42:55.000Z, indicating it was made publicly available in October 2025.
1.5. Abstract
The paper addresses the limitations of typical post-training paradigms for Large Vision-and-Language Models (LVLMs): Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). While SFT provides external guidance for new knowledge injection, it often leads to sub-optimal performance and catastrophic forgetting. RLVR, using internal reinforcement, enhances reasoning but struggles with tasks beyond the model's existing knowledge.
To overcome these, the authors propose ViSurf (Visual Supervised-and-Reinforcement Fine-Tuning), a unified, single-stage post-training paradigm. ViSurf integrates both SFT and RLVR strengths by analyzing their objective functions and deriving a unified ViSurf objective. Its core mechanism involves injecting ground-truth labels into RLVR rollouts, providing simultaneous external supervision and internal reinforcement. The method also introduces three novel reward control strategies to stabilize and optimize training.
Extensive experiments across diverse benchmarks demonstrate ViSurf's superiority over individual SFT, RLVR, and two-stage SFT → RLVR methods. In-depth analysis validates its derivation and design principles, confirming its effectiveness in enhancing LVLMs performance and mitigating catastrophic forgetting.
1.6. Original Source Link
The original source link is https://arxiv.org/abs/2510.10606.
The PDF link is https://arxiv.org/pdf/2510.10606v2.pdf.
This paper is published as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The development of Large Vision-and-Language Models (LVLMs) is a significant direction in visual intelligence. Post-training these models often relies on two primary paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR).
-
SFT: This method directly optimizes models using expert-annotated data, providing explicit external guidance to help the model
memorizetarget distributions. However,SFToften results in sub-optimal performance and can lead tocatastrophic forgetting, where the model loses previously acquired pre-trained knowledge. This meansSFTstruggles to generalize well or retain broad capabilities. -
RLVR: This approach leverages internal reinforcement signals, typically from pre-defined reward functions, to enhance reasoning capabilities and overall performance.
RLVRis generally better at mitigatingcatastrophic forgettingand often achieves superior results. However, its performance degrades when tasks extend beyond the initial model's internal knowledge base. It relies onself-rollouts(model-generated outputs) and struggles if the model cannot internally generate a good solution or a "no object" response for certain tasks.The core problem the paper aims to solve is that
SFTandRLVRhave complementary strengths and weaknesses.SFTis effective for injecting new knowledge, especially for tasks outside the model's pre-training distribution, whileRLVRexcels at refining reasoning within the model's existing knowledge. Existing two-stage approaches (SFT → RLVR) attempt to combine them but suffer from increased computational cost andcatastrophic forgettingduring the initialSFTphase.
The paper's innovative idea (entry point) is to propose a unified, single-stage paradigm called ViSurf that integrates the strengths of both SFT and RLVR more effectively. This aims to achieve the benefits of both without their typical drawbacks, particularly addressing the "no object" scenario where RLVR fails and SFT provides crucial external guidance.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Unified Post-Training Paradigm (
ViSurf): The authors proposeViSurf, a novel, single-stage post-training method that integrates the complementary benefits ofSFTandRLVR. This is achieved by analyzing their underlying objective functions and gradients, theoretically demonstrating their structural similarities, and proposing a unifiedViSurfobjective. -
Integration of Ground-Truth Labels into RL Framework: The core of
ViSurfinvolves injecting ground-truth labels directly intoRLVRrollouts. This allows for simultaneous external supervision (fromSFT) and internal reinforcement (fromRLVR), providing a powerful combined learning signal. -
Novel Reward Control Strategies: Three new reward control strategies are introduced to stabilize and optimize the training process, specifically for handling ground-truth labels within the
RLframework. These strategies include:- Aligning ground-truth labels with rollouts preference.
- Eliminating thinking reward for ground-truth labels.
- Smoothing the reward for ground-truth labels.
These are crucial for preventing issues like
reward hackingand ensuring proper balance betweenSFTandRLVRinfluences.
-
Empirical Superiority: Extensive experiments across diverse vision-and-language benchmarks (e.g.,
Non-Object Segmentation,Reasoning Segmentation,GUI Grounding,Anomaly Detection,Medical Image,MathVista) demonstrate thatViSurfconsistently outperforms individualSFT,RLVR, and sequentialSFT → RLVRpipelines. -
Mitigation of Catastrophic Forgetting:
ViSurfsuccessfully mitigatescatastrophic forgetting, as evidenced by its stable performance onVQAtasks likeChartQAandDocVQA, a common problem withSFT-based methods. -
Robustness and Stability:
ViSurfexhibits greater training stability compared to pureRLVRandSFT → RLVR, with performance gains directly related to the baseline model's initial competency, corroborating the theoretical analysis. It also reduces the burden of prompt engineering.The key conclusions are that
ViSurfprovides a more effective and robust post-training paradigm forLVLMsby harmoniously combining external supervision and internal reinforcement. This addresses the limitations of both individual approaches, leading to superior performance across a wide range of tasks and enhancing model stability.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand ViSurf, a reader should be familiar with the following concepts:
- Large Vision-and-Language Models (LVLMs): These are advanced Artificial Intelligence models capable of understanding and generating content across both visual (images, videos) and linguistic (text) modalities. They combine the strengths of large language models (LLMs) for reasoning and language generation with powerful vision models for image understanding. They can perform tasks like image captioning, visual question answering (VQA), and referring expression segmentation.
- Fine-Tuning: After a large model (like an
LVLM) is pre-trained on a massive dataset for general capabilities,fine-tuningadapts it to a specific task or dataset. This involves further training the model for a relatively shorter period on a smaller, task-specific dataset, usually by continuing to optimize its parameters. - Supervised Learning: A type of machine learning where an algorithm learns from labeled training data. Each piece of input data is paired with a correct output label. The model learns to map inputs to outputs by minimizing the difference between its predictions and the ground-truth labels.
- Supervised Fine-Tuning (SFT): In the context of
LVLMs,SFTinvolves fine-tuning a pre-trainedLVLMon a dataset where each input (e.g., image-text pair) has a corresponding expert-annotated correct output (ground-truth label). The model is trained to predict these correct outputs, typically by minimizing anegative log-likelihoodloss. - Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent receives rewards or penalties for its actions, which guide it towards optimal behavior without explicit supervision. Key components include
agent,environment,state,action, andreward. - Reinforcement Learning from Human Feedback (RLHF): A common
RLtechnique used to alignLLMswith human preferences. Instead of a simple reward function,RLHFuses a reward model trained on human preference data (e.g., humans ranking model outputs) to provide a scalar reward signal to theLLM. - Reinforcement Learning with Verifiable Rewards (RLVR): This is a specific type of
RLused in the paper, distinct fromRLHF. InRLVR, the reward function is objective and calculable based on predefined criteria, often related to the format and accuracy of the model's output. It doesn't require subjective human preference data, making it more scalable. Examples of verifiable rewards might includeIoUfor segmentation masks,accuracyfor answers, or adherence to a specific output format (e.g.,JSON). - On-Policy vs. Off-Policy RL:
- On-policy RL: The agent learns a policy and evaluates it based on data generated by that same policy. Algorithms like
Proximal Policy Optimization (PPO)andGroup Relative Policy Optimization (GRPO)are on-policy. - Off-policy RL: The agent learns a policy from data generated by a different policy (e.g., an older version of itself or a completely different exploratory policy).
Direct Preference Optimization (DPO)is often considered off-policy in how it uses a static preference dataset.
- On-policy RL: The agent learns a policy and evaluates it based on data generated by that same policy. Algorithms like
- Catastrophic Forgetting: A phenomenon observed in neural networks where training on new tasks causes a significant and abrupt loss of performance on previously learned tasks. In
LVLMs,SFTcan sometimes lead tocatastrophic forgettingof the broad knowledge acquired during pre-training. - Log-derivative Trick: A mathematical technique used in
Reinforcement Learningto estimate gradients of expected values with respect to policy parameters. It allows converting the gradient of an expectation into an expectation of a gradient, which is crucial for optimizing policies directly. Specifically, for an expectation , its gradient with respect to can be written as . - Advantage Function: In
RL, theadvantage functionmeasures how much better a particular action is compared to the average action from a given state. It's often defined as , whereQ(s, a)is the action-value function (expected return from taking action in state ) andV(s)is the state-value function (expected return from state ). InPPOandGRPO, a normalized version of the reward is often used as anadvantage estimateto stabilize training. - Entropy: In information theory,
entropymeasures the uncertainty or randomness of a probability distribution. InRL, a policy with highentropyexplores more diverse actions, while a policy with lowentropyis more deterministic. Maintaining a certain level ofentropyduring training can encourage exploration and prevent premature convergence to sub-optimal policies. - Reward Hacking: A phenomenon in
Reinforcement Learningwhere an agent finds unintended ways to maximize its reward function, often by exploiting flaws or ambiguities in the reward definition, without actually achieving the desired behavior or objective.
3.2. Previous Works
The paper discusses prior work in two main categories: Supervised Fine-tuning for LVLMs and Reinforcement Learning for LVLMs.
3.2.1. Supervised Fine-tuning for LVLMs
SFT is a foundational paradigm for LVLMs, adapting pre-trained models using expert-annotated data.
-
LLaVA [17, 13, 18]: A pioneering work that started the trend of
visual instruction tuning. It involves fine-tuningLLMson multimodal instruction-following data. TheLLaVA-seriesrefers to subsequent models built upon this foundation. -
QwenVL-series [1, 38]: Another prominent series of
LVLMsthat employSFTfor diverse visual-language tasks. -
MGM-series [14, 36, 44] and InternVL [3]: Other examples of
LVLMsthat successfully adopted theSFTparadigm. -
Applications:
SFThas been effective for various downstream applications, such asimage quality assessment [41],visual counting [5], andautonomous driving [40].Comment: The core idea behind SFT is to expose the model to high-quality examples of desired behavior. For instance, if an LVLM is trained for image captioning, SFT would involve showing it many images with corresponding human-written captions and optimizing the model to generate similar captions. The loss function typically used is
cross-entropy lossornegative log-likelihood, which aims to maximize the probability of the correct output tokens given the input. For a given input(v, t)and ground-truth output , thenegative log-likelihoodloss is: $ \mathcal{L}{\mathrm{SFT}} = - \sum{i=1}^L \log P(y_i | y_{ is the probability of the -th token given previous tokens , visual input , textual input , and model parameters . This loss encourages the model to generate the exact sequence of tokens as the ground truth.
3.2.2. Reinforcement Learning for LVLMs
RL is also a standard method for fine-tuning LVLMs.
- Direct Preference Optimization (DPO) [30]: An
RLalgorithm that relies on pre-collected human preference datasets. Instead of training a separate reward model,DPOdirectly optimizes the policy by comparing preferred and dispreferred pairs of responses. While effective, it requires costly human preference data. - Proximal Policy Optimization (PPO) [32]: A widely used
on-policy RLalgorithm that optimizes the policy by taking small steps to avoid large changes that could destabilize training. It typically requires a well-trained reward model to evaluate generated responses and provide feedback. The objective function ofPPOaims to maximize aclipped surrogate objectiveto ensure policy updates are not too aggressive. The corePPOobjective (without entropy bonus) is: $ \mathcal{L}^{CLIP}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right] $ where is theprobability ratio, is theadvantage estimateat time step , and is a small hyperparameter for clipping. This objective encourages the new policy to move in the direction of higher rewards, but only within a "clipped" range relative to the old policy to maintain stability. - Group Relative Policy Optimization (GRPO) [34] and Dynamic Sampling Policy Optimization (DAPO) [42]: These are
RLVRalgorithms that gain attention for their ability to assess model outputs against objective, verifiable criteria. They reduce dependency on manually annotated data (likeSFT) or pre-trained reward models (likePPOoften needs orDPO's preference data). - Recent Works: The effectiveness of
RLVRforLVLMshas been demonstrated in works likeSegZero [20]andVisualRFT [22], as well asVisionReasoner [21].
3.3. Technological Evolution
The field of LVLMs has rapidly evolved from early multimodal models that simply concatenated features to complex architectures that deeply integrate vision and language.
- Early Multimodal Models: Initially, models processed visual and textual information separately and then combined their representations at a later stage (e.g., simple concatenation of embeddings).
- Vision-Language Pre-training: The advent of large-scale datasets and transformer architectures led to
vision-language pre-training, where models learned joint representations by tasks like image-text matching or masked language modeling conditioned on images (e.g.,CLIP,ALIGN). - Instruction Tuning and
LLMIntegration: The success ofLarge Language Models (LLMs)(likeGPT-3) inspiredvisual instruction tuning(LLaVA), whereLLMswere extended with visual capabilities and fine-tuned to follow multimodal instructions. This madeLVLMsmore conversational and versatile. - Specialized Fine-Tuning Paradigms: As
LVLMsbecame more powerful, researchers explored various fine-tuning methods to adapt them to specific downstream tasks and improve their reasoning.-
SFTbecame standard for injecting task-specific knowledge. -
RLHFandRLVRemerged to align models with desired behaviors and enhance reasoning, overcoming some limitations ofSFT. -
The realization that
SFTandRLhave complementary strengths led to approaches that sequentially combine them (SFT → RLVR).This paper's work,
ViSurf, fits into this timeline as an advanced fine-tuning paradigm that seeks to unify the benefits ofSFTandRLVRinto a single, more efficient, and robust stage. It's a step towards more holistic and stable post-training strategies forLVLMs, moving beyond sequential or separate applications ofSFTandRL.
-
3.4. Differentiation Analysis
Compared to existing methods, ViSurf presents several core differences and innovations:
-
Unified, Single-Stage Integration vs. Sequential or Separate:
- SFT:
ViSurfincorporatesSFT's external guidance, but unlike pureSFT, it avoidscatastrophic forgettingand is enhanced by internalRLVRsignals. - RLVR:
ViSurfretainsRLVR's internal reinforcement but overcomes its limitation of struggling with tasks outside the model's knowledge base by proactively injecting ground-truthSFTsignals. PureRLVRcan also be unstable or perform worse than baselines if rollouts are consistently poor. - SFT → RLVR (Two-Stage): This sequential approach attempts to combine the benefits but incurs higher computational costs and can still suffer from
catastrophic forgettingduring theSFTphase, which is then passed on.ViSurfintegrates both within a single stage, making it more efficient and stable.
- SFT:
-
Novel Objective Function:
ViSurfproposes a unified objective function that naturally combines the gradients of bothSFTandRLVR. Unlike prior methods that might simply addSFTandRLlosses,ViSurftreats the ground-truth label as a high-reward sample within theRLVRframework, modifying the advantage calculation to incorporate its signal. This provides a theoretically grounded integration. -
Self-Adaptive Balance: A key innovation is the dynamic, self-adaptive balancing between
SFTandRLVRinfluences. When the model'sself-rolloutsare poor, theSFTterm (guided by ground truth) dominates the update. Whenself-rolloutsare good, theSFTinfluence is smoothed or eliminated, allowingRLVRto take over. This adaptive mechanism is more sophisticated than fixed weighting or sequential training. -
Reward Control Strategies: The introduction of three specific reward control strategies (aligning ground truth, eliminating thinking reward, smoothing reward) is crucial for
ViSurf's stability and effectiveness. These strategies address practical challenges of integrating ground truth into anRLsetting, such asreward hackingand ensuring proper distribution alignment, which are not typically found in standaloneSFTorRLVR.In essence,
ViSurfoffers a more elegant, efficient, and robust solution forLVLMfine-tuning by providing a principled way to leverage both external expert knowledge and internal model exploration within a single, stable training loop.
4. Methodology
4.1. Principles
The core idea behind ViSurf is to unify the strengths of Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) within a single, integrated training stage. The theoretical basis or intuition is derived from the observation that the gradient forms of SFT and RLVR objectives share structural similarities. ViSurf's key principle is to treat the ground-truth label (from SFT) as a high-reward sample within the RLVR framework, thereby providing simultaneous external guidance and internal reinforcement. This allows the model to learn from explicit correct answers while also refining its reasoning and exploration capabilities based on verifiable rewards. The method also incorporates dynamic reward control mechanisms to ensure stability and adaptive balancing between these two learning signals.
4.2. Core Methodology In-depth (Layer by Layer)
The paper begins by defining the preliminary concepts, then analyzes the gradients of SFT and RLVR, and finally introduces the ViSurf objective and its associated reward control strategies.
4.2.1. Preliminary
Let denote a Large Vision-and-Language Model (LVLM), parameterized by .
- Input Data: Both
SFTandRLVRutilize the same input dataset, .- : a visual input (e.g., an image).
- : a textual input (e.g., a prompt or question).
- : the size of the dataset.
4.2.1.1. Supervised Fine-Tuning (SFT)
SFT optimizes against a set of ground-truth labels, . The objective is to minimize the negative log-likelihood of these labels.
The SFT objective function is given by Equation (1):
$
\mathcal{L}{\mathrm{SFT}}(\theta) = - \mathbb{E}{(v, t) \sim \mathcal{D}{\mathrm{input}}} \left[ \log \pi{\theta}(y \mid v, t) \right]
$
Here:
-
: The
SFTloss function, whichViSurfaims to minimize. -
: The parameters of the
LVLM. -
: Expectation operator, averaging over samples from the data distribution.
-
: An input pair (visual , textual ) sampled from the input dataset.
-
: The logarithm of the probability assigned by the model to the ground-truth output , given the visual and textual inputs.
-
: The ground-truth label corresponding to the input
(v, t). This term implies that for each input(v,t), there is a corresponding correct output that the model should produce.This objective encourages the model to assign high probabilities to the correct ground-truth outputs.
4.2.1.2. Reinforcement Learning with Verifiable Rewards (RLVR)
The paper illustrates RLVR using the on-policy Group Relative Policy Optimization (GRPO) algorithm. GRPO optimizes the policy using a verifiable reward function , which typically combines measures of output format and accuracy (e.g., IoU, factual correctness, adherence to JSON format).
For a given input :
-
Rollout Generation: The
old policy(from a previous optimization step, i.e., the current policy before an update) generates a group of rollouts . These rollouts are generated by sampling from the policy with different random seeds. -
Reward Evaluation: Each rollout is evaluated by the reward function , resulting in a set of rewards
\{r(o_j)\}_{j=1}^G. -
Advantage Calculation: The
advantagefor each rollout is computed based on its reward relative to the mean and standard deviation of rewards within the group.The advantage calculation is given by Equation (2): $ \hat{A}j = \frac{\mathrm{r}(o_j) - \mathrm{mean}\left( {\mathrm{r}(o_j)}{j=1}^G \right)}{\mathrm{std}\left( {\mathrm{r}(o_j)}_{j=1}^G \right)} $ Here:
- : The estimated advantage for the -th rollout. A higher advantage means the rollout is better than average in its group.
- : The reward obtained for the -th rollout .
- : The average reward across all rollouts for the current input.
- : The standard deviation of rewards across all rollouts for the current input. This normalization helps stabilize training and provides a relative measure of quality for each rollout.
The RLVR objective is to minimize the equation, which is a Proximal Policy Optimization (PPO)-like clipped objective, given by Equation (3):
$
\mathcal{L}{\mathrm{RLVR}}(\theta) = - \mathbb{E}{\Lambda (v, t) \sim \mathcal{D}{\mathrm{input}}} \Bigg[ \frac{1}{G} \sum{j=1}^G \operatorname* {min} \Bigg{ \frac{\pi_{\theta}(o_j \mid v, t)}{\pi_{\theta_{\mathrm{old}}}(o_j \mid v, t)} \hat{A}j, \mathrm{clip}\left( \frac{\pi{\theta}(o_j \mid v, t)}{\pi_{\theta_{\mathrm{old}}}(o_j \mid v, t)}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_j \Bigg} \Bigg]
$
Here:
- : The
RLVRloss function. - : Expectation over input samples.
- : Averaging over the rollouts.
- : Takes the minimum of two terms, which is characteristic of the
PPO clipped objective. - : The
probability ratio(also calledimportance sampling ratio), which measures how much more or less likely the current policy generates rollout compared to the old policy . - : The advantage calculated from Equation (2). This term guides the policy update towards actions with higher advantages.
- : A function that clips the value of to be within the range .
- , : The clipping bounds, where is a constant hyperparameter (e.g., 0.2). Clipping prevents overly large policy updates that could destabilize training.
The objective maximizes the expected advantage of actions, weighted by their probability ratio, but clips this ratio to ensure updates are not too far from the old policy. The paper mentions it omits the
KL divergenceterm often found inPPOfor simplicity.
4.2.2. Gradient Analysis of SFT and RLVR
The authors analyze the gradients of SFT and RLVR to show their structural similarity.
4.2.2.1. Gradient of SFT
The gradient of SFT can be derived from Equation (1) as:
$
\nabla_{\boldsymbol{\theta}} \mathcal{L}{\mathrm{SFT}}(\boldsymbol{\theta}) = - \mathbb{E}{(\boldsymbol{v}, t) \sim \mathcal{D}{\mathrm{input}}} \left[ \nabla{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(\boldsymbol{y} \mid \boldsymbol{v}, t) \right]
$
Here:
- : The gradient of the
SFTloss with respect to model parameters . - : The gradient of the log-probability of the ground-truth label given the inputs, with respect to . This gradient pushes the model to increase the probability of generating the ground-truth output.
4.2.2.2. Gradient of RLVR
The gradient of RLVR can be derived from Equation (3) using the approximation and the log-derivative trick, omitting the clip operation for simplicity:
$
\nabla_{\theta} \mathcal{L}{\mathrm{RLVR}}(\theta) = - \mathbb{E}{\phi \Lambda (v, t) \sim \mathcal{D}{\mathrm{input}}} \left[ \frac{1}{G} \sum{j=1}^G \hat{A}j \nabla{\theta} \log \pi_{\theta}(o_j \mid v, t) \right]{\theta \approx \theta{\mathrm{old}}}
$
Here:
- : The gradient of the
RLVRloss with respect to model parameters . - : Expectation over input samples.
- : The rollouts are sampled from the old policy.
- : The advantage estimate for rollout .
- : The gradient of the log-probability of the rollout given the inputs, with respect to .
- : Indicates that the policy for which the gradient is computed is close to the old policy (an approximation used in
on-policymethods likeGRPOto simplify gradient estimation). This gradient pushes the model to increase the probability of generating rollouts that yielded high advantages. The authors highlight that bothSFTandRLVRgradients involve , differing mainly in theguidance signal( vs. ) and acoefficient(1 vs. ).
4.2.3. Objective of ViSurf
To combine SFT and RLVR into a single stage, ViSurf designs an objective function whose gradient naturally combines both SFT and RLVR gradients. The key insight is to include the ground-truth label as a high-reward sample within the RLVR framework.
Instead of just , the set of samples for advantage calculation becomes . Consequently, the rewards are .
This formulation modifies the advantage calculation for rollouts as follows (Equation 6): $ \hat{A}j = \frac{\mathrm{\mathbf{r}}(o_j) - \mathrm{\mathbf{mean}}\left( \mathbf{r}(y) \cup {\mathbf{r}(o_j)}{j=1}^G \right)}{\mathrm{\mathbf{std}}\left( {\mathbf{r}(y) \cup {\mathbf{r}(o_j)}_{j=1}^G } \right)} $ Here:
- : The modified advantage for the -th rollout .
- : The reward for rollout .
- : The mean reward calculated over the combined set of ground-truth label and all rollouts.
- : The standard deviation of rewards over the combined set of ground-truth label and all rollouts. The advantage of rollouts is now calculated relative to a pool that also includes the ground-truth label, which typically has a high reward.
And the advantage of the ground-truth is calculated as (Equation 7): $ \hat{A}y = \frac{\mathrm{r}(y) - \mathrm{mean}\left( \mathrm{r}(y) \cup {\mathrm{r}(o_j)}{j=1}^G \right)}{\mathrm{std}\left( {\mathrm{r}(y) \cup {\mathrm{r}(o_j)}_{j=1}^G } \right)} $ Here:
- : The advantage for the ground-truth label .
- : The reward obtained for the ground-truth label . This is typically high as it's the correct answer. The mean and standard deviation are the same as used for calculating .
The objective of ViSurf is to minimize the equation (Equation 8):
$
\epsilon(t^{(i)}) = - \mathbb{E}{\epsilon(t) \leq T{\epsilon(t)+1}} \left[ \frac{1}{G+1} \left( \frac{2}{\gamma \epsilon_{1+1}} \left{ \frac{\alpha_0}{\alpha_0(\epsilon_j \mid \nu_t)} \left{ \frac{\alpha_0(\epsilon_j \mid \nu_t)}{\alpha_{0,t+1}(\epsilon_j | \nu_t)} \dot{A}{j_1}, \quad \mathrm{clip}\left( \frac{\gamma \epsilon_1(\epsilon_j \mid \nu_t)}{\alpha_0(\epsilon_j | \nu_t)}, 1 - \epsilon_1, 1 + \epsilon \right) \dot{A}{j_1} \right} \right} + \operatorname* {min} \left{ \frac{\gamma \epsilon_0(\epsilon_j \mid \nu_t)}{\pi_{0,t+1}(\epsilon_j | \nu_t)} \dot{A}_1, \ldots \right} \right) \right]
$
This Equation (8) as presented in the paper contains several unusual symbols (e.g., , , , , ) and structure that deviates significantly from standard PPO or RL objectives, and appears to be ill-formed or highly abstract. Given the subsequent clear gradient formulation (Equation 9) and the textual description, it's likely this equation is either a placeholder, contains significant typos, or represents a very high-level conceptualization that is not directly used for practical implementation as stated. The authors' subsequent gradient analysis (Equation 9) and description of ViSurf imply a more direct integration. Therefore, while strictly reproducing it, the practical objective is better understood from its gradient in Equation (9).
4.2.3.1. Algorithm 1: ViSurf Optimization Step
The pseudocode for the ViSurf Optimization Step outlines the practical training loop:
Algorithm 1: ViSurf Optimization Step
- Input:
policy model;reward functionr();input data;label data - Output:
- for do
- Sample a from and corresponding
B_labelfrom . - Update the
old policy model; (This stores the current policy foron-policyupdates). - Sample outputs for each ; (Generate
rolloutsfrom the current policy). - Compute rewards
\{r(o_j)\}_{j=1}^Gfor each sampled output ; - Compute rewards
r(y)for label ; - Compute and through
relative advantage estimationusing Equations (6) and (7); - Update the
policy modelusing Equation (8) (or more practically, based on its gradient as shown in Equation 9);
- Sample a from and corresponding
- end for
4.2.3.2. Gradient Analysis of ViSurf
The gradient of the ViSurf objective (Equation 8) can be derived using the approximation and the log-derivative trick. Omitting the clip operation for simplicity, it is given by Equation (9):
$
\nabla_{\theta} \mathcal{L}{\mathrm{ViSurf}}(\theta) = - \mathbb{E}{\Phi(v, t) \sim \mathcal{D}{\mathrm{input}}} \left[ \frac{1}{G+1} \left( \sum{j=1}^G \hat{A}j \nabla{\theta} \log \pi_{\theta}(o_j \mid v, t) + \hat{A}y \nabla{\theta} \log \pi_{\theta}(y \mid v, t) \right) \right]{\theta \approx \theta{\mathrm{old}}}
$
Here:
- : The gradient of the
ViSurfloss with respect to model parameters . - : Expectation over input samples.
- : The rollouts are sampled from the old policy.
- : A scaling factor, as there are rollouts plus 1 ground-truth label in the consideration set.
- : This term represents the
RLVRcomponent, similar to Equation (5), weighted by the advantages of the rollouts. It encourages the model to generate more high-reward rollouts. - : This term represents the
SFTcomponent, weighted by the advantage of the ground-truth label. It encourages the model to generate the ground-truth output. - : Indicates the approximation that the current policy is close to the old policy.
This gradient clearly shows how
ViSurfsimultaneously incorporates bothRLVR(through weighted rollouts) andSFT(through weighted ground truth) signals.
4.2.3.3. Relation to SFT and RLVR
To better illustrate the structure of the gradient, Equation (9) is reformulated into two distinct terms (Equation 10): $ \nabla_{\theta} \mathcal{L}{\mathrm{ViSurf}}(\theta) = - \mathbb{E}{\Phi(v, t) \sim \mathcal{D}{\mathrm{input}}} \left[ \frac{1}{G+1} \sum{j=1}^G \hat{A}j \nabla{\theta} \log { \pi_{\theta}(o_j \mathbf{\phi} | v, t) } \right]{\theta \approx \theta{\mathrm{old}}} - \underbrace{ \mathbb{E}{(v, t) \sim \mathcal{D}{\mathrm{iabel}}} \left[ \frac{1}{G+1} \mathbb{\hat{A}}y \nabla{\theta} \log \pi_{\theta}(y \mid v, t) \right]{\theta \approx \theta{\mathrm{old}}} }_{\mathrm{SFT7e r m}} $ Here:
- The first term:
- E[...RLVR term...]is structurally identical to the standardRLVRgradient in Equation (5). The differences are a scaling coefficient ( vs. ) and some potential formatting anomalies like which should likely be . - The second term, explicitly labeled
SFT7e r m(which should likely beSFT term):- E[...SFT term...]resembles theSFTgradient from Equation (4).-
Distinction (i): The coefficient is weighted by instead of 1. This means the
SFTinfluence is modulated by the ground-truth's advantage, rather than being a pure maximum likelihood objective. -
Distinction (ii): The use of the approximation . This implies that for the
SFTsignal to be effective within thisRL-like framework, the ground-truth label should align with the model's internal generative preference (i.e., it's a plausible output for the model, not an out-of-distribution sample). The term should likely be or for consistency.Crucially, Equation (9) (and its reformulation in 10) integrates both the external guidance from
SFTand the internal guidance fromRLVRinto a single gradient update.
-
4.2.4. Reward Control for Ground-Truth Label
The direct injection of ground-truth labels with typically high rewards could lead to reward hacking or sub-optimal learning if not carefully managed. The advantage for the ground-truth label would always be positive, potentially suppressing the relative advantages of actual rollouts even if they are good, and lacking reasoning traces. To address this and ensure the ground-truth aligns with self-rollouts (satisfying ), three novel reward control strategies are proposed:
-
Aligning Ground-truth Labels with Rollouts Preference:
- Problem: Distribution shift between ground-truth annotations and model-generated
rollouts(e.g., slightwhitespacedifferences inJSONformats can altertokenization). - Strategy: Reformat ground-truth annotations to match the model's preferred output style (e.g., vs. ).
- Purpose: Minimizes the distribution shift, making the ground-truth more "natural" for the model and satisfying the approximation.
- Problem: Distribution shift between ground-truth annotations and model-generated
-
Eliminating Thinking Reward for Ground-truth Labels:
- Problem: Ground-truth labels often lack an annotated
reasoning pathor intermediatethinking steps. If a reward component is given forthinking format, the ground truth would inherently score zero on this, potentially biasing the model. - Strategy: Assign a
reasoning format scoreof zero to ground-truth labels. - Purpose: Ensures the model learns to generate
reasoning tracesdirectly from itsself-rollouts(where thinking steps can be generated and evaluated) without being biased by the absence of external reasoning annotations in the ground truth.
- Problem: Ground-truth labels often lack an annotated
-
Smoothing the Reward for Ground-truth Labels:
-
Problem: If the model's
self-rolloutsare already high-quality, the ground-truth label's consistently highadvantagemight unnecessarily dominate the updates, preventing furtherRLVRexploration and refinement. -
Strategy: Before
advantage estimation, compare the maximum reward among generatedrollouts(\operatorname*{max}\{r(o_j)\}_{j=1}^G) against the ground-truth rewardr(y). If\operatorname*{max}\{r(o_j)\}_{j=1}^G \geq r(y), it indicates the policy model has already produced a high-quality output. In this case, set the ground-truth reward to the mean of rollout rewards:r(y) = \operatorname*{mean}\{r(o_j)\}_{j=1}^G. -
Purpose: This smoothing ensures that if
rolloutsare already good, theadvantagefor the ground-truth becomes approximately zero (as per Equation 7), effectively turning off the external supervision signal when it's no longer necessary. This allows theRLVRterm to dominate.These strategies, visually represented in Figure 4, dynamically modulate the influence of the
SFTcomponent withinViSurf, preventing issues and ensuring an adaptive balance between external guidance and internal reinforcement.
-
4.2.5. Optimization Analysis During Training
Building on the reward control strategies, the paper analyzes the dynamics of the terms in Equation (10) throughout training.
- Self-Adaptive Balance: The
advantages(forrollouts) and (for the ground-truth) dynamically govern the balance between theRLVRandSFTterms.- When Policy Fails: If the policy fails to generate high-quality
rollouts, decreases (potentially becoming negative), while remains high (due to the ground-truth's inherent correctness, before smoothing). Consequently, theSFTterm dominates the policy update, providing strong external guidance from the ground-truth label to correct the model's behavior. - When Policy Succeeds: Conversely, when the policy successfully generates desirable
rollouts, thesmoothingreward control mechanism (described above) sets . In this scenario, theSFTterm's influence is minimized, and the optimization becomes dominated almost entirely by theRLVRterm, allowing for continued refinement through self-reinforcement. This automatic shifting between learning modes is a core feature that makesViSurfa powerful single-stage paradigm.
- When Policy Fails: If the policy fails to generate high-quality
4.2.5.1. Upper Bound Analysis
- When Policy Generates Correct Rollouts: If the
old policy modelalready generates correctrollouts, theSFT Termin Equation (10) becomes close to zero due to thesmoothingstrategy. In this case, theupper boundofViSurf's performance is effectively the same asRLVRalone, asRLVRtakes over the optimization. - When Policy Cannot Generate Desirable Rollouts: If the policy cannot generate desirable
rollouts,ViSurfleverages the strong external guidance from theSFT term. In this scenario, theupper boundofViSurfis better than using eitherSFTorRLVRalone, as it combines the corrective power of ground truth with the exploratory potential ofRL.
5. Experimental Setup
5.1. Datasets
The authors verify ViSurf on benchmarks across several diverse domains.
-
Non-Object Segmentation:
- Dataset:
gRefCOCO [16]. This dataset is chosen because it includes queries that refer to objects that do not exist in the image (non-objectcases), making it suitable for testing a model's ability to identify absence. - Training Data:
Multi-objects-7Kplus 200non-objectdata, adapted fromVisionReasoner [21]. The 200non-objectexamples are generated by providing unanswerable questions to train the model to output an empty list (). - Domain: Visual grounding, specifically
referring expression segmentationwith a focus onnon-objectscenarios.
- Dataset:
-
Reasoning Segmentation:
- Dataset:
ReasonSeg [12]. This dataset is designed to test scenarios where correct segmentation requires complex visual reasoning beyond simple object detection. - Training Data:
Multi-objects-7Kas proposed inVisionReasoner [21]. - Domain: Visual segmentation requiring logical inference.
- Dataset:
-
GUI Grounding:
- Dataset:
OmniACT [11]. This dataset focuses onGUI groundingtasks for Desktop and Web interfaces, requiring the model to locate specific interactive elements. - Training Data: 6,101 samples from the training split.
- Domain: Human-Computer Interaction,
GUIautomation.
- Dataset:
-
Anomaly Detection:
- Dataset:
RealIAD [35]. This dataset features real-world, multi-view industrial anomalies. - Training Data: 3,292 training samples and 2,736 test samples, ensuring disjoint sets.
- Domain: Industrial visual inspection, quality control.
- Dataset:
-
Medical Image: Skin:
- Dataset: Task one of
ISIC2018 [4, 10](International Skin Imaging Collaboration). This dataset focuses onlesion segmentationin dermatology. - Training Data: 2,594 training samples and 1,000 test samples.
- Domain: Medical imaging, dermatological diagnosis.
- Dataset: Task one of
-
MathVista:
-
Dataset:
MathVista-testmini [24]. This dataset includes 1,000 diverse mathematical and visual tasks, testingLVLMs' mathematical reasoning capabilities in visual contexts. -
Training Data: Approximately 10k training data gathered from
WeMath [29],MathVision [37],Polymath [8],SceMQA [15],Geometry3K [23]. -
Domain: Multimodal mathematical reasoning.
These datasets were chosen for their diversity across various vision-and-language tasks (segmentation, grounding, anomaly detection, medical, math) and their ability to highlight the challenges that
SFTandRLVRface individually, making them effective for validatingViSurf's comprehensive performance.
-
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, here is a complete explanation:
-
gIoU (Generalized Intersection over Union):
- Conceptual Definition:
IoU(Intersection over Union) is a common metric for evaluating the accuracy of object detection and segmentation tasks, measuring the overlap between a predicted bounding box/mask and a ground-truth bounding box/mask.gIoUextendsIoUby adding a penalty term for non-overlapping areas, especially useful when there is no overlap between the prediction and ground truth (e.g., when a model predicts a box but the ground truth is empty or far away). It considers the smallest enclosing box that contains both the predicted and ground-truth boxes. - Mathematical Formula: $ \text{IoU} = \frac{|A \cap B|}{|A \cup B|} $ $ \text{gIoU} = \text{IoU} - \frac{|C \setminus (A \cup B)|}{|C|} $
- Symbol Explanation:
- : The predicted bounding box or segmentation mask.
- : The ground-truth bounding box or segmentation mask.
- : The intersection area of and .
- : The union area of and .
- : The smallest enclosing box that covers both and .
- : The area within that is not covered by or .
- : Denotes the area of the respective region.
gIoUranges from -1 to 1, where 1 means perfect overlap, 0 means no overlap but touching, and negative values indicate poor predictions that are far from the ground truth.
- Conceptual Definition:
-
N-Acc (Non-object Accuracy):
- Conceptual Definition: This metric specifically evaluates the model's ability to correctly identify when no object corresponding to a given query exists in the image. It's crucial for tasks like
non-object segmentationwhere the correct answer might be an empty segmentation mask. - Mathematical Formula: $ \text{N-Acc} = \frac{\text{Number of correctly identified non-object cases}}{\text{Total number of actual non-object cases}} $
- Symbol Explanation:
Number of correctly identified non-object cases: The count of instances where the model correctly outputs an empty set or a "no object" response for anon-objectquery.Total number of actual non-object cases: The total count of queries in the evaluation set that refer tonon-existentobjects.
- Conceptual Definition: This metric specifically evaluates the model's ability to correctly identify when no object corresponding to a given query exists in the image. It's crucial for tasks like
-
Acc (Accuracy):
- Conceptual Definition:
Accuracyis a basic classification metric that measures the proportion of correctly predicted instances out of the total instances. - Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} $
- Symbol Explanation:
Number of correct predictions: The count of instances where the model's output matches the ground-truth label.Total number of predictions: The total number of instances evaluated.
- Conceptual Definition:
-
ROC_AUC (Receiver Operating Characteristic - Area Under the Curve):
- Conceptual Definition:
ROC_AUCis a performance metric for binary classification problems (or anomaly detection, which can be framed as binary classification: normal vs. anomalous). It represents the area under theReceiver Operating Characteristic (ROC)curve, which plots theTrue Positive Rate (TPR)against theFalse Positive Rate (FPR)at various threshold settings. AnAUCof 1 indicates a perfect classifier, while anAUCof 0.5 indicates a classifier no better than random chance. - Mathematical Formula: There is no single simple formula for
AUCas it's the integral of theROCcurve. TheROCcurve is generated by plottingTPRvsFPRfor various classification thresholds: $ \text{TPR (Sensitivity)} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $ $ \text{FPR (1 - Specificity)} = \frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}} $AUCis then the area under this curve. - Symbol Explanation:
True Positives (TP): Correctly identified positive instances (e.g., correctly detected anomalies).False Negatives (FN): Positive instances incorrectly identified as negative (e.g., missed anomalies).False Positives (FP): Negative instances incorrectly identified as positive (e.g., normal items flagged as anomalous).True Negatives (TN): Correctly identified negative instances (e.g., correctly identified normal items).
- Conceptual Definition:
-
bbox_acc (Bounding Box Accuracy):
- Conceptual Definition: This metric is used for tasks involving bounding box prediction (e.g., object detection, lesion localization). It measures the proportion of predicted bounding boxes that significantly overlap with their corresponding ground-truth bounding boxes, typically using an
IoUthreshold. - Mathematical Formula: $ \text{bbox_acc} = \frac{\text{Number of predicted bounding boxes with IoU} > \tau}{\text{Total number of ground-truth bounding boxes}} $
- Symbol Explanation:
- \tau: The count of predicted bounding boxes for which the `IoU` with the ground-truth bounding box exceeds a predefined threshold $\tau$. * $\tau$: A threshold, typically 0.5, meaning a predicted box is considered correct if its `IoU` with the ground truth is greater than 0.5. * `Total number of ground-truth bounding boxes`: The total count of actual objects/lesions to be detected in the evaluation set. ## 5.3. Baselines The paper compares `ViSurf` against several baseline models and training paradigms: * **Baseline:** This refers to the pre-trained `Large Vision-and-Language Model (LVLM)` without any specific fine-tuning for the downstream task. In the experiments, this is often `Qwen2.5VL-7B [1]` or `Qwen2VL-7B [38]` initialized (and `SAM2 [31]` for segmentation tasks if needed). It represents the performance of the model's general capabilities before specialized adaptation. * **SFT (Supervised Fine-Tuning):** The model fine-tuned solely using the `Supervised Fine-Tuning` paradigm, optimizing against ground-truth labels. This serves as a baseline for external guidance. * **RLVR (Reinforcement Learning with Verifiable Rewards):** The model fine-tuned solely using the `Reinforcement Learning with Verifiable Rewards` paradigm, optimizing based on objective reward functions from `self-rollouts`. This serves as a baseline for internal reinforcement. * **SFT → RLVR (Two-Stage Fine-Tuning):** This represents a sequential approach where the model is first fine-tuned using `SFT`, and then the `SFT`-tuned model is further fine-tuned using `RLVR`. This is a common strategy to try and combine the benefits of both paradigms. The paper notes this method's cost is estimated as the addition of `SFT` and `RLVR` training times. These baselines are representative because they cover the primary individual post-training methods (`SFT`, `RLVR`) and their common sequential combination. Comparing against them effectively highlights `ViSurf`'s advantages in unification, stability, and performance. ### 5.3.1. Implementation Details * **Base LVLM:** `Qwen2.5VL-7B [1]` is used as the base `LVLM` for most experiments. `SAM2 [31]` (Segment Anything Model 2) is adopted for tasks requiring segmentation. * **Learning Rate:** A constant learning rate of `1e-6` is employed for all methods. * **Batch Size:** `SFT` uses a batch size of 32, while `RLVR` and `ViSurf` use 16. * **Training Steps:** All methods employ the same number of training steps for fair comparison. * **Reward Functions:** * **MathVista:** The reward function consists of `format` and `accuracy` rewards. * **Other tasks:** Rewards are adopted from `VisionReasoner [21]`, which include `format accuracy`, `point accuracy`, and `bounding box accuracy` rewards, among others. These are `verifiable rewards` as they can be objectively calculated. # 6. Results & Analysis ## 6.1. Core Results Analysis ### 6.1.1. Comparison of Different Training Paradigms The following are the results from Table 1 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2">Method</td> <td colspan="2">Non-Object gRefCoCo</td> <td colspan="2">Segmentation ReasonSeg</td> <td rowspan="2">GUI OmniACT test Acc</td> <td rowspan="2">Anomaly RealIAD subset ROC_AUC</td> <td rowspan="2">Medical:Skin ISIC2018 test Bbox_Acc</td> <td rowspan="2">Math MathVista test-mini Acc</td> <td rowspan="2">Avg</td> </tr> <tr> <td>gIoU</td> <td>val N-Acc</td> <td>val gIoU</td> <td>test gIoU</td> </tr> </thead> <tbody> <tr> <td>Baseline</td> <td>41.6</td> <td>3.3</td> <td>56.9</td> <td>52.1</td> <td>60.4</td> <td>50.1</td> <td>78.8</td> <td>68.2</td> <td>56.2</td> </tr> <tr> <td>SFT</td> <td>33.4</td> <td>1.8</td> <td>65.5</td> <td>60.3</td> <td>55.4</td> <td>50.0</td> <td>90.3</td> <td>68.3</td> <td>56.1</td> </tr> <tr> <td>RLVR</td> <td>42.8</td> <td>0.0</td> <td>63.8</td> <td>63.2</td> <td>65.5</td> <td>50.0</td> <td>90.3</td> <td>71.2</td> <td>56.1</td> </tr> <tr> <td>SFT → RLVR</td> <td>65.0</td> <td>52.1</td> <td>57.2</td> <td>55.2</td> <td>64.5</td> <td>66.9</td> <td>93.6</td> <td>68.5</td> <td>65.4</td> </tr> <tr> <td>ViSurf</td> <td>66.6</td> <td>57.1</td> <td>66.5</td> <td>65.0</td> <td>65.6</td> <td>69.3</td> <td>94.7</td> <td>71.6</td> <td>69.6</td> </tr> </tbody> </table></div> **Analysis:** * **ViSurf's Overall Superiority:** `ViSurf` consistently outperforms all other methods across nearly all benchmarks, achieving the highest average score of `69.6%`. This demonstrates its effectiveness as a unified post-training paradigm. Its average relative gain of `38.6%` over the `Baseline` model highlights its significant impact. * **Performance in Challenging Domains:** `ViSurf` shows particularly strong gains in `Non-Object gRefCOCO` (gIoU: `66.6` vs. Baseline `41.6`, N-Acc: `57.1` vs. Baseline `3.3`) and `Anomaly RealIAD` (ROC_AUC: `69.3` vs. Baseline `50.1`). These are domains where the `Baseline` model exhibits lower competency or `RLVR` struggles, indicating `ViSurf`'s efficacy in addressing tasks exceeding the model's initial knowledge base by leveraging external `SFT` guidance. * **SFT's Limitations:** `SFT` often leads to sub-optimal performance. For instance, in `Non-Object gRefCOCO`, `SFT` performs worse than the `Baseline` in `gIoU` (`33.4` vs. `41.6`). Its `N-Acc` is also low. In `OmniACT`, `SFT` performance (`55.4`) degrades compared to the `Baseline` (`60.4`), which the authors attribute to potential `test data contamination` during pre-training. * **RLVR's Limitations:** While `RLVR` often achieves superior performance on tasks aligned with its knowledge base (e.g., strong `test gIoU` on `ReasonSeg` at `63.2` and `MathVista` at `71.2`), it performs poorly on tasks like `Non-Object gRefCOCO` (N-Acc `0.0`, meaning it never correctly identified a non-object case) and `Anomaly RealIAD` (ROC_AUC `50.0`, no better than random chance), sometimes even underperforming the `Baseline`. This corroborates the paper's motivation that `RLVR` struggles when tasks extend beyond the initial model's knowledge or when `self-rollouts` are frequently incorrect. * **SFT → RLVR (Two-Stage) Performance:** This method generally performs better than individual `SFT` or `RLVR` and the `Baseline`, showing the benefit of combining the approaches (Avg: `65.4%`). However, it still falls short of `ViSurf`'s performance, suggesting `ViSurf`'s single-stage, unified approach is more effective. For example, `ViSurf`'s `gRefCOCO gIoU` (`66.6`) and `N-Acc` (`57.1`) are better than `SFT → RLVR` (`65.0` and `52.1`). Figure 1 (a visualization of example tasks and the performance of `SFT` vs. `RLVR` on tasks within/exceeding knowledge) and Figure 2 (radar and bar charts summarizing `ViSurf`'s performance and `catastrophic forgetting`) visually support these findings. Figure 1(b) shows `RLVR` performs better within `LVLMs'` knowledge base, while 1(c) shows `SFT` is better for tasks exceeding knowledge, and `RLVR` can perform worse than baseline, validating the core motivation for `ViSurf`. Figure 2's radar chart clearly illustrates `ViSurf`'s superior performance across domains. ### 6.1.2. Catastrophic Forgetting The following are the results from Table 2 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td>Method</td> <td>ChartQA</td> <td>DocVQA_val</td> </tr> </thead> <tbody> <tr> <td>Baseline</td> <td>83.8</td> <td>94.9</td> </tr> <tr> <td>SFT</td> <td>80.8</td> <td>89.6</td> </tr> <tr> <td>RLVR</td> <td>86.7</td> <td>95.0</td> </tr> <tr> <td>SFT → RLVR</td> <td>85.0</td> <td>92.9</td> </tr> <tr> <td>ViSurf</td> <td>87.4</td> <td>95.0</td> </tr> </tbody> </table></div> **Analysis:** * **SFT and SFT → RLVR Suffer:** `SFT` shows a clear performance degradation on both `ChartQA` (`80.8` vs. `Baseline 83.8`) and `DocVQA_val` (`89.6` vs. `Baseline 94.9`), indicating `catastrophic forgetting` of general `VQA` capabilities. The two-stage `SFT → RLVR` also experiences performance drops compared to the `Baseline`, though less severe than pure `SFT`, suggesting that the initial `SFT` phase already induces some forgetting. * **RLVR and ViSurf Robustness:** Both `RLVR` and `ViSurf` demonstrate robustness against `catastrophic forgetting`. `RLVR` maintains or slightly improves performance over `Baseline` (`86.7` on `ChartQA`, `95.0` on `DocVQA_val`). `ViSurf` achieves the best `VQA` performance (`87.4` on `ChartQA`, `95.0` on `DocVQA_val`), outperforming `Baseline`, `SFT`, and `SFT → RLVR`. This validates `ViSurf`'s ability to preserve general knowledge while learning new tasks, which is a significant advantage over `SFT`. ### 6.1.3. ViSurf on Other Models The following are the results from Table 3 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2">Method</td> <td colspan="2">Qwen2VL-7B</td> </tr> <tr> <td>RealIAD subset ROC_AUC</td> <td>ISIC2018 test Bbox_Acc</td> </tr> </thead> <tbody> <tr> <td>Baseline</td> <td>60.0</td> <td>51.8</td> </tr> <tr> <td>SFT</td> <td>56.7</td> <td>94.2</td> </tr> <tr> <td>RLVR</td> <td>57.1</td> <td>90.5</td> </tr> <tr> <td>SFT → RLVR</td> <td>67.5</td> <td>94.6</td> </tr> <tr> <td>ViSurf</td> <td>76.0</td> <td>95.4</td> </tr> </tbody> </table></div> **Analysis:** * **Consistent Outperformance:** When applied to `Qwen2VL-7B` (a different base model), `ViSurf` consistently outperforms its counterparts on both `RealIAD` and `ISIC2018`. This demonstrates the generalizability and robustness of `ViSurf` across different `LVLM` architectures. * **Pure RLVR's Weakness:** The pure `RLVR` approach yields the weakest performance on both datasets and even underperforms the `Baseline` on `RealIAD` (`57.1` vs. `60.0`). This strongly indicates that external supervision (as provided by `SFT` and integrated into `ViSurf`) is critical, especially when the model's `self-rollouts` might be unreliable or insufficient for learning. ## 6.2. Ablation Studies / Parameter Analysis The following are the results from Table 4 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td colspan="3">Reward Control Strategy</td> <td rowspan="2">gRefCoCo val gIoU</td> <td rowspan="2">gRefCoCo val N-Acc</td> <td rowspan="2">ReasonSeg val gIoU</td> <td rowspan="2">MathVista test-mini Acc</td> </tr> <tr> <td>Align</td> <td>Eliminate</td> <td>Smooth</td> </tr> </thead> <tbody> <tr> <td></td> <td></td> <td></td> <td>59.0</td> <td>40.2</td> <td>63.6</td> <td>—</td> </tr> <tr> <td>×</td> <td>✓</td> <td>✓</td> <td>72.9</td> <td>74.1</td> <td>58.2</td> <td>67.1</td> </tr> <tr> <td>✓</td> <td>×</td> <td>✓</td> <td>61.0</td> <td>45.7</td> <td>62.7</td> <td>66.8</td> </tr> <tr> <td>✓</td> <td>✓</td> <td>×</td> <td>66.6</td> <td>57.1</td> <td>66.5</td> <td>71.6</td> </tr> </tbody> </table></div> **Analysis of Reward Control Strategies:** * **Base (No Reward Control):** The first row (empty for `Align`, `Eliminate`, `Smooth`) shows the performance without any of the proposed reward control strategies. It serves as a base for comparison. * **Aligning Ground-truth Labels with Rollouts Preference (`Align`):** * **Ablation (×):** When `Align` is removed (second row: `×`, `✓`, `✓`), there's a significant drop in `ReasonSeg val gIoU` (`58.2` vs. `66.5` with `Align`) and `MathVista test-mini Acc` (`67.1` vs. `71.6`). * **Conclusion:** This underscores the critical importance of aligning ground-truth data format with the model's `rollout` preferences. It empirically validates the theoretical requirement of $\pi_{\boldsymbol{\theta}} \approx \pi_{\boldsymbol{\theta}_{old}}$ by showing that even subtle `distribution shifts` (like `whitespace` in `JSON`) can negatively impact performance. `Align` is not applicable for MathVista according to the table footnote, which means the `--` value for the first row is appropriate. However, the second row shows `74.1` for `gRefCoCo val N-Acc` which is unexpectedly high without `Align`. This might be a typo or indicates a complex interaction where `Align` is critical for `gIoU` and `ReasonSeg`/`MathVista` but less so or even detrimental for `N-Acc` when combined with other controls. The authors state it leads to "consistent performance degradation across multiple datasets," so the `gRefCoCo val N-Acc` result without `Align` is an outlier in their general claim. * **Eliminating Thinking Reward for Ground-truth Labels (`Eliminate`):** * **Ablation (×):** When `Eliminate` is removed (third row: `✓`, `×`, `✓`), performance drops for `ReasonSeg` (`62.7` vs. `66.5`) and `MathVista` (`66.8` vs. `71.6`). However, `gRefCOCO` (`gIoU 61.0`, $N-Acc 45.7$) performs better than the `Baseline` without `Eliminate`. * **Conclusion:** This suggests that for tasks requiring complex inference and reasoning (`ReasonSeg`, `MathVista`), explicitly encouraging the model to generate `reasoning traces` (by *not* giving reasoning reward to ground-truth) is beneficial. For simpler tasks like `gRefCOCO` where direct answers are often sufficient, removing the thinking reward for ground truth might not be as critical, or might even be slightly detrimental to `gIoU` if the model tries to generate unnecessary thoughts. The authors observe that for `gRefCOCO`, "omitting the reasoning step yields superior performance," which aligns with this ablation result. * **Smoothing the Reward for Ground-truth Labels (`Smooth`):** * **Ablation (×):** When `Smooth` is removed (fourth row: `✓`, `✓`, `×`), there's a performance decline across all datasets (e.g., `gRefCoCo gIoU 66.6` vs. `72.9` with `Smooth`, `ReasonSeg gIoU 65.0` vs. `66.5` with `Smooth`). * **Conclusion:** This validates the necessity of reward smoothing. Without it, the `SFT term` (ground truth) might continue to dominate even when `rollouts` are already high-quality, hindering `RLVR`'s ability to further refine the policy. Smoothing ensures that the `SFT` signal is only applied when truly needed, allowing `RLVR` to take over when the model is capable. ### 6.2.1. In-depth Analysis #### 6.2.1.1. Entropy Analysis During Training Figure 5 shows the entropy dynamics for `RLVR`, `SFT → RLVR`, and `ViSurf`.  *该图像是图表,展示了 RLVR、SFT->RLVR 和 ViSurf 的熵分析。ViSurf 具有初始下降趋势,然后缓慢收敛,反映了不同训练方法在熵变化上的表现。* * **Observation:** `ViSurf` exhibits an initial drop in `entropy`, indicating that the model is rapidly `fitting` the external guidance provided by the ground-truth labels. However, after this initial phase, `ViSurf` converges at a slower rate than `RLVR` and `SFT → RLVR`. * **Analysis:** This behavior is desirable. The initial `entropy` drop suggests `ViSurf` quickly leverages `SFT` signals to learn the desired output distribution. The subsequent slower convergence implies that `ViSurf` avoids `entropy collapse` (where the policy becomes overly confident and stops exploring) and maintains a healthier balance between `exploration` and `exploitation`, possibly due to its adaptive reward control mechanisms. This contributes to better generalization and robustness. #### 6.2.1.2. Training Stability Figure 6 illustrates the performance on `gRefCOCO` (gIoU) across different training steps for `RLVR`, `SFT → RLVR`, and `ViSurf`.  *该图像是图表,展示了不同训练步骤下各方法在 `ext{gloU}` 性能上的表现。可以看出,ViSurf 方法在训练过程中展示了更大的稳定性,尤其在 200 和 300 步时的绩效明显优于 RLVR 和 SFT→RLVR 方法。* * **Observation:** Models trained with `ViSurf` demonstrate greater stability. While `RLVR` and `SFT → RLVR` exhibit performance declines with longer training (after certain steps, their `gIoU` starts to decrease), `ViSurf` maintains or slightly improves its performance. * **Analysis:** This confirms the effectiveness of `ViSurf`'s design. The integrated external guidance, managed by the reward control strategies, acts as a constraint that stabilizes the training process, preventing the model from `drifting` or `overfitting` in undesirable ways over extended training. #### 6.2.1.3. Boundary Analysis As observed in Table 1, the performance gain of `ViSurf` is correlated with the `Baseline` model's initial performance: * **Low Baseline Performance (e.g., Non-Object gRefCOCO, Anomaly RealIAD):** When the `Baseline` performs poorly (e.g., below `50%`), indicating its inadequacy for the task, `ViSurf` yields a substantial improvement. This highlights `ViSurf`'s ability to inject new knowledge and correct deficiencies, leveraging the `SFT` component when `RLVR` alone would struggle. * **High Baseline Performance (e.g., ISIC2018):** When the `Baseline` already achieves high performance (e.g., above `50%` or `70%`), `ViSurf` still provides incremental gains, but the `upper bound` of its method aligns with that of `RLVR` alone. This corroborates the theoretical analysis from Section 3.5, where the `SFT Term` diminishes when the policy can already generate desirable `rollouts`, allowing `RLVR` to dominate and further refine. #### 6.2.1.4. Reduce the Burden of Prompt Design The following are the results from Table 5 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td>Detailed Prompt</td> <td></td> <td>ReasonSeg val (gIoU)</td> <td>ReasonSeg test (gIoU)</td> </tr> </thead> <tbody> <tr> <td rowspan="2">RLVR</td> <td>✗</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <td>✓</td> <td>66.0</td> <td>63.2</td> </tr> <tr> <td rowspan="2">ViSurf</td> <td>✗</td> <td>62.3</td> <td>57.8</td> </tr> <tr> <td>✓</td> <td>66.4</td> <td>65.0</td> </tr> </tbody> </table></div> **Analysis:** * **RLVR's Dependence on Prompt:** `RLVR` relies heavily on explicit instructions (`detailed formatting prompt`) to guide the model towards generating `rollouts` in a specific format. Without a `detailed prompt` (✗), `RLVR` fails completely, yielding `0.0` `gIoU`. This shows its vulnerability to poor `prompt engineering`. * **ViSurf's Robustness:** `ViSurf`, in contrast, achieves satisfying results even without a `detailed formatting prompt` (`62.3` `val gIoU`, `57.8` `test gIoU`). While performance improves with a `detailed prompt`, `ViSurf` doesn't collapse without it. This demonstrates that `ViSurf`'s incorporation of external guidance (ground-truth labels) provides an implicit "desired format" signal, reducing the dependency on manual `prompt engineering` and making it more robust. #### 6.2.1.5. Training Cost The following are the results from Table 6 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td>Method</td> <td>Mem / GPU (G) ↓</td> <td>Time / Step (s) ↓</td> </tr> </thead> <tbody> <tr> <td>SFT</td> <td>97.7</td> <td>9.0</td> </tr> <tr> <td>RLVR</td> <td>81.8</td> <td>22.7</td> </tr> <tr> <td>SFT → RLVR</td> <td>97.9</td> <td>31.7</td> </tr> <tr> <td>ViSurf</td> <td>81.8</td> <td>22.9</td> </tr> </tbody> </table></div> **Analysis:** * **Memory Efficiency:** `RLVR` and `ViSurf` are more `memory efficient` (`81.8 G` per GPU) compared to `SFT` and `SFT → RLVR` (`~97.7 G`). This is a practical advantage, allowing for training larger models or larger batches under memory constraints. * **Computational Cost:** `RLVR` and `ViSurf` incur a higher `computational cost per step` ($~22.7-22.9 s$) compared to `SFT` (`9.0 s`). This is attributable to the overhead of generating multiple `rollouts` and evaluating rewards in each step. * **Comparison to Two-Stage:** `SFT → RLVR` has the highest `time per step` (`31.7 s`), which is expected as it combines the costs. However, its overall training time would be the sum of `SFT` and `RLVR` stages. `ViSurf` matches the `per-step` cost of `RLVR` while achieving superior performance in a single stage, making it more efficient than sequential approaches overall. ## 6.3. Comparison with State-of-The-Arts The following are the results from Table 7 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2">Method</td> <td colspan="2">gRefCoco val</td> <td colspan="2">ReasonSeg val</td> </tr> <tr> <td>gIoU</td> <td>N-Acc</td> <td>gIoU</td> <td>test gIoU</td> </tr> </thead> <tbody> <tr> <td>LISA-7B</td> <td>61.6</td> <td>54.7</td> <td>53.6</td> <td>48.7</td> </tr> <tr> <td>GSVA-7B</td> <td>66.5</td> <td>62.4</td> <td>-</td> <td>-</td> </tr> <tr> <td>SAM4MLLM-7B</td> <td>69.0</td> <td>63.0</td> <td>46.7</td> <td>-</td> </tr> <tr> <td>Qwen2.5VL-7B + SAM2</td> <td>41.6</td> <td>3.3</td> <td>56.9</td> <td>52.1</td> </tr> <tr> <td>SegZero-7B</td> <td>-</td> <td>-</td> <td>62.6</td> <td>57.5</td> </tr> <tr> <td>VisionReasoner-7B</td> <td>41.5</td> <td>0.0</td> <td>66.3</td> <td>63.6</td> </tr> <tr> <td>ViSurf (Qwen2.5VL-7B + SAM2)</td> <td>72.9</td> <td>74.1</td> <td>66.4</td> <td>65.0</td> </tr> </tbody> </table></div> **Analysis:** * **State-of-the-Art Performance:** `ViSurf` achieves state-of-the-art performance on both `gRefCOCO` and `ReasonSeg` benchmarks. * On `gRefCOCO`, `ViSurf` achieves `72.9 gIoU` and $74.1 N-Acc$, surpassing previous SoTA models like `SAM4MLLM-7B` (`69.0 gIoU`, $63.0 N-Acc$) and `GSVA-7B` (`66.5 gIoU`, $62.4 N-Acc$). * On `ReasonSeg`, `ViSurf` achieves `66.4 val gIoU` and `65.0 test gIoU`, outperforming `VisionReasoner-7B` (`66.3 val gIoU`, `63.6 test gIoU`) and `SegZero-7B` (`62.6 val gIoU`, `57.5 test gIoU`). * **Impact on Challenging Tasks:** The significant improvement in `N-Acc` for `gRefCOCO` (`74.1` vs. `63.0` for `SAM4MLLM-7B`) is particularly noteworthy, demonstrating `ViSurf`'s ability to handle `non-object` scenarios effectively, a known weakness for pure `RLVR` methods (as seen from `VisionReasoner-7B`'s $0.0 N-Acc$). * **Robustness Across Task Types:** `ViSurf`'s strong performance on both `gRefCOCO` (which includes `non-object` cases) and `ReasonSeg` (which requires complex reasoning) highlights its versatility and robust learning capabilities across different types of visual perception tasks. ## 6.4. Qualitative Results Figure 7 provides visual examples of `ViSurf`'s performance on various tasks.  *该图像是一个示意图,展示了不同类型任务的视觉推理示例,包括非目标、异常、GUI引导、医学及数学推理等。每个示例都包含思考过程及相应对象的描述,如图中表明的玩具汉堡在顶面存在一个异常的洞。* * **Observations:** The visualizations demonstrate `ViSurf`'s ability to: * Correctly identify `non-object` cases (`non-object` example). * Accurately localize `anomalies` (`anomaly` example, e.g., a hole in a toy hamburger). * Perform `GUI grounding` by identifying the correct interactive elements. * Handle `medical imaging` tasks. * Solve `mathematical problems` in visual contexts, often showing a reasoning process. * **Analysis:** These qualitative results corroborate the quantitative findings, showing that `ViSurf` can successfully produce high-quality, reasoned outputs across a diverse set of `LVLM` applications. The inclusion of `thinking` steps in some examples suggests that the model is indeed learning to generate explicit reasoning, likely encouraged by the reward function designs and `Eliminating Thinking Reward` strategy. # 7. Conclusion & Reflections ## 7.1. Conclusion Summary The paper successfully introduces `ViSurf (Visual Supervised-and-Reinforcement Fine-Tuning)`, a novel, unified, and single-stage post-training paradigm for `Large Vision-and-Language Models (LVLMs)`. Motivated by a theoretical analysis of the gradient similarities between `Supervised Fine-Tuning (SFT)` and `Reinforcement Learning with Verifiable Rewards (RLVR)`, `ViSurf` integrates their complementary benefits. Its core mechanism involves treating ground-truth labels as high-reward samples within `RLVR` rollouts, providing a simultaneous blend of external supervision and internal reinforcement. The method is further strengthened by three novel reward control strategies (aligning ground-truth format, eliminating thinking rewards for ground-truth, and smoothing ground-truth rewards) that stabilize training and dynamically balance the `SFT` and `RLVR` influences. Extensive experiments across diverse benchmarks like `gRefCOCO`, `ReasonSeg`, `OmniACT`, `RealIAD`, `ISIC2018`, and `MathVista` demonstrate `ViSurf`'s superior performance over individual `SFT`, `RLVR`, and two-stage `SFT → RLVR` pipelines. It also effectively mitigates `catastrophic forgetting` and reduces the burden of `prompt engineering`. ## 7.2. Limitations & Future Work The authors highlight several points for future consideration in the supplementary material: * **Ground-Truth Label Scope:** Currently, the `ground-truth labels` in this work are primarily limited to `final answers`. * **Incorporating Explicit Reasoning Traces:** The `ViSurf` paradigm is inherently compatible with incorporating `explicit reasoning traces` if such labels become available. This suggests an avenue for enriching the `SFT` component with intermediate reasoning steps, moving beyond just final outputs. * **Compatibility with Advanced Techniques:** `ViSurf`'s flexibility ensures compatibility with advanced techniques like `knowledge distillation`, where `reasoning traces` from larger models could be directly incorporated. This opens doors for leveraging more powerful, potentially proprietary, models to guide the training of smaller, more efficient `ViSurf`-trained `LVLMs`. The authors anticipate that this work will provide a foundation for future research in `LVLMs'` post-training, implying continued development in more sophisticated and integrated fine-tuning strategies. ## 7.3. Personal Insights & Critique * **Elegance of Unified Objective:** The derivation of `ViSurf`'s objective, which elegantly combines the gradients of `SFT` and `RLVR`, is a significant theoretical contribution. It offers a principled way to view these two seemingly distinct paradigms as parts of a larger, unified learning process. The self-adaptive balancing mechanism, where the `SFT` influence automatically diminishes as the model's `rollouts` improve, is particularly insightful and practical. * **Importance of Reward Control:** The three `reward control strategies` are crucial practical innovations. Without them, simply injecting ground truth into an `RL` framework could lead to `reward hacking`, `distribution shifts`, or inefficient learning. These strategies demonstrate a deep understanding of the subtle challenges in combining `SFT` and `RL` and provide concrete solutions that contribute significantly to `ViSurf`'s stability and effectiveness. * **Robustness and Generalizability:** `ViSurf`'s empirical success across a wide array of `LVLM` tasks, its ability to mitigate `catastrophic forgetting`, and its robustness to `prompt design` are strong indicators of its practical value. The demonstration of its effectiveness on multiple base models (Qwen2.5VL-7B and Qwen2VL-7B) suggests good generalizability. * **Potential for Broader Application:** The principles behind `ViSurf`—integrating supervised signals into a reinforcement learning loop with adaptive weighting—could potentially be transferred to other domains beyond `LVLMs`, such as general `LLM` fine-tuning or robotic control, where both explicit demonstrations and reward-based exploration are valuable. * **Critique - Equation (8) Clarity:** A notable point of critique is the clarity of Equation (8) in the paper. As presented, it contains unusual notation and complex structure that makes it difficult to interpret without further context, potentially hindering reproducibility or a beginner's understanding. While the subsequent gradient (Equation 9) is clear and effectively describes the mechanism, the discrepancy or abstractness of Equation (8) could be improved in future revisions for better scientific communication. Similarly, minor formatting issues or potential typos in Equation (10) and its surrounding text ($o_j \mathbf{\phi} | v, t$, `SFT7e r m`, \mathcal{D}_{\mathrm{iabel}}$$) suggest a need for more careful proofreading.
Overall,
ViSurfpresents a compelling and well-motivated approach toLVLMpost-training. Its theoretical foundation, innovative practical strategies, and strong empirical results mark a significant step forward in developing more robust and capable multimodal AI systems. - Conceptual Definition: This metric is used for tasks involving bounding box prediction (e.g., object detection, lesion localization). It measures the proportion of predicted bounding boxes that significantly overlap with their corresponding ground-truth bounding boxes, typically using an
Similar papers
Recommended via semantic vector search.