Paper status: completed

ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

Published:10/12/2025

Fine-Tuning of Visual Language Models (1)Integration of Supervised and Reinforcement Learning (1)Post-Training Paradigm (1)Incremental Reward Control Strategies (1)Large Vision-and-Language Models (1)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces ViSurf, a novel post-training paradigm integrating Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) for large vision-and-language models, enhancing performance through simultaneous external supervision and internal reinf

Abstract

Typical post-training paradigms for Large Vision-and-Language Models (LVLMs) include Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). SFT leverages external guidance to inject new knowledge, whereas RLVR utilizes internal reinforcement to enhance reasoning capabilities and overall performance. However, our analysis reveals that SFT often leads to sub-optimal performance, while RLVR struggles with tasks that exceed the model's internal knowledge base. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage. We analyze the derivation of the SFT and RLVR objectives to establish the ViSurf objective, providing a unified perspective on these two paradigms. The core of ViSurf involves injecting ground-truth labels into the RLVR rollouts, thereby providing simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to stabilize and optimize the training process. Extensive experiments across several diverse benchmarks demonstrate the effectiveness of ViSurf, outperforming both individual SFT, RLVR, and two-stage SFT \textrightarrow RLVR. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.

Mind Map

In-depth Reading

English Analysis~23 min read · 32,055 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models. It proposes a novel fine-tuning paradigm for Large Vision-and-Language Models (LVLMs) that combines the benefits of both Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR).

1.2. Authors

The authors are Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong, Bei Yu, and Jiaya Jia. Their affiliations include CUHK (The Chinese University of Hong Kong), HKUST (The Hong Kong University of Science and Technology), and RUC (Renmin University of China). These are prominent academic institutions known for their research in computer vision, natural language processing, and artificial intelligence.

1.3. Journal/Conference

The paper is published as a preprint on arXiv (arXiv:2510.10606). arXiv is a widely used open-access repository for preprints of scientific papers in fields like physics, mathematics, computer science, and more. Publishing on arXiv allows for rapid dissemination of research findings before formal peer review and publication in a journal or conference, making it highly influential for sharing cutting-edge work in AI.

1.4. Publication Year

The paper was published at 2025-10-12T13:42:55.000Z, indicating it was made publicly available in October 2025.

1.5. Abstract

The paper addresses the limitations of typical post-training paradigms for Large Vision-and-Language Models (LVLMs): Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). While SFT provides external guidance for new knowledge injection, it often leads to sub-optimal performance and catastrophic forgetting. RLVR, using internal reinforcement, enhances reasoning but struggles with tasks beyond the model's existing knowledge.

To overcome these, the authors propose ViSurf (Visual Supervised-and-Reinforcement Fine-Tuning), a unified, single-stage post-training paradigm. ViSurf integrates both SFT and RLVR strengths by analyzing their objective functions and deriving a unified ViSurf objective. Its core mechanism involves injecting ground-truth labels into RLVR rollouts, providing simultaneous external supervision and internal reinforcement. The method also introduces three novel reward control strategies to stabilize and optimize training.

Extensive experiments across diverse benchmarks demonstrate ViSurf's superiority over individual SFT, RLVR, and two-stage SFT → RLVR methods. In-depth analysis validates its derivation and design principles, confirming its effectiveness in enhancing LVLMs performance and mitigating catastrophic forgetting.

1.6. Original Source Link

The original source link is https://arxiv.org/abs/2510.10606. The PDF link is https://arxiv.org/pdf/2510.10606v2.pdf. This paper is published as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The development of Large Vision-and-Language Models (LVLMs) is a significant direction in visual intelligence. Post-training these models often relies on two primary paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR).

SFT: This method directly optimizes models using expert-annotated data, providing explicit external guidance to help the model memorize target distributions. However, SFT often results in sub-optimal performance and can lead to catastrophic forgetting, where the model loses previously acquired pre-trained knowledge. This means SFT struggles to generalize well or retain broad capabilities.
RLVR: This approach leverages internal reinforcement signals, typically from pre-defined reward functions, to enhance reasoning capabilities and overall performance. RLVR is generally better at mitigating catastrophic forgetting and often achieves superior results. However, its performance degrades when tasks extend beyond the initial model's internal knowledge base. It relies on self-rollouts (model-generated outputs) and struggles if the model cannot internally generate a good solution or a "no object" response for certain tasks.

The core problem the paper aims to solve is that SFT and RLVR have complementary strengths and weaknesses. SFT is effective for injecting new knowledge, especially for tasks outside the model's pre-training distribution, while RLVR excels at refining reasoning within the model's existing knowledge. Existing two-stage approaches (SFT → RLVR) attempt to combine them but suffer from increased computational cost and catastrophic forgetting during the initial SFT phase.

The paper's innovative idea (entry point) is to propose a unified, single-stage paradigm called ViSurf that integrates the strengths of both SFT and RLVR more effectively. This aims to achieve the benefits of both without their typical drawbacks, particularly addressing the "no object" scenario where RLVR fails and SFT provides crucial external guidance.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Unified Post-Training Paradigm (ViSurf): The authors propose ViSurf, a novel, single-stage post-training method that integrates the complementary benefits of SFT and RLVR. This is achieved by analyzing their underlying objective functions and gradients, theoretically demonstrating their structural similarities, and proposing a unified ViSurf objective.
Integration of Ground-Truth Labels into RL Framework: The core of ViSurf involves injecting ground-truth labels directly into RLVR rollouts. This allows for simultaneous external supervision (from SFT) and internal reinforcement (from RLVR), providing a powerful combined learning signal.
Novel Reward Control Strategies: Three new reward control strategies are introduced to stabilize and optimize the training process, specifically for handling ground-truth labels within the RL framework. These strategies include:
1. Aligning ground-truth labels with rollouts preference.
2. Eliminating thinking reward for ground-truth labels.
3. Smoothing the reward for ground-truth labels. These are crucial for preventing issues like reward hacking and ensuring proper balance between SFT and RLVR influences.
Empirical Superiority: Extensive experiments across diverse vision-and-language benchmarks (e.g., Non-Object Segmentation, Reasoning Segmentation, GUI Grounding, Anomaly Detection, Medical Image, MathVista) demonstrate that ViSurf consistently outperforms individual SFT, RLVR, and sequential SFT → RLVR pipelines.
Mitigation of Catastrophic Forgetting: ViSurf successfully mitigates catastrophic forgetting, as evidenced by its stable performance on VQA tasks like ChartQA and DocVQA, a common problem with SFT-based methods.
Robustness and Stability: ViSurf exhibits greater training stability compared to pure RLVR and SFT → RLVR, with performance gains directly related to the baseline model's initial competency, corroborating the theoretical analysis. It also reduces the burden of prompt engineering.

The key conclusions are that ViSurf provides a more effective and robust post-training paradigm for LVLMs by harmoniously combining external supervision and internal reinforcement. This addresses the limitations of both individual approaches, leading to superior performance across a wide range of tasks and enhancing model stability.

3.1. Foundational Concepts

To understand ViSurf, a reader should be familiar with the following concepts:

Large Vision-and-Language Models (LVLMs): These are advanced Artificial Intelligence models capable of understanding and generating content across both visual (images, videos) and linguistic (text) modalities. They combine the strengths of large language models (LLMs) for reasoning and language generation with powerful vision models for image understanding. They can perform tasks like image captioning, visual question answering (VQA), and referring expression segmentation.
Fine-Tuning: After a large model (like an LVLM) is pre-trained on a massive dataset for general capabilities, fine-tuning adapts it to a specific task or dataset. This involves further training the model for a relatively shorter period on a smaller, task-specific dataset, usually by continuing to optimize its parameters.
Supervised Learning: A type of machine learning where an algorithm learns from labeled training data. Each piece of input data is paired with a correct output label. The model learns to map inputs to outputs by minimizing the difference between its predictions and the ground-truth labels.
Supervised Fine-Tuning (SFT): In the context of LVLMs, SFT involves fine-tuning a pre-trained LVLM on a dataset where each input (e.g., image-text pair) has a corresponding expert-annotated correct output (ground-truth label). The model is trained to predict these correct outputs, typically by minimizing a negative log-likelihood loss.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent receives rewards or penalties for its actions, which guide it towards optimal behavior without explicit supervision. Key components include agent, environment, state, action, and reward.
Reinforcement Learning from Human Feedback (RLHF): A common RL technique used to align LLMs with human preferences. Instead of a simple reward function, RLHF uses a reward model trained on human preference data (e.g., humans ranking model outputs) to provide a scalar reward signal to the LLM.
Reinforcement Learning with Verifiable Rewards (RLVR): This is a specific type of RL used in the paper, distinct from RLHF. In RLVR, the reward function is objective and calculable based on predefined criteria, often related to the format and accuracy of the model's output. It doesn't require subjective human preference data, making it more scalable. Examples of verifiable rewards might include IoU for segmentation masks, accuracy for answers, or adherence to a specific output format (e.g., JSON).
On-Policy vs. Off-Policy RL:
- On-policy RL: The agent learns a policy and evaluates it based on data generated by that same policy. Algorithms like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) are on-policy.
- Off-policy RL: The agent learns a policy from data generated by a different policy (e.g., an older version of itself or a completely different exploratory policy). Direct Preference Optimization (DPO) is often considered off-policy in how it uses a static preference dataset.
Catastrophic Forgetting: A phenomenon observed in neural networks where training on new tasks causes a significant and abrupt loss of performance on previously learned tasks. In LVLMs, SFT can sometimes lead to catastrophic forgetting of the broad knowledge acquired during pre-training.
Log-derivative Trick: A mathematical technique used in Reinforcement Learning to estimate gradients of expected values with respect to policy parameters. It allows converting the gradient of an expectation into an expectation of a gradient, which is crucial for optimizing policies directly. Specifically, for an expectation $\mathbb{E}_{x \sim p(\theta)}[f(x)]$ , its gradient with respect to $\theta$ can be written as $\mathbb{E}_{x \sim p(\theta)}[f(x) \nabla_\theta \log p(x | \theta)]$ .
Advantage Function: In RL, the advantage function measures how much better a particular action is compared to the average action from a given state. It's often defined as $A(s, a) = Q(s, a) - V(s)$ , where Q(s, a) is the action-value function (expected return from taking action $a$ in state $s$ ) and V(s) is the state-value function (expected return from state $s$ ). In PPO and GRPO, a normalized version of the reward is often used as an advantage estimate to stabilize training.
Entropy: In information theory, entropy measures the uncertainty or randomness of a probability distribution. In RL, a policy with high entropy explores more diverse actions, while a policy with low entropy is more deterministic. Maintaining a certain level of entropy during training can encourage exploration and prevent premature convergence to sub-optimal policies.
Reward Hacking: A phenomenon in Reinforcement Learning where an agent finds unintended ways to maximize its reward function, often by exploiting flaws or ambiguities in the reward definition, without actually achieving the desired behavior or objective.

3.2. Previous Works

The paper discusses prior work in two main categories: Supervised Fine-tuning for LVLMs and Reinforcement Learning for LVLMs.

3.2.1. Supervised Fine-tuning for LVLMs

SFT is a foundational paradigm for LVLMs, adapting pre-trained models using expert-annotated data.

LLaVA [17, 13, 18]: A pioneering work that started the trend of visual instruction tuning. It involves fine-tuning LLMs on multimodal instruction-following data. The LLaVA-series refers to subsequent models built upon this foundation.
QwenVL-series [1, 38]: Another prominent series of LVLMs that employ SFT for diverse visual-language tasks.
MGM-series [14, 36, 44] and InternVL [3]: Other examples of LVLMs that successfully adopted the SFT paradigm.
Applications: SFT has been effective for various downstream applications, such as image quality assessment [41], visual counting [5], and autonomous driving [40].

Comment: The core idea behind SFT is to expose the model to high-quality examples of desired behavior. For instance, if an LVLM is trained for image captioning, SFT would involve showing it many images with corresponding human-written captions and optimizing the model to generate similar captions. The loss function typically used is cross-entropy loss or negative log-likelihood, which aims to maximize the probability of the correct output tokens given the input. For a given input (v, t) and ground-truth output $y = (y_1, y_2, \ldots, y_L)$ , the negative log-likelihood loss is: $ \mathcal{L}{\mathrm{SFT}} = - \sum{i=1}^L \log P(y_i | y_{ $P(y_i | y_{<i}, v, t; \theta)$

3.2.2. Reinforcement Learning for LVLMs

RL is also a standard method for fine-tuning LVLMs.

Direct Preference Optimization (DPO) [30]: An RL algorithm that relies on pre-collected human preference datasets. Instead of training a separate reward model, DPO directly optimizes the policy by comparing preferred and dispreferred pairs of responses. While effective, it requires costly human preference data.
Proximal Policy Optimization (PPO) [32]: A widely used on-policy RL algorithm that optimizes the policy by taking small steps to avoid large changes that could destabilize training. It typically requires a well-trained reward model to evaluate generated responses and provide feedback. The objective function of PPO aims to maximize a clipped surrogate objective to ensure policy updates are not too aggressive. The core PPO objective (without entropy bonus) is: $ \mathcal{L}^{CLIP}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right] $ where $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$ is the probability ratio, $\hat{A}_t$ is the advantage estimate at time step $t$ , and $\epsilon$ is a small hyperparameter for clipping. This objective encourages the new policy $\pi_\theta$ to move in the direction of higher rewards, but only within a "clipped" range relative to the old policy $\pi_{\theta_{old}}$ to maintain stability.
Group Relative Policy Optimization (GRPO) [34] and Dynamic Sampling Policy Optimization (DAPO) [42]: These are RLVR algorithms that gain attention for their ability to assess model outputs against objective, verifiable criteria. They reduce dependency on manually annotated data (like SFT) or pre-trained reward models (like PPO often needs or DPO's preference data).
Recent Works: The effectiveness of RLVR for LVLMs has been demonstrated in works like SegZero [20] and VisualRFT [22], as well as VisionReasoner [21].

3.3. Technological Evolution

The field of LVLMs has rapidly evolved from early multimodal models that simply concatenated features to complex architectures that deeply integrate vision and language.

Early Multimodal Models: Initially, models processed visual and textual information separately and then combined their representations at a later stage (e.g., simple concatenation of embeddings).
Vision-Language Pre-training: The advent of large-scale datasets and transformer architectures led to vision-language pre-training, where models learned joint representations by tasks like image-text matching or masked language modeling conditioned on images (e.g., CLIP, ALIGN).
Instruction Tuning and LLM Integration: The success of Large Language Models (LLMs) (like GPT-3) inspired visual instruction tuning (LLaVA), where LLMs were extended with visual capabilities and fine-tuned to follow multimodal instructions. This made LVLMs more conversational and versatile.
Specialized Fine-Tuning Paradigms: As LVLMs became more powerful, researchers explored various fine-tuning methods to adapt them to specific downstream tasks and improve their reasoning.
- SFT became standard for injecting task-specific knowledge.
- RLHF and RLVR emerged to align models with desired behaviors and enhance reasoning, overcoming some limitations of SFT.
- The realization that SFT and RL have complementary strengths led to approaches that sequentially combine them (SFT → RLVR).
  
  This paper's work, ViSurf, fits into this timeline as an advanced fine-tuning paradigm that seeks to unify the benefits of SFT and RLVR into a single, more efficient, and robust stage. It's a step towards more holistic and stable post-training strategies for LVLMs, moving beyond sequential or separate applications of SFT and RL.

3.4. Differentiation Analysis

Compared to existing methods, ViSurf presents several core differences and innovations:

Unified, Single-Stage Integration vs. Sequential or Separate:
- SFT: ViSurf incorporates SFT's external guidance, but unlike pure SFT, it avoids catastrophic forgetting and is enhanced by internal RLVR signals.
- RLVR: ViSurf retains RLVR's internal reinforcement but overcomes its limitation of struggling with tasks outside the model's knowledge base by proactively injecting ground-truth SFT signals. Pure RLVR can also be unstable or perform worse than baselines if rollouts are consistently poor.
- SFT → RLVR (Two-Stage): This sequential approach attempts to combine the benefits but incurs higher computational costs and can still suffer from catastrophic forgetting during the SFT phase, which is then passed on. ViSurf integrates both within a single stage, making it more efficient and stable.
Novel Objective Function: ViSurf proposes a unified objective function that naturally combines the gradients of both SFT and RLVR. Unlike prior methods that might simply add SFT and RL losses, ViSurf treats the ground-truth label as a high-reward sample within the RLVR framework, modifying the advantage calculation to incorporate its signal. This provides a theoretically grounded integration.
Self-Adaptive Balance: A key innovation is the dynamic, self-adaptive balancing between SFT and RLVR influences. When the model's self-rollouts are poor, the SFT term (guided by ground truth) dominates the update. When self-rollouts are good, the SFT influence is smoothed or eliminated, allowing RLVR to take over. This adaptive mechanism is more sophisticated than fixed weighting or sequential training.
Reward Control Strategies: The introduction of three specific reward control strategies (aligning ground truth, eliminating thinking reward, smoothing reward) is crucial for ViSurf's stability and effectiveness. These strategies address practical challenges of integrating ground truth into an RL setting, such as reward hacking and ensuring proper distribution alignment, which are not typically found in standalone SFT or RLVR.

In essence, ViSurf offers a more elegant, efficient, and robust solution for LVLM fine-tuning by providing a principled way to leverage both external expert knowledge and internal model exploration within a single, stable training loop.

4. Methodology

4.1. Principles

The core idea behind ViSurf is to unify the strengths of Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) within a single, integrated training stage. The theoretical basis or intuition is derived from the observation that the gradient forms of SFT and RLVR objectives share structural similarities. ViSurf's key principle is to treat the ground-truth label (from SFT) as a high-reward sample within the RLVR framework, thereby providing simultaneous external guidance and internal reinforcement. This allows the model to learn from explicit correct answers while also refining its reasoning and exploration capabilities based on verifiable rewards. The method also incorporates dynamic reward control mechanisms to ensure stability and adaptive balancing between these two learning signals.

4.2. Core Methodology In-depth (Layer by Layer)

The paper begins by defining the preliminary concepts, then analyzes the gradients of SFT and RLVR, and finally introduces the ViSurf objective and its associated reward control strategies.

4.2.1. Preliminary

Let $\pi_{\theta}$ denote a Large Vision-and-Language Model (LVLM), parameterized by $\theta$ .

Input Data: Both SFT and RLVR utilize the same input dataset, $\mathcal{D}_{\mathrm{input}} = \{ (v_i, t_i) \}_{i=1}^N$ $D_{input} = {(v_{i}, t_{i})}_{i = 1}^{N}$ .
- $v_i$ : a visual input (e.g., an image).
- $t_i$ : a textual input (e.g., a prompt or question).
- $N$ : the size of the dataset.

4.2.1.1. Supervised Fine-Tuning (SFT)

SFT optimizes $\pi_{\theta}$ against a set of ground-truth labels, $\mathcal{D}_{\mathrm{label}} = \{ y_i \}_{i=1}^N$ . The objective is to minimize the negative log-likelihood of these labels.

The SFT objective function is given by Equation (1): $ \mathcal{L}{\mathrm{SFT}}(\theta) = - \mathbb{E}{(v, t) \sim \mathcal{D}{\mathrm{input}}} \left[ \log \pi{\theta}(y \mid v, t) \right] $ Here:

$\mathcal{L}_{\mathrm{SFT}}(\theta)$ : The SFT loss function, which ViSurf aims to minimize.
$\theta$ : The parameters of the LVLM $\pi_{\theta}$ .
$\mathbb{E}$ : Expectation operator, averaging over samples from the data distribution.
$(v, t) \sim \mathcal{D}_{\mathrm{input}}$ : An input pair (visual $v$ , textual $t$ ) sampled from the input dataset.
$\log \pi_{\theta}(y \mid v, t)$ : The logarithm of the probability assigned by the model $\pi_{\theta}$ to the ground-truth output $y$ , given the visual and textual inputs.
$y \sim \mathcal{D}_{\mathrm{label}}$ : The ground-truth label $y$ corresponding to the input (v, t). This term implies that for each input (v,t), there is a corresponding correct output $y$ that the model should produce.

This objective encourages the model to assign high probabilities to the correct ground-truth outputs.

4.2.1.2. Reinforcement Learning with Verifiable Rewards (RLVR)

The paper illustrates RLVR using the on-policy Group Relative Policy Optimization (GRPO) algorithm. GRPO optimizes the policy $\pi_{\theta}$ using a verifiable reward function $r(\cdot)$ , which typically combines measures of output format and accuracy (e.g., IoU, factual correctness, adherence to JSON format).

For a given input $(v_i, t_i) \in \mathcal{D}_{\mathrm{input}}$ :

Rollout Generation: The old policy $\pi_{\theta_{old}}$ (from a previous optimization step, i.e., the current policy before an update) generates a group of $G$ rollouts $\{o_j\}_{j=1}^G$ . These rollouts are generated by sampling from the policy with different random seeds.
Reward Evaluation: Each rollout $o_j$ is evaluated by the reward function $r(\cdot)$ , resulting in a set of rewards \{r(o_j)\}_{j=1}^G.
Advantage Calculation: The advantage $\hat{A}_j$ for each rollout $o_j$ is computed based on its reward relative to the mean and standard deviation of rewards within the group.

The advantage calculation is given by Equation (2): $ \hat{A}j = \frac{\mathrm{r}(o_j) - \mathrm{mean}\left( {\mathrm{r}(o_j)}{j=1}^G \right)}{\mathrm{std}\left( {\mathrm{r}(o_j)}_{j=1}^G \right)} $ Here:

$\hat{A}_j$ : The estimated advantage for the $j$ -th rollout. A higher advantage means the rollout is better than average in its group.
$\mathrm{r}(o_j)$ : The reward obtained for the $j$ -th rollout $o_j$ .
$\mathrm{mean}\left( \{\mathrm{r}(o_j)\}_{j=1}^G \right)$ : The average reward across all $G$ rollouts for the current input.
$\mathrm{std}\left( \{\mathrm{r}(o_j)\}_{j=1}^G \right)$ : The standard deviation of rewards across all $G$ rollouts for the current input. This normalization helps stabilize training and provides a relative measure of quality for each rollout.

The RLVR objective is to minimize the equation, which is a Proximal Policy Optimization (PPO)-like clipped objective, given by Equation (3): $ \mathcal{L}{\mathrm{RLVR}}(\theta) = - \mathbb{E}{\Lambda (v, t) \sim \mathcal{D}{\mathrm{input}}} \Bigg[ \frac{1}{G} \sum{j=1}^G \operatorname* {min} \Bigg{ \frac{\pi_{\theta}(o_j \mid v, t)}{\pi_{\theta_{\mathrm{old}}}(o_j \mid v, t)} \hat{A}j, \mathrm{clip}\left( \frac{\pi{\theta}(o_j \mid v, t)}{\pi_{\theta_{\mathrm{old}}}(o_j \mid v, t)}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_j \Bigg} \Bigg] $ Here:

$\mathcal{L}_{\mathrm{RLVR}}(\theta)$ : The RLVR loss function.
$\mathbb{E}_{\Lambda (v, t) \sim \mathcal{D}_{\mathrm{input}}}$ : Expectation over input samples.
$\frac{1}{G} \sum_{j=1}^G$ : Averaging over the $G$ rollouts.
$\operatorname* {min}\{A, B\}$ : Takes the minimum of two terms, which is characteristic of the PPO clipped objective.
$\frac{\pi_{\theta}(o_j \mid v, t)}{\pi_{\theta_{\mathrm{old}}}(o_j \mid v, t)}$ : The probability ratio (also called importance sampling ratio), which measures how much more or less likely the current policy $\pi_{\theta}$ generates rollout $o_j$ compared to the old policy $\pi_{\theta_{old}}$ .
$\hat{A}_j$ : The advantage calculated from Equation (2). This term guides the policy update towards actions with higher advantages.
$\mathrm{clip}(x, \text{lower}, \text{upper})$ : A function that clips the value of $x$ to be within the range $[\text{lower}, \text{upper}]$ .
$1 - \epsilon$ , $1 + \epsilon$ : The clipping bounds, where $\epsilon$ is a constant hyperparameter (e.g., 0.2). Clipping prevents overly large policy updates that could destabilize training. The objective maximizes the expected advantage of actions, weighted by their probability ratio, but clips this ratio to ensure updates are not too far from the old policy. The paper mentions it omits the KL divergence term often found in PPO for simplicity.

4.2.2. Gradient Analysis of SFT and RLVR

The authors analyze the gradients of SFT and RLVR to show their structural similarity.

4.2.2.1. Gradient of SFT

The gradient of SFT can be derived from Equation (1) as: $ \nabla_{\boldsymbol{\theta}} \mathcal{L}{\mathrm{SFT}}(\boldsymbol{\theta}) = - \mathbb{E}{(\boldsymbol{v}, t) \sim \mathcal{D}{\mathrm{input}}} \left[ \nabla{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(\boldsymbol{y} \mid \boldsymbol{v}, t) \right] $ Here:

$\nabla_{\boldsymbol{\theta}} \mathcal{L}_{\mathrm{SFT}}(\boldsymbol{\theta})$ : The gradient of the SFT loss with respect to model parameters $\boldsymbol{\theta}$ .
$\nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(\boldsymbol{y} \mid \boldsymbol{v}, t)$ : The gradient of the log-probability of the ground-truth label $y$ given the inputs, with respect to $\boldsymbol{\theta}$ . This gradient pushes the model to increase the probability of generating the ground-truth output.

4.2.2.2. Gradient of RLVR

The gradient of RLVR can be derived from Equation (3) using the approximation $\pi_{\theta} \approx \pi_{\theta_{old}}$ and the log-derivative trick, omitting the clip operation for simplicity: $ \nabla_{\theta} \mathcal{L}{\mathrm{RLVR}}(\theta) = - \mathbb{E}{\phi \Lambda (v, t) \sim \mathcal{D}{\mathrm{input}}} \left[ \frac{1}{G} \sum{j=1}^G \hat{A}j \nabla{\theta} \log \pi_{\theta}(o_j \mid v, t) \right]{\theta \approx \theta{\mathrm{old}}} $ Here:

$\nabla_{\theta} \mathcal{L}_{\mathrm{RLVR}}(\theta)$ : The gradient of the RLVR loss with respect to model parameters $\theta$ .
$\mathbb{E}_{\phi \Lambda (v, t) \sim \mathcal{D}_{\mathrm{input}}}$ : Expectation over input samples.
$\{o_j\}_{j=1}^G \sim \pi_{\theta_{\mathrm{old}}}$ : The rollouts are sampled from the old policy.
$\hat{A}_j$ : The advantage estimate for rollout $o_j$ .
$\nabla_{\theta} \log \pi_{\theta}(o_j \mid v, t)$ : The gradient of the log-probability of the rollout $o_j$ given the inputs, with respect to $\theta$ .
$[\ldots]_{\theta \approx \theta_{\mathrm{old}}}$ : Indicates that the policy $\pi_{\theta}$ for which the gradient is computed is close to the old policy $\pi_{\theta_{old}}$ (an approximation used in on-policy methods like GRPO to simplify gradient estimation). This gradient pushes the model to increase the probability of generating rollouts that yielded high advantages. The authors highlight that both SFT and RLVR gradients involve $\nabla_{\theta} \log \pi_{\theta}(\cdot)$ , differing mainly in the guidance signal ( $y$ vs. $\{o_j\}_{j=1}^G$ ) and a coefficient (1 vs. $\hat{A}_j$ ).

4.2.3. Objective of ViSurf

To combine SFT and RLVR into a single stage, ViSurf designs an objective function whose gradient naturally combines both SFT and RLVR gradients. The key insight is to include the ground-truth label $y$ as a high-reward sample within the RLVR framework. Instead of just $\{o_j\}_{j=1}^G$ , the set of samples for advantage calculation becomes $y \cup \{o_j\}_{j=1}^G$ . Consequently, the rewards are $\mathbf{r}(y) \cup \{\mathbf{r}(o_j)\}_{j=1}^G$ .

This formulation modifies the advantage calculation for rollouts as follows (Equation 6): $ \hat{A}j = \frac{\mathrm{\mathbf{r}}(o_j) - \mathrm{\mathbf{mean}}\left( \mathbf{r}(y) \cup {\mathbf{r}(o_j)}{j=1}^G \right)}{\mathrm{\mathbf{std}}\left( {\mathbf{r}(y) \cup {\mathbf{r}(o_j)}_{j=1}^G } \right)} $ Here:

$\hat{A}_j$ : The modified advantage for the $j$ -th rollout $o_j$ .
$\mathrm{\mathbf{r}}(o_j)$ : The reward for rollout $o_j$ .
$\mathrm{\mathbf{mean}}\left( \mathbf{r}(y) \cup \{\mathbf{r}(o_j)\}_{j=1}^G \right)$ : The mean reward calculated over the combined set of ground-truth label and all $G$ rollouts.
$\mathrm{\mathbf{std}}\left( \{\mathbf{r}(y) \cup \{\mathbf{r}(o_j)\}_{j=1}^G \} \right)$ : The standard deviation of rewards over the combined set of ground-truth label and all $G$ rollouts. The advantage of rollouts is now calculated relative to a pool that also includes the ground-truth label, which typically has a high reward.

And the advantage of the ground-truth $y$ is calculated as (Equation 7): $ \hat{A}y = \frac{\mathrm{r}(y) - \mathrm{mean}\left( \mathrm{r}(y) \cup {\mathrm{r}(o_j)}{j=1}^G \right)}{\mathrm{std}\left( {\mathrm{r}(y) \cup {\mathrm{r}(o_j)}_{j=1}^G } \right)} $ Here:

$\hat{A}_y$ : The advantage for the ground-truth label $y$ .
$\mathrm{r}(y)$ : The reward obtained for the ground-truth label $y$ . This is typically high as it's the correct answer. The mean and standard deviation are the same as used for calculating $\hat{A}_j$ .

The objective of ViSurf is to minimize the equation (Equation 8): $ \epsilon(t^{(i)}) = - \mathbb{E}{\epsilon(t) \leq T{\epsilon(t)+1}} \left[ \frac{1}{G+1} \left( \frac{2}{\gamma \epsilon_{1+1}} \left{ \frac{\alpha_0}{\alpha_0(\epsilon_j \mid \nu_t)} \left{ \frac{\alpha_0(\epsilon_j \mid \nu_t)}{\alpha_{0,t+1}(\epsilon_j | \nu_t)} \dot{A}{j_1}, \quad \mathrm{clip}\left( \frac{\gamma \epsilon_1(\epsilon_j \mid \nu_t)}{\alpha_0(\epsilon_j | \nu_t)}, 1 - \epsilon_1, 1 + \epsilon \right) \dot{A}{j_1} \right} \right} + \operatorname* {min} \left{ \frac{\gamma \epsilon_0(\epsilon_j \mid \nu_t)}{\pi_{0,t+1}(\epsilon_j | \nu_t)} \dot{A}_1, \ldots \right} \right) \right] $ This Equation (8) as presented in the paper contains several unusual symbols (e.g., $\epsilon(t^{(i)})$ , $\nu_t$ , $\alpha_0$ , $\dot{A}_{j_1}$ , $\ldots$ ) and structure that deviates significantly from standard PPO or RL objectives, and appears to be ill-formed or highly abstract. Given the subsequent clear gradient formulation (Equation 9) and the textual description, it's likely this equation is either a placeholder, contains significant typos, or represents a very high-level conceptualization that is not directly used for practical implementation as stated. The authors' subsequent gradient analysis (Equation 9) and description of ViSurf imply a more direct integration. Therefore, while strictly reproducing it, the practical objective is better understood from its gradient in Equation (9).

4.2.3.1. Algorithm 1: ViSurf Optimization Step

The pseudocode for the ViSurf Optimization Step outlines the practical training loop:

Algorithm 1: ViSurf Optimization Step

Input: policy model $\pi_{\theta}$ ; reward function r( $\cdot$ ); input data $\mathcal{D}_{\mathrm{input}}$ ; label data $\mathcal{D}_{\mathrm{label}}$
Output: $\pi_{\theta}$
for $step = 1, . . . , M$ $s t e p = 1, ..., M$ do
1. Sample a $mini-batch B_input$ from $\mathcal{D}_{\mathrm{input}}$ and corresponding B_label from $\mathcal{D}_{\mathrm{label}}$ .
2. Update the old policy model $\pi_{\theta_{old}} \leftarrow \pi_{\theta}$ ; (This stores the current policy for on-policy updates).
3. Sample $G$ outputs $\{o_j\}_{j=1}^G \sim \pi_{\theta_{old}}$ for each $(v, t) \in B_{\mathrm{input}}$ ; (Generate rollouts from the current policy).
4. Compute rewards \{r(o_j)\}_{j=1}^G for each sampled output $o_j$ ;
5. Compute rewards r(y) for label $y \in B_{\mathrm{label}}$ ;
6. Compute $\hat{A}_j$ and $\hat{A}_y$ through relative advantage estimation using Equations (6) and (7);
7. Update the policy model $\pi_{\theta}$ using Equation (8) (or more practically, based on its gradient as shown in Equation 9);
end for

4.2.3.2. Gradient Analysis of ViSurf

The gradient of the ViSurf objective (Equation 8) can be derived using the approximation $\pi_{\boldsymbol{\theta}} \approx \pi_{\boldsymbol{\theta}_{old}}$ and the log-derivative trick. Omitting the clip operation for simplicity, it is given by Equation (9): $ \nabla_{\theta} \mathcal{L}{\mathrm{ViSurf}}(\theta) = - \mathbb{E}{\Phi(v, t) \sim \mathcal{D}{\mathrm{input}}} \left[ \frac{1}{G+1} \left( \sum{j=1}^G \hat{A}j \nabla{\theta} \log \pi_{\theta}(o_j \mid v, t) + \hat{A}y \nabla{\theta} \log \pi_{\theta}(y \mid v, t) \right) \right]{\theta \approx \theta{\mathrm{old}}} $ Here:

$\nabla_{\theta} \mathcal{L}_{\mathrm{ViSurf}}(\theta)$ : The gradient of the ViSurf loss with respect to model parameters $\theta$ .
$\mathbb{E}_{\Phi(v, t) \sim \mathcal{D}_{\mathrm{input}}}$ : Expectation over input samples.
$\{o_j\}_{j=1}^G \sim \pi_{\theta_{\mathrm{old}}}$ : The rollouts are sampled from the old policy.
$\frac{1}{G+1}$ : A scaling factor, as there are $G$ rollouts plus 1 ground-truth label in the consideration set.
$\sum_{j=1}^G \hat{A}_j \nabla_{\theta} \log \pi_{\theta}(o_j \mid v, t)$ : This term represents the RLVR component, similar to Equation (5), weighted by the advantages of the rollouts. It encourages the model to generate more high-reward rollouts.
$\hat{A}_y \nabla_{\theta} \log \pi_{\theta}(y \mid v, t)$ : This term represents the SFT component, weighted by the advantage of the ground-truth label. It encourages the model to generate the ground-truth output.
$[\ldots]_{\theta \approx \theta_{\mathrm{old}}}$ : Indicates the approximation that the current policy is close to the old policy. This gradient clearly shows how ViSurf simultaneously incorporates both RLVR (through $\hat{A}_j$ weighted rollouts) and SFT (through $\hat{A}_y$ weighted ground truth) signals.

4.2.3.3. Relation to SFT and RLVR

To better illustrate the structure of the gradient, Equation (9) is reformulated into two distinct terms (Equation 10): $ \nabla_{\theta} \mathcal{L}{\mathrm{ViSurf}}(\theta) = - \mathbb{E}{\Phi(v, t) \sim \mathcal{D}{\mathrm{input}}} \left[ \frac{1}{G+1} \sum{j=1}^G \hat{A}j \nabla{\theta} \log { \pi_{\theta}(o_j \mathbf{\phi} | v, t) } \right]{\theta \approx \theta{\mathrm{old}}} - \underbrace{ \mathbb{E}{(v, t) \sim \mathcal{D}{\mathrm{iabel}}} \left[ \frac{1}{G+1} \mathbb{\hat{A}}y \nabla{\theta} \log \pi_{\theta}(y \mid v, t) \right]{\theta \approx \theta{\mathrm{old}}} }_{\mathrm{SFT7e r m}} $ Here:

The first term: - E[...RLVR term...] is structurally identical to the standard RLVR gradient in Equation (5). The differences are a scaling coefficient ( $\frac{1}{G+1} \hat{A}_j$ vs. $\frac{1}{G} \hat{A}_j$ ) and some potential formatting anomalies like $o_j \mathbf{\phi} | v, t$ which should likely be $o_j | v, t$ .
The second term, explicitly labeled SFT7e r m (which should likely be SFT term): - E[...SFT term...] resembles the SFT gradient from Equation (4).
- Distinction (i): The coefficient is weighted by $\frac{1}{G+1} \hat{A}_y$ instead of 1. This means the SFT influence is modulated by the ground-truth's advantage, rather than being a pure maximum likelihood objective.
- Distinction (ii): The use of the approximation $\pi_{\boldsymbol{\theta}} \approx \pi_{\boldsymbol{\theta}_{old}}$ . This implies that for the SFT signal to be effective within this RL-like framework, the ground-truth label $y$ should align with the model's internal generative preference (i.e., it's a plausible output for the model, not an out-of-distribution sample). The term $\mathcal{D}_{\mathrm{iabel}}$ should likely be $\mathcal{D}_{\mathrm{input}}$ or $\mathcal{D}_{\mathrm{label}}$ for consistency.
  
  Crucially, Equation (9) (and its reformulation in 10) integrates both the external guidance from SFT and the internal guidance from RLVR into a single gradient update.

4.2.4. Reward Control for Ground-Truth Label

The direct injection of ground-truth labels with typically high rewards could lead to reward hacking or sub-optimal learning if not carefully managed. The advantage $\hat{A}_y$ for the ground-truth label $y$ would always be positive, potentially suppressing the relative advantages of actual rollouts even if they are good, and lacking reasoning traces. To address this and ensure the ground-truth aligns with self-rollouts (satisfying $\pi_{\boldsymbol{\theta}} \approx \pi_{\boldsymbol{\theta}_{old}}$ ), three novel reward control strategies are proposed:

Aligning Ground-truth Labels with Rollouts Preference:
- Problem: Distribution shift between ground-truth annotations and model-generated rollouts (e.g., slight whitespace differences in JSON formats can alter tokenization).
- Strategy: Reformat ground-truth annotations to match the model's preferred output style (e.g., ${"bbox_2d": [x1, y1, x2, y2]}$ vs. ${"bbox_2d": [ x1, y1, x2, y2 ]}$ ).
- Purpose: Minimizes the distribution shift, making the ground-truth more "natural" for the model and satisfying the $\pi_{\boldsymbol{\theta}} \approx \pi_{\boldsymbol{\theta}_{old}}$ approximation.
Eliminating Thinking Reward for Ground-truth Labels:
- Problem: Ground-truth labels often lack an annotated reasoning path or intermediate thinking steps. If a reward component is given for thinking format, the ground truth would inherently score zero on this, potentially biasing the model.
- Strategy: Assign a reasoning format score of zero to ground-truth labels.
- Purpose: Ensures the model learns to generate reasoning traces directly from its self-rollouts (where thinking steps can be generated and evaluated) without being biased by the absence of external reasoning annotations in the ground truth.
Smoothing the Reward for Ground-truth Labels:
- Problem: If the model's self-rollouts are already high-quality, the ground-truth label's consistently high advantage $\hat{A}_y$ might unnecessarily dominate the updates, preventing further RLVR exploration and refinement.
- Strategy: Before advantage estimation, compare the maximum reward among generated rollouts (\operatorname*{max}\{r(o_j)\}_{j=1}^G) against the ground-truth reward r(y). If \operatorname*{max}\{r(o_j)\}_{j=1}^G \geq r(y), it indicates the policy model $\pi_{\theta}$ has already produced a high-quality output. In this case, set the ground-truth reward to the mean of rollout rewards: r(y) = \operatorname*{mean}\{r(o_j)\}_{j=1}^G.
- Purpose: This smoothing ensures that if rollouts are already good, the advantage for the ground-truth $\hat{A}_y$ becomes approximately zero (as per Equation 7), effectively turning off the external supervision signal when it's no longer necessary. This allows the RLVR term to dominate.
  
  These strategies, visually represented in Figure 4, dynamically modulate the influence of the SFT component within ViSurf, preventing issues and ensuring an adaptive balance between external guidance and internal reinforcement.

4.2.5. Optimization Analysis During Training

Building on the reward control strategies, the paper analyzes the dynamics of the terms in Equation (10) throughout training.

Self-Adaptive Balance: The advantages $\hat{A}_j$ $\hat{A}_{j}$ (for rollouts) and $\hat{A}_y$ $\hat{A}_{y}$ (for the ground-truth) dynamically govern the balance between the RLVR and SFT terms.
- When Policy Fails: If the policy fails to generate high-quality rollouts, $\hat{A}_j$ decreases (potentially becoming negative), while $\hat{A}_y$ remains high (due to the ground-truth's inherent correctness, before smoothing). Consequently, the SFT term dominates the policy update, providing strong external guidance from the ground-truth label to correct the model's behavior.
- When Policy Succeeds: Conversely, when the policy successfully generates desirable rollouts, the smoothing reward control mechanism (described above) sets $\hat{A}_y \approx 0$ . In this scenario, the SFT term's influence is minimized, and the optimization becomes dominated almost entirely by the RLVR term, allowing for continued refinement through self-reinforcement. This automatic shifting between learning modes is a core feature that makes ViSurf a powerful single-stage paradigm.

4.2.5.1. Upper Bound Analysis

When Policy Generates Correct Rollouts: If the old policy model $\pi_{\theta_{old}}$ already generates correct rollouts, the SFT Term in Equation (10) becomes close to zero due to the smoothing strategy. In this case, the upper bound of ViSurf's performance is effectively the same as RLVR alone, as RLVR takes over the optimization.
When Policy Cannot Generate Desirable Rollouts: If the policy cannot generate desirable rollouts, ViSurf leverages the strong external guidance from the SFT term. In this scenario, the upper bound of ViSurf is better than using either SFT or RLVR alone, as it combines the corrective power of ground truth with the exploratory potential of RL.

5. Experimental Setup

5.1. Datasets

The authors verify ViSurf on benchmarks across several diverse domains.

Non-Object Segmentation:
- Dataset: gRefCOCO [16]. This dataset is chosen because it includes queries that refer to objects that do not exist in the image (non-object cases), making it suitable for testing a model's ability to identify absence.
- Training Data: Multi-objects-7K plus 200 non-object data, adapted from VisionReasoner [21]. The 200 non-object examples are generated by providing unanswerable questions to train the model to output an empty list ( $<answer>[]</answer>$ ).
- Domain: Visual grounding, specifically referring expression segmentation with a focus on non-object scenarios.
Reasoning Segmentation:
- Dataset: ReasonSeg [12]. This dataset is designed to test scenarios where correct segmentation requires complex visual reasoning beyond simple object detection.
- Training Data: Multi-objects-7K as proposed in VisionReasoner [21].
- Domain: Visual segmentation requiring logical inference.
GUI Grounding:
- Dataset: OmniACT [11]. This dataset focuses on GUI grounding tasks for Desktop and Web interfaces, requiring the model to locate specific interactive elements.
- Training Data: 6,101 samples from the training split.
- Domain: Human-Computer Interaction, GUI automation.
Anomaly Detection:
- Dataset: RealIAD [35]. This dataset features real-world, multi-view industrial anomalies.
- Training Data: 3,292 training samples and 2,736 test samples, ensuring disjoint sets.
- Domain: Industrial visual inspection, quality control.
Medical Image: Skin:
- Dataset: Task one of ISIC2018 [4, 10] (International Skin Imaging Collaboration). This dataset focuses on lesion segmentation in dermatology.
- Training Data: 2,594 training samples and 1,000 test samples.
- Domain: Medical imaging, dermatological diagnosis.
MathVista:
- Dataset: MathVista-testmini [24]. This dataset includes 1,000 diverse mathematical and visual tasks, testing LVLMs' mathematical reasoning capabilities in visual contexts.
- Training Data: Approximately 10k training data gathered from WeMath [29], MathVision [37], Polymath [8], SceMQA [15], Geometry3K [23].
- Domain: Multimodal mathematical reasoning.
  
  These datasets were chosen for their diversity across various vision-and-language tasks (segmentation, grounding, anomaly detection, medical, math) and their ability to highlight the challenges that SFT and RLVR face individually, making them effective for validating ViSurf's comprehensive performance.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, here is a complete explanation:

gIoU (Generalized Intersection over Union):
- Conceptual Definition: IoU (Intersection over Union) is a common metric for evaluating the accuracy of object detection and segmentation tasks, measuring the overlap between a predicted bounding box/mask and a ground-truth bounding box/mask. gIoU extends IoU by adding a penalty term for non-overlapping areas, especially useful when there is no overlap between the prediction and ground truth (e.g., when a model predicts a box but the ground truth is empty or far away). It considers the smallest enclosing box that contains both the predicted and ground-truth boxes.
- Mathematical Formula: $ \text{IoU} = \frac{|A \cap B|}{|A \cup B|} $ $ \text{gIoU} = \text{IoU} - \frac{|C \setminus (A \cup B)|}{|C|} $
- Symbol Explanation:
  - $A$ : The predicted bounding box or segmentation mask.
  - $B$ : The ground-truth bounding box or segmentation mask.
  - $A \cap B$ : The intersection area of $A$ and $B$ .
  - $A \cup B$ : The union area of $A$ and $B$ .
  - $C$ : The smallest enclosing box that covers both $A$ and $B$ .
  - $C \setminus (A \cup B)$ : The area within $C$ that is not covered by $A$ or $B$ .
  - $| \cdot |$ : Denotes the area of the respective region. gIoU ranges from -1 to 1, where 1 means perfect overlap, 0 means no overlap but touching, and negative values indicate poor predictions that are far from the ground truth.
N-Acc (Non-object Accuracy):
- Conceptual Definition: This metric specifically evaluates the model's ability to correctly identify when no object corresponding to a given query exists in the image. It's crucial for tasks like non-object segmentation where the correct answer might be an empty segmentation mask.
- Mathematical Formula: $ \text{N-Acc} = \frac{\text{Number of correctly identified non-object cases}}{\text{Total number of actual non-object cases}} $
- Symbol Explanation:
  - Number of correctly identified non-object cases: The count of instances where the model correctly outputs an empty set or a "no object" response for a non-object query.
  - Total number of actual non-object cases: The total count of queries in the evaluation set that refer to non-existent objects.
Acc (Accuracy):
- Conceptual Definition: Accuracy is a basic classification metric that measures the proportion of correctly predicted instances out of the total instances.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} $
- Symbol Explanation:
  - Number of correct predictions: The count of instances where the model's output matches the ground-truth label.
  - Total number of predictions: The total number of instances evaluated.
ROC_AUC (Receiver Operating Characteristic - Area Under the Curve):
- Conceptual Definition: ROC_AUC is a performance metric for binary classification problems (or anomaly detection, which can be framed as binary classification: normal vs. anomalous). It represents the area under the Receiver Operating Characteristic (ROC) curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. An AUC of 1 indicates a perfect classifier, while an AUC of 0.5 indicates a classifier no better than random chance.
- Mathematical Formula: There is no single simple formula for AUC as it's the integral of the ROC curve. The ROC curve is generated by plotting TPR vs FPR for various classification thresholds: $ \text{TPR (Sensitivity)} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $ $ \text{FPR (1 - Specificity)} = \frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}} $ AUC is then the area under this curve.
- Symbol Explanation:
  - True Positives (TP): Correctly identified positive instances (e.g., correctly detected anomalies).
  - False Negatives (FN): Positive instances incorrectly identified as negative (e.g., missed anomalies).
  - False Positives (FP): Negative instances incorrectly identified as positive (e.g., normal items flagged as anomalous).
  - True Negatives (TN): Correctly identified negative instances (e.g., correctly identified normal items).
bbox_acc (Bounding Box Accuracy):
- Conceptual Definition: This metric is used for tasks involving bounding box prediction (e.g., object detection, lesion localization). It measures the proportion of predicted bounding boxes that significantly overlap with their corresponding ground-truth bounding boxes, typically using an IoU threshold.
- Mathematical Formula: $ \text{bbox_acc} = \frac{\text{Number of predicted bounding boxes with IoU} > \tau}{\text{Total number of ground-truth bounding boxes}} $
- Symbol Explanation:
  - $Number of predicted bounding boxes with IoU >$ \tau\mathcal{D}_{\mathrm{iabel}}$$) suggest a need for more careful proofreading.
Overall, ViSurf presents a compelling and well-motivated approach to LVLM post-training. Its theoretical foundation, innovative practical strategies, and strong empirical results mark a significant step forward in developing more robust and capable multimodal AI systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.