SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training
TL;DR Summary
SeedVR2 enables one-step high-res video restoration using diffusion adversarial post-training and adaptive window attention, enhancing quality and efficiency while stabilizing training with novel loss functions.
Abstract
Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training
The title clearly states the paper's core contribution: a model named SeedVR2 that achieves video restoration (VR) in a single step. It also hints at the key techniques used: diffusion models and a training strategy called Adversarial Post-Training.
1.2. Authors
Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, and Lu Jiang.
The authors are from Nanyang Technological University and ByteDance Seed. This collaboration between a top academic institution and a leading industrial research lab suggests that the work combines rigorous academic research with a focus on practical, large-scale applications. Several authors have a strong publication record in generative models, diffusion models, and image/video restoration, lending credibility to the work.
1.3. Journal/Conference
The paper was submitted to arXiv as a preprint. The publication date listed is June 5, 2025. Given the references to CVPR 2025, it is likely a work prepared for submission to a top-tier computer vision conference. The venue arXiv is a common platform for researchers to share cutting-edge work before official peer review and publication, allowing for rapid dissemination of new ideas.
1.4. Publication Year
2025 (as per the preprint date).
1.5. Abstract
The abstract summarizes that while recent diffusion-based video restoration (VR) models produce high-quality results, their iterative inference process is computationally prohibitive. Existing one-step image restoration methods are difficult to extend to high-resolution video. This paper introduces SeedVR2, a one-step diffusion-based VR model. SeedVR2 uses adversarial post-training on real data. To address the challenges of high-resolution VR, the authors propose two main enhancements:
-
An adaptive window attention mechanism: This dynamically adjusts the attention window size based on output resolution, preventing artifacts that appear in high-resolution videos when using fixed-size windows.
-
Improved training procedures: This includes a series of specialized loss functions, notably a proposed feature matching loss, to stabilize the large-scale adversarial training and improve performance without a significant efficiency drop.
The abstract concludes that
SeedVR2can achieve performance comparable to or even better than existing multi-step VR methods, but in a single inference step.
1.6. Original Source Link
-
Original Source Link:
https://arxiv.org/abs/2506.05301 -
PDF Link:
https://arxiv.org/pdf/2506.05301v1.pdfThe paper is currently available as a preprint on arXiv (version 1).
2. Executive Summary
2.1. Background & Motivation
- Core Problem: State-of-the-art video restoration (VR) methods, especially those based on diffusion models, achieve remarkable visual quality. However, they are extremely slow. Diffusion models are inherently iterative, requiring tens or even hundreds of sampling steps to generate a single video frame. This high computational cost and latency make them impractical for real-world applications, especially when dealing with long, high-resolution videos (e.g., 1080p or 2K).
- Existing Gaps: While researchers have developed methods to accelerate diffusion models for image restoration to a single step (one-step restoration), these techniques are not easily transferable to the video domain. The challenges are amplified for video due to the need for temporal consistency and the sheer volume of data. Furthermore, many acceleration techniques like distillation rely on a powerful (and slow) "teacher" model, which is computationally expensive to run for video data. There was a clear lack of an effective, high-quality, one-step solution for real-world video restoration.
- Paper's Entry Point: The authors bypass the need for a "teacher" model by directly fine-tuning a pre-trained diffusion model using an adversarial objective. This approach, termed Adversarial Post-Training (APT), trains the model to generate realistic restorations in a single step by learning from a discriminator that is also a powerful diffusion transformer. The core idea is to transform a slow, multi-step model into a fast, one-step generator without sacrificing, and potentially even improving, its generative quality.
2.2. Main Contributions / Findings
The paper presents three main contributions:
- Proposal of SeedVR2, a one-step diffusion transformer for high-quality video restoration. To the authors' knowledge, this is one of the first works to successfully demonstrate the feasibility of one-step, high-resolution video restoration using a large-scale diffusion transformer architecture.
- An adaptive window attention mechanism. This novel mechanism dynamically adjusts the attention window size based on the video's resolution. This simple yet effective design choice solves the problem of "boundary artifacts" that occur when processing high-resolution videos with fixed-size windows, significantly improving the model's robustness and output quality on arbitrary-resolution inputs.
- A set of effective training enhancements for large-scale adversarial video restoration. The authors introduce several techniques to stabilize the notoriously difficult adversarial training process for huge models (a combined 16 billion parameters for the generator and discriminator). These include:
-
Progressive Distillation: Gradually reducing the number of sampling steps before the final one-step adversarial training.
-
Improved Loss Functions: Using
RpGANloss and anR2regularization to prevent mode collapse, and proposing an efficient feature matching loss as a substitute for the computationally expensiveLPIPSloss, which is crucial for high-resolution video training.The key finding is that
SeedVR2can match or exceed the performance of multi-step diffusion VR models while being over 4x faster, making high-quality video restoration significantly more practical.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Diffusion Models
Diffusion models are a class of generative models that learn to create data by reversing a gradual noising process.
- Forward Process: You start with a clean data sample (e.g., a high-quality image) and repeatedly add a small amount of Gaussian noise over many steps. After enough steps, the image becomes pure noise. This process is fixed and does not involve learning.
- Reverse Process (Denoising): The model, typically a neural network like a U-Net or a Transformer, is trained to reverse this process. At each step, it takes a noisy image and predicts the noise that was added. By repeatedly subtracting the predicted noise, the model can gradually denoise a random noise sample into a clean, realistic image.
- Inference: To generate a new image, you start with pure random noise and apply the trained denoising model iteratively for a certain number of steps. This iterative nature is what makes diffusion models slow.
3.1.2. Generative Adversarial Networks (GANs)
A GAN is a framework for generative modeling that involves a contest between two neural networks:
- Generator (G): This network tries to create realistic data (e.g., images) from random noise. Its goal is to fool the Discriminator.
- Discriminator (D): This network acts as a classifier, trying to distinguish between real data (from the training set) and fake data (from the Generator). The two networks are trained simultaneously. The Generator gets better at creating realistic data, and the Discriminator gets better at spotting fakes. This "adversarial" process pushes the Generator to produce increasingly high-quality samples. Unlike diffusion models, GANs can generate a sample in a single forward pass, making them very fast.
3.1.3. Adversarial Post-Training (APT)
APT is a technique to accelerate a pre-trained diffusion model, converting it into a one-step generator. It leverages the strengths of both diffusion models and GANs. The process involves taking a powerful, pre-trained multi-step diffusion model and fine-tuning it using a GAN-like objective. The diffusion model itself becomes the Generator. A Discriminator (which can also be based on a diffusion architecture) is trained to distinguish the Generator's one-step output from real data. This forces the Generator to produce high-quality outputs in a single pass, effectively distilling the knowledge of the multi-step process into a single step.
3.1.4. Video Restoration (VR)
Video Restoration is the task of improving the quality of a degraded video. This can include tasks like:
- Super-Resolution (VSR): Increasing the resolution of a low-resolution video.
- Deblurring: Removing motion or camera shake blur.
- Denoising: Removing noise.
- Artifact Removal: Removing compression artifacts. Real-world VR is particularly challenging because degradations are complex and unknown, and the model must maintain temporal consistency (i.e., the restored video should look smooth and not flicker between frames).
3.1.5. Window Attention and Rotary Positional Embedding (RoPE)
- Attention Mechanism: A core component of Transformer models. It allows the model to weigh the importance of different parts of the input sequence when producing an output. For an image or video, this means focusing on relevant pixels or patches.
- Window Attention: Standard attention is computationally expensive as it compares every pixel/patch to every other pixel/patch. Window attention limits the attention calculation to within smaller, non-overlapping or shifted "windows" of the input, making it much more efficient for high-resolution data.
- Rotary Positional Embedding (RoPE): A method for encoding positional information in Transformers. Instead of adding positional embeddings to the input features, RoPE applies a rotation to the query and key vectors in the attention mechanism based on their absolute position. It is known for its good generalization to different sequence lengths.
3.2. Previous Works
The paper categorizes related work into three areas:
-
Video Restoration (VR):
- Traditional Methods: Works like
BasicVSR[4] and [5] focused on synthetic datasets and used recurrent architectures to propagate information across frames. They often struggle with real-world textures. - GAN-based Methods: Models like
Real-ESRGAN[70] andRealViformer[95] use GANs to generate more realistic details but are often limited in generative power compared to diffusion models. - Diffusion-based Methods: Recent works like
UAV[97],MGLD-VSR[79], andSeedVR[67] leverage diffusion models for superior visual quality. However, they are all multi-step and slow.SeedVR, the predecessor to this work, introduced a large diffusion transformer specifically for VR, achieving state-of-the-art results but still requiring many sampling steps.
- Traditional Methods: Works like
-
Diffusion Acceleration:
- Distillation-based Methods: Techniques like
Progressive Distillation[53] andConsistency Distillation[60, 61] train a "student" model to match the output of a "teacher" diffusion model in fewer steps. These methods often produce blurry results when distilled to very few steps (e.g., one step). - Adversarial Methods: Works like
Adversarial Diffusion Distillation (ADD)[56] andUFOGen[76] use adversarial training to accelerate diffusion models, often achieving better perceptual quality in one step. TheAPT[34] method, on whichSeedVR2is based, falls into this category and stands out for its direct fine-tuning on real data.
- Distillation-based Methods: Techniques like
-
One-step Restoration:
- Most one-step restoration works focus on images. Examples include
ResShift[87, 88], which modifies the initial sampling distribution, and various distillation-based [10, 15, 27] or adversarial [28] approaches. These methods are not designed for video and lack mechanisms for temporal consistency.SeedVR2is presented as an early attempt to bring high-quality, one-step restoration to the video domain.
- Most one-step restoration works focus on images. Examples include
3.3. Technological Evolution
The field of video restoration has evolved significantly:
- Early Methods: Focused on simple algorithms and synthetic degradations.
- Deep Learning (CNNs): Models like
EDVR[69] used convolutional neural networks with specialized modules like deformable convolutions to handle alignment and restoration. - Transformers for Video: More recent models like
VRT[31] started using Transformers to capture long-range dependencies in video, improving performance. - GANs for Realism: To tackle real-world restoration, GANs were incorporated to generate realistic textures, but sometimes at the cost of fidelity (hallucinating incorrect details).
- Diffusion Models for Quality: The latest wave, including
SeedVRandUAV, adopted diffusion models, which set a new standard for visual quality and realism. However, this came at a massive computational cost. - The Next Step (This Paper):
SeedVR2represents the next logical step in this evolution: combining the quality of diffusion models with the speed of GANs to create a practical, high-performance solution for real-world video restoration.
3.4. Differentiation Analysis
Compared to previous works, SeedVR2 is different in several key ways:
- One-Step for Video: It is one of the first models to achieve high-quality video restoration in a single step, whereas most prior work focused on images.
- Teacher-Free Adversarial Training: Unlike many distillation methods that are upper-bounded by a fixed teacher model,
SeedVR2uses adversarial post-training directly on real data. This allows it to potentially surpass the performance of its initial multi-step model. - Specialized Architecture and Training for VR: It's not a generic acceleration technique. It introduces specific enhancements tailored for high-resolution video restoration, namely the adaptive window attention to handle arbitrary resolutions and a carefully designed set of loss functions (including the novel feature matching loss) to stabilize training for this specific task.
- Scale: The paper trains an exceptionally large model (up to 16B parameters combined), demonstrating that this framework can scale effectively.
4. Methodology
4.1. Principles
The core principle of SeedVR2 is to convert a powerful but slow multi-step diffusion video restoration model (SeedVR) into an extremely fast one-step generator. This is achieved through Diffusion Adversarial Post-Training (APT). The intuition is to use a GAN framework where the pre-trained diffusion model acts as the generator. A discriminator, also a powerful transformer, is trained to distinguish the one-step output of the generator from real high-quality videos. This adversarial pressure forces the generator to learn how to produce a realistic restoration from a low-quality input in a single forward pass, effectively compressing the entire iterative denoising process into one step.
The methodology introduces several key improvements over a naive application of APT to make it work for the challenging task of high-resolution video restoration.
The overall model architecture is depicted in the figure below. Both the generator and discriminator are based on the Swin-MMDIT transformer architecture, but the discriminator has additional cross-attention blocks to process conditional information.
该图像是论文中模型架构的示意图,展示了SeedVR2生成器与判别器的结构及自适应窗口注意力机制。图中包含多层Swin-MMDIT块、交叉注意力和多层感知机模块,以及自适应调整窗口大小的流程。
4.2. Core Methodology In-depth
4.2.1. Preliminaries: Diffusion Adversarial Post-Training (APT)
SeedVR2 builds upon the APT framework. The training process in APT has two main stages:
- Deterministic Distillation: First, a pre-trained multi-step diffusion model (the "teacher") is distilled into a student model that uses fewer steps. This is done using a simple regression loss (Mean Squared Error) to make the student's output match the teacher's. This step helps bridge the gap between the multi-step teacher and the final one-step generator.
SeedVR2adapts this into a progressive distillation strategy. - Adversarial Post-Training: The distilled model is then used as the generator in a GAN setup. A discriminator is initialized from the same pre-trained diffusion network and is fine-tuned to differentiate between the generator's one-step outputs and real data. The generator is fine-tuned to fool the discriminator.
APTuses a non-saturating GAN loss and anR1regularization term to stabilize training.
4.2.2. Adaptive Window Attention
A key problem with using window attention in restoration models is that a window size that works well for the training resolution (e.g., 720p) may cause artifacts at test time when dealing with much higher resolutions (e.g., 1080p, 2K). This happens because the model isn't sufficiently trained on handling window shifting and boundary conditions at these scales.
To solve this, SeedVR2 proposes an adaptive window attention mechanism.
During Training: The window patch size is not fixed but dynamically calculated based on the input feature map dimensions. Given a feature map (where are temporal, height, and width dimensions), the patch sizes are calculated as: $ p _ { t } = \bigg \lceil \frac { \operatorname* { m i n } ( d _ { t } , 30 ) } { n _ { t } } \bigg \rceil , \quad p _ { h } = \bigg \lceil \frac { d _ { h } } { n _ { h } } \bigg \rceil , \quad p _ { w } = \bigg \lceil \frac { d _ { w } } { n _ { w } } \bigg \rceil $
- : The size of the attention window along the time, height, and width dimensions.
- : The dimensions of the input feature map.
- : Predefined constants that determine the number of windows along each dimension (the paper uses a spatial partition, so ).
- : The ceiling function, which rounds up to the nearest integer.
- : This caps the temporal length considered for window calculation to 30 frames, preventing a large discrepancy between training and inference sequence lengths. Since training videos can have varying aspect ratios, this formula results in various window sizes during training, making the model more robust.
During Testing (Inference): For a high-resolution test input , the model first calculates a "proxy resolution" that has the same aspect ratio as the test input but the same total area as the training resolution. $ \tilde { d } _ { h } = \sqrt { d _ { h } \times d _ { w } \times \frac { \hat { d } _ { h } } { \hat { d } _ { w } } } , \tilde { d } _ { w } = \sqrt { d _ { h } \times d _ { w } \times \frac { \hat { d } _ { w } } { \hat { d } _ { h } } } $
- : The height and width of the test input's feature map.
- : The height and width of the training feature map (e.g., for 720p).
- : The calculated proxy dimensions. Note that and . The final test-time window size is then calculated using the first formula, but with the proxy resolution dimensions . This ensures the window partitioning logic remains consistent with what the model saw during training, effectively eliminating boundary artifacts.
4.2.3. Training Procedures
The paper introduces several enhancements to stabilize training and improve performance.
Progressive Distillation:
Instead of a single distillation step, the authors use progressive distillation. They start with a 64-step SeedVR model and progressively distill it down to a 1-step model with a stride of 2 (i.e., 64 steps -> 32 -> 16 -> 8 -> 4 -> 2 -> 1). Each distillation stage uses a simple Mean Squared Error loss and takes about 10k iterations. This gradual approach makes it easier for the model to learn the one-step generation mapping.
Loss Improvement: Training a 16B-parameter GAN is highly unstable. The authors incorporate several loss functions to improve stability and quality.
-
RpGAN + R2 Regularization:
- The standard non-saturating GAN loss is replaced with
RpGANloss [20], which is known to be more stable and less prone to mode dropping. - An approximated R2 regularization is added to penalize the discriminator's gradient norm on fake data. This is an alternative to
R1regularization (which penalizes gradients on real data) and further stabilizes training. The loss is calculated as: $ \mathcal { L } _ { a R 2 } = | D ( \hat { \pmb x } , c ) - D ( \mathcal { N } ( \hat { \pmb x } , \sigma { \bf I } ) , c ) | _ { 2 } ^ { 2 } $- : The fake sample generated by the generator.
- : The conditional input (the low-quality video).
- : The discriminator's output score.
- : The fake sample perturbed with small Gaussian noise, where is the noise variance and is the identity matrix. This loss penalizes the discriminator if its output changes significantly for a slightly perturbed input, encouraging smoother discriminator landscapes.
- The standard non-saturating GAN loss is replaced with
-
Feature Matching Loss:
-
Standard VR training often uses
L1loss for pixel-level accuracy andLPIPSloss for perceptual similarity. However,LPIPSrequires decoding the latent output to pixel space, which is too slow for high-resolution video training. -
The authors propose an efficient alternative: a feature matching loss. Instead of using a separate pre-trained network like VGG for perceptual features, they reuse the discriminator itself. They extract intermediate features from several layers of the discriminator for both the generated video and the ground-truth video and minimize the L1 distance between them.
-
The feature matching loss is defined as: $ \mathcal { L } _ { F } = \frac { 1 } { 3 } \sum _ { i = 16 , 26 , 36 } | D _ { i } ^ { F } ( \hat { \pmb x } , c ) - D _ { i } ^ { F } ( \pmb x , c ) | _ { 1 } $
- : A function that extracts the feature map from the -th block of the discriminator's transformer backbone. The paper uses features from the 16th, 26th, and 36th blocks.
- : The real (ground-truth) high-quality video.
- : The fake (generated) video.
-
This loss encourages the generated video to have similar high-level feature representations to the real video, as interpreted by the discriminator, effectively acting as a perceptual loss without extra computational overhead.
The final loss for the generator is a weighted sum of the GAN loss,
L1loss, and the feature matching loss. The discriminator is updated with the GAN loss and theR1/R2regularization terms.
-
5. Experimental Setup
5.1. Datasets
The model was trained on a large-scale dataset synthesized from ~10M image pairs and ~5M video pairs. For evaluation, a diverse set of benchmarks was used:
- Synthetic Datasets: These datasets have clean ground-truth videos, allowing for full-reference metric calculation.
SPMCS[80]: A video super-resolution dataset with rich textures and scenes.UDM10[64]: A dataset for video super-resolution containing diverse motions.REDS30[46]: A subset of the popular REDS dataset, widely used for video deblurring and super-resolution.YouHQ40[97]: A high-quality dataset created from YouTube videos, designed to better represent real-world content.
- Real-World Datasets: These datasets contain real low-quality videos without ground-truth counterparts, requiring no-reference metrics.
VideoLQ[6]: A commonly used benchmark for real-world video super-resolution, containing videos with complex, unknown degradations.
- AIGC Dataset:
AIGC28: A self-collected dataset of 28 videos generated by AI models. This tests the model's ability to enhance content that is already synthetically generated.
5.2. Evaluation Metrics
The paper uses a comprehensive set of metrics to evaluate performance from different perspectives.
5.2.1. Full-Reference Metrics (for synthetic data)
These metrics compare the restored video frame-by-frame against a ground-truth high-quality video.
-
PSNR (Peak Signal-to-Noise Ratio):
- Conceptual Definition: Measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects its fidelity. It is a classic metric for reconstruction quality. Higher PSNR generally means the reconstruction is closer to the original pixel-wise.
- Mathematical Formula: $ \text{PSNR} = 10 \cdot \log_{10}\left(\frac{\text{MAX}_I^2}{\text{MSE}}\right) $
- Symbol Explanation:
- : The maximum possible pixel value of the image (e.g., 255 for an 8-bit grayscale image).
- : The Mean Squared Error between the ground-truth image and the restored image of size : .
-
SSIM (Structural Similarity Index Measure):
- Conceptual Definition: Measures the similarity between two images based on human perception, considering changes in luminance, contrast, and structure. It is generally better aligned with human judgment than PSNR. A value of 1 indicates perfect similarity.
- Mathematical Formula: $ \text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
- Symbol Explanation:
- : The average of images and .
- : The variance of images and .
- : The covariance of and .
- : Small constants to stabilize the division.
-
LPIPS (Learned Perceptual Image Patch Similarity):
- Conceptual Definition: Measures the perceptual distance between two images. It uses deep features extracted from a pre-trained neural network (like AlexNet or VGG). Two images that are perceptually similar will have similar deep feature representations. Lower LPIPS means better perceptual quality.
- Mathematical Formula: $ d(x, x_0) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} | w_l \odot ( \hat{y}{hw}^l - \hat{y}{0hw}^l ) |_2^2 $
- Symbol Explanation:
- : The distance between images and .
- : The feature maps extracted from layer of a deep network for images and .
- : The spatial dimensions of the feature maps.
- : A channel-wise weight to scale the importance of activations.
- : Element-wise multiplication.
-
DISTS (Deep Image Structure and Texture Similarity):
- Conceptual Definition: A perceptual quality metric that explicitly models both structural similarity and texture similarity using features from a deep network. It is designed to be robust to local texture variations while being sensitive to structural distortions. Lower is better.
- Mathematical Formula: The metric is a weighted sum of structure and texture differences across different layers of a VGG network. The overall distance is , where and are structure and texture distance terms based on feature means and cross-correlations.
5.2.2. No-Reference Metrics (for real-world & AIGC data)
These metrics evaluate the quality of a video without access to a ground-truth original.
- NIQE (Natural Image Quality Evaluator): Measures the deviation from statistical regularities observed in natural images. A lower NIQE score indicates that the image statistics are closer to those of a natural, high-quality image.
- MUSIQ (Multi-scale Image Quality Transformer): A Transformer-based no-reference metric that assesses image quality by considering features at multiple scales. Higher is better.
- CLIP-IQA (CLIP-based Image Quality Assessment): Uses the pre-trained CLIP model to assess image quality by comparing image features to text prompts related to quality (e.g., "high quality," "blurry"). Higher is better.
- DOVER: A unified no-reference video quality assessment model that fuses aesthetic and technical quality scores. Higher scores indicate better overall video quality.
5.3. Baselines
SeedVR2 was compared against several state-of-the-art video restoration methods:
-
RealViformer[95]: A Transformer-based real-world VSR model. -
MGLD-VSR[79]: A diffusion-based VSR method guided by motion. -
UAV[97]: A diffusion-based VSR model that also uses a large-scale training strategy. -
VEnhancer[14]: A diffusion-based method for generative space-time video enhancement. -
STAR[74]: A diffusion-based VSR method that uses a text-to-video model for augmentation. -
SeedVR-7B[67]: The multi-step predecessor toSeedVR2, serving as a direct performance and speed baseline.For all diffusion-based baselines, 50 sampling steps were used to ensure high-quality output, highlighting the speed advantage of the one-step
SeedVR2.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that SeedVR2 is highly effective, achieving a strong balance between performance and speed.
Quantitative Results: The following are the results from Table 1 of the original paper:
| Datasets | Metrics | RealViformer [95] | MGLD-VSR [79] | UAV [97] | VEnhancer [14] | STAR [74] | SeedVR-7B [67] | Ours 3B | Ours 7B |
|---|---|---|---|---|---|---|---|---|---|
| SPMCS | PSNR ↑ | 24.185 | 23.41 | 21.69 | 18.20 | 22.58 | 20.78 | 22.97 | 22.90 |
| SSIM ↑ | 0.663 | 0.633 | 0.519 | 0.507 | 0.609 | 0.575 | 0.646 | 0.638 | |
| LPIPS ↓ | 0.378 | 0.369 | 0.508 | 0.455 | 0.420 | 0.395 | 0.306 | 0.322 | |
| DISTS ↓ | 0.186 | 0.166 | 0.229 | 0.194 | 0.229 | 0.166 | 0.13 | 0.134 | |
| UDM10 | PSNR ↑ | 26.70 | 26.11 | 24.62 | 21.48 | 24.66 | 24.29 | 25.61 | 26.26 |
| SSIM ↑ | 0.796 | 0.772 | 0.712 | 0.691 | 0.747 | 0.731 | 0.784 | 0.798 | |
| LPIPS ↓ | 0.285 | 0.273 | 0.323 | 0.349 | 0.359 | 0.264 | 0.218 | 0.203 | |
| DISTS ↓ | 0.166 | 0.144 | 0.178 | 0.175 | 0.195 | 0.124 | 0.106 | 0.101 | |
| REDS30 | PSNR ↑ | 23.34 | 22.74 | 21.44 | 19.83 | 22.04 | 21.74 | 21.90 | 22.27 |
| SSIM ↑ | 0.615 | 0.578 | 0.514 | 0.545 | 0.593 | 0.596 | 0.598 | 0.606 | |
| LPIPS ↓ | 0.328 | 0.271 | 0.397 | 0.508 | 0.487 | 0.340 | 0.350 | 0.337 | |
| DISTS ↓ | 0.154 | 0.097 | 0.181 | 0.229 | 0.229 | 0.122 | 0.135 | 0.127 | |
| YouHQ40 | PSNR ↑ | 23.26 | 22.62 | 21.32 | 18.68 | 22.15 | 20.60 | 22.10 | 22.46 |
| SSIM ↑ | 0.606 | 0.576 | 0.503 | 0.509 | 0.575 | 0.546 | 0.595 | 0.600 | |
| LPIPS ↓ | 0.362 | 0.356 | 0.404 | 0.449 | 0.451 | 0.323 | 0.284 | 0.274 | |
| DISTS ↓ | 0.193 | 0.166 | 0.196 | 0.175 | 0.213 | 0.134 | 0.122 | 0.110 | |
| VideoLQ | NIQE ↓ | 4.153 | 3.864 | 4.079 | 5.122 | 5.915 | 4.933 | 4.687 | 4.948 |
| MUSIQ ↑ | 54.65 | 53.49 | 52.90 | 42.66 | 40.50 | 48.35 | 51.09 | 45.76 | |
| CLIP-IQA ↑ | 0.411 | 0.333 | 0.386 | 0.269 | 0.243 | 0.258 | 0.295 | 0.257 | |
| DOVER ↑ | 7.035 | 8.109 | 6.975 | 7.985 | 6.891 | 7.416 | 8.176 | 7.236 | |
| AIGC28 | NIQE ↓ | 3.994 | 4.049 | 4.541 | 4.176 | 5.004 | 4.294 | 3.801 | 4.015 |
| MUSIQ ↑ | 62.82 | 60.98 | 62.79 | 60.99 | 55.59 | 56.90 | 62.99 | 59.97 | |
| CLIP-IQA ↑ | 0.647 | 0.570 | 0.653 | 0.461 | 0.435 | 0.453 | 0.561 | 0.497 | |
| DOVER ↑ | 11.66 | 14.27 | 13.09 | 15.31 | 14.82 | 14.77 | 15.77 | 15.55 |
- On synthetic datasets,
SeedVR2(both 3B and 7B versions) consistently achieves the best or second-best scores on perceptual metrics likeLPIPSandDISTS. This shows its strong generative capability. It lags behind some methods onPSNRandSSIM, which is expected for GAN-based training that prioritizes realism over pixel-perfect reconstruction (the perception-distortion tradeoff). - On the real-world
VideoLQdataset,SeedVR2(3B) achieves the bestDOVERscore, indicating superior overall video quality as judged by a sophisticated no-reference metric. - On the
AIGC28dataset,SeedVR2models dominate, securing top scores inNIQE,MUSIQ, andDOVER, demonstrating their effectiveness in enhancing AI-generated content.
Qualitative Results and User Study:
Qualitative comparisons in Figure 3 show that SeedVR2 produces results that are visually comparable to the 50-step SeedVR and significantly better than other baselines. It successfully restores fine details and textures (e.g., bird feathers, text, dog fur) while removing complex degradations.
该图像是图表,展示了图3中不同视频恢复方法对高分辨率真实视频的定性对比。SeedVR2在单步采样下表现出与SeedVR相当甚至更优的恢复效果,成功去除了降质同时保留了鸟、文字、建筑及犬只面部的细节纹理。
A user study was conducted to compare SeedVR2 against other methods. The results, summarized below from Table 2, reinforce the quantitative findings.
The following are the results from Table 2 of the original paper:
| Methods-{Steps} | Visual Fidelity | Visual Quality | Overall Quality |
|---|---|---|---|
| RealViformer-1 [95] | +2% | -38% | -32% |
| VEnhancer-50 [14] | -82% | -86% | -94% |
| UAV-50 [98] | 0% | -26% | -26% |
| MGLD-VSR-50 [79] | 0% | -12% | -12% |
| STAR-50 [74] | +4% | -22% | -24% |
| SeedVR-7B-50 [67] | +2% | +10% | +10% |
| Ours-3B-1 | 0% | +16% | +16% |
| Ours-7B-1 | 0% | 0% | 0% |
- Users found
SeedVR2's visual quality to be significantly better than most multi-step diffusion baselines. - Interestingly, the smaller
Ours-3Bmodel was preferred even over the multi-stepSeedVR-7Band its own largerOurs-7Bcounterpart, suggesting that the distillation process might have been particularly effective for this model size.
6.2. Ablation Studies / Parameter Analysis
The authors conducted rigorous ablation studies to validate their design choices.
Effect of Adaptive Window Attention: Figure 4 clearly illustrates the benefit of the proposed adaptive window attention. The model trained with a predefined, fixed-size window exhibits visible grid-like artifacts in high-resolution outputs. In contrast, the model with adaptive attention produces a clean, artifact-free restoration. This confirms that the adaptive mechanism is crucial for the model's robustness to varying resolutions.

Effect of Losses and Progressive Distillation: The following are the results from Table 3 of the original paper:
| Metrics | Non-satu. +R1 | RpGAN + R1 + R2 | RpGAN + R1 + R2 +L1 | RpGAN + R1 + R2 + L1 + LF | Prog. Training |
|---|---|---|---|---|---|
| PSNR ↑ | 22.55 | 22.56 | 22.91 | 22.91 | 23.96 |
| SSIM↑ | 0.612 | 0.603 | 0.616 | 0.620 | 0.667 |
| LPIPS ↓ | 0.310 | 0.278 | 0.251 | 0.244 | 0.227 |
| DISTS ↓ | 0.136 | 0.109 | 0.099 | 0.092 | 0.097 |
This ablation study on the YouHQ40 dataset shows a clear, step-by-step improvement as new components are added:
- Replacing the vanilla GAN loss with
RpGAN + R1 + R2significantly improves perceptual metrics (LPIPS,DISTS). - Adding
L1loss further improves all metrics, balancing perceptual quality with fidelity. - Adding the proposed feature matching loss (
LF) provides another boost to perceptual scores. - Finally, using progressive training (distillation) provides the most substantial improvement across all metrics, highlighting its importance in bridging the gap between the multi-step and one-step models.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces SeedVR2, a one-step diffusion transformer for real-world video restoration. By building on an Adversarial Post-Training framework and introducing several critical, task-specific innovations—namely the adaptive window attention for high-resolution robustness and a refined training strategy with progressive distillation and a novel feature matching loss—the authors demonstrate that it is possible to achieve video restoration quality comparable or superior to slow, multi-step diffusion models in a single, efficient forward pass. The extensive experiments validate that SeedVR2 is a significant step towards making high-quality, diffusion-based video enhancement practical for real-world use.
7.2. Limitations & Future Work
The authors acknowledge several limitations:
- VAE Bottleneck: While the one-step generator is fast, the overall process is still slowed down by the causal video Variational Autoencoder (VAE), which is used to encode the video into a latent space and decode it back. The VAE takes up over 95% of the total inference time for a long video. Future work should focus on improving the efficiency of this video VAE.
- Robustness to Extreme Degradations: The model can still fail on videos with very heavy degradations or extremely large motions, which remains a challenge for most VR methods.
- Oversharpening: Due to its strong generative nature, the model can sometimes over-sharpen videos that have only light degradations, requiring careful hyperparameter tuning.
- Societal Impact: The authors responsibly note the potential for misuse in enhancing illegal or harmful content (e.g., NSFW) and plan to release a detection tool to mitigate this risk.
7.3. Personal Insights & Critique
This paper is a strong piece of engineering and research that addresses a very practical and important problem.
Strengths:
- Problem-Driven Innovation: The paper doesn't just apply an existing technique (APT) to a new domain. It identifies specific failure points (window artifacts, training instability) and proposes targeted, elegant solutions (adaptive attention, feature matching loss). This demonstrates a deep understanding of the problem.
- Efficiency and Practicality: The focus on achieving one-step inference is highly relevant. By making diffusion-level quality accessible at a fraction of the cost, this work could unlock new applications in video editing software, streaming platforms, and archival footage restoration.
- Methodological Rigor: The proposed feature matching loss is a clever and efficient substitute for
LPIPS. Reusing the discriminator as a feature extractor is computationally smart and theoretically sound within the adversarial framework, as the discriminator is concurrently learning to identify perceptually relevant features. - Scalability: Demonstrating that the proposed training framework is stable enough to handle a massive 16B parameter GAN is a significant engineering achievement and points towards the future scalability of such models.
Potential Issues and Areas for Reflection:
-
Generalization of the 3B vs. 7B Result: The user study finding that the 3B model was preferred over the 7B model is intriguing. The authors suggest this points to the effectiveness of distillation. It could also imply that for this specific task, larger model size does not automatically equate to better perceptual quality, or that the 7B model required more training/tuning to reach its full potential. This is a valuable insight in an era often dominated by "scaling is all you need."
-
Dependence on Pre-training: The entire method relies on a powerful, pre-trained multi-step model (
SeedVR). The quality of this initial model likely sets a strong "prior" for the final one-step generator. While the paper shows it can surpass the initial model, the performance is not entirely independent of it. -
"One-Step" Definition: The claim of "one-step" refers to the core generative process. As the authors admit, the VAE encoding/decoding adds significant overhead. While the model is a huge leap forward, achieving true real-time performance will require tackling the VAE bottleneck, which is an important research direction in itself.
Overall,
SeedVR2is a landmark paper in the field of video restoration. It provides a robust and effective blueprint for creating fast and powerful generative restoration models, and its insights into stable, large-scale adversarial training and efficient perceptual losses are valuable for the broader generative AI community.
Similar papers
Recommended via semantic vector search.