Paper status: completed

DS-VTON: An Enhanced Dual-Scale Coarse-to-Fine Framework for Virtual Try-On

Published:06/01/2025

Dual-Scale Coarse-to-Fine Framework (1)Mask-Free Generation Strategy (1)Virtual Try-On Technology (1)Semantic Garment-Body Alignment (1)Diffusion-Based Detail Refinement (1)

Original Link PDF

Price: 0.100000

9 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DS-VTON proposes a dual-scale coarse-to-fine, mask-free framework, utilizing a blend-refine diffusion process, to simultaneously achieve precise garment alignment and high-fidelity texture preservation in virtual try-on. It outperforms prior state-of-the-art methods in both aspec

Abstract

Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. These two requirements map directly onto a coarse-to-fine generation paradigm, where the coarse stage handles structural alignment and the fine stage recovers rich garment details. Motivated by this observation, we propose DS-VTON, an enhanced dual-scale coarse-to-fine framework that tackles the try-on problem more effectively. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. In the second stage, a blend-refine diffusion process reconstructs high-resolution outputs by refining the residual between scales through noise-image blending, emphasizing texture fidelity and effectively correcting fine-detail errors from the low-resolution stage. In addition, our method adopts a fully mask-free generation strategy, eliminating reliance on human parsing maps or segmentation masks. Extensive experiments show that DS-VTON not only achieves state-of-the-art performance but consistently and significantly surpasses prior methods in both structural alignment and texture fidelity across multiple standard virtual try-on benchmarks.

Mind Map

In-depth Reading

English Analysis~14 min read · 17,074 chars

1. Bibliographic Information

Title: DS-VTON: An Enhanced Dual-Scale Coarse-to-Fine Framework for Virtual Try-On
Authors: Xianbing Sun, Yan Hong, Jiahui Zhan, Jun Lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang
Affiliations: Shanghai Jiao Tong University, Ant Group
Journal/Conference: This paper is available as a preprint on arXiv. arXiv is a popular open-access repository for academic papers, often used for pre-publication dissemination of research.
Publication Year: The first version was submitted to arXiv in June 2024.
Abstract: The paper addresses two primary challenges in virtual try-on: accurately aligning a garment with a person's body and preserving the garment's fine-grained textures. It proposes DS-VTON, a dual-scale coarse-to-fine framework. The first (coarse) stage generates a low-resolution result to establish correct structural alignment. The second (fine) stage uses a novel blend-refine diffusion process to reconstruct a high-resolution image, focusing on texture fidelity and correcting errors from the first stage. A key feature is its mask-free strategy, which avoids reliance on human segmentation masks. Experiments on standard benchmarks show that DS-VTON significantly outperforms previous state-of-the-art methods in both alignment and texture preservation.
Original Source Link: https://arxiv.org/pdf/2506.00908

2. Executive Summary

Background & Motivation (Why):
- Core Problem: The central task of virtual try-on (VTON) is to create a realistic image of a person wearing a new piece of clothing. This is deceptively difficult, as it requires solving two conflicting goals simultaneously: (1) Structural Alignment: The garment must be realistically warped and draped over the person's body, respecting their pose and shape. (2) Texture Fidelity: The intricate details, patterns, logos, and textures of the original garment must be perfectly preserved.
- Existing Gaps: Previous methods, whether based on Generative Adversarial Networks (GANs) or single-stage diffusion models, often compromise on one goal to achieve the other. GANs struggle with detail loss during image fusion, while single-stage diffusion models find it hard to manage both global structure and local details within a single denoising process. Many methods also depend on inaccurate human parsing masks, which can introduce errors.
- Innovative Angle: DS-VTON introduces a dedicated dual-scale coarse-to-fine framework. This explicitly separates the problem: a low-resolution stage focuses solely on getting the coarse structure and alignment right, while a high-resolution stage builds upon this solid foundation to restore fine details. This separation is the key to overcoming the inherent trade-off. Furthermore, its mask-free approach simplifies the pipeline and removes a common source of error.
Main Contributions / Findings (What):
1. A Novel Dual-Scale Framework: A two-stage pipeline that effectively decouples structural alignment (coarse stage) from texture restoration (fine stage), which is well-suited for the VTON task.
2. Blend-Refine Diffusion Process: A new technique for the high-resolution stage that initializes the denoising process with a blend of random noise and the low-resolution result. This provides strong structural guidance while allowing for the generation of high-frequency details.
3. Mask-Free Generation: The model operates directly on person and garment images without needing segmentation masks, leveraging the powerful semantic understanding of pre-trained diffusion models. This makes the system more robust and easier to use.
4. State-of-the-Art Performance: The paper demonstrates through extensive experiments that DS-VTON significantly surpasses existing methods on the VITON-HD and DressCode datasets, achieving superior results in both quantitative metrics and human perceptual studies.

Foundational Concepts:
- Virtual Try-On (VTON): The computer vision task of taking an image of a person and an image of a garment and synthesizing a new, photorealistic image of that person wearing that garment.
- Generative Adversarial Networks (GANs): A class of deep learning models where two neural networks, a Generator and a Discriminator, compete. The Generator creates new data (e.g., images), and the Discriminator tries to distinguish between real and generated data. This competition pushes the Generator to produce increasingly realistic outputs.
- Diffusion Models: A class of generative models that learn to create data by reversing a gradual noising process. They start with a real image, add Gaussian noise step-by-step until it becomes pure noise (forward process), and then train a neural network to reverse this process, starting from noise and gradually denoising it to form a coherent image (reverse process). Stable Diffusion is a popular diffusion model that performs this process in a lower-dimensional "latent space" for efficiency.
- Coarse-to-Fine Paradigm: A problem-solving strategy where a complex task is first addressed at a low resolution to establish the global structure or layout (the "coarse" stage). This coarse solution is then progressively refined at higher resolutions to add details (the "fine" stage). This is effective because it prevents the model from getting lost in local details before the overall structure is correct.
Previous Works:
- GAN-based VTON: Early methods like VITON-HD and ACGPN used a two-step process. First, a warping module (e.g., using Thin-Plate Spline) would deform the flat garment image to align with the person's pose. Second, a generation module would blend this warped garment with the person's image. While the warping helped preserve garment details, the final blending step often introduced artifacts and detail loss.
- Diffusion-based VTON: More recent methods like IDM-VTON, Leffa, and OOTDiffusion leverage the power of diffusion models. They typically use a single, unified denoising process to generate the final image. While diffusion models are naturally coarse-to-fine (early steps define structure, later steps refine details), a single process still struggles to perfectly balance global alignment and texture preservation. Many of these methods also rely on segmentation masks to tell the model where to inpaint the new garment, which can be a source of error if the mask is imprecise.
Differentiation: DS-VTON stands out from prior work in two main ways:
1. Explicit Dual-Scale vs. Implicit Single-Scale: While previous diffusion models have an implicit coarse-to-fine nature, DS-VTON makes it explicit with two separate stages. The low-resolution stage is forced to ignore fine details and focus only on structural alignment. The high-resolution stage is then explicitly guided by this stable, low-resolution structure.
2. Mask-Free vs. Mask-Reliant: Most competitors require a segmentation mask of the person's torso to be replaced. DS-VTON completely eliminates this dependency, relying on the model's internal knowledge of human anatomy, which simplifies the pipeline and avoids errors from faulty masks.

4. Methodology (Core Technology & Implementation)

The core of DS-VTON is its two-stage, coarse-to-fine pipeline built on a dual U-Net architecture.

该图像为方法流程示意图与实验结果对比图。上半部分展示了DS-VTON的双阶段框架：左侧低分辨率阶段通过Reference U-Net和Denoising U-Net进行结构对齐，右侧高分辨率阶段结合低分辨率输出及融合参数进行纹理细化重建；底部为多组虚拟试穿效果对比，展示了不同参数配置和方法在细节保留、纹理还原以及结构对齐上的优劣，突出DS-VTON方法在服装虚拟试穿中的效果提升。

Principles: The central idea is to divide and conquer. Generating a high-resolution try-on image is hard because the model has to simultaneously figure out how the clothes should drape (global structure) and what the tiny patterns on the fabric look like (local details). By generating a low-resolution image first, DS-VTON forces the model to solve the structural problem in an environment where fine details are suppressed. The high-resolution stage then only needs to focus on adding details, guided by the already-correct structure.
Network Architecture:
- The model uses a dual U-Net structure, inspired by recent successful methods like IDM-VTON.
- A Reference U-Net processes the garment image ( $\mathbf{x}_g$ ) to extract its appearance features (texture, pattern, color).
- A Denoising U-Net processes the person image ( $\mathbf{x}_p$ ) and generates the final output.
- The features from the Reference U-Net are injected into the Denoising U-Net via self-attention layers. This allows the denoising process to "refer" to the garment's appearance at each step.
- The model is based on Stable Diffusion 1.5 weights, chosen for its balance of performance and efficiency. Notably, the authors removed all cross-attention layers, finding that they did not improve and sometimes even harmed performance (as shown in an ablation study).
Steps & Procedures: Stage 1: Low-Resolution Generation
1. Input: The person image ( $\mathbf{x}_p$ ) and garment image ( $\mathbf{x}_g$ ) are downsampled by a ratio $\sigma$ . The paper finds $\sigma=2$ to be optimal (e.g., $1024 \times 768 \to 512 \times 384$ ). This yields $\tilde{\mathbf{x}}_p$ and $\tilde{\mathbf{x}}_g$ .
2. Process: A standard latent diffusion process is applied. $\tilde{\mathbf{x}}_p$ is combined with Gaussian noise and fed to the Denoising U-Net, while $\tilde{\mathbf{x}}_g$ is fed to the Reference U-Net for conditioning.
3. Output: A low-resolution try-on image, $\tilde{\mathbf{x}}_r$ , which has good structural alignment but lacks fine details.
Stage 2: High-Resolution Generation (Blend-Refine Diffusion)
1. Input: The original full-resolution person image ( $\mathbf{x}_p$ ), garment image ( $\mathbf{x}_g$ ), and the low-resolution result from Stage 1 ( $\tilde{\mathbf{x}}_r$ ), which is upsampled to the target resolution.
2. Latent Initialization: This is the key innovation. Instead of starting the diffusion process from pure Gaussian noise $\epsilon$ , the initial latent state $\mathbf{x}_T$ is a blend of noise and the coarse result: $\mathbf{x}_{T} = \alpha \cdot \boldsymbol{\epsilon} + \beta \cdot \tilde{\mathbf{x}}_{r}$ Here, $\tilde{\mathbf{x}}_r$ provides the structural prior, $\epsilon$ provides the stochasticity needed to generate new details, and $\alpha, \beta$ are coefficients that balance their influence. The paper empirically finds $\alpha = \beta = 0.5$ works best.
3. Reformulated Denoising Process: The model is trained to reverse this blended forward process. The forward process to get a noisy image $\mathbf{x}_t$ from the ground truth $\mathbf{x}_0$ becomes: $\mathbf{x}_{t} = \sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0} + \sqrt{1 - \bar{\alpha}_{t}} (\alpha \cdot \boldsymbol{\epsilon} + \beta \cdot \tilde{\mathbf{x}}_{r})$ Consequently, the network must learn to predict the combined term $\alpha \cdot \boldsymbol{\epsilon} + \beta \cdot \tilde{\mathbf{x}}_{r}$ instead of just $\epsilon$ . This forces the denoising process to be aware of the low-resolution guidance from the very beginning and to generate details that are consistent with it.
4. Output: A high-resolution, photorealistic try-on image $\mathbf{x}_r$ .

5. Experimental Setup

Datasets:
- VITON-HD: A high-resolution dataset ( $1024 \times 768$ ) containing paired images of models and upper-body garments.
- DressCode: A more challenging high-resolution dataset that includes three categories: upper-body, lower-body, and dresses, featuring more complex poses and garment styles.
- Training Data Generation: Since DS-VTON is mask-free, it requires unpaired data (a person wearing garment A, and the same person wearing garment B). To create this, the authors used a baseline model (IDM-VTON) to synthesize additional training images.
Evaluation Metrics:
1. Fréchet Inception Distance (FID):
  - Conceptual Definition: FID measures the similarity between two sets of images (e.g., generated vs. real) by comparing the statistics of their features extracted from a pre-trained InceptionV3 network. It captures both image quality and diversity. Lower is better.
  - Mathematical Formula: $\mathrm{FID}(x, g) = \left\| \mu_x - \mu_g \right\|_2^2 + \mathrm{Tr}\left(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2}\right)$
  - Symbol Explanation:
    - $\mu_x, \mu_g$ : Mean vectors of the features for real ( $x$ ) and generated ( $g$ ) images.
    - $\Sigma_x, \Sigma_g$ : Covariance matrices of the features for real and generated images.
    - $\mathrm{Tr}(\cdot)$ : The trace of a matrix (sum of diagonal elements).
2. Kernel Inception Distance (KID):
  - Conceptual Definition: KID is an alternative to FID that also compares feature distributions but uses a polynomial kernel, making it less biased and more suitable for smaller test sets. It is also reported as a mean and standard deviation. Lower is better.
  - Mathematical Formula: $\mathrm{KID}(X, G) = \mathbb{E}_{x, x' \sim p_x, g, g' \sim p_g} [k(f(x), f(x'))] - 2\mathbb{E}_{x \sim p_x, g \sim p_g} [k(f(x), f(g))] + \mathbb{E}_{g, g' \sim p_g} [k(f(g), f(g'))]$
  - Symbol Explanation:
    - X, G: Sets of real and generated images.
    - $f(\cdot)$ : Feature extractor (Inception network).
    - $k(a, b)$ : A polynomial kernel function, e.g., $(a^T b/d + 1)^3$ , where $d$ is the feature dimension.
    - $\mathbb{E}[\cdot]$ : Expected value.
3. User Study: A perceptual metric where human participants are shown results from different models and asked to choose the best one. Results are reported as the percentage of times a method was preferred. Higher is better.
4. Structural Similarity Index Measure (SSIM) & Learned Perceptual Image Patch Similarity (LPIPS): Used for the "paired" setting evaluation (reconstructing the original image). SSIM is higher-is-better, measuring structural similarity. LPIPS is lower-is-better, measuring perceptual difference.
Baselines: The paper compares DS-VTON against several recent state-of-the-art methods: OOTDiffusion, IDM-VTON, CatVTON, Leffa, and FitDiT. These represent the leading diffusion-based approaches in the field.

6. Results & Analysis

Core Results:

Qualitative Analysis: As shown in the figures below, DS-VTON consistently produces superior results.

该图像为插图，展示了虚拟试穿流程及效果对比。左起依次为服装图像、人物原图、低分辨率试穿结果、噪声-去噪处理结果和DS-VTON方法生成的试穿图。图中体现了DS-VTON在保持服装细节和结构对齐上的优势，最终试穿效果更自然且细节丰富。

In Figure 7, compared to baselines, DS-VTON excels at:
- Structural Alignment: In Row 1, most methods struggle to correctly render the hands and the waist area, while DS-VTON handles the pose naturally. FitDiT creates visible artifacts where the person's hands should be.
- Detail Preservation: In Row 3, CatVTON and IDM-VTON lose the complex floral pattern, generating a blurry texture. Leffa and OOTDiffusion preserve it better but with less clarity. DS-VTON (HR) reconstructs the pattern with high fidelity.
- Tonal Consistency: In Row 6, FitDiT generates an overly bright garment, and Leffa a darker one, while DS-VTON maintains the original color and tone accurately.
  
  The comparisons on the DressCode dataset (Figures 4, 5, 6) further confirm these strengths across different garment types like dresses and pants.
  
  该图像为对比图，展示了多种虚拟试穿方法的效果对比。左侧为原始服装图像和真人模特，右侧依次展示了CatVTON、IDM-VTON、FitDiT、OOTDiffusion、Leffa、DS-VTON低分辨率和DS-VTON高分辨率的试穿结果。图中可见DS-VTON在结构对齐和细节纹理还原上表现较优，服装贴合自然且细节丰富。
  
  该图像为虚拟试穿效果对比示意图。左侧第一列展示了不同款式的裤子，第二列为对应人物原图，后续多列分别显示了多种虚拟试穿方法（包括CatVTOn、IDM-VTON、FitDiT、OOTDiffusion、Leffa、DS-VTON低分辨率和高分辨率）将对应裤子穿戴在人物身上的合成效果。图中突出展示了DS-VTON方法在服装结构对齐和细节纹理保留上的优势，特别是高分辨率结果更为细腻自然。
  
  Quantitative Analysis: The quantitative results in Table 1 corroborate the visual findings.
(This is a manual transcription of Table 1 from the paper.)

| Dataset | <multicolumn="3">VITON-HD | <multicolumn="3">DressCode | :--- | :---: | :---: | :---: | :---: | :---: | :---: | Method | FID ↓ | KID ↓ | User Study ↑ | FID ↓ | KID ↓ | User Study ↑ | OOTDiffusion | 9.02 | 0.63 | 4.1 | 7.10 | 2.28 | 7.2 | IDM-VTON | 9.10 | 1.06 | 11.6 | 5.51 | 1.42 | 9.1 | CatVTON | 9.40 | 1.27 | 3.4 | 5.24 | 1.21 | 5.2 | Leffa | 9.38 | 0.92 | 4.7 | 6.17 | 1.90 | 7.5 | FitDiT | 9.33 | 0.89 | 19.7 | 4.47 | 0.41 | 34.3 | DS-VTON (ours) | 8.24 | 0.31 | 56.5 | 4.21 | 0.34 | 36.7

DS-VTON achieves the best (lowest) FID and KID scores on both datasets by a significant margin. Most strikingly, in the user study, it was preferred 56.5% of the time on VITON-HD and 36.7% on DressCode, indicating a strong human preference for its results.
Ablations / Parameter Sensitivity:

1. Ablation on Dual-Scale Design ( $\sigma$ ): This study validates the core idea of the dual-scale approach.

(This is a manual transcription of Table 2 from the paper.)

Version FID ↓ KID ↓

σ = 1 (single-stage) 8.97 1.01

σ = 1, α = β = 1/2 (two-stage, same scale) 8.77 0.61

σ = 4, α = β = 1/2 8.41 0.57

σ = 2, α = β = 1/2 8.24 0.31
- σ = 1 (single-stage): This is equivalent to a standard single-stage diffusion model. It performs the worst, confirming that generating high-resolution images directly with a mask-free strategy leads to poor structural alignment (as seen in Figure 8, right columns).
- σ = 4: Using a very low resolution for the first stage improves structural understanding, but it sacrifices too much detail (e.g., the sleeve stripes in Figure 8 are blurred), which the high-resolution stage cannot fully recover.
- σ = 2: This provides the best balance, creating a coarse result that is detailed enough to guide the high-resolution stage effectively but simple enough to ensure correct global structure.
2. Ablation on Initialization Coefficients ( $\alpha, \beta$ ): This study explores the balance between noise and structural guidance in the high-resolution stage.

(This is a manual transcription of Table 3 from the paper.)

Version FID ↓ KID ↓

σ = 2, α = 1/2, β = 1/2 8.24 0.31

σ = 2, α = 1/3, β = 2/3 8.46 0.55

σ = 2, α = 2/3, β = 1/3 8.26 0.35

σ = 2, α = 1, β = 1 8.75 0.94
- High $\beta$ (e.g., β = 2/3): Too much reliance on the low-resolution result. The model fails to add enough new detail, resulting in a final image that still looks blurry or lacks fine textures.
- High $\alpha$ (e.g., α = 2/3): Too much random noise. The structural guidance from the low-resolution result is weakened, and the model can introduce new structural errors, like the warped text ("GANT") in Figure 9.
- α = β = 0.5: This setting provides the optimal trade-off, allowing the model to respect the coarse structure while having enough freedom to generate crisp, high-frequency details.

Version	FID ↓	KID ↓
σ = 1 (single-stage)	8.97	1.01
σ = 1, α = β = 1/2 (two-stage, same scale)	8.77	0.61
σ = 4, α = β = 1/2	8.41	0.57
σ = 2, α = β = 1/2	8.24	0.31

Version	FID ↓	KID ↓
σ = 2, α = 1/2, β = 1/2	8.24	0.31
σ = 2, α = 1/3, β = 2/3	8.46	0.55
σ = 2, α = 2/3, β = 1/3	8.26	0.35
σ = 2, α = 1, β = 1	8.75	0.94

7. Conclusion & Reflections

Conclusion Summary: The paper successfully introduces DS-VTON, a dual-scale, coarse-to-fine framework for virtual try-on. By explicitly separating structural alignment (low-resolution stage) from texture refinement (high-resolution stage) and introducing a novel blend-refine diffusion process, the method achieves a new state of the art. Its mask-free design further enhances robustness and usability. The results demonstrate significant improvements in both visual quality and quantitative metrics over previous leading methods.
Limitations & Future Work:
- Data Generation: The mask-free training process relies on synthesizing person images with different garments using another VTON model. This can introduce artifacts from the synthesis model into the training data, potentially causing DS-VTON to learn to alter non-garment regions like hair or background.
- Fixed Coefficients: The coefficients $\alpha$ and $\beta$ for the blend-refine process are fixed. The authors suggest that an adaptive or learnable scheduling mechanism could provide more flexible and content-aware refinement, which is a promising direction for future research.
Personal Insights & Critique:
- Elegance of Blend-Refine: The blend-refine diffusion process is a simple yet powerful idea. It elegantly reframes the high-resolution generation task not as creating an image from scratch, but as learning the residual between a coarse and a fine version. This is a more constrained and well-posed problem, likely contributing to the method's stability and high performance.
- Generalizability: The dual-scale framework is not limited to virtual try-on. It could be applied to other image generation tasks that require both global consistency and fine detail, such as image super-resolution, inpainting, or style transfer.
- Practicality Concerns: While the results are impressive, the training pipeline's reliance on another SOTA model (IDM-VTON) for data synthesis adds complexity and a dependency. The quality of DS-VTON is indirectly tied to the quality of the model used to generate its training data. A future challenge would be to achieve similar results without this dependency, perhaps through self-supervised or weakly-supervised techniques. Overall, DS-VTON is a strong contribution that clearly pushes the boundaries of virtual try-on technology.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.