AiPaper
Status: completed

Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach

Fidelity-Perception Trade-offDiffusion Model Super-ResolutionDual-LoRA Fine-Tuning ApproachPixel-level and Semantic-level Adjustable Super-ResolutionStable Diffusion Application
Original LinkPDFEdit PDF notes
Price: 0.10
Price: 0.10
4 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

PiSA-SR proposes a dual-LoRA approach on Stable Diffusion, decoupling pixel-level fidelity ($\ell_2$ loss) and semantic-level perceptual quality (LPIPS/CSD loss) into distinct modules. This enables adjustable super-resolution during inference via guidance scales, achieving flexib

Abstract

Diffusion prior-based methods have shown impressive results in real-world image super-resolution (SR). However, most existing methods entangle pixel-level and semantic-level SR objectives in the training process, struggling to balance pixel-wise fidelity and perceptual quality. Meanwhile, users have varying preferences on SR results, thus it is demanded to develop an adjustable SR model that can be tailored to different fidelity-perception preferences during inference without re-training. We present Pixel-level and Semantic-level Adjustable SR (PiSA-SR), which learns two LoRA modules upon the pre-trained stable-diffusion (SD) model to achieve improved and adjustable SR results. We first formulate the SD-based SR problem as learning the residual between the low-quality input and the high-quality output, then show that the learning objective can be decoupled into two distinct LoRA weight spaces: one is characterized by the 2\ell_2-loss for pixel-level regression, and another is characterized by the LPIPS and classifier score distillation losses to extract semantic information from pre-trained classification and SD models. In its default setting, PiSA-SR can be performed in a single diffusion step, achieving leading real-world SR results in both quality and efficiency. By introducing two adjustable guidance scales on the two LoRA modules to control the strengths of pixel-wise fidelity and semantic-level details during inference, PiSASR can offer flexible SR results according to user preference without re-training. Codes and models can be found at https://github.com/csslc/PiSA-SR.

English Analysis

1. Bibliographic Information

  • Title: Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach
  • Authors: Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, Lei Zhang
  • Affiliations: The Hong Kong Polytechnic University, OPPO Research Institute
  • Journal/Conference: The paper is available on arXiv, a preprint server. This means it has not yet undergone formal peer review for a specific conference or journal, but it is a common way to disseminate cutting-edge research quickly.
  • Publication Year: 2024 (as per arXiv submission)
  • Abstract: The paper addresses the challenge in real-world image super-resolution (SR) where most methods struggle to balance pixel-wise fidelity and perceptual quality. Existing methods often entangle these two conflicting objectives. To solve this, the authors propose PiSA-SR, a method that uses two separate Low-Rank Adapter (LoRA) modules on top of a pre-trained Stable Diffusion (SD) model. One LoRA is trained for pixel-level regression using an 2\ell_2 loss to ensure fidelity. The other LoRA is trained for semantic-level enhancement using LPIPS and Classifier Score Distillation (CSD) losses to improve perceptual quality. This decoupled approach allows PiSA-SR to achieve state-of-the-art results in a single diffusion step. Crucially, it introduces two adjustable guidance scales at inference time, enabling users to control the balance between fidelity and perception without retraining the model.
  • Original Source Link:

2. Executive Summary

Background & Motivation (Why)

The core problem in image super-resolution (SR) is the inherent perception-distortion trade-off. Methods optimized for pixel-level accuracy (high fidelity, measured by PSNR) tend to produce overly smooth and blurry images, lacking realistic textures. Conversely, methods optimized for perceptual quality (realism, measured by metrics like LPIPS or no-reference scores) often use Generative Adversarial Networks (GANs) or Diffusion Models (DMs) to generate plausible details, but this can introduce artifacts and deviate from the original image content.

Existing diffusion-based SR methods, despite their impressive generative capabilities, suffer from several key limitations:

  1. Entangled Objectives: They typically train a single model to simultaneously optimize for both fidelity and perception, leading to a difficult balancing act and suboptimal results.

  2. Lack of Flexibility: Most models produce a fixed output style. However, different users have different preferences—some may prefer a faithful restoration, while others might want more vibrant, detailed results. A "one-size-fits-all" model is not ideal.

  3. High Computational Cost: Many diffusion-based SR methods require numerous denoising steps (e.g., 20, 50, or even 200), making them slow and impractical for real-time applications.

    This paper is motivated by the need for an SR model that is efficient, effective, and adjustable, allowing it to explicitly disentangle the fidelity and perception objectives and cater to diverse user preferences at inference time.

Main Contributions / Findings (What)

The paper introduces PiSA-SR, a novel framework with the following key contributions:

  1. Decoupled Dual-LoRA Framework: PiSA-SR is the first to propose using two distinct LoRA modules to explicitly separate the pixel-level restoration and semantic-level enhancement tasks. This disentangles the conflicting objectives into two separate, optimizable weight spaces within a single frozen Stable Diffusion model.
  2. Efficient One-Step Residual Learning: The SR process is formulated as learning the residual between the low-quality (LQ) and high-quality (HQ) latent codes. This allows the model to be trained end-to-end and perform SR in a single diffusion step, making it significantly faster than multi-step methods.
  3. Adjustable Inference with Dual Guidance: By introducing two guidance scales, λpix\lambda_{pix} and λsem\lambda_{sem}, users can dynamically control the strength of the pixel-level and semantic-level LoRAs during inference. This provides unprecedented flexibility to generate custom SR results tailored to specific needs without any retraining.
  4. State-of-the-Art Performance: In its default single-step setting, PiSA-SR achieves leading performance on standard real-world SR benchmarks, outperforming previous GAN-based and diffusion-based methods in both quantitative metrics and visual quality.

3. Prerequisite Knowledge & Related Work

Foundational Concepts

  • Image Super-Resolution (SR): The task of creating a high-resolution (HR) image from a low-resolution (LR) input. It is an "ill-posed" problem because a single LR image can correspond to many possible HR images.
  • Perception-Distortion Trade-off: A fundamental dilemma in image restoration. Distortion measures pixel-wise error (e.g., PSNR, 2\ell_2 loss), while Perception measures perceptual realism. Improving one often degrades the other. Models with low distortion are faithful but often blurry; models with high perceptual quality are realistic but may contain artifacts or hallucinated details.
  • Diffusion Models (DMs): Generative models that learn to reverse a process of gradually adding noise to an image. The forward process adds Gaussian noise over many steps. The reverse process learns to denoise the image step-by-step, starting from pure noise to generate a clean image. Stable Diffusion (SD) is a powerful, large-scale pre-trained text-to-image diffusion model that operates in a compressed latent space for efficiency.
  • Low-Rank Adaptation (LoRA): A parameter-efficient fine-tuning (PEFT) technique. Instead of retraining all the weights of a large pre-trained model (which is computationally expensive), LoRA freezes the original weights and injects small, trainable "adapter" matrices into specific layers (e.g., attention layers). These adapters learn the task-specific updates, drastically reducing the number of trainable parameters.
  • Classifier Score Distillation (CSD): A loss function that leverages a pre-trained model as an implicit "classifier" or "judge." It guides the optimization of a generator (in this case, the SR model) by ensuring its output has a high probability under the pre-trained model's distribution for a given condition (e.g., a text prompt). This effectively distills the rich semantic knowledge from the large model into the smaller one being trained.

Previous Works

The paper positions itself within the evolution of SR methods:

  1. Early CNN-based SR: Methods like SRCNN and EDSR used deep convolutional neural networks optimized with pixel-wise losses (1\ell_1 or 2\ell_2). They achieved high PSNR but produced overly smooth results.
  2. GAN-based SR: Models like SRGAN and Real-ESRGAN introduced an adversarial loss, training a discriminator to distinguish between real HR images and generated SR images. This significantly improved perceptual quality but often led to training instability and noticeable artifacts.
  3. Diffusion-based SR: This is the current state-of-the-art.
    • Multi-step Methods (StableSR, DiffBIR, PASD, SeeSR): These methods adapt pre-trained Stable Diffusion models for SR. They typically start from Gaussian noise and perform many denoising steps, conditioned on the LQ image. While powerful, they are slow and can introduce inconsistencies.
    • Single-step Methods (OSEDiff, SinSR): To improve efficiency, these methods distill the knowledge from a multi-step diffusion process into a model that can perform SR in a single step. OSEDiff, a key predecessor, uses Variational Score Distillation (VSD) loss. However, these methods still entangle the fidelity and perception objectives.

Differentiation

PiSA-SR distinguishes itself from prior work in several crucial ways:

  • Explicit Decoupling: Unlike all previous methods that mix fidelity and perception losses in a single training process, PiSA-SR uses a dual-LoRA structure to assign each objective to a dedicated set of parameters.
  • Novel Inference Control: While some methods offer limited control (e.g., adjusting noise levels), PiSA-SR provides a more intuitive and powerful dual-guidance mechanism that directly manipulates the contributions of the fidelity and perception modules.
  • Efficiency and Stability: By using CSD loss instead of the more complex VSD loss (used in OSEDiff), PiSA-SR achieves more stable training with lower memory usage. Its single-step residual learning formulation makes it one of the fastest diffusion-based SR methods available.

4. Methodology (Core Technology & Implementation)

The core of PiSA-SR is its unique formulation of the SR problem and its decoupled training and inference pipeline.

Figure 2. Comparison of the pipeline of different DM-based SR methods. (a) Multi-step methods \[2, 30, 36, 46, 57, 59\] perform \(T\) denoising steps starting from Gaussian noise `z _ { T }` , conditione… Figure 2. Comparison of the pipeline of different DM-based SR methods.

Principles: Residual Learning in a Single Step

As shown in Figure 2, traditional multi-step methods (a) are slow. OSEDiff (b) introduces a single-step approach, directly mapping the LQ latent zLz_L to the HQ latent zHz_H. PiSA-SR (c) builds on this but reframes the problem as residual learning. Instead of predicting zHz_H directly, the model learns to predict the residual (i.e., the high-frequency details and degradation patterns) that needs to be subtracted from zLz_L to obtain zHz_H.

The process is defined by the formula: zH=zLλϵθ(zL) z_H = z_L - \lambda \epsilon_\theta(z_L)

  • zLz_L and zHz_H are the latent representations of the LQ and HQ images, respectively, obtained from a VAE encoder.
  • ϵθ(zL)\epsilon_\theta(z_L) is the output of the diffusion U-Net, which is trained to predict the residual.
  • λ\lambda is a scaling factor. It is fixed to 1 during training but can be adjusted during inference to control the restoration strength.

Steps & Procedures: Dual-LoRA Training

The training process, illustrated in Figure 3(a), is designed to disentangle the two objectives. The base SD model and VAE are frozen.

该图像是PiSA-SR模型的训练和推理流程示意图。图(a)展示了训练阶段,利用像素级LoRA和PiSA LoRA分别预测噪声残差。其中,像素级输出通过 \(\\ell_2\)-loss (\(L_{\\ell_2}\)) 进行优化,而语义级输出则通过LPIPS loss (\(L_{lpips}\)) 和分类器分数蒸馏损失 (\(L_{CSD}^{cls}\)) 进行优化。图(b)展示了推理阶段,引入了可调节的像素… Figure 3. Training and inference process of PiSA-SR.

  1. Stage 1: Pixel-level LoRA Training.

    • A LoRA module, denoted as Δθpix\Delta\theta_{pix}, is added to the SD U-Net.
    • This module is trained to perform pixel-level restoration. The objective is to minimize the pixel-wise difference between the reconstructed image and the ground truth.
    • Loss Function: A simple 2\ell_2 loss is used on the reconstructed images: Lpix=D(zHpix)xH22L_{pix} = ||D(z_H^{pix}) - x_H||_2^2, where zHpix=zLϵθsd+Δθpix(zL)z_H^{pix} = z_L - \epsilon_{\theta_{sd} + \Delta\theta_{pix}}(z_L).
  2. Stage 2: Semantic-level LoRA Training.

    • The trained pixel-level LoRA (Δθpix\Delta\theta_{pix}) is frozen.
    • A second, new LoRA module, Δθsem\Delta\theta_{sem}, is introduced.
    • Only Δθsem\Delta\theta_{sem} is trained in this stage. The full set of active parameters is θPiSA={θsd,Δθpix,Δθsem}\theta_{PiSA} = \{ \theta_{sd}, \Delta\theta_{pix}, \Delta\theta_{sem} \}.
    • The goal is to learn how to add rich, semantic details on top of the base restoration provided by the pixel-level LoRA.
    • Loss Functions: A combination of two perceptual losses is used:
      • LPIPS Loss: Measures perceptual similarity in a deep feature space.
      • Classifier Score Distillation (CSD) Loss: This is the key for generating high-quality semantic details.

Mathematical Formulas & Key Details

The semantic-level enhancement relies heavily on the CSD loss. The gradient of the CSD loss is formulated as: CSDλcfg=Et,ϵ,zt,c[wt(f(zt,ϵreal)f(zt,ϵrealλcfg))zHsemθPiSA] \nabla \ell_{CSD}^{\lambda_{cfg}} = \mathbb{E}_{t, \epsilon, z_t, c} \left[ w_t \left( f(z_t, \epsilon_{real}) - f(z_t, \epsilon_{real}^{\lambda_{cfg}}) \right) \frac{\partial z_H^{sem}}{\partial \theta_{PiSA}} \right]

  • Intuition: This formula pushes the SR output zHsemz_H^{sem} to be more aligned with what the pre-trained SD model considers a high-quality image given a text condition cc.

  • f(zt,ϵ)f(z_t, \epsilon) is a function that predicts the clean latent from a noisy one.

  • ϵreal\epsilon_{real} is the noise prediction from the unconditional pre-trained SD model.

  • ϵrealλcfg\epsilon_{real}^{\lambda_{cfg}} is the noise prediction from the pre-trained SD model using Classifier-Free Guidance (CFG), which enhances the influence of the text condition cc.

  • The term (f(zt,ϵreal)f(zt,ϵrealλcfg))(f(z_t, \epsilon_{real}) - f(z_t, \epsilon_{real}^{\lambda_{cfg}})) represents the semantic "correction" suggested by the pre-trained SD model. The loss minimizes this difference, effectively distilling SD's semantic knowledge into the PiSA-LoRA module.

    Figure 4. The model outputs with pixel-wise and semantic-level losses for a given LQ image. Figure 4. Visualizing the effect of different losses.

Figure 4 provides a visual justification for the choice of losses. The 2\ell_2 loss alone (D(εθpix(zL))D(ε_{θ_pix}(z_L))) removes degradation but results in a smooth image. The CSD loss component is shown to be highly effective at adding semantic details, while the VSD component used by prior work (OSEDiff) can weaken them. The term D(εθPiSA(zL)εθpix(zL))D(ε_{θ_{PiSA}}(z_L) - ε_{θ_{pix}}(z_L)) visualizes the isolated semantic enhancement, which is clean of pixel-level artifacts.

Inference: Default and Adjustable Settings

As shown in Figure 3(b), inference can be done in two ways:

  1. Default Setting: The pixel-level and semantic-level LoRAs are merged into a single PiSA-LoRA. The SR is performed in one step, making it very fast. This corresponds to setting λpix=1\lambda_{pix}=1 and λsem=1\lambda_{sem}=1.
  2. Adjustable Setting: This mode uses the dual guidance scales. The final residual is computed as: ϵθ(zL)=λpixϵθpix(zL)+λsem(ϵθPiSA(zL)ϵθpix(zL)) \epsilon_{\theta}(z_L) = \lambda_{pix} \epsilon_{\theta_{pix}}(z_L) + \lambda_{sem} (\epsilon_{\theta_{PiSA}}(z_L) - \epsilon_{\theta_{pix}}(z_L))
    • ϵθpix(zL)\epsilon_{\theta_{pix}}(z_L) is the output using only the pixel-level LoRA (the fidelity component).
    • (ϵθPiSA(zL)ϵθpix(zL))(\epsilon_{\theta_{PiSA}}(z_L) - \epsilon_{\theta_{pix}}(z_L)) is the difference between the full model output and the pixel-only output. This term isolates the pure semantic enhancement.
    • By adjusting λpix\lambda_{pix} and λsem\lambda_{sem}, the user can mix and match the amount of fidelity and perceptual detail in the final result.

5. Experimental Setup

  • Datasets:
    • Training: A combination of the LSDIR dataset and the first 10,000 images from the FFHQ dataset. LQ images were synthesized using the degradation pipeline from Real-ESRGAN.
    • Testing:
      • Synthetic: 3000 images from the DIV2K dataset, degraded with the same Real-ESRGAN pipeline.
      • Real-World: Images from the RealSR and DrealSR datasets.
  • Evaluation Metrics:
    • Fidelity Metrics (Reference-based):
      1. PSNR (Peak Signal-to-Noise Ratio): Measures pixel-wise error between the SR output and the ground-truth (GT) image. Higher is better. PSNR=10log10(MAXI2MSE) \text{PSNR} = 10 \cdot \log_{10} \left( \frac{\text{MAX}_I^2}{\text{MSE}} \right)
        • MAXI\text{MAX}_I is the maximum possible pixel value (e.g., 255 for 8-bit images).
        • MSE\text{MSE} is the Mean Squared Error between the SR and GT images.
      2. SSIM (Structural Similarity Index): Measures similarity in terms of luminance, contrast, and structure. Closer to 1 is better. SSIM(x,y)=(2μxμy+c1)(2σxy+c2)(μx2+μy2+c1)(σx2+σy2+c2) \text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)}
        • μx,μy\mu_x, \mu_y are the means of images xx and yy.
        • σx,σy\sigma_x, \sigma_y are the standard deviations.
        • σxy\sigma_{xy} is the covariance.
        • c1,c2c_1, c_2 are small constants for stability.
    • Perceptual Metrics (Reference-based):
      1. LPIPS (Learned Perceptual Image Patch Similarity): Calculates the distance between deep features of two images extracted from a pre-trained network (e.g., VGG). Lower is better.
      2. DISTS (Deep Image Structure and Texture Similarity): Another perceptual metric that unifies structure and texture similarity assessment. Lower is better.
      3. FID (Fréchet Inception Distance): Measures the distance between the distribution of features from generated images and real images, extracted from an InceptionV3 network. Lower indicates the distributions are more similar.
    • Perceptual Metrics (No-Reference):
      1. NIQE (Natural Image Quality Evaluator): Measures the quality of an image by comparing its statistical properties to those of natural images. Lower is better.
      2. CLIPIQA: A no-reference metric that leverages the CLIP model to assess image quality. Higher is better.
      3. MUSIQ: A Transformer-based no-reference metric that assesses quality at multiple scales. Higher is better.
      4. MANIQA: An attention-based no-reference metric for image quality assessment. Higher is better.
  • Baselines: The paper compares PiSA-SR against a comprehensive set of recent methods:
    • GAN-based: RealESRGAN, BSRGAN, LDL.
    • Multi-step DM-based: StableSR, ResShift, DiffBIR, PASD, SeeSR.
    • One-step DM-based: SinSR, OSEDiff.

6. Results & Analysis

Core Results: Adjustable Super-Resolution

Figure 1. Visual illustration of our pixel- and semantic-level adjustable method for real-world SR. By increasing the pixel-level guidance scale \(\\lambda _ { p i x }\) , the image degradations such as… Figure 1. Visual demonstration of the adjustable SR capabilities.

Figure 1 and Table 1 showcase the core innovation: adjustability.

  • Varying Pixel-level Scale (λpix\lambda_{pix}): As seen in Figure 1 (moving up the vertical axis), increasing λpix\lambda_{pix} effectively removes noise and artifacts. However, as Table 1 shows, PSNR peaks at λpix=0.5\lambda_{pix}=0.5 and then decreases, while LPIPS is best at λpix=0.8\lambda_{pix}=0.8. A very high value (e.g., 1.5) leads to over-smoothing and loss of detail.

  • Varying Semantic-level Scale (λsem\lambda_{sem}): As seen in Figure 1 (moving across the horizontal axis), increasing λsem\lambda_{sem} generates richer semantic details, like wrinkles and hair texture. Table 1 shows that this consistently improves no-reference metrics like MUSIQ. However, it degrades PSNR (fidelity) and can worsen LPIPS if pushed too far, as it may introduce unrealistic artifacts.

    Here is the manually transcribed data from Table 1.

Table 1. Results of PiSA-SR with different pixel-semantic guidance scales on the RealSR test dataset.

λpix\lambda_{pix} λsem\lambda_{sem} PSNR↑ LPIPS↓ CLIPIQA↑ MUSIQ↑
0.0 1.0 25.96 0.3426 0.4129 46.45
0.2 1.0 26.48 0.3042 0.4868 54.05
0.5 1.0 26.75 0.2646 0.5705 63.82
0.8 1.0 26.18 0.2612 0.6292 68.95
1.0 1.0 25.50 0.2672 0.6702 70.15
1.2 1.0 24.76 0.2723 0.6746 70.33
1.5 1.0 23.74 0.2769 0.6305 69.23
1.0 0.0 26.92 0.3018 0.3227 49.62
1.0 0.2 26.95 0.2784 0.3591 53.64
1.0 0.5 26.77 0.2476 0.4322 58.76
1.0 0.8 26.20 0.2465 0.5806 66.33
1.0 1.0 25.50 0.2672 0.6702 70.15
1.0 1.2 24.59 0.3000 0.7015 71.60
1.0 1.5 23.08 0.3541 0.6835 71.76

Core Results: Comparisons with State-of-the-Arts

Below is the transcribed data from Table 2.

Table 2. Quantitative comparison with DM-based SR methods. S denotes the number of steps.

Datasets Methods PSNR↑ SSIM↑ LPIPS↓ DISTS↓ FID↓ NIQE↓ CLIPIQA↑ MUSIQ↑ MANIQA↑
DIV2K ResShift-S15 24.69 0.6175 0.3374 0.2215 36.01 6.82 0.6089 60.92 0.5450
StableSR-S200 23.31 0.5728 0.3129 0.2138 24.67 4.76 0.6682 65.63 0.6188
DiffBIR-S50 23.67 0.5653 0.3541 0.2129 30.93 4.71 0.6652 65.66 0.6204
PASD-S20 23.14 0.5489 0.3607 0.2219 29.32 4.40 0.6711 68.83 0.6484
SeeSR-S50 23.71 0.6045 0.3207 0.1967 25.83 4.82 0.6857 68.49 0.6239
SinSR-S1 24.43 0.6012 0.3262 0.2066 35.45 6.02 0.6499 62.80 0.5395
OSEDiff-S1 23.72 0.6108 0.2941 0.1976 26.32 4.71 0.6683 67.97 0.6148
PiSA-SR-S1 23.87 0.6058 0.2823 0.1934 25.07 4.55 0.6927 69.68 0.6400
RealSR ResShift-S15 26.31 0.7411 0.3489 0.2498 142.81 7.27 0.5450 58.10 0.5305
StableSR-S200 24.69 0.7052 0.3091 0.2167 127.20 5.76 0.6195 65.42 0.6211
DiffBIR-S50 24.88 0.6673 0.3567 0.2290 124.56 5.63 0.6412 64.66 0.6231
PASD-S20 25.22 0.6809 0.3392 0.2259 123.08 5.18 0.6502 68.74 0.6461
SeeSR-S50 25.33 0.7273 0.2985 0.2213 125.66 5.38 0.6594 69.37 0.6439
SinSR-S1 26.30 0.7354 0.3212 0.2346 137.05 6.31 0.6204 60.41 0.5389
OSEDiff-S1 25.15 0.7341 0.2921 0.2128 123.50 5.65 0.6693 69.09 0.6339
PiSA-SR-S1 25.50 0.7417 0.2672 0.2044 124.09 5.50 0.6702 70.15 0.6560
DrealSR ResShift-S15 28.45 0.7632 0.4073 0.2700 175.92 8.28 0.5259 49.86 0.4573
StableSR-S200 28.04 0.7460 0.3354 0.2287 147.03 6.51 0.6171 58.50 0.5602
DiffBIR-S50 26.84 0.6660 0.4446 0.2706 167.38 6.02 0.6292 60.68 0.5902
PASD-S20 27.48 0.7051 0.3854 0.2535 157.36 5.57 0.6714 64.55 0.6130
SeeSR-S50 28.26 0.7698 0.3197 0.2306 149.86 6.52 0.6672 64.84 0.6026
SinSR-S1 28.41 0.7495 0.3741 0.2488 177.05 7.02 0.6367 55.34 0.4898
OSEDiff-S1 27.92 0.7835 0.2968 0.2165 135.29 6.49 0.6963 64.65 0.5899
PiSA-SR-S1 28.31 0.7804 0.2960 0.2169 130.61 6.20 0.6970 66.11 0.6156

Analysis: PiSA-SR (in its default 1-step setting) consistently achieves the best scores across almost all perceptual metrics (LPIPS, DISTS, CLIPIQA, MUSIQ, MANIQA) on all three datasets. This demonstrates its superior ability to generate realistic and high-quality details. While its fidelity scores (PSNR/SSIM) are not always the highest, they are highly competitive, showcasing the effectiveness of the decoupled approach in balancing the two objectives.

该图像是超分辨率插图,对比了LQ、GT、BSRGAN、RealESRGAN、LDL和PiSA-SR在多组真实世界图像上的表现。每行展示一个场景的低质量输入、真实图像和不同方法生成的超分辨率结果,并突出显示局部细节,表明PiSA-SR在细节和质量上表现出色。 Figure 5. Visual comparisons of different DM-based SR methods.

Qualitative Analysis: Figure 5 provides visual evidence. While methods like ResShift and SinSR produce blurry results, and others like SeeSR can over-enhance details into artifacts, PiSA-SR generates textures (wood grain, penguin feathers) that are both detailed and natural, aligning closely with the ground truth.

Complexity Comparisons

The transcribed data from Table 3 highlights PiSA-SR's efficiency.

Table 3. The inference time and the number of parameters of DM-based SR methods.

StableSR ResShift DiffBIR PASD SeeSR SinSR OSEDiff PiSA-SR-def. PiSA-SR-adj.
Inference Steps 200 15 50 20 50 1 1 1 2
Inference time(s)/Image 10.03 0.76 2.72 2.80 4.30 0.13 0.12 0.09 0.13
#Params(B) 1.56 0.18 1.68 2.31 2.51 0.18 1.77 1.30 1.30

PiSA-SR is the fastest method (0.09s/image), even outperforming the previous fastest one-step method, OSEDiff. The adjustable version is only slightly slower (0.13s) but offers significant flexibility. It also has the fewest parameters among SD-based methods.

Ablation Studies

The supplementary material provides key insights:

  • Effectiveness of Dual-LoRA: An ablation study (Table 5) shows that using only the pixel-level LoRA results in high PSNR but poor perceptual scores. Using only the semantic-level LoRA gives excellent perceptual scores but poor fidelity. The combined PiSA-SR model provides the best balance.

  • Training Efficiency: A comparison with OSEDiff (Figure 9) shows PiSA-SR uses significantly less memory (43.87 GB vs. 56.28 GB) and is faster per training iteration (1.60s vs. 2.26s). This is attributed to the more efficient CSD loss.

    Figure 10. Visual comparisons of SR results from OSEDiff and PiSA-SR across 1 to 2000 training iterations. Figure 10. Visual comparison of training progress.

Figure 10 shows that PiSA-SR converges faster visually, restoring clear text more quickly than OSEDiff.

7. Conclusion & Reflections

Conclusion Summary

PiSA-SR presents a significant step forward in real-world image super-resolution. By ingeniously decoupling the pixel-level and semantic-level objectives using a dual-LoRA framework, it successfully navigates the perception-distortion trade-off. The method not only achieves state-of-the-art results but does so with remarkable efficiency in a single diffusion step. Its most impactful contribution is the introduction of adjustable guidance scales, which empowers users to customize SR outputs to their liking, transforming the SR model from a fixed tool into a flexible, interactive system.

Limitations & Future Work

The authors acknowledge two main limitations:

  1. The adjustable inference mode, while flexible, requires two forward passes, slightly increasing computation time compared to the default single-pass version.

  2. A single pixel-level LoRA is used to handle all types of degradation. This might be suboptimal for images with extremely complex or varied degradations.

    Future work could focus on exploring more specialized LoRA modules for different types of degradation (e.g., one for noise, one for blur) and developing image-adaptive guidance scales that automatically adjust based on the input content.

Personal Insights & Critique

PiSA-SR is an elegant and practical piece of engineering. The core idea of disentangling objectives via separate, lightweight adapters (LoRAs) is powerful and could be highly transferable to other image restoration tasks (e.g., deblurring, inpainting) that also face conflicting objectives. The shift from the complex VSD loss to the more stable and efficient CSD loss is a smart, pragmatic choice that improves the entire training pipeline.

The adjustable inference mechanism is the standout feature from a user-centric perspective. In real-world applications, user preference is paramount, and PiSA-SR directly addresses this. It bridges the gap between purely technical metrics and practical usability. The work is a strong example of how to effectively leverage the power of large pre-trained models while adding fine-grained control and task-specific specialization in a parameter-efficient manner.

Similar papers

Recommended via semantic vector search.

Discussion

Leave a comment

Sign in to join the discussion.

No comments yet. Start the discussion!