Paper status: completed

Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach

Published:12/04/2024

Fidelity-Perception Trade-off (1)Diffusion Model Super-Resolution (2)Dual-LoRA Fine-Tuning Approach (1)Pixel-level and Semantic-level Adjustable Super-Resolution (1)Stable Diffusion Application (1)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

PiSA-SR proposes a dual-LoRA approach on Stable Diffusion, decoupling pixel-level fidelity ($\ell_2$ loss) and semantic-level perceptual quality (LPIPS/CSD loss) into distinct modules. This enables adjustable super-resolution during inference via guidance scales, achieving flexib

Abstract

Diffusion prior-based methods have shown impressive results in real-world image super-resolution (SR). However, most existing methods entangle pixel-level and semantic-level SR objectives in the training process, struggling to balance pixel-wise fidelity and perceptual quality. Meanwhile, users have varying preferences on SR results, thus it is demanded to develop an adjustable SR model that can be tailored to different fidelity-perception preferences during inference without re-training. We present Pixel-level and Semantic-level Adjustable SR (PiSA-SR), which learns two LoRA modules upon the pre-trained stable-diffusion (SD) model to achieve improved and adjustable SR results. We first formulate the SD-based SR problem as learning the residual between the low-quality input and the high-quality output, then show that the learning objective can be decoupled into two distinct LoRA weight spaces: one is characterized by the $\ell_2$ -loss for pixel-level regression, and another is characterized by the LPIPS and classifier score distillation losses to extract semantic information from pre-trained classification and SD models. In its default setting, PiSA-SR can be performed in a single diffusion step, achieving leading real-world SR results in both quality and efficiency. By introducing two adjustable guidance scales on the two LoRA modules to control the strengths of pixel-wise fidelity and semantic-level details during inference, PiSASR can offer flexible SR results according to user preference without re-training. Codes and models can be found at https://github.com/csslc/PiSA-SR.

Mind Map

In-depth Reading

English Analysis~15 min read · 20,822 chars

1. Bibliographic Information

Title: Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach
Authors: Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, Lei Zhang
Affiliations: The Hong Kong Polytechnic University, OPPO Research Institute
Journal/Conference: The paper is available on arXiv, a preprint server. This means it has not yet undergone formal peer review for a specific conference or journal, but it is a common way to disseminate cutting-edge research quickly.
Publication Year: 2024 (as per arXiv submission)
Abstract: The paper addresses the challenge in real-world image super-resolution (SR) where most methods struggle to balance pixel-wise fidelity and perceptual quality. Existing methods often entangle these two conflicting objectives. To solve this, the authors propose PiSA-SR, a method that uses two separate Low-Rank Adapter (LoRA) modules on top of a pre-trained Stable Diffusion (SD) model. One LoRA is trained for pixel-level regression using an $\ell_2$ loss to ensure fidelity. The other LoRA is trained for semantic-level enhancement using LPIPS and Classifier Score Distillation (CSD) losses to improve perceptual quality. This decoupled approach allows PiSA-SR to achieve state-of-the-art results in a single diffusion step. Crucially, it introduces two adjustable guidance scales at inference time, enabling users to control the balance between fidelity and perception without retraining the model.
Original Source Link:
- arXiv Page: https://arxiv.org/abs/2412.03017
- PDF Link: http://arxiv.org/pdf/2412.03017v2

2. Executive Summary

Background & Motivation (Why)

The core problem in image super-resolution (SR) is the inherent perception-distortion trade-off. Methods optimized for pixel-level accuracy (high fidelity, measured by PSNR) tend to produce overly smooth and blurry images, lacking realistic textures. Conversely, methods optimized for perceptual quality (realism, measured by metrics like LPIPS or no-reference scores) often use Generative Adversarial Networks (GANs) or Diffusion Models (DMs) to generate plausible details, but this can introduce artifacts and deviate from the original image content.

Existing diffusion-based SR methods, despite their impressive generative capabilities, suffer from several key limitations:

Entangled Objectives: They typically train a single model to simultaneously optimize for both fidelity and perception, leading to a difficult balancing act and suboptimal results.
Lack of Flexibility: Most models produce a fixed output style. However, different users have different preferences—some may prefer a faithful restoration, while others might want more vibrant, detailed results. A "one-size-fits-all" model is not ideal.
High Computational Cost: Many diffusion-based SR methods require numerous denoising steps (e.g., 20, 50, or even 200), making them slow and impractical for real-time applications.

This paper is motivated by the need for an SR model that is efficient, effective, and adjustable, allowing it to explicitly disentangle the fidelity and perception objectives and cater to diverse user preferences at inference time.

Main Contributions / Findings (What)

The paper introduces PiSA-SR, a novel framework with the following key contributions:

Decoupled Dual-LoRA Framework: PiSA-SR is the first to propose using two distinct LoRA modules to explicitly separate the pixel-level restoration and semantic-level enhancement tasks. This disentangles the conflicting objectives into two separate, optimizable weight spaces within a single frozen Stable Diffusion model.
Efficient One-Step Residual Learning: The SR process is formulated as learning the residual between the low-quality (LQ) and high-quality (HQ) latent codes. This allows the model to be trained end-to-end and perform SR in a single diffusion step, making it significantly faster than multi-step methods.
Adjustable Inference with Dual Guidance: By introducing two guidance scales, $\lambda_{pix}$ and $\lambda_{sem}$ , users can dynamically control the strength of the pixel-level and semantic-level LoRAs during inference. This provides unprecedented flexibility to generate custom SR results tailored to specific needs without any retraining.
State-of-the-Art Performance: In its default single-step setting, PiSA-SR achieves leading performance on standard real-world SR benchmarks, outperforming previous GAN-based and diffusion-based methods in both quantitative metrics and visual quality.

Foundational Concepts

Image Super-Resolution (SR): The task of creating a high-resolution (HR) image from a low-resolution (LR) input. It is an "ill-posed" problem because a single LR image can correspond to many possible HR images.
Perception-Distortion Trade-off: A fundamental dilemma in image restoration. Distortion measures pixel-wise error (e.g., PSNR, $\ell_2$ loss), while Perception measures perceptual realism. Improving one often degrades the other. Models with low distortion are faithful but often blurry; models with high perceptual quality are realistic but may contain artifacts or hallucinated details.
Diffusion Models (DMs): Generative models that learn to reverse a process of gradually adding noise to an image. The forward process adds Gaussian noise over many steps. The reverse process learns to denoise the image step-by-step, starting from pure noise to generate a clean image. Stable Diffusion (SD) is a powerful, large-scale pre-trained text-to-image diffusion model that operates in a compressed latent space for efficiency.
Low-Rank Adaptation (LoRA): A parameter-efficient fine-tuning (PEFT) technique. Instead of retraining all the weights of a large pre-trained model (which is computationally expensive), LoRA freezes the original weights and injects small, trainable "adapter" matrices into specific layers (e.g., attention layers). These adapters learn the task-specific updates, drastically reducing the number of trainable parameters.
Classifier Score Distillation (CSD): A loss function that leverages a pre-trained model as an implicit "classifier" or "judge." It guides the optimization of a generator (in this case, the SR model) by ensuring its output has a high probability under the pre-trained model's distribution for a given condition (e.g., a text prompt). This effectively distills the rich semantic knowledge from the large model into the smaller one being trained.

Previous Works

The paper positions itself within the evolution of SR methods:

Early CNN-based SR: Methods like SRCNN and EDSR used deep convolutional neural networks optimized with pixel-wise losses ( $\ell_1$ or $\ell_2$ ). They achieved high PSNR but produced overly smooth results.
GAN-based SR: Models like SRGAN and Real-ESRGAN introduced an adversarial loss, training a discriminator to distinguish between real HR images and generated SR images. This significantly improved perceptual quality but often led to training instability and noticeable artifacts.
Diffusion-based SR: This is the current state-of-the-art.
- Multi-step Methods (StableSR, DiffBIR, PASD, SeeSR): These methods adapt pre-trained Stable Diffusion models for SR. They typically start from Gaussian noise and perform many denoising steps, conditioned on the LQ image. While powerful, they are slow and can introduce inconsistencies.
- Single-step Methods (OSEDiff, SinSR): To improve efficiency, these methods distill the knowledge from a multi-step diffusion process into a model that can perform SR in a single step. OSEDiff, a key predecessor, uses Variational Score Distillation (VSD) loss. However, these methods still entangle the fidelity and perception objectives.

Differentiation

PiSA-SR distinguishes itself from prior work in several crucial ways:

Explicit Decoupling: Unlike all previous methods that mix fidelity and perception losses in a single training process, PiSA-SR uses a dual-LoRA structure to assign each objective to a dedicated set of parameters.
Novel Inference Control: While some methods offer limited control (e.g., adjusting noise levels), PiSA-SR provides a more intuitive and powerful dual-guidance mechanism that directly manipulates the contributions of the fidelity and perception modules.
Efficiency and Stability: By using CSD loss instead of the more complex VSD loss (used in OSEDiff), PiSA-SR achieves more stable training with lower memory usage. Its single-step residual learning formulation makes it one of the fastest diffusion-based SR methods available.

4. Methodology (Core Technology & Implementation)

The core of PiSA-SR is its unique formulation of the SR problem and its decoupled training and inference pipeline.

$Figure 2. Comparison of the pipeline of different DM-based SR methods. (a) Multi-step methods \[2, 30, 36, 46, 57, 59\] perform $T$ denoising steps starting from Gaussian noise `z _ { T }` , conditione…$ Figure 2. Comparison of the pipeline of different DM-based SR methods.

Principles: Residual Learning in a Single Step

As shown in Figure 2, traditional multi-step methods (a) are slow. OSEDiff (b) introduces a single-step approach, directly mapping the LQ latent $z_L$ to the HQ latent $z_H$ . PiSA-SR (c) builds on this but reframes the problem as residual learning. Instead of predicting $z_H$ directly, the model learns to predict the residual (i.e., the high-frequency details and degradation patterns) that needs to be subtracted from $z_L$ to obtain $z_H$ .

The process is defined by the formula: $z_H = z_L - \lambda \epsilon_\theta(z_L)$

$z_L$ and $z_H$ are the latent representations of the LQ and HQ images, respectively, obtained from a VAE encoder.
$\epsilon_\theta(z_L)$ is the output of the diffusion U-Net, which is trained to predict the residual.
$\lambda$ is a scaling factor. It is fixed to 1 during training but can be adjusted during inference to control the restoration strength.

Steps & Procedures: Dual-LoRA Training

The training process, illustrated in Figure 3(a), is designed to disentangle the two objectives. The base SD model and VAE are frozen.

$该图像是PiSA-SR模型的训练和推理流程示意图。图(a)展示了训练阶段，利用像素级LoRA和PiSA LoRA分别预测噪声残差。其中，像素级输出通过 $\\ell_2$-loss ($L_{\\ell_2}$) 进行优化，而语义级输出则通过LPIPS loss ($L_{lpips}$) 和分类器分数蒸馏损失 ($L_{CSD}^{cls}$) 进行优化。图(b)展示了推理阶段，引入了可调节的像素…$ Figure 3. Training and inference process of PiSA-SR.

Stage 1: Pixel-level LoRA Training.
- A LoRA module, denoted as $\Delta\theta_{pix}$ , is added to the SD U-Net.
- This module is trained to perform pixel-level restoration. The objective is to minimize the pixel-wise difference between the reconstructed image and the ground truth.
- Loss Function: A simple $\ell_2$ loss is used on the reconstructed images: $L_{pix} = ||D(z_H^{pix}) - x_H||_2^2$ , where $z_H^{pix} = z_L - \epsilon_{\theta_{sd} + \Delta\theta_{pix}}(z_L)$ .
Stage 2: Semantic-level LoRA Training.
- The trained pixel-level LoRA ( $\Delta\theta_{pix}$ ) is frozen.
- A second, new LoRA module, $\Delta\theta_{sem}$ , is introduced.
- Only $\Delta\theta_{sem}$ is trained in this stage. The full set of active parameters is $\theta_{PiSA} = \{ \theta_{sd}, \Delta\theta_{pix}, \Delta\theta_{sem} \}$ .
- The goal is to learn how to add rich, semantic details on top of the base restoration provided by the pixel-level LoRA.
- Loss Functions: A combination of two perceptual losses is used:
  - LPIPS Loss: Measures perceptual similarity in a deep feature space.
  - Classifier Score Distillation (CSD) Loss: This is the key for generating high-quality semantic details.

Mathematical Formulas & Key Details

The semantic-level enhancement relies heavily on the CSD loss. The gradient of the CSD loss is formulated as: $\nabla \ell_{CSD}^{\lambda_{cfg}} = \mathbb{E}_{t, \epsilon, z_t, c} \left[ w_t \left( f(z_t, \epsilon_{real}) - f(z_t, \epsilon_{real}^{\lambda_{cfg}}) \right) \frac{\partial z_H^{sem}}{\partial \theta_{PiSA}} \right]$

Intuition: This formula pushes the SR output $z_H^{sem}$ to be more aligned with what the pre-trained SD model considers a high-quality image given a text condition $c$ .
$f(z_t, \epsilon)$ is a function that predicts the clean latent from a noisy one.
$\epsilon_{real}$ is the noise prediction from the unconditional pre-trained SD model.
$\epsilon_{real}^{\lambda_{cfg}}$ is the noise prediction from the pre-trained SD model using Classifier-Free Guidance (CFG), which enhances the influence of the text condition $c$ .
The term $(f(z_t, \epsilon_{real}) - f(z_t, \epsilon_{real}^{\lambda_{cfg}}))$ represents the semantic "correction" suggested by the pre-trained SD model. The loss minimizes this difference, effectively distilling SD's semantic knowledge into the PiSA-LoRA module.

Figure 4. Visualizing the effect of different losses.

Figure 4 provides a visual justification for the choice of losses. The $\ell_2$ loss alone ( $D(ε_{θ_pix}(z_L))$ ) removes degradation but results in a smooth image. The CSD loss component is shown to be highly effective at adding semantic details, while the VSD component used by prior work (OSEDiff) can weaken them. The term $D(ε_{θ_{PiSA}}(z_L) - ε_{θ_{pix}}(z_L))$ visualizes the isolated semantic enhancement, which is clean of pixel-level artifacts.

Inference: Default and Adjustable Settings

As shown in Figure 3(b), inference can be done in two ways:

Default Setting: The pixel-level and semantic-level LoRAs are merged into a single PiSA-LoRA. The SR is performed in one step, making it very fast. This corresponds to setting $\lambda_{pix}=1$ and $\lambda_{sem}=1$ .
Adjustable Setting: This mode uses the dual guidance scales. The final residual is computed as: $\epsilon_{\theta}(z_L) = \lambda_{pix} \epsilon_{\theta_{pix}}(z_L) + \lambda_{sem} (\epsilon_{\theta_{PiSA}}(z_L) - \epsilon_{\theta_{pix}}(z_L))$
- $\epsilon_{\theta_{pix}}(z_L)$ is the output using only the pixel-level LoRA (the fidelity component).
- $(\epsilon_{\theta_{PiSA}}(z_L) - \epsilon_{\theta_{pix}}(z_L))$ is the difference between the full model output and the pixel-only output. This term isolates the pure semantic enhancement.
- By adjusting $\lambda_{pix}$ and $\lambda_{sem}$ , the user can mix and match the amount of fidelity and perceptual detail in the final result.

5. Experimental Setup

Datasets:
- Training: A combination of the LSDIR dataset and the first 10,000 images from the FFHQ dataset. LQ images were synthesized using the degradation pipeline from Real-ESRGAN.
- Testing:
  - Synthetic: 3000 images from the DIV2K dataset, degraded with the same Real-ESRGAN pipeline.
  - Real-World: Images from the RealSR and DrealSR datasets.
Evaluation Metrics:
- Fidelity Metrics (Reference-based):
  1. PSNR (Peak Signal-to-Noise Ratio): Measures pixel-wise error between the SR output and the ground-truth (GT) image. Higher is better. $\text{PSNR} = 10 \cdot \log_{10} \left( \frac{\text{MAX}_I^2}{\text{MSE}} \right)$
    - $\text{MAX}_I$ is the maximum possible pixel value (e.g., 255 for 8-bit images).
    - $\text{MSE}$ is the Mean Squared Error between the SR and GT images.
  2. SSIM (Structural Similarity Index): Measures similarity in terms of luminance, contrast, and structure. Closer to 1 is better. $\text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)}$
    - $\mu_x, \mu_y$ are the means of images $x$ and $y$ .
    - $\sigma_x, \sigma_y$ are the standard deviations.
    - $\sigma_{xy}$ is the covariance.
    - $c_1, c_2$ are small constants for stability.
- Perceptual Metrics (Reference-based):
  1. LPIPS (Learned Perceptual Image Patch Similarity): Calculates the distance between deep features of two images extracted from a pre-trained network (e.g., VGG). Lower is better.
  2. DISTS (Deep Image Structure and Texture Similarity): Another perceptual metric that unifies structure and texture similarity assessment. Lower is better.
  3. FID (Fréchet Inception Distance): Measures the distance between the distribution of features from generated images and real images, extracted from an InceptionV3 network. Lower indicates the distributions are more similar.
- Perceptual Metrics (No-Reference):
  1. NIQE (Natural Image Quality Evaluator): Measures the quality of an image by comparing its statistical properties to those of natural images. Lower is better.
  2. CLIPIQA: A no-reference metric that leverages the CLIP model to assess image quality. Higher is better.
  3. MUSIQ: A Transformer-based no-reference metric that assesses quality at multiple scales. Higher is better.
  4. MANIQA: An attention-based no-reference metric for image quality assessment. Higher is better.
Baselines: The paper compares PiSA-SR against a comprehensive set of recent methods:
- GAN-based: RealESRGAN, BSRGAN, LDL.
- Multi-step DM-based: StableSR, ResShift, DiffBIR, PASD, SeeSR.
- One-step DM-based: SinSR, OSEDiff.

6. Results & Analysis

Core Results: Adjustable Super-Resolution

$Figure 1. Visual illustration of our pixel- and semantic-level adjustable method for real-world SR. By increasing the pixel-level guidance scale $\\lambda _ { p i x }$ , the image degradations such as…$ Figure 1. Visual demonstration of the adjustable SR capabilities.

Figure 1 and Table 1 showcase the core innovation: adjustability.

Varying Pixel-level Scale ( $\lambda_{pix}$ ): As seen in Figure 1 (moving up the vertical axis), increasing $\lambda_{pix}$ effectively removes noise and artifacts. However, as Table 1 shows, PSNR peaks at $\lambda_{pix}=0.5$ and then decreases, while LPIPS is best at $\lambda_{pix}=0.8$ . A very high value (e.g., 1.5) leads to over-smoothing and loss of detail.
Varying Semantic-level Scale ( $\lambda_{sem}$ ): As seen in Figure 1 (moving across the horizontal axis), increasing $\lambda_{sem}$ generates richer semantic details, like wrinkles and hair texture. Table 1 shows that this consistently improves no-reference metrics like MUSIQ. However, it degrades PSNR (fidelity) and can worsen LPIPS if pushed too far, as it may introduce unrealistic artifacts.

Here is the manually transcribed data from Table 1.

Table 1. Results of PiSA-SR with different pixel-semantic guidance scales on the RealSR test dataset.

$\lambda_{pix}$	$\lambda_{sem}$	PSNR↑	LPIPS↓	CLIPIQA↑	MUSIQ↑
0.0	1.0	25.96	0.3426	0.4129	46.45
0.2	1.0	26.48	0.3042	0.4868	54.05
0.5	1.0	26.75	0.2646	0.5705	63.82
0.8	1.0	26.18	0.2612	0.6292	68.95
1.0	1.0	25.50	0.2672	0.6702	70.15
1.2	1.0	24.76	0.2723	0.6746	70.33
1.5	1.0	23.74	0.2769	0.6305	69.23
1.0	0.0	26.92	0.3018	0.3227	49.62
1.0	0.2	26.95	0.2784	0.3591	53.64
1.0	0.5	26.77	0.2476	0.4322	58.76
1.0	0.8	26.20	0.2465	0.5806	66.33
1.0	1.0	25.50	0.2672	0.6702	70.15
1.0	1.2	24.59	0.3000	0.7015	71.60
1.0	1.5	23.08	0.3541	0.6835	71.76

Core Results: Comparisons with State-of-the-Arts

Below is the transcribed data from Table 2.

Table 2. Quantitative comparison with DM-based SR methods. S denotes the number of steps.

Datasets	Methods	PSNR↑	SSIM↑	LPIPS↓	DISTS↓	FID↓	NIQE↓	CLIPIQA↑	MUSIQ↑	MANIQA↑
DIV2K	ResShift-S15	24.69	0.6175	0.3374	0.2215	36.01	6.82	0.6089	60.92	0.5450
	StableSR-S200	23.31	0.5728	0.3129	0.2138	24.67	4.76	0.6682	65.63	0.6188
	DiffBIR-S50	23.67	0.5653	0.3541	0.2129	30.93	4.71	0.6652	65.66	0.6204
	PASD-S20	23.14	0.5489	0.3607	0.2219	29.32	4.40	0.6711	68.83	0.6484
	SeeSR-S50	23.71	0.6045	0.3207	0.1967	25.83	4.82	0.6857	68.49	0.6239
	SinSR-S1	24.43	0.6012	0.3262	0.2066	35.45	6.02	0.6499	62.80	0.5395
	OSEDiff-S1	23.72	0.6108	0.2941	0.1976	26.32	4.71	0.6683	67.97	0.6148
	PiSA-SR-S1	23.87	0.6058	0.2823	0.1934	25.07	4.55	0.6927	69.68	0.6400
RealSR	ResShift-S15	26.31	0.7411	0.3489	0.2498	142.81	7.27	0.5450	58.10	0.5305
	StableSR-S200	24.69	0.7052	0.3091	0.2167	127.20	5.76	0.6195	65.42	0.6211
	DiffBIR-S50	24.88	0.6673	0.3567	0.2290	124.56	5.63	0.6412	64.66	0.6231
	PASD-S20	25.22	0.6809	0.3392	0.2259	123.08	5.18	0.6502	68.74	0.6461
	SeeSR-S50	25.33	0.7273	0.2985	0.2213	125.66	5.38	0.6594	69.37	0.6439
	SinSR-S1	26.30	0.7354	0.3212	0.2346	137.05	6.31	0.6204	60.41	0.5389
	OSEDiff-S1	25.15	0.7341	0.2921	0.2128	123.50	5.65	0.6693	69.09	0.6339
	PiSA-SR-S1	25.50	0.7417	0.2672	0.2044	124.09	5.50	0.6702	70.15	0.6560
DrealSR	ResShift-S15	28.45	0.7632	0.4073	0.2700	175.92	8.28	0.5259	49.86	0.4573
	StableSR-S200	28.04	0.7460	0.3354	0.2287	147.03	6.51	0.6171	58.50	0.5602
	DiffBIR-S50	26.84	0.6660	0.4446	0.2706	167.38	6.02	0.6292	60.68	0.5902
	PASD-S20	27.48	0.7051	0.3854	0.2535	157.36	5.57	0.6714	64.55	0.6130
	SeeSR-S50	28.26	0.7698	0.3197	0.2306	149.86	6.52	0.6672	64.84	0.6026
	SinSR-S1	28.41	0.7495	0.3741	0.2488	177.05	7.02	0.6367	55.34	0.4898
	OSEDiff-S1	27.92	0.7835	0.2968	0.2165	135.29	6.49	0.6963	64.65	0.5899
	PiSA-SR-S1	28.31	0.7804	0.2960	0.2169	130.61	6.20	0.6970	66.11	0.6156

Analysis: PiSA-SR (in its default 1-step setting) consistently achieves the best scores across almost all perceptual metrics (LPIPS, DISTS, CLIPIQA, MUSIQ, MANIQA) on all three datasets. This demonstrates its superior ability to generate realistic and high-quality details. While its fidelity scores (PSNR/SSIM) are not always the highest, they are highly competitive, showcasing the effectiveness of the decoupled approach in balancing the two objectives.

该图像是超分辨率插图，对比了LQ、GT、BSRGAN、RealESRGAN、LDL和PiSA-SR在多组真实世界图像上的表现。每行展示一个场景的低质量输入、真实图像和不同方法生成的超分辨率结果，并突出显示局部细节，表明PiSA-SR在细节和质量上表现出色。 Figure 5. Visual comparisons of different DM-based SR methods.

Qualitative Analysis: Figure 5 provides visual evidence. While methods like ResShift and SinSR produce blurry results, and others like SeeSR can over-enhance details into artifacts, PiSA-SR generates textures (wood grain, penguin feathers) that are both detailed and natural, aligning closely with the ground truth.

Complexity Comparisons

The transcribed data from Table 3 highlights PiSA-SR's efficiency.

Table 3. The inference time and the number of parameters of DM-based SR methods.

	StableSR	ResShift	DiffBIR	PASD	SeeSR	SinSR	OSEDiff	PiSA-SR-def.	PiSA-SR-adj.
Inference Steps	200	15	50	20	50	1	1	1	2
Inference time(s)/Image	10.03	0.76	2.72	2.80	4.30	0.13	0.12	0.09	0.13
#Params(B)	1.56	0.18	1.68	2.31	2.51	0.18	1.77	1.30	1.30

PiSA-SR is the fastest method (0.09s/image), even outperforming the previous fastest one-step method, OSEDiff. The adjustable version is only slightly slower (0.13s) but offers significant flexibility. It also has the fewest parameters among SD-based methods.

Ablation Studies

The supplementary material provides key insights:

Effectiveness of Dual-LoRA: An ablation study (Table 5) shows that using only the pixel-level LoRA results in high PSNR but poor perceptual scores. Using only the semantic-level LoRA gives excellent perceptual scores but poor fidelity. The combined PiSA-SR model provides the best balance.
Training Efficiency: A comparison with OSEDiff (Figure 9) shows PiSA-SR uses significantly less memory (43.87 GB vs. 56.28 GB) and is faster per training iteration (1.60s vs. 2.26s). This is attributed to the more efficient CSD loss.

Figure 10. Visual comparison of training progress.

Figure 10 shows that PiSA-SR converges faster visually, restoring clear text more quickly than OSEDiff.

7. Conclusion & Reflections

Conclusion Summary

PiSA-SR presents a significant step forward in real-world image super-resolution. By ingeniously decoupling the pixel-level and semantic-level objectives using a dual-LoRA framework, it successfully navigates the perception-distortion trade-off. The method not only achieves state-of-the-art results but does so with remarkable efficiency in a single diffusion step. Its most impactful contribution is the introduction of adjustable guidance scales, which empowers users to customize SR outputs to their liking, transforming the SR model from a fixed tool into a flexible, interactive system.

Limitations & Future Work

The authors acknowledge two main limitations:

The adjustable inference mode, while flexible, requires two forward passes, slightly increasing computation time compared to the default single-pass version.
A single pixel-level LoRA is used to handle all types of degradation. This might be suboptimal for images with extremely complex or varied degradations.

Future work could focus on exploring more specialized LoRA modules for different types of degradation (e.g., one for noise, one for blur) and developing image-adaptive guidance scales that automatically adjust based on the input content.

Personal Insights & Critique

PiSA-SR is an elegant and practical piece of engineering. The core idea of disentangling objectives via separate, lightweight adapters (LoRAs) is powerful and could be highly transferable to other image restoration tasks (e.g., deblurring, inpainting) that also face conflicting objectives. The shift from the complex VSD loss to the more stable and efficient CSD loss is a smart, pragmatic choice that improves the entire training pipeline.

The adjustable inference mechanism is the standout feature from a user-centric perspective. In real-world applications, user preference is paramount, and PiSA-SR directly addresses this. It bridges the gap between purely technical metrics and practical usability. The work is a strong example of how to effectively leverage the power of large pre-trained models while adding fine-grained control and task-specific specialization in a parameter-efficient manner.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.