- Title: Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach
- Authors: Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, Lei Zhang
- Affiliations: The Hong Kong Polytechnic University, OPPO Research Institute
- Journal/Conference: The paper is available on arXiv, a preprint server. This means it has not yet undergone formal peer review for a specific conference or journal, but it is a common way to disseminate cutting-edge research quickly.
- Publication Year: 2024 (as per arXiv submission)
- Abstract: The paper addresses the challenge in real-world image super-resolution (SR) where most methods struggle to balance pixel-wise fidelity and perceptual quality. Existing methods often entangle these two conflicting objectives. To solve this, the authors propose PiSA-SR, a method that uses two separate Low-Rank Adapter (LoRA) modules on top of a pre-trained Stable Diffusion (SD) model. One LoRA is trained for pixel-level regression using an ℓ2 loss to ensure fidelity. The other LoRA is trained for semantic-level enhancement using LPIPS and Classifier Score Distillation (CSD) losses to improve perceptual quality. This decoupled approach allows PiSA-SR to achieve state-of-the-art results in a single diffusion step. Crucially, it introduces two adjustable guidance scales at inference time, enabling users to control the balance between fidelity and perception without retraining the model.
- Original Source Link:
2. Executive Summary
Background & Motivation (Why)
The core problem in image super-resolution (SR) is the inherent perception-distortion trade-off. Methods optimized for pixel-level accuracy (high fidelity, measured by PSNR) tend to produce overly smooth and blurry images, lacking realistic textures. Conversely, methods optimized for perceptual quality (realism, measured by metrics like LPIPS or no-reference scores) often use Generative Adversarial Networks (GANs) or Diffusion Models (DMs) to generate plausible details, but this can introduce artifacts and deviate from the original image content.
Existing diffusion-based SR methods, despite their impressive generative capabilities, suffer from several key limitations:
-
Entangled Objectives: They typically train a single model to simultaneously optimize for both fidelity and perception, leading to a difficult balancing act and suboptimal results.
-
Lack of Flexibility: Most models produce a fixed output style. However, different users have different preferences—some may prefer a faithful restoration, while others might want more vibrant, detailed results. A "one-size-fits-all" model is not ideal.
-
High Computational Cost: Many diffusion-based SR methods require numerous denoising steps (e.g., 20, 50, or even 200), making them slow and impractical for real-time applications.
This paper is motivated by the need for an SR model that is efficient, effective, and adjustable, allowing it to explicitly disentangle the fidelity and perception objectives and cater to diverse user preferences at inference time.
Main Contributions / Findings (What)
The paper introduces PiSA-SR, a novel framework with the following key contributions:
- Decoupled Dual-LoRA Framework: PiSA-SR is the first to propose using two distinct LoRA modules to explicitly separate the pixel-level restoration and semantic-level enhancement tasks. This disentangles the conflicting objectives into two separate, optimizable weight spaces within a single frozen Stable Diffusion model.
- Efficient One-Step Residual Learning: The SR process is formulated as learning the residual between the low-quality (LQ) and high-quality (HQ) latent codes. This allows the model to be trained end-to-end and perform SR in a single diffusion step, making it significantly faster than multi-step methods.
- Adjustable Inference with Dual Guidance: By introducing two guidance scales, λpix and λsem, users can dynamically control the strength of the pixel-level and semantic-level LoRAs during inference. This provides unprecedented flexibility to generate custom SR results tailored to specific needs without any retraining.
- State-of-the-Art Performance: In its default single-step setting, PiSA-SR achieves leading performance on standard real-world SR benchmarks, outperforming previous GAN-based and diffusion-based methods in both quantitative metrics and visual quality.
Foundational Concepts
- Image Super-Resolution (SR): The task of creating a high-resolution (HR) image from a low-resolution (LR) input. It is an "ill-posed" problem because a single LR image can correspond to many possible HR images.
- Perception-Distortion Trade-off: A fundamental dilemma in image restoration. Distortion measures pixel-wise error (e.g., PSNR, ℓ2 loss), while Perception measures perceptual realism. Improving one often degrades the other. Models with low distortion are faithful but often blurry; models with high perceptual quality are realistic but may contain artifacts or hallucinated details.
- Diffusion Models (DMs): Generative models that learn to reverse a process of gradually adding noise to an image. The forward process adds Gaussian noise over many steps. The reverse process learns to denoise the image step-by-step, starting from pure noise to generate a clean image. Stable Diffusion (SD) is a powerful, large-scale pre-trained text-to-image diffusion model that operates in a compressed latent space for efficiency.
- Low-Rank Adaptation (LoRA): A parameter-efficient fine-tuning (PEFT) technique. Instead of retraining all the weights of a large pre-trained model (which is computationally expensive), LoRA freezes the original weights and injects small, trainable "adapter" matrices into specific layers (e.g., attention layers). These adapters learn the task-specific updates, drastically reducing the number of trainable parameters.
- Classifier Score Distillation (CSD): A loss function that leverages a pre-trained model as an implicit "classifier" or "judge." It guides the optimization of a generator (in this case, the SR model) by ensuring its output has a high probability under the pre-trained model's distribution for a given condition (e.g., a text prompt). This effectively distills the rich semantic knowledge from the large model into the smaller one being trained.
Previous Works
The paper positions itself within the evolution of SR methods:
- Early CNN-based SR: Methods like SRCNN and EDSR used deep convolutional neural networks optimized with pixel-wise losses (ℓ1 or ℓ2). They achieved high PSNR but produced overly smooth results.
- GAN-based SR: Models like SRGAN and Real-ESRGAN introduced an adversarial loss, training a discriminator to distinguish between real HR images and generated SR images. This significantly improved perceptual quality but often led to training instability and noticeable artifacts.
- Diffusion-based SR: This is the current state-of-the-art.
- Multi-step Methods (
StableSR, DiffBIR, PASD, SeeSR): These methods adapt pre-trained Stable Diffusion models for SR. They typically start from Gaussian noise and perform many denoising steps, conditioned on the LQ image. While powerful, they are slow and can introduce inconsistencies.
- Single-step Methods (
OSEDiff, SinSR): To improve efficiency, these methods distill the knowledge from a multi-step diffusion process into a model that can perform SR in a single step. OSEDiff, a key predecessor, uses Variational Score Distillation (VSD) loss. However, these methods still entangle the fidelity and perception objectives.
Differentiation
PiSA-SR distinguishes itself from prior work in several crucial ways:
- Explicit Decoupling: Unlike all previous methods that mix fidelity and perception losses in a single training process, PiSA-SR uses a dual-LoRA structure to assign each objective to a dedicated set of parameters.
- Novel Inference Control: While some methods offer limited control (e.g., adjusting noise levels), PiSA-SR provides a more intuitive and powerful dual-guidance mechanism that directly manipulates the contributions of the fidelity and perception modules.
- Efficiency and Stability: By using CSD loss instead of the more complex VSD loss (used in
OSEDiff), PiSA-SR achieves more stable training with lower memory usage. Its single-step residual learning formulation makes it one of the fastest diffusion-based SR methods available.
4. Methodology (Core Technology & Implementation)
The core of PiSA-SR is its unique formulation of the SR problem and its decoupled training and inference pipeline.
Figure 2. Comparison of the pipeline of different DM-based SR methods.
Principles: Residual Learning in a Single Step
As shown in Figure 2, traditional multi-step methods (a) are slow. OSEDiff (b) introduces a single-step approach, directly mapping the LQ latent zL to the HQ latent zH. PiSA-SR (c) builds on this but reframes the problem as residual learning. Instead of predicting zH directly, the model learns to predict the residual (i.e., the high-frequency details and degradation patterns) that needs to be subtracted from zL to obtain zH.
The process is defined by the formula:
zH=zL−λϵθ(zL)
- zL and zH are the latent representations of the LQ and HQ images, respectively, obtained from a VAE encoder.
- ϵθ(zL) is the output of the diffusion U-Net, which is trained to predict the residual.
- λ is a scaling factor. It is fixed to 1 during training but can be adjusted during inference to control the restoration strength.
Steps & Procedures: Dual-LoRA Training
The training process, illustrated in Figure 3(a), is designed to disentangle the two objectives. The base SD model and VAE are frozen.
Figure 3. Training and inference process of PiSA-SR.
-
Stage 1: Pixel-level LoRA Training.
- A LoRA module, denoted as Δθpix, is added to the SD U-Net.
- This module is trained to perform pixel-level restoration. The objective is to minimize the pixel-wise difference between the reconstructed image and the ground truth.
- Loss Function: A simple ℓ2 loss is used on the reconstructed images: Lpix=∣∣D(zHpix)−xH∣∣22, where zHpix=zL−ϵθsd+Δθpix(zL).
-
Stage 2: Semantic-level LoRA Training.
- The trained pixel-level LoRA (Δθpix) is frozen.
- A second, new LoRA module, Δθsem, is introduced.
- Only Δθsem is trained in this stage. The full set of active parameters is θPiSA={θsd,Δθpix,Δθsem}.
- The goal is to learn how to add rich, semantic details on top of the base restoration provided by the pixel-level LoRA.
- Loss Functions: A combination of two perceptual losses is used:
- LPIPS Loss: Measures perceptual similarity in a deep feature space.
- Classifier Score Distillation (CSD) Loss: This is the key for generating high-quality semantic details.
The semantic-level enhancement relies heavily on the CSD loss. The gradient of the CSD loss is formulated as:
∇ℓCSDλcfg=Et,ϵ,zt,c[wt(f(zt,ϵreal)−f(zt,ϵrealλcfg))∂θPiSA∂zHsem]
-
Intuition: This formula pushes the SR output zHsem to be more aligned with what the pre-trained SD model considers a high-quality image given a text condition c.
-
f(zt,ϵ) is a function that predicts the clean latent from a noisy one.
-
ϵreal is the noise prediction from the unconditional pre-trained SD model.
-
ϵrealλcfg is the noise prediction from the pre-trained SD model using Classifier-Free Guidance (CFG), which enhances the influence of the text condition c.
-
The term (f(zt,ϵreal)−f(zt,ϵrealλcfg)) represents the semantic "correction" suggested by the pre-trained SD model. The loss minimizes this difference, effectively distilling SD's semantic knowledge into the PiSA-LoRA module.
Figure 4. Visualizing the effect of different losses.
Figure 4 provides a visual justification for the choice of losses. The ℓ2 loss alone (D(εθpix(zL))) removes degradation but results in a smooth image. The CSD loss component is shown to be highly effective at adding semantic details, while the VSD component used by prior work (OSEDiff) can weaken them. The term D(εθPiSA(zL)−εθpix(zL)) visualizes the isolated semantic enhancement, which is clean of pixel-level artifacts.
Inference: Default and Adjustable Settings
As shown in Figure 3(b), inference can be done in two ways:
- Default Setting: The pixel-level and semantic-level LoRAs are merged into a single
PiSA-LoRA. The SR is performed in one step, making it very fast. This corresponds to setting λpix=1 and λsem=1.
- Adjustable Setting: This mode uses the dual guidance scales. The final residual is computed as:
ϵθ(zL)=λpixϵθpix(zL)+λsem(ϵθPiSA(zL)−ϵθpix(zL))
- ϵθpix(zL) is the output using only the pixel-level LoRA (the fidelity component).
- (ϵθPiSA(zL)−ϵθpix(zL)) is the difference between the full model output and the pixel-only output. This term isolates the pure semantic enhancement.
- By adjusting λpix and λsem, the user can mix and match the amount of fidelity and perceptual detail in the final result.
5. Experimental Setup
- Datasets:
- Training: A combination of the LSDIR dataset and the first 10,000 images from the FFHQ dataset. LQ images were synthesized using the degradation pipeline from Real-ESRGAN.
- Testing:
- Synthetic: 3000 images from the DIV2K dataset, degraded with the same Real-ESRGAN pipeline.
- Real-World: Images from the RealSR and DrealSR datasets.
- Evaluation Metrics:
- Fidelity Metrics (Reference-based):
- PSNR (Peak Signal-to-Noise Ratio): Measures pixel-wise error between the SR output and the ground-truth (GT) image. Higher is better.
PSNR=10⋅log10(MSEMAXI2)
- MAXI is the maximum possible pixel value (e.g., 255 for 8-bit images).
- MSE is the Mean Squared Error between the SR and GT images.
- SSIM (Structural Similarity Index): Measures similarity in terms of luminance, contrast, and structure. Closer to 1 is better.
SSIM(x,y)=(μx2+μy2+c1)(σx2+σy2+c2)(2μxμy+c1)(2σxy+c2)
- μx,μy are the means of images x and y.
- σx,σy are the standard deviations.
- σxy is the covariance.
- c1,c2 are small constants for stability.
- Perceptual Metrics (Reference-based):
- LPIPS (Learned Perceptual Image Patch Similarity): Calculates the distance between deep features of two images extracted from a pre-trained network (e.g., VGG). Lower is better.
- DISTS (Deep Image Structure and Texture Similarity): Another perceptual metric that unifies structure and texture similarity assessment. Lower is better.
- FID (Fréchet Inception Distance): Measures the distance between the distribution of features from generated images and real images, extracted from an InceptionV3 network. Lower indicates the distributions are more similar.
- Perceptual Metrics (No-Reference):
- NIQE (Natural Image Quality Evaluator): Measures the quality of an image by comparing its statistical properties to those of natural images. Lower is better.
- CLIPIQA: A no-reference metric that leverages the CLIP model to assess image quality. Higher is better.
- MUSIQ: A Transformer-based no-reference metric that assesses quality at multiple scales. Higher is better.
- MANIQA: An attention-based no-reference metric for image quality assessment. Higher is better.
- Baselines: The paper compares PiSA-SR against a comprehensive set of recent methods:
- GAN-based:
RealESRGAN, BSRGAN, LDL.
- Multi-step DM-based:
StableSR, ResShift, DiffBIR, PASD, SeeSR.
- One-step DM-based:
SinSR, OSEDiff.
6. Results & Analysis
Core Results: Adjustable Super-Resolution
Figure 1. Visual demonstration of the adjustable SR capabilities.
Figure 1 and Table 1 showcase the core innovation: adjustability.
-
Varying Pixel-level Scale (λpix): As seen in Figure 1 (moving up the vertical axis), increasing λpix effectively removes noise and artifacts. However, as Table 1 shows, PSNR peaks at λpix=0.5 and then decreases, while LPIPS is best at λpix=0.8. A very high value (e.g., 1.5) leads to over-smoothing and loss of detail.
-
Varying Semantic-level Scale (λsem): As seen in Figure 1 (moving across the horizontal axis), increasing λsem generates richer semantic details, like wrinkles and hair texture. Table 1 shows that this consistently improves no-reference metrics like MUSIQ. However, it degrades PSNR (fidelity) and can worsen LPIPS if pushed too far, as it may introduce unrealistic artifacts.
Here is the manually transcribed data from Table 1.
Table 1. Results of PiSA-SR with different pixel-semantic guidance scales on the RealSR test dataset.
| λpix |
λsem |
PSNR↑ |
LPIPS↓ |
CLIPIQA↑ |
MUSIQ↑ |
| 0.0 |
1.0 |
25.96 |
0.3426 |
0.4129 |
46.45 |
| 0.2 |
1.0 |
26.48 |
0.3042 |
0.4868 |
54.05 |
| 0.5 |
1.0 |
26.75 |
0.2646 |
0.5705 |
63.82 |
| 0.8 |
1.0 |
26.18 |
0.2612 |
0.6292 |
68.95 |
| 1.0 |
1.0 |
25.50 |
0.2672 |
0.6702 |
70.15 |
| 1.2 |
1.0 |
24.76 |
0.2723 |
0.6746 |
70.33 |
| 1.5 |
1.0 |
23.74 |
0.2769 |
0.6305 |
69.23 |
| 1.0 |
0.0 |
26.92 |
0.3018 |
0.3227 |
49.62 |
| 1.0 |
0.2 |
26.95 |
0.2784 |
0.3591 |
53.64 |
| 1.0 |
0.5 |
26.77 |
0.2476 |
0.4322 |
58.76 |
| 1.0 |
0.8 |
26.20 |
0.2465 |
0.5806 |
66.33 |
| 1.0 |
1.0 |
25.50 |
0.2672 |
0.6702 |
70.15 |
| 1.0 |
1.2 |
24.59 |
0.3000 |
0.7015 |
71.60 |
| 1.0 |
1.5 |
23.08 |
0.3541 |
0.6835 |
71.76 |
Core Results: Comparisons with State-of-the-Arts
Below is the transcribed data from Table 2.
Table 2. Quantitative comparison with DM-based SR methods. S denotes the number of steps.
| Datasets |
Methods |
PSNR↑ |
SSIM↑ |
LPIPS↓ |
DISTS↓ |
FID↓ |
NIQE↓ |
CLIPIQA↑ |
MUSIQ↑ |
MANIQA↑ |
| DIV2K |
ResShift-S15 |
24.69 |
0.6175 |
0.3374 |
0.2215 |
36.01 |
6.82 |
0.6089 |
60.92 |
0.5450 |
| StableSR-S200 |
23.31 |
0.5728 |
0.3129 |
0.2138 |
24.67 |
4.76 |
0.6682 |
65.63 |
0.6188 |
| DiffBIR-S50 |
23.67 |
0.5653 |
0.3541 |
0.2129 |
30.93 |
4.71 |
0.6652 |
65.66 |
0.6204 |
| PASD-S20 |
23.14 |
0.5489 |
0.3607 |
0.2219 |
29.32 |
4.40 |
0.6711 |
68.83 |
0.6484 |
| SeeSR-S50 |
23.71 |
0.6045 |
0.3207 |
0.1967 |
25.83 |
4.82 |
0.6857 |
68.49 |
0.6239 |
| SinSR-S1 |
24.43 |
0.6012 |
0.3262 |
0.2066 |
35.45 |
6.02 |
0.6499 |
62.80 |
0.5395 |
| OSEDiff-S1 |
23.72 |
0.6108 |
0.2941 |
0.1976 |
26.32 |
4.71 |
0.6683 |
67.97 |
0.6148 |
| PiSA-SR-S1 |
23.87 |
0.6058 |
0.2823 |
0.1934 |
25.07 |
4.55 |
0.6927 |
69.68 |
0.6400 |
| RealSR |
ResShift-S15 |
26.31 |
0.7411 |
0.3489 |
0.2498 |
142.81 |
7.27 |
0.5450 |
58.10 |
0.5305 |
| StableSR-S200 |
24.69 |
0.7052 |
0.3091 |
0.2167 |
127.20 |
5.76 |
0.6195 |
65.42 |
0.6211 |
| DiffBIR-S50 |
24.88 |
0.6673 |
0.3567 |
0.2290 |
124.56 |
5.63 |
0.6412 |
64.66 |
0.6231 |
| PASD-S20 |
25.22 |
0.6809 |
0.3392 |
0.2259 |
123.08 |
5.18 |
0.6502 |
68.74 |
0.6461 |
| SeeSR-S50 |
25.33 |
0.7273 |
0.2985 |
0.2213 |
125.66 |
5.38 |
0.6594 |
69.37 |
0.6439 |
| SinSR-S1 |
26.30 |
0.7354 |
0.3212 |
0.2346 |
137.05 |
6.31 |
0.6204 |
60.41 |
0.5389 |
| OSEDiff-S1 |
25.15 |
0.7341 |
0.2921 |
0.2128 |
123.50 |
5.65 |
0.6693 |
69.09 |
0.6339 |
| PiSA-SR-S1 |
25.50 |
0.7417 |
0.2672 |
0.2044 |
124.09 |
5.50 |
0.6702 |
70.15 |
0.6560 |
| DrealSR |
ResShift-S15 |
28.45 |
0.7632 |
0.4073 |
0.2700 |
175.92 |
8.28 |
0.5259 |
49.86 |
0.4573 |
| StableSR-S200 |
28.04 |
0.7460 |
0.3354 |
0.2287 |
147.03 |
6.51 |
0.6171 |
58.50 |
0.5602 |
| DiffBIR-S50 |
26.84 |
0.6660 |
0.4446 |
0.2706 |
167.38 |
6.02 |
0.6292 |
60.68 |
0.5902 |
| PASD-S20 |
27.48 |
0.7051 |
0.3854 |
0.2535 |
157.36 |
5.57 |
0.6714 |
64.55 |
0.6130 |
| SeeSR-S50 |
28.26 |
0.7698 |
0.3197 |
0.2306 |
149.86 |
6.52 |
0.6672 |
64.84 |
0.6026 |
| SinSR-S1 |
28.41 |
0.7495 |
0.3741 |
0.2488 |
177.05 |
7.02 |
0.6367 |
55.34 |
0.4898 |
| OSEDiff-S1 |
27.92 |
0.7835 |
0.2968 |
0.2165 |
135.29 |
6.49 |
0.6963 |
64.65 |
0.5899 |
| PiSA-SR-S1 |
28.31 |
0.7804 |
0.2960 |
0.2169 |
130.61 |
6.20 |
0.6970 |
66.11 |
0.6156 |
Analysis: PiSA-SR (in its default 1-step setting) consistently achieves the best scores across almost all perceptual metrics (LPIPS, DISTS, CLIPIQA, MUSIQ, MANIQA) on all three datasets. This demonstrates its superior ability to generate realistic and high-quality details. While its fidelity scores (PSNR/SSIM) are not always the highest, they are highly competitive, showcasing the effectiveness of the decoupled approach in balancing the two objectives.
Figure 5. Visual comparisons of different DM-based SR methods.
Qualitative Analysis: Figure 5 provides visual evidence. While methods like ResShift and SinSR produce blurry results, and others like SeeSR can over-enhance details into artifacts, PiSA-SR generates textures (wood grain, penguin feathers) that are both detailed and natural, aligning closely with the ground truth.
Complexity Comparisons
The transcribed data from Table 3 highlights PiSA-SR's efficiency.
Table 3. The inference time and the number of parameters of DM-based SR methods.
|
StableSR |
ResShift |
DiffBIR |
PASD |
SeeSR |
SinSR |
OSEDiff |
PiSA-SR-def. |
PiSA-SR-adj. |
| Inference Steps |
200 |
15 |
50 |
20 |
50 |
1 |
1 |
1 |
2 |
| Inference time(s)/Image |
10.03 |
0.76 |
2.72 |
2.80 |
4.30 |
0.13 |
0.12 |
0.09 |
0.13 |
| #Params(B) |
1.56 |
0.18 |
1.68 |
2.31 |
2.51 |
0.18 |
1.77 |
1.30 |
1.30 |
PiSA-SR is the fastest method (0.09s/image), even outperforming the previous fastest one-step method, OSEDiff. The adjustable version is only slightly slower (0.13s) but offers significant flexibility. It also has the fewest parameters among SD-based methods.
Ablation Studies
The supplementary material provides key insights:
-
Effectiveness of Dual-LoRA: An ablation study (Table 5) shows that using only the pixel-level LoRA results in high PSNR but poor perceptual scores. Using only the semantic-level LoRA gives excellent perceptual scores but poor fidelity. The combined PiSA-SR model provides the best balance.
-
Training Efficiency: A comparison with OSEDiff (Figure 9) shows PiSA-SR uses significantly less memory (43.87 GB vs. 56.28 GB) and is faster per training iteration (1.60s vs. 2.26s). This is attributed to the more efficient CSD loss.
Figure 10. Visual comparison of training progress.
Figure 10 shows that PiSA-SR converges faster visually, restoring clear text more quickly than OSEDiff.
7. Conclusion & Reflections
Conclusion Summary
PiSA-SR presents a significant step forward in real-world image super-resolution. By ingeniously decoupling the pixel-level and semantic-level objectives using a dual-LoRA framework, it successfully navigates the perception-distortion trade-off. The method not only achieves state-of-the-art results but does so with remarkable efficiency in a single diffusion step. Its most impactful contribution is the introduction of adjustable guidance scales, which empowers users to customize SR outputs to their liking, transforming the SR model from a fixed tool into a flexible, interactive system.
Limitations & Future Work
The authors acknowledge two main limitations:
-
The adjustable inference mode, while flexible, requires two forward passes, slightly increasing computation time compared to the default single-pass version.
-
A single pixel-level LoRA is used to handle all types of degradation. This might be suboptimal for images with extremely complex or varied degradations.
Future work could focus on exploring more specialized LoRA modules for different types of degradation (e.g., one for noise, one for blur) and developing image-adaptive guidance scales that automatically adjust based on the input content.
Personal Insights & Critique
PiSA-SR is an elegant and practical piece of engineering. The core idea of disentangling objectives via separate, lightweight adapters (LoRAs) is powerful and could be highly transferable to other image restoration tasks (e.g., deblurring, inpainting) that also face conflicting objectives. The shift from the complex VSD loss to the more stable and efficient CSD loss is a smart, pragmatic choice that improves the entire training pipeline.
The adjustable inference mechanism is the standout feature from a user-centric perspective. In real-world applications, user preference is paramount, and PiSA-SR directly addresses this. It bridges the gap between purely technical metrics and practical usability. The work is a strong example of how to effectively leverage the power of large pre-trained models while adding fine-grained control and task-specific specialization in a parameter-efficient manner.