Modified Diffusion Process and Overall Pipeline:
The UNW is integrated into the ResShift
diffusion framework. The forward and backward transition distributions are modified to include the weight term`w_u(\pmb{y}0)q(xt∣xt−1,x0,y0)=N(xt∣xt−1+αt(y0−x0),κ2wu(y0)2αtI)
The key change is that the variance of the added noise is now scaled by :∗∗∗ForwardTransition:∗∗::MATHBLOCK3::Thekeychangeisthatthevarianceoftheaddednoiseisnowscaledbyw_u(\pmb{y}0)^2q(xt−1∣xt,x0,y0)=N(xt−1∣ηtηt−1xt+ηtαtx0,κ2wu(y0)2ηtηt−1αtI)
The overall pipeline for the `UPSR` model is shown in Figure 4.
∗该图像是UPSR扩散模型的流程图。它展示了如何利用低分辨率图像y_0估算的不确定性图u(\psi{est}(y_0)),引导高斯噪声\epsilon \sim N(0, \sigma^2I)的区域权重。模型通过迭代去噪和上采样,最终从x_T生成高分辨率图像x_0。∗1.GivenanLRinputy_0,anauxiliarySRnetworkg(\cdot)computesaninitialSRestimateg(y_0).2.Theuncertaintymap\psi_{est}(y_0)andthenoiseweightmapw_u(y_0)arecalculated.3.Theinitialstatex_Tiscreatedbyaddingtheuncertainty−weightednoisetoy_0.4.ThereversediffusionprocessisexecutedbyaU−Netdenoiserf_\theta(\cdot),whichtakesthenoisyimagex_t,timestept,andbothy_0andg(y_0)asconditionalinputs.5.Thefinaloutputisthedenoisedimagex_0.
* **Network Architecture Modification:**
The paper replaces the VQGAN encoder/decoder used in `ResShift` for latent space diffusion. Instead, it uses `PixelUnshuffle` to reduce spatial resolution (and increase channel depth) before the U-Net, and a simple nearest-neighbor upsampling followed by a convolution to restore it. This is a much lighter approach that avoids the massive parameter count and computational cost of VQGAN, making `UPSR` more efficient.
# 5. Experimental Setup
* **Datasets:**
* **Training:** `ImageNet` dataset, with 256 \times 256patchesasHRtargets.∗∗∗Degradation:∗∗LRimages(64 \times 64) are generated using the degradation pipeline from `Real-ESRGAN`, which simulates real-world artifacts.
* **Testing:**
* `ImageNet-Test`: A synthetic test set.
* `RealSR` & `RealSet65`: Standard real-world SR benchmark datasets.
* **Evaluation Metrics:**
* **Full-Reference Metrics (Fidelity):**
1. **PSNR (Peak Signal-to-Noise Ratio):** Measures the pixel-wise reconstruction accuracy. Higher is better.
* **Conceptual Definition:** It is the ratio between the maximum possible power of a signal and the power of corrupting noise that affects its fidelity. It is expressed in decibels (dB).
* **Formula:** \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}I^2}{\mathrm{MSE}} \right) ,where\mathrm{MSE} = \frac{1}{mn} \sum{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2.∗∗∗Symbols:∗∗\mathrm{MAX}Iisthemaximumpixelvalue(e.g.,255),Iisthegroundtruthimage,K is the reconstructed image, and `m, n` are the image dimensions.
2. **SSIM (Structural Similarity Index Measure):** Measures the similarity in luminance, contrast, and structure between two images. Ranges from -1 to 1, where 1 is a perfect match. Higher is better.
* **Full-Reference Metrics (Perceptual Quality):**
1. **LPIPS (Learned Perceptual Image Patch Similarity):** Measures perceptual similarity using deep features from a pre-trained network (e.g., VGG). Lower is better.
* **Non-Reference Metrics (Perceptual Quality):**
1. **NIQE (Natural Image Quality Evaluator):** Measures the deviation from statistical regularities observed in natural images. Lower indicates better, more natural quality.
2. **CLIPIQA (CLIP Image Quality Assessment):** Uses the CLIP model to assess image quality based on semantic and stylistic consistency. Higher is better.
3. **MUSIQ (Multi-scale Image Quality Transformer):** A Transformer-based model for assessing image quality. Higher is better.
4. **MANIQA (Multi-dimension Attention Network for IQA):** Another transformer-based no-reference IQA model. Higher is better.
* **Baselines:** The paper compares `UPSR` with a wide range of methods:
* **GAN-based:** `ESRGAN`, `RealSR-JPEG`, `BSRGAN`, `RealESRGAN`, `SwinIR`, `DASR`.
* **Diffusion-based:** `LDM-15` (15 steps), `ResShift-15` (15 steps), `ResShift-4` (4 steps).
# 6. Results & Analysis
* **Ablations / Parameter Sensitivity:**
The ablation study in Table 1 validates the key components of `UPSR`. The following is a manual transcription of the table:
<div class="table-wrapper"><table>
<thead>
<tr>
<th rowspan="2">UNW</th>
<th rowspan="2">SR cond.</th>
<th colspan="5">RealSR</th>
<th colspan="4">RealSet</th>
</tr>
<tr>
<th>PSNR↑</th>
<th>CLIPIQA↑</th>
<th>MUSIQ↑</th>
<th>MANIQA↑</th>
<th>NIQE↓</th>
<th>CLIPIQA↑</th>
<th>MUSIQ↑</th>
<th>MANIQA↑</th>
<th>NIQE↓</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>26.18</td>
<td>0.5447</td>
<td>62.951</td>
<td>0.3596</td>
<td>4.49</td>
<td>0.6141</td>
<td>64.360</td>
<td>0.3718</td>
<td>4.42</td>
</tr>
<tr>
<td>√</td>
<td></td>
<td>26.12</td>
<td>0.5760</td>
<td>64.512</td>
<td>0.3717</td>
<td>4.18</td>
<td>0.6340</td>
<td>64.280</td>
<td>0.3836</td>
<td>4.22</td>
</tr>
<tr>
<td>√</td>
<td>√</td>
<td><b>26.44</b></td>
<td><b>0.6010</b></td>
<td><b>64.541</b></td>
<td><b>0.3818</b></td>
<td><b>4.02</b></td>
<td><b>0.6389</b></td>
<td><b>63.498</b></td>
<td><b>0.3931</b></td>
<td><b>4.24</b></td>
</tr>
</tbody>
</table></div>
* **Baseline (Row 1):** A model based on `ResShift` without UNW or extra SR conditioning.
* **Effect of UNW (Row 2 vs. Row 1):** Adding `Uncertainty-guided Noise Weighting` significantly improves perceptual metrics (`CLIPIQA`, `MUSIQ`, `NIQE`) while slightly decreasing `PSNR`. This is expected, as reducing noise in flat areas aids perceptual quality, sometimes at the cost of minute pixel-wise accuracy.
* **Effect of SR Conditioning (Row 3 vs. Row 2):** Adding the SR prediction g(y_0) as an additional condition for the denoiser boosts all metrics, including a significant jump in `PSNR` (0.32 dB on RealSR). This confirms that providing a better initial estimate helps the denoiser achieve both better fidelity and perceptual quality.
* **Model Size and Training Overhead:**
The tables below are transcribed from the paper.
**Table 2: Model size and computational efficiency.**
<div class="table-wrapper"><table>
<thead>
<tr>
<th>Model</th>
<th>Params (M)</th>
<th>Runtimes (s)</th>
<th>MUSIQ</th>
<th>MANIQA</th>
</tr>
</thead>
<tbody>
<tr>
<td>LDM-15</td>
<td>113.60+55.32</td>
<td>1.59</td>
<td>48.698</td>
<td>0.2655</td>
</tr>
<tr>
<td>ResShift-15</td>
<td>118.59+55.32</td>
<td>1.98</td>
<td>57.769</td>
<td>0.3691</td>
</tr>
<tr>
<td>ResShift-4</td>
<td>118.59+55.32</td>
<td>1.00</td>
<td>55.189</td>
<td>0.3337</td>
</tr>
<tr>
<td><b>UPSR-5</b></td>
<td><b>119.42+2.50</b></td>
<td><b>1.12</b></td>
<td><b>64.541</b></td>
<td><b>0.3818</b></td>
</tr>
</tbody>
</table></div>
**Table 3: Training overhead comparison.**
<div class="table-wrapper"><table>
<thead>
<tr>
<th>Model</th>
<th>Training Speed</th>
<th>Memory Footprint</th>
</tr>
</thead>
<tbody>
<tr>
<td>ResShift</td>
<td>1.20 s/iter</td>
<td>24.1 G</td>
</tr>
<tr>
<td><b>UPSR</b></td>
<td><b>0.45 s/iter</b></td>
<td><b>14.9 G</b></td>
</tr>
</tbody>
</table></div>
* **Efficiency Gains:** `UPSR`'s total parameter count (121.92M) is much smaller than `ResShift`'s (173.91M). This is because `UPSR` replaces the 55.32M-parameter VQGAN with a tiny 2.50M-parameter auxiliary SR network.
* **Performance-Cost Trade-off:** `UPSR-5` (5 steps) is much more effective than `ResShift-4` and even outperforms `ResShift-15` while being faster.
* **Training Advantage:** `UPSR` trains **2.67 times faster** and uses **38% less GPU memory** than `ResShift`, demonstrating a massive improvement in practicality.
* **Core Results (Comparison with SOTA):**
The main results in Table 4 (transcribed below) show that `UPSR-5` consistently achieves top performance in perceptual quality across all three test sets.
<details>
<summary>Click to view the transcription of Table 4</summary>
<div class="table-wrapper"><table>
<thead>
<tr>
<th rowspan="2">Datasets</th>
<th rowspan="2">Metrics</th>
<th colspan="6">GAN-based Methods</th>
<th colspan="4">Diffusion-based Methods</th>
</tr>
<tr>
<th>ESRGAN</th>
<th>RealSR-JPEG</th>
<th>BSRGAN</th>
<th>RealESRGAN</th>
<th>SwinIR</th>
<th>DASR</th>
<th>LDM-15</th>
<th>ResShift-15</th>
<th>ResShift-4</th>
<th>UPSR-5</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="7">ImageNet -Test</td>
<td>PSNR↑</td>
<td>20.67</td>
<td>23.11</td>
<td>24.42</td>
<td>24.04</td>
<td>23.99</td>
<td>24.75</td>
<td>24.85</td>
<td>24.94</td>
<td>25.02</td>
<td>23.77</td>
</tr>
<tr>
<td>SSIM↑</td>
<td>0.4485</td>
<td>0.5912</td>
<td>0.6585</td>
<td>0.6649</td>
<td>0.6666</td>
<td>0.6749</td>
<td>0.6682</td>
<td>0.6738</td>
<td>0.6830</td>
<td>0.6296</td>
</tr>
<tr>
<td>LPIPS↓</td>
<td>0.4851</td>
<td>0.3263</td>
<td>0.2585</td>
<td>0.2539</td>
<td>0.2376</td>
<td>0.2498</td>
<td>0.2685</td>
<td>0.2371</td>
<td>0.2075</td>
<td>0.2456</td>
</tr>
<tr>
<td>CLIPIQA↑</td>
<td>0.4512</td>
<td>0.5366</td>
<td>0.5810</td>
<td>0.5241</td>
<td>0.5639</td>
<td>0.5362</td>
<td>0.5095</td>
<td>0.5860</td>
<td>0.6003</td>
<td><b>0.6328</b></td>
</tr>
<tr>
<td>MUSIQ↑</td>
<td>43.615</td>
<td>46.981</td>
<td>54.696</td>
<td>52.609</td>
<td>53.789</td>
<td>48.337</td>
<td>46.639</td>
<td>53.182</td>
<td>52.019</td>
<td><b>59.227</b></td>
</tr>
<tr>
<td>MANIQA↑</td>
<td>0.3212</td>
<td>0.3065</td>
<td>0.3865</td>
<td>0.3689</td>
<td>0.3882</td>
<td>0.3292</td>
<td>0.3305</td>
<td>0.4191</td>
<td>0.3885</td>
<td><b>0.4591</b></td>
</tr>
<tr>
<td>NIQE↓</td>
<td>8.33</td>
<td>5.96</td>
<td>6.08</td>
<td>6.07</td>
<td>5.89</td>
<td>5.86</td>
<td>7.21</td>
<td>6.88</td>
<td>7.34</td>
<td><b>5.24</b></td>
</tr>
<tr>
<td rowspan="7">RealSR</td>
<td>PSNR↑</td>
<td>27.57</td>
<td>27.34</td>
<td>26.51</td>
<td>25.83</td>
<td>26.43</td>
<td>27.19</td>
<td>27.18</td>
<td>26.80</td>
<td>25.77</td>
<td>26.44</td>
</tr>
<tr>
<td>SSIM↑</td>
<td>0.7742</td>
<td>0.7605</td>
<td>0.7746</td>
<td>0.7726</td>
<td>0.7861</td>
<td>0.7861</td>
<td>0.7853</td>
<td>0.7674</td>
<td>0.7439</td>
<td>0.7589</td>
</tr>
<tr>
<td>LPIPS↓</td>
<td>0.4152</td>
<td>0.3962</td>
<td>0.2685</td>
<td>0.2739</td>
<td>0.2515</td>
<td>0.3113</td>
<td>0.3021</td>
<td>0.3411</td>
<td>0.3491</td>
<td>0.2871</td>
</tr>
<tr>
<td>CLIPIQA↑</td>
<td>0.2362</td>
<td>0.3613</td>
<td>0.5439</td>
<td>0.4923</td>
<td>0.4655</td>
<td>0.3628</td>
<td>0.3748</td>
<td>0.5709</td>
<td>0.5646</td>
<td><b>0.6010</b></td>
</tr>
<tr>
<td>MUSIQ↑</td>
<td>29.037</td>
<td>36.069</td>
<td>63.587</td>
<td>59.849</td>
<td>59.635</td>
<td>45.818</td>
<td>48.698</td>
<td>57.769</td>
<td>55.189</td>
<td><b>64.541</b></td>
</tr>
<tr>
<td>MANIQA↑</td>
<td>0.2071</td>
<td>0.1783</td>
<td>0.3702</td>
<td>0.3694</td>
<td>0.3436</td>
<td>0.2663</td>
<td>0.2655</td>
<td>0.3691</td>
<td>0.3337</td>
<td><b>0.3828</b></td>
</tr>
<tr>
<td>NIQE↓</td>
<td>7.73</td>
<td>6.95</td>
<td>4.65</td>
<td>4.68</td>
<td>4.68</td>
<td>5.98</td>
<td>6.22</td>
<td>5.96</td>
<td>6.93</td>
<td><b>4.02</b></td>
</tr>
<tr>
<td rowspan="4">RealSet</td>
<td>CLIPIQA↑</td>
<td>0.3739</td>
<td>0.5282</td>
<td>0.6160</td>
<td>0.6081</td>
<td>0.5778</td>
<td>0.4966</td>
<td>0.4313</td>
<td>0.6309</td>
<td>0.6188</td>
<td><b>0.6392</b></td>
</tr>
<tr>
<td>MUSIQ↑</td>
<td>42.366</td>
<td>50.539</td>
<td>65.583</td>
<td>64.125</td>
<td>63.817</td>
<td>55.708</td>
<td>48.602</td>
<td>59.319</td>
<td>58.516</td>
<td>63.519</td>
</tr>
<tr>
<td>MANIQA↑</td>
<td>0.3100</td>
<td>0.2927</td>
<td>0.3888</td>
<td>0.3949</td>
<td>0.3818</td>
<td>0.3134</td>
<td>0.2693</td>
<td>0.3916</td>
<td>0.3526</td>
<td><b>0.3931</b></td>
</tr>
<tr>
<td>NIQE↓</td>
<td>4.93</td>
<td>4.81</td>
<td>4.58</td>
<td>4.38</td>
<td>4.40</td>
<td>4.72</td>
<td>6.47</td>
<td>5.96</td>
<td>6.46</td>
<td><b>4.23</b></td>
</tr>
</tbody>
</table></div>
</details>
* It is the clear winner on all perceptual metrics (`CLIPIQA`, `MUSIQ`, `MANIQA`, `NIQE`) across all datasets.
* Notably, it achieves the best `NIQE` scores by a large margin, a metric where previous diffusion models like `LDM` and `ResShift` struggled compared to GANs. This indicates `UPSR` produces images that are statistically more "natural."
* **Visual Examples:**

*该图像是图5,展示了所提出的不确定性引导噪声加权(UNW)策略的视觉示例。基于不确定性估计(热力图),在大多数平坦区域中,噪声水平被降低以保留更多细节,从而获得更好的超分辨率结果。同时,在边缘区域(如图a)和严重降级的区域(如图b),噪声被维持在相对较高的水平,以确保可靠的得分估计并产生视觉上令人满意的结果。这突出了UNW根据区域不确定性自适应调整噪声的能力。*
Figure 5 demonstrates the UNW strategy in action. The uncertainty heatmap highlights edges and complex textures. The resulting noise map is anisotropic: flat background areas receive very little noise, preserving their clarity, while the edges of the owl and the texture in the second image receive stronger noise, allowing the model more freedom to generate details.

*该图像是超分辨率方法的视觉比较插图,展示了不同模型在细节重建上的表现。它包含三组对比,分别来自ImageNet、RealSet和RealSR数据集的低分辨率图像区域(狗眼、狮子鬃毛、人眼)。通过放大显示,可以观察到BSRGAN、RealESRGAN、SwinIR、DASR、LDM-15、ResShift-15、ResShift-4等方法与本文提出的“Ours”方法在恢复高频细节和纹理方面的差异,突显了“Ours”方法在感知质量上的优势。*
The qualitative comparisons in Figure 6 show that `UPSR` reconstructs finer and sharper details in challenging areas like eyes and fur, while other methods tend to produce blurrier or less coherent textures. The additional examples in the supplementary material (Figures 8 and 9) further reinforce this conclusion.
![Figure 8. Additional visual comparisons on RealSet \[44\].](/files/papers/68efcd33a63c142e6efe1dfb/images/8.jpg)
*该图像是图8,展示了在RealSet数据集上不同超分辨率方法的视觉比较。它包含了一只狗和一只熊猫的原始低分辨率图像(带有放大区域框)以及LMD-15、ResShift-15、ResShift-4和本文提出的“Ours”方法重建的局部高分辨率图像。通过对比可见,本文方法在细节和纹理恢复方面优于其他SOTA方法。*
![Figure 9. Additional visual comparisons on RealSR \[1\].](/files/papers/68efcd33a63c142e6efe1dfb/images/9.jpg)
*该图像是图9,展示了在RealSR数据集上的额外视觉比较。图中每行对比了低分辨率原图(标有红框区域)与四种超分辨率方法(LDM-15、ResShift-15、ResShift-4和本文方法)对该区域的重建效果。结果表明,本文提出的UPSR模型在保留图像细节、纹理和边缘方面表现最佳,生成了更清晰、更真实的图像,优于其他现有方法。*
# 7. Conclusion & Reflections
* **Conclusion Summary:**
The paper successfully introduces a more specialized and effective diffusion process for image super-resolution. By proposing the `Uncertainty-guided Noise Weighting` (UNW) scheme, it moves beyond the standard isotropic noise model and adapts the initial perturbation to the content of the LR image. This allows the model to preserve information in simple regions while providing enough stochasticity to generate realistic details in complex regions. Combined with an efficient architectural modification that removes the VQGAN bottleneck, the resulting `UPSR` model sets a new state-of-the-art in perceptual quality for real-world SR, all while being significantly smaller, faster, and cheaper to train than its predecessors.
* **Limitations & Future Work (Author-Stated & Implied):**
* **Dependence on Auxiliary Network:** The quality of the uncertainty estimate is tied to the performance of the pre-trained SR network g(\cdot).Apoorg(\cdot)couldleadtosuboptimalnoiseweighting.∗∗∗HeuristicWeightingFunction:∗∗Thepiecewiselinearfunctionformappinguncertaintytonoiseweightsishand−designedandbasedonempiricalsettings(b_u=0.4, \psi{max}=0.05$). Future work could explore learning this function or finding a more principled formulation.
- Task Specificity: While highly effective for SR, the generalizability of this specific uncertainty estimation method to other image restoration tasks (e.g., deblurring, inpainting) is an open question.