Paper status: completed

HIIF: Hierarchical Encoding based Implicit Image Function for Continuous Super-resolution

Published:12/05/2024

Continuous Image Super-resolution (1)Implicit Neural Representation (INR) (1)Hierarchical Encoding Mechanism (1)Multi-Head Linear Attention (1)Local and Non-Local Information Fusion (1)

Original Link PDF

Price: 0.100000

8 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

HIIF proposes a continuous image super-resolution method utilizing implicit neural representations. It introduces a novel hierarchical positional encoding to capture multi-scale local details and embeds a multi-head linear attention mechanism for non-local information. Experiment

Abstract

Recent advances in implicit neural representations (INRs) have shown significant promise in modeling visual signals for various low-vision tasks including image super-resolution (ISR). INR-based ISR methods typically learn continuous representations, providing flexibility for generating high-resolution images at any desired scale from their low-resolution counterparts. However, existing INR-based ISR methods utilize multi-layer perceptrons for parameterization in the network; this does not take account of the hierarchical structure existing in local sampling points and hence constrains the representation capability. In this paper, we propose a new \textbf{H}ierarchical encoding based \textbf{I}mplicit \textbf{I}mage \textbf{F}unction for continuous image super-resolution, \textbf{HIIF}, which leverages a novel hierarchical positional encoding that enhances the local implicit representation, enabling it to capture fine details at multiple scales. Our approach also embeds a multi-head linear attention mechanism within the implicit attention network by taking additional non-local information into account. Our experiments show that, when integrated with different backbone encoders, HIIF outperforms the state-of-the-art continuous image super-resolution methods by up to 0.17dB in PSNR. The source code of HIIF will be made publicly available at \url{www.github.com}.

Mind Map

In-depth Reading

English Analysis~16 min read · 21,624 chars

1. Bibliographic Information

Title: HIIF: Hierarchical Encoding based Implicit Image Function for Continuous Super-resolution
Authors: Yuxuan Jiang, Ho Man Kwan, Tianhao Peng, Ge Gao, Fan Zhang, Xiaoqing Zhu, Joel Sole, and David Bull.
Affiliations: The authors are affiliated with the Visual Information Laboratory at the University of Bristol, UK, and Netflix Inc., USA. This collaboration between academia and industry suggests a focus on both novel research and practical applications.
Journal/Conference: The paper is available on arXiv, which is a repository for electronic preprints of scientific papers. This means it has not yet undergone formal peer review for publication in a journal or conference.
Publication Year: The paper was submitted to arXiv in December 2024 (based on the identifier 2412.03748v1).
Abstract: The paper introduces a new method for continuous image super-resolution called HIIF (Hierarchical encoding based Implicit Image Function). Existing methods using Implicit Neural Representations (INRs) often fail to capture the hierarchical structure of local image regions, which limits their performance. HIIF addresses this by introducing a novel hierarchical positional encoding to capture fine details at multiple scales. It also incorporates a multi-head linear attention mechanism to consider non-local information. Experiments show that HIIF, when combined with various standard network backbones, outperforms existing state-of-the-art methods by up to 0.17dB in PSNR.
Original Source Link: https://arxiv.org/abs/2412.03748v1

2. Executive Summary

Background & Motivation (Why):
- The primary problem is continuous image super-resolution (ISR), which aims to upscale a low-resolution (LR) image to any arbitrary high-resolution (HR) size with a single model. This is more flexible than traditional ISR methods that are trained for fixed integer scales (e.g., ×2, ×4).
- Implicit Neural Representations (INRs) have become a popular approach for continuous ISR. They represent an image as a continuous function that maps coordinates to pixel values. However, existing INR-based methods typically use simple Multi-Layer Perceptrons (MLPs) that treat all local sampling points independently. This "single-scale" approach overlooks the inherent hierarchical relationships between nearby points, limiting the model's ability to reconstruct fine, high-frequency details.
- The paper identifies this as a key gap: the lack of a multi-scale representation within the local implicit function.
Main Contributions / Findings (What): The paper introduces HIIF, a framework with three main contributions:
1. A Novel Hierarchical Positional Encoding: This is the core innovation. Instead of using a single relative coordinate, HIIF generates a hierarchy of positional codes at different scales. This allows the model to learn features at various levels of detail, enhancing local representation.
2. A Multi-Scale Decoder Architecture: The hierarchical encodings are progressively fed into different layers of the decoder. This ensures that nearby query points share information at coarser levels but are differentiated at finer levels, effectively learning a multi-scale representation.
3. Multi-Head Linear Attention: A multi-head linear self-attention mechanism is integrated into the decoder. This expands the model's receptive field, allowing it to capture non-local spatial information from the latent features, which MLPs alone cannot do efficiently.
  
  The key finding is that HIIF consistently outperforms state-of-the-art continuous ISR methods on several benchmark datasets, achieving significant PSNR gains across various up-sampling scales and with different backbone networks.

Foundational Concepts:
- Image Super-Resolution (ISR): The task of algorithmically generating a high-resolution (HR) image from its low-resolution (LR) counterpart. It is an "ill-posed" problem because multiple HR images can produce the same LR image when downscaled.
- Continuous Super-Resolution: Also known as arbitrary-scale super-resolution. It allows a single trained model to upscale an image by any scaling factor (e.g., ×2.5, ×7.8), not just predefined integer factors.
- Implicit Neural Representations (INRs): A method to represent signals (like images, videos, or 3D shapes) with a neural network. Instead of storing discrete pixel values in a grid, an INR model learns a continuous function $f(\text{coordinate}) \rightarrow \text{value}$ . For an image, this would be $f(x, y) \rightarrow (R, G, B)$ . This continuous nature makes INRs inherently suitable for continuous super-resolution.
- Multi-Layer Perceptron (MLP): A fundamental type of artificial neural network consisting of multiple layers of neurons, where each neuron is fully connected to those in the previous layer. INRs are often parameterized by MLPs.
- Positional Encoding: A technique used to inject positional information into a model's input. Since MLPs and Transformers are permutation-invariant (they don't know the order of their inputs), positional encodings provide the necessary spatial context.
- Attention Mechanism: A mechanism that allows a neural network to focus on specific parts of its input. Self-attention allows inputs to interact with each other to compute a representation. Multi-head attention performs this process in parallel in different "representation subspaces," capturing diverse patterns.
Previous Works: The paper builds upon a lineage of INR-based ISR methods:
- LIIF (Local Implicit Image Function): The foundational work that proposed representing an image via local implicit functions. For any query coordinate, LIIF interpolates the output from the implicit functions centered at the four nearest latent features from an encoder. HIIF is based on this local formulation.
- LTE (Local Texture Estimator): An extension of LIIF that enhances texture details by operating in the frequency domain, estimating Fourier coefficients from the latent features.
- CiaoSR & CLIT: More recent methods that aim to improve upon LIIF. CiaoSR introduces learnable weights for combining local features and a scale-aware attention module. CLIT uses a local attention mechanism and a cascaded framework for large-scale up-sampling.
- Limitation of Prior Work: The paper argues that all these methods use a single-scale representation for the local function, which HIIF aims to overcome with its hierarchical approach.
Differentiation: HIIF distinguishes itself from prior art in two main ways:
1. Hierarchical vs. Single-Scale Encoding: While LIIF, LTE, and others use a single relative coordinate vector (e.g., (dx, dy)) as input, HIIF generates a sequence of hierarchical coordinates. This explicitly forces the network to model features at different spatial granularities.
2. Learned Aggregation vs. Fixed/Simple Ensemble: LIIF uses fixed bilinear interpolation weights. CiaoSR learns the weights. HIIF's multi-scale architecture implicitly learns to aggregate information through its progressive network structure, rather than relying on explicit ensemble weights. The addition of linear attention further helps capture non-local dependencies, a feature largely absent in other local implicit models.

4. Methodology (Core Technology & Implementation)

The HIIF framework follows an encoder-decoder structure, as illustrated in the figure below.

$该图像是HIIF模型的示意图，展示了其连续超分辨率的架构。它通过编码器 $E_\\varphi$ 处理低分辨率图像 $I^{LR}$，利用多头线性注意力机制和新颖的分层位置编码（Level 0到Level `L-1`）来增强局部隐式表示。最终，解码器 $D_\\varrho$ 和上采样的 $I^{LR}_{\\uparrow}$ 结合生成高分辨率图像 $I^{HR}$。该模型旨在捕捉多尺度细节并提升表…$ 该图像是HIIF模型的示意图，展示了其连续超分辨率的架构。它通过编码器 $E_\varphi$ 处理低分辨率图像 $I^{LR}$ ，利用多头线性注意力机制和新颖的分层位置编码（Level 0到Level L-1）来增强局部隐式表示。最终，解码器 $D_\varrho$ 和上采样的 $I^{LR}_{\uparrow}$ 结合生成高分辨率图像 $I^{HR}$ 。该模型旨在捕捉多尺度细节并提升表现。

Principles: The core idea is to model the local implicit image function as a multi-scale representation. By creating hierarchical coordinates, the network can share information among neighboring query points at coarse scales while distinguishing them at fine scales. This allows for the reconstruction of both smooth regions and high-frequency details more effectively.
Steps & Procedures:
1. Encoder: An input LR image $\mathcal{T}^{\mathrm{LR}}$ is passed through a backbone encoder network $E_{\varphi}$ (e.g., EDSR, RDN, SwinIR) to extract a grid of latent feature vectors $\mathbf{z}$ .
2. Querying: To generate a pixel at a specific coordinate $\mathbf{x}_q$ in the target HR image, the system identifies the four nearest latent codes $\{\mathbf{z}_t^*\}$ in the feature grid.
3. Hierarchical Encoding: The relative coordinate $\delta(\mathbf{x}_q, t) = \mathbf{x}_q - \mathbf{x}_t^*$ is calculated. Instead of using this directly, it is converted into a series of $L$ hierarchical encodings $\{\delta_h(\mathbf{x}_q, l)\}_{l=0}^{L-1}$ .
4. Decoder: The HIIF decoder $D_{\varrho}$ $D_{ϱ}$ processes these inputs progressively.
  - Level 0: The four latent codes $\{\mathbf{z}_t^*\}$ , the coarsest hierarchical encoding $\delta_h(\mathbf{x}_q, 0)$ , and cell size information are concatenated and passed through an MLP. The output is then processed by a Multi-Head Linear Attention block to produce the first-level feature $\mathbf{z_0}$ .
  - Level $l > 0$ : The feature from the previous level, $\mathbf{z}_{l-1}$ , is concatenated with the next hierarchical encoding, $\delta_h(\mathbf{x}_q, l)$ , and passed through another MLP to produce the feature $\mathbf{z_l}$ . This is repeated for all $L$ levels.
5. Final Output: The output from the final level, $\mathbf{z}_{L-1}$ , is the predicted RGB residual. This is added to a simple bilinearly upsampled version of the LR image, $\mathcal{I}_{\uparrow}^{\mathrm{LR}}$ , to produce the final HR pixel value.
Mathematical Formulas & Key Details:
- Problem Formulation (Based on LIIF): The final HR pixel value at coordinate $\mathbf{x}_q$ is predicted by combining outputs from its four nearest latent codes $\{\mathbf{z}_t^*\}$ where $t \in \{00, 01, 10, 11\}$ : $\mathcal { T } ^ { \mathrm { H R } } ( \mathbf { x } _ { q } ) = \sum _ { t } w _ { t } f _ { \theta } ( \mathbf { z } _ { t } ^ { * } , \delta ( \mathbf { x } _ { q } , t ) , c e l l )$
  - $f_{\theta}$ : The implicit function (decoder).
  - $\mathbf{z}_t^*$ : The nearest latent code.
  - $\delta(\mathbf{x}_q, t) = \mathbf{x}_q - \mathbf{x}_t^*$ : The relative coordinate vector from the latent code's position to the query coordinate.
  - cell: A vector representing the size of a pixel in the continuous coordinate space, which informs the network about the target scale.
  - $w_t$ : Ensemble weights, typically calculated based on bilinear interpolation areas.
- HIIF Overall Workflow: $\mathcal { T } ^ { \mathrm { H R } } ( \mathbf { x } _ { q } ) = D _ { \varrho } \left( E _ { \varphi } ( \mathcal { T } ^ { \mathrm { LR } } ) , \{ \delta _ { h } ( \mathbf { x } _ { q } , l ) \} , c e l l \right) + \mathcal { T } _ { \uparrow } ^ { \mathrm { LR } } ( \mathbf { x } _ { q } )$
  - $D_{\varrho}$ : The HIIF decoder.
  - $\{\delta_h(\mathbf{x}_q, l)\}$ : The set of hierarchical encodings for levels $l = 0, \dots, L-1$ .
  - $\mathcal{T}_{\uparrow}^{\mathrm{LR}}(\mathbf{x}_q)$ : The value from the bilinearly upsampled LR image at $\mathbf{x}_q$ .
- Hierarchical Encoding: The local coordinate $(x_{local}, y_{local})$ is first normalized to the range [0, 1]. The hierarchical coordinate for level $l$ is then computed as: $( x _ { h i e r } , y _ { h i e r } ) = \lfloor \left( x _ { l o c a l } , y _ { l o c a l } \right) \times S ^ { l + 1 } \rfloor \bmod S$
  - $S$ : A scaling factor, set to 2 in the paper.
  - $l$ : The hierarchy level, from 0 to L-1.
  - $\lfloor \cdot \rfloor$ : The floor function (rounds down to the nearest integer).
  - $\bmod$ : The modulo operator. This formula effectively creates a quadtree-like subdivision of the local space. At level 0, it divides the space into an $S \times S$ grid. At level 1, it further divides each of those cells, and so on. This provides a multi-scale positional signal.
    
    该图像是图3所示的多尺度架构示意图。它展示了在解码器中应用分层编码如何在不同层次上影响采样点。在较粗级别（Level 0），相邻采样点（如在区域11内的b和c）共享相同的网络特征。然而，在较细级别（Level 1），当引入更精细的编码后（如b为00，c为11），这些点不再共享相同特征，从而能够捕获多尺度的精细细节。
As shown in Figure 3, points 'b' and 'c' share the same coarse-level encoding (they are both in quadrant '11' at Level 0), so they share network features. At Level 1, they get different finer-level encodings ('00' and '11' respectively), so their processing paths diverge, allowing the network to learn finer details.
- Multi-Head Linear Attention: Standard self-attention has a complexity of $O(N^2)$ where $N$ is the number of tokens (pixels). To manage this, HIIF uses a linear attention mechanism. Given an input $\hat{\mathbf{z}}_0$ , it computes queries ( $\mathbf{Q}$ ), keys ( $\mathbf{K}$ ), and values ( $\mathbf{V}$ ). ${ \mathrm { Attention } } ( \mathbf { Q } _ { n } , \mathbf { K } _ { n } , \mathbf { V } _ { n } ) = \mathbf { V } _ { n } \times { \left( } { \frac { \mathbf { K } _ { n } ^ { T } \mathbf { Q } _ { n } } { \sqrt { H W } } } { \right) }$ The key difference from standard attention is the order of matrix multiplication. By computing $\mathbf{K}_n^T \mathbf{Q}_n$ first, it avoids creating the large $N \times N$ attention matrix, reducing complexity to $O(N)$ . This allows HIIF to incorporate non-local information efficiently.

5. Experimental Setup

Datasets:
- Training: DIV2K dataset, which contains 800 high-quality 2K resolution images.
- Testing: Five standard benchmark datasets were used for evaluation: DIV2K validation (100 images), Set5, Set14, BSD100, and Urban100. The Urban100 dataset is particularly challenging due to its high-frequency details (e.g., building facades, grids).
Evaluation Metrics: The primary metric used is the Peak Signal-to-Noise Ratio (PSNR).
1. Conceptual Definition: PSNR measures the quality of a reconstructed image by comparing it to the original, ground-truth image. It quantifies the ratio between the maximum possible pixel value and the mean squared error (MSE) between the images. A higher PSNR value indicates a better reconstruction quality, with less error. It is measured on a logarithmic scale in decibels (dB).
2. Mathematical Formula: $\mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right)$ The Mean Squared Error (MSE) is calculated as: $\mathrm{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2$
3. Symbol Explanation:
  - $\mathrm{MAX}_I$ : The maximum possible pixel value of the image (e.g., 255 for an 8-bit grayscale image).
  - m, n: The height and width of the image in pixels.
  - I(i,j): The pixel value at coordinate (i,j) in the original ground-truth image.
  - K(i,j): The pixel value at coordinate (i,j) in the reconstructed image.
Baselines: HIIF is compared against several state-of-the-art continuous SR methods:
- MetaSR
- LIIF
- LTE
- CLIT
- CiaoSR
- SRNO The experiments also compare HIIF integrated with different encoder backbones (EDSR_baseline, RDN, SwinIR) against the original fixed-scale versions of these backbones.

6. Results & Analysis

Core Results:

The quantitative results are presented in Tables 1 and 2, which are transcribed below.

Table 1: PSNR (dB) comparison on DIV2K and Set5 datasets.

Encoder	Database	DIV2K								Set5
	PSNR (dB)↑	In-distribution			Out-of-distribution					In-distribution			Out-of-distribution
	Method	×2	×3	×4	×6	×12	×18	×24	×30	×2	×3	×4	×6	×8	×12
n/a	Bicubic	31.01	28.22	26.66	24.82	22.27	21.00	20.19	19.59	32.30	29.04	27.06	24.58	23.00	21.24
EDSR_baseline [33]	EDSR only	34.52	30.89	28.98	-	-	-	-	-	37.99	34.36	32.09	-	-	-
	MetaSR [20]	34.64	30.93	28.92	26.61	23.55	22.03	21.06	20.37	37.94	34.35	32.07	28.74	26.83	24.53
	LIIF [10]	34.67	30.96	29.00	26.75	23.71	22.17	21.28	20.48	37.97	34.39	32.22	28.93	26.98	24.57
	LTE [29]	34.72	31.02	29.04	26.81	23.78	22.23	21.24	20.53	38.02	34.42	32.22	28.95	27.02	24.60
	CLIT [9]	34.81	31.12	29.15	26.92	23.83	22.29	21.26	20.53	-	-	-	-	-	-
	CiaoSR [5]	34.88	31.12	29.19	26.92	23.85	22.30	21.29	20.44	38.13	34.47	32.42	29.10	27.12	24.68
	SRNO [51]	34.85	31.11	29.16	26.90	23.84	22.29	21.27	20.56	38.11	34.51	32.37	29.02	27.03	24.60
	HIIF (ours)	34.90	31.19	29.21	26.94	23.87	22.31	21.27	20.54	38.11	34.54	32.42	29.06	27.08	24.69
RDN [56]	RDN only	34.59	31.03	29.12	-	-	-	-	-	38.24	34.71	32.47	-	-	-
	LIIF [10]	35.00	31.27	29.25	26.88	23.73	22.18	21.17	20.47	38.22	34.63	32.38	29.04	26.96	-
	LTE [29]	35.04	31.32	29.33	27.04	23.95	22.40	21.36	20.64	38.23	34.72	32.61	29.32	27.26	24.79
	CLIT [9]	35.10	31.39	29.39	27.12	24.01	22.45	21.38	20.64	-	-	-	-	-	-
	CiaoSR [5]	35.13	31.39	29.43	27.13	24.03	22.45	21.41	20.55	38.29	34.85	32.66	29.46	27.36	24.92
	SRNO [51]	35.16	31.42	29.42	27.12	24.03	22.46	21.41	20.68	38.32	34.84	32.69	29.38	27.28	-
	HIIF (ours)	35.27	31.59	29.56	27.16	24.02	22.48	21.40	20.62	38.34	34.85	32.70	29.49	27.35	24.95
	SwinIR [31]	SwinIR only	35.09	31.42	29.48	-	-	-	-	-	38.35	34.89	32.72	-	-	-
SwinIR [31]	LIIF [10]	35.17	31.46	29.46	27.15	24.02	22.43	21.40	20.67	38.28	34.87	32.73	29.46	27.36	24.98
	LTE [29]	35.24	31.50	29.51	27.20	24.09	22.50	21.47	20.73	38.33	34.89	32.81	29.50	27.35	25.01
	CLIT [9]	35.29	31.55	29.55	27.26	24.11	22.51	21.45	20.70	-	-	-	-	-	-
	CiaoSR [5]	35.28	31.53	29.56	27.26	24.12	22.52	21.48	20.59	38.41	34.97	32.86	29.69	27.62	-
	SRNO [51]	-	-	-	-	-	-	-	-	38.38	34.91	32.84	29.62	27.45	24.96
	HIIF (ours)	35.34	31.62	29.61	27.27	24.13	22.50	21.48	20.73	38.43	34.98	32.87	29.64	27.56	25.03

Table 2: PSNR (dB) comparison on Set14, BSD100 and Urban100 datasets. (Note: Due to the complexity of the original table's formatting, this transcription is a best-effort reconstruction.)

Encoder	Method	Set14						BSD100						Urban100
Encoder	Method	×2	×3	×4	×6	×8	×12	×2	×3	×4	×6	×8	×12	×2	×3	×4	×6	×8	×12
EDSR_baseline [33]	Bicubic	28.72	26.04	24.50	22.74	21.64	20.27	28.24	25.89	24.65	23.25	22.40	21.32	25.54	23.12	21.81	20.31	19.42	18.33
	EDSR only	33.56	30.28	28.56	-	-	-	32.17	29.09	27.57	-	-	-	31.99	28.15	26.03	-	-	-
	LIIF [10]	33.66	30.33	28.60	26.44	24.93	23.19	32.19	29.11	27.60	25.84	24.78	23.49	32.17	28.22	26.14	23.78	22.44	20.91
	LTE [29]	33.70	30.36	28.63	26.48	24.97	23.20	32.22	29.15	27.62	25.87	24.81	23.50	32.30	28.32	26.23	23.84	22.52	20.96
	CiaoSR [5]	33.88	30.46	28.76	26.59	25.07	23.23	32.28	29.20	27.70	25.94	24.88	23.55	32.79	28.57	26.58	24.23	22.82	21.08
	SRNO [51]	33.82	30.49	28.77	26.53	25.02	23.16	32.28	29.19	27.67	25.90	24.86	23.52	32.63	28.57	26.49	24.06	22.66	21.07
	HIIF (ours)	33.88	30.51	28.81	26.59	25.02	23.24	32.33	29.25	27.71	25.93	24.89	23.56	32.69	28.59	26.51	23.99	22.72	21.09
	RDN [56]	RDN only	34.01	30.57	28.81	-	-	-	32.34	29.26	27.72	-	-	-	32.89	28.80	26.61	-	-	-
LTE [29]		34.09	30.58	28.88	26.71	25.16	23.31	32.36	29.30	27.77	26.01	24.95	23.60	33.04	28.97	26.81	24.28	22.88	21.22
CiaoSR [5]		34.22	30.65	28.93	26.79	25.28	23.37	32.41	29.34	27.83	26.07	25.00	23.64	33.30	29.17	27.11	24.58	23.13	21.42
SRNO [51]		34.27	30.71	28.97	26.76	25.26	-	32.43	29.37	27.83	26.04	24.99	-	33.33	29.14	26.98	24.43	23.02	-
HIIF (ours)		34.29	30.76	28.92	26.84	25.32	23.34	32.51	29.42	27.86	26.08	25.02	23.69	33.34	29.20	27.07	24.59	23.11	21.43
SwinIR [31]		LTE [29]	34.25	30.80	29.06	26.86	25.34	23.38	32.39	29.34	27.84	26.07	25.01	23.66	33.50	29.41	27.15	24.59	23.17	21.50
	CiaoSR [5]	34.27	30.85	29.08	26.94	25.42	23.44	32.46	29.42	27.91	26.09	25.03	-	-	-	27.24	24.62	-	-
	HIIF (ours)	34.36	30.89	29.11	26.99	25.46	23.47	32.54	29.47	27.94	26.17	25.09	23.75	33.64	29.54	27.27	24.71	23.27	21.54

Analysis of Quantitative Results:

HIIF consistently achieves the highest or second-highest PSNR across almost all backbones, datasets, and scaling factors, from in-distribution (×2, ×3, ×4) to extreme out-of-distribution scales (up to ×30).
The largest gain mentioned in the abstract, 0.17dB, can be seen with the RDN backbone on the DIV2K dataset at ×3 scale (31.59dB for HIIF vs. 31.42dB for SRNO).
The performance gains are particularly noticeable on challenging datasets like Urban100, which is rich in textures and repeating patterns that benefit from HIIF's multi-scale representation.
The radar and bar plots in Figure 1 visually summarize this superiority, showing HIIF's performance envelope consistently enclosing its competitors.

该图像是图1，展示了顶部两个雷达图和底部三个柱状图。雷达图比较了HIIF与五种INR-based连续ISR方法在ESDR和SwinIR骨干网络下的性能。柱状图则展示了HIIF集成到EDSR、RDN和SwinIR后在固定上采样尺度（x2, x3, x4）下的PSNR提升，所有结果均基于DIV2K验证集。
Qualitative Analysis: Figures 4 and 6 show visual comparisons. HIIF produces images with sharper details and fewer artifacts than other methods. For instance, in Figure 4, the text on the book spines and the fine structures of the bridge are much clearer in the HIIF result. Figure 6 shows that for non-integer scaling (×3.3), HIIF reconstructs the fine light trails more faithfully than other methods, which tend to blur or distort them.

$Figure 4. Qualitative comparison results. Here RDN \[56\] is used as an encoder for all methods.$ 该图像是图4，展示了图像超分辨率的定性比较结果。所有方法均使用RDN [56] 作为编码器。图中包含两组图像：书籍（0867）和桥梁（0861）。每组展示了原始高分辨率（HR）、低分辨率（LR）以及Bicubic、LIIF、LTE、SRNO和“ours”方法处理后的效果。通过对比细节，可以看出“ours”方法在恢复图像细节方面表现更优，例如书籍上的文字和桥梁的结构，优于其他方法。

该图像是针对图片0828 (DIV2K, ×3.3)的连续超分辨率结果对比图。它展示了原始高分辨率图像(HR)、低分辨率(LR)、双三次插值(Bicubic)以及LIIF、LTE、SRNO和本文提出的HIIF（ours）方法在超分辨率重建细节方面的表现。HIIF方法在恢复精细线条和纹理上展现出更佳的视觉质量，与HR图像更接近。

Figure 5 demonstrates that the choice of encoder backbone still matters, with SwinIR providing the best results when combined with HIIF, due to its strong feature extraction capabilities.

$Figure 5. Qualitative comparison among three encoders with HIIF for $\\times 6$ upsampling. The sequence is from the Urban100 dataset.$ 该图像是图5，展示了使用HIIF与三种编码器（EDSR_bl、RDN和SwinIR）进行 $\times 6$ 上采样时的定性比较。图像序列来源于Urban100数据集。与参考的img_019 GT清晰网格图相比，EDSR_bl和RDN结果显示出明显的失真和伪影，而SwinIR在恢复网格细节方面表现更接近原始图像，尽管仍有细微差异。

Complexity Analysis:

Table 3: Model complexity comparison using EDSR_baseline encoder on Urban100.

Encoder/Method	#Params (M)	Runtime (s)	Memory (GB)
EDSR_baseline [33]	1.5	3.23	2.2
+ MetaSR [20]	+ 0.45	8.23	1.2
+ LIIF [10]	+ 0.35	18.48	1.3
+ LTE [29]	+ 0.49	18.54	1.4
+ CLIT [9]	+ 15.7	398.02	16.3
+ CiaoSR [5]	+ 1.43	251.80	12.6
+ SRNO [51]	+ 0.81	20.23	7.1
+ HIIF (Ours)	+ 1.33	35.17	1.5
RDN [56]	21.9	8.98	3.0
SwinIR [31]	11.8	68.15	3.0

HIIF adds 1.33M parameters to the EDSR backbone and has a runtime of 35.17s. This represents a trade-off: it is more computationally expensive than simpler methods like LIIF and LTE, but significantly more efficient in terms of parameters, runtime, and memory compared to other high-performing methods like CLIT and CiaoSR. This makes HIIF a practical choice for achieving state-of-the-art performance without excessive computational overhead.

Ablations / Parameter Sensitivity:

Table 4: Ablation study on the Urban100 dataset with EDSR_baseline encoder.

Method	×2	×3	×4	×6	×8	×12
EDSR only	31.99	28.15	26.03	-	-	-
v1-H (w/o Hierarchical Enc.)	32.56	28.47	26.38	23.91	22.59	21.00
v2-MS (w/o Multi-Scale Arch.)	32.47	28.45	26.35	23.92	22.56	20.98
v3-MH (w/o Multi-Head Attn.)	32.34	28.34	26.25	23.86	22.52	20.96
HIIF (ours)	32.69	28.59	26.51	23.99	22.72	21.09

The ablation study clearly demonstrates the importance of each of HIIF's components:

v1-H: Removing the hierarchical encoding (likely reverting to a single relative coordinate like LIIF) causes a significant performance drop. This confirms that the hierarchical representation is the most critical contribution.
v2-MS: Removing the multi-scale architecture (i.e., feeding all hierarchical encodings at once at the input) also degrades performance. This shows that the progressive injection of positional information at different layers is crucial for learning multi-scale features effectively.
v3-MH: Removing the multi-head linear attention blocks results in the largest performance drop among the variants, highlighting the importance of capturing non-local information to complement the local implicit function. The full HIIF model outperforms all ablated versions, validating the design choices.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully introduces HIIF, a novel framework for continuous image super-resolution. By leveraging a hierarchical positional encoding scheme, a multi-scale decoder architecture, and a multi-head linear attention mechanism, HIIF significantly improves upon existing INR-based methods. It effectively captures both local, multi-scale details and non-local spatial dependencies. Extensive experiments confirm its state-of-the-art performance and its flexibility to be integrated with various encoder backbones.
Limitations & Future Work: The paper does not explicitly list limitations. However, some potential areas for future work can be inferred:
- Perceptual Quality: The evaluation is based solely on PSNR, which does not always align with human perception of image quality. Future work could incorporate perceptual metrics (like LPIPS) or generative adversarial networks (GANs) to improve visual realism.
- Real-World Degradations: The model is trained and tested using bicubic downsampling. Its performance on real-world low-resolution images with more complex degradations (e.g., blur, noise, compression artifacts) is not explored.
- Computational Cost: While more efficient than some competitors, HIIF is still slower than simpler methods like LIIF. Further optimization for real-time applications could be a direction for future research.
Personal Insights & Critique: HIIF presents a well-motivated and elegant solution to a clear limitation in prior INR-based SR models. The concept of hierarchical positional encoding is intuitive and powerful, providing a structured way to inject multi-scale priors into the network. The progressive architecture is a clever design that ensures these priors are used effectively at different stages of feature processing. The experimental validation is thorough and convincing, demonstrating consistent gains across multiple settings. The inclusion of an ablation study provides strong evidence for the efficacy of each proposed component. A minor critique is the reliance on PSNR. For super-resolution, especially at large scaling factors, visually plausible results are often preferred over ones that are numerically closer to the ground truth but appear blurry. Nevertheless, as a method for improving the fidelity of INR-based reconstruction, HIIF is a significant and well-executed contribution to the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.