DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution
TL;DR Summary
DiT4SR integrates LR embeddings into the diffusion transformer's attention for bidirectional latent interaction and uses cross-stream convolution to enhance local detail, achieving superior real-world image super-resolution performance with diffusion transformers.
Abstract
Large-scale pre-trained diffusion models are becoming increasingly popular in solving the Real-World Image Super-Resolution (Real-ISR) problem because of their rich generative priors. The recent development of diffusion transformer (DiT) has witnessed overwhelming performance over the traditional UNet-based architecture in image generation, which also raises the question: Can we adopt the advanced DiT-based diffusion model for Real-ISR? To this end, we propose our DiT4SR, one of the pioneering works to tame the large-scale DiT model for Real-ISR. Instead of directly injecting embeddings extracted from low-resolution (LR) images like ControlNet, we integrate the LR embeddings into the original attention mechanism of DiT, allowing for the bidirectional flow of information between the LR latent and the generated latent. The sufficient interaction of these two streams allows the LR stream to evolve with the diffusion process, producing progressively refined guidance that better aligns with the generated latent at each diffusion step. Additionally, the LR guidance is injected into the generated latent via a cross-stream convolution layer, compensating for DiT's limited ability to capture local information. These simple but effective designs endow the DiT model with superior performance in Real-ISR, which is demonstrated by extensive experiments. Project Page: https://adam-duan.github.io/projects/dit4sr/.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is "DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution". This title clearly indicates the central topic, which is the adaptation of Diffusion Transformer (DiT) models for the specific task of Real-World Image Super-Resolution (Real-ISR).
1.2. Authors
The authors are:
-
Zheng-Peng Duan (1, 2)
-
Jiawei Zhang (2)
-
Xin Jin (1)
-
Ziheng Zhang (1)
-
Zheng Xiong (2)
-
Dongqing Zou (2, 3)
-
Jimmy S. Ren (2, 4)
-
Chunle Guo (1)
-
Chongyi Li (1)
Affiliations:
-
vCIP CS, Nankai University
-
SensTime Research
-
pBVR
-
Hong Kong Metropolitan University
The diverse affiliations, including universities and research labs (SensTime Research), suggest a collaboration between academic and industrial research, which is common in cutting-edge AI research. Jimmy S. Ren is a known figure in computer vision research.
1.3. Journal/Conference
The paper is published at (UTC): 2025-03-30T20:27:22.000Z. While a specific conference or journal is not explicitly mentioned in the provided text as the final publication venue (e.g., CVPR, NeurIPS), the arXiv preprint suggests it is undergoing peer review or has been submitted to a major venue. Given the publication date in 2025, it is likely intended for a top-tier computer vision or machine learning conference. The quality of research in this domain often appears in venues like CVPR, ICCV, ECCV, or NeurIPS, which are highly reputable and influential in the field.
1.4. Publication Year
The paper was published on March 30, 2025.
1.5. Abstract
The paper addresses the Real-World Image Super-Resolution (Real-ISR) problem, leveraging large-scale pre-trained diffusion models due to their rich generative priors. It focuses on Diffusion Transformer (DiT) models, which have shown superior performance over traditional UNet-based architectures in image generation. The core question investigated is whether advanced DiT-based diffusion models can be adopted for Real-ISR.
To answer this, the authors propose DiT4SR, a pioneering work in taming large-scale DiT models for Real-ISR. Unlike methods such as ControlNet that directly inject low-resolution (LR) image embeddings, DiT4SR integrates LR embeddings directly into the original attention mechanism of DiT. This allows for bidirectional information flow between the LR latent and the generated latent, enabling the LR stream to evolve and provide progressively refined guidance aligned with the diffusion process. Additionally, a cross-stream convolution layer is introduced to inject LR guidance into the generated latent, compensating for DiT's limited ability to capture local information. These designs enable DiT4SR to achieve superior performance in Real-ISR, as validated by extensive experiments.
1.6. Original Source Link
The original source link is: https://arxiv.org/abs/2503.23580
The PDF link is: https://arxiv.org/pdf/2503.23580v2.pdf
This indicates the paper is available as a preprint on arXiv, a common platform for sharing research before formal peer review and publication. The v2.pdf suggests it's an updated version of the preprint.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is Real-World Image Super-Resolution (Real-ISR). This task involves recovering a high-resolution (HR) image from its low-resolution (LR) counterpart, which has undergone various complex degradations (e.g., compression, blur, noise).
This problem is important because Real-ISR is highly ill-posed; a single LR image can correspond to multiple possible HR images. Beyond simply removing degradations, it requires generating perceptually realistic details to enhance visual quality, which demands significant prior knowledge from the model. Traditional Super-Resolution (SR) methods often struggle with the complex, real-world degradations and the generation of realistic, high-fidelity details.
Prior research has turned to large-scale pre-trained text-to-image (T2I) diffusion models like Stable Diffusion (SD) due to their rich generative priors learned from vast datasets of high-quality images. These models, often based on UNet architectures, leverage techniques like ControlNet to guide the generative process with LR images.
However, recent advancements have shown that Diffusion Transformers (DiT), such as those used in SD3 and Flux, offer overwhelmingly superior performance in image generation compared to UNet-based architectures. This raises a critical question: Can the advanced DiT-based diffusion model be effectively adopted for Real-ISR? The challenge lies in efficiently and robustly integrating LR information into the DiT architecture, especially considering DiT's global attention mechanism and potential limitations in capturing local details. An intuitive ControlNet-like approach might not fully leverage DiT's unique characteristics.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Pioneering DiT for Real-ISR with Integrated LR Stream:
DiT4SRis presented as one of the first works to adapt large-scaleDiT modelsforReal-ISR. Instead of merely copying blocks (likeControlNet),DiT4SRintegrates theLR streamdirectly into the originalDiT block'sattention mechanism. This design enablesbidirectional information interactionbetween theLR guidanceand thediffusion process, allowing theLR streamto evolve and provide more refined, context-aware guidance throughout thediffusion process. -
Cross-Stream Convolution for Local Information: The authors introduce a
convolutional layerto injectLR guidanceinto theNoise Stream'sMLP(Multi-Layer Perceptron) processing. This mechanism specifically compensates forDiT's inherent limitation in capturinglocal informationeffectively due to its global attention mechanism, which is crucial for restoring fine details inSRtasks.The key findings are that these simple yet effective designs endow the
DiT modelwith superior performance inReal-ISR. Extensive experiments demonstrateDiT4SR's capability to produce high-quality restoration results, often outperforming state-of-the-art methods on real-world datasets, particularly in clarity, detail abundance, and the processing of fine structures. A user study further confirms its superiority inimage realismandfidelity.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand DiT4SR, several foundational concepts are crucial:
- Super-Resolution (SR): The task of enhancing the resolution of an image, typically by generating a high-resolution (HR) image from a low-resolution (LR) input.
- Real-World Image Super-Resolution (Real-ISR): A more challenging variant of SR where LR images are degraded by complex, often unknown, real-world factors (e.g., camera noise, compression artifacts, blur) rather than just simple downsampling. This task requires not only removing degradations but also hallucinating realistic details.
- Generative Models: A class of machine learning models designed to generate new data samples that resemble the training data. For
Real-ISR, they are used to generate plausible HR details. - Diffusion Models: A type of generative model that works by learning to reverse a gradual diffusion (noise addition) process. During training, noise is progressively added to data, and the model learns to denoise it. During inference, the model starts with random noise and iteratively denoises it to generate a new data sample.
- Latent Diffusion Models (LDM) / Stable Diffusion (SD): A variant of diffusion models that operates in a lower-dimensional
latent spacerather than directly on pixel space. This makes them computationally more efficient.Stable Diffusionis a prominent example of anLDMcapable of high-quality image generation from text prompts. - UNet Architecture: A convolutional neural network architecture, originally designed for biomedical image segmentation, characterized by its U-shaped structure. It features an
encoder(downsampling path) that captures context and adecoder(upsampling path) that enables precise localization, withskip connectionsbetween corresponding encoder and decoder layers to preserve fine-grained information.UNets are widely used as the backbone for manydiffusion models. - Transformers: A neural network architecture that relies heavily on the
self-attention mechanismto weigh the importance of different parts of the input data. Transformers have revolutionizedNatural Language Processing (NLP)and are increasingly popular in computer vision tasks. - Self-Attention Mechanism: A core component of transformers. It allows a model to weigh the importance of different elements in a sequence when processing a single element. The fundamental formula for
Scaled Dot-Product Attentionis: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices derived from the input embeddings.
- calculates the dot product between queries and keys, determining attention scores.
- is a scaling factor, typically the square root of the dimension of the key vectors, used to prevent large dot products from pushing the
softmaxfunction into regions with tiny gradients. softmaxnormalizes the scores to produce attention weights.- is then weighted by these attention weights.
- Diffusion Transformer (DiT): A transformer-based architecture used as the backbone for diffusion models, replacing the traditional
UNet.DiTs process image patches as sequences of tokens, leveraging the global receptive field and scaling properties of transformers for improved generative performance. - Multimodal Diffusion Transformers (MM-DiT): An extension of
DiTthat handles multiple modalities (e.g., text and image) by processing them in separate "streams" (e.g.,Text Stream,Noise Stream) but allowing them to interact bidirectionally through joint attention mechanisms. - ControlNet: An neural network architecture that adds conditional control to large pre-trained
diffusion models(typicallyUNet-based). It works by creating a trainable copy of the diffusion model'sUNetencoder, which takes an additionalcondition(e.g., anLR image, edge map, human pose) as input. The outputs of this conditional encoder are then injected into the originalUNet'sskip connectionsto guide the generation process. It usually employs aone-way information flow. - Variational Autoencoder (VAE) encoder/decoder:
VAEsare generative models used to learn alatent spacerepresentation of data. Theencodermaps high-dimensional input (e.g., an image) to a lower-dimensionallatent representation, while thedecoderreconstructs the original input from thelatent representation. InLDMs, theVAE encoderis used to transform images into alatent spacefor diffusion, and theVAE decoderreconstructs images from thedenoised latent.
3.2. Previous Works
The paper contextualizes its contribution by discussing prior work in Image Super-Resolution (ISR) and Diffusion Models for Real-ISR.
- Traditional ISR Methods:
- Convolutional Networks (CNNs): Early deep learning methods like
SRCNN[18],EDSR[32],RCAN[72],RAN[71] achieved significant progress but struggled withReal-ISR's complex degradations. - Transformers: More recent
ISRmethods likeIPT[10],HAT[13],SwinIR[29],DAT[12] leveragedtransformerarchitectures.SwinIRis explicitly used as a baseline.
- Convolutional Networks (CNNs): Early deep learning methods like
- GAN-based
Real-ISR: To address theill-posednessand generate realistic details,Generative Adversarial Networks (GANs)were proposed, such asESRGAN[52],GLEAN[6],DGAN[9],EDAN[31]. While generatingperceptually realisticdetails, these often suffer fromtraining instabilityandunnatural visual artifacts[8, 30, 60].Real-ESRGAN[53] andBSRGAN[67] further explored complex degradation models.Real-ESRGANis used as a baseline. - Diffusion-based
Real-ISR(UNet-centric): With the success ofStable Diffusion (SD)[41], many methods built upon itsUNetarchitecture, leveraging itsgenerative priors.- ControlNet-like Approaches:
StableSR[51] andDiffBIR[33] injectLR informationas conditions viaControlNet[68] or similar mechanisms.SeeSR[58] andPASD[63] further incorporate high-level semantic information.ResShift[65],OSEDiff[57], andSUPIR[64] are other examples.SUPIRspecifically investigates scaling effects. - Other methods focus on efficiency, such as
OSD[35],PatchScaler[34],Distillation-free one-step diffusion[27],SINSr[54],Addsr[61].
- ControlNet-like Approaches:
- Diffusion Transformer (DiT) for Generation:
DiT[36] itself revolutionized generative modeling.PixArt-alpha[11],SD3[20], andFlux[3] are prominent examples of large-scaleT2I modelsthat leverageMultimodal Diffusion Transformers (MM-DiTs)to integratetextandimagemodalities. These models achieve state-of-the-art performance due to their ability for full interaction between modalities.SD3is the direct foundation forDiT4SR.
- Diffusion Transformer (DiT) for SR/Restoration:
DiT-SR[14] trainsDiT-based SR modelsfrom scratch.DreamClear[2] proposes aDiT-based image restoration modelbut still adoptsControlNetto injectLR information, which the authors argue prevents it from fully leveragingDiT's advantages.DreamClearis used as a baseline.- Two concurrent methods,
TSD-SR[19] andOne diffusion step[28], explore one-stepDiT-based SR modelsusingflow trajectoryconcepts.
3.3. Technological Evolution
The evolution of image super-resolution has progressed from traditional signal processing methods to deep learning. Initially dominated by Convolutional Neural Networks (CNNs), the field saw a shift towards Generative Adversarial Networks (GANs) to tackle the perceptual quality aspect of Real-ISR. More recently, diffusion models, particularly Latent Diffusion Models (LDMs) like Stable Diffusion (SD), have emerged as powerful generative priors for Real-ISR, largely due to their ability to produce highly realistic and diverse results.
The architectural backbone of these diffusion models has also evolved. Initially, UNet architectures were standard. However, Transformers, with their global attention mechanisms and scalability, have proven superior in general image generation tasks, leading to the development of Diffusion Transformers (DiT). These DiT models, especially Multimodal Diffusion Transformers (MM-DiTs) like SD3, have pushed the boundaries of generative performance.
This paper's work (DiT4SR) represents a crucial next step in this evolution. It aims to bridge the gap between the advanced DiT architecture (proven in generation) and the Real-ISR task, moving beyond UNet-based SD and even DiT-based models that still rely on ControlNet-like external conditioning. By directly integrating LR information within the DiT's attention mechanism and addressing DiT's local information limitations, DiT4SR seeks to unlock the full potential of DiT for Real-ISR.
3.4. Differentiation Analysis
DiT4SR differentiates itself from previous methods primarily in how it integrates low-resolution (LR) information into the Diffusion Transformer (DiT) architecture for Real-World Image Super-Resolution (Real-ISR).
-
Compared to UNet-based Diffusion Models (e.g., StableSR, DiffBIR, SeeSR, SUPIR): These methods build upon
UNet-based Stable Diffusionand typically useControlNetor similar mechanisms.DiT4SRfundamentally differs by using aDiTbackbone, which has shown overwhelming performance overUNetin image generation due to its global attention and scalability. -
Compared to ControlNet-like Approaches (e.g., SD3-ControlNet, DreamClear): This is the most direct point of differentiation.
ControlNet(andSD3-ControlNet): Duplicates several blocks of the main network, processesLR informationin these separate blocks, and then injects theLR embeddinginto theNoise Stream(orUNet'sskip connections) via trainable linear layers. This creates aone-directional information flow. TheLR streamdoes not evolve or bidirectionally interact with thegenerated latent.DreamClear: AlthoughDreamClearuses aDiT-basedbackbone for image restoration, it still employs theControlNetparadigm forLR informationinjection.DiT4SR's Innovation:DiT4SRcompletely abandons theControlNet-likeapproach.- It
integrates the LR Streamdirectly into the originalDiT block'sattention mechanism. This enablesbidirectional information interactionbetween theLR latentand thegenerated latent(Noise Stream), allowing theLR Streamto continuously adapt and evolve alongside thediffusion process. This createsprogressively refined guidance. - It introduces a
cross-stream convolution layerto injectLR guidancebetween theMLPlayers of theLR StreamandNoise Stream. This addressesDiT'slimited ability to capture local information, which is crucial for fine details inSR.ControlNetprimarily focuses on global structural guidance, not local detail enhancement through a convolutional injection likeDiT4SR.
- It
-
Compared to other DiT-based SR models (e.g., DiT-SR, concurrent one-step methods): While
DiT-SRtrainsDiTmodels from scratch forSR,DiT4SRfocuses on taming large-scale pre-trained DiT models (specificallySD3) by designing a control mechanism. The concurrent one-step methods focus onflow trajectory distillationfor efficiency, which is a different aspect fromDiT4SR's control mechanism design.In essence,
DiT4SR's core innovation lies in its tailored approach to leverageDiT's architectural strengths by enabling rich, bidirectional interaction ofLR guidanceand addressingDiT's local information shortcomings through a novelconvolutional injection, rather than simply adapting existing conditioning mechanisms.
4. Methodology
4.1. Principles
The core idea behind DiT4SR is to effectively adapt a large-scale pre-trained Diffusion Transformer (DiT) model, specifically SD3, for the task of Real-World Image Super-Resolution (Real-ISR). The theoretical basis and intuition are rooted in two main observations:
-
Leveraging DiT's Generative Power:
DiT-based models (likeSD3) have demonstrated overwhelming performance in image generation due to theirMultimodal Diffusion Transformers (MM-DiTs), which facilitate richbidirectional information flowbetween different modalities (e.g., text and image) via attention. The principle is to extend this multi-modal interaction to includelow-resolution (LR)image information as another "stream" that can actively guide the generation of high-resolution (HR) images. -
Addressing DiT's Limitations for SR: While
DiTexcels at global context and generation, itsattention mechanismoperates globally and may lack the ability to sufficiently capturelocal information, which is paramount for restoring fine details inSRtasks. The principle is to complement the global attention-based guidance with a mechanism that specifically introduceslocal LR guidance.Therefore,
DiT4SRaims to integrateLR informationdirectly and interactively into theDiT's coreattention mechanism(enabling bidirectional flow and evolution of guidance) and to provide a targetedlocal information injectionthrough aconvolutional layer. This combined approach allows theDiTto leverage its generative priors while being precisely controlled byLR inputto produce high-fidelityReal-ISRresults.
4.2. Core Methodology In-depth (Layer by Layer)
DiT4SR is built upon the DiT-architectured SD3. The overall process involves a diffusion process in the latent space, guided by LR information and text prompts.
4.2.1. Overall Architecture
The DiT4SR architecture is similar to SD3 but with key modifications to integrate LR information.
As shown in Figure 3(a), the process starts with a noisy latent and an LR image .
- Input Processing:
-
The
noisy latent(whereH, W, Care height, width, and channels) is firstflattenedinto apatch sequenceof length . This sequence is thenprojectedinto a -dimensional space using alinear layerto obtain thenoisy image token.Position embeddingis added to to retain spatial information. -
The
LR imageis encoded into alatent spaceusing apre-trained VAE encoderto obtain theLR latent. SinceLR latentandnoisy latentare both visual representations, they undergo the same processing:flatteninginto apatch sequence,linear projectionto -dimensions, and addition of the sameposition embeddingto form theLR image token. -
A
text captiondescribing is encoded by threepre-trained text models(CLIP-L[39],CLIP-G[15], andT5 XXL[40]). Pooled representations fromCLIPmodels are combined with thetimestepto modulateDiT's internal features. All threetext representationsare combined to construct atext tokenof length .These three tokens (, , ) represent the
Noise Stream,LR Stream, andText Stream, respectively. TheMM-DiTblock ofSD3is modified into anMM-DiT-Control blockto handle these three streams. After passing throughMM-DiT-Control blocksand anunpatch operation, theNoise Streamoutputs thedenoised latentfor the currenttimestep. Thisdiffusion processis repeated for steps, and the finalclean latentis then decoded by theVAE decoderto obtain the desiredHR result.
-
The following figure (Figure 3 from the original paper) shows the overall architecture:
该图像是论文中DiT4SR方法的结构示意图,包括(a)整体框架和(b)单个MM-DiT-Control模块的详细结构,展示了多流融合Transformer在扩散模型中的应用及LR信息的注入机制。
This design contrasts with SD3-ControlNet (shown in Figure 2(a)), where ControlNet processes the LR Stream in additional DiT blocks and injects LR embedding into the Noise Stream via trainable linear layers, establishing only a one-way information flow. In contrast, DiT4SR (shown in Figure 2(b)) integrates the LR Stream directly into the original DiT blocks, enabling bidirectional information flow.
The following figure (Figure 2 from the original paper) shows the network structure comparison:
该图像是论文中图2的示意图,比较了SD3-ControlNet与DiT4SR的网络结构。图中红色箭头表示信息流向,DiT4SR实现了LR流与噪声流的双向信息交互,而SD3-ControlNet仅为单向流动,限制了交互。
4.2.2. LR Integration in Attention
This is the first key modification within each MM-DiT-Control block. Instead of separate ControlNet blocks, the LR Stream is integrated into the joint attention mechanism of the DiT.
The input to the joint attention mechanism comprises the noisy image token , the LR image token , and the text token . For each stream, separate linear projections are used to compute the Query (), Key (), and Value () matrices.
The input , , for the joint attention are formulated as:
$
\begin{array} { r } { \mathbf { Q } = P _ { \mathbf { Q } } ^ { \mathbf { X } } ( \mathbf { X } ) \circledast P _ { \mathbf { Q } } ^ { \mathbf { L } } ( \mathbf { L } ) \circledast P _ { \mathbf { Q } } ^ { \mathbf { C } } ( \mathbf { C } ) , } \ { \mathbf { K } = P _ { \mathbf { K } } ^ { \mathbf { X } } ( \mathbf { X } ) \circledast P _ { \mathbf { K } } ^ { \mathbf { L } } ( \mathbf { L } ) \circledast P _ { \mathbf { K } } ^ { \mathbf { C } } ( \mathbf { C } ) , } \ { \mathbf { V } = P _ { \mathbf { V } } ^ { \mathbf { X } } ( \mathbf { X } ) \circledast P _ { \mathbf { V } } ^ { \mathbf { L } } ( \mathbf { L } ) \circledast P _ { \mathbf { V } } ^ { \mathbf { C } } ( \mathbf { C } ) , } \end{array}
$
Where:
-
: The
noisy image token(from theNoise Stream). -
: The
LR image token(from theLR Stream). -
: The
text token(from theText Stream). -
:
Pre-trained fixed linear projectionsfor thenoisy image token. -
:
Pre-trained fixed linear projectionsfor thetext token. -
:
Newly created trainable linear projectionsspecifically for theLR image token. Their weights areinitialized to zerosto ensure that at the beginning of training, theLR Streamhas minimal impact, which then progressively increases as the model learns. -
: This symbol denotes an operation that concatenates (or stacks) the outputs of the linear projections for , , and along a sequence dimension. This creates a combined sequence of queries, keys, and values from all three streams, allowing them to interact through a single
attention mechanism.The
joint attentionin theMM-DiT-Control blockis then calculated using the standardScaled Dot-Product Attentionformula: $ \mathrm { A t t e n t i o n } ( \mathbf { Q } , \mathbf { K } , \mathbf { V } ) = \underbrace { \mathrm { s o f t m a x } ( \frac { \mathbf { Q } \mathbf { K } ^ { T } } { \sqrt { d } } ) } _ { \mathrm { a t t e n t i o n } \operatorname* { m a p } } \mathbf { V } $ Where: -
, , : The concatenated query, key, and value matrices from all three streams.
-
: The dimension of the keys (and queries), used as a scaling factor.
-
softmax: Thesoftmaxfunction applied to the scaled dot products to obtain attention weights.This formulation allows for
comprehensive interactionbetween all three streams. The paper visualizes theattention mapsbetweennoisy image tokenandLR image tokenin Figure 4(a), showing clear diagonal activations in bothself-attention(e.g., , ) andcross-attentionregions (e.g., , ), indicating strongbidirectional information interaction. This interaction is crucial because it allows theNoise Streamto be guided byLR information, and simultaneously, theLR Streamto adapt and providecontext-aware guidancebased on the evolving state of theNoise Stream.
The following figure (Figure 4 from the original paper) visualizes attention maps:

LR Residual:
The paper notes that the information interaction between and can decay through successive attention blocks (as shown in Figure 4(b) without LR Residual). This happens because the LR Stream might experience undesired disruptions as it evolves with the Noise Stream. To maintain the consistency of LR guidance in deeper transformer blocks, an additional shortcut (residual connection) is introduced. This LR Residual directly transfers the input LR information to the output of the joint attention mechanism within the LR Stream. This mechanism ensures that the LR guidance is effectively preserved and exerts a consistent influence throughout the diffusion process.
4.2.3. LR Injection between MLP
This is the second key modification, addressing DiT's limitation in capturing local information.
The joint attention mechanism operates at a global level, relying primarily on position embeddings for spatial information. For Super-Resolution tasks, accurately restoring fine details requires strong local information. Relying solely on global attention is insufficient, as demonstrated by results in Figure 5(b) which show challenges in restoring text and maintaining fidelity.
To overcome this, DiT4SR introduces an LR Injection mechanism between the MLP (Multi-Layer Perceptron) layers of the LR Stream and the Noise Stream.
- Within both the
MLPfor theLR Streamand theNoise Stream, thehidden state dimensionsare first expanded by a factor of 4 and then projected back to the original size using twolinear projections. - Let these
intermediate featuresbe for theNoise Streamand for theLR Stream. - The
LR informationfrom is injected into using a3x3 depth-wise convolution layer.-
First, is
reshapedfrom itstoken form() to animage-like form. -
This
reshaped LR featureis then passed through a3x3 depth-wise convolution layer. The weights of thisconvolution layerare alsoinitialized to zeros, allowing its influence to grow gradually during training. -
After the convolution, the output is
reshaped backto theimage token formand added to .This
3x3 depth-wise convolution layerserves two critical purposes:
-
-
Strengthening LR guidance: It provides an additional channel for
LR informationto influence theNoise Stream. -
Capturing Local Information: Unlike global attention, convolution inherently excels at capturing
local spatial patterns. By using aconvolutional layer,DiT4SReffectively compensates forDiT's limitedlocal information-capturing ability. This is crucial for restoring fine structures like text, as shown in Figure 5(d), which performs significantly better than using a simple linear layer (Figure 5(c)).The following figure (Figure 5 from the original paper) illustrates the effect of LR injection:
该图像是图5的示意图,展示了使用不同方法注入低分辨率(LR)信息对超分辨率恢复的效果对比。(a)为LR输入,(b)未注入LR信息,(c)通过线性层注入LR信息,(d)为DiT4SR方法,用卷积层替代线性层,显著提升了局部细节恢复效果。
5. Experimental Setup
5.1. Datasets
Training Datasets
The authors use a combination of several high-quality image datasets for training, augmented with self-captured images:
-
DIV2K [1]: A widely used dataset for
super-resolutionandimage restorationtasks. -
DIV8K [22]: An extension of
DIV2Kwith 8K resolution images. -
Flickr2K [46]: Another common dataset for
image restoration, consisting of a large number of diverse images. -
FFHQ (first 10K face images) [24]: A high-quality dataset of human faces, useful for training models that need to generate realistic facial details.
-
1K self-captured high-resolution images: Added to fully exploit the potential of the method and scale up the training dataset, indicating a focus on real-world applicability.
To create
LR-HR training pairs, thedegradation pipelinefromReal-ESRGAN[53] is utilized, using the same parameter configuration asSeeSR[58]. This ensures the training data mimics real-world degradations. The resolutions are set to forLRimages and forHRimages, implying a scaling factor.
Evaluation Datasets
DiT4SR is evaluated on four widely used real-world datasets specifically designed for Real-ISR tasks, all with a scaling factor of .
-
DrealSR [56]: Consists of 93 images. Center-cropping is adopted, with
LRimages at resolution. -
RealSR [5]: Consists of 100 images. Center-cropping is adopted, with
LRimages at resolution. -
RealLR200 [58]: Proposed in
SeeSR, it comprises 200 images of significantly different resolutions and lacks corresponding ground-truth (GT) images. -
RealLQ250 [2]: Established by
DreamClear, it consists of 200 images with a fixed resolution of and also lacks corresponding GT images.The inclusion of datasets without
GT images(RealLR200, RealLQ250) is crucial forReal-ISRas real-world scenarios often lack perfectGTs, making evaluation metrics that do not requireGTs(non-reference metrics) particularly relevant.
5.2. Evaluation Metrics
The paper employs a suite of perceptual and non-reference image quality assessment (IQA) metrics, acknowledging that traditional full-reference metrics like PSNR and SSIM [55] may not accurately reflect visual quality for generative tasks [4, 23, 64]. The metrics used are:
-
LPIPS (Learned Perceptual Image Patch Similarity) [69]:
- Conceptual Definition:
LPIPSmeasures the perceptual distance between two images by comparing theirdeep featuresextracted from a pre-trainedCNN(e.g.,AlexNet,VGG). It aims to align better with human perception of image similarity than traditional pixel-wise metrics likePSNRorSSIM. A lowerLPIPSscore indicates higher perceptual similarity (better fidelity). - Mathematical Formula: $ \text{LPIPS}(x, x_0) = \sum_{l} \frac{1}{H_l W_l} |w_l \odot (\phi_l(x) - \phi_l(x_0))|_2^2 $
- Symbol Explanation:
- : The generated image.
- : The reference (ground truth) image.
- : Feature stack from layer of a pre-trained
CNN(e.g.,AlexNet). - : A learnable scalar weight for channel .
- : Height and width of the feature map at layer .
- : Element-wise product.
- : Squared norm.
- The sum is taken over different layers .
- Conceptual Definition:
-
MUSIQ (Multi-scale Image Quality Transformer) [25]:
- Conceptual Definition:
MUSIQis anon-reference image quality assessment (NR-IQA)metric that uses atransformerarchitecture to predict image quality scores. It operates on multiple scales of the input image and aggregates features using attention, designed to correlate well with human quality judgments for images with various distortions. A higherMUSIQscore indicates better image quality. - Mathematical Formula: (The specific full mathematical formula for
MUSIQ's internal transformer computations is complex and not typically provided in a single equation. Instead, its score is a scalar output from its trained model.) $ \text{MUSIQ Score} = f_{\text{MUSIQ}}(\text{Image}) $ - Symbol Explanation:
- : The trained
MUSIQ transformermodel. Image: The input image for which quality is to be assessed.
- : The trained
- Conceptual Definition:
-
MANIQA (Multi-dimension Attention Network for No-Reference Image Quality Assessment) [62]:
- Conceptual Definition:
MANIQAis anothernon-reference IQAmetric that leverages amulti-dimension attention networkto predict image quality. It extracts features from different dimensions (e.g., channel, spatial, temporal if applicable) and uses attention mechanisms to weigh their importance, aiming for high correlation with human perception. A higherMANIQAscore indicates better image quality. - Mathematical Formula: (Similar to
MUSIQ,MANIQAis a complex neural network. Its output is a scalar quality score.) $ \text{MANIQA Score} = f_{\text{MANIQA}}(\text{Image}) $ - Symbol Explanation:
- : The trained
MANIQA network. Image: The input image.
- : The trained
- Conceptual Definition:
-
ClipIQA (Exploring CLIP for Assessing the Look and Feel of Images) [50]:
- Conceptual Definition:
ClipIQAis anon-reference IQAmetric that utilizes theCLIP(Contrastive Language-Image Pre-training) model's ability to understand image content and relate it to textual concepts. It typically involves comparing an image against quality-related text prompts to infer its quality, or usingCLIP's image encoder features to predict quality. A higherClipIQAscore indicates better image quality. - Mathematical Formula: (Like
MUSIQandMANIQA,ClipIQAis model-based.) $ \text{ClipIQA Score} = f_{\text{ClipIQA}}(\text{Image}) $ - Symbol Explanation:
- : The model derived from
CLIPfor quality assessment. Image: The input image.
- : The model derived from
- Conceptual Definition:
-
LIQE (Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective) [70]:
- Conceptual Definition:
LIQEis ablind (non-reference) IQAmethod that estimates image quality by consideringvision-language correspondence. It often usesmultitask learningto predict various quality attributes that collectively determine the overall quality score, aiming for strong correlation with human judgment across different distortion types. A higherLIQEscore indicates better image quality. - Mathematical Formula: (Similar to the other
NR-IQAmetrics,LIQEis a neural network output.) $ \text{LIQE Score} = f_{\text{LIQE}}(\text{Image}) $ - Symbol Explanation:
- : The trained
LIQE network. Image: The input image.
- : The trained
- Conceptual Definition:
5.3. Baselines
The paper compares DiT4SR against a comprehensive set of state-of-the-art Real-ISR methods, categorized by their underlying architecture:
-
GAN-based Methods:
Real-ESRGAN[53]: A highly influential method forreal-world blind super-resolutionusingpure synthetic datafor training.SwinIR[29]: Atransformer-based image restorationmodel, often used forSR, demonstrating strong performance.
-
Diffusion-based Methods with UNet Architecture: These methods are built upon
Stable Diffusion (SD)models, which typically employ aUNetbackbone.ResShift[65]: An efficientdiffusion modelforimage super-resolutionusingresidual shifting.StableSR[51]: Exploitsdiffusion priorforreal-world image super-resolution.SeeSR[58]: Asemantics-aware real-world image super-resolutionmethod.DiffBIR[33]: A method forblind image restorationusinggenerative diffusion prior.OSEDiff[57]: Anone-step effective diffusion networkforreal-world image super-resolution.SUPIR[64]: Investigatesmodel scalingforphotorealistic image restorationin the wild, often built onSDXL.
-
Diffusion-based Methods with DiT Architecture: These methods specifically use a
Diffusion Transformerbackbone.-
DreamClear[2]: ADiT-based image restoration modelthat, despite usingDiT, still adoptsControlNetforLR informationinjection. -
SD3-ControlNet: This is a custom baseline initialized withSD3.5-medium parametersand trained under the same settings asDiT4SR. It represents a direct application ofControlNettoSD3, serving as a crucial comparison to highlight the benefits ofDiT4SR's integrated approach versus aControlNet-likeparadigm forDiT.These baselines cover a wide range of architectures and approaches within the
Real-ISRfield, allowing for a robust evaluation ofDiT4SR's performance.
-
6. Results & Analysis
6.1. Core Results Analysis
The paper presents quantitative and qualitative comparisons, along with a user study, to demonstrate the effectiveness of DiT4SR.
Quantitative Comparisons (Table 1): The following are the results from Table 1 of the original paper:
| Datasets | Metrics | Real-ESRGAN | SwinIR | ResShift | StableSR | SeeSR | DiffBIR | OSEDiff | SUPIR | DreamClear | SD3-ControlNet | DiT4SR |
| DrealSR | LPIPS ↓ | 0.282 | 0.274 | 0.353 | 0.273 | 0.317 | 0.452 | 0.297 | 0.419 | 0.354 | 0.323 | 0.365 |
| MUSIQ ↑ | 54.267 | 52.737 | 52.392 | 58.512 | 65.077 | 65.665 | 64.692 | 59.744 | 44.047 | 55.956 | 64.950 | |
| MANIQA ↑ | 0.490 | 0.475 | 0.476 | 0.559 | 0.605 | 0.629 | 0.590 | 0.552 | 0.455 | 0.545 | 0.627 | |
| ClipIQA ↑ | 0.409 | 0.396 | 0.379 | 0.438 | 0.543 | 0.572 | 0.519 | 0.518 | 0.379 | 0.449 | 0.548 | |
| LIQE | 2.927 | 2.745 | 2.798 | 3.243 | 4.126 | 3.894 | 3.942 | 3.728 | 2.401 | 3.059 | 3.964 | |
| RealSR | LPIPS ↓ | 0.271 | 0.254 | 0.316 | 0.306 | 0.299 | 0.347 | 0.292 | 0.357 | 0.325 | 0.305 | 0.319 |
| MUSIQ ↑ | 60.370 | 58.694 | 56.892 | 65.653 | 69.675 | 68.340 | 69.087 | 61.929 | 59.396 | 62.604 | 68.073 | |
| MANIQA ↑ | 0.551 | 0.524 | 0.511 | 0.622 | 0.643 | 0.653 | 0.634 | 0.574 | 0.546 | 0.599 | 0.661 | |
| ClipIQA ↑ | 0.432 | 0.422 | 0.407 | 0.472 | 0.577 | 0.586 | 0.552 | 0.543 | 0.474 | 0.484 | 0.550 | |
| LIQE ↑ | 3.358 | 2.956 | 2.853 | 3.750 | 4.123 | 4.026 | 4.065 | 3.780 | 3.221 | 3.338 | 3.977 | |
| RealLR200 | MUSIQ ↑ | 62.961 | 63.548 | 59.695 | 63.433 | 69.428 | 68.027 | 69.547 | 64.837 | 65.926 | 65.623 | 70.469 |
| MANIQA ↑ | 0.553 | 0.560 | 0.525 | 0.579 | 0.612 | 0.629 | 0.606 | 0.600 | 0.597 | 0.587 | 0.645 | |
| ClipIQA ↑ | 0.451 | 0.463 | 0.452 | 0.458 | 0.566 | 0.582 | 0.551 | 0.524 | 0.546 | 0.526 | 0.588 | |
| LIQE ↑ | 3.484 | 3.465 | 3.054 | 3.379 | 4.006 | 4.003 | 4.069 | 3.626 | 3.775 | 3.733 | 4.331 | |
| RealLQ250 | MUSIQ ↑ | 62.514 | 63.371 | 59.337 | 56.858 | 70.556 | 69.876 | 69.580 | 66.016 | 66.693 | 66.385 | 71.832 |
| MANIQA ↑ | 0.524 | 0.534 | 0.500 | 0.504 | 0.594 | 0.624 | 0.578 | 0.584 | 0.585 | 0.568 | 0.632 | |
| ClipIQA ↑ | 0.435 | 0.440 | 0.417 | 0.382 | 0.562 | 0.578 | 0.528 | 0.483 | 0.502 | 0.509 | 0.578 | |
| LIQE ↑ | 3.341 | 3.280 | 2.753 | 2.719 | 4.005 | 4.003 | 3.904 | 3.605 | 3.688 | 3.639 | 4.356 |
-
DrealSR and RealSR: For these datasets,
DiT4SRachieves competitive results. WhileSeeSRandDiffBIRsometimes show slightly betterLPIPS(lower is better),DiT4SRperforms well onMUSIQ,MANIQA, andClipIQA. For instance, onDrealSR,DiT4SR'sMUSIQ(64.950) is strong, close toDiffBIR(65.665) andSeeSR(65.077). OnRealSR,DiT4SR'sMANIQA(0.661) is the highest, andMUSIQ(68.073) is very competitive. -
RealLR200 and RealLQ250: These are crucial real-world datasets that lack
GTimages, makingnon-reference metricsespecially important.DiT4SRexhibitsoverwhelming performanceon these datasets, achieving thetop performance across all non-reference metrics(MUSIQ, MANIQA, ClipIQA, LIQE). For example, onRealLQ250,DiT4SRachievesMUSIQ71.832,MANIQA0.632,ClipIQA0.578, andLIQE4.356, consistently outperforming all other methods by a significant margin.These quantitative results highlight
DiT4SR's superior capability in producing high-quality restoration results, especially for challenging real-world scenarios whereGTimages are unavailable. This success can be attributed to its effective leverage ofDiT's generative capabilities through its novelLR integrationandlocal information injectionmechanisms.
Qualitative Comparisons (Figure 6): The following figure (Figure 6 from the original paper) shows qualitative comparisons:
该图像是多个真实世界低分辨率图像与不同超分辨率方法重建结果的对比示意图。图中包含DiT4SR和其他五种算法的细节放大区域,展示了DiT4SR在细节恢复上的优势。
Figure 6 visually demonstrates DiT4SR's advantages.
- Clarity and Detail: In the first two rows,
DiT4SRgenerates results with better clarity and more abundant details, even when encountering severe blurring degradations. This implies that the model effectively utilizes thegenerative capabilities of SD3to synthesize plausible high-frequency information. - Fine Structures: The last two rows show
DiT4SRexcelling at processingfine structures, such as architectural details and text. Notably,SD3-ControlNet, which also usesSD3as a backbone but with aControlNet-likemechanism, fails to handle these aspects as effectively. This observation underscores the superiority ofDiT4SR'sbidirectional information interactionandcross-stream convolutional injectionover theone-way controloffered byControlNet. The more comprehensive information interaction allowsDiT4SRto better leverageLR informationfor high-fidelity restoration.
User Study (Table 2):
To further validate perceptual quality, a user study was conducted with 80 volunteers. Participants compared DiT4SR's results against four latest methods (SeeSR, DiffBIR, SUPIR, DreamClear) on randomly selected LR images. They answered two questions: (1) Which result has higher image realism? (2) Which has better fidelity?
The following are the results from Table 2 of the original paper:
| Ours vs. | SeeSR | DiffBIR | SUPIR | DreamClear |
| Realism | 82.1% | 83.6% | 81.7% | 72.7% |
| Fidelity | 68.9% | 79.5% | 75.4% | 64.5% |
The results demonstrate that DiT4SR consistently outperforms all compared methods in both image realism and fidelity, with winning rates ranging from 72.7% to 83.6% for realism and 64.5% to 79.5% for fidelity. This strong preference from human evaluators further confirms DiT4SR's ability to generate perceptually superior results.
6.2. Ablation Studies / Parameter Analysis
The paper conducts an ablation study on RealLQ250 using MUSIQ and MANIQA metrics to evaluate the effectiveness of each proposed component. All variants are trained under the same settings as the full model.
The following are the results from Table 3 of the original paper:
| Model | LR Integation | LR Residual | LR Injection | MUSIQ ↑ | MANIQA ↑ |
| FULL | √ | √ | Conv | 71.832 | 0.632 |
| A | × | √ | Conv | 66.963 | 0.574 |
| B | √ | × | Conv | 70.887 | 0.614 |
| C | √ | √ | × | 71.202 | 0.610 |
| D | √ | √ | Linear | 71.607 | 0.621 |
The following figure (Figure 7 from the original paper) shows visual comparison for the ablation study:
该图像是论文中图7的消融实验对比示意图,展示了不同变体对图像细节恢复的影响。变体A、B、C分别去除LR Integration、LR Residual和LR Injection;变体D用线性层替代了卷积层。完整模型效果最佳。
Effectiveness of LR Integration
- Variant A (
× LR Integration, ,Conv LR Injection): This variant removes theLR Streamfrom theattention computation. As shown in Table 3, bothMUSIQ(66.963) andMANIQA(0.574) scoressignificantly declinecompared to theFULLmodel (MUSIQ 71.832, MANIQA 0.632). Figure 7(b) visually confirms that withoutbidirectional information interactionbetween theLR Streamand thegenerated latent,degradations cannot be effectively removed. This indicates that relying solely onLR injection between MLP layersis insufficient. Thebidirectional interactionallows theLR guidanceto adapt and refine itself, which is crucial for addressing complex degradations.
Effectiveness of LR Residual
- Variant B (,
× LR Residual,Conv LR Injection): This variant removes theLR Residualconnection. Table 3 shows adecline in performancefor bothMUSIQ(70.887) andMANIQA(0.614) compared to theFULLmodel. Visually, Figure 7(c) revealsnoticeable artifactsin the results, degradingimage fidelity. This confirms the role ofLR Residualin stabilizing theLR Stream's evolution and preservingLR guidance consistencyin deeperDiT blocks, preventing undesired disruptions and leading to higher fidelity results.
Effectiveness of LR Injection
-
Variant C (, ,
× LR Injection): This variant removes theLR Injection between MLP layers. Table 3 shows aslight declinein metrics (MUSIQ71.202,MANIQA0.610). WhileLR integration in attentionalone can produce passable results, Figure 7(d) showsnoticeable content distortions, especially in fine details like the eye region. This confirms thatglobal attentionis not sufficient forSR tasksthat requirelocal informationfor accurate detail restoration. -
Variant D (, ,
Linear LR Injection): This variant replaces the3x3 depth-wise convolution layerwith alinear layerforLR injection. Although Table 3 showssimilar performancein metrics (MUSIQ71.607,MANIQA0.621) compared to theFULLmodel, Figure 7(e) clearly indicates thatartifacts and distortions have not been alleviated. This visual evidence is crucial, demonstrating that theconvolutional layer's ability to captureprecise local informationis essential for enhancingfidelity, especially for fine structures, a capability not fully reflected by thenon-reference metric data. The3x3 depth-wise convolution layereffectively compensates forDiT's limitedlocal information-capturing ability.In summary, the ablation studies rigorously validate the contribution of each proposed component:
LR Integration in Attentionis fundamental for effectivebidirectional guidance,LR Residualensuresguidance consistencyacross layers, andConvolutional LR Injectionis critical for capturinglocal informationand enhancingfine detail restoration.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces DiT4SR, a novel approach that effectively "tames" large-scale Diffusion Transformer (DiT) models for Real-World Image Super-Resolution (Real-ISR). DiT4SR is identified as one of the pioneering works in this specific direction, moving beyond UNet-based diffusion models and ControlNet-like conditioning mechanisms.
The core contributions lie in two key design choices:
-
Integrated LR Stream in Attention: Instead of external injection,
DiT4SRintegrateslow-resolution (LR) embeddingsdirectly into theoriginal attention mechanismofDiT. This enables a powerfulbidirectional information flowbetween theLR latentand thegenerated latent, allowing theLR streamto dynamically evolve and provide progressively refined and context-aware guidance throughout thediffusion process. The introduction of anLR Residualfurther ensures the consistency of this guidance in deeper layers. -
Cross-Stream Convolutional Injection: A
convolutional layeris introduced to injectLR-guided informationinto thegenerated latent(specifically, between theMLPlayers of theLRandNoise Streams). This design is crucial for enhancingLR guidanceand, more importantly, compensating forDiT's inherent limitation in capturinglocal information, which is vital for restoring fine details inSRtasks.Extensive experiments, including quantitative evaluations on multiple real-world datasets and a comprehensive user study, consistently demonstrate
DiT4SR's superior performance in terms of image quality, realism, and fidelity compared to state-of-the-artReal-ISRmethods, including those based onUNetandDiTwithControlNet.
7.2. Limitations & Future Work
The paper explicitly states that its work highlights the potential of leveraging DiT for high-quality image restoration and paves the way for future research in this direction. However, it does not explicitly list specific limitations of DiT4SR or detailed future work directions beyond this general statement.
Based on the nature of the model and general challenges in DiT-based models, potential implicit limitations and future work could include:
- Computational Cost: Large-scale
DiTmodels are computationally intensive. WhileDiT4SRis built on a pre-trainedSD3, the training and inference costs for fine-tuning and running such a model might still be substantial compared to simplerUNet-basedorGAN-basedapproaches. Future work could explore efficiency improvements, such as distillation or optimized inference strategies. - Generalizability to Other Restoration Tasks: While shown effective for
Real-ISR, it's not explicitly tested for otherimage restorationtasks (e.g., denoising, deblurring, inpainting) whereDiTcould also be applied. Future work could investigate its adaptability and potential for unifiedimage restoration. - Adaptive Guidance Strength: The current
LR integrationinvolves trainable linear projections and a convolutional layer initialized to zeros, gradually increasing their impact. Future work might explore more dynamic or adaptive weighting mechanisms for theLR stream's influence across different diffusion steps or image regions. - Understanding Bidirectional Interaction: While the paper demonstrates the effectiveness of
bidirectional interaction, a deeper theoretical understanding or more fine-grained control over how theLR Stream's evolution is influenced by theNoise Streamcould lead to further improvements.
7.3. Personal Insights & Critique
DiT4SR offers a highly impactful contribution by demonstrating that Diffusion Transformers, previously celebrated for general image generation, can be effectively specialized for Real-World Image Super-Resolution. The paper's core insight—that DiT's architectural strengths for multimodal interaction need a more profound integration than simple ControlNet-like conditioning—is particularly insightful.
The method's ability to enable bidirectional information flow for LR guidance is a significant advancement. Unlike ControlNet, where the control signal is often "static" or only conditionally integrated, DiT4SR allows the LR stream itself to evolve and become more "aware" of the diffusion process and the noisy latent. This dynamic guidance is likely a major factor in its superior performance, especially on complex real-world degradations where the LR information needs to be interpreted contextually.
Furthermore, the explicit introduction of a cross-stream convolution layer to address DiT's local information weakness is a clever and effective design choice. It acknowledges that while transformers excel at global relationships, convolutions remain powerful for local pattern recognition, and integrating both synergistically can yield superior results in pixel-level tasks like SR. The ablation study visually confirming the necessity of this convolutional layer, even when metrics might seem similar to a linear layer, underscores the importance of qualitative assessment in perceptual tasks.
A potential area for critique or further exploration could be the interpretability of the bidirectional attention. While effective, understanding precisely how the LR stream adapts and what information it prioritizes at different diffusion steps could open avenues for more targeted control or optimization. Also, the computational efficiency of DiT4SR relative to other SR methods, especially during inference, might be a practical consideration for real-world deployment, though this is often a trade-off for higher quality in DiT-based models.
The implications of DiT4SR extend beyond Real-ISR. Its principles of deeply integrating conditional information into DiT's attention and supplementing global attention with local convolutional processing could be transferable to other conditional generation or image restoration tasks. For instance, in tasks like image editing or inpainting where precise local control is needed alongside global coherence, similar bidirectional stream integration and convolutional injection could be explored. The work sets a strong precedent for how to effectively adapt powerful foundation models to specific, challenging downstream tasks without merely treating them as black boxes.
Similar papers
Recommended via semantic vector search.