Paper status: completed

DiffTMR: Diffusion-based Hierarchical Alignment for Text-Molecule Retrieval

Published:10/25/2025

Diffusion Model for Molecular Retrieval (1)Hierarchical Alignment Approach (1)Text-Molecule Retrieval Framework (1)Dynamic Perturbation Embedding Mechanism (1)Cross-Modal Hierarchical Alignment (1)

Original Link

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DiffTMR is an innovative text-molecule retrieval framework that reframes retrieval as a reverse denoising process, addressing traditional methods' limitations in out-of-distribution detection and diversity maintenance. By integrating hierarchical diffusion alignment and dynamic p

Abstract

Molecular retrieval is critical in drug discovery and molecular design. Traditional discriminative methods often model the conditional probability distribution of retrieving candidates, treating the query text as a deterministic input. However, these approaches have notable limitations: (1) They often overlook the statistical properties of the original data distributions of queries and candidates, preventing the recognition of out-of-distribution data. (2) They struggle to balance retrieval accuracy and diversity when processing open-ended semantic queries. To address these challenges, we introduce DiffTMR, a novel framework that reformulates text-molecule retrieval as a reverse denoising process, progressively generating the joint distribution of candidates and queries from noises. DiffTMR uniquely integrates hierarchical diffusion alignment with dynamic perturbation embedding mechanisms. By employing text-anchored perturbations, it enhances the diversity of molecular representations, and through global-local progressive denoising, it achieves cross-modal hierarchical alignment. This leads to significant improvements in retrieval accuracy and out-of-domain generalization. Evaluations on benchmark datasets ChEBI-20 and PCdes demonstrate that DiffTMR surpasses current leading baselines by 4.2%–5.4% in Hits@1 metrics and exhibits superior performance in out-of-domain retrieval tasks.

Mind Map

In-depth Reading

English Analysis~33 min read · 45,799 chars

1. Bibliographic Information

1.1. Title

DiffTMR: Diffusion-based Hierarchical Alignment for Text-Molecule Retrieval

1.2. Authors

The paper is authored by Chenxu Wang, Dong Zhou*, Ting Liu, Jianghao Lin, and Yongmei Zhou, all affiliated with Guangdong University of Foreign Studies, Guangzhou, China. Aimin Yang is also an author, affiliated with Lingnan Normal University, Zhanjiang, China. Dong Zhou is marked with an asterisk, typically indicating a corresponding author.

1.3. Journal/Conference

The paper is published at the 33rd ACM International Conference on Multimedia (MM '25), October 27-31, 2025, Dublin, Ireland. The ACM International Conference on Multimedia (ACM MM) is a highly reputable and influential conference in the field of multimedia computing, known for publishing cutting-edge research in various areas including cross-modal retrieval, computer vision, and natural language processing. Its high selectivity and broad scope make it a significant venue for disseminating research findings.

1.4. Publication Year

2025

1.5. Abstract

Molecular retrieval is a crucial task in drug discovery and molecular design. Traditional discriminative methods for text-molecule retrieval model the conditional probability distribution of retrieving candidates, treating the query text as a deterministic input. However, these methods often overlook the statistical properties of the original data distributions of queries and candidates, making them unable to recognize out-of-distribution data and struggle to balance retrieval accuracy and diversity for open-ended queries. To overcome these limitations, the paper introduces DiffTMR, a novel framework that re-frames text-molecule retrieval as a reverse denoising process. This process progressively generates the joint distribution of candidates and queries from noise. DiffTMR integrates a hierarchical diffusion alignment network with dynamic perturbation embedding mechanisms. It uses text-anchored perturbations to enhance the diversity of molecular representations and employs global-local progressive denoising to achieve cross-modal hierarchical alignment. This approach significantly improves retrieval accuracy and out-of-domain generalization. Evaluations on benchmark datasets ChEBI-20 and PCdes show that DiffTMR outperforms leading baselines by 4.2%–5.4% in Hits@1 metrics and demonstrates superior performance in out-of-domain retrieval tasks.

1.6. Original Source Link

/files/papers/694a002c3e1288a634f1be16/paper.pdf (This is a local file path, implying the paper is provided directly rather than an external web link. Publication status: Officially published at ACM MM '25.)

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by the paper is the limitations of existing text-molecule retrieval methods, which primarily fall under the discriminative paradigm. These methods typically model the conditional probability distribution (e.g., $P(\text{molecule} | \text{text})$ ), focusing on optimizing feature extractors to align text and molecule representations in a shared embedding space. While successful in improving retrieval accuracy, this paradigm has two major drawbacks:

Neglect of Data Distributions: They overlook the inherent statistical properties and joint distributions of the original query and candidate data. This prevents effective recognition of out-of-distribution (OOD) data and leads to poor generalization across different datasets or to novel data.
Accuracy-Diversity Trade-off: The fixed-point embedding mechanism (where each input maps to a single, static embedding) struggles to balance retrieval accuracy with diversity when dealing with open-ended semantic queries. Overly compressed semantic spaces ensure precision but yield rigid responses, while attempts to enhance diversity by introducing unconstrained noise can lead to semantic drift (where the meaning of the representation deviates from the original intent).

The importance of text-molecule retrieval is highlighted by its critical role in various applications such as drug discovery, molecular design, and virtual screening. Addressing the aforementioned limitations is vital for robust and generalizable solutions in these fields.

The paper's entry point is to leverage the power of generative models, specifically diffusion models, which have shown breakthroughs in other domains like image synthesis and natural language processing. Unlike discriminative models, generative models can capture comprehensive cross-modal associations by generating the joint distribution of candidates and queries, thereby enhancing robustness against data distribution shifts. The coarse-to-fine generative nature of diffusion models is identified as inherently suitable for progressively uncovering associations in text-molecule retrieval.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Novel Generative Framework: Introduction of DiffTMR, the first generative diffusion-based framework specifically designed for text-molecule retrieval, which reformulates the task as a reverse denoising process to model the joint probability distribution of text and molecules.
Dynamic Perturbation Mechanism: Proposal of a Text-Anchored Perturbation Representation (TAPR) module that dynamically regulates the perturbation scope for molecular embeddings based on semantic coupling perception. This mechanism enhances the diversity of molecular representations while maintaining consistency under text-anchor constraints.
Hierarchical Alignment Network: Development of a Hierarchical Diffusion Alignment Network (HDAN) that achieves cross-modal hierarchical modeling through global-local progressive denoising, capturing both high-level semantic associations and fine-grained correspondences between textual tokens and molecular motifs.
Collaborative Optimization: A unique joint optimization strategy combining discriminative contrastive learning (for explicit semantic alignment) and generative diffusion modeling (for implicit distribution dependencies). This collaboration significantly improves both retrieval accuracy and generalization capability, especially for unseen data.
Superior Performance: Demonstrates that DiffTMR significantly outperforms current state-of-the-art baselines on benchmark datasets ChEBI-20 and PCdes, achieving improvements of 4.2%–5.4% in Hits@1 metrics. The model also exhibits superior performance in challenging out-of-domain retrieval tasks, highlighting its strong generalization and transfer capabilities.

3.1. Foundational Concepts

3.1.1. Text-Molecule Retrieval

Text-molecule retrieval is a cross-modal task where the goal is to find the most relevant molecule (or its representation) given a textual query, or conversely, to find the most relevant textual description for a given molecule. This involves understanding the semantic relationship between natural language and chemical structures and embedding them into a shared latent space where similarity can be measured.

3.1.2. Discriminative vs. Generative Models

Discriminative Models: These models learn a mapping from input features to output labels or categories. In retrieval, they often focus on modeling the conditional probability distribution $P(Y|X)$ , i.e., the probability of a candidate $Y$ given a query $X$ . They excel at distinguishing between classes but typically do not learn the underlying data distribution of $X$ or $Y$ . Contrastive learning-based methods are a common example, aiming to push dissimilar pairs apart and pull similar pairs closer.
Generative Models: These models learn the joint probability distribution P(X, Y) (or P(X) and P(Y) separately). They can generate new data samples that resemble the training data. By learning the full data distribution, they can better handle out-of-distribution (OOD) data and capture more complex dependencies. Diffusion models are a prominent type of generative model.

3.1.3. Contrastive Learning

Contrastive learning is a self-supervised learning paradigm where a model learns representations by comparing similar and dissimilar pairs of data. The core idea is to learn an embedding space where positive pairs (e.g., a text and its corresponding molecule) are pulled closer together, while negative pairs (e.g., a text and a random, unrelated molecule) are pushed further apart. This is often achieved through a contrastive loss function that maximizes the agreement between positive pairs while minimizing it for negative pairs.

3.1.4. Diffusion Models

Diffusion models are a class of generative models that learn to generate data by reversing a gradual noise-adding process.

Forward Diffusion Process: This is a fixed Markov chain that progressively adds Gaussian noise to data (e.g., an image or an embedding) over several time steps, eventually transforming it into pure Gaussian noise. The data distribution at any step $t$ is denoted as $X_t$ .
Reverse Denoising Process: The model learns to reverse this process, starting from pure noise and gradually denoising it back to a clean data sample. This is done by training a neural network to predict the noise that was added at each step, allowing it to estimate the distribution $P(X_{t-1}|X_t)$ . By iteratively applying this learned denoising step, the model can generate new samples from the data distribution. Diffusion models are known for their high-quality generation and ability to model complex data distributions in a coarse-to-fine manner.

3.1.5. Graph Convolutional Network (GCN)

A Graph Convolutional Network (GCN) is a type of neural network that operates directly on graph-structured data. Unlike traditional convolutional networks designed for grid-like data (e.g., images), GCNs can process arbitrary graph topologies. In molecular representation, a molecule is naturally represented as a graph where atoms are nodes and chemical bonds are edges. GCNs learn node representations by aggregating information from their neighbors, effectively capturing local structural patterns (like functional groups) and propagating this information across the entire molecule, forming a global molecular representation.

3.1.6. Reparameterization Trick

The reparameterization trick is a technique used in variational autoencoders (VAEs) and other probabilistic models to enable backpropagation through a stochastic node (a node from which a random sample is drawn). Instead of sampling directly from a distribution (which is not differentiable), the trick expresses the sample as a deterministic function of a simple random variable (e.g., a standard Gaussian noise $\epsilon \sim N(0,1)$ ) and the parameters of the distribution. For example, if we want to sample $z$ from $N(\mu, \sigma^2)$ , we can write $z = \mu + \sigma \cdot \epsilon$ , where $\epsilon \sim N(0,1)$ . This allows the gradients to flow through $\mu$ and $\sigma$ , making the model trainable with standard gradient descent.

3.1.7. Attention Mechanism

The attention mechanism is a technique in neural networks that allows a model to weigh the importance of different parts of an input sequence or set of features when making a prediction or generating an output. It typically involves three components: Query (Q), Key (K), and Value (V). The Query is used to calculate attention scores (or weights) with all Keys, which indicate how relevant each Key is to the Query. These weights are then applied to the Values to produce a weighted sum, representing the "attended" output. The most common form is Scaled Dot-Product Attention: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $

$Q$ : Query matrix.
$K$ : Key matrix.
$V$ : Value matrix.
$d_k$ : Dimension of the key vectors (used for scaling to prevent large dot products from pushing the softmax into regions with extremely small gradients).
$\mathrm{softmax}$ : A function that converts a vector of numbers into a probability distribution.

3.1.8. KL Divergence

Kullback-Leibler (KL) divergence is a non-symmetric measure of the difference between two probability distributions, $P$ and $Q$ . It quantifies how much information is lost when $Q$ is used to approximate $P$ . A KL divergence of zero indicates that the two distributions are identical. $ D_{\text{KL}}(P | Q) = \sum_{x \in \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right) $

P(x): The probability of event $x$ in distribution $P$ .
Q(x): The probability of event $x$ in distribution $Q$ .
$\mathcal{X}$ : The set of all possible events.

3.1.9. SciBERT

SciBERT is a pre-trained language model based on the BERT (Bidirectional Encoder Representations from Transformers) architecture. Unlike the original BERT, which is trained on general domain text, SciBERT is specifically pre-trained on a large corpus of scientific publications. This specialized training allows it to better understand and encode the nuances, terminology, and semantic relationships prevalent in scientific and technical texts, making it particularly effective for tasks involving scientific literature, such as chemical text descriptions.

3.2. Previous Works

Previous research in text-molecule retrieval can be broadly categorized by the molecular representation used and the alignment strategy:

3.2.1. Molecular Representations

One-dimensional Sequences: Methods utilizing the Simplified Molecular Input Line Entry System (SMILES) string representation of molecules. Models like KV-PLM [40], Text^+Chem T5 [7], and MolT5 [10] adapt sequence modeling techniques (often based on Transformer architectures) to align SMILES strings with textual semantics.
Two-dimensional Molecular Graphs: Molecules are modeled as graphs where atoms are nodes and bonds are edges. MoMu [33] and MoleculeSTM [23] use contrastive learning to build a shared embedding space. MolCA [24] introduces a dedicated cross-modal projector. AMAN [43] employs adversarial learning for modality alignment. Some works, like MolFM [28] and GIT-Mol [21], integrate multi-modal auxiliary information (knowledge graphs, molecular images) to enrich representations.
Three-dimensional Spatial Conformations: 3D-MoLM [19] extends graph neural networks to incorporate 3D coordinate information. Orma [29] and Atomas [41] use mixed-granularity alignments, modeling features at atomic, motif (functional groups), and molecular levels to improve semantic consistency.

3.2.2. Representation Learning with Diffusion Models

While diffusion models have seen significant advancements in various domains, their application to text-molecule retrieval is relatively new.

Image Synthesis: UNIT-DDPM [31] applies Markov chain inference for unpaired image-to-image translation. ILVR [6] optimizes the DDPM generation process for high-quality image synthesis conditioned on reference images.
Inverse Problems in Imaging: Kadkhodaie et al. [16] use stochastic gradient ascent with CNN-based implicit priors for tasks like denoising and super-resolution.
3D Point Cloud Generation: Luo et al. [27] model point cloud synthesis as a thermodynamic particle diffusion process.
Video Understanding: DiffusionVMR [42] uses denoising generation strategies for video segment retrieval and boundary detection. Luo et al. [26] propose a text-guided diffusion model for video editing and cross-domain video moment retrieval.

3.3. Technological Evolution

The field of cross-modal retrieval, including text-molecule retrieval, has evolved significantly. Initially, approaches focused on developing sophisticated feature extractors for each modality and then projecting them into a common embedding space, often using contrastive learning or adversarial training to enforce alignment. This era was dominated by discriminative models that optimized conditional probabilities. The advent of large-scale pre-trained models (like BERT for text, and specialized models for molecules) further boosted performance by leveraging vast amounts of data for robust feature learning.

However, a key limitation identified in this paper is the failure of these discriminative approaches to fully capture the underlying data distributions and joint dependencies of queries and candidates, leading to issues with out-of-distribution generalization and balancing accuracy with diversity.

This paper marks a shift towards generative modeling in text-molecule retrieval. By adapting diffusion models, which are inherently designed to learn and generate data from complex distributions, DiffTMR aims to overcome these limitations. It bridges the gap by moving from merely distinguishing between relevant and irrelevant pairs to explicitly modeling the joint probability distribution of text and molecules, allowing for a more comprehensive understanding of their semantic associations. The integration of hierarchical alignment and dynamic perturbations further refines this generative approach.

3.4. Differentiation Analysis

Compared to previous works, DiffTMR introduces several core innovations:

Generative Paradigm Shift: The most significant difference is the fundamental shift from a discriminative paradigm (modeling $P(M|T)$ ) to a generative paradigm (modeling P(M,T)) for text-molecule retrieval. Existing methods primarily focus on optimizing explicit conditional probability or embedding similarity, while DiffTMR leverages the reverse denoising process of diffusion models to implicitly learn the joint probability distribution. This enables better capture of statistical properties and intrinsic data structures for both modalities.
Hierarchical Diffusion Alignment: Unlike fixed-point embedding mechanisms or single-level alignment strategies, DiffTMR employs a Hierarchical Diffusion Alignment Network (HDAN). This network performs both global semantic alignment (sentence-level text to global molecule) and fine-grained local alignment (word-level text to molecular motif-level). This multi-granularity approach, coupled with progressive denoising, allows for more robust and adaptive matching.
Dynamic Perturbation Embedding: The Text-Anchored Perturbation Representation (TAPR) module is a novel mechanism to enhance molecular representation diversity while maintaining semantic consistency. Instead of rigid embeddings, it introduces dynamic, text-anchored perturbations. This contrasts with methods that either use static embeddings or introduce unconstrained noise, addressing the accuracy-diversity trade-off more effectively.
Discriminative-Generative Collaborative Optimization: DiffTMR uniquely combines both contrastive learning (discriminative) and diffusion modeling (generative) in its optimization. This allows it to leverage the strengths of both: explicit semantic alignment from contrastive learning and implicit distribution dependencies from generative modeling, leading to enhanced cross-modal retrieval accuracy and superior out-of-domain generalization. This contrasts with most prior works that rely solely on one paradigm.
Robustness to Data Distribution Shifts: By modeling joint distributions and incorporating diffusion-theory-based timestep adaptive noise, DiffTMR is inherently designed to be more robust against data distribution shifts and out-of-domain (OOD) challenges, a limitation explicitly called out for traditional discriminative methods.

4. Methodology

4.1. Principles

The core idea behind DiffTMR is to reformulate the text-molecule retrieval task as a reverse denoising process, characteristic of diffusion models. Instead of learning a direct mapping from text to molecule (a discriminative approach), DiffTMR aims to learn the joint probability distribution of text and molecule pairs (P(t, m)). This approach allows the model to progressively generate an aligned joint distribution from noise, implicitly capturing bidirectional dependencies and statistical properties of the original data. The method is built on two main components:

Text-Anchored Perturbation Representation (TAPR): To address suboptimal molecular representations and leverage rich textual information, this module dynamically perturbs molecular embeddings within a range determined by semantic coupling with the query text. This enhances diversity while maintaining semantic relevance.
Hierarchical Diffusion Alignment Network (HDAN): This network uses a diffusion model to progressively denoise a noisy joint representation, gradually revealing cross-modal associations. It achieves this by modeling alignment at both global (semantic) and local (substructure) granularities.

These two components are collaboratively optimized with a discriminative loss (for explicit semantic alignment) and a generative loss (for implicit distribution modeling), aiming to achieve both high retrieval accuracy and strong generalization.

4.2. Core Methodology In-depth (Layer by Layer)

The overall DiffTMR framework is designed to model the joint distribution of text and molecules through a progressive denoising process. The data flow starts with independent encoders for each modality, followed by a dynamic perturbation mechanism for molecules, and then a hierarchical diffusion alignment network that learns the joint distribution.

该图像是示意图，展示了DiffTMR框架在文本-分子检索中的过程，通过文本锚定的扰动表示和层级扩散对齐网络进行建模。图中显示了全局和局部查询、候选分子与噪声的关系与逐步去噪的过程，以及最终的检索结果。

The image above (Figure 2 from the original paper) provides an overview of the DiffTMR framework. It illustrates how an input query text and candidate molecules are processed. The text-anchored perturbation representation module generates $H_p^g$ (molecular perturbation embeddings) guided by textual semantics. The hierarchical diffusion alignment network then takes these embeddings, along with text embeddings, and performs a reverse denoising process to learn the optimal text-molecule aligned data distribution hat{X}_0.

4.2.1. Feature Extraction

The model utilizes two independent encoders:

Molecular Encoder ( $\phi_m$ ): For molecules, a three-layer Graph Convolutional Network (GCN) [37] is used. GCNs are well-suited for molecular data as they model molecules as graphs (atoms as nodes, bonds as edges), allowing for the aggregation of information from different granularities, such as functional groups, atomic clusters, and other chemical substructures.
Text Encoder ( $\phi_t$ ): For textual descriptions, the SciBERT [1] pre-trained model is employed. SciBERT is chosen for its superior performance in encoding chemical text descriptions due to its pre-training on a large corpus of scientific publications.

Given a pair of text description $t$ and molecule $m$ , their feature embeddings are represented as: $ \begin{array} { l } { \pmb { h } _ { i } ^ { w } = \phi _ { t } ( t ) , \forall i \in [ 1 , . . . , N _ { w } ] ; \quad \pmb { h } ^ { t } = [ \mathrm { c l s } ] _ { w } ; } \ { \pmb { h } _ { j } ^ { m } = \phi _ { m } ( m ) , \forall j \in [ 1 , . . . , N _ { m } ] ; \pmb { h } ^ { g } = \displaystyle \frac { 1 } { N _ { m } } \sum _ { j = 1 } ^ { N _ { m } } \pmb { h } _ { j } ^ { m } } \end{array} $
$\pmb{h}_i^w$ : The embedding of the $i$ -th word in the text description. $\phi_t(t)$ produces a sequence of word embeddings.
$N_w$ : The length of the text description (number of words).
$\pmb{h}^t$ : The global textual representation, obtained from the special [cls] token's embedding (a common practice in BERT-like models to represent the entire input sequence).
$\pmb{h}_j^m$ : The embedding of the $j$ -th molecular motif (e.g., functional groups, atom substructures). $\phi_m(m)$ produces a set of motif embeddings.
$N_m$ : The number of molecular motifs in the molecule.
$\pmb{h}^g$ : The global molecular representation, derived by averaging the motif embeddings.

4.2.2. Text-Anchored Perturbation Representation (TAPR)

The TAPR module aims to address two challenges: suboptimal molecular representations (since a pre-trained molecular encoder is not used) and the need to fully leverage textual information for diversity. It dynamically defines an optimized perturbation region for molecular representations, anchored on text embeddings. This implicitly pulls relevant text and molecule embeddings closer.

$Figure 3: TAPR architecture: A text-molecule encoder extracts hierarchical features: text has word-level $\\pmb { h } ^ { w }$ and sentence-level global $\\mathbf { \\Omega } _ { h } t$ ; molecules have motif-level $\\pmb { h } ^ { m }$ and global $h ^ { g }$ .Based on text-molecule semantic coupling, each molecule's perturbation range $\\mathcal { P }$ is modeled. Within this range, Gaussian-sampled $( \\epsilon \\sim \\mathcal { N } ( \\mathbf { 0 } , \\mathbf { 1 } ) )$ perturbations generate text-guided molecular perturbation representation $h _ { p } ^ { g }$ .$ 该图像是示意图，展示了TAPR架构的文本-分子编码器。文本编码器提取了词级特征 $\pmb { h } ^ { w }$ 和句子级全局特征 $\mathbf { \Omega } _ { h } t$ ；分子编码器提取了基元级特征 $\pmb { h } ^ { m }$ 和全局特征 $h ^ { g }$ 。通过语义耦合感知模块，结合每个分子的扰动范围 $\mathcal { P }$ ，生成文本引导的分子扰动表示 $h _ { p } ^ { g }$ ，其中扰动依据高斯分布采样，即 $\epsilon \sim \mathcal { N } ( \mathbf { 0 } , \mathbf { 1 } )$ 。

The image above (Figure 3 from the original paper) illustrates the TAPR architecture. Text features (word-level $\pmb{h}^w$ and sentence-level global $\mathbf{\Omega}_h t$ ) and molecule features (motif-level $\pmb{h}^m$ and global $h^g$ ) are extracted. The semantic coupling perception module then calculates a perturbation range $\mathcal{P}$ for each molecule based on its semantic alignment with the text. Within this range, Gaussian-sampled perturbations generate a text-guided molecular perturbation representation $h_p^g$ .

Given a global molecular embedding $\pmb{h}^g$ , a reparameterization trick is employed to introduce a dynamic perturbation range $\mathcal{P}$ , and stochastic sampling is performed within this range to obtain the molecular perturbation embedding: $ \begin{array} { r } { h _ { \mathcal { P } } ^ { \mathcal { G } } = h ^ { \mathcal { G } } + \mathcal { P } \cdot \epsilon , \quad \epsilon \sim N ( 0 , 1 ) , } \end{array} $

$h_{\mathcal{P}}^{\mathcal{G}}$ : The perturbed global molecular embedding.
$h^{\mathcal{G}}$ : The original global molecular embedding from Equation (1).
$\mathcal{P}$ : The dynamic perturbation range.
$\epsilon \sim N(0,1)$ : A standard Gaussian noise term, used with the reparameterization trick to allow gradients to flow through the sampling process.

The critical challenge is determining the appropriate perturbation range $\mathcal{P}$ . Too small, and diversity is constrained; too large, and semantic drift occurs. To address this, a semantic coupling perception module dynamically calibrates the perturbation intensity through fine-grained cross-modal interactions. First, it analyzes the semantic alignment between the molecule and individual text tokens: $ \pmb { c } _ { i } = \frac { \pmb { h } ^ { g } \cdot \pmb { h } _ { i } ^ { w } } { \lVert \pmb { h } ^ { g } \rVert \lVert \pmb { h } _ { i } ^ { w } \rVert } , \quad i = 1 , . . . , N _ { w } $
$\pmb{c}_i$ : The semantic coupling strength between the global molecular embedding $\pmb{h}^g$ and the $i$ -th word embedding $\pmb{h}_i^w$ . This is calculated using cosine similarity.
$\lVert \cdot \rVert$ : Denotes the $L_2$ norm of a vector.

Based on the computed semantic coupling strengths, the molecular perturbation range $\mathcal{P}$ is determined using a learnable linear layer $\Theta$ : $ \begin{array} { r } { \mathcal { P } = \exp ( C \Theta ) , \quad C = [ \boldsymbol { c } _ { 1 } , \boldsymbol { c } _ { 2 } , . . , \boldsymbol { c } _ { N _ { w } } ] , } \end{array} $
$\mathcal{P}$ : The dynamic perturbation range. The exp function ensures that $\mathcal{P}$ is positive.
$C$ : A tensor representing the concatenation of all semantic coupling strengths $[\boldsymbol{c}_1, \boldsymbol{c}_2, \ldots, \boldsymbol{c}_{N_w}]$ .
$\Theta$ : A learnable linear layer (or matrix) that transforms the semantic coupling tensor into the perturbation range. This allows the model to learn complex, high-order interactions to calibrate the perturbation intensity.

During training, one perturbed molecular representation $h_{\mathcal{P}}^{\mathcal{G}}$ is sampled. During evaluation, multiple stochastic samples are drawn to identify the optimal perturbed molecular embedding that best aligns with the text anchor, ensuring representation stability and accuracy.

The discriminative loss function is then applied. The similarity score $s_{t,m}$ is computed between the global text embedding $\mathbf{\Omega}_h t$ and the molecular perturbation embedding $h_{\mathcal{P}}^{\mathcal{G}}$ . The conditional probability is formulated as: $ \displaystyle { p ( m | t ; \phi _ { t } , \phi _ { m } ) = \frac { \exp { \left( s _ { t , m } / \tau \right) } } { \sum _ { m ^ { \prime } \in M } { \exp { \left( s _ { t , m ^ { \prime } } / \tau \right) } } } } , $

$p(m|t; \phi_t, \phi_m)$ : The probability of retrieving molecule $m$ given text $t$ , parameterized by encoders $\phi_t$ and $\phi_m$ .
$s_{t,m}$ : The similarity score between text $t$ and molecule $m$ .
$\tau$ : A temperature parameter, commonly used in contrastive learning to scale the logits before softmax.
$\sum_{m' \in M} \exp{(s_{t,m'} / \tau)}$ : The sum over all candidate molecules $m'$ in the set $M$ , forming the normalization term for the softmax function.

The corresponding discriminative loss function $\mathcal{L}_{\mathrm{Dis}}$ is defined as: $ \mathcal { L } _ { \mathrm { Dis } } = \frac { 1 } { 2 } \mathbb { E } _ { ( t , m ) } \big [ \log \mathnormal { p } ( m | t ; \phi _ { t } , \phi _ { m } ) + \log \mathnormal { p } ( t | m ; \phi _ { t } , \phi _ { m } ) \big ] . $
$\mathcal{L}_{\mathrm{Dis}}$ : The discriminative loss. This is typically a contrastive loss (or InfoNCE loss), designed to maximize the conditional likelihood of positive pairs.
$\mathbb{E}_{(t,m)}$ : Expectation over all positive text-molecule pairs (t,m) in the training set $Z$ .
$\log p(m|t; \phi_t, \phi_m)$ : The log-likelihood of correctly retrieving molecule $m$ given text $t$ .
$\log p(t|m; \phi_t, \phi_m)$ : The log-likelihood of correctly retrieving text $t$ given molecule $m$ . The loss averages these two directions to ensure bidirectional retrieval capability.

This discriminative approach, however, is limited as it focuses solely on conditional probabilities and does not model the intrinsic distributions p(t) and p(m), which can lead to poor generalization.

4.2.3. Hierarchical Diffusion Alignment Network (HDAN)

To overcome the limitations of discriminative methods and model the joint probability distribution $\boldsymbol{p}(t,m)$ , DiffTMR introduces the Hierarchical Diffusion Alignment Network (HDAN). This network leverages the generative perspective of diffusion models to hierarchically model cross-modal alignment.

$Figure 4: HDAN architecture: Multi-granularity representations are batch-concatenated. Noise is added to $H ^ { t }$ (sentencelevel text) and $H _ { p } ^ { g }$ (molecular perturbation) to generate global alignment distribution $\\varepsilon ^ { \\mathrm { g l o b a l } }$ via attention. Local distribution $\\varepsilon ^ { \\mathrm { l o c a l } }$ is built from $H ^ { w }$ (word-level) and $H ^ { m }$ (motif-level). Their weighted fusion is fed into the denoising network, which after reverse denoising iterations predicts clean alignment distribution ${ \\hat { X } } _ { 0 }$ .$ 该图像是示意图，展示了HDAN架构的工作原理。图中显示了多层次表示的批量连接过程，噪声被添加到句子级文本 $H^t$ 和分子扰动 $H^g_p$ ，通过注意力机制生成全局对齐分布 $\varepsilon^{\mathrm{global}}$ 。局部分布 $\varepsilon^{\mathrm{local}}$ 由单词级 $H^w$ 和基元级 $H^m$ 构成。两者加权融合后输入去噪网络，经过反向去噪迭代，预测出干净的对齐分布 $\hat{X}_0$ 。

The image above (Figure 4 from the original paper) details the HDAN architecture. It shows how multi-granularity representations (sentence-level text $H^t$ , molecular perturbation $H_p^g$ , word-level $H^w$ , motif-level $H^m$ ) are batched and processed. Noise is added to global representations ( $H^t$ and $H_p^g$ ) to generate a global alignment distribution $\varepsilon^{\mathrm{global}}$ via attention. A local distribution $\varepsilon^{\mathrm{local}}$ is derived from word-level and motif-level features. These two distributions are then weighted and fused into $\varepsilon^{\mathrm{joint}}$ , which is fed into a denoising network. After reverse denoising iterations, the network predicts a clean alignment distribution $\hat{X}_0$ .

The objective function for the generative component is defined as: $ \prod _ { ( t , m ) \in \cal Z } p ( m , t ) \approx \psi \big ( m , t , N ( 0 , \mathrm { I } ) \big ) , $

$\prod_{(t,m) \in \mathcal{Z}} p(m,t)$ : The joint probability distribution of text-molecule pairs across the corpus $\mathcal{Z}$ .
$\psi(m,t, N(0,\mathrm{I}))$ : The diffusion network's parameterized approximation of this joint distribution, starting from Gaussian noise $N(0, \mathrm{I})$ .

To enhance robustness, diffusion-theory-based timestep adaptive noise is introduced into the query (text) and key-value (molecule) projections. This simulates the progressive noise scheduling of diffusion models: low noise initially for high-confidence alignments, dynamically intensified noise later for more complex scenarios.

For text-to-molecule retrieval, the molecular perturbation embeddings $h_p^g$ are concatenated into a candidate set tensor $H_p^g \in \mathbb{R}^{N \times D}$ (where $N$ is batch size, $D$ is embedding dimension), projected into keys and values. Similarly, global text embeddings $\mathbf{\Omega}_h t$ are concatenated into a text tensor $\dot{\boldsymbol{H}}^t \in \mathbb{R}^{N \times D}$ , projected into queries. Noise is introduced based on the time step $k$ : $ \begin{array} { r } { Q _ { t } = W _ { Q } \big ( \boldsymbol { H } ^ { t } + \mathrm { P r o j } ( \mathrm { N o i s e } _ { k } ) \big ) , } \ { K _ { m } = W _ { K } \big ( \boldsymbol { H } _ { \mathcal { P } } ^ { g } + \mathrm { P r o j } ( \mathrm { N o i s e } _ { k } ) \big ) , } \ { V _ { m } = W _ { V } \big ( \boldsymbol { H } _ { \mathcal { P } } ^ { g } + \mathrm { P r o j } ( \mathrm { N o i s e } _ { k } ) \big ) , } \end{array} $

$Q_t, K_m, V_m$ : The projected query, key, and value matrices.
$W_Q, W_K, W_V$ : Learnable projection matrices.
$\boldsymbol{H}^t$ : Batched global text embeddings.
$\boldsymbol{H}_{\mathcal{P}}^g$ : Batched perturbed global molecular embeddings.
$\mathrm{Proj}(\mathrm{Noise}_k)$ : A projection function that maps the noise at time step $k$ to the embedding dimension $D$ . This ensures that noise is added in a dimensionally compatible way.

To model the joint data distribution of text-molecule alignment, an attention mechanism is employed, incorporating the data distribution $X_k$ at time step $k$ into the attention weights to reflect the reverse denoising process: $ \mathscr { E } ^ { \mathrm { g l o b a l } } = \left( \mathrm { S o f t m a x } ( Q _ { t } K _ { m } ^ { T } ) + X _ { k } \right) \cdot V _ { m } + \mathrm { D W C } ( V _ { m } ) $
$\mathscr{E}^{\mathrm{global}}$ : The global semantic alignment distribution.
$\mathrm{Softmax}(Q_t K_m^T)$ : The standard attention scores between text queries and molecule keys.
$X_k$ : Represents the joint probability of the previous noise level for each candidate molecule. Higher values indicate greater confidence, dynamically adjusting attention weights. This term is crucial for the diffusion aspect, reflecting the model's understanding of the data distribution at a given noise level.
$V_m$ : The value matrix for molecules.
$\mathrm{DWC}(V_m)$ : A Depthwise Convolution Module applied to the value matrix, which helps preserve feature diversity.

This $\mathscr{E}^{\mathrm{global}}$ represents the global (semantic-level) alignment distribution. To capture fine-grained local alignment (substructure-level), a token-fragment similarity matrix $S$ is computed: $ S = [ s _ { i j } ] ^ { N _ { w } \times N _ { m } } \quad \text{where} \quad s _ { i j } ~ = ~ \frac { ( \pmb { h } _ { i } ^ { w } ) ^ { \top } { \pmb { h } } _ { j } ^ { m } } { | \pmb { h } _ { i } ^ { w } | | \pmb { h } _ { j } ^ { m } | } $
$S$ : A matrix capturing cosine similarity between the $i$ -th word embedding $\pmb{h}_i^w$ and the $j$ -th molecular motif embedding $\pmb{h}_j^m$ .

The fine-grained local alignment distribution $\mathcal{E}^{\mathrm{local}}$ is then obtained by selecting the maximum fragment alignment score per word ( $\max_j s_{ij}$ ) and aggregating based on word importance weights: $ \begin{array} { r } { \mathcal { E } ^ { \mathrm { l o c a l } } = \displaystyle \sum _ { i = 1 } ^ { N _ { w } } f _ { i } ^ { w } \operatorname* { m a x } _ { j } s _ { i j } , } \ { \displaystyle \left. \left. f _ { i } ^ { w } \right. _ { i = 1 } ^ { N _ { w } } = \mathrm { S o f t m a x } \big ( \mathrm { M L P } ( { h _ { i } ^ { w } } _ { i = 1 } ^ { N _ { w } } ) \big ) , \right. } \end{array} $
$\mathcal{E}^{\mathrm{local}}$ : The local alignment distribution.
$f_i^w$ : The importance weight for the $i$ -th word. These weights are learned by an MLP (Multi-Layer Perceptron) that processes the word embeddings $\{h_i^w\}$ and applies a Softmax to ensure they sum to 1.
$\max_j s_{ij}$ : The maximum similarity score between the $i$ -th word and any molecular motif.

Finally, the joint data distribution $\mathcal{E}^{\mathrm{joint}}$ is obtained through a weighted fusion of the two alignment levels: $ \mathcal { E } ^ { \mathrm { j o i n t } } = \gamma \mathcal { E } ^ { \mathrm { l o c a l } } + ( 1 - \gamma ) \mathcal { E } ^ { \mathrm { g l o b a l } } $
$\mathcal{E}^{\mathrm{joint}}$ : The final combined joint data distribution.
$\gamma$ : A balance coefficient that weighs the contribution of local vs. global alignment.

This $\mathcal{E}^{\mathrm{joint}}$ is then fed into a denoising decoder (a multilayer perceptron (MLP) with ReLU activation) which, after multiple rounds of reverse denoising, predicts a clean alignment distribution $\hat{X}_0$ . The final retrieval results are produced by ranking based on $\hat{X}_0$ .

The reverse denoising process reconstructs the data distribution through a Markov chain: $ \hat { x } _ { t } = \sqrt { \alpha _ { t } } \hat { X } _ { t - 1 } + \sqrt { 1 - \alpha _ { t } } \epsilon $

$\hat{x}_t$ : The predicted data distribution at time step $t$ .
$\alpha_t$ : A noise scheduling coefficient that controls the amount of noise added/removed at each step.
$\hat{X}_{t-1}$ : The estimated data distribution at the previous time step.
$\epsilon$ : A noise term predicted by the network.

The generation loss function $\mathcal{L}_{\mathrm{Gen}}$ aims to optimize this noise prediction and data distribution modeling: $ \begin{array} { r } { \mathcal { L } _ { \mathrm { Gen } } = \mathbb { E } _ { ( t , m ) \in Z } \Big [ \mathrm { K L } \big ( X _ { 0 } | \psi ( m , t , X _ { k } ) \big ) \Big ] + \mathbb { E } _ { ( m , t ) \in Z } \Big [ \mathrm { K L } \big ( X _ { 0 } | \psi ( t , m , X _ { k } ) \big ) \Big ] } \end{array} $
$\mathcal{L}_{\mathrm{Gen}}$ : The generative loss.
$\mathbb{E}_{(t,m) \in Z}$ : Expectation over positive text-molecule pairs in the corpus.
$\mathrm{KL}(X_0 \| \psi(m,t,X_k))$ : KL divergence between the true, clean alignment distribution $X_0$ and the one predicted by the diffusion network $\psi$ (given molecule $m$ , text $t$ , and current noisy state $X_k$ ). This term optimizes the model to accurately predict the underlying data distribution for text-to-molecule retrieval.
$\mathrm{KL}(X_0 \| \psi(t,m,X_k))$ : The symmetric KL divergence for molecule-to-text retrieval. This ensures the model learns bidirectional dependencies.

4.2.4. Overall Training Objective

The overall training loss is a sum of the generative and discriminative losses: $ \mathcal { L } _ { \mathrm { total } } = \mathcal { L } _ { \mathrm { Gen } } + \mathcal { L } _ { \mathrm { Dis } } $ This joint optimization allows the model to leverage both:

Discriminative Contrastive Learning: Strengthens explicit semantic alignment by maximizing conditional probability, focusing on core semantic regions.
Generative Diffusion Modeling: Optimizes implicit distribution dependencies by modeling the joint probability, providing dynamic alignment signals and enhancing generalization.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on two publicly available benchmark datasets:

ChEBI-20 [11]:
- Source: From the ChEBI (Chemical Entities of Biological Interest) database, a dictionary of molecular entities focused on "small chemical compounds".
- Scale: Contains 33,010 molecules, each paired with one or more textual descriptions (e.g., chemical names, synonyms, descriptions of biological roles).
- Characteristics: Focuses on biologically relevant small molecules. The text descriptions often contain scientific and chemical terminology.
- Splits: Divided into training, validation, and test sets in an 8:1:1 ratio.
- Inference: During inference, test samples are retrieved from the entire dataset, meaning the model needs to distinguish the correct molecule from a large pool of candidates.
- Example data sample: While not explicitly provided in the paper, a typical entry might be a molecule (represented as a SMILES string or graph) paired with text like "A carboxylic acid that is acetic acid in which one of the methyl hydrogens is substituted by a fluoro group." (for Fluoroacetic acid).
PCdes [40]:
- Source: Sourced from PubChem [17], a public repository for information on chemical substances and their biological activities.
- Scale: Contains 15,000 molecule pairs. The paper implies these are text-molecule pairs, likely descriptions from PubChem records.
- Characteristics: Broader chemical scope than ChEBI, potentially including a wider variety of chemical compounds and descriptions.
- Splits: Divided into training, validation, and test sets in a 7:1:2 ratio.
- Why chosen: Both datasets are standard benchmarks in text-molecule retrieval, providing a robust evaluation platform for the method's performance across different scales and types of chemical entities and their descriptions.

5.2. Evaluation Metrics

The paper uses several standard metrics to evaluate retrieval performance:

5.2.1. Hits@K (Recall@K)

Conceptual Definition: Hits@K (also known as Recall@K) measures the proportion of queries for which the correct item (molecule for text-to-molecule, or text for molecule-to-text) is found within the top $K$ retrieved results. A higher value indicates better retrieval accuracy.
Mathematical Formula: $ \text{Hits@K} = \frac{\text{Number of queries where the correct item is in the top K results}}{\text{Total number of queries}} $
Symbol Explanation:
- Number of queries where the correct item is in the top K results: The count of instances where the ground-truth target is found within the first $K$ items of the ranked retrieval list.
- Total number of queries: The total number of retrieval attempts made during evaluation.

5.2.2. Mean Reciprocal Rank (MRR)

Conceptual Definition: Mean Reciprocal Rank (MRR) is a statistic for evaluating ordered lists of responses. For a single query, the reciprocal rank is 1 divided by the rank of the first correct answer. If the correct answer is at rank 1, the reciprocal rank is 1; if at rank 2, it's 0.5, and so on. MRR is the average of these reciprocal ranks across all queries. A higher MRR indicates that relevant items are consistently ranked higher.
Mathematical Formula: $ \text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i} $
Symbol Explanation:
- $|Q|$ : The total number of queries.
- $\text{rank}_i$ : The rank position of the first relevant document for the $i$ -th query. If no relevant document is found in the evaluated list, its reciprocal rank is considered 0.

5.2.3. Mean Rank (MR)

Conceptual Definition: Mean Rank (MR) calculates the average rank of the correct item across all queries. Unlike MRR, it directly averages the rank positions. A lower MR indicates better performance, as it means correct items are, on average, found at lower (better) ranks.
Mathematical Formula: $ \text{MR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \text{rank}_i $
Symbol Explanation:
- $|Q|$ : The total number of queries.
- $\text{rank}_i$ : The rank position of the relevant document for the $i$ -th query.

5.3. Baselines

The paper compares DiffTMR against a comprehensive set of baseline models, categorized into:

5.3.1. Task-Specific Models

These models are designed specifically for text-molecule retrieval or related cross-modal alignment tasks and often employ specialized techniques:

MLP-Ensemble [11], GCN-Ensemble [11], All-Ensemble [11]: Early ensemble methods proposed by Edwards et al. for text-to-molecule retrieval, combining different representation learning architectures.
$MLP+Atten$ [11], $MLP+FPG$ [14]: Other architectures from Edwards et al., likely involving MLP layers with attention mechanisms or fingerprint graphs.
AMAN [43]: Adversarial Modality Alignment Network uses adversarial training to align molecular and textual representations, enhancing cross-modal retrieval by making the embeddings modality-invariant.
Memory Bank [32]: A method that likely uses a memory bank mechanism to store and retrieve representations, often used in contrastive learning to manage a large number of negative samples.
ORMA [29]: Optimal Transport-based Multi-grained Alignments, which explores optimal transport theory to achieve multi-grained alignment between different modalities, focusing on fine-grained and global relationships.
CLASS (ORMA) [38]: An enhanced version building upon ORMA, likely optimizing performance and training efficiency.

5.3.2. Large-Scale Pretrained Multimodal Models

These models leverage large pre-training corpora and often employ hierarchical or knowledge-enhanced alignment mechanisms:

SciBERT [1]: Used as a text encoder baseline; its performance is evaluated when directly applied to the retrieval task without specific molecular adaptation beyond basic fine-tuning.
KV-PLM [40]: A Knowledge-enhanced Visual-language Pre-trained Model that aligns molecular structures and biomedical text, bridging the gap between molecular graphs and natural language.
MoMu [33]: A Molecular Multimodal Foundation Model that associates molecule graphs with natural language, often via contrastive learning on large datasets.
MolFM [28]: A Multimodal Molecular Foundation Model that incorporates knowledge graphs to enrich molecular representations.
MolCA [24]: A Molecular graph-language modeling approach with a cross-modal projector and uni-modal adapter.
MoleculeSTM [23]: A Multi-modal Molecule Structure-Text Model for text-based retrieval and editing, utilizing self-supervised learning on large datasets.
Atomas-base [41], Atomas-large [41]: Models based on Hierarchical Adaptive Alignment for molecular and textual representations, processing features at multiple granularities. Atomas-large is a larger version, typically with more parameters or trained on more data.

These baselines are representative as they cover a range of approaches, from traditional ensemble methods to state-of-the-art contrastive learning and large-scale pre-trained multimodal models, allowing for a comprehensive evaluation of DiffTMR's performance.

5.4. Implementation Details

Text Encoder: SciBERT [1] is used, with a maximum sequence length of 256 tokens.
Molecular Graph Encoder: A three-layer GCN [37] is used, with an output dimension of 300.
Optimization Strategy: A discriminative-generative joint optimization strategy is adopted.
Learning Rates:
- SciBERT: 3e-5
- Semantic coupling layer (in TAPR): 1e-5
- Other components: 1e-4
Diffusion Denoising Network: Optimized using the Adam optimizer [18] with a base learning rate of 1e-3 and a cosine annealing scheduler [25].
Diffusion Process: Fixed to 50 steps.
Balance Coefficient ( $\gamma$ ): Set to 0.4 for fusing global and local alignment distributions.
Training: 60 epochs with a batch size of 32.
Hardware: All experiments were conducted on an NVIDIA A40 GPU.
Candidate Set: Assumed to be predefined.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that DiffTMR significantly outperforms existing baseline methods across various metrics and datasets, particularly excelling in out-of-domain generalization. This superior performance is attributed to its discriminative-generative collaborative optimization framework, which combines the precision of contrastive learning for explicit semantic alignment with the generalization strength of diffusion modeling for capturing implicit joint data distributions.

6.1.1. ChEBI-20 Dataset Performance

The following are the results from Table 1 of the original paper:

Models	Text-Molecule Retrieval				Molecule-Text Retrieval
Models	Hits@1(↑)	Hits@10(↑)	MRR(↑)	MR(↓)	Hits@1(↑)	Hits@10(↑)	MRR(↑)	MR(↓)
MLP-Ensemble [11]	29.4%	77.6%	0.452	20.78
GCN-Ensemble [11]	29.4%	77.1%	0.447	28.77	−
All-Ensemble [11]	34.4%	81.1%	0.499	20.21	25.2%	74.1%	0.408	21.77
MLP+Atten [11]	22.8%	68.7%	0.375	30.37
MLP+FPG [14]	22.6%	68.6%	0.374	30.37	−
AMAN [43]	49.4%	92.1%	0.647	16.01	46.6%	91.6%	0.625	16.50
Atomas-base [41]	50.1%	92.1%	0.653	14.49	45.6%	90.3%	0.614	15.12
Memory Bank [32]	56.5%	94.1%	0.702	12.66	52.3%	93.3%	0.673	12.29
ORMA [29]	66.4%	93.7%	0.775	18.63	61.2%	92.8%	0.738	10.21
CLASS (ORMA) [38]	67.4%	93.4%	0.774	17.82	62.0%	92.7%	0.738	14.59
DiffTMR (ours)	72.8%	96.5%	0.823	16.24	66.7%	96.3%	0.784	10.07

Text-to-Molecule Retrieval: DiffTMR achieves a Hits@1 of 72.8%, surpassing the previous state-of-the-art baseline CLASS (ORMA) (67.4%) by a substantial 5.4%. It also shows the highest Hits@10 at 96.5% and MRR at 0.823, indicating excellent performance in finding the correct molecule and ranking it highly.
Molecule-to-Text Retrieval: Similarly, DiffTMR achieves a Hits@1 of 66.7%, exceeding CLASS (ORMA) (62.0%) by 4.7%. Its Hits@10 of 96.3% and MRR of 0.784 are also the highest, demonstrating strong bidirectional retrieval capabilities.
Ambiguous Queries: The high Hits@10 scores (96.5% for text-to-molecule and 96.3% for molecule-to-text) highlight DiffTMR's strong semantic modeling and understanding capabilities, making it effective even for ambiguous queries where multiple relevant candidates might exist.

6.1.2. PCdes Dataset Performance

The following are the results from Table 2 of the original paper:

Models	Text-Molecule Retrieval				Molecule-Text Retrieval
Models	Recall@1(↑)	Recall@5(↑)	Recall@10(↑)	MRR(↑)	Recall@1(↑)	Recall@5(↑)	Recall@10(↑)	MRR(↑)
Pretrained Model + Finetuning
SciBERT [1]	16.3%	33.9%	42.6%	0.250	15.0%	34.1%	41.7%	0.239
KV-PLM [40]	20.6%	37.9%	45.7%	0.292	19.3%	37.3%	45.3%	0.281
MoMu [33]	24.5%	45.4%	53.8%	0.343	24.9%	44.9%	54.3%	0.345
MolFM [28]	29.8%	50.5%	58.6%	0.396	29.4%	50.3%	58.5%	0.393
Pretrained Model + Zero-shot
MolCA [24]	35.1%	62.1%	69.8%	0.473	38.0%	66.8%	74.5%	0.508
MoleculeSTM [23]	35.8%				39.5%
Atomas-base [41]	39.1%	59.7%	66.6%	0.473	37.9%	59.2%	65.6%	0.478
Atomas-large [41]	49.1%	68.3%	73.2%	0.578	46.2%	66.0%	72.3%	0.555
From-scratch
ORMA [29]	64.8%	82.3%	86.3%	0.727	62.1%	81.4%	86.3%	0.710
DiffTMR (ours)	69.0%	85.6%	89.2%	0.755	66.8%	84.3%	88.7%	0.738

Text-to-Molecule Retrieval: DiffTMR achieves 69.0% Recall@1, outperforming ORMA (64.8%) by 4.2% and significantly surpassing other large-scale pretrained models. Its Recall@10 reaches 89.2%.
Molecule-to-Text Retrieval: DiffTMR obtains 66.8% Recall@1, an improvement of 4.7% over ORMA (62.1%).
Long-tail Retrieval Scenarios: The model maintains strong performance in long-tail scenarios (infrequent or rare molecular categories), achieving 89.2% Recall@10, demonstrating superior cross-domain generalization and robustness compared to models like Atomas, KV-PLM, and MoMu.

The consistent superior performance of DiffTMR across both datasets and retrieval directions underscores the effectiveness of its discriminative-generative collaborative optimization framework. By jointly optimizing for explicit semantic alignment via contrastive learning and implicit distribution dependencies via diffusion modeling, the model achieves a robust and generalizable solution for cross-modal molecular retrieval.

6.2. Ablation Studies / Parameter Analysis

Ablation studies were conducted on the ChEBI-20 dataset to investigate the contribution of individual components and the effect of key hyper-parameters.

The following are the results from Table 3 of the original paper:

Perturbation P	Hits@1↑	Hits@10↑	MRR↑	MR↓
(a) Calculation Method of Perturbation Range
exp( $\frac{1}{N_w}$ $\sum c_i$ )	63.4	90.1	0.792	17.3
exp( $\frac{\theta}{N_w}$ $\sum c_i$ )	66.1	93.2	0.805	16.7
exp(C $\Theta$ )	72.8	96.5	0.823	16.24
(b) Effect of Perturbation Sampling Frequency.
Frequency (F)	Hits@1↑	Hits@10↑	MRR↑	MR↓
5	63.5	90.2	0.795	17.9
10	68.4	93.6	0.811	17.1
15	72.8	96.5	0.823	16.24
20	71.5	95.8	0.820	16.8
(c) Effect of Diffusion Steps on Hits@1.
Train	Eval
Train	10	50	100	500
10	71.3	×	X	X
50	71.7	72.8	×	×
100	70.3	71.2	70.5	×
1000	70.0	70.7	70.9	71.0

6.2.1. Perturbation Range Calculation

Table 3a compares three strategies for calculating the perturbation range $\mathcal{P}$ :

Fixed parameter exp(\frac{1}{N_w}\sum c_i): This uses a static, non-trainable average of semantic coupling strengths. It achieves a Hits@1 of 63.4%.
Scalable parameter exp(\frac{\theta}{N_w}\sum c_i): This introduces a single trainable scalar parameter $\theta$ to dynamically adjust the perturbation magnitude. Its Hits@1 is 66.1%, showing an improvement over the fixed parameter, indicating the benefit of dynamic adjustment.
Matrix transformation exp(C\Theta): This utilizes a learnable matrix $\Theta$ to model high-order interactions of the semantic coupling tensor $C$ . This strategy yields the best performance with a Hits@1 of 72.8%. The authors conclude that this method's superior performance is due to its stronger nonlinear modeling capability, allowing for more nuanced and adaptive calibration of the perturbation intensity.

6.2.2. Perturbation Sampling Frequency

Table 3b examines the impact of molecular perturbation sampling frequency (F) during inference. As $F$ increases from 5 to 15, the Hits@1 score improves from 63.5% to 72.8%.

Increasing $F$ means the model performs multiple stochastic samplings of $h_p^g$ and selects the perturbed embedding that best aligns with the text anchor. This allows for a more thorough exploration of the molecular semantic space, thereby boosting cross-modal alignment.
However, increasing $F$ beyond 15 (e.g., to 20) leads to a slight decrease in Hits@1 (71.5%), suggesting a point of diminishing returns where excessive sampling might introduce noise or computational overhead without further accuracy gains.
An optimal balance between retrieval accuracy and computational cost is found at $F=15$ .

6.2.3. Diffusion Step

Table 3c investigates how the number of diffusion steps during training and evaluation affects Hits@1 performance in text-to-molecule retrieval.

The results show that 50 diffusion steps during both training and evaluation achieve optimal performance (Hits@1 of 72.8%).
Interestingly, fewer steps (e.g., 10) during training or evaluation result in lower performance. More steps (e.g., 100 or 1000 for training, or 100/500 for evaluation) also tend to slightly decrease or plateau performance.
The paper notes that 50 steps are sufficient for DiffTMR, which contrasts with image generation tasks (e.g., DDPM [15]) that often require 1,000 steps. This difference is attributed to the varying complexity of data distributions: image generation deals with high-dimensional pixel distributions, while text-molecule alignment focuses on semantic associations. The less complex representation space in text-molecule retrieval allows for fewer steps to capture key cross-modal relationships, while excessive iterations might risk introducing extra noise and reducing alignment quality.

6.3. Out-of-domain Retrieval

To assess the generalization ability of models on unseen data, the authors adopted an in-domain training, out-of-domain testing paradigm. This means the model is trained on one source dataset (e.g., ChEBI-20) and then evaluated on a completely different, unseen target dataset (e.g., PCdes).

Discriminative Methods Limitations: Traditional discriminative methods like ORMA [29] and AMAN [43] exhibit limited generalization in this out-of-domain scenario. Even though ORMA performs significantly better than AMAN in in-domain retrieval, both show poor performance when transferred to an unseen target domain. This confirms the inherent limitation of discriminative models in capturing intrinsic data distributions, which hinders their adaptability to shifts in data.
DiffTMR's Superiority: In contrast, DiffTMR consistently demonstrates strong performance across both in-domain and out-of-domain retrieval tasks. This robust generalization is a direct consequence of its generative modeling approach, which learns the joint probability distribution and thus captures the fundamental statistical properties of the data, making it more resilient to distribution shifts.

The following figure (Figure 5 from the original paper) visualizes the similarity distribution of models in both in-domain and out-of-domain scenarios.

该图像是三维散点图，显示了在ChEBI-20和PCdes数据集上，DiffTMR、ORMA和AMAN模型的正负配对情况。正配对用蓝色表示，负配对用紫色表示，展示了不同模型在分子检索任务中的表现差异。

The image above (Figure 5 from the original paper) visually represents the separation of positive and negative pairs in the embedding space for DiffTMR, ORMA, and AMAN on in-domain (ChEBI-20) and out-of-domain (PCdes) retrieval.

Positive pairs (blue points) represent correctly matched text-molecule pairs.
Negative pairs (purple points) represent mismatched pairs.

In the in-domain (ChEBI-20) scenario, all models show a reasonable separation between positive and negative pairs, with positives generally clustered together and negatives further away. However, in the out-of-domain (PCdes) scenario, discriminative baselines like ORMA and AMAN show a significant overlap between positive and negative pairs, indicating poor generalization. Their embeddings for correct pairs are no longer clearly distinct from incorrect ones. DiffTMR, however, maintains a much clearer separation between positive and negative pairs in the target domain (PCdes), comparable to its in-domain performance. This visualization strongly supports the claim that DiffTMR possesses stronger generalization and transferability when dealing with unseen data, a key advantage of its generative approach.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces DiffTMR, a novel framework for text-to-molecule cross-modal retrieval built upon generative diffusion models. Diverging from conventional discriminative approaches that model deterministic conditional probabilities, DiffTMR reconstructs the joint cross-modal distribution through a progressive denoising process. This fundamental shift allows the model to effectively capture the inherent many-to-many semantic correspondences between textual queries and molecular structures. The framework integrates global semantic alignment with local substructure matching via a Hierarchical Diffusion Alignment Network (HDAN), enabling fine-grained and robust retrieval. Additionally, the method's Text-Anchored Perturbation Representation (TAPR) dynamically enhances molecular representation diversity. The unique combined optimization of contrastive learning (for explicit alignment) and generative diffusion modeling (for implicit distribution learning) significantly boosts both alignment accuracy and generalization ability. Experimental evaluations on ChEBI-20 and PCdes benchmark datasets confirm that DiffTMR substantially outperforms state-of-the-art discriminative baselines, particularly demonstrating superior performance in challenging out-of-domain scenarios.

7.2. Limitations & Future Work

While the paper presents a significant advancement, it implicitly acknowledges areas for future exploration. The authors highlight that "the potential of diffusion models in the critical field of text-molecule retrieval remains unexplored. Specifically, existing studies have yet to address the challenges of modeling the complex semantics of molecular structures [10, 19, 29, 40] and the bidirectional reasoning inherent to chemical analysis, presenting a promising direction for future research." This implies that even with DiffTMR's generative approach, there is still scope for:

More Complex Semantic Modeling: Deepening the model's understanding of intricate molecular structures and their descriptions, potentially by integrating more sophisticated graph representations or chemical knowledge graphs within the diffusion process.
Enhanced Bidirectional Reasoning: Further improving the model's ability to perform bidirectional inference (text-to-molecule and molecule-to-text) by perhaps explicitly modeling conditional generation in both directions or refining the joint distribution learning to better capture causal relationships.

7.3. Personal Insights & Critique

DiffTMR offers a compelling paradigm shift by applying diffusion models to text-molecule retrieval, moving beyond the limitations of discriminative methods. The core strength lies in its ability to model the joint probability distribution, which is theoretically sound for handling out-of-distribution data and enhancing generalization. The hierarchical alignment and dynamic perturbation mechanisms are clever additions that address specific challenges in this domain, such as capturing multi-granularity information and balancing representation diversity with accuracy.

Strengths:

Novelty: It's a pioneering work in applying diffusion models to text-molecule retrieval, opening a new avenue for research.
Robustness & Generalization: The generative approach inherently improves robustness to data distribution shifts and delivers strong out-of-domain performance, which is crucial for real-world applications where data diversity is high.
Comprehensive Alignment: The global-local hierarchical alignment strategy ensures that both high-level semantic meaning and fine-grained structural details are considered during retrieval.
Dynamic Representation: Text-anchored perturbations are an innovative way to introduce flexibility and diversity into molecular representations without sacrificing semantic precision.

Potential Issues & Areas for Improvement:

Computational Cost: Diffusion models are typically computationally intensive, especially during the reverse denoising process. While the paper notes that 50 steps suffice (compared to 1000 for image generation), dynamic perturbation sampling at inference ( $F=15$ optimal) could still add significant overhead, especially with large candidate sets. Further work could explore more efficient sampling strategies or distillation techniques for faster inference.
Interpretability: While the model performs well, the denoising process and the complex interactions within the diffusion alignment network might make it challenging to interpret why certain molecules are retrieved for specific texts. Enhancing interpretability would be valuable for drug discovery where understanding the reasoning is critical.
Scalability to Ultra-Large Datasets: While DiffTMR performs well on benchmark datasets, its scalability to datasets with millions of molecules (common in industrial drug discovery) and very long, complex text descriptions needs further investigation. The current design might face memory or computational bottlenecks with extremely large candidate pools.
Explicit Chemical Knowledge Integration: While SciBERT and GCN implicitly leverage some chemical information, more explicit integration of chemical domain knowledge (e.g., reactivity rules, retrosynthesis pathways, specific drug-target interactions) within the diffusion process could potentially lead to even more chemically sound and accurate retrievals, especially for complex functional queries.
Beyond Similarity: The diffusion process learns the joint distribution, which is richer than just similarity. Future work could explore how this learned distribution can be leveraged for tasks beyond retrieval, such as conditional molecule generation given text properties, or identifying novel molecular design principles from the generated distributions.

The methods and conclusions of DiffTMR could potentially be transferred to other cross-modal retrieval tasks that struggle with out-of-distribution generalization and the accuracy-diversity trade-off, especially in scientific domains where data distributions can be complex and varied (e.g., text-protein retrieval, text-material science data retrieval). The principle of combining discriminative and generative learning, along with hierarchical dynamic representations, offers a powerful blueprint for future multimodal AI systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

DiffTMR: Diffusion-based Hierarchical Alignment for Text-Molecule Retrieval

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~33 min read · 45,799 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Text-Molecule Retrieval

3.1.2. Discriminative vs. Generative Models

3.1.3. Contrastive Learning

3.1.4. Diffusion Models

3.1.5. Graph Convolutional Network (GCN)

3.1.6. Reparameterization Trick

3.1.7. Attention Mechanism

3.1.8. KL Divergence

3.1.9. SciBERT

3.2. Previous Works

3.2.1. Molecular Representations

3.2.2. Representation Learning with Diffusion Models

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Feature Extraction

4.2.2. Text-Anchored Perturbation Representation (TAPR)

4.2.3. Hierarchical Diffusion Alignment Network (HDAN)

4.2.4. Overall Training Objective

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Hits@K (Recall@K)

5.2.2. Mean Reciprocal Rank (MRR)

5.2.3. Mean Rank (MR)

5.3. Baselines

5.3.1. Task-Specific Models

5.3.2. Large-Scale Pretrained Multimodal Models

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. ChEBI-20 Dataset Performance

6.1.2. PCdes Dataset Performance

6.2. Ablation Studies / Parameter Analysis

6.2.1. Perturbation Range Calculation

6.2.2. Perturbation Sampling Frequency

6.2.3. Diffusion Step

6.3. Out-of-domain Retrieval

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers