DiffTMR: Diffusion-based Hierarchical Alignment for Text-Molecule Retrieval
TL;DR Summary
DiffTMR is an innovative text-molecule retrieval framework that reframes retrieval as a reverse denoising process, addressing traditional methods' limitations in out-of-distribution detection and diversity maintenance. By integrating hierarchical diffusion alignment and dynamic p
Abstract
Molecular retrieval is critical in drug discovery and molecular design. Traditional discriminative methods often model the conditional probability distribution of retrieving candidates, treating the query text as a deterministic input. However, these approaches have notable limitations: (1) They often overlook the statistical properties of the original data distributions of queries and candidates, preventing the recognition of out-of-distribution data. (2) They struggle to balance retrieval accuracy and diversity when processing open-ended semantic queries. To address these challenges, we introduce DiffTMR, a novel framework that reformulates text-molecule retrieval as a reverse denoising process, progressively generating the joint distribution of candidates and queries from noises. DiffTMR uniquely integrates hierarchical diffusion alignment with dynamic perturbation embedding mechanisms. By employing text-anchored perturbations, it enhances the diversity of molecular representations, and through global-local progressive denoising, it achieves cross-modal hierarchical alignment. This leads to significant improvements in retrieval accuracy and out-of-domain generalization. Evaluations on benchmark datasets ChEBI-20 and PCdes demonstrate that DiffTMR surpasses current leading baselines by 4.2%–5.4% in Hits@1 metrics and exhibits superior performance in out-of-domain retrieval tasks.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
DiffTMR: Diffusion-based Hierarchical Alignment for Text-Molecule Retrieval
1.2. Authors
The paper is authored by Chenxu Wang, Dong Zhou*, Ting Liu, Jianghao Lin, and Yongmei Zhou, all affiliated with Guangdong University of Foreign Studies, Guangzhou, China. Aimin Yang is also an author, affiliated with Lingnan Normal University, Zhanjiang, China. Dong Zhou is marked with an asterisk, typically indicating a corresponding author.
1.3. Journal/Conference
The paper is published at the 33rd ACM International Conference on Multimedia (MM '25), October 27-31, 2025, Dublin, Ireland. The ACM International Conference on Multimedia (ACM MM) is a highly reputable and influential conference in the field of multimedia computing, known for publishing cutting-edge research in various areas including cross-modal retrieval, computer vision, and natural language processing. Its high selectivity and broad scope make it a significant venue for disseminating research findings.
1.4. Publication Year
2025
1.5. Abstract
Molecular retrieval is a crucial task in drug discovery and molecular design. Traditional discriminative methods for text-molecule retrieval model the conditional probability distribution of retrieving candidates, treating the query text as a deterministic input. However, these methods often overlook the statistical properties of the original data distributions of queries and candidates, making them unable to recognize out-of-distribution data and struggle to balance retrieval accuracy and diversity for open-ended queries. To overcome these limitations, the paper introduces DiffTMR, a novel framework that re-frames text-molecule retrieval as a reverse denoising process. This process progressively generates the joint distribution of candidates and queries from noise. DiffTMR integrates a hierarchical diffusion alignment network with dynamic perturbation embedding mechanisms. It uses text-anchored perturbations to enhance the diversity of molecular representations and employs global-local progressive denoising to achieve cross-modal hierarchical alignment. This approach significantly improves retrieval accuracy and out-of-domain generalization. Evaluations on benchmark datasets ChEBI-20 and PCdes show that DiffTMR outperforms leading baselines by 4.2%–5.4% in Hits@1 metrics and demonstrates superior performance in out-of-domain retrieval tasks.
1.6. Original Source Link
/files/papers/694a002c3e1288a634f1be16/paper.pdf (This is a local file path, implying the paper is provided directly rather than an external web link. Publication status: Officially published at ACM MM '25.)
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by the paper is the limitations of existing text-molecule retrieval methods, which primarily fall under the discriminative paradigm. These methods typically model the conditional probability distribution (e.g., ), focusing on optimizing feature extractors to align text and molecule representations in a shared embedding space. While successful in improving retrieval accuracy, this paradigm has two major drawbacks:
-
Neglect of Data Distributions: They overlook the inherent statistical properties and
joint distributionsof the original query and candidate data. This prevents effective recognition ofout-of-distribution (OOD)data and leads to poor generalization across different datasets or to novel data. -
Accuracy-Diversity Trade-off: The
fixed-point embedding mechanism(where each input maps to a single, static embedding) struggles to balance retrieval accuracy with diversity when dealing withopen-ended semantic queries. Overly compressed semantic spaces ensure precision but yield rigid responses, while attempts to enhance diversity by introducing unconstrained noise can lead tosemantic drift(where the meaning of the representation deviates from the original intent).The importance of text-molecule retrieval is highlighted by its critical role in various applications such as
drug discovery,molecular design, andvirtual screening. Addressing the aforementioned limitations is vital for robust and generalizable solutions in these fields.
The paper's entry point is to leverage the power of generative models, specifically diffusion models, which have shown breakthroughs in other domains like image synthesis and natural language processing. Unlike discriminative models, generative models can capture comprehensive cross-modal associations by generating the joint distribution of candidates and queries, thereby enhancing robustness against data distribution shifts. The coarse-to-fine generative nature of diffusion models is identified as inherently suitable for progressively uncovering associations in text-molecule retrieval.
2.2. Main Contributions / Findings
The paper's primary contributions are:
- Novel Generative Framework: Introduction of
DiffTMR, the first generative diffusion-based framework specifically designed for text-molecule retrieval, which reformulates the task as a reverse denoising process to model the joint probability distribution of text and molecules. - Dynamic Perturbation Mechanism: Proposal of a
Text-Anchored Perturbation Representation (TAPR)module that dynamically regulates the perturbation scope for molecular embeddings based onsemantic coupling perception. This mechanism enhances the diversity of molecular representations while maintaining consistency under text-anchor constraints. - Hierarchical Alignment Network: Development of a
Hierarchical Diffusion Alignment Network (HDAN)that achieves cross-modal hierarchical modeling through global-local progressive denoising, capturing both high-level semantic associations and fine-grained correspondences between textual tokens and molecular motifs. - Collaborative Optimization: A unique joint optimization strategy combining
discriminative contrastive learning(for explicit semantic alignment) andgenerative diffusion modeling(for implicit distribution dependencies). This collaboration significantly improves both retrieval accuracy and generalization capability, especially forunseen data. - Superior Performance: Demonstrates that
DiffTMRsignificantly outperforms current state-of-the-art baselines on benchmark datasets ChEBI-20 and PCdes, achieving improvements of 4.2%–5.4% inHits@1metrics. The model also exhibits superior performance in challengingout-of-domain retrieval tasks, highlighting its strong generalization and transfer capabilities.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Text-Molecule Retrieval
Text-molecule retrieval is a cross-modal task where the goal is to find the most relevant molecule (or its representation) given a textual query, or conversely, to find the most relevant textual description for a given molecule. This involves understanding the semantic relationship between natural language and chemical structures and embedding them into a shared latent space where similarity can be measured.
3.1.2. Discriminative vs. Generative Models
- Discriminative Models: These models learn a mapping from input features to output labels or categories. In retrieval, they often focus on modeling the
conditional probability distribution, i.e., the probability of a candidate given a query . They excel at distinguishing between classes but typically do not learn the underlying data distribution of or .Contrastive learning-based methods are a common example, aiming to push dissimilar pairs apart and pull similar pairs closer. - Generative Models: These models learn the
joint probability distributionP(X, Y)(orP(X)andP(Y)separately). They can generate new data samples that resemble the training data. By learning the full data distribution, they can better handleout-of-distribution (OOD)data and capture more complex dependencies.Diffusion modelsare a prominent type of generative model.
3.1.3. Contrastive Learning
Contrastive learning is a self-supervised learning paradigm where a model learns representations by comparing similar and dissimilar pairs of data. The core idea is to learn an embedding space where positive pairs (e.g., a text and its corresponding molecule) are pulled closer together, while negative pairs (e.g., a text and a random, unrelated molecule) are pushed further apart. This is often achieved through a contrastive loss function that maximizes the agreement between positive pairs while minimizing it for negative pairs.
3.1.4. Diffusion Models
Diffusion models are a class of generative models that learn to generate data by reversing a gradual noise-adding process.
- Forward Diffusion Process: This is a fixed Markov chain that progressively adds Gaussian noise to data (e.g., an image or an embedding) over several time steps, eventually transforming it into pure Gaussian noise. The data distribution at any step is denoted as .
- Reverse Denoising Process: The model learns to reverse this process, starting from pure noise and gradually denoising it back to a clean data sample. This is done by training a neural network to predict the noise that was added at each step, allowing it to estimate the distribution . By iteratively applying this learned denoising step, the model can generate new samples from the data distribution.
Diffusion models are known for their high-quality generation and ability to model complex data distributions in a
coarse-to-finemanner.
3.1.5. Graph Convolutional Network (GCN)
A Graph Convolutional Network (GCN) is a type of neural network that operates directly on graph-structured data. Unlike traditional convolutional networks designed for grid-like data (e.g., images), GCNs can process arbitrary graph topologies. In molecular representation, a molecule is naturally represented as a graph where atoms are nodes and chemical bonds are edges. GCNs learn node representations by aggregating information from their neighbors, effectively capturing local structural patterns (like functional groups) and propagating this information across the entire molecule, forming a global molecular representation.
3.1.6. Reparameterization Trick
The reparameterization trick is a technique used in variational autoencoders (VAEs) and other probabilistic models to enable backpropagation through a stochastic node (a node from which a random sample is drawn). Instead of sampling directly from a distribution (which is not differentiable), the trick expresses the sample as a deterministic function of a simple random variable (e.g., a standard Gaussian noise ) and the parameters of the distribution. For example, if we want to sample from , we can write , where . This allows the gradients to flow through and , making the model trainable with standard gradient descent.
3.1.7. Attention Mechanism
The attention mechanism is a technique in neural networks that allows a model to weigh the importance of different parts of an input sequence or set of features when making a prediction or generating an output. It typically involves three components: Query (Q), Key (K), and Value (V). The Query is used to calculate attention scores (or weights) with all Keys, which indicate how relevant each Key is to the Query. These weights are then applied to the Values to produce a weighted sum, representing the "attended" output. The most common form is Scaled Dot-Product Attention:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
- : Query matrix.
- : Key matrix.
- : Value matrix.
- : Dimension of the key vectors (used for scaling to prevent large dot products from pushing the softmax into regions with extremely small gradients).
- : A function that converts a vector of numbers into a probability distribution.
3.1.8. KL Divergence
Kullback-Leibler (KL) divergence is a non-symmetric measure of the difference between two probability distributions, and . It quantifies how much information is lost when is used to approximate . A KL divergence of zero indicates that the two distributions are identical.
$
D_{\text{KL}}(P | Q) = \sum_{x \in \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right)
$
P(x): The probability of event in distribution .Q(x): The probability of event in distribution .- : The set of all possible events.
3.1.9. SciBERT
SciBERT is a pre-trained language model based on the BERT (Bidirectional Encoder Representations from Transformers) architecture. Unlike the original BERT, which is trained on general domain text, SciBERT is specifically pre-trained on a large corpus of scientific publications. This specialized training allows it to better understand and encode the nuances, terminology, and semantic relationships prevalent in scientific and technical texts, making it particularly effective for tasks involving scientific literature, such as chemical text descriptions.
3.2. Previous Works
Previous research in text-molecule retrieval can be broadly categorized by the molecular representation used and the alignment strategy:
3.2.1. Molecular Representations
- One-dimensional Sequences: Methods utilizing the
Simplified Molecular Input Line Entry System (SMILES)string representation of molecules. Models likeKV-PLM[40],Text^+Chem T5[7], andMolT5[10] adapt sequence modeling techniques (often based onTransformerarchitectures) to align SMILES strings with textual semantics. - Two-dimensional Molecular Graphs: Molecules are modeled as graphs where atoms are nodes and bonds are edges.
MoMu[33] andMoleculeSTM[23] usecontrastive learningto build a shared embedding space.MolCA[24] introduces a dedicatedcross-modal projector.AMAN[43] employsadversarial learningfor modality alignment. Some works, likeMolFM[28] andGIT-Mol[21], integrate multi-modal auxiliary information (knowledge graphs, molecular images) to enrich representations. - Three-dimensional Spatial Conformations:
3D-MoLM[19] extendsgraph neural networksto incorporate 3D coordinate information.Orma[29] andAtomas[41] usemixed-granularity alignments, modeling features at atomic, motif (functional groups), and molecular levels to improve semantic consistency.
3.2.2. Representation Learning with Diffusion Models
While diffusion models have seen significant advancements in various domains, their application to text-molecule retrieval is relatively new.
- Image Synthesis:
UNIT-DDPM[31] appliesMarkov chain inferenceforunpaired image-to-image translation.ILVR[6] optimizes theDDPMgeneration process for high-quality image synthesis conditioned on reference images. - Inverse Problems in Imaging: Kadkhodaie et al. [16] use
stochastic gradient ascentwithCNN-based implicit priorsfor tasks like denoising and super-resolution. - 3D Point Cloud Generation: Luo et al. [27] model
point cloud synthesisas athermodynamic particle diffusion process. - Video Understanding:
DiffusionVMR[42] usesdenoising generation strategiesforvideo segment retrievalandboundary detection. Luo et al. [26] propose atext-guided diffusion modelfor video editing andcross-domain video moment retrieval.
3.3. Technological Evolution
The field of cross-modal retrieval, including text-molecule retrieval, has evolved significantly. Initially, approaches focused on developing sophisticated feature extractors for each modality and then projecting them into a common embedding space, often using contrastive learning or adversarial training to enforce alignment. This era was dominated by discriminative models that optimized conditional probabilities. The advent of large-scale pre-trained models (like BERT for text, and specialized models for molecules) further boosted performance by leveraging vast amounts of data for robust feature learning.
However, a key limitation identified in this paper is the failure of these discriminative approaches to fully capture the underlying data distributions and joint dependencies of queries and candidates, leading to issues with out-of-distribution generalization and balancing accuracy with diversity.
This paper marks a shift towards generative modeling in text-molecule retrieval. By adapting diffusion models, which are inherently designed to learn and generate data from complex distributions, DiffTMR aims to overcome these limitations. It bridges the gap by moving from merely distinguishing between relevant and irrelevant pairs to explicitly modeling the joint probability distribution of text and molecules, allowing for a more comprehensive understanding of their semantic associations. The integration of hierarchical alignment and dynamic perturbations further refines this generative approach.
3.4. Differentiation Analysis
Compared to previous works, DiffTMR introduces several core innovations:
- Generative Paradigm Shift: The most significant difference is the fundamental shift from a
discriminative paradigm(modeling ) to agenerative paradigm(modelingP(M,T)) for text-molecule retrieval. Existing methods primarily focus on optimizing explicitconditional probabilityorembedding similarity, whileDiffTMRleverages thereverse denoising processofdiffusion modelsto implicitly learn thejoint probability distribution. This enables better capture ofstatistical propertiesandintrinsic data structuresfor both modalities. - Hierarchical Diffusion Alignment: Unlike
fixed-point embedding mechanismsor single-level alignment strategies,DiffTMRemploys aHierarchical Diffusion Alignment Network (HDAN). This network performs bothglobal semantic alignment(sentence-level text to global molecule) andfine-grained local alignment(word-level text to molecular motif-level). This multi-granularity approach, coupled withprogressive denoising, allows for more robust and adaptive matching. - Dynamic Perturbation Embedding: The
Text-Anchored Perturbation Representation (TAPR)module is a novel mechanism to enhancemolecular representation diversitywhile maintainingsemantic consistency. Instead of rigid embeddings, it introduces dynamic, text-anchored perturbations. This contrasts with methods that either use static embeddings or introduce unconstrained noise, addressing theaccuracy-diversity trade-offmore effectively. - Discriminative-Generative Collaborative Optimization:
DiffTMRuniquely combines bothcontrastive learning(discriminative) anddiffusion modeling(generative) in its optimization. This allows it to leverage the strengths of both:explicit semantic alignmentfrom contrastive learning andimplicit distribution dependenciesfrom generative modeling, leading to enhancedcross-modal retrieval accuracyand superiorout-of-domain generalization. This contrasts with most prior works that rely solely on one paradigm. - Robustness to Data Distribution Shifts: By modeling
joint distributionsand incorporatingdiffusion-theory-based timestep adaptive noise,DiffTMRis inherently designed to be more robust againstdata distribution shiftsandout-of-domain (OOD)challenges, a limitation explicitly called out for traditional discriminative methods.
4. Methodology
4.1. Principles
The core idea behind DiffTMR is to reformulate the text-molecule retrieval task as a reverse denoising process, characteristic of diffusion models. Instead of learning a direct mapping from text to molecule (a discriminative approach), DiffTMR aims to learn the joint probability distribution of text and molecule pairs (P(t, m)). This approach allows the model to progressively generate an aligned joint distribution from noise, implicitly capturing bidirectional dependencies and statistical properties of the original data. The method is built on two main components:
-
Text-Anchored Perturbation Representation (TAPR): To address suboptimal molecular representations and leverage rich textual information, this module dynamically perturbs molecular embeddings within a range determined by semantic coupling with the query text. This enhances diversity while maintaining semantic relevance.
-
Hierarchical Diffusion Alignment Network (HDAN): This network uses a
diffusion modelto progressively denoise a noisy joint representation, gradually revealing cross-modal associations. It achieves this by modeling alignment at bothglobal (semantic)andlocal (substructure)granularities.These two components are collaboratively optimized with a
discriminative loss(for explicit semantic alignment) and agenerative loss(for implicit distribution modeling), aiming to achieve both high retrieval accuracy and strong generalization.
4.2. Core Methodology In-depth (Layer by Layer)
The overall DiffTMR framework is designed to model the joint distribution of text and molecules through a progressive denoising process. The data flow starts with independent encoders for each modality, followed by a dynamic perturbation mechanism for molecules, and then a hierarchical diffusion alignment network that learns the joint distribution.
该图像是示意图,展示了DiffTMR框架在文本-分子检索中的过程,通过文本锚定的扰动表示和层级扩散对齐网络进行建模。图中显示了全局和局部查询、候选分子与噪声的关系与逐步去噪的过程,以及最终的检索结果。
The image above (Figure 2 from the original paper) provides an overview of the DiffTMR framework. It illustrates how an input query text and candidate molecules are processed. The text-anchored perturbation representation module generates (molecular perturbation embeddings) guided by textual semantics. The hierarchical diffusion alignment network then takes these embeddings, along with text embeddings, and performs a reverse denoising process to learn the optimal text-molecule aligned data distribution hat{X}_0.
4.2.1. Feature Extraction
The model utilizes two independent encoders:
-
Molecular Encoder (): For molecules, a
three-layer Graph Convolutional Network (GCN)[37] is used. GCNs are well-suited for molecular data as they model molecules as graphs (atoms as nodes, bonds as edges), allowing for the aggregation of information from different granularities, such as functional groups, atomic clusters, and other chemical substructures. -
Text Encoder (): For textual descriptions, the
SciBERT[1] pre-trained model is employed. SciBERT is chosen for its superior performance in encoding chemical text descriptions due to its pre-training on a large corpus of scientific publications.Given a pair of text description and molecule , their feature embeddings are represented as: $ \begin{array} { l } { \pmb { h } _ { i } ^ { w } = \phi _ { t } ( t ) , \forall i \in [ 1 , . . . , N _ { w } ] ; \quad \pmb { h } ^ { t } = [ \mathrm { c l s } ] _ { w } ; } \ { \pmb { h } _ { j } ^ { m } = \phi _ { m } ( m ) , \forall j \in [ 1 , . . . , N _ { m } ] ; \pmb { h } ^ { g } = \displaystyle \frac { 1 } { N _ { m } } \sum _ { j = 1 } ^ { N _ { m } } \pmb { h } _ { j } ^ { m } } \end{array} $
-
: The embedding of the -th word in the text description. produces a sequence of word embeddings.
-
: The length of the text description (number of words).
-
: The global textual representation, obtained from the special
[cls]token's embedding (a common practice inBERT-like models to represent the entire input sequence). -
: The embedding of the -th molecular motif (e.g., functional groups, atom substructures). produces a set of motif embeddings.
-
: The number of molecular motifs in the molecule.
-
: The global molecular representation, derived by averaging the motif embeddings.
4.2.2. Text-Anchored Perturbation Representation (TAPR)
The TAPR module aims to address two challenges: suboptimal molecular representations (since a pre-trained molecular encoder is not used) and the need to fully leverage textual information for diversity. It dynamically defines an optimized perturbation region for molecular representations, anchored on text embeddings. This implicitly pulls relevant text and molecule embeddings closer.
该图像是示意图,展示了TAPR架构的文本-分子编码器。文本编码器提取了词级特征和句子级全局特征;分子编码器提取了基元级特征和全局特征。通过语义耦合感知模块,结合每个分子的扰动范围,生成文本引导的分子扰动表示,其中扰动依据高斯分布采样,即。
The image above (Figure 3 from the original paper) illustrates the TAPR architecture. Text features (word-level and sentence-level global ) and molecule features (motif-level and global ) are extracted. The semantic coupling perception module then calculates a perturbation range for each molecule based on its semantic alignment with the text. Within this range, Gaussian-sampled perturbations generate a text-guided molecular perturbation representation .
Given a global molecular embedding , a reparameterization trick is employed to introduce a dynamic perturbation range , and stochastic sampling is performed within this range to obtain the molecular perturbation embedding:
$
\begin{array} { r } { h _ { \mathcal { P } } ^ { \mathcal { G } } = h ^ { \mathcal { G } } + \mathcal { P } \cdot \epsilon , \quad \epsilon \sim N ( 0 , 1 ) , } \end{array}
$
-
: The perturbed global molecular embedding.
-
: The original global molecular embedding from Equation (1).
-
: The dynamic perturbation range.
-
: A standard Gaussian noise term, used with the
reparameterization trickto allow gradients to flow through the sampling process.The critical challenge is determining the appropriate
perturbation range. Too small, and diversity is constrained; too large, andsemantic driftoccurs. To address this, asemantic coupling perception moduledynamically calibrates the perturbation intensity through fine-grained cross-modal interactions. First, it analyzes the semantic alignment between the molecule and individual text tokens: $ \pmb { c } _ { i } = \frac { \pmb { h } ^ { g } \cdot \pmb { h } _ { i } ^ { w } } { \lVert \pmb { h } ^ { g } \rVert \lVert \pmb { h } _ { i } ^ { w } \rVert } , \quad i = 1 , . . . , N _ { w } $ -
: The semantic coupling strength between the global molecular embedding and the -th word embedding . This is calculated using
cosine similarity. -
: Denotes the norm of a vector.
Based on the computed semantic coupling strengths, the molecular perturbation range is determined using a learnable linear layer : $ \begin{array} { r } { \mathcal { P } = \exp ( C \Theta ) , \quad C = [ \boldsymbol { c } _ { 1 } , \boldsymbol { c } _ { 2 } , . . , \boldsymbol { c } _ { N _ { w } } ] , } \end{array} $
-
: The dynamic perturbation range. The
expfunction ensures that is positive. -
: A tensor representing the concatenation of all semantic coupling strengths .
-
: A learnable linear layer (or matrix) that transforms the semantic coupling tensor into the perturbation range. This allows the model to learn complex, high-order interactions to calibrate the perturbation intensity.
During training, one
perturbed molecular representationis sampled. During evaluation, multiple stochastic samples are drawn to identify the optimal perturbed molecular embedding that best aligns with the text anchor, ensuring representation stability and accuracy.
The discriminative loss function is then applied. The similarity score is computed between the global text embedding and the molecular perturbation embedding . The conditional probability is formulated as:
$
\displaystyle { p ( m | t ; \phi _ { t } , \phi _ { m } ) = \frac { \exp { \left( s _ { t , m } / \tau \right) } } { \sum _ { m ^ { \prime } \in M } { \exp { \left( s _ { t , m ^ { \prime } } / \tau \right) } } } } ,
$
-
: The probability of retrieving molecule given text , parameterized by encoders and .
-
: The similarity score between text and molecule .
-
: A temperature parameter, commonly used in
contrastive learningto scale the logits before softmax. -
: The sum over all candidate molecules in the set , forming the normalization term for the
softmaxfunction.The corresponding
discriminative loss functionis defined as: $ \mathcal { L } _ { \mathrm { Dis } } = \frac { 1 } { 2 } \mathbb { E } _ { ( t , m ) } \big [ \log \mathnormal { p } ( m | t ; \phi _ { t } , \phi _ { m } ) + \log \mathnormal { p } ( t | m ; \phi _ { t } , \phi _ { m } ) \big ] . $ -
: The discriminative loss. This is typically a
contrastive loss(orInfoNCE loss), designed to maximize theconditional likelihoodof positive pairs. -
: Expectation over all positive text-molecule pairs
(t,m)in the training set . -
: The
log-likelihoodof correctly retrieving molecule given text . -
: The
log-likelihoodof correctly retrieving text given molecule . The loss averages these two directions to ensure bidirectional retrieval capability.This
discriminative approach, however, is limited as it focuses solely on conditional probabilities and does not model theintrinsic distributionsp(t)andp(m), which can lead to poor generalization.
4.2.3. Hierarchical Diffusion Alignment Network (HDAN)
To overcome the limitations of discriminative methods and model the joint probability distribution , DiffTMR introduces the Hierarchical Diffusion Alignment Network (HDAN). This network leverages the generative perspective of diffusion models to hierarchically model cross-modal alignment.
该图像是示意图,展示了HDAN架构的工作原理。图中显示了多层次表示的批量连接过程,噪声被添加到句子级文本和分子扰动,通过注意力机制生成全局对齐分布。局部分布由单词级和基元级构成。两者加权融合后输入去噪网络,经过反向去噪迭代,预测出干净的对齐分布。
The image above (Figure 4 from the original paper) details the HDAN architecture. It shows how multi-granularity representations (sentence-level text , molecular perturbation , word-level , motif-level ) are batched and processed. Noise is added to global representations ( and ) to generate a global alignment distribution via attention. A local distribution is derived from word-level and motif-level features. These two distributions are then weighted and fused into , which is fed into a denoising network. After reverse denoising iterations, the network predicts a clean alignment distribution .
The objective function for the generative component is defined as: $ \prod _ { ( t , m ) \in \cal Z } p ( m , t ) \approx \psi \big ( m , t , N ( 0 , \mathrm { I } ) \big ) , $
-
: The joint probability distribution of text-molecule pairs across the corpus .
-
: The diffusion network's parameterized approximation of this joint distribution, starting from Gaussian noise .
To enhance robustness,
diffusion-theory-based timestep adaptive noiseis introduced into the query (text) and key-value (molecule) projections. This simulates the progressive noise scheduling of diffusion models: low noise initially for high-confidence alignments, dynamically intensified noise later for more complex scenarios.
For text-to-molecule retrieval, the molecular perturbation embeddings are concatenated into a candidate set tensor (where is batch size, is embedding dimension), projected into keys and values. Similarly, global text embeddings are concatenated into a text tensor , projected into queries. Noise is introduced based on the time step : $ \begin{array} { r } { Q _ { t } = W _ { Q } \big ( \boldsymbol { H } ^ { t } + \mathrm { P r o j } ( \mathrm { N o i s e } _ { k } ) \big ) , } \ { K _ { m } = W _ { K } \big ( \boldsymbol { H } _ { \mathcal { P } } ^ { g } + \mathrm { P r o j } ( \mathrm { N o i s e } _ { k } ) \big ) , } \ { V _ { m } = W _ { V } \big ( \boldsymbol { H } _ { \mathcal { P } } ^ { g } + \mathrm { P r o j } ( \mathrm { N o i s e } _ { k } ) \big ) , } \end{array} $
-
: The projected query, key, and value matrices.
-
: Learnable projection matrices.
-
: Batched global text embeddings.
-
: Batched perturbed global molecular embeddings.
-
: A projection function that maps the noise at time step to the embedding dimension . This ensures that noise is added in a dimensionally compatible way.
To model the joint data distribution of text-molecule alignment, an
attention mechanismis employed, incorporating the data distribution at time step into the attention weights to reflect the reverse denoising process: $ \mathscr { E } ^ { \mathrm { g l o b a l } } = \left( \mathrm { S o f t m a x } ( Q _ { t } K _ { m } ^ { T } ) + X _ { k } \right) \cdot V _ { m } + \mathrm { D W C } ( V _ { m } ) $ -
: The global semantic alignment distribution.
-
: The standard attention scores between text queries and molecule keys.
-
: Represents the joint probability of the previous noise level for each candidate molecule. Higher values indicate greater confidence, dynamically adjusting attention weights. This term is crucial for the diffusion aspect, reflecting the model's understanding of the data distribution at a given noise level.
-
: The value matrix for molecules.
-
: A
Depthwise Convolution Moduleapplied to the value matrix, which helps preserve feature diversity.This represents the global (semantic-level) alignment distribution. To capture
fine-grained local alignment(substructure-level), a token-fragment similarity matrix is computed: $ S = [ s _ { i j } ] ^ { N _ { w } \times N _ { m } } \quad \text{where} \quad s _ { i j } ~ = ~ \frac { ( \pmb { h } _ { i } ^ { w } ) ^ { \top } { \pmb { h } } _ { j } ^ { m } } { | \pmb { h } _ { i } ^ { w } | | \pmb { h } _ { j } ^ { m } | } $ -
: A matrix capturing
cosine similaritybetween the -th word embedding and the -th molecular motif embedding .The fine-grained local alignment distribution is then obtained by selecting the maximum fragment alignment score per word () and aggregating based on
word importance weights: $ \begin{array} { r } { \mathcal { E } ^ { \mathrm { l o c a l } } = \displaystyle \sum _ { i = 1 } ^ { N _ { w } } f _ { i } ^ { w } \operatorname* { m a x } _ { j } s _ { i j } , } \ { \displaystyle \left. \left. f _ { i } ^ { w } \right. _ { i = 1 } ^ { N _ { w } } = \mathrm { S o f t m a x } \big ( \mathrm { M L P } ( { h _ { i } ^ { w } } _ { i = 1 } ^ { N _ { w } } ) \big ) , \right. } \end{array} $ -
: The local alignment distribution.
-
: The importance weight for the -th word. These weights are learned by an
MLP (Multi-Layer Perceptron)that processes the word embeddings and applies aSoftmaxto ensure they sum to 1. -
: The maximum similarity score between the -th word and any molecular motif.
Finally, the
joint data distributionis obtained through a weighted fusion of the two alignment levels: $ \mathcal { E } ^ { \mathrm { j o i n t } } = \gamma \mathcal { E } ^ { \mathrm { l o c a l } } + ( 1 - \gamma ) \mathcal { E } ^ { \mathrm { g l o b a l } } $ -
: The final combined joint data distribution.
-
: A balance coefficient that weighs the contribution of local vs. global alignment.
This is then fed into a
denoising decoder(amultilayer perceptron (MLP)withReLUactivation) which, after multiple rounds ofreverse denoising, predicts a cleanalignment distribution. The final retrieval results are produced by ranking based on .
The reverse denoising process reconstructs the data distribution through a Markov chain:
$
\hat { x } _ { t } = \sqrt { \alpha _ { t } } \hat { X } _ { t - 1 } + \sqrt { 1 - \alpha _ { t } } \epsilon
$
-
: The predicted data distribution at time step .
-
: A noise scheduling coefficient that controls the amount of noise added/removed at each step.
-
: The estimated data distribution at the previous time step.
-
: A noise term predicted by the network.
The
generation loss functionaims to optimize thisnoise predictionanddata distribution modeling: $ \begin{array} { r } { \mathcal { L } _ { \mathrm { Gen } } = \mathbb { E } _ { ( t , m ) \in Z } \Big [ \mathrm { K L } \big ( X _ { 0 } | \psi ( m , t , X _ { k } ) \big ) \Big ] + \mathbb { E } _ { ( m , t ) \in Z } \Big [ \mathrm { K L } \big ( X _ { 0 } | \psi ( t , m , X _ { k } ) \big ) \Big ] } \end{array} $ -
: The generative loss.
-
: Expectation over positive text-molecule pairs in the corpus.
-
:
KL divergencebetween the true, clean alignment distribution and the one predicted by the diffusion network (given molecule , text , and current noisy state ). This term optimizes the model to accurately predict the underlying data distribution for text-to-molecule retrieval. -
: The symmetric
KL divergencefor molecule-to-text retrieval. This ensures the model learns bidirectional dependencies.
4.2.4. Overall Training Objective
The overall training loss is a sum of the generative and discriminative losses: $ \mathcal { L } _ { \mathrm { total } } = \mathcal { L } _ { \mathrm { Gen } } + \mathcal { L } _ { \mathrm { Dis } } $ This joint optimization allows the model to leverage both:
Discriminative Contrastive Learning: Strengthens explicit semantic alignment by maximizing conditional probability, focusing on core semantic regions.Generative Diffusion Modeling: Optimizes implicit distribution dependencies by modeling the joint probability, providing dynamic alignment signals and enhancing generalization.
5. Experimental Setup
5.1. Datasets
The experiments were conducted on two publicly available benchmark datasets:
-
ChEBI-20 [11]:
- Source: From the
ChEBI (Chemical Entities of Biological Interest)database, a dictionary of molecular entities focused on "small chemical compounds". - Scale: Contains 33,010 molecules, each paired with one or more textual descriptions (e.g., chemical names, synonyms, descriptions of biological roles).
- Characteristics: Focuses on biologically relevant small molecules. The text descriptions often contain scientific and chemical terminology.
- Splits: Divided into training, validation, and test sets in an 8:1:1 ratio.
- Inference: During inference, test samples are retrieved from the entire dataset, meaning the model needs to distinguish the correct molecule from a large pool of candidates.
- Example data sample: While not explicitly provided in the paper, a typical entry might be a molecule (represented as a
SMILESstring or graph) paired with text like "A carboxylic acid that is acetic acid in which one of the methyl hydrogens is substituted by a fluoro group." (for Fluoroacetic acid).
- Source: From the
-
PCdes [40]:
- Source: Sourced from
PubChem[17], a public repository for information on chemical substances and their biological activities. - Scale: Contains 15,000 molecule pairs. The paper implies these are text-molecule pairs, likely descriptions from PubChem records.
- Characteristics: Broader chemical scope than ChEBI, potentially including a wider variety of chemical compounds and descriptions.
- Splits: Divided into training, validation, and test sets in a 7:1:2 ratio.
- Why chosen: Both datasets are standard benchmarks in text-molecule retrieval, providing a robust evaluation platform for the method's performance across different scales and types of chemical entities and their descriptions.
- Source: Sourced from
5.2. Evaluation Metrics
The paper uses several standard metrics to evaluate retrieval performance:
5.2.1. Hits@K (Recall@K)
- Conceptual Definition:
Hits@K(also known asRecall@K) measures the proportion of queries for which the correct item (molecule for text-to-molecule, or text for molecule-to-text) is found within the top retrieved results. A higher value indicates better retrieval accuracy. - Mathematical Formula: $ \text{Hits@K} = \frac{\text{Number of queries where the correct item is in the top K results}}{\text{Total number of queries}} $
- Symbol Explanation:
Number of queries where the correct item is in the top K results: The count of instances where the ground-truth target is found within the first items of the ranked retrieval list.Total number of queries: The total number of retrieval attempts made during evaluation.
5.2.2. Mean Reciprocal Rank (MRR)
- Conceptual Definition:
Mean Reciprocal Rank (MRR)is a statistic for evaluating ordered lists of responses. For a single query, thereciprocal rankis 1 divided by the rank of the first correct answer. If the correct answer is at rank 1, the reciprocal rank is 1; if at rank 2, it's 0.5, and so on.MRRis the average of these reciprocal ranks across all queries. A higherMRRindicates that relevant items are consistently ranked higher. - Mathematical Formula: $ \text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i} $
- Symbol Explanation:
- : The total number of queries.
- : The rank position of the first relevant document for the -th query. If no relevant document is found in the evaluated list, its reciprocal rank is considered 0.
5.2.3. Mean Rank (MR)
- Conceptual Definition:
Mean Rank (MR)calculates the average rank of the correct item across all queries. UnlikeMRR, it directly averages the rank positions. A lowerMRindicates better performance, as it means correct items are, on average, found at lower (better) ranks. - Mathematical Formula: $ \text{MR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \text{rank}_i $
- Symbol Explanation:
- : The total number of queries.
- : The rank position of the relevant document for the -th query.
5.3. Baselines
The paper compares DiffTMR against a comprehensive set of baseline models, categorized into:
5.3.1. Task-Specific Models
These models are designed specifically for text-molecule retrieval or related cross-modal alignment tasks and often employ specialized techniques:
MLP-Ensemble[11],GCN-Ensemble[11],All-Ensemble[11]: Early ensemble methods proposed by Edwards et al. for text-to-molecule retrieval, combining different representation learning architectures.- [11], [14]: Other architectures from Edwards et al., likely involving
MLPlayers with attention mechanisms orfingerprint graphs. AMAN[43]:Adversarial Modality Alignment Networkusesadversarial trainingto align molecular and textual representations, enhancing cross-modal retrieval by making the embeddings modality-invariant.Memory Bank[32]: A method that likely uses a memory bank mechanism to store and retrieve representations, often used incontrastive learningto manage a large number of negative samples.ORMA[29]:Optimal Transport-based Multi-grained Alignments, which explores optimal transport theory to achieve multi-grained alignment between different modalities, focusing on fine-grained and global relationships.CLASS (ORMA)[38]: An enhanced version building uponORMA, likely optimizing performance and training efficiency.
5.3.2. Large-Scale Pretrained Multimodal Models
These models leverage large pre-training corpora and often employ hierarchical or knowledge-enhanced alignment mechanisms:
-
SciBERT[1]: Used as a text encoder baseline; its performance is evaluated when directly applied to the retrieval task without specific molecular adaptation beyond basic fine-tuning. -
KV-PLM[40]: AKnowledge-enhanced Visual-language Pre-trained Modelthat aligns molecular structures and biomedical text, bridging the gap between molecular graphs and natural language. -
MoMu[33]: AMolecular Multimodal Foundation Modelthat associates molecule graphs with natural language, often viacontrastive learningon large datasets. -
MolFM[28]: AMultimodal Molecular Foundation Modelthat incorporates knowledge graphs to enrich molecular representations. -
MolCA[24]: AMolecular graph-language modelingapproach with across-modal projectoranduni-modal adapter. -
MoleculeSTM[23]: AMulti-modal Molecule Structure-Text Modelfor text-based retrieval and editing, utilizing self-supervised learning on large datasets. -
Atomas-base[41],Atomas-large[41]: Models based onHierarchical Adaptive Alignmentfor molecular and textual representations, processing features at multiple granularities.Atomas-largeis a larger version, typically with more parameters or trained on more data.These baselines are representative as they cover a range of approaches, from traditional ensemble methods to state-of-the-art
contrastive learningandlarge-scale pre-trained multimodal models, allowing for a comprehensive evaluation ofDiffTMR's performance.
5.4. Implementation Details
- Text Encoder:
SciBERT[1] is used, with a maximum sequence length of 256 tokens. - Molecular Graph Encoder: A
three-layer GCN[37] is used, with an output dimension of 300. - Optimization Strategy: A
discriminative-generative joint optimizationstrategy is adopted. - Learning Rates:
SciBERT: 3e-5Semantic coupling layer(in TAPR): 1e-5- Other components: 1e-4
- Diffusion Denoising Network: Optimized using the
Adam optimizer[18] with a base learning rate of 1e-3 and acosine annealing scheduler[25]. - Diffusion Process: Fixed to 50 steps.
- Balance Coefficient (): Set to 0.4 for fusing global and local alignment distributions.
- Training: 60 epochs with a batch size of 32.
- Hardware: All experiments were conducted on an
NVIDIA A40 GPU. - Candidate Set: Assumed to be predefined.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that DiffTMR significantly outperforms existing baseline methods across various metrics and datasets, particularly excelling in out-of-domain generalization. This superior performance is attributed to its discriminative-generative collaborative optimization framework, which combines the precision of contrastive learning for explicit semantic alignment with the generalization strength of diffusion modeling for capturing implicit joint data distributions.
6.1.1. ChEBI-20 Dataset Performance
The following are the results from Table 1 of the original paper:
| Models | Text-Molecule Retrieval | Molecule-Text Retrieval | ||||||
|---|---|---|---|---|---|---|---|---|
| Hits@1(↑) | Hits@10(↑) | MRR(↑) | MR(↓) | Hits@1(↑) | Hits@10(↑) | MRR(↑) | MR(↓) | |
| MLP-Ensemble [11] | 29.4% | 77.6% | 0.452 | 20.78 | ||||
| GCN-Ensemble [11] | 29.4% | 77.1% | 0.447 | 28.77 | − | |||
| All-Ensemble [11] | 34.4% | 81.1% | 0.499 | 20.21 | 25.2% | 74.1% | 0.408 | 21.77 |
| MLP+Atten [11] | 22.8% | 68.7% | 0.375 | 30.37 | ||||
| MLP+FPG [14] | 22.6% | 68.6% | 0.374 | 30.37 | − | |||
| AMAN [43] | 49.4% | 92.1% | 0.647 | 16.01 | 46.6% | 91.6% | 0.625 | 16.50 |
| Atomas-base [41] | 50.1% | 92.1% | 0.653 | 14.49 | 45.6% | 90.3% | 0.614 | 15.12 |
| Memory Bank [32] | 56.5% | 94.1% | 0.702 | 12.66 | 52.3% | 93.3% | 0.673 | 12.29 |
| ORMA [29] | 66.4% | 93.7% | 0.775 | 18.63 | 61.2% | 92.8% | 0.738 | 10.21 |
| CLASS (ORMA) [38] | 67.4% | 93.4% | 0.774 | 17.82 | 62.0% | 92.7% | 0.738 | 14.59 |
| DiffTMR (ours) | 72.8% | 96.5% | 0.823 | 16.24 | 66.7% | 96.3% | 0.784 | 10.07 |
- Text-to-Molecule Retrieval:
DiffTMRachieves aHits@1of 72.8%, surpassing the previous state-of-the-art baselineCLASS (ORMA)(67.4%) by a substantial 5.4%. It also shows the highestHits@10at 96.5% andMRRat 0.823, indicating excellent performance in finding the correct molecule and ranking it highly. - Molecule-to-Text Retrieval: Similarly,
DiffTMRachieves aHits@1of 66.7%, exceedingCLASS (ORMA)(62.0%) by 4.7%. ItsHits@10of 96.3% andMRRof 0.784 are also the highest, demonstrating strong bidirectional retrieval capabilities. - Ambiguous Queries: The high
Hits@10scores (96.5% for text-to-molecule and 96.3% for molecule-to-text) highlightDiffTMR's strongsemantic modelingandunderstanding capabilities, making it effective even for ambiguous queries where multiple relevant candidates might exist.
6.1.2. PCdes Dataset Performance
The following are the results from Table 2 of the original paper:
| Models | Text-Molecule Retrieval | Molecule-Text Retrieval | ||||||
|---|---|---|---|---|---|---|---|---|
| Recall@1(↑) | Recall@5(↑) | Recall@10(↑) | MRR(↑) | Recall@1(↑) | Recall@5(↑) | Recall@10(↑) | MRR(↑) | |
| Pretrained Model + Finetuning | ||||||||
| SciBERT [1] | 16.3% | 33.9% | 42.6% | 0.250 | 15.0% | 34.1% | 41.7% | 0.239 |
| KV-PLM [40] | 20.6% | 37.9% | 45.7% | 0.292 | 19.3% | 37.3% | 45.3% | 0.281 |
| MoMu [33] | 24.5% | 45.4% | 53.8% | 0.343 | 24.9% | 44.9% | 54.3% | 0.345 |
| MolFM [28] | 29.8% | 50.5% | 58.6% | 0.396 | 29.4% | 50.3% | 58.5% | 0.393 |
| Pretrained Model + Zero-shot | ||||||||
| MolCA [24] | 35.1% | 62.1% | 69.8% | 0.473 | 38.0% | 66.8% | 74.5% | 0.508 |
| MoleculeSTM [23] | 35.8% | 39.5% | ||||||
| Atomas-base [41] | 39.1% | 59.7% | 66.6% | 0.473 | 37.9% | 59.2% | 65.6% | 0.478 |
| Atomas-large [41] | 49.1% | 68.3% | 73.2% | 0.578 | 46.2% | 66.0% | 72.3% | 0.555 |
| From-scratch | ||||||||
| ORMA [29] | 64.8% | 82.3% | 86.3% | 0.727 | 62.1% | 81.4% | 86.3% | 0.710 |
| DiffTMR (ours) | 69.0% | 85.6% | 89.2% | 0.755 | 66.8% | 84.3% | 88.7% | 0.738 |
-
Text-to-Molecule Retrieval:
DiffTMRachieves 69.0%Recall@1, outperformingORMA(64.8%) by 4.2% and significantly surpassing otherlarge-scale pretrained models. ItsRecall@10reaches 89.2%. -
Molecule-to-Text Retrieval:
DiffTMRobtains 66.8%Recall@1, an improvement of 4.7% overORMA(62.1%). -
Long-tail Retrieval Scenarios: The model maintains strong performance in long-tail scenarios (infrequent or rare molecular categories), achieving 89.2%
Recall@10, demonstrating superiorcross-domain generalizationandrobustnesscompared to models likeAtomas,KV-PLM, andMoMu.The consistent superior performance of
DiffTMRacross both datasets and retrieval directions underscores the effectiveness of itsdiscriminative-generative collaborative optimization framework. Byjointly optimizingfor explicit semantic alignment viacontrastive learningand implicit distribution dependencies viadiffusion modeling, the model achieves a robust and generalizable solution forcross-modal molecular retrieval.
6.2. Ablation Studies / Parameter Analysis
Ablation studies were conducted on the ChEBI-20 dataset to investigate the contribution of individual components and the effect of key hyper-parameters.
The following are the results from Table 3 of the original paper:
| (a) Calculation Method of Perturbation Range | ||||
| Perturbation P | Hits@1↑ | Hits@10↑ | MRR↑ | MR↓ |
|---|---|---|---|---|
| exp( ) | 63.4 | 90.1 | 0.792 | 17.3 |
| exp( ) | 66.1 | 93.2 | 0.805 | 16.7 |
| exp(C) | 72.8 | 96.5 | 0.823 | 16.24 |
| (b) Effect of Perturbation Sampling Frequency. | ||||
| Frequency (F) | Hits@1↑ | Hits@10↑ | MRR↑ | MR↓ |
| 5 | 63.5 | 90.2 | 0.795 | 17.9 |
| 10 | 68.4 | 93.6 | 0.811 | 17.1 |
| 15 | 72.8 | 96.5 | 0.823 | 16.24 |
| 20 | 71.5 | 95.8 | 0.820 | 16.8 |
| (c) Effect of Diffusion Steps on Hits@1. | ||||
| Train | Eval | |||
| 10 | 50 | 100 | 500 | |
| 10 | 71.3 | × | X | X |
| 50 | 71.7 | 72.8 | × | × |
| 100 | 70.3 | 71.2 | 70.5 | × |
| 1000 | 70.0 | 70.7 | 70.9 | 71.0 |
6.2.1. Perturbation Range Calculation
Table 3a compares three strategies for calculating the perturbation range :
- Fixed parameter
exp(\frac{1}{N_w}\sum c_i): This uses a static, non-trainable average of semantic coupling strengths. It achieves aHits@1of 63.4%. - Scalable parameter
exp(\frac{\theta}{N_w}\sum c_i): This introduces a single trainable scalar parameter to dynamically adjust the perturbation magnitude. ItsHits@1is 66.1%, showing an improvement over the fixed parameter, indicating the benefit of dynamic adjustment. - Matrix transformation
exp(C\Theta): This utilizes a learnable matrix to model high-order interactions of the semantic coupling tensor . This strategy yields the best performance with aHits@1of 72.8%. The authors conclude that this method's superior performance is due to its strongernonlinear modeling capability, allowing for more nuanced and adaptive calibration of the perturbation intensity.
6.2.2. Perturbation Sampling Frequency
Table 3b examines the impact of molecular perturbation sampling frequency (F) during inference. As increases from 5 to 15, the Hits@1 score improves from 63.5% to 72.8%.
- Increasing means the model performs multiple stochastic samplings of and selects the perturbed embedding that best aligns with the text anchor. This allows for a more thorough exploration of the
molecular semantic space, thereby boostingcross-modal alignment. - However, increasing beyond 15 (e.g., to 20) leads to a slight decrease in
Hits@1(71.5%), suggesting a point of diminishing returns where excessive sampling might introduce noise or computational overhead without further accuracy gains. - An optimal balance between
retrieval accuracyandcomputational costis found at .
6.2.3. Diffusion Step
Table 3c investigates how the number of diffusion steps during training and evaluation affects Hits@1 performance in text-to-molecule retrieval.
- The results show that
50 diffusion stepsduring both training and evaluation achieve optimal performance (Hits@1of 72.8%). - Interestingly, fewer steps (e.g., 10) during training or evaluation result in lower performance. More steps (e.g., 100 or 1000 for training, or 100/500 for evaluation) also tend to slightly decrease or plateau performance.
- The paper notes that 50 steps are sufficient for
DiffTMR, which contrasts with image generation tasks (e.g.,DDPM[15]) that often require 1,000 steps. This difference is attributed to the varying complexity of data distributions: image generation deals with high-dimensional pixel distributions, while text-molecule alignment focuses onsemantic associations. The less complexrepresentation spacein text-molecule retrieval allows for fewer steps to capture keycross-modal relationships, while excessive iterations might risk introducing extra noise and reducing alignment quality.
6.3. Out-of-domain Retrieval
To assess the generalization ability of models on unseen data, the authors adopted an in-domain training, out-of-domain testing paradigm. This means the model is trained on one source dataset (e.g., ChEBI-20) and then evaluated on a completely different, unseen target dataset (e.g., PCdes).
-
Discriminative Methods Limitations: Traditional
discriminative methodslikeORMA[29] andAMAN[43] exhibitlimited generalizationin thisout-of-domainscenario. Even thoughORMAperforms significantly better thanAMANinin-domain retrieval, both show poor performance when transferred to anunseen target domain. This confirms the inherent limitation of discriminative models in capturing intrinsic data distributions, which hinders their adaptability to shifts in data. -
DiffTMR's Superiority: In contrast,
DiffTMRconsistently demonstrates strong performance across bothin-domainandout-of-domain retrieval tasks. This robust generalization is a direct consequence of itsgenerative modelingapproach, which learns thejoint probability distributionand thus captures the fundamental statistical properties of the data, making it more resilient to distribution shifts.The following figure (Figure 5 from the original paper) visualizes the
similarity distributionof models in bothin-domainandout-of-domainscenarios.
该图像是三维散点图,显示了在ChEBI-20和PCdes数据集上,DiffTMR、ORMA和AMAN模型的正负配对情况。正配对用蓝色表示,负配对用紫色表示,展示了不同模型在分子检索任务中的表现差异。
The image above (Figure 5 from the original paper) visually represents the separation of positive and negative pairs in the embedding space for DiffTMR, ORMA, and AMAN on in-domain (ChEBI-20) and out-of-domain (PCdes) retrieval.
-
Positive pairs(blue points) represent correctly matched text-molecule pairs. -
Negative pairs(purple points) represent mismatched pairs.In the
in-domain(ChEBI-20) scenario, all models show a reasonable separation between positive and negative pairs, with positives generally clustered together and negatives further away. However, in theout-of-domain(PCdes) scenario,discriminative baselineslikeORMAandAMANshow a significant overlap between positive and negative pairs, indicating poor generalization. Their embeddings for correct pairs are no longer clearly distinct from incorrect ones.DiffTMR, however, maintains a much clearer separation between positive and negative pairs in thetarget domain(PCdes), comparable to its in-domain performance. This visualization strongly supports the claim thatDiffTMRpossesses strongergeneralizationandtransferabilitywhen dealing withunseen data, a key advantage of itsgenerative approach.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces DiffTMR, a novel framework for text-to-molecule cross-modal retrieval built upon generative diffusion models. Diverging from conventional discriminative approaches that model deterministic conditional probabilities, DiffTMR reconstructs the joint cross-modal distribution through a progressive denoising process. This fundamental shift allows the model to effectively capture the inherent many-to-many semantic correspondences between textual queries and molecular structures. The framework integrates global semantic alignment with local substructure matching via a Hierarchical Diffusion Alignment Network (HDAN), enabling fine-grained and robust retrieval. Additionally, the method's Text-Anchored Perturbation Representation (TAPR) dynamically enhances molecular representation diversity. The unique combined optimization of contrastive learning (for explicit alignment) and generative diffusion modeling (for implicit distribution learning) significantly boosts both alignment accuracy and generalization ability. Experimental evaluations on ChEBI-20 and PCdes benchmark datasets confirm that DiffTMR substantially outperforms state-of-the-art discriminative baselines, particularly demonstrating superior performance in challenging out-of-domain scenarios.
7.2. Limitations & Future Work
While the paper presents a significant advancement, it implicitly acknowledges areas for future exploration. The authors highlight that "the potential of diffusion models in the critical field of text-molecule retrieval remains unexplored. Specifically, existing studies have yet to address the challenges of modeling the complex semantics of molecular structures [10, 19, 29, 40] and the bidirectional reasoning inherent to chemical analysis, presenting a promising direction for future research." This implies that even with DiffTMR's generative approach, there is still scope for:
- More Complex Semantic Modeling: Deepening the model's understanding of intricate molecular structures and their descriptions, potentially by integrating more sophisticated graph representations or chemical knowledge graphs within the diffusion process.
- Enhanced Bidirectional Reasoning: Further improving the model's ability to perform bidirectional inference (text-to-molecule and molecule-to-text) by perhaps explicitly modeling conditional generation in both directions or refining the joint distribution learning to better capture causal relationships.
7.3. Personal Insights & Critique
DiffTMR offers a compelling paradigm shift by applying diffusion models to text-molecule retrieval, moving beyond the limitations of discriminative methods. The core strength lies in its ability to model the joint probability distribution, which is theoretically sound for handling out-of-distribution data and enhancing generalization. The hierarchical alignment and dynamic perturbation mechanisms are clever additions that address specific challenges in this domain, such as capturing multi-granularity information and balancing representation diversity with accuracy.
Strengths:
- Novelty: It's a pioneering work in applying
diffusion modelstotext-molecule retrieval, opening a new avenue for research. - Robustness & Generalization: The generative approach inherently improves robustness to
data distribution shiftsand delivers strongout-of-domain performance, which is crucial for real-world applications where data diversity is high. - Comprehensive Alignment: The
global-local hierarchical alignmentstrategy ensures that both high-level semantic meaning and fine-grained structural details are considered during retrieval. - Dynamic Representation:
Text-anchored perturbationsare an innovative way to introduce flexibility and diversity into molecular representations without sacrificing semantic precision.
Potential Issues & Areas for Improvement:
-
Computational Cost:
Diffusion modelsare typically computationally intensive, especially during the reverse denoising process. While the paper notes that 50 steps suffice (compared to 1000 for image generation),dynamic perturbation samplingat inference ( optimal) could still add significant overhead, especially with large candidate sets. Further work could explore more efficient sampling strategies or distillation techniques for faster inference. -
Interpretability: While the model performs well, the
denoising processand the complex interactions within thediffusion alignment networkmight make it challenging to interpret why certain molecules are retrieved for specific texts. Enhancinginterpretabilitywould be valuable for drug discovery where understanding the reasoning is critical. -
Scalability to Ultra-Large Datasets: While
DiffTMRperforms well on benchmark datasets, its scalability to datasets with millions of molecules (common in industrial drug discovery) and very long, complex text descriptions needs further investigation. The current design might face memory or computational bottlenecks with extremely large candidate pools. -
Explicit Chemical Knowledge Integration: While
SciBERTandGCNimplicitly leverage some chemical information, more explicit integration of chemicaldomain knowledge(e.g., reactivity rules, retrosynthesis pathways, specific drug-target interactions) within the diffusion process could potentially lead to even more chemically sound and accurate retrievals, especially for complex functional queries. -
Beyond Similarity: The diffusion process learns the joint distribution, which is richer than just similarity. Future work could explore how this learned distribution can be leveraged for tasks beyond retrieval, such as conditional molecule generation given text properties, or identifying novel molecular design principles from the generated distributions.
The methods and conclusions of
DiffTMRcould potentially be transferred to othercross-modal retrieval tasksthat struggle without-of-distribution generalizationand theaccuracy-diversity trade-off, especially in scientific domains where data distributions can be complex and varied (e.g., text-protein retrieval, text-material science data retrieval). The principle of combiningdiscriminativeandgenerativelearning, along withhierarchical dynamic representations, offers a powerful blueprint for future multimodal AI systems.
Similar papers
Recommended via semantic vector search.