Paper status: completed

Contrastive Incomplete Cross-Modal Hashing

Published:06/14/2024

Cross-Modal Hashing (1)Incomplete Cross-Modal Data Handling (1)Semantic Similarity Coordination Module (1)Semantic-Aware Contrastive Hashing (1)Contextual Correspondence Alignment (1)

Original Link

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

CICH addresses incomplete cross-modal data by coordinating semantic similarities and aligning contextual correspondences, using prototypical semantic similarity and contrastive hashing modules to generate discriminative hash codes for robust retrieval.

Abstract

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 11, NOVEMBER 2024 5823 Contrastive Incomplete Cross-Modal Hashing Haoyang Luo , Zheng Zhang , and Liqiang Nie , Senior Member, IEEE Abstract —The success of current deep cross-modal hashing ad- mits a default assumption of the fully-observed cross-modal data. However, such a rigorous common policy is hardly guaranteed for practical large-scale cases, which directly disable the training of prevalent cross-modal retrieval methods with incomplete cross- modal instances and unpaired relations. The main challenges come from the collapsed semantic- and modality-level similarity learning as well as uncertain cross-modal correspondence. In this paper, we propose a Contrastive Incomplete Cross-modal Hashing (CICH) network, which simultaneously determines the cross-modal seman- tic coordination, unbalanced similarity calibration, and contextual correspondence alignment. Specifically, we design a prototypi- cal semantic similarity coordination module to globally rebuild partially-observed cross-modal similarities under an asymmetric learning scheme. Meanwhile, a semantic-aware contrastive hashing module is establis

Mind Map

In-depth Reading

English Analysis~27 min read · 37,815 chars

1. Bibliographic Information

1.1. Title

Contrastive Incomplete Cross-Modal Hashing

1.2. Authors

Haoyang Luo: Affiliated with the Harbin Institute of Technology, Shenzhen. At the time of publication, he was a master's student. His research interests include multi-modal learning and vision-language foundation models.
Zheng Zhang: Affiliated with the Harbin Institute of Technology, Shenzhen. He holds a PhD from the Harbin Institute of Technology and was a postdoctoral research fellow at The University of Queensland. His research focuses on multimedia content analysis and understanding.
Liqiang Nie: A Senior Member of IEEE and a professor at the Harbin Institute of Technology (Shenzhen). He received his PhD from the National University of Singapore (NUS). His research interests include multimedia computing and information retrieval.

1.3. Journal/Conference

The paper does not explicitly state its publication venue in the provided text. However, the formatting and style are consistent with high-impact IEEE transactions or top-tier conferences in computer science, such as IEEE Transactions on Knowledge and Data Engineering (TKDE) or IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Given the authors' affiliations and the rigor of the work, it is intended for a top-tier academic venue.

1.4. Publication Year

The provided text does not contain a publication year. However, based on the references, which include papers up to 2023, the paper was likely written in or after 2023.

1.5. Abstract

The abstract addresses a critical limitation in current deep cross-modal hashing methods: the assumption that training data is complete and fully paired. In real-world scenarios, data is often incomplete (missing modalities) and unpaired, which undermines model training. The authors identify three main challenges arising from this incompleteness: collapsed semantic similarity learning, collapsed modality-level similarity learning, and uncertain cross-modal correspondence. To tackle these issues, they propose the Contrastive Incomplete Cross-modal Hashing (CICH) network. CICH consists of three key modules:

A prototypical semantic similarity coordination module to rebuild cross-modal similarities globally using an asymmetric learning scheme.
A semantic-aware contrastive hashing module to correct for imbalanced similarities and generate discriminative hash codes.
A contextual correspondence alignment module that uses a dual contextual information bottleneck to capture shared knowledge and resolve correspondence uncertainty. The authors claim this is the first work to apply contrastive learning to incomplete deep cross-modal hashing. Extensive experiments show that CICH outperforms state-of-the-art methods.

1.6. Original Source Link

Official Source: /files/papers/6900d1c9272c0b89d44bd6e0/paper.pdf
Publication Status: Based on the formatting, it appears to be a final draft for submission or a published paper.

2. Executive Summary

2.1. Background & Motivation

Core Problem: The primary problem is that most existing Cross-Modal Hashing (CMH) methods fail when faced with incomplete training data. CMH aims to learn compact binary codes (hashes) for data from different modalities (e.g., images and text) so that semantically similar items, regardless of their modality, have similar hash codes. The standard assumption is that the training set consists of perfectly paired data (e.g., every image has a corresponding text caption).
Importance and Challenges: In practice, large-scale datasets are rarely complete. For instance, images may lack captions, or videos may lack audio. This "incompleteness" presents three major challenges that break conventional CMH methods:
1. Collapsed Semantic Similarity: With missing data, the distribution of semantic labels across modalities becomes imbalanced, leading to a "semantic shift." This makes it difficult to learn a consistent mapping between semantics and features.
2. Collapsed Modality Similarity: The lack of paired instances means the model cannot directly learn the similarity relationship between different modalities, widening the "heterogeneous gap" (the intrinsic difference in data structure between modalities).
3. Uncertain Cross-Modal Correspondence: Without a one-to-one mapping for every instance, the model cannot be certain which image feature corresponds to which text feature, even if they share the same semantic label. This "correspondence uncertainty" hinders knowledge transfer between modalities.
Entry Point / Innovative Idea: The paper's innovative idea is to tackle these three challenges simultaneously within a unified framework called Contrastive Incomplete Cross-Modal Hashing (CICH). Instead of just trying to "fill in" the missing data naively, CICH actively reconstructs and calibrates the learning process. It introduces contrastive learning—a powerful self-supervised technique—into the incomplete setting, which had not been done before. The core idea is to rebuild the broken similarity structures and align the uncertain correspondences using carefully designed modules.

The following figure from the paper illustrates these challenges:

该图像是示意图，展示了不完整跨模态哈希中的三种障碍：(a) 理想情况的跨模态数据完美配对形成语义聚类；(b) 部分观测数据导致语义和模态关系模糊；(c) 不确定的实例对应关系损害了一对一特征对齐和跨模态知识交互。

2.2. Main Contributions / Findings

The paper presents the following main contributions:

A Principled Framework for Incomplete Cross-Modal Hashing (ICMH): The authors propose CICH, the first supervised contrastive learning framework designed specifically for CMH with incomplete data. It jointly handles semantic calibration, similarity calibration, and feature restoration.
Robust Similarity Learning on Incomplete Data: To learn high-quality hash codes, CICH includes two novel loss functions:
- Prototypical Semantic Similarity Coordination (PSSC): This loss reconstructs the global semantic similarity structure by using learnable "prototypes" for each class, acting as a bridge between modalities.
- Semantic-aware Contrastive Hashing (SaCH): This loss stabilizes the learning process by inferring and propagating relationships between samples, creating a more balanced set of "positive pairs" for contrastive learning, even for incomplete or recovered data.
Elimination of Correspondence Uncertainty: To handle the uncertain mapping between modalities, the paper introduces a Contextual Correspondence Alignment (CCA) loss. This module uses a novel dual contextual information bottleneck to:
- Extract the maximum shared information between modalities.
- Use contextual information (from neighboring samples) to recover missing features more accurately, thus solidifying the instance-level correspondence.
State-of-the-Art Performance: The experimental results demonstrate that CICH significantly outperforms existing methods across various datasets and levels of data incompleteness, confirming its effectiveness and robustness.

3.1. Foundational Concepts

Cross-Modal Hashing (CMH) is a technique used for efficient retrieval across different data types (modalities), such as images, text, video, and audio. The goal is to find semantically relevant items from one modality when given a query from another modality (e.g., using an image to find related text descriptions).

Hashing: The "hashing" part involves learning functions that map high-dimensional data (like image features) into low-dimensional binary codes (e.g., $[-1, 1, 1, -1, ...]$ ). These binary codes are computationally cheap to store and compare. The similarity between two items can be quickly estimated by calculating the Hamming distance (the number of positions at which the corresponding bits are different) between their hash codes.
Cross-Modal: The "cross-modal" challenge is to bridge the heterogeneity gap, which refers to the intrinsic structural differences between modalities. For example, a pixel-based image representation is fundamentally different from a word-based text representation. CMH methods aim to learn a common Hamming space where semantically similar items from different modalities are mapped to nearby hash codes.

3.1.2. Contrastive Learning

Contrastive Learning is a popular self-supervised learning paradigm. Its core idea is to learn representations by pulling similar samples closer together and pushing dissimilar samples farther apart in the feature space.

Mechanism: It typically works with pairs or triplets of data. Given an "anchor" sample, a "positive" sample (semantically similar to the anchor, e.g., an augmented version of the same image), and several "negative" samples (dissimilar to the anchor), the model is trained to maximize the similarity score between the anchor and the positive while minimizing the similarity scores between the anchor and all negatives.
InfoNCE Loss: A common objective function for contrastive learning is the InfoNCE (Noise-Contrastive Estimation) loss. For an anchor sample $x$ $x$ with its positive counterpart $x^+$ $x^{+}$ and a set of negative samples $\{x_i^-\}$ ${x_{i}^{-}}$ , the loss is formulated as: $ \mathcal{L} = -\log \frac{\exp(\text{sim}(f(x), f(x^+))/\tau)}{\exp(\text{sim}(f(x), f(x^+))/\tau) + \sum_{i} \exp(\text{sim}(f(x), f(x_i^-))/\tau)} $
- $f(\cdot)$ is the encoder network that generates the representation.
- $\text{sim}(\cdot, \cdot)$ is a similarity function, often the cosine similarity.
- $\tau$ is a temperature hyperparameter that scales the similarity scores, controlling the "hardness" of the negative samples. Minimizing this loss is equivalent to correctly classifying the positive sample among a set of negatives.

3.1.3. Information Bottleneck (IB) Principle

The Information Bottleneck (IB) principle, introduced by Tishby et al., provides a theoretical framework for learning compressed yet informative representations.

Core Idea: Given a source variable $X$ and a target variable $Y$ , the goal is to learn a compressed representation $Z$ of $X$ that is as informative as possible about $Y$ . This creates a "bottleneck" $Z$ that filters out irrelevant information from $X$ while preserving the information relevant to $Y$ .
Optimization Objective: This is formalized as a constrained optimization problem: maximize the mutual information I(Z, Y) while constraining the mutual information I(Z, X). The objective can be written as: $ \mathcal{L}_{IB} = I(Z, X) - \beta I(Z, Y) $
- $I(\cdot, \cdot)$ denotes mutual information.
- The goal is to minimize this objective, which means compressing $X$ (minimizing I(Z, X)) while retaining information about $Y$ (maximizing I(Z, Y)).
- $\beta$ is a Lagrange multiplier that controls the trade-off between compression and prediction. In this paper, the IB principle is adapted to find a compressed representation $z$ that captures the common knowledge between two modalities.

3.2. Previous Works

The paper categorizes related work into three areas: Cross-Modal Hashing, Contrastive Learning, and Incomplete Cross-Modal Retrieval.

Shallow CMH: These methods operate on pre-extracted features and typically follow a two-step process: feature extraction and then hash code learning. Examples include:
- DCH [3]: Learns modality-specific transformations to generate common binary codes.
- SePH [16]: Constructs an affinity matrix from labels to preserve semantic similarity.
- SCRATCH [12]: Uses matrix factorization to learn latent representations. Limitation: They are vulnerable to data incompleteness and cannot perform end-to-end optimization.
Deep CMH: These methods use deep neural networks to learn features and hash codes in an end-to-end manner, offering better representation power. Examples include:
- DCMH [4]: A pioneering deep method that preserves pairwise similarity based on a given semantic matrix.
- SSAH [5]: Uses adversarial training to learn a label encoder, forcing the generated hash codes to be consistent with semantic information.
- DADH [13]: Employs adversarial training with a weighted cosine triplet loss to better rank similarities. Limitation: These methods all assume the training data is fully observed and paired, making them unsuitable for the ICMH problem.

3.2.2. Contrastive Learning in Retrieval

CIBHash [9] and CIMON [10]: Unsupervised hashing methods that use contrastive learning. However, they assume complete data and one-to-one correspondence.
UCMFH [11]: Uses a multi-modal fusion transformer with contrastive learning, which is not practical for incomplete modalities.
Wu et al. [6]: A supervised method that applies contrastive learning between modal codes and their corresponding labels. The paper argues this pointwise correspondence is insufficient for robust modal alignment. Limitation: None of these methods are designed to handle the similarity reconstruction required in the incomplete data setting.

This is the most relevant area. Existing methods try to handle missing modalities but have significant limitations.

Generative Methods:
- DAVAE [7]: Uses Variational Autoencoders (VAEs) to generate missing features. However, recovering precise features from limited distribution knowledge is difficult.
- Wu et al. [24]: Propose modality-cyclic generative models to synthesize missing features.
Prototype-based Methods:
- MCCN [8]: An unsupervised method that uses prototypes to guide similarity learning and recover features. It neglects the class imbalance caused by incomplete data.
- PAN [25]: Constructs class-specific prototypes and restores features via propagation from similar samples. Limitation: The paper argues that these methods fail to learn balanced and stable similarity between modalities and overlook the importance of cross-modal transferability (i.e., maintaining correspondence knowledge).

3.3. Technological Evolution

The field of CMH has evolved significantly:

Shallow CMH: Early methods focused on linear projections or matrix factorization on hand-crafted features. They were computationally efficient but had limited representation power.
Deep CMH: The rise of deep learning led to end-to-end models (DCMH, SSAH) that could learn much richer, non-linear representations, leading to significant performance gains.
The Incompleteness Challenge: As CMH was applied to larger, real-world datasets, the "complete data" assumption proved to be a major bottleneck. This gave rise to the Incomplete Cross-Modal Hashing (ICMH) problem.
Early ICMH Solutions: Initial attempts (DAVAE, MCCN) focused on either generating missing features or using unsupervised clustering with prototypes. These methods were a step in the right direction but were not robust enough.
This Paper's Position: CICH represents the next step in this evolution. It is the first to introduce supervised contrastive learning to the ICMH problem, proposing a more principled framework that not only "fills in the gaps" but also actively calibrates the distorted similarity structures and aligns the uncertain correspondences.

3.4. Differentiation Analysis

CICH distinguishes itself from prior works in several key ways:

vs. Standard Deep CMH (DCMH, SSAH): Unlike these methods, CICH does not assume complete data. It is explicitly designed to handle missing modalities and unpaired instances.
vs. Generative ICMH Methods (DAVAE): Instead of just generating missing features from a learned distribution, CICH's CCA module uses an information bottleneck and contextual neighbors to recover features that are not only plausible but also maximally preserve cross-modal common knowledge, making them more informative for retrieval.
vs. Prototype-based ICMH Methods (MCCN, PAN): CICH goes beyond simple prototype-based similarity. Its PSSC module uses prototypes in an asymmetric learning scheme to globally coordinate semantics. More importantly, it complements this with the SaCH module, which explicitly performs inter-modality contrastive learning to stabilize and balance similarity learning, something other methods neglect.
vs. Contrastive Hashing Methods (Wu et al. [6]): Standard contrastive hashing methods either are unsupervised or rely on simple sample-label pairs. CICH's SaCH is unique because it performs contrast between modalities and uses a relation propagation strategy ( $A^uv$ ) to construct a richer and more stable set of positive pairs, making it robust to the imbalances caused by incomplete data.

The overall architecture of the proposed CICH model is shown below.

该图像是论文中关于Contrastive Incomplete Cross-Modal Hashing方法的示意图，展示了语义相似度协调、语义感知对比哈希和上下文对应对齐模块的框架结构，阐释了CICH网络如何处理不完整跨模态数据。

4. Methodology

4.1. Principles

The core principle of CICH is to address the three key challenges of incomplete cross-modal hashing (collapsed semantic similarity, collapsed modal similarity, and uncertain correspondence) through a multi-pronged, cooperative learning framework. The model is built on three main components that work together:

Prototypical Semantic Similarity Coordination (PSSC): To fix the collapsed semantic similarity, this module rebuilds a global, stable semantic structure using learnable class prototypes as intermediaries.
Semantic-aware Contrastive Hashing (SaCH): To fix the collapsed modal similarity and imbalanced learning, this module introduces a novel contrastive learning scheme between modalities, which infers and expands positive relationships to stabilize training.
Contextual Correspondence Alignment (CCA): To fix the uncertain correspondence and recover missing data, this module uses a dual information bottleneck to learn maximally shared information and leverages contextual clues from neighbors for more informative feature generation.

4.2. Core Methodology In-depth (Layer by Layer)

First, we define the problem. The dataset $O$ consists of three parts: a fully paired set $\mathcal{F} = \{(\boldsymbol{u}_i, \boldsymbol{v}_i, l_i)\}_{i=1}^{n_f}$ , a text-only set $\mathcal{U} = \{(\boldsymbol{u}_i, l_i)\}_{i=n_f+1}^{n_f+n_u}$ , and an image-only set $\mathcal{V} = \{(\boldsymbol{v}_i, l_i)\}_{i=n_f+n_u+1}^{n}$ . Here, $\boldsymbol{u}_i$ is text, $\boldsymbol{v}_i$ is an image, and $l_i$ is the label. The goal is to learn hash functions $H^u$ and $H^v$ that generate binary codes $\boldsymbol{b}^u$ and $\boldsymbol{b}^v$ .

4.2.1. Prototypical Semantic Similarity Coordination (PSSC)

Motivation: In an incomplete setting, learning similarity directly between modalities is unreliable due to missing pairs. Furthermore, the distribution of labels within the available data for each modality is imbalanced. To overcome this, PSSC uses the globally available set of all labels as a stable anchor. It introduces learnable prototypical codes for each semantic class to act as a bridge.

Method:

Prototypical Codes: An intermediary code network $g(\cdot)$ generates a "prototypical code" $\boldsymbol{\psi}_i$ for each label $l_i$ . This is implemented as a linear projection: $\boldsymbol{\psi}_i = g(l_i) = W_p^\top l_i$ , where $W_p$ is a learnable weight matrix. These prototypes represent the ideal hash code for each class in a shared semantic space.
Asymmetric Similarity Learning: The model learns the similarity between the hash codes of available modalities (e.g., text $\boldsymbol{h}^u$ ) and the full set of prototypical codes $\boldsymbol{\psi}$ . This is framed as a negative log-likelihood optimization problem. The probability of two samples being similar ( $S_{ij}=1$ ) is modeled using the sigmoid function of their inner product. The objective is to maximize the likelihood of the true similarity matrix $S$ .
Loss Function: The learning is done asymmetrically. For a mini-batch of image/text samples, their similarity to all $n$ prototypical codes in the dataset is calculated. This allows the model to learn from the global semantic structure, even with a small batch of incomplete data. The final loss for PSSC is: $ \min_{W_p, \theta^{u,v}} \mathcal{L}^{PSSC} = \sum_{k \in M} (\mathcal{L}P^k + \mathcal{L}Q^k) $ $ = \sum{k \in M} \left( - \sum{i \in I_k} \sum_{j=1}^n \left( S_{ij} \Lambda_{ij}^k - \log\left(1 + e^{\Lambda_{ij}^k}\right) \right) + \sum_{i \in I_k} |\boldsymbol{h}_i^k - \boldsymbol{b}_i|_F^2 \right) \text{ s.t. } \boldsymbol{B} \in {-1, 1}^{r \times n} $
- $k \in M = \{p, u, v\}$ represents the modality (prototype, text, image).
- $I_k$ is the index set for a mini-batch of samples from modality $k$ .
- $\boldsymbol{h}_i^k$ is the continuous hash-like representation for sample $i$ of modality $k$ .
- $\boldsymbol{\psi}_j$ is the prototypical code for the $j$ -th class.
- $\Lambda_{ij}^k = \frac{1}{2}(\boldsymbol{h}_i^k)^\top \boldsymbol{\psi}_j$ is the scaled inner product similarity between the $i$ -th modal sample and the $j$ -th prototype.
- $S_{ij} = 1$ if sample $i$ and sample $j$ share a label, and 0 otherwise.
- The first term is the negative log-likelihood for preserving semantic similarity. The second term is a quantization loss that encourages the continuous representations $\boldsymbol{h}_i^k$ to be close to binary codes $\boldsymbol{b}_i$ .
  
  This asymmetric scheme distills knowledge from the stable, globally-visible prototypes ( $\boldsymbol{\psi}$ ) into the modality-specific hash codes ( $\boldsymbol{h}^u, \boldsymbol{h}^v$ ), effectively rebuilding the disrupted semantic similarity.

4.2.2. Semantic-aware Contrastive Hashing (SaCH)

Motivation: While PSSC reconstructs semantic similarity, it doesn't explicitly enforce similarity between modalities. Standard contrastive learning is not ideal because the lack of pairs creates a severe imbalance (few positive pairs, many negative pairs). SaCH is designed to fix this by creating a more robust and balanced contrastive relationship.

Method:

Inter-Modality Contrast: Instead of contrasting samples with labels, SaCH explicitly contrasts samples from one modality (e.g., text) with samples from another (e.g., image).
Calibrated Contrastive Relationship: The core innovation is to move beyond a simple one-to-one positive pair strategy. For a given anchor sample, SaCH considers all other samples that share the same semantics as positive pairs. This is controlled by a semantic adjacency matrix $A^{uv}$ . The contrastive loss for a text anchor $i$ against a set of image samples is: $ \mathcal{L}i^{u \to v} = \sum{r \in I_v} -A_{ir}^{uv} \log \frac{e^{\delta(\frac{1}{2}(\boldsymbol{h}_i^u)^\top \boldsymbol{h}r^v) / \tau}}{\sum{j \in I_v} e^{\delta(\frac{1}{2}(\boldsymbol{h}_i^u)^\top \boldsymbol{h}_j^v) / \tau}} $
- $I_u$ and $I_v$ are index sets for text and image samples in a batch.
- $\delta(\Phi) = \frac{1}{1 + e^{-\Phi}}$ is the sigmoid function, used to scale the inner product.
- $A_{ir}^{uv}$ is the key element. If image $r$ is a positive match for text $i$ , $A_{ir}^{uv}=1$ , and the loss encourages their similarity. Otherwise, $A_{ir}^{uv}=0$ . For complete pairs, $A^{uv}$ is simply the corresponding block of the semantic similarity matrix $S$ .
Relation Propagation for Recovered Samples: For incomplete samples that have been "recovered" (feature generation explained in CCA), their semantic relationship is uncertain. To stabilize learning for these samples, the adjacency matrix is redefined using a propagation strategy: $ \boldsymbol{A}^{uv} = \mathbb{I}\left( (S^{uv} S^{uv\top}) S^{uv} \right) $
- $\mathbb{I}(\cdot)$ is an element-wise indicator function (1 for positive elements, 0 otherwise).
- This formula essentially performs multi-hop relation propagation. $(S^{uv} S^{uv\top})$ finds samples that are 2-hop neighbors (neighbors of neighbors), and multiplying by $S^{uv}$ again propagates this expanded similarity. This allows a recovered sample to learn not just from its direct semantic matches but also from a wider, more stable neighborhood of related samples.
Final Loss: The total SaCH loss is the sum of the contrastive losses in both directions (text-to-image and image-to-text): $ \mathcal{L}^{SaCH} = \sum_{i \in I_u} \mathcal{L}i^{u \to v} + \sum{j \in I_v} \mathcal{L}_j^{v \to u} $

4.2.3. Contextual Correspondence Alignment (CCA)

Motivation: Due to missing pairs, the model struggles to learn a consistent instance-level mapping between modalities. CCA aims to align the modalities and recover missing features in an information-theoretic way.

Method:

Information Bottleneck for Modal Transfer: The method introduces a latent variable $z_i^{v \to u}$ as a compressed "bottleneck" between a visual feature $\boldsymbol{f}_i^v$ and a textual feature $\boldsymbol{f}_i^u$ . The goal is to maximize the information $z$ contains about the target modality feature ( $\boldsymbol{f}^u$ ) while minimizing the information it retains from the source modality feature ( $\boldsymbol{f}^v$ ). This forces $z$ to only capture the shared cross-modal knowledge. The objective is: $ \max \mathcal{L}_{CA} = I(z^{v \to u}, \boldsymbol{f}^u) - \beta I(z^{v \to u}, \boldsymbol{f}^v) $
- $\beta$ controls the trade-off between predicting the target and compressing the source. This is optimized using a variational approximation, leading to an objective involving a reconstruction term and a KL-divergence regularization term.
Dual Contextual Recovery: Simply recovering a feature from the shared knowledge $z$ is insufficient, as it ignores modality-specific context. CCA uses a dual approximation strategy for recovery:
- An approximator $q(\boldsymbol{f}_i^u | \boldsymbol{z}_i^{v \to u})$ reconstructs the target feature using only the shared information.
- A second, context-conditioned approximator $q'(\boldsymbol{f}_i^u | \boldsymbol{z}_i^{v \to u}, \mathcal{N}^t(\boldsymbol{f}_i^v))$ reconstructs the target feature using both the shared information $z$ and the features of its top-K nearest text neighbors $\mathcal{N}^t(\boldsymbol{f}_i^v)$ . This injects crucial contextual information into the recovery process.
Final Loss: The full CCA loss combines the variational information bottleneck objective with this dual recovery scheme: $ \mathcal{L}^{CCA} = - \frac{1}{|I_v|} \sum_{i \in I_v} \mathbb{E}_{\varepsilon \sim p(\varepsilon)} \left( \log(q(\boldsymbol{f}_i^u | \boldsymbol{z}_i^{v \to u})) + \log(q'(\boldsymbol{f}_i^u | \boldsymbol{z}_i^{v \to u}, \mathcal{N}^t(\boldsymbol{f}_i^v))) \right) $ $
- \beta \mathbb{E}_{p(\boldsymbol{f}_i^v)} [KL(p(\boldsymbol{z}_i^{v \to u} | \boldsymbol{f}_i^v), r(\boldsymbol{z}_i^{v \to u}))] $
- The first two terms are the reconstruction losses from the two approximators, $q$ and $q'$ .
- The third term is the KL-divergence from the IB principle, which regularizes the bottleneck.
- $\boldsymbol{z}_i^{v \to u}$ is obtained via the reparameterization trick from the source feature $\boldsymbol{f}_i^v$ .
- For training, the features of missing modalities are generated using the learned mapper $q'$ .

4.2.4. Optimization

The overall loss function for CICH combines the three components: $ \mathcal{L}_{CICH} = \mathcal{L}^{PSSC} + \alpha \mathcal{L}^{SaCH} + \delta \mathcal{L}^{CCA} $

$\alpha$ and $\delta$ are hyperparameters that balance the contributions of the three modules. The model is trained in an alternating optimization manner, as outlined in the paper's Algorithm 1.

5. Experimental Setup

5.1. Datasets

The authors used five widely-used benchmark datasets for cross-modal retrieval.

The following are the statistics from Table I of the original paper:

Dataset	Train	Test	Database	Total	Text dim.	Classes
MIRFLICKR-25K	18,015	2,000	20,015	18,015	1,386	24
MS COCO	82,081	5,000	87,081	82,081	2,000	80
NUS-WIDE-10K	8,000	2,000	10,000	8,000	1,000	10
IAPR TC12	18,000	2,000	20,000	18,000	2,885	275
NUS-WIDE	50,000	2,085	52,085	50,000	1,000	21

Incompleteness Settings: To simulate real-world scenarios, three levels of incompleteness were created for the training set:
- Easy: 50% paired instances, 25% single images, 25% single texts.
- Medium: 30% paired instances, 35% single images, 35% single texts.
- Hard: 10% paired instances, 45% single images, 45% single texts.
Evaluation Partitions:
- Discard: Train and test only on paired samples (baseline setting).
- Enhance: Use incomplete samples for training, but the retrieval database contains only complete samples.
- Extend: Use incomplete samples for both training and retrieval validation (default setting for most experiments).

5.2. Evaluation Metrics

5.2.1. Mean Average Precision (mAP)

Conceptual Definition: mAP is the standard metric for evaluating the performance of retrieval systems. It measures the quality of a ranked list of retrieved items. It is the mean of the Average Precision (AP) scores calculated for each query. A higher mAP indicates that the model consistently ranks relevant items higher in the retrieval list.
Mathematical Formula: The AP for a single query is defined as: $ AP = \frac{1}{R} \sum_{j=1}^{N} P(j) \times \text{rel}(j) $
- $R$ is the total number of relevant documents in the database for the query.
- $N$ is the total number of documents in the database.
- P(j) is the precision at rank $j$ (the fraction of relevant items among the top $j$ retrieved items).
- $\text{rel}(j)$ is an indicator function that is 1 if the item at rank $j$ is relevant, and 0 otherwise. The mAP is then the average of AP scores over all queries. The paper uses a slightly different but equivalent formulation (Formula 12).

5.2.2. Precision-Recall (PR) Curve

Conceptual Definition: A PR curve visualizes the trade-off between precision and recall for different ranking thresholds.
- Precision: The fraction of retrieved items that are relevant ( $TP / (TP + FP)$ ).
- Recall: The fraction of relevant items that are successfully retrieved ( $TP / (TP + FN)$ ). A curve that is higher and closer to the top-right corner indicates better performance.

5.2.3. Normalized Discounted Cumulative Gain (NDCG@topK)

Conceptual Definition: NDCG is a ranking quality metric that is particularly useful for tasks with multiple levels of relevance (though here, relevance is binary). It evaluates the "gain" of a document based on its position in the result list. The gain is "discounted" at lower ranks, meaning relevant items found at the top of the list are more valuable than those found at the bottom. NDCG@K measures this for the top K results.
Mathematical Formula: $ \text{NDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K} $ where $ \text{DCG}@K = \sum_{i=1}^{K} \frac{\text{rel}_i}{\log_2(i+1)} $
- $\text{rel}_i$ is the relevance score of the result at position $i$ .
- $\text{IDCG}@K$ is the Ideal Discounted Cumulative Gain, which is the DCG score of a perfect ranking (all relevant items ranked at the top). This normalization ensures the score is between 0 and 1.

5.3. Baselines

The paper compares CICH against 10 representative baseline methods:

Shallow Hashing Methods:
- DCH [3], JIMFH [33], SCRATCH [12].
Deep Hashing Methods (Incomplete-Agnostic):
- DCMH [4], SSAH [5], AGAH [34], DADH [13], DCHMT [35].
Incomplete Cross-Modal Retrieval Methods (Real-valued):
- MCCN [8], PAN [25]. These are strong baselines as they are explicitly designed for incomplete data, although they output real-valued features instead of hash codes.

6. Results & Analysis

6.1. Core Results Analysis

The main results are presented in Tables II and III, which compare the mAP of CICH against baselines on five datasets under hard, medium, and easy incompleteness settings.

The following are the results from Table II of the original paper:

Datasets	Methods	hard(10%,45%,45%)			medium(30%,35%,35%)			easy(50%,25%,25%)			Δ (easy - hard)(%)
Datasets	Methods	i→t	t→i	mean	i→t	t→i	mean	i→t	t→i	mean	i→t	t→i	mean
MIRFLICKR-25K	DCH	0.597	0.622	0.610	0.713	0.683	0.698	0.733	0.692	0.713	13.60	7.00	10.30
	JIMFH	0.560	0.589	0.575	0.589	0.598	0.594	0.599	0.600	0.600	3.90	1.10	2.5
	SCRATCH	0.795	0.733	0.764	0.813	0.737	0.775	0.828	0.748	0.788	3.30	1.50	2.40
	DCMH	0.890	0.817	0.854	0.891	0.826	0.859	0.894	0.829	0.862	0.40	1.20	0.80
	SSAH	0.895	0.822	0.859	0.895	0.832	0.864	0.896	0.845	0.871	0.10	2.30	1.20
	AGAH	0.617	0.626	0.622	0.841	0.811	0.826	0.876	0.830	0.853	25.90	20.40	23.15
	DADH	0.876	0.814	0.856	0.883	0.828	0.859	0.884	0.835	0.864	0.80	2.1	1.45
	PAN	0.896	0.807	0.852	0.901	0.806	0.854	0.903	0.811	0.857	0.70	0.40	0.55
	DCHMT	0.860	0.802	0.832	0.862	0.805	0.835	0.864	0.809	0.835	0.40	0.70	0.55
	CICH(ours)	0.917	0.845	0.881	0.918	0.845	0.882	0.924	0.847	0.886	0.70	0.20	0.45
MS COCO	DCH	0.527	0.628	0.578	0.535	0.643	0.589	0.539	0.645	0.592	1.20	1.70	1.45
	JIMFH	0.414	0.461	0.438	0.462	0.485	0.474	0.479	0.492	0.486	6.50	3.10	4.8
	SCRATCH	0.539	0.600	0.570	0.603	0.668	0.636	0.609	0.671	0.640	7.00	7.10	7.05
	DCMH	0.681	0.710	0.696	0.689	0.718	0.704	0.695	0.724	0.710	1.40	1.40	1.4
	SSAH	0.613	0.623	0.618	0.615	0.638	0.627	0.625	0.645	0.635	1.20	2.20	1.7
	AGAH	0.576	0.600	0.588	0.595	0.618	0.607	0.621	0.634	0.628	4.51	3.40	3.955
	DADH	0.670	0.672	0.678	0.689	0.688	0.689	0.692	0.689	0.691	2.20	1.70	1.30
	MCCN	0.617	0.631	0.624	0.621	0.636	0.629	0.636	0.649	0.643	1.90	1.80	1.85
	DCHMT	0.656	0.673	0.665	0.662	0.670	0.666	0.668	0.670	0.669	1.20	-0.30	0.45
	CICH(ours)	0.708	0.739	0.724	0.723	0.744	0.734	0.739	0.764	0.752	3.10	2.50	2.8
NUS-WIDE-10K	DCH	0.507	0.585	0.546	0.531	0.599	0.565	0.553	0.598	0.576	4.60	1.30	2.95
	JIMFH	0.132	0.205	0.169	0.154	0.207	0.181	0.154	0.213	0.184	2.20	0.80	1.50
	SCRATCH	0.607	0.571	0.589	0.606	0.581	0.594	0.619	0.584	0.602	1.20	1.30	1.25
	DCMH	0.605	0.521	0.563	0.614	0.541	0.578	0.622	0.545	0.584	1.70	2.40	2.05
	SSAH	0.514	0.496	0.505	0.563	0.534	0.549	0.579	0.555	0.567	6.50	5.90	6.20
	AGAH	0.304	0.356	0.330	0.475	0.413	0.444	0.616	0.574	0.595	31.20	21.80	26.50
	DADH	0.569	0.588	0.579	0.598	0.594	0.596	0.610	0.613	0.612	4.10	2.50	3.30
	MCCN	0.563	0.537	0.550	0.588	0.550	0.569	0.587	0.555	0.571	2.40	1.80	2.10
	DCHMT	0.625	0.542	0.584	0.649	0.553	0.601	0.671	0.560	0.616	4.60	1.80	3.20
	CICH(ours)	0.628	0.602	0.615	0.644	0.604	0.624	0.639	0.607	0.623	1.10	0.50	0.80

Key Observations:

Superior Performance: CICH consistently achieves the highest mAP scores across all three datasets (MIRFLICKR-25K, MS COCO, NUS-WIDE-10K) and all three incompleteness levels. This superiority is particularly pronounced on the more challenging hard setting, demonstrating its robustness.
Incompleteness Resistance: The metric $Δ$ measures the performance drop from the easy to the hard setting. A lower $Δ$ means the method is more robust to increasing data incompleteness. CICH achieves the smallest or second-smallest $Δ$ on MIRFLICKR-25K and NUS-WIDE-10K, indicating its remarkable stability. In contrast, methods like AGAH suffer a catastrophic performance drop (e.g., 23.15% on MIRFLICKR-25K), showing their vulnerability.
Advantage over Incomplete-specific Methods: CICH outperforms PAN and MCCN, which are state-of-the-art real-valued methods designed for incomplete data. This is significant because hashing methods are typically expected to perform worse than real-valued methods due to the information loss from binarization. This result validates the advanced representation learning capability of CICH.
Effectiveness of Proposed Modules: The poor performance of incomplete-agnostic methods like DCMH and SSAH shows that simply applying standard CMH to incomplete data fails. The superior performance of CICH over these methods confirms the effectiveness of its specialized modules (PSSC, SaCH, CCA) in rebuilding similarity and aligning correspondence.

The PR-curves and NDCG histograms further corroborate these findings, showing CICH's dominance.

The following figure shows the PR-curves, where CICH (red line) consistently outperforms other deep CMH methods.

该图像是包含六个柱状图的图表，展示了CICH与其他方法在Flickr-25k和MS COCO数据集不同采样率下的NDCG@top100性能比较，横轴为模态转换方向，纵轴为NDCG@top100指标，显示CICH表现优越。

The following figure shows the NDCG@top100 results, where CICH achieves the highest bars in all scenarios.

Fig. 5. The average of `i t` and `t i` mAP values w.r.t.. different values of parameters on the MIRFLICKR-25K dataset. 该图像是三幅折线图组成的图表，展示了在MIRFLICKR-25K数据集上不同参数取值对i t和t i平均mAP值的影响。三幅图分别对应参数 $t$ 、 $\kappa$ 和 $\delta$ 的变化，横坐标为参数值，纵坐标为mAP值，显示模型性能随参数调整的趋势。

6.2. Data Presentation (Tables)

The following are the results from Table III of the original paper:

dataset	methods	hard	medium	easy	∆ (easy - hard)(%)
IAPR TC-12	DCMH	0.612	0.610	0.619	0.76
	SSAH	0.542	0.579	0.579	3.68
	AGAH	0.457	0.465	0.482	2.59
	DADH	0.597	0.600	0.605	0.80
	DCHMT	0.585	0.600	0.603	1.80
	CICH(ours)	0.643	0.644	0.645	0.20
NUS-WIDE	DCMH	0.755	0.768	0.769	1.40
	SSAH	0.684	0.701	0.690	0.63
	AGAH	0.770	0.771	0.796	2.52
	DADH	0.735	0.756	0.770	3.56
	CICH(ours)	0.798	0.799	0.801	0.30

The following are the results from Table IV of the original paper, showing performance with different hash code lengths (16 and 64 bits):

Methods	16 bits			64 bits
Methods	hard	medium	easy	hard	medium	easy
DCMH	0.850	0.852	0.852	0.848	0.853	0.860
SSAH	0.860	0.861	0.866	0.872	0.884	0.890
AGAH	0.687	0.690	0.763	0.821	0.830	0.849
DADH	0.839	0.843	0.852	0.854	0.865	0.868
DCHMT	0.828	0.829	0.831	0.840	0.838	0.842
CICH(ours)	0.862	0.865	0.866	0.889	0.890	0.891

These results show that CICH maintains its superiority across different bit lengths, confirming its stability.

6.3. Ablation Studies / Parameter Analysis

6.3.1. Ablation Study

The ablation study in Table V systematically evaluates the contribution of each module in CICH.

The following are the results from Table V of the original paper:

Datasets	Variants	hard		medium		easy
Datasets	Variants	i→t	t→i	i→t	t→i	i→t	t→i
MIRFLICKR-25K	Pair	0.743	0.769	0.818	0.807	0.862	0.824
	PSSC	0.896	0.831	0.899	0.833	0.901	0.837
	PSSC+CCA	0.907	0.842	0.906	0.843	0.915	0.845
	CICH	0.917	0.845	0.918	0.845	0.924	0.847
MS COCO	Pair	0.552	0.562	0.573	0.582	0.604	0.587
	PSSC	0.704	0.737	0.707	0.742	0.703	0.744
	PSSC+CCA	0.706	0.736	0.714	0.742	0.712	0.744
	CICH	0.708	0.739	0.723	0.744	0.739	0.764
NUS-10K	Pair	0.321	0.288	0.459	0.424	0.557	0.482
	PSSC	0.620	0.597	0.638	0.601	0.635	0.603
	PSSC+CCA	0.627	0.600	0.642	0.601	0.639	0.604
	ICH	0.628	0.602	0.644	0.604	0.639	0.607

Pair vs. PSSC: The Pair baseline (standard pairwise hashing loss) performs poorly, especially in the hard setting. Replacing it with PSSC leads to a massive performance jump (e.g., from 0.743 to 0.896 on MIRFLICKR-25K i→t). This confirms that PSSC's global semantic coordination is crucial for rebuilding similarity.
PSSC vs. PSSC+CCA: Adding the CCA module further improves performance. This demonstrates that aligning correspondence and recovering features via the contextual information bottleneck effectively remedies the uncertain pairing issue.
PSSC+CCA vs. CICH: The full CICH model, which includes SaCH, achieves the best results. This validates that SaCH's semantic-aware contrastive learning effectively stabilizes modal similarity learning and helps generate more discriminative hash codes.

6.3.2. Parameter Analysis

The authors analyze the sensitivity of hyperparameters $\alpha$ , $\beta$ , $\tau$ , $K$ , and $\delta$ .

The following figure shows the parameter sensitivity analysis.

Fig. 6. The running time (left) and time efficiency (right) analysis of CICH. 该图像是图表，展示了CICH方法在Flickr-25k数据集上的运行时间（左图）和时间效率（右图）分析。左图显示随时间avg. mAP值的变化，右图比较了CICH与SSAH的时间效率表现。均为性能指标随时间趋势的可视化。

The results show that CICH performs stably within a reasonable range for each parameter. For example, performance is high for temperature $\tau \in [0.25, 5.0]$ and number of neighbors $K \in [5, 15]$ .
The optimal performance is achieved with a relatively large $\alpha$ (5.0) and a very small $\beta$ ( $10^{-4}$ ). This suggests that the SaCH module (controlled by $\alpha$ ) plays a significant role in similarity recovery, while a lower constraint on the information bottleneck (small $\beta$ ) allows the latent variable $z$ to retain more information, leading to better performance.

6.3.3. Visualization Analysis

The t-SNE visualization in Figure 7 provides a qualitative assessment of the learned hash codes.

该图像是论文中关于Contrastive Incomplete Cross-Modal Hashing方法的示意图，展示了语义相似度协调、语义感知对比哈希和上下文对应对齐模块的框架结构，阐释了CICH网络如何处理不完整跨模态数据。

In the plot for CICH, samples from the same class (same color) form tight, distinct clusters. Image features (dots) and text features (plus marks) within the same class are well-mixed.
In contrast, the plots for DCMH, DADH, and AGAH show much more scattered and overlapping clusters. For AGAH, modalities are poorly aligned. For DCMH and DADH, the clusters are dispersed.
This visualization powerfully demonstrates that CICH learns hash codes that are both semantically discriminative (different classes are well-separated) and modality-aligned (different modalities of the same class are clustered together).

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces a novel and effective solution to the challenging problem of Incomplete Cross-Modal Hashing (ICMH). The authors identify three core issues caused by incomplete data—collapsed semantic and modal similarity learning, and uncertain cross-modal correspondence—and propose the Contrastive Incomplete Cross-modal Hashing (CICH) framework to address them. By integrating Prototypical Semantic Similarity Coordination (PSSC), Semantic-aware Contrastive Hashing (SaCH), and Contextual Correspondence Alignment (CCA), CICH successfully rebuilds a stable learning environment. It is the first work to successfully leverage contrastive learning for this task. Extensive experiments show that CICH not only significantly outperforms state-of-the-art methods in terms of retrieval accuracy but also demonstrates superior robustness to varying levels of data incompleteness.

7.2. Limitations & Future Work

The paper does not explicitly state its limitations, but some potential areas for improvement can be inferred:

Computational Complexity: The PSSC module computes similarity against all $n$ prototypical codes in the dataset, which could be computationally expensive for extremely large-scale datasets. While the paper shows good time efficiency in early epochs (Fig. 6), the scalability of this global coordination step to datasets with millions of instances might be a concern.
Hyperparameter Sensitivity: The model has several key hyperparameters ( $\alpha, \beta, \delta, \tau, K$ ) that require careful tuning via cross-validation. A more adaptive or less sensitive mechanism could improve its practicality.
Multi-Modality Extension: The paper focuses on a two-modality case (image and text). While the authors state it can be "easily adapted" to multiple modalities, the practical challenges of managing incompleteness across three or more modalities (e.g., video, audio, text) would be significantly more complex and warrant further investigation.

Future work could explore more efficient global similarity coordination mechanisms, develop adaptive hyperparameter tuning strategies, and explicitly validate the framework's performance in settings with more than two modalities.

7.3. Personal Insights & Critique

This paper presents a very well-structured and methodologically sound piece of research. It is a strong contribution to the field of cross-modal retrieval.

Strengths:

Clear Problem Formulation: The paper does an excellent job of defining the ICMH problem and breaking it down into three concrete, understandable challenges (the "impediments" in Fig. 1). This provides a clear motivation for the proposed solution.
Novel and Coherent Framework: The CICH framework is elegant. Each of the three modules (PSSC, SaCH, CCA) directly targets one of the identified challenges, and they are designed to work cooperatively. The introduction of contrastive learning to this specific problem is highly innovative.
Methodological Depth: The design of each module is theoretically grounded. The use of an asymmetric learning scheme in PSSC, the relation propagation in SaCH, and the dual contextual information bottleneck in CCA are all sophisticated and well-justified choices.
Thorough Experimentation: The experimental evaluation is comprehensive, covering multiple datasets, various incompleteness levels, multiple metrics, ablation studies, and parameter analysis. This provides strong evidence for the method's effectiveness and robustness.

Potential Issues & Critique:

The "Recovered" Sample Quality: The framework relies on recovering features for incomplete instances to use them in the SaCH loss. While CCA is designed to make this recovery informative, the quality of these generated features is not directly evaluated. If the recovery is poor (especially in the hard setting), it could potentially introduce noise into the contrastive learning step.
Applicability to Unsupervised Settings: The proposed method is supervised, relying on semantic labels. Many real-world datasets are not only incomplete but also unlabeled. Adapting the CICH framework, particularly the PSSC and SaCH modules, to a fully unsupervised setting would be a challenging but valuable extension.
Implicit Assumption of Label Correctness: The method heavily relies on the provided labels to construct the similarity matrix $S$ and the prototypical codes. It assumes these labels are clean and accurate, which may not always be true for large, web-crawled datasets.

Overall, this paper is an excellent example of academic research that identifies a practical problem, deconstructs it into fundamental challenges, and proposes a novel, principled solution backed by rigorous experimentation. Its methods for similarity calibration and correspondence alignment could inspire solutions in other domains dealing with incomplete or noisy multi-modal data.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Contrastive Incomplete Cross-Modal Hashing

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~27 min read · 37,815 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Cross-Modal Hashing (CMH)

3.1.2. Contrastive Learning

3.1.3. Information Bottleneck (IB) Principle

3.2. Previous Works

3.2.1. Cross-Modal Hashing (CMH)

3.2.2. Contrastive Learning in Retrieval

3.2.3. Incomplete Cross-Modal Retrieval

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Prototypical Semantic Similarity Coordination (PSSC)

4.2.2. Semantic-aware Contrastive Hashing (SaCH)

4.2.3. Contextual Correspondence Alignment (CCA)

4.2.4. Optimization

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Mean Average Precision (mAP)

5.2.2. Precision-Recall (PR) Curve

5.2.3. Normalized Discounted Cumulative Gain (NDCG@topK)

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

6.3.1. Ablation Study

6.3.2. Parameter Analysis

6.3.3. Visualization Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers