Paper status: completed

Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation

Published:09/11/2025
Original LinkPDF
Price: 0.10
Price: 0.10
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study presents MambaRec, an innovative multimodal recommendation framework addressing fine-grained cross-modal association modeling and global consistency issues through attention-guided learning. Its core contribution is the Dilated Refinement Attention Module, enhancing fu

Abstract

Multimodal recommendation systems are increasingly becoming foundational technologies for e-commerce and content platforms, enabling personalized services by jointly modeling users' historical behaviors and the multimodal features of items (e.g., visual and textual). However, most existing methods rely on either static fusion strategies or graph-based local interaction modeling, facing two critical limitations: (1) insufficient ability to model fine-grained cross-modal associations, leading to suboptimal fusion quality; and (2) a lack of global distribution-level consistency, causing representational bias. To address these, we propose MambaRec, a novel framework that integrates local feature alignment and global distribution regularization via attention-guided learning. At its core, we introduce the Dilated Refinement Attention Module (DREAM), which uses multi-scale dilated convolutions with channel-wise and spatial attention to align fine-grained semantic patterns between visual and textual modalities. This module captures hierarchical relationships and context-aware associations, improving cross-modal semantic modeling. Additionally, we apply Maximum Mean Discrepancy (MMD) and contrastive loss functions to constrain global modality alignment, enhancing semantic consistency. This dual regularization reduces mode-specific deviations and boosts robustness. To improve scalability, MambaRec employs a dimensionality reduction strategy to lower the computational cost of high-dimensional multimodal features. Extensive experiments on real-world e-commerce datasets show that MambaRec outperforms existing methods in fusion quality, generalization, and efficiency. Our code has been made publicly available at https://github.com/rkl71/MambaRec.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation

1.2. Authors

  • Kelir Ren: Department of Computer Science and Engineering, Hanyang University, Ansan, Republic of Korea.
  • Chan-Yang Ju: Department of Applied Artificial Intelligence, Hanyang University, Ansan, Republic of Korea. (Major in Bio Artificial Intelligence)
  • Dong-Ho Lee: Department of Applied Artificial Intelligence, Hanyang University, Ansan, Republic of Korea. (Corresponding author)

1.3. Journal/Conference

Published in the Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM. 25), November 10-14, 2025, Seoul, Republic of Korea. CIKM is a highly reputable and influential conference in the fields of information retrieval, databases, and knowledge management, making it a significant venue for research in multimodal recommendation systems.

1.4. Publication Year

2025

1.5. Abstract

The paper addresses challenges in multimodal recommendation systems, which integrate users' historical behaviors with item features (e.g., visual, textual) to provide personalized services. Existing methods often suffer from two main limitations: (1) inadequate modeling of fine-grained cross-modal associations, leading to suboptimal fusion quality, and (2) a lack of global distribution-level consistency, resulting in representational bias. To overcome these, the authors propose MambaRec, a novel framework that combines local feature alignment and global distribution regularization through attention-guided learning. A key component is the Dilated Refinement Attention Module (DREAM), which uses multi-scale dilated convolutions with channel-wise and spatial attention to align fine-grained semantic patterns between visual and textual modalities. This module captures hierarchical and context-aware relationships, enhancing cross-modal semantic modeling. Additionally, Maximum Mean Discrepancy (MMD) and contrastive loss functions are employed to enforce global modality alignment, promoting semantic consistency and reducing mode-specific deviations. To improve scalability, MambaRec also incorporates a dimensionality reduction strategy for high-dimensional multimodal features. Extensive experiments on real-world e-commerce datasets demonstrate that MambaRec outperforms existing methods in fusion quality, generalization, and efficiency.

https://arxiv.org/abs/2509.09114v1 (Preprint) PDF Link: https://arxiv.org/pdf/2509.09114v1.pdf

2. Executive Summary

2.1. Background & Motivation

Multimodal recommendation systems (MRS) are becoming essential in e-commerce and content platforms for delivering personalized services. These systems model user preferences by considering both historical user behaviors and diverse item features, such as images and text. MRS are particularly valuable as they can leverage rich content information to alleviate issues like cold start and data sparsity, common in traditional recommendation methods that rely solely on interaction data.

However, existing MRS face significant limitations:

  1. Insufficient fine-grained cross-modal association modeling: Many current approaches use static fusion strategies or simplistic linear projections for multimodal features. This limits their ability to capture subtle, intricate semantic correspondences between different modalities (e.g., visual and textual features), leading to suboptimal fusion quality. For instance, a simple aggregation of image and text features might miss the specific visual attributes of a shirt (e.g., its color or pattern) that are semantically linked to descriptive text (e.g., "navy blue" or "striped").

  2. Lack of global distribution-level consistency: Most models do not explicitly constrain the overall statistical distributions of representations learned from different modalities. This can lead to representational bias, where features from one modality might be semantically misaligned with those from another, even if local interactions are considered. This distribution shift makes it difficult for the model to generalize across modalities effectively. Graph-based methods, while powerful for local interactions, still struggle with this global consistency and are susceptible to noise from high-dimensional features.

    The paper's entry point is to address these two critical limitations by proposing a unified framework that explicitly handles both fine-grained local alignment and global distribution consistency, coupled with efficiency improvements for high-dimensional data.

2.2. Main Contributions / Findings

The paper proposes MambaRec, a novel framework for multimodal recommendation, with the following primary contributions:

  • Novel Local Feature Alignment Module (DREAM): Introduction of the Dilated Refinement Attention Module (DREAM). This module uses multi-scale dilated convolutions to capture hierarchical and context-aware relationships within each modality, and then applies channel-wise and spatial attention mechanisms to achieve fine-grained semantic pattern alignment between visual and textual modalities. This significantly improves the quality of feature fusion at a local level.

  • Global Distribution Regularization: Integration of Maximum Mean Discrepancy (MMD) and contrastive loss functions into the recommendation system. This dual regularization strategy enforces consistency in the global distribution of modal features, reducing mode-specific deviations and enhancing the model's generalization capabilities by aligning overall semantic representations.

  • Efficient Dimensionality Optimization Strategy: Design and implementation of a flexible and efficient dimensionality reduction strategy. This mechanism significantly reduces the computational cost and memory usage associated with high-dimensional multimodal features, making MambaRec scalable and deployable in large-scale recommendation environments without sacrificing performance.

    The key findings are that MambaRec consistently outperforms state-of-the-art general-purpose and multimodal recommendation models across three real-world e-commerce datasets (Baby, Sports, Clothing). The ablation studies confirm the critical role of both local and global alignment modules, demonstrating their necessity for optimal performance. The dimensionality optimization strategy effectively balances performance and resource consumption, showing that moderate compression can even improve generalization by suppressing overfitting noise. Visualization analysis further indicates that MambaRec yields more uniform and discriminative feature distributions, alleviating representation degeneration issues observed in traditional methods.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following concepts:

  • Recommendation Systems: Systems that predict user preferences for items.
    • Collaborative Filtering (CF): A common technique where recommendations are made based on the preferences of similar users or the characteristics of similar items. It typically relies on historical interaction data.
    • Multimodal Recommendation Systems (MRS): Extension of CF that incorporates diverse item features beyond just identifiers, such as visual (images) and textual (descriptions) data, to enrich item representations and improve recommendation quality.
    • Cold Start Problem: A challenge in recommendation systems where new users or new items have insufficient interaction data, making it difficult to generate accurate recommendations. Multimodal features can help alleviate this by providing content-based information.
    • Data Sparsity: When the user-item interaction matrix is very sparse (few interactions relative to possible interactions), making it difficult for CF methods to find meaningful patterns. Multimodal data can provide supplementary information.
  • Deep Learning Components:
    • Convolutional Neural Networks (CNNs): Neural networks commonly used for image processing. They apply convolutional filters to extract hierarchical features from input data.
    • Dilated Convolutions (Atrous Convolutions): A type of convolution that introduces gaps between kernel elements, effectively enlarging the receptive field of the filter without increasing the number of parameters or losing resolution. This allows multi-scale feature extraction.
    • Attention Mechanisms: Mechanisms that allow a neural network to focus on specific parts of its input when making predictions.
      • Channel Attention: Dynamically adjusts the importance of each feature channel, allowing the model to focus on more informative channels.
      • Spatial Attention: Allows the model to focus on the most relevant spatial locations within a feature map.
  • Graph Neural Networks (GNNs): Neural networks that operate on graph-structured data.
    • Graph Convolutional Networks (GCNs): A type of GNN that learns representations of nodes by aggregating information from their neighbors, effectively performing convolution-like operations on graphs. They are useful for modeling complex user-item interactions.
  • Loss Functions & Regularization:
    • Bayesian Personalized Ranking (BPR): A pairwise ranking loss function widely used in recommendation systems. It optimizes for the relative order of items, preferring positive items over negative ones for a given user.
    • Contrastive Learning: A self-supervised learning paradigm where the model learns representations by pulling similar (positive) samples closer and pushing dissimilar (negative) samples apart in the embedding space.
      • InfoNCE Loss (Noise-Contrastive Estimation): A specific form of contrastive loss that maximizes the mutual information between different views of the same data, commonly used for learning robust representations. It typically involves a temperature hyperparameter (τ\tau) to control the sensitivity of the similarity measure.
    • Maximum Mean Discrepancy (MMD): A statistical distance measure used to quantify the difference between two probability distributions. In machine learning, it's often used to align distributions, for example, between different modalities, by minimizing the distance between their mean embeddings in a Reproducing Kernel Hilbert Space (RKHS).
    • L2 Regularization: A technique to prevent overfitting by adding a penalty proportional to the sum of the squares of the model's weights to the loss function. This encourages smaller weights and simpler models.
  • Dimensionality Reduction: Techniques to reduce the number of random variables under consideration, often by transforming a set of variables into a set of principal ones. This helps in reducing computational cost, memory usage, and noise.
  • Representation Degeneration: A problem in representation learning where different semantic categories of information concentrate in overlapping regions of the embedding space, making effective discrimination challenging. This can manifest as features being too similar, losing discriminative power.

3.2. Previous Works

The paper discusses several categories of related work, highlighting their strengths and, more importantly, their limitations which MambaRec aims to address.

  • Traditional Recommendation Systems:
    • BPR [21]: A foundational model using Bayesian Personalized Ranking for implicit feedback. It optimizes for item ranking.
      • Limitation: Relies solely on interaction data, suffering from data sparsity and cold-start problems, and cannot leverage rich multimodal content.
    • VBPR [9]: Extends BPR by integrating visual features extracted from VGG16 (a pre-trained Convolutional Neural Network).
      • Limitation: While introducing visual features helps with cold-start, it often treats modalities independently or uses simple linear combinations, lacking sophisticated cross-modal interaction modeling. The paper notes it struggles with representation degradation (as seen in Section 5.5).
  • Graph Neural Network (GNN) based Multimodal Recommendation:
    • LightGCN [10]: A simplified GCN that focuses on neighborhood aggregation for efficient modeling of high-order user-item relationships.
      • Limitation: General GCNs like LightGCN often ignore multimodal content, or if they integrate it, they may still struggle with heterogeneous semantic granularities across modalities, leading to information dilution or over-smoothing in dense graphs.
    • MMGCN [27]: Constructs separate interaction graphs for each modality and fuses their graph embeddings.
      • Limitation: While improving performance, it can amplify modality-specific noise if not handled carefully.
    • GRCN [26]: Refines the graph through gating mechanisms to suppress noisy interactions.
      • Limitation: Still faces challenges in accommodating heterogeneous semantic granularities and may use static fusion strategies, making it difficult to filter out noise effectively.
    • MGCN [30]: Builds multimodal collaboration graphs by mapping features from different modes. It also guides the preprocessing and refinement of modal features using historical behavior.
      • Limitation: The paper notes that MGCN still uses linear projection and static fusion, making it difficult to effectively suppress modal noise and extract fine-grained semantics.
  • Self-supervised Learning & Fine-grained Fusion:
    • SLMRee (Self-supervised Learning for Multimedia Recommendation) [23]: Introduces cross-modal self-supervised tasks and hierarchical contrastive learning to enhance the robustness of modal representations.
      • Limitation: Typically lacks global constraints on modality-level representation distributions, leading to semantic misalignment.
    • BM3 (Bootstrap Multimodal Matching) [35]: Proposes a self-supervised strategy that generates contrastive views through random feature dropping.
      • Limitation: Similar to SLMRee, it may lack global distributional consistency.
    • FREEDOM [34]: Focuses on memory efficiency by freezing the item similarity graph and pruning user interaction graphs.
      • Limitation: Primarily addresses resource consumption; it might not explicitly tackle fine-grained alignment or global distribution consistency.
    • LGMRec (Local and Global Graph Learning for Multimodal Recommendation) [7]: Models both modality-specific user interests and shared cross-modal preferences using local-global modal graph learning.
      • Limitation: While addressing local and global aspects, it may still struggle with distribution shift across modalities and noise from high-dimensional features.

3.3. Technological Evolution

The evolution of recommendation systems, particularly multimodal ones, can be broadly traced through these stages:

  1. Early Collaborative Filtering (CF): Focused solely on user-item interaction data (e.g., BPR). These methods were effective but suffered from cold-start and data sparsity.
  2. Content-Enhanced CF: Integrated basic content features (e.g., tags, categories).
  3. Multimodal Integration (Initial): Incorporated rich visual/textual features (e.g., from VGG16 for images) but often treated modalities simplistically, via concatenation or linear combination (e.g., VBPR). This improved cold-start but lacked deep cross-modal understanding.
  4. Graph Neural Networks (GNNs) for Recommendation: Leveraged graph structures to model complex user-item relationships and high-order connectivity (LightGCN, MMGCN, GRCN, MGCN). This brought significant improvements in capturing implicit feedback. When combined with multimodal features, these models started to explore how different modalities could inform graph construction or feature aggregation.
  5. Self-supervised and Contrastive Learning: Introduced techniques to learn robust representations by creating self-supervision tasks or by contrastive learning to pull similar items together and push dissimilar ones apart (SLMRee, BM3). These methods focused on enhancing modal discrimination and robustness.
  6. Explicit Modality Alignment: The current paper's contribution, MambaRec, fits into this latest stage, where the focus shifts to explicitly addressing the semantic misalignment and distribution shift between modalities. It combines local fine-grained alignment (via attention and multi-scale convolutions) with global distribution consistency (via MMD and InfoNCE), coupled with dimensionality optimization for practical scalability.

3.4. Differentiation Analysis

MambaRec differentiates itself from previous works primarily by its comprehensive approach to modality alignment, tackling both local fine-grained associations and global distribution consistency within a single, efficient framework.

  • Compared to static fusion or linear projection methods (e.g., VBPR, MMGCN, MGCN): MambaRec moves beyond simplistic fusion by introducing the DREAM module, which uses multi-scale dilated convolutions and dual attention mechanisms (channel and spatial) to adaptively capture intricate, hierarchical, and context-aware semantic patterns. This is a much more dynamic and fine-grained approach to local feature alignment.

  • Compared to methods lacking global constraints (e.g., SLMRee, BM3): While these methods use contrastive objectives for local fusion or robust representation, they often miss global distribution-level consistency. MambaRec explicitly addresses this by incorporating MMD and contrastive loss to enforce global alignment between the overall distributions of multimodal features, thereby reducing representational bias and mode-specific deviations.

  • Compared to graph-based methods (e.g., LightGCN, MMGCN, GRCN, LGMRec): While MambaRec can leverage graph structures for user-item interactions (as implicitly suggested by the BPR loss), its core innovation lies in how multimodal features are processed and aligned before or during their use in the recommendation task. Existing GCN-based multimodal methods, even those addressing local-global graph learning (LGMRec), still struggle with distribution shifts and noise interference across modalities, which MambaRec directly mitigates through its DREAM module and MMD constraints.

  • Efficiency: MambaRec proactively tackles the challenge of high-dimensional multimodal features with a dimensionality optimization strategy, making it more scalable and efficient for large-scale deployments compared to many methods that might incur high memory or computational costs.

    In essence, MambaRec offers a more holistic and robust solution by jointly modeling local semantic correspondences and global distributional consistency with an eye towards practical deployment, which many prior works addressed only partially or implicitly.

4. Methodology

The proposed MambaRec framework integrates local feature alignment and global distribution regularization via attention-guided learning, along with a dimensionality optimization mechanism. The overall architecture consists of three core aspects: (i) Local Feature Alignment via the Dilated Refinement Attention Module (DREAM), (ii) Global Distribution Alignment using Maximum Mean Discrepancy (MMD) and InfoNCE contrastive loss, and (iii) a Dimensionality Optimization Mechanism to handle high-dimensional features.

The following figure (Figure 2 from the original paper) shows an overview of the proposed architecture.

fig 9 该图像是一个示意图,展示了多模态推荐系统中的Dilated Refinement Attention Module(DREAM)及其组件,包括视觉模态、文本模态和行为信息的局部和全局对齐过程。涉及的关键部分包括最大均值差异(MMD)和信息对比学习(InfoNCE)。

Figure 2: The overall architecture of the proposed MambaRec, comprising three key components. (i) The local alignment performs fine-grained matching between local features of images and text. (ii) The global alignment ensures consistency between different modalities at the level of global distribution. (iii) The dimensionality reduction mechanism was introduced before alignment to compress high-dimensional visual features and reduce memory overhead.

4.1. Local Feature Alignment

The Local Feature Alignment component aims to fuse semantically aligned image and text features at a fine-grained level. This is achieved through the Dilated Refinement Attention Module (DREAM), which draws inspiration from DeepLabV3's Atrous Spatial Pyramid Pooling (ASPP) and the Convolutional Block Attention Module (CBAM).

Let the input feature map be XRC×H×WX \in \mathbb{R}^{C \times H \times W} (where CC is channels, HH is height, WW is width). The DREAM module processes this input to produce an output feature map YRC×H×WY \in \mathbb{R}^{\mathrm{C} \times \mathrm{H} \times \mathrm{W}}. The process involves multi-scale convolution and attention mechanisms.

4.1.1 Multi-scale feature extraction

The DREAM module employs five parallel branches for multi-scale feature extraction. These branches process the input and their outputs are then concatenated.

  • Branch 1: A 1×11 \times 1 convolution extracts fine-grained local features with minimal computational cost and without expanding the receptive field. This preserves high-resolution spatial details while performing channel-wise transformation.

  • Branches 2-4: Three 3×33 \times 3 dilated convolutions are used with increasing dilation rates of 6, 12, and 18, respectively. These dilated convolutions effectively enlarge the receptive field to capture medium- and long-range contextual information while maintaining spatial resolution due to appropriate padding.

  • Branch 5: Global average pooling is applied over the entire spatial dimension, followed by a 1×11 \times 1 convolution. The resulting 1×11 \times 1 feature map, representing global semantic context, is then upsampled to match the H×WH \times W spatial dimensions of the other branches using bilinear interpolation.

    Let the output feature maps of these five branches be F1,F2,F3,F4,F5RH×W×CiF_1, F_2, F_3, F_4, F_5 \in \mathbb{R}^{H \times W \times C_i}. These are then concatenated along the channel dimension to form the fused feature map FF. The formula for this concatenation is: F=Concatc(F1,F2,F3,F4,F5)(1) F=\operatorname {Concat}_{c}(F_{1},F_{2},F_{3},F_{4},F_{5}) \quad (1) Here, Concatc\operatorname{Concat}_{c} denotes concatenation along the channel dimension. The resulting fused representation FRH×W×CF \in \mathbb{R}^{H \times W \times C'} integrates both local detail and global context information, where C=i=15CiC' = \sum_{i=1}^{5} C_i is the total number of channels after concatenation.

4.1.2 Channel attention mechanism

To adaptively enhance the importance of different channels in the fused feature map FF, DREAM incorporates a channel attention module. First, global average pooling is applied to FF to obtain a channel compression vector zRC×1×1z \in \mathbb{R}^{C' \times 1 \times 1}. This vector summarizes the spatial information for each channel. Next, zz is passed sequentially through two fully connected layers. The first fully connected layer reduces the channel dimension (using the ReLU activation function δ()\delta(\cdot)) to C/rC'/r, where rr is a reduction ratio. The second fully connected layer restores the dimension to CC' and outputs a channel weight vector McRC×1×1M_c \in \mathbb{R}^{C' \times 1 \times 1}. The process is expressed as: Mc=σ(W2(δ(W1(z))))(2) M_{c}=\sigma (W_{2}(\delta (W_{1}(z)))) \quad (2) In this formula:

  • W1W_1 and W2W_2 are the weight matrices of the fully connected layers.
  • δ()\delta(\cdot) denotes the ReLU activation function.
  • σ()\sigma(\cdot) denotes the Sigmoid activation function. The Sigmoid function ensures that each element of the channel attention weight McM_c has a value between 0 and 1, indicating the importance of its corresponding channel. Finally, this weight McM_c is multiplied element-wise onto the fused feature map FF to perform channel-by-channel feature recalibration: Fc=FMc(3)F_{c}=F\otimes M_{c} \quad (3) Here, \otimes denotes element-wise multiplication (broadcasting McM_c across the spatial dimensions of FF). This operation enhances key channels and suppresses unimportant ones.

4.1.3 Spatial attention mechanism

Complementary to channel attention, the DREAM module also applies a spatial attention mechanism to highlight important spatial locations within the feature map. First, the fused feature map FF is pooled across the channel dimension to obtain a two-dimensional feature map PR1×H×WP \in \mathbb{R}^{1 \times H \times W}. This is done by taking the average across all channels at each spatial location (i,j), i.e., P(i,j) = \frac{1}{C'} \sum_{c=1}^{C'} F(c,i,j). Next, PP is linearly transformed using a 1×11 \times 1 convolution, and the result is normalized by a Sigmoid function to produce a spatial attention weight map MsR1×H×WM_s \in \mathbb{R}^{1 \times H \times W}: Ms=σ(f1×1(P))(4) M_{s}=\sigma (f^{1\times 1}(P)) \quad (4) In this formula:

  • f1×1f^{1 \times 1} denotes the 1×11 \times 1 convolution operator.
  • σ()\sigma(\cdot) is the Sigmoid function, mapping the result to the [0,1] interval. Each position (i,j) in MsM_s holds a spatial weight, signifying the importance of that location. This weight map MsM_s is then expanded to match the number of channels in FF (by copying its values to all channels at each position) and multiplied element-wise with the original fused feature map FF: Fs=FMs(5)F_{s}=F\otimes M_{s} \quad (5) This operation enhances important spatial locations and suppresses irrelevant areas.

4.1.4 Attention fusion

After obtaining the channel-enhanced feature FcF_c and the spatial-enhanced feature FsF_s, the DREAM module fuses these two types of attention information using an element-by-element maximization operation. The final output feature map YY is given by: Y=M(Fc,Fs)(6) Y=\mathcal{M}(F_{c},F_{s}) \quad (6) Here, M(a,b)\mathcal{M}(a,b) means selecting the larger value between the corresponding elements of input tensors aa and bb. This adaptively selects the more responsive one of the channel attention branches or the spatial attention branches for each position in each channel as the output, ensuring that the most salient features from both attention types are preserved. The output YY from the DREAM module contains more significant and accurate local feature information.

4.2. Global Attribution Alignment

The global alignment module aims to achieve consistency in the spatial representation distribution of image and text features. It primarily uses Maximum Mean Discrepancy (MMD) loss, complemented by InfoNCE comparative loss, for feature alignment.

Let V={vi}i=1NV = \{v_i\}_{i=1}^{N} be the set of image modal features and T={tj}j=1NT = \{t_j\}_{j=1}^{N} be the set of text modal features, both with NN samples.

4.2.1 Maximum Mean Discrepancy (MMD)

MMD is used to measure the difference between the two modal feature distributions. It quantifies the inconsistency by measuring the embedding gap of the distribution means in a Reproducing Kernel Hilbert Space (RKHS). The squared MMD is defined as: \mathrm {MMD}^{2}(V,T)=||\frac {1}{N}\sum _{i=1}^{N}\phi \left(v_{i}\right)-\frac {1}{N}\sum _{j=1}^{N}\phi \left(t_{j}\right)||}_{\mathcal {H}}^{2} \quad (7) In this formula:

  • VV and TT are the sets of visual and textual features, respectively.

  • ϕ()\phi(\cdot) denotes the mapping function that projects features into the Reproducing Kernel Hilbert Space (RKHS) H\mathcal{H}.

  • NN is the number of samples.

  • H2||\cdot||_{\mathcal{H}}^{2} denotes the squared norm in the RKHS.

    This expression can be equivalently converted into a kernel function form: MMD2(V,T)=1N2i=1Ni=1Nk(vi,vi)+1N2j=1Nj=1Nk(tj,tj)2N2i=1Nj=1Nk(vi,tj)(8) \mathrm{MMD}^2 (V,T) = \frac{1}{N^2}\sum_{i = 1}^{N}\sum_{i' = 1}^{N}k(v_i,v_{i'}) + \frac{1}{N^2}\sum_{j = 1}^{N}\sum_{j' = 1}^{N}k(t_j,t_{j'}) - \frac{2}{N^2}\sum_{i = 1}^{N}\sum_{j = 1}^{N}k(v_i,t_j) \quad (8) Here, k(,)k(\cdot, \cdot) is the kernel function defined on the input space. The paper uses a Gaussian kernel function, defined as: k(v,t)=exp(vt22σ2)(9) k(v,t) = \exp \left(-\frac{\|v - t\|^2}{2\sigma^2}\right) \quad (9) In this formula:

  • vt2\|v - t\|^2 is the squared Euclidean distance between features vv and tt.

  • σ\sigma is a kernel bandwidth hyperparameter, which adjusts the sensitivity of the distance to similarity. By minimizing the MMD loss, the model aims to reduce the distance between the overall distributions of image and text modalities, thereby achieving global semantic alignment.

4.2.2 InfoNCE comparative learning loss

To further enhance the consistency of representations, InfoNCE comparative learning loss is introduced as a supplement. This loss function strengthens the similarity between matching pairs of visual and textual features while suppressing the similarity between non-matching pairs. It is expressed as: LInfoNCE=1Ni=1Nlogexp(sin(vi,ti)/τ)j=1Nexp(sin(vi,tj)/τ)(10) \mathcal{L}_{\mathrm{Inf_oNCE}} = -\frac{1}{N}\sum_{i = 1}^{N}\log \frac{\exp\left(\sin(v_i,t_i) / \tau\right)}{\sum_{j = 1}^{N}\exp\left(\sin(v_i,t_j) / \tau\right)} \quad (10) In this formula:

  • sin(,)\sin(\cdot, \cdot) is the dot product of the normalized feature vectors. In InfoNCE context, normalized feature vectors typically mean that the vectors are scaled to unit length, so their dot product is equivalent to their cosine similarity.
  • viv_i and tit_i represent a matching pair of visual and textual features for the ii-th item.
  • tjt_j represents a negative text sample from the set of all text features (or a batch of negative samples).
  • τ\tau is the temperature hyperparameter, which controls the concentration of the distribution and the sensitivity of the similarity measure. A smaller τ\tau makes the similarity function more sensitive to small differences, leading to sharper distributions.

4.2.3 Overall global alignment loss

The total loss for the global distribution alignment module is a weighted combination of the MMD and InfoNCE components: Lalign=λmmdLMMD+λclLInfoNCE(11) \mathcal{L}_{\mathrm{align}} = \lambda_{\mathrm{mmd}}\cdot \mathcal{L}_{\mathrm{MMD}} + \lambda_{\mathrm{cl}}\cdot \mathcal{L}_{\mathrm{Inf_oNCE}} \quad (11) In this formula:

  • λmmd\lambda_{\mathrm{mmd}} and λcl\lambda_{\mathrm{cl}} are weight coefficients that determine the contribution of MMD loss and InfoNCE loss, respectively. These are hyperparameters specified by the model configuration. This joint optimization strategy ensures alignment and fusion of cross-modal semantic representations at the global distribution level.

4.3. Dimensionality Optimization

To address the dimensional differences and potential redundancy between image and text modal features, and to improve efficiency, a dimensionality optimization mechanism is proposed.

Let VRN×DVV \in \mathbb{R}^{N \times D_V} be the image modal feature matrix and TRN×DTT \in \mathbb{R}^{N \times D_T} be the text modal feature matrix, where NN is the number of samples, DVD_V is the original visual feature dimension, and DTD_T is the original textual feature dimension. Two learnable linear dimensionality reduction matrices are introduced: WVRDV×dW_V \in \mathbb{R}^{D_V \times d} for visual features and WTRDT×dW_T \in \mathbb{R}^{D_T \times d} for textual features. These matrices project the raw features into a unified, lower-dimensional embedding space: V=VWV,T=TWT(12) V^{\prime} = VW_{V},\quad T^{\prime} = TW_{T} \quad (12) In this formula:

  • VV' and TT' are the unified feature representations after dimension reduction, both belonging to RN×d\mathbb{R}^{N \times d}.

  • dd is the target embedding dimension.

    The target embedding dimension dd is controlled by a dimension reduction coefficient rr, defined as: d=min(DV,DT)r(13) d = \left\lfloor \frac{\min (D_V,D_T)}{r}\right\rfloor \quad (13) Here, \lfloor \cdot \rfloor denotes the floor function, ensuring dd is an integer. The reduction factor rr allows for flexible control over the compressed dimension, where a larger rr means greater compression. These dimensionality reduced modal features (VV' and TT') are then fed into the DREAM module for further structural enhancement and fine-grained feature enhancement. This mechanism not only reduces feature space dimension, computational overhead, and memory usage but also improves the robustness of the multimodal fusion representation.

4.4. Multi-View Encoder

The paper mentions a Multi-View Encoder for enhancing the representation of specific modes before global alignment. Inspired by MGCN [30], this encoder processes the output of the Local Alignment Module (DREAM) to generate two semantically rich views: a visual view and a text view. Each view is intended to capture intra-modal relationships and complementary semantics that might otherwise be lost during local fusion. The paper does not provide specific formulas or detailed architecture for this encoder, implying it's a standard or adapted component from MGCN.

4.5. Prediction and Optimization

The goal is to predict the interaction score between a user uu and an item ii. For any user uu and item ii, the model generates their fusion representation vectors euRde_u^* \in \mathbb{R}^d and eiRde_i^* \in \mathbb{R}^d respectively. The interaction score is calculated via the inner product of these vectors: y^(u,i)=(eu)ei(14) \hat{y} (u,i) = (e_u^*)^\top e_i^* \quad (14) Here, y^(u,i)\hat{y}(u,i) is the predicted preference score, and (eu)(e_u^*)^\top denotes the transpose of the user embedding.

4.5.1 Optimization objective

The model training primarily uses Bayesian Personalized Ranking (BPR) [21] as the basic optimization goal. BPR aims to optimize the relative ranking of items for a user. The BPR loss function is defined as: LBPR=(u,i,j)Ologσ(y^uiy^uj)(15) \mathcal{L}_{\mathrm{BPR}} = \sum_{(u,i,j)\in O} - \log \sigma (\hat{y}_{ui} - \hat{y}_{uj}) \quad (15) In this formula:

  • σ()\sigma(\cdot) denotes the Sigmoid function, which squashes its input to a value between 0 and 1.

  • O={(u,i,j)}O = \{(u,i,j)\} is the set of triples where user uu prefers positive sample ii over negative sample jj. The BPR loss aims to maximize the difference between the predicted scores of positive items (y^ui\hat{y}_{ui}) and negative items (y^uj\hat{y}_{uj}) for a given user uu.

    To enhance representation learning consistency and cross-modal fusion, the model incorporates the InfoNCE comparative learning loss (Lcl\mathcal{L}_{\mathrm{cl}}, as defined in Equation 10) and the Maximum Mean Discrepancy loss (Lmmd\mathcal{L}_{\mathrm{mmd}}, as defined in Equation 8) as auxiliary optimization terms. Additionally, L2 regularization is applied to prevent overfitting. The final joint optimization goal (the total loss function) is: L=LBPR+λclLcl+λmmdLmmd+λregΘ22(16) \mathcal{L} = \mathcal{L}_{\mathrm{BPR}} + \lambda_{\mathrm{cl}}\cdot \mathcal{L}_{\mathrm{cl}} + \lambda_{\mathrm{mmd}}\cdot \mathcal{L}_{\mathrm{mmd}} + \lambda_{\mathrm{reg}}\cdot \| \Theta \| _2^2 \quad (16) In this formula:

  • Θ\Theta denotes the set of all learnable parameters in the model.

  • λcl\lambda_{\mathrm{cl}}, λmmd\lambda_{\mathrm{mmd}}, and λreg\lambda_{\mathrm{reg}} are weight hyperparameters for the InfoNCE loss, MMD loss, and L2 regularization term, respectively. These weights control the importance of each component in the overall training objective.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on three representative subsets of Amazon's product review datasets, following previous works [7, 34]. These datasets represent different consumption scenarios and exhibit diverse user behavior patterns.

  • Baby: Products for babies and infants.

  • Sports and Outdoors: Products related to sports and outdoor activities.

  • Clothing, Shoes and Jewelry: Products in the fashion domain.

    For each dataset, a 5-core setting was applied to filter out infrequent users and items, meaning both users and items must have at least 5 interactions.

The following are the statistics from Table 1 of the original paper:

Dataset #User #Item #Interaction Density
Baby 19,445 7,050 160,792 0.117%
Sports 35,598 18,357 296,337 0.045%
Clothing 39,387 23,033 278,677 0.031%

Multimodal Features Extraction:

  • Visual Modality: 4,096-dimensional visual features were extracted using a pre-trained VGG16 Convolutional Neural Network [22]. These features then underwent dimensionality reduction.
  • Textual Modality: 384-dimensional textual embeddings were extracted using Sentence Transformers [20] applied to the concatenation of the item's title, descriptions, categories, and brand.

5.2. Evaluation Metrics

The models were evaluated using two widely accepted metrics for top-K recommendation quality: Recall@K and NDCG@K. The evaluation was performed for K=10K=10 and K=20K=20.

  • Recall@K:

    • Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved within the top KK recommendations. It focuses on the ability of the recommendation system to find as many relevant items as possible. A higher Recall@K indicates that the model is good at not missing positive items.
    • Mathematical Formula: Recall@K=Number of relevant items in top-K recommendationsTotal number of relevant items \mathrm{Recall@K} = \frac{\text{Number of relevant items in top-K recommendations}}{\text{Total number of relevant items}}
    • Symbol Explanation:
      • Number of relevant items in top-K recommendations: The count of items in the top KK recommendations that the user actually interacted with (or is known to be interested in).
      • Total number of relevant items: The total count of items that the user actually interacted with in the test set.
  • NDCG@K (Normalized Discounted Cumulative Gain at K):

    • Conceptual Definition: NDCG@K is a measure of ranking quality that takes into account the position of relevant items. It assigns higher scores to relevant items that appear at higher ranks (closer to the top of the recommendation list). It is "normalized" to ensure scores across different queries are comparable.
    • Mathematical Formula: First, DCG@K (Discounted Cumulative Gain at K) is calculated: DCG@K=i=1K2reli1log2(i+1) \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)} Then, NDCG@K is calculated by normalizing DCG@K by the IDCG@K (Ideal Discounted Cumulative Gain at K), which is the maximum possible DCG@K if all relevant items were perfectly ranked: NDCG@K=DCG@KIDCG@K \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}}
    • Symbol Explanation:
      • reli\mathrm{rel}_i: The relevance score of the item at position ii in the recommended list. For implicit feedback, this is typically 1 if the item is relevant and 0 otherwise.
      • KK: The number of top recommendations considered.
      • log2(i+1)\log_2(i+1): The discount factor, which reduces the contribution of relevant items as their rank ii decreases (i.e., they appear lower in the list).
      • IDCG@K\mathrm{IDCG@K}: The DCG@K calculated for a hypothetical ideal ranking where all relevant items are placed at the top of the list in decreasing order of relevance.

5.3. Baselines

To evaluate MambaRec's effectiveness, it was compared against several state-of-the-art recommendation methods, categorized into General Models and Multimodal Models.

5.3.1 General Models

These models typically rely primarily on collaborative filtering and do not explicitly leverage multimodal features.

  • BPR [21]: Bayesian Personalized Ranking, a classic matrix factorization based model optimizing for ranking.
  • LightGCN [10]: A simplified Graph Convolutional Network that focuses on neighborhood aggregation to model high-order user-item relationships efficiently.

5.3.2 Multimodal Models

These models incorporate multimodal features in various ways.

  • VBPR [9]: Visual Bayesian Personalized Ranking, which integrates visual features from pre-trained CNNs into BPR.
  • MMGCN [27]: Multi-modal Graph Convolution Network, which builds separate interaction graphs for each modality and fuses their graph embeddings.
  • GRCN [26]: Graph-Refined Convolutional Network, which refines graphs through gating mechanisms to suppress noisy interactions from multimodal data.
  • SLMRec [23]: Self-supervised Learning for Multimedia Recommendation, which uses cross-modal self-supervised tasks and hierarchical contrastive learning to enhance modal representations.
  • FREEDOM [34]: Focuses on freezing and denoising graph structures to improve training stability and resource efficiency in multimodal settings.
  • MGCN [30]: Multi-view Graph Convolutional Network, which uses users' historical behavior to guide preprocessing and refinement of modal features, and employs multi-view graph convolution for fusion.
  • LGMRec [7]: Local and Global Graph Learning for Multimodal Recommendation, which models both modality-specific user interests and shared cross-modal preferences.

5.4. Implementation Details

The MambaRec model and all baseline methods were implemented using the MMRec [33] unified open-source framework. Experiments were run on a GeForce RTX4090 GPU with 24 GB memory.

  • Data Split: Each user's interaction history was randomly split into training, validation, and test sets with an 8:1:1 ratio.
  • Optimizer: Adam optimizer [12] was used for training.
  • Hyperparameters: Optimal hyperparameters were referenced from baseline papers.
    • Embedding Dimension: Set to 64 for users and items.
    • Initialization: Xavier [5] initialization for model parameters.
    • Batch Size: 2048.
    • Max Epochs: 1000.
    • Early Stopping: Triggered if Recall@20 on the validation set did not improve for 20 consecutive epochs.
    • Learning Rate: 0.001, with a decay factor of 0.96 every 50 epochs.
    • Regularization: Weight decay of 1×1041 \times 10^{-4} (for L2 regularization).
    • Contrastive Learning Loss Weight (λcl\lambda_{cl}): 0.01.
    • MMD Loss Weights (λmmd\lambda_{mmd}): Explored in the range [0.1, 0.15, 0.2].
    • Kernel Bandwidths (σ\sigma) for MMD: Explored in the range [1.0, 1.5, 2.0].
    • Reduction Factor (rr): Applied a factor of 8 for dimensionality reduction during feature extraction.
  • Reproducibility: The same random seed was used across all implementations to ensure result stability.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate the superiority of the proposed MambaRec model compared to both general recommendation models and multimodal recommendation models.

The following are the results from Table 2 of the original paper:

Datasets Metrics General Models Multimodal Models
BPR LightGCN VBPR MMGCN GCRN SLMRec FREEDOM MGCN LGMREC MAMBARec
Baby Recall@10 0.0382 0.0453 0.0425 0.0424 0.0534 0.0545 0.0627 0.0620 0.0644 0.0660*
Recall@20 0.0595 0.0728 0.0663 0.0668 0.0831 0.0837 0.0992 0.0964 0.1002 0.1013*
NDCG@10 0.0214 0.0246 0.0223 0.0223 0.0288 0.0296 0.0330 0.0339 0.0349 0.0363*
NDCG@20 0.0263 0.0317 0.0284 0.0286 0.0365 0.0371 0.0424 0.0427 0.0440 0.0454*
Sports Recall@10 0.0417 0.0542 0.0561 0.0386 0.0607 0.0676 0.0717 0.0729 0.0720 0.0763*
Recall@20 0.0633 0.0837 0.0857 0.0627 0.0922 0.1017 0.1089 0.1106 0.1068 0.1147*
NDCG@10 0.0232 0.0300 0.0307 0.0204 0.0325 0.0374 0.0385 0.0397 0.0390 0.0416*
NDCG@20 0.0288 0.0376 0.0384 0.0266 0.0406 0.0462 0.0481 0.0496 0.0480 0.0514*
Clothing Recall@10 0.0200 0.0338 0.0281 0.0224 0.0428 0.0461 0.0629 0.0641 0.0555 0.0673*
Recall@20 0.0295 0.0517 0.0410 0.0362 0.0663 0.0696 0.0941 0.0945 0.0828 0.0996*
NDCG@10 0.0111 0.0185 0.0157 0.0118 0.0227 0.0249 0.0341 0.0347 0.0302 0.0367*
NDCG@20 0.0135 0.0230 0.0190 0.0153 0.0287 0.0308 0.0420 0.0428 0.0371 0.0449*

From Table 2, the following key observations can be made:

  1. MambaRec's Superiority: MambaRec consistently achieves the highest Recall@K and NDCG@K scores across all three datasets (Baby, Sports, Clothing) and for both K values (10 and 20). This demonstrates its robust performance and effectiveness in generating personalized recommendations. For example, on the Baby dataset, MambaRec achieves Recall@20 of 0.1013 and NDCG@20 of 0.0454, outperforming the next best LGMRec (0.1002 and 0.0440, respectively).
  2. Limitations of General Models: BPR and LightGCN (general models) show the lowest performance. This is expected as they do not leverage multimodal content, making them susceptible to cold start and data sparsity issues inherent in recommendation tasks. VBPR, while incorporating visual features, performs only marginally better than LightGCN in some cases and significantly worse than advanced multimodal models, highlighting the limitations of simplistic multimodal fusion.
  3. Challenges for Existing Multimodal Models: Even advanced multimodal models like MMGCN, GRCN, SLMRec, FREEDOM, MGCN, and LGMRec fall short of MambaRec's performance. The paper attributes this to their struggles with:
    • Noise interference: Static or linear fusion strategies in models like MMGCN and GRCN fail to effectively filter noise from multimodal features.
    • Insufficient extraction of effective information: Models like MGCN, despite using multimodal information, may not effectively capture fine-grained semantics due to simpler projection and fusion.
    • Lack of global consistency: Models like SLMRec and BM3 may use contrastive objectives but lack global constraints on modality-level representation distributions, leading to semantic misalignment.
  4. MambaRec's Advantage through Multi-level Alignment: MambaRec's superior performance is attributed to its dual-level alignment strategy:
    • Local Alignment: The DREAM module, with its multi-scale convolutions and dual attention mechanisms, effectively extracts and refines fine-grained representations of image and text. This allows adaptive filtering of noise and enhancement of key modal features.
    • Global Alignment: The explicit alignment of image and text distributions using MMD and InfoNCE constraints alleviates cross-modal semantic bias, ensuring better consistency. This comprehensive approach allows MambaRec to better handle modality heterogeneity, filter noise, and achieve more precise and robust multimodal fusion.

6.2. Ablation Studies

To verify the effectiveness of each core module, ablation experiments were conducted from two perspectives: the impact of alignment modules and the contribution of different modality combinations.

The following figure (Figure 3 from the original paper) shows the ablation studies on the proposed MambaRec:

fig 3 该图像是一个柱状图,展示了在不同类别(Baby、Sports、Clothing)下,MambaRec模型以及不使用全局对齐(GA)和局部对齐(LA)情况下的Recall@20和NDCG@20指标对比。结果表明,MambaRec在融合质量和性能上优于其他方法。

6.2.1 Effect of Alignment Modules (RQ2)

The ablation study on alignment modules compares the full MambaRec model with two variants:

  • MambaRec w/o LA: MambaRec without the Local Alignment module (DREAM).

  • MambaRec w/o GA: MambaRec without the Global Alignment module (MMD and InfoNCE losses).

    As shown in Figure 3, the complete MambaRec model achieves the best performance across all datasets and metrics.

  • Impact of Local Alignment: Removing the Local Alignment module (w/o LA) leads to a significant performance drop, often the largest degradation among the variants. This suggests that the DREAM module is crucial for capturing fine-grained modal details and local associations between features, which are vital for high-quality fusion.

  • Impact of Global Alignment: Removing the Global Alignment module (w/o GA) also results in a notable decrease in performance. While Local Alignment focuses on details, Global Alignment ensures overall semantic consistency between modalities at the distribution level. Its absence weakens the fusion effect by allowing representational bias to persist.

    This confirms that both the local feature alignment and global distribution alignment modules are indispensable for MambaRec's superior performance, working synergistically to enhance cross-modal semantic modeling and fusion quality.

6.2.2 Effect of Modalities (RQ2)

To understand the contribution of individual modalities, experiments were conducted with different input conditions:

  • Text: Only textual modal content is used.

  • Visual: Only visual modal content is used.

  • Full: Combines both text and visual modalities (this is the full MambaRec model).

    The following are the results from Table 3 of the original paper:

    Datasets Modality R@10 R@20 N@10 N@20
    Baby Text 0.0561 0.0863 0.0307 0.0384
    Visual 0.0485 0.0761 0.0270 0.0341
    Full 0.0660 0.1013 0.0363 0.0454
    Sports Text 0.0685 0.1024 0.0373 0.0460
    Visual 0.0575 0.0852 0.0310 0.0382
    Full 0.0763 0.1147 0.0416 0.0514
    Clothing Text 0.0601 0.0900 0.0327 0.0403
    Visual 0.0406 0.0627 0.0217 0.0273
    Full 0.0673 0.0996 0.0367 0.0449

Table 3 shows that using Full input (combining both text and visual modalities) consistently yields the best performance across all datasets and metrics.

  • Complementary Nature: This outcome underscores the complementary nature of multimodal information. Textual modalities are effective at capturing abstract semantic attributes (e.g., "durable," "for hiking"), while visual modalities intuitively present concrete appearance characteristics (e.g., color, shape, style).
  • Individual Modality Performance: When relying on a single modality, Text features generally outperform Visual features. This might be because textual descriptions often contain more explicit semantic information relevant to user preferences for certain items. However, both single modalities perform worse than their combined use. The results clearly indicate that integrating diverse multimodal information provides a richer and more comprehensive understanding of items, leading to improved recommendation effectiveness.

6.3. Sensitivity Analysis (RQ3)

Sensitivity analysis was performed to understand how key hyperparameters affect MambaRec's overall effectiveness.

6.3.1 Effects of the weight of λcl\lambda_{cl}

The weight of the contrastive learning loss (λcl\lambda_{cl}) was varied to observe its impact on Recall@20 and NDCG@20 on the Baby and Sports datasets.

The following figure (Figure 4 from the original paper) shows the variation of MambaRec with λcl\lambda_{cl}:

fig 4 该图像是一个示意图,展示了在不同 extλd ext{λ}_{d} 值下,Baby 和 Sports 数据集的 Recall@20 和 NDCG@20 的变化趋势。左侧为 Baby 数据集,右侧为 Sports 数据集,曲线表现出不同参数对推荐效果的影响。

As shown in Figure 4:

  • Performance (both Recall@20 and NDCG@20) steadily improves as λcl\lambda_{cl} increases from 10410^{-4} to 0.01. This indicates that a moderate contrastive signal helps MambaRec learn more discriminative and consistent modal representations.
  • However, when λcl\lambda_{cl} exceeds 0.01, the model's performance significantly drops.
  • Optimal Setting: The experimental results suggest that λcl=0.01\lambda_{cl} = 0.01 is the optimal setting for both datasets.
  • Interpretation: A small weight for contrastive loss might mean the contrastive signal is too weak to effectively improve representation ability. Conversely, a large weight might cause the auxiliary task of contrastive learning to dominate the training process, diverting focus from the primary recommendation objective and potentially introducing unwanted biases or suppressing necessary information for recommendation.

6.3.2 Effects of the weight of λmmd\lambda_{mmd}

The impact of the Maximum Mean Discrepancy (MMD) weight (λmmd\lambda_{mmd}) in combination with the reduction factor was analyzed on Recall@20 for the Baby and Sports datasets.

The following figure (Figure 5 from the original paper) shows the variation of MambaRec with λmmd\lambda_{mmd}:

fig 5 该图像是一个比较图,其中展示了在两个数据集(Baby 和 Sports)上,不同 mmd_weight 和 reduction_factor 参数组合下的结果。各单元格中显示的数值代表了对应配置的性能评估结果。

As depicted in Figure 5:

  • The results show a favorable performance region within the parameter space, with several combinations achieving comparable results.
  • Optimal Combination: The combination of reduction factor 8 and λmmd=0.15\lambda_{mmd} = 0.15 consistently performs well on both datasets. Other nearby parameter settings also demonstrate competitive performance.
  • MMD Weight Impact: λmmd\lambda_{mmd} (the strength coefficient of modal alignment) significantly impacts performance. Too small a value can lead to insufficient alignment of different modal features and ineffective information fusion. Conversely, too large a value might force excessive alignment, potentially introducing irrelevant noise or over-regularizing, thereby reducing the model's ability to distinguish between distinct items (discrimination capabilities).
  • Moderate Compression: The interplay with the reduction factor suggests that moderate compression ratios (like 8) are beneficial. They remove redundant parameters and noise without severely degrading characterization capabilities, unlike excessive compression or no compression.

6.3.3 Effects of Reduction Factor

The reduction factor plays a crucial role in balancing model performance and computational resource consumption. Recall@20, model parameter count, and computation time were evaluated across different reduction factors for the Baby and Sports datasets.

The following figure (Figure 6 from the original paper) shows the impact of Reduction Factor:

fig 6 该图像是两个折线图,分别展示了在不同降维因子下,'Baby'和'Sports'两类数据的召回率(Recall@20)与计算资源(参数量和时间)的关系。图中红色曲线表示召回率,绿色和蓝色曲线分别表示参数占比和时间变化,反映出降维因子对模型性能的影响。

As shown in Figure 6:

  • Resource Reduction: As the reduction factor increases, both model parameters and computation time exhibit a continuous decline. For instance, with a reduction factor of 8, parameters are reduced by over 30%, and computation time is significantly decreased.
  • Performance Fluctuation: Recall@20 shows a non-linear fluctuation pattern. Uncompressed models (factor 1) achieve superior performance but at the highest resource cost. Excessive compression (very high factor) leads to significant parameter reduction but also notable performance degradation due to insufficient representation capability.
  • Optimal Trade-off: At a reduction factor of 8, Recall@20 reaches peak levels while simultaneously achieving substantial reductions in parameters and computation time.
  • Overfitting Mitigation: The paper notes that model compression not only reduces resource burden but can also suppress overfitting noise to some extent, thereby improving generalization stability. This highlights the importance of choosing a reasonable compression factor for resource-constrained deployment scenarios.

6.4. Visualization Analysis (RQ4)

To qualitatively assess MambaRec's effectiveness in multimodal fusion and modality alignment, a t-SNE [25] visualization of fusion features was performed, comparing MambaRec with the VBPR baseline. Gaussian kernel density estimation (KDE) [24] was used to visualize the distributions.

The following figure (Figure 7 from the original paper) shows the distribution of representations for Baby Datasets:

fig 7 该图像是一个对比图,左侧展示了VBPR融合特征的分布,右侧展示了MambaRec融合特征的分布。下方为角度的密度分布图,左侧为VBPR方法,右侧为MambaRec方法。两者的分布及密度特征可以看出MambaRec在跨模态融合上具有更好的性质。

The following figure (Figure 8 from the original paper) shows the distribution of representations for Sports Datasets:

fig 8 该图像是一个对比图,左侧展示了 VBPR 融合特征的分布情况,右侧展示了 MambaRec 融合特征的分布。下方为两种方法在角度上的密度分布对比,表明 MambaRec 在特征融合质量上具有更优的表现。

In Figures 7 and 8, the Baby dataset is represented in red and the Sports dataset in blue.

  • VBPR's Limitations: The embedding distribution of the VBPR model shows significant feature clustering effects and multimodal concentration patterns. This indicates a representation degeneration problem [4, 19], where features from different semantic categories (even across modalities) tend to cluster too closely or overlap excessively in the embedding space. This makes it difficult for the model to effectively discriminate between different items or accurately capture their unique characteristics, ultimately harming recommendation quality.
  • MambaRec's Advantage: In contrast, MambaRec exhibits a more uniform embedding distribution with wider coverage and a smoother kernel density estimation (KDE) curve. This signifies a more discriminatory and well-separated representation where distinct semantic categories are better distinguished in the embedding space.
  • Role of Modality Alignment: The visualization confirms that MambaRec's modality alignment mechanism (both local and global) effectively alleviates the semantic degradation and modal overlap problems commonly found in traditional methods. By fostering more distinct and consistent representations, MambaRec significantly improves the expression power and representation structure of the multimodal recommendation system. This visual evidence strongly supports the quantitative results, validating the effectiveness and superiority of the MambaRec model.

7. Conclusion & Reflections

7.1. Conclusion Summary

In this paper, the authors proposed MambaRec, a novel multimodal recommendation framework designed to address critical limitations in existing systems, specifically insufficient fine-grained cross-modal associations and a lack of global distribution-level consistency. The core of MambaRec lies in its dual-pronged approach:

  1. Local Feature Alignment: Achieved through the Dilated Refinement Attention Module (DREAM), which utilizes multi-scale dilated convolutions combined with channel-wise and spatial attention to precisely align fine-grained semantic patterns between visual and textual modalities.
  2. Global Distribution Regularization: Ensured by integrating Maximum Mean Discrepancy (MMD) loss and contrastive learning to enforce semantic consistency across the overall distributions of modalities. Furthermore, MambaRec incorporates a dimensionality reduction scheme to reduce memory consumption and enhance model scalability and deployment efficiency. Extensive experiments on real-world e-commerce datasets demonstrated that MambaRec consistently outperforms state-of-the-art baselines in fusion quality, generalization, and efficiency, while visualization studies confirmed its ability to mitigate representation degeneration.

7.2. Limitations & Future Work

The paper explicitly states that MambaRec offers an efficient, robust, and scalable solution for multimodal recommendation. However, it also points to future work for extension to more complex modalities.

  • Current Modality Scope: The current work focuses on visual and textual modalities.
  • Future Directions: The authors suggest extending MambaRec to incorporate more complex modalities such as video and audio. This implies that the current framework might need adaptations or further developments to handle the temporal and richer structural complexities inherent in these modalities. For instance, video features might require 3D convolutions or recurrent neural networks within the DREAM module, and audio features would need specialized processing pipelines before alignment.

7.3. Personal Insights & Critique

The MambaRec paper presents a compelling solution to fundamental challenges in multimodal recommendation. The explicit focus on both fine-grained local alignment and global distribution consistency is a significant strength. Many previous works address one or the other, or implicitly handle them, but MambaRec's unified approach with DREAM and MMD/InfoNCE provides a robust framework.

Strengths:

  • Comprehensive Alignment: The DREAM module is well-designed, leveraging multi-scale convolutions (like ASPP) and dual attention (CBAM-like) to capture rich, context-aware information at a local level. This fine-grained fusion is crucial for understanding subtle semantic relationships.

  • Principled Global Regularization: The use of MMD is a strong theoretical choice for distribution alignment, complementing the InfoNCE loss which focuses on instance-level similarity. This dual regularization helps prevent representational bias and semantic misalignment, which are common pitfalls in multimodal learning.

  • Practical Efficiency: The dimensionality optimization mechanism demonstrates a practical understanding of real-world deployment constraints. Balancing performance with computational resources is vital, and the sensitivity analysis effectively illustrates this trade-off.

  • Clear Problem Statement: The paper clearly articulates the limitations of prior work, making MambaRec's motivations and innovations highly relevant.

    Potential Issues/Critique:

  • Complexity of DREAM: While powerful, the DREAM module with five parallel branches, concatenation, and dual attention mechanisms might add significant architectural complexity. A more detailed analysis of its computational overhead relative to other fusion mechanisms (beyond just the dimensionality reduction) could strengthen the efficiency claims.

  • Hyperparameter Sensitivity: The sensitivity analysis for λcl\lambda_{cl} and λmmd\lambda_{mmd} shows that performance is sensitive to these weights. While optimal values were found, the practical tuning of these parameters for new datasets might require considerable effort.

  • "Multi-View Encoder" Details: The Multi-View Encoder is mentioned as processing DREAM output to generate visual and text views, inspired by MGCN. However, its specific architecture and how it precisely interacts with DREAM and the global alignment modules are not detailed with formulas or diagrams. This part feels a bit like a black box compared to the thorough explanation of DREAM and MMD.

  • Generalizability of Modalities: While the paper suggests extending to video and audio, it's worth considering if the current DREAM architecture (especially its spatial convolutions) would be directly applicable or if specialized 3D or temporal attention mechanisms would be required for such modalities. The current design seems heavily biased towards image-like (2D) and text sequence features.

    Transferability/Applicability: The principles of MambaRec are highly transferable. The idea of explicit local fine-grained alignment and global distribution consistency is relevant to any domain dealing with heterogeneous data sources, not just recommendation systems. For example:

  • Medical Image Analysis: Aligning features from different imaging modalities (e.g., MRI, CT, X-ray) for diagnosis.

  • Robotics/Perception: Fusing sensor data (e.g., LiDAR, camera, radar) for robust environmental understanding.

  • Natural Language Processing: Aligning textual information with related knowledge graphs or visual context.

    Overall, MambaRec offers a significant step forward in robust multimodal representation learning for recommendation, providing a strong foundation for future research in cross-modal understanding and fusion.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.