Spectrum-based Modality Representation Fusion Graph Convolutional Network for Multimodal Recommendation
TL;DR Summary
SMORE uses frequency-domain fusion with adaptive filtering to suppress modality-specific noise and balance uni- and multi-modal preferences, enhancing recommendation accuracy via multimodal graph learning on real datasets.
Abstract
Spectrum-based Modality Representation Fusion Graph Convolutional Network for Multimodal Recommendation Rongqing Kenneth Ong Nanyang Technological University Singapore rongqing001@e.ntu.edu.sg Andy W. H. Khong Nanyang Technological University Singapore andykhong@ntu.edu.sg ABSTRACT Incorporating multi-modal features as side information has recently become a trend in recommender systems. To elucidate user-item preferences, recent studies focus on fusing modalities via concate- nation, element-wise sum, or attention mechanisms. Despite hav- ing notable success, existing approaches do not account for the modality-specific noise encapsulated within each modality. As a result, direct fusion of modalities will lead to the amplification of cross-modality noise. Moreover, the variation of noise that is unique within each modality results in noise alleviation and fusion being more challenging. In this work, we propose a new S pectrum-based Mo dality Re presentation (SMORE) fusion graph recommender that aims to capture both uni-modal and fusion preferences while simul- taneously suppressing modality noise. Specifically, SMORE projects the multi-modal features into the frequency
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Spectrum-based Modality Representation Fusion Graph Convolutional Network for Multimodal Recommendation
- Authors: Rongqing Kenneth Ong and Andy W. H. Khong. Their affiliation is Nanyang Technological University, Singapore.
- Journal/Conference: Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining (WSDM '25). WSDM is a top-tier, highly competitive conference in the fields of information retrieval, data mining, and web search, making this a significant publication.
- Publication Year: 2025 (as listed in the paper's reference format).
- Abstract: The paper addresses a key limitation in existing multi-modal recommender systems: the amplification of modality-specific noise during feature fusion. To solve this, the authors propose a new model, SMORE (Spectrum-based Modality Representation). SMORE's core idea is to perform modality fusion in the frequency domain (spectral space) using the Fourier transform. This allows for adaptive noise filtering while capturing universal patterns between modalities. The model also includes a multi-modal graph learning module to capture item-item correlations and a modality-aware preference module to balance uni-modal and fused features for more accurate user preference modeling. Experiments on three real-world datasets confirm the model's effectiveness.
- Original Source Link: The paper is available at
/files/papers/68f33f0ed77e2c20857d897a/paper.pdf.
2. Executive Summary
-
Background & Motivation (Why):
- Modern recommender systems increasingly use multi-modal data (e.g., images, text) to better understand user preferences. However, the methods used to combine these different types of data—such as simple concatenation or attention mechanisms—have a major flaw. Each modality contains its own unique noise (e.g., irrelevant text, blurry images). Fusing these noisy features directly can amplify the noise, corrupting the final item representation and harming recommendation quality.
- The paper provides compelling examples of this problem in Image 2. A toy hammer and a jumpsuit are considered highly similar because their text descriptions both mention "Hammer." Conversely, two similar travel pouches are considered dissimilar because one image is blurry. This "modality-specific contamination" is a critical challenge.
- The paper's core innovation is to tackle this problem by borrowing techniques from signal processing. It proposes that by transforming features into the frequency domain, it becomes easier to separate the "signal" (useful patterns) from the "noise."
-
Main Contributions / Findings (What): The paper introduces SMORE, a novel architecture with three main contributions:
-
A New Spectrum-based Fusion Scheme: Instead of fusing features in their original form, SMORE converts them to the frequency domain. Here, it applies learnable filters to suppress noise in each modality before and during fusion, and then efficiently combines them. This is the central novelty of the paper.
-
A Multi-modal Graph Learning Module: SMORE constructs separate graphs to capture relationships between items based on individual modalities (e.g., visual similarity, textual similarity) and a fused modality graph. This allows it to learn from both high-order user-item interactions (collaborative signals) and the semantic relationships between items.
-
A Modality-Aware Preference Module: The model explicitly acknowledges that users have diverse preferences—some may care more about images, some about text, and some about the combination. This module learns to weigh and balance the uni-modal (single modality) and multi-modal (fused) features to create a final, personalized representation of user preferences.
该图像是论文中第1幅示意图,展示用户多模态偏好及模态特异性噪声问题:(i)用户多模态偏好示例;(ii)文本描述污染导致不相关商品对相似度异常高(69.35%);(iii)视觉模糊污染导致相关商品对相似度异常低(11.27%)。
-
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Recommender Systems (RS): Algorithms designed to suggest relevant items to users, such as products on an e-commerce site or movies on a streaming service.
- Collaborative Filtering (CF): A common recommendation technique based on the idea that users with similar past behaviors (e.g., purchases, ratings) will have similar future preferences.
- Multi-modal Recommender Systems (MRSs): These systems go beyond user-item interaction history. They incorporate "side information" about items, such as images (visual modality) and text descriptions (textual modality), to build a richer understanding of items and user preferences.
- Graph Neural Networks (GNNs): A class of neural networks designed to work with graph-structured data. In recommendation, the set of users and items can be represented as a bipartite graph, where an edge exists if a user has interacted with an item. GNNs learn embeddings for users and items by propagating information across this graph, capturing complex, high-order relationships (e.g., friends of friends).
LightGCNis a popular, simplified GNN for recommendation that the paper builds upon. - Fourier Transform: A mathematical tool that decomposes a signal (like an audio wave, an image, or in this case, a feature vector) into its constituent frequencies. The result is a representation in the frequency domain (or spectrum). The core intuition is that the essential, global structure of a signal is often captured by low-frequency components, while noise and fine-grained details are captured by high-frequency components. By manipulating the signal in the frequency domain (e.g., filtering out high frequencies), one can denoise it. The paper uses the Fast Fourier Transform (FFT), an efficient algorithm to compute this.
-
Previous Works:
- Early MRSs: Models like
VBPRandDeepStyleused pre-trained neural networks (like VGG for images) to extract features and then simply combined them with item ID embeddings using concatenation or summation. Their main drawback is the direct fusion, which is susceptible to noise. - GNN-based MRSs:
LATTICEandFREEDOMbuild item-item similarity graphs for each modality and then fuse these graphs. However, they aggregate these graphs early on, potentially losing distinct uni-modal signals and still being vulnerable to noise during the graph fusion step.MGCNattempts to mitigate modality noise by injecting behavioral (user interaction) information into modality features. However, it uses simple average pooling for fusion, which may not effectively handle cross-modality noise.
- Fusion Stages: The paper notes that fusion can happen at different stages: early (fusing raw features), intermediate (fusing within a model), or late (fusing final predictions). Most existing methods perform fusion in the spatial/sequential domain.
- Early MRSs: Models like
-
Differentiation: SMORE stands out from previous work in a fundamental way: it performs fusion in the frequency domain. This is a paradigm shift from the conventional spatial/sequential domain fusion. This approach allows SMORE to:
- Denoise and Fuse Simultaneously: The learnable filters in the frequency domain can selectively suppress noise components for each modality before they are combined.
- Capture Global Correlations Efficiently: A point-wise product in the frequency domain is equivalent to convolution in the spatial domain, allowing the model to capture complex interactions between modalities efficiently (logarithmic time complexity via FFT vs. quadratic for attention).
- Explicitly Model Uni- and Multi-modal Preferences: Unlike models that just create one fused representation, SMORE maintains separate uni-modal and fused feature pathways and intelligently combines them at the end, better reflecting diverse user interests.
4. Methodology (Core Technology & Implementation)
The SMORE architecture, shown in Image 1, is composed of three main modules.
该图像是一个多模态推荐系统模型的整体架构示意图,展示了文本模态、视觉模态和行为模态的特征提取、频谱模态融合、图卷积学习以及模态感知偏好模块的流程和数据流动。
4.1 Spectrum Modality Fusion
This module is the core innovation of SMORE. It aims to fuse visual and textual features while denoising them.
-
Projection: The raw modality features for an item and modality (e.g., visual or textual ) are first projected into a common embedding space of dimension using a Multi-Layer Perceptron (MLP).
- : The projected feature vector for item and modality .
- : The raw feature vector.
- : Learnable weight matrix and bias for the MLP.
-
Transformation to Frequency Domain: The projected features are transformed into the frequency domain using the Fast Fourier Transform (FFT).
- : The FFT function.
- : The features in the frequency domain (also called the spectrum). These are complex-valued numbers.
-
Uni-modal Denoising: A learnable, modality-specific dynamic filter is applied to the spectrum of each modality. This is done via a point-wise product with a trainable complex-valued weight matrix. This filter learns to "select" important frequencies and suppress noisy or irrelevant ones.
- : The denoised spectral features for a single modality.
- : The learnable complex-valued weights of the filter for modality .
- : The element-wise (Hadamard) product.
-
Cross-modal Fusion and Denoising: To create the fused representation, the original (pre-filtered) spectral features of all modalities are first combined using a point-wise product. This operation is computationally cheap but powerful, as it corresponds to circular convolution in the spatial domain, capturing rich interactions. Then, another dynamic filter is applied to denoise this fused spectrum.
- : The denoised fused spectral features.
- : The point-wise product operator across all modalities.
- : The dynamic fusion filter, similar to the uni-modal filter but for the fused spectrum.
-
Transformation back to Spatial Domain: The denoised uni-modal and fused spectral features are transformed back to the original feature space using the Inverse Fast Fourier Transform (IFFT), denoted as .
- and are the final denoised uni-modal and fused item features from this module.
4.2 Multi-modal Graph Learning
This module learns representations from three different perspectives: two item-item views (one for each modality) and a user-item view.
-
Item-Item Modal-Specific and Fusion Views:
- For each modality , an item-item similarity graph is built. The edge weight between two items is their cosine similarity based on raw modality features.
- The graph is sparsified by keeping only the top- strongest connections for each item, reducing noise and computational cost.
- A fusion affinity graph is created by taking the maximum edge weight across all individual modality graphs. This preserves the strongest semantic link between any two items, regardless of which modality it came from.
- The denoised features from the previous module () are refined by a gating mechanism that incorporates behavioral information (from item ID embeddings).
- Finally, a single layer of graph convolution (inspired by
LightGCN) is applied on the uni-modal and fusion graphs to propagate information and learn semantically enriched item representations: and .
-
User-Item Behavioral View:
- This view focuses purely on collaborative signals. It uses the standard
LightGCNarchitecture to learn user and item embeddings () by propagating them through the user-item interaction graph over layers. This captures high-order connectivity patterns (e.g., "users who liked items that you liked...").
- This view focuses purely on collaborative signals. It uses the standard
4.3 Modality-Aware Preference Module
This module intelligently combines the behavioral, uni-modal, and fused representations to model final user preferences.
- Balancing Uni-modal and Fusion Features: The model computes attention scores () to determine the importance of each uni-modal representation. Crucially, these scores are calculated based on the fused features (). This allows the model to use the holistic, cross-modal understanding to decide which individual modality is more relevant for a given item. The uni-modal features are then aggregated using these scores.
- Gating with Behavioral Signals: The high-order behavioral embeddings () are used to create preference "gates" (). These gates act like switches that select which parts of the uni-modal and fused features are most relevant to a user's collaborative behavior.
- Final Side Features: The gated uni-modal and fusion features are combined to produce the final multi-modal side information features, . Note: The paper has a slight typo in Eq(21). It should use the aggregated uni-modal features and the enriched fusion features .
4.4 Prediction and Optimization
- Final Representations: The final user () and item () representations are obtained by simply summing their behavioral embeddings and their final side feature embeddings.
- Optimization: The model is trained with a joint loss function:
-
BPR Loss (): A pairwise ranking loss that pushes the model to score observed user-item interactions higher than unobserved (negative) ones. This is standard for implicit feedback recommendation.
-
Contrastive Loss (): An auxiliary
InfoNCEloss that encourages the behavioral embeddings () and the multi-modal side embeddings () for the same user to be similar, while being dissimilar from those of other users. This helps align the two representation spaces. -
L2 Regularization: Prevents overfitting.
The final loss is: .
-
5. Experimental Setup
-
Datasets: Three real-world datasets from Amazon reviews were used:
Baby,Sports and Outdoors(Sports), andClothing, Shoes and Jewelry(Clothing). The data was pre-processed using a 5-core setting (keeping only users and items with at least 5 interactions).This is a manual transcription of Table 1 from the paper.
Dataset #User #Item #Interaction Density Baby 19,445 7,050 160,792 0.117% Sports 35,598 18,357 296,337 0.045% Clothing 39,387 23,033 278,677 0.031% -
Evaluation Metrics: Top-K recommendation performance was evaluated using two standard metrics:
- Recall@K:
- Conceptual Definition: This metric measures the proportion of relevant items (from the test set) that are successfully found in the top-K recommended items. It answers the question: "Out of all the items the user actually liked, what fraction did we recommend in the top K?" It is a measure of coverage.
- Mathematical Formula:
- Symbol Explanation:
- : The set of all users in the test set.
- : The set of items the user interacted with in the test set.
- : The set of top-K items recommended to user .
- Normalized Discounted Cumulative Gain (NDCG)@K:
- Conceptual Definition: This metric evaluates the ranking quality of the recommendations. It gives higher scores to models that place relevant items higher up in the top-K list. It is more sophisticated than Recall because it rewards correct ordering.
- Mathematical Formula:
- Symbol Explanation:
rel(k): A binary value indicating if the item at rank is relevant (1) or not (0).- : Discounted Cumulative Gain, which sums the relevance scores penalized by their rank.
- : Ideal DCG, the maximum possible DCG score if all relevant items were ranked at the very top. Normalization by IDCG ensures the score is between 0 and 1.
- Recall@K:
-
Baselines: SMORE was compared against two types of models:
- General Recommenders:
BPR-MF(a classic matrix factorization method) andLightGCN(a strong GNN baseline). - Multi-modal Recommenders:
VBPR,MMGCN,GRCN,SLMRec,BM3,MGCN, andFREEDOM, representing a wide range of state-of-the-art MRSs.
- General Recommenders:
6. Results & Analysis
-
Core Results (RQ1):
This is a manual transcription of Table 2 from the paper.
Datasets Metrics General Recommenders Multi-modal Recommenders BPR LightGCN VBPR MMGCN GRCN SLMRec BM3 MGCN FREEDOM SMORE Baby Recall@10 0.0382 0.0453 0.0425 0.0424 0.0534 0.0545 0.0548 0.0616 0.0626 0.0680* Recall@20 0.0595 0.0728 0.0663 0.0668 0.0831 0.0837 0.0876 0.0943 0.0986 0.1035* NDCG@10 0.0207 0.0246 0.0223 0.0223 0.0288 0.0296 0.0297 0.0330 0.0327 0.0365* NDCG@20 0.0263 0.0317 0.0284 0.0286 0.0365 0.0371 0.0381 0.0414 0.0420 0.0457* Sports Recall@10 0.0417 0.0542 0.0561 0.0386 0.0607 0.0676 0.0613 0.0736 0.0724 0.0762* Recall@20 0.0633 0.0837 0.0857 0.0627 0.0922 0.1017 0.0940 0.1105 0.1089 0.1142* NDCG@10 0.0232 0.0300 0.0307 0.0204 0.0325 0.0374 0.0339 0.0403 0.0390 0.0408* NDCG@20 0.0288 0.0376 0.0384 0.0266 0.0406 0.0462 0.0424 0.0498 0.0484 0.0506* Clothing Recall@10 0.0200 0.0338 0.0281 0.0224 0.0428 0.0461 0.0418 0.0649 0.0635 0.0659* Recall@20 0.0295 0.0517 0.0410 0.0362 0.0663 0.0696 0.0636 0.0971 0.0938 0.0987* NDCG@10 0.0111 0.0185 0.0157 0.0118 0.0227 0.0249 0.0225 0.0356 0.0340 0.0360* NDCG@20 0.0135 0.0230 0.0190 0.0153 0.0287 0.0308 0.0281 0.0438 0.0417 0.0443* SMORE consistently and significantly outperforms all baseline models across all three datasets and all metrics. The improvements over the strongest baselines (
MGCNandFREEDOM) are notable. This confirms the effectiveness of the proposed architecture. The fact that a general recommender likeLightGCNsometimes beats multi-modal models likeMMGCNstrongly supports the paper's core premise: naive fusion of modalities can hurt performance due to noise amplification. -
Ablation Studies (RQ2):
该图像是图表,展示了论文中SMORE模型的消融实验结果。图中分别通过Recall@20和NDCG@20指标,比较了完整模型与去除三个关键模块(SMF、MMGL、MAPM)后的性能差异,涵盖Baby、Sports、Clothing三个数据集,反映不同模块对推荐效果的影响。Figure 3 shows the results of removing key components from SMORE:
w/o SMF: SMORE without Spectrum Modality Fusion.w/o MMGL: SMORE without Multi-modal Graph Learning.w/o MAPM: SMORE without the Modality-Aware Preference Module. All components are crucial, as removing any of them leads to a drop in performance. The most significant drop occurs when theMMGLmodule is removed, highlighting the importance of capturing both collaborative signals and semantic item-item relationships via the graph learning structure.
The paper also analyzes the contribution of each modality type:
This is a manual transcription of Table 3 from the paper.
Datasets Modality R@10 R@20 N@10 N@20 Baby Text 0.0646 0.0996 0.0341 0.0431 Visual 0.0533 0.0854 0.0290 0.0373 Fusion 0.0625 0.0964 0.0331 0.0418 Full 0.0680 0.1035 0.0365 0.0457 Sports Text 0.0727 0.1099 0.0392 0.0488 Visual 0.0592 0.0903 0.0323 0.0404 Fusion 0.0729 0.1091 0.0392 0.0486 Full 0.0762 0.1142 0.0408 0.0506 Clothing Text 0.0631 0.0945 0.0343 0.0422 Visual 0.0443 0.0661 0.0241 0.0296 Fusion 0.0621 0.0937 0.0342 0.0422 Full 0.0659 0.0987 0.0360 0.0443 The
Fullmodel, which uses uni-modal (Text, Visual) and Fusion information, performs the best. This proves that SMORE successfully leverages complementary signals from all sources and that modeling both uni-modal and fusion preferences is beneficial. Notably, SMORE using onlyTextstill outperforms many strong baselines that use both modalities, which strongly suggests its denoising capability is highly effective, especially for noisy textual data. -
Selection of Key Hyperparameters (RQ3):
该图像是图表,展示了图4中SMORE模型在不同参数 取值下的表现变化,具体以Recall@20和NDCG@20为评估指标,横轴为 ,纵轴分别对应两个指标的数值,展示了参数调整对模型性能的影响。Figure 4 shows that the contrastive loss is important ( performs poorly), but its weight should be small. A value of or
0.03works best, as a large weight can make the model focus too much on the auxiliary task.
该图像是图表,展示了不同参数 和 对数据集中 Baby 和 Sports 两类商品的 SMORE 变化影响。左侧热力图为 Baby 类,右侧为 Sports 类,颜色深浅反映数值大小。Figure 5 shows the impact of , the number of neighbors in the item-item graphs. The optimal setting varies by dataset. For the
Sportsdataset, a smaller is better, while for theBabydataset, a larger for the visual graph is beneficial. This indicates that the density of useful semantic connections differs across modalities and datasets, making a key parameter to tune. -
Impact of Fusion in Frequency Domain (RQ4):
该图像是图表,展示了Baby数据集中VBPR和SMORE两种融合特征的分布情况。上方为散点图,下方为角度分布的密度曲线,反映两种方法在特征融合上的差异性和分布特点。
该图像是图表,展示了Sports数据集中VBPR与SMORE模型融合特征的分布情况。上方为融合特征在二维空间的散点图,下方为对应角度的密度分布,显示SMORE融合特征角度分布更均匀,噪声较低。Figures 6 and 7 visualize the learned fusion features of SMORE compared to the baseline
VBPR. The features fromVBPRare heavily clustered and show a non-uniform distribution, a classic sign of representation degeneration, where the model learns to map many different items to a small region of the embedding space. This is likely caused by the noise amplification problem. In stark contrast, SMORE's features are spread out much more uniformly. This visually demonstrates that SMORE's spectrum-based fusion preserves more information, avoids representation collapse, and learns more discriminative embeddings for items.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully identifies and addresses the critical problem of cross-modality noise amplification in multi-modal recommendation. The proposed model, SMORE, introduces a novel and highly effective fusion strategy based on Fourier transforms. By moving the fusion and denoising process to the frequency domain, SMORE can adaptively filter out modality-specific noise and efficiently capture universal patterns. Combined with a multi-view graph learning module and a preference-aware balancing mechanism, SMORE achieves state-of-the-art performance, demonstrating its ability to model both uni-modal and fusion preferences more accurately.
-
Limitations & Future Work:
- Computational Complexity: While FFT is efficient (O(N log N)), it still adds computational overhead compared to simple fusion methods like concatenation or summation, which could be a factor for extremely large-scale systems.
- Scalability to More Modalities: The paper focuses on visual and textual modalities. While the framework is generalizable, the point-wise product for fusion might become less effective or harder to interpret as the number of modalities grows significantly.
- Hyperparameter Sensitivity: The model has several key hyperparameters (e.g., ) that require careful tuning for optimal performance, as shown in the analysis.
-
Personal Insights & Critique:
- The application of signal processing concepts (specifically, the Fourier transform) to the problem of modality fusion in recommender systems is the standout contribution of this paper. It's an elegant, principled approach to denoising that moves beyond more heuristic methods. This work could open up a new research direction in representation learning, where techniques from classical signal/image processing are adapted for deep learning models.
- The design of the modality-aware preference module is also very intuitive. Using the fused representation to "attend" to the uni-modal ones is a clever way to balance the specific and the general, reflecting a more realistic model of user decision-making.
- A potential area for improvement could be to make the filters in the frequency domain more sophisticated. The current model uses a simple learnable point-wise multiplication. More advanced filter designs (e.g., based on the content of the features themselves) could potentially yield even better denoising.
- Overall, this is a strong paper with a novel core idea, a well-designed architecture, and convincing experimental validation. It provides a clear solution to a well-defined and important problem in the field of multi-modal recommendation.
Similar papers
Recommended via semantic vector search.