Entity Recommendation via Knowledge Graph: A Heterogeneous Networking Embedding Approach

Wei-Ying Ma † † Microsoft Research

Paper status: completed

Entity Recommendation via Knowledge Graph: A Heterogeneous Networking Embedding Approach

Knowledge Graph Embedding (3)Heterogeneous Network Embedding (1)Collaborative Filtering with Knowledge Base Integration (1)Textual and Visual Content Representation Learning (1)Semantic-Enhanced Recommendation Systems (1)

Original Link

Price: 0.10

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper proposes CKE, integrating heterogeneous knowledge graph embeddings and deep learning for multi-modal item representation, enhancing recommender systems beyond traditional collaborative filtering.

Abstract

Collaborative Knowledge Base Embedding for Recommender Systems Fuzheng Zhang † , Nicholas Jing Yuan † , Defu Lian ‡ , Xing Xie † ,Wei-Ying Ma † † Microsoft Research ‡ Big Data Research Center, University of Electronic Science and Technology of China {fuzzhang,nicholas.yuan,xingx,wyma}@microsoft.com, dove.ustc@gmail.com ABSTRACT Among different recommendation techniques, collaborative fil- tering usually suffer from limited performance due to the sparsity of user-item interactions. To address the issues, auxiliary informa- tion is usually used to boost the performance. Due to the rapid collection of information on the web, the knowledge base provides heterogeneous information including both structured and unstruc- tured data with different semantics, which can be consumed by var- ious applications. In this paper, we investigate how to leverage the heterogeneous information in a knowledge base to improve the quality of recommender systems. First, by exploiting the knowl- edge base, we design three components to extract items’ semantic representations from structural content, textual content and visu- al content, respectively. To be specific, we adopt a heterogeneous network e

Mind Map

In-depth Reading

English Analysis~41 min read · 51,950 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is Entity Recommendation via Knowledge Graph: A Heterogeneous Networking Embedding Approach. The paper proposes a method to improve recommender systems by leveraging heterogeneous information from a knowledge graph.

1.2. Authors

The authors are Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. Their affiliations are Microsoft Research (Fuzheng Zhang, Nicholas Jing Yuan, Xing Xie, Wei-Ying Ma) and Big Data Research Center, University of Electronic Science and Technology of China (Defu Lian). Their research backgrounds appear to be in areas such as recommender systems, knowledge bases, machine learning, and deep learning, given the topics addressed in the paper.

1.3. Journal/Conference

The publication venue is not explicitly stated in the provided text, but it is typical for such research to be published in major conferences related to artificial intelligence, data mining, or information retrieval (e.g., KDD, WWW, SIGIR, AAAI, IJCAI) or in reputable journals within these fields. The reference section provides clues, with many citations to ACM and IEEE conferences/journals.

1.4. Publication Year

The publication year is not explicitly stated in the provided text. However, a reference to "Online Information Review (2015)" and "KDD '15" suggests it is likely published around 2015.

1.5. Abstract

The paper addresses the common problem of collaborative filtering (CF) systems suffering from data sparsity, which limits their performance. To overcome this, it proposes integrating auxiliary information from knowledge bases (KBs). The core idea is to leverage the heterogeneous information (structured, textual, and visual data) available in a knowledge base.

The methodology involves three main components for extracting semantic representations of items:

Structural content: A heterogeneous network embedding method called TransR is used to capture item relationships and node heterogeneity.
Textual content: Stacked denoising auto-encoders (SDAE), a deep learning technique, extracts textual representations.
Visual content: Stacked convolutional auto-encoders (SCAE), another deep learning technique, extracts visual representations.

These extracted item representations are then combined with collaborative filtering in an integrated framework called Collaborative Knowledge Base Embedding (CKE). CKE jointly learns latent representations from collaborative filtering and the semantic representations from the knowledge base.

The paper evaluates CKE on two real-world datasets, demonstrating that its approach significantly outperforms several state-of-the-art recommendation methods.

1.6. Original Source Link

/files/papers/6901d1b584ecf5fffe471809/paper.pdf The publication status is unknown based solely on the provided abstract and paper content, but it is presented as a PDF, indicating it's likely a published paper or a preprint.

2. Executive Summary

2.1. Background & Motivation

The paper aims to solve the problem of limited performance and data sparsity in collaborative filtering (CF)-based recommender systems. CF methods, while successful, struggle when user-item interactions are sparse (e.g., in online shopping with vast item sets) and cannot recommend new items that lack interaction history (cold-start problem). This problem is significant because online services heavily rely on effective recommendation systems for user engagement and satisfaction.

Previous attempts to address these issues often involve hybrid recommender systems that combine CF with auxiliary information. The paper identifies a gap: while knowledge bases (KBs) offer a rich source of heterogeneous information (structured, textual, visual), existing studies have not fully exploited their potential. They often either use only the network structure or rely on tedious feature engineering, neglecting other valuable data modalities and efficient representation learning.

The paper's innovative idea is to fully leverage the diverse content within a knowledge base (structural, textual, and visual) to automatically learn rich semantic representations of items. By doing so, it aims to boost the quality of recommender systems, especially in sparse data environments, without relying on manual feature engineering.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Comprehensive Knowledge Base Utilization: It is presented as the first work to comprehensively leverage structural content, textual content, and visual content from a knowledge base to enhance recommender systems. This addresses the limitation of previous works that often focused on a single data modality or required manual feature engineering.
Automatic Semantic Representation Learning: The paper applies advanced embedding methods, including heterogeneous network embedding (TransR) and deep learning embeddings (stacked denoising auto-encoders (SDAE) for text and stacked convolutional auto-encoders (SCAE) for visual data), to automatically extract semantic representations of items from the knowledge base. These learned representations are versatile and can be used for other tasks beyond recommendation.
Collaborative Joint Learning Framework (CKE): It proposes a novel integrated framework, Collaborative Knowledge Base Embedding (CKE), which performs knowledge base embedding and collaborative filtering jointly. This allows the model to simultaneously extract rich feature representations from the knowledge base and capture the implicit relationships between users and items, leading to a more unified and effective learning process.
Empirical Validation: Through extensive experiments on two real-world datasets (MovieLens-1M and IntentBooks), the paper demonstrates that the CKE framework significantly outperforms several widely adopted state-of-the-art recommendation methods, validating its effectiveness.

The key findings are that integrating heterogeneous knowledge from a knowledge base, especially when learned through advanced embedding techniques and combined with collaborative filtering via joint learning, can substantially improve recommendation performance, particularly in scenarios with data sparsity. Each component (structural, textual, visual) contributes positively to the overall recommendation quality.

3.1. Foundational Concepts

Recommender Systems: Systems designed to predict user preferences for items and suggest relevant items. They help users discover new products, services, or content in vast information spaces.
Collaborative Filtering (CF): A common technique in recommender systems that makes predictions about user interests by collecting preferences or taste information from many users. The underlying assumption is that if users A and B have similar preferences for some items, they will have similar preferences for other items.
- Data Sparsity: A major challenge in CF where most users have interacted with only a small fraction of the total items, leading to a very sparse user-item interaction matrix. This makes it difficult to find reliable similarities between users or items.
- Cold Start Problem: A specific aspect of data sparsity where new items (with no past interactions) or new users (with no past preferences) cannot be effectively recommended or receive recommendations, respectively, because CF relies on historical data.
- Implicit Feedback: User interactions that are not explicit ratings (e.g., 1-5 stars) but rather indirect signals of preference, such as views, clicks, purchases, search queries, or dwell time. In this paper, $R_{ij}=1$ indicates an observed interaction (e.g., user $i$ watched movie $j$ ), and $R_{ij}=0$ indicates no observed interaction (which could mean disinterest or simply unawareness).
Knowledge Base (KB): A structured repository of information, often represented as a graph, that stores entities (real-world objects, concepts) and their relationships. KBs provide heterogeneous information, meaning they contain diverse types of data (e.g., facts, text, images) with different semantic meanings. Examples include DBpedia, YAGO, and Google's Knowledge Graph.
- Entities: Nodes in a knowledge graph, representing real-world objects like "movie," "actor," "genre," "book," etc.
- Relationships (or Relations): Edges in a knowledge graph, describing how entities are connected, e.g., "movie stars actor," "book has genre fiction."
- Heterogeneous Network: A network (or graph) where there are multiple types of nodes (entities) and multiple types of edges (relationships).
Embedding (Representation Learning): The process of transforming high-dimensional, sparse data (like text, images, or graph nodes) into dense, low-dimensional vector representations (embeddings) in a continuous vector space. These vectors are designed to capture the semantic meaning and relationships of the original data, making them suitable for machine learning models.
Deep Learning: A subfield of machine learning that uses neural networks with multiple layers (deep neural networks) to learn complex patterns and representations from data.
- Auto-encoders (AE): An unsupervised neural network that attempts to learn a compact, compressed representation (encoding) of its input data. It consists of an encoder that maps input to a latent space representation and a decoder that reconstructs the input from this representation. The goal is for the reconstructed output to be as close to the original input as possible.
- Denoising Auto-encoders (DAE): A variant of auto-encoders that learns robust representations by attempting to reconstruct the original, clean input from a corrupted version of the input. This forces the model to learn more meaningful features by recovering missing or noisy information.
- Stacked Denoising Auto-encoders (SDAE): A deep neural network formed by stacking multiple denoising auto-encoders on top of each other. Each DAE learns a higher-level representation of the output from the previous DAE. This architecture is effective for learning hierarchical features from data like text.
- Convolutional Neural Networks (CNN): A class of deep neural networks specifically designed for processing structured grid-like data, such as images. CNNs use convolutional layers that apply filters (kernels) to input data, preserving spatial relationships and reducing the number of parameters through weight sharing.
- Stacked Convolutional Auto-encoders (SCAE): Similar to SDAE but using convolutional layers instead of fully connected layers in its encoder and decoder parts, making it particularly well-suited for learning representations from image data.
Factorization Machines (FM): A generic supervised learning model that combines the advantages of Support Vector Machines (SVMs) with factorization models. It can capture interactions between features and is efficient for sparse data.
Matrix Factorization (MF): A class of collaborative filtering algorithms that decompose the user-item interaction matrix into two lower-rank matrices: a user-latent factor matrix and an item-latent factor matrix. The dot product of a user's latent vector and an item's latent vector approximates the user's preference for that item.
Stochastic Gradient Descent (SGD): An iterative optimization algorithm used to minimize an objective function. In SGD, instead of computing the gradient on the entire dataset, the gradient is approximated using a single randomly chosen sample (or a small batch) at each step, making it computationally efficient for large datasets.

3.2. Previous Works

The paper discusses several existing methods, both for recommender systems and for processing knowledge bases or content information.

Collaborative Filtering (CF) Methods:
- BPRMF (Bayesian Personalized Ranking based Matrix Factorization) [22]: A state-of-the-art collaborative filtering method that optimizes for pair-wise ranking instead of predicting explicit ratings. It assumes that for a given user, observed items are preferred over unobserved items. The objective function is derived from the Bayesian Personalized Ranking (BPR) principle, which aims to maximize the posterior probability of correct personalized ranking. $ L_{BPR} = \sum_{u=1}^M \sum_{i \in I_u^+} \sum_{j \in I_u^-} \ln \sigma(\hat{x}{ui} - \hat{x}{uj}) - \lambda_\theta ||\theta||^2 $ Where:
  - $M$ : Number of users.
  - $I_u^+$ : Set of items user $u$ has interacted with (positive implicit feedback).
  - $I_u^-$ : Set of items user $u$ has not interacted with (negative implicit feedback).
  - $\sigma(x)$ : Logistic sigmoid function, $\frac{1}{1+e^{-x}}$ .
  - $\hat{x}_{ui}$ : Predicted preference of user $u$ for item $i$ , often modeled as $\mathbf{p}_u \cdot \mathbf{q}_i$ (dot product of user and item latent vectors).
  - $\theta$ : Model parameters (user and item latent vectors).
  - $\lambda_\theta$ : Regularization parameter.
  - The goal is to maximize the difference between the predicted preference for a positive item and a negative item.
Knowledge Base Embedding Methods:
- TransE (Translating Embeddings for Modeling Multi-relational Data) [3]: A pioneering knowledge graph embedding model that represents entities and relations as vectors in the same continuous vector space. For a valid triple (h, r, t) (head entity, relation, tail entity), TransE aims to ensure that the embedding of the head entity plus the embedding of the relation is approximately equal to the embedding of the tail entity: $\mathbf{h} + \mathbf{r} \approx \mathbf{t}$ $h + r \approx t$ . This is typically optimized by minimizing a margin-based ranking objective function. $ f(h, r, t) = ||\mathbf{h} + \mathbf{r} - \mathbf{t}||_{L_1/L_2} $ Where:
  - $\mathbf{h}, \mathbf{r}, \mathbf{t}$ : Vector embeddings for the head entity, relation, and tail entity, respectively.
  - $L_1/L_2$ : Denotes the $L_1$ or $L_2$ norm.
  - A key limitation of TransE is its struggle with many-to-many relationships and heterogeneous entities/relations, as it assumes entities and relations reside in the same embedding space.
- TransR (Translating Embeddings for Entities and Relations in Heterogeneous Networks) [15]: An extension of TransE designed to better handle heterogeneous information in knowledge bases. TransR represents entities and relations in distinct semantic spaces. For each relation $r$ , it introduces a projection matrix $\mathbf{M}_r$ that projects entity embeddings from the entity space to the relation-specific space. The translation property ( $\mathbf{h}_r + \mathbf{r} \approx \mathbf{t}_r$ ) then holds in this projected relation space. This is a crucial foundation for the structural embedding component in the current paper.
Content-based Recommendation Methods using Deep Learning:
- Stacked Denoising Auto-encoders (SDAE) [27]: Used for learning robust representations from textual data. As explained in Foundational Concepts, it reconstructs clean input from corrupted input, forcing the model to learn useful features. Wang [29] (mentioned in Section 7.3) used deep representation learning for textual content combined with CF.
- Stacked Convolutional Auto-encoders (SCAE) [16]: Used for learning representations from visual data. It leverages convolutional layers to preserve spatial information in images and reduce parameters, as detailed in Foundational Concepts.
Hybrid Recommendation Models:
- PRP (PageRank with Priors) [17]: Integrates user-item relations and structural knowledge into a unified homogeneous graph. It then performs PageRank (an algorithm for ranking nodes in a graph based on incoming links) for each user with a personalized initial probability distribution to recommend items. This method primarily uses graph structure.
- PER (Personalized Entity Recommendation) [30]: Treats structural knowledge as a heterogeneous information network. It extracts meta-path based latent features (sequences of entity and relation types, e.g., "movie-genre-movie") to represent connectivity between users and items and applies Bayesian ranking optimization for recommendation. This focuses on network structure.
- LIBFM (Factorization Machines with libFM) [21]: A state-of-the-art feature-based factorization model. It can model arbitrary real-valued feature vectors by factorizing parameters. $LIBFM(S/T/V)$ variants described in the paper use item attributes from structural, textual, or visual knowledge as raw features.
- CMF (Collective Matrix Factorization) [21]: Combines different data sources by simultaneously factorizing multiple matrices. For example, CMF(T) would factorize a user-item matrix and an item-word matrix jointly to leverage textual information. CMF(V) would do similarly for item-pixel matrices.
- CTR (Collaborative Topic Regression) [28]: A state-of-the-art method that leverages textual information for recommendation by jointly modeling collaborative filtering with topic modeling (e.g., Latent Dirichlet Allocation) on item content. This allows it to capture both user-item interaction patterns and semantic themes from text.

3.3. Technological Evolution

The field of recommender systems has evolved from basic collaborative filtering (CF) techniques (e.g., item-based or user-based CF) to matrix factorization (MF) models that learn latent representations. However, these methods still suffer from data sparsity and cold start issues. To address this, hybrid approaches emerged, integrating CF with auxiliary information.

Initially, auxiliary information often came from content-based filtering (e.g., movie genres, book descriptions), which typically involved manual feature engineering. The rise of the Semantic Web and Linked Data led to the construction of large-scale knowledge bases (KBs), offering structured and interlinked heterogeneous information. Early efforts to use KBs in recommendation often focused on leveraging their network structure (e.g., meta-paths, PageRank) to infer relationships.

More recently, the success of deep learning has shifted the paradigm from hand-crafted features to automatically learned features (embeddings). This paper fits into this evolution by combining the richness of knowledge bases with the power of deep learning for representation learning. It moves beyond just structural information to include textual and visual content, and it integrates these diverse embeddings with collaborative filtering through a joint learning framework, representing a sophisticated hybrid approach.

3.4. Differentiation Analysis

Compared to the main methods in related work, the CKE approach offers several core differences and innovations:

Multi-Modal Knowledge Integration: Unlike most prior works that focused on a single type of auxiliary information (e.g., only network structure, or only text), CKE is novel in explicitly leveraging three distinct modalities from the knowledge base: structural content, textual content, and visual content. This comprehensive approach aims to capture a richer and more complete understanding of items.
Automatic Feature Extraction via Advanced Embeddings: Instead of relying on heavy and tedious feature engineering (as seen in LIBFM or early meta-path approaches like PER), CKE employs state-of-the-art embedding techniques for each modality:
- Bayesian TransR for structural knowledge: This method is specifically chosen for heterogeneous networks, outperforming TransE by mapping entities and relations to distinct spaces, thereby better capturing complex relationships.
- Bayesian Stacked Denoising Auto-encoders (SDAE) for textual knowledge: This deep learning model automatically learns robust semantic representations from raw text.
- Bayesian Stacked Convolutional Auto-encoders (SCAE) for visual knowledge: This deep learning model is tailored for image data, leveraging convolutional layers to capture spatial features effectively, which is a key differentiator from using generic SDAE for images.
Collaborative Joint Learning: CKE integrates the knowledge base embedding process directly with collaborative filtering into a unified, jointly learned model. This is distinct from approaches that learn representations separately and then combine them (e.g., $BPRMF+STV$ baseline), or collective matrix factorization methods that might not fully leverage deep, non-linear representations. The joint learning objective allows the knowledge base embeddings to be optimized directly for the recommendation task, enabling a more effective interplay between explicit user-item interactions and implicit item semantics.
Bayesian Formulation: The paper extends the individual embedding components (TransR, SDAE, SCAE) into Bayesian versions, which can help in regularizing the model and potentially provide better uncertainty estimates, although the focus in the paper is primarily on performance improvement.

4. Methodology

4.1. Principles

The core idea behind Collaborative Knowledge Base Embedding (CKE) is to enrich the traditional collaborative filtering (CF) approach by explicitly incorporating diverse semantic representations of items derived from a knowledge base (KB). CF often suffers from data sparsity and cold-start issues because it relies solely on historical user-item interactions. A knowledge base, on the other hand, provides rich, heterogeneous auxiliary information (structural relationships, textual descriptions, and visual content) about items.

The theoretical basis and intuition are as follows:

Enriching Item Representations: Items are not just abstract IDs but have associated content and relationships. By learning dense vector embeddings for items from their structural context (how they relate to other entities like genres, actors), textual content (summaries, descriptions), and visual content (posters, covers), we can capture a more comprehensive semantic understanding of each item.
Addressing Sparsity and Cold Start: These learned semantic representations can provide valuable information even for items with few or no user interactions, effectively mitigating sparsity and cold start problems by leveraging item-to-item similarities in the embedding space.
Joint Optimization: To ensure that the semantic representations are relevant to the recommendation task, they are not learned in isolation. Instead, CKE proposes a joint learning framework where the process of learning user preferences (via collaborative filtering) and learning item semantic representations (via knowledge base embedding) are optimized together. This allows the model to find latent factors that are both discriminative for user preferences and semantically meaningful according to the knowledge base.
Heterogeneity Handling: Recognizing that different types of knowledge require different representation learning techniques, CKE employs specialized embedding methods for each modality: TransR for structured graph data, Stacked Denoising Auto-encoders (SDAE) for textual data, and Stacked Convolutional Auto-encoders (SCAE) for visual data. The use of Bayesian formulations for these components provides a probabilistic framework, which can contribute to better regularization and generalization.

4.2. Core Methodology In-depth (Layer by Layer)

The CKE framework operates in two main steps: knowledge base embedding and collaborative joint learning. Figure 2 provides an overview of this framework.

Figure 4: Ilustration of a 6-layer SDAE for textual embedding 该图像是图示，展示了用于文本嵌入的6层堆叠去噪自动编码器（SDAE）结构，输入为被破坏的文档，输出为还原的干净文档，中间通过多层隐藏层提取文本嵌入向量。

Figure 2: Illustration of a collaborative joint learning framework based on knowledge base embeddings.

4.2.1. Knowledge Base Embedding

In this step, CKE extracts three distinct embedding vectors for each item entity, one from each type of knowledge: structural, textual, and visual. These vectors serve as the item entity's latent representation in its respective domain. The components are designed to automatically extract these representations, avoiding manual feature engineering.

4.2.1.1. Structural Embedding

Structural knowledge is represented as a heterogeneous network (a graph) $G = (\mathcal{V}, \mathcal{E})$ , where $\mathcal{V}$ is a set of vertices (entities) and $\mathcal{E}$ is a set of edges (relationships). To capture this structured information, the paper adopts TransR [15], a state-of-the-art network embedding method.

TransR Overview: Unlike methods that embed entities and relations in the same space, TransR represents entities and relations in distinct semantic spaces. For each relation $r$ , it introduces a projection matrix $\mathbf{M}_r$ . This matrix projects the entities from the entity space into a relation-specific space where the translation property ( $\mathbf{h}^r + \mathbf{r} \approx \mathbf{t}^r$ ) is expected to hold.
Projected Entity Vectors: For a triple $(v_h, r, v_t)$ (head entity $v_h$ , relation $r$ , tail entity $v_t$ ), entities are first embedded into vectors $\mathbf{v}_h, \mathbf{v}_t \in \mathbb{R}^k$ (entity space), and the relation is embedded into $\mathbf{r} \in \mathbb{R}^d$ (relation space). The projected vectors of the entities in the relation space are defined as: $ \mathbf{v}_h^r = \mathbf{v}_h \mathbf{M}_r, \qquad \mathbf{v}_t^r = \mathbf{v}_t \mathbf{M}_r $ Where:
- $\mathbf{v}_h \in \mathbb{R}^k$ : Embedding vector of the head entity $v_h$ .
- $\mathbf{v}_t \in \mathbb{R}^k$ : Embedding vector of the tail entity $v_t$ .
- $\mathbf{M}_r \in \mathbb{R}^{k \times d}$ : Projection matrix specific to relation $r$ , mapping $k$ -dimensional entity embeddings to $d$ -dimensional relation space.
- $\mathbf{v}_h^r \in \mathbb{R}^d$ : Projected embedding vector of the head entity $v_h$ in the relation space of $r$ .
- $\mathbf{v}_t^r \in \mathbb{R}^d$ : Projected embedding vector of the tail entity $v_t$ in the relation space of $r$ .
Score Function: The score function for a triple measures its plausibility. In TransR, it is defined as the $L_2$ -norm squared of the difference between the projected head entity plus the relation vector, and the projected tail entity: $ f_r(v_h, v_t) = ||\mathbf{v}_h^r + \mathbf{r} - \mathbf{v}_t^r||_2^2 $ Where:
- $\mathbf{r} \in \mathbb{R}^d$ : Embedding vector of the relation $r$ .
- $||\cdot||_2^2$ : Squared $L_2$ -norm (Euclidean distance squared), which measures the "distance" or "error" of the translation. A smaller score indicates a more plausible triple.
Bayesian TransR: The paper extends TransR to a Bayesian version, using a sigmoid function to calculate pair-wise triple ranking probabilities instead of a margin-based objective. The generative process is defined as follows:
1. Entity Embeddings: For each entity $v$ $v$ , its embedding vector $\mathbf{v}$ $v$ is drawn from a multivariate normal distribution with zero mean and inverse variance $\lambda_v^{-1}$ $λ_{v}^{- 1}$ scaled by the identity matrix $\mathbf{I}$ $I$ : $ \mathbf{v} \sim \mathcal{N}(\mathbf{0}, \lambda_v^{-1} \mathbf{I}) $ Where:
  - $\mathcal{N}(\mu, \Sigma)$ : A normal (Gaussian) distribution with mean $\mu$ and covariance matrix $\Sigma$ .
  - $\mathbf{0}$ : A vector of zeros.
  - $\lambda_v^{-1}$ : The inverse of the precision (variance) parameter for entity embeddings. A smaller $\lambda_v$ means larger variance, allowing more flexibility for $\mathbf{v}$ .
  - $\mathbf{I}$ : Identity matrix, implying independent components.
2. Relation Embeddings and Projection Matrices: For each relation $r$ $r$ , its embedding vector $\mathbf{r}$ $r$ is drawn from a similar normal distribution with parameter $\lambda_r^{-1}$ $λ_{r}^{- 1}$ , and its projection matrix $\mathbf{M}_r$ $M_{r}$ is drawn from a normal distribution with parameter $\lambda_M^{-1}$ $λ_{M}^{- 1}$ : $ \mathbf{r} \sim \mathcal{N}(\mathbf{0}, \lambda_r^{-1} \mathbf{I}) \quad \text{and} \quad \mathbf{M}_r \sim \mathcal{N}(\mathbf{0}, \lambda_M^{-1} \mathbf{I}) $ Where:
  - $\lambda_r^{-1}$ : Inverse precision for relation embeddings.
  - $\lambda_M^{-1}$ : Inverse precision for projection matrix components.
3. Triple Ranking Probability: For each quadruple $(v_h, r, v_t, v_{t'})$ $(v_{h}, r, v_{t}, v_{t^{'}})$ , where $(v_h, r, v_t)$ $(v_{h}, r, v_{t})$ is a correct triple and $(v_h, r, v_{t'})$ $(v_{h}, r, v_{t^{'}})$ is an incorrect triple, the model draws from the probability $\sigma(f_r(v_h, v_{t'}) - f_r(v_h, v_t))$ $σ (f_{r} (v_{h}, v_{t^{'}}) - f_{r} (v_{h}, v_{t}))$ . $ \sigma(x) := \frac{1}{1 + e^{-x}} $ Where:
  - $\sigma(x)$ : The logistic sigmoid function, which maps any real value to a probability between 0 and 1.
  - $f_r(v_h, v_t)$ : Score function for the correct triple.
  - $f_r(v_h, v_{t'})$ : Score function for the incorrect triple.
  - The formulation implies that for a quadruple to be sampled, the score of the correct triple should be lower (more plausible) than the score of the incorrect triple, making $f_r(v_h, v_{t'}) - f_r(v_h, v_t)$ positive, and thus $\sigma(\cdot)$ approaches 1. Incorrect triples are typically constructed by corrupting correct triples by replacing either the head or tail entity with a randomly chosen entity of the same type.
    
    The embedding vector $\mathbf{v}_j$ (from the entity space, before projection) for an item entity $j$ is used to denote its structural representation.
    
    该图像是论文中图5的示意图，展示了用于视觉嵌入的6层堆叠卷积去噪自编码器（SCAE）的结构，包括多个卷积层和全连接层，输入为受损图像，输出为重建的干净图像，中间获得视觉嵌入向量。

Figure 3: Illustration of TransR for structural embedding

Figure 3 visually explains TransR. It shows entities in an entity space and relations in a relation space. The projection matrix $\mathbf{M}_r$ maps entities from the entity space to the relation space. In the relation space, the projected head entity $\mathbf{v}_h^r$ plus the relation vector $\mathbf{r}$ should approximate the projected tail entity $\mathbf{v}_t^r$ .

4.2.1.2. Textual Embedding

This component extracts textual representations for item entities from their textual knowledge (e.g., summaries, descriptions) using Stacked Denoising Auto-encoders (SDAE).

SDAE Overview: An SDAE is a feedforward neural network designed to learn robust representations by reconstructing a clean input from a corrupted version of that input. It consists of an encoder that maps the input to a latent compact representation and a decoder that reconstructs the original input from this latent representation.
Notation:
- $L_t$ : Number of layers in the SDAE.
- $\mathbf{X}_l$ : Output of layer $l$ .
- $\mathbf{X}_{L_t}$ : Matrix representing the original clean textual knowledge of all item entities. The $j$ -th row, $\mathbf{X}_{L_t, j*}$ , is the bag-of-words vector for item entity $j$ .
- $\mathbf{X}_0$ : Noise-corrupted matrix, created by randomly masking (setting to zero) some entries of $\mathbf{X}_{L_t}$ .
- $\mathbf{W}_l$ : Weight parameter for layer $l$ .
- $\mathbf{b}_l$ : Bias parameter for layer $l$ .
Architecture (Example 6-layer SDAE): As shown in Figure 4, an SDAE is structured such that the first half of layers (e.g., $\frac{L_t}{2}$ layers) form the encoder, mapping the corrupted input to a latent representation. The latter half forms the decoder, reconstructing the clean input from this latent representation. The embedding vector is typically taken from the middle layer.
Bayesian SDAE: The generative process for each layer $l$ in the Bayesian SDAE is as follows, given the clean input $\mathbf{X}_{L_t}$ and corrupted input $\mathbf{X}_0$ :
1. Weight Parameters: For each weight parameter $\mathbf{W}_l$ $W_{l}$ , it is drawn from a normal distribution: $ \mathbf{W}_l \sim \mathcal{N}(\mathbf{0}, \lambda_W^{-1} \mathbf{I}) $ Where:
  - $\lambda_W^{-1}$ : Inverse precision for weight parameters.
2. Bias Parameters: For each bias parameter $\mathbf{b}_l$ $b_{l}$ , it is drawn from a normal distribution: $ \mathbf{b}_l \sim \mathcal{N}(\mathbf{0}, \lambda_b^{-1} \mathbf{I}) $ Where:
  - $\lambda_b^{-1}$ : Inverse precision for bias parameters.
3. Layer Output: For the output of layer $l$ $l$ , $\mathbf{X}_l$ $X_{l}$ , it is drawn from a normal distribution centered at the activated output of the previous layer's transformation, with parameter $\lambda_X^{-1}$ $λ_{X}^{- 1}$ : $ \mathbf{X}l \sim \mathcal{N}(\hat{\sigma}(\mathbf{X}{l-1} \mathbf{W}_l + \mathbf{b}_l), \lambda_X^{-1} \mathbf{I}) $ Where:
  - $\hat{\sigma}(\cdot)$ : An activation function (e.g., sigmoid or ReLU) applied element-wise.
  - $\mathbf{X}_{l-1}$ : Output of the previous layer.
  - $\lambda_X^{-1}$ : Inverse precision for layer outputs.
    
    The embedding vector for item entity $j$ is the row vector from the output of the middle layer, specifically $\mathbf{X}_{\frac{L_t}{2}, j*}$ .
    
    $Figure 6: Recall $@ \\mathbf { K }$ results comparison between our methods using each component in knowledge base embedding and relater baselines for dataset MovieLens-1M.$ 该图像是论文中图6，显示了MovieLens-1M数据集中不同方法结合知识库各组件的Recall@K指标对比，分为结构知识、文本知识和视觉知识三部分，横轴为K值，纵轴为Recall@K，体现了各方法随K变化的性能趋势。

Figure 4: Ilustration of a 6-layer SDAE for textual embedding

Figure 4 illustrates a 6-layer SDAE. The corrupted input $\mathbf{X}_0$ goes through encoding layers ( $\mathbf{X}_0 \to \mathbf{X}_1 \to \mathbf{X}_2 \to \mathbf{X}_3$ ). $\mathbf{X}_3$ is the latent representation. Then decoding layers reconstruct the clean output ( $\mathbf{X}_3 \to \mathbf{X}_4 \to \mathbf{X}_5 \to \mathbf{X}_6$ ). The textual embedding for an item $j$ is taken from $\mathbf{X}_{3, j*}$ .

4.2.1.3. Visual Embedding

This component extracts visual representations for item entities from their visual knowledge (e.g., poster images, cover images) using Stacked Convolutional Auto-encoders (SCAE).

SCAE Overview: SCAE is chosen because convolutional layers in CNNs are effective for image data, preserving neighborhood relations and spatial locality while reducing parameters through weight sharing. SCAE replaces the fully-connected layers of a standard SDAE with convolutional layers.
Notation:
- $L_v$ : Number of layers in the SCAE.
- $\mathbf{Z}_{L_v}$ : 4-dimensional tensor representing the collection of clean images. $\mathbf{Z}_{L_v, j*}$ is the 3-dimensional tensor for raw pixel representation (RGB) of item $j$ .
- $\mathbf{Z}_0$ : Corrupted images tensor, created by adding Gaussian noise to entries of $\mathbf{Z}_{L_v}$ .
- $\mathbf{Z}_l$ : Output of layer $l$ .
- $\mathbf{Q}_l$ : Weight parameter (convolutional filter) for layer $l$ .
- $\mathbf{c}_l$ : Bias parameter for layer $l$ .
Architecture (Example 6-layer SCAE): Figure 5 shows a 6-layer SCAE. The middle layers ( $\frac{L_v}{2}$ and $\frac{L_v}{2}+1$ ) are typically fully connected layers to produce the dense embedding vector, while other layers are convolutional or deconvolutional.
- Encoder: Two convolutional layers (from $\mathbf{Z}_0$ to $\mathbf{Z}_2$ ) followed by a fully connected layer ( $\mathbf{Z}_2$ to $\mathbf{Z}_3$ ).
- Decoder: A fully connected layer ( $\mathbf{Z}_3$ to $\mathbf{Z}_4$ ) followed by two deconvolutional layers ( $\mathbf{Z}_4$ to $\mathbf{Z}_6$ ).
- The output of the middle hidden layer $\mathbf{Z}_3$ is a matrix representing the collection of all item entities' visual embedding vectors. Other hidden layers typically output feature maps (4-dimensional tensors).
Convolutional Layer Mapping: The mapping for a convolutional layer is given as: $ \mathbf{Z}l = \sigma(\mathbf{Q} * \mathbf{Z}{l-1} + \mathbf{c}_l) $ Where:
- $\sigma(\cdot)$ : An activation function.
- *: The convolutional operator, which applies filters (kernels) to the input, preserving local connectivity.
- $\mathbf{Q}$ : Convolutional filter weights.
- $\mathbf{c}_l$ : Bias term.
Bayesian SCAE: The generative process for each layer $l$ in the Bayesian SCAE is as follows, given clean image input $\mathbf{Z}_{L_v}$ and corrupted input $\mathbf{Z}_0$ :
1. Weight Parameters: For each weight parameter $\mathbf{Q}_l$ $Q_{l}$ , it is drawn from a normal distribution: $ \mathbf{Q}_l \sim \mathcal{N}(\mathbf{0}, \lambda_Q^{-1} \mathbf{I}) $ Where:
  - $\lambda_Q^{-1}$ : Inverse precision for weight parameters (convolutional filters).
2. Bias Parameters: For each bias parameter $\mathbf{c}_l$ $c_{l}$ , it is drawn from a normal distribution: $ \mathbf{c}_l \sim \mathcal{N}(\mathbf{0}, \lambda_c^{-1} \mathbf{I}) $ Where:
  - $\lambda_c^{-1}$ : Inverse precision for bias parameters.
3. Layer Output: For the output of layer $l$ $l$ , $\mathbf{Z}_l$ $Z_{l}$ , it is drawn from a normal distribution:
  - If layer $l$ is a fully connected layer: $ \mathbf{Z}l \sim \mathcal{N}(\sigma(\mathbf{Z}{l-1} \mathbf{Q}_l + \mathbf{c}_l), \lambda_Z^{-1} \mathbf{I}) $
  - Else (if layer $l$ is a convolutional layer): $ \mathbf{Z}l \sim \mathcal{N}(\sigma(\mathbf{Z}{l-1} * \mathbf{Q}_l + \mathbf{c}_l), \lambda_Z^{-1} \mathbf{I}) $ Where:
  - $\lambda_Z^{-1}$ : Inverse precision for layer outputs.
  - The operation $(\mathbf{Z}_{l-1} \mathbf{Q}_l)$ represents a matrix multiplication for fully connected layers.
  - The operation $(\mathbf{Z}_{l-1} * \mathbf{Q}_l)$ represents a convolution for convolutional layers.
    
    The embedding vector for item entity $j$ is the row vector from the output of the middle layer, specifically $\mathbf{Z}_{\frac{L_v}{2}, j*}$ .
    
    $Figure 7: MAP $@ \\mathbf { K }$ results comparison between our methods using each component in knowledge base embedding and relater baselines for dataset MovieLens-1M.$ 该图像是图表，展示了论文中图7关于MovieLens-1M数据集中，各个知识库嵌入组件（结构、文本、视觉）及相关基线方法在MAP@ $K$ 指标上的表现对比，横轴为K值，纵轴为MAP@K值。

Figure 5: Ilustration of a 6-layer SCAE for visual embedding

Figure 5 illustrates a 6-layer SCAE with convolutional and fully connected layers. The corrupted image input $\mathbf{Z}_0$ is processed by convolutional encoder layers ( $\mathbf{Z}_0 \to \mathbf{Z}_1 \to \mathbf{Z}_2$ ), then a fully connected layer yields the latent representation $\mathbf{Z}_3$ . This is followed by a fully connected decoder layer ( $\mathbf{Z}_3 \to \mathbf{Z}_4$ ) and deconvolutional layers to reconstruct the clean image ( $\mathbf{Z}_4 \to \mathbf{Z}_5 \to \mathbf{Z}_6$ ). The visual embedding for an item $j$ is taken from $\mathbf{Z}_{3, j*}$ .

4.2.2. Collaborative Joint Learning

This step integrates collaborative filtering with the item embedding representations from the knowledge base into a unified CKE framework.

Implicit Feedback and Pair-wise Ranking: The recommendation task is based on implicit feedback. A user implicit feedback matrix $\mathbf{R} \in \mathbb{R}^{m \times n}$ has $R_{ij}=1$ if an interaction between user $i$ and item $j$ has been observed, and $R_{ij}=0$ otherwise. For learning, the paper considers pair-wise ranking: if $R_{ij}=1$ and $R_{ij'}=0$ , it is assumed that user $i$ prefers item $j$ over item $j'$ . The pair-wise preference probability $p(j > j' ; i | \theta)$ denotes this preference, where $\theta$ represents the model parameters.
Item Latent Vector Integration: In collaborative filtering, user $i$ is represented by a latent vector $\mathbf{u}_i$ and item $j$ by a latent vector $\eta_j$ . To incorporate the semantic representations from the knowledge base, the item latent vector $\mathbf{e}_j$ is redefined as an integration of the CF latent offset vector $\eta_j$ and the three embedding vectors extracted from the knowledge base: $ \mathbf{e}j = \eta_j + \mathbf{v}j + \mathbf{X}{\frac{L_t}{2}, j*} + \mathbf{Z}{\frac{L_v}{2}, j*} $ Where:
- $\mathbf{e}_j$ : The final integrated latent vector for item $j$ .
- $\eta_j$ : A latent offset vector for item $j$ from the collaborative filtering part, representing information about item $j$ not captured by the knowledge base or user-item interactions.
- $\mathbf{v}_j$ : Structural representation of item $j$ from Bayesian TransR.
- $\mathbf{X}_{\frac{L_t}{2}, j*}$ ( $X$ with subscript $L_t/2, j*$ ): Textual representation of item $j$ from Bayesian SDAE.
- $\mathbf{Z}_{\frac{L_v}{2}, j*}$ ( $Z$ with subscript $L_v/2, j*$ ): Visual representation of item $j$ from Bayesian SCAE.
- This sum implies that the different representations contribute additively to the item's overall latent vector.
Pair-wise Preference Probability with Integrated Embeddings: The pair-wise preference probability is then defined using the user's latent vector $\mathbf{u}_i$ and the integrated item latent vectors $\mathbf{e}_j$ and $\mathbf{e}_{j'}$ : $ p(j > j' ; i | \theta) = \sigma(\mathbf{u}_i^T \mathbf{e}_j - \mathbf{u}i^T \mathbf{e}{j'}) $ Where:
- $\mathbf{u}_i \in \mathbb{R}^k$ : Latent vector for user $i$ .
- $\mathbf{e}_j, \mathbf{e}_{j'} \in \mathbb{R}^k$ : Integrated latent vectors for items $j$ and $j'$ .
- $\mathbf{u}_i^T \mathbf{e}_j$ : The predicted preference score for user $i$ and item $j$ , calculated as a dot product, similar to matrix factorization.
- $\sigma(\cdot)$ : The logistic sigmoid function. The model aims to maximize this probability, meaning $\mathbf{u}_i^T \mathbf{e}_j$ should be greater than $\mathbf{u}_i^T \mathbf{e}_{j'}$ .
Full CKE Generative Process: The complete generative process of the CKE framework, combining all components, is given as:
1. Structural Knowledge (from Bayesian TransR):
  - For each entity $v$ , draw $\mathbf{v} \sim \mathcal{N}(\mathbf{0}, \lambda_v^{-1} \mathbf{I})$ .
  - For each relation $r$ , draw $\mathbf{r} \sim \mathcal{N}(\mathbf{0}, \lambda_r^{-1} \mathbf{I})$ and $\mathbf{M}_r \sim \mathcal{N}(\mathbf{0}, \lambda_M^{-1} \mathbf{I})$ .
  - For each quadruple $(v_h, r, v_t, v_{t'}) \in \mathcal{S}$ , draw from the probability $\sigma(f_r(v_h, v_{t'}) - f_r(v_h, v_t))$ .
2. Textual Knowledge (from Bayesian SDAE): For each layer $l$ $l$ in SDAE:
  - For weight parameter $\mathbf{W}_l$ , draw $\mathbf{W}_l \sim \mathcal{N}(\mathbf{0}, \lambda_W^{-1} \mathbf{I})$ .
  - For bias parameter $\mathbf{b}_l$ , draw $\mathbf{b}_l \sim \mathcal{N}(\mathbf{0}, \lambda_b^{-1} \mathbf{I})$ .
  - For the output of the layer $\mathbf{X}_l$ , draw $\mathbf{X}_l \sim \mathcal{N}(\hat{\sigma}(\mathbf{X}_{l-1} \mathbf{W}_l + \mathbf{b}_l), \lambda_X^{-1} \mathbf{I})$ .
3. Visual Knowledge (from Bayesian SCAE): For each layer $l$ $l$ in SCAE:
  - For weight parameter $\mathbf{Q}_l$ , draw $\mathbf{Q}_l \sim \mathcal{N}(\mathbf{0}, \lambda_Q^{-1} \mathbf{I})$ .
  - For bias parameter $\mathbf{c}_l$ , draw $\mathbf{c}_l \sim \mathcal{N}(\mathbf{0}, \lambda_c^{-1} \mathbf{I})$ .
  - For the output of the layer $\mathbf{Z}_l$ $Z_{l}$ :
    - If layer $l$ is a fully connected layer: draw $\mathbf{Z}_l \sim \mathcal{N}(\sigma(\mathbf{Z}_{l-1} \mathbf{Q}_l + \mathbf{c}_l), \lambda_Z^{-1} \mathbf{I})$ .
    - Else (if layer $l$ is a convolutional layer): draw $\mathbf{Z}_l \sim \mathcal{N}(\sigma(\mathbf{Z}_{l-1} * \mathbf{Q}_l + \mathbf{c}_l), \lambda_Z^{-1} \mathbf{I})$ .
4. Item Latent Offset Vector: For each item $j$ , draw a latent item offset vector $\eta_j \sim \mathcal{N}(\mathbf{0}, \lambda_I^{-1} \mathbf{I})$ . Then, set the final item latent vector as $\mathbf{e}_j = \eta_j + \mathbf{v}_j + \mathbf{X}_{\frac{L_t}{2}, j*} + \mathbf{Z}_{\frac{L_v}{2}, j*}$ .
5. User Latent Vector: For each user $i$ , draw a user latent vector $\mathbf{u}_i \sim \mathcal{N}(\mathbf{0}, \lambda_U^{-1} \mathbf{I})$ .
6. Collaborative Preference: For each triple $(i, j, j') \in \mathcal{D}$ $(i, j, j^{'}) \in D$ , draw from the probability $\sigma(\mathbf{u}_i^T \mathbf{e}_j - \mathbf{u}_i^T \mathbf{e}_{j'})$ $σ (u_{i}^{T} e_{j} - u_{i}^{T} e_{j^{'}})$ .
  - $\mathcal{D}$ : A collection of user-item-negative item triples, where (i, j, j') means user $i$ interacted with item $j$ ( $R_{ij}=1$ ) but not with item $j'$ ( $R_{ij'}=0$ ), and $j'$ is randomly sampled.
  - The vectors $\mathbf{v}_j$ , $\mathbf{X}_{\frac{L_t}{2}, j*}$ , and $\mathbf{Z}_{\frac{L_v}{2}, j*}$ act as "bridges" connecting implicit feedback preferences with structural, textual, and visual knowledge, respectively.
Learning the Parameters: Computing the full posterior probability of all parameters (user vectors $\mathbf{u}$ , item offset vectors $\eta$ , relation vectors $\mathbf{r}$ , projection matrices $\mathbf{M}$ , SDAE weights $\mathbf{W}$ and biases $\mathbf{b}$ , SCAE weights $\mathbf{Q}$ and biases $\mathbf{c}$ ) is intractable. Therefore, the paper aims to maximize a log-likelihood objective function, which is implicitly presented as a series of terms to be optimized. This objective function (Eq. 7 in the paper) combines the negative log-likelihoods derived from the generative processes for structural, textual, visual knowledge, and collaborative filtering, along with regularization terms: $ \begin{array}{r l} C & - \sum_{(i,j,j') \in \mathcal{D}} \ln \sigma(\mathbf{u}i^T \mathbf{e}j - \mathbf{u}i^T \mathbf{e}{j'}) \ & - \sum{(v_h,r,v_t,v{t'}) \in \mathcal{S}} \ln \sigma(f_r(v_h, v_{t'}) - f_r(v_h, v_t)) \ & - \frac{\lambda_X}{2} \sum_{l=1}^{L_t} ||\mathbf{X}l - \hat{\sigma}(\mathbf{X}{l-1} \mathbf{W}_l + \mathbf{b}_l)||F^2 - \frac{\lambda_Z}{2} \sum{l=1}^{L_v} ||\mathbf{Z}l - \sigma(\mathbf{Z}{l-1} \text{ op } \mathbf{Q}_l + \mathbf{c}_l)||_F^2 \ & - \frac{\lambda_U}{2} \sum_i ||\mathbf{u}_i||_2^2 - \frac{\lambda_I}{2} \sum_j ||\eta_j||_2^2 \ & - \frac{\lambda_v}{2} \sum_v ||\mathbf{v}||_2^2 - \frac{\lambda_r}{2} \sum_r ||\mathbf{r}||_2^2 - \frac{\lambda_M}{2} \sum_r ||\mathbf{M}_r||_F^2 \ & - \frac{\lambda_W}{2} \sum_l ||\mathbf{W}_l||_F^2 - \frac{\lambda_b}{2} \sum_l ||\mathbf{b}_l||_2^2 \ & - \frac{\lambda_Q}{2} \sum_l ||\mathbf{Q}_l||_F^2 - \frac{\lambda_c}{2} \sum_l ||\mathbf{c}_l||_2^2 \end{array} $ (Note: The equation (7) in the original paper appears to have formatting issues and is difficult to parse correctly. The reconstruction above attempts to represent the standard log-likelihood for such a Bayesian model, combining the elements from the generative process descriptions and common practice in similar papers. The first term corresponds to the collaborative filtering objective, the second to structural knowledge, the third and fourth to textual and visual autoencoder reconstruction errors, and the remaining terms are $L_2$ regularization for all parameters, derived from the normal priors.)

To maximize this objective, a stochastic gradient descent (SGD) algorithm is employed. In each iteration, for a randomly sampled triple $(i, j, j') \in \mathcal{D}$ , the model identifies a subset $\mathcal{S}_{j,j'} \in \mathcal{S}$ containing quadruples related to items $j$ or $j'$ . Then, SGD updates are performed for each parameter using the gradient of the corresponding objective function.
Prediction: The final recommendation for a user $i$ is generated by ranking items according to their predicted preference scores, which are calculated as the dot product of the user's latent vector $\mathbf{u}_i$ and the item's integrated latent vector $\mathbf{e}_j$ : $ i : j_1 > j_2 > ... > j_n \quad \text{such that} \quad \mathbf{u}i^T \mathbf{e}{j_1} > \mathbf{u}i^T \mathbf{e}{j_2} > ... > \mathbf{u}i^T \mathbf{e}{j_n} $ This means items with higher scores are ranked higher and recommended to the user.

5. Experimental Setup

5.1. Datasets

The paper uses two real-world datasets from different domains (movie and book) to evaluate the CKE framework.

MovieLens-1M:
- Source: A well-known dataset for movie recommendations. The original dataset consists of 1 million ratings.
- Preprocessing: To align with implicit feedback settings, only positive ratings (rating 5) were extracted for training and testing. Users with fewer than 3 positive ratings were removed.
- Characteristics (Final Dataset):
  - #users: 5,883
  - #items (movies): 3,230
  - #interactions: 226,101
- Knowledge Base Integration: Movies were mapped to entities in the Satori knowledge base using a two-stage method (title match and attribute match). 92% of pairs were correctly matched. 134 movies could not be mapped.
  - Structural Knowledge (SK): A subgraph was extracted from Satori, including movie entities, 1-step related entities (e.g., genre, director, actor, language, country, production date, rating, awards), and their relationships.
  - Textual Knowledge (TK): Textual information was extracted from movie plots and preprocessed using word hashing.
  - Visual Knowledge (VK): Poster images of movies were used, reshaped to $3 \times 64 \times 64$ tensor format (RGB).

IntentBooks:

Source: Collected from Microsoft's Bing search engine and Microsoft's Satori knowledge base [1]. User interests for books were extracted from click/query actions.
Preprocessing: Book interests were extracted by combining unsupervised similarity computation with supervised classification. Precision of 91.5% for extracted instances. Users with less than 5 book interests were removed.
Characteristics (Final Dataset):
- #users: 92,564
- #items (books): 18,475
- #interactions: 897,871

Knowledge Base Integration: Book entities were already part of the Satori knowledge base, so no mapping step was needed.

Structural Knowledge (SK): Similar to MovieLens, a subgraph was extracted, including book entities, 1-step related entities (e.g., genre, author, publish date, belonged series, language, rating), and their relationships.
Textual Knowledge (TK): Textual information from book descriptions was preprocessed using word hashing.

Visual Knowledge (VK): Front cover images of books were used, reshaped to $3 \times 64 \times 64$ tensor format (RGB).

The following are the results from Table 1 of the original paper:

	MovieLens-1M	IntentBooks
#user	5,883	92,564
#item	3,230	18,475
#interactions	226,101	897,871
#sk nodes	84,011	26,337
#sk edges	169,368	57,408
#sk edge types	10	6
#tk items	2,752	17,331
#vk items	2,958	16,719

Table 1: Detailed statistics of the two datasets

These datasets were chosen because they represent different domains (movies, books) and scenarios, allowing for a comprehensive evaluation of the framework's generalizability. They are effective for validating the method's performance as they offer real-world implicit feedback data combined with rich knowledge base information.

5.2. Evaluation Metrics

For evaluating the performance of top-K recommendations (where $K$ is the number of items recommended), the paper uses MAP@K and Recall@K. It notes that precision is not suitable for implicit feedback.

Recall@K:
- Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved among the top $K$ recommended items. It indicates how many of the items a user would actually interact with (according to the test set) were present in the recommended list. A higher Recall@K means the system is better at finding all relevant items.
- Mathematical Formula: $ \mathrm{Recall@K} = \frac{1}{M} \sum_{u=1}^M \frac{|\mathrm{Recommended}_u(K) \cap \mathrm{Relevant}_u|}{|\mathrm{Relevant}_u|} $
- Symbol Explanation:
  - $M$ : Total number of users in the test set.
  - $\mathrm{Recommended}_u(K)$ : The set of top $K$ items recommended to user $u$ .
  - $\mathrm{Relevant}_u$ : The set of items user $u$ actually interacted with in the test set (true positive items).
  - $|\cdot|$ : Cardinality of a set.
  - The formula averages the recall for each user.
MAP@K (Mean Average Precision at K):
- Conceptual Definition: MAP@K is a single-figure measure of quality for a ranked list of search results or recommendations. It takes into account both precision and the rank of relevant items. Average Precision (AP) for a single user is the average of precision values computed at each point where a relevant item is found in the ranked list. MAP@K is then the mean of these AP@K scores across all users. It emphasizes finding relevant items earlier in the ranked list. A higher MAP@K indicates better ranking quality.
- Mathematical Formula: $ \mathrm{MAP@K} = \frac{1}{M} \sum_{u=1}^M \mathrm{AP}_u@K $ Where $\mathrm{AP}_u@K$ for user $u$ is calculated as: $ \mathrm{AP}_u@K = \frac{1}{|\mathrm{Relevant}u|} \sum{k=1}^K P_u(k) \times \mathrm{rel}_u(k) $
- Symbol Explanation:
  - $M$ : Total number of users in the test set.
  - $\mathrm{AP}_u@K$ : Average Precision for user $u$ at cutoff $K$ .
  - $|\mathrm{Relevant}_u|$ : Number of relevant items for user $u$ in the test set.
  - $P_u(k)$ : Precision at rank $k$ for user $u$ , calculated as $\frac{\text{number of relevant items at or before rank } k}{\text{number of items at or before rank } k}$ .
  - $\mathrm{rel}_u(k)$ : A binary function, 1 if the item at rank $k$ is relevant for user $u$ , and 0 otherwise.

5.3. Baselines

The paper compares its proposed CKE framework and its component variants against several competitive baselines, categorized by the type of knowledge leveraged:

5.3.1. Baselines for Structural Knowledge Usage

BPRMF: Bayesian Personalized Ranking based Matrix Factorization [22]. This is a pure collaborative filtering method that serves as a baseline that completely ignores structural knowledge.
BPRMF+TransE: Combines BPRMF with TransE [3], a network embedding method. It uses the same settings as CKE(S) but uses TransE (which ignores the heterogeneity of entities and relations) instead of TransR for structural knowledge embedding.
PRP (PageRank with Priors) [17]: Integrates user-item relation and structural knowledge into a unified homogeneous graph and then performs PageRank for each user with a personalized initial probability distribution.
PER (Personalized Entity Recommendation) [30]: Treats structural knowledge as a heterogeneous information network and extracts meta-path based latent features (e.g., "movie-genre-movie") to represent connectivity between users and items. Applies Bayesian ranking optimization.
LIBFM(S) [21]: A state-of-the-art feature-based factorization model. LIBFM(S) uses an item's attributes from the structural knowledge directly as raw features.

5.3.2. Baselines for Textual Knowledge Usage

BPRMF: (Same as above) A pure collaborative filtering baseline, ignoring textual knowledge.
LIBFM(T) [21]: Similar to LIBFM(S), but uses bag of words from the textual knowledge as raw features.
CMF(T) (Collective Matrix Factorization) [21]: Combines different data sources by simultaneously factorizing multiple matrices. CMF(T) uses the user-item matrix and the item-word matrix.
CTR (Collaborative Topic Regression) [28]: A state-of-the-art method leveraging textual information by integrating collaborative filtering and topic modeling simultaneously.

5.3.3. Baselines for Visual Knowledge Usage

BPRMF: (Same as above) A pure collaborative filtering baseline, ignoring visual knowledge.
LIBFM(V) [21]: Similar to LIBFM(S), but uses flattened raw pixel representations (in RGB color space) as raw features.
CMF(V) [21]: Similar to CMF(T), but uses the user-item matrix and the item-pixel matrix for simultaneous factorization.
BPRMF+SDAE(V): Uses the same settings as CKE(V) but employs Stacked Denoising Auto-encoders (SDAE) (fully-connected layers) instead of Stacked Convolutional Auto-encoders (SCAE) (convolutional layers) to embed visual knowledge. This baseline specifically evaluates the effectiveness of convolutional layers for visual data.

5.3.4. Baselines for the Whole Framework Evaluation

CKE(ST), CKE(SV), CKE(TV): Ablation study baselines which are variants of CKE(STV) that only incorporate two out of the three knowledge types ( $S$ tructural, $T$ extual, $V$ isual). These evaluate the additional contribution of each knowledge type.
LIBFM(STV) [21]: Uses LIBFM with structural knowledge, textual knowledge, and visual knowledge all combined as features. This represents a feature-engineering approach to multi-modal knowledge.
BPRMF+STV: Uses the same settings as CKE(STV) but learns collaborative filtering and the three knowledge base embedding components separately. This baseline evaluates the effectiveness of joint learning.

For all baselines, the latent dimension in the collaborative filtering part is set to be the same as in CKE for fair comparison. Other hyperparameters for baselines are determined by grid search.

The following are the results from Table 2 of the original paper:

	MovieLens-1M	IntentBooks
cf	dim=150,λU=λI =0.0025	dim=100,λU=λI =0.005
sk	λv=λr=0.001,λM =0.01	λv=λr=0.001,λM =0.1
tk	λW =λb=0.01, λX =0.0001,=0.2, Lt=4, Nl=300	λW =λb=0.01, λX =0.001,=0.1, Lt=6, Nl=200
vk	λQ=λc=0.01, λZ=0.0001, σ=3,Lv = 6, Nf =20, Sf =(5,5)	λQ=λc=0.01, λZ=0.001, σ=2,Lv = 8, Nf =20, Sf =(5,5)

Table 2: Hyperparameter settings of our framework for the two datasets. cf, sk, tk and vk indicate the parameters in the component of collaborative filtering, structural knowledge embedding, textual knowledge embedding and visual knowledge embedding, respectively.

In Table 2:

dim: Latent dimension for user and item vectors in collaborative filtering.
$λU$ , $λI$ , $λv$ , $λr$ , $λM$ , $λW$ , $λb$ , $λX$ , $λQ$ , $λc$ , $λZ$ : Regularization parameters (inverse precision) for various model parameters.
ϵ: Noise masking level for SDAE (textual embedding).
$σ$ : Standard deviation for Gaussian filter noise for images (visual embedding).
Lt: Number of layers for SDAE (textual embedding).
Lv: Number of layers for SCAE (visual embedding).
Nl: Number of hidden units in SDAE layers (when not a middle/output layer).
Nf: Number of filter maps in each convolutional layer of SCAE.
Sf: Size of filter maps in each convolutional layer of SCAE.

6. Results & Analysis

The experiments are conducted on two datasets, MovieLens-1M and IntentBooks, evaluating Recall@K and MAP@K for various $K$ values. The results are presented in several figures, comparing CKE components and the full framework against baselines.

6.1. Core Results Analysis

6.1.1. Study of Structural Knowledge Usage

This section evaluates CKE(S), which integrates structural knowledge embedding via Bayesian TransR with collaborative filtering.

$Figure 8: Recall $@ \\mathbf { K }$ results comparison between our methods using each component in knowledge base embedding and relate baselines for dataset IntentBooks.$ 该图像是图表，展示了IntentBooks数据集中基于知识库不同组件的召回率Recall@K性能比较，包含结构化知识、文本知识和视觉知识三个子图，横轴为K，纵轴为Recall@K，展示了多种方法的表现差异。

Figure 6: Recall $@ \mathbf { K }$ results comparison between our methods using each component in knowledge base embedding and relater baselines for dataset MovieLens-1M.

$Figure 9: $\\mathbf { M A P } @ \\mathbf { K }$ results comparison between our methods using each component in knowledge base embedding and relate baselines for dataset IntentBooks.$ 该图像是图表，展示了论文中图9的结果对比，分别显示了在IntentBooks数据集中使用结构知识、文本知识和视觉知识的MAP@K表现情况，横轴为参数 $K$ ，纵轴为对应的MAP@K值，反映不同方法在推荐任务中的效果。

Figure 7: MAP $@ \mathbf { K }$ results comparison between our methods using each component in knowledge base embedding and relater baselines for dataset MovieLens-1M.

$Figure 10: Recall $@ \\mathbf { K }$ results comparison between our framework and related baselines for both datasets.$ 该图像是图表，展示了论文中图10的Recall@K对比结果，比较了CKE框架与多个基线方法在MovieLens-1M和IntentBooks两个数据集上的性能表现。

Figure 8: Recall $@ \mathbf { K }$ results comparison between our methods using each component in knowledge base embedding and relate baselines for dataset IntentBooks.

$Figure 11: $\\mathbf { M A P @ K }$ results comparison between our framework and related baselines for both datasets.$ 该图像是论文中的图表，展示了不同方法在MovieLens-1M和IntentBooks两个数据集上的MAP@K指标比较，横轴为K，纵轴为MAP@K，反映了推荐系统性能的差异。

Figure 9: $\mathbf { M A P } @ \mathbf { K }$ results comparison between our methods using each component in knowledge base embedding and relate baselines for dataset IntentBooks.

From Figure 6(a), 7(a), 8(a), and 9(a) (which correspond to the "Structural Knowledge" subplot in these figures), several observations are made:

Importance of Structural Knowledge: BPRMF performs the worst among all approaches. This is significant because BPRMF is a pure collaborative filtering method that completely ignores structural knowledge. The results strongly suggest that incorporating structural knowledge can substantially improve recommendation performance.
Limitations of Non-Factorization Methods: PRP (PageRank with Priors) performs worse than other approaches that leverage structural knowledge. This is attributed to PRP not using factorization, which is crucial for capturing latent low-rank approximations of user-item interactions, especially in sparse datasets.
Effectiveness of Network Embedding: $BPRMF+TransE$ outperforms LIBFM(S) and PER. This indicates that using network embedding to capture semantic representations from structural knowledge is more effective than directly using it in a feature-engineering way (LIBFM(S)) or through meta-paths (PER). Network embedding can learn more meaningful and compact representations.
Superiority of TransR: CKE(S) consistently beats $BPRMF+TransE$ . This demonstrates that by using TransR, which explicitly considers the heterogeneity of nodes and relationships and projects entities into a relation-specific space, there is a clear improvement over TransE (which assumes a single embedding space). TransR's ability to model complex relational patterns contributes to better recommendation quality.

6.1.2. Study of Textual Knowledge Usage

This section evaluates CKE(T), which integrates textual knowledge embedding via Bayesian SDAE with collaborative filtering.

From Figure 6(b), 7(b), 8(b), and 9(b) (the "Textual Knowledge" subplot), the following observations are made:

Weaker Improvement Compared to Structural Knowledge: CKE(S) (structural knowledge) generally outperforms CKE(T) (textual knowledge), and similarly, LIBFM(S) outperforms LIBFM(T). This suggests that, for these datasets, textual knowledge provides a weaker performance boost compared to structural knowledge. While still beneficial, the semantic signal from text might be less directly correlated with user preferences than structured relationships.
Deep Learning vs. Direct Factorization: LIBFM(T), CTR, and CKE(T) generally perform better than CMF(T). This indicates that direct factorization of an item-word matrix (as in CMF(T)) may not fully leverage textual information. More sophisticated models that learn semantic representations (like LIBFM with text features, topic modeling in CTR, or deep learning embeddings in CKE(T)) are more effective.
Effectiveness of Deep Learning for Text: CTR is noted as a strong baseline, sometimes achieving the best performance on IntentBooks. However, CKE(T) typically outperforms CTR. This highlights the strength of deep learning embeddings (specifically SDAE) in extracting deep semantic representations from text compared to traditional topic modeling approaches.

6.1.3. Study of Visual Knowledge Usage

This section evaluates CKE(V), which integrates visual knowledge embedding via Bayesian SCAE with collaborative filtering.

From Figure 6(c), 7(c), 8(c), and 9(c) (the "Visual Knowledge" subplot), the following observations are made:

Limited but Significant Improvement: The performance improvement from using visual knowledge is generally less pronounced compared to structural knowledge but is still significant. This implies that visual cues (like movie posters or book covers) contain valuable, albeit perhaps less direct, information for recommendation.
Superiority of Deep Networks for Visuals: CKE(V) and $BPRMF+SDAE(V)$ outperform other approaches like LIBFM(V) and CMF(V). This demonstrates the effectiveness of deep neural networks for visual knowledge embedding compared to simpler feature-based or factorization methods.
Importance of Convolutional Layers: The performance gap between CKE(V) (using SCAE with convolutional layers) and $BPRMF+SDAE(V)$ (using SDAE with fully-connected layers for visuals) is significant. This strongly suggests that convolutional layers are much more suitable for extracting meaningful visual representations due to their ability to capture spatial hierarchies and local patterns in images, validating the design choice of SCAE.

6.1.4. Study of The Whole Framework

This section evaluates CKE(STV), the full CKE framework integrating all three knowledge types, against various baselines including ablation studies.

该图像是论文中的示意图，展示了基于知识库嵌入的协同联合学习框架。图中通过结构化、文本和视觉三种知识表示，结合贝叶斯TransR、SDAE和SCAE方法，提取商品的异构语义向量，最终整合用户和物品潜在向量以提升推荐性能。

Figure 10: Recall $@ \mathbf { K }$ results comparison between our framework and related baselines for both datasets.

Figure 3: Illustration of TransR for structural embedding 该图像是图示，展示了原文中图3所示的TransR结构嵌入方法。图中分为实体空间和关系空间，通过矩阵 $M_r$ 将实体嵌入映射到关系空间，红色箭头表示关系 $r$ 从头实体向尾实体的转换。

Figure 11: $\mathbf { M A P @ K }$ results comparison between our framework and related baselines for both datasets.

From Figure 10 and Figure 11, the following crucial observations are made:

Additive Benefit of Multi-Modal Knowledge: CKE(STV) consistently outperforms its ablation variants: CKE(ST) (structural + textual), CKE(SV) (structural + visual), and CKE(TV) (textual + visual). This is a key finding, demonstrating that the additional usage of each type of knowledge (structural, textual, visual) provides a cumulative benefit and further improves recommendation performance. This supports the paper's central hypothesis that combining heterogeneous knowledge sources is effective.
Effectiveness of Embedding Components over Feature Engineering: CKE(STV) outperforms LIBFM(STV). LIBFM(STV) uses all three knowledge types but treats them as raw features in a feature-engineering manner. This comparison clearly shows that CKE's embedding components (TransR, SDAE, SCAE) are more effective at capturing semantic representations from the knowledge base than direct feature usage, leading to superior recommendation quality.
Importance of Joint Learning: CKE(STV) also achieves better performance than $BPRMF+STV$ . The key difference here is that $BPRMF+STV$ learns the collaborative filtering and knowledge base embedding components separately and then combines them, whereas CKE(STV) learns them jointly. This result highlights that joint learning directly optimizes all components towards the recommendation task, enabling a more synergistic and effective integration, thus improving overall quality.

6.2. Data Presentation (Tables)

The detailed statistics of the two datasets, MovieLens-1M and IntentBooks, are presented in Table 1 in the Experimental Setup section. This table provides concrete numbers for users, items, interactions, and knowledge base components, allowing for an understanding of the scale and characteristics of the data used. The hyperparameter settings used for achieving optimal performance for CKE on both datasets are provided in Table 2, also in the Experimental Setup section. This table details parameters for collaborative filtering, structural knowledge, textual knowledge, and visual knowledge components.

6.3. Ablation Studies / Parameter Analysis

The paper conducts effective ablation studies within the "Study of The Whole Framework" section by comparing CKE(STV) with its variants: CKE(ST), CKE(SV), and CKE(TV).

CKE(ST) vs. CKE(STV): The performance improvement of CKE(STV) over CKE(ST) demonstrates the positive contribution of visual knowledge.
CKE(SV) vs. CKE(STV): The improvement of CKE(STV) over CKE(SV) highlights the value added by textual knowledge.
CKE(TV) vs. CKE(STV): The improvement of CKE(STV) over CKE(TV) confirms the importance of structural knowledge.

These ablation studies collectively confirm that each knowledge modality (structural, textual, visual) provides unique and beneficial information, and their combined use in CKE(STV) leads to the best performance.

Regarding parameter analysis, Table 2 (in Experimental Setup) lists the optimal hyperparameter settings found for each component (cf, sk, tk, vk) on both datasets. While the paper states these were found using a validation set and grid search, it does not explicitly detail a separate parameter sensitivity analysis (e.g., how performance changes with varying dim, $λ$ , ϵ, Nl, Nf, Sf values). However, providing these optimal values is a crucial aspect of experimental reproducibility and understanding.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Collaborative Knowledge Base Embedding (CKE), a hybrid recommender system that effectively integrates collaborative filtering with rich, heterogeneous information from knowledge bases. The core innovation lies in designing three specialized embedding components:

Structural Embedding: Uses Bayesian TransR to extract representations from the knowledge graph's heterogeneous structure.
Textual Embedding: Employs Bayesian Stacked Denoising Auto-encoders (SDAE) for textual content.
Visual Embedding: Leverages Bayesian Stacked Convolutional Auto-encoders (SCAE) for visual data.

These components automatically learn semantic representations of items. The CKE framework then jointly learns these knowledge base embeddings alongside the latent representations from collaborative filtering, optimizing them together for the recommendation task. Extensive experiments on MovieLens-1M and IntentBooks datasets demonstrate that CKE significantly outperforms state-of-the-art baselines. The results confirm the additive value of each knowledge modality and the superiority of automatically learned embeddings and joint learning over traditional feature engineering or separate learning approaches.

7.2. Limitations & Future Work

The paper highlights that its research sheds new light on the usage of heterogeneous information in the knowledge base, which can be consumed in more application scenarios. While it doesn't explicitly list limitations in a dedicated section, some implicit limitations and future directions can be inferred:

Computational Complexity: Deep learning models and joint optimization, especially with large knowledge bases, can be computationally intensive, which might be a practical limitation for very large-scale systems. The paper uses SGD, which helps, but training time is not discussed in detail.
Generalizability of Hyperparameters: Optimal hyperparameters are dataset-specific (as shown in Table 2). Finding these for new datasets still requires a validation set and grid search, which can be time-consuming.
Feature Interaction Complexity: The item latent vector is formed by a simple additive combination of the different embedding types ( $\mathbf{e}_j = \eta_j + \mathbf{v}_j + \mathbf{X}_{\frac{L_t}{2}, j*} + \mathbf{Z}_{\frac{L_v}{2}, j*}$ ). More complex, non-linear ways of combining these heterogeneous features might further improve performance, capturing intricate interactions between structural, textual, and visual semantics.
Cold Start for New Knowledge Base Entities: While it addresses cold start for new items in the recommender system (if they exist in the KB), it doesn't explicitly discuss how to handle cold start for entirely new entities within the knowledge base itself.
User Side Information: The framework primarily focuses on enriching item representations. Incorporating heterogeneous information about users (e.g., demographic data, social network information, textual profiles) from a knowledge base could be a natural extension.

Future work could explore:
More sophisticated fusion mechanisms for heterogeneous embeddings.
Scalability solutions for even larger knowledge bases and user-item interaction data.
Extending the framework to incorporate user-side knowledge base information.
Investigating the interpretability of the learned embeddings.
Applying CKE to other domains or types of recommendation tasks (e.g., sequential recommendation).

7.3. Personal Insights & Critique

The CKE paper presents a compelling and logically structured approach to a pervasive problem in recommender systems. Its strength lies in its comprehensive integration of multi-modal knowledge from knowledge bases using modern deep learning and network embedding techniques. The explicit selection of TransR for heterogeneous graphs and SCAE for visual data demonstrates a thoughtful understanding of the data characteristics. The joint learning objective is also a critical design choice, ensuring that the rich semantic representations are directly optimized for improving recommendations, rather than being learned in isolation.

One key inspiration drawn from this paper is the powerful synergy created by combining strengths from different subfields: the relational richness of knowledge graphs, the representation power of deep learning, and the effectiveness of collaborative filtering. This multi-faceted approach offers a robust solution to data sparsity that is often superior to single-paradigm methods.

Potential issues or areas for improvement could include:

Interpretability: While embeddings are powerful, their black-box nature can make it challenging to explain why a specific item was recommended based on its structural, textual, or visual properties. Future work could focus on adding explainability layers.
Knowledge Base Construction and Maintenance: The effectiveness of CKE heavily relies on the quality and completeness of the underlying knowledge base. Building and maintaining such a rich, clean, and up-to-date KB for diverse item types can be a significant practical challenge. The manual mapping step for MovieLens-1M highlights this effort.
Computational Cost: Training complex deep learning architectures and graph embeddings jointly can be computationally expensive, requiring substantial hardware resources and training time. While SGD is used, the overall training duration is not detailed, which could be a concern for practical deployment with frequently updated KBs or models.
Hyperparameter Sensitivity: As with many deep learning models, the performance can be sensitive to hyperparameter choices. While optimal values are provided, the process of finding them for new domains without extensive tuning remains a practical hurdle.

Despite these potential considerations, the CKE framework provides a robust and innovative blueprint for leveraging heterogeneous auxiliary information to enhance recommender systems, making a significant contribution to the field. Its methods and conclusions could be transferred to other domains where items possess diverse structured and unstructured content, such as e-commerce product recommendations, scientific article suggestions, or even personalized educational content delivery.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Entity Recommendation via Knowledge Graph: A Heterogeneous Networking Embedding Approach

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~41 min read · 51,950 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Knowledge Base Embedding

4.2.1.1. Structural Embedding

4.2.1.2. Textual Embedding

4.2.1.3. Visual Embedding

4.2.2. Collaborative Joint Learning

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.3.1. Baselines for Structural Knowledge Usage

5.3.2. Baselines for Textual Knowledge Usage

5.3.3. Baselines for Visual Knowledge Usage

5.3.4. Baselines for the Whole Framework Evaluation

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Study of Structural Knowledge Usage

6.1.2. Study of Textual Knowledge Usage

6.1.3. Study of Visual Knowledge Usage

6.1.4. Study of The Whole Framework

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers