Paper status: completed

KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints

Published:10/22/2025

Multimodal Large Language Model (25)Knowledge Utilization in LLMs (3)Knowledge Injection Methods (1)Knowledge Adaptation and Retention (1)Catastrophic Forgetting Mitigation (1)

Original Link PDF

Price: 0.100000

10 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

KORE enhances knowledge injection in large multimodal models by structured augmentations and constraints, preserving old knowledge via null space projection to mitigate forgetting and enable precise adaptation to new knowledge.

Abstract

Large Multimodal Models encode extensive factual knowledge in their pre-trained weights. However, its knowledge remains static and limited, unable to keep pace with real-world developments, which hinders continuous knowledge acquisition. Effective knowledge injection thus becomes critical, involving two goals: knowledge adaptation (injecting new knowledge) and knowledge retention (preserving old knowledge). Existing methods often struggle to learn new knowledge and suffer from catastrophic forgetting. To address this, we propose KORE, a synergistic method of KnOwledge-oRientEd augmentations and constraints for injecting new knowledge into large multimodal models while preserving old knowledge. Unlike general text or image data augmentation, KORE automatically converts individual knowledge items into structured and comprehensive knowledge to ensure that the model accurately learns new knowledge, enabling accurate adaptation. Meanwhile, KORE stores previous knowledge in the covariance matrix of LMM's linear layer activations and initializes the adapter by projecting the original weights into the matrix's null space, defining a fine-tuning direction that minimizes interference with previous knowledge, enabling powerful retention. Extensive experiments on various LMMs, including LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B, show that KORE achieves superior new knowledge injection performance and effectively mitigates catastrophic forgetting.

Mind Map

In-depth Reading

English Analysis~36 min read · 54,511 chars

1. Bibliographic Information

1.1. Title

KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints

1.2. Authors

Kailin Jiang, Hongbo Jiang, Ning Jiang, Zhi Gao, Jinhe Bi, Yuchen Ren, Bin Li, Yuntao Du, Lei Liu, Qing Li.

Their affiliations span several prominent institutions, including the University of Science and Technology of China, State Key Laboratory of General Artificial Intelligence, BIGA Technology, Beijing Institute of Technology, Ludwig Maximilian University of Munich, University of Sydney, and Shandong University. This indicates a collaborative effort from multiple research groups specializing in AI, natural language processing, and multimodal learning.

1.3. Journal/Conference

Published as a preprint on arXiv (arxiv.org/abs/2510.19316). As an arXiv preprint, it is currently undergoing peer review or awaiting formal publication in a conference or journal. arXiv is a widely respected platform for disseminating cutting-edge research in fields like AI, allowing early access to findings before formal peer review processes are completed.

1.4. Publication Year

2025 (specifically, published at 2025-10-22T07:26:55.000Z).

1.5. Abstract

Large Multimodal Models (LMMs) inherently store vast factual knowledge within their pre-trained weights. However, this knowledge is static and quickly becomes outdated, impeding their ability to continuously acquire new information. Effective knowledge injection is crucial, necessitating two objectives: knowledge adaptation (integrating new knowledge) and knowledge retention (preserving existing knowledge). Current methods often struggle with learning new knowledge effectively and suffer from catastrophic forgetting (the tendency of a neural network to completely and abruptly forget previously learned information upon learning new information).

To address these challenges, the authors propose KORE (KnOwledge-oRientEd augmentations and constraints), a synergistic method for injecting new knowledge into LMMs while preserving old knowledge. Unlike general data augmentation techniques for text or images, KORE automatically transforms individual knowledge items into structured and comprehensive formats, ensuring the model accurately learns and adapts to new information. Simultaneously, KORE stores previous knowledge within the covariance matrix of the LMM's linear layer activations. It then initializes an adapter by projecting the original weights into the null space of this matrix, thereby defining a fine-tuning direction that minimizes interference with prior knowledge and ensures robust retention.

Extensive experiments conducted on various LMMs, including LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B, demonstrate that KORE achieves superior performance in new knowledge injection and significantly mitigates catastrophic forgetting.

1.6. Original Source Link

https://arxiv.org/abs/2510.19316 (Preprint status) PDF Link: https://arxiv.org/pdf/2510.19316v1.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the static and limited nature of knowledge in Large Multimodal Models (LMMs). While LMMs, similar to Large Language Models (LLMs), demonstrate remarkable capabilities in storing and recalling world knowledge embedded in their pre-trained weights, this knowledge cannot evolve. In a rapidly changing world, this leads to:

Outdated Responses: Models provide information that is no longer current.
Inability for Continuous Acquisition: LMMs cannot naturally learn new information as it emerges.

This limitation hinders the continuous evolution and real-world applicability of these powerful models. Therefore, the ability to effectively inject new knowledge becomes critical. This process involves a dual challenge:

Knowledge Adaptation: The model must effectively learn and integrate new information.
Knowledge Retention: The model must simultaneously preserve its vast pre-existing knowledge, avoiding catastrophic forgetting.

Existing methods, such as full fine-tuning (updating all model weights) or Parameter-Efficient Fine-Tuning (PEFT) (updating only a small subset of parameters), often fall short. Full fine-tuning is computationally expensive and prone to overfitting (where a model learns the training data too well, including noise, and performs poorly on unseen data), failing to generalize new knowledge. PEFT methods, while resource-friendly, also suffer from catastrophic forgetting. Continual learning techniques aim to mitigate forgetting but often struggle to strike a balance between acquiring new knowledge and retaining old, potentially impairing adaptation or leading to irrelevant responses.

The paper's innovative idea is to propose a synergistic method that tackles both knowledge adaptation and retention simultaneously through knowledge-oriented augmentations and knowledge-oriented constraints. The entry point is to move beyond superficial data augmentations and generic parameter regularization by deeply understanding and structuring knowledge for learning, and precisely constraining updates based on the model's internal representation of prior knowledge.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of knowledge injection for LMMs:

Novel Synergistic Method (KORE): Introduction of KORE, a unified approach that combines Knowledge-Oriented Augmentations and Knowledge-Oriented Constraints. This synergy explicitly addresses the trade-off between knowledge adaptation and retention, a core problem in continuous learning for LMMs.
Knowledge-Oriented Augmentation (KORE-AUGMENTATION): Development of an automated pipeline that converts individual knowledge items into a profound and structured "knowledge tree" format. This involves generating multi-round dialogues and diverse instruction tasks (e.g., visual recognition, image captioning, VQA) around new knowledge. This augmentation strategy moves beyond superficial data variations to ensure genuine knowledge internalization and generalization, enabling models to reason about new information.
Knowledge-Oriented Constraint (KORE-CONSTRAINT): A novel constraint mechanism that leverages the covariance matrix of LMM's linear layer activations to represent previous knowledge. By decomposing this matrix and projecting original weights into its null space to initialize an adapter (specifically, a LoRA adapter), KORE defines a fine-tuning direction that minimally interferes with pre-existing knowledge, thereby ensuring powerful retention and mitigating catastrophic forgetting.
Empirical Validation and Superior Performance: Extensive experiments on diverse LMMs (LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B) demonstrate that KORE consistently outperforms state-of-the-art baselines (including Full-FT, LoRA, Replay, and various continual learning methods like EWC, LwF, MoELoRA, O-LoRA, SEFE) in both new knowledge adaptation and old knowledge retention.
Universality and Robustness: KORE's effectiveness is shown to be largely independent of the specific LMM architecture or scale, performing well across different models.
Customizable Knowledge Constraints: The method allows for the creation of specific knowledge-oriented constraints by sampling data related to particular benchmarks. This enables tailored knowledge retention, enhancing performance on targeted knowledge types without significantly compromising other abilities.

In summary, KORE offers a robust and flexible solution for enabling LMMs to continuously acquire and retain knowledge, thereby supporting their evolution and broader application in dynamic real-world scenarios.

This section provides an overview of the fundamental concepts and related research necessary to understand the methodology and contributions of the KORE paper.

3.1. Foundational Concepts

Large Multimodal Models (LMMs)

Large Multimodal Models are advanced artificial intelligence systems that can process and understand information from multiple modalities, typically text and images (and sometimes audio or video). They are built upon the foundation of Large Language Models (LLMs) but extend their capabilities to interpret and generate content that integrates visual information. LMMs learn extensive factual knowledge, common sense, and reasoning abilities during a pre-training phase on massive datasets, storing this knowledge in their billions or trillions of parameters (weights). Examples include LLaVA and Qwen-VL.

Knowledge Injection

Knowledge injection refers to the process of updating a pre-trained model with new factual information or capabilities. This is crucial for LMMs because their pre-trained knowledge is static and can quickly become outdated. Effective knowledge injection aims to achieve two primary goals:

Knowledge Adaptation: The model's ability to accurately learn and effectively utilize the newly introduced knowledge. This involves not just memorizing facts but also generalizing and reasoning with them.
Knowledge Retention: The model's ability to preserve its previously acquired knowledge and capabilities while learning new information. Without proper retention mechanisms, LMMs can suffer from catastrophic forgetting.

Catastrophic Forgetting

Catastrophic forgetting, also known as catastrophic interference, is a phenomenon observed in artificial neural networks where learning new information causes the abrupt and complete loss of previously learned information. When a neural network is fine-tuned on a new task or dataset, its weights are updated, which can overwrite the knowledge acquired from prior training, leading to a significant degradation in performance on older tasks. This is a major challenge in continual learning and knowledge injection for LLMs and LMMs.

Parameter-Efficient Fine-Tuning (PEFT)

Parameter-Efficient Fine-Tuning (PEFT) methods are a class of techniques designed to adapt large pre-trained models to new tasks or data without updating all of their parameters. Instead, PEFT methods typically freeze the majority of the pre-trained weights and introduce a small number of new, trainable parameters. This significantly reduces computational costs (training time, memory) and storage requirements compared to full fine-tuning. Common PEFT techniques include:

Adapters: Small, task-specific neural network modules inserted between layers of the pre-trained model. Only the adapter weights are trained.
LoRA (Low-Rank Adaptation): Decomposes weight updates into low-rank matrices. This method trains only these small, low-rank matrices while keeping the original pre-trained weights fixed.

Data Augmentation

Data augmentation is a strategy used to increase the diversity of training data by creating modified versions of existing data. For images, this might involve rotations, flips, or color changes. For text, it could involve synonym replacement, rephrasing, or back-translation. The goal is to make the model more robust and improve its generalization capabilities by exposing it to a wider variety of examples during training. In the context of knowledge injection, data augmentation can help the model learn new facts more thoroughly.

Continual Learning

Continual learning (or lifelong learning) is a machine learning paradigm that aims to enable models to learn sequentially from a continuous stream of data, accumulating knowledge over time without forgetting previously learned information. It directly addresses the problem of catastrophic forgetting. Techniques in continual learning often involve strategies like rehearsal (re-training on a small subset of old data), regularization (adding penalties to weight updates to preserve important parameters for old tasks), or dynamic architectures (modifying the model's structure as new tasks arrive).

Covariance Matrix

In statistics, the covariance matrix is a square matrix that describes the variance between each pair of random variables in a dataset. For a vector of random variables $X = [X_1, X_2, \ldots, X_n]^T$ , the covariance matrix $C$ is defined as $C_{ij} = \mathrm{cov}(X_i, X_j)$ . The diagonal elements $C_{ii}$ are the variances of each variable, and the off-diagonal elements $C_{ij}$ are the covariances between variable $X_i$ and $X_j$ . In machine learning, particularly in models like LMMs, the covariance matrix of activations within a layer can capture statistical dependencies and patterns that represent learned knowledge or task-specific features.

Singular Value Decomposition (SVD)

Singular Value Decomposition (SVD) is a powerful matrix factorization technique. For any real matrix $M$ , SVD decomposes it into three other matrices: $ M = U \Sigma V^T $ where:

$U$ is an orthogonal matrix whose columns are the left singular vectors.
$\Sigma$ is a diagonal matrix containing the singular values in descending order. These values represent the "strength" or "importance" of each corresponding singular vector.
$V^T$ is the transpose of an orthogonal matrix $V$ , whose columns are the right singular vectors.

SVD has numerous applications, including dimensionality reduction, noise reduction, and finding the null space of a matrix.

Null Space

The null space (or kernel) of a matrix $A$ is the set of all vectors $x$ for which $Ax = 0$ . In other words, it's the set of vectors that are "annihilated" by the matrix $A$ . Geometrically, the null space is the subspace spanned by the vectors that are orthogonal to the row space of $A$ . In the context of KORE, initializing an adapter's weights in the null space of a covariance matrix means that these weights, when applied, will minimally affect the patterns (knowledge) represented by that covariance matrix, thus helping to preserve old knowledge.

3.2. Previous Works

Retrieval-Augmented Generation (RAG)

RAG (Song et al., 2016; Fan et al., 2020; Lewis et al., 2020) is an alternative paradigm for handling external knowledge. Instead of modifying model parameters, RAG systems retrieve relevant information from an external knowledge base at inference time and use it to augment the model's generated response. This preserves pre-trained knowledge and allows access to up-to-date information without fine-tuning. However, its effectiveness depends heavily on the quality and speed of the retrieval system. KORE differentiates itself by focusing on directly modifying model parameters for knowledge injection.

Full Fine-Tuning

Full fine-tuning updates all parameters of a pre-trained model on new data. While it can achieve high performance on the new task, it is computationally intensive, requires significant storage, and notoriously suffers from catastrophic forgetting of previously learned knowledge. The paper explicitly contrasts KORE with Full-FT in its ability to balance adaptation and retention.

Parameter-Efficient Fine-Tuning (PEFT)

PEFT methods like adapters (Houlsby et al., 2019; Hu et al., 2022; Bi et al., 2025b) and new tokens (Lester et al., 2021; Sabbatella et al., 2024) significantly reduce computational and storage costs by updating only a small fraction of parameters. LoRA (Hu et al., 2022) is a prominent PEFT technique that KORE builds upon. Despite their efficiency, general PEFT methods still face challenges in effective knowledge injection and mitigating catastrophic forgetting.

Continual Learning (CL) Techniques

CL aims to enable models to learn sequentially without forgetting.

Rehearsal-based methods (Li & Hoiem, 2017a; Hou et al., 2019) store and periodically replay a subset of old data alongside new data. KORE's Replay baseline uses this.
Parameter regularization methods (Kirkpatrick et al., 2017 - EWC; Li & Hoiem, 2017b - LwF) add penalties to the loss function to prevent significant changes to parameters important for old tasks.
- EWC (Elastic Weight Consolidation) calculates the importance of each parameter using the Fisher Information Matrix and penalizes changes to important parameters.
- LwF (Learning without Forgetting) uses knowledge distillation to retain old knowledge by having the new model mimic the outputs of the old model on previous tasks.
Other CL approaches include dynamic architectures (Yan et al., 2021) and complementary projection-based methods (Farajtabar et al., 2020; Chaudhry et al., 2020; Saha et al., 2021). KORE draws inspiration from CL but aims to optimize the balance between new knowledge acquisition and prior knowledge retention more effectively, especially for LLMs and LMMs.

CO-SVD and Orthogonal Subspace Constraints

Prior work (Meng et al., 2023; Yang et al., 2024) has explored how covariance matrices of LMM activations capture knowledge. KORE extends CO-SVD (Context-Oriented Decomposition Adaptation) from text-only LLMs to multimodal LMMs to verify the knowledge-capturing ability of covariance matrices in a multimodal context.
Methods like O-LoRA (Wang et al., 2023) use orthogonal subspace constraints to mitigate catastrophic forgetting by allocating independent, orthogonal parameter subspaces for different tasks. KORE-CONSTRAINT leverages the concept of null space projection, which is geometrically related to orthogonal subspaces, to define a fine-tuning direction that minimizes interference.

EvOKE Benchmark

EvOKE (Jiang et al., 2025) is a new benchmark specifically designed to evaluate how well LMMs can learn evolving knowledge without forgetting their original capabilities. KORE uses EvOKE for its knowledge adaptation evaluation, highlighting its relevance to the paper's core problem.

3.3. Technological Evolution

The journey of LLMs and LMMs started with pre-training on vast, static datasets, encoding immense world knowledge into their weights. This initial phase yielded models capable of impressive feats but with a critical limitation: their knowledge was frozen at the time of pre-training.

The first attempt to update this knowledge was full fine-tuning, which quickly proved too expensive and detrimental to old knowledge (catastrophic forgetting). This led to the development of PEFT methods like LoRA, which drastically reduced computational costs but still struggled with the adaptation-retention trade-off. Simultaneously, the continual learning field developed techniques to mitigate catastrophic forgetting, but these often focused on balancing existing tasks and struggled to scale effectively to the sheer volume of world knowledge in LLMs/LMMs. RAG emerged as an alternative, bypassing fine-tuning for knowledge updates by retrieving external information.

KORE fits into this evolution by proposing a hybrid approach that enhances the fine-tuning paradigm. It acknowledges the limitations of existing PEFT and continual learning methods by introducing knowledge-oriented strategies for both learning new knowledge (KORE-AUGMENTATION) and protecting old knowledge (KORE-CONSTRAINT). This represents a step towards truly continuous knowledge acquisition for LMMs, aiming for a more balanced and effective solution than previous attempts.

3.4. Differentiation Analysis

KORE differentiates itself from existing knowledge injection methods through its synergistic and knowledge-oriented approach:

vs. Full Fine-Tuning & General PEFT (e.g., LoRA):
- Differentiation: Full fine-tuning and LoRA primarily focus on minimizing loss on new data. KORE explicitly addresses catastrophic forgetting and robust generalization by introducing knowledge-oriented augmentations and constraints.
- Innovation: KORE-AUGMENTATION creates a structured "knowledge tree" for deeper learning of new knowledge, moving beyond simple data fitting. KORE-CONSTRAINT actively preserves old knowledge by leveraging the null space of activation covariance matrices, a mechanism absent in standard PEFT.
vs. Retrieval-Augmented Generation (RAG):
- Differentiation: RAG accesses external knowledge at inference time without modifying model parameters. KORE directly modifies parameters to internalize new knowledge, making it an intrinsic part of the model.
- Innovation: KORE aims for knowledge internalization, allowing models to reason and generalize with new facts, which RAG might not inherently achieve as it relies on external lookup.
vs. Continual Learning (CL) Methods (e.g., EWC, LwF, Replay, O-LoRA, MoELoRA, SEFE):
- Differentiation: While CL methods aim to mitigate catastrophic forgetting, they often struggle to balance adaptation and retention. Some CL methods might impair adaptation (e.g., EWC), or are not knowledge-oriented in their augmentation or constraint mechanisms.
- Innovation: KORE-AUGMENTATION provides a novel way to learn new knowledge profoundly, which general CL methods lack. KORE-CONSTRAINT offers a knowledge-driven fine-tuning constraint by using covariance matrices to represent and protect existing knowledge, distinguishing it from general parameter regularization (e.g., EWC) or orthogonal subspace methods (O-LoRA) by its direct link to the model's internal representation of knowledge. The paper specifically claims KORE "optimizes the balance between injecting new knowledge and preserving old knowledge" better than these baselines.
  
  In essence, KORE's innovation lies in its dual, synergistic, and knowledge-centric approach. It doesn't just augment data or constrain parameters generally; it designs these processes specifically around the nature of knowledge to ensure more effective learning and protection in LMMs.

4. Methodology

KORE is designed as a synergistic method comprising two main components: KORE-AUGMENTATION and KORE-CONSTRAINT. These components work together to optimize the balance between injecting new knowledge (knowledge adaptation) and preserving old knowledge (knowledge retention) in Large Multimodal Models (LMMs).

4.1. Principles

The core idea behind KORE is to address the limitations of existing knowledge injection methods, which often struggle with generalization for new knowledge and suffer from catastrophic forgetting of old knowledge.

Knowledge-Oriented Augmentation: Instead of superficial data variations, KORE aims to build a deep, structured understanding of new knowledge by automatically converting individual knowledge items into comprehensive and interconnected data formats. This ensures the model genuinely learns and can flexibly manipulate the new information, promoting accurate adaptation.
Knowledge-Oriented Constraint: To prevent catastrophic forgetting, KORE identifies and protects the model's internal representations of previous knowledge. It achieves this by leveraging the statistical patterns (captured in covariance matrices) of LMM's activations and then guiding fine-tuning directions to minimally interfere with these patterns, ensuring powerful retention.

The synergistic combination of these two principles allows KORE to achieve superior performance in both knowledge adaptation and knowledge retention.

4.2. Core Methodology In-depth (Layer by Layer)

The overall architecture of KORE is illustrated in Figure 2.

4.2.1. KORE-AUGMENTATION

Existing knowledge injection methods often lead to poor generalization because they struggle to help models truly master new knowledge. Inspired by the fact that data augmentation can enhance generalization, KORE proposes KORE-AUGMENTATION. Unlike general augmentation techniques that produce shallow, discrete data variations, KORE-AUGMENTATION employs an automated pipeline to build structured and comprehensive knowledge from individual items, facilitating accurate adaptation.

The key insight of KORE-AUGMENTATION is to transform original knowledge into a "knowledge tree" format, consisting of multi-round dialogue data (forming the trunk) and instruction tasks data (forming the branches), as depicted in Figure 3. This structured approach moves beyond mere "data memorization" and aims to help the model comprehend the inherent logic and associations within the knowledge, enabling knowledge internalization.

The construction process for KORE-74K (the dataset generated by KORE-AUGMENTATION) is fully automated, with only question templates being manually crafted (see Figure 13 in Appendix H).

The steps involved in KORE-AUGMENTATION are:

Step 1: Constructing Multi-rounds of Dialogue Data

This component forms the "trunk" of the knowledge tree and consists of two parts for each knowledge sample:

Heuristic Q&A (H.Q in Figure 2): These are constructed randomly using manually written templates. For instance, for news, templates might include: "What is the {type} news in the image about?" or "Could you summarize the {type} news story presented in the image?". For entities, "What is the {type} entity in the image?" or "Can you identify the {type} entity shown in the picture?".
Dialogue Q&A: GPT-4o is used to generate up to 10 dialogues based on original textual knowledge, following rigorous rules and diverse task examples. The first turn of the dialogue is usually a heuristic Q&A pair. Subsequent turns are generated automatically by GPT-4o based on the original knowledge, predefined rules, and previous questions/answers. The query images for these dialogues are directly from the original image set. This process yields a large volume of dialogue data (e.g., 75,710 dialogue rounds for KORE-74K).

Step 2: Collecting Recognition and Caption Images

This step supports the "branches" related to visual tasks.

Image Retrieval: News titles or entity names are used as search keywords to retrieve the top five images via Google Search.
Visual Feature Extraction: CLIP (Radford et al., 2021) is used to extract visual features from both the original image and the newly downloaded images.
Selection: The two downloaded images with the highest cosine similarity to the original image (excluding identical matches, i.e., $similarity ≠ 1$ ) are retained. These serve as query images for subsequent visual recognition and captioning tasks.

Step 3: Constructing Visual Recognition QA

This task aims to verify if the model can recognize specific elements related to the new knowledge in images.

Template-based Questions: Questions are randomly selected from manually written templates (e.g., "Is the image depicting news {title}?", "Can you see {entity_name} in this picture?").
Fixed Answer: The answer is always "Yes".
Instruction: The model is instructed to "Answer this question with Yes or No."
Query Image: One of the images selected in Step 2 serves as the query image.

Step 4: Constructing Image Caption QA

This task assesses the model's ability to summarize the new knowledge in a descriptive paragraph linked to an image.

Summary Generation: GPT-4o generates a summary based on the original textual knowledge, which serves as the answer.
Template-based Questions: Questions are randomly selected from templates (e.g., "Could you please describe the {type} news shown in the picture?", "Please provide a description for the {type} entity in the image.").
Instruction: The model is instructed to "Answer this question in one paragraph."
Query Image: The remaining image from Step 2 (not used in Visual Recognition QA) serves as the query image.

Step 5: Constructing VQA (Visual Question Answering)

This task challenges the model with more complex visual reasoning related to the new knowledge.

Quadruplet Generation: GPT-4o generates quadruplets (Q, A, S, H) from the original textual knowledge, where $Q$ is a question, $A$ is its answer (single word or phrase), $S$ is the subject in the question, and $H$ is the hypernym corresponding to the subject. For example: (Q: "Who attempted to assassinate the person in the image during a campaign rally in July 2024?", A: "Thomas Matthew Crooks", S: "Donald John Trump", H: "Person").
Image Retrieval for VQA: The subject and hypernym are combined as search keywords to retrieve and download the top-ranked image from Google.
Instruction: The model is instructed to "Answer the question using a single word or phrase."

Through this automated pipeline, KORE-AUGMENTATION generates a rich and diverse dataset like KORE-74K, which enables the model to acquire new knowledge more effectively and generalize better.

4.2.2. KORE-CONSTRAINT

KORE-CONSTRAINT is designed to mitigate catastrophic forgetting by preserving previous knowledge during fine-tuning. It achieves this by identifying and protecting the patterns within LMM's internal representations that correspond to pre-trained knowledge.

The method involves the following steps:

Activation Collection: KORE-CONSTRAINT collects activations from LMM's linear layers on a set of random samples that represent the pre-trained knowledge. Let the input activations to a linear layer be $\pmb { X }$ . $ \pmb { X } \in \mathbb { R } ^ { d _ { i n } \times B L } $ where:
- $d_{in}$ is the input dimension of the linear layer.
- $B$ is the number of samples (batch size).
- $L$ is the sequence length.
Covariance Matrix Computation: The covariance matrix $C$ of these activations is computed. This matrix is assumed to effectively capture previous knowledge patterns. $ C = X X ^ { T } \in \mathbb { R } ^ { d _ { i n } \times d _ { i n } } $ where:
- $C$ is the covariance matrix.
- $X^T$ is the transpose of $X$ .
Knowledge Retention Condition: For LoRA fine-tuning, the fine-tuned weights $W^*$ are typically given by $W^* = W_0 + BA$ , where $W_0$ are the original weights, $B \in \mathbb{R}^{d_{out} \times r}$ and $A \in \mathbb{R}^{r \times d_{in}}$ are the low-rank LoRA matrices, and $r$ is the LoRA rank. To ensure knowledge retention, the output activations derived from pre-trained knowledge should remain consistent after fine-tuning. This is formalized as: $ ( W _ { 0 } + \bar { B } A ) C \approx W _ { 0 } C $ Simplifying this condition, we aim for: $ B A C \approx \mathbf { 0 } $ To achieve this, the goal is to have the matrix $A$ (specifically, its action on activations) effectively lie in the null space related to $C$ . This is formulated as: $ A C = \mathbf { 0 } $
Singular Value Decomposition (SVD) of Covariance: To find the null space of $C$ , SVD is applied to $X X ^ { T }$ (which is $C$ ). $ \operatorname { S V D } \left( \pmb { X } ( \pmb { X } ) ^ { T } \right) = U \Sigma U ^ { T } = \sum _ { i = 1 } ^ { d _ { in } } \sigma _ { i } \mathbf { u } _ { i } \mathbf { u } _ { i } ^ { T } $ where:
- $U$ is an orthogonal matrix whose columns are the left singular vectors $\mathbf{u}_i$ .
- $\Sigma$ is a diagonal matrix containing the singular values $\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_R > 0$ , where $R = \mathrm{rank}(C)$ . The remaining $\sigma_i$ for $i > R$ are zero.
- The null space of $C$ is spanned by the columns of $U$ that correspond to zero singular values.
Approximate Null Space Projection: In practice, exact zero singular values are rare due to numerical precision. KORE approximates the null space with $\hat { \boldsymbol { U } }$ , a submatrix containing the $r$ left singular vectors from $U$ that are associated with the smallest singular values. Here, $r$ refers to the LoRA's rank.
- This $\hat { \boldsymbol { U } } \in \mathbb { R } ^ { d _ { \mathrm { i n } } \times \boldsymbol { r } }$ is used to define a knowledge-oriented constraint projector.
- The projector is given by: $ \mathbf { \Psi } _ { P } = \hat { U } \hat { U } ^ { T } $ This projector maps a vector onto the approximate null space of $C$ .
Adapter Initialization via Projection: The LoRA adapters are initialized by factorizing the pre-trained weights ( $W_0$ ) that have been projected into this null space.
- First, the SVD of the projected weights $W_0 P$ is computed: $ \operatorname { S V D } \left( W _ { 0 } P \right) = \left{ U ^ { * } , \Sigma ^ { * } , ( V ^ { * } ) ^ { T } \right} $ where $U^*$ , $\Sigma^*$ , and $V^*$ are the singular value decomposition components for $W_0 P$ .
- Then, the adapter matrices $B$ and $A$ are initialized as: $ B = U ^ { * } \sqrt { \Sigma ^ { * } } , \qquad A = \sqrt { \Sigma ^ { * } } \big ( V ^ { * } \big ) ^ { T } $ where $\sqrt { \pmb { \Sigma } ^ { * } }$ denotes a diagonal matrix with entries corresponding to the square roots of the singular values in $\Sigma^*$ .
Original Weight Adjustment: To ensure the model remains unchanged at the beginning of fine-tuning, the original weight matrix $W_0$ is adjusted with a residual term: $ W _ { 0 } ^ { \prime } = W _ { 0 } - B A $ This means the effective initial weight matrix for fine-tuning becomes $W_0'$ .
Freezing Matrix A: Given the asymmetry between LoRA matrices $A$ and $B$ , fine-tuning only $B$ is often sufficient for strong performance. KORE freezes matrix $A$ .
- By initializing $A$ such that its column space is related to the null space of $C$ (as shown in Appendix C, Theorem 1, where $\operatorname { C o l } ( A ) = \operatorname { C o l } ( U _ { \mathrm { n u l l } } ^ { T } )$ for the relevant null space of $C$ ), and then freezing $A$ , it ensures that $A C \approx \mathbf { 0 }$ .
- This makes the update term from LoRA (B A C) negligible regardless of how $B$ is updated during fine-tuning. Consequently, the output activations from pre-trained knowledge are approximately preserved.

Proof of KORE-CONSTRAINT (from Appendix C)

The paper provides two theorems to mathematically substantiate KORE-CONSTRAINT:

Theorem 1. Under the assumption that W _ { 0 } is full-rank, the column space of $A$ forms a subset of the column space of $U_{\mathrm{null}}$ . The proof proceeds as follows:

Step 1: Defines $A = \sqrt { \Sigma ^ { * } } { { \left( V ^ { * } \right) } ^ { T } }$ . Since $\sqrt { \Sigma ^ { * } }$ and $(V^*)^T$ are full-rank, their column space is preserved, so $\operatorname { C o l } ( A ) = \operatorname { C o l } ( \left( V ^ { * } \right) ^ { T } )$ .
Step 2: Uses the SVD of $W _ { 0 } { U _ { \mathrm { n u l l } } } U _ { \mathrm { n u l l } } ^ { T } = U ^ { * } \Sigma ^ { * } { ( V ^ { * } ) } ^ { T }$ . It states that the columns of $V^*$ span the row space of $W _ { 0 } { U _ { \mathrm { n u l l } } } U _ { \mathrm { n u l l } } ^ { T }$ , and since $U_{\mathrm{null}} U_{\mathrm{null}}^T$ is a projector onto the subspace spanned by $U_{\mathrm{null}}$ , the column space of $V^*$ is equal to the column space of $U_{\mathrm{null}}^T$ . $ \operatorname { C o l } ( V ^ { * } ) = \operatorname { C o l } ( W _ { 0 } U _ { \mathrm { n u l l } } U _ { \mathrm { n u l l } } ^ { T } ) = \operatorname { C o l } ( U _ { \mathrm { n u l l } } ^ { T } ) $
Step 3: Combining Step 1 and Step 2, it concludes that $\operatorname { C o l } ( A ) = \operatorname { C o l } ( U _ { \mathrm { n u l l } } ^ { T } )$ , meaning the column space of $A$ lies in the null space of $C$ .

Theorem 2. For a given layer $l$ in a large language model, suppose the input activation $\pmb { X } ^ { ( l ) }$ is derived from pre-trained world knowledge and remains unchanged. Then, under fine-tuning with KoRE, the output of the layer is approximately preserved: $ { W ^ { * } } ^ { ( l ) } X ^ { ( l ) } \approx W _ { 0 } ^ { ( l ) } X ^ { ( l ) } $ where $W_0^{(l)}$ is the initial weight matrix and $W^{*(l)}$ is the fine-tuned weight matrix for layer $l$ . The proof proceeds as follows:

KORE defines the fine-tuned weight $W^{*(l)}$ for layer $l$ as: $ { \pmb W } ^ { * ( l ) } = { \pmb W } _ { 0 } ^ { ( l ) } - { \pmb B } ^ { ( l ) } { \pmb A } ^ { ( l ) } + { \pmb B } ^ { * ( l ) } { \pmb A } ^ { ( l ) } $ Here, $W_0^{(l)}$ is the original weight matrix. The term $-B^{(l)}A^{(l)}$ represents the initial adjustment to make the effective starting weight $W_0' = W_0 - BA$ . The term $+B^{*(l)}A^{(l)}$ represents the updated LoRA contribution during fine-tuning, where $B^{*(l)}$ is the fine-tuned $B$ matrix.
The output of the layer is: $ \pmb { W } ^ { \ast ( l ) } \pmb { X } ^ { ( l ) } = ( \pmb { W } _ { 0 } ^ { ( l ) } - \pmb { B } ^ { ( l ) } \pmb { A } ^ { ( l ) } + \pmb { B } ^ { \ast ( l ) } \pmb { A } ^ { ( l ) } ) \pmb { X } ^ { ( l ) } $
Using the approximation that $\pmb { A } ^ { ( l ) } \pmb { X } ^ { ( l ) } \approx \mathbf { 0 }$ (due to $A$ being initialized in the null space of $C$ and $C$ being $XX^T$ ), the terms involving $A^{(l)}X^{(l)}$ become negligible: $ \pmb { W } ^ { \ast ( l ) } \pmb { X } ^ { ( l ) } \approx \pmb { W } _ { 0 } ^ { ( l ) } \pmb { X } ^ { ( l ) } $ This demonstrates that the output remains approximately unchanged, thus preserving the pre-trained knowledge.

Analysis of Knowledge-Oriented Constraint's Ability to Capture Knowledge

KORE-CONSTRAINT relies on the premise that covariance matrices effectively capture knowledge. To verify this for multimodal scenarios (extending CO-SVD from text to multimodal), the paper conducts experiments on LLaVA-v1.5 (7B) by applying Plain SVD, ASVD (Activation-aware SVD), and CO-SVD to decompose all layers' pre-trained weights. Weights are reconstructed after removing components corresponding to the smallest singular values.

The findings (from Figure 4(a) and (b), and Appendix D.1 Table 8):

CO-SVD consistently shows superior performance retention compared to Plain SVD and ASVD after reconstruction, especially when a large number of ranks are discarded. This suggests that multimodal knowledge is indeed effectively captured and stored in covariance matrices.
The number of sampled data points for covariance matrix computation has limited influence; even a small number (e.g., 32 samples) is sufficient to capture essential knowledge.
Using test-specific samples for covariance matrix computation leads to better performance on those specific tasks when discarding many ranks (e.g., CO-SVD with MME samples performs better on MME than ScienceQA samples). This indicates that covariance matrices can capture dataset-specific knowledge and exhibit distinct patterns for different tasks.

Visualizations of covariance matrices (Figure 4(c), Figure 8, and Figure 9 in Appendix D.2) for tasks like POPE (object hallucination), HallusionBench (entangled language hallucination and visual illusion), and MMBench (comprehensive evaluation) show:
Covariance matrices of linear layer inputs for related tasks (POPE and HallusionBench) share similar outlier patterns (marked by red circles), which differ from unrelated tasks (MMBench).
This indicates that distinct tasks activate different outlier distributions in the covariance matrix, empirically supporting that covariance matrix patterns characterize the triggered task. This ability is leveraged by KORE to guide the decomposition of pre-trained weights and initialize adapters with informative knowledge, enabling knowledge-oriented constraints.
For KORE, a multi-dimensional covariance matrix is built by sampling 64 examples per category from OneVision's single-image subset (General, Doc/Chart/Screen, Math/Reasoning, General OCR).

5. Experimental Setup

This section details the experimental design, including the benchmarks, evaluation metrics, baseline methods, and training configurations used to validate KORE.

5.1. Datasets

5.1.1. Knowledge Adaptation Evaluation

The primary benchmark for evaluating knowledge adaptation (injecting new knowledge) is EvOKE.

EvOKE (Jiang et al., 2025): This benchmark is specifically designed to assess how well Large Multimodal Models (LMMs) can learn evolving knowledge without experiencing catastrophic forgetting. In the context of KORE's experiments, knowledge is injected as image-text pairs, and evaluation questions are derived from the accompanying text. EvOKE reveals the limitations of current methods and highlights the severity of catastrophic forgetting.

5.1.2. Knowledge Retention Evaluation

To evaluate knowledge retention (preserving old knowledge), fine-tuned LMMs are assessed on 12 benchmarks spanning 7 distinct capability dimensions. The evaluation settings follow VLMEvalKit (Duan et al., 2024).

Comprehensive Evaluation (COM):
- MME (Fu et al., 2023): A benchmark for holistic evaluation of LMMs' perception and cognition across various tasks, primarily using straightforward question-answer pairs.
- MMBench (Liu et al., 2024c): A cross-lingual benchmark featuring over 3,000 bilingual multiple-choice questions across 20 skill dimensions, from visual recognition to abstract reasoning.
Optical Character Recognition (OCR):
- SEEDBench2 Plus (Li et al., 2024): Benchmarks LMMs on interpreting text-rich visuals (e.g., charts, web layouts) using 2,300 multiple-choice questions where integrating textual and visual information is crucial.
- OCRVQA (Mishra et al., 2019): Evaluates a model's ability to answer questions by reading text within images, focusing on tasks where OCR is essential.
Multidisciplinary Reasoning (M-DIS):
- ScienceQA (Lu et al., 2022): Evaluates scientific reasoning through a large-scale multimodal benchmark with curriculum-based questions, diagrams, and provided lectures/explanations to encourage complex reasoning.
- MMMU (Yue et al., 2024): Evaluates LMMs on college-level, multimodal questions requiring expert knowledge across six disciplines and 30 image formats.
Instruction Following (INS):
- MIA-Bench (Qian et al., 2024): A benchmark measuring how precisely LMMs can follow complex and multi-layered instructions using 400 distinct image-prompt combinations.
Multi-Turn Multi-Image Dialog Understanding (M-IDU):
- MMDU (Liu et al., 2025): Evaluates LMMs in multi-image, multi-turn conversational scenarios, assessing contextual understanding, temporal reasoning, and coherence.
Mathematical Reasoning (MAT):
- MathVista (Lu et al., 2024): Benchmarks mathematical reasoning of foundation models in visual contexts, aggregating 6,141 problems from 31 datasets requiring visual analysis and compositional logic.
- MathVision (Wang et al., 2025a): A challenging dataset of 3,040 visually-presented problems from math competitions, categorized by mathematical areas and difficulty tiers.
Hallucination (HAL):
- POPE (Li et al., 2023): Evaluates object hallucination (describing non-existent objects) in LMMs using a polling-based questioning strategy.
- HallusionBench (Guan et al., 2024): A diagnostic suite for entangled language hallucination and visual illusion in LMMs, using 346 images and 1,129 structured questions.

5.1.3. KORE-74K Dataset

KORE-74K is a new dataset constructed by KORE-AUGMENTATION using the original knowledge from EvOKE. It contains 74,734 total data points, comprising multi-round dialogue data (9,422, 12.6%), visual recognition data (9,422, 12.6%), image caption data (9,422, 12.6%), and VQA data (46,468, 62.2%). It includes 75,710 rounds of dialogue and 65,312 unique images.

5.2. Evaluation Metrics

5.2.1. Knowledge Adaptation Metrics

For open-domain question answering tasks (e.g., EvOKE), two key metrics are used:

Cover Exact Match (CEM):
- Conceptual Definition: CEM assesses whether the entire ground truth answer is perfectly contained within the model's generated prediction. It is a strict metric that requires the model to produce a response that includes all parts of the correct answer.
- Mathematical Formula: $ CEM = { \left{ \begin{array} { l l } { 1 , } & { y _ { q } \subseteq { \hat { Y } } } \ { 0 , } & { { \mathrm { otherwise } } } \end{array} \right. } $
- Symbol Explanation:
  - $y_q$ : The ground truth answer string.
  - $\hat{Y}$ : The text generated by the model.
  - $y_q \subseteq \hat{Y}$ : Indicates that the ground truth answer $y_q$ is a substring of, or completely contained within, the generated text $\hat{Y}$ .
  - 1: If the condition is met (exact match).
  - 0: Otherwise (no exact match).
F1-Score (F1):
- Conceptual Definition: The F1-Score measures the word-level overlap between the predicted answer and the ground truth answer. It is the harmonic mean of Precision and Recall, providing a balanced measure when the number of relevant items (words in the ground truth) and retrieved items (words in the prediction) might be uneven.
- Mathematical Formula: First, define the overlap function $\mathcal{U}$ : $ \mathcal { U } ( \hat { Y } , y _ { q } ) = \sum _ { t \in \mathcal { W } ( y _ { q } ) } \mathbf { 1 } [ t \in \mathcal { W } ( \hat { Y } ) ] $ Then, Precision $\mathcal{P}$ is: $ \mathcal { P } ( \hat { Y } , Y ) = \frac { \mathcal { U } ( \hat { Y } , y _ { q } ) } { \vert \mathcal { W } ( \hat { Y } ) \vert } $ And Recall $\mathcal{R}$ is: $ \mathcal { R } ( \hat { Y } , Y ) = \frac { \mathcal { U } ( \hat { Y } , y _ { q } ) } { \vert \mathcal { W } ( Y ) \vert } $ Finally, the F1-Score is: $ F1 = 2 \times \frac{\mathcal{P} \times \mathcal{R}}{\mathcal{P} + \mathcal{R}} $
- Symbol Explanation:
  - $\mathcal{W}(y_q)$ : The set of words in the ground truth answer $y_q$ .
  - $\mathcal{W}(\hat{Y})$ : The set of words in the model's predicted answer $\hat{Y}$ .
  - $t$ : A token (word) from the set $\mathcal{W}(y_q)$ .
  - $\mathbf{1}[\cdot]$ : The indicator function, which is 1 if the condition inside is true, and 0 otherwise.
  - $\mathcal{U}(\hat{Y}, y_q)$ : The number of unique words from the ground truth answer that are also present in the predicted answer.
  - $|\mathcal{W}(\hat{Y})|$ : The total number of words in the predicted answer.
  - $|\mathcal{W}(Y)|$ : The total number of words in the ground truth answer.
  - $\mathcal{P}(\hat{Y}, Y)$ : Precision, the fraction of predicted words that are correct.
  - $\mathcal{R}(\hat{Y}, Y)$ : Recall, the fraction of ground truth words that were successfully predicted.

5.2.2. Knowledge Retention Metrics

For knowledge retention, the paper uses a variety of existing benchmarks, each with its own established evaluation protocols and metrics (e.g., accuracy, specific scores for hallucination). The paper states it follows the settings of VLMEvalKit (Duan et al., 2024) for these evaluations. While specific formulas for each of these 12 benchmarks are not provided in the paper, their conceptual definitions (as described in Section 5.1.2) explain what aspects of model performance they quantify. The results are typically reported as scores (e.g., percentage accuracy) where higher is better.

5.3. Baselines

KORE was compared against a comprehensive set of baseline methods:

Full Fine-Tuning (Ful-FT): Updates all model weights on the new EvOKE dataset. Represents the most direct but computationally expensive approach.
LoRA (Hu et al., 2022): A Parameter-Efficient Fine-Tuning (PEFT) method that updates a small number of low-rank matrices while freezing the original weights. It is resource-efficient but can suffer from catastrophic forgetting.
Replay: Implemented via LoRA, this method mixes a fixed quantity (10% of EvOKE's size) of randomly sampled data from the LMM's pre-training corpus with the new EvOKE data. This is a common continual learning strategy to mitigate catastrophic forgetting.
EWC (Elastic Weight Consolidation) (Kirkpatrick et al., 2017): A parameter regularization method from continual learning that slows down updates to parameters deemed important for prior tasks by imposing a quadratic constraint based on the Fisher Information Matrix.
LwF (Learning without Forgetting) (Li & Hoiem, 2017b): Another continual learning method that uses knowledge distillation to preserve old knowledge by ensuring the new model's predictions on new data align with the old model's outputs.
MoELoRA (Luo et al., 2024): Combines Mixture-of-Experts (MoE) with contrastive learning for PEFT, specializing experts for different data types and using contrastive objectives to guide collaboration, aiming to reduce catastrophic forgetting.
O-LoRA (Wang et al., 2023): An orthogonal subspace-based method for continual learning that allocates independent, orthogonal parameter subspaces for each task, constraining updates to prevent interference and mitigate catastrophic forgetting.
SEFE (Superficial and Essential Forgetting Eliminator) (Chen et al., 2025): A method that tackles multimodal catastrophic forgetting by separately addressing superficial forgetting of style and essential forgetting of knowledge through a tailored training strategy.

These baselines cover a range of fine-tuning (full, PEFT) and continual learning approaches, making the comparison comprehensive.

5.4. Training Parameters

The following are the hyperparameter settings used for model training, as provided in Table 7 of the paper:

The following are the results from [Table 7] of the original paper:

LLaVA-v1.5 (7B)
Rank 235	Optimizer AdamW	Deepspeed Zero3	Epochs 6	Vision Select Layer -2
Weight Decay 0	Warmup Ratio 0.03	LR Schedule cosine decay	Learning Rate 2 × 10^-4	Batch Size 54
LLaVA-v1.5 (13B)
Rank 235	Optimizer AdamW	Deepspeed Zero3	Epochs 6	Vision Select Layer -2
Weight Decay 0	Warmup Ratio 0.03	LR Schedule cosine decay	Learning Rate 2 × 10^-4	Batch Size 32
Qwen2.5-VL (7B)
Rank 274	Optimizer AdamW	Deepspeed Zero3	Epochs 6	Image Max Pixels 262144
Grad Accum Steps 8	Warmup Ratio 0.1	LR Schedule cosine decay	Learning Rate 2 × 10^-4	Batch Size 24

Explanation of Key Parameters:

Rank: Refers to the rank parameter in LoRA (Low-Rank Adaptation), which determines the dimensionality of the low-rank matrices used for fine-tuning. A higher rank means more trainable parameters.
Optimizer AdamW: A variant of the Adam optimizer that includes weight decay regularization to prevent overfitting.
Deepspeed Zero3: A memory optimization technique from DeepSpeed that partitions optimizer states, gradients, and parameters across GPUs, enabling the training of very large models.
Epochs: The number of times the entire training dataset is passed forward and backward through the neural network.
Vision Select Layer: Specifies which layer in the vision component of the LMM is targeted or selected for certain operations (e.g., feature extraction or LoRA application). "-2" often means the second to last layer.
Weight Decay: A regularization technique that penalizes large weights to prevent overfitting. A value of 0 means no weight decay is applied for LLaVA models.
Warmup Ratio: The proportion of training steps during which the learning rate linearly increases from a small value to the initial learning rate.
LR Schedule cosine decay: A learning rate scheduler that gradually decreases the learning rate following a cosine curve during training.
Learning Rate: The step size at each iteration while moving towards a minimum of the loss function.
Batch Size: The number of training samples utilized in one iteration.
Image Max Pixels: The maximum number of pixels an image can have, likely for Qwen2.5-VL, indicating a constraint on input image size.
Grad Accum Steps: Gradient Accumulation Steps. The number of mini-batches over which gradients are accumulated before a parameter update is performed. This effectively increases the effective batch size without requiring more GPU memory.

5.5. Experiment Resources

All training experiments were conducted using 4 NVIDIA H100 GPUs (each with 96 GiB memory). All evaluation experiments were performed on systems equipped with 4 NVIDIA A100 PCIe GPUs (each with 40 GiB memory).

6. Results & Analysis

This section presents and analyzes the experimental results, demonstrating the effectiveness of KORE in both knowledge adaptation and knowledge retention across various LMMs and settings.

6.1. Core Results Analysis

The main results comparing KORE with eight baseline methods on LLaVA-v1.5 (7B) are presented in Table 1. Performance is measured by CEM and F1-Score on EvOKE (for knowledge adaptation) and an average score across multiple knowledge retention benchmarks (COM, OCR, M-DIS, INS, M-IDU, MAT, HAL). The Avg score is the mean of the separate averages for adaptation and retention.

The following are the results from [Table 1] of the original paper:

Method	#Params	Evoke		COM↑ OCR ↑		M-DIS ↑	INS ↑	M-IDU ↑	MAT ↑	HAL ↑	Avg ↑
Method	#Params	CEM↑	F1↑	COM↑ OCR ↑		M-DIS ↑	INS ↑	M-IDU ↑	MAT ↑	HAL ↑	Avg ↑
LLaVA-v1.5 (7B)		—	—	65.61	45.59	49.22	66.33	26.37	19.33	54.32
Full-FT	6,759M	18.02	15.17	43.55	21.55	45.67	25.25	13.03	18.32	16.09	23.23
LoRA	340M	15.23	18.31	48.96	27.01	43.79	29.66	13.70	18.02	41.38	24.28
Replay	340M	11.36	17.98	59.72	37.98	48.64	62.33	19.31	19.17	51.67	28.68
EWC	340M	15.49	19.42	49.42	32.88	45.46	29.79	13.36	18.00	43.50	25.33
LwF	340M	14.58	19.99	53.14	28.77	43.41	36.19	13.68	18.22	44.18	25.61
MoELoRA	340M	6.45	12.20	60.79	38.79	48.27	35.03	17.85	19.79	49.99	23.98
O-LoRA	340M	6.44	12.08	61.47	40.91	48.07	34.85	17.28	19.87	51.12	24.17
SEFE	340M	13.38	16.88	42.06	20.43	40.17	17.73	13.25	18.20	39.30	22.54
Kore (r=235)	340M	30.65	41.26	52.41	40.98	48.68	38.54	16.58	18.59	51.75	37.09
KORE (r=256)	369M	31.05	41.32	52.48	39.96	48.96	60.02	23.18	18.09	51.50	39.11

Obs 1: KORE enables accurate adaptation for effectively injecting new knowledge. $KORE (r=235)$ (the standard KORE configuration) significantly outperforms all baselines on EvOKE (knowledge adaptation). It achieves CEM of 30.65 and F1-Score of 41.26. This represents an improvement of 12.63 in CEM and 21.27 in F1-Score over the best baseline (Full-FT for CEM and LwF for F1-Score). Notably, KORE's F1-Score is more than double that of LoRA (18.31), highlighting the effectiveness of KORE-AUGMENTATION in enabling the model to truly learn and generalize new knowledge, rather than just memorizing it.

Obs 2: KORE enables powerful retention for effectively preserving old knowledge. $KORE (r=235)$ shows strong performance in knowledge retention across various benchmarks. It outperforms LoRA across all knowledge retention tests and achieves top scores on OCR, M-DIS, and HAL dimensions, while placing second on INS. Although its performance on INS and M-IDU is suboptimal compared to Replay (which mixes old data), the paper attributes this to the number of trainable parameters (rank) and the source of the covariance matrix. When rank is increased to 256 ( $KORE (r=256)$ ), KORE significantly improves on INS (60.02, trailing Replay by only 2.31) and M-IDU (23.18, outperforming Replay by 3.87), demonstrating that with appropriate configuration, KORE can achieve powerful retention even in these areas.

Obs 3: KORE achieves remarkable holistic performance by harmonizing the dual objectives of knowledge injection. The Avg score, which balances knowledge adaptation and retention, clearly shows KORE's superiority. $KORE (r=235)$ achieves an Avg score of 37.09, an 8.41 improvement over the strongest baseline (Replay at 28.68). $KORE (r=256)$ further boosts this to 39.11. This demonstrates KORE's ability to effectively manage the inherent trade-off between injecting new knowledge and preserving old knowledge, which is a critical challenge in continual learning for LMMs.

6.2. Detailed Results on Adaptation & Retention

6.2.1. Fine-grained Knowledge Adaptation

KORE's performance was further analyzed across 20 fine-grained News and Entity types from EvOKE (Figure 5).

As seen in Figure 5, KORE consistently outperforms all baselines across a wide spectrum of fine-grained knowledge types. This highlights KORE-AUGMENTATION's ability to build comprehensive and structured knowledge, allowing the model to adapt robustly to diverse new information.

The following are the results from [Table 9] of the original paper:

Method	News										Entity
Method		Avg	PO		SP		BU		HE		Avg		CE		FI		AL		WR
	CEM ↑	F1 ↑	CEM ↑	F1 ↑	CEM ↑	F1↑	CEM ↑ F1↑	CEM ↑	F1↑	CEM ↑	F1↑	CEM ↑	F1 ↑	CEM ↑	F1 ↑	CEM ↑	F1 ↑	CEM ↑	F1 ↑
Full-FT	21.35	16.34	12.92	10.99	22.49	20.88	27.31	20.95	19.84	16.47	14.37	13.88	13.11	16.93	12.39		13.16	12.17	7.66	20.34	8.43
LoRA	17.72	19.42	10.54	12.96	19.11	21.50	20.66	24.03	17.81	23.76	12.51	17.09	12.20	21.19	10.57	15.82	10.72	8.72	18.64	12.94
Replay	13.98	19.43	7.61	13.16	15.96	20.69	16.05	22.40	15.38	24.21	8.48	16.39	9.40	18.78	10.34	15.60	3.77	10.79	4.55	8.23
EWC	17.86	21.10	10.45	14.81	19.83	23.02	19.00	24.57	17.41	23.88	12.88	17.58	14.53	22.07	12.16	16.91	10.72	8.13	15.25	17.69
LwF	17.05	21.43	9.62	13.99	19.83	23.66	18.63	25.82	19.03	26.20	11.88	18.40	12.45	21.64	12.39	17.01	9.28	11.11	10.17	17.10
MoELoRA	9.23	14.86	3.39	8.72	6.77	11.77	12.36	18.92	10.53	20.60	3.40	9.28	2.95	10.32	4.43	8.96	3.19	5.22	10.17	14.07
O-LoRA	9.21	14.68	3.67	8.52	7.01	12.23	12.55	18.98	11.74	20.68	3.40	9.22	3.10	10.51	4.20	8.28	3.19	5.35	8.47	12.37
SEFE	16.66	18.44	10.82	12.64	17.78	20.92	20.30	23.23	17.00	21.55	9.79	15.18	10.77	20.13	9.09	12.01	5.51	7.47	13.56	13.87
KORE	34.74	42.96	23.83	32.31	46.19	50.38	34.69	45.74	33.20	45.23	26.17	39.39	27.79	42.61	26.93	34.05	16.52	29.54	28.81	43.05

Obs 4: KORE demonstrates superior performance across a wide spectrum of fine-grained knowledge. Table 9 (Appendix E.1) provides detailed numerical results confirming that KORE consistently achieves the highest CEM and F1-Score across all fine-grained News categories (Politics, Sports, Business, Health) and Entity categories (Celebrity, Film, Album, Written Work). This comprehensive superiority underscores the robustness and effectiveness of KORE-AUGMENTATION in enabling deep and generalized learning of new knowledge.

6.2.2. Detailed Knowledge Retention

Table 2 provides a detailed breakdown of knowledge retention performance for each of the 12 benchmarks.

The following are the results from [Table 2] of the original paper:

Method	COM		OCR		M-DIS		INS	M-IDU	MAT		HAL		Avg
Method	MME ↑	MM8 ↑\|	SEEDB2P ↑	OCRVQ^↑	SQA ↑	MMMU ↑	MIAB ↑	MMDU ↑	\| Math ↑	Math1 ↑\|	POPE ↑	Hall ↑	Avg
LLaVA-v1.5 (7B)	66.63	64.60	38.78	52.41	69.83	28.60	66.33	26.37	25.50	13.16	86.87	21.76	46.74
Full-FT	34.17	52.92	31.44	11.65	67.13	24.20	25.25	13.03	24.70	11.94	74.22	9.27	31.66
LoRA	44.06	53.87	30.22	23.80	66.18	21.40	29.66	13.70	23.20	12.83	73.97	8.78	33.47
Replay	58.96	60.48	38.34	37.73	68.77	28.50	62.33	19.31	25.20	13.13	85.44	17.90	43.00
WC	448.57	50.26	33.60	32.16	65.71	25.20	29.79	113.36	23.30	12.76	76.22	10.77	35.14
Lw	50.87	55.41	32.02	25.52	66.21	20.60	36.19	13.68	24.40	12.04	79.23	9.13	35.44
MoELoRA	58.26	63.32	37.42	440.17	69.04	27.50	35.03	17.85	27.80	11.78	80.70	19.29	40.51
O-LORA	60.30	62.63	37.90	43.91	68.84	27.30	34.85	177.28	28.20	11.55	81.46	20.78	41.25
SEFE	36.10	48.02	22.79	118.07	65.03	15.30	17.73	13.25	26.00	10.39	72.81	5579	29.27
KOre (r=235)	49.84	54.98	37.73	44.24	68.06	29.30	38.54	16.58	25.10	12.09	80.99	22.51	40.00
KoRe (r=256)	50.06	54.90	36.89	43.03	68.51	29.40	60.02	23.18	24.70	11.48	80.77	22.23	42.10

Obs 5: KORE achieves competitive knowledge retention. $KORE (r=235)$ outperforms LoRA in overall retention (40.00 vs. 33.47 Avg). It also surpasses several continual learning methods like EWC (35.14), LwF (35.44), and SEFE (29.27). KORE achieves top scores on OCRVQA (44.24), M-DIS (MMMUT 29.30), and HallusionBench (22.51). When its rank is increased to 256 ( $KORE (r=256)$ ), it closely matches or even exceeds Replay (the strongest retention baseline) on INS (60.02 vs. 62.33) and M-IDU (23.18 vs. 19.31), demonstrating its strong retention capabilities. The KORE-CONSTRAINT component effectively minimizes interference with previous knowledge.

6.2.3. Specific Knowledge-Oriented Constraints

The paper investigates whether KORE can preserve specific knowledge without compromising other abilities by constructing specific constraints (sampling 256 data per benchmark across four dimensions).

The following are the results from [Table 3] of the original paper:

Method	K.A↑	K.R ↑	Avg ↑
KORE	35.96	38.22	37.09
KOREMME	34.46	43.16	38.81
KOREOCRVQA	34.85	42.21	38.53
KOREMathT	35.20	42.87	39.03
KOREHallB	34.96	42.09	38.52

Obs 6: Specific constraints enhance knowledge retention and overall performance. Table 3 shows that applying specific knowledge-oriented constraints (e.g., KOREMME, KOREOCRVQA, KOREMathT, KOREHallB) leads to a slight reduction in Knowledge Adaptation (K.A) scores but a substantial improvement in Knowledge Retention (K.R) and overall Avg performance compared to the general KORE (r=235) setup. For instance, KOREMME increases K.R from 38.22 to 43.16 and Avg from 37.09 to 38.81. Figure 6 visually confirms that these specific constraints enhance targeted knowledge retention, with KOREMME showing a 7.17 gain on MME. This demonstrates KORE's flexibility for tailored knowledge preservation according to specific needs.

6.3. Various LMM Scales and Architectures

The paper further evaluates KORE's universality and robustness on larger (LLaVA-v1.5-13B) and architecturally distinct (Qwen2.5-VL-7B) LMMs.

The following are the results from [Table 4] of the original paper:

Methods	Evoke		COM↑	OCR ↑	M-DIS ↑	INS ↑	M-IDU ↑	MAT	HAL↑Avg ↑
Methods	CEM↑	F1↑	COM↑	OCR ↑	M-DIS ↑	INS ↑	M-IDU ↑	MAT	HAL↑Avg ↑
LLaVA-v1.5 (13B)
Vanilla		—	66.86	51.12	52.70	66.04	33.93	19.64	56.77	—
LoRA	16.26	22.83	60.57	32.58	43.72	23.26	17.43	15.82	38.08	25.21
Replay	12.05	20.21	65.81	47.51	48.42	61.04	24.62	19.55	54.16	30.70
KorE	32.89	44.47	59.35	45.96	51.39	65.10	26.84	20.31	40.52	41.44
Qwen2.5-VL (7B)
Vanilla			81.18	70.32	65.35	78.46	61.25	47.69	66.96	—
LoRA	14.56	14.01	52.54	64.54	22.35	21.39	23.25	13.52	41.38	24.21
Replay	11.73	18.51	78.54	69.17	65.26	70.20	50.72	42.74	67.48	39.28
KORE	22.91	31.36	56.60	67.74	65.48	70.51	45.02	43.72	58.57	42.68

Obs 7: KORE shows enhanced superiority on a larger-scale LMM. On LLaVA-v1.5 (13B), KORE achieves CEM of 32.89 and F1-Score of 44.47, significantly surpassing LoRA (16.26 CEM, 22.83 F1-Score). It also demonstrates strong knowledge retention, achieving an overall Avg score of 41.44, which is a 10.74 improvement over Replay (30.70). This confirms KORE's strong potential for larger LMMs.

Obs 8: KORE's effectiveness is not architecture-specific. On Qwen2.5-VL (7B), KORE again outperforms LoRA by a large margin (e.g., 12.63 CEM and 21.27 F1-Score on EvOKE compared to LoRA). It also surpasses Replay with an Avg score of 42.68 (vs. 39.28). The improvement margins are slightly smaller compared to LLaVA-v1.5, which the paper attributes to Qwen2.5-VL's already robust knowledge system (honed via three-stage training), reducing the marginal gains from knowledge injection. Nevertheless, KORE maintains superior performance across different LMM architectures, showcasing its universality.

6.4. Ablation Studies

Ablation studies were conducted to validate the effectiveness of KORE's design components: rank, W/o Augmentation, W/o Constraint, and W/o Frozen Matrix A.

The following are the results from [Table 5] of the original paper:

Setting	Evoke		COM↑	OCR ↑	M-DIS ↑	INS ↑	M-IDU ↑	MAT ↑	HAL ↑ :	Avg ↑
Setting	CEM↑	F1↑	COM↑	OCR ↑	M-DIS ↑	INS ↑	M-IDU ↑	MAT ↑	HAL ↑ :	Avg ↑
Kore	30.65	41.26	52.41	40.98	48.68	38.54	16.58	18.59	51.75	37.09
W/o Augmentation	10.83	18.31	59.96	40.42	47.13	32.53	16.00	19.71	49.50	26.23
W/o Constraint	33.93	43.71	46.39	32.38	46.31	32.70	15.38	19.12	46.47	36.46
W/o Frozen Matrix A	31.97	41.72	50.73	39.56	48.37	35.30	16.44	19.07	49.91	36.95

Obs 9: Larger rank enhances KORE's performance. Figure 7 and Table 15 (Appendix E.4.1) show that KORE's performance in both knowledge adaptation and retention consistently increases with a higher rank (number of trainable parameters). Even at $rank=64$ , KORE (Avg 31.81) still surpasses Replay (28.68) while using less than half of Replay's parameters. This indicates that increasing the trainable parameter scale activates stronger capabilities in KORE.

Obs 10: Ablation studies reveal the effectiveness of KORE's design. Table 5 clearly validates the contribution of each KORE component:

W/o Augmentation: Removing KORE-AUGMENTATION (K.A CEM 10.83, F1 18.31) is particularly detrimental to knowledge adaptation, causing a significant drop (19.82 CEM and 22.95 F1-Score decrease compared to full KORE). This emphasizes the critical role of knowledge-oriented augmentations in enabling accurate learning of new information.
W/o Constraint: Removing KORE-CONSTRAINT leads to an Avg score of 36.46, slightly lower than KORE (37.09), but significantly impacting knowledge retention benchmarks (e.g., COM 46.39 vs. 52.41, OCR 32.38 vs. 40.98). This confirms that KORE-CONSTRAINT is essential for preserving old knowledge. Interestingly, it slightly improves K.A metrics (e.g., CEM 33.93 vs. 30.65), suggesting a trade-off where explicit constraints for retention can slightly limit adaptation when not perfectly balanced.
W/o Frozen Matrix A: Not freezing matrix $A$ (allowing it to be fine-tuned) also impairs knowledge retention slightly, resulting in an Avg score of 36.95, indicating that the strategic freezing of $A$ (which lies in the null space) is important for robust knowledge preservation.

The results from Table 18 (Appendix E.4.2) further reinforce that modifying KORE's design leads to an overall degradation in knowledge retention performance, underscoring the efficacy of KORE's comprehensive design. Table 19 (Appendix E.4.2) shows that W/o Constraint yields superior knowledge adaptation across fine-grained knowledge, which stems from the KORE-AUGMENTATION's profound augmentation without the mitigating effect of the constraint.

6.5. Comparison with General Augmentation Methods

This section validates the claim that KORE-AUGMENTATION is superior to general augmentation methods (Section 3.1).

The following are the results from [Table 6] of the original paper:

Method	K.A ↑	K.R ↑	Avg ↑
KOrE-AUGMENTaTION	38.82	35.78	36.46
Augmentation for Text
Knowledge-AwareKnowledge-Agnostic	20.2915.60	34.8635.71	27.3825.49
Augmentation for Images
Knowledge-AwareKnowledge-Agnostic	18.33	34.02	25.86
Knowledge-AwareKnowledge-Agnostic	18.33	32.09	25.25

Obs 11: KORE-AUGMENTATION is superior to general augmentation methods. Table 6 compares KORE-AUGMENTATION with generic text augmentation (knowledge-aware and knowledge-agnostic) and image augmentation (knowledge-aware and knowledge-agnostic). KORE-AUGMENTATION (which includes the KORE-CONSTRAINT component as well, since the K.A score here is 38.82 and K.R is 35.78, differing from the original KORE entry of 35.96 and 38.22 in Table 3, likely representing the W/o Constraint ablation from Table 5 where K.A was 33.93 and F1 43.71, averaging to 38.82) significantly outperforms all general augmentation methods across K.A, K.R, and Avg metrics. It achieves an 18.53 improvement in K.A over the strongest baseline (20.29 for knowledge-aware text augmentation). This strong performance confirms that KORE-AUGMENTATION's profound and structured approach to augmentation is far more effective than superficial variations. Table 20 and 21 (Appendix E.5) further detail this, showing KORE-AUGMENTATION's absolute comprehensive superiority in both knowledge retention and adaptation across fine-grained knowledge types.

6.6. Convergence Comparison

The paper provides training loss curves (Figure 10) to compare the convergence behavior of various methods.

As seen in Figure 10, the training loss curves on EvOKE for Full-FT, LoRA, EWC, O-LoRA, SEFE, and KORE are presented. It's clarified that KORE is trained on KORE-74K (a larger, augmented dataset), while others train on the original EvOKE dataset.

O-LoRA and SEFE notably fail to fit the EvOKE dataset, exhibiting high training loss.
LoRA, EWC, and Full-FT converge to very low loss values, successfully fitting the EvOKE dataset. However, their poor generalization performance in Table 1 suggests overfitting to the smaller EvOKE dataset.
KORE shows a rapid decrease in loss during the first epoch due to its larger KORE-74K dataset and efficient learning. Crucially, despite converging effectively on its larger training set, KORE also demonstrates strong generalization capabilities for novel knowledge, unlike the overfitting observed in LoRA, EWC, and Full-FT. This implies that KORE's structured augmentation leads to more meaningful learning.

6.7. Case Studies

The paper includes case studies (Figure 11 and Figure 12) to provide qualitative insights into KORE's performance.

As seen in Figure 11 and Figure 12, KORE demonstrates superior qualitative performance in complex scenarios. For instance, in a "News" case study (Figure 11) about the Nobel Prize in Physics, KORE (on LLaVA-v1.5-7B) provides a more accurate and comprehensive answer (CEM 1, F1 1) compared to LoRA (0 CEM, 0 F1) and Replay (0 CEM, 0 F1). Similarly, in an "Entity" case study (Figure 12) about the Bugatti Tourbillon's production limit, KORE accurately states the limit (99 units), while LoRA and Replay provide incorrect or less specific answers. These qualitative results align with the quantitative findings, illustrating KORE's ability to inject new knowledge accurately and retain previous knowledge effectively across different LMMs.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work introduces KORE (KnOwledge-oRientEd augmentations and constraints), a novel and synergistic method designed to effectively inject new knowledge into Large Multimodal Models (LMMs) while simultaneously preserving their vast existing knowledge. KORE tackles the critical trade-off between knowledge adaptation and knowledge retention, a long-standing challenge in continual learning for LMMs.

The method's success stems from two key, inter-dependent components:

KORE-AUGMENTATION: This component automatically transforms individual knowledge items into a profound and structured format, such as multi-round dialogues and diverse instruction tasks. This knowledge-oriented augmentation strategy ensures that the LMM learns new information accurately, deeply, and in a generalized manner, moving beyond superficial data memorization.
KORE-CONSTRAINT: This component safeguards existing knowledge by leveraging the covariance matrix of LMM's linear layer activations to represent prior knowledge patterns. It then initializes a LoRA adapter by projecting original weights into the null space of this covariance matrix, defining a fine-tuning direction that minimally interferes with previously acquired knowledge. This mechanism enables powerful knowledge retention and effectively mitigates catastrophic forgetting.

Extensive experimental validation on various LMMs (LLaVA-v1.5-7B, LLaVA-v1.5-13B, Qwen2.5-VL-7B) demonstrates KORE's superior performance in new knowledge injection and its efficacy in mitigating catastrophic forgetting, outperforming numerous state-of-the-art baselines. KORE also exhibits universality across different LMM scales and architectures. Furthermore, its capability for specific knowledge-oriented constraints allows for tailored knowledge preservation, offering high flexibility for specialized applications.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Reliance on GPT-4o for Augmentation: The KORE-AUGMENTATION process heavily relies on GPT-4o for generating dialogues, summaries, and quadruplets. This dependency introduces potential risks of hallucination (generating factually incorrect or nonsensical information) into the augmented data. Future work could explore methods to reduce or verify the outputs of GPT-4o.
Scope of Augmentation: The current augmentation primarily focuses on enhancing individual knowledge units. More complex and structured augmentation strategies could be explored, such as leveraging knowledge graphs or knowledge forests (Ji et al., 2021; Chen et al., 2020), potentially in combination with reinforcement learning, to build even richer and more interconnected knowledge structures.
Computational Cost of Covariance Matrix Extraction: Extracting covariance matrices from all linear layers in LMMs can be computationally expensive. Future work aims to reduce this resource consumption by identifying the most critical layers or subsets of parameters for covariance computation, focusing only on those that are most salient for knowledge representation.

7.3. Personal Insights & Critique

KORE presents a compelling and well-motivated approach to a crucial problem in LMM development. The synergistic design, combining knowledge-oriented augmentation with knowledge-oriented constraints, is a notable innovation.

One of the paper's strengths is its emphasis on moving beyond superficial data augmentation. The "knowledge tree" concept for structuring new information, including multi-round dialogues and various instructional tasks, is intuitively powerful. It aligns with the idea that models need to learn not just facts but also how to reason about and apply them in different contexts. This approach to augmentation could be transferable to other domains where robust and generalized learning from new information is critical, especially in continual learning settings for diverse AI agents.

The KORE-CONSTRAINT component, leveraging null space projection based on covariance matrices, provides a theoretically grounded method for knowledge retention. The empirical verification that covariance matrices indeed capture multimodal knowledge and exhibit task-specific patterns is a significant finding. This could inspire further research into understanding and manipulating internal model representations for more precise control over knowledge preservation. The ability to apply specific knowledge-oriented constraints is also highly practical, offering a customizable solution for real-world applications where certain types of knowledge must be rigorously protected.

However, some aspects invite further consideration:

Scalability of KORE-AUGMENTATION: While automated, the reliance on GPT-4o for generating large quantities of high-quality, structured data might still present practical scaling challenges for truly vast amounts of evolving knowledge, especially if quality assurance mechanisms are needed to prevent hallucinations. The computational cost of generating KORE-74K using GPT-4o is not explicitly detailed but could be substantial.
Interpretability of Null Space for Knowledge: While mathematically sound, the mapping between the null space of a covariance matrix and "minimally interfering with previous knowledge" is still somewhat abstract. Further work could explore more interpretable ways to demonstrate what specific aspects of old knowledge are preserved by this method, beyond just benchmark scores.
Generality of Covariance Matrix Representation: The paper assumes that covariance matrices effectively capture previous knowledge. While empirically validated, the extent to which this holds universally across all types of knowledge (e.g., procedural knowledge vs. factual knowledge) or for all LMM architectures could be a subject of deeper theoretical investigation.
Trade-offs in Specific Constraints: The ablation study showed that specific constraints slightly reduced K.A scores while boosting K.R. This highlights the inherent trade-off. Future work could explore dynamic weighting mechanisms or adaptive constraint strengths to optimize this balance automatically for different scenarios.

The paper's Ethics Statement is also commendable, acknowledging the potential misuse of knowledge injection for propagating false or biased information. This underscores the need for responsible development and deployment of such powerful AI capabilities.

Overall, KORE provides a significant step forward in enabling LMMs to continuously learn and evolve, addressing a fundamental limitation that hinders their deployment in dynamic environments. Its knowledge-centric approach offers a promising direction for future research in continual learning and LMM adaptation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~36 min read · 54,511 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Large Multimodal Models (LMMs)

Knowledge Injection

Catastrophic Forgetting

Parameter-Efficient Fine-Tuning (PEFT)

Data Augmentation

Continual Learning

Covariance Matrix

Singular Value Decomposition (SVD)

Null Space

3.2. Previous Works

Retrieval-Augmented Generation (RAG)

Full Fine-Tuning

Parameter-Efficient Fine-Tuning (PEFT)

Continual Learning (CL) Techniques

CO-SVD and Orthogonal Subspace Constraints

EvOKE Benchmark

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. KORE-AUGMENTATION

Step 1: Constructing Multi-rounds of Dialogue Data

Step 2: Collecting Recognition and Caption Images

Step 3: Constructing Visual Recognition QA

Step 4: Constructing Image Caption QA

Step 5: Constructing VQA (Visual Question Answering)

4.2.2. KORE-CONSTRAINT

Proof of KORE-CONSTRAINT (from Appendix C)

Analysis of Knowledge-Oriented Constraint's Ability to Capture Knowledge

5. Experimental Setup

5.1. Datasets

5.1.1. Knowledge Adaptation Evaluation

5.1.2. Knowledge Retention Evaluation

5.1.3. KORE-74K Dataset

5.2. Evaluation Metrics

5.2.1. Knowledge Adaptation Metrics

5.2.2. Knowledge Retention Metrics

5.3. Baselines

5.4. Training Parameters

Explanation of Key Parameters:

5.5. Experiment Resources

6. Results & Analysis

6.1. Core Results Analysis

6.2. Detailed Results on Adaptation & Retention

6.2.1. Fine-grained Knowledge Adaptation

6.2.2. Detailed Knowledge Retention

6.2.3. Specific Knowledge-Oriented Constraints

6.3. Various LMM Scales and Architectures

6.4. Ablation Studies

6.5. Comparison with General Augmentation Methods

6.6. Convergence Comparison

6.7. Case Studies

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers