KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints
TL;DR Summary
KORE enhances knowledge injection in large multimodal models by structured augmentations and constraints, preserving old knowledge via null space projection to mitigate forgetting and enable precise adaptation to new knowledge.
Abstract
Large Multimodal Models encode extensive factual knowledge in their pre-trained weights. However, its knowledge remains static and limited, unable to keep pace with real-world developments, which hinders continuous knowledge acquisition. Effective knowledge injection thus becomes critical, involving two goals: knowledge adaptation (injecting new knowledge) and knowledge retention (preserving old knowledge). Existing methods often struggle to learn new knowledge and suffer from catastrophic forgetting. To address this, we propose KORE, a synergistic method of KnOwledge-oRientEd augmentations and constraints for injecting new knowledge into large multimodal models while preserving old knowledge. Unlike general text or image data augmentation, KORE automatically converts individual knowledge items into structured and comprehensive knowledge to ensure that the model accurately learns new knowledge, enabling accurate adaptation. Meanwhile, KORE stores previous knowledge in the covariance matrix of LMM's linear layer activations and initializes the adapter by projecting the original weights into the matrix's null space, defining a fine-tuning direction that minimizes interference with previous knowledge, enabling powerful retention. Extensive experiments on various LMMs, including LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B, show that KORE achieves superior new knowledge injection performance and effectively mitigates catastrophic forgetting.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints
1.2. Authors
Kailin Jiang, Hongbo Jiang, Ning Jiang, Zhi Gao, Jinhe Bi, Yuchen Ren, Bin Li, Yuntao Du, Lei Liu, Qing Li.
Their affiliations span several prominent institutions, including the University of Science and Technology of China, State Key Laboratory of General Artificial Intelligence, BIGA Technology, Beijing Institute of Technology, Ludwig Maximilian University of Munich, University of Sydney, and Shandong University. This indicates a collaborative effort from multiple research groups specializing in AI, natural language processing, and multimodal learning.
1.3. Journal/Conference
Published as a preprint on arXiv (arxiv.org/abs/2510.19316). As an arXiv preprint, it is currently undergoing peer review or awaiting formal publication in a conference or journal. arXiv is a widely respected platform for disseminating cutting-edge research in fields like AI, allowing early access to findings before formal peer review processes are completed.
1.4. Publication Year
2025 (specifically, published at 2025-10-22T07:26:55.000Z).
1.5. Abstract
Large Multimodal Models (LMMs) inherently store vast factual knowledge within their pre-trained weights. However, this knowledge is static and quickly becomes outdated, impeding their ability to continuously acquire new information. Effective knowledge injection is crucial, necessitating two objectives: knowledge adaptation (integrating new knowledge) and knowledge retention (preserving existing knowledge). Current methods often struggle with learning new knowledge effectively and suffer from catastrophic forgetting (the tendency of a neural network to completely and abruptly forget previously learned information upon learning new information).
To address these challenges, the authors propose KORE (KnOwledge-oRientEd augmentations and constraints), a synergistic method for injecting new knowledge into LMMs while preserving old knowledge. Unlike general data augmentation techniques for text or images, KORE automatically transforms individual knowledge items into structured and comprehensive formats, ensuring the model accurately learns and adapts to new information. Simultaneously, KORE stores previous knowledge within the covariance matrix of the LMM's linear layer activations. It then initializes an adapter by projecting the original weights into the null space of this matrix, thereby defining a fine-tuning direction that minimizes interference with prior knowledge and ensures robust retention.
Extensive experiments conducted on various LMMs, including LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B, demonstrate that KORE achieves superior performance in new knowledge injection and significantly mitigates catastrophic forgetting.
1.6. Original Source Link
https://arxiv.org/abs/2510.19316 (Preprint status) PDF Link: https://arxiv.org/pdf/2510.19316v1.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by this paper is the static and limited nature of knowledge in Large Multimodal Models (LMMs). While LMMs, similar to Large Language Models (LLMs), demonstrate remarkable capabilities in storing and recalling world knowledge embedded in their pre-trained weights, this knowledge cannot evolve. In a rapidly changing world, this leads to:
-
Outdated Responses: Models provide information that is no longer current.
-
Inability for Continuous Acquisition: LMMs cannot naturally learn new information as it emerges.
This limitation hinders the continuous evolution and real-world applicability of these powerful models. Therefore, the ability to effectively inject new knowledge becomes critical. This process involves a dual challenge:
-
Knowledge Adaptation: The model must effectively learn and integrate new information.
-
Knowledge Retention: The model must simultaneously preserve its vast pre-existing knowledge, avoiding
catastrophic forgetting.Existing methods, such as
full fine-tuning(updating all model weights) orParameter-Efficient Fine-Tuning (PEFT)(updating only a small subset of parameters), often fall short.Full fine-tuningis computationally expensive and prone tooverfitting(where a model learns the training data too well, including noise, and performs poorly on unseen data), failing to generalize new knowledge.PEFTmethods, while resource-friendly, also suffer fromcatastrophic forgetting. Continual learning techniques aim to mitigate forgetting but often struggle to strike a balance between acquiring new knowledge and retaining old, potentially impairing adaptation or leading to irrelevant responses.
The paper's innovative idea is to propose a synergistic method that tackles both knowledge adaptation and retention simultaneously through knowledge-oriented augmentations and knowledge-oriented constraints. The entry point is to move beyond superficial data augmentations and generic parameter regularization by deeply understanding and structuring knowledge for learning, and precisely constraining updates based on the model's internal representation of prior knowledge.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of knowledge injection for LMMs:
-
Novel Synergistic Method (KORE): Introduction of
KORE, a unified approach that combinesKnowledge-Oriented AugmentationsandKnowledge-Oriented Constraints. This synergy explicitly addresses the trade-off between knowledge adaptation and retention, a core problem in continuous learning for LMMs. -
Knowledge-Oriented Augmentation (KORE-AUGMENTATION): Development of an automated pipeline that converts individual knowledge items into a profound and structured "knowledge tree" format. This involves generating multi-round dialogues and diverse instruction tasks (e.g., visual recognition, image captioning, VQA) around new knowledge. This augmentation strategy moves beyond superficial data variations to ensure genuine
knowledge internalizationandgeneralization, enabling models to reason about new information. -
Knowledge-Oriented Constraint (KORE-CONSTRAINT): A novel constraint mechanism that leverages the
covariance matrixofLMM's linear layer activations to represent previous knowledge. By decomposing this matrix and projecting original weights into itsnull spaceto initialize anadapter(specifically, a LoRA adapter),KOREdefines a fine-tuning direction that minimally interferes with pre-existing knowledge, thereby ensuring powerful retention and mitigatingcatastrophic forgetting. -
Empirical Validation and Superior Performance: Extensive experiments on diverse
LMMs(LLaVA-v1.5-7B,LLaVA-v1.5-13B, andQwen2.5-VL-7B) demonstrate thatKOREconsistently outperforms state-of-the-art baselines (includingFull-FT,LoRA,Replay, and variouscontinual learningmethods likeEWC,LwF,MoELoRA,O-LoRA,SEFE) in both new knowledge adaptation and old knowledge retention. -
Universality and Robustness:
KORE's effectiveness is shown to be largely independent of the specificLMMarchitecture or scale, performing well across different models. -
Customizable Knowledge Constraints: The method allows for the creation of specific
knowledge-oriented constraintsby sampling data related to particular benchmarks. This enables tailored knowledge retention, enhancing performance on targeted knowledge types without significantly compromising other abilities.In summary,
KOREoffers a robust and flexible solution for enablingLMMsto continuously acquire and retain knowledge, thereby supporting their evolution and broader application in dynamic real-world scenarios.
3. Prerequisite Knowledge & Related Work
This section provides an overview of the fundamental concepts and related research necessary to understand the methodology and contributions of the KORE paper.
3.1. Foundational Concepts
Large Multimodal Models (LMMs)
Large Multimodal Models are advanced artificial intelligence systems that can process and understand information from multiple modalities, typically text and images (and sometimes audio or video). They are built upon the foundation of Large Language Models (LLMs) but extend their capabilities to interpret and generate content that integrates visual information. LMMs learn extensive factual knowledge, common sense, and reasoning abilities during a pre-training phase on massive datasets, storing this knowledge in their billions or trillions of parameters (weights). Examples include LLaVA and Qwen-VL.
Knowledge Injection
Knowledge injection refers to the process of updating a pre-trained model with new factual information or capabilities. This is crucial for LMMs because their pre-trained knowledge is static and can quickly become outdated. Effective knowledge injection aims to achieve two primary goals:
- Knowledge Adaptation: The model's ability to accurately learn and effectively utilize the newly introduced knowledge. This involves not just memorizing facts but also generalizing and reasoning with them.
- Knowledge Retention: The model's ability to preserve its previously acquired knowledge and capabilities while learning new information. Without proper retention mechanisms,
LMMscan suffer fromcatastrophic forgetting.
Catastrophic Forgetting
Catastrophic forgetting, also known as catastrophic interference, is a phenomenon observed in artificial neural networks where learning new information causes the abrupt and complete loss of previously learned information. When a neural network is fine-tuned on a new task or dataset, its weights are updated, which can overwrite the knowledge acquired from prior training, leading to a significant degradation in performance on older tasks. This is a major challenge in continual learning and knowledge injection for LLMs and LMMs.
Parameter-Efficient Fine-Tuning (PEFT)
Parameter-Efficient Fine-Tuning (PEFT) methods are a class of techniques designed to adapt large pre-trained models to new tasks or data without updating all of their parameters. Instead, PEFT methods typically freeze the majority of the pre-trained weights and introduce a small number of new, trainable parameters. This significantly reduces computational costs (training time, memory) and storage requirements compared to full fine-tuning. Common PEFT techniques include:
- Adapters: Small, task-specific neural network modules inserted between layers of the pre-trained model. Only the
adapterweights are trained. - LoRA (Low-Rank Adaptation): Decomposes weight updates into low-rank matrices. This method trains only these small, low-rank matrices while keeping the original pre-trained weights fixed.
Data Augmentation
Data augmentation is a strategy used to increase the diversity of training data by creating modified versions of existing data. For images, this might involve rotations, flips, or color changes. For text, it could involve synonym replacement, rephrasing, or back-translation. The goal is to make the model more robust and improve its generalization capabilities by exposing it to a wider variety of examples during training. In the context of knowledge injection, data augmentation can help the model learn new facts more thoroughly.
Continual Learning
Continual learning (or lifelong learning) is a machine learning paradigm that aims to enable models to learn sequentially from a continuous stream of data, accumulating knowledge over time without forgetting previously learned information. It directly addresses the problem of catastrophic forgetting. Techniques in continual learning often involve strategies like rehearsal (re-training on a small subset of old data), regularization (adding penalties to weight updates to preserve important parameters for old tasks), or dynamic architectures (modifying the model's structure as new tasks arrive).
Covariance Matrix
In statistics, the covariance matrix is a square matrix that describes the variance between each pair of random variables in a dataset. For a vector of random variables , the covariance matrix is defined as . The diagonal elements are the variances of each variable, and the off-diagonal elements are the covariances between variable and . In machine learning, particularly in models like LMMs, the covariance matrix of activations within a layer can capture statistical dependencies and patterns that represent learned knowledge or task-specific features.
Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD) is a powerful matrix factorization technique. For any real matrix , SVD decomposes it into three other matrices:
$
M = U \Sigma V^T
$
where:
-
is an orthogonal matrix whose columns are the
left singular vectors. -
is a diagonal matrix containing the
singular valuesin descending order. These values represent the "strength" or "importance" of each corresponding singular vector. -
is the transpose of an orthogonal matrix , whose columns are the
right singular vectors.SVDhas numerous applications, including dimensionality reduction, noise reduction, and finding thenull spaceof a matrix.
Null Space
The null space (or kernel) of a matrix is the set of all vectors for which . In other words, it's the set of vectors that are "annihilated" by the matrix . Geometrically, the null space is the subspace spanned by the vectors that are orthogonal to the row space of . In the context of KORE, initializing an adapter's weights in the null space of a covariance matrix means that these weights, when applied, will minimally affect the patterns (knowledge) represented by that covariance matrix, thus helping to preserve old knowledge.
3.2. Previous Works
Retrieval-Augmented Generation (RAG)
RAG (Song et al., 2016; Fan et al., 2020; Lewis et al., 2020) is an alternative paradigm for handling external knowledge. Instead of modifying model parameters, RAG systems retrieve relevant information from an external knowledge base at inference time and use it to augment the model's generated response. This preserves pre-trained knowledge and allows access to up-to-date information without fine-tuning. However, its effectiveness depends heavily on the quality and speed of the retrieval system. KORE differentiates itself by focusing on directly modifying model parameters for knowledge injection.
Full Fine-Tuning
Full fine-tuning updates all parameters of a pre-trained model on new data. While it can achieve high performance on the new task, it is computationally intensive, requires significant storage, and notoriously suffers from catastrophic forgetting of previously learned knowledge. The paper explicitly contrasts KORE with Full-FT in its ability to balance adaptation and retention.
Parameter-Efficient Fine-Tuning (PEFT)
PEFT methods like adapters (Houlsby et al., 2019; Hu et al., 2022; Bi et al., 2025b) and new tokens (Lester et al., 2021; Sabbatella et al., 2024) significantly reduce computational and storage costs by updating only a small fraction of parameters. LoRA (Hu et al., 2022) is a prominent PEFT technique that KORE builds upon. Despite their efficiency, general PEFT methods still face challenges in effective knowledge injection and mitigating catastrophic forgetting.
Continual Learning (CL) Techniques
CL aims to enable models to learn sequentially without forgetting.
- Rehearsal-based methods (Li & Hoiem, 2017a; Hou et al., 2019) store and periodically replay a subset of old data alongside new data.
KORE'sReplaybaseline uses this. - Parameter regularization methods (Kirkpatrick et al., 2017 - EWC; Li & Hoiem, 2017b - LwF) add penalties to the loss function to prevent significant changes to parameters important for old tasks.
EWC(Elastic Weight Consolidation) calculates the importance of each parameter using theFisher Information Matrixand penalizes changes to important parameters.LwF(Learning without Forgetting) usesknowledge distillationto retain old knowledge by having the new model mimic the outputs of the old model on previous tasks.
- Other
CLapproaches includedynamic architectures(Yan et al., 2021) andcomplementary projection-based methods(Farajtabar et al., 2020; Chaudhry et al., 2020; Saha et al., 2021).KOREdraws inspiration fromCLbut aims to optimize the balance between new knowledge acquisition and prior knowledge retention more effectively, especially forLLMsandLMMs.
CO-SVD and Orthogonal Subspace Constraints
- Prior work (Meng et al., 2023; Yang et al., 2024) has explored how
covariance matricesofLMMactivations capture knowledge.KOREextendsCO-SVD(Context-Oriented Decomposition Adaptation) from text-onlyLLMsto multimodalLMMsto verify the knowledge-capturing ability ofcovariance matricesin a multimodal context. - Methods like
O-LoRA(Wang et al., 2023) useorthogonal subspace constraintsto mitigatecatastrophic forgettingby allocating independent, orthogonal parameter subspaces for different tasks.KORE-CONSTRAINTleverages the concept ofnull space projection, which is geometrically related to orthogonal subspaces, to define a fine-tuning direction that minimizes interference.
EvOKE Benchmark
EvOKE (Jiang et al., 2025) is a new benchmark specifically designed to evaluate how well LMMs can learn evolving knowledge without forgetting their original capabilities. KORE uses EvOKE for its knowledge adaptation evaluation, highlighting its relevance to the paper's core problem.
3.3. Technological Evolution
The journey of LLMs and LMMs started with pre-training on vast, static datasets, encoding immense world knowledge into their weights. This initial phase yielded models capable of impressive feats but with a critical limitation: their knowledge was frozen at the time of pre-training.
The first attempt to update this knowledge was full fine-tuning, which quickly proved too expensive and detrimental to old knowledge (catastrophic forgetting). This led to the development of PEFT methods like LoRA, which drastically reduced computational costs but still struggled with the adaptation-retention trade-off. Simultaneously, the continual learning field developed techniques to mitigate catastrophic forgetting, but these often focused on balancing existing tasks and struggled to scale effectively to the sheer volume of world knowledge in LLMs/LMMs. RAG emerged as an alternative, bypassing fine-tuning for knowledge updates by retrieving external information.
KORE fits into this evolution by proposing a hybrid approach that enhances the fine-tuning paradigm. It acknowledges the limitations of existing PEFT and continual learning methods by introducing knowledge-oriented strategies for both learning new knowledge (KORE-AUGMENTATION) and protecting old knowledge (KORE-CONSTRAINT). This represents a step towards truly continuous knowledge acquisition for LMMs, aiming for a more balanced and effective solution than previous attempts.
3.4. Differentiation Analysis
KORE differentiates itself from existing knowledge injection methods through its synergistic and knowledge-oriented approach:
-
vs. Full Fine-Tuning & General PEFT (e.g., LoRA):
- Differentiation:
Full fine-tuningandLoRAprimarily focus on minimizing loss on new data.KOREexplicitly addressescatastrophic forgettingand robust generalization by introducingknowledge-oriented augmentationsandconstraints. - Innovation:
KORE-AUGMENTATIONcreates a structured "knowledge tree" for deeper learning of new knowledge, moving beyond simple data fitting.KORE-CONSTRAINTactively preserves old knowledge by leveraging thenull spaceof activationcovariance matrices, a mechanism absent in standardPEFT.
- Differentiation:
-
vs. Retrieval-Augmented Generation (RAG):
- Differentiation:
RAGaccesses external knowledge at inference time without modifying modelparameters.KOREdirectly modifiesparametersto internalize new knowledge, making it an intrinsic part of the model. - Innovation:
KOREaims forknowledge internalization, allowing models to reason and generalize with new facts, whichRAGmight not inherently achieve as it relies on external lookup.
- Differentiation:
-
vs. Continual Learning (CL) Methods (e.g., EWC, LwF, Replay, O-LoRA, MoELoRA, SEFE):
-
Differentiation: While
CLmethods aim to mitigatecatastrophic forgetting, they often struggle to balanceadaptationandretention. SomeCLmethods might impairadaptation(e.g.,EWC), or are notknowledge-orientedin theiraugmentationorconstraintmechanisms. -
Innovation:
KORE-AUGMENTATIONprovides a novel way to learn new knowledge profoundly, which generalCLmethods lack.KORE-CONSTRAINToffers aknowledge-driven fine-tuning constraintby usingcovariance matricesto represent and protect existing knowledge, distinguishing it from generalparameter regularization(e.g.,EWC) ororthogonal subspacemethods (O-LoRA) by its direct link to the model's internal representation of knowledge. The paper specifically claimsKORE"optimizes the balance between injecting new knowledge and preserving old knowledge" better than these baselines.In essence,
KORE's innovation lies in its dual, synergistic, andknowledge-centricapproach. It doesn't just augment data or constrainparametersgenerally; it designs these processes specifically around the nature of knowledge to ensure more effective learning and protection inLMMs.
-
4. Methodology
KORE is designed as a synergistic method comprising two main components: KORE-AUGMENTATION and KORE-CONSTRAINT. These components work together to optimize the balance between injecting new knowledge (knowledge adaptation) and preserving old knowledge (knowledge retention) in Large Multimodal Models (LMMs).
4.1. Principles
The core idea behind KORE is to address the limitations of existing knowledge injection methods, which often struggle with generalization for new knowledge and suffer from catastrophic forgetting of old knowledge.
-
Knowledge-Oriented Augmentation: Instead of superficial data variations,
KOREaims to build a deep, structured understanding of new knowledge by automatically converting individual knowledge items into comprehensive and interconnected data formats. This ensures the model genuinely learns and can flexibly manipulate the new information, promoting accurateadaptation. -
Knowledge-Oriented Constraint: To prevent
catastrophic forgetting,KOREidentifies and protects the model's internal representations of previous knowledge. It achieves this by leveraging the statistical patterns (captured incovariance matrices) ofLMM's activations and then guidingfine-tuningdirections to minimally interfere with these patterns, ensuring powerfulretention.The synergistic combination of these two principles allows
KOREto achieve superior performance in bothknowledge adaptationandknowledge retention.
4.2. Core Methodology In-depth (Layer by Layer)
The overall architecture of KORE is illustrated in Figure 2.
4.2.1. KORE-AUGMENTATION
Existing knowledge injection methods often lead to poor generalization because they struggle to help models truly master new knowledge. Inspired by the fact that data augmentation can enhance generalization, KORE proposes KORE-AUGMENTATION. Unlike general augmentation techniques that produce shallow, discrete data variations, KORE-AUGMENTATION employs an automated pipeline to build structured and comprehensive knowledge from individual items, facilitating accurate adaptation.
The key insight of KORE-AUGMENTATION is to transform original knowledge into a "knowledge tree" format, consisting of multi-round dialogue data (forming the trunk) and instruction tasks data (forming the branches), as depicted in Figure 3. This structured approach moves beyond mere "data memorization" and aims to help the model comprehend the inherent logic and associations within the knowledge, enabling knowledge internalization.
The construction process for KORE-74K (the dataset generated by KORE-AUGMENTATION) is fully automated, with only question templates being manually crafted (see Figure 13 in Appendix H).
The steps involved in KORE-AUGMENTATION are:
Step 1: Constructing Multi-rounds of Dialogue Data
This component forms the "trunk" of the knowledge tree and consists of two parts for each knowledge sample:
- Heuristic Q&A (H.Q in Figure 2): These are constructed randomly using manually written templates. For instance, for news, templates might include: "What is the
{type}news in the image about?" or "Could you summarize the{type}news story presented in the image?". For entities, "What is the{type}entity in the image?" or "Can you identify the{type}entity shown in the picture?". - Dialogue Q&A:
GPT-4ois used to generate up to 10 dialogues based on original textual knowledge, following rigorous rules and diverse task examples. The first turn of the dialogue is usually aheuristic Q&Apair. Subsequent turns are generated automatically byGPT-4obased on the original knowledge, predefined rules, and previous questions/answers. The query images for these dialogues are directly from the original image set. This process yields a large volume of dialogue data (e.g., 75,710 dialogue rounds forKORE-74K).
Step 2: Collecting Recognition and Caption Images
This step supports the "branches" related to visual tasks.
- Image Retrieval: News titles or entity names are used as search keywords to retrieve the top five images via Google Search.
- Visual Feature Extraction:
CLIP(Radford et al., 2021) is used to extract visual features from both the original image and the newly downloaded images. - Selection: The two downloaded images with the highest
cosine similarityto the original image (excluding identical matches, i.e., ) are retained. These serve as query images for subsequent visual recognition and captioning tasks.
Step 3: Constructing Visual Recognition QA
This task aims to verify if the model can recognize specific elements related to the new knowledge in images.
- Template-based Questions: Questions are randomly selected from manually written templates (e.g., "Is the image depicting news
{title}?", "Can you see{entity_name}in this picture?"). - Fixed Answer: The answer is always "Yes".
- Instruction: The model is instructed to "Answer this question with Yes or No."
- Query Image: One of the images selected in Step 2 serves as the query image.
Step 4: Constructing Image Caption QA
This task assesses the model's ability to summarize the new knowledge in a descriptive paragraph linked to an image.
- Summary Generation:
GPT-4ogenerates a summary based on the original textual knowledge, which serves as the answer. - Template-based Questions: Questions are randomly selected from templates (e.g., "Could you please describe the
{type}news shown in the picture?", "Please provide a description for the{type}entity in the image."). - Instruction: The model is instructed to "Answer this question in one paragraph."
- Query Image: The remaining image from Step 2 (not used in Visual Recognition QA) serves as the query image.
Step 5: Constructing VQA (Visual Question Answering)
This task challenges the model with more complex visual reasoning related to the new knowledge.
-
Quadruplet Generation:
GPT-4ogenerates quadruplets(Q, A, S, H)from the original textual knowledge, where is a question, is its answer (single word or phrase), is the subject in the question, and is the hypernym corresponding to the subject. For example: (Q: "Who attempted to assassinate the person in the image during a campaign rally in July 2024?", A: "Thomas Matthew Crooks", S: "Donald John Trump", H: "Person"). -
Image Retrieval for VQA: The subject and hypernym are combined as search keywords to retrieve and download the top-ranked image from Google.
-
Instruction: The model is instructed to "Answer the question using a single word or phrase."
Through this automated pipeline,
KORE-AUGMENTATIONgenerates a rich and diverse dataset likeKORE-74K, which enables the model to acquire new knowledge more effectively and generalize better.
4.2.2. KORE-CONSTRAINT
KORE-CONSTRAINT is designed to mitigate catastrophic forgetting by preserving previous knowledge during fine-tuning. It achieves this by identifying and protecting the patterns within LMM's internal representations that correspond to pre-trained knowledge.
The method involves the following steps:
-
Activation Collection:
KORE-CONSTRAINTcollects activations fromLMM's linear layers on a set of random samples that represent the pre-trained knowledge. Let the input activations to a linear layer be . $ \pmb { X } \in \mathbb { R } ^ { d _ { i n } \times B L } $ where:- is the input dimension of the linear layer.
- is the number of samples (batch size).
- is the sequence length.
-
Covariance Matrix Computation: The
covariance matrixof these activations is computed. This matrix is assumed to effectively capture previous knowledge patterns. $ C = X X ^ { T } \in \mathbb { R } ^ { d _ { i n } \times d _ { i n } } $ where:- is the
covariance matrix. - is the transpose of .
- is the
-
Knowledge Retention Condition: For
LoRAfine-tuning, the fine-tuned weights are typically given by , where are the original weights, and are the low-rankLoRAmatrices, and is theLoRArank. To ensureknowledge retention, the output activations derived from pre-trained knowledge should remain consistent afterfine-tuning. This is formalized as: $ ( W _ { 0 } + \bar { B } A ) C \approx W _ { 0 } C $ Simplifying this condition, we aim for: $ B A C \approx \mathbf { 0 } $ To achieve this, the goal is to have the matrix (specifically, its action on activations) effectively lie in thenull spacerelated to . This is formulated as: $ A C = \mathbf { 0 } $ -
Singular Value Decomposition (SVD) of Covariance: To find the
null spaceof ,SVDis applied to (which is ). $ \operatorname { S V D } \left( \pmb { X } ( \pmb { X } ) ^ { T } \right) = U \Sigma U ^ { T } = \sum _ { i = 1 } ^ { d _ { in } } \sigma _ { i } \mathbf { u } _ { i } \mathbf { u } _ { i } ^ { T } $ where:- is an orthogonal matrix whose columns are the
left singular vectors. - is a diagonal matrix containing the
singular values, where . The remaining for are zero. - The
null spaceof is spanned by the columns of that correspond to zerosingular values.
- is an orthogonal matrix whose columns are the
-
Approximate Null Space Projection: In practice, exact zero
singular valuesare rare due to numerical precision.KOREapproximates thenull spacewith , a submatrix containing theleft singular vectorsfrom that are associated with the smallestsingular values. Here, refers to theLoRA's rank.- This is used to define a
knowledge-oriented constraint projector. - The
projectoris given by: $ \mathbf { \Psi } _ { P } = \hat { U } \hat { U } ^ { T } $ Thisprojectormaps a vector onto the approximatenull spaceof .
- This is used to define a
-
Adapter Initialization via Projection: The
LoRAadapters are initialized byfactorizingthe pre-trained weights () that have been projected into thisnull space.- First, the
SVDof the projected weights is computed: $ \operatorname { S V D } \left( W _ { 0 } P \right) = \left{ U ^ { * } , \Sigma ^ { * } , ( V ^ { * } ) ^ { T } \right} $ where , , and are thesingular value decompositioncomponents for . - Then, the
adapter matricesand are initialized as: $ B = U ^ { * } \sqrt { \Sigma ^ { * } } , \qquad A = \sqrt { \Sigma ^ { * } } \big ( V ^ { * } \big ) ^ { T } $ where denotes a diagonal matrix with entries corresponding to the square roots of thesingular valuesin .
- First, the
-
Original Weight Adjustment: To ensure the model remains unchanged at the beginning of
fine-tuning, the original weight matrix is adjusted with aresidual term: $ W _ { 0 } ^ { \prime } = W _ { 0 } - B A $ This means the effective initial weight matrix forfine-tuningbecomes . -
Freezing Matrix A: Given the asymmetry between
LoRAmatrices and ,fine-tuningonly is often sufficient for strong performance.KOREfreezes matrix .- By initializing such that its column space is related to the
null spaceof (as shown in Appendix C, Theorem 1, where for the relevantnull spaceof ), and then freezing , it ensures that . - This makes the update term from
LoRA(B A C) negligible regardless of how is updated duringfine-tuning. Consequently, the output activations from pre-trained knowledge are approximately preserved.
- By initializing such that its column space is related to the
Proof of KORE-CONSTRAINT (from Appendix C)
The paper provides two theorems to mathematically substantiate KORE-CONSTRAINT:
Theorem 1.
Under the assumption that W _ { 0 } is full-rank, the column space of forms a subset of the column space of .
The proof proceeds as follows:
- Step 1: Defines . Since and are full-rank, their column space is preserved, so .
- Step 2: Uses the SVD of . It states that the columns of span the row space of , and since is a projector onto the subspace spanned by , the column space of is equal to the column space of . $ \operatorname { C o l } ( V ^ { * } ) = \operatorname { C o l } ( W _ { 0 } U _ { \mathrm { n u l l } } U _ { \mathrm { n u l l } } ^ { T } ) = \operatorname { C o l } ( U _ { \mathrm { n u l l } } ^ { T } ) $
- Step 3: Combining Step 1 and Step 2, it concludes that , meaning the column space of lies in the
null spaceof .
Theorem 2. For a given layer in a large language model, suppose the input activation is derived from pre-trained world knowledge and remains unchanged. Then, under fine-tuning with KoRE, the output of the layer is approximately preserved: $ { W ^ { * } } ^ { ( l ) } X ^ { ( l ) } \approx W _ { 0 } ^ { ( l ) } X ^ { ( l ) } $ where is the initial weight matrix and is the fine-tuned weight matrix for layer . The proof proceeds as follows:
KOREdefines the fine-tuned weight for layer as: $ { \pmb W } ^ { * ( l ) } = { \pmb W } _ { 0 } ^ { ( l ) } - { \pmb B } ^ { ( l ) } { \pmb A } ^ { ( l ) } + { \pmb B } ^ { * ( l ) } { \pmb A } ^ { ( l ) } $ Here, is the original weight matrix. The term represents the initial adjustment to make the effective starting weight . The term represents the updatedLoRAcontribution during fine-tuning, where is the fine-tuned matrix.- The output of the layer is: $ \pmb { W } ^ { \ast ( l ) } \pmb { X } ^ { ( l ) } = ( \pmb { W } _ { 0 } ^ { ( l ) } - \pmb { B } ^ { ( l ) } \pmb { A } ^ { ( l ) } + \pmb { B } ^ { \ast ( l ) } \pmb { A } ^ { ( l ) } ) \pmb { X } ^ { ( l ) } $
- Using the approximation that (due to being initialized in the null space of and being ), the terms involving become negligible: $ \pmb { W } ^ { \ast ( l ) } \pmb { X } ^ { ( l ) } \approx \pmb { W } _ { 0 } ^ { ( l ) } \pmb { X } ^ { ( l ) } $ This demonstrates that the output remains approximately unchanged, thus preserving the pre-trained knowledge.
Analysis of Knowledge-Oriented Constraint's Ability to Capture Knowledge
KORE-CONSTRAINT relies on the premise that covariance matrices effectively capture knowledge. To verify this for multimodal scenarios (extending CO-SVD from text to multimodal), the paper conducts experiments on LLaVA-v1.5 (7B) by applying Plain SVD, ASVD (Activation-aware SVD), and CO-SVD to decompose all layers' pre-trained weights. Weights are reconstructed after removing components corresponding to the smallest singular values.
The findings (from Figure 4(a) and (b), and Appendix D.1 Table 8):
-
CO-SVDconsistently shows superior performance retention compared toPlain SVDandASVDafter reconstruction, especially when a large number of ranks are discarded. This suggests thatmultimodal knowledgeis indeed effectively captured and stored incovariance matrices. -
The number of sampled data points for
covariance matrixcomputation has limited influence; even a small number (e.g., 32 samples) is sufficient to capture essential knowledge. -
Using test-specific samples for
covariance matrixcomputation leads to better performance on those specific tasks when discarding many ranks (e.g.,CO-SVDwithMMEsamples performs better onMMEthanScienceQAsamples). This indicates thatcovariance matricescan capture dataset-specific knowledge and exhibit distinct patterns for different tasks.Visualizations of
covariance matrices(Figure 4(c), Figure 8, and Figure 9 in Appendix D.2) for tasks likePOPE(object hallucination),HallusionBench(entangled language hallucination and visual illusion), andMMBench(comprehensive evaluation) show: -
Covariance matricesof linear layer inputs for related tasks (POPEandHallusionBench) share similar outlier patterns (marked by red circles), which differ from unrelated tasks (MMBench). -
This indicates that distinct tasks activate different outlier distributions in the
covariance matrix, empirically supporting thatcovariance matrixpatterns characterize the triggered task. This ability is leveraged byKOREto guide the decomposition of pre-trained weights and initialize adapters with informative knowledge, enablingknowledge-oriented constraints. -
For
KORE, a multi-dimensionalcovariance matrixis built by sampling 64 examples per category fromOneVision's single-image subset (General, Doc/Chart/Screen, Math/Reasoning, General OCR).
5. Experimental Setup
This section details the experimental design, including the benchmarks, evaluation metrics, baseline methods, and training configurations used to validate KORE.
5.1. Datasets
5.1.1. Knowledge Adaptation Evaluation
The primary benchmark for evaluating knowledge adaptation (injecting new knowledge) is EvOKE.
- EvOKE (Jiang et al., 2025): This benchmark is specifically designed to assess how well
Large Multimodal Models (LMMs)can learnevolving knowledgewithout experiencingcatastrophic forgetting. In the context ofKORE's experiments, knowledge is injected as image-text pairs, and evaluation questions are derived from the accompanying text.EvOKEreveals the limitations of current methods and highlights the severity ofcatastrophic forgetting.
5.1.2. Knowledge Retention Evaluation
To evaluate knowledge retention (preserving old knowledge), fine-tuned LMMs are assessed on 12 benchmarks spanning 7 distinct capability dimensions. The evaluation settings follow VLMEvalKit (Duan et al., 2024).
-
Comprehensive Evaluation (COM):
- MME (Fu et al., 2023): A benchmark for holistic evaluation of
LMMs' perception and cognition across various tasks, primarily using straightforward question-answer pairs. - MMBench (Liu et al., 2024c): A cross-lingual benchmark featuring over 3,000 bilingual multiple-choice questions across 20 skill dimensions, from visual recognition to abstract reasoning.
- MME (Fu et al., 2023): A benchmark for holistic evaluation of
-
Optical Character Recognition (OCR):
- SEEDBench2 Plus (Li et al., 2024): Benchmarks
LMMson interpreting text-rich visuals (e.g., charts, web layouts) using 2,300 multiple-choice questions where integrating textual and visual information is crucial. - OCRVQA (Mishra et al., 2019): Evaluates a model's ability to answer questions by reading text within images, focusing on tasks where
OCRis essential.
- SEEDBench2 Plus (Li et al., 2024): Benchmarks
-
Multidisciplinary Reasoning (M-DIS):
- ScienceQA (Lu et al., 2022): Evaluates scientific reasoning through a large-scale multimodal benchmark with curriculum-based questions, diagrams, and provided lectures/explanations to encourage complex reasoning.
- MMMU (Yue et al., 2024): Evaluates
LMMson college-level, multimodal questions requiring expert knowledge across six disciplines and 30 image formats.
-
Instruction Following (INS):
- MIA-Bench (Qian et al., 2024): A benchmark measuring how precisely
LMMscan follow complex and multi-layered instructions using 400 distinct image-prompt combinations.
- MIA-Bench (Qian et al., 2024): A benchmark measuring how precisely
-
Multi-Turn Multi-Image Dialog Understanding (M-IDU):
- MMDU (Liu et al., 2025): Evaluates
LMMsin multi-image, multi-turn conversational scenarios, assessing contextual understanding, temporal reasoning, and coherence.
- MMDU (Liu et al., 2025): Evaluates
-
Mathematical Reasoning (MAT):
- MathVista (Lu et al., 2024): Benchmarks mathematical reasoning of foundation models in visual contexts, aggregating 6,141 problems from 31 datasets requiring visual analysis and compositional logic.
- MathVision (Wang et al., 2025a): A challenging dataset of 3,040 visually-presented problems from math competitions, categorized by mathematical areas and difficulty tiers.
-
Hallucination (HAL):
- POPE (Li et al., 2023): Evaluates
object hallucination(describing non-existent objects) inLMMsusing a polling-based questioning strategy. - HallusionBench (Guan et al., 2024): A diagnostic suite for entangled language
hallucinationand visual illusion inLMMs, using 346 images and 1,129 structured questions.
- POPE (Li et al., 2023): Evaluates
5.1.3. KORE-74K Dataset
KORE-74K is a new dataset constructed by KORE-AUGMENTATION using the original knowledge from EvOKE. It contains 74,734 total data points, comprising multi-round dialogue data (9,422, 12.6%), visual recognition data (9,422, 12.6%), image caption data (9,422, 12.6%), and VQA data (46,468, 62.2%). It includes 75,710 rounds of dialogue and 65,312 unique images.
5.2. Evaluation Metrics
5.2.1. Knowledge Adaptation Metrics
For open-domain question answering tasks (e.g., EvOKE), two key metrics are used:
-
Cover Exact Match (CEM):
- Conceptual Definition:
CEMassesses whether the entire ground truth answer is perfectly contained within the model's generated prediction. It is a strict metric that requires the model to produce a response that includes all parts of the correct answer. - Mathematical Formula: $ CEM = { \left{ \begin{array} { l l } { 1 , } & { y _ { q } \subseteq { \hat { Y } } } \ { 0 , } & { { \mathrm { otherwise } } } \end{array} \right. } $
- Symbol Explanation:
- : The ground truth answer string.
- : The text generated by the model.
- : Indicates that the ground truth answer is a substring of, or completely contained within, the generated text .
1: If the condition is met (exact match).0: Otherwise (no exact match).
- Conceptual Definition:
-
F1-Score (F1):
- Conceptual Definition: The
F1-Scoremeasures the word-level overlap between the predicted answer and the ground truth answer. It is the harmonic mean ofPrecisionandRecall, providing a balanced measure when the number of relevant items (words in the ground truth) and retrieved items (words in the prediction) might be uneven. - Mathematical Formula:
First, define the overlap function :
$
\mathcal { U } ( \hat { Y } , y _ { q } ) = \sum _ { t \in \mathcal { W } ( y _ { q } ) } \mathbf { 1 } [ t \in \mathcal { W } ( \hat { Y } ) ]
$
Then,
Precisionis: $ \mathcal { P } ( \hat { Y } , Y ) = \frac { \mathcal { U } ( \hat { Y } , y _ { q } ) } { \vert \mathcal { W } ( \hat { Y } ) \vert } $ AndRecallis: $ \mathcal { R } ( \hat { Y } , Y ) = \frac { \mathcal { U } ( \hat { Y } , y _ { q } ) } { \vert \mathcal { W } ( Y ) \vert } $ Finally, theF1-Scoreis: $ F1 = 2 \times \frac{\mathcal{P} \times \mathcal{R}}{\mathcal{P} + \mathcal{R}} $ - Symbol Explanation:
- : The set of words in the ground truth answer .
- : The set of words in the model's predicted answer .
- : A token (word) from the set .
- : The indicator function, which is 1 if the condition inside is true, and 0 otherwise.
- : The number of unique words from the ground truth answer that are also present in the predicted answer.
- : The total number of words in the predicted answer.
- : The total number of words in the ground truth answer.
- :
Precision, the fraction of predicted words that are correct. - :
Recall, the fraction of ground truth words that were successfully predicted.
- Conceptual Definition: The
5.2.2. Knowledge Retention Metrics
For knowledge retention, the paper uses a variety of existing benchmarks, each with its own established evaluation protocols and metrics (e.g., accuracy, specific scores for hallucination). The paper states it follows the settings of VLMEvalKit (Duan et al., 2024) for these evaluations. While specific formulas for each of these 12 benchmarks are not provided in the paper, their conceptual definitions (as described in Section 5.1.2) explain what aspects of model performance they quantify. The results are typically reported as scores (e.g., percentage accuracy) where higher is better.
5.3. Baselines
KORE was compared against a comprehensive set of baseline methods:
-
Full Fine-Tuning (Ful-FT): Updates all model weights on the new
EvOKEdataset. Represents the most direct but computationally expensive approach. -
LoRA (Hu et al., 2022): A
Parameter-Efficient Fine-Tuning (PEFT)method that updates a small number of low-rank matrices while freezing the original weights. It is resource-efficient but can suffer fromcatastrophic forgetting. -
Replay: Implemented via
LoRA, this method mixes a fixed quantity (10% ofEvOKE's size) of randomly sampled data from theLMM's pre-training corpus with the newEvOKEdata. This is a commoncontinual learningstrategy to mitigatecatastrophic forgetting. -
EWC (Elastic Weight Consolidation) (Kirkpatrick et al., 2017): A
parameter regularizationmethod fromcontinual learningthat slows down updates toparametersdeemed important for prior tasks by imposing a quadratic constraint based on theFisher Information Matrix. -
LwF (Learning without Forgetting) (Li & Hoiem, 2017b): Another
continual learningmethod that usesknowledge distillationto preserve old knowledge by ensuring the new model's predictions on new data align with the old model's outputs. -
MoELoRA (Luo et al., 2024): Combines
Mixture-of-Experts (MoE)withcontrastive learningforPEFT, specializing experts for different data types and usingcontrastive objectivesto guide collaboration, aiming to reducecatastrophic forgetting. -
O-LoRA (Wang et al., 2023): An
orthogonal subspace-based methodforcontinual learningthat allocates independent, orthogonalparameter subspacesfor each task, constraining updates to prevent interference and mitigatecatastrophic forgetting. -
SEFE (Superficial and Essential Forgetting Eliminator) (Chen et al., 2025): A method that tackles multimodal
catastrophic forgettingby separately addressing superficial forgetting of style and essential forgetting of knowledge through a tailored training strategy.These baselines cover a range of
fine-tuning(full,PEFT) andcontinual learningapproaches, making the comparison comprehensive.
5.4. Training Parameters
The following are the hyperparameter settings used for model training, as provided in Table 7 of the paper:
The following are the results from [Table 7] of the original paper:
| LLaVA-v1.5 (7B) | ||||
| Rank 235 | Optimizer AdamW | Deepspeed Zero3 | Epochs 6 | Vision Select Layer -2 |
| Weight Decay 0 | Warmup Ratio 0.03 | LR Schedule cosine decay | Learning Rate 2 × 10-4 | Batch Size 54 |
| LLaVA-v1.5 (13B) | ||||
| Rank 235 | Optimizer AdamW | Deepspeed Zero3 | Epochs 6 | Vision Select Layer -2 |
| Weight Decay 0 | Warmup Ratio 0.03 | LR Schedule cosine decay | Learning Rate 2 × 10-4 | Batch Size 32 |
| Qwen2.5-VL (7B) | ||||
| Rank 274 | Optimizer AdamW | Deepspeed Zero3 | Epochs 6 | Image Max Pixels 262144 |
| Grad Accum Steps 8 | Warmup Ratio 0.1 | LR Schedule cosine decay | Learning Rate 2 × 10-4 | Batch Size 24 |
Explanation of Key Parameters:
- Rank: Refers to the
rankparameter inLoRA(Low-Rank Adaptation), which determines the dimensionality of the low-rank matrices used forfine-tuning. A higherrankmeans more trainable parameters. - Optimizer AdamW: A variant of the
Adamoptimizer that includesweight decayregularization to preventoverfitting. - Deepspeed Zero3: A memory optimization technique from
DeepSpeedthat partitions optimizer states, gradients, and parameters acrossGPUs, enabling the training of very large models. - Epochs: The number of times the entire training dataset is passed forward and backward through the neural network.
- Vision Select Layer: Specifies which layer in the vision component of the
LMMis targeted or selected for certain operations (e.g., feature extraction orLoRAapplication). "-2" often means the second to last layer. - Weight Decay: A regularization technique that penalizes large
weightsto preventoverfitting. A value of 0 means noweight decayis applied forLLaVAmodels. - Warmup Ratio: The proportion of training steps during which the
learning ratelinearly increases from a small value to the initiallearning rate. - LR Schedule cosine decay: A
learning rate schedulerthat gradually decreases thelearning ratefollowing acosine curveduring training. - Learning Rate: The step size at each iteration while moving towards a minimum of the loss function.
- Batch Size: The number of training samples utilized in one iteration.
- Image Max Pixels: The maximum number of pixels an image can have, likely for
Qwen2.5-VL, indicating a constraint on input image size. - Grad Accum Steps:
Gradient Accumulation Steps. The number of mini-batches over whichgradientsare accumulated before aparameter updateis performed. This effectively increases theeffective batch sizewithout requiring moreGPUmemory.
5.5. Experiment Resources
All training experiments were conducted using 4 NVIDIA H100 GPUs (each with 96 GiB memory).
All evaluation experiments were performed on systems equipped with 4 NVIDIA A100 PCIe GPUs (each with 40 GiB memory).
6. Results & Analysis
This section presents and analyzes the experimental results, demonstrating the effectiveness of KORE in both knowledge adaptation and knowledge retention across various LMMs and settings.
6.1. Core Results Analysis
The main results comparing KORE with eight baseline methods on LLaVA-v1.5 (7B) are presented in Table 1. Performance is measured by CEM and F1-Score on EvOKE (for knowledge adaptation) and an average score across multiple knowledge retention benchmarks (COM, OCR, M-DIS, INS, M-IDU, MAT, HAL). The Avg score is the mean of the separate averages for adaptation and retention.
The following are the results from [Table 1] of the original paper:
| Method | #Params | Evoke | COM↑ OCR ↑ | M-DIS ↑ | INS ↑ | M-IDU ↑ | MAT ↑ | HAL ↑ | Avg ↑ | ||
| CEM↑ | F1↑ | ||||||||||
| LLaVA-v1.5 (7B) | — | — | 65.61 | 45.59 | 49.22 | 66.33 | 26.37 | 19.33 | 54.32 | ||
| Full-FT | 6,759M | 18.02 | 15.17 | 43.55 | 21.55 | 45.67 | 25.25 | 13.03 | 18.32 | 16.09 | 23.23 |
| LoRA | 340M | 15.23 | 18.31 | 48.96 | 27.01 | 43.79 | 29.66 | 13.70 | 18.02 | 41.38 | 24.28 |
| Replay | 340M | 11.36 | 17.98 | 59.72 | 37.98 | 48.64 | 62.33 | 19.31 | 19.17 | 51.67 | 28.68 |
| EWC | 340M | 15.49 | 19.42 | 49.42 | 32.88 | 45.46 | 29.79 | 13.36 | 18.00 | 43.50 | 25.33 |
| LwF | 340M | 14.58 | 19.99 | 53.14 | 28.77 | 43.41 | 36.19 | 13.68 | 18.22 | 44.18 | 25.61 |
| MoELoRA | 340M | 6.45 | 12.20 | 60.79 | 38.79 | 48.27 | 35.03 | 17.85 | 19.79 | 49.99 | 23.98 |
| O-LoRA | 340M | 6.44 | 12.08 | 61.47 | 40.91 | 48.07 | 34.85 | 17.28 | 19.87 | 51.12 | 24.17 |
| SEFE | 340M | 13.38 | 16.88 | 42.06 | 20.43 | 40.17 | 17.73 | 13.25 | 18.20 | 39.30 | 22.54 |
| Kore (r=235) | 340M | 30.65 | 41.26 | 52.41 | 40.98 | 48.68 | 38.54 | 16.58 | 18.59 | 51.75 | 37.09 |
| KORE (r=256) | 369M | 31.05 | 41.32 | 52.48 | 39.96 | 48.96 | 60.02 | 23.18 | 18.09 | 51.50 | 39.11 |
Obs 1: KORE enables accurate adaptation for effectively injecting new knowledge.
(the standard KORE configuration) significantly outperforms all baselines on EvOKE (knowledge adaptation). It achieves CEM of 30.65 and F1-Score of 41.26. This represents an improvement of 12.63 in CEM and 21.27 in F1-Score over the best baseline (Full-FT for CEM and LwF for F1-Score). Notably, KORE's F1-Score is more than double that of LoRA (18.31), highlighting the effectiveness of KORE-AUGMENTATION in enabling the model to truly learn and generalize new knowledge, rather than just memorizing it.
Obs 2: KORE enables powerful retention for effectively preserving old knowledge.
shows strong performance in knowledge retention across various benchmarks. It outperforms LoRA across all knowledge retention tests and achieves top scores on OCR, M-DIS, and HAL dimensions, while placing second on INS. Although its performance on INS and M-IDU is suboptimal compared to Replay (which mixes old data), the paper attributes this to the number of trainable parameters (rank) and the source of the covariance matrix. When rank is increased to 256 (), KORE significantly improves on INS (60.02, trailing Replay by only 2.31) and M-IDU (23.18, outperforming Replay by 3.87), demonstrating that with appropriate configuration, KORE can achieve powerful retention even in these areas.
Obs 3: KORE achieves remarkable holistic performance by harmonizing the dual objectives of knowledge injection.
The Avg score, which balances knowledge adaptation and retention, clearly shows KORE's superiority. achieves an Avg score of 37.09, an 8.41 improvement over the strongest baseline (Replay at 28.68). further boosts this to 39.11. This demonstrates KORE's ability to effectively manage the inherent trade-off between injecting new knowledge and preserving old knowledge, which is a critical challenge in continual learning for LMMs.
6.2. Detailed Results on Adaptation & Retention
6.2.1. Fine-grained Knowledge Adaptation
KORE's performance was further analyzed across 20 fine-grained News and Entity types from EvOKE (Figure 5).
As seen in Figure 5, KORE consistently outperforms all baselines across a wide spectrum of fine-grained knowledge types. This highlights KORE-AUGMENTATION's ability to build comprehensive and structured knowledge, allowing the model to adapt robustly to diverse new information.
The following are the results from [Table 9] of the original paper:
| Method | News | Entity | |||||||||||||||||||
| Avg | PO | SP | BU | HE | Avg | CE | FI | AL | WR | ||||||||||||
| CEM ↑ | F1 ↑ | CEM ↑ | F1 ↑ | CEM ↑ | F1↑ | CEM ↑ F1↑ | CEM ↑ | F1↑ | CEM ↑ | F1↑ | CEM ↑ | F1 ↑ | CEM ↑ | F1 ↑ | CEM ↑ | F1 ↑ | CEM ↑ | F1 ↑ | |||
| Full-FT | 21.35 | 16.34 | 12.92 | 10.99 | 22.49 | 20.88 | 27.31 | 20.95 | 19.84 | 16.47 | 14.37 | 13.88 | 13.11 | 16.93 | 12.39 | 13.16 | 12.17 | 7.66 | 20.34 | 8.43 | |
| LoRA | 17.72 | 19.42 | 10.54 | 12.96 | 19.11 | 21.50 | 20.66 | 24.03 | 17.81 | 23.76 | 12.51 | 17.09 | 12.20 | 21.19 | 10.57 | 15.82 | 10.72 | 8.72 | 18.64 | 12.94 | |
| Replay | 13.98 | 19.43 | 7.61 | 13.16 | 15.96 | 20.69 | 16.05 | 22.40 | 15.38 | 24.21 | 8.48 | 16.39 | 9.40 | 18.78 | 10.34 | 15.60 | 3.77 | 10.79 | 4.55 | 8.23 | |
| EWC | 17.86 | 21.10 | 10.45 | 14.81 | 19.83 | 23.02 | 19.00 | 24.57 | 17.41 | 23.88 | 12.88 | 17.58 | 14.53 | 22.07 | 12.16 | 16.91 | 10.72 | 8.13 | 15.25 | 17.69 | |
| LwF | 17.05 | 21.43 | 9.62 | 13.99 | 19.83 | 23.66 | 18.63 | 25.82 | 19.03 | 26.20 | 11.88 | 18.40 | 12.45 | 21.64 | 12.39 | 17.01 | 9.28 | 11.11 | 10.17 | 17.10 | |
| MoELoRA | 9.23 | 14.86 | 3.39 | 8.72 | 6.77 | 11.77 | 12.36 | 18.92 | 10.53 | 20.60 | 3.40 | 9.28 | 2.95 | 10.32 | 4.43 | 8.96 | 3.19 | 5.22 | 10.17 | 14.07 | |
| O-LoRA | 9.21 | 14.68 | 3.67 | 8.52 | 7.01 | 12.23 | 12.55 | 18.98 | 11.74 | 20.68 | 3.40 | 9.22 | 3.10 | 10.51 | 4.20 | 8.28 | 3.19 | 5.35 | 8.47 | 12.37 | |
| SEFE | 16.66 | 18.44 | 10.82 | 12.64 | 17.78 | 20.92 | 20.30 | 23.23 | 17.00 | 21.55 | 9.79 | 15.18 | 10.77 | 20.13 | 9.09 | 12.01 | 5.51 | 7.47 | 13.56 | 13.87 | |
| KORE | 34.74 | 42.96 | 23.83 | 32.31 | 46.19 | 50.38 | 34.69 | 45.74 | 33.20 | 45.23 | 26.17 | 39.39 | 27.79 | 42.61 | 26.93 | 34.05 | 16.52 | 29.54 | 28.81 | 43.05 | |
Obs 4: KORE demonstrates superior performance across a wide spectrum of fine-grained knowledge.
Table 9 (Appendix E.1) provides detailed numerical results confirming that KORE consistently achieves the highest CEM and F1-Score across all fine-grained News categories (Politics, Sports, Business, Health) and Entity categories (Celebrity, Film, Album, Written Work). This comprehensive superiority underscores the robustness and effectiveness of KORE-AUGMENTATION in enabling deep and generalized learning of new knowledge.
6.2.2. Detailed Knowledge Retention
Table 2 provides a detailed breakdown of knowledge retention performance for each of the 12 benchmarks.
The following are the results from [Table 2] of the original paper:
| Method | COM | OCR | M-DIS | INS | M-IDU | MAT | HAL | Avg | |||||
| MME ↑ | MM8 ↑| | SEEDB2P ↑ | OCRVQ^↑ | SQA ↑ | MMMU ↑ | MIAB ↑ | MMDU ↑ | | Math ↑ | Math1 ↑| | POPE ↑ | Hall ↑ | ||
| LLaVA-v1.5 (7B) | 66.63 | 64.60 | 38.78 | 52.41 | 69.83 | 28.60 | 66.33 | 26.37 | 25.50 | 13.16 | 86.87 | 21.76 | 46.74 |
| Full-FT | 34.17 | 52.92 | 31.44 | 11.65 | 67.13 | 24.20 | 25.25 | 13.03 | 24.70 | 11.94 | 74.22 | 9.27 | 31.66 |
| LoRA | 44.06 | 53.87 | 30.22 | 23.80 | 66.18 | 21.40 | 29.66 | 13.70 | 23.20 | 12.83 | 73.97 | 8.78 | 33.47 |
| Replay | 58.96 | 60.48 | 38.34 | 37.73 | 68.77 | 28.50 | 62.33 | 19.31 | 25.20 | 13.13 | 85.44 | 17.90 | 43.00 |
| WC | 448.57 | 50.26 | 33.60 | 32.16 | 65.71 | 25.20 | 29.79 | 113.36 | 23.30 | 12.76 | 76.22 | 10.77 | 35.14 |
| Lw | 50.87 | 55.41 | 32.02 | 25.52 | 66.21 | 20.60 | 36.19 | 13.68 | 24.40 | 12.04 | 79.23 | 9.13 | 35.44 |
| MoELoRA | 58.26 | 63.32 | 37.42 | 440.17 | 69.04 | 27.50 | 35.03 | 17.85 | 27.80 | 11.78 | 80.70 | 19.29 | 40.51 |
| O-LORA | 60.30 | 62.63 | 37.90 | 43.91 | 68.84 | 27.30 | 34.85 | 177.28 | 28.20 | 11.55 | 81.46 | 20.78 | 41.25 |
| SEFE | 36.10 | 48.02 | 22.79 | 118.07 | 65.03 | 15.30 | 17.73 | 13.25 | 26.00 | 10.39 | 72.81 | 5579 | 29.27 |
| KOre (r=235) | 49.84 | 54.98 | 37.73 | 44.24 | 68.06 | 29.30 | 38.54 | 16.58 | 25.10 | 12.09 | 80.99 | 22.51 | 40.00 |
| KoRe (r=256) | 50.06 | 54.90 | 36.89 | 43.03 | 68.51 | 29.40 | 60.02 | 23.18 | 24.70 | 11.48 | 80.77 | 22.23 | 42.10 |
Obs 5: KORE achieves competitive knowledge retention.
outperforms LoRA in overall retention (40.00 vs. 33.47 Avg). It also surpasses several continual learning methods like EWC (35.14), LwF (35.44), and SEFE (29.27). KORE achieves top scores on OCRVQA (44.24), M-DIS (MMMUT 29.30), and HallusionBench (22.51). When its rank is increased to 256 (), it closely matches or even exceeds Replay (the strongest retention baseline) on INS (60.02 vs. 62.33) and M-IDU (23.18 vs. 19.31), demonstrating its strong retention capabilities. The KORE-CONSTRAINT component effectively minimizes interference with previous knowledge.
6.2.3. Specific Knowledge-Oriented Constraints
The paper investigates whether KORE can preserve specific knowledge without compromising other abilities by constructing specific constraints (sampling 256 data per benchmark across four dimensions).
The following are the results from [Table 3] of the original paper:
| Method | K.A↑ | K.R ↑ | Avg ↑ |
| KORE | 35.96 | 38.22 | 37.09 |
| KOREMME | 34.46 | 43.16 | 38.81 |
| KOREOCRVQA | 34.85 | 42.21 | 38.53 |
| KOREMathT | 35.20 | 42.87 | 39.03 |
| KOREHallB | 34.96 | 42.09 | 38.52 |
Obs 6: Specific constraints enhance knowledge retention and overall performance.
Table 3 shows that applying specific knowledge-oriented constraints (e.g., KOREMME, KOREOCRVQA, KOREMathT, KOREHallB) leads to a slight reduction in Knowledge Adaptation (K.A) scores but a substantial improvement in Knowledge Retention (K.R) and overall Avg performance compared to the general KORE (r=235) setup. For instance, KOREMME increases K.R from 38.22 to 43.16 and Avg from 37.09 to 38.81.
Figure 6 visually confirms that these specific constraints enhance targeted knowledge retention, with KOREMME showing a 7.17 gain on MME. This demonstrates KORE's flexibility for tailored knowledge preservation according to specific needs.
6.3. Various LMM Scales and Architectures
The paper further evaluates KORE's universality and robustness on larger (LLaVA-v1.5-13B) and architecturally distinct (Qwen2.5-VL-7B) LMMs.
The following are the results from [Table 4] of the original paper:
| Methods | Evoke | COM↑ | OCR ↑ | M-DIS ↑ | INS ↑ | M-IDU ↑ | MAT | HAL↑Avg ↑ | ||
| CEM↑ | F1↑ | |||||||||
| LLaVA-v1.5 (13B) | ||||||||||
| Vanilla | — | 66.86 | 51.12 | 52.70 | 66.04 | 33.93 | 19.64 | 56.77 | — | |
| LoRA | 16.26 | 22.83 | 60.57 | 32.58 | 43.72 | 23.26 | 17.43 | 15.82 | 38.08 | 25.21 |
| Replay | 12.05 | 20.21 | 65.81 | 47.51 | 48.42 | 61.04 | 24.62 | 19.55 | 54.16 | 30.70 |
| KorE | 32.89 | 44.47 | 59.35 | 45.96 | 51.39 | 65.10 | 26.84 | 20.31 | 40.52 | 41.44 |
| Qwen2.5-VL (7B) | ||||||||||
| Vanilla | 81.18 | 70.32 | 65.35 | 78.46 | 61.25 | 47.69 | 66.96 | — | ||
| LoRA | 14.56 | 14.01 | 52.54 | 64.54 | 22.35 | 21.39 | 23.25 | 13.52 | 41.38 | 24.21 |
| Replay | 11.73 | 18.51 | 78.54 | 69.17 | 65.26 | 70.20 | 50.72 | 42.74 | 67.48 | 39.28 |
| KORE | 22.91 | 31.36 | 56.60 | 67.74 | 65.48 | 70.51 | 45.02 | 43.72 | 58.57 | 42.68 |
Obs 7: KORE shows enhanced superiority on a larger-scale LMM.
On LLaVA-v1.5 (13B), KORE achieves CEM of 32.89 and F1-Score of 44.47, significantly surpassing LoRA (16.26 CEM, 22.83 F1-Score). It also demonstrates strong knowledge retention, achieving an overall Avg score of 41.44, which is a 10.74 improvement over Replay (30.70). This confirms KORE's strong potential for larger LMMs.
Obs 8: KORE's effectiveness is not architecture-specific.
On Qwen2.5-VL (7B), KORE again outperforms LoRA by a large margin (e.g., 12.63 CEM and 21.27 F1-Score on EvOKE compared to LoRA). It also surpasses Replay with an Avg score of 42.68 (vs. 39.28). The improvement margins are slightly smaller compared to LLaVA-v1.5, which the paper attributes to Qwen2.5-VL's already robust knowledge system (honed via three-stage training), reducing the marginal gains from knowledge injection. Nevertheless, KORE maintains superior performance across different LMM architectures, showcasing its universality.
6.4. Ablation Studies
Ablation studies were conducted to validate the effectiveness of KORE's design components: rank, W/o Augmentation, W/o Constraint, and W/o Frozen Matrix A.
The following are the results from [Table 5] of the original paper:
| Setting | Evoke | COM↑ | OCR ↑ | M-DIS ↑ | INS ↑ | M-IDU ↑ | MAT ↑ | HAL ↑ : | Avg ↑ | |
| CEM↑ | F1↑ | |||||||||
| Kore | 30.65 | 41.26 | 52.41 | 40.98 | 48.68 | 38.54 | 16.58 | 18.59 | 51.75 | 37.09 |
| W/o Augmentation | 10.83 | 18.31 | 59.96 | 40.42 | 47.13 | 32.53 | 16.00 | 19.71 | 49.50 | 26.23 |
| W/o Constraint | 33.93 | 43.71 | 46.39 | 32.38 | 46.31 | 32.70 | 15.38 | 19.12 | 46.47 | 36.46 |
| W/o Frozen Matrix A | 31.97 | 41.72 | 50.73 | 39.56 | 48.37 | 35.30 | 16.44 | 19.07 | 49.91 | 36.95 |
Obs 9: Larger rank enhances KORE's performance.
Figure 7 and Table 15 (Appendix E.4.1) show that KORE's performance in both knowledge adaptation and retention consistently increases with a higher rank (number of trainable parameters). Even at , KORE (Avg 31.81) still surpasses Replay (28.68) while using less than half of Replay's parameters. This indicates that increasing the trainable parameter scale activates stronger capabilities in KORE.
Obs 10: Ablation studies reveal the effectiveness of KORE's design.
Table 5 clearly validates the contribution of each KORE component:
-
W/o Augmentation: RemovingKORE-AUGMENTATION(K.ACEM10.83,F118.31) is particularly detrimental toknowledge adaptation, causing a significant drop (19.82CEMand 22.95F1-Scoredecrease compared to fullKORE). This emphasizes the critical role ofknowledge-oriented augmentationsin enabling accurate learning of new information. -
W/o Constraint: RemovingKORE-CONSTRAINTleads to anAvgscore of 36.46, slightly lower thanKORE(37.09), but significantly impactingknowledge retentionbenchmarks (e.g.,COM46.39 vs. 52.41,OCR32.38 vs. 40.98). This confirms thatKORE-CONSTRAINTis essential for preserving old knowledge. Interestingly, it slightly improvesK.Ametrics (e.g.,CEM33.93 vs. 30.65), suggesting a trade-off where explicit constraints for retention can slightly limit adaptation when not perfectly balanced. -
W/o Frozen Matrix A: Not freezing matrix (allowing it to be fine-tuned) also impairsknowledge retentionslightly, resulting in anAvgscore of 36.95, indicating that the strategic freezing of (which lies in thenull space) is important for robustknowledge preservation.The results from Table 18 (Appendix E.4.2) further reinforce that modifying
KORE's design leads to an overall degradation inknowledge retentionperformance, underscoring the efficacy ofKORE's comprehensive design. Table 19 (Appendix E.4.2) shows thatW/o Constraintyields superiorknowledge adaptationacross fine-grained knowledge, which stems from theKORE-AUGMENTATION's profound augmentation without the mitigating effect of theconstraint.
6.5. Comparison with General Augmentation Methods
This section validates the claim that KORE-AUGMENTATION is superior to general augmentation methods (Section 3.1).
The following are the results from [Table 6] of the original paper:
| Method | K.A ↑ | K.R ↑ | Avg ↑ |
| KOrE-AUGMENTaTION | 38.82 | 35.78 | 36.46 |
| Augmentation for Text | |||
| Knowledge-AwareKnowledge-Agnostic | 20.2915.60 | 34.8635.71 | 27.3825.49 |
| Augmentation for Images | |||
| Knowledge-AwareKnowledge-Agnostic | 18.33 | 34.02 | 25.86 |
| 18.33 | 32.09 | 25.25 | |
Obs 11: KORE-AUGMENTATION is superior to general augmentation methods.
Table 6 compares KORE-AUGMENTATION with generic text augmentation (knowledge-aware and knowledge-agnostic) and image augmentation (knowledge-aware and knowledge-agnostic). KORE-AUGMENTATION (which includes the KORE-CONSTRAINT component as well, since the K.A score here is 38.82 and K.R is 35.78, differing from the original KORE entry of 35.96 and 38.22 in Table 3, likely representing the W/o Constraint ablation from Table 5 where K.A was 33.93 and F1 43.71, averaging to 38.82) significantly outperforms all general augmentation methods across K.A, K.R, and Avg metrics. It achieves an 18.53 improvement in K.A over the strongest baseline (20.29 for knowledge-aware text augmentation). This strong performance confirms that KORE-AUGMENTATION's profound and structured approach to augmentation is far more effective than superficial variations.
Table 20 and 21 (Appendix E.5) further detail this, showing KORE-AUGMENTATION's absolute comprehensive superiority in both knowledge retention and adaptation across fine-grained knowledge types.
6.6. Convergence Comparison
The paper provides training loss curves (Figure 10) to compare the convergence behavior of various methods.
As seen in Figure 10, the training loss curves on EvOKE for Full-FT, LoRA, EWC, O-LoRA, SEFE, and KORE are presented. It's clarified that KORE is trained on KORE-74K (a larger, augmented dataset), while others train on the original EvOKE dataset.
O-LoRAandSEFEnotably fail to fit theEvOKEdataset, exhibiting high training loss.LoRA,EWC, andFull-FTconverge to very low loss values, successfully fitting theEvOKEdataset. However, their poorgeneralizationperformance in Table 1 suggestsoverfittingto the smallerEvOKEdataset.KOREshows a rapid decrease in loss during the firstepochdue to its largerKORE-74Kdataset and efficient learning. Crucially, despite converging effectively on its larger training set,KOREalso demonstrates stronggeneralization capabilitiesfor novel knowledge, unlike the overfitting observed inLoRA,EWC, andFull-FT. This implies thatKORE's structuredaugmentationleads to more meaningful learning.
6.7. Case Studies
The paper includes case studies (Figure 11 and Figure 12) to provide qualitative insights into KORE's performance.
As seen in Figure 11 and Figure 12, KORE demonstrates superior qualitative performance in complex scenarios. For instance, in a "News" case study (Figure 11) about the Nobel Prize in Physics, KORE (on LLaVA-v1.5-7B) provides a more accurate and comprehensive answer (CEM 1, F1 1) compared to LoRA (0 CEM, 0 F1) and Replay (0 CEM, 0 F1). Similarly, in an "Entity" case study (Figure 12) about the Bugatti Tourbillon's production limit, KORE accurately states the limit (99 units), while LoRA and Replay provide incorrect or less specific answers. These qualitative results align with the quantitative findings, illustrating KORE's ability to inject new knowledge accurately and retain previous knowledge effectively across different LMMs.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work introduces KORE (KnOwledge-oRientEd augmentations and constraints), a novel and synergistic method designed to effectively inject new knowledge into Large Multimodal Models (LMMs) while simultaneously preserving their vast existing knowledge. KORE tackles the critical trade-off between knowledge adaptation and knowledge retention, a long-standing challenge in continual learning for LMMs.
The method's success stems from two key, inter-dependent components:
-
KORE-AUGMENTATION: This component automatically transforms individual knowledge items into a profound and structured format, such as multi-round dialogues and diverse instruction tasks. Thisknowledge-oriented augmentationstrategy ensures that theLMMlearns new information accurately, deeply, and in a generalized manner, moving beyond superficial data memorization. -
KORE-CONSTRAINT: This component safeguards existing knowledge by leveraging thecovariance matrixofLMM's linear layer activations to represent prior knowledge patterns. It then initializes aLoRAadapter by projecting original weights into thenull spaceof thiscovariance matrix, defining afine-tuning directionthat minimally interferes with previously acquired knowledge. This mechanism enables powerfulknowledge retentionand effectively mitigatescatastrophic forgetting.Extensive experimental validation on various
LMMs(LLaVA-v1.5-7B,LLaVA-v1.5-13B,Qwen2.5-VL-7B) demonstratesKORE's superior performance innew knowledge injectionand its efficacy in mitigatingcatastrophic forgetting, outperforming numerous state-of-the-art baselines.KOREalso exhibits universality across differentLMMscales and architectures. Furthermore, its capability forspecific knowledge-oriented constraintsallows for tailored knowledge preservation, offering high flexibility for specialized applications.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
- Reliance on
GPT-4ofor Augmentation: TheKORE-AUGMENTATIONprocess heavily relies onGPT-4ofor generating dialogues, summaries, and quadruplets. This dependency introduces potential risks ofhallucination(generating factually incorrect or nonsensical information) into the augmented data. Future work could explore methods to reduce or verify the outputs ofGPT-4o. - Scope of Augmentation: The current
augmentationprimarily focuses on enhancing individual knowledge units. More complex and structuredaugmentationstrategies could be explored, such as leveragingknowledge graphsorknowledge forests(Ji et al., 2021; Chen et al., 2020), potentially in combination withreinforcement learning, to build even richer and more interconnected knowledge structures. - Computational Cost of Covariance Matrix Extraction: Extracting
covariance matricesfrom all linear layers inLMMscan be computationally expensive. Future work aims to reduce this resource consumption by identifying the most critical layers or subsets ofparametersforcovariance computation, focusing only on those that are most salient for knowledge representation.
7.3. Personal Insights & Critique
KORE presents a compelling and well-motivated approach to a crucial problem in LMM development. The synergistic design, combining knowledge-oriented augmentation with knowledge-oriented constraints, is a notable innovation.
One of the paper's strengths is its emphasis on moving beyond superficial data augmentation. The "knowledge tree" concept for structuring new information, including multi-round dialogues and various instructional tasks, is intuitively powerful. It aligns with the idea that models need to learn not just facts but also how to reason about and apply them in different contexts. This approach to augmentation could be transferable to other domains where robust and generalized learning from new information is critical, especially in continual learning settings for diverse AI agents.
The KORE-CONSTRAINT component, leveraging null space projection based on covariance matrices, provides a theoretically grounded method for knowledge retention. The empirical verification that covariance matrices indeed capture multimodal knowledge and exhibit task-specific patterns is a significant finding. This could inspire further research into understanding and manipulating internal model representations for more precise control over knowledge preservation. The ability to apply specific knowledge-oriented constraints is also highly practical, offering a customizable solution for real-world applications where certain types of knowledge must be rigorously protected.
However, some aspects invite further consideration:
-
Scalability of
KORE-AUGMENTATION: While automated, the reliance onGPT-4ofor generating large quantities of high-quality, structured data might still present practical scaling challenges for truly vast amounts of evolving knowledge, especially if quality assurance mechanisms are needed to preventhallucinations. The computational cost of generatingKORE-74KusingGPT-4ois not explicitly detailed but could be substantial. -
Interpretability of
Null SpaceforKnowledge: While mathematically sound, the mapping between thenull spaceof acovariance matrixand "minimally interfering with previous knowledge" is still somewhat abstract. Further work could explore more interpretable ways to demonstrate what specific aspects of old knowledge are preserved by this method, beyond just benchmark scores. -
Generality of
Covariance MatrixRepresentation: The paper assumes thatcovariance matriceseffectively capture previous knowledge. While empirically validated, the extent to which this holds universally across all types of knowledge (e.g., procedural knowledge vs. factual knowledge) or for allLMMarchitectures could be a subject of deeper theoretical investigation. -
Trade-offs in
Specific Constraints: The ablation study showed thatspecific constraintsslightly reducedK.Ascores while boostingK.R. This highlights the inherent trade-off. Future work could explore dynamic weighting mechanisms or adaptive constraint strengths to optimize this balance automatically for different scenarios.The paper's
Ethics Statementis also commendable, acknowledging the potential misuse ofknowledge injectionfor propagating false or biased information. This underscores the need for responsible development and deployment of such powerfulAIcapabilities.
Overall, KORE provides a significant step forward in enabling LMMs to continuously learn and evolve, addressing a fundamental limitation that hinders their deployment in dynamic environments. Its knowledge-centric approach offers a promising direction for future research in continual learning and LMM adaptation.
Similar papers
Recommended via semantic vector search.