Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
TL;DR Summary
The Self-RAG framework enhances large language models' quality and factual accuracy through adaptive retrieval and reflection tokens. Experiments show it significantly outperforms existing models in open-domain QA, reasoning, and fact verification tasks.
Abstract
Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances an LM's quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
1.2. Authors
The primary authors are Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Their affiliations include the University of Washington, Allen Institute for AI, and IBM Research AI. This indicates a collaborative effort between academia and industry research.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server, on 2023-10-17T18:18:32.000Z. While arXiv itself is not a peer-reviewed journal or conference, it is a widely recognized platform for disseminating cutting-edge research in fields like AI and computer science. Papers often appear on arXiv before undergoing formal peer review for major conferences (e.g., NeurIPS, ICML, ICLR, ACL, EMNLP) or journals, indicating its status as new, impactful research.
1.4. Publication Year
2023
1.5. Abstract
This paper introduces Self-Reflective Retrieval-Augmented Generation (Self-RAG), a novel framework designed to improve the quality and factuality of Large Language Models (LLMs). Despite their advanced capabilities, LLMs frequently produce factually incorrect responses due to their sole reliance on parametric knowledge (information encoded within the model's weights during training). While Retrieval-Augmented Generation (RAG) offers an ad hoc solution by supplementing LLMs with external knowledge, traditional RAG methods often retrieve passages indiscriminately, which can reduce versatility or lead to unhelpful generations.
Self-RAG addresses these limitations by training an arbitrary LLM to adaptively retrieve passages on-demand. It also learns to generate and reflect on both the retrieved passages and its own generated text using special tokens called reflection tokens. These tokens, categorized into retrieval and critique types, enable the LLM to control its behavior during inference, tailoring it to diverse task requirements. The framework trains a single LLM to unify text generation and reflection token prediction.
Experiments on a diverse set of tasks, including open-domain QA, reasoning, and fact verification, demonstrate that Self-RAG (with 7B and 13B parameters) significantly outperforms state-of-the-art LLMs (like ChatGPT) and existing RAG models (like retrieval-augmented Llama2-chat). It shows substantial gains in improving factuality and citation accuracy, particularly for long-form generations.
1.6. Original Source Link
Original Source Link: https://arxiv.org/abs/2310.11511 PDF Link: https://arxiv.org/pdf/2310.11511v1.pdf Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
Core Problem
The core problem the paper aims to solve is the prevalent issue of factual inaccuracies (also known as hallucinations) in outputs generated by Large Language Models (LLMs). LLMs primarily rely on parametric knowledge, which is the information encoded within their billions of parameters during pre-training. This parametric knowledge can be outdated, incomplete, or incorrectly memorized, leading to responses that sound fluent and confident but are factually wrong.
Importance of the Problem
In the current field of Natural Language Processing (NLP) and Artificial Intelligence (AI), LLMs are being increasingly deployed in applications that demand high accuracy, such as search, education, and content creation. Factual errors undermine trust, can spread misinformation, and are particularly problematic in sensitive domains (e.g., healthcare, finance). Therefore, improving the factuality of LLM outputs is a critical challenge.
Challenges and Gaps in Prior Research
Retrieval-Augmented Generation (RAG) methods have emerged as a prominent approach to mitigate factual errors. RAG works by retrieving relevant external documents and providing them as context to the LLM during generation, allowing the model to ground its responses in up-to-date and verified information. However, prior RAG approaches suffer from several limitations:
- Indiscriminate Retrieval: Many RAG systems retrieve a fixed number of documents regardless of whether retrieval is actually necessary for a given query or if the retrieved passages are truly relevant. This can lead to:
- Diminished Versatility: For tasks not requiring factual grounding (e.g., creative writing), unnecessary retrieval adds overhead and can constrain the model's creativity.
- Low-Quality Generations: Irrelevant or off-topic retrieved passages can "confuse" the LLM, leading to unhelpful or even incorrect responses.
- Lack of Consistency/Attribution: Even when relevant passages are retrieved, LLMs are not always explicitly trained to leverage and follow facts from these passages consistently. The output is not guaranteed to be fully supported by the retrieved sources, and clear citations are often missing or inaccurate.
- Inference Overhead: Some adaptive retrieval methods improve performance but come at the cost of runtime efficiency or rely on proprietary models.
Paper's Entry Point / Innovative Idea
The paper's innovative idea is to introduce Self-Reflective Retrieval-Augmented Generation (Self-RAG). This framework addresses the limitations of prior RAG by integrating adaptive, on-demand retrieval with a novel self-reflection mechanism. The core innovation is to train a single LLM to:
- Adaptively Decide Retrieval: Determine when retrieval is necessary.
- Critique Retrieved Passages: Evaluate the relevance and usefulness of retrieved information.
- Critique its Own Generations: Assess the quality, factuality, and support of its generated text against retrieved evidence.
This self-reflection is achieved through
reflection tokens, special tokens generated by the LLM itself, which allow for fine-grained control and customization during inference.
2.2. Main Contributions / Findings
The paper makes several primary contributions to the field:
-
Novel
Self-RAGFramework: Introduction of a new, end-to-end trainable framework that enhances LLM quality and factuality through adaptive retrieval and self-reflection. This framework allows an arbitrary LLM to learn to retrieve, generate, and critique its own outputs. -
Reflection Tokensfor Controllability: The design and implementation ofreflection tokens(including retrieval tokens and critique tokens) that are generated by the LLM itself. These tokens are seamlessly integrated into the model's vocabulary and generation process, making the LLM controllable during inference. This enables tailoring the model's behavior to diverse task requirements, such as prioritizing factual accuracy or creative expression. -
Cost-Effective Training Methodology: A method for training the generator LLM by incorporating critiques offline from a separate
critic model. This significantly reduces training costs compared to computationally intensive methods likeReinforcement Learning from Human Feedback (RLHF)which require an online reward model during training. -
Significant Performance Gains: Empirical demonstration that
Self-RAG(7B and 13B parameters) substantially outperforms existing state-of-the-art LLMs (e.g.,ChatGPT) and traditional retrieval-augmented models across a diverse suite of tasks. This includes:- Open-domain QA: Improved accuracy.
- Reasoning and Fact Verification: Enhanced performance.
- Long-form Generations: Significant gains in
factuality(measured byFactScore) andcitation accuracy(precision and recall), addressing a critical weakness of many LLMs.
-
Inference-Time Customization: The framework enables a customizable decoding algorithm that leverages reflection token probabilities to enforce soft or hard constraints. This allows users to balance trade-offs (e.g., between citation precision and completeness/fluency) at inference time without retraining the model.
These findings collectively address the limitations of prior RAG by making retrieval adaptive, generations more factually grounded, and the overall LLM behavior more controllable and verifiable, leading to higher-quality and more trustworthy outputs.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the Self-RAG paper, a foundational understanding of several key concepts in Natural Language Processing (NLP) and Machine Learning (ML) is essential.
3.1.1. Large Language Models (LLMs)
Large Language Models (LLMs) are a class of neural networks, typically transformer-based, trained on vast amounts of text data to predict the next word in a sequence. This pre-training enables them to learn complex language patterns, grammar, facts, and reasoning abilities.
- Parametric Knowledge: This refers to the knowledge implicitly stored within the billions of
parameters(weights and biases) of the neural network during its pre-training phase. When an LLM generates a response, it primarily draws from this internal, parametric knowledge. - Factual Inaccuracies / Hallucinations: A significant challenge with LLMs is that they sometimes generate information that is plausible-sounding but factually incorrect, a phenomenon often termed
hallucination. This can occur because their parametric knowledge might be outdated, incomplete, or a result of probabilistic generation rather than logical deduction. - Fine-tuning and Instruction Tuning:
- Fine-tuning: The process of taking a pre-trained LLM and further training it on a smaller, task-specific dataset to adapt its capabilities to a particular task (e.g., sentiment analysis, summarization).
- Instruction Tuning: A specific type of fine-tuning where models are trained on datasets formatted as instructions (e.g., "Summarize this article:") and corresponding desired outputs. This teaches the model to follow natural language instructions effectively. Examples include
AlpacaandLlama2-chat.
3.1.2. Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is an architecture or paradigm that enhances LLMs by allowing them to access and incorporate information from external knowledge bases during the text generation process.
- Core Idea: Instead of relying solely on its internal parametric knowledge, a RAG system first
retrievesrelevant documents or passages from a large corpus (e.g., Wikipedia, a company's internal documents) that are pertinent to a given user query. These retrieved passages are then provided ascontextalongside the original query to the LLM, which uses this external information to generate a more accurate, up-to-date, and grounded response. - Components:
- Retriever: A component (often a
dense retrieverlikeContrieverorDPR) that takes a query and efficiently searches a large corpus to find the mostsemanticallysimilar or relevant passages. - Generator: An LLM that takes the original query and the retrieved passages as input and generates the final text output.
- Retriever: A component (often a
3.1.3. Reward Models and Reinforcement Learning from Human Feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF) is a technique used to align LLMs with human preferences and values, often to make them more helpful, honest, and harmless.
- Reward Model: A separate
neural networktrained to predict a scalarreward scorefor a given LLM output, based on human preference data. Humans rank or rate different LLM generations, and this feedback is used to train the reward model. - Reinforcement Learning (RL): The reward model's output then serves as a
reward signalfor an LLM during an RL phase (e.g., usingProximal Policy Optimization (PPO)). The LLM is fine-tuned to maximize this reward, thereby generating outputs that are more aligned with human preferences.
3.1.4. Control Tokens
Control tokens (also known as special tokens or prompt tokens) are specific, reserved tokens added to an LLM's vocabulary during training. They serve as explicit signals or instructions to guide the model's behavior or output style during generation. For example, a [summarize] token might instruct the model to produce a summary, or [positive] might bias sentiment.
3.2. Previous Works
The paper contextualizes Self-RAG by discussing existing approaches to improving LLM factuality and control.
3.2.1. Traditional RAG Approaches
- Early RAG Models (Lewis et al. 2020; Guu et al. 2020): These foundational works established the RAG paradigm. They typically retrieve a fixed number of documents once at the beginning of the generation process and prepend them to the input.
- Limitation (addressed by Self-RAG): These methods retrieve indiscriminately, even when not needed, and don't explicitly teach the model to verify facts against the retrieved context or critique its own generation quality.
- Instruction-Tuned RAG (Luo et al., 2023 - SAIL): Some works fine-tune LLMs on instruction datasets where retrieved passages are prepended to the input.
- Limitation (addressed by Self-RAG): Still often relies on a fixed number of retrieved passages and lacks the dynamic, self-reflective critique mechanism of
Self-RAG.
- Limitation (addressed by Self-RAG): Still often relies on a fixed number of retrieved passages and lacks the dynamic, self-reflective critique mechanism of
- Joint Training of Retriever and LM (Izacard et al., 2022b): Approaches that jointly train both the retriever and the language model, followed by few-shot fine-tuning.
- Limitation (addressed by Self-RAG): While improving integration, these still don't inherently provide the self-critique and adaptive retrieval capabilities of
Self-RAG.
- Limitation (addressed by Self-RAG): While improving integration, these still don't inherently provide the self-critique and adaptive retrieval capabilities of
3.2.2. Adaptive Retrieval Approaches
More recent works have started exploring adaptive retrieval, but often with different mechanisms:
- Adaptive Retrieval with Proprietary LLMs (Jiang et al., 2023): Propose methods to adaptively retrieve passages, often relying on a separate, proprietary LLM to decide when to retrieve or to format queries.
- Limitation (addressed by Self-RAG): Can be expensive, lack reproducibility, and don't integrate the adaptive decision-making directly into the generation process of the target LM.
- API Call Generation (Schick et al., 2023 - Toolformer): Train an LM to generate
API calls(e.g., to Wikipedia APIs) for specific entities.- Limitation (addressed by Self-RAG): Focuses on structured data retrieval for named entities rather than a general, on-demand passage retrieval and comprehensive self-reflection for diverse generation tasks.
3.2.3. Concurrent RAG Works
The paper mentions several concurrent works that aim to improve RAG:
- Retriever/LM Fine-tuning (Lin et al., 2023): Fine-tune both retriever and LM on instruction-tuning datasets.
- Differentiation:
Self-RAGalso trains on diverse instruction-following data but adds on-demand retrieval and fine-grained self-reflection for selecting the best output, making it more robust and controllable.
- Differentiation:
- Filtering/Compressing Passages (Yoran et al., 2023; Xu et al., 2023): Use external models (NLI or summarization) to filter or compress retrieved passages before feeding them to the LM.
- Differentiation:
Self-RAGprocesses passages in parallel and filters irrelevant ones through its own self-reflection without relying on external models at inference. Its critique mechanism also evaluates broader output quality.
- Differentiation:
- Tree Search with Value Scores (Zhou et al., 2023 - LATS): Prompts off-the-shelf LLMs to search for information and generate with tree search guided by LM-generated value scores.
- Differentiation:
LATSuses a single overall value score, whereasSelf-RAGtrains an arbitrary LM to generatefine-grained self-reflection tokensfor customizable inference across multiple criteria.
- Differentiation:
3.2.4. Training and Generating with Critics
- RLHF (Ouyang et al., 2022; Wu et al., 2023): These methods train LLMs using reinforcement learning (e.g., PPO) from human feedback, often via reward models. Wu et al. (2023) explored fine-grained RLHF.
- Differentiation:
Self-RAGtrains its target LM on task examples augmented with reflection tokens offline, by using acritic modelto insert critiques into the training corpus. This significantly lowers training cost compared to online RLHF. While RLHF focuses on human preference alignment,Self-RAGuses reflection tokens for controllable generation at inference.
- Differentiation:
- General Control Tokens (Lu et al., 2022; Korbak et al., 2023; Keskar et al., 2019): Other works use control tokens to guide text generation.
- Differentiation:
Self-RAGspecifically usesreflection tokensnot just to guide, but to decide the need for retrieval and to self-evaluate generation quality after each segment.
- Differentiation:
- Self-Evaluation Guided Decoding (Xie et al., 2023): Propose a framework for self-evaluation, but primarily for reasoning tasks and a single evaluation dimension without retrieval.
- Differentiation:
Self-RAGintegrates retrieval and multiple, fine-grained evaluation dimensions (relevance, support, utility).
- Differentiation:
- LLM Refinement (Dhuliawala et al., 2023 - CoVE; Madaan et al., 2023; Paul et al., 2023): Involves iterative prompting of a model to generate output, natural language feedback, and then a refined output.
- Differentiation: These methods incur higher inference costs due to their iterative nature.
Self-RAGintegrates the critique process more directly into the generation flow, enabling more efficient single-pass generation with self-correction.
- Differentiation: These methods incur higher inference costs due to their iterative nature.
3.3. Technological Evolution
The evolution of technology in this domain can be traced as follows:
- Pre-trained LLMs: Started with general-purpose language models trained on massive text corpora (e.g.,
GPT-3,Llama). These models excel at fluency but struggle with factuality due to reliance on static parametric knowledge. - Instruction-Tuned/Chat Models: Further training to align LLMs with human instructions and conversational patterns (e.g.,
ChatGPT,Llama2-chat,Alpaca). These improved usability but did not fundamentally solve the hallucination problem. - Traditional RAG: Introduced external knowledge retrieval to LLMs to enhance factuality. Initial methods were often fixed and non-adaptive.
- Adaptive RAG & Tool-Use: Began to explore more dynamic retrieval strategies and integration of external tools/APIs.
- Self-RAG: Represents a significant step forward by integrating adaptive retrieval directly into the LLM's self-awareness through
reflection tokens, allowing the model to critique itself and its inputs on-demand, leading to unprecedented control and factuality improvements in a cost-effective manner.
3.4. Differentiation Analysis
Compared to the main methods in related work, the core differences and innovations of Self-RAG are:
- Adaptive and On-demand Retrieval: Unlike traditional RAG that retrieves indiscriminately,
Self-RAGexplicitly trains the LLM to predict when retrieval is necessary via aRetrievereflection token. This enhances versatility and efficiency. - Integrated Self-Reflection and Critique:
Self-RAGintroducescritique tokens(IsReL,IsSUP,IsUsE) that allow the model to evaluate the relevance of retrieved passages and the factuality/utility of its own generated segments. This is a fine-grained, internal self-assessment mechanism, unlike external models used in some concurrent works or iterative refinement that increases inference costs. - End-to-End Training of a Single LLM:
Self-RAGtrains a single arbitrary LLM to generate both task output and reflection tokens by expanding its vocabulary. This makes the entire process cohesive and allows the LLM to learn the interplay between generation, retrieval, and critique. - Offline Critique Insertion for Training Efficiency: Instead of costly online
RLHFwith a live reward model,Self-RAGuses a pre-trainedcritic modelto insert reflection tokens into the training data offline. The generator then learns from this augmented corpus using a standard language modeling objective, significantly reducing training overhead. - Inference-Time Controllability without Retraining: The reflection tokens and the segment-level beam search with customizable
weight terms() enable users to tailor the model's behavior at inference time (e.g., prioritizing factuality or fluency) without needing to retrain the model. This offers a flexible control mechanism not typically found in other RAG or RLHF approaches. - Enhanced Factuality and Citation Accuracy: Empirically,
Self-RAGdemonstrates superior performance in grounding generations, leading to higherFactScoreand significantly improvedcitation precisionandrecallcompared to leading LLMs and RAG baselines.
4. Methodology
The Self-Reflective Retrieval-Augmented Generation (Self-RAG) framework enhances an LLM's quality and factuality by integrating adaptive retrieval and self-reflection directly into its generation process.
4.1. Principles
The core idea behind Self-RAG is to train a single Large Language Model (LLM) to not only generate text but also to think about its generation process and its retrieved information. This "thinking" is manifested through the generation of special reflection tokens. These tokens act as internal signals, prompting the model to decide whether to retrieve external information, evaluate the relevance of that information, and then critique the quality and factual support of its own generated text. This allows for dynamic, on-demand retrieval and self-correction, leading to more accurate and controllable outputs.
4.2. Core Methodology In-depth (Layer by Layer)
The Self-RAG framework involves training a generator LLM () and utilizing a retriever () and a critic model () during the training data augmentation phase. The overall process is summarized in Figure 1 and Algorithm 1.
The problem is formalized as training a generator LM to sequentially generate textual outputs composed of multiple segments , where is a sequence of tokens for the -th segment. These generated tokens can be regular text or the special reflection tokens.
The four types of reflection tokens used in Self-RAG are detailed in Table 1:
The following are the results from Table 1 of the original paper:
| Type | Input | Output | Definitions |
|---|---|---|---|
| Retrieve | x/x,y | {yes, no, continue} | Decides when to retrieve with R |
| IsReL | x,d | {relevant, irrelevant} | d provides useful information to solve x. |
| IsSUP | x,d,y | {fully supported, partially supported, no support} | All of the verification-worthy statement in y is supported by d. |
| IsUsE | x,y | {5, 4, 3, 2, 1} | y is a useful response to x. |
Each type uses several tokens to represent its output values. The bold text (e.g., yes, relevant, fully supported) indicates the most desirable critique tokens. Here, denotes the input, the output, and a relevant passage.
4.2.1. Inference Overview
During inference, Self-RAG operates step-by-step to generate content, adaptively retrieving and critiquing.
As shown in Figure 1 (right) and Algorithm 1, for each input prompt and preceding generation (i.e., tokens generated before the current segment ):
The following figure (Figure 1 from the original paper) provides an overview of SELF-RAG:

Algorithm 1 SELF-RAG Inference
Input: Generator LM , Retriever , Large-scale passage collections Input: input prompt and preceding generation Output: next output segment
- predicts
Retrievegiven .- Explanation: The generator LM first assesses whether retrieving external knowledge would be beneficial for generating the next segment. It does this by predicting a
Retrievetoken, which can beyes,no, orcontinue. This is a crucial adaptive decision point.
- Explanation: The generator LM first assesses whether retrieving external knowledge would be beneficial for generating the next segment. It does this by predicting a
ifthen- Explanation: If the model decides that retrieval is necessary, it proceeds to activate the retriever.
- Retrieve relevant text passages using given .
- Explanation: The
Retriever(e.g.,Contriever-MS MARCO) is called to find top relevant passages from a large collection, based on the input prompt and the last generated segment .
- Explanation: The
- predicts
IsRELgivenx, dand given for each . (Generate)- Explanation: For each retrieved passage in , the generator concurrently performs two actions:
- It predicts an
IsRELtoken to evaluate if the passage isrelevantorirrelevantto the input . This filters out unhelpful passages. - It starts generating a candidate next output segment , taking into account the input , the specific retrieved passage , and the preceding generation .
- It predicts an
- Explanation: For each retrieved passage in , the generator concurrently performs two actions:
- predicts
IsSUPandIsUsEgiven for each . (Critique)- Explanation: After generating a candidate with respect to a passage , the model then self-reflects further. It predicts:
- An
IsSUPtoken to assess whether the generated segment isfully supported,partially supported, or hasno supportfrom the retrieved passage . This is key for factual grounding. - An
IsUsEtoken to evaluate the overallusefulnessof the segment as a response to . This token usually has a numerical scale (e.g., 1-5).
- An
- Explanation: After generating a candidate with respect to a passage , the model then self-reflects further. It predicts:
- Rank based on
IsSUPIsUse. Detailed in Section 3.3.- Explanation: With multiple candidate segments generated (one for each relevant retrieved passage), and associated critique tokens (
IsREL,IsSUP,IsUsE),Self-RAGthen ranks these candidates to select the best one. This ranking is guided by asegment scorethat combines the probabilities of these critique tokens.
- Explanation: With multiple candidate segments generated (one for each relevant retrieved passage), and associated critique tokens (
else ifthen- Explanation: If the model decides that retrieval is not necessary (i.e.,
RetrieveNo), it bypasses the retrieval step.
- Explanation: If the model decides that retrieval is not necessary (i.e.,
- predicts given . (Generate)
- Explanation: The generator proceeds to generate the next segment directly, similar to a standard LLM, without external passages.
- predicts
IsUsEgiven . (Critique)-
Explanation: Even without retrieval, the model still critiques its overall usefulness. It predicts an
IsUsEtoken to evaluate the utility of the generated segment relative to the input .This adaptive and self-reflective inference process allows
Self-RAGto tailor its behavior, ensuring retrieval only when needed and enabling selection of the most accurate and supported generation.
-
4.2.2. Training Overview
Self-RAG trains an arbitrary LLM to generate text interleaved with reflection tokens by treating these tokens as part of an expanded vocabulary. The training process involves two main stages:
- Training a Critic Model (): This model is trained to predict the reflection tokens for evaluating retrieved passages and generation quality.
- Training a Generator Model (): This model learns to generate both the task output and the reflection tokens by training on a curated corpus augmented with passages retrieved by a
retriever() and reflection tokens predicted by thecritic model(). The key here is that the critic model is used offline to prepare the training data, avoiding the high cost of online reinforcement learning.
4.2.3. Training the Critic Model ()
Data Collection for Critic Model
Manually annotating reflection tokens for every segment is labor-intensive. To overcome this, the authors leverage a powerful proprietary LLM, GPT-4 (OpenAI, 2023), to generate the initial feedback.
- For each type of
reflection token(e.g.,Retrieve,IsReL,IsSUP,IsUsE), instances are randomly sampled from the original training data . GPT-4is prompted with type-specific instructions and few-shot demonstrations () along with the original task input and output to predict an appropriate reflection token as text (e.g., "yes", "relevant", "fully supported").- For example, for the
Retrievetoken,GPT-4is prompted with: "Given an instruction, make a judgment on whether finding some external documents from the web helps to generate a better response."
- For example, for the
- The authors collect between to supervised training data points for each reflection token type and combine them to form the training dataset for the critic model, .
Critic Learning
After collecting , a pre-trained LLM (e.g., Llama 2-7B) is initialized as the critic model . It is then trained on using a standard conditional language modeling objective, aiming to maximize the likelihood of predicting the correct reflection tokens.
The objective function for training the critic model is: Where:
-
: Denotes maximizing the objective with respect to the parameters of the critic model .
-
: Represents the expected value (average) over the training dataset.
-
: Indicates that
(x, y)are an input-output pair from the training dataset, and is the corresponding reflection token label generated byGPT-4. is the collection of these triplets. -
: Is the
log-likelihoodof the critic model predicting the reflection token , given the input and output . Maximizing this term means training to accurately predict the reflection tokens thatGPT-4provided.The trained critic model achieves high agreement (over 90%) with
GPT-4's predictions on most reflection token categories, ensuring its reliability for generating training data for the generator.
4.2.4. Training the Generator Model ()
Data Collection for Generator
The training data for the generator model, , is created by augmenting the original input-output pairs (x, y) with retrieved passages and reflection tokens, mimicking the Self-RAG inference process.
-
For each segment within the output :
- The trained
critic modelis used to assess ifretrievalis needed. - If predicts
RetrieveYes, theretrieval special tokenis added to the data. - Then, the
retrieverretrieves the top passages, . - For each passage , evaluates its relevance and predicts . If relevant, further evaluates if the passage supports the model's generation and predicts . These
critique tokensare appended after the retrieved passage or generation. - Finally, at the end of the output , predicts the overall utility token .
- The trained
-
The augmented output, including the original input pair, retrieved passages, and all generated reflection tokens, forms an instance in .
The following figure (Figure 2 from the original paper) illustrates SELF-RAG training examples:
该图像是一个示意图,展示了自反性检索增强生成(Self-RAG)框架的工作流程。左侧为输入和输出示例,右侧显示了如何通过检索增强输出和批评模型响应。通过特定的反射令牌,使得模型能够根据任务需求调整其行为。
Figure 2 shows examples of Self-RAG training data. The left example does not require retrieval, so no passages are inserted. The right example requires retrieval, and thus retrieved passages are interleaved with reflection tokens (IsReL, IsSUP).
Generator Learning
The generator model (e.g., Llama2 7B or 13B) is trained on the curated corpus using a standard next token prediction objective. This means learns to predict both the target output text and the reflection tokens.
The objective function for training the generator model is: Where:
-
: Denotes maximizing the objective with respect to the parameters of the generator model .
-
: Represents the expected value (average) over the training dataset.
-
: Indicates that is the input, and
(y, r)represents the augmented output (including text and interleaved reflection tokens ) from the training dataset . -
: Is the
log-likelihoodof the generator model predicting the sequence of target tokens(y, r)given the input . By maximizing this, learns to generate both the desired text and the appropriate reflection tokens.Crucially, during training, the actual retrieved text chunks (e.g., surrounded by and tags in Figure 2) are masked out for loss calculation. This means the model learns to predict the reflection tokens and the generated text, but the loss isn't computed on the retrieved passages themselves, only on the model's response to them and the meta-information (reflection tokens). The original vocabulary of the
LLMis expanded with the set ofreflection tokens([Critique],[Retrieve]) so that the model can learn to generate them.
This training approach enables to generate reflection tokens itself at inference time without needing the critic , making the entire system self-contained and controllable. This method is also more cost-effective than RLHF because the critiques are pre-computed offline.
4.2.5. SELF-RAG Inference (Customization)
The ability of Self-RAG to generate reflection tokens to self-evaluate its own output is a key feature for controllability during inference. This allows the model's behavior to be tailored to diverse task requirements.
Adaptive Retrieval with Threshold
Self-RAG dynamically decides whether to retrieve text passages by predicting the Retrieve token.
- Mechanism: The model predicts the probability of generating
RetrieveYes. If this probability, normalized over all possibleRetrieveoutput tokens (e.g.,Yes,No,Continue), surpasses a designated threshold, retrieval is triggered. - Customization: A
thresholdcan be set (e.g., 0.2 for most tasks, 0 forALCEwhere citation is mandatory). A larger threshold means less frequent retrieval, useful for open-ended creative tasks. A smaller threshold (or 0) encourages more frequent retrieval for factual accuracy-demanding tasks.
Tree-decoding with Critique Tokens
When retrieval is required (either by the model's RetrieveYes prediction or a hard condition), Self-RAG employs a segment-level beam search to select the best continuation.
-
Parallel Passage Processing: The
retrieverretrieves passages. The generator then processes each of these passages in parallel, generating different candidate continuations (). -
Segment-level Beam Search: A
beam search(with abeam size) is conducted at each segment step . This aims to find the top- segment continuations. At the end of generation, the best sequence is returned. -
Critique-Guided Scoring: The score of each candidate segment (generated with respect to a specific passage ) is updated using a
critic score. This score is alinear weighted sumof the normalized probabilities of thecritique tokenspredicted for that segment.The segment score for with passage and associated critique tokens is calculated as: Where:
-
: The overall score for the segment considering the passage and the predicted critique tokens.
-
p ( y _ { t } | x , d , y _ { < t } ): Thelog-probability(or probability) of the generator generating the segment given the input , retrieved passage , and preceding generation . This is the standard language model score. -
: The
critique scorederived from the reflection tokens, which explicitly incorporates self-reflection into the segment selection.The critique score is a weighted sum of scores from different critique token groups: Where:
-
: The total critique score.
-
: The set of
critique token groups, which includesIsReL(relevance),IsSUP(support), andIsUsE(usefulness). -
: A
hyperparameterrepresenting theweightassigned to each critique token group . These weights are adjustable at inference time to customize model behavior. For example, a higher prioritizes factual support. -
: The
normalized probabilityof the most desirable token within critique group at timestamp .The normalized probability for a critique token type is calculated as: Where:
-
: The probability of generating the most desirable reflection token (e.g.,
RelevantforIsReL,Fully SupportedforIsSUP,5forIsUsE) within group at timestamp . -
: The sum of probabilities of all distinct tokens () that represent different possible values for the critique token type . This normalizes the score.
Customization and Control:
-
Soft Constraints: By adjusting the weights , users can softly guide the model's preferences. For example, increasing makes the model prioritize outputs that are strongly supported by evidence, even if it might slightly reduce fluency (as longer, more fluent generations might have more unsubstantiated claims).
-
Hard Constraints: Alternatively, one can explicitly filter out segment continuations if the model generates an undesirable
Critiquetoken (e.g., discarding a segment ifIsSUPpredictsNo support).This mechanism provides fine-grained control over the generation process, allowing
Self-RAGto adapt to specific user requirements or task objectives without requiring further training.
5. Experimental Setup
5.1. Datasets
The experiments for Self-RAG are conducted on a diverse set of six tasks categorized into closed-set, short-form generation, and long-form generation, to holistically evaluate performance across correctness, factuality, and fluency.
5.1.1. Closed-set Tasks
These tasks involve selecting a correct answer from a predefined set, often testing reasoning or factual knowledge directly.
- PubHealth (Zhang et al., 2023): A fact verification dataset focused on public health. Models need to determine the factual correctness of statements related to health information.
- ARC Challenge (Clark et al., 2018): A multiple-choice reasoning dataset derived from scientific exams. It requires models to answer complex scientific questions, often needing multi-step reasoning.
- Data Characteristics: Both datasets require precise factual understanding and reasoning capabilities. Accuracy is the primary metric. The paper aggregates answer probabilities for target classes.
5.1.2. Short-form Generation Tasks
These tasks involve generating concise, factual answers to questions.
- PopQA (Mallen et al., 2023): An open-domain question answering dataset. The paper specifically uses the
long-tail subset, which consists of 1,399 queries about rare entities (Wikipedia page views less than 100 per month). This dataset tests the model's ability to recall or retrieve information about less common facts. - TriviaQA-unfiltered (Joshi et al., 2017): Another open-domain question answering dataset, focusing on factual knowledge. The paper uses 11,313 test queries from the validation and test split, following prior work.
- Data Characteristics: Both require accurate factual recall or retrieval. Evaluation focuses on whether gold answers are included in the model's generation, rather than strict exact matching, to account for variations in phrasing.
5.1.3. Long-form Generation Tasks
These tasks require generating longer, coherent texts, often demanding high factuality and proper attribution.
- Biography Generation (Min et al., 2023): A task where the model generates biographical texts.
- ALCE-ASQA (Gao et al., 2023; Stelmakh et al., 2022): A long-form QA task that requires generating detailed answers, often with multiple pieces of information, and crucially, demands
citation accuracy. - Data Characteristics: These tasks heavily emphasize
factuality,coherence,fluency, and the ability to correctly attribute information to sources (citations).
5.2. Evaluation Metrics
For every evaluation metric mentioned, a complete explanation is provided.
5.2.1. Accuracy (acc)
- Conceptual Definition: Accuracy measures the proportion of correct predictions (or answers) out of the total number of predictions made. It is a fundamental metric for classification and question-answering tasks where there is a single correct output.
- Mathematical Formula:
- Symbol Explanation:
- : The count of instances where the model's output matches the ground truth.
- : The total count of instances in the dataset for which the model made a prediction.
5.2.2. FactScore (FS)
- Conceptual Definition:
FactScoreis a metric designed to evaluate the factuality of long-form text generations, particularly biographies. It aims to quantify how many claims made in the generated text are factually correct and supported by evidence, rather than being hallucinations. A higherFactScoreindicates higher factual accuracy. - Mathematical Formula: The paper refers to Min et al. (2023) for
FactScorebut does not provide its specific calculation formula in the main text. Generally,FactScoreinvolves extracting atomic facts (claims) from the generated text and then verifying each claim against a knowledge base or retrieved evidence. The score often reflects the percentage of verifiable and correct facts. - Symbol Explanation: Not provided in the paper's main text.
5.2.3. Correctness (str-em, rg)
- Conceptual Definition:
- str-em (string exact match): Measures if the generated answer string is an exact match to the reference answer. It's a very strict metric, requiring character-level identity.
- rouge (Recall-Oriented Understudy for Gisting Evaluation): A set of metrics (e.g.,
ROUGE-1,ROUGE-2,ROUGE-L) used for evaluating summarization and machine translation. It measures the overlap of n-grams (sequences of words) or longest common subsequences between the generated text and a set of reference texts. HigherROUGEscores indicate greater similarity to the reference.
- Mathematical Formula: The paper refers to these metrics but does not provide their specific calculation formulas in the main text.
- For
str-em, it's a binary outcome (1 if exact match, 0 otherwise), averaged over all instances. - For
ROUGE, formulas vary by type (e.g.,ROUGE-Nis based on n-gram overlap,ROUGE-Lon longest common subsequence).- Example
ROUGE-N(general form):
- Example
- For
- Symbol Explanation:
- : A sentence in either the reference or generated text.
- : A contiguous sequence of words.
- : The maximum number of n-grams co-occurring in a candidate generation and a set of reference summaries.
- : The number of n-grams in the reference summaries.
5.2.4. Fluency (MAUVE)
- Conceptual Definition:
MAUVE(Multivariate Autoregressive Universal Evaluation) is a metric for evaluating the quality of open-ended text generation, specifically itsfluency(how natural and grammatical the text is) anddiversity. It compares the distribution of generated text to the distribution of human reference text. A higherMAUVEscore indicates that the generated text is more similar in quality and style to human-written text. - Mathematical Formula: The paper refers to Pillutla et al. (2021) for
MAUVEbut does not provide its specific calculation formula in the main text. It's a complex metric based on comparing statistical properties of text distributions. - Symbol Explanation: Not provided in the paper's main text.
5.2.5. Citation Precision (prec) and Recall (rec)
- Conceptual Definition: These metrics are crucial for evaluating how well a model attributes its generated content to provided sources, particularly in RAG systems.
- Citation Precision: Measures the proportion of cited statements in the generated text that are actually supported by the corresponding cited passages. A high precision means few "hallucinated" or unsubstantiated citations.
- Citation Recall: Measures the proportion of all verifiable statements in the generated text that should have been cited (i.e., are factual and supported by available passages) and were correctly cited. A high recall means the model is good at attributing all derivable facts.
- Mathematical Formula: The paper refers to Gao et al. (2023) for these metrics but does not provide their specific calculation formulas in the main text. Generally:
- Precision:
- Recall:
- Symbol Explanation: Not provided in the paper's main text.
5.3. Baselines
The paper compares Self-RAG against a comprehensive set of baselines, categorized by their use of retrieval and proprietary data.
5.3.1. Baselines without Retrieval
These models rely solely on their internal parametric knowledge.
- Open-source Pre-trained/Instruction-tuned LLMs:
Llama2-7B,Llama2-13B(Touvron et al., 2023): FoundationalLLMs.Alpaca-7B,Alpaca-13B(Dubois et al., 2023): Instruction-tuned versions ofLlamamodels (replicated by the authors).
- Proprietary/Reinforced LLMs:
ChatGPT(Ouyang et al., 2022): A widely used instruction-tuned andRLHF-alignedLLM.Llama2-chat-13B(Touvron et al., 2023): AnRLHF-aligned chat model.
- Concurrent work:
CoVE-65B(Dhuliawala et al., 2023): Iteratively prompts a65BparameterLLMto refine outputs, focusing on factuality.
5.3.2. Baselines with Retrieval
These models integrate external knowledge retrieval.
- Standard RAG Baselines:
Llama2-7B,Alpaca-7B,Llama2-13B,Alpaca-13Baugmented with retrieval at test time. This involves prepending top retrieved documents to the query before feeding it to theLLM. They use the same retriever asSelf-RAG.
- Fine-tuned without Reflection:
Llama2-FT-7B: ALlama2-7Bmodel fine-tuned on the same training data asSelf-RAGbut without reflection tokens or retrieved passages, and then augmented with retrieval at test time. This baseline helps isolate the gains attributed to theSelf-RAGframework itself rather than just the training data.
- Proprietary/Production RAG Systems:
Ret-ChatGPT:ChatGPTaugmented with retrieval.Ret-Llama2-chat:Llama2-chataugmented with retrieval.perplexity.ai: AnInstructGPT-based production search system that inherently uses retrieval.
- Concurrent Trained RAG Methods:
Toolformer-6B(Schick et al., 2023): AnLLMpre-trained to generate API calls for named entities.SAIL-7B(Luo et al., 2023): AnLLMinstruction-tuned onAlpacadata with top retrieved documents inserted before instructions.
5.4. Experimental Settings
5.4.1. Training Data and Settings
- Training Data: A total of 150k instruction-output pairs were used. This dataset was constructed from:
Open-Instructprocessed data (Wang et al., 2023): A collection of diverse instruction-following examples.- Knowledge-intensive datasets (Petroni et al., 2021; Stelmakh et al., 2022; Mihaylov et al., 2018): Datasets designed to test knowledge-based question answering and generation.
- Base LLMs:
- Generator ():
Llama2 7BandLlama2 13Bmodels (Touvron et al., 2023) were used. - Critic ():
Llama2 7Bwas used as the base model for the critic.
- Generator ():
- Retriever ():
Contriever-MS MARCO(Izacard et al., 2022a) was used as the default off-the-shelf retriever. It retrieves up to ten documents for each input.- For
biographiesandopen-domain QA, an additional five documents retrieved by a web search engine were used, following Luo et al. (2023). - For
ASQA, the authors used five documents provided byGTR-XXL(Ni et al., 2022) across all baselines to ensure a fair comparison.
5.4.2. Inference Settings
- Critique Weight Terms: As a default configuration, the
weight terms() for thecritique tokenswere set as follows:IsReL(Relevant/Irrelevant):IsSUP(Fully Supported/Partially Supported/No Support):IsUsE():
- Retrieval Threshold:
- For most tasks, the
retrieval thresholdwas set to 0.2. This means retrieval is triggered if the normalized probability ofRetrieveYesexceeds 0.2. - For
ALCE(due to its strong citation requirements), the threshold was set to 0, effectively forcing retrieval whenever possible.
- For most tasks, the
- Efficiency:
vllm(Kwon et al., 2023), a high-throughput inference engine, was used to speed up inference. - Decoding:
- Segment-level beam search: A
beam widthof 2 was adopted at each segment level, meaning the model explores two best continuations for each segment. - Token-level generation:
Greedy decodingwas used for generating individual tokens within a segment. This means at each step, the model picks the token with the highest probability.
- Segment-level beam search: A
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate Self-RAG's superior performance across a wide range of tasks, outperforming both non-retrieval LLMs and traditional RAG approaches.
The following are the results from Table 2 of the original paper:
| Short-form | Closed-set | Long-form generations (with citations) | ||||||||
| LM | PopQA (acc) | TQA (acc) | Pub (acc) | ARC (acc) | Bio (FS) | (em) | (rg) | ASQA (mau) | (pre) | (rec) |
| LMs with proprietary data | ||||||||||
| Llama2-c13B | 20.0 | 59.3 | 49.4 | 38.4 | 55.9 | 22.4 | 29.6 | 28.6 | — | |
| Ret-Llama2-C13B | 51.8 | 59.8 | 52.1 | 37.9 | 79.9 | 32.8 | 34.8 | 43.8 | 19.8 | 36.1 |
| ChatGPT | 29.3 | 74.3 | 70.1 | 75.3 | 71.8 | 35.3 | 36.2 | 68.8 | ||
| Ret-ChatGPT | 50.8 | 65.7 | 54.7 | 75.3 | 40.7 | 39.9 | 79.7 | 65.1 | 76.6 | |
| Perplexity.ai | 71.2 | |||||||||
| Baselines without retrieval | ||||||||||
| Llama27B | 14.7 | 30.5 | 34.2 | 21.8 | 44.5 | 7.9 | 15.3 | 19.0 | ||
| Alpaca7B | 23.6 | 54.5 | 49.8 | 45.0 | 45.8 | 18.8 | 29.4 | 61.7 | ||
| Llama213B | 14.7 | 38.5 | 29.4 | 29.4 | 53.4 | 7.2 | 12.4 | 16.0 | ||
| Alpaca13B | 24.4 | 61.3 | 55.5 | 54.9 | 50.2 | 22.9 | 32.0 | 70.6 | ||
| CoVE65B * | 71.2 | |||||||||
| Baselines with retrieval | ||||||||||
| Toolformer*6B | 48.8 | |||||||||
| Llama27B | 38.2 | 42.5 | 30.0 | 48.0 | 78.0 | 15.2 | 22.1 | 32.0 | 2.9 | 4.0 |
| Alpaca7B | 46.7 | 64.1 | 40.2 | 48.0 | 76.6 | 30.9 | 33.3 | 57.9 | 5.5 | 7.2 |
| Llama2-FT7B | 48.7 | 57.3 | 64.3 | 65.8 | 78.2 | 31.0 | 35.8 | 51.2 | 5.0 | 7.5 |
| SAIL*7B | 69.2 | 48.4 | ||||||||
| Llama213B | 45.7 | 47.0 | 30.2 | 26.0 | 77.5 | 16.3 | 20.5 | 24.7 | 2.3 | 3.6 |
| Alpaca13B | 46.1 | 66.9 | 51.1 | 57.6 | 77.7 | 34.8 | 36.7 | 56.6 | 2.0 | 3.8 |
| Our SELF-RAG 7B | 54.9 | 66.4 | 72.4 | 67.3 | 81.2 | 30.0 | 35.7 | 74.3 | 66.9 | 67.8 |
| Our SELF-RAG 13B | 55.8 | 69.3 | 74.5 | 73.1 | 80.2 | 31.7 | 37.0 | 71.6 | 70.3 | 71.3 |
6.1.1. Comparison against Baselines without Retrieval
- Superiority over Open-source LLMs:
Self-RAG(both 7B and 13B models) significantly outperformsLlama2andAlpacamodels of similar or larger sizes across all tasks. For instance,Self-RAG 7Bachieves 54.9% accuracy onPopQAcompared toAlpaca 7B's 23.6%, demonstrating a massive improvement in factual QA. - Outperforming
ChatGPTin Specific Areas:Self-RAGeven surpassesChatGPT(a proprietary andRLHF-tuned model) in several key metrics:PubHealth:Self-RAG 13B(74.5%) vs.ChatGPT(70.1%).PopQA:Self-RAG 13B(55.8%) vs.ChatGPT(29.3%).Biography FactScore:Self-RAG 7B(81.2%) vs.ChatGPT(71.8%).ASQA(RougeandMAUVE):Self-RAG 7B(rg 35.7, mau 74.3) andSelf-RAG 13B(rg 37.0, mau 71.6) generally outperformChatGPT(rg 36.2, mau 68.8), indicating better fluency and content similarity.
- Outperforming
CoVE: On the biography generation task,Self-RAG 7B(81.2FactScore) and13B(80.2FactScore) significantly outperformCoVE 65B(71.2FactScore), a concurrent method employing iterative prompt engineering, despiteCoVEbeing a much larger model. This highlights the efficiency and effectiveness ofSelf-RAG's integrated self-reflection.
6.1.2. Comparison against Baselines with Retrieval
- Best Performance among Non-Proprietary Models:
Self-RAGconsistently achieves the best performance among all non-proprietaryLLM-based models across all tasks. - Significant Gains over Standard RAG: While standard
RAGimproves baselines (e.g.,Llama2-7BwithRAGjumps from 14.7% to 38.2% onPopQA),Self-RAGpushes these gains further (Self-RAG 7Breaches 54.9% onPopQA). This indicates that the adaptive retrieval and self-reflection mechanisms are critical beyond simple document concatenation. - Addressing RAG Limitations: The paper notes that standard
RAGoften struggles when tasks require more than just copying/extracting substrings, or when retrieval isn't always helpful (e.g.,PubHealth,ARC ChallengewhereRAGbaselines don't notably improve over non-retrieval).Self-RAGovercomes this by its adaptive nature. - Exceptional Citation Accuracy: One of the most striking results is
Self-RAG's performance oncitation precisionandrecallforASQA.Self-RAG 13Bachieves 70.3% precision and 71.3% recall, far surpassing all other non-proprietary models (e.g.,Ret-Llama2-C13Bhas 19.8% precision, 36.1% recall;Alpaca13BwithRAGhas 2.0% precision, 3.8% recall).Self-RAG 13Beven outperformsRet-ChatGPT(65.1% precision, 76.6% recall) incitation precision, which measures whether model claims are fully supported by cited evidence. This is a critical advancement in verifiableLLMgeneration.
- Effectiveness of
Self-RAGFramework: TheLlama2-FT-7Bbaseline (fine-tuned on the same instruction-output pairs asSelf-RAGbut without reflection tokens or retrieved passages during training, only retrieval-augmented at test time) lags significantly behindSelf-RAG. For example, onPubHealth,Llama2-FT-7Bscores 64.3%, whileSelf-RAG 7Bachieves 72.4%. This explicitly demonstrates that the gains ofSelf-RAGare not merely from training data but from the novel framework of learning to retrieve and self-reflect. - Nuance in Model Size: Interestingly,
Self-RAG 7Boccasionally outperformsSelf-RAG 13Bon metrics for factual precision (e.g.,Bio FactScore: 81.2 for 7B vs. 80.2 for 13B). The authors attribute this to smaller models sometimes generating more precisely grounded, albeit shorter, outputs, suggesting a trade-off that can be controlled via inference parameters.
6.2. Ablation Studies / Parameter Analysis
The paper conducts several ablation studies to understand the contribution of different components of the Self-RAG framework, both during training and inference.
The following figure (Figure 3 from the original paper) shows an analysis on SELF-RAG:
该图像是一个图表,展示了Self-RAG方法在多项任务上的各项性能指标,包括PQA、Med和AS的准确率及生成效果。同时展示了不同训练和测试条件下的结果,以及关于定制化和检索阈值的分析。图中包含关键的训练和测试数据,如无检索和检索条件下的性能对比。
6.2.1. Ablation Studies (Figure 3a)
The ablation studies are presented in Table 3a (not explicitly provided in the markdown, but described in text).
-
Training Ablations:
No Retriever: AnLLMtrained with standard instruction-following, without any retrieved passages. This performs significantly worse thanSelf-RAG, indicating the necessity of retrieval in the training process.No Critic: AnLLMtrained with input-output pairs that are always augmented with the top retrieved document, but without reflection tokens. This is similar toSAIL. The large performance gap betweenSelf-RAGandNo Critichighlights the crucial role of thecritic modelandreflection tokensin guiding effective retrieval and generation.- Conclusion: The large performance gap between
Self-RAGandNo RetrieverorNo Criticbaselines across tasks clearly demonstrates that training theLLMwith adaptive retrieval and self-reflection (via thecritic modelandreflection tokens) largely contributes to the performance gains.
-
Inference Ablations:
No retrieval: Disables retrieval during inference. This leads to a significant performance drop, especially on knowledge-intensive tasks, affirming the importance of retrieval.Hard constraints: Instead of adaptive thresholds, retrieval is triggered only when is explicitly predicted. While not always optimal, it explores stricter control.Retrieve top 1: Always retrieves and uses only the single top document, similar to conventionalRAGapproaches. This causes a large drop in performance onPopQAandASQA. This suggests thatSelf-RAG's ability to process multiple passages and select based on fine-grained critique is superior to relying solely on a single "best" retrieved document.Remove IsSUP: TheIsSUPscore (measuring factual support) is removed from thecritique-guided beam search(Equation 4). This significantly hurts performance onASQA, especially its factuality and citation metrics.- Conclusion: These inference ablations underscore the effectiveness of
Self-RAG's multi-faceted approach: adaptive retrieval, parallel processing of passages, and thecritique-guided beam searchthat considers multiple fine-grained criteria (IsReL,IsSUP,IsUsE), rather than blindly using retrieved passages or relying on single relevance scores.
6.2.2. Effects of Inference-Time Customization (Figure 3b)
Self-RAG's framework allows for customizing behavior at inference time by adjusting the weight terms () for different critique types.
- Trade-off between Citation Precision and Fluency: Figure 3b illustrates the effect of changing the weight () for the
IsSUPtoken (which criticizes how well the output is supported by evidence) onASQA.- Increasing : Leads to positive effects on
citation precision, as the model is incentivized to generate outputs that are strongly supported by evidence. This aligns with the goal of prioritizing factual grounding. - Inverse Relationship with
MAUVE: However, a larger often results in lowerMAUVEscores (fluency/completeness). This is because longer, more fluent generations tend to contain more claims that might not be fully supported by the direct retrieved citations, consistent with prior findings.
- Increasing : Leads to positive effects on
- Conclusion: This demonstrates
Self-RAG's powerful capability for dynamic customization. Practitioners can adjust these parameters to achieve desired trade-offs (e.g., maximizing factual accuracy even if it means slightly less fluent output, or vice-versa) at test time without requiring additional training.
6.2.3. Efficiency and Accuracy Trade-off (Figure 3c)
The adaptive retrieval mechanism in Self-RAG allows users to control the retrieval frequency using a threshold on the Retrieve token's probability.
-
Retrieval Frequency vs. Threshold (): Figure 3c shows how varying the
threshold(where a larger leads to less retrieval) dramatically changes the model's retrieval frequencies onPubHealthandPopQA. -
Performance Impact:
- On
PubHealth, the performance deterioration from retrieving less is relatively smaller. This suggests that for some tasks, the model can perform well even with less frequent retrieval if its parametric knowledge is sufficient or if it's less reliant on external facts for every segment. - On
PopQA, the performance drop is larger when retrieving less, highlighting its stronger dependence on retrieved facts for accurate answers (especially for long-tail entities).
- On
-
Conclusion: This analysis indicates that
Self-RAGoffers a knob to balanceinference efficiency(less retrieval means faster inference) withaccuracybased on task requirements, providing practical flexibility.The following figure (Figure 4 from the original paper) shows training scale and human analysis:
该图像是包含四个子图的示意图,展示了不同训练数量对模型性能的影响。子图(a)为PopQA,子图(b)为PubHealth,子图(c)为ASQA的精度变化。同时,子图(d)为对PopQ和Bio生成的人工评估结果,包括S&P、IsRel和IsSup的得分。
6.2.4. Effects of Training Data Size (Figure 4a, 4b, 4c)
The paper investigates how the amount of training data influences Self-RAG's performance.
- Data Scales: Variants of
Self-RAG 7Bwere fine-tuned on randomly sampled subsets of 5k, 10k, 20k, and 50k instances from the original 150k training dataset. - Upward Trajectories: Figures 4a, 4b, and 4c (for
PopQA,PubHealth, andASQA citation precisionrespectively) consistently show upward trajectories: increasing the training data size generally leads to improved performance across all datasets. - Significant Improvements: The improvements are particularly significant in
PopQAandASQA, suggesting that these tasks benefit greatly from more diverse and extensive training examples that reinforce the retrieval and self-reflection patterns. - Comparison to
Llama2-FT: The paper notes thatLlama2-FT-7B(a baseline withoutSelf-RAG's specific training) did not observe such significant improvements when training data increased from 50k to 150k. - Conclusion: This indicates that
Self-RAGeffectively leverages larger training datasets to learn its complex adaptive and self-reflective behaviors, and further expanding the training data beyond 150k instances could potentially lead to even greater improvements.
6.2.5. Human Evaluations (Figure 4d)
A small-scale human evaluation was conducted on 50 samples from PopQA and Biography generation results to assess output quality and the reliability of predicted reflection tokens.
- Plausibility & Support (S&P): Human annotators evaluated
S&P, which indicates whether the output isplausible(reasonable and on-topic) andsupported(verified by provided evidence). ForS&P, instances whereSelf-RAGpredictedirrelevantorno supportwere excluded.- Results:
Self-RAGoutputs were found to beplausibleandsupportedby relevant passages with higherS&Pscores on short-formPopQA, aligning with previous findings in similar studies.
- Results:
- Reflection Token Alignment: Annotators also checked if the model-predicted
IsReLandIsSUPreflection tokens matched their own human assessments (e.g., if afully supportedtoken indeed corresponded to a factually supported claim).- Results: Human annotators found that
IsReLandIsSUPpredictions mostly aligned with their inspections.
- Results: Human annotators found that
- Conclusion: This human evaluation provides qualitative validation that
Self-RAGnot only produces better numerical metrics but also generates outputs that humans perceive as more accurate and well-supported, and that its internal self-reflection signals are largely reliable.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work introduced Self-RAG (Self-Reflective Retrieval-Augmented Generation), a novel and robust framework designed to significantly enhance the quality and factuality of Large Language Model (LLM) outputs. The core innovation lies in training a single, arbitrary LLM to adaptively decide when to retrieve external knowledge, to generate text, and critically, to critique both the retrieved passages and its own generations through the use of special reflection tokens. These tokens, which are seamlessly integrated into the model's vocabulary and generation process, enable unprecedented controllability during inference, allowing Self-RAG to tailor its behavior to diverse task requirements.
Through comprehensive evaluations on six distinct tasks spanning open-domain QA, reasoning, fact verification, and long-form generation, Self-RAG (7B and 13B parameters) consistently demonstrated superior performance. It significantly outperformed state-of-the-art LLMs like ChatGPT and traditional Retrieval-Augmented Generation (RAG) models, particularly excelling in improving factuality (as measured by FactScore) and citation accuracy (precision and recall) for long-form generations. The framework's cost-effective training approach, which uses an offline critic model to augment training data, and its flexible inference-time customization capabilities further highlight its practical utility and innovation.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
- Incomplete Factual Support: Despite significant improvements,
Self-RAGcan still generate outputs that are not fully supported by the cited passages. This suggests that while the mechanism greatly reduces hallucinations, it does not eliminate them entirely. - Explicit Verification Burden: While
Self-RAGprovides explicit self-reflection and fine-grained attribution (e.g., throughIsSUPtokens), the ultimate burden of verifying factual errors still falls on the user. The model flags potential issues but doesn't guarantee absolute truthfulness. - Further Training Data Expansion: The analysis on training data size indicated that increasing data consistently led to performance improvements. This suggests that further expanding the training dataset beyond the current 150k instances could lead to even greater gains. This implies that the current models might not have fully converged or exploited the potential of the
Self-RAGtraining paradigm. - Exploring More Complex Critiques: The current set of reflection tokens covers relevance, support, and usefulness. Future work could explore more nuanced or domain-specific critique types to further enhance control and quality for specialized tasks.
7.3. Personal Insights & Critique
Self-RAG represents a highly impactful advancement in making LLMs more reliable and controllable.
- Innovation of Reflection Tokens: The concept of
reflection tokensas an integral part of the generation stream is particularly insightful. It externalizes an internal "thought process" of theLLM, making its reasoning and self-assessment transparent and actionable. This moves beyond simple prompt engineering to a deeply integrated, learned mechanism for control. - Cost-Effectiveness vs. RLHF: The offline training of the critic model and its use to augment the generator's training data is a brilliant strategy for achieving
RLHF-like alignment benefits without the immense computational and engineering complexity of onlineRLHF(e.g., PPO). This makesSelf-RAGmore accessible and scalable for many research groups and applications. - Inference-Time Customization: The ability to adjust
weight terms() at inference time to balance conflicting objectives (e.g., factual precision vs. fluency) is a powerful feature. This is a practical control knob that is often missing in otherLLMsystems, which typically require retraining for such behavioral shifts. It transforms theLLMfrom a black box into a more configurable tool. - Broader Applicability: The framework's adaptability to "any arbitrary LM" suggests wide applicability. It could potentially be integrated with various base
LLMsand adapted to diverse domains beyond those tested, where factuality and verifiable generation are paramount. - Potential Issues/Areas for Improvement:
-
Defining Optimal Weights: While customizable weights () offer flexibility, determining the optimal weights for complex, multi-objective tasks might still require some trial and error or domain expertise. Automated methods for learning or suggesting these weights could be a valuable extension.
-
Scalability of Critic Model Training: While offline, the initial collection of
GPT-4-generated critiques still relies on a proprietary model. Developing open-source alternatives or more efficient ways to bootstrap the critic model could further enhance reproducibility and accessibility. -
Handling Conflicting Information: The paper focuses on retrieving relevant passages. How
Self-RAGhandles conflicting information within retrieved passages, or conflicts between parametric knowledge and retrieved facts, would be an interesting area for deeper analysis. TheIsSUPtoken helps, but the ranking mechanism could be further refined. -
Generalizability of Reflection Tokens: While human evaluations were positive, the subjective nature of "usefulness" or "relevance" could vary across users and cultures. Further research into cross-cultural applicability and robustness of
reflection tokensmight be beneficial.Overall,
Self-RAGoffers a robust and practical solution to a critical problem inLLMdeployment, paving the way for more trustworthy and controllableAIsystems.
-
Similar papers
Recommended via semantic vector search.