Paper status: completed

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Published:10/18/2023

Self-Reflective Retrieval-Augmented Generation (1)Quality Improvement for Large Language Models (1)Controllable Generation and Reflection Mechanism (1)Open-Domain Question Answering Tasks (2)Fact Verification and Citation Accuracy (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The Self-RAG framework enhances large language models' quality and factual accuracy through adaptive retrieval and reflection tokens. Experiments show it significantly outperforms existing models in open-domain QA, reasoning, and fact verification tasks.

Abstract

Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances an LM's quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.

Mind Map

In-depth Reading

English Analysis~24 min read · 33,935 chars

1. Bibliographic Information

1.1. Title

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

1.2. Authors

The primary authors are Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Their affiliations include the University of Washington, Allen Institute for AI, and IBM Research AI. This indicates a collaborative effort between academia and industry research.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server, on 2023-10-17T18:18:32.000Z. While arXiv itself is not a peer-reviewed journal or conference, it is a widely recognized platform for disseminating cutting-edge research in fields like AI and computer science. Papers often appear on arXiv before undergoing formal peer review for major conferences (e.g., NeurIPS, ICML, ICLR, ACL, EMNLP) or journals, indicating its status as new, impactful research.

1.4. Publication Year

2023

1.5. Abstract

This paper introduces Self-Reflective Retrieval-Augmented Generation (Self-RAG), a novel framework designed to improve the quality and factuality of Large Language Models (LLMs). Despite their advanced capabilities, LLMs frequently produce factually incorrect responses due to their sole reliance on parametric knowledge (information encoded within the model's weights during training). While Retrieval-Augmented Generation (RAG) offers an ad hoc solution by supplementing LLMs with external knowledge, traditional RAG methods often retrieve passages indiscriminately, which can reduce versatility or lead to unhelpful generations.

Self-RAG addresses these limitations by training an arbitrary LLM to adaptively retrieve passages on-demand. It also learns to generate and reflect on both the retrieved passages and its own generated text using special tokens called reflection tokens. These tokens, categorized into retrieval and critique types, enable the LLM to control its behavior during inference, tailoring it to diverse task requirements. The framework trains a single LLM to unify text generation and reflection token prediction.

Experiments on a diverse set of tasks, including open-domain QA, reasoning, and fact verification, demonstrate that Self-RAG (with 7B and 13B parameters) significantly outperforms state-of-the-art LLMs (like ChatGPT) and existing RAG models (like retrieval-augmented Llama2-chat). It shows substantial gains in improving factuality and citation accuracy, particularly for long-form generations.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2310.11511 PDF Link: https://arxiv.org/pdf/2310.11511v1.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

Core Problem

The core problem the paper aims to solve is the prevalent issue of factual inaccuracies (also known as hallucinations) in outputs generated by Large Language Models (LLMs). LLMs primarily rely on parametric knowledge, which is the information encoded within their billions of parameters during pre-training. This parametric knowledge can be outdated, incomplete, or incorrectly memorized, leading to responses that sound fluent and confident but are factually wrong.

Importance of the Problem

In the current field of Natural Language Processing (NLP) and Artificial Intelligence (AI), LLMs are being increasingly deployed in applications that demand high accuracy, such as search, education, and content creation. Factual errors undermine trust, can spread misinformation, and are particularly problematic in sensitive domains (e.g., healthcare, finance). Therefore, improving the factuality of LLM outputs is a critical challenge.

Challenges and Gaps in Prior Research

Retrieval-Augmented Generation (RAG) methods have emerged as a prominent approach to mitigate factual errors. RAG works by retrieving relevant external documents and providing them as context to the LLM during generation, allowing the model to ground its responses in up-to-date and verified information. However, prior RAG approaches suffer from several limitations:

Indiscriminate Retrieval: Many RAG systems retrieve a fixed number of documents regardless of whether retrieval is actually necessary for a given query or if the retrieved passages are truly relevant. This can lead to:
- Diminished Versatility: For tasks not requiring factual grounding (e.g., creative writing), unnecessary retrieval adds overhead and can constrain the model's creativity.
- Low-Quality Generations: Irrelevant or off-topic retrieved passages can "confuse" the LLM, leading to unhelpful or even incorrect responses.
Lack of Consistency/Attribution: Even when relevant passages are retrieved, LLMs are not always explicitly trained to leverage and follow facts from these passages consistently. The output is not guaranteed to be fully supported by the retrieved sources, and clear citations are often missing or inaccurate.
Inference Overhead: Some adaptive retrieval methods improve performance but come at the cost of runtime efficiency or rely on proprietary models.

Paper's Entry Point / Innovative Idea

The paper's innovative idea is to introduce Self-Reflective Retrieval-Augmented Generation (Self-RAG). This framework addresses the limitations of prior RAG by integrating adaptive, on-demand retrieval with a novel self-reflection mechanism. The core innovation is to train a single LLM to:

Adaptively Decide Retrieval: Determine when retrieval is necessary.
Critique Retrieved Passages: Evaluate the relevance and usefulness of retrieved information.
Critique its Own Generations: Assess the quality, factuality, and support of its generated text against retrieved evidence. This self-reflection is achieved through reflection tokens, special tokens generated by the LLM itself, which allow for fine-grained control and customization during inference.

2.2. Main Contributions / Findings

The paper makes several primary contributions to the field:

Novel Self-RAG Framework: Introduction of a new, end-to-end trainable framework that enhances LLM quality and factuality through adaptive retrieval and self-reflection. This framework allows an arbitrary LLM to learn to retrieve, generate, and critique its own outputs.
Reflection Tokens for Controllability: The design and implementation of reflection tokens (including retrieval tokens and critique tokens) that are generated by the LLM itself. These tokens are seamlessly integrated into the model's vocabulary and generation process, making the LLM controllable during inference. This enables tailoring the model's behavior to diverse task requirements, such as prioritizing factual accuracy or creative expression.
Cost-Effective Training Methodology: A method for training the generator LLM by incorporating critiques offline from a separate critic model. This significantly reduces training costs compared to computationally intensive methods like Reinforcement Learning from Human Feedback (RLHF) which require an online reward model during training.
Significant Performance Gains: Empirical demonstration that Self-RAG (7B and 13B parameters) substantially outperforms existing state-of-the-art LLMs (e.g., ChatGPT) and traditional retrieval-augmented models across a diverse suite of tasks. This includes:
- Open-domain QA: Improved accuracy.
- Reasoning and Fact Verification: Enhanced performance.
- Long-form Generations: Significant gains in factuality (measured by FactScore) and citation accuracy (precision and recall), addressing a critical weakness of many LLMs.
Inference-Time Customization: The framework enables a customizable decoding algorithm that leverages reflection token probabilities to enforce soft or hard constraints. This allows users to balance trade-offs (e.g., between citation precision and completeness/fluency) at inference time without retraining the model.

These findings collectively address the limitations of prior RAG by making retrieval adaptive, generations more factually grounded, and the overall LLM behavior more controllable and verifiable, leading to higher-quality and more trustworthy outputs.

3.1. Foundational Concepts

To fully understand the Self-RAG paper, a foundational understanding of several key concepts in Natural Language Processing (NLP) and Machine Learning (ML) is essential.

3.1.1. Large Language Models (LLMs)

Large Language Models (LLMs) are a class of neural networks, typically transformer-based, trained on vast amounts of text data to predict the next word in a sequence. This pre-training enables them to learn complex language patterns, grammar, facts, and reasoning abilities.

Parametric Knowledge: This refers to the knowledge implicitly stored within the billions of parameters (weights and biases) of the neural network during its pre-training phase. When an LLM generates a response, it primarily draws from this internal, parametric knowledge.
Factual Inaccuracies / Hallucinations: A significant challenge with LLMs is that they sometimes generate information that is plausible-sounding but factually incorrect, a phenomenon often termed hallucination. This can occur because their parametric knowledge might be outdated, incomplete, or a result of probabilistic generation rather than logical deduction.
Fine-tuning and Instruction Tuning:
- Fine-tuning: The process of taking a pre-trained LLM and further training it on a smaller, task-specific dataset to adapt its capabilities to a particular task (e.g., sentiment analysis, summarization).
- Instruction Tuning: A specific type of fine-tuning where models are trained on datasets formatted as instructions (e.g., "Summarize this article:") and corresponding desired outputs. This teaches the model to follow natural language instructions effectively. Examples include Alpaca and Llama2-chat.

3.1.2. Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is an architecture or paradigm that enhances LLMs by allowing them to access and incorporate information from external knowledge bases during the text generation process.

Core Idea: Instead of relying solely on its internal parametric knowledge, a RAG system first retrieves relevant documents or passages from a large corpus (e.g., Wikipedia, a company's internal documents) that are pertinent to a given user query. These retrieved passages are then provided as context alongside the original query to the LLM, which uses this external information to generate a more accurate, up-to-date, and grounded response.
Components:
- Retriever: A component (often a dense retriever like Contriever or DPR) that takes a query and efficiently searches a large corpus to find the most semantically similar or relevant passages.
- Generator: An LLM that takes the original query and the retrieved passages as input and generates the final text output.

3.1.3. Reward Models and Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is a technique used to align LLMs with human preferences and values, often to make them more helpful, honest, and harmless.

Reward Model: A separate neural network trained to predict a scalar reward score for a given LLM output, based on human preference data. Humans rank or rate different LLM generations, and this feedback is used to train the reward model.
Reinforcement Learning (RL): The reward model's output then serves as a reward signal for an LLM during an RL phase (e.g., using Proximal Policy Optimization (PPO)). The LLM is fine-tuned to maximize this reward, thereby generating outputs that are more aligned with human preferences.

3.1.4. Control Tokens

Control tokens (also known as special tokens or prompt tokens) are specific, reserved tokens added to an LLM's vocabulary during training. They serve as explicit signals or instructions to guide the model's behavior or output style during generation. For example, a [summarize] token might instruct the model to produce a summary, or [positive] might bias sentiment.

3.2. Previous Works

The paper contextualizes Self-RAG by discussing existing approaches to improving LLM factuality and control.

3.2.1. Traditional RAG Approaches

Early RAG Models (Lewis et al. 2020; Guu et al. 2020): These foundational works established the RAG paradigm. They typically retrieve a fixed number of documents once at the beginning of the generation process and prepend them to the input.
- Limitation (addressed by Self-RAG): These methods retrieve indiscriminately, even when not needed, and don't explicitly teach the model to verify facts against the retrieved context or critique its own generation quality.
Instruction-Tuned RAG (Luo et al., 2023 - SAIL): Some works fine-tune LLMs on instruction datasets where retrieved passages are prepended to the input.
- Limitation (addressed by Self-RAG): Still often relies on a fixed number of retrieved passages and lacks the dynamic, self-reflective critique mechanism of Self-RAG.
Joint Training of Retriever and LM (Izacard et al., 2022b): Approaches that jointly train both the retriever and the language model, followed by few-shot fine-tuning.
- Limitation (addressed by Self-RAG): While improving integration, these still don't inherently provide the self-critique and adaptive retrieval capabilities of Self-RAG.

3.2.2. Adaptive Retrieval Approaches

More recent works have started exploring adaptive retrieval, but often with different mechanisms:

Adaptive Retrieval with Proprietary LLMs (Jiang et al., 2023): Propose methods to adaptively retrieve passages, often relying on a separate, proprietary LLM to decide when to retrieve or to format queries.
- Limitation (addressed by Self-RAG): Can be expensive, lack reproducibility, and don't integrate the adaptive decision-making directly into the generation process of the target LM.
API Call Generation (Schick et al., 2023 - Toolformer): Train an LM to generate API calls (e.g., to Wikipedia APIs) for specific entities.
- Limitation (addressed by Self-RAG): Focuses on structured data retrieval for named entities rather than a general, on-demand passage retrieval and comprehensive self-reflection for diverse generation tasks.

3.2.3. Concurrent RAG Works

The paper mentions several concurrent works that aim to improve RAG:

Retriever/LM Fine-tuning (Lin et al., 2023): Fine-tune both retriever and LM on instruction-tuning datasets.
- Differentiation: Self-RAG also trains on diverse instruction-following data but adds on-demand retrieval and fine-grained self-reflection for selecting the best output, making it more robust and controllable.
Filtering/Compressing Passages (Yoran et al., 2023; Xu et al., 2023): Use external models (NLI or summarization) to filter or compress retrieved passages before feeding them to the LM.
- Differentiation: Self-RAG processes passages in parallel and filters irrelevant ones through its own self-reflection without relying on external models at inference. Its critique mechanism also evaluates broader output quality.
Tree Search with Value Scores (Zhou et al., 2023 - LATS): Prompts off-the-shelf LLMs to search for information and generate with tree search guided by LM-generated value scores.
- Differentiation: LATS uses a single overall value score, whereas Self-RAG trains an arbitrary LM to generate fine-grained self-reflection tokens for customizable inference across multiple criteria.

3.2.4. Training and Generating with Critics

RLHF (Ouyang et al., 2022; Wu et al., 2023): These methods train LLMs using reinforcement learning (e.g., PPO) from human feedback, often via reward models. Wu et al. (2023) explored fine-grained RLHF.
- Differentiation: Self-RAG trains its target LM on task examples augmented with reflection tokens offline, by using a critic model to insert critiques into the training corpus. This significantly lowers training cost compared to online RLHF. While RLHF focuses on human preference alignment, Self-RAG uses reflection tokens for controllable generation at inference.
General Control Tokens (Lu et al., 2022; Korbak et al., 2023; Keskar et al., 2019): Other works use control tokens to guide text generation.
- Differentiation: Self-RAG specifically uses reflection tokens not just to guide, but to decide the need for retrieval and to self-evaluate generation quality after each segment.
Self-Evaluation Guided Decoding (Xie et al., 2023): Propose a framework for self-evaluation, but primarily for reasoning tasks and a single evaluation dimension without retrieval.
- Differentiation: Self-RAG integrates retrieval and multiple, fine-grained evaluation dimensions (relevance, support, utility).
LLM Refinement (Dhuliawala et al., 2023 - CoVE; Madaan et al., 2023; Paul et al., 2023): Involves iterative prompting of a model to generate output, natural language feedback, and then a refined output.
- Differentiation: These methods incur higher inference costs due to their iterative nature. Self-RAG integrates the critique process more directly into the generation flow, enabling more efficient single-pass generation with self-correction.

3.3. Technological Evolution

The evolution of technology in this domain can be traced as follows:

Pre-trained LLMs: Started with general-purpose language models trained on massive text corpora (e.g., GPT-3, Llama). These models excel at fluency but struggle with factuality due to reliance on static parametric knowledge.
Instruction-Tuned/Chat Models: Further training to align LLMs with human instructions and conversational patterns (e.g., ChatGPT, Llama2-chat, Alpaca). These improved usability but did not fundamentally solve the hallucination problem.
Traditional RAG: Introduced external knowledge retrieval to LLMs to enhance factuality. Initial methods were often fixed and non-adaptive.
Adaptive RAG & Tool-Use: Began to explore more dynamic retrieval strategies and integration of external tools/APIs.
Self-RAG: Represents a significant step forward by integrating adaptive retrieval directly into the LLM's self-awareness through reflection tokens, allowing the model to critique itself and its inputs on-demand, leading to unprecedented control and factuality improvements in a cost-effective manner.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of Self-RAG are:

Adaptive and On-demand Retrieval: Unlike traditional RAG that retrieves indiscriminately, Self-RAG explicitly trains the LLM to predict when retrieval is necessary via a Retrieve reflection token. This enhances versatility and efficiency.
Integrated Self-Reflection and Critique: Self-RAG introduces critique tokens (IsReL, IsSUP, IsUsE) that allow the model to evaluate the relevance of retrieved passages and the factuality/utility of its own generated segments. This is a fine-grained, internal self-assessment mechanism, unlike external models used in some concurrent works or iterative refinement that increases inference costs.
End-to-End Training of a Single LLM: Self-RAG trains a single arbitrary LLM to generate both task output and reflection tokens by expanding its vocabulary. This makes the entire process cohesive and allows the LLM to learn the interplay between generation, retrieval, and critique.
Offline Critique Insertion for Training Efficiency: Instead of costly online RLHF with a live reward model, Self-RAG uses a pre-trained critic model to insert reflection tokens into the training data offline. The generator then learns from this augmented corpus using a standard language modeling objective, significantly reducing training overhead.
Inference-Time Controllability without Retraining: The reflection tokens and the segment-level beam search with customizable weight terms ( $w^G$ ) enable users to tailor the model's behavior at inference time (e.g., prioritizing factuality or fluency) without needing to retrain the model. This offers a flexible control mechanism not typically found in other RAG or RLHF approaches.
Enhanced Factuality and Citation Accuracy: Empirically, Self-RAG demonstrates superior performance in grounding generations, leading to higher FactScore and significantly improved citation precision and recall compared to leading LLMs and RAG baselines.

4. Methodology

The Self-Reflective Retrieval-Augmented Generation (Self-RAG) framework enhances an LLM's quality and factuality by integrating adaptive retrieval and self-reflection directly into its generation process.

4.1. Principles

The core idea behind Self-RAG is to train a single Large Language Model (LLM) to not only generate text but also to think about its generation process and its retrieved information. This "thinking" is manifested through the generation of special reflection tokens. These tokens act as internal signals, prompting the model to decide whether to retrieve external information, evaluate the relevance of that information, and then critique the quality and factual support of its own generated text. This allows for dynamic, on-demand retrieval and self-correction, leading to more accurate and controllable outputs.

4.2. Core Methodology In-depth (Layer by Layer)

The Self-RAG framework involves training a generator LLM ( $\mathcal{M}$ ) and utilizing a retriever ( $\mathcal{R}$ ) and a critic model ( $\mathcal{C}$ ) during the training data augmentation phase. The overall process is summarized in Figure 1 and Algorithm 1.

The problem is formalized as training a generator LM $\mathcal{M}$ to sequentially generate textual outputs $y$ composed of multiple segments $y = [y_1, \dots, y_T]$ , where $y_t$ is a sequence of tokens for the $t$ -th segment. These generated tokens can be regular text or the special reflection tokens.

The four types of reflection tokens used in Self-RAG are detailed in Table 1:

The following are the results from Table 1 of the original paper:

Type	Input	Output	Definitions
Retrieve	x/x,y	{yes, no, continue}	Decides when to retrieve with R
IsReL	x,d	{relevant, irrelevant}	d provides useful information to solve x.
IsSUP	x,d,y	{fully supported, partially supported, no support}	All of the verification-worthy statement in y is supported by d.
IsUsE	x,y	{5, 4, 3, 2, 1}	y is a useful response to x.

Each type uses several tokens to represent its output values. The bold text (e.g., yes, relevant, fully supported) indicates the most desirable critique tokens. Here, $x$ denotes the input, $y$ the output, and $d$ a relevant passage.

4.2.1. Inference Overview

During inference, Self-RAG operates step-by-step to generate content, adaptively retrieving and critiquing. As shown in Figure 1 (right) and Algorithm 1, for each input prompt $x$ and preceding generation $y_{<t}$ (i.e., tokens generated before the current segment $t$ ):

The following figure (Figure 1 from the original paper) provides an overview of SELF-RAG:

Figure 1: Overview of SELF-RAG. SELF-RAG learns to retrieve, critique, and generate text passages to enhance overall generation quality, factuality, and verifiability.

Algorithm 1 SELF-RAG Inference

Input: Generator LM $\mathcal{M}$ , Retriever $\mathcal{R}$ , Large-scale passage collections $\{d_1, \dots, d_N\}$ Input: input prompt $x$ and preceding generation $y_{<t}$ Output: next output segment $y_t$

$\mathcal{M}$ $M$ predicts Retrieve given $(x, y_{<t})$ $(x, y_{< t})$ .
- Explanation: The generator LM $\mathcal{M}$ first assesses whether retrieving external knowledge would be beneficial for generating the next segment. It does this by predicting a Retrieve token, which can be yes, no, or continue. This is a crucial adaptive decision point.
if $\boxed{\text{Retrieve}} == \text{Yes}$ $Retrieve == Yes$ then
- Explanation: If the model decides that retrieval is necessary, it proceeds to activate the retriever.
Retrieve relevant text passages $\mathbf{D}$ $D$ using $\mathcal{R}$ $R$ given $(x, y_{t-1})$ $(x, y_{t - 1})$ .
- Explanation: The Retriever $\mathcal{R}$ (e.g., Contriever-MS MARCO) is called to find top relevant passages $\mathbf{D}$ from a large collection, based on the input prompt $x$ and the last generated segment $y_{t-1}$ .
$\mathcal{M}$ $M$ predicts IsREL given x, d and $y_t$ $y_{t}$ given $x, d, y_{<t}$ $x, d, y_{< t}$ for each $d \in \mathbf{D}$ $d \in D$ . (Generate)
- Explanation: For each retrieved passage $d$ $d$ in $\mathbf{D}$ $D$ , the generator $\mathcal{M}$ $M$ concurrently performs two actions:
  - It predicts an IsREL token to evaluate if the passage $d$ is relevant or irrelevant to the input $x$ . This filters out unhelpful passages.
  - It starts generating a candidate next output segment $y_t$ , taking into account the input $x$ , the specific retrieved passage $d$ , and the preceding generation $y_{<t}$ .
$\mathcal{M}$ $M$ predicts IsSUP and IsUsE given $x, y_t, d$ $x, y_{t}, d$ for each $d \in \mathbf{D}$ $d \in D$ . (Critique)
- Explanation: After generating a candidate $y_t$ $y_{t}$ with respect to a passage $d$ $d$ , the model then self-reflects further. It predicts:
  - An IsSUP token to assess whether the generated segment $y_t$ is fully supported, partially supported, or has no support from the retrieved passage $d$ . This is key for factual grounding.
  - An IsUsE token to evaluate the overall usefulness of the segment $y_t$ as a response to $x$ . This token usually has a numerical scale (e.g., 1-5).
Rank $y_t$ $y_{t}$ based on $\boxed{\mathbf{IsREL}}$ $IsREL$ IsSUP IsUse. Detailed in Section 3.3.
- Explanation: With multiple candidate segments $y_t$ generated (one for each relevant retrieved passage), and associated critique tokens (IsREL, IsSUP, IsUsE), Self-RAG then ranks these candidates to select the best one. This ranking is guided by a segment score that combines the probabilities of these critique tokens.
else if $\boxed{\text{Retrieve}} == \text{No}$ $Retrieve == No$ then
- Explanation: If the model decides that retrieval is not necessary (i.e., RetrieveNo), it bypasses the retrieval step.
$\mathcal{M}_{gen}$ $M_{g e n}$ predicts $y_t$ $y_{t}$ given $x$ $x$ . (Generate)
- Explanation: The generator proceeds to generate the next segment $y_t$ directly, similar to a standard LLM, without external passages.
$\mathcal{M}_{gen}$ $M_{g e n}$ predicts IsUsE given $x, y_t$ $x, y_{t}$ . (Critique)
- Explanation: Even without retrieval, the model still critiques its overall usefulness. It predicts an IsUsE token to evaluate the utility of the generated segment $y_t$ relative to the input $x$ .
  
  This adaptive and self-reflective inference process allows Self-RAG to tailor its behavior, ensuring retrieval only when needed and enabling selection of the most accurate and supported generation.

4.2.2. Training Overview

Self-RAG trains an arbitrary LLM to generate text interleaved with reflection tokens by treating these tokens as part of an expanded vocabulary. The training process involves two main stages:

Training a Critic Model ( $\mathcal{C}$ ): This model is trained to predict the reflection tokens for evaluating retrieved passages and generation quality.
Training a Generator Model ( $\mathcal{M}$ ): This model learns to generate both the task output and the reflection tokens by training on a curated corpus augmented with passages retrieved by a retriever ( $\mathcal{R}$ ) and reflection tokens predicted by the critic model ( $\mathcal{C}$ ). The key here is that the critic model is used offline to prepare the training data, avoiding the high cost of online reinforcement learning.

4.2.3. Training the Critic Model ( $\mathcal{C}$ )

Data Collection for Critic Model

Manually annotating reflection tokens for every segment is labor-intensive. To overcome this, the authors leverage a powerful proprietary LLM, GPT-4 (OpenAI, 2023), to generate the initial feedback.

For each type of reflection token (e.g., Retrieve, IsReL, IsSUP, IsUsE), instances are randomly sampled from the original training data $\{X, Y\}$ .
GPT-4 is prompted with type-specific instructions and few-shot demonstrations ( $I$ $I$ ) along with the original task input $x$ $x$ and output $y$ $y$ to predict an appropriate reflection token as text (e.g., "yes", "relevant", "fully supported").
- For example, for the Retrieve token, GPT-4 is prompted with: "Given an instruction, make a judgment on whether finding some external documents from the web helps to generate a better response."
The authors collect between $4 \mathrm{k}$ to $20 \mathrm{k}$ supervised training data points for each reflection token type and combine them to form the training dataset for the critic model, $\mathcal{D}_{critic}$ .

Critic Learning

After collecting $\mathcal{D}_{critic}$ , a pre-trained LLM (e.g., Llama 2-7B) is initialized as the critic model $\mathcal{C}$ . It is then trained on $\mathcal{D}_{critic}$ using a standard conditional language modeling objective, aiming to maximize the likelihood of predicting the correct reflection tokens.

The objective function for training the critic model $\mathcal{C}$ is: $\operatorname*{max}_{\mathcal { C }} \mathbb { E } _ { ( ( x , y ) , r ) \sim \mathcal { D } _ { c r i t i c } } \log p _ { \mathcal { C } } ( r | x , y )$ Where:

$\operatorname*{max}_{\mathcal { C }}$ : Denotes maximizing the objective with respect to the parameters of the critic model $\mathcal{C}$ .
$\mathbb{E}$ : Represents the expected value (average) over the training dataset.
$((x, y), r) \sim \mathcal{D}_{critic}$ : Indicates that (x, y) are an input-output pair from the training dataset, and $r$ is the corresponding reflection token label generated by GPT-4. $\mathcal{D}_{critic}$ is the collection of these triplets.
$\log p_{\mathcal{C}}(r | x, y)$ : Is the log-likelihood of the critic model $\mathcal{C}$ predicting the reflection token $r$ , given the input $x$ and output $y$ . Maximizing this term means training $\mathcal{C}$ to accurately predict the reflection tokens $r$ that GPT-4 provided.

The trained critic model $\mathcal{C}$ achieves high agreement (over 90%) with GPT-4's predictions on most reflection token categories, ensuring its reliability for generating training data for the generator.

4.2.4. Training the Generator Model ( $\mathcal{M}$ )

Data Collection for Generator

The training data for the generator model, $\mathcal{D}_{gen}$ , is created by augmenting the original input-output pairs (x, y) with retrieved passages and reflection tokens, mimicking the Self-RAG inference process.

For each segment $y_t$ within the output $y$ :
- The trained critic model $\mathcal{C}$ is used to assess if retrieval is needed.
- If $\mathcal{C}$ predicts RetrieveYes, the retrieval special token $\boxed{\text{Retrieve}} = \text{Yes}$ is added to the data.
- Then, the retriever $\mathcal{R}$ retrieves the top $K$ passages, $\mathbf{D}$ .
- For each passage $d \in \mathbf{D}$ , $\mathcal{C}$ evaluates its relevance and predicts $\boxed{\mathbf{IsReL}}$ . If relevant, $\mathcal{C}$ further evaluates if the passage supports the model's generation $y_t$ and predicts $\boxed{\mathbf{IsSUP}}$ . These critique tokens are appended after the retrieved passage or generation.
- Finally, at the end of the output $y$ , $\mathcal{C}$ predicts the overall utility token $\boxed{\mathbf{IsUsE}}$ .
The augmented output, including the original input pair, retrieved passages, and all generated reflection tokens, forms an instance in $\mathcal{D}_{gen}$ .

The following figure (Figure 2 from the original paper) illustrates SELF-RAG training examples:

该图像是一个示意图，展示了自反性检索增强生成（Self-RAG）框架的工作流程。左侧为输入和输出示例，右侧显示了如何通过检索增强输出和批评模型响应。通过特定的反射令牌，使得模型能够根据任务需求调整其行为。

Figure 2 shows examples of Self-RAG training data. The left example does not require retrieval, so no passages are inserted. The right example requires retrieval, and thus retrieved passages are interleaved with reflection tokens (IsReL, IsSUP).

Generator Learning

The generator model $\mathcal{M}$ (e.g., Llama2 7B or 13B) is trained on the curated corpus $\mathcal{D}_{gen}$ using a standard next token prediction objective. This means $\mathcal{M}$ learns to predict both the target output text and the reflection tokens.

The objective function for training the generator model $\mathcal{M}$ is: $\operatorname*{max}_{\mathcal { M }} \mathbb { E } _ { ( x , y , r ) \sim \mathcal { D } _ { g e n } } \log p _ { \mathcal { M } } ( y , r | x )$ Where:

$\operatorname*{max}_{\mathcal { M }}$ : Denotes maximizing the objective with respect to the parameters of the generator model $\mathcal{M}$ .
$\mathbb{E}$ : Represents the expected value (average) over the training dataset.
$(x, y, r) \sim \mathcal{D}_{gen}$ : Indicates that $x$ is the input, and (y, r) represents the augmented output (including text $y$ and interleaved reflection tokens $r$ ) from the training dataset $\mathcal{D}_{gen}$ .
$\log p_{\mathcal{M}}(y, r | x)$ : Is the log-likelihood of the generator model $\mathcal{M}$ predicting the sequence of target tokens (y, r) given the input $x$ . By maximizing this, $\mathcal{M}$ learns to generate both the desired text and the appropriate reflection tokens.

Crucially, during training, the actual retrieved text chunks (e.g., surrounded by $<p>$ and $</p>$ tags in Figure 2) are masked out for loss calculation. This means the model learns to predict the reflection tokens and the generated text, but the loss isn't computed on the retrieved passages themselves, only on the model's response to them and the meta-information (reflection tokens). The original vocabulary of the LLM is expanded with the set of reflection tokens ([Critique], [Retrieve]) so that the model can learn to generate them.

This training approach enables $\mathcal{M}$ to generate reflection tokens itself at inference time without needing the critic $\mathcal{C}$ , making the entire system self-contained and controllable. This method is also more cost-effective than RLHF because the critiques are pre-computed offline.

4.2.5. SELF-RAG Inference (Customization)

The ability of Self-RAG to generate reflection tokens to self-evaluate its own output is a key feature for controllability during inference. This allows the model's behavior to be tailored to diverse task requirements.

Adaptive Retrieval with Threshold

Self-RAG dynamically decides whether to retrieve text passages by predicting the Retrieve token.

Mechanism: The model predicts the probability of generating RetrieveYes. If this probability, normalized over all possible Retrieve output tokens (e.g., Yes, No, Continue), surpasses a designated threshold, retrieval is triggered.
Customization: A threshold can be set (e.g., 0.2 for most tasks, 0 for ALCE where citation is mandatory). A larger threshold means less frequent retrieval, useful for open-ended creative tasks. A smaller threshold (or 0) encourages more frequent retrieval for factual accuracy-demanding tasks.

Tree-decoding with Critique Tokens

When retrieval is required (either by the model's RetrieveYes prediction or a hard condition), Self-RAG employs a segment-level beam search to select the best continuation.

Parallel Passage Processing: The retriever $\mathcal{R}$ retrieves $K$ passages. The generator $\mathcal{M}$ then processes each of these $K$ passages in parallel, generating $K$ different candidate continuations ( $y_t$ ).
Segment-level Beam Search: A beam search (with a beam size $B$ ) is conducted at each segment step $t$ . This aims to find the top- $B$ segment continuations. At the end of generation, the best sequence is returned.
Critique-Guided Scoring: The score of each candidate segment $y_t$ (generated with respect to a specific passage $d$ ) is updated using a critic score $S(\textcircled{Critique})$ . This score is a linear weighted sum of the normalized probabilities of the critique tokens predicted for that segment.

The segment score for $y_t$ with passage $d$ and associated critique tokens is calculated as: $f ( y _ { t } , d , \textcircled { \mathrm { C r i t i q u e } } ) = p ( y _ { t } | x , d , y _ { < t } ) ) + S ( \textcircled { \mathrm { C r i t i q u e } } )$ Where:

$f ( y _ { t } , d , \textcircled { \mathrm { C r i t i q u e } } )$ : The overall score for the segment $y_t$ considering the passage $d$ and the predicted critique tokens.
p ( y _ { t } | x , d , y _ { < t } ): The log-probability (or probability) of the generator $\mathcal{M}$ generating the segment $y_t$ given the input $x$ , retrieved passage $d$ , and preceding generation $y_{<t}$ . This is the standard language model score.
$S ( \textcircled { \mathrm { C r i t i q u e } } )$ : The critique score derived from the reflection tokens, which explicitly incorporates self-reflection into the segment selection.

The critique score $S(\textcircled{Critique})$ is a weighted sum of scores from different critique token groups: $\mathcal { S } ( [ \underline { { \mathrm { C r i t i q u e } } } ] ) = \displaystyle \sum _ { G \in \mathcal { G } } w ^ { G } s _ { t } ^ { G } \mathrm { f o r } \mathcal { G } = \big \{ \underline { { \mathrm { I s R e L } } } \big \} , \underline { { \mathrm { I s S u p } } } \big ] , \underline { { \mathrm { [I s U s E ] } } } \big \}$ Where:
$\mathcal{S}([\underline{{\text{Critique}}}])$ : The total critique score.
$\mathcal{G}$ : The set of critique token groups, which includes IsReL (relevance), IsSUP (support), and IsUsE (usefulness).
$w^G$ : A hyperparameter representing the weight assigned to each critique token group $G$ . These weights are adjustable at inference time to customize model behavior. For example, a higher $w^{\text{IsSUP}}$ prioritizes factual support.
$s_t^G$ : The normalized probability of the most desirable token within critique group $G$ at timestamp $t$ .

The normalized probability $s_t^G$ for a critique token type $G$ is calculated as: $s _ { t } ^ { G } = \frac { p _ { t } ( \hat { r } ) } { \sum _ { i = 1 } ^ { N ^ { G } } p _ { t } ( r _ { i } ) }$ Where:
$p_t(\hat{r})$ : The probability of generating the most desirable reflection token $\hat{r}$ (e.g., Relevant for IsReL, Fully Supported for IsSUP, 5 for IsUsE) within group $G$ at timestamp $t$ .
$\sum_{i=1}^{N^G} p_t(r_i)$ : The sum of probabilities of all $N^G$ distinct tokens ( $r_i$ ) that represent different possible values for the critique token type $G$ . This normalizes the score.

Customization and Control:

Soft Constraints: By adjusting the weights $w^G$ , users can softly guide the model's preferences. For example, increasing $w^{\text{IsSUP}}$ makes the model prioritize outputs that are strongly supported by evidence, even if it might slightly reduce fluency (as longer, more fluent generations might have more unsubstantiated claims).
Hard Constraints: Alternatively, one can explicitly filter out segment continuations if the model generates an undesirable Critique token (e.g., discarding a segment if IsSUP predicts No support).

This mechanism provides fine-grained control over the generation process, allowing Self-RAG to adapt to specific user requirements or task objectives without requiring further training.

5. Experimental Setup

5.1. Datasets

The experiments for Self-RAG are conducted on a diverse set of six tasks categorized into closed-set, short-form generation, and long-form generation, to holistically evaluate performance across correctness, factuality, and fluency.

5.1.1. Closed-set Tasks

These tasks involve selecting a correct answer from a predefined set, often testing reasoning or factual knowledge directly.

PubHealth (Zhang et al., 2023): A fact verification dataset focused on public health. Models need to determine the factual correctness of statements related to health information.
ARC Challenge (Clark et al., 2018): A multiple-choice reasoning dataset derived from scientific exams. It requires models to answer complex scientific questions, often needing multi-step reasoning.
Data Characteristics: Both datasets require precise factual understanding and reasoning capabilities. Accuracy is the primary metric. The paper aggregates answer probabilities for target classes.

5.1.2. Short-form Generation Tasks

These tasks involve generating concise, factual answers to questions.

PopQA (Mallen et al., 2023): An open-domain question answering dataset. The paper specifically uses the long-tail subset, which consists of 1,399 queries about rare entities (Wikipedia page views less than 100 per month). This dataset tests the model's ability to recall or retrieve information about less common facts.
TriviaQA-unfiltered (Joshi et al., 2017): Another open-domain question answering dataset, focusing on factual knowledge. The paper uses 11,313 test queries from the validation and test split, following prior work.
Data Characteristics: Both require accurate factual recall or retrieval. Evaluation focuses on whether gold answers are included in the model's generation, rather than strict exact matching, to account for variations in phrasing.

5.1.3. Long-form Generation Tasks

These tasks require generating longer, coherent texts, often demanding high factuality and proper attribution.

Biography Generation (Min et al., 2023): A task where the model generates biographical texts.
ALCE-ASQA (Gao et al., 2023; Stelmakh et al., 2022): A long-form QA task that requires generating detailed answers, often with multiple pieces of information, and crucially, demands citation accuracy.
Data Characteristics: These tasks heavily emphasize factuality, coherence, fluency, and the ability to correctly attribute information to sources (citations).

5.2. Evaluation Metrics

For every evaluation metric mentioned, a complete explanation is provided.

5.2.1. Accuracy (acc)

Conceptual Definition: Accuracy measures the proportion of correct predictions (or answers) out of the total number of predictions made. It is a fundamental metric for classification and question-answering tasks where there is a single correct output.
Mathematical Formula: $\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$
Symbol Explanation:
- $\text{Number of Correct Predictions}$ : The count of instances where the model's output matches the ground truth.
- $\text{Total Number of Predictions}$ : The total count of instances in the dataset for which the model made a prediction.

5.2.2. FactScore (FS)

Conceptual Definition: FactScore is a metric designed to evaluate the factuality of long-form text generations, particularly biographies. It aims to quantify how many claims made in the generated text are factually correct and supported by evidence, rather than being hallucinations. A higher FactScore indicates higher factual accuracy.
Mathematical Formula: The paper refers to Min et al. (2023) for FactScore but does not provide its specific calculation formula in the main text. Generally, FactScore involves extracting atomic facts (claims) from the generated text and then verifying each claim against a knowledge base or retrieved evidence. The score often reflects the percentage of verifiable and correct facts.
Symbol Explanation: Not provided in the paper's main text.

5.2.3. Correctness (str-em, rg)

Conceptual Definition:
- str-em (string exact match): Measures if the generated answer string is an exact match to the reference answer. It's a very strict metric, requiring character-level identity.
- rouge (Recall-Oriented Understudy for Gisting Evaluation): A set of metrics (e.g., ROUGE-1, ROUGE-2, ROUGE-L) used for evaluating summarization and machine translation. It measures the overlap of n-grams (sequences of $n$ words) or longest common subsequences between the generated text and a set of reference texts. Higher ROUGE scores indicate greater similarity to the reference.
Mathematical Formula: The paper refers to these metrics but does not provide their specific calculation formulas in the main text.
- For str-em, it's a binary outcome (1 if exact match, 0 otherwise), averaged over all instances.
- For ROUGE, formulas vary by type (e.g., ROUGE-N is based on n-gram overlap, ROUGE-L on longest common subsequence).
  - Example ROUGE-N (general form): $\text{ROUGE-N} = \frac{\sum_{\text{Sentence} \in \text{References}} \sum_{\text{n-gram} \in \text{Sentence}} \text{Count}_{\text{match}}(\text{n-gram})}{\sum_{\text{Sentence} \in \text{References}} \sum_{\text{n-gram} \in \text{Sentence}} \text{Count}(\text{n-gram})}$
Symbol Explanation:
- $\text{Sentence}$ : A sentence in either the reference or generated text.
- $\text{n-gram}$ : A contiguous sequence of $N$ words.
- $\text{Count}_{\text{match}}(\text{n-gram})$ : The maximum number of n-grams co-occurring in a candidate generation and a set of reference summaries.
- $\text{Count}(\text{n-gram})$ : The number of n-grams in the reference summaries.

5.2.4. Fluency (MAUVE)

Conceptual Definition: MAUVE (Multivariate Autoregressive Universal Evaluation) is a metric for evaluating the quality of open-ended text generation, specifically its fluency (how natural and grammatical the text is) and diversity. It compares the distribution of generated text to the distribution of human reference text. A higher MAUVE score indicates that the generated text is more similar in quality and style to human-written text.
Mathematical Formula: The paper refers to Pillutla et al. (2021) for MAUVE but does not provide its specific calculation formula in the main text. It's a complex metric based on comparing statistical properties of text distributions.
Symbol Explanation: Not provided in the paper's main text.

5.2.5. Citation Precision (prec) and Recall (rec)

Conceptual Definition: These metrics are crucial for evaluating how well a model attributes its generated content to provided sources, particularly in RAG systems.
- Citation Precision: Measures the proportion of cited statements in the generated text that are actually supported by the corresponding cited passages. A high precision means few "hallucinated" or unsubstantiated citations.
- Citation Recall: Measures the proportion of all verifiable statements in the generated text that should have been cited (i.e., are factual and supported by available passages) and were correctly cited. A high recall means the model is good at attributing all derivable facts.
Mathematical Formula: The paper refers to Gao et al. (2023) for these metrics but does not provide their specific calculation formulas in the main text. Generally:
- Precision: $\text{Precision} = \frac{\text{Number of Correct Citations}}{\text{Total Number of Citations Made}}$
- Recall: $\text{Recall} = \frac{\text{Number of Correct Citations}}{\text{Total Number of Citable Statements in Generation}}$
Symbol Explanation: Not provided in the paper's main text.

5.3. Baselines

The paper compares Self-RAG against a comprehensive set of baselines, categorized by their use of retrieval and proprietary data.

5.3.1. Baselines without Retrieval

These models rely solely on their internal parametric knowledge.

Open-source Pre-trained/Instruction-tuned LLMs:
- Llama2-7B, Llama2-13B (Touvron et al., 2023): Foundational LLMs.
- Alpaca-7B, Alpaca-13B (Dubois et al., 2023): Instruction-tuned versions of Llama models (replicated by the authors).
Proprietary/Reinforced LLMs:
- ChatGPT (Ouyang et al., 2022): A widely used instruction-tuned and RLHF-aligned LLM.
- Llama2-chat-13B (Touvron et al., 2023): An RLHF-aligned chat model.
Concurrent work:
- CoVE-65B (Dhuliawala et al., 2023): Iteratively prompts a 65B parameter LLM to refine outputs, focusing on factuality.

5.3.2. Baselines with Retrieval

These models integrate external knowledge retrieval.

Standard RAG Baselines:
- Llama2-7B, Alpaca-7B, Llama2-13B, Alpaca-13B augmented with retrieval at test time. This involves prepending top retrieved documents to the query before feeding it to the LLM. They use the same retriever as Self-RAG.
Fine-tuned without Reflection:
- Llama2-FT-7B: A Llama2-7B model fine-tuned on the same training data as Self-RAG but without reflection tokens or retrieved passages, and then augmented with retrieval at test time. This baseline helps isolate the gains attributed to the Self-RAG framework itself rather than just the training data.
Proprietary/Production RAG Systems:
- Ret-ChatGPT: ChatGPT augmented with retrieval.
- Ret-Llama2-chat: Llama2-chat augmented with retrieval.
- perplexity.ai: An InstructGPT-based production search system that inherently uses retrieval.
Concurrent Trained RAG Methods:
- Toolformer-6B (Schick et al., 2023): An LLM pre-trained to generate API calls for named entities.
- SAIL-7B (Luo et al., 2023): An LLM instruction-tuned on Alpaca data with top retrieved documents inserted before instructions.

5.4. Experimental Settings

5.4.1. Training Data and Settings

Training Data: A total of 150k instruction-output pairs were used. This dataset was constructed from:
- Open-Instruct processed data (Wang et al., 2023): A collection of diverse instruction-following examples.
- Knowledge-intensive datasets (Petroni et al., 2021; Stelmakh et al., 2022; Mihaylov et al., 2018): Datasets designed to test knowledge-based question answering and generation.
Base LLMs:
- Generator ( $\mathcal{M}$ ): Llama2 7B and Llama2 13B models (Touvron et al., 2023) were used.
- Critic ( $\mathcal{C}$ ): Llama2 7B was used as the base model for the critic.
Retriever ( $\mathcal{R}$ ):
- Contriever-MS MARCO (Izacard et al., 2022a) was used as the default off-the-shelf retriever. It retrieves up to ten documents for each input.
- For biographies and open-domain QA, an additional five documents retrieved by a web search engine were used, following Luo et al. (2023).
- For ASQA, the authors used five documents provided by GTR-XXL (Ni et al., 2022) across all baselines to ensure a fair comparison.

5.4.2. Inference Settings

Critique Weight Terms: As a default configuration, the weight terms ( $w^G$ $w^{G}$ ) for the critique tokens were set as follows:
- IsReL (Relevant/Irrelevant): $w^{\text{IsReL}} = 1.0$
- IsSUP (Fully Supported/Partially Supported/No Support): $w^{\text{IsSUP}} = 1.0$
- IsUsE ( $5/4/3/2/1$ ): $w^{\text{IsUsE}} = 0.5$
Retrieval Threshold:
- For most tasks, the retrieval threshold was set to 0.2. This means retrieval is triggered if the normalized probability of RetrieveYes exceeds 0.2.
- For ALCE (due to its strong citation requirements), the threshold was set to 0, effectively forcing retrieval whenever possible.
Efficiency: vllm (Kwon et al., 2023), a high-throughput inference engine, was used to speed up inference.
Decoding:
- Segment-level beam search: A beam width of 2 was adopted at each segment level, meaning the model explores two best continuations for each segment.
- Token-level generation: Greedy decoding was used for generating individual tokens within a segment. This means at each step, the model picks the token with the highest probability.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate Self-RAG's superior performance across a wide range of tasks, outperforming both non-retrieval LLMs and traditional RAG approaches.

The following are the results from Table 2 of the original paper:

	Short-form		Closed-set		Long-form generations (with citations)
LM	PopQA (acc)	TQA (acc)	Pub (acc)	ARC (acc)	Bio (FS)	(em)	(rg)	ASQA (mau)	(pre)	(rec)
LMs with proprietary data
Llama2-c13B	20.0	59.3	49.4	38.4	55.9	22.4	29.6	28.6		—
Ret-Llama2-C13B	51.8	59.8	52.1	37.9	79.9	32.8	34.8	43.8	19.8	36.1
ChatGPT	29.3	74.3	70.1	75.3	71.8	35.3	36.2	68.8
Ret-ChatGPT	50.8	65.7	54.7	75.3		40.7	39.9	79.7	65.1	76.6
Perplexity.ai					71.2
Baselines without retrieval
Llama27B	14.7	30.5	34.2	21.8	44.5	7.9	15.3	19.0
Alpaca7B	23.6	54.5	49.8	45.0	45.8	18.8	29.4	61.7
Llama213B	14.7	38.5	29.4	29.4	53.4	7.2	12.4	16.0
Alpaca13B	24.4	61.3	55.5	54.9	50.2	22.9	32.0	70.6
CoVE65B *					71.2
Baselines with retrieval
Toolformer*6B		48.8
Llama27B	38.2	42.5	30.0	48.0	78.0	15.2	22.1	32.0	2.9	4.0
Alpaca7B	46.7	64.1	40.2	48.0	76.6	30.9	33.3	57.9	5.5	7.2
Llama2-FT7B	48.7	57.3	64.3	65.8	78.2	31.0	35.8	51.2	5.0	7.5
SAIL*7B			69.2	48.4
Llama213B	45.7	47.0	30.2	26.0	77.5	16.3	20.5	24.7	2.3	3.6
Alpaca13B	46.1	66.9	51.1	57.6	77.7	34.8	36.7	56.6	2.0	3.8
Our SELF-RAG 7B	54.9	66.4	72.4	67.3	81.2	30.0	35.7	74.3	66.9	67.8
Our SELF-RAG 13B	55.8	69.3	74.5	73.1	80.2	31.7	37.0	71.6	70.3	71.3

6.1.1. Comparison against Baselines without Retrieval

Superiority over Open-source LLMs: Self-RAG (both 7B and 13B models) significantly outperforms Llama2 and Alpaca models of similar or larger sizes across all tasks. For instance, Self-RAG 7B achieves 54.9% accuracy on PopQA compared to Alpaca 7B's 23.6%, demonstrating a massive improvement in factual QA.
Outperforming ChatGPT in Specific Areas: Self-RAG even surpasses ChatGPT (a proprietary and RLHF-tuned model) in several key metrics:
- PubHealth: Self-RAG 13B (74.5%) vs. ChatGPT (70.1%).
- PopQA: Self-RAG 13B (55.8%) vs. ChatGPT (29.3%).
- Biography FactScore: Self-RAG 7B (81.2%) vs. ChatGPT (71.8%).
- ASQA (Rouge and MAUVE): Self-RAG 7B (rg 35.7, mau 74.3) and Self-RAG 13B (rg 37.0, mau 71.6) generally outperform ChatGPT (rg 36.2, mau 68.8), indicating better fluency and content similarity.
Outperforming CoVE: On the biography generation task, Self-RAG 7B (81.2 FactScore) and 13B (80.2 FactScore) significantly outperform CoVE 65B (71.2 FactScore), a concurrent method employing iterative prompt engineering, despite CoVE being a much larger model. This highlights the efficiency and effectiveness of Self-RAG's integrated self-reflection.

6.1.2. Comparison against Baselines with Retrieval

Best Performance among Non-Proprietary Models: Self-RAG consistently achieves the best performance among all non-proprietary LLM-based models across all tasks.
Significant Gains over Standard RAG: While standard RAG improves baselines (e.g., Llama2-7B with RAG jumps from 14.7% to 38.2% on PopQA), Self-RAG pushes these gains further (Self-RAG 7B reaches 54.9% on PopQA). This indicates that the adaptive retrieval and self-reflection mechanisms are critical beyond simple document concatenation.
Addressing RAG Limitations: The paper notes that standard RAG often struggles when tasks require more than just copying/extracting substrings, or when retrieval isn't always helpful (e.g., PubHealth, ARC Challenge where RAG baselines don't notably improve over non-retrieval). Self-RAG overcomes this by its adaptive nature.
Exceptional Citation Accuracy: One of the most striking results is Self-RAG's performance on citation precision and recall for ASQA.
- Self-RAG 13B achieves 70.3% precision and 71.3% recall, far surpassing all other non-proprietary models (e.g., Ret-Llama2-C13B has 19.8% precision, 36.1% recall; Alpaca13B with RAG has 2.0% precision, 3.8% recall).
- Self-RAG 13B even outperforms Ret-ChatGPT (65.1% precision, 76.6% recall) in citation precision, which measures whether model claims are fully supported by cited evidence. This is a critical advancement in verifiable LLM generation.
Effectiveness of Self-RAG Framework: The Llama2-FT-7B baseline (fine-tuned on the same instruction-output pairs as Self-RAG but without reflection tokens or retrieved passages during training, only retrieval-augmented at test time) lags significantly behind Self-RAG. For example, on PubHealth, Llama2-FT-7B scores 64.3%, while Self-RAG 7B achieves 72.4%. This explicitly demonstrates that the gains of Self-RAG are not merely from training data but from the novel framework of learning to retrieve and self-reflect.
Nuance in Model Size: Interestingly, Self-RAG 7B occasionally outperforms Self-RAG 13B on metrics for factual precision (e.g., Bio FactScore: 81.2 for 7B vs. 80.2 for 13B). The authors attribute this to smaller models sometimes generating more precisely grounded, albeit shorter, outputs, suggesting a trade-off that can be controlled via inference parameters.

6.2. Ablation Studies / Parameter Analysis

The paper conducts several ablation studies to understand the contribution of different components of the Self-RAG framework, both during training and inference.

The following figure (Figure 3 from the original paper) shows an analysis on SELF-RAG:

Figure 3: Analysis on SELF-RAG: (a) Ablation studies for key components of SELF-RAG training and inference based on our 7B model. (b) Effects of soft weights on ASQA citation precision and Mauve (fluency). (c) Retrieval frequency and normalized accuracy on PubHealth and PopQA. 该图像是一个图表，展示了Self-RAG方法在多项任务上的各项性能指标，包括PQA、Med和AS的准确率及生成效果。同时展示了不同训练和测试条件下的结果，以及关于定制化和检索阈值的分析。图中包含关键的训练和测试数据，如无检索和检索条件下的性能对比。

6.2.1. Ablation Studies (Figure 3a)

The ablation studies are presented in Table 3a (not explicitly provided in the markdown, but described in text).

Training Ablations:
- No Retriever: An LLM trained with standard instruction-following, without any retrieved passages. This performs significantly worse than Self-RAG, indicating the necessity of retrieval in the training process.
- No Critic: An LLM trained with input-output pairs that are always augmented with the top retrieved document, but without reflection tokens. This is similar to SAIL. The large performance gap between Self-RAG and No Critic highlights the crucial role of the critic model and reflection tokens in guiding effective retrieval and generation.
- Conclusion: The large performance gap between Self-RAG and No Retriever or No Critic baselines across tasks clearly demonstrates that training the LLM with adaptive retrieval and self-reflection (via the critic model and reflection tokens) largely contributes to the performance gains.
Inference Ablations:
- No retrieval: Disables retrieval during inference. This leads to a significant performance drop, especially on knowledge-intensive tasks, affirming the importance of retrieval.
- Hard constraints: Instead of adaptive thresholds, retrieval is triggered only when $[Retrieve=Yes$ is explicitly predicted. While not always optimal, it explores stricter control.
- Retrieve top 1: Always retrieves and uses only the single top document, similar to conventional RAG approaches. This causes a large drop in performance on PopQA and ASQA. This suggests that Self-RAG's ability to process multiple passages and select based on fine-grained critique is superior to relying solely on a single "best" retrieved document.
- Remove IsSUP: The IsSUP score (measuring factual support) is removed from the critique-guided beam search (Equation 4). This significantly hurts performance on ASQA, especially its factuality and citation metrics.
- Conclusion: These inference ablations underscore the effectiveness of Self-RAG's multi-faceted approach: adaptive retrieval, parallel processing of passages, and the critique-guided beam search that considers multiple fine-grained criteria (IsReL, IsSUP, IsUsE), rather than blindly using retrieved passages or relying on single relevance scores.

6.2.2. Effects of Inference-Time Customization (Figure 3b)

Self-RAG's framework allows for customizing behavior at inference time by adjusting the weight terms ( $w^G$ ) for different critique types.

Trade-off between Citation Precision and Fluency: Figure 3b illustrates the effect of changing the weight ( $w^{\text{IsSUP}}$ $w^{IsSUP}$ ) for the IsSUP token (which criticizes how well the output is supported by evidence) on ASQA.
- Increasing $w^{\text{IsSUP}}$ : Leads to positive effects on citation precision, as the model is incentivized to generate outputs that are strongly supported by evidence. This aligns with the goal of prioritizing factual grounding.
- Inverse Relationship with MAUVE: However, a larger $w^{\text{IsSUP}}$ often results in lower MAUVE scores (fluency/completeness). This is because longer, more fluent generations tend to contain more claims that might not be fully supported by the direct retrieved citations, consistent with prior findings.
Conclusion: This demonstrates Self-RAG's powerful capability for dynamic customization. Practitioners can adjust these parameters to achieve desired trade-offs (e.g., maximizing factual accuracy even if it means slightly less fluent output, or vice-versa) at test time without requiring additional training.

6.2.3. Efficiency and Accuracy Trade-off (Figure 3c)

The adaptive retrieval mechanism in Self-RAG allows users to control the retrieval frequency using a threshold on the Retrieve token's probability.

Retrieval Frequency vs. Threshold ( $\delta$ ): Figure 3c shows how varying the threshold $\delta$ (where a larger $\delta$ leads to less retrieval) dramatically changes the model's retrieval frequencies on PubHealth and PopQA.
Performance Impact:
- On PubHealth, the performance deterioration from retrieving less is relatively smaller. This suggests that for some tasks, the model can perform well even with less frequent retrieval if its parametric knowledge is sufficient or if it's less reliant on external facts for every segment.
- On PopQA, the performance drop is larger when retrieving less, highlighting its stronger dependence on retrieved facts for accurate answers (especially for long-tail entities).
Conclusion: This analysis indicates that Self-RAG offers a knob to balance inference efficiency (less retrieval means faster inference) with accuracy based on task requirements, providing practical flexibility.

The following figure (Figure 4 from the original paper) shows training scale and human analysis:

该图像是包含四个子图的示意图，展示了不同训练数量对模型性能的影响。子图(a)为PopQA，子图(b)为PubHealth，子图(c)为ASQA的精度变化。同时，子图(d)为对PopQ和Bio生成的人工评估结果，包括S&P、IsRel和IsSup的得分。

6.2.4. Effects of Training Data Size (Figure 4a, 4b, 4c)

The paper investigates how the amount of training data influences Self-RAG's performance.

Data Scales: Variants of Self-RAG 7B were fine-tuned on randomly sampled subsets of 5k, 10k, 20k, and 50k instances from the original 150k training dataset.
Upward Trajectories: Figures 4a, 4b, and 4c (for PopQA, PubHealth, and ASQA citation precision respectively) consistently show upward trajectories: increasing the training data size generally leads to improved performance across all datasets.
Significant Improvements: The improvements are particularly significant in PopQA and ASQA, suggesting that these tasks benefit greatly from more diverse and extensive training examples that reinforce the retrieval and self-reflection patterns.
Comparison to Llama2-FT: The paper notes that Llama2-FT-7B (a baseline without Self-RAG's specific training) did not observe such significant improvements when training data increased from 50k to 150k.
Conclusion: This indicates that Self-RAG effectively leverages larger training datasets to learn its complex adaptive and self-reflective behaviors, and further expanding the training data beyond 150k instances could potentially lead to even greater improvements.

6.2.5. Human Evaluations (Figure 4d)

A small-scale human evaluation was conducted on 50 samples from PopQA and Biography generation results to assess output quality and the reliability of predicted reflection tokens.

Plausibility & Support (S&P): Human annotators evaluated S&P, which indicates whether the output is plausible (reasonable and on-topic) and supported (verified by provided evidence). For S&P, instances where Self-RAG predicted irrelevant or no support were excluded.
- Results: Self-RAG outputs were found to be plausible and supported by relevant passages with higher S&P scores on short-form PopQA, aligning with previous findings in similar studies.
Reflection Token Alignment: Annotators also checked if the model-predicted IsReL and IsSUP reflection tokens matched their own human assessments (e.g., if a fully supported token indeed corresponded to a factually supported claim).
- Results: Human annotators found that IsReL and IsSUP predictions mostly aligned with their inspections.
Conclusion: This human evaluation provides qualitative validation that Self-RAG not only produces better numerical metrics but also generates outputs that humans perceive as more accurate and well-supported, and that its internal self-reflection signals are largely reliable.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work introduced Self-RAG (Self-Reflective Retrieval-Augmented Generation), a novel and robust framework designed to significantly enhance the quality and factuality of Large Language Model (LLM) outputs. The core innovation lies in training a single, arbitrary LLM to adaptively decide when to retrieve external knowledge, to generate text, and critically, to critique both the retrieved passages and its own generations through the use of special reflection tokens. These tokens, which are seamlessly integrated into the model's vocabulary and generation process, enable unprecedented controllability during inference, allowing Self-RAG to tailor its behavior to diverse task requirements.

Through comprehensive evaluations on six distinct tasks spanning open-domain QA, reasoning, fact verification, and long-form generation, Self-RAG (7B and 13B parameters) consistently demonstrated superior performance. It significantly outperformed state-of-the-art LLMs like ChatGPT and traditional Retrieval-Augmented Generation (RAG) models, particularly excelling in improving factuality (as measured by FactScore) and citation accuracy (precision and recall) for long-form generations. The framework's cost-effective training approach, which uses an offline critic model to augment training data, and its flexible inference-time customization capabilities further highlight its practical utility and innovation.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Incomplete Factual Support: Despite significant improvements, Self-RAG can still generate outputs that are not fully supported by the cited passages. This suggests that while the mechanism greatly reduces hallucinations, it does not eliminate them entirely.
Explicit Verification Burden: While Self-RAG provides explicit self-reflection and fine-grained attribution (e.g., through IsSUP tokens), the ultimate burden of verifying factual errors still falls on the user. The model flags potential issues but doesn't guarantee absolute truthfulness.
Further Training Data Expansion: The analysis on training data size indicated that increasing data consistently led to performance improvements. This suggests that further expanding the training dataset beyond the current 150k instances could lead to even greater gains. This implies that the current models might not have fully converged or exploited the potential of the Self-RAG training paradigm.
Exploring More Complex Critiques: The current set of reflection tokens covers relevance, support, and usefulness. Future work could explore more nuanced or domain-specific critique types to further enhance control and quality for specialized tasks.

7.3. Personal Insights & Critique

Self-RAG represents a highly impactful advancement in making LLMs more reliable and controllable.

Innovation of Reflection Tokens: The concept of reflection tokens as an integral part of the generation stream is particularly insightful. It externalizes an internal "thought process" of the LLM, making its reasoning and self-assessment transparent and actionable. This moves beyond simple prompt engineering to a deeply integrated, learned mechanism for control.
Cost-Effectiveness vs. RLHF: The offline training of the critic model and its use to augment the generator's training data is a brilliant strategy for achieving RLHF-like alignment benefits without the immense computational and engineering complexity of online RLHF (e.g., PPO). This makes Self-RAG more accessible and scalable for many research groups and applications.
Inference-Time Customization: The ability to adjust weight terms ( $w^G$ ) at inference time to balance conflicting objectives (e.g., factual precision vs. fluency) is a powerful feature. This is a practical control knob that is often missing in other LLM systems, which typically require retraining for such behavioral shifts. It transforms the LLM from a black box into a more configurable tool.
Broader Applicability: The framework's adaptability to "any arbitrary LM" suggests wide applicability. It could potentially be integrated with various base LLMs and adapted to diverse domains beyond those tested, where factuality and verifiable generation are paramount.
Potential Issues/Areas for Improvement:
- Defining Optimal Weights: While customizable weights ( $w^G$ ) offer flexibility, determining the optimal weights for complex, multi-objective tasks might still require some trial and error or domain expertise. Automated methods for learning or suggesting these weights could be a valuable extension.
- Scalability of Critic Model Training: While offline, the initial collection of GPT-4-generated critiques still relies on a proprietary model. Developing open-source alternatives or more efficient ways to bootstrap the critic model could further enhance reproducibility and accessibility.
- Handling Conflicting Information: The paper focuses on retrieving relevant passages. How Self-RAG handles conflicting information within retrieved passages, or conflicts between parametric knowledge and retrieved facts, would be an interesting area for deeper analysis. The IsSUP token helps, but the ranking mechanism could be further refined.
- Generalizability of Reflection Tokens: While human evaluations were positive, the subjective nature of "usefulness" or "relevance" could vary across users and cultures. Further research into cross-cultural applicability and robustness of reflection tokens might be beneficial.
  
  Overall, Self-RAG offers a robust and practical solution to a critical problem in LLM deployment, paving the way for more trustworthy and controllable AI systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~24 min read · 33,935 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

Core Problem

Importance of the Problem

Challenges and Gaps in Prior Research

Paper's Entry Point / Innovative Idea

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs)

3.1.2. Retrieval-Augmented Generation (RAG)

3.1.3. Reward Models and Reinforcement Learning from Human Feedback (RLHF)

3.1.4. Control Tokens

3.2. Previous Works

3.2.1. Traditional RAG Approaches

3.2.2. Adaptive Retrieval Approaches

3.2.3. Concurrent RAG Works

3.2.4. Training and Generating with Critics

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Inference Overview

4.2.2. Training Overview

4.2.3. Training the Critic Model (C\mathcal{C}C)

Data Collection for Critic Model

Critic Learning

4.2.4. Training the Generator Model (M\mathcal{M}M)

Data Collection for Generator

Generator Learning

4.2.5. SELF-RAG Inference (Customization)

Adaptive Retrieval with Threshold

Tree-decoding with Critique Tokens

5. Experimental Setup

5.1. Datasets

5.1.1. Closed-set Tasks

5.1.2. Short-form Generation Tasks

5.1.3. Long-form Generation Tasks

5.2. Evaluation Metrics

5.2.1. Accuracy (acc)

5.2.2. FactScore (FS)

5.2.3. Correctness (str-em, rg)

5.2.4. Fluency (MAUVE)

5.2.5. Citation Precision (prec) and Recall (rec)

5.3. Baselines

5.3.1. Baselines without Retrieval

5.3.2. Baselines with Retrieval

5.4. Experimental Settings

5.4.1. Training Data and Settings

5.4.2. Inference Settings

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Comparison against Baselines without Retrieval

6.1.2. Comparison against Baselines with Retrieval

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Studies (Figure 3a)

6.2.2. Effects of Inference-Time Customization (Figure 3b)

6.2.3. Efficiency and Accuracy Trade-off (Figure 3c)

6.2.4. Effects of Training Data Size (Figure 4a, 4b, 4c)

6.2.5. Human Evaluations (Figure 4d)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

4.2.3. Training the Critic Model ( $\mathcal{C}$ )

4.2.4. Training the Generator Model ( $\mathcal{M}$ )