Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
TL;DR Summary
The Retrieval-Augmented Generation (RAG) model combines parametric and non-parametric memory, overcoming limitations of large pre-trained language models in knowledge-intensive tasks, achieving superior performance in open-domain question answering with more accurate and diverse
Abstract
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
1.2. Authors
The paper was authored by Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. The authors are primarily affiliated with Facebook AI Research (FAIR), with some members also associated with University College London (UCL) and New York University (NYU). These institutions and researchers are leading figures in the field of Natural Language Processing (NLP) and Artificial Intelligence.
1.3. Journal/Conference
This paper was published at the Conference on Neural Information Processing Systems (NeurIPS) 2020. NeurIPS is widely considered one of the most prestigious and influential venues in machine learning and computational neuroscience.
1.4. Publication Year
The paper was published in 2020 (First version submitted to arXiv in May 2020).
1.5. Abstract
Large pre-trained language models store factual knowledge in their parameters but struggle with precisely manipulating that knowledge or providing provenance (evidence) for their claims. This paper introduces Retrieval-Augmented Generation (RAG), a general-purpose fine-tuning recipe that combines parametric memory (a pre-trained sequence-to-sequence model like BART) with non-parametric memory (a dense vector index of Wikipedia accessed via a neural retriever). The authors explore two RAG formulations: RAG-Sequence, which uses the same retrieved document for the entire sequence, and RAG-Token, which can use different documents for each token. The model achieves state-of-the-art results on several open-domain Question Answering (QA) tasks and generates more factual, specific, and diverse language than purely parametric models.
1.6. Original Source Link
-
ArXiv Link: https://arxiv.org/abs/2005.11401
-
PDF Link: https://arxiv.org/pdf/2005.11401v4.pdf
-
Publication Status: Officially published (NeurIPS 2020).
2. Executive Summary
2.1. Background & Motivation
The core problem addressed is the limitation of Large Language Models (LLMs) in handling "knowledge-intensive" tasks—tasks that require specific, factual information that a human would need to look up in an external source (e.g., "What is the currency of Scotland?").
While models like T5 or GPT store vast amounts of information in their internal weights (parametric memory), they face three major challenges:
-
Hallucinations: They often generate plausible-sounding but factually incorrect information.
-
Lack of Interpretability: They cannot easily cite their sources or provide evidence for their answers.
-
Static Knowledge: Updating their world knowledge requires expensive re-training on new data.
The researchers' innovative idea is to treat the retrieval of external documents as a latent variable (an unobserved factor that influences the output) within a sequence-to-sequence framework. This allows the model to "consult" a massive external library (Wikipedia) before generating a response.
2.2. Main Contributions / Findings
-
Unified RAG Framework: Developed a model that integrates a pre-trained retriever (
DPR) and a pre-trained generator (BART) into a single system that can be fine-tuned end-to-end. -
Two Probabilistic Formulations: Introduced
RAG-SequenceandRAG-Tokento explore different ways of aggregating information from multiple retrieved documents. -
State-of-the-Art Performance: Set new benchmarks on open-domain QA tasks (Natural Questions, WebQuestions, CuratedTrec), outperforming models that were significantly larger or used specialized pre-training.
-
Hot-Swappable Memory: Demonstrated that the model's knowledge can be updated by simply replacing the document index (non-parametric memory) without re-training the neural network parameters.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a novice should be familiar with the following:
- Parametric vs. Non-parametric Memory:
- Parametric Memory: Information stored within the weights (parameters) of the neural network, learned during training.
- Non-parametric Memory: Information stored externally (like a database or document index) that the model can look up.
- Sequence-to-Sequence (seq2seq) Models: An architecture (usually based on the
Transformer) that takes a sequence of text as input and generates another sequence as output. - Maximum Inner Product Search (MIPS): A technique for finding the "closest" items in a large database by calculating the dot product between a query vector and millions of document vectors very quickly.
- Latent Variable: In this context, the "retrieved document" is the latent variable. The model doesn't know for sure which document is best, so it treats the choice as a probability distribution and sums over the possibilities.
3.2. Previous Works
The authors build upon several landmark studies:
- DPR (Dense Passage Retrieval): A method that uses two
BERTencoders (one for the query, one for the document) to represent text as dense vectors. This is the "Retriever" component of RAG. - BART: A pre-trained
seq2seqmodel that is excellent at generating text by "denoising" (reconstructing) corrupted input. This is the "Generator" component. - REALM & ORQA: Earlier models that combined retrieval and language modeling. However, these were mostly "extractive" (they only picked a span of text from a document) whereas RAG is "generative" (it can synthesize a new answer).
3.3. Technological Evolution
Initially, NLP models were either purely parametric (like GPT-2 or T5), which were flexible but hallucination-prone, or purely extractive (like DrQA), which were factual but inflexible. RAG represents the convergence of these two paths, creating a "Hybrid" model that uses retrieval to ground generation in facts.
3.4. Differentiation Analysis
The key difference between RAG and its predecessors is that RAG allows for end-to-end fine-tuning of both the retriever and the generator for any text-to-text task. Unlike REALM, it does not require a specialized, expensive pre-training phase for the retriever; it can be fine-tuned directly on the target task (e.g., answering questions).
4. Methodology
4.1. Principles
The core intuition behind RAG is that the probability of an output sequence given an input can be calculated by looking at a set of retrieved documents and combining their information. The model treats as a "hidden" intermediate step that the model must learn to navigate.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. The Retrieval Component (DPR)
The process begins with the Retriever, . This component determines which documents in the external library are most relevant to the user's input .
The model uses a Bi-encoder architecture: $ p_{\eta}(z|x) \propto \exp(\mathbf{d}(z)^{\top} \mathbf{q}(x)) $ Where:
- : A vector representation of the query, produced by a
BERTencoder. - : A vector representation of a document chunk, produced by a separate
BERTencoder. - The dot product measures the similarity between the two. The model finds the top documents (usually 5 or 10) using MIPS.
4.2.2. The Generation Component (BART)
Once the documents are retrieved, they are passed to the Generator, .
To generate a response, the model concatenates the input and the retrieved document as a single text prompt. The BART model then predicts the next token based on this combined context and the tokens generated so far ().
4.2.3. RAG-Sequence Model
In this formulation, the model assumes that one single document is responsible for the entire answer. To find the probability of the whole answer , the model calculates the probability of given each retrieved document and then sums them up (marginalizes), weighted by how relevant each document was:
$ p_{\mathrm{RAG-Sequence}}(y|x) \approx \sum_{z \in \mathrm{top} \cdot k(p(z|x))} p_{\eta}(z|x) \prod_{i}^{N} p_{\theta}(y_{i}|x,z,y_{1:i-1}) $
Here, the product calculates the probability of the entire sequence for a fixed document . This is then multiplied by the retriever's score and summed over the top documents.
4.2.4. RAG-Token Model
In this formulation, the model is more flexible: it can switch documents for every single token it generates. This is useful if a question needs info from multiple places (e.g., "Who wrote 'The Sun Also Rises' and where was he born?"). The formula is:
$ p_{\mathtt{RAG-Token}}(y|x) \approx \prod_{i}^{N} \sum_{z \in \mathrm{top} \cdot k(p(\cdot|x))} p_{\eta}(z|x) p_{\theta}(y_{i}|x,z,y_{1:i-1}) $
In this case, for every token , the model sums the probabilities across all documents before moving to the next token.
4.2.5. Training and Decoding
The following figure (Figure 1 from the original paper) shows the overall system architecture:
该图像是示意图,展示了检索增强生成(RAG)方法的总体框架。图中包括了查询编码器、检索器(非参数)和生成器(参数)三个主要组成部分。通过查询 q(x),使用最大内积搜索(MIPS)从文档索引中检索相关文档 。最后,通过生成器 对这些文档进行预测,边际化生成最终结果。这一过程实现了端到端的反向传播。
- Training: The model is trained to minimize the negative log-likelihood of the correct answer. The gradient (the signal for how to update weights) flows through the generator back into the query encoder . This means the model learns to "ask better questions" to retrieve better documents.
- Decoding:
-
For RAG-Token, the model uses standard Beam Search (a way to find the most likely sequence of words).
-
For RAG-Sequence, it's trickier because the probability is summed across documents. The authors use "Thorough Decoding" (running beam search for each document and merging results) or "Fast Decoding" (a more efficient approximation).
-
5. Experimental Setup
5.1. Datasets
The authors tested RAG on several diverse datasets:
- Open-domain QA: Natural Questions (
NQ), TriviaQA (TQA), WebQuestions (WQ), and CuratedTrec (CT). These require short, factual answers. - Abstractive QA:
MS-MARCO(Question-answering where answers are full sentences). - Question Generation:
Jeopardy. The model is given an entity (e.g., "Hemingway") and must generate a Jeopardy-style clue about it. - Fact Verification:
FEVER. The model must classify if a claim is "Supported", "Refuted", or "Not Enough Info" based on Wikipedia.
Data Sample Example (Jeopardy):
- Input (Entity): "The World Cup"
- Output (Generated Clue): "In 1986 Mexico scored as the first country to host this international sports competition twice."
5.2. Evaluation Metrics
- Exact Match (EM):
- Definition: Quantifies if the generated answer is identical to the gold standard answer after minor normalization (like removing "the" or "a").
- Formula:
- Symbols: is the indicator function (1 if true, 0 if false), is the number of samples.
- Rouge-L:
- Definition: Measures the Longest Common Subsequence (LCS) between the generated text and the reference. It focuses on fluency and content overlap.
- Formula: , where is the length of the reference.
- Bleu-1:
- Definition: Measures precision of unigrams (single words) in the output compared to the reference.
- Formula:
\mathrm{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{N} w_{n} \log p_{n}\right) - Symbols: is n-gram precision,
BPis a brevity penalty.
- Q-BLEU-1:
- Definition: A variation of BLEU designed for question generation that gives more weight to correctly matching entities (nouns/names).
5.3. Baselines
-
Closed-Book T5: A model that generates answers using only internal weights.
-
DPR (Extractive): The state-of-the-art model that extracts spans of text from Wikipedia but cannot generate new sentences.
-
BART: A parametric-only baseline to see how much "retrieval" actually helps.
6. Results & Analysis
6.1. Core Results Analysis
RAG significantly outperformed both purely parametric models and extractive models.
- In Open-domain QA, RAG set new state-of-the-art records. Notably, it could answer questions even when the correct answer wasn't in the retrieved text, by combining retrieved "clues" with its internal knowledge.
- In Generation tasks, human evaluators found RAG more factual ( vs for BART) and more specific.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper, showing performance on Open-Domain QA:
| Model | Category | NQ | TQA | WQ | CT |
|---|---|---|---|---|---|
| EM | EM (Open/Wiki) | EM | EM | ||
| T5-11B | Closed Book | 34.5 | - / 50.1 | 37.4 | - |
| T5-11B+SSM | Closed Book | 36.6 | - / 60.5 | 44.7 | - |
| REALM | Open Book | 40.4 | - / - | 40.7 | 46.8 |
| DPR | Open Book | 41.5 | 57.9 / - | 41.1 | 50.6 |
| **RAG-Token** | Open Book | 44.1 | 55.2 / 66.1 | 45.5 | 50.0 |
| **RAG-Seq.** | Open Book | **44.5** | **56.8 / 68.0** | **45.2** | **52.2** |
The following are the results from Table 2, focusing on generation and classification:
| Model | Jeopardy (Q-Gen) | MS-MARCO | FEVER (Acc.) | |||
|---|---|---|---|---|---|---|
| B-1 | QB-1 | R-L | B-1 | 3-way | 2-way | |
| SotA (various) | - | - | 49.8* | 49.9* | 76.8 | 92.2* |
| BART | 15.1 | 19.7 | 38.2 | 41.6 | 64.0 | 81.1 |
| **RAG-Tok.** | **17.3** | **22.2** | 40.1 | 41.5 | **72.5** | **89.5** |
| **RAG-Seq.** | 14.7 | 21.4 | **40.8** | **44.2** | **72.5** | **89.5** |
Note: Asterisk () indicates the use of gold context/evidence which RAG does not use.*
6.3. Ablation Studies / Parameter Analysis
As seen in Figure 3 (below), the number of retrieved documents affects performance differently for the two models. RAG-Sequence benefits from more documents (up to 50), while RAG-Token peaks at around 10.
该图像是一个图表,展示了不同模型在NQ任务中随检索文档数量变化的性能。左侧显示了NQ的精确匹配率,中间为NQ回答召回率,右侧呈现了MS-MARCO的Bleu-1和Rouge-L得分。各条曲线代表RAG-Tok和RAG-Sequ模型,表现出随检索文档数量增加而有所变化。
The authors also conducted "Index Hot-swapping." They switched the Wikipedia index from 2016 to 2018. The model automatically updated its answers (e.g., correctly identifying the new President of a country) without any retraining of the neural weights.
7. Conclusion & Reflections
7.1. Conclusion Summary
RAG represents a major step forward in creating grounded AI. By combining the creative generation of BART with the factual retrieval of DPR, the authors created a model that is more accurate, more interpretable, and easier to update than standard LLMs. It proves that generation models do not need to store all world knowledge in their parameters to be effective.
7.2. Limitations & Future Work
- Retrieval Collapse: In some tasks (like story generation), the retriever sometimes "collapses" and starts retrieving the same irrelevant documents regardless of the input.
- Efficiency: Running a retriever and a generator together is more computationally expensive than a single model.
- Future Work: The authors suggest jointly pre-training the retriever and generator from scratch, rather than starting with already pre-trained components.
7.3. Personal Insights & Critique
This paper is foundational for modern AI architectures (like those used in ChatGPT or Perplexity).
- Innovation: The idea of "marginalizing over latent documents" is a clever way to handle the uncertainty of retrieval. It allows the model to be "unsure" which document is best and look at several.
- Potential Issue: The model still relies on the quality of the external index (Wikipedia). If the index contains bias or misinformation, RAG will confidently repeat it.
- Application: This method is highly transferable to enterprise settings. A company could swap the Wikipedia index for their internal technical manuals, and the RAG model would immediately become an expert on that company's specific products.
Similar papers
Recommended via semantic vector search.