Retrieval-in-the-Chain: Bootstrapping Large Language Models for Generative Retrieval
TL;DR Summary
R4R transforms free-form CoT reasoning into structured formats, iteratively refining it to enhance generative retrieval using a single instruction-tuned LLM, significantly improving retrieval performance on multiple benchmarks.
Abstract
Generative retrieval (GR) is an emerging paradigm that leverages large language models (LLMs) to autoregressively generate document identifiers (docids) relevant to a given query. Prior works have focused on leveraging the generative capabilities of LLMs to improve GR, while overlooking that their reasoning capabilities could likewise help. This raises a key question: Can explicit reasoning benefit GR? To investigate, we first conduct a preliminary study where an LLM is prompted to generate free-form chain-of-thought (CoT) reasoning before performing constrained docid decoding. Although this method outperforms standard GR, the generated reasoning tends to be verbose and poorly aligned with the docid space. These limitations motivate the development of a reasoning mechanism better tailored to GR. Therefore, we propose Reason-for-Retrieval (R4R), a reasoning-augmented framework for GR that converts free-form CoT reasoning into a compact, structured format, and iteratively refines the reasoning during the retrieval process. R4R augments an existing GR method by leveraging a reasoning-capable LLM that has been instruction-tuned for GR. At inference time, R4R first uses the LLM to generate an initial structured reasoning; then the same LLM alternates between (i) constrained decoding with the chosen GR method to produce candidate docids and (ii) updating the reasoning based on retrieval results to improve the next round. R4R does not require additional models or training, and instead a single LLM serves as both the reasoning generator and the retriever. Extensive experiments on Natural Questions, MS MARCO, and a real-world item-search benchmark validate the effectiveness of R4R.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Retrieval-in-the-Chain: Bootstrapping Large Language Models for Generative Retrieval
- Authors:
- Yingchen Zhang (State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences)
- Ruqing Zhang (State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences)
- Jiafeng Guo (State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences)
- Wenjun Peng (Researcher, Hangzhou, China)
- Sen Li (Researcher, Hangzhou, China)
- Fuyu Lv (Researcher, Hangzhou, China)
- The authors are a mix of academic researchers from a prominent Chinese institution and industry researchers, suggesting a blend of foundational research and practical application perspectives.
- Journal/Conference: The paper is available on arXiv, which is a preprint server. It does not appear to be formally published in a peer-reviewed conference or journal yet. Preprints are common in fast-moving fields like AI for rapid dissemination of research.
- Publication Year: 2025 (The arXiv identifier suggests a future date, which is unusual but could be a placeholder or a typo in the provided link. The content suggests it is contemporary research.)
- Abstract: The paper addresses a gap in Generative Retrieval (GR), where Large Language Models (LLMs) are used to directly generate document identifiers (docids) for a query. While prior work focused on the generative abilities of LLMs, this paper explores their reasoning capabilities. A preliminary study shows that simple Chain-of-Thought (CoT) reasoning helps but is inefficient. To address this, the authors propose Reason-for-Retrieval (R4R), a framework that uses a single LLM to perform iterative retrieval. R4R first generates structured reasoning, then alternates between retrieving docids and refining the reasoning based on the results. This approach requires no extra models or training and is shown to be effective on standard benchmarks (Natural Questions, MS MARCO) and a real-world item-search dataset.
- Original Source Link:
- Official Source: https://arxiv.org/abs/2510.13095
- PDF Link: https://arxiv.org/pdf/2510.13095v2.pdf
- Publication Status: This is a preprint on arXiv and has not yet undergone formal peer review for a conference or journal.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: The paper aims to improve the performance of Generative Retrieval (GR), a new paradigm where LLMs autoregressively generate identifiers of relevant documents (docids) in response to a query.
- Identified Gap: Previous GR methods primarily leveraged the generative power of LLMs. The authors identify a missed opportunity: the powerful reasoning capabilities of modern LLMs, famously enhanced by techniques like Chain-of-Thought (CoT), have been largely overlooked in the retrieval process itself.
- Initial Challenge: A naive application of CoT, where the LLM generates free-form reasoning before retrieving, proves to be suboptimal. The reasoning is often verbose, increases latency, and is poorly aligned with the concise format of docids, sometimes even introducing noise that hurts performance.
- Fresh Angle: The paper proposes to explicitly integrate a structured and iterative reasoning process within the retrieval loop, using a single LLM to "think, retrieve, and refine" in a cycle.
-
Main Contributions / Findings (What):
- Novel Framework (R4R): The paper introduces Reason-for-Retrieval (R4R), a reasoning-augmented framework for GR. It does not require additional models or specialized training beyond adapting a standard GR model with instruction-tuning.
- Structured, Iterative Reasoning: R4R's core innovation is a three-step iterative loop executed by a single LLM:
- Think: Generates a compact, structured reasoning from the query.
- Retrieve: Uses this reasoning to perform constrained docid generation.
- Refine: Verifies the retrieved docids and reflects on errors to update the reasoning for the next iteration.
- Demonstrated Effectiveness: Through extensive experiments on public benchmarks (Natural Questions, MS MARCO) and a real-world e-commerce search dataset (Taobao), the paper shows that R4R consistently improves the performance of several state-of-the-art GR methods.
- Efficiency: The structured reasoning format is designed to be compact, avoiding the high latency associated with verbose, free-form CoT, making the approach more practical.
3. Prerequisite Knowledge & Related Work
This section explains foundational concepts for a reader new to the field, drawing from the paper's introduction and related work sections.
-
Foundational Concepts:
- Information Retrieval (IR): The science of searching for information in documents, searching for documents themselves, or searching for metadata that describe data. A classic example is a web search engine.
- Dense Retrieval: A modern IR approach. It uses two neural networks (dual encoders) to map both the query and the documents into high-dimensional vectors (embeddings). Retrieval is performed by finding the document vectors that are closest to the query vector in the embedding space (e.g., using cosine similarity). This captures semantic meaning better than just keyword matching.
- Generative Retrieval (GR): A paradigm shift from dense retrieval. Instead of comparing vectors, GR uses a single, large autoregressive model (like a GPT-style LLM). The entire document collection (corpus) is "memorized" by training the model to associate each document with a unique document identifier (docid). At inference time, the model directly generates the docid of a relevant document token-by-token, given the query.
- Document Identifier (docid): A unique string that represents a document. The paper distinguishes two types:
Numeric docids: Integers or numerical codes (e.g.,12345). These are efficient but can harm an LLM's general language abilities.Textual docids: Human-readable text, such as the document's title ("Barack_Obama_Biography") or keywords ("Physics-Quantum-Mechanics"). R4R focuses on this type because it keeps the reasoning and retrieval tasks in the same natural language space.
- Chain-of-Thought (CoT) Prompting: A technique to improve the reasoning ability of LLMs. Instead of asking for a direct answer, the LLM is prompted to first generate a step-by-step reasoning process before arriving at the final answer. This often leads to more accurate results on complex tasks.
- Instruction Tuning: A fine-tuning process where an LLM is trained on examples that pair an "instruction" (a command, e.g., "Retrieve the relevant document for this query") with the desired output. The paper uses this to teach the LLM to perform the retrieval task without losing its general reasoning and generation capabilities.
- Constrained Decoding: A technique used during generation to force the model's output to conform to a specific format or set of valid options. In GR, this is crucial to ensure the model only generates valid docids that exist in the corpus. This is often implemented with a
prefix-trie(a tree structure of all valid docid prefixes) or anFM-index.
-
Previous Works & Differentiation:
- Standard GR (e.g.,
DSI): Early GR methods focused on the core mechanism of mapping queries to docids. They often used numeric docids and did not incorporate explicit reasoning. - Textual Docid GR (e.g.,
SEAL,MINDER,TSGen): These methods improved GR by using textual docids (like document titles), which better align with an LLM's pre-training. This partially preserves the LLM's language abilities. R4R builds on this line of work, as textual docids are a prerequisite for its reasoning process. - Reasoning in IR: Some prior work like
CorpusLMused reasoning, but only in a post-processing step after retrieval was complete (e.g., to answer a question based on retrieved documents). R4R is novel because it integrates reasoning directly into the retrieval loop to improve the set of retrieved documents itself. - Self-Refinement: Techniques like
Self-refineinvolve a model iteratively improving its own output. R4R applies this concept specifically to the GR task, where the "output" is a list of docids and the "refinement" is guided by self-generated relevance judgments.
- Standard GR (e.g.,
4. Methodology (Core Technology & Implementation)
This section provides a detailed breakdown of the R4R framework.
-
Principles: The core idea of R4R is to make an LLM perform a "human-like" retrieval process: think about the query, find some initial results, reflect on whether they are correct, and use that reflection to refine the search strategy for another attempt. This is all done by a single LLM, orchestrated by a set of specialized prompts.
-
Steps & Procedures: The R4R framework consists of an initial
Thinkstep followed by an iterativeRetrieve-Refineloop.
Image 1: A visual comparison of (a) standard GR, which directly maps a query to a docid; (b) GR with Direct CoT, which prepends verbose, unstructured reasoning to the query; and (c) the proposed R4R, which uses a compact, structured reasoning (
query context) for retrieval and iteratively refines it.1. Adapted GR Training: Before R4R can be used at inference time, the base LLM must be prepared. The paper adapts standard GR training by using instruction-tuning. Instead of just feeding the model
(query, docid)pairs, it uses prompts like "Retrieve the document for this query:" This teaches the model the retrieval task while preserving its general ability to follow instructions, which is crucial for the reasoning steps in R4R. The training objective for indexing and retrieval is given by:- : A document from the corpus .
- : A query.
docid(d): The textual identifier for document .docid(d)_i: The -th token of the docid.- : The LLM being trained.
- : Task-specific instruction prompts for indexing and retrieval.
- : The log-probability of the model generating the next token, which the training aims to maximize.
2. R4R Inference Pipeline: The inference process is detailed in Algorithm 1 and Figure 2.
Step I: Think (Initial Reasoning Generation) Given a user query , the process starts by generating a structured, two-part reasoning. This is done by prompting the model with a
thinking prompt().-
: The initial
query context. A compact set of keywords or phrases that are aligned with the docid format and serve as direct guidance for retrieval. -
: The initial
expanded explanation. A more detailed, structured explanation of the user's intent, potential document titles, etc. This is not used for retrieval directly but is kept for theRefinestep. -
: Concatenation operator.
Example (from Figure 2):
-
Query: "What is the largest city in Western Australia?"
-
query context(): "Geography Australia Cities Sydney Melbourne Canberra" (Note: This initial guess is incorrect, as it focuses on Australia in general). -
expanded explanation():{user intent: "The user is searching for information on major Australian cities", ...}.
Step II: Retrieve The model then performs constrained decoding to generate a ranked list of candidate docids. The input is a concatenation of the
retrieval prompt(), the original query , and the currentquery context.-
: The list of top- docids retrieved in iteration .
-
: The
query contextfrom the previous iteration (or for the first iteration). -
cons: The constrained decoding strategy (e.g., prefix-trie).Example (from Figure 2, Iteration 1):
-
Input Context: "Geography Australia Cities Sydney Melbourne Canberra"
-
Retrieved Docids:
Australia-City-Sydney,Australia-City-Canberra, etc. These are incorrect because the context misled the model.
Step III: Refine This step has two sub-steps:
- Verification: The model acts as a relevance judge. It inspects the top retrieved docids one by one. For each docid, it is prompted with to output either "relevant" or "irrelevant". If all top docids are judged relevant, the process stops and returns the current results. Otherwise, it proceeds to reflection upon finding the first irrelevant docid.
- Reflection: If an irrelevant docid () is found, the model is prompted with to analyze the error and generate an updated
query context() andexpanded explanation(). The inputs to this step are the query, the faulty docid, and the previous reasoning ().
Example (from Figure 2, Iteration 1 Refinement):
- Verification: The model judges
Australia-City-Sydneyas "irrelevant" to the query about "Western Australia". - Reflection: Based on this error, the model updates the reasoning.
- Updated
query context(): "Geography Western Australia Perth" - Updated
expanded explanation():{user intent: "The user is searching for the largest city in Western Australia", ...}.
Step IV: Iteration The process loops back to the
Retrievestep using the updatedquery context. This loop continues until a termination condition is met: (1) all top- candidates are verified as relevant, (2) the maximum number of rounds is reached, or (3) the model fails to produce a valid structured output during refinement.
5. Experimental Setup
-
Datasets:
- Natural Questions (NQ): A question-answering dataset derived from Google search queries. Queries are complex and often require reasoning to find the answer within Wikipedia passages.
- MS MARCO Passage: A large-scale passage ranking dataset with real-world, often ambiguous queries from the Bing search engine.
- Taobao Item-Search: A proprietary, real-world dataset from a large e-commerce platform, consisting of 2.5 million query-item pairs. This tests the method's applicability beyond standard QA/web search to product search.
-
Evaluation Metrics:
-
Hits@k:
- Conceptual Definition: This metric measures whether at least one correct (relevant) document appears in the top retrieved results for a given query. It is a binary measure (1 if a hit occurs, 0 otherwise) and is averaged over all queries. It answers the question: "Did the model find any correct answer in the top k results?"
- Mathematical Formula:
- Symbol Explanation:
- : The set of all queries.
- : The total number of queries.
- : The indicator function, which is 1 if the condition inside is true, and 0 otherwise.
- : A relevant document for query .
- : The set of retrieved documents for query .
- : The position of the relevant document in the ranked list of results.
-
Mean Reciprocal Rank (MRR@k):
- Conceptual Definition: This metric evaluates the ranking quality of the results. For each query, it finds the rank of the first correct document. The reciprocal of this rank is calculated (e.g., if the first correct item is at rank 3, the reciprocal rank is 1/3). If no correct document is found in the top results, the score is 0. The final MRR is the average of these scores over all queries. It heavily rewards models that place the first correct answer higher up in the list.
- Mathematical Formula:
- Symbol Explanation:
- : The set of all queries.
- : The total number of queries.
- : The rank of the first relevant document for query . If no relevant document is found in the top results, is considered and is 0.
-
-
Baselines:
- Integrated GR Baselines: R4R was applied to four existing GR methods that use textual docids:
DSI-text: Standard GR using a prefix-trie for constrained decoding.SEAL: Uses document titles as docids and an FM-index for more flexible decoding.TSGen: Uses an inverted index and term constraints for dynamic candidate space reduction.MINDER: Uses multiple textual docids (views) for each document.
- External Baselines (for horizontal comparison):
- Term-based:
BM25,DocT5Query. - Dense Retrieval:
DPR,ANCE. - Generative Retrieval (Numeric docids):
DSI-semantic,DSI-QG,LTRGR,RIPOR,PAG.
- Term-based:
- Integrated GR Baselines: R4R was applied to four existing GR methods that use textual docids:
6. Results & Analysis
-
Core Results:
Table 1 (Manual Transcription): Exploration of GR with Direct CoT. This preliminary experiment on the NQ dataset motivates the need for R4R.
Method Hits@1 Hits@5 Hits@20 MRR@10 Latency Standard GR 45.8 59.6 75.3 56.3 3.2 Standard GR + CoT 12.8 18.2 21.5 15.1 72.8 Adapted GR + CoT 46.0 59.6 76.0 57.5 63.9 - Analysis: Directly adding CoT to a standard GR model (
Standard GR + CoT) catastrophically fails because the model lost its generative reasoning ability during training. After adapting the training with instruction-tuning (Adapted GR + CoT), performance improves slightly over the baseline, confirming that reasoning can help. However, the latency increases by ~20x, highlighting the inefficiency of unstructured, verbose CoT.
Table 2 (Manual Transcription): Performance of GR methods with and without R4R on NQ and MS MARCO.
Method NQ MS MARCO Hits@1 Hits@5 Hits@20 MRR@10 Hits@1 Hits@10 MRR@10 DSI-text 46.0 59.6 75.3 56.3 35.9 55.8 34.1 + R4R 46.6 59.6 76.9 58.1 37.2 55.8 35.2 SEAL 50.9 63.5 79.3 61.2 41.4 61.1 37.2 + R4R 53.1 66.0 81.2 65.3 44.3 63.9 38.5 MINDER 50.0 66.0 80.0 62.5 44.3 64.7 37.9 + R4R 53.8 69.3 80.0 67.7 45.7 64.1 38.1 TSGen 48.8 67.1 79.7 64.6 42.2 64.0 35.1 + R4R 52.3 69.1 81.6 68.5 44.2 66.7 36.3 - Analysis: R4R provides consistent and significant improvements across all four baseline GR methods on both NQ and MS MARCO. For example,
SEAL + R4Rgains 2.2 points in Hits@1 and 4.1 points in MRR@10 on NQ. The gains are more pronounced for methods with more flexible decoding (SEAL,MINDER,TSGen) than for the rigid prefix-basedDSI.
Table 4 (Manual Transcription): Performance on Taobao Item Search. This table demonstrates R4R's effectiveness in a real-world e-commerce scenario.
Method Hits@1 Hits@5 Hits@20 MRR@10 TSGen 33.1 57.1 71.0 37.6 + R4R 34.2 59.2 72.5 39.3 DPR 29.1 62.8 73.1 39.5 - Analysis: R4R again boosts the performance of all GR baselines. For instance,
TSGen + R4Routperforms the baselineTSGensignificantly. Interestingly, the enhanced GR model (TSGen + R4R) becomes highly competitive with or even surpasses strong dense retrieval baselines likeDPRon this task, showcasing its practical value.
- Analysis: Directly adding CoT to a standard GR model (
-
Ablations / Parameter Sensitivity:
Table 5 (Manual Transcription): Ablation Study of R4R on NQ. This study investigates the contribution of each component of R4R. The results are shown for integrating with
SEAL.Method (with SEAL) Hits@1 Hits@5 Hits@20 MRR@10 SEAL + R4R (Full) 53.1 66.0 81.2 65.3 w/o query context 27.3 33.5 45.2 33.6 w/o expanded explanation 51.3 64.2 79.6 63.7 w/o verification 32.6 41.3 68.2 41.3 - Analysis:
- Removing the
query context(and using the verboseexpanded explanationfor retrieval) causes a massive performance drop. This confirms that a compact, docid-aligned context is crucial for guiding the retriever. - Removing the
expanded explanation(leaving nothing for theRefinestep to reflect upon) causes a smaller but still noticeable drop. The model can't update its strategy as effectively. - Removing
verification(and just reflecting on all candidates) also leads to a significant degradation. This shows that identifying the specific point of failure is critical for effective refinement.
- Removing the
Impact of Verify Depth (Figure 3/Image 2):

Image 2: This plot shows that as the verify depth (the number of top candidates to check for relevance) increases, retrieval performance on NQ (Hits@k, MRR@10) tends to decrease. Performance is best for small (1 or 3).
- Analysis: Performance peaks at a small verification depth (). When is too large, the model may try to "correct" a low-ranked irrelevant result even when the top-ranked results are already correct. This can introduce noise and derail the search. Therefore, a small is a good trade-off.
Impact of Iteration Rounds (Figure 4/Image 3):

Image 3: This chart shows how performance and latency change as the maximum number of R4R rounds () increases. Performance metrics (like MRR@10) improve from T=0 to T=3, then plateau or slightly decline. Latency increases steadily with .
- Analysis: Retrieval performance improves with more iterations up to a point ( or ), after which it can start to degrade (potentially due to error propagation or "over-thinking"). Meanwhile, latency increases with each round. The authors chose as the default, offering a good balance between performance gain and efficiency cost.
- Analysis:
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully demonstrates that explicitly incorporating reasoning into Generative Retrieval can significantly boost performance. The proposed R4R framework provides a practical and effective way to do this. By using a single LLM to iteratively
Think,Retrieve, andRefine, R4R overcomes the limitations of naive CoT prompting (verbosity and inefficiency). The framework is shown to be a "plug-and-play" enhancement for existing GR methods that use textual docids, delivering consistent gains across multiple diverse datasets without requiring extra models or complex training procedures. -
Limitations & Future Work (from the paper):
- Scope is limited to textual docids: The current R4R framework is incompatible with GR methods that use numeric docids, as those models lose their natural language generation capabilities. Extending the reasoning paradigm to numeric docids is a key area for future work, perhaps by using a separate reasoning model.
- Reliance on large, costly LLMs: R4R requires a powerful, reasoning-capable LLM as its backbone, which may be too expensive for some real-world applications.
- Imperfect refinement strategy: The
Refinestep currently updates the reasoning based on only the first irrelevant document found. This can be suboptimal if a higher-ranked document is actually correct. More robust refinement strategies could be developed.
-
Personal Insights & Critique:
- Novelty and Significance: The paper's core contribution—structuring CoT for a specific task (retrieval) and making it iterative—is both simple and powerful. It elegantly bridges the gap between the reasoning and generative capabilities of LLMs. This "in-loop reasoning" pattern is highly transferable and could inspire similar architectures in other domains, such as multi-step tool use or complex planning.
- Potential for Error Propagation: The iterative nature of R4R introduces a risk of error propagation. If the model makes a mistake in the
VerificationorReflectionstep (e.g., incorrectly judges a relevant docid as irrelevant, or updates the context in a nonsensical way), this error could compound in subsequent iterations, leading the search astray. The paper's results suggest this is manageable, but it remains a potential failure mode. - Single-Model Architecture: A key strength is using a single LLM for all steps. This is elegant and resource-efficient compared to a multi-agent system. However, it also means the LLM must be a "jack-of-all-trades"—good at retrieval, reasoning, and self-correction. This places a high demand on the quality of the base LLM.
- Open Questions: Could the structured reasoning (
query contextandexpanded explanation) be learned during fine-tuning instead of being generated on-the-fly? This might improve efficiency and reliability. Additionally, how does R4R perform on "unanswerable" or highly ambiguous queries where even iterative refinement might not lead to a good answer? Exploring the failure cases would be an interesting direction.
Similar papers
Recommended via semantic vector search.