Learning from Synthetic Data Improves Multi-hop Reasoning
TL;DR Summary
Using synthetic data for RL fine-tuning enhances LLM multi-hop reasoning; despite fictional content, models improve significantly on real benchmarks by learning fundamental knowledge composition skills.
Abstract
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 L EARNING FROM S YNTHETIC D ATA I MPROVES M ULTI - HOP R EASONING Anonymous authors Paper under double-blind review A BSTRACT Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifi- able data, often obtained through human-annotated datasets and LLM-as-verifier loops. Both of these data types have considerable limitations: human-annotated datasets are small and expensive to curate, while LLM verifiers have high scoring latency and are costly to operate. In this work, we investigate the use of synthetic datasets in RL fine-tuning for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, even though the synthetic data only contain fic- tional knowledge. On stratifying model performa
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Learning from Synthetic Data Improves Multi-hop Reasoning".
1.2. Authors
The authors are anonymous, as indicated by "Anonymous authors" and "Paper under double-blind review". This is a common practice in conference submissions to ensure unbiased review.
1.3. Journal/Conference
The paper is published at OpenReview, which is a platform primarily used for managing double-blind peer review processes for academic conferences, particularly in machine learning (e.g., ICLR, NeurIPS). The publication date suggests it is for an upcoming conference in late 2025. This venue is highly influential in the machine learning and artificial intelligence community.
1.4. Publication Year
The paper was published at (UTC): 2025-10-08T00:00:00.000Z, which means the publication year is 2025.
1.5. Abstract
The abstract highlights that Reinforcement Learning (RL) can enhance the reasoning abilities of Large Language Models (LLMs) in various tasks, including multi-hop reasoning. However, RL fine-tuning typically demands extensive, high-quality, and verifiable data, which is usually sourced from expensive human annotations or slow, costly LLM-as-verifier loops. This paper explores the use of synthetic datasets for RL fine-tuning in multi-hop reasoning. The key discovery is that LLMs fine-tuned on synthetic data, even if it contains only fictional knowledge, significantly outperform base models on real-world question-answering benchmarks. By analyzing performance across question difficulty, the authors conclude that synthetic data teaches LLMs to compose knowledge, a fundamental and generalizable reasoning skill. The work thus underscores the effectiveness of synthetic reasoning datasets in boosting LLM reasoning.
1.6. Original Source Link
Official Source Link: https://openreview.net/forum?id=38nYZ5QBui PDF Link: https://openreview.net/pdf?id=38nYZ5QBui This paper is currently under double-blind review, as indicated by the "Anonymous authors" and "Paper under double-blind review" notations.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the scarcity and cost of high-quality, verifiable training data for Reinforcement Learning (RL) fine-tuning of Large Language Models (LLMs) in complex reasoning tasks, particularly multi-hop reasoning.
This problem is crucial in the current field because:
-
RL fine-tuninghas proven highly effective in boostingLLMcapabilities in domains like math, coding, and multi-hop reasoning. -
Existing data sources have significant limitations:
human-annotated datasetsare small and expensive, whileLLM-as-verifierloops are slow and costly. -
LLMstrained on internet-scale data are increasingly prone todata leakageandmemorizationon benchmarks, making reasoning improvements unreliable. -
The pace of
LLMtraining is outstripping the availability of high-quality human-written text for reasoning.The paper's entry point and innovative idea is to investigate whether
LLMscan develop general reasoning capabilities, specificallyknowledge composition, solely from synthetic data, without relying on real-world knowledge or expensive human/LLM verification processes. The hypothesis is that reasoning skills might be transferable even from fictional domains.
2.2. Main Contributions / Findings
The primary contributions of the paper are:
-
Scalable and Cost-Effective Data Source: Proposing
synthetic multi-hop datasetsas a scalable, cost-effective source of reasoning training data that provides infinite, verifiable training signals. This demonstrates that multi-hop reasoning capabilities can be effectively learned from synthetic data, even without factual overlap between training and evaluation domains. -
Empirical Evidence for Generalization: Providing strong empirical evidence that
synthetic reasoning traininggeneralizes to real-world scenarios. The paper shows performance gains across variousLLMfamilies and sizes, establishing the practical viability of synthetic data for enhancing reasoning. -
Analysis of Reasoning Transfer across Difficulty: Studying the transfer of reasoning skills to both synthetic and real-world tasks at different question difficulty levels. The findings indicate that improvements on more complex synthetic tasks consistently lead to enhanced performance on increasingly challenging real-world tasks.
The key conclusions and findings reached by the paper are:
-
LLMsfine-tuned on synthetic data (likePhantomWikiandGSM-∞), which contain only fictional knowledge, perform significantly better on popular real-world question-answering benchmarks (HotpotQA,2WikiMultihopQA,MuSiQue). -
The performance gains are consistent across different
LLMmodels (Qwen3-0.6B, Qwen3-1.7B, Qwen2.5-1.5B-Instruct, Phi-4-mini-reasoning) and sizes (0.6B to 4B parameters). -
Synthetic data teaches
LLMstocompose knowledge– the fundamental ability to integrate information across multiple inferential steps. This skill is shown to be generalizable and transferable across domains, independent of domain-specific factual knowledge. -
Models do not overfit to the synthetic data; performance on real-world benchmarks continues to improve with more training steps/samples on synthetic data.
-
Training on synthetic data improves the
LLM's ability to generatereasoning tracesthat include a higher proportion of correct intermediate answers, especially for more challenging questions.These findings address the challenge of data scarcity and cost by offering a viable alternative for improving
LLMreasoning, potentially accelerating the development of more capable and robustLLMs.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp this paper, a beginner needs to understand several core concepts:
- Large Language Models (LLMs): These are advanced artificial intelligence models designed to understand and generate human-like text. They are typically trained on vast amounts of text data from the internet, learning patterns, grammar, and factual information. Examples include GPT-3/4, Llama, Qwen, etc.
- Reasoning: In the context of
LLMs, reasoning refers to the model's ability to process information, draw inferences, and solve problems that require more than simple recall of facts. This often involves understanding relationships between pieces of information, performing calculations, or following logical steps. - Multi-hop Reasoning: This is a specific type of reasoning where a model needs to combine information from multiple distinct facts or steps to arrive at an answer. It's not enough to find a single piece of information; the model must "hop" between several pieces of knowledge and integrate them logically. For example, to answer "What is the capital of the country where the Eiffel Tower is located?", an
LLMfirst needs to know the Eiffel Tower is in France, and then know the capital of France. - Reinforcement Learning (RL): A paradigm of machine learning where an intelligent agent learns to make decisions by interacting with an environment. The agent performs actions, receives rewards (or penalties) for those actions, and learns to choose actions that maximize cumulative reward. Unlike
supervised learning,RLdoesn't require explicit correct answer labels for every input; instead, it learns from feedback on its actions. - Fine-tuning: The process of taking a pre-trained
LLM(which has learned general language patterns on a massive dataset) and further training it on a smaller, specific dataset to adapt it to a particular task or domain. This typically involves adjusting the model's parameters slightly to improve its performance on the new task. - Reinforcement Learning from Human Feedback (RLHF): A popular
fine-tuningtechnique forLLMsthat uses human preferences as a reward signal. Instead of humans directly labeling correct answers, they rank or provide feedback on differentLLMoutputs. Areward modelis then trained on these human preferences, and thisreward modelguides theRLagent (theLLM) to generate responses that are preferred by humans. - Supervised Fine-Tuning (SFT): A simpler
fine-tuningmethod where anLLMis trained on a dataset of input-output pairs, with the objective of predicting the next token in a sequence. It's essentially standardsupervised learningapplied toLLMsfor specific tasks, often used as an initial step beforeRLHF. - Chain-of-Thought (CoT): A prompting technique that encourages
LLMsto generate intermediate reasoning steps before arriving at a final answer. This mimics human problem-solving and often significantly improvesLLMperformance on complex reasoning tasks by making the reasoning process explicit. - Synthetic Data: Data that is not collected from real-world events but is artificially generated. In the context of
LLMs,synthetic datacan include generated questions, answers, or even entire fictional knowledge bases, often created programmatically or by otherLLMs. Its advantage is that it can be created in vast quantities and tailored to specific properties. - Overfitting: A phenomenon in machine learning where a model learns the training data too well, capturing noise and specific details that are not generalizable to new, unseen data. An
overfitmodel performs exceptionally on training data but poorly on test data.
3.2. Previous Works
The paper extensively references prior work, categorizing them into Reasoning in Large Language Models, Training and Fine-tuning Large Reasoning Models, and Leveraging Synthetic Data.
3.2.1. Reasoning in Large Language Models
- Evaluation Benchmarks:
LLMsare typically evaluated on their reasoning skills using various benchmarks. These include:- Technical/Abstract domains: Mathematics (
MATH(Hendrycks et al., 2021),MAA), algorithms, coding (Cobbe et al., 2021), puzzle-solving (Jain et al., 2025; Chollet et al., 2025). - Knowledge-intensive domains: Sciences, law (
Rein et al., 2024; Sawada et al., 2023). - General common sense: Abductive, counterfactual reasoning (
Talmor et al., 2019; Zhao et al., 2023; Bhagavatula et al., 2020; Wu et al., 2025a; Hüyük et al., 2025). - Natural language question-answering (NLQA):
HotpotQA(Yang et al., 2018),2WikiMultihopQA(Ho et al., 2020),MuSiQue(Trivedi et al., 2022),Tang & Yang, 2024; Qi et al., 2021. - Interaction with environment: Planning and tool use (
Patil et al., 2024; Zhuang et al., 2023; Yao et al., 2024).
- Technical/Abstract domains: Mathematics (
- Many of these benchmarks, especially
multi-hop reasoningtasks, requireLLMsto break down questions into intermediate subproblems and compose them, which is considered a hallmark of effective reasoning (Gong et al., 2025; Xie et al., 2025; Gandhi et al., 2025).
3.2.2. Training and Fine-tuning Large Reasoning Models
- Supervised Fine-tuning (SFT): The simplest approach (
Lambert et al., 2025). Variants includeinstruction fine-tuning(Chung et al., 2024) andChain-of-Thought (CoT)modeling (Xiang et al., 2025; Zelikman et al., 2022; Hao et al., 2025; Yao et al., 2023; Chen et al., 2023; Wan et al., 2025), which encourage more detailed thinking processes. - Reinforcement Learning from Human Feedback (RLHF): A more complex,
RL-basedframework using human preferences (Christiano et al., 2017; Ouyang et al., 2022). Algorithms includeProximal Policy Optimization (PPO)(Schulman et al., 2017),GRPO(Shao et al., 2024), andDirect Preference Optimization (DPO)(Rafailov et al., 2023).- Proximal Policy Optimization (PPO): A widely used
RLalgorithm known for its stability and efficiency. It optimizes astochastic policyby taking small steps to avoid performance collapse, using aclipping mechanismand aKL divergence penaltyto keep the new policy close to the old one. The core objective ofPPOinvolves maximizing aclipped surrogate objectivefunction. For an action taken at state withadvantage estimate, and ratio of new to old policy probabilities , theclipped surrogate objectiveis: $ L^{CLIP}(\theta) = \mathbb{E}_t\left[\min(r_t(\theta)\hat{A}_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)\right] $ Where is a hyperparameter, and is theadvantage functionestimate, which measures how much better an action is than the average action.
- Proximal Policy Optimization (PPO): A widely used
- Reinforcement Learning with Verifiable Rewards (RLVR): A variant of
RLHFwhere the humanreward modelis replaced by a procedural verification function, often applicable in domains with objective ground-truth answers like math or coding (Lambert et al., 2025; Guo et al., 2025a; Abdin et al., 2025). The paper notes that the mechanisms for eliciting novel reasoning patterns throughRLVRremain an open research area (Wen et al., 2025; Yue et al., 2025; Shao et al., 2025; Zhao et al., 2025).
3.2.3. Leveraging Synthetic Data
- Challenges in Fine-tuning:
Abstract multi-hop reasoning skillsare hard to isolate, confounded by other skills ormemorization(Wu et al., 2025b; Xie et al., 2024; Yu et al., 2024).LLM benchmarksare prone totest set leakage(Gong et al., 2025; Wu et al., 2025b). - Synthetic Data as a Solution:
Synthetic datasetsalleviate these issues by isolating specific reasoning aspects, providing unlimited examples with verifiable rewards. - Programmatic Generation: Many
synthetic reasoning benchmarksare programmatically generated, especially in mathematics (Mirzadeh et al., 2025; Zhou et al., 2025; Wu et al., 2025b), logic puzzles (Xie et al., 2024; Shojaee et al., 2025; Stojanovski et al., 2025), and someNLQAforms (Gong et al., 2025; Guo et al., 2025b; Sinha et al., 2019). - LLM-generated Examples: Other methods use
LLMsto create additional examples and reasoning traces to augment existing datasets (Yang et al., 2025; Goldie et al., 2025; Huang et al., 2025; Saad-Falcon et al., 2024; Li et al., 2025). - The effectiveness and applicability of
synthetic datato real-world reasoning skills, especially beyondRLVRdomains, remainunderexplored(Yu et al., 2024; Mizrahi et al., 2025; Abbe et al., 2024b;a; Stojanovski et al., 2025), which this paper aims to address.
3.3. Technological Evolution
The field of LLM reasoning has evolved from basic SFT to more sophisticated RL-based fine-tuning techniques like RLHF and RLVR. Initially, LLMs relied heavily on vast real-world datasets for pre-training and fine-tuning. However, as models scale and data quality becomes a bottleneck, the focus has shifted towards synthetic data generation to address issues like data scarcity, cost, and test set leakage. This paper fits into this evolution by exploring the frontier of using purely fictional synthetic data to impart generalized reasoning skills, moving beyond domain-specific RLVR applications. It tests the hypothesis that fundamental reasoning skills like knowledge composition can be learned abstractly, independently of factual knowledge.
3.4. Differentiation Analysis
Compared to the main methods in related work, this paper's core innovations and differences are:
- Fictional Synthetic Data for Generalization: Unlike
RLVR, which often focuses on systematically verifiable domains (like math and coding) wheresynthetic datais used to generate problems within that domain, this work usessynthetic datafrom fictional domains (PhantomWikiwith fictional knowledge,GSM-∞math problems) to impart reasoning skills that generalize to real-world natural language question-answering tasks. This explicitly rules out directfactual overlapormemorizationas the source of improvement. - Focus on Knowledge Composition: The paper specifically investigates
knowledge compositionas afundamental and generalizable reasoning skill, demonstrating its transferability. Many prior works focus on performance on specific benchmarks, while this paper dissects what is learned (composition) and how it transfers. - Scalable and Verifiable Training Signals: It highlights
synthetic datasetsas a solution to thedata scarcityandcostissues associated withhuman-annotated dataandLLM-as-verifierloops, providing aninfinitesource ofverifiable training signals. This is a practical advancement forLLM fine-tuning. - Demonstration of Non-Overfitting with Synthetic Data: The paper empirically shows that scaling synthetic training data does not cause
overfitting, and performance on real-world benchmarks continues to improve with more synthetic training steps, suggesting robust generalization. This addresses a common concern with synthetic data.
4. Methodology
The methodology section details how the authors conducted their experiments to study the transfer performance from synthetic to real-world datasets. This involved RL fine-tuning various LLMs on carefully selected synthetic datasets and evaluating them on real-world multi-hop reasoning benchmarks.
4.1. Principles
The core idea behind the method is that fundamental reasoning skills, particularly knowledge composition (the ability to chain logical inferences), can be learned from structured, synthetically generated data, even if that data contains no real-world facts. The theoretical basis is that reasoning is a meta-skill independent of domain-specific factual knowledge. By training LLMs with Reinforcement Learning on synthetic datasets that require multi-step reasoning in fictional contexts, the models should learn the process of reasoning, which can then generalize to tasks in real-world knowledge domains. The RL framework provides a mechanism for the model to learn from positive rewards for correct reasoning paths.
4.2. Core Methodology In-depth (Layer by Layer)
The methodology consists of several key components: LLM selection, synthetic training dataset curation, RL fine-tuning algorithm application, and prompt/reward design.
4.2.1. LLM Selection
The study uses LLMs of varying sizes to ensure the generality of the findings:
-
Qwen3-0.6B -
Qwen3-1.7B -
Qwen2.5-1.5B-Instruct(Qwen Team, 2024, 2025) -
Phi-4-mini-reasoning(Abdin et al., 2025)Each model is
fine-tunedfor 1 epoch on a random shuffle of the selectedsynthetic training datasets. This required 4NVIDIA H100 GPUsfor approximately 1 day of training for larger models and longerCoTgenerations.
4.2.2. Synthetic Training Datasets
Two types of synthetic datasets are chosen for their scalable verification and varying difficulty:
4.2.2.1. GSM-∞ (GSM-Infinity)
- Source: (Zhou et al., 2025)
- Description: This dataset generalizes the
GSM8Kbenchmark (grade school math word problems) to aninfinitely extensibleversion. It constructs randomcomputation graphsto represent ground-truth solutions, which can be augmented withdistractor facts. These graphs are then converted intoword problemsusing natural language templates across pre-defined themes (e.g., zoo, teacher-school, movie). - Relevance: Represents
math-based synthetic reasoningand helps investigategeneralizabilityfrom arithmetic skills toknowledge-intensive real-world tasks. - Configuration for experiments:
Difficulty level: "medium"Number of arithmetic operations: 2 to 20 (definesreasoning complexity).Context length: zero (ensures problems contain only necessary information, simplifyinghopidentification).
- Data Size: Approximately 600 questions per difficulty level across 19 levels. Half from
zootheme, one-quarter fromteacher-school, and one-quarter frommovietheme, equally split betweenforward(addition/multiplication) andreverse(subtraction/division) modes. - Total Samples: samples, with for training and the remainder for validation.
4.2.2.2. PhantomWiki
- Source: (Gong et al., 2025)
- Description: This dataset generates
on-demand synthetic datasetscomprising natural language document corpora and question-answer pairs. It is designed to evaluateLLMsonmulti-stepandmulti-branch reasoningandretrieval. Each dataset describes a random universe of fictional individuals, their attributes, and relationships in Wikipedia-like documents. Acontext-free grammarandlogic programming-based algorithmgeneratemulti-hop reasoning questions(e.g., "Who is the nephew of the friend of the person who likes birdwatching?").PhantomWikiquestions can have multiple answers and requireretrievalandknowledge compositionacross multiple documents. - Relevance: Directly tests
retrievalandknowledge compositionskills in a fictional natural language context. - Configuration for experiments:
Relations: Only "easy" relations (immediate family, friends) to simplify conceptualhops.Question type filter: Excludesaggregation questions("How many...") to focus purely onmulti-hop questions("Who is theof...?", "What is the of...?"). Difficulty definition: Answering a question of difficulty requires hopping through documents of exactly individuals.Generation: 34 universes of 25 individuals using 100 random seeds.Context-free grammar recursion depthset to 20 for varying difficulties (1 to 9 hops). Ground-truth answers are obtained viaPhantomWiki's logic program.
- Total Samples: 330 questions per universe.
- Training/Validation Split: 31 universes for training samples, 3 universes ( samples) for validation.
4.2.3. RL Fine-tuning for Reasoning
The paper uses Group Relative Policy Optimization (GRPO) (Shao et al., 2024) as the primary RL fine-tuning algorithm.
4.2.3.1. Group Relative Policy Optimization (GRPO)
-
Description:
GRPOis a variant ofProximal Policy Optimization (PPO)designed to reduce memory and compute requirements. It replaces thevalue model-based advantage estimationofPPOwith an estimation based on agroup of online completionsfor each prompt. -
Objective Function: Given a question sampled from a distribution over question set
P(Q),GRPOsamples a group of output completions from the oldLLMwith parameters . Each completion is assigned a scalar reward value . Theadvantageof each completion is estimated by normalizing with respect to the average reward as a baseline. The final objective is:- Symbol Explanation:
- : The
GRPOobjective function being maximized with respect to theLLM's current policy parameters . - : Expectation taken over questions sampled from the distribution
P(Q)and over groups of output completions generated by the old policy . - : The size of the group of output completions sampled for each question.
- : The -th output completion (sequence of tokens) generated by the
LLM. - : The length of the -th output completion.
- : An index iterating over tokens within an output completion .
- : The
relative weightorimportance ratiofor token . It's calculated as the ratio of the probability of token under the current policy to its probability under the old policy , given the question and previous tokens . - : The
advantage estimatefor the -th completion at token . InGRPO, this is simplified to be constant across tokens for a given completion , calculated as the normalized reward of completion . - : The scalar reward value assigned to the -th output completion.
- : The average reward across all completions in the group.
- : The standard deviation of rewards across all completions in the group. This normalization helps stabilize training.
- : A function that clips the value to be within the range . In the objective, it limits the
importance ratioto prevent large policy updates that could destabilize training, similar toPPO. - : A hyperparameter that defines the clipping range for the
importance ratio. - : A hyperparameter controlling the strength of the
KL divergence penalty. - : The
Kullback-Leibler (KL) divergencebetween the current policy and areference policy(usually the model's initialization). This penalty encourages the new policy not to deviate too much from the original policy, maintaining stability and avoiding catastrophic forgetting.
- : The
- Symbol Explanation:
-
Implementation: The authors use the
GRPOTrainerimplementation from theopen-source Hugging Face TRL library. For their experiments, theKL-divergence penalty hyperparameteris set to 0.
4.2.4. Prompt and Reward Design
The LLMs are trained to perform in-context reasoning and retrieval, meaning all relevant context is provided in the prompt.
-
Prompt Structure:
- Evidence: For
GSM-∞, this is the problem statement. ForPhantomWiki, it includes documents for all 25 individuals in the generated universe. - Instruction: Guides the
LLMto output the final answer within tags, a standard format forDeepSeek-R1andQwen3models. - CoT Examples: To further guide the output format and reasoning process:
GSM-∞: 3 automatically generated ground-truthCoT solutionsfrom the training set.PhantomWiki: 11CoT examplesoriginally curated by Gong et al. (2025).
- Question: The actual question posed to the
LLM. The full prompts are detailed in Appendix C.
- Evidence: For
-
Reward Model:
-
Model generations are parsed using
regular expressionsto extract the last tag. -
GSM-∞: A
binary reward(1 or 0) is given based on whether the correct numeric value is produced. -
PhantomWiki: An
F1 score(between 0 and 1) is used as the reward, as questions can have multiple correct answers.The following is an example of a
PhantomWikiprompt template (full prompt with many examples is in Appendix C.1):
-
You are given the following evidence:
(BEGIN EVIDENCE)
{{evidence}}
(END EVIDENCE)
You will be provided a question. Your response must end with the final answer enclosed in tags: <answer>FINAL_ANSWER</answer>
Here, FINAL_ANSWER must be one of the following:
- a name (if there is only one correct answer);
- a list of names separated by ',' (if there are multiple correct answers); or
- numbers separated by ',' (if the answer is numerical); or
- empty string (if there is no answer).
Here are some examples:
(START OF EXAMPLES)
Example 1:
Question: Who is the sister of Aida Wang?
Answer: Based on the evidence, the sisters of Aida Wang are Barabara Beltran, Vicki Hackworth. <answer>Barabara Beltran, Vicki Hackworth</answer>.
... (many more examples) ...
(END OF EXAMPLES)
Question: {{question}}
Answer: """
An example GSM-∞ prompt template (full prompt with examples is in Appendix C.2):
You are given the following problem:
(BEGIN PROBLEM)
{{problem}}
(END PROBLEM)
You will be provided a question on the above problem. Your
response must end with the final answer enclosed in tags: <
answer>FINAL_ANSWER</answer>
Here, FINAL_ANSWER must be a number.
Here are some examples:
(START OF EXAMPLES)
Example 1:
Question: What is the total number of adult animals in Maple Creek ?
Answer: Define adult wolf in Maple Creek as r; so r = 2. Define total number of adult animals in Maple Creek as p; so p = r = 2. <answer>2</answer>.
... (many more examples) ...
(END OF EXAMPLES)
Question: {{question}}
Answer:
5. Experimental Setup
5.1. Datasets
The experiments utilize a combination of synthetic datasets for RL fine-tuning and real-world datasets for evaluation.
5.1.1. Synthetic Training Datasets
- GSM-∞ (Zhou et al., 2025): An
infinitely extensibledataset for grade school math word problems. It features a randomcomputation graphconverted to word problems via templates, allowing forvarying difficultybased on the number of arithmetic operations. The problems are knowledge-free and focus on arithmetic reasoning.- Data Sample Example: (from Appendix C.2)
Question: What is the total number of adult animals in Maple Creek ?
- Data Sample Example: (from Appendix C.2)
- PhantomWiki (Gong et al., 2025): Generates
on-demand synthetic datasetsof fictional document corpora and question-answer pairs formulti-stepandmulti-branch reasoningandretrieval. It creates universes of fictional individuals and their relationships, then generatesmulti-hop questionsusingcontext-free grammarandlogic programming. It specifically testsknowledge compositionandretrievalin a natural language context.- Data Sample Example: (from Appendix C.1)
Question: Who is the sister of Aida Wang?Theevidencewould be a set of fictional Wikipedia-like documents describing Aida Wang, Barabara Beltran, Vicki Hackworth, and their relationships.
- Data Sample Example: (from Appendix C.1)
These synthetic datasets were chosen because they provide scalable verification (ground-truth answers are programmatically generated) and questions of varying difficulty, which are crucial for effective RL fine-tuning. They allow isolating multi-hop reasoning skills without relying on real-world factual knowledge, thus enabling the study of reasoning transfer.
5.1.2. Real-world Evaluation Datasets
The models are evaluated on 500 randomly subsampled questions from the test sets of the following three in-context question answering datasets, which measure multi-hop reasoning capabilities. These were chosen to represent popular, established multi-hop NLQA benchmarks.
- HotpotQA (Yang et al., 2018): A
multi-hop question answering datasetwith over 100,000 questions, requiring information from two Wikipedia paragraphs. It features a consistenttwo-hop reasoning structure.- Domain: Real-world factual knowledge (Wikipedia).
- 2WikiMultihopQA (Ho et al., 2020): A more recent
two-hop datasetwith over 190,000 questions, categorized into compositional, inference, comparison, and bridge-comparison. Questions are grounded inWikidata's knowledge graph, following specifictwo-hop pathsbetween entities.- Domain: Real-world factual knowledge (Wikidata).
- MuSiQue (Trivedi et al., 2022): Evaluates
compositional reasoningwith 2-4hop questionscreated by bridgingsingle-hop questions. Requires joining information from multiple separate paragraphs. TheMuSiQue-Answerablesplit is used to ensure all questions can be answered from the provided context.- Domain: Real-world factual knowledge.
5.2. Evaluation Metrics
The paper primarily uses F1 score and binary correctness (accuracy) for evaluation and reward.
5.2.1. F1 Score
- Conceptual Definition: The
F1 scoreis a measure of a model's accuracy on a dataset, often used in information retrieval and natural language processing. It is theharmonic meanofprecisionandrecall, providing a single score that balances both.Precisionmeasures how many of the selected items are relevant, whilerecallmeasures how many relevant items are selected. AnF1 scorereaches its best value at 1 (perfectprecisionandrecall) and worst at 0. It is particularly useful when dealing with imbalanced classes or when bothfalse positivesandfalse negativesare important. - Mathematical Formula: $ \mathrm{F1} = 2 \times \frac{\mathrm{precision} \times \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}} $ where $ \mathrm{precision} = \frac{\mathrm{true_positives}}{\mathrm{true_positives} + \mathrm{false_positives}} $ and $ \mathrm{recall} = \frac{\mathrm{true_positives}}{\mathrm{true_positives} + \mathrm{false_negatives}} $
- Symbol Explanation:
- : The number of items correctly identified as positive (e.g., correct answers identified by the model).
- : The number of items incorrectly identified as positive (e.g., incorrect answers given by the model that were not ground truth).
- : The number of items that were actually positive but were incorrectly identified as negative (e.g., ground truth answers that the model missed).
5.2.2. Accuracy (Binary Correctness)
- Conceptual Definition:
Accuracyis the ratio of correctly predicted observations to the total observations. It's a straightforward measure of how often the model is correct. In abinary correctnessscenario (likeGSM-∞), it simply indicates the proportion of questions for which the model produced the exact correct answer. - Mathematical Formula: $ \mathrm{Accuracy} = \frac{\mathrm{Number\ of\ Correct\ Predictions}}{\mathrm{Total\ Number\ of\ Predictions}} $
- Symbol Explanation:
- : The count of instances where the model's output exactly matches the ground-truth answer.
- : The total count of instances (questions) in the dataset.
5.3. Baselines
The paper primarily compares the performance of the RL fine-tuned LLMs against their base models (i.e., the same LLM before any RL fine-tuning on synthetic data). These base models serve as the reference point to quantify the improvement brought by synthetic data training.
The LLMs used are:
-
Qwen3-0.6B -
Qwen3-1.7B -
Phi-4-mini-reasoning(a 4B parameter model, which itself was already trained on some in-house synthetic data (Abdin et al., 2025)) -
Qwen2.5-1.5B-InstructThe rationale for these choices is to demonstrate consistent performance transfer across a range of
LLMfamilies and sizes, from small (0.6B) to medium (4B), and to see if even a model pre-trained for reasoning (Phi-4-mini-reasoning) can further benefit.
5.4. Implementation Details
- Software:
Hugging Face TRL library(version 0.21.0)GRPOTrainer,vLLM colocate mode(Kwon et al., 2023), andFlashAttention-2(Dao, 2024). - Hardware: 4
NVIDIA H100 GPUs(each with 80GB VRAM). - Training Time:
RL fine-tuninga 1-4B parameterLLMon training samples takes day.Phi-4-mini-reasoningandQwen3-1.7Btook the full day ( hours) due to longCoTgenerations.Qwen2.5-1.5B-Instructtrained faster ( hours) because it did not generate longCoT. - Hyperparameters (from Listing 1 in Appendix A):
per_device_train_batch_size: 8 (adjusted to 4 forPhi-4-mini-reasoning)gradient_accumulation_steps: 1num_generations: 16 (adjusted to 8 forPhi-4-mini-reasoning)vllm_gpu_memory_utilization: 0.20 (adjusted to 0.25 forPhi-4-mini-reasoning)max_completion_length: 4096temperature: 1.0- : 1.0
- : null
- : null
repetition_penalty: 1.0GRPO algorithm parameters: , ,importance_sampling_level: "token",scale_rewards: true,loss_type: bnpo,mask_truncated_completions: false
- Prompt Lengths:
PhantomWiki: 6000GSM-∞: 2048HotpotQA: 60002WikiMultihopQA: 6000MuSiQue: 8000
- Random Seeds: Data generation uses fixed random seeds, and fixed training random seeds are used where possible to ensure
reproducibility. - Evaluation: Models are evaluated with 2 random training seeds, and standard errors are reported.
6. Results & Analysis
The experimental results demonstrate that fine-tuning LLMs on synthetic data significantly improves their performance on real-world multi-hop reasoning benchmarks, even though the synthetic data contains only fictional knowledge. This indicates a successful transfer of knowledge composition skills.
6.1. Core Results Analysis
6.1.1. Performance Transfer from Synthetic to Real-world Datasets
The primary finding is that RL fine-tuning on synthetic datasets (PhantomWiki and GSM-∞) consistently improves F1 scores on real-world multi-hop reasoning benchmarks (HotpotQA, 2WikiMultihopQA, and MuSiQue). This transfer is observed across different LLM families (Qwen, Phi) and sizes (0.6B to 4B parameters).
The following figure (Figure 2 from the original paper) shows the F1 scores on real-world multi-hop reasoning datasets of LLMs finetuned with GRPO on synthetic datasets PhantomWiki and GSM-∞:

Analysis of Figure 2:
- Consistent Improvement: For nearly all
LLMsand benchmarks, bothPhantomWikiandGSM-∞training lead to higherF1 scorescompared to thebase model(labeled "base"). - PhantomWiki's Superiority:
PhantomWikigenerally yields better performance transfer thanGSM-∞for thesemulti-hop reasoningtasks. For example,Qwen3-0.6Btrained onPhantomWikishows remarkable relative improvements: onHotpotQA, on2WikiMultihopQA, and onMuSiQue. This suggests that natural language-based fictional reasoning (PhantomWiki) transfers more effectively to real-world natural language QA than math-based reasoning (GSM-∞). - Model Family and Size Consistency: The positive trend holds for
Qwen3-0.6B,Qwen3-1.7B,Qwen2.5-1.5B-Instruct, andPhi-4-mini-reasoning. EvenPhi-4-mini-reasoning, which was already designed for reasoning, benefits from this additionalRL fine-tuning. This suggests the approach is robust across differentLLMarchitectures. - Statistical Significance: The error bars (standard error from 2 random training seeds) indicate that the improvements are generally statistically significant.
6.1.2. Ablation Study on Binary Format Reward
To distinguish between learning proper output formatting and true reasoning capabilities, an ablation study was conducted. LLMs were fine-tuned for 3K training steps solely on a binary reward signal for producing the correct format (1 if correct, 0 if not).
The following are the results from Table 1 of the original paper:
| HotpotQA | 2WikiMultihopQA | MuSiQue | ||
|---|---|---|---|---|
| Qwen3-0.6B | base | 0.36 ± 0.02 | 0.37 ± 0.02 | 0.14 ± 0.01 |
| format | 0.38 ± 0.02 | 0.34 ± 0.02 | 0.13 ± 0.01 | |
| Qwen3-1.7B | base | 0.59 ± 0.02 | 0.64 ± 0.02 | 0.34 ± 0.02 |
| format | 0.64 ± 0.02 | 0.67 ± 0.02 | 0.35 ± 0.02 | |
| Phi-4-mini-reasoning | base | 0.48 ± 0.02 | 0.66 ± 0.02 | 0.27 ± 0.02 |
| format | 0.47 ± 0.02 | 0.48 ± 0.02 | 0.26 ± 0.02 | |
| Qwen2.5-1.5B-Instruct | base | 0.02 ± 0.01 | 0.14 ± 0.02 | 0.04 ± 0.01 |
| format | 0.43 ± 0.02 | 0.30 ± 0.02 | 0.20 ± 0.02 |
Analysis of Table 1:
- Qwen2.5-1.5B-Instruct Improvement: This model shows a remarkable improvement (e.g., from 0.02 to 0.43 F1 on
HotpotQA) with onlyformat reward training. This suggests itsbase modelstruggled with the output format, andRL fine-tuningtaught itreward hackingto produce the desired format. Therefore, its performance improvements in Figure 2 encompass both formatting and reasoning. - Other Models' Relative Stability: For
Qwen3family models andPhi-4-mini-reasoning,format reward trainingdoes not significantly improve performance, and sometimes even slightly degrades it (e.g.,Phi-4-mini-reasoningon2WikiMultihopQA). This indicates these models already had good output formatting capabilities at initialization. - Key Takeaway: For models that already format correctly, the improvements observed in Figure 2 are attributed to enhanced
knowledge composition, not just formatting. This empirically demonstrates thatLLMscan developknowledge compositionfrom synthetic data alone and apply it in real-world settings, asPhantomWikiandGSM-∞are purely fictional.
6.1.3. Reasoning Evolution During Training
The paper investigates how reasoning capabilities evolve during training by evaluating intermediate checkpoints saved at every 10% of the total training steps. Since training is for 1 epoch and synthetic data is generated from random universes, this also allows studying the effect of synthetic data scaling.
The following figure (Figure 3 from the original paper) shows F1 scores on real-world multi-hop reasoning datasets of intermediate training checkpoints, when LLMs are finetuned with GRPO on synthetic datasets:

Analysis of Figure 3:
-
Continued Improvement:
Qwen3 LLMs(Qwen3-0.6BandQwen3-1.7B) continue to improve onreal-world multi-hop reasoning benchmarkswith more training steps (or equivalently, more training samples). This is particularly evident withPhantomWikitraining. -
No Overfitting: The sustained improvement suggests that models do not
overfitto thesynthetic training dataset. Instead, learningknowledge compositionin fictional worlds consistently translates to gains in real-world contexts. -
Varying Malleability: Different
LLMsexhibit varyingmalleabilitytoRL fine-tuning.Qwen3-0.6Bstarts lower but shows a steeper upward trend, whileQwen3-1.7Bimproves slowly. This hints at the role ofLLM initializationinRL fine-tuning, an area for future work. -
Similar trends are observed for
Phi-4-mini-reasoningin Figure 6 (in Appendix B), whileQwen2.5-1.5B-Instructdoes not show such continuous improvement.The following figure (Figure 6 from the original paper, located in Appendix B) shows F1 scores on real-world multi-hop reasoning datasets of intermediate training checkpoints for Phi-4-mini-reasoning and Qwen2.5-1.5B-Instruct:

Analysis of Figure 6:
-
Phi-4-mini-reasoning: Shows a general improvement trend with training steps, similar to
Qwen3models, indicating thatRL fine-tuningon synthetic data is beneficial even for models already designed for reasoning. -
Qwen2.5-1.5B-Instruct: Performance saturates relatively quickly, especially when trained on
GSM-∞. This contrasts with theQwen3models andPhi-4-mini-reasoning, suggestingQwen2.5-1.5B-Instructmight be less amenable to learning additionalknowledge compositionskills through this method or might be primarily benefiting from format learning as seen in the ablation study.
6.1.4. Reasoning Evolution Across Question Difficulty
The synthetic datasets (PhantomWiki and GSM-∞) contain questions of varying difficulties, enabling analysis of model performance as a function of reasoning complexity (e.g., number of hops, number of arithmetic operations).
The following figure (Figure 5 from the original paper) shows reasoning evolution plots of F1 vs question difficulty of intermediate training checkpoints for Qwen3-0.6B and Qwen3-1.7B:

Analysis of Figure 5:
-
Learning Across All Difficulties:
Qwen3 LLMslearn to correctly answer questions across all difficulty levels as training progresses (indicated by darker lines representing more training steps). -
Transferability of Knowledge Composition: This improvement on validation questions across all difficulties, from universes completely disjoint from the training sets, signifies an enhancement in
knowledge compositionat all levels simultaneously. -
Impact of Dataset Type: The learning patterns differ between
PhantomWiki(natural language, knowledge composition) andGSM-∞(math, arithmetic operations).PhantomWikitraining shows more consistent and broader improvements across difficulty levels for natural languagemulti-hop QAtasks. -
Similar trends are observed for
Phi-4-mini-reasoningin Figure 7 (in Appendix B), whileQwen2.5-1.5B-Instructsaturates quickly, especially onGSM-∞.The following figure (Figure 7 from the original paper, located in Appendix B) shows reasoning evolution plots of F1 vs question difficulty of intermediate training checkpoints for Qwen2.5-1.5B-Instruct and Phi-4-mini-reasoning:

Analysis of Figure 7:
-
Phi-4-mini-reasoning: Demonstrates continued improvement across varying difficulties for both
PhantomWikiandGSM-∞as training progresses, reinforcing the idea that even specialized reasoning models can further refine their compositional abilities. -
Qwen2.5-1.5B-Instruct: Shows quick saturation, particularly on
GSM-∞. The performance improvement plateaus early, suggesting limitations in its ability to acquire deeperknowledge compositionskills from thesesynthetic datasetsbeyond an initial boost.
6.1.5. Learning to Compose Knowledge in Real-world Tasks
Finally, the paper demonstrates that LLMs learn to compose knowledge by analyzing their reasoning traces on the MuSiQue dataset. MuSiQue questions come with ground-truth intermediate answers.
The following figure (Figure 4 from the original paper) shows reasoning evolution plots on MuSiQue of PhantomWiki training checkpoints:

Analysis of Figure 4:
- Increased Intermediate Answer Generation: As
LLMsundergo more training steps onPhantomWiki(darker lines), their generatedreasoning tracesinclude a progressively higher proportion of correctintermediate answers(Nth intermediate answer). - Evidence for Knowledge Composition: This directly supports the claim that
LLMsare learning to construct multi-step reasoning paths, not just producing final answers by chance. The ability to correctly generateintermediate stepsis a strong indicator of learnedknowledge composition. - Unified Insight: This observation ties together the findings from
performance transfer(Figure 2) andsynthetic reasoning evolution(Figure 5), confirming thatknowledge compositionis a fundamental, generalizable skill critical formulti-hop reasoning, transferable across both synthetic and real-world domains.
6.2. Data Presentation (Tables)
The only table in the paper, Table 1, has been fully transcribed in the Ablation Study section above.
6.3. Ablation Studies / Parameter Analysis
The ablation study on binary format reward training (Table 1) effectively disentangles the improvements due to correct output formatting from true gains in reasoning capabilities. It shows that while some models (like Qwen2.5-1.5B-Instruct) initially struggle with formatting and benefit significantly from this reward, others (like Qwen3 and Phi-4-mini-reasoning) already handle formatting well. For the latter, the substantial performance increases observed in the main results are confidently attributed to enhanced knowledge composition rather than mere formatting adherence. This strengthens the paper's core claim about the transferability of reasoning skills.
The study of reasoning evolution during training (Figures 3, 5, 6, 7) also functions as an analysis of the effect of training steps (or synthetic data scaling). It confirms that continued exposure to synthetic data leads to sustained improvements without overfitting, especially for knowledge composition as measured by increasing accuracy on more difficult questions and better intermediate step generation. This implies that the amount of synthetic data can be scaled to further enhance LLM reasoning.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work rigorously evaluates the efficacy of synthetic multi-hop reasoning datasets as a scalable and cost-effective alternative to traditional real-world training data for LLM reasoning. The findings conclusively demonstrate that RL fine-tuning on these synthetic datasets instills transferable compositional inference abilities in LLMs. These acquired skills lead to significant performance gains on diverse real-world question-answering benchmarks like HotpotQA, 2WikiMultihopQA, and MuSiQue, despite a complete absence of factual overlap between the synthetic training data and the real-world evaluation domains. The paper establishes that reasoning, particularly the ability to compose knowledge across multiple logical steps, is a fundamental and generalizable skill that can transfer across different domains and varying levels of complexity. Crucially, the models do not overfit to the synthetic data, showing continuous improvement with more training.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
- Extent of Transferability: While the study demonstrates transferability of
knowledge composition, the precise extent of this transferability remains anopen question. Real-world tasks involve bothfactual knowledgeandknowledge composition. - Interplay of Factual Knowledge and Synthetic Data: The paper highlights that
memorizationcan degrade performance incounterfactual contexts(Wu et al., 2025a), but models can learnmemorizationandgeneralizabilitysimultaneously (Xie et al., 2024). Since synthetic datasets areknowledge-free, further investigation into their interplay withknowledge-intensive real-world datasetsis warranted. - Other Reasoning Capabilities: The current work focuses on
knowledge compositionthroughmulti-hop reasoning. Future work should explore whether other reasoning capabilities, such ascausal reasoning,counterfactual inference, oranalogical thinking, can also exhibit similar transferability patterns fromsynthetic datasets. - Boundary Conditions for Synthetic-to-Real Transfer: Understanding the specific conditions and mechanisms under which
synthetic dataeffectively transfers to real-world tasks, and extending these methods beyondmulti-hop reasoning(Zhao et al., 2023; Wu et al., 2025b; Wang et al., 2024), remains an important area for research. - LLM Initialization and Malleability: The observed varying
malleabilityof differentLLMstoRL fine-tuning(e.g.,Qwen3-0.6Bshowing steep improvement vs.Qwen2.5-1.5B-Instructsaturating quickly) suggests thatLLM initializationand its inherent "quality" affectRL fine-tuningoutcomes. Analyzing this effect is left for future work.
7.3. Personal Insights & Critique
This paper offers a compelling and highly practical insight into the potential of synthetic data for LLM development. The demonstration that knowledge composition can be learned from purely fictional, verifiable data and then generalized to complex real-world tasks is a significant step forward. It provides a blueprint for mitigating the data scarcity and cost bottlenecks that currently hinder LLM fine-tuning, especially for reasoning tasks. The rigor of testing across multiple LLM sizes and families, and the detailed analysis of reasoning evolution and difficulty stratification, strengthen the claims substantially.
Inspirations and Applications:
- Curriculum Learning with Synthetic Data: The idea of "domain experts as curators of verifiable curricula" for synthetic data is powerful. This could lead to a more principled approach to
LLMtraining, where specific skills are targeted and honed using tailored synthetic environments, much like how humans learn through structured education. - Robustness to Factual Changes: If
LLMscan learn reasoning independently of specific facts, they might become more robust tofactual changesorout-of-distribution knowledge, as their core reasoning engine would be more abstract. This could improveLLMperformance in rapidly evolving knowledge domains. - Debugging Reasoning:
Synthetic datasetswith controllable parameters (likePhantomWiki's number of hops) offer an excellent environment fordebugging LLM reasoning. Researchers could pinpoint exactly where a model struggles in amulti-step inferenceprocess.
Potential Issues or Unverified Assumptions:
-
"Fictional Knowledge" vs. "No Knowledge": While
PhantomWikiis fictional, it still uses natural language structures and common relational concepts (family, friends). It's not entirely "knowledge-free" in the sense of abstract symbols. It implicitly carries structural knowledge of how entities and relations are expressed in natural language. The extent to which truly abstract, domain-agnostic reasoning could be learned from symbolic-only synthetic data might be a deeper question. -
Task Specificity of "Knowledge Composition": The
multi-hop reasoningtasks primarily testchained inference. While fundamental,knowledge compositionmight manifest differently in other reasoning paradigms (e.g., causal, abductive). The transferability ofcompositional abilitylearned here to those different reasoning types is assumed but not directly tested. -
Scaling Limit of Synthetic Data: While the paper shows no
overfittingwith current synthetic data scales, there might be a point wheresynthetic dataalone cannot provide the necessary nuances or complexities present in real-world data, leading to a plateau in performance or evennegative transferif the synthetic world is too simplistic. TheQwen2.5-1.5B-Instruct's saturation might hint at this for smaller models or certain architectures. -
Complexity of Synthetic Data Generation: While presented as scalable, generating high-quality
synthetic data(especially for natural language tasks likePhantomWiki) still requires sophisticatedgrammar design,logic programming, and potentiallyLLM-based generation, which can be complex to curate and manage for new reasoning types.Overall, this paper provides a highly valuable contribution by empirically validating the power of synthetic data for improving
LLMreasoning. It opens exciting avenues for constructing more capable and efficientLLMsby decoupling the learning of reasoning skills from the availability of vast, domain-specific real-world data.
Similar papers
Recommended via semantic vector search.