DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic
TL;DR Summary
The DEL-ToM framework enhances large language models' performance on Theory-of-Mind tasks through inference-time scaling, using Dynamic Epistemic Logic to structure belief updates. It leverages a Process Belief Model to verify reasoning, showing consistent performance improvement
Abstract
Theory-of-Mind (ToM) tasks pose a unique challenge for large language models (LLMs), which often lack the capability for dynamic logical reasoning. In this work, we propose DEL-ToM, a framework that improves verifiable ToM reasoning through inference-time scaling rather than architectural changes. Our approach decomposes ToM tasks into a sequence of belief updates grounded in Dynamic Epistemic Logic (DEL), enabling structured and verifiable dynamic logical reasoning. We use data generated automatically via a DEL simulator to train a verifier, which we call the Process Belief Model (PBM), to score each belief update step. During inference, the PBM evaluates candidate belief traces from the LLM and selects the highest-scoring one. This allows LLMs to allocate extra inference-time compute to yield more transparent reasoning. Experiments across model scales and benchmarks show that DEL-ToM consistently improves performance, demonstrating that verifiable belief supervision significantly enhances LLMs’ ToM capabilities without retraining. Code is available at https://github.com/joel-wu/DEL-ToM.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic
1.2. Authors
The paper is authored by Yuheng Wu, Jianwen Xie, Denghui Zhang, and Zhaozhuo Xu.
-
Yuheng Wu (Stanford University)
-
Jianwen Xie (Lambda, Inc.)
-
Denghui Zhang (Stevens Institute of Technology)
-
Zhaozhuo Xu (Stevens Institute of Technology)
Their affiliations suggest a collaboration between a prominent academic institution (Stanford), a computing hardware/service provider (Lambda, Inc.), and another academic institution (Stevens Institute of Technology). This blend implies a project with strong theoretical foundations, practical implementation considerations, and access to significant computational resources.
1.3. Journal/Conference
The paper is listed as "Published at (UTC): 2025-01-01T00:00:00.000Z", which indicates an upcoming publication date, likely in a conference proceeding or journal. Given the arXiv preprint link structure in the prompt's Original Source Link, it is currently available as a preprint, common for academic works awaiting formal peer review and publication. The subject matter (LLMs, Theory-of-Mind, formal logic) suggests a venue in AI, NLP, or cognitive science, likely a highly reputable conference given the affiliations.
1.4. Publication Year
2025 (based on the provided publication timestamp)
1.5. Abstract
This paper introduces DEL-ToM, a framework designed to enhance the Theory-of-Mind (ToM) reasoning capabilities of Large Language Models (LLMs). The core challenge addressed is LLMs' typical deficiency in dynamic logical reasoning required for ToM tasks. Instead of architectural modifications, DEL-ToM employs an inference-time scaling approach. It achieves this by decomposing ToM tasks into a sequence of belief updates, formally grounded in Dynamic Epistemic Logic (DEL). A key component is the Process Belief Model (PBM), a verifier trained on automatically generated data from a DEL simulator. During inference, the PBM scores candidate belief traces produced by an LLM and selects the highest-scoring one, allowing LLMs to invest more compute at inference time for more transparent and reliable reasoning. Experimental results across various model scales and benchmarks demonstrate that DEL-ToM consistently improves LLM performance on ToM tasks, showcasing the effectiveness of verifiable belief supervision without requiring LLM retraining.
1.6. Original Source Link
/files/papers/691c896125edee2b759f3360/paper.pdf (This is a local file path provided in the prompt, indicating the paper content was extracted from this PDF. Its publication status is likely a preprint currently, with a formal publication date in 2025 as noted in the abstract details).
2. Executive Summary
2.1. Background & Motivation
The paper addresses the significant challenge that Theory-of-Mind (ToM) tasks pose for Large Language Models (LLMs). ToM is the ability to attribute mental states (beliefs, desires, intentions) to oneself and others, crucial for social intelligence. While LLMs have shown nascent ToM abilities, these often scale with model size, making smaller models less capable. More critically, current LLM evaluations for ToM typically only check the final output against ground truth, offering no insight into the reasoning process. This lack of verifiability means it's unclear whether a correct answer stems from genuine understanding or mere lucky guessing, rendering LLM ToM reasoning unreliable and impractical for real-world applications, especially in resource-constrained environments where robust inference of user intentions is critical.
The core problem the paper aims to solve is: How can LLMs perform verifiable ToM reasoning, particularly in low-resource settings, without requiring architectural changes or extensive retraining? The paper's innovative idea is to formalize ToM reasoning as a multi-step dynamic belief-update process using Dynamic Epistemic Logic (DEL) and then apply inference-time scaling to select the most reliable reasoning traces.
2.2. Main Contributions / Findings
The paper makes three primary contributions:
-
New Perspective on
ToMReasoning: It proposes viewingToMreasoning through the lens ofprocess reliability, formalizing it as a multi-step dynamic belief-update process. This perspective enables the application ofinference-time scalingto select more reliable belief traces, enhancing transparency and trustworthiness. -
Formalization with
DELand Noise-Free Supervision: The work formalizesToMreasoning within theDynamic Epistemic Logic (DEL)framework. Crucially, it constructs aProcess Belief Model (PBM)dataset with noise-free supervision, automatically derived from aDEL simulator. ThisDEL-generated supervision guarantees correctness, a significant advantage over datasets relying on human annotation orLLMassistance for process-level reward modeling. ThisPBMis then trained to evaluate stepwise reasoning. -
Consistent Performance Improvement Across Scales: The
DEL-ToMapproach consistently improvesLLMperformance on standardToMbenchmarks across different model scales and search strategies. This demonstrates that verifiable belief supervision significantly enhancesLLMs'ToMcapabilities without the need forLLMretraining.Key findings include:
DEL-ToMconsistently boostsToMaccuracy for both open-source (Qwen3, Llama3.2) and closed-source (GPT series)LLMs.- Smaller
LLMsaugmented withDEL-ToMcan achieve performance competitive with or even surpass much largerLLMs(e.g., Qwen3-4B+PBM outperforms GPT-4.1). Inference-time scaling(particularlyBest-of-NwithPBMguidance) is crucial for performance gains, while simpler methods like majority voting fail.- The
PBMdemonstrates robustness and generalization capabilities toToMtasks from out-of-distribution datasets (Kosinski's dataset). - The quality of the
PBM(i.e., its base model) directly correlates with the end-taskToMperformance. DEL-ToMoffers a cost-efficient alternative forAPI-basedLLMusage, allowing smaller, cheaper models to achieve higherToMperformance.- The method is lightweight, efficient to train, and non-invasive, avoiding the computational and optimization challenges associated with
Reinforcement Learning (RL)-based fine-tuning methods.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand DEL-ToM, a foundational grasp of several key concepts is essential:
3.1.1. Theory-of-Mind (ToM)
Theory-of-Mind (ToM) refers to the cognitive ability to attribute mental states—beliefs, desires, intentions, knowledge, etc.—to oneself and to others, and to understand that others' mental states may differ from one's own. It's a fundamental aspect of social intelligence, enabling individuals to predict and explain the behavior of others. In LLM research, ToM tasks often involve scenarios where an LLM must infer what a character knows or believes, especially when those beliefs are false or outdated relative to the actual state of the world (e.g., the classic "Sally-Anne" test or the "unexpected transfer" task described in the paper).
3.1.2. Large Language Models (LLMs)
Large Language Models (LLMs) are deep learning models, typically based on the transformer architecture, that are trained on vast amounts of text data. They excel at a wide range of natural language processing tasks, including text generation, translation, summarization, and question answering. While LLMs exhibit impressive emergent abilities, their performance on complex logical reasoning, especially dynamic and multi-agent reasoning like ToM, can be inconsistent or lack transparency.
3.1.3. Dynamic Epistemic Logic (DEL)
Dynamic Epistemic Logic (DEL) is a formal logic system used to model knowledge and belief, and how these mental states change in response to events or communication. It extends standard epistemic logic (which models static knowledge/belief) by incorporating dynamic operators that represent actions and their effects on agents' knowledge and beliefs. DEL is rooted in philosophical logic and computer science, providing a rigorous framework for reasoning about information flow in multi-agent systems.
The core components of DEL as utilized in the paper are:
- Epistemic Models: These represent the current state of knowledge/beliefs of agents about the world. They use
Kripke's possible-world semantics, where an agent's belief state is represented by a set of "possible worlds" that are consistent with their current information. - Event Models: These represent actions or events that occur in the world, along with their preconditions (when the action can happen) and postconditions (how the action changes the world). They also describe what information agents gain or lose from observing these events.
- Product Update: This is the mechanism by which an
epistemic modelis updated by anevent model. It combines the current state of beliefs with the information conveyed by an event to produce a newepistemic modelreflecting the updated beliefs of all agents.
3.1.4. Kripke's Possible-World Semantics
This is the foundational semantic framework for modal logic, including epistemic logic. In Kripke semantics, knowledge or belief is represented not by directly stating what an agent knows, but by considering a set of "possible worlds." An agent believes a proposition if is true in all possible worlds that considers compatible with their current information. An accessibility relation connects worlds that agent considers possible from a given world. If , it means that from world , agent considers to be a possible state of affairs.
3.1.5. Inference-Time Scaling
Inference-time scaling refers to techniques that improve LLM performance by allocating more computational resources during the inference phase (when the model is generating an output), rather than through architectural changes or additional training/fine-tuning of the model parameters. This can involve generating multiple candidate outputs, performing complex search procedures, or using external verifiers to select the best output. The goal is to get better results from existing models by spending more compute when generating answers, often trading off latency for quality.
3.1.6. Binary Cross-Entropy Loss
Binary Cross-Entropy (BCE) loss is a common loss function used in binary classification tasks. It measures the performance of a classification model whose output is a probability value between 0 and 1. For a single prediction, it's calculated as:
$
\mathcal{L}(y, \hat{y}) = - (y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}))
$
where:
- is the true binary label (0 or 1).
- is the predicted probability of the positive class (a value between 0 and 1). The goal during training is to minimize this loss, pushing closer to .
3.2. Previous Works
The paper contextualizes its work by referencing several lines of previous research:
- Philosophical and Logical Foundations of Epistemic Logic: The authors explicitly cite pioneers like Jaakko Hintikka (
Hintikka and B. P. Hintikka, 1989; Hintikka, 1962), Bertrand Russell (Russell and Whitehead, 1910), Ludwig Wittgenstein (Wittgenstein, 1922), Alfred Tarski (Tarski, 1956), Gottlob Frege (Frege, 1879), and Saul Kripke (Kripke, 1963). These works laid the groundwork for formalizing logic, knowledge, and belief, whichDELbuilds upon. - Theory-of-Mind in Cognitive Science: The concept of
ToMitself is attributed to early work by Premack and Woodruff (Premack and Woodruff, 1978), C. Dennett (C. Dennett, 1978), and Apperly and Butterfill (Apperly and Butterfill, 2009), with its importance in social intelligence highlighted by Baron-Cohen (Baron-Cohen, 1991). - LLMs and ToM Capabilities: Recent studies have explored
ToMabilities inLLMs(Strachan et al., 2024; Lin et al., 2024; Street et al., 2024; Amirizaniani et al., 2024; Sclar et al. 2025; Wu et al., 2025; Kosinski, 2024). The paper notes thatToMperformance inLLMsfollows a scaling law (Kosinski, 2024) and that current evaluations (Chen et al. 2024) often lack verifiability (Ullman, 2023). - Dynamic Epistemic Logic Applications:
DELitself has a rich history, with key formalizations by Baltag et al. (Baltag et al., 1998), Van Benthem (Van Benthem, 2001), Plaza (Plaza, 2007), and Van Ditmarsch et al. (Van Ditmarsch et al., 2007), and extensions by Aucher and Schwarzentruber (Aucher and Schwarzentruber, 2013). Earlier cognitive models usedDELto simulate belief change (Bolander and Andersen, 2011), and logic-based simulators provided symbolic supervision for belief updates (Bolander, 2014; Hansen and Bolander, 2020).DEL-ToMbuilds on this by usingDELnot just for modeling but for generating supervision forLLMevaluation. - Inference-Time Scaling of LLMs: This is a growing area, with two main paradigms:
- Single-trace scaling: Encourages deeper reasoning within one path (e.g.,
reinforcement learningmethods likeGRPObyGuo et al., 2025a; Cheng et al., 2025, or distillationLi et al., 2025). - Multi-trace scaling: Generates multiple reasoning traces and selects the best one (e.g., voting
Wang et al., 2023, 2025or external verifiersWang et al., 2024; Sun et al., 2024; Guo et al., 2025b; Saad-Falcon et al., 2025). This paradigm sometimes combines with search algorithms liketree searchorbeam search(Zhang et al., 2024; Lin et al., 2025).DEL-ToMfalls into this multi-trace category, usingPBMas the external verifier.
- Single-trace scaling: Encourages deeper reasoning within one path (e.g.,
3.3. Technological Evolution
The field has evolved from early philosophical formalizations of logic and knowledge to the development of epistemic logic and later Dynamic Epistemic Logic (DEL) to model changes in knowledge. Concurrently, LLMs have emerged as powerful tools for language generation and understanding. Initial evaluations of LLM ToM abilities were often heuristic and lacked process-level transparency. This paper represents an evolution towards integrating rigorous formal logic (DEL) with LLM capabilities, aiming to provide verifiable ToM reasoning. It also leverages the recent trend of inference-time scaling to enhance LLM performance without costly retraining.
3.4. Differentiation Analysis
DEL-ToM distinguishes itself from previous work in several key ways:
- Verifiable Reasoning via Formal Logic: Unlike most
LLMToMevaluations that only check final answers or rely onLLM-generated "thoughts" that are hard to verify,DEL-ToMexplicitly framesToMreasoning within the mathematically rigorous framework ofDEL. This allows forstep-by-step verificationof belief updates, grounding the reasoning in formal semantics. - Noise-Free Process-Level Supervision: A major innovation is the use of a
DEL simulatorto automatically generate noise-free, process-level labels for training theProcess Belief Model (PBM). This contrasts with reward modeling techniques that rely on human annotations orLLMself-supervision, which can introduce noise or biases. The guaranteed correctness ofDEL-derived labels is a significant advantage. - Inference-Time Scaling for Process Reliability:
DEL-ToMexplicitly appliesinference-time scalingtechniques (likeBest-of-Nandbeam search) guided by thePBMto select the most reliable belief traces. This is a specific application of inference-time scaling focused on the process ofToMreasoning, not just the final output accuracy. It enables smallerLLMsto achieve higherToMcapabilities without retraining, offering a practical solution for resource-constrained environments. - Non-Invasive Enhancement: Unlike
Reinforcement Learning (RL)-based fine-tuning methods (GRPO) that modifyLLMparameters and can be computationally expensive and risk degrading performance on other tasks,DEL-ToMisnon-invasive. It acts as an external verifier and reranker, leaving the baseLLMunchanged, making it more generalizable and easier to deploy.
4. Methodology
The core methodology of DEL-ToM involves formulating ToM reasoning as a Dynamic Epistemic Logic (DEL) process, training a Process Belief Model (PBM) using DEL-generated ground truth, and then using this PBM to guide inference-time scaling for selecting optimal ToM reasoning traces from LLMs.
4.1. Principles
The fundamental principle behind DEL-ToM is that ToM reasoning can be understood as a sequence of dynamic belief updates, where agents' knowledge and beliefs change in response to observed actions and communications. By formalizing this process with Dynamic Epistemic Logic (DEL), each step of ToM reasoning becomes a verifiable belief update. The paper posits that process reliability (i.e., the correctness of each intermediate step) is key to reliable ToM conclusions. The intuition is that if an LLM can generate multiple potential reasoning paths, an external verifier, trained on formally correct belief updates, can identify the most logically sound path, thereby improving the overall ToM performance without altering the base LLM. This is achieved by allocating additional computation at inference time to evaluate and select these traces.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Formulating ToM Reasoning within DEL
The paper formalizes ToM reasoning using Dynamic Epistemic Logic (DEL), which is built upon Kripke's possible-world semantics.
Epistemic Language:
The epistemic language is defined to express facts and beliefs.
- is a countable set of
atomic propositions(basic facts about the world, e.g.,chocolate_in_drawer). - is a finite, non-empty set of
agents(e.g., John, Mary, Alice). The language is defined by the followingBackus-Naur form: $ \varphi ::= p \mid \neg \varphi \mid \varphi \land \varphi \mid B_i \varphi $ where is an atomic proposition, is an agent, and represents a well-formed formula. The formula is read as "agent believes ." For example, "John believes the chocolate is in the drawer" is expressed as .
Definition 1 (Epistemic Model):
An epistemic model represents the knowledge and beliefs of agents at a given moment. Over an agent set and proposition set , it is a triple , where:
- is a set of
possible worlds. Each world in is a complete assignment of truth values to all atomic propositions in . - assigns to each agent an
accessibility relation. The notation means that from world , agent considers world to be a possible state of affairs. These relations capture an agent's uncertainty. - is a
valuation functionthat maps each atomic proposition to the set of worlds where is true. Astateis a pointed epistemic model where is the designatedactual world.
The satisfaction relation for is defined for an epistemic model and a designated world :
- iff (an atomic proposition is true in world if is in the set of worlds where is true).
- iff for all such that , we have (agent believes in world if is true in all worlds that considers possible from ).
Definition 2 (Event Model):
An event model describes an action or event and its impact on the world and agents' knowledge. It is a tuple , where:
- is a finite, non-empty set of
events. - assigns to each agent an
indistinguishability relationover events. This indicates which events an agent cannot distinguish from each other. pre: assigns to each event aprecondition, a formula specifying when is executable in a given state.post: assigns to each event apostconditiondescribing how the world (atomic propositions) changes if occurs. Apointed event modelis referred to as anaction, where is the actual event that occurs.
Definition 3 (Product Update):
The product update is the mechanism for updating an epistemic model based on an event model. Given an initial state with and an action with , if the precondition is satisfied, the product update results in a new state . The updated epistemic model is defined as:
- : The new set of possible worlds consists of pairs of original worlds and events that could have occurred in those worlds (i.e., whose preconditions are met).
- For each , : The new accessibility relation connects new worlds
((w', e'), (v', f'))if agent considered possible from in the old model AND agent cannot distinguish event from in the event model. This captures how agents' uncertainties about the world and events combine. - iff or , for each : The truth of an atomic proposition in a new world
(w', e')is determined by the postcondition of event . If the postcondition explicitly states , it's true. If the postcondition does not mention or explicitly state , 's truth value is inherited from the original world .
Application of DEL to ToM Reasoning (Illustrative Example):
The paper provides an example (from Figure 2) to illustrate the DEL update process.
The following figure (Figure 2 from the original paper) shows the actions and belief state transitions in a Theory-of-Mind (ToM) task.
该图像是一个示意图,展示了理论心智(ToM)任务中的行动和BELIEF状态变化。图中包含了不同状态的标识和过程标签生成的示例,以及动态情景下的信念更新。此图帮助理解在推理中如何使用动态证念逻辑(DEL)优化ToM推理过程。
- Scenario (State 4): Mary and Alice are present, and the chocolate is on the table. So, is true (chocolate is on the table). Their accessibility relations mean they both know the chocolate is on the table (they only consider possible).
- Action 5: Mary exits the kitchen.
pre(e_5) =\top, `post(`e_5$) = table$. The fact that the chocolate is on the table remains unchanged. However, Mary will not observe subsequent actions. * **Action 6:** Alice moves the chocolate to the cupboard. `pre(`e_6`) =`\top,post(e_6\wedge \negtable. This means the chocolate is now in the cupboard. After the product update, the actual state satisfies . - Belief Update: Alice's new accessibility relation will point to worlds where the chocolate is in the cupboard, reflecting her new knowledge. Mary's relation will still point to worlds where the chocolate is on the table because she didn't observe the move.
- Nested Belief Example: The paper then shows how to derive Mary's belief about Alice's belief: Mary believes that Alice believes the chocolate is on the table (, where is "the chocolate is on the table"). This is because in all worlds Mary considers possible from (which are essentially -like worlds where the chocolate is on the table), Alice believes the chocolate is on the table.
4.2.2. Building the PBM with DEL
The Process Belief Model (PBM) is a verifier that scores each belief update step.
Generating Process-Level Labels via DEL:
To train the PBM, the authors integrate a DEL simulator into the Hi-ToM generators (Wu et al., 2023). This system automatically synthesizes 20,000 ToM stories. For each story, the DEL simulator produces process-level traces across different belief orders. At each action in a story, the simulator updates the accessibility relations based on the action's semantics (e.g., whether observation is public or private). It records the resulting belief state in the trace set. The crucial aspect here is that these DEL-generated labels are guaranteed to be noise-free and correct because they are derived from a formal logic system.
Dataset Assembly:
For each synthesized story, GPT-4o-mini (Hurst et al., 2024) is prompted to generate step-by-step belief updates in a DEL format (the specific prompt is detailed in Appendix A). Each LLM-generated trace is then paired with the corresponding DEL-generated per-step labels. This pairing creates training instances that provide both positive (correct belief updates) and negative (incorrect belief updates) supervision for process-level reward modeling.
Training the PBM:
The PBM is conceptualized as a scoring function . This function takes a ToM problem and a belief update step (from a GPT-4o-mini-generated trace ) and assigns a score to it.
The training of the PBM is framed as a binary classification task. Each step in an LLM trace is labeled as either correct (1) or incorrect (0) according to the DEL-generated gold labels. The model is trained to predict these binary labels using binary cross-entropy loss:
$
\mathcal{L}{\mathrm{PBM}} = - \sum{i=1}^K y_{s_i} \log f(s_i) - \sum_{i=1}^K (1 - y_{s_i}) \log (1 - f(s_i))
$
where:
- is the total number of steps in a belief trace.
- is the
binary label(0 or 1) for step (1 if correct, 0 if incorrect, as provided by theDEL simulator). - is the
predicted score(a probability between 0 and 1) for step by thePBM. The training aims to minimize this loss, meaning thePBMlearns to assign high scores (close to 1) to correct steps and low scores (close to 0) to incorrect steps. The training code used is adapted from theRLHF-Reward-Modeling codebase.
4.2.3. Inference-Time Scaling Pipeline
After the PBM is trained, it is used during LLM inference to guide the selection of reliable ToM reasoning traces. The paper explores two main strategies: Beam Search and Best-of-N (BoN).
Beam Search:
Beam search is a decoding method that maintains multiple partial belief traces.
The process is as follows (illustrated in Figure 1 from the original paper, a schematic representation of the pipeline):
The following figure (Figure 1 from the original paper) is a schematic diagram that illustrates the belief state update process in Theory-of-Mind (ToM) tasks under Dynamic Epistemic Logic. It shows the actions of John, Mary, and Alice, along with Mary's inference about the chocolate's location and the evaluation of belief candidates using the Process Belief Model (PBM) to determine the highest reward score.
该图像是示意图,展示了在动态认知逻辑下的理论心智(ToM)任务中的信念状态更新过程。图中显示了约翰、玛丽和爱丽丝的行动,以及玛丽对巧克力位置的推测,并利用过程信念模型(PBM)对信念候选进行评估,得出最高的奖励分数。其他信息参见原文。
- Initialize: Start with beams. Each beam represents a partial reasoning trace. The
LLMsamples initial candidatefirst-step updates. - Expand: At each subsequent action in the
ToMstory, theLLM(observing the trace so far) proposes multiple candidatebelief updatesfor the current state for each of the active beams. This yields potential partial paths. - Score: The
PBMevaluates and scores each of these candidate paths based on thescore of the most recent step. - Retain: Only the top highest-scoring paths are retained for the next iteration.
- Iterate: This
expand-score-retainprocess repeats until all actions in the story are processed, or a maximum depth/sequence length is reached.
Best-of-N (BoN):
Best-of-N (BoN) is an alternative, simpler strategy where the LLM generates complete belief traces after reading the entire ToM story.
- Generate Traces: The
LLMgenerates independent, complete belief traces. - Step-wise Scoring: The
PBMscores each individual step within all traces. - Aggregate Scores: The
step-wise scoresfor each trace are aggregated into a singleprocess-level reward(atrace-level score). The paper explores severalaggregation rules:- Last: Uses only the
PBM score of the final stepin the trace. - Min: Uses the
lowest PBM score across all stepsin the trace (a conservative measure, assuming the weakest link determines trace quality). - Avg: Uses the
average PBM score across all stepsin the trace. - Prod: Multiplies the
PBM scores of all stepsin the trace (highly sensitive to low scores, as a single low score can drastically reduce the product). - Majority: This is a baseline comparison, where the final answer is selected by
simple majority votingacross the traces, without using the PBM scores. This is used to highlight the PBM's value.
- Last: Uses only the
- Rerank and Select: Based on the aggregated
trace-level scores, the candidate traces are reranked. Tworanking strategiesare considered:- Vanilla BoN: Selects the single trace with the highest
PBM score(after aggregation) as the final output. - Weighted BoN: Groups traces by their final answers. Let be the set of unique final answers proposed by the traces. For each unique answer , the
PBM scoresof all traces that yield that answer are summed. The final answer is chosen as the one with the highest total sum ofPBM scores: $ \boldsymbol{\hat{y}} = \arg \max_{\boldsymbol{y} \in \mathcal{V}} \sum_{i=1}^N \mathbb{1}(\boldsymbol{y}_i = \boldsymbol{y}) \cdot \mathrm{PBM}(\boldsymbol{p}, t_i) $ where:- is the selected final answer.
- represents a unique candidate final answer.
- is the total number of sampled traces.
- is an
indicator functionthat equals 1 if the final answer of trace () matches the candidate answer , and 0 otherwise. - is the aggregated
PBM scorefor trace given problem .
- Vanilla BoN: Selects the single trace with the highest
5. Experimental Setup
5.1. Datasets
The experiments are conducted on two datasets to evaluate the DEL-ToM framework.
5.1.1. Hi-ToM
- Source:
Hi-ToM(Wu et al., 2023) is a benchmark specifically designed for evaluating higher-orderTheory-of-Mindreasoning inLLMs. - Characteristics: For this study, only
one-chapter storiesfromHi-ToMare evaluated.Hi-ToMstories are synthetically generated and include various levels ofbelief orders(0-th to 4-th order), which represent increasing complexity in nested beliefs (e.g., "John believes X" is 1st order, "Mary believes John believes X" is 2nd order, etc.). TheDEL simulatoris integrated into theHi-ToM generatorsto produce the noise-free process labels forPBMtraining. - Domain: Synthetic
ToMscenarios involving multiple agents and dynamic information changes. - Purpose: To evaluate the effectiveness of
DEL-ToMon a dataset specifically designed to probeToMcapabilities and where thePBMis trained using data of a similar generation style.
5.1.2. Kosinski's Dataset
- Source: The
ToMtasks introduced by Kosinski (Kosinski, 2024). - Characteristics: This dataset contains
hand-written scenarioswithfalse-beliefandtrue-beliefcontrols. For the experiments, evaluation is restricted to theunexpected transfer task. This task typically involves an object being moved without a protagonist's knowledge, and the model must infer where the protagonist will look based on their outdated belief. - Domain: Hand-crafted
ToMscenarios. - Purpose: To test the
generalization capabilitiesof thePBMand theDEL-ToMframework toout-of-distribution (OOD)ToMdata, as thePBMwas trained onHi-ToM-style synthetic data. This assesses whether thePBMacts as a genuine verifier rather than overfitting to the training distribution.
5.1.3. Data Sample Example (from Appendix A)
The paper provides a detailed example of a ToM story and its expected step-by-step trace format in Appendix A. This helps illustrate the type of data the models process and the format of the belief updates they are expected to generate.
Here is a story that unfolds in chronological order.
You will be asked a question about the story, which may involve either: (1) Locating an object, or (2) Inferring an agent's mental state (e.g., what A thinks B thinks C thinks).
To solve it, think step-by-step. At each step, repeat the current line from the story, then explain its effect on beliefs. Use [Null] if someone does not yet have knowledge. If a belief chain cannot be formed (e.g., some agent exited too early), freeze belief at the last available step.
In public or private communication:
-
The speaker believes the listener will believe the claim. If the listener exited the room earlier than the speaker, they will believe it.
If the question is zero-order (e.g., "Where is X really?"), then in each step, only track the actual location of the object (e.g., "X is in [Y]"). You do not need to track nested beliefs.
Here is an example:
1 Amelia, Chloe, Liam, Owen and Benjamin entered the TV_room.
2 The celery is in the red_envelope.
3 Amelia made no movements and stayed in the TV_room for 1 minute.
4 Chloe lost his watch.
5 Amelia exited the TV_room.
6 Chloe moved the celery to the green_bucket.
7 Chloe exited the TV_room.
8 Liam moved the celery to the red_bathtub.
9 Liam exited the TV_room.
10 Owen made no movements and stayed in the Tv_room for 1 minute.
11 Owen exited the TV_room.
12 Benjamin made no movements and stayed in the TV_room for 1 minute.
13 Benjamin exited the TV_room.
14 Amelia, Chloe, Liam, Owen and Benjamin entered the waiting_room.
15 Liam publicly claimed that celery is in the white_bathtub now.
16 Benjamin privately told Liam that the celery is in the blue_drawer now.
Step 3
Amelia made no movements and stayed in the TV_room for 1 minute.
No effect.
Owen thinks Liam thinks Chloe thinks the celery is in [red_envelope]
Step 4
Chloe lost his watch.
Irrelevant.
Owen thinks Liam thinks Chloe thinks the celery is in [red_envelope]
Step 5
Amelia exited the TV_room.
Irrelevant.
Owen thinks Liam thinks Chloe thinks the celery is in [red_envelope]
Step 6
Chloe moved the celery to the green_bucket.
Only Chloe, Liam, Owen, Benjamin are present. They all see this move. Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]
Step 7
Chloe exited the TV_room.
Chloe's belief frozen; still [green_bucket]
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]
Step 8
Liam moved the celery to the red_bathtub.
Only Liam, Owen, Benjamin present. They observe the move. Chloe not present, so her belief unchanged.
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]
Step 9
Liam exited the TV_room.
No change.
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]
Step 10
Owen made no movements and stayed in the Tv_room for 1 minute.
Irrelevant.
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]
Step 11
Owen exited the TV_room.
Owen's belief frozen.
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]
Step 12
Benjamin made no movements and stayed in the TV_room for 1 minute.
Irrelevant.
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]
Step 13
Benjamin exited the TV_room.
No change.
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]
Step 14
Everyone entered the waiting_room.
No effect on beliefs.
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]
Step 15
Liam publicly claimed that celery is in the white_bathtub now.
ven hears this statement. However, public speech only affects first- and second-order beliefs (e.g., what Liam believes, what Owen thinks Liam believes, and what Liam thinks Owen believes). It does not change Owen's belief about what Liam thinks Chloe thinks.
wen thinks Liam thinks Chloe thinks the celery is in [green_bucket]
Step 16
Benjamin privately told Liam that the celery is in the blue_drawer now.
Owen does not hear this, but more importantly, private communication only affects beliefs between the speaker and the listener. It can change what Liam believes (based on exit order), or what Liam thinks Benjamin believes (based on exit order), or what Benjamin thinks Liam believes (always change ) - but it cannot affect higher-order beliefs. So this does not change Owen 's belief about what Liam thinks Chloe thinks.
Give a step-by-step trace as in the example. Then, give the final answer in one line like:
Final Answer: [your choice]
5.2. Evaluation Metrics
The primary evaluation metric reported in the paper is final answer accuracy.
- Conceptual Definition:
Accuracymeasures the proportion of correctly predicted instances out of the total number of instances. In this context, it refers to the percentage ofToMquestions for which theLLM(afterDEL-ToMintervention) provides the correct final answer, matching the ground truth. It directly reflects the model's ability to arrive at the correctToMinference. - Mathematical Formula: For a set of predictions and ground truths: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
Number of Correct Predictions: The count ofToMtasks where the model's final answer matches the ground truth.Total Number of Predictions: The total number ofToMtasks evaluated.
5.3. Baselines
The paper compares its DEL-ToM approach against a wide range of LLMs, both open-source and closed-source, acting as baselines. These baselines are evaluated in their vanilla (original) settings, without the PBM-guided inference-time scaling.
- Open-source Models:
- Qwen3 series: Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B (
Yang et al. 2025). - Llama3.2 series: Llama3.2-1B, Llama3.2-3B (
Grattafiori et al., 2024). - Other large open-source models: Qwen3-235B-A22B (
Yang et al., 2025), DeepSeek-V3 (Liu et al., 2024), OLMo-2-0325-32B (Walsh et al., 2025).
- Qwen3 series: Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B (
- Closed-source Models (API-based):
-
gpt-4.1
-
gpt-4o
-
gpt-4.1-mini
-
gpt-4o-mini
-
Additional baselines for comparison: o4-mini, gpt-4.1-nano.
These baselines are representative as they cover a wide spectrum of
LLMsizes and capabilities, including state-of-the-art models from major developers. This allows for a comprehensive assessment ofDEL-ToM's ability to improve both smaller, resource-efficient models and larger, more powerful ones.
-
5.4. Platform and PBM Training
- Platform: All experiments are conducted on a single
NVIDIA GH200 GPU node. ThevLLM(Kwon et al., 2023) framework is used for efficientbatched inferenceandlarge-scale decoding, which is crucial for handling the multiple trace generations required byinference-time scaling. - PBM Training: The
Process Belief Model (PBM)is fine-tuned based on aLlama3.1-8B-Instructmodel (Grattafiori et al., 2024). It is trained for only1 epochusing the 20,000synthesized storiesfrom theHi-ToM generators, which provide theDEL-derived process labels. The training code is adapted from theRLHF-Reward-Modeling codebase.
5.5. Prompt Format
All models are evaluated using a consistent prompting format, details of which are provided in Appendix A of the paper. This ensures that any performance differences are due to the DEL-ToM intervention or model capabilities, rather than variations in prompt engineering. The example in Appendix A shows a highly structured prompt that guides the LLM to produce step-by-step reasoning in a specific format.
6. Results & Analysis
6.1. Core Results Analysis
The experiments consistently demonstrate that DEL-ToM significantly enhances LLMs' Theory-of-Mind (ToM) reasoning capabilities across various model scales and benchmarks.
6.1.1. Main Results on Hi-ToM Dataset (Table 1)
The following are the results from Table 1 of the original paper:
| Model | 0-th Order | 1-th Order | 2-th Order | 3-th Order | 4-th Order | Average | ||||||
| Ori | +PBM | Ori | +PBM | Ori | +PBM | Ori | +PBM | Ori | +PBM | Ori | +PBM | |
| BoN (N = 1024) | ||||||||||||
| Qwen3-4B | 100.0 | 100.0 | 79.8 | 85.0 | 79.3 | 90.0 | 70.2 | 82.5 | 46.0 | 65.0 | 75.1 | 84.5 |
| Qwen3-1.7B | 78.0 | 82.5 | 59.7 | 65.0 | 45.2 | 55.0 | 47.0 | 62.5 | 47.8 | 57.5 | 55.5 | 64.5 |
| Qwen3-0.6B | 69.2 | 80.0 | 52.0 | 72.5 | 35.0 | 47.5 | 31.5 | 52.5 | 34.0 | 47.5 | 44.3 | 60.0 |
| Llama3.2-3B | 68.2 | 85.0 | 52.0 | 80.0 | 43.2 | 82.5 | 37.0 | 82.5 | 36.8 | 75.0 | 47.4 | 81.0 |
| Llama3.2-1B | 41.5 | 46.2 | 40.0 | 53.8 | 28.5 | 61.5 | 41.5 | 84.6 | 29.2 | 58.3 | 36.1 | 60.9 |
| BoN (N = 4) | ||||||||||||
| gpt-4.1 | 95.0 | 97.5 | 85.0 | 87.5 | 85.0 | 92.5 | 82.5 | 95.0 | 70.0 | 77.5 | 83.5 | 90.0 |
| gpt-4.1-mini | 77.5 | 70.0 | 90.0 | 85.0 | 70.0 | 75.0 | 75.0 | 92.5 | 77.5 | 92.5 | 78.0 | 83.0 |
| gpt-40 | 100.0 | 100.0 | 85.0 | 90.0 | 82.5 | 92.5 | 90.0 | 97.5 | 77.5 | 85.0 | 87.0 | 93.0 |
| gpt-4o-mini | 90.0 | 100.0 | 75.0 | 87.5 | 77.5 | 95.0 | 77.5 | 100.0 | 55.0 | 95.0 | 75.0 | 93.5 |
| Beam Search (N = 256) | ||||||||||||
| Qwen3-8B | 96.5 | 80.0 | 53.3 | 80.0 | 38.8 | 85.0 | 55.8 | 95.0 | 57.8 | 95.0 | 60.4 | 87.0 |
| Qwen-4B | 100.0 | 100.0 | 79.8 | 85.0 | 79.3 | 97.5 | 70.2 | 82.5 | 46.0 | 60.0 | 75.1 | 85.0 |
The table clearly shows that incorporating PBM () consistently improves ToM reasoning across both Best-of-N (BoN) and beam search methods, for all tested open-source and closed-source models.
- Significant Gains for Smaller Models: For instance,
Llama3.2-3Bsees an impressive average accuracy gain of33.6points (from47.4to81.0) with .Qwen3-0.6Bimproves by15.7points (from44.3to60.0). These are substantial improvements for models that typically struggle with higher-orderToM. - Enhancement for Larger Models: Even powerful models like
gpt-4.1andgpt-4obenefit, showing average gains of6.5points (from83.5to90.0) and6.0points (from87.0to93.0) respectively with . This indicates thePBMprovides valuable guidance even for highly capableLLMs. - Unlocking Latent Abilities:
Qwen3-8B, which underperformsQwen3-4Bin its baseline (Ori) setting (60.4vs75.1average accuracy), achieves the highest accuracy among allQwen3variants (87.0) when guided byPBMusingbeam search. This suggests thatPBMcan "unlock" higher-order reasoning capabilities that might be latent or poorly expressed in base models. - Performance across Belief Orders: Improvements are observed across all belief orders, from 0-th to 4-th, with the most dramatic gains often seen in higher-order
ToMtasks, which are inherently more challenging. For example,Llama3.2-3Bjumps from36.8to75.0on 4-th orderToM.
6.1.2. Comparison with SOTA LLMs (Table 2)
The following are the results from Table 2 of the original paper:
| Model | 0-th | 1-th | 2-th | 3-th | 4-th | Avg. |
| 04-mini | 97.5 | 95.0 | 77.5 | 87.5 | 85.0 | 88.5 |
| gpt-4o | 100.0 | 85.0 | 82.5 | 90.0 | 77.5 | 87.0 |
| Qwen3-4B+PBM | 100.0 | 85.0 | 90.0 | 82.5 | 65.0 | 84.5 |
| Qwen3-235B-A22B | 100.0 | 75.0 | 85.0 | 85.0 | 75.0 | 84.0 |
| gpt-4.1 | 95.0 | 85.0 | 85.0 | 82.5 | 70.0 | 83.5 |
| DeepSeek-V3 | 100.0 | 80.0 | 90.0 | 70.0 | 72.5 | 82.5 |
| Llama3.2-3B+PBM | 85.0 | 80.0 | 82.5 | 82.5 | 75.0 | 81.0 |
| gpt-4.1-mini | 77.5 | 90.0 | 70.0 | 75.0 | 77.5 | 78.0 |
| gpt-4o-mini | 90.0 | 75.0 | 77.5 | 77.5 | 55.0 | 75.0 |
| Qwen3-1.7B+PBM | 82.5 | 65.0 | 55.0 | 62.5 | 57.5 | 64.5 |
| OLMo-32B | 77.5 | 60.0 | 60.0 | 65.0 | 52.5 | 63.0 |
| Llama3.2-1B+PBM | 46.2 | 53.8 | 61.5 | 84.6 | 58.3 | 60.9 |
| Qwen3-0.6B+PBM | 80.0 | 72.5 | 47.5 | 52.5 | 47.5 | 60.0 |
| gpt-4.1-nano | 22.5 | 32.5 | 42.5 | 27.5 | 30.0 | 31.0 |
This comparison highlights the practical impact of DEL-ToM.
- Small Models Rival Large Models:
Qwen3-4B+PBMachieves an average accuracy of84.5, which is higher thangpt-4.1(83.5),DeepSeek-V3(82.5), andOLMo-32B(63.0). Similarly,Llama3.2-3B+PBMperforms on par withgpt-4.1-mini(81.0vs78.0for average). - These results underscore the effectiveness of
PBMin scalingToMreasoning, demonstrating that with appropriate guidance, smaller, more deployment-friendly models can achieve performance levels previously associated only with much larger or closed-sourceLLMs.
6.1.3. Scaling Test-Time Compute for ToM Reasoning (Figure 3)
The following figure (Figure 3 from the original paper) shows the accuracy of Vanilla and Weighted Best-of-N decoding strategies on the Hi-ToM dataset using Qwen3-4B. It illustrates the accuracy variation under different budgets (representing the number of traces), with (a) for Vanilla strategy and (b) for Weighted strategy.
该图像是图表, 展示了 Qwen3-4B 在 Hi-ToM 数据集上的 Vanilla 和 Weighted 最佳解码策略的准确性。图中展示了不同预算 (表示路径数量)下的准确率变化,其中 (a) 为 Vanilla 策略,(b) 为 Weighted 策略。
Figure 3 illustrates a critical finding: simply increasing the number of sampled belief traces () only improves ToM performance when guided by PBM.
- PBM's Role in Scaling: Without
PBMguidance, increasing (e.g., using majority voting or unguidedBoN) does not reliably lead to performance gains. This confirms the necessity ofprocess-level supervisionfor effectiveinference-time scalinginToMtasks. - Aggregation Strategies: The
minandprodaggregation rules fortrace-level scoresproved to be the most reliable, especially underweighted aggregation. This is becauseToMreasoning is sequential, and an error in any intermediate step can invalidate the final conclusion.Minandprodare more sensitive to such errors, assigning lower scores to traces with any incorrect steps. Conversely,avgandlastoften degrade under weighted aggregation, as they might overlook crucial errors in earlier steps. - Failure of Majority Voting:
Majority votingconsistently fails to improve accuracy. The appendix provides a theoretical analysis: majority voting is vulnerable tovote dilution, where if the probability of a correct trace is low, bad traces can "cluster" on wrong answers and outvote the correct ones. This highlights thatToMrequires evaluatingintermediate belief statesrather than just aggregating final answers.
6.1.4. BoN vs. Beam Search
The experiments indicate that both BoN and beam search achieve comparable accuracy improvements. However, BoN is recommended as the preferred method because:
Beam searchoften fails on smaller or weaker models due to their inability to reliably producevalid intermediate reasoning steps, makingPBMevaluation infeasible.BoNgeneratesfull belief traces in one shot. ThePBMcan still be effective even if some steps are noisy, as it evaluates complete traces.BoNcan leveragehigh-throughput backendslikevLLMto efficiently produce large candidate sets, making it practical for deployment.
6.1.5. Results on Out-of-Distribution ToM Data (Table 3)
The following are the results from Table 3 of the original paper:
| Model | False Belief | Informed Protagonist | No Transfer | Present Protagonist | Average | |||||
| Ori | +PBM | Ori | +PBM | Ori | +PBM | Ori | +PBM | Ori | +PBM | |
| Qwen3-8B | 83.3 | 87.5 | 83.8 | 85.0 | 92.8 | 97.5 | 79.5 | 85.0 | 84.8 | 88.8 |
| Qwen3-4B | 70.2 | 80.0 | 86.2 | 90.0 | 93.2 | 95.0 | 88.0 | 92.5 | 84.4 | 89.4 |
| Qwen3-1.7B | 18.2 | 35.0 | 15.5 | 37.5 | 24.8 | 60.0 | 13.8 | 30.0 | 18.1 | 40.6 |
| Qwen3-0.6B | 14.5 | 12.5 | 23.5 | 30.0 | 25.0 | 35.0 | 21.0 | 32.5 | 21.0 | 27.5 |
The results on Kosinski's dataset (an out-of-distribution, hand-written dataset) confirm the generalization capability of DEL-ToM.
PBMconsistently improves accuracy across allQwen3models, for different belief types (False Belief,Informed Protagonist,No Transfer,Present Protagonist).- For instance,
Qwen3-4B's average accuracy increases from84.4to89.4. Even the smallestQwen3-0.6Bmodel sees an average gain from21.0to27.5. - This demonstrates that the
PBMfunctions as agenuine verifierofToMreasoning justification, rather than simply overfitting to the syntheticHi-ToMtraining distribution. This robustness is crucial for real-world applicability.
6.1.6. Benchmarking the PBM (Table 4)
The following are the results from Table 4 of the original paper:
| PBM | 0-th | 1-th | 2-th | 3-th | 4-th | Avg. |
| Llama3.1-8B | 99.2 | 94.6 | 89.0 | 87.0 | 79.9 | 90.0 |
| Llama3.2-3B | 99.1 | 91.9 | 84.9 | 83.8 | 73.8 | 86.7 |
Evaluating the PBM's standalone reliability on a held-out test set reveals:
- PBM Accuracy: The
Llama3.1-8B-basedPBMachieves an average step-level classification accuracy of90.0%, with higher accuracy for lower belief orders (e.g.,99.2%for 0-th order) and decreasing accuracy for higher orders (e.g.,79.9%for 4-th order). - Impact of PBM Base Model Size: A larger base model for the
PBM(Llama3.1-8BvsLlama3.2-3B) consistently yields higherPBMaccuracy, particularly for higher belief orders. This suggests that stronger models are better at verifying reasoning steps, and thatevaluating deeper recursive beliefs is inherently more challengingfor the verifier itself.
6.1.7. Impact of PBM Quality on Task Accuracy (Table 5)
The following are the results from Table 5 of the original paper:
| Model+PBM | 0-th | 1-th | 2-th | 3-th | 4-th | Avg. |
| Qwen3-4B + 8B | 100.0 | 85.0 | 90.0 | 82.5 | 65.0 | 84.5 |
| Qwen3-4B + 3B | 100.0 | 77.5 | 77.5 | 72.5 | 47.5 | 75.0 |
| Qwen3-1.7B + 8B | 82.5 | 65.0 | 55.0 | 62.5 | 57.5 | 64.5 |
| Qwen3-1.7B + 3B | 82.5 | 60.0 | 45.0 | 47.5 | 50.0 | 57.0 |
| Qwen3-0.6B + 8B | 80.0 | 72.5 | 47.5 | 52.5 | 47.5 | 60.0 |
| Qwen3-0.6B + 3B | 77.5 | 55.0 | 27.5 | 35.0 | 32.5 | 45.5 |
This table directly links PBM quality to end-task performance. Replacing the stronger Llama3.1-8B-based PBM () with the weaker Llama3.2-3B-based PBM () consistently reduces accuracy across all base models and belief orders. This confirms a clear relationship: a stronger PBM (better verifier) leads to better inference-time scaling outcomes and thus higher ToM task performance.
6.1.8. Cost Efficiency for API-based Usage (Table 6)
The following are the results from Table 6 of the original paper:
| Model | Input | Cached Input | Output | Total |
| gpt-4.1 | \$2.00 | \$0.50 | \$8.00 | \$10.50 |
| gpt-4.1-mini | \$0.40 | \$0.10 | \$1.60 | \$2.10 |
| gpt-40 | \$2.50 | \$1.25 | \$10.00 | \$13.75 |
| gpt-4o-mini | \$0.15 | \$0.075 | \$0.60 | \$0.825 |
DEL-ToM offers a cost-efficient solution for API-based LLM usage:
- While
gpt-4.1-miniwithDEL-ToMapproaches the performance ofgpt-4.1, andgpt-4o-miniwithDEL-ToMcan surpassgpt-4o, theminimodels remain significantly cheaper per million tokens. - For example,
gpt-4o-mini's total cost is0.825`, compared to `gpt-4o`'s13.75. Even with sampling outputs, themini` models are more cost-effective. - Crucially, the input cost for
BoNis paid only once (as all samples share the same input prompt), and only the output tokens scale with . This makesPBM-guided small-batchinference-time scalinga more economical alternative to using larger, more expensiveLLMs.
6.1.9. Scaling with Model Size (Figure 4)
The following figure (Figure 4 from the original paper) shows the scaling trend of average accuracy before and after applying PBM across different LLMs on Hi-ToM. "Ori" denotes baseline accuracy; denotes accuracy with inference-time scaling.
该图像是图表,展示了在 Hi-ToM 上应用 PBM 前后不同 LLM 模型的平均准确率的变化趋势。纵轴为平均准确率,横轴为模型规模(以十亿为单位)。"Ori" 代表基线准确率,PBM 则代表使用了推理时间缩放的情况。
Figure 4 demonstrates how PBM influences the scaling trend of ToM accuracy with model size.
- Consistent Improvement and Stronger Scaling:
PBMconsistently improves performance across allLLMsizes. ForLlama3.2models, the accuracy curve becomessteeperwhenPBMis applied, suggesting that larger models benefit more and generalize better from thisinference-time intervention. - Unlocking Latent Potential: The observation that
Qwen3-8Bperforms worse thanQwen3-4Bin its vanilla setting but becomes thebest-performing variantafterPBMapplication is significant. This indicates thatPBMnot only boosts accuracy but can alsounlock higher-order reasoning abilitiesthat are present but not effectively utilized by the base model on its own.
6.2. Qualitative Analysis of PBM Behavior
The paper provides an example of PBM behavior on reasoning traces, highlighting both success and failure modes:
-
Success Case (Irrelevant Statement): When
Elizabeth likes the red_boxoccurs, thePBMcorrectly predicts this step ascorrectbecause it's irrelevant to theasparaguslocation, and no beliefs about the asparagus should change. ThePBMcaptures this invariance. -
Failure Case (Nested, Perspective-Sensitive Update): When
Elizabeth moved the asparagus to the green_bucketoccurs, andAlexanderis not present,PBMpredicts the step ascorrect, but the ground truth isincorrect. The correct reasoning should be thatAlexander's beliefs(as perceived by Charlotte) should not change, since he didn't observe the action. ThePBM overgeneralizes belief update based on partial presence.This qualitative analysis reveals that while
PBMeffectively handles simple cases, it can struggle with the nuances ofnested, perspective-sensitive updatesin multi-agent reasoning. This points to a key challenge in verifying complexToMprocesses, especially whenLLMsmight not correctly parse the conditions for belief change in higher-order scenarios.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work introduces DEL-ToM, an innovative framework that substantially enhances Theory-of-Mind (ToM) reasoning in Large Language Models (LLMs) by leveraging inference-time scaling. By formally grounding ToM tasks in Dynamic Epistemic Logic (DEL), DEL-ToM decomposes complex reasoning into verifiable belief updates. A Process Belief Model (PBM), trained on noise-free supervision generated by a DEL simulator, then scores and selects the most reliable reasoning traces from LLMs during inference. Experiments across diverse LLMs and ToM benchmarks demonstrate consistent performance improvements, showing that DEL-ToM significantly boosts ToM capabilities without requiring LLM retraining. This approach provides a practical, cost-effective, and transparent method for achieving robust ToM reasoning, making ToM-capable LLMs more viable for deployment, particularly in resource-constrained environments.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
- Dependence on Formal Logic Simulator: The approach heavily relies on accurate belief supervision generated by a
formal-logic-based simulator. This supervision might not generalize perfectly to all types of reasoning or the full complexity of real-world language use, which can involve more nuanced social and emotional factors beyond formal logical updates. - Beam Search for Weak Models:
Beam searchis less effective forLLMswith weakinstruction-following capabilitiesbecause these models struggle to produce valid intermediate reasoning steps, makingPBMevaluation impractical. This limits the practical deployment ofbeam searchwith smaller or less capable base models. - Future Work:
- Exploring
more efficient trace selection methodsto further optimize theinference-time scalingprocess. - Extending the
DEL-ToMapproach tobroader domains beyond ToM, suggesting its potential applicability to other forms of complex logical or multi-agent reasoning.
- Exploring
7.3. Personal Insights & Critique
-
Innovation of Noise-Free Supervision: The most striking innovation of
DEL-ToMis its use of aDEL simulatorto generate noise-free, process-level supervision. This tackles a fundamental challenge inreward modelingforLLMs, where human annotation is expensive and prone to error, andLLMself-supervision can perpetuate model biases. By rooting the supervision in formal logic, thePBMlearns from an unassailably correct source, which is a powerful paradigm for building reliable verifiers. This could be highly transferable to other domains requiring verifiable, step-by-step reasoning where a formal simulator can provide ground truth. -
Practicality for Resource-Constrained Settings: The ability to significantly boost the
ToMperformance of smallerLLMswithout retraining is immensely practical. This makes advancedToMcapabilities accessible for edge devices, applications with strict latency/cost budgets, or scenarios where access to very largeLLMsis limited. The cost-efficiency analysis forAPI-basedmodels further strengthens this argument. -
Process Reliability vs. Final Answer Accuracy: The paper effectively argues for and demonstrates the importance of
process reliability. The comparison withmajority votingvividly illustrates thatLLMsmight arrive at correct answers for the wrong reasons, and verifying the intermediate steps is crucial for genuine reasoning. This aligns with a broader trend inLLMresearch to move beyond mere outcome-based evaluation to understanding and improving the reasoning process itself. -
Critique on Generalizability and Complexity: While the
DEL simulatorprovides noise-free labels,DELitself is a formal abstraction of reality. Real-worldToMcan involve highly complex, ambiguous, and emotionally charged social situations that might not be fully captured by currentDELformalizations. The qualitative analysis points to this, where thePBMstruggled with a subtle nuance of nested, perspective-sensitive updates (Alexander's beliefs shouldn't change if he's absent). This highlights that whileDELprovides a strong scaffold, theLLM's ability to accurately translate natural language scenarios into aDEL-compatible representation (and for thePBMto verify it) is still a challenge, especially as scenarios deviate from the training distribution or involve more agents, recursive beliefs, and complex interaction patterns. The prompts in Appendix A are very structured, suggesting that theLLMis heavily guided in its output format. It's an open question howDEL-ToMwould perform with less structured, more open-endedToMscenarios, or if the initialLLM's ability to even generate plausibleDEL-formatted traces without such guidance is limited.Inspiration: This work inspires the development of domain-specific, formal-logic-grounded verifiers as a general strategy for enhancing
LLMreasoning in fields like code generation, mathematical theorem proving, or scientific discovery, where formal systems can provide reliable supervision. The idea of using formal systems to generate data for reward models rather than relying on human annotations is a powerful and potentially scalable approach.
Similar papers
Recommended via semantic vector search.