AiPaper
Paper status: completed

DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic

Published:01/01/2025
Original Link
Price: 0.10
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The DEL-ToM framework enhances large language models' performance on Theory-of-Mind tasks through inference-time scaling, using Dynamic Epistemic Logic to structure belief updates. It leverages a Process Belief Model to verify reasoning, showing consistent performance improvement

Abstract

Theory-of-Mind (ToM) tasks pose a unique challenge for large language models (LLMs), which often lack the capability for dynamic logical reasoning. In this work, we propose DEL-ToM, a framework that improves verifiable ToM reasoning through inference-time scaling rather than architectural changes. Our approach decomposes ToM tasks into a sequence of belief updates grounded in Dynamic Epistemic Logic (DEL), enabling structured and verifiable dynamic logical reasoning. We use data generated automatically via a DEL simulator to train a verifier, which we call the Process Belief Model (PBM), to score each belief update step. During inference, the PBM evaluates candidate belief traces from the LLM and selects the highest-scoring one. This allows LLMs to allocate extra inference-time compute to yield more transparent reasoning. Experiments across model scales and benchmarks show that DEL-ToM consistently improves performance, demonstrating that verifiable belief supervision significantly enhances LLMs’ ToM capabilities without retraining. Code is available at https://github.com/joel-wu/DEL-ToM.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic

1.2. Authors

The paper is authored by Yuheng Wu, Jianwen Xie, Denghui Zhang, and Zhaozhuo Xu.

  • Yuheng Wu (Stanford University)

  • Jianwen Xie (Lambda, Inc.)

  • Denghui Zhang (Stevens Institute of Technology)

  • Zhaozhuo Xu (Stevens Institute of Technology)

    Their affiliations suggest a collaboration between a prominent academic institution (Stanford), a computing hardware/service provider (Lambda, Inc.), and another academic institution (Stevens Institute of Technology). This blend implies a project with strong theoretical foundations, practical implementation considerations, and access to significant computational resources.

1.3. Journal/Conference

The paper is listed as "Published at (UTC): 2025-01-01T00:00:00.000Z", which indicates an upcoming publication date, likely in a conference proceeding or journal. Given the arXiv preprint link structure in the prompt's Original Source Link, it is currently available as a preprint, common for academic works awaiting formal peer review and publication. The subject matter (LLMs, Theory-of-Mind, formal logic) suggests a venue in AI, NLP, or cognitive science, likely a highly reputable conference given the affiliations.

1.4. Publication Year

2025 (based on the provided publication timestamp)

1.5. Abstract

This paper introduces DEL-ToM, a framework designed to enhance the Theory-of-Mind (ToM) reasoning capabilities of Large Language Models (LLMs). The core challenge addressed is LLMs' typical deficiency in dynamic logical reasoning required for ToM tasks. Instead of architectural modifications, DEL-ToM employs an inference-time scaling approach. It achieves this by decomposing ToM tasks into a sequence of belief updates, formally grounded in Dynamic Epistemic Logic (DEL). A key component is the Process Belief Model (PBM), a verifier trained on automatically generated data from a DEL simulator. During inference, the PBM scores candidate belief traces produced by an LLM and selects the highest-scoring one, allowing LLMs to invest more compute at inference time for more transparent and reliable reasoning. Experimental results across various model scales and benchmarks demonstrate that DEL-ToM consistently improves LLM performance on ToM tasks, showcasing the effectiveness of verifiable belief supervision without requiring LLM retraining.

/files/papers/691c896125edee2b759f3360/paper.pdf (This is a local file path provided in the prompt, indicating the paper content was extracted from this PDF. Its publication status is likely a preprint currently, with a formal publication date in 2025 as noted in the abstract details).

2. Executive Summary

2.1. Background & Motivation

The paper addresses the significant challenge that Theory-of-Mind (ToM) tasks pose for Large Language Models (LLMs). ToM is the ability to attribute mental states (beliefs, desires, intentions) to oneself and others, crucial for social intelligence. While LLMs have shown nascent ToM abilities, these often scale with model size, making smaller models less capable. More critically, current LLM evaluations for ToM typically only check the final output against ground truth, offering no insight into the reasoning process. This lack of verifiability means it's unclear whether a correct answer stems from genuine understanding or mere lucky guessing, rendering LLM ToM reasoning unreliable and impractical for real-world applications, especially in resource-constrained environments where robust inference of user intentions is critical.

The core problem the paper aims to solve is: How can LLMs perform verifiable ToM reasoning, particularly in low-resource settings, without requiring architectural changes or extensive retraining? The paper's innovative idea is to formalize ToM reasoning as a multi-step dynamic belief-update process using Dynamic Epistemic Logic (DEL) and then apply inference-time scaling to select the most reliable reasoning traces.

2.2. Main Contributions / Findings

The paper makes three primary contributions:

  1. New Perspective on ToM Reasoning: It proposes viewing ToM reasoning through the lens of process reliability, formalizing it as a multi-step dynamic belief-update process. This perspective enables the application of inference-time scaling to select more reliable belief traces, enhancing transparency and trustworthiness.

  2. Formalization with DEL and Noise-Free Supervision: The work formalizes ToM reasoning within the Dynamic Epistemic Logic (DEL) framework. Crucially, it constructs a Process Belief Model (PBM) dataset with noise-free supervision, automatically derived from a DEL simulator. This DEL-generated supervision guarantees correctness, a significant advantage over datasets relying on human annotation or LLM assistance for process-level reward modeling. This PBM is then trained to evaluate stepwise reasoning.

  3. Consistent Performance Improvement Across Scales: The DEL-ToM approach consistently improves LLM performance on standard ToM benchmarks across different model scales and search strategies. This demonstrates that verifiable belief supervision significantly enhances LLMs' ToM capabilities without the need for LLM retraining.

    Key findings include:

  • DEL-ToM consistently boosts ToM accuracy for both open-source (Qwen3, Llama3.2) and closed-source (GPT series) LLMs.
  • Smaller LLMs augmented with DEL-ToM can achieve performance competitive with or even surpass much larger LLMs (e.g., Qwen3-4B+PBM outperforms GPT-4.1).
  • Inference-time scaling (particularly Best-of-N with PBM guidance) is crucial for performance gains, while simpler methods like majority voting fail.
  • The PBM demonstrates robustness and generalization capabilities to ToM tasks from out-of-distribution datasets (Kosinski's dataset).
  • The quality of the PBM (i.e., its base model) directly correlates with the end-task ToM performance.
  • DEL-ToM offers a cost-efficient alternative for API-based LLM usage, allowing smaller, cheaper models to achieve higher ToM performance.
  • The method is lightweight, efficient to train, and non-invasive, avoiding the computational and optimization challenges associated with Reinforcement Learning (RL)-based fine-tuning methods.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand DEL-ToM, a foundational grasp of several key concepts is essential:

3.1.1. Theory-of-Mind (ToM)

Theory-of-Mind (ToM) refers to the cognitive ability to attribute mental states—beliefs, desires, intentions, knowledge, etc.—to oneself and to others, and to understand that others' mental states may differ from one's own. It's a fundamental aspect of social intelligence, enabling individuals to predict and explain the behavior of others. In LLM research, ToM tasks often involve scenarios where an LLM must infer what a character knows or believes, especially when those beliefs are false or outdated relative to the actual state of the world (e.g., the classic "Sally-Anne" test or the "unexpected transfer" task described in the paper).

3.1.2. Large Language Models (LLMs)

Large Language Models (LLMs) are deep learning models, typically based on the transformer architecture, that are trained on vast amounts of text data. They excel at a wide range of natural language processing tasks, including text generation, translation, summarization, and question answering. While LLMs exhibit impressive emergent abilities, their performance on complex logical reasoning, especially dynamic and multi-agent reasoning like ToM, can be inconsistent or lack transparency.

3.1.3. Dynamic Epistemic Logic (DEL)

Dynamic Epistemic Logic (DEL) is a formal logic system used to model knowledge and belief, and how these mental states change in response to events or communication. It extends standard epistemic logic (which models static knowledge/belief) by incorporating dynamic operators that represent actions and their effects on agents' knowledge and beliefs. DEL is rooted in philosophical logic and computer science, providing a rigorous framework for reasoning about information flow in multi-agent systems.

The core components of DEL as utilized in the paper are:

  • Epistemic Models: These represent the current state of knowledge/beliefs of agents about the world. They use Kripke's possible-world semantics, where an agent's belief state is represented by a set of "possible worlds" that are consistent with their current information.
  • Event Models: These represent actions or events that occur in the world, along with their preconditions (when the action can happen) and postconditions (how the action changes the world). They also describe what information agents gain or lose from observing these events.
  • Product Update: This is the mechanism by which an epistemic model is updated by an event model. It combines the current state of beliefs with the information conveyed by an event to produce a new epistemic model reflecting the updated beliefs of all agents.

3.1.4. Kripke's Possible-World Semantics

This is the foundational semantic framework for modal logic, including epistemic logic. In Kripke semantics, knowledge or belief is represented not by directly stating what an agent knows, but by considering a set of "possible worlds." An agent AA believes a proposition φφ if φφ is true in all possible worlds that AA considers compatible with their current information. An accessibility relation RaR_a connects worlds that agent aa considers possible from a given world. If wRavw R_a v, it means that from world ww, agent aa considers vv to be a possible state of affairs.

3.1.5. Inference-Time Scaling

Inference-time scaling refers to techniques that improve LLM performance by allocating more computational resources during the inference phase (when the model is generating an output), rather than through architectural changes or additional training/fine-tuning of the model parameters. This can involve generating multiple candidate outputs, performing complex search procedures, or using external verifiers to select the best output. The goal is to get better results from existing models by spending more compute when generating answers, often trading off latency for quality.

3.1.6. Binary Cross-Entropy Loss

Binary Cross-Entropy (BCE) loss is a common loss function used in binary classification tasks. It measures the performance of a classification model whose output is a probability value between 0 and 1. For a single prediction, it's calculated as: $ \mathcal{L}(y, \hat{y}) = - (y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})) $ where:

  • yy is the true binary label (0 or 1).
  • y^\hat{y} is the predicted probability of the positive class (a value between 0 and 1). The goal during training is to minimize this loss, pushing y^\hat{y} closer to yy.

3.2. Previous Works

The paper contextualizes its work by referencing several lines of previous research:

  • Philosophical and Logical Foundations of Epistemic Logic: The authors explicitly cite pioneers like Jaakko Hintikka (Hintikka and B. P. Hintikka, 1989; Hintikka, 1962), Bertrand Russell (Russell and Whitehead, 1910), Ludwig Wittgenstein (Wittgenstein, 1922), Alfred Tarski (Tarski, 1956), Gottlob Frege (Frege, 1879), and Saul Kripke (Kripke, 1963). These works laid the groundwork for formalizing logic, knowledge, and belief, which DEL builds upon.
  • Theory-of-Mind in Cognitive Science: The concept of ToM itself is attributed to early work by Premack and Woodruff (Premack and Woodruff, 1978), C. Dennett (C. Dennett, 1978), and Apperly and Butterfill (Apperly and Butterfill, 2009), with its importance in social intelligence highlighted by Baron-Cohen (Baron-Cohen, 1991).
  • LLMs and ToM Capabilities: Recent studies have explored ToM abilities in LLMs (Strachan et al., 2024; Lin et al., 2024; Street et al., 2024; Amirizaniani et al., 2024; Sclar et al. 2025; Wu et al., 2025; Kosinski, 2024). The paper notes that ToM performance in LLMs follows a scaling law (Kosinski, 2024) and that current evaluations (Chen et al. 2024) often lack verifiability (Ullman, 2023).
  • Dynamic Epistemic Logic Applications: DEL itself has a rich history, with key formalizations by Baltag et al. (Baltag et al., 1998), Van Benthem (Van Benthem, 2001), Plaza (Plaza, 2007), and Van Ditmarsch et al. (Van Ditmarsch et al., 2007), and extensions by Aucher and Schwarzentruber (Aucher and Schwarzentruber, 2013). Earlier cognitive models used DEL to simulate belief change (Bolander and Andersen, 2011), and logic-based simulators provided symbolic supervision for belief updates (Bolander, 2014; Hansen and Bolander, 2020). DEL-ToM builds on this by using DEL not just for modeling but for generating supervision for LLM evaluation.
  • Inference-Time Scaling of LLMs: This is a growing area, with two main paradigms:
    • Single-trace scaling: Encourages deeper reasoning within one path (e.g., reinforcement learning methods like GRPO by Guo et al., 2025a; Cheng et al., 2025, or distillation Li et al., 2025).
    • Multi-trace scaling: Generates multiple reasoning traces and selects the best one (e.g., voting Wang et al., 2023, 2025 or external verifiers Wang et al., 2024; Sun et al., 2024; Guo et al., 2025b; Saad-Falcon et al., 2025). This paradigm sometimes combines with search algorithms like tree search or beam search (Zhang et al., 2024; Lin et al., 2025). DEL-ToM falls into this multi-trace category, using PBM as the external verifier.

3.3. Technological Evolution

The field has evolved from early philosophical formalizations of logic and knowledge to the development of epistemic logic and later Dynamic Epistemic Logic (DEL) to model changes in knowledge. Concurrently, LLMs have emerged as powerful tools for language generation and understanding. Initial evaluations of LLM ToM abilities were often heuristic and lacked process-level transparency. This paper represents an evolution towards integrating rigorous formal logic (DEL) with LLM capabilities, aiming to provide verifiable ToM reasoning. It also leverages the recent trend of inference-time scaling to enhance LLM performance without costly retraining.

3.4. Differentiation Analysis

DEL-ToM distinguishes itself from previous work in several key ways:

  • Verifiable Reasoning via Formal Logic: Unlike most LLM ToM evaluations that only check final answers or rely on LLM-generated "thoughts" that are hard to verify, DEL-ToM explicitly frames ToM reasoning within the mathematically rigorous framework of DEL. This allows for step-by-step verification of belief updates, grounding the reasoning in formal semantics.
  • Noise-Free Process-Level Supervision: A major innovation is the use of a DEL simulator to automatically generate noise-free, process-level labels for training the Process Belief Model (PBM). This contrasts with reward modeling techniques that rely on human annotations or LLM self-supervision, which can introduce noise or biases. The guaranteed correctness of DEL-derived labels is a significant advantage.
  • Inference-Time Scaling for Process Reliability: DEL-ToM explicitly applies inference-time scaling techniques (like Best-of-N and beam search) guided by the PBM to select the most reliable belief traces. This is a specific application of inference-time scaling focused on the process of ToM reasoning, not just the final output accuracy. It enables smaller LLMs to achieve higher ToM capabilities without retraining, offering a practical solution for resource-constrained environments.
  • Non-Invasive Enhancement: Unlike Reinforcement Learning (RL)-based fine-tuning methods (GRPO) that modify LLM parameters and can be computationally expensive and risk degrading performance on other tasks, DEL-ToM is non-invasive. It acts as an external verifier and reranker, leaving the base LLM unchanged, making it more generalizable and easier to deploy.

4. Methodology

The core methodology of DEL-ToM involves formulating ToM reasoning as a Dynamic Epistemic Logic (DEL) process, training a Process Belief Model (PBM) using DEL-generated ground truth, and then using this PBM to guide inference-time scaling for selecting optimal ToM reasoning traces from LLMs.

4.1. Principles

The fundamental principle behind DEL-ToM is that ToM reasoning can be understood as a sequence of dynamic belief updates, where agents' knowledge and beliefs change in response to observed actions and communications. By formalizing this process with Dynamic Epistemic Logic (DEL), each step of ToM reasoning becomes a verifiable belief update. The paper posits that process reliability (i.e., the correctness of each intermediate step) is key to reliable ToM conclusions. The intuition is that if an LLM can generate multiple potential reasoning paths, an external verifier, trained on formally correct belief updates, can identify the most logically sound path, thereby improving the overall ToM performance without altering the base LLM. This is achieved by allocating additional computation at inference time to evaluate and select these traces.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Formulating ToM Reasoning within DEL

The paper formalizes ToM reasoning using Dynamic Epistemic Logic (DEL), which is built upon Kripke's possible-world semantics.

Epistemic Language: The epistemic language L(P,A)\mathcal{L}(\mathcal{P}, \mathcal{A}) is defined to express facts and beliefs.

  • P\mathcal{P} is a countable set of atomic propositions (basic facts about the world, e.g., chocolate_in_drawer).
  • A\mathcal{A} is a finite, non-empty set of agents (e.g., John, Mary, Alice). The language is defined by the following Backus-Naur form: $ \varphi ::= p \mid \neg \varphi \mid \varphi \land \varphi \mid B_i \varphi $ where pPp \in \mathcal{P} is an atomic proposition, iAi \in \mathcal{A} is an agent, and φ\varphi represents a well-formed formula. The formula BiφB_i \varphi is read as "agent ii believes φ\varphi." For example, "John believes the chocolate is in the drawer" is expressed as BJohn(chocolate_in_drawer)B_{\mathrm{John}}(\text{chocolate\_in\_drawer}).

Definition 1 (Epistemic Model): An epistemic model represents the knowledge and beliefs of agents at a given moment. Over an agent set A\mathcal{A} and proposition set P\mathcal{P}, it is a triple M=(W,R,V)\mathcal{M} = (W, R, V), where:

  • WW is a set of possible worlds. Each world in WW is a complete assignment of truth values to all atomic propositions in P\mathcal{P}.
  • R:A2W×WR: \mathcal{A} \to 2^{W \times W} assigns to each agent aAa \in \mathcal{A} an accessibility relation RaW×WR_a \subseteq W \times W. The notation wRavw R_a v means that from world ww, agent aa considers world vv to be a possible state of affairs. These relations capture an agent's uncertainty.
  • V:P2WV: \mathcal{P} \to 2^W is a valuation function that maps each atomic proposition pPp \in \mathcal{P} to the set of worlds where pp is true. A state is a pointed epistemic model (M,w)(\mathcal{M}, w) where wWw \in W is the designated actual world.

The satisfaction relation \vDash for L(P,A)\mathcal{L}(\mathcal{P}, \mathcal{A}) is defined for an epistemic model M\mathcal{M} and a designated world wWw \in W:

  • M,wp\mathcal{M}, w \models p iff wV(p)w \in V(p) (an atomic proposition pp is true in world ww if ww is in the set of worlds where pp is true).
  • M,wBaφ\mathcal{M}, w \models B_a \varphi iff for all vWv \in W such that wRavw R_a v, we have M,vφ\mathcal{M}, v \models \varphi (agent aa believes φ\varphi in world ww if φ\varphi is true in all worlds that aa considers possible from ww).

Definition 2 (Event Model): An event model describes an action or event and its impact on the world and agents' knowledge. It is a tuple ε=(E,Q,pre, post)\varepsilon = (E, Q, \text{pre, post}), where:

  • EE is a finite, non-empty set of events.
  • Q:A2E×EQ: \mathcal{A} \to 2^{E \times E} assigns to each agent aAa \in \mathcal{A} an indistinguishability relation QaE×EQ_a \subseteq E \times E over events. This indicates which events an agent cannot distinguish from each other.
  • pre: EL(P,A)E \to \mathcal{L}(\mathcal{P}, \mathcal{A}) assigns to each event eEe \in E a precondition, a formula specifying when ee is executable in a given state.
  • post: EL(P,A)E \to \mathcal{L}(\mathcal{P}, \mathcal{A}) assigns to each event eEe \in E a postcondition describing how the world (atomic propositions) changes if ee occurs. A pointed event model (ε,e)(\varepsilon, e) is referred to as an action, where eEe \in E is the actual event that occurs.

Definition 3 (Product Update): The product update is the mechanism for updating an epistemic model based on an event model. Given an initial state (M,w)(\mathcal{M}, w) with M=(W,R,V)\mathcal{M} = (W, R, V) and an action (ε,e)(\varepsilon, e) with ε=(E,Q,pre, post)\varepsilon = (E, Q, \text{pre, post}), if the precondition M,wpre(e)\mathcal{M}, w \models \text{pre}(e) is satisfied, the product update results in a new state (M,(w,e))(\mathcal{M}', (w,e)). The updated epistemic model M=(W,R,V)\mathcal{M}' = (W', R', V') is defined as:

  • W={(w,e)W×EM,wpre(e)}W' = \{ (w', e') \in W \times E \mid \mathcal{M}, w' \models \text{pre}(e') \}: The new set of possible worlds consists of pairs of original worlds and events that could have occurred in those worlds (i.e., whose preconditions are met).
  • For each aAa \in \mathcal{A}, Ra={((w,e),(v,f))W×WwRaveQaf}R_a' = \{ ((w', e'), (v', f')) \in W' \times W' \mid w' R_a v' \wedge e' Q_a f' \}: The new accessibility relation RaR_a' connects new worlds ((w', e'), (v', f')) if agent aa considered vv' possible from ww' in the old model AND agent aa cannot distinguish event ff' from ee' in the event model. This captures how agents' uncertainties about the world and events combine.
  • (w,e)V(p)(w', e') \in V'(p) iff post(e)p\text{post}(e') \models p or (M,wppost(e)⊭¬p)(\mathcal{M}, w' \models p \wedge \text{post}(e') \not\models \neg p), for each pPp \in \mathcal{P}: The truth of an atomic proposition pp in a new world (w', e') is determined by the postcondition of event ee'. If the postcondition explicitly states pp, it's true. If the postcondition does not mention pp or explicitly state ¬p\neg p, pp's truth value is inherited from the original world ww'.

Application of DEL to ToM Reasoning (Illustrative Example): The paper provides an example (from Figure 2) to illustrate the DEL update process. The following figure (Figure 2 from the original paper) shows the actions and belief state transitions in a Theory-of-Mind (ToM) task.

该图像是一个示意图,展示了理论心智(ToM)任务中的行动和BELIEF状态变化。图中包含了不同状态的标识和过程标签生成的示例,以及动态情景下的信念更新。此图帮助理解在推理中如何使用动态证念逻辑(DEL)优化ToM推理过程。 该图像是一个示意图,展示了理论心智(ToM)任务中的行动和BELIEF状态变化。图中包含了不同状态的标识和过程标签生成的示例,以及动态情景下的信念更新。此图帮助理解在推理中如何使用动态证念逻辑(DEL)优化ToM推理过程。

  • Scenario (State 4): Mary and Alice are present, and the chocolate is on the table. So, M,w4table\mathcal{M}, w_4 \models \text{table} is true (chocolate is on the table). Their accessibility relations RM=RA={(w4,w4)}R_M = R_A = \{ (w_4, w_4) \} mean they both know the chocolate is on the table (they only consider w4w_4 possible).
  • Action 5: Mary exits the kitchen. pre(e_5) =\top, `post(`e_5$) = table$. The fact that the chocolate is on the table remains unchanged. However, Mary will not observe subsequent actions. * **Action 6:** Alice moves the chocolate to the cupboard. `pre(`e_6`) =`\top, post(e_6)=cupboard) = cupboard \wedge \negtable. This means the chocolate is now in the cupboard. After the product update, the actual state (M,w6)(\mathcal{M}, w_6) satisfies M,w6cupboard\mathcal{M}, w_6 \models \text{cupboard}.
  • Belief Update: Alice's new accessibility relation RAR_A' will point to worlds where the chocolate is in the cupboard, reflecting her new knowledge. Mary's relation RMR_M' will still point to worlds where the chocolate is on the table because she didn't observe the move.
  • Nested Belief Example: The paper then shows how to derive Mary's belief about Alice's belief: Mary believes that Alice believes the chocolate is on the table (M,w6BMaryBAliceφ\mathcal{M}, w_6 \models B_{\mathrm{Mary}} B_{\mathrm{Alice}} \varphi, where φ\varphi is "the chocolate is on the table"). This is because in all worlds Mary considers possible from w6w_6 (which are essentially w4w_4-like worlds where the chocolate is on the table), Alice believes the chocolate is on the table.

4.2.2. Building the PBM with DEL

The Process Belief Model (PBM) is a verifier that scores each belief update step.

Generating Process-Level Labels via DEL: To train the PBM, the authors integrate a DEL simulator into the Hi-ToM generators (Wu et al., 2023). This system automatically synthesizes 20,000 ToM stories. For each story, the DEL simulator produces process-level traces across different belief orders. At each action in a story, the simulator updates the accessibility relations RR based on the action's semantics (e.g., whether observation is public or private). It records the resulting belief state in the trace set. The crucial aspect here is that these DEL-generated labels are guaranteed to be noise-free and correct because they are derived from a formal logic system.

Dataset Assembly: For each synthesized story, GPT-4o-mini (Hurst et al., 2024) is prompted to generate step-by-step belief updates in a DEL format (the specific prompt is detailed in Appendix A). Each LLM-generated trace is then paired with the corresponding DEL-generated per-step labels. This pairing creates training instances that provide both positive (correct belief updates) and negative (incorrect belief updates) supervision for process-level reward modeling.

Training the PBM: The PBM is conceptualized as a scoring function f:Q×SR+f : \mathcal{Q} \times \mathcal{S} \to \mathbb{R}^+. This function takes a ToM problem qq and a belief update step sis_i (from a GPT-4o-mini-generated trace ss) and assigns a score to it. The training of the PBM is framed as a binary classification task. Each step sis_i in an LLM trace is labeled as either correct (1) or incorrect (0) according to the DEL-generated gold labels. The model is trained to predict these binary labels using binary cross-entropy loss: $ \mathcal{L}{\mathrm{PBM}} = - \sum{i=1}^K y_{s_i} \log f(s_i) - \sum_{i=1}^K (1 - y_{s_i}) \log (1 - f(s_i)) $ where:

  • KK is the total number of steps in a belief trace.
  • ysiy_{s_i} is the binary label (0 or 1) for step sis_i (1 if correct, 0 if incorrect, as provided by the DEL simulator).
  • f(si)f(s_i) is the predicted score (a probability between 0 and 1) for step sis_i by the PBM. The training aims to minimize this loss, meaning the PBM learns to assign high scores (close to 1) to correct steps and low scores (close to 0) to incorrect steps. The training code used is adapted from the RLHF-Reward-Modeling codebase.

4.2.3. Inference-Time Scaling Pipeline

After the PBM is trained, it is used during LLM inference to guide the selection of reliable ToM reasoning traces. The paper explores two main strategies: Beam Search and Best-of-N (BoN).

Beam Search: Beam search is a decoding method that maintains multiple partial belief traces. The process is as follows (illustrated in Figure 1 from the original paper, a schematic representation of the pipeline): The following figure (Figure 1 from the original paper) is a schematic diagram that illustrates the belief state update process in Theory-of-Mind (ToM) tasks under Dynamic Epistemic Logic. It shows the actions of John, Mary, and Alice, along with Mary's inference about the chocolate's location and the evaluation of belief candidates using the Process Belief Model (PBM) to determine the highest reward score.

该图像是示意图,展示了在动态认知逻辑下的理论心智(ToM)任务中的信念状态更新过程。图中显示了约翰、玛丽和爱丽丝的行动,以及玛丽对巧克力位置的推测,并利用过程信念模型(PBM)对信念候选进行评估,得出最高的奖励分数。其他信息参见原文。 该图像是示意图,展示了在动态认知逻辑下的理论心智(ToM)任务中的信念状态更新过程。图中显示了约翰、玛丽和爱丽丝的行动,以及玛丽对巧克力位置的推测,并利用过程信念模型(PBM)对信念候选进行评估,得出最高的奖励分数。其他信息参见原文。

  1. Initialize: Start with kk beams. Each beam represents a partial reasoning trace. The LLM samples kk initial candidate first-step updates.
  2. Expand: At each subsequent action in the ToM story, the LLM (observing the trace so far) proposes bb multiple candidate belief updates for the current state for each of the kk active beams. This yields k×bk \times b potential partial paths.
  3. Score: The PBM evaluates and scores each of these k×bk \times b candidate paths based on the score of the most recent step.
  4. Retain: Only the top kk highest-scoring paths are retained for the next iteration.
  5. Iterate: This expand-score-retain process repeats until all actions in the story are processed, or a maximum depth/sequence length is reached.

Best-of-N (BoN): Best-of-N (BoN) is an alternative, simpler strategy where the LLM generates NN complete belief traces after reading the entire ToM story.

  1. Generate Traces: The LLM generates NN independent, complete belief traces.
  2. Step-wise Scoring: The PBM scores each individual step within all NN traces.
  3. Aggregate Scores: The step-wise scores for each trace are aggregated into a single process-level reward (a trace-level score). The paper explores several aggregation rules:
    • Last: Uses only the PBM score of the final step in the trace.
    • Min: Uses the lowest PBM score across all steps in the trace (a conservative measure, assuming the weakest link determines trace quality).
    • Avg: Uses the average PBM score across all steps in the trace.
    • Prod: Multiplies the PBM scores of all steps in the trace (highly sensitive to low scores, as a single low score can drastically reduce the product).
    • Majority: This is a baseline comparison, where the final answer is selected by simple majority voting across the NN traces, without using the PBM scores. This is used to highlight the PBM's value.
  4. Rerank and Select: Based on the aggregated trace-level scores, the NN candidate traces are reranked. Two ranking strategies are considered:
    • Vanilla BoN: Selects the single trace with the highest PBM score (after aggregation) as the final output.
    • Weighted BoN: Groups traces by their final answers. Let V={y1,y2,}\mathcal{V} = \{ y_1, y_2, \ldots \} be the set of unique final answers proposed by the NN traces. For each unique answer yVy \in \mathcal{V}, the PBM scores of all traces that yield that answer are summed. The final answer y^\hat{y} is chosen as the one with the highest total sum of PBM scores: $ \boldsymbol{\hat{y}} = \arg \max_{\boldsymbol{y} \in \mathcal{V}} \sum_{i=1}^N \mathbb{1}(\boldsymbol{y}_i = \boldsymbol{y}) \cdot \mathrm{PBM}(\boldsymbol{p}, t_i) $ where:
      • y^\boldsymbol{\hat{y}} is the selected final answer.
      • yV\boldsymbol{y} \in \mathcal{V} represents a unique candidate final answer.
      • NN is the total number of sampled traces.
      • 1(yi=y)\mathbb{1}(\boldsymbol{y}_i = \boldsymbol{y}) is an indicator function that equals 1 if the final answer of trace ii (yiy_i) matches the candidate answer y\boldsymbol{y}, and 0 otherwise.
      • PBM(p,ti)\mathrm{PBM}(\boldsymbol{p}, t_i) is the aggregated PBM score for trace tit_i given problem pp.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on two datasets to evaluate the DEL-ToM framework.

5.1.1. Hi-ToM

  • Source: Hi-ToM (Wu et al., 2023) is a benchmark specifically designed for evaluating higher-order Theory-of-Mind reasoning in LLMs.
  • Characteristics: For this study, only one-chapter stories from Hi-ToM are evaluated. Hi-ToM stories are synthetically generated and include various levels of belief orders (0-th to 4-th order), which represent increasing complexity in nested beliefs (e.g., "John believes X" is 1st order, "Mary believes John believes X" is 2nd order, etc.). The DEL simulator is integrated into the Hi-ToM generators to produce the noise-free process labels for PBM training.
  • Domain: Synthetic ToM scenarios involving multiple agents and dynamic information changes.
  • Purpose: To evaluate the effectiveness of DEL-ToM on a dataset specifically designed to probe ToM capabilities and where the PBM is trained using data of a similar generation style.

5.1.2. Kosinski's Dataset

  • Source: The ToM tasks introduced by Kosinski (Kosinski, 2024).
  • Characteristics: This dataset contains hand-written scenarios with false-belief and true-belief controls. For the experiments, evaluation is restricted to the unexpected transfer task. This task typically involves an object being moved without a protagonist's knowledge, and the model must infer where the protagonist will look based on their outdated belief.
  • Domain: Hand-crafted ToM scenarios.
  • Purpose: To test the generalization capabilities of the PBM and the DEL-ToM framework to out-of-distribution (OOD) ToM data, as the PBM was trained on Hi-ToM-style synthetic data. This assesses whether the PBM acts as a genuine verifier rather than overfitting to the training distribution.

5.1.3. Data Sample Example (from Appendix A)

The paper provides a detailed example of a ToM story and its expected step-by-step trace format in Appendix A. This helps illustrate the type of data the models process and the format of the belief updates they are expected to generate.

Here is a story that unfolds in chronological order.

You will be asked a question about the story, which may involve either: (1) Locating an object, or (2) Inferring an agent's mental state (e.g., what A thinks B thinks C thinks).

To solve it, think step-by-step. At each step, repeat the current line from the story, then explain its effect on beliefs. Use [Null] if someone does not yet have knowledge. If a belief chain cannot be formed (e.g., some agent exited too early), freeze belief at the last available step.

{note}

In public or private communication:

  • The speaker believes the listener will believe the claim. If the listener exited the room earlier than the speaker, they will believe it.

    If the question is zero-order (e.g., "Where is X really?"), then in each step, only track the actual location of the object (e.g., "X is in [Y]"). You do not need to track nested beliefs.

Here is an example:

1 Amelia, Chloe, Liam, Owen and Benjamin entered the TV_room.
2 The celery is in the red_envelope.
3 Amelia made no movements and stayed in the TV_room for 1 minute.
4 Chloe lost his watch.
5 Amelia exited the TV_room.
6 Chloe moved the celery to the green_bucket.
7 Chloe exited the TV_room.
8 Liam moved the celery to the red_bathtub.
9 Liam exited the TV_room.
10 Owen made no movements and stayed in the Tv_room for 1 minute.
11 Owen exited the TV_room.
12 Benjamin made no movements and stayed in the TV_room for 1 minute.
13 Benjamin exited the TV_room.
14 Amelia, Chloe, Liam, Owen and Benjamin entered the waiting_room.
15 Liam publicly claimed that celery is in the white_bathtub now.
16 Benjamin privately told Liam that the celery is in the blue_drawer now.

Where does Owen think Liam thinks Chloe thinks the celery is?

## Step 1 ## Amelia, Chloe, Liam, Owen and Benjamin entered the Ty_room. Everyone is present, but the celery's location is still unknown. Owen thinks Liam thinks Chloe thinks the celery is in [Null] ## Step 2 ## The celery is in the red_envelope. Everyone observes this. Owen thinks Liam thinks Chloe thinks the celery is in [red_envelope]

Step 3

Amelia made no movements and stayed in the TV_room for 1 minute.
No effect.

Owen thinks Liam thinks Chloe thinks the celery is in [red_envelope]

Step 4

Chloe lost his watch.
Irrelevant.
Owen thinks Liam thinks Chloe thinks the celery is in [red_envelope]

Step 5

Amelia exited the TV_room.
Irrelevant.
Owen thinks Liam thinks Chloe thinks the celery is in [red_envelope]

Step 6

Chloe moved the celery to the green_bucket.
Only Chloe, Liam, Owen, Benjamin are present. They all see this move. Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]

Step 7

Chloe exited the TV_room.
Chloe's belief frozen; still [green_bucket]
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]

Step 8

Liam moved the celery to the red_bathtub.
Only Liam, Owen, Benjamin present. They observe the move. Chloe not present, so her belief unchanged.
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]

Step 9

Liam exited the TV_room.
No change.
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]

Step 10

Owen made no movements and stayed in the Tv_room for 1 minute.
Irrelevant.
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]

Step 11

Owen exited the TV_room.
Owen's belief frozen.
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]

Step 12

Benjamin made no movements and stayed in the TV_room for 1 minute.
Irrelevant.
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]

Step 13

Benjamin exited the TV_room.
No change.
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]

Step 14

Everyone entered the waiting_room.
No effect on beliefs.
Owen thinks Liam thinks Chloe thinks the celery is in [green_bucket]

Step 15

Liam publicly claimed that celery is in the white_bathtub now.

ven hears this statement. However, public speech only affects first- and second-order beliefs (e.g., what Liam believes, what Owen thinks Liam believes, and what Liam thinks Owen believes). It does not change Owen's belief about what Liam thinks Chloe thinks.

wen thinks Liam thinks Chloe thinks the celery is in [green_bucket]

Step 16

Benjamin privately told Liam that the celery is in the blue_drawer now.
Owen does not hear this, but more importantly, private communication only affects beliefs between the speaker and the listener. It can change what Liam believes (based on exit order), or what Liam thinks Benjamin believes (based on exit order), or what Benjamin thinks Liam believes (always change ) - but it cannot affect higher-order beliefs. So this does not change Owen 's belief about what Liam thinks Chloe thinks.

Give a step-by-step trace as in the example. Then, give the final answer in one line like:

Final Answer: [your choice]

5.2. Evaluation Metrics

The primary evaluation metric reported in the paper is final answer accuracy.

  • Conceptual Definition: Accuracy measures the proportion of correctly predicted instances out of the total number of instances. In this context, it refers to the percentage of ToM questions for which the LLM (after DEL-ToM intervention) provides the correct final answer, matching the ground truth. It directly reflects the model's ability to arrive at the correct ToM inference.
  • Mathematical Formula: For a set of predictions and ground truths: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
  • Symbol Explanation:
    • Number of Correct Predictions: The count of ToM tasks where the model's final answer matches the ground truth.
    • Total Number of Predictions: The total number of ToM tasks evaluated.

5.3. Baselines

The paper compares its DEL-ToM approach against a wide range of LLMs, both open-source and closed-source, acting as baselines. These baselines are evaluated in their vanilla (original) settings, without the PBM-guided inference-time scaling.

  • Open-source Models:
    • Qwen3 series: Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B (Yang et al. 2025).
    • Llama3.2 series: Llama3.2-1B, Llama3.2-3B (Grattafiori et al., 2024).
    • Other large open-source models: Qwen3-235B-A22B (Yang et al., 2025), DeepSeek-V3 (Liu et al., 2024), OLMo-2-0325-32B (Walsh et al., 2025).
  • Closed-source Models (API-based):
    • gpt-4.1

    • gpt-4o

    • gpt-4.1-mini

    • gpt-4o-mini

    • Additional baselines for comparison: o4-mini, gpt-4.1-nano.

      These baselines are representative as they cover a wide spectrum of LLM sizes and capabilities, including state-of-the-art models from major developers. This allows for a comprehensive assessment of DEL-ToM's ability to improve both smaller, resource-efficient models and larger, more powerful ones.

5.4. Platform and PBM Training

  • Platform: All experiments are conducted on a single NVIDIA GH200 GPU node. The vLLM (Kwon et al., 2023) framework is used for efficient batched inference and large-scale decoding, which is crucial for handling the multiple trace generations required by inference-time scaling.
  • PBM Training: The Process Belief Model (PBM) is fine-tuned based on a Llama3.1-8B-Instruct model (Grattafiori et al., 2024). It is trained for only 1 epoch using the 20,000 synthesized stories from the Hi-ToM generators, which provide the DEL-derived process labels. The training code is adapted from the RLHF-Reward-Modeling codebase.

5.5. Prompt Format

All models are evaluated using a consistent prompting format, details of which are provided in Appendix A of the paper. This ensures that any performance differences are due to the DEL-ToM intervention or model capabilities, rather than variations in prompt engineering. The example in Appendix A shows a highly structured prompt that guides the LLM to produce step-by-step reasoning in a specific format.

6. Results & Analysis

6.1. Core Results Analysis

The experiments consistently demonstrate that DEL-ToM significantly enhances LLMs' Theory-of-Mind (ToM) reasoning capabilities across various model scales and benchmarks.

6.1.1. Main Results on Hi-ToM Dataset (Table 1)

The following are the results from Table 1 of the original paper:

Model 0-th Order 1-th Order 2-th Order 3-th Order 4-th Order Average
Ori +PBM Ori +PBM Ori +PBM Ori +PBM Ori +PBM Ori +PBM
BoN (N = 1024)
Qwen3-4B 100.0 100.0 79.8 85.0 79.3 90.0 70.2 82.5 46.0 65.0 75.1 84.5
Qwen3-1.7B 78.0 82.5 59.7 65.0 45.2 55.0 47.0 62.5 47.8 57.5 55.5 64.5
Qwen3-0.6B 69.2 80.0 52.0 72.5 35.0 47.5 31.5 52.5 34.0 47.5 44.3 60.0
Llama3.2-3B 68.2 85.0 52.0 80.0 43.2 82.5 37.0 82.5 36.8 75.0 47.4 81.0
Llama3.2-1B 41.5 46.2 40.0 53.8 28.5 61.5 41.5 84.6 29.2 58.3 36.1 60.9
BoN (N = 4)
gpt-4.1 95.0 97.5 85.0 87.5 85.0 92.5 82.5 95.0 70.0 77.5 83.5 90.0
gpt-4.1-mini 77.5 70.0 90.0 85.0 70.0 75.0 75.0 92.5 77.5 92.5 78.0 83.0
gpt-40 100.0 100.0 85.0 90.0 82.5 92.5 90.0 97.5 77.5 85.0 87.0 93.0
gpt-4o-mini 90.0 100.0 75.0 87.5 77.5 95.0 77.5 100.0 55.0 95.0 75.0 93.5
Beam Search (N = 256)
Qwen3-8B 96.5 80.0 53.3 80.0 38.8 85.0 55.8 95.0 57.8 95.0 60.4 87.0
Qwen-4B 100.0 100.0 79.8 85.0 79.3 97.5 70.2 82.5 46.0 60.0 75.1 85.0

The table clearly shows that incorporating PBM (+PBM+PBM) consistently improves ToM reasoning across both Best-of-N (BoN) and beam search methods, for all tested open-source and closed-source models.

  • Significant Gains for Smaller Models: For instance, Llama3.2-3B sees an impressive average accuracy gain of 33.6 points (from 47.4 to 81.0) with BoN(N=1024)BoN (N=1024). Qwen3-0.6B improves by 15.7 points (from 44.3 to 60.0). These are substantial improvements for models that typically struggle with higher-order ToM.
  • Enhancement for Larger Models: Even powerful models like gpt-4.1 and gpt-4o benefit, showing average gains of 6.5 points (from 83.5 to 90.0) and 6.0 points (from 87.0 to 93.0) respectively with BoN(N=4)BoN (N=4). This indicates the PBM provides valuable guidance even for highly capable LLMs.
  • Unlocking Latent Abilities: Qwen3-8B, which underperforms Qwen3-4B in its baseline (Ori) setting (60.4 vs 75.1 average accuracy), achieves the highest accuracy among all Qwen3 variants (87.0) when guided by PBM using beam search. This suggests that PBM can "unlock" higher-order reasoning capabilities that might be latent or poorly expressed in base models.
  • Performance across Belief Orders: Improvements are observed across all belief orders, from 0-th to 4-th, with the most dramatic gains often seen in higher-order ToM tasks, which are inherently more challenging. For example, Llama3.2-3B jumps from 36.8 to 75.0 on 4-th order ToM.

6.1.2. Comparison with SOTA LLMs (Table 2)

The following are the results from Table 2 of the original paper:

Model 0-th 1-th 2-th 3-th 4-th Avg.
04-mini 97.5 95.0 77.5 87.5 85.0 88.5
gpt-4o 100.0 85.0 82.5 90.0 77.5 87.0
Qwen3-4B+PBM 100.0 85.0 90.0 82.5 65.0 84.5
Qwen3-235B-A22B 100.0 75.0 85.0 85.0 75.0 84.0
gpt-4.1 95.0 85.0 85.0 82.5 70.0 83.5
DeepSeek-V3 100.0 80.0 90.0 70.0 72.5 82.5
Llama3.2-3B+PBM 85.0 80.0 82.5 82.5 75.0 81.0
gpt-4.1-mini 77.5 90.0 70.0 75.0 77.5 78.0
gpt-4o-mini 90.0 75.0 77.5 77.5 55.0 75.0
Qwen3-1.7B+PBM 82.5 65.0 55.0 62.5 57.5 64.5
OLMo-32B 77.5 60.0 60.0 65.0 52.5 63.0
Llama3.2-1B+PBM 46.2 53.8 61.5 84.6 58.3 60.9
Qwen3-0.6B+PBM 80.0 72.5 47.5 52.5 47.5 60.0
gpt-4.1-nano 22.5 32.5 42.5 27.5 30.0 31.0

This comparison highlights the practical impact of DEL-ToM.

  • Small Models Rival Large Models: Qwen3-4B+PBM achieves an average accuracy of 84.5, which is higher than gpt-4.1 (83.5), DeepSeek-V3 (82.5), and OLMo-32B (63.0). Similarly, Llama3.2-3B+PBM performs on par with gpt-4.1-mini (81.0 vs 78.0 for average).
  • These results underscore the effectiveness of PBM in scaling ToM reasoning, demonstrating that with appropriate guidance, smaller, more deployment-friendly models can achieve performance levels previously associated only with much larger or closed-source LLMs.

6.1.3. Scaling Test-Time Compute for ToM Reasoning (Figure 3)

The following figure (Figure 3 from the original paper) shows the accuracy of Vanilla and Weighted Best-of-N decoding strategies on the Hi-ToM dataset using Qwen3-4B. It illustrates the accuracy variation under different budgets NN (representing the number of traces), with (a) for Vanilla strategy and (b) for Weighted strategy.

Figure 3: Accuracy of BoN decoding on Qwen3-4B across different budgets \(N\) in the Hi-ToM dataset. Results are shown for (a) Vanilla and (b) Weighted aggregation strategies. 该图像是图表, 展示了 Qwen3-4B 在 Hi-ToM 数据集上的 Vanilla 和 Weighted 最佳解码策略的准确性。图中展示了不同预算 NN(表示路径数量)下的准确率变化,其中 (a) 为 Vanilla 策略,(b) 为 Weighted 策略。

Figure 3 illustrates a critical finding: simply increasing the number of sampled belief traces (NN) only improves ToM performance when guided by PBM.

  • PBM's Role in Scaling: Without PBM guidance, increasing NN (e.g., using majority voting or unguided BoN) does not reliably lead to performance gains. This confirms the necessity of process-level supervision for effective inference-time scaling in ToM tasks.
  • Aggregation Strategies: The min and prod aggregation rules for trace-level scores proved to be the most reliable, especially under weighted aggregation. This is because ToM reasoning is sequential, and an error in any intermediate step can invalidate the final conclusion. Min and prod are more sensitive to such errors, assigning lower scores to traces with any incorrect steps. Conversely, avg and last often degrade under weighted aggregation, as they might overlook crucial errors in earlier steps.
  • Failure of Majority Voting: Majority voting consistently fails to improve accuracy. The appendix provides a theoretical analysis: majority voting is vulnerable to vote dilution, where if the probability of a correct trace is low, bad traces can "cluster" on wrong answers and outvote the correct ones. This highlights that ToM requires evaluating intermediate belief states rather than just aggregating final answers.

The experiments indicate that both BoN and beam search achieve comparable accuracy improvements. However, BoN is recommended as the preferred method because:

  • Beam search often fails on smaller or weaker models due to their inability to reliably produce valid intermediate reasoning steps, making PBM evaluation infeasible.
  • BoN generates full belief traces in one shot. The PBM can still be effective even if some steps are noisy, as it evaluates complete traces.
  • BoN can leverage high-throughput backends like vLLM to efficiently produce large candidate sets, making it practical for deployment.

6.1.5. Results on Out-of-Distribution ToM Data (Table 3)

The following are the results from Table 3 of the original paper:

Model False Belief Informed Protagonist No Transfer Present Protagonist Average
Ori +PBM Ori +PBM Ori +PBM Ori +PBM Ori +PBM
Qwen3-8B 83.3 87.5 83.8 85.0 92.8 97.5 79.5 85.0 84.8 88.8
Qwen3-4B 70.2 80.0 86.2 90.0 93.2 95.0 88.0 92.5 84.4 89.4
Qwen3-1.7B 18.2 35.0 15.5 37.5 24.8 60.0 13.8 30.0 18.1 40.6
Qwen3-0.6B 14.5 12.5 23.5 30.0 25.0 35.0 21.0 32.5 21.0 27.5

The results on Kosinski's dataset (an out-of-distribution, hand-written dataset) confirm the generalization capability of DEL-ToM.

  • PBM consistently improves accuracy across all Qwen3 models, for different belief types (False Belief, Informed Protagonist, No Transfer, Present Protagonist).
  • For instance, Qwen3-4B's average accuracy increases from 84.4 to 89.4. Even the smallest Qwen3-0.6B model sees an average gain from 21.0 to 27.5.
  • This demonstrates that the PBM functions as a genuine verifier of ToM reasoning justification, rather than simply overfitting to the synthetic Hi-ToM training distribution. This robustness is crucial for real-world applicability.

6.1.6. Benchmarking the PBM (Table 4)

The following are the results from Table 4 of the original paper:

PBM 0-th 1-th 2-th 3-th 4-th Avg.
Llama3.1-8B 99.2 94.6 89.0 87.0 79.9 90.0
Llama3.2-3B 99.1 91.9 84.9 83.8 73.8 86.7

Evaluating the PBM's standalone reliability on a held-out test set reveals:

  • PBM Accuracy: The Llama3.1-8B-based PBM achieves an average step-level classification accuracy of 90.0%, with higher accuracy for lower belief orders (e.g., 99.2% for 0-th order) and decreasing accuracy for higher orders (e.g., 79.9% for 4-th order).
  • Impact of PBM Base Model Size: A larger base model for the PBM (Llama3.1-8B vs Llama3.2-3B) consistently yields higher PBM accuracy, particularly for higher belief orders. This suggests that stronger models are better at verifying reasoning steps, and that evaluating deeper recursive beliefs is inherently more challenging for the verifier itself.

6.1.7. Impact of PBM Quality on Task Accuracy (Table 5)

The following are the results from Table 5 of the original paper:

Model+PBM 0-th 1-th 2-th 3-th 4-th Avg.
Qwen3-4B + 8B 100.0 85.0 90.0 82.5 65.0 84.5
Qwen3-4B + 3B 100.0 77.5 77.5 72.5 47.5 75.0
Qwen3-1.7B + 8B 82.5 65.0 55.0 62.5 57.5 64.5
Qwen3-1.7B + 3B 82.5 60.0 45.0 47.5 50.0 57.0
Qwen3-0.6B + 8B 80.0 72.5 47.5 52.5 47.5 60.0
Qwen3-0.6B + 3B 77.5 55.0 27.5 35.0 32.5 45.5

This table directly links PBM quality to end-task performance. Replacing the stronger Llama3.1-8B-based PBM (+8B+ 8B) with the weaker Llama3.2-3B-based PBM (+3B+ 3B) consistently reduces accuracy across all base models and belief orders. This confirms a clear relationship: a stronger PBM (better verifier) leads to better inference-time scaling outcomes and thus higher ToM task performance.

6.1.8. Cost Efficiency for API-based Usage (Table 6)

The following are the results from Table 6 of the original paper:

Model Input Cached Input Output Total
gpt-4.1 \$2.00 \$0.50 \$8.00 \$10.50
gpt-4.1-mini \$0.40 \$0.10 \$1.60 \$2.10
gpt-40 \$2.50 \$1.25 \$10.00 \$13.75
gpt-4o-mini \$0.15 \$0.075 \$0.60 \$0.825

DEL-ToM offers a cost-efficient solution for API-based LLM usage:

  • While gpt-4.1-mini with DEL-ToM approaches the performance of gpt-4.1, and gpt-4o-mini with DEL-ToM can surpass gpt-4o, the mini models remain significantly cheaper per million tokens.
  • For example, gpt-4o-mini's total cost is 0.825`, compared to `gpt-4o`'s 13.75. Even with sampling N=4N=4 outputs, the mini` models are more cost-effective.
  • Crucially, the input cost for BoN is paid only once (as all NN samples share the same input prompt), and only the output tokens scale with NN. This makes PBM-guided small-batch inference-time scaling a more economical alternative to using larger, more expensive LLMs.

6.1.9. Scaling with Model Size (Figure 4)

The following figure (Figure 4 from the original paper) shows the scaling trend of average accuracy before and after applying PBM across different LLMs on Hi-ToM. "Ori" denotes baseline accuracy; 6+PBM { \mathrm { ^ 6 + P B M ^ { \prime } } } denotes accuracy with inference-time scaling.

Figure 4: Scaling trend of average accuracy before and after applying PBM across different LLMs on Hi-ToM. "Ori" denotes baseline accuracy; \( { \\mathrm { ^ 6 + P B M ^ { \\prime } } }\) denotes accuracy with inference-time scaling. 该图像是图表,展示了在 Hi-ToM 上应用 PBM 前后不同 LLM 模型的平均准确率的变化趋势。纵轴为平均准确率,横轴为模型规模(以十亿为单位)。"Ori" 代表基线准确率,PBM 则代表使用了推理时间缩放的情况。

Figure 4 demonstrates how PBM influences the scaling trend of ToM accuracy with model size.

  • Consistent Improvement and Stronger Scaling: PBM consistently improves performance across all LLM sizes. For Llama3.2 models, the accuracy curve becomes steeper when PBM is applied, suggesting that larger models benefit more and generalize better from this inference-time intervention.
  • Unlocking Latent Potential: The observation that Qwen3-8B performs worse than Qwen3-4B in its vanilla setting but becomes the best-performing variant after PBM application is significant. This indicates that PBM not only boosts accuracy but can also unlock higher-order reasoning abilities that are present but not effectively utilized by the base model on its own.

6.2. Qualitative Analysis of PBM Behavior

The paper provides an example of PBM behavior on reasoning traces, highlighting both success and failure modes:

  • Success Case (Irrelevant Statement): When Elizabeth likes the red_box occurs, the PBM correctly predicts this step as correct because it's irrelevant to the asparagus location, and no beliefs about the asparagus should change. The PBM captures this invariance.

  • Failure Case (Nested, Perspective-Sensitive Update): When Elizabeth moved the asparagus to the green_bucket occurs, and Alexander is not present, PBM predicts the step as correct, but the ground truth is incorrect. The correct reasoning should be that Alexander's beliefs (as perceived by Charlotte) should not change, since he didn't observe the action. The PBM overgeneralizes belief update based on partial presence.

    This qualitative analysis reveals that while PBM effectively handles simple cases, it can struggle with the nuances of nested, perspective-sensitive updates in multi-agent reasoning. This points to a key challenge in verifying complex ToM processes, especially when LLMs might not correctly parse the conditions for belief change in higher-order scenarios.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work introduces DEL-ToM, an innovative framework that substantially enhances Theory-of-Mind (ToM) reasoning in Large Language Models (LLMs) by leveraging inference-time scaling. By formally grounding ToM tasks in Dynamic Epistemic Logic (DEL), DEL-ToM decomposes complex reasoning into verifiable belief updates. A Process Belief Model (PBM), trained on noise-free supervision generated by a DEL simulator, then scores and selects the most reliable reasoning traces from LLMs during inference. Experiments across diverse LLMs and ToM benchmarks demonstrate consistent performance improvements, showing that DEL-ToM significantly boosts ToM capabilities without requiring LLM retraining. This approach provides a practical, cost-effective, and transparent method for achieving robust ToM reasoning, making ToM-capable LLMs more viable for deployment, particularly in resource-constrained environments.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

  • Dependence on Formal Logic Simulator: The approach heavily relies on accurate belief supervision generated by a formal-logic-based simulator. This supervision might not generalize perfectly to all types of reasoning or the full complexity of real-world language use, which can involve more nuanced social and emotional factors beyond formal logical updates.
  • Beam Search for Weak Models: Beam search is less effective for LLMs with weak instruction-following capabilities because these models struggle to produce valid intermediate reasoning steps, making PBM evaluation impractical. This limits the practical deployment of beam search with smaller or less capable base models.
  • Future Work:
    • Exploring more efficient trace selection methods to further optimize the inference-time scaling process.
    • Extending the DEL-ToM approach to broader domains beyond ToM, suggesting its potential applicability to other forms of complex logical or multi-agent reasoning.

7.3. Personal Insights & Critique

  • Innovation of Noise-Free Supervision: The most striking innovation of DEL-ToM is its use of a DEL simulator to generate noise-free, process-level supervision. This tackles a fundamental challenge in reward modeling for LLMs, where human annotation is expensive and prone to error, and LLM self-supervision can perpetuate model biases. By rooting the supervision in formal logic, the PBM learns from an unassailably correct source, which is a powerful paradigm for building reliable verifiers. This could be highly transferable to other domains requiring verifiable, step-by-step reasoning where a formal simulator can provide ground truth.

  • Practicality for Resource-Constrained Settings: The ability to significantly boost the ToM performance of smaller LLMs without retraining is immensely practical. This makes advanced ToM capabilities accessible for edge devices, applications with strict latency/cost budgets, or scenarios where access to very large LLMs is limited. The cost-efficiency analysis for API-based models further strengthens this argument.

  • Process Reliability vs. Final Answer Accuracy: The paper effectively argues for and demonstrates the importance of process reliability. The comparison with majority voting vividly illustrates that LLMs might arrive at correct answers for the wrong reasons, and verifying the intermediate steps is crucial for genuine reasoning. This aligns with a broader trend in LLM research to move beyond mere outcome-based evaluation to understanding and improving the reasoning process itself.

  • Critique on Generalizability and Complexity: While the DEL simulator provides noise-free labels, DEL itself is a formal abstraction of reality. Real-world ToM can involve highly complex, ambiguous, and emotionally charged social situations that might not be fully captured by current DEL formalizations. The qualitative analysis points to this, where the PBM struggled with a subtle nuance of nested, perspective-sensitive updates (Alexander's beliefs shouldn't change if he's absent). This highlights that while DEL provides a strong scaffold, the LLM's ability to accurately translate natural language scenarios into a DEL-compatible representation (and for the PBM to verify it) is still a challenge, especially as scenarios deviate from the training distribution or involve more agents, recursive beliefs, and complex interaction patterns. The prompts in Appendix A are very structured, suggesting that the LLM is heavily guided in its output format. It's an open question how DEL-ToM would perform with less structured, more open-ended ToM scenarios, or if the initial LLM's ability to even generate plausible DEL-formatted traces without such guidance is limited.

    Inspiration: This work inspires the development of domain-specific, formal-logic-grounded verifiers as a general strategy for enhancing LLM reasoning in fields like code generation, mathematical theorem proving, or scientific discovery, where formal systems can provide reliable supervision. The idea of using formal systems to generate data for reward models rather than relying on human annotations is a powerful and potentially scalable approach.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.