Paper status: completed

BaZi-Based Character Simulation Benchmark: Evaluating AI on Temporal and Persona Reasoning

Published:10/27/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
22 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This work introduces the first BaZi-based character simulation dataset and BaZi-LLM system, combining symbolic reasoning with large language models to generate dynamic, fine-grained personas, boosting accuracy by 30%-62% over mainstream models, highlighting culturally grounded AI

Abstract

Human-like virtual characters are crucial for games, storytelling, and virtual reality, yet current methods rely heavily on annotated data or handcrafted persona prompts, making it difficult to scale up and generate realistic, contextually coherent personas. We create the first QA dataset for BaZi-based persona reasoning, where real human experiences categorized into wealth, health, kinship, career, and relationships are represented as life-event questions and answers. Furthermore, we propose the first BaZi-LLM system that integrates symbolic reasoning with large language models to generate temporally dynamic and fine-grained virtual personas. Compared with mainstream LLMs such as DeepSeek-v3 and GPT-5-mini, our method achieves a 30.3%-62.6% accuracy improvement. In addition, when incorrect BaZi information is used, our model's accuracy drops by 20%-45%, showing the potential of culturally grounded symbolic-LLM integration for realistic character simulation.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

BaZi-Based Character Simulation Benchmark: Evaluating AI on Temporal and Persona Reasoning

1.2. Authors

Siyuan Zheng**, Pai Liu**, Xi Chen***, Jizheng Dong, Sihan Jia Their affiliations are listed as MirrorAI Co, Ltd., University of Rochester, New York University, Georgia State University, and Anhui Zhu Zi College. The specific roles or primary affiliations for each author are not explicitly delineated in the provided header, but MirrorAI Co, Ltd. appears to be a key institutional supporter.

1.3. Journal/Conference

This paper is published as a preprint on arXiv. The Published at (UTC): 2025-10-27T13:51:13.000Z and references to DeepSeek-v3 and GPT-5-mini (a hypothetical future model) suggest it anticipates publication in a highly competitive venue given the advanced models mentioned. The arXiv platform is a widely respected open-access archive for preprints of scientific papers, particularly in physics, mathematics, computer science, and related fields, allowing for early dissemination of research findings.

1.4. Publication Year

2025 (based on the Published at (UTC) timestamp).

1.5. Abstract

The paper addresses the challenge of creating human-like virtual characters for games, storytelling, and virtual reality, noting that existing methods relying on annotated data or handcrafted prompts struggle with scalability and generating realistic, contextually coherent personas. It introduces the first QA dataset specifically designed for BaZi-based persona reasoning, which represents real human experiences (categorized into wealth, health, kinship, career, and relationships) as life-event questions and answers. Furthermore, the paper proposes BaZi-LLM, the first system that integrates symbolic reasoning (derived from BaZi, or Four Pillars of Destiny) with large language models (LLMs) to produce temporally dynamic and fine-grained virtual personas. Experimental results demonstrate that BaZi-LLM achieves a 30.3% to 62.6% accuracy improvement over mainstream LLMs like DeepSeek-v3 and GPT-5-mini. The model's accuracy drops significantly (by 20% to 45%) when incorrect BaZi information is used, highlighting the importance of this culturally grounded symbolic-LLM integration for realistic character simulation.

https://arxiv.org/abs/2510.23337

https://arxiv.org/pdf/2510.23337v1.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the difficulty in developing realistic, scalable, and contextually coherent virtual characters for immersive applications such as games, storytelling, and virtual reality. Traditional methods, including dialogue trees, finite-state machines, and behavior trees, are labor-intensive to author, limited in scope, and often produce generic personas lacking long-term consistency. While Large Language Models (LLMs) have advanced capabilities in prompt-following and dialogue generation, they still face limitations: persona prompts struggle to capture human complexity within length constraints, and character-specific finetuning is challenging to scale across a diverse range of personas.

This problem is important because the realism and dynamism of virtual characters directly impact the immersion and engagement in digital environments. The existing gaps highlight a need for a more efficient and robust method for persona construction that can generate fine-grained and temporally dynamic characters. The paper's innovative idea is to adopt BaZi (Four Pillars of Destiny)—a traditional Chinese metaphysical system—as a culturally grounded, temporally structured representation for persona construction. By reinterpreting BaZi as a conditional feature-generation model, the authors propose to discretize chronological time into symbolic attributes tied to personal traits and temporal dynamics, enabling fine-grained, dynamic persona generation without making metaphysical claims.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  1. Reinterpretation of BaZi for Persona Simulation: It reinterprets BaZi as a culturally grounded representational system, enabling fine-grained and temporally dynamic character modeling for virtual personas. This moves beyond traditional deterministic applications of BaZi, reframing it as a source of interpretable, probabilistic features.

  2. First BaZi-Based QA Dataset: The paper introduces Celebrity 50, the first Question-Answering (QA) dataset for BaZi-based persona reasoning. This dataset allows for systematic and quantitative evaluation of symbolic reasoning in the context of real-world life events across five key dimensions: wealth, health, kinship, career, and relationships.

  3. First BaZi-Augmented LLM System: It develops the first BaZi-LLM system, which integrates symbolic reasoning derived from BaZi with Large Language Models for culturally informed character simulation. This system processes minimal birth information to generate rich, dynamic persona prompts.

  4. Demonstrated Accuracy Gains: The BaZi-enhanced models achieve consistent and significant accuracy gains (30.3% to 62.6% improvement) over baseline Large Language Models (such as DeepSeek-v3 and GPT-5-mini) on the Celebrity 50 benchmark, validating the effectiveness of the proposed integration. Furthermore, when incorrect BaZi information (shuffled birth dates) is used, the model's accuracy drops by 20% to 45%, demonstrating that the model genuinely leverages BaZi's symbolic features for reasoning, rather than relying on superficial correlations.

    The key findings demonstrate that integrating a culturally grounded symbolic framework like BaZi can significantly enhance the ability of Large Language Models to simulate realistic and temporally dynamic virtual characters, offering a novel approach to persona generation.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following fundamental concepts:

  • Large Language Models (LLMs): These are advanced artificial intelligence models, such as GPT-3, DeepSeek, or Gemini, trained on vast amounts of text data. They are capable of understanding, generating, and processing human language for a wide range of tasks, including answering questions, writing essays, and engaging in conversations. Their core strength lies in their ability to identify patterns and relationships in language, allowing them to perform prompt-following (generating responses based on specific instructions) and dialogue generation. In this paper, LLMs serve as the base models that are augmented with BaZi information.

  • Virtual Characters / Non-Player Characters (NPCs): These are digital entities within virtual environments (like video games, simulations, or virtual reality) that are not controlled by a human player. Their purpose is to interact with players, advance narratives, or populate the virtual world. The goal is often to make them human-like and realistic, exhibiting believable behaviors, personalities, and responses to their environment.

  • Persona Reasoning: In the context of virtual characters, persona reasoning refers to the process by which an AI model simulates the traits, behaviors, motivations, and life trajectories of a character. It involves understanding and predicting how a character with a specific persona (a set of characteristics, experiences, and social roles) would act, feel, or experience life events. This paper aims to improve the fine-grained (detailed) and temporally dynamic (changing over time) aspects of persona reasoning.

  • BaZi (Four Pillars of Destiny): Also known as Eight Characters (八字), BaZi is a traditional Chinese metaphysical concept used for destiny analysis. It translates a person's precise birth time (year, month, day, hour) and place of birth into eight symbolic characters, comprising four pairs of Heavenly Stems (天干) and Earthly Branches (地支). Each of these eight characters is associated with one of the Five Elements (五行: Wood, Fire, Earth, Metal, Water) and Yin/Yang polarity. These elements and their interactions (productive, destructive, reductive, exhaustive cycles) are believed to reveal insights into a person's personality, life path, health, wealth, relationships, and career across different temporal dynamics (life stages, Flowing Years, Months, and Days). The paper reinterprets BaZi not as a deterministic prophecy, but as a structured, symbolic framework for generating conditional features related to an individual's traits and life events, making it amenable to computational modeling.

  • Question Answering (QA): A task in natural language processing where a system receives a question and a context (e.g., a document, a knowledge base) and must provide an accurate answer. In this paper, a QA dataset is constructed where the "context" for answering questions about life events is the BaZi information of a celebrity, and the questions are multiple-choice, requiring the model to reason about the implications of the BaZi chart.

3.2. Previous Works

The paper contextualizes its work by discussing existing approaches and research in three main areas: AI-Driven NPC Development in Games, Interactive Storytelling and Computational Narratives, and Traditional Chinese Metaphysics and Bazi Theory.

3.2.1. AI-Driven NPC Development in Games

  • Traditional Approaches: Historically, Non-Player Characters (NPCs) in games relied on pre-scripted logic such as dialogue trees, finite-state machines (FSMs), and behavior trees.
    • Dialogue Trees: Predetermined conversation paths that limit player choices and NPC responses.
    • Finite-State Machines (FSMs): NPCs transition between a limited number of defined states (e.g., 'idle', 'patrolling', 'attacking') based on specific conditions. This can be restrictive and costly to author for complex behaviors.
    • Behavior Trees: A hierarchical, modular approach to controlling AI behavior, offering more flexibility than FSMs but still requiring extensive manual authoring and struggling with long-horizon consistency (maintaining consistent behavior over extended periods).
  • LLM-Based NPCs: Recent advancements in Large Language Models (e.g., Brown et al., 2020 on GPT-3, OpenAI, 2023 on GPT-4) have enabled more sophisticated NPC behaviors, generative agents (Park et al., 2023), and multi-agent simulations (Wang et al., 2023). These leverage LLMs for prompt-following and dialogue generation.
    • Limitations of Current LLM Approaches: Despite improvements, these still face challenges:
      • Detailed persona prompts often cannot capture the full human complexity due to length constraints (Liu et al., 2023).
      • Character-specific finetuning (e.g., Hu et al., 2022 on LoRA, Dettmers et al., 2023 on QLoRA) is difficult to scale efficiently across a diverse range of unique personas.
  • Related Research: The paper cites various works on AI in games, including:
    • Karaca et al. (2023) on AI-powered procedural content generation for NPC behavior.
    • Zeng (2023) identifying challenges in human-like NPC behavior and categorizing AI techniques.
    • Kopel et al. (2018) presenting experimental results with decision trees, genetic algorithms, and Q-learning for 3D game NPCs.
    • Mehta (2025) examining AI's role in game development and player experience (dynamic difficulty adjustment, adaptive NPC systems like the Nemesis System).
    • Armanto et al. (2024) on evolutionary algorithms for NPC behavior.
    • Filipovi (2023) discussing computational linguistics aspects for dialogue systems.
    • Wikipedia Contributors (2025) noting challenges in LLM-driven dialogue generation (consistency, computational complexity).

3.2.2. Interactive Storytelling and Computational Narratives

  • This field focuses on how computational systems can generate, manage, and adapt narratives based on user interaction.
  • Foundational Work: Szilas (2007) established early work on intelligent narrators using rule-based systems to dynamically maintain storylines.
  • Contemporary Research:
    • Begu (2024) compared human-authored and AI-generated stories, finding LLMs struggle with emotional authenticity and psychological complexity.
    • Kybartas and Bidarra (2023) surveyed computational and emergent digital storytelling, analyzing bottom-up emergent narratives vs. top-down drama manager approaches.
    • Gerba (2025) proposed Universal Narrative Models to separate storytelling from structure, addressing the "player dilemma" between narrative coherence and user agency.
    • Cavazza et al. (2003) explored narrative intelligence and cultural transmission in AI systems.
    • Kabashkin et al. (2025) investigated how LLMs reproduce archetypal storytelling patterns, excelling at structured narratives but struggling with psychologically complex and ambiguous stories.
  • The field is moving towards hybrid human-AI collaboration in storytelling.

3.2.3. Traditional Chinese Metaphysics and Bazi Theory

  • The academic study of BaZi is limited but growing.
  • Historical Context: Pankenier (2023) examined court astrology in ancient China, showing its integration into imperial governance.
  • Cultural Exchange: Mak (2017) analyzed the transmission of Western astral science into Chinese contexts, revealing intercultural exchange.
  • Connections to TCM: Academia Contributors (2013) explored the relationship between Chinese astrology and Traditional Chinese Medicine (TCM), linking birth date analysis to health and personality within TCM frameworks.
  • Gap Identified: The paper notes a lack of a comprehensive peer-reviewed analysis of BaZi's epistemological foundations and contemporary applications, especially regarding its predictive validity (e.g., Carlson, 1985; Wyman and Vyse, 2008; Dean, 2025 on astrology's lack of predictive validity). This motivates the paper's empirical approach to BaZi, treating it as a narrative representation rather than a metaphysical claim.

3.3. Technological Evolution

The evolution of character simulation has progressed from highly prescriptive, rule-based systems (like dialogue trees, FSMs, behavior trees) that are costly to author and limited in flexibility, to more dynamic, generative approaches powered by Large Language Models. LLMs have significantly enhanced dialogue generation and prompt-following, enabling more interactive and less rigid NPCs. However, these LLM-based methods still grapple with scaling across diverse personas, maintaining long-horizon consistency, and capturing the nuanced complexity of human behavior due to limitations in prompt length or the cost of finetuning.

This paper's work represents a step in this evolution by introducing a new paradigm: integrating symbolic reasoning derived from a culturally grounded system (BaZi) with LLMs. This approach aims to provide a structured, interpretable, and temporally dynamic way to generate persona features that can overcome some of the scaling and consistency issues of purely data-driven or prompt-based LLM methods. It positions BaZi as a framework to discretize chronological time into symbolic attributes, offering fine-grained and dynamic persona generation.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach are:

  • Symbolic Reasoning Integration: Unlike mainstream LLM-based NPC approaches that primarily rely on textual persona prompts or finetuning on behavioral data, this paper explicitly integrates a symbolic reasoning system (BaZi). This provides a structured, interpretable, and culturally informed layer of feature generation that is often missing in purely LLM-driven character models.

  • Minimal Input, Rich Output: Current LLM methods often require detailed persona prompts or extensive datasets for finetuning. This paper's BaZi-LLM system can generate temporally dynamic and fine-grained personas from minimal inputs: birth date and time, gender, and place of birth. This addresses the scalability issue by reducing the need for voluminous annotated data or complex prompt engineering per character.

  • Temporal Dynamics and Contextual Coherence: While some LLM agents can exhibit temporal consistency over short horizons, maintaining long-horizon consistency and adapting persona traits to temporal dynamics (e.g., different life stages) is a known challenge. BaZi inherently provides a framework for temporal dynamics (Flowing Years, Months, Days) and person-environment interactions. The proposed system leverages this to generate time-sequenced and environment-aware character profiles, which is a key differentiator from static persona labels.

  • Culturally Grounded Framework: The use of BaZi introduces a culturally grounded perspective, offering a unique vocabulary for identity and life-course description that is not typically found in Western-centric AI character simulation methods (which might use concepts like MBTI or astrology in a more superficial way). The paper reinterprets BaZi as a conditional feature-generation model rather than a metaphysical system, making it suitable for computational application.

  • Quantitative Evaluation of Symbolic Reasoning: The creation of the Celebrity 50 QA dataset provides a novel quantitative benchmark for evaluating BaZi-based persona reasoning, addressing a long-standing limitation in the field of BaZi which previously lacked measurable accuracy.

    In essence, the paper differentiates itself by moving beyond solely data-driven or prompt-based LLM methods towards a hybrid approach that combines the generative power of LLMs with the structured, temporally dynamic, and culturally grounded symbolic reasoning of BaZi, leading to more realistic and scalable character simulation.

4. Methodology

4.1. Principles

The core idea behind the proposed BaZi-inspired character simulation framework is to systematically transform an individual's birth information into structured, interpretable prompts that capture both stable personality traits and dynamic temporal states. Instead of treating BaZi as a metaphysical practice, the method reinterprets it as a symbolic rule-mapping process. This process discretizes chronological time into symbolic attributes that are tied to personal traits and temporal dynamics, allowing for fine-grained, dynamic persona generation suitable for Large Language Models (LLMs). The framework aims to provide a more scalable and contextually coherent alternative to methods heavily reliant on annotated data or handcrafted persona prompts.

4.2. Core Methodology In-depth (Layer by Layer)

The BaZi-inspired character simulation framework is organized into four main components, as depicted in Figure 4, which collectively form the BaZi-LLM prompt workflow.

The following figure (Figure 4 from the original paper) shows the system architecture:

Figure 4: Our model is organized into four main components: (1) input layer for birth-related information (birthday, gender, place of birth), (2) BaZi rule analysis, (3) BaZi reasoning, and (4) scena… 该图像是一张示意图,展示了BaZi提示工作流程的四个主要组成部分:出生信息输入、BaZi规则分析、BaZi推理模块及场景模块,并说明其通过多维度(财富、职业、亲情、健康)进行多样化交互,生成细粒度的角色特征。

4.2.1. Input Layer

This is the initial stage where the raw biographical data for an individual is provided to the model.

  • Input Elements: The model requires only three minimal pieces of information:
    • Birth date and time: This includes the year, month, day, and precise hour of birth.

    • Gender: The biological sex of the individual.

    • Place of birth: The geographical location where the individual was born.

      These minimal inputs are crucial because they form the basis for constructing the BaZi chart, which then drives the subsequent persona generation process.

4.2.2. BaZi Rule Analysis

In this stage, the input birth information is translated into the formal structure of a BaZi chart using a rule-based mapping program grounded in BaZi theory.

  • BaZi Chart Construction: The birth year, month, day, and time are encoded into eight symbolic elements. These eight elements are composed of four pairs, each consisting of a Heavenly Stem (天干) and an Earthly Branch (地支). For example, a person born in the year of Jia Zi (甲子) would have Jia as the Heavenly Stem and Zi as the Earthly Branch for their year pillar.
  • Symbolic Element Attributes: Each of these eight symbolic elements is further associated with specific attributes:
    • Personality features: These are derived from the balance and interaction of the Five Elements (Wood, Fire, Earth, Metal, Water) within the chart. BaZi theory posits that the relative strength and interaction of these elements indicate different aspects of a person's character and predispositions. For instance, an abundance of Wood might suggest a nurturing and creative personality, while an excess of Metal could imply decisiveness and rigidity.
    • Daily dynamic states: These are temporal features linked to various life dimensions such as health, career, wealth, and kinship. The Heavenly Stems and Earthly Branches in a BaZi chart are not static; their interactions with the changing Flowing Years, Months, and Days (大运, 流年, 流月, 流日 in BaZi terminology) are believed to predict periods of prosperity, challenge, or specific life events.
  • Output of this stage: A structured BaZi chart with its associated Five Element balance and initial dynamic indicators, ensuring interpretability and temporal grounding.

4.2.3. BaZi Reasoning (Interpretation via Classical Logic)

While the BaZi chart from the previous stage provides raw symbolic features, it requires interpretation to construct a meaningful persona. This stage involves a coarse-grained interpretation mechanism inspired by classical BaZi analysis.

  • Key Interpretive Mechanisms:
    • Ten Gods (+#): These are symbolic roles that represent the relationships between the Day Master (日主, the Heavenly Stem of the day pillar, representing the self) and the other Heavenly Stems and Earthly Branches in the chart. Each of the Ten Gods (e.g., Direct Wealth, Indirect Wealth, Officer, Seven Killings, Food God, Hurting Officer, Friend, Rob Wealth, Direct Seal, Indirect Seal) represents specific personality traits, relationship dynamics, or career tendencies. For example, a strong Direct Officer (正官) might indicate a person who is disciplined and status-conscious, while a prominent Food God (食神) could suggest creativity and enjoyment of life.
    • ShenSha (#): These are auxiliary symbolic markers or Divinity Stars associated with specific life tendencies or external influences. ShenSha can indicate potential blessings, misfortunes, talents, or challenges. Examples include Nobleman (贵人), Academic Star (文昌), Peach Blossom (桃花), or Solitary Star (寡宿).
    • Pattern Structures (): These are higher-level symbolic groupings that reflect broader personality orientations or life path archetypes. Examples might include Seven Killings Pattern (七杀格) indicating a challenging but powerful life, or Direct Wealth Pattern (正财格) for someone stable and financially conservative.
  • Process and Output: This interpretive process follows the logic of BaZi divination but produces conditional interpretive features rather than deterministic outcomes. These features form the foundation for downstream scenario reasoning.

4.2.4. Scenario-Oriented Analysis

To achieve fine-grained and adaptive persona modeling, the BaZi-derived interpretive features are coupled with scenario-specific modules.

  • Domain Contextualization: These modules contextualize the symbolic features into five primary life domains:
    • Health: How BaZi elements influence physical well-being and potential ailments.
    • Career: BaZi indicators related to professional success, type of work, and ambition.
    • Wealth: BaZi aspects pertaining to financial fortune, earning capacity, and saving habits.
    • Relationship: BaZi insights into romantic relationships, marriage, and social interactions.
    • Kinship: BaZi interpretations concerning family relationships, parents, siblings, and children.
  • Adaptive Persona Modeling: This stage recognizes that a single BaZi feature (e.g., indicating career ambition) can manifest differently depending on the external scenario. For example, high career ambition might lead to strategic alliances in a financial opportunity scenario but to direct confrontation in an interpersonal conflict scenario. This contextual adaptation ensures the persona remains consistent with its core traits but responds realistically to diverse environments.

4.2.5. Dynamic Persona Prompt Generation

This is the final stage where all the processed information is consolidated into actionable prompts for LLMs.

  • Consolidation: The interpreted features from the BaZi Reasoning and Scenario-Oriented Analysis stages are synthesized.
  • Dynamic Prompts: The output consists of dynamic prompts designed to simulate individual behavior and responses across time. Crucially, these prompts incorporate both:
    • Long-term stable traits: Core personality characteristics derived from the BaZi chart that remain consistent over an individual's life.
    • Short-term temporal variations: Changes or specific events indicated by the interaction of the BaZi chart with Flowing Years, Months, or Days, and how these manifest in specific scenarios.
  • Result: This leads to a time-sequenced and environment-aware character profile, which serves as the basis for generating lifelike and context-sensitive character simulations by LLMs.

4.2.6. Methodological Innovations

The authors highlight three key innovations:

  1. Minimal Input, Rich Output: The model generates temporally dynamic and domain-specific persona prompts from only birth date/time, gender, and place of birth.
  2. Symbolic-Logical Integration: It combines rule-based BaZi mapping with interpretive logic (Ten Gods, ShenSha, Pattern Structures) to create structured, explicitly interpretable symbolic features.
  3. Scenario Adaptivity: Persona representations are not fixed but dynamically adapt to health, career, wealth, relationship, and kinship contexts, resulting in vivid, time-evolving character simulation.

5. Experimental Setup

5.1. Datasets

The primary dataset used for evaluation is Celebrity 50.

  • Name: Celebrity 50

  • Purpose: Designed to evaluate Large Language Models (LLMs)' ability to predict key life events based on BaZi principles. It serves as the first QA dataset for BaZi-based persona reasoning.

  • Source and Characteristics:

    • Individuals: Contains information about 50 real individuals from diverse global backgrounds. The selection was restricted to individuals born around 1940 to ensure a sufficiently rich amount of biographical data across their adult lives.
    • Data Collection & Validation: Biographical records were collected and validated through astro.com (a reputable astrology data source, even if the paper reinterprets BaZi non-metaphysically, astro.com provides precise birth data).
    • Selection Criteria:
      1. Adults with sufficiently rich life experiences.
      2. Exclusion of idols (likely for privacy and data richness concerns).
      3. All subjects born in the Northern Hemisphere.
    • Question-Answer Pairs: Each persona is associated with 45 multiple-choice question—answer pairs. These questions span five key life dimensions: wealth, health, kinship, career, and relationships. This formulation reduces evaluation complexity while allowing reasoning over significant, discrete life nodes.
  • Statistics:

    • Individuals: 50

    • Countries: From 29 different countries.

    • Total Q&A Pairs: 488 (average of approximately 9.76 questions per person).

    • Gender Distribution: 37 males and 13 females, indicating a gender imbalance.

      The following figure (Figure 3 from the original paper) shows the distribution of birthplace and question counts across different countries:

      Figure 3: Question and Birthplace Counts Across Countries 该图像是柱状图,展示了不同国家的出生地和问题数量分布情况。图中橙色柱代表问题数,蓝色柱代表出生地数,美国、英国和俄罗斯的问题数量较多,而大部分国家的出生地数量较少,分布不均。

  • Construction Process:

    1. Birth Time Acquisition: Precise birth time data is acquired.
    2. Biographical Narrative Retrieval: The Qwen API is prompted to retrieve biographical narratives across the five dimensions (wealth, health, kinship, career, relationships). Qwen leverages its web search capabilities and internal knowledge base.
    3. Multiple-Choice Question Generation: The same LLM (Qwen API) generates multiple-choice questions from the compiled biographical information.
    4. JSON Export: A final script extracts and synthesizes these questions with the birth data into a JSON format.
    5. Cleaning and Quality Assurance:
      • Initial Filtering: A rating system was established to eliminate questions based on three criteria:
        • Questions containing real proper names (people, organizations, etc.).
        • Questions demanding overly specific numerical details (e.g., exact wealth amounts) that are not reasonably predictable by BaZi analysis.
        • Questions that exceed the reasonable predictive capabilities of traditional BaZi analysis.
      • Refinement: Unsatisfactory questions were grouped and iteratively refined by the LLM itself through prompt modifications. Discarded questions were replaced by new ones.
      • Manual Verification: All remaining questions underwent manual verification to ensure rigor and compliance with guidelines.
    • Annotation Process: Comprehensive guidelines ensured that all generated questions were factually accurate and strictly aligned with one of the five predefined life dimensions based on the sourced biographical material.
  • Data Sample: The model's input for a given task includes the individual's birth time, gender, and place of birth, along with a multiple-choice question and candidate answers. The goal is to select the correct answer.

The following figure (Figure 2 from the original paper) shows sample information input to a Large Language Model:

Figure 2: Sample information input to LLM 该图像是一个示意图,展示了输入给大语言模型(LLM)的示例信息,包括出生时间、性别、出生地及关于人物可能职业的多项选择问题。

  • Why these datasets were chosen: Celebrity 50 was chosen because it allows for QA-based evaluation over critical life events, which simplifies verification compared to full life-course narratives. The focus on real individuals born around 1940 ensures a rich historical context for biographical events. The multiple-choice format enables quantitative evaluation, a significant improvement over previous BaZi reasoning which lacked measurable accuracy.

5.2. Evaluation Metrics

The primary evaluation metric used in this paper is Accuracy.

  • Accuracy:
    1. Conceptual Definition: Accuracy is a fundamental metric in classification tasks that measures the proportion of correctly predicted instances out of the total number of instances. In the context of a multiple-choice Question Answering (QA) benchmark, it quantifies how often the model selects the correct answer choice. A higher accuracy indicates a better ability of the model to correctly reason about and predict life events based on the provided information.
    2. Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
    3. Symbol Explanation:
      • Number of Correct Predictions: This represents the count of instances (questions) where the model's output (chosen answer) exactly matches the ground truth correct answer.
      • Total Number of Predictions: This is the total number of questions or samples evaluated in the dataset.

5.3. Baselines

The paper compares its proposed method against several state-of-the-art Large Language Models (LLMs) under different experimental settings.

  • Mainstream LLM Backbones:

    • DeepSeek-v3: A large language model from DeepSeek AI.

    • Gemini-2.5-flash: A model from Google's Gemini family, designed for speed and efficiency.

    • GPT-5-mini: A hypothetical or placeholder model for a future state-of-the-art model from OpenAI, indicating an intention to compare against the cutting edge.

      These models are considered representative state-of-the-art LLMs known for their strong prompt-following, dialogue generation, and reasoning capabilities.

  • Experimental Settings for Comparison:

    1. Vanilla LLM w/ Bazi (Baseline): In this setting, the standard LLMs are provided with the BaZi-derived features as input, but without any additional BaZi reasoning modules integrated into their architecture. This tests the LLMs' inherent ability to interpret BaZi information implicitly, purely through their pre-trained knowledge.
    2. Vanilla LLM w/ Bazi Rule Knowledge: Here, in addition to receiving the BaZi features, the LLMs are augmented with explicit symbolic knowledge rules related to BaZi. This evaluates if providing explicit rule-based knowledge improves the LLMs' performance.
    3. Our Model: This refers to the proposed multi-agent architecture that specifically integrates symbolic reasoning derived from BaZi with LLM inference for BaZi-inspired character simulation. This is the method being evaluated for its overall effectiveness.
  • Control Experiment:

    • Shuffled Birthday Control: To validate the importance of genuine birth-date grounding and BaZi reasoning, a control condition was introduced. In this setup, each subject's true birth date was replaced with another person's date (randomly shuffled), while all other information (questions, candidate answers) remained constant. The expectation is that if BaZi reasoning is meaningful, performance should significantly deteriorate when the correct mapping between an individual's biography and their true BaZi chart is broken. This helps to confirm that the model is genuinely leveraging BaZi information rather than statistical correlations or other cues.
  • External Benchmark:

    • The 15th Global Fortune-Teller Championship 2024: The model was also evaluated on a question set from this external competition organized by the Hong Kong Junior Feng Shui Masters Association. This provides a real-world, albeit specialized, benchmark for BaZi prediction capabilities.
  • Implementation Details:

    • The model uses DeepSeek-R1 (Guo et al., 2025) for generating BaZi knowledge and Doubao-1.5-ThinkingPro (ByteDance, 2025) for performing reasoning.
    • All evaluations were conducted on the Celebrity 50 dataset.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate the significant effectiveness of the proposed BaZi-augmented LLM system in persona reasoning and life-event prediction. The analysis focuses on comparing our model against vanilla LLMs and rule-augmented LLMs, and crucially, on validating the importance of BaZi information through shuffled birthday controls.

6.1.1. Comparison with Baseline LLMs

The following are the results from Table 1 of the original paper:

Setting Model Acc. (%)
Vanilla LLM w/ Bazi (Baseline) Deepseek-v3 39.3
Gemini-2.5-flash 42.2
GPT-5-mini 34.0
Baseline w/ Bazi Rule Knowledge Deepseek-v3 35.9 (↓8.7%)
Gemini-2.5-flash 42.4 (↓4.1%)
GPT-5-mini 36.9 (↑8.5%)
Our Model Deepseek-v3 51.2 (↑30.3%)
Gemini-2.5-flash 47.1 (↑6.6%)
GPT-5-mini 55.3 (↑62.6%)

As shown in Table 1, our model (the BaZi-LLM system) consistently outperforms all baseline LLMs across various backbones (DeepSeek-v3, Gemini-2.5-flash, GPT-5-mini).

  • For DeepSeek-v3, our model achieves 51.2% accuracy, a 30.3% relative improvement over the Vanilla LLM w/ Bazi baseline (39.3%).

  • For Gemini-2.5-flash, our model reaches 47.1%, a 6.6% relative gain over its baseline (42.2%).

  • Most notably, for GPT-5-mini, our model achieves 55.3% accuracy, an impressive 62.6% relative improvement over its baseline (34.0%).

    These results strongly validate the effectiveness of integrating symbolic BaZi reasoning into LLM-based character simulation. The gains are substantial, particularly for GPT-5-mini, suggesting that the symbolic framework helps LLMs that might otherwise struggle with this specific task.

Interestingly, providing BaZi Rule Knowledge to Vanilla LLMs (the "Baseline w/ Bazi Rule Knowledge" setting) did not consistently improve performance and even led to slight drops for DeepSeek-v3 and Gemini-2.5-flash. This implies that simply feeding explicit rules to LLMs might not be as effective as having a dedicated multi-agent architecture that systematically processes and integrates these rules, as our model does. GPT-5-mini showed an 8.5% increase in this setting, indicating some models might benefit more from direct rule injection, but it's still far less than the gains achieved by our model.

6.1.2. Impact of Shuffled Birthdays

The following are the results from Table 2 of the original paper:

Setting Model Acc. (%)
Real Birthdays DeepSeek-V3 51.2
Gemini-2.5-flash 47.1
GPT-5-mini 55.3
Shuffled Birthdays DeepSeek-v3 40.6 (↓20.7%)
Gemini-2.5-flash 35.5 (↓24.6%)
GPT-5-mini 30.0 (↓45.7%)

Table 2 presents a crucial validation of the role of BaZi as a meaningful symbolic system. When our model is tested with shuffled birthdays (breaking the genuine temporal alignment between the biography and the BaZi chart), the performance drops significantly across all backbones:

  • DeepSeek-v3: 51.2% (Real) to 40.6% (Shuffled), a 20.7% decrease.

  • Gemini-2.5-flash: 47.1% (Real) to 35.5% (Shuffled), a 24.6% decrease.

  • GPT-5-mini: 55.3% (Real) to 30.0% (Shuffled), a substantial 45.7% decrease.

    This drastic drop in accuracy when the BaZi information is incorrectly associated confirms that our model is not merely relying on superficial cues or general biographical knowledge, but is actively leveraging the specific BaZi features derived from the correct birth data for its reasoning. This directly addresses the question of whether BaZi-derived features provide incremental information beyond raw birth dates, showing that they do.

6.1.3. Vanilla LLMs with Shuffled Birthdays

The following are the results from Table 3 of the original paper:

Setting Model Acc. (%)
Vanilla LLM + Bazi DeepSeek-V3 39.3
Gemini-2.5-flash 42.2
GPT-5-mini 34.0
/ + Shuffled Birthday DeepSeek-V3 42.5(↑8.1%)
Gemini-2.5-flash 42.1(↓0.2%)
GPT-5-mini 34.8 (↑2.4%)

Table 3 provides a comparative view of Vanilla LLMs performance with real vs. shuffled BaZi features. In contrast to our model (Table 2), the Vanilla LLMs show relatively stable, and in some cases even slightly improved, performance when birthdays are shuffled.

  • DeepSeek-v3: 39.3% (Real) to 42.5% (Shuffled), an 8.1% increase.

  • Gemini-2.5-flash: 42.2% (Real) to 42.1% (Shuffled), a negligible 0.2% decrease.

  • GPT-5-mini: 34.0% (Real) to 34.8% (Shuffled), a 2.4% increase.

    This outcome suggests that Vanilla LLMs contain only limited implicit knowledge of BaZi and, without the explicit BaZi reasoning framework, they do not strongly rely on the BaZi features for their predictions. Their reasoning might be more influenced by general patterns in the biographical data or common sense rather than the specific symbolic information. This contrast further highlights that our model is genuinely leveraging BaZi theory and its structured interpretation for persona fitting, rather than operating on surface-level correlations.

6.1.4. External Benchmark Performance

The model, which uses DeepSeek-R1 for BaZi knowledge generation and Doubao-1.5-ThinkingPro for reasoning, achieved an accuracy of 60% on the question set from The 15th Global Fortune-Teller Championship 2024. This performance matched the third-place competitor in that year's competition. This indicates that the model's capabilities extend to real-world BaZi prediction tasks and suggests potential for further improvement with more powerful underlying reasoning engines.

6.1.5. Case Study: Differences in BaZi Theory Interpretation (sergey_brin_P042)

The case study involving Sergey Brin (sergey_brin_P042) reveals qualitative differences in how DeepSeek-V3, GPT-5-mini, and Gemini-2.5-flash interpret BaZi theory within the custom BaZi analysis framework.

  • BaZi Theory Interpretation:
    • DeepSeek-V3 and Gemini-2.5-flash classified the chart as a Shangguan Structural Pattern (伤官格).
    • GPT-5-mini identified it as a Cong Er Structural Pattern (从儿格).
    • This divergence led to opposite conclusions regarding favorable/unfavorable elements and future luck cycles. The authors note that while flexibility exists in pattern classification, such decisions typically rely on professional experience. GPT-5-mini adopted a more flexible and bold interpretative logic, while Deepseek-V3 and Gemini-2.5-flash exhibited a more conservative, rule-bound approach.
  • Scene Mapping Process:
    • DeepSeek-V3 followed a rigid feature-to-prediction pattern, susceptible to local information bias.
    • Gemini-2.5-flash integrated multiple dimensions for holistic analysis.
    • GPT-5-mini showed behavior similar to a human consultant, adapting reasoning to user context and exploring alternative scenarios dynamically.
  • Output Expression:
    • DeepSeek-V3 used absolute statements for real-world manifestations.
    • Gemini-2.5-flash and GPT-5-mini used more probabilistic language ("possibly", "likely") and presented multiple potential outcomes, resembling a human consultant more closely.
  • Commonalities: When provided with identical upstream results, all three models showed convergent reasoning paths without severe factual or logical errors and demonstrated a comparable level of baseline BaZi knowledge. However, none exhibited strong reflection or self-correction mechanisms.
  • Overall Assessment: Gemini-2.5-flash provided the most stable and conservative theoretical reasoning. GPT-5-mini excelled in the final output stage, producing explanations most similar to a human consultant due to its aggressive and exploratory interpretations. DeepSeek-V3 remained rigid and deterministic.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Setting Model Acc. (%)
Vanilla LLM w/ Bazi (Baseline) Deepseek-v3 39.3
Gemini-2.5-flash 42.2
GPT-5-mini 34.0
Baseline w/ Bazi Rule Knowledge Deepseek-v3 35.9 (↓8.7%)
Gemini-2.5-flash 42.4 (↓4.1%)
GPT-5-mini 36.9 (↑8.5%)
Our Model Deepseek-v3 51.2 (↑30.3%)
Gemini-2.5-flash 47.1 (↑6.6%)
GPT-5-mini 55.3 (↑62.6%)

The following are the results from Table 2 of the original paper:

Setting Model Acc. (%)
Real Birthdays DeepSeek-V3 51.2
Gemini-2.5-flash 47.1
GPT-5-mini 55.3
Shuffled Birthdays DeepSeek-v3 40.6 (↓20.7%)
Gemini-2.5-flash 35.5 (↓24.6%)
GPT-5-mini 30.0 (↓45.7%)

The following are the results from Table 3 of the original paper:

Setting Model Acc. (%)
Vanilla LLM + Bazi DeepSeek-V3 39.3
Gemini-2.5-flash 42.2
GPT-5-mini 34.0
/ + Shuffled Birthday DeepSeek-V3 42.5(↑8.1%)
Gemini-2.5-flash 42.1(↓0.2%)
GPT-5-mini 34.8 (↑2.4%)

6.3. Ablation Studies / Parameter Analysis

While the paper doesn't present traditional ablation studies breaking down the contribution of each specific component of our model (e.g., BaZi Rule Analysis, Interpretation via Classical Logic, Scenario-Oriented Analysis), the Shuffled Birthday Control (Table 2 and Table 3) serves as a critical pseudo-ablation to verify the effectiveness of the BaZi grounding itself.

  • Shuffled Birthday as a Control: By comparing our model's performance with real birthdays versus shuffled birthdays, the authors effectively "ablate" the correct BaZi-biography alignment. The significant drop in accuracy (20%-45%) directly demonstrates that the BaZi-derived features are indeed being utilized and are crucial for the model's performance. If the BaZi component were ineffective, shuffling the birthdays would not cause such a dramatic performance decrease.

  • Contrast with Vanilla LLMs: The further comparison in Table 3 shows that Vanilla LLMs (without our model's integrated BaZi reasoning framework) are largely unaffected by shuffled birthdays. This implies that Vanilla LLMs do not inherently leverage BaZi features in a meaningful way, even when provided, highlighting the necessity of our model's explicit symbolic integration. This comparison acts as an indirect ablation, showing that the specific integration mechanism (not just the presence of BaZi data) is what drives the performance gains.

    This control experiment effectively verifies that the BaZi-based symbolic features are not mere noise but provide substantial, incremental information that our model successfully exploits for persona reasoning.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces a novel and effective approach to character simulation by integrating traditional Chinese BaZi (Four Pillars of Destiny) symbolic reasoning with Large Language Models (LLMs). The authors successfully reinterpret BaZi as a culturally grounded, conditional feature-generation model for persona construction, enabling the creation of fine-grained and temporally dynamic virtual characters. A key contribution is the development of Celebrity 50, the first QA dataset specifically designed for BaZi-based persona reasoning, allowing for quantitative evaluation of this complex domain. The proposed BaZi-LLM system demonstrates significant accuracy improvements, ranging from 30.3% to 62.6% over state-of-the-art LLM baselines like DeepSeek-v3 and GPT-5-mini. Crucially, the model's performance drastically declines (20%-45%) when incorrect BaZi information is used, unequivocally validating that the system genuinely leverages the symbolic birth information for its persona reasoning. This work underscores the potential of culturally grounded symbolic-LLM integration for generating more realistic and coherent virtual characters.

7.2. Limitations & Future Work

The authors acknowledge several limitations of their current work:

  • Dataset Generation Bias: Many narratives and questions in the Celebrity 50 dataset are LLM-generated (by Qwen), which introduces potential issues such as hallucination, bias, and factual errors.

  • Dataset Size and Imbalance: The dataset is relatively small (50 individuals, 488 questions) and exhibits gender imbalance (37 male, 13 female). This limits the generalizability of the findings.

  • Birth Data Accuracy: While birth details are sourced from astro.com, there may still be inaccuracies in the precise birth times.

  • Cultural and Temporal Biases: The focus on mostly Western figures born around 1940 introduces temporal and cultural biases, which may not generalize across different eras or diverse cultural contexts.

    Based on these limitations, the authors propose future directions for improvement:

  • Model Enhancements:

    • Domain-Specific Knowledge Bases: Incorporating domain-specific knowledge bases or training for particular schools of BaZi thought (e.g., specialized pattern classifications) could further refine the model's interpretative capabilities.
    • Agent-Based Mechanisms: Implementing agent-based mechanisms that can dynamically select among intermediate outputs, reflect on user feedback, and adapt reasoning pathways accordingly would improve model robustness and flexibility.
  • Dataset Expansion: Collecting more diverse data samples from different countries with more precise birth times is crucial to improve generalizability and reduce existing biases.

7.3. Personal Insights & Critique

This paper presents a fascinating and innovative approach that bridges traditional symbolic systems with modern Large Language Models.

Insights:

  • Novel Integration of Symbolic AI and LLMs: The most compelling insight is the successful integration of a culturally grounded symbolic system like BaZi with LLMs. This goes beyond merely prompting LLMs with external knowledge; it involves a structured, multi-stage process of symbolic rule mapping, reasoning, and scenario adaptation. This hybrid approach demonstrates a powerful path forward for AI where interpretable, structured knowledge can guide the generative capabilities of LLMs.
  • Reinterpretation of BaZi: The reinterpretation of BaZi as a conditional feature-generation model rather than a metaphysical claim is a clever way to leverage its rich symbolic structure for persona simulation without getting entangled in pseudoscientific debates. It extracts the narrative and temporal patterning potential from BaZi, making it computationally useful.
  • Validation through Shuffled Data: The shuffled birthday control is an excellent experimental design choice. The significant drop in our model's accuracy with shuffled data provides strong empirical evidence that the BaZi component is genuinely contributing to the persona reasoning, and not just acting as a spurious correlation. This is a crucial validation point for any system relying on such a specialized knowledge base.
  • Addressing LLM Limitations: The paper effectively targets known limitations of LLMs in character simulation, such as scaling diverse personas and maintaining long-horizon consistency. BaZi's inherent temporal dynamics and pattern structures offer a structured solution to these challenges.

Critique and Areas for Improvement:

  • Dataset Quality and Bias: The reliance on LLM-generated questions and biographical narratives for the Celebrity 50 dataset is a significant concern. LLMs are prone to hallucination and biases present in their training data. For a benchmark intended to validate a symbolic reasoning system, potential inaccuracies or implicit biases introduced by Qwen in the ground truth could affect the reliability of the accuracy scores. Future work should prioritize human-curated and verified Q&A pairs, or at least a more rigorous multi-stage human review process.
  • Generalizability of BaZi: While BaZi is a culturally grounded system, its application to predominantly Western figures born around 1940 raises questions about its cross-cultural generalizability and applicability to modern contexts. The gender imbalance in the dataset also limits the conclusions that can be drawn about its performance for diverse demographics. Expanding the dataset to include individuals from various cultural backgrounds, time periods, and a balanced gender distribution is essential.
  • Interpretability of LLM Outputs: The case study highlights the divergence in BaZi theory interpretation among LLMs (e.g., Shangguan vs. Cong Er patterns). While GPT-5-mini is praised for human-like output, such fundamental disagreements in the symbolic reasoning stage can lead to vastly different persona profiles. This points to the inherent subjectivity or interpretative flexibility within BaZi itself, which might be hard for an AI to navigate consistently without a definitive "oracle" for BaZi interpretation. Further research could explore how to integrate human expert consensus or allow for "multi-interpretative" persona aspects.
  • The "GPT-5-mini" Anomaly: The consistent reference to GPT-5-mini is unusual, given that GPT-5 has not been publicly released. This either implies a placeholder for a future state-of-the-art model, a private developmental version, or perhaps a slight anachronism in the paper's timing if it was written significantly before its listed publication date. Clarification on this would enhance rigor.
  • Formalization of BaZi Rules: The paper mentions BaZi rule analysis and interpretation via classical logic, but a more detailed exposition of how these rules are formalized and encoded for the AI (e.g., as logical predicates, knowledge graphs, or specific prompt structures) would be beneficial for reproducibility and deeper understanding.

Potential Applications: The methodology could be applied to other symbolic systems or folk traditions that describe human personality or life trajectories (e.g., Western astrology, numerology, other forms of divination), transforming them into computable feature sets for LLM-driven character generation. This could greatly enrich the diversity and cultural specificity of virtual characters in games, storytelling, and educational simulations. Furthermore, the approach of using minimal inputs to generate rich, dynamic personas has significant implications for scaling character creation in large virtual worlds.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.