BaZi-Based Character Simulation Benchmark: Evaluating AI on Temporal and Persona Reasoning
TL;DR Summary
This work introduces the first BaZi-based character simulation dataset and BaZi-LLM system, combining symbolic reasoning with large language models to generate dynamic, fine-grained personas, boosting accuracy by 30%-62% over mainstream models, highlighting culturally grounded AI
Abstract
Human-like virtual characters are crucial for games, storytelling, and virtual reality, yet current methods rely heavily on annotated data or handcrafted persona prompts, making it difficult to scale up and generate realistic, contextually coherent personas. We create the first QA dataset for BaZi-based persona reasoning, where real human experiences categorized into wealth, health, kinship, career, and relationships are represented as life-event questions and answers. Furthermore, we propose the first BaZi-LLM system that integrates symbolic reasoning with large language models to generate temporally dynamic and fine-grained virtual personas. Compared with mainstream LLMs such as DeepSeek-v3 and GPT-5-mini, our method achieves a 30.3%-62.6% accuracy improvement. In addition, when incorrect BaZi information is used, our model's accuracy drops by 20%-45%, showing the potential of culturally grounded symbolic-LLM integration for realistic character simulation.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
BaZi-Based Character Simulation Benchmark: Evaluating AI on Temporal and Persona Reasoning
1.2. Authors
Siyuan Zheng**, Pai Liu**, Xi Chen***, Jizheng Dong, Sihan Jia Their affiliations are listed as MirrorAI Co, Ltd., University of Rochester, New York University, Georgia State University, and Anhui Zhu Zi College. The specific roles or primary affiliations for each author are not explicitly delineated in the provided header, but MirrorAI Co, Ltd. appears to be a key institutional supporter.
1.3. Journal/Conference
This paper is published as a preprint on arXiv. The Published at (UTC): 2025-10-27T13:51:13.000Z and references to DeepSeek-v3 and GPT-5-mini (a hypothetical future model) suggest it anticipates publication in a highly competitive venue given the advanced models mentioned. The arXiv platform is a widely respected open-access archive for preprints of scientific papers, particularly in physics, mathematics, computer science, and related fields, allowing for early dissemination of research findings.
1.4. Publication Year
2025 (based on the Published at (UTC) timestamp).
1.5. Abstract
The paper addresses the challenge of creating human-like virtual characters for games, storytelling, and virtual reality, noting that existing methods relying on annotated data or handcrafted prompts struggle with scalability and generating realistic, contextually coherent personas. It introduces the first QA dataset specifically designed for BaZi-based persona reasoning, which represents real human experiences (categorized into wealth, health, kinship, career, and relationships) as life-event questions and answers. Furthermore, the paper proposes BaZi-LLM, the first system that integrates symbolic reasoning (derived from BaZi, or Four Pillars of Destiny) with large language models (LLMs) to produce temporally dynamic and fine-grained virtual personas. Experimental results demonstrate that BaZi-LLM achieves a 30.3% to 62.6% accuracy improvement over mainstream LLMs like DeepSeek-v3 and GPT-5-mini. The model's accuracy drops significantly (by 20% to 45%) when incorrect BaZi information is used, highlighting the importance of this culturally grounded symbolic-LLM integration for realistic character simulation.
1.6. Original Source Link
https://arxiv.org/abs/2510.23337
1.7. PDF Link
https://arxiv.org/pdf/2510.23337v1.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the difficulty in developing realistic, scalable, and contextually coherent virtual characters for immersive applications such as games, storytelling, and virtual reality. Traditional methods, including dialogue trees, finite-state machines, and behavior trees, are labor-intensive to author, limited in scope, and often produce generic personas lacking long-term consistency. While Large Language Models (LLMs) have advanced capabilities in prompt-following and dialogue generation, they still face limitations: persona prompts struggle to capture human complexity within length constraints, and character-specific finetuning is challenging to scale across a diverse range of personas.
This problem is important because the realism and dynamism of virtual characters directly impact the immersion and engagement in digital environments. The existing gaps highlight a need for a more efficient and robust method for persona construction that can generate fine-grained and temporally dynamic characters. The paper's innovative idea is to adopt BaZi (Four Pillars of Destiny)—a traditional Chinese metaphysical system—as a culturally grounded, temporally structured representation for persona construction. By reinterpreting BaZi as a conditional feature-generation model, the authors propose to discretize chronological time into symbolic attributes tied to personal traits and temporal dynamics, enabling fine-grained, dynamic persona generation without making metaphysical claims.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Reinterpretation of BaZi for Persona Simulation: It reinterprets BaZi as a culturally grounded representational system, enabling
fine-grainedandtemporally dynamiccharacter modeling for virtual personas. This moves beyond traditional deterministic applications of BaZi, reframing it as a source of interpretable, probabilistic features. -
First BaZi-Based QA Dataset: The paper introduces
Celebrity 50, the firstQuestion-Answering (QA)dataset forBaZi-based persona reasoning. This dataset allows for systematic and quantitative evaluation of symbolic reasoning in the context of real-world life events across five key dimensions: wealth, health, kinship, career, and relationships. -
First BaZi-Augmented LLM System: It develops the first
BaZi-LLMsystem, which integratessymbolic reasoningderived from BaZi withLarge Language Modelsfor culturally informed character simulation. This system processes minimal birth information to generate rich, dynamic persona prompts. -
Demonstrated Accuracy Gains: The
BaZi-enhanced modelsachieve consistent and significant accuracy gains (30.3%to62.6%improvement) over baselineLarge Language Models(such asDeepSeek-v3andGPT-5-mini) on theCelebrity 50benchmark, validating the effectiveness of the proposed integration. Furthermore, whenincorrect BaZi information(shuffled birth dates) is used, the model's accuracy drops by20%to45%, demonstrating that the model genuinely leverages BaZi's symbolic features for reasoning, rather than relying on superficial correlations.The key findings demonstrate that integrating a culturally grounded symbolic framework like BaZi can significantly enhance the ability of
Large Language Modelsto simulate realistic andtemporally dynamicvirtual characters, offering a novel approach to persona generation.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a novice reader should be familiar with the following fundamental concepts:
-
Large Language Models (LLMs): These are advanced artificial intelligence models, such as
GPT-3,DeepSeek, orGemini, trained on vast amounts of text data. They are capable of understanding, generating, and processing human language for a wide range of tasks, including answering questions, writing essays, and engaging in conversations. Their core strength lies in their ability to identify patterns and relationships in language, allowing them to performprompt-following(generating responses based on specific instructions) anddialogue generation. In this paper, LLMs serve as the base models that are augmented withBaZiinformation. -
Virtual Characters / Non-Player Characters (NPCs): These are digital entities within virtual environments (like video games, simulations, or virtual reality) that are not controlled by a human player. Their purpose is to interact with players, advance narratives, or populate the virtual world. The goal is often to make them
human-likeandrealistic, exhibiting believable behaviors, personalities, and responses to their environment. -
Persona Reasoning: In the context of virtual characters, persona reasoning refers to the process by which an AI model simulates the traits, behaviors, motivations, and life trajectories of a character. It involves understanding and predicting how a character with a specific
persona(a set of characteristics, experiences, and social roles) would act, feel, or experience life events. This paper aims to improve thefine-grained(detailed) andtemporally dynamic(changing over time) aspects of persona reasoning. -
BaZi (Four Pillars of Destiny): Also known as
Eight Characters (八字),BaZiis a traditional Chinese metaphysical concept used for destiny analysis. It translates a person's precise birth time (year, month, day, hour) and place of birth into eight symbolic characters, comprising four pairs ofHeavenly Stems(天干) andEarthly Branches(地支). Each of these eight characters is associated with one of theFive Elements(五行: Wood, Fire, Earth, Metal, Water) andYin/Yangpolarity. These elements and their interactions (productive, destructive, reductive, exhaustive cycles) are believed to reveal insights into a person's personality, life path, health, wealth, relationships, and career across differenttemporal dynamics(life stages,Flowing Years,Months, andDays). The paper reinterprets BaZi not as a deterministic prophecy, but as a structured, symbolic framework for generating conditional features related to an individual's traits and life events, making it amenable to computational modeling. -
Question Answering (QA): A task in natural language processing where a system receives a question and a context (e.g., a document, a knowledge base) and must provide an accurate answer. In this paper, a
QA datasetis constructed where the "context" for answering questions about life events is theBaZiinformation of acelebrity, and the questions are multiple-choice, requiring the model to reason about the implications of the BaZi chart.
3.2. Previous Works
The paper contextualizes its work by discussing existing approaches and research in three main areas: AI-Driven NPC Development in Games, Interactive Storytelling and Computational Narratives, and Traditional Chinese Metaphysics and Bazi Theory.
3.2.1. AI-Driven NPC Development in Games
- Traditional Approaches: Historically,
Non-Player Characters (NPCs)in games relied on pre-scripted logic such asdialogue trees,finite-state machines (FSMs), andbehavior trees.- Dialogue Trees: Predetermined conversation paths that limit player choices and NPC responses.
- Finite-State Machines (FSMs): NPCs transition between a limited number of defined states (e.g., 'idle', 'patrolling', 'attacking') based on specific conditions. This can be restrictive and costly to author for complex behaviors.
- Behavior Trees: A hierarchical, modular approach to controlling AI behavior, offering more flexibility than FSMs but still requiring extensive manual authoring and struggling with
long-horizon consistency(maintaining consistent behavior over extended periods).
- LLM-Based NPCs: Recent advancements in
Large Language Models(e.g.,Brown et al., 2020onGPT-3,OpenAI, 2023onGPT-4) have enabled more sophisticated NPC behaviors,generative agents(Park et al., 2023), andmulti-agent simulations(Wang et al., 2023). These leverage LLMs forprompt-followinganddialogue generation.- Limitations of Current LLM Approaches: Despite improvements, these still face challenges:
Detailed persona promptsoften cannot capture the fullhuman complexitydue to length constraints (Liu et al., 2023).Character-specific finetuning(e.g.,Hu et al., 2022onLoRA,Dettmers et al., 2023onQLoRA) is difficult to scale efficiently across a diverse range of unique personas.
- Limitations of Current LLM Approaches: Despite improvements, these still face challenges:
- Related Research: The paper cites various works on AI in games, including:
Karaca et al. (2023)onAI-powered procedural content generationfor NPC behavior.Zeng (2023)identifying challenges inhuman-like NPC behaviorand categorizing AI techniques.Kopel et al. (2018)presenting experimental results withdecision trees,genetic algorithms, andQ-learningfor3D game NPCs.Mehta (2025)examining AI's role ingame development and player experience(dynamic difficulty adjustment, adaptive NPC systems like theNemesis System).Armanto et al. (2024)onevolutionary algorithms for NPC behavior.Filipovi (2023)discussingcomputational linguisticsaspects fordialogue systems.Wikipedia Contributors (2025)noting challenges inLLM-driven dialogue generation(consistency, computational complexity).
3.2.2. Interactive Storytelling and Computational Narratives
- This field focuses on how computational systems can generate, manage, and adapt narratives based on user interaction.
- Foundational Work:
Szilas (2007)established early work onintelligent narratorsusingrule-based systemsto dynamically maintain storylines. - Contemporary Research:
Begu (2024)compared human-authored andAI-generated stories, finding LLMs struggle withemotional authenticityandpsychological complexity.Kybartas and Bidarra (2023)surveyedcomputational and emergent digital storytelling, analyzingbottom-up emergent narrativesvs.top-down drama manager approaches.Gerba (2025)proposedUniversal Narrative Modelsto separate storytelling from structure, addressing the "player dilemma" between narrative coherence and user agency.Cavazza et al. (2003)explorednarrative intelligenceandcultural transmissionin AI systems.Kabashkin et al. (2025)investigated howLLMs reproduce archetypal storytelling patterns, excelling at structured narratives but struggling withpsychologically complexandambiguous stories.
- The field is moving towards
hybrid human-AI collaborationin storytelling.
3.2.3. Traditional Chinese Metaphysics and Bazi Theory
- The academic study of
BaZiis limited but growing. - Historical Context:
Pankenier (2023)examinedcourt astrologyin ancient China, showing its integration into imperial governance. - Cultural Exchange:
Mak (2017)analyzed the transmission ofWestern astral scienceinto Chinese contexts, revealing intercultural exchange. - Connections to TCM:
Academia Contributors (2013)explored the relationship betweenChinese astrologyandTraditional Chinese Medicine (TCM), linking birth date analysis to health and personality within TCM frameworks. - Gap Identified: The paper notes a lack of a comprehensive peer-reviewed analysis of BaZi's
epistemological foundationsandcontemporary applications, especially regarding its predictive validity (e.g.,Carlson, 1985; Wyman and Vyse, 2008; Dean, 2025on astrology's lack of predictive validity). This motivates the paper's empirical approach to BaZi, treating it as anarrative representationrather than a metaphysical claim.
3.3. Technological Evolution
The evolution of character simulation has progressed from highly prescriptive, rule-based systems (like dialogue trees, FSMs, behavior trees) that are costly to author and limited in flexibility, to more dynamic, generative approaches powered by Large Language Models. LLMs have significantly enhanced dialogue generation and prompt-following, enabling more interactive and less rigid NPCs. However, these LLM-based methods still grapple with scaling across diverse personas, maintaining long-horizon consistency, and capturing the nuanced complexity of human behavior due to limitations in prompt length or the cost of finetuning.
This paper's work represents a step in this evolution by introducing a new paradigm: integrating symbolic reasoning derived from a culturally grounded system (BaZi) with LLMs. This approach aims to provide a structured, interpretable, and temporally dynamic way to generate persona features that can overcome some of the scaling and consistency issues of purely data-driven or prompt-based LLM methods. It positions BaZi as a framework to discretize chronological time into symbolic attributes, offering fine-grained and dynamic persona generation.
3.4. Differentiation Analysis
Compared to the main methods in related work, the core differences and innovations of this paper's approach are:
-
Symbolic Reasoning Integration: Unlike mainstream
LLM-based NPCapproaches that primarily rely on textualpersona promptsorfinetuningon behavioral data, this paper explicitly integrates asymbolic reasoning system(BaZi). This provides a structured, interpretable, and culturally informed layer offeature generationthat is often missing in purelyLLM-drivencharacter models. -
Minimal Input, Rich Output: Current
LLMmethods often require detailedpersona promptsor extensive datasets forfinetuning. This paper'sBaZi-LLMsystem can generatetemporally dynamicandfine-grainedpersonas from minimal inputs:birth date and time,gender, andplace of birth. This addresses the scalability issue by reducing the need for voluminous annotated data or complex prompt engineering per character. -
Temporal Dynamics and Contextual Coherence: While some
LLMagents can exhibittemporal consistencyover short horizons, maintaininglong-horizon consistencyand adapting persona traits totemporal dynamics(e.g., different life stages) is a known challenge.BaZiinherently provides a framework fortemporal dynamics(Flowing Years,Months,Days) andperson-environment interactions. The proposed system leverages this to generatetime-sequencedandenvironment-aware character profiles, which is a key differentiator from staticpersona labels. -
Culturally Grounded Framework: The use of
BaZiintroduces aculturally groundedperspective, offering a unique vocabulary foridentityandlife-course descriptionthat is not typically found in Western-centric AI character simulation methods (which might use concepts likeMBTIorastrologyin a more superficial way). The paper reinterpretsBaZias aconditional feature-generation modelrather than a metaphysical system, making it suitable for computational application. -
Quantitative Evaluation of Symbolic Reasoning: The creation of the
Celebrity 50QA datasetprovides a novel quantitative benchmark for evaluatingBaZi-based persona reasoning, addressing a long-standing limitation in the field ofBaZiwhich previously lackedmeasurable accuracy.In essence, the paper differentiates itself by moving beyond solely data-driven or prompt-based
LLMmethods towards a hybrid approach that combines the generative power ofLLMswith the structured,temporally dynamic, andculturally grounded symbolic reasoningofBaZi, leading to more realistic and scalablecharacter simulation.
4. Methodology
4.1. Principles
The core idea behind the proposed BaZi-inspired character simulation framework is to systematically transform an individual's birth information into structured, interpretable prompts that capture both stable personality traits and dynamic temporal states. Instead of treating BaZi as a metaphysical practice, the method reinterprets it as a symbolic rule-mapping process. This process discretizes chronological time into symbolic attributes that are tied to personal traits and temporal dynamics, allowing for fine-grained, dynamic persona generation suitable for Large Language Models (LLMs). The framework aims to provide a more scalable and contextually coherent alternative to methods heavily reliant on annotated data or handcrafted persona prompts.
4.2. Core Methodology In-depth (Layer by Layer)
The BaZi-inspired character simulation framework is organized into four main components, as depicted in Figure 4, which collectively form the BaZi-LLM prompt workflow.
The following figure (Figure 4 from the original paper) shows the system architecture:
该图像是一张示意图,展示了BaZi提示工作流程的四个主要组成部分:出生信息输入、BaZi规则分析、BaZi推理模块及场景模块,并说明其通过多维度(财富、职业、亲情、健康)进行多样化交互,生成细粒度的角色特征。
4.2.1. Input Layer
This is the initial stage where the raw biographical data for an individual is provided to the model.
- Input Elements: The model requires only three minimal pieces of information:
-
Birth date and time: This includes the year, month, day, and precise hour of birth. -
Gender: The biological sex of the individual. -
Place of birth: The geographical location where the individual was born.These minimal inputs are crucial because they form the basis for constructing the
BaZi chart, which then drives the subsequentpersona generationprocess.
-
4.2.2. BaZi Rule Analysis
In this stage, the input birth information is translated into the formal structure of a BaZi chart using a rule-based mapping program grounded in BaZi theory.
- BaZi Chart Construction: The birth
year,month,day, andtimeare encoded into eight symbolic elements. These eight elements are composed of four pairs, each consisting of aHeavenly Stem(天干) and anEarthly Branch(地支). For example, a person born in the year ofJia Zi(甲子) would haveJiaas the Heavenly Stem andZias the Earthly Branch for their year pillar. - Symbolic Element Attributes: Each of these eight symbolic elements is further associated with specific attributes:
Personality features: These are derived from the balance and interaction of theFive Elements(Wood, Fire, Earth, Metal, Water) within the chart.BaZi theoryposits that the relative strength and interaction of these elements indicate different aspects of a person's character and predispositions. For instance, an abundance of Wood might suggest a nurturing and creative personality, while an excess of Metal could imply decisiveness and rigidity.Daily dynamic states: These aretemporal featureslinked to various life dimensions such ashealth,career,wealth, andkinship. TheHeavenly StemsandEarthly Branchesin aBaZi chartare not static; their interactions with the changingFlowing Years,Months, andDays(大运,流年,流月,流日in BaZi terminology) are believed to predict periods of prosperity, challenge, or specific life events.
- Output of this stage: A structured
BaZi chartwith its associatedFive Elementbalance and initial dynamic indicators, ensuring interpretability and temporal grounding.
4.2.3. BaZi Reasoning (Interpretation via Classical Logic)
While the BaZi chart from the previous stage provides raw symbolic features, it requires interpretation to construct a meaningful persona. This stage involves a coarse-grained interpretation mechanism inspired by classical BaZi analysis.
- Key Interpretive Mechanisms:
Ten Gods (+#): These are symbolic roles that represent the relationships between theDay Master(日主, the Heavenly Stem of the day pillar, representing the self) and the otherHeavenly StemsandEarthly Branchesin the chart. Each of theTen Gods(e.g.,Direct Wealth,Indirect Wealth,Officer,Seven Killings,Food God,Hurting Officer,Friend,Rob Wealth,Direct Seal,Indirect Seal) represents specific personality traits, relationship dynamics, or career tendencies. For example, a strongDirect Officer(正官) might indicate a person who is disciplined and status-conscious, while a prominentFood God(食神) could suggest creativity and enjoyment of life.ShenSha (#): These are auxiliary symbolic markers orDivinity Starsassociated with specificlife tendenciesorexternal influences.ShenShacan indicate potential blessings, misfortunes, talents, or challenges. Examples includeNobleman (贵人),Academic Star (文昌),Peach Blossom (桃花), orSolitary Star (寡宿).Pattern Structures (): These are higher-level symbolic groupings that reflectbroader personality orientationsorlife path archetypes. Examples might includeSeven Killings Pattern(七杀格) indicating a challenging but powerful life, orDirect Wealth Pattern(正财格) for someone stable and financially conservative.
- Process and Output: This interpretive process follows the logic of
BaZi divinationbut producesconditional interpretive featuresrather than deterministic outcomes. These features form the foundation for downstreamscenario reasoning.
4.2.4. Scenario-Oriented Analysis
To achieve fine-grained and adaptive persona modeling, the BaZi-derived interpretive features are coupled with scenario-specific modules.
- Domain Contextualization: These modules contextualize the symbolic features into five primary life domains:
Health: How BaZi elements influence physical well-being and potential ailments.Career: BaZi indicators related to professional success, type of work, and ambition.Wealth: BaZi aspects pertaining to financial fortune, earning capacity, and saving habits.Relationship: BaZi insights into romantic relationships, marriage, and social interactions.Kinship: BaZi interpretations concerning family relationships, parents, siblings, and children.
- Adaptive Persona Modeling: This stage recognizes that a single
BaZi feature(e.g., indicatingcareer ambition) can manifest differently depending on theexternal scenario. For example, high career ambition might lead to strategic alliances in afinancial opportunityscenario but to direct confrontation in aninterpersonal conflictscenario. This contextual adaptation ensures the persona remains consistent with its core traits but responds realistically to diverse environments.
4.2.5. Dynamic Persona Prompt Generation
This is the final stage where all the processed information is consolidated into actionable prompts for LLMs.
- Consolidation: The interpreted features from the
BaZi ReasoningandScenario-Oriented Analysisstages are synthesized. - Dynamic Prompts: The output consists of
dynamic promptsdesigned to simulate individual behavior and responses across time. Crucially, these prompts incorporate both:Long-term stable traits: Core personality characteristics derived from theBaZi chartthat remain consistent over an individual's life.Short-term temporal variations: Changes or specific events indicated by the interaction of theBaZi chartwithFlowing Years,Months, orDays, and how these manifest in specific scenarios.
- Result: This leads to a
time-sequencedandenvironment-aware character profile, which serves as the basis for generatinglifelikeandcontext-sensitive character simulationsbyLLMs.
4.2.6. Methodological Innovations
The authors highlight three key innovations:
- Minimal Input, Rich Output: The model generates
temporally dynamicanddomain-specific persona promptsfrom onlybirth date/time,gender, andplace of birth. - Symbolic-Logical Integration: It combines
rule-based BaZi mappingwithinterpretive logic(Ten Gods,ShenSha,Pattern Structures) to create structured, explicitly interpretablesymbolic features. - Scenario Adaptivity: Persona representations are not fixed but
dynamically adapttohealth,career,wealth,relationship, andkinship contexts, resulting invivid,time-evolving character simulation.
5. Experimental Setup
5.1. Datasets
The primary dataset used for evaluation is Celebrity 50.
-
Name:
Celebrity 50 -
Purpose: Designed to evaluate
Large Language Models (LLMs)' ability to predict key life events based onBaZiprinciples. It serves as the firstQA datasetforBaZi-based persona reasoning. -
Source and Characteristics:
- Individuals: Contains information about 50 real individuals from diverse global backgrounds. The selection was restricted to individuals born around 1940 to ensure a sufficiently rich amount of biographical data across their adult lives.
- Data Collection & Validation: Biographical records were collected and validated through
astro.com(a reputable astrology data source, even if the paper reinterprets BaZi non-metaphysically,astro.comprovides precise birth data). - Selection Criteria:
- Adults with sufficiently rich life experiences.
- Exclusion of idols (likely for privacy and data richness concerns).
- All subjects born in the Northern Hemisphere.
- Question-Answer Pairs: Each persona is associated with 45
multiple-choice question—answer pairs. These questions span five key life dimensions:wealth,health,kinship,career, andrelationships. This formulation reduces evaluation complexity while allowing reasoning over significant, discrete life nodes.
-
Statistics:
-
Individuals: 50
-
Countries: From 29 different countries.
-
Total Q&A Pairs: 488 (average of approximately 9.76 questions per person).
-
Gender Distribution: 37 males and 13 females, indicating a gender imbalance.
The following figure (Figure 3 from the original paper) shows the distribution of birthplace and question counts across different countries:
该图像是柱状图,展示了不同国家的出生地和问题数量分布情况。图中橙色柱代表问题数,蓝色柱代表出生地数,美国、英国和俄罗斯的问题数量较多,而大部分国家的出生地数量较少,分布不均。
-
-
Construction Process:
- Birth Time Acquisition: Precise birth time data is acquired.
- Biographical Narrative Retrieval: The
Qwen APIis prompted to retrieve biographical narratives across the five dimensions (wealth, health, kinship, career, relationships).Qwenleverages its web search capabilities and internal knowledge base. - Multiple-Choice Question Generation: The same
LLM(Qwen API) generates multiple-choice questions from the compiled biographical information. - JSON Export: A final script extracts and synthesizes these questions with the birth data into a
JSON format. - Cleaning and Quality Assurance:
- Initial Filtering: A rating system was established to eliminate questions based on three criteria:
- Questions containing real proper names (people, organizations, etc.).
- Questions demanding overly specific numerical details (e.g., exact wealth amounts) that are not reasonably predictable by
BaZi analysis. - Questions that exceed the reasonable predictive capabilities of traditional
BaZi analysis.
- Refinement: Unsatisfactory questions were grouped and iteratively refined by the
LLMitself through prompt modifications. Discarded questions were replaced by new ones. - Manual Verification: All remaining questions underwent manual verification to ensure rigor and compliance with guidelines.
- Initial Filtering: A rating system was established to eliminate questions based on three criteria:
- Annotation Process: Comprehensive guidelines ensured that all generated questions were factually accurate and strictly aligned with one of the five predefined life dimensions based on the sourced biographical material.
-
Data Sample: The model's input for a given task includes the individual's
birth time,gender, andplace of birth, along with amultiple-choice questionand candidate answers. The goal is to select the correct answer.
The following figure (Figure 2 from the original paper) shows sample information input to a Large Language Model:
该图像是一个示意图,展示了输入给大语言模型(LLM)的示例信息,包括出生时间、性别、出生地及关于人物可能职业的多项选择问题。
- Why these datasets were chosen:
Celebrity 50was chosen because it allows forQA-based evaluationover critical life events, which simplifies verification compared to full life-course narratives. The focus on real individuals born around 1940 ensures a rich historical context for biographical events. Themultiple-choice formatenables quantitative evaluation, a significant improvement over previousBaZi reasoningwhich lacked measurable accuracy.
5.2. Evaluation Metrics
The primary evaluation metric used in this paper is Accuracy.
- Accuracy:
- Conceptual Definition: Accuracy is a fundamental metric in classification tasks that measures the proportion of correctly predicted instances out of the total number of instances. In the context of a multiple-choice
Question Answering (QA)benchmark, it quantifies how often the model selects the correct answer choice. A higher accuracy indicates a better ability of the model to correctly reason about and predict life events based on the provided information. - Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
Number of Correct Predictions: This represents the count of instances (questions) where the model's output (chosen answer) exactly matches the ground truth correct answer.Total Number of Predictions: This is the total number of questions or samples evaluated in the dataset.
- Conceptual Definition: Accuracy is a fundamental metric in classification tasks that measures the proportion of correctly predicted instances out of the total number of instances. In the context of a multiple-choice
5.3. Baselines
The paper compares its proposed method against several state-of-the-art Large Language Models (LLMs) under different experimental settings.
-
Mainstream LLM Backbones:
-
DeepSeek-v3: A large language model from DeepSeek AI. -
Gemini-2.5-flash: A model from Google's Gemini family, designed for speed and efficiency. -
GPT-5-mini: A hypothetical or placeholder model for a future state-of-the-art model from OpenAI, indicating an intention to compare against the cutting edge.These models are considered representative
state-of-the-art LLMsknown for their strongprompt-following,dialogue generation, andreasoning capabilities.
-
-
Experimental Settings for Comparison:
- Vanilla LLM w/ Bazi (Baseline): In this setting, the standard
LLMsare provided with theBaZi-derived featuresas input, but without any additionalBaZi reasoning modulesintegrated into their architecture. This tests the LLMs' inherent ability to interpretBaZiinformation implicitly, purely through their pre-trained knowledge. - Vanilla LLM w/ Bazi Rule Knowledge: Here, in addition to receiving the
BaZi features, theLLMsare augmented withexplicit symbolic knowledge rulesrelated toBaZi. This evaluates if providing explicit rule-based knowledge improves the LLMs' performance. - Our Model: This refers to the proposed
multi-agent architecturethat specifically integratessymbolic reasoningderived fromBaZiwithLLM inferenceforBaZi-inspired character simulation. This is the method being evaluated for its overall effectiveness.
- Vanilla LLM w/ Bazi (Baseline): In this setting, the standard
-
Control Experiment:
- Shuffled Birthday Control: To validate the importance of genuine
birth-date groundingandBaZi reasoning, a control condition was introduced. In this setup, each subject's true birth date was replaced with another person's date (randomly shuffled), while all other information (questions, candidate answers) remained constant. The expectation is that ifBaZi reasoningis meaningful, performance should significantlydeterioratewhen the correct mapping between an individual's biography and their trueBaZichart is broken. This helps to confirm that the model is genuinely leveragingBaZiinformation rather than statistical correlations or other cues.
- Shuffled Birthday Control: To validate the importance of genuine
-
External Benchmark:
- The 15th Global Fortune-Teller Championship 2024: The model was also evaluated on a question set from this external competition organized by the
Hong Kong Junior Feng Shui Masters Association. This provides a real-world, albeit specialized, benchmark forBaZiprediction capabilities.
- The 15th Global Fortune-Teller Championship 2024: The model was also evaluated on a question set from this external competition organized by the
-
Implementation Details:
- The model uses
DeepSeek-R1(Guo et al., 2025) for generatingBaZi knowledgeandDoubao-1.5-ThinkingPro(ByteDance, 2025) for performingreasoning. - All evaluations were conducted on the
Celebrity 50dataset.
- The model uses
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate the significant effectiveness of the proposed BaZi-augmented LLM system in persona reasoning and life-event prediction. The analysis focuses on comparing our model against vanilla LLMs and rule-augmented LLMs, and crucially, on validating the importance of BaZi information through shuffled birthday controls.
6.1.1. Comparison with Baseline LLMs
The following are the results from Table 1 of the original paper:
| Setting | Model | Acc. (%) |
| Vanilla LLM w/ Bazi (Baseline) | Deepseek-v3 | 39.3 |
| Gemini-2.5-flash | 42.2 | |
| GPT-5-mini | 34.0 | |
| Baseline w/ Bazi Rule Knowledge | Deepseek-v3 | 35.9 (↓8.7%) |
| Gemini-2.5-flash | 42.4 (↓4.1%) | |
| GPT-5-mini | 36.9 (↑8.5%) | |
| Our Model | Deepseek-v3 | 51.2 (↑30.3%) |
| Gemini-2.5-flash | 47.1 (↑6.6%) | |
| GPT-5-mini | 55.3 (↑62.6%) |
As shown in Table 1, our model (the BaZi-LLM system) consistently outperforms all baseline LLMs across various backbones (DeepSeek-v3, Gemini-2.5-flash, GPT-5-mini).
-
For
DeepSeek-v3,our modelachieves51.2%accuracy, a30.3%relative improvement over theVanilla LLM w/ Bazibaseline (39.3%). -
For
Gemini-2.5-flash,our modelreaches47.1%, a6.6%relative gain over its baseline (42.2%). -
Most notably, for
GPT-5-mini,our modelachieves55.3%accuracy, an impressive62.6%relative improvement over its baseline (34.0%).These results strongly validate the effectiveness of integrating
symbolic BaZi reasoningintoLLM-based character simulation. The gains are substantial, particularly forGPT-5-mini, suggesting that the symbolic framework helpsLLMsthat might otherwise struggle with this specific task.
Interestingly, providing BaZi Rule Knowledge to Vanilla LLMs (the "Baseline w/ Bazi Rule Knowledge" setting) did not consistently improve performance and even led to slight drops for DeepSeek-v3 and Gemini-2.5-flash. This implies that simply feeding explicit rules to LLMs might not be as effective as having a dedicated multi-agent architecture that systematically processes and integrates these rules, as our model does. GPT-5-mini showed an 8.5% increase in this setting, indicating some models might benefit more from direct rule injection, but it's still far less than the gains achieved by our model.
6.1.2. Impact of Shuffled Birthdays
The following are the results from Table 2 of the original paper:
| Setting | Model | Acc. (%) |
| Real Birthdays | DeepSeek-V3 | 51.2 |
| Gemini-2.5-flash | 47.1 | |
| GPT-5-mini | 55.3 | |
| Shuffled Birthdays | DeepSeek-v3 | 40.6 (↓20.7%) |
| Gemini-2.5-flash | 35.5 (↓24.6%) | |
| GPT-5-mini | 30.0 (↓45.7%) |
Table 2 presents a crucial validation of the role of BaZi as a meaningful symbolic system. When our model is tested with shuffled birthdays (breaking the genuine temporal alignment between the biography and the BaZi chart), the performance drops significantly across all backbones:
-
DeepSeek-v3:51.2%(Real) to40.6%(Shuffled), a20.7%decrease. -
Gemini-2.5-flash:47.1%(Real) to35.5%(Shuffled), a24.6%decrease. -
GPT-5-mini:55.3%(Real) to30.0%(Shuffled), a substantial45.7%decrease.This drastic drop in accuracy when the
BaZiinformation is incorrectly associated confirms thatour modelis not merely relying on superficial cues or general biographical knowledge, but is actively leveraging the specificBaZifeatures derived from the correct birth data for itsreasoning. This directly addresses the question of whetherBaZi-derived featuresprovide incremental information beyond raw birth dates, showing that they do.
6.1.3. Vanilla LLMs with Shuffled Birthdays
The following are the results from Table 3 of the original paper:
| Setting | Model | Acc. (%) |
| Vanilla LLM + Bazi | DeepSeek-V3 | 39.3 |
| Gemini-2.5-flash | 42.2 | |
| GPT-5-mini | 34.0 | |
| / + Shuffled Birthday | DeepSeek-V3 | 42.5(↑8.1%) |
| Gemini-2.5-flash | 42.1(↓0.2%) | |
| GPT-5-mini | 34.8 (↑2.4%) |
Table 3 provides a comparative view of Vanilla LLMs performance with real vs. shuffled BaZi features. In contrast to our model (Table 2), the Vanilla LLMs show relatively stable, and in some cases even slightly improved, performance when birthdays are shuffled.
-
DeepSeek-v3:39.3%(Real) to42.5%(Shuffled), an8.1%increase. -
Gemini-2.5-flash:42.2%(Real) to42.1%(Shuffled), a negligible0.2%decrease. -
GPT-5-mini:34.0%(Real) to34.8%(Shuffled), a2.4%increase.This outcome suggests that
Vanilla LLMscontain only limited implicit knowledge ofBaZiand, without the explicitBaZi reasoning framework, they do not strongly rely on theBaZi featuresfor their predictions. Their reasoning might be more influenced by general patterns in the biographical data or common sense rather than the specificsymbolic information. This contrast further highlights thatour modelis genuinely leveragingBaZi theoryand its structured interpretation forpersona fitting, rather than operating on surface-level correlations.
6.1.4. External Benchmark Performance
The model, which uses DeepSeek-R1 for BaZi knowledge generation and Doubao-1.5-ThinkingPro for reasoning, achieved an accuracy of 60% on the question set from The 15th Global Fortune-Teller Championship 2024. This performance matched the third-place competitor in that year's competition. This indicates that the model's capabilities extend to real-world BaZi prediction tasks and suggests potential for further improvement with more powerful underlying reasoning engines.
6.1.5. Case Study: Differences in BaZi Theory Interpretation (sergey_brin_P042)
The case study involving Sergey Brin (sergey_brin_P042) reveals qualitative differences in how DeepSeek-V3, GPT-5-mini, and Gemini-2.5-flash interpret BaZi theory within the custom BaZi analysis framework.
- BaZi Theory Interpretation:
DeepSeek-V3andGemini-2.5-flashclassified the chart as aShangguan Structural Pattern(伤官格).GPT-5-miniidentified it as aCong Er Structural Pattern(从儿格).- This divergence led to opposite conclusions regarding
favorable/unfavorable elementsandfuture luck cycles. The authors note that while flexibility exists in pattern classification, such decisions typically rely on professional experience.GPT-5-miniadopted a more flexible and bold interpretative logic, whileDeepseek-V3andGemini-2.5-flashexhibited a more conservative, rule-bound approach.
- Scene Mapping Process:
DeepSeek-V3followed a rigidfeature-to-predictionpattern, susceptible tolocal information bias.Gemini-2.5-flashintegrated multiple dimensions forholistic analysis.GPT-5-minishowed behavior similar to a human consultant, adapting reasoning touser contextand exploringalternative scenarios dynamically.
- Output Expression:
DeepSeek-V3usedabsolute statementsforreal-world manifestations.Gemini-2.5-flashandGPT-5-miniused moreprobabilistic language("possibly","likely") and presentedmultiple potential outcomes, resembling a human consultant more closely.
- Commonalities: When provided with identical upstream results, all three models showed
convergent reasoning pathswithoutsevere factual or logical errorsand demonstrated acomparable level of baseline BaZi knowledge. However, none exhibited strongreflectionorself-correction mechanisms. - Overall Assessment:
Gemini-2.5-flashprovided the moststableandconservativetheoretical reasoning.GPT-5-miniexcelled in the final output stage, producing explanations most similar to a human consultant due to itsaggressiveandexploratory interpretations.DeepSeek-V3remainedrigidanddeterministic.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Setting | Model | Acc. (%) |
| Vanilla LLM w/ Bazi (Baseline) | Deepseek-v3 | 39.3 |
| Gemini-2.5-flash | 42.2 | |
| GPT-5-mini | 34.0 | |
| Baseline w/ Bazi Rule Knowledge | Deepseek-v3 | 35.9 (↓8.7%) |
| Gemini-2.5-flash | 42.4 (↓4.1%) | |
| GPT-5-mini | 36.9 (↑8.5%) | |
| Our Model | Deepseek-v3 | 51.2 (↑30.3%) |
| Gemini-2.5-flash | 47.1 (↑6.6%) | |
| GPT-5-mini | 55.3 (↑62.6%) |
The following are the results from Table 2 of the original paper:
| Setting | Model | Acc. (%) |
| Real Birthdays | DeepSeek-V3 | 51.2 |
| Gemini-2.5-flash | 47.1 | |
| GPT-5-mini | 55.3 | |
| Shuffled Birthdays | DeepSeek-v3 | 40.6 (↓20.7%) |
| Gemini-2.5-flash | 35.5 (↓24.6%) | |
| GPT-5-mini | 30.0 (↓45.7%) |
The following are the results from Table 3 of the original paper:
| Setting | Model | Acc. (%) |
| Vanilla LLM + Bazi | DeepSeek-V3 | 39.3 |
| Gemini-2.5-flash | 42.2 | |
| GPT-5-mini | 34.0 | |
| / + Shuffled Birthday | DeepSeek-V3 | 42.5(↑8.1%) |
| Gemini-2.5-flash | 42.1(↓0.2%) | |
| GPT-5-mini | 34.8 (↑2.4%) |
6.3. Ablation Studies / Parameter Analysis
While the paper doesn't present traditional ablation studies breaking down the contribution of each specific component of our model (e.g., BaZi Rule Analysis, Interpretation via Classical Logic, Scenario-Oriented Analysis), the Shuffled Birthday Control (Table 2 and Table 3) serves as a critical pseudo-ablation to verify the effectiveness of the BaZi grounding itself.
-
Shuffled Birthday as a Control: By comparing
our model's performance withreal birthdaysversusshuffled birthdays, the authors effectively "ablate" the correctBaZi-biography alignment. The significant drop in accuracy (20%-45%) directly demonstrates that theBaZi-derived featuresare indeed being utilized and are crucial for the model's performance. If theBaZicomponent were ineffective, shuffling the birthdays would not cause such a dramatic performance decrease. -
Contrast with Vanilla LLMs: The further comparison in Table 3 shows that
Vanilla LLMs(withoutour model's integratedBaZi reasoning framework) are largely unaffected byshuffled birthdays. This implies thatVanilla LLMsdo not inherently leverageBaZi featuresin a meaningful way, even when provided, highlighting the necessity ofour model's explicitsymbolic integration. This comparison acts as an indirect ablation, showing that the specific integration mechanism (not just the presence ofBaZidata) is what drives the performance gains.This control experiment effectively verifies that the
BaZi-based symbolic featuresare not mere noise but provide substantial, incremental information thatour modelsuccessfully exploits forpersona reasoning.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces a novel and effective approach to character simulation by integrating traditional Chinese BaZi (Four Pillars of Destiny) symbolic reasoning with Large Language Models (LLMs). The authors successfully reinterpret BaZi as a culturally grounded, conditional feature-generation model for persona construction, enabling the creation of fine-grained and temporally dynamic virtual characters. A key contribution is the development of Celebrity 50, the first QA dataset specifically designed for BaZi-based persona reasoning, allowing for quantitative evaluation of this complex domain. The proposed BaZi-LLM system demonstrates significant accuracy improvements, ranging from 30.3% to 62.6% over state-of-the-art LLM baselines like DeepSeek-v3 and GPT-5-mini. Crucially, the model's performance drastically declines (20%-45%) when incorrect BaZi information is used, unequivocally validating that the system genuinely leverages the symbolic birth information for its persona reasoning. This work underscores the potential of culturally grounded symbolic-LLM integration for generating more realistic and coherent virtual characters.
7.2. Limitations & Future Work
The authors acknowledge several limitations of their current work:
-
Dataset Generation Bias: Many narratives and questions in the
Celebrity 50dataset areLLM-generated(byQwen), which introduces potential issues such ashallucination,bias, andfactual errors. -
Dataset Size and Imbalance: The dataset is relatively small (50 individuals, 488 questions) and exhibits
gender imbalance(37 male, 13 female). This limits thegeneralizabilityof the findings. -
Birth Data Accuracy: While birth details are sourced from
astro.com, there may still beinaccuraciesin the precise birth times. -
Cultural and Temporal Biases: The focus on mostly
Western figures born around 1940introducestemporalandcultural biases, which may not generalize across different eras or diverse cultural contexts.Based on these limitations, the authors propose future directions for improvement:
-
Model Enhancements:
- Domain-Specific Knowledge Bases: Incorporating
domain-specific knowledge basesortrainingfor particularschools of BaZi thought(e.g., specialized pattern classifications) could further refine the model's interpretative capabilities. - Agent-Based Mechanisms: Implementing
agent-based mechanismsthat can dynamically select among intermediate outputs, reflect on user feedback, and adapt reasoning pathways accordingly would improve model robustness and flexibility.
- Domain-Specific Knowledge Bases: Incorporating
-
Dataset Expansion: Collecting more diverse data samples from different countries with more precise birth times is crucial to improve
generalizabilityand reduce existing biases.
7.3. Personal Insights & Critique
This paper presents a fascinating and innovative approach that bridges traditional symbolic systems with modern Large Language Models.
Insights:
- Novel Integration of Symbolic AI and LLMs: The most compelling insight is the successful integration of a
culturally grounded symbolic systemlikeBaZiwithLLMs. This goes beyond merely promptingLLMswith external knowledge; it involves a structured, multi-stage process ofsymbolic rule mapping,reasoning, andscenario adaptation. This hybrid approach demonstrates a powerful path forward forAIwhere interpretable, structured knowledge can guide the generative capabilities ofLLMs. - Reinterpretation of BaZi: The reinterpretation of
BaZias aconditional feature-generation modelrather than a metaphysical claim is a clever way to leverage its rich symbolic structure forpersona simulationwithout getting entangled in pseudoscientific debates. It extracts the narrative and temporal patterning potential fromBaZi, making it computationally useful. - Validation through Shuffled Data: The
shuffled birthday controlis an excellent experimental design choice. The significant drop inour model's accuracy with shuffled data provides strong empirical evidence that theBaZicomponent is genuinely contributing to thepersona reasoning, and not just acting as a spurious correlation. This is a crucial validation point for any system relying on such a specialized knowledge base. - Addressing LLM Limitations: The paper effectively targets known limitations of
LLMsincharacter simulation, such asscaling diverse personasand maintaininglong-horizon consistency.BaZi's inherenttemporal dynamicsandpattern structuresoffer a structured solution to these challenges.
Critique and Areas for Improvement:
- Dataset Quality and Bias: The reliance on
LLM-generated questionsandbiographical narrativesfor theCelebrity 50dataset is a significant concern.LLMsare prone tohallucinationandbiasespresent in their training data. For a benchmark intended to validate a symbolic reasoning system, potential inaccuracies or implicit biases introduced byQwenin the ground truth could affect the reliability of the accuracy scores. Future work should prioritize human-curated and verifiedQ&Apairs, or at least a more rigorous multi-stage human review process. - Generalizability of BaZi: While
BaZiis aculturally groundedsystem, its application to predominantlyWestern figuresborn around1940raises questions about itscross-cultural generalizabilityand applicability to modern contexts. Thegender imbalancein the dataset also limits the conclusions that can be drawn about its performance for diverse demographics. Expanding the dataset to include individuals from various cultural backgrounds, time periods, and a balanced gender distribution is essential. - Interpretability of LLM Outputs: The case study highlights the divergence in
BaZi theory interpretationamongLLMs(e.g.,Shangguanvs.Cong Erpatterns). WhileGPT-5-miniis praised for human-like output, such fundamental disagreements in thesymbolic reasoningstage can lead to vastly differentpersona profiles. This points to the inherent subjectivity or interpretative flexibility withinBaZiitself, which might be hard for anAIto navigate consistently without a definitive "oracle" forBaZiinterpretation. Further research could explore how to integrate human expert consensus or allow for "multi-interpretative" persona aspects. - The "GPT-5-mini" Anomaly: The consistent reference to
GPT-5-miniis unusual, given thatGPT-5has not been publicly released. This either implies a placeholder for a future state-of-the-art model, a private developmental version, or perhaps a slight anachronism in the paper's timing if it was written significantly before its listed publication date. Clarification on this would enhance rigor. - Formalization of BaZi Rules: The paper mentions
BaZi rule analysisandinterpretation via classical logic, but a more detailed exposition of how these rules are formalized and encoded for theAI(e.g., as logical predicates, knowledge graphs, or specific prompt structures) would be beneficial for reproducibility and deeper understanding.
Potential Applications:
The methodology could be applied to other symbolic systems or folk traditions that describe human personality or life trajectories (e.g., Western astrology, numerology, other forms of divination), transforming them into computable feature sets for LLM-driven character generation. This could greatly enrich the diversity and cultural specificity of virtual characters in games, storytelling, and educational simulations. Furthermore, the approach of using minimal inputs to generate rich, dynamic personas has significant implications for scaling character creation in large virtual worlds.
Similar papers
Recommended via semantic vector search.