AiPaper
Paper status: completed

Galaxy: A Cognition-Centered Framework for Proactive, Privacy-Preserving, and Self-Evolving LLM Agents

Published:08/06/2025
Original LinkPDF
Price: 0.10
Price: 0.10
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents `Galaxy`, a cognition-centered framework for proactive, self-evolving, and privacy-preserving LLM agents, integrating cognitive architecture with system design. Experimental results demonstrate its superior performance across various benchmarks.

Abstract

Intelligent personal assistants (IPAs) such as Siri and Google Assistant are designed to enhance human capabilities and perform tasks on behalf of users. The emergence of LLM agents brings new opportunities for the development of IPAs. While responsive capabilities have been widely studied, proactive behaviors remain underexplored. Designing an IPA that is proactive, privacy-preserving, and capable of self-evolution remains a significant challenge. Designing such IPAs relies on the cognitive architecture of LLM agents. This work proposes Cognition Forest, a semantic structure designed to align cognitive modeling with system-level design. We unify cognitive architecture and system design into a self-reinforcing loop instead of treating them separately. Based on this principle, we present Galaxy, a framework that supports multidimensional interactions and personalized capability generation. Two cooperative agents are implemented based on Galaxy: KoRa, a cognition-enhanced generative agent that supports both responsive and proactive skills; and Kernel, a meta-cognition-based meta-agent that enables Galaxy's self-evolution and privacy preservation. Experimental results show that Galaxy outperforms multiple state-of-the-art benchmarks. Ablation studies and real-world interaction cases validate the effectiveness of Galaxy.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Galaxy: A Cognition-Centered Framework for Proactive, Privacy-Preserving, and Self-Evolving LLM Agents

1.2. Authors

The authors are Chongyu Bao1\mathbf { B a o } ^ { 1 * \dagger }, Ruimin Dai3*, Yangbo Shen1* Runyang Jian4, Jinghan Zhang3, Xiaolan Liu2\mathbf { L i u } ^ { 2 }, Kunpeng Liu3†. Their affiliations are:

  • Carnegie Mellon University (1)

  • University of Bristol (2)

  • Clemson University (3)

  • Portland State University (4)

    Chongyu Bao is noted with a primary authorship indicator (*). Chongyu Bao, Xiaolan Liu, and Kunpeng Liu are noted with correspondence indicators (†).

1.3. Journal/Conference

The paper is published at arXiv, a preprint server for scientific papers. As of the provided information, it is a preprint (arXiv:2508.03991) and not yet formally published in a peer-reviewed journal or conference. arXiv is a widely recognized platform for disseminating research quickly, but preprints have not undergone formal peer review.

1.4. Publication Year

The publication timestamp (UTC) is 2025-08-06T00:46:38.000Z.

1.5. Abstract

This paper introduces Galaxy, a cognition-centered framework designed for Large Language Model (LLM) agents to function as Intelligent Personal Assistants (IPAs). While existing IPAs primarily focus on responsive capabilities, Galaxy addresses the underexplored aspects of proactive behaviors, privacy preservation, and self-evolution. The core innovation is Cognition Forest, a semantic structure that unifies cognitive modeling with system-level design, creating a self-reinforcing loop where cognitive architecture drives system design, and system improvements refine cognition. Galaxy supports multidimensional interactions and personalized capability generation. Within this framework, two cooperative agents are implemented: KoRa, a cognition-enhanced generative agent for responsive and proactive skills; and Kernel, a meta-cognition-based meta-agent responsible for Galaxy's self-evolution and privacy preservation. Experimental results demonstrate that Galaxy outperforms multiple state-of-the-art benchmarks, with ablation studies and real-world cases validating its effectiveness.

2. Executive Summary

2.1. Background & Motivation

The paper addresses critical limitations in current Intelligent Personal Assistants (IPAs) like Siri and Google Assistant, especially with the advent of Large Language Model (LLM) agents.

  • Core Problem: Current LLM agents, when deployed as IPAs, predominantly focus on responsive behaviors (i.e., acting only upon explicit user commands). They largely lack proactive behaviors (acting without explicit commands), robust privacy-preserving mechanisms, and the ability for self-evolution (continuously adapting and improving their own architecture and strategies).

  • Importance of the Problem:

    • Enhanced Human Capabilities: IPAs are designed to augment human abilities and perform tasks, but their utility is constrained if they cannot anticipate needs or learn over time. Proactivity, privacy, and self-evolution are crucial for truly intelligent, helpful, and trustworthy assistants.
    • Underexplored Proactivity: While LLMs have advanced reasoning and planning, leveraging these for proactive assistance, which requires deep user modeling and intent prediction, remains a significant challenge.
    • Privacy Risks: Proactive systems often require access to sensitive user data for modeling. Cloud-based LLM inferences, in particular, exacerbate privacy concerns, making robust privacy preservation essential for user trust and adoption.
    • Fixed Cognitive Architectures: Existing LLM agents are often constrained by predefined internal modules and reasoning pipelines. They struggle to inspect, revise, or evolve their own underlying system designs or cognitive architectures, limiting their adaptability and personalization. This separation of cognitive architecture and system design hinders continuous improvement.
  • Paper's Entry Point / Innovative Idea: The paper proposes that the cognitive architecture and system design of LLM agents should not be treated separately but rather unified into a self-reinforcing loop. The core innovative idea is Cognition Forest, a semantic structure that explicitly links cognitive modeling with system-level design. This enables LLM agents not only to understand what to do and how to do it but also how it is implemented, allowing for deeper reflection and self-modification.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of LLM agents and IPAs:

  • Cognition Forest: Proposal of Cognition Forest, a novel tree-structured semantic mechanism. This structure fundamentally integrates cognitive architecture (how an agent thinks and understands) with system design (how the system is built and functions). This unification forms a self-reinforcing loop, where cognitive insights drive system modifications, and system improvements enrich the cognitive architecture. This addresses the challenge of fixed architectures.
  • Galaxy Framework: Development of Galaxy, an LLM agent framework built upon Cognition Forest. Galaxy is specifically designed to support proactive task execution (acting without explicit commands), privacy-preserving operation, and continuous adaptation (self-evolution). It offers multidimensional interaction modalities and can generate or aggregate new cognitive capabilities for personalized needs.
  • Collaborative Agents (KoRa and Kernel): Implementation of two cooperative agents within the Galaxy framework:
    • KoRa: A cognition-enhanced generative agent that acts as a human-like assistant. It supports both responsive skills (handling explicit commands) and proactive skills (anticipating needs) by grounding its cognition-to-action pipeline in the Cognition Forest, which helps mitigate persona drift and improves consistency.
    • Kernel: A meta-cognition empowered meta-agent. Kernel operates at a higher level, supervising and optimizing the entire Galaxy framework. It enables self-reflection on capability limitations, expands functionality based on user demands, and ensures privacy preservation through its Privacy Gate when interacting with cloud models.
  • Empirical Validation: Galaxy significantly outperforms multiple state-of-the-art benchmarks in areas relevant to agent capabilities. Extensive ablation studies and real-world interaction cases further validate the effectiveness of Galaxy's design principles and components, particularly the crucial roles of Kernel and the Analysis Layer modules (Agenda and Persona).
  • Key Conclusion: The paper concludes by arguing that an IPA's understanding of its users should not be static or constrained by a fixed cognitive architecture, but rather should continuously evolve through reflection on and refinement of its own system design. This integrated approach leads to more capable, adaptable, and trustworthy LLM agents.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand the Galaxy framework, it's essential to grasp several core concepts:

  • Large Language Models (LLMs): LLMs are advanced artificial intelligence models trained on vast amounts of text data, enabling them to understand, generate, and process human language. They can perform tasks like translation, summarization, question-answering, and even code generation. Examples include GPT-4, Claude, Gemini, Qwen.
  • LLM Agents: An LLM agent is an LLM augmented with additional components that allow it to interact with its environment, perceive information, reason, plan actions, and execute them. Unlike a standalone LLM that merely generates text, an LLM agent can act. Key components often include:
    • Memory: To store past interactions and observations.
    • Planning Module: To break down complex tasks into smaller steps.
    • Tool-Use Module: To call external tools (e.g., search engines, calendars, APIs) to perform specific actions.
    • Reflection Module: To evaluate its own actions and plans for improvement.
  • Intelligent Personal Assistants (IPAs): Software agents that assist users with various tasks and information retrieval. Examples include Siri, Alexa, and Google Assistant. They typically respond to voice commands or text input.
  • Proactive vs. Responsive Behaviors:
    • Responsive Behavior: The agent acts only when explicitly prompted or commanded by the user. Most current IPAs and LLM agents primarily exhibit responsive behavior.
    • Proactive Behavior: The agent anticipates user needs or potential issues and initiates actions without explicit instructions. This requires deep user modeling, context awareness, and predictive capabilities. For example, suggesting a route change due to traffic before being asked.
  • Cognitive Architecture: In the context of LLM agents, a cognitive architecture defines the fundamental structure of an agent's "mind." It specifies the internal modules (e.g., memory, reasoning, planning, perception), how they interact, the types of information they process, and the reasoning processes they can perform. It's the blueprint of the agent's intelligence.
  • Metacognition: Often described as "cognition about cognition," metacognition refers to an agent's ability to monitor, understand, and control its own cognitive processes. For LLM agents, this means an agent can reflect on its own reasoning, identify limitations, and potentially adjust its internal strategies or even its cognitive architecture.
  • Generative Agents: A type of LLM agent (e.g., Generative Agents by Park et al., 2023) designed to simulate human-like behavior, including memory streams, reflection, and planning, to create consistent and believable actions in interactive environments. They generate responses and actions based on their internal state and perceived environment.
  • Privacy Preservation (Data Masking/Anonymization): Techniques used to protect sensitive user information while still allowing data to be used for analysis or LLM inference. Masking involves replacing sensitive data with generic placeholders or aggregated values. Anonymization aims to remove personally identifiable information entirely. The goal is to minimize privacy leakage.
  • Self-Evolution (Continuous Adaptation/Self-Improvement): The capacity of an LLM agent to continually learn, adapt, and improve its own internal architecture, capabilities, and interaction strategies over time, often in response to new experiences, user feedback, or detected limitations. This goes beyond just learning new facts; it involves modifying the system itself.
  • Persona Drift: In generative agents, persona drift refers to the phenomenon where an agent's simulated personality, characteristics, or consistent behavior patterns gradually change or deviate from its intended or established persona over long periods of interaction, often due to an accumulation of varied experiences or insufficient grounding.

3.2. Previous Works

The paper discusses several categories of prior LLM agent research and specific systems, highlighting their advancements and limitations:

  • Conversational Agents (e.g., Wahde and Virgolin 2022; Guan et al. 2025; Jiang et al. 2021): These agents primarily interact through dialogue and execute tasks by calling external tools. They focus on understanding natural language intent and responding appropriately within a conversational context.
    • Limitation: While good for dialogue, they typically lack proactive capabilities that initiate actions without explicit commands.
  • Autonomous Agents (e.g., Chen et al. 2025; Belle et al. 2025): These agents operate within specific environments (e.g., simulated worlds or specific applications) and often focus on executing single, well-defined tasks autonomously. They demonstrate strong task-planning and tool-invoking capabilities.
    • Limitation: Primarily focused on responsive behaviors within their environment; limited support for proactive skills or continuous adaptation of their own architecture.
  • Multi-agent Systems (e.g., Zhou et al. 2024; Chen et al. 2024): These systems involve multiple LLM agents collaborating to divide and conquer complex tasks, leveraging specialized roles to improve scalability and efficiency. MetaGPT (Zhou et al. 2024) is an example that produces efficient solutions through multi-agent collaboration.
    • Limitation: While they advance collaborative capabilities, the focus is still largely on responsive behaviors in task execution, with limited attention to proactivity, privacy, or self-evolution of the overall system.
  • LLMPA (Guan et al. 2023): An LLM-based process automation system embedded in mobile applications to complete complex operations under natural language instructions.
    • Context: Showcases LLMs in practical mobile task automation.
  • WebPilot (Zhang et al. 2025b) and Mind2web (Deng et al. 2023): These are GUI agents designed to perform multi-step interactions on arbitrary websites, demonstrating advanced web automation capabilities.
    • Context: Highlight the ability of LLM agents to interact with graphical user interfaces.
  • Proactive Agents (e.g., Liao, Yang, and Shah 2023; Lu et al. 2024): Some works have started to explore inferring user intent to enable proactive features. (Liao, Yang, and Shah 2023) can infer user intent but remains confined to dialog interactions. (Luetal.2024)(Lu et al. 2024) emphasizes multi-source perception for deep user modeling needed for intent prediction.
    • Limitation: These efforts are nascent and generally do not address the combined challenges of directly triggering concrete operations, privacy risks, and self-evolution.
  • Privacy in LLM Agents (e.g., Gan et al. 2024; Hahm et al. 2025; Zeng et al. 2024): Research exists on security, privacy, and ethical threats in LLM-based agents (Gan et al. 2024), enhancing safety via causal influence prompting (Hahm et al. 2025), and privacy-preserving inference (Zeng et al. 2024).
    • Context: Acknowledges the importance of privacy, but often treated as a separate challenge rather than integrated into a holistic agent design.
  • Self-Evolution and Metacognition (e.g., Li, Zhang, and Sun 2023; Liu and van der Schaar 2025; YanfangZhou et al. 2025; Hu, Lu, and Clune 2024; Yin et al. 2024):
    • Generative Agents (Park et al. 2023): Uses memory stream, reflection, and planning to simulate consistent human-like behavior. This is a foundational work for KoRa's architecture.
    • Metaagent-P (YanfangZhou et al. 2025): Improves performance by reflecting on current workflows.
    • Meta-agents that inspect and revise their own code (Li, Zhang, and Sun 2023; Liu and van der Schaar 2025).
    • Automated design of agentic systems (Hu, Lu, and Clune 2024; Yin et al. 2024) that generate stronger modules or new agents.
    • Limitation: These studies often lack integration with task context or system constraints. The depth of metacognitive ability is often constrained by the underlying fixed cognitive architecture. Automated design relies on preset evaluation standards, struggling with sustained, open-ended evolution.

3.3. Technological Evolution

The evolution of intelligent assistants has progressed from rule-based systems to more sophisticated AIs.

  1. Early IPAs (e.g., pre-LLM Siri/Alexa): Primarily relied on predefined rules, scripts, and limited natural language processing capabilities. They were good at executing specific commands but lacked deeper understanding, context retention, and adaptability.

  2. Rise of LLMs: The development of transformer architectures and large-scale pre-training revolutionized natural language understanding and generation. LLMs gained impressive causal reasoning and task-planning abilities.

  3. LLM-based Agents: This integrated LLMs with components like memory, planning, and tool-use, transforming them into agents capable of interacting with environments and performing multi-step tasks. This represents a significant leap from static LLMs to dynamic, action-oriented systems. Research then branched into conversational, autonomous, and multi-agent systems.

  4. Current Frontier (Proactivity, Privacy, Self-Evolution): The field is now moving towards agents that are not just reactive but proactive, not just capable but privacy-preserving, and not just static but self-evolving. This requires moving beyond fixed cognitive architectures to systems that can inspect and modify their own design.

    Galaxy fits into this timeline by pushing the boundaries of LLM agents into the frontier stage. It directly addresses the shortcomings of previous LLM agent designs by proposing a unified framework for proactivity, privacy, and self-evolution, enabled by a novel integration of cognitive architecture and system design.

3.4. Differentiation Analysis

Compared to prior research, Galaxy's core differences and innovations are:

  • Unified Cognitive Architecture and System Design: The most significant innovation is the Cognition Forest, which explicitly unifies an agent's cognitive architecture (its internal understanding and reasoning) with its underlying system design (its code and functional implementation). Previous works typically treat these as separate, leading to fixed architectures where metacognition is limited to improving reasoning within the given architecture, not modifying the architecture itself. Galaxy creates a self-reinforcing loop for alternating optimization.

  • Holistic Approach to Proactivity, Privacy, and Self-Evolution: Unlike most existing works that focus on one or two of these aspects, Galaxy is designed from the ground up to jointly address all three.

    • Proactivity: Achieved through the Analysis Layer (Agenda, Persona) for deep user modeling and KoRa's Cognition-Action Pipeline grounded in Cognition Forest.
    • Privacy Preservation: Managed by Kernel's Privacy Gate with multi-level masking, specifically designed for cloud-based LLM inferences.
    • Self-Evolution: Enabled by Kernel's meta-agent capabilities, which can inspect, adapt, and extend Spaces and even core system structures based on user needs and observed limitations, moving beyond preset evaluation standards.
  • Cognition-Enhanced Generative Agent (KoRa): While KoRa builds on generative agent architectures, it uses the Cognition Forest to provide long-horizon semantic constraints, which helps mitigate issues like persona drift and improves the consistency of behaviors, especially in long-term proactive assistance.

  • Framework-Level Meta-Agent (Kernel): Kernel acts as a meta-agent that operates at the framework-level. It not only oversees KoRa's cognitive execution but also inspects and adapts the underlying system structures (Spaces, Cognition Forest itself). This gives Galaxy a deeper capacity for self-improvement than prior meta-agents that might only reflect on workflows or generate code within fixed architectural constraints.

  • Multidimensional Interactions (Spaces): Galaxy extends interaction modalities beyond chat windows through Spaces, which are cognitively accessible and interactable modules. These Spaces can be customized or auto-generated, further enhancing personalization and the system's perceptual scope, a deeper integration than simple tool invocation.

    In essence, Galaxy differentiates itself by introducing a unified, adaptable "cognitive blueprint" that allows the LLM agent to intelligently perceive, understand, act proactively, protect privacy, and fundamentally re-engineer itself in response to user needs and operational experience.

4. Methodology

4.1. Principles

The core principle behind Galaxy is the unification of cognitive architecture and system design into a self-reinforcing loop. Instead of treating these as separate entities, Galaxy posits that an LLM agent's understanding (cognition) should directly inform and evolve its underlying structure (system design), and in turn, enhancements in the system design should enrich and refine its cognitive capabilities. This principle is embodied in the Cognition Forest, a semantic structure that provides a comprehensive, hierarchical cognitive context for the agents (KoRa and Kernel) while also integrating the design principles and actual code for reuse and modification. This allows LLM agents to not only know what to do and how to do it, but also to understand how it is implemented, enabling deeper metacognition and self-evolution.

4.2. Core Methodology In-depth (Layer by Layer)

The Galaxy framework operates on a Perception-Analysis-Execution paradigm, supported by a central Cognition Forest and two specialized LLM agents, KoRa and Kernel. Figure 1 provides an overview of this framework.

4.2.1. Overall Framework (Figure 1)

As illustrated in Figure 1, Galaxy integrates user interactions via Spaces and Chat Window. The core components are the Cognition Forest, KoRa, and Kernel, supported by Interaction Layer, Analysis Layer, and Execution Layer.

The Interaction Layer perceives user interaction states and contextual signals. The Analysis Layer stores and organizes user data, conducting short-term and long-term user modeling. The Execution Layer generates plans, schedules tasks, and executes actions.

Figure 1: Framework of proposed Galaxy IPA. 该图像是对 Galaxy 框架的示意图,展示了 Cognition Forest 的构成以及与 LLM 代理 KoRa 和 Kernel 的关系。左侧展示了不同的模块和功能,包括用户信息和执行、分析、互动层。右侧则说明了 LLM 代理如何进行分析和行动的机制。

Figure 1: Framework of proposed Galaxy IPA.

Key modules and their roles:

  • Cognition Forest: The framework's unified cognitive and metacognitive architecture, structured as multiple semantic subtrees. It provides KoRa and Kernel with comprehensive, hierarchical cognitive context.
  • Spaces: Cognition-driven personalized interaction modules that capture multi-dimensional information during user interactions.
  • Agenda: Models user behavior from perceived event information, generating scheduling recommendations to guide KoRa's autonomous actions.
  • Persona: Performs comprehensive, long-term modeling of user characteristics to support KoRa's delegated decision-making.
  • KoRa: A cognition-enhanced generative agent capable of proactive delegation without explicit instructions, or efficient execution when instructed.
  • Kernel: A metacognition-empowered meta agent operating outside the three main layers. It's responsible for maintaining system stability, safeguarding privacy, and enabling Galaxy's evolution.

4.2.2. Cognition Forest

The Cognition Forest (F\mathcal{F}) is a fundamental semantic structure proposed to unify different cognition dimensions with their underlying system designs.

Definition: Cognition Forest F\mathcal{F} is a structured forest consisting of four subtrees: $ \mathcal{F} = { \mathcal{T}{\mathrm{user}}, \mathcal{T}{\mathrm{self}}, \mathcal{T}{\mathrm{env}}, \mathcal{T}{\mathrm{meta}} } $ Where:

  • Tuser\mathcal{T}_{\mathrm{user}}: Represents personalized modeling of the user, maintained by Persona.

  • Tself\mathcal{T}_{\mathrm{self}}: Describes Galaxy itself, its internal agents like KoRa, and their roles and capabilities.

  • Tenv\mathcal{T}_{\mathrm{env}}: Represents the operational environment, including perceivable Space modules and system tools.

  • Tmeta\mathcal{T}_{\mathrm{meta}}: Represents the system's metacognition, such as execution pipelines.

    Uniqueness: The key differentiator of Cognition Forest is its association of each cognitive element with its corresponding system design. This means LLM agents not only understand what to do (semantic understanding) and how to do it (mapped system function) but also how it is implemented (concrete implementation code). This deepens the framework's metacognition beyond traditional cognitive architectures.

Node Structure: Each node within these subtrees is represented by three dimensions:

  1. Semantic: The LLM's semantic understanding or natural language meaning of the concept.

  2. Function: The corresponding system function or callable element mapped to the semantic meaning.

  3. Design: The concrete implementation code or design principles for that function.

    Example: For a write_text node in a Memo Space:

  • Semantic: "writing new content to memo"

  • Function: write_text()

  • Design: The actual implementation code for write_text().

    This design allows Kernel to reflect on execution failures (e.g., incorrect sequence, implementation errors) and perform deeper modifications, as it understands the design alongside the function and semantics.

4.2.3. Sensing and Interaction Protocol (Spaces)

Spaces are a protocol designed to encapsulate heterogeneous information sources into unified, cognitively accessible, and interactable modules. They serve as the system's extensible Interaction Layer.

Objective: To overcome the limitation of most IPAs where cognitive architectures are constrained by underlying system design, and extensibility at the Interaction Layer is limited, hindering personalization.

Approach: Galaxy treats each Space function as a local execution container and an independent subtree within Cognition Forest. Spaces can be user-customized or automatically generated, expanding both the system's perceptual scope and interactive capabilities.

Each Space consists of the following components:

  • Perception Window: Continuously observes user actions and environmental signals. It converts raw inputs into structured TimeEvent entries and state snapshots, unifying them into a consistent, temporally grounded context for the Analysis Layer.

  • Interaction Component: Can act as a standalone, personalized module providing a user interface and interaction nodes accessible to both the user and KoRa.

  • Cognitive Protocol: Provides a unified development and integration standard for all Spaces. It specifies how high-level intents are translated into concrete system operations, ensuring each Space can be consistently embedded into the Cognition Forest for reasoning and task execution.

    Unlike simple LLM agent tool generation, Galaxy's Spaces are deeply embedded within the system's cognition, functioning as integral "organs" rather than detachable tools.

4.2.4. User Behavior and Cognitive Modeling (Agenda & Persona)

The Analysis Layer is responsible for modeling user behavior and preferences to support proactive skills. It comprises Agenda and Persona.

4.2.4.1. Agenda

Agenda models explicit schedules and implicit behavioral patterns to anticipate and interpret upcoming events.

  • TimeEvent: Agenda uses a unified TimeEvent to represent two types of events:

    • Schedule: Denotes explicit user schedules (e.g., "class at 18:30 on June 18").
    • Behavior: Denotes observed operational actions (e.g., "translated documents in the chat_window in the morning").
  • Schedule Draft: The Interaction Layer extracts event content and time ranges, writing them to Schedule Draft. Uncertain or conflicting events are routed to an alignment queue for resolution.

  • Behavior Patterns: All TimeEvent entries are retained for long-term behavior modeling. Each behavior is represented as a structured triple: (time, tool, semantic intent). Galaxy clusters these behaviors along tool and semantic dimensions to identify recurring Behavior Patterns.

  • Daily Plan Generation: Based on the user's schedule, Agenda drafts an initial plan, suggesting relevant Behavior Patterns for open time slots. This proposed daily plan is shared with the user for confirmation. Once approved, a summary of next-day actions is passed to KoRa for timely assistance.

    Figure 2 illustrates how observed behavior patterns inform the Agenda and Persona.

    该图像是一个示意图,展示了Galaxy框架的用户界面和两个智能代理KoRa与Kernel的交互。图中包含了用户的日常偏好、计划和行为模式,还展示了KoRa的主动与响应模式,以及操作空间的设计元素。 该图像是一个示意图,展示了Galaxy框架的用户界面和两个智能代理KoRa与Kernel的交互。图中包含了用户的日常偏好、计划和行为模式,还展示了KoRa的主动与响应模式,以及操作空间的设计元素。

Figure 2: The user interface of the Galaxy framework and the interactions of two intelligent agents, KoRa and Kernel. It includes the user's daily preferences, plans, and behavior patterns, while also showcasing KoRa's proactive and responsive modes, along with the design elements of operational spaces.

4.2.4.2. Persona

Persona maintains a growing User Cognition Tree (Tuser\mathcal{T}_{\mathrm{user}}), which is a subtree of the Cognition Forest (F\mathcal{F}).

  • User Insights: Galaxy uses LLMs to aggregate dialogues and Space interactions into user insights. Each insight contains a natural language summary and a semantic embedding. These are high-level semantic cognitions, not just statistical aggregates.
  • Node Management:
    • Similar insights accumulating beyond a threshold are promoted to a long-term node.
    • Insights similar to an existing node are merged, and the node's timestamp is refreshed.
    • Nodes unused for a long period decay and are removed.
    • Stable identity information (e.g., name, phone number) is inserted into an identity branch upon first discovery.

4.2.5. KoRa: Intelligent Butler for User

KoRa is the cognition-enhanced generative agent responsible for direct user interaction, proactive schedule management, and real-time request handling.

Objective: To enable proactive schedule management and real-time request handling while maintaining task consistency, avoiding conflicting operations (e.g., pre-scheduled vs. manual booking), and mitigating persona drift.

Approach:

  • Generative Agent Architecture: KoRa adopts a generative agent architecture (similar to Park et al., 2023) with memory stream, planning, and reflection modules.

  • Structured State Stack: To handle interruptions and resume execution in responsive mode, KoRa uses a structured state stack instead of a simple memory stream. This stack records task type, source, and execution details.

  • Execution Flow: KoRa follows a top-down execution flow to advance tasks from the daily plan generated by Agenda in the Analysis Layer.

  • Cognition-Action Pipeline: To address personality forgetting and behavior drift, KoRa integrates a cognitive architecture grounded in Cognition Forest. This hierarchical semantic space supports intent parsing, semantic routing, and the construction of behavior chains.

    KoRa's operating Cognition Forest subset, FKoRa\mathcal{F}^{\mathrm{KoRa}^{\bullet}}, includes components relevant to its tasks: $ \mathcal{F}^{\mathrm{KoRa}^{\bullet}} = { \mathcal{T}{\mathrm{user}}, \mathcal{T}{\mathrm{self}}^{\mathrm{KoRa}}, \mathcal{T}{\mathrm{env}}^{\mathrm{KoRa}}, \mathcal{T}{\mathrm{dialogue}} } $ Where:

  • Tuser\mathcal{T}_{\mathrm{user}}: The User Cognition Tree maintained by Persona.

  • TselfKoRaTself\mathcal{T}_{\mathrm{self}}^{\mathrm{KoRa}} \subset \mathcal{T}_{\mathrm{self}}: Represents KoRa's specific capabilities and role within Galaxy.

  • TenvKoRaTenv\mathcal{T}_{\mathrm{env}}^{\mathrm{KoRa}} \subset \mathcal{T}_{\mathrm{env}}: Includes any callable elements within Spaces relevant to KoRa's tasks.

  • Tdialogue\mathcal{T}_{\mathrm{dialogue}}: Collects fallback or vague-intent utterances, serving as the default entry point for open-ended interactions.

    As illustrated in Figure 3, KoRa's Cognitive-Action Pipeline processes an intent MM through three main stages:

    Figure 3: Execution pipeline of KoRa. The user's intent or KoRa's plan is parsed and grounded through the Cognition Forest. KoRa extracts relevant semantic paths, performs reasoning, generates contex… 该图像是KoRa的执行流程示意图。图中展示了用户意图解析与Cognition Forest的结合过程,KoRa依据用户的身份、关系及语言习惯进行推理,生成邮件内容并组装执行链。当关键信息缺失时,执行会被暂停以完成对齐。

Figure 3: Execution pipeline of KoRa. The user's intent or KoRa's plan is parsed and grounded through the Cognition Forest. KoRa extracts relevant semantic paths, performs reasoning, generates contextual content, and assembles an execution chain. If essential information is missing, execution is suspended until alignment is completed.

  1. Semantic Routing: KoRa first locates relevant cognitive paths (e.g., ["env", "user", "self"]) by traversing the Cognition Forest and selecting branches that semantically align with the intent MM.

  2. Forest Retrieval: For each identified path, KoRa retrieves supporting nodes from the corresponding subtree based on contextual cues, lexical similarity, or inferred relevance. These nodes provide the Semantic, Function, and Design information needed.

  3. Action Chain Construction: Guided by the retrieved content, KoRa assembles a structured Action Chain. This chain comprises discrete operations such as generating content, aligning intent, invoking system functions (e.g., send_email(address, content)), and composing natural language responses.

    Missing Information Handling: If any required information is missing (e.g., incomplete parameters for a function, failed node retrieval), KoRa suspends the current chain. It then interacts with the user in natural language to align the missing information before resuming execution. KoRa uses cloud-based LLM inference for its operations, and Kernel ensures privacy isolation for this.

4.2.6. Kernel: Framework-Level Meta Agent

Kernel is the metacognition-empowered meta agent responsible for the overarching stability, privacy, and self-evolution of the Galaxy framework. It operates outside the Perception-Analysis-Execution layers.

Objective: To ensure robustness in LLM-based reasoning, addressing privacy concerns with cloud-based inference and mitigating hallucinations from lightweight local models. It provides recovery mechanisms and self-monitoring for self-evolution.

Approach: Kernel uses the MetaCognition Tree (Tmeta\mathcal{T}_{\mathrm{meta}}) to monitor internal reasoning and catch potential execution failures. Unlike most systems, Kernel can revise reasoning flows when the cognitive architecture itself becomes a bottleneck. It is implemented as a meta agent with the ability to reason across both functional logic and architectural dependencies, enabling targeted adjustments to system configurations.

Kernel operates through three principal mechanisms:

  1. Oversee: Kernel continuously monitors Galaxy's execution pipelines, including LLM calls across all three layers (Interaction, Analysis, Execution) and KoRa's task behavior. Upon detecting abnormal patterns, it triggers meta-reflection and executes predefined failure-handling routines to ensure stable system operation.

  2. User-Adaptive System Design: Kernel identifies latent user needs based on long-term behavioral trends (from Analysis Layer), confirms them through lightweight user alignment, and then modifies or extends relevant Spaces accordingly. It functions as a minimal, self-contained control unit with a local code interpreter and rule engine, allowing self-checks and recovery even offline. This directly leverages the Design dimension of Cognition Forest nodes.

  3. Contextual Privacy Management: Kernel maintains an Autonomous Avatar aligned with the User Cognition Tree (Tuser\mathcal{T}_{\mathrm{user}}) to represent user context. It regulates data exposure through an LLM-based Privacy Gate, as shown in Figure 4.

    Figure 4: Workflow of Privacy Gate. Privacy Gate defines four levels of masking (L1L4), where higher levels apply stricter anonymization across more attributes. 该图像是示意图,展示了隐私门的工作流程,定义了四个级别的屏蔽(L1至L4),更高的级别在更多属性上应用更严格的匿名化。图中显示了真实用户信息和虚拟角色档案之间的关系,经过隐私门处理后的信息在L3级别上进行转换。

Figure 4: Workflow of Privacy Gate. Privacy Gate defines four levels of masking (L1L4), where higher levels apply stricter anonymization across more attributes.

Privacy Gate Workflow: Before transmitting data to the cloud, Privacy Gate applies masking to safeguard sensitive content while preserving task-relevant information. After receiving results from the cloud LLM, Kernel selectively demasks data to restore the necessary context for downstream use. Privacy Gate defines four levels of masking (L1-L4), where higher levels apply stricter anonymization across more attributes. This contextual approach ensures that privacy protection is dynamically adjusted based on sensitivity and task requirements.

4.2.7. From Cognitive Architecture to System Design, and Back Again

This section reiterates the core philosophy of Galaxy, outlining the closed-loop mechanism of alternating optimization between Cognition Forest and system design:

  1. Cognition drives understanding: Galaxy interprets user needs and intentions by grounding its understanding in its cognitive architecture (e.g., Cognition Forest).

  2. Cognition triggers reflection: Galaxy assesses whether its current framework capabilities adequately address user needs and identifies unmet requirements. This is where Kernel's metacognition comes into play, utilizing the Design dimension of Cognition Forest nodes.

  3. Reflection guides system design: Galaxy translates these unmet needs into new system design goals and autonomously improves system capabilities (e.e., generating new Spaces, modifying existing code). This modification directly impacts the Design dimension of Cognition Forest nodes.

  4. Design reinforces cognition: Newly introduced or modified system structures (e.g., new Space modules, refined execution pipelines) create additional cognitive pathways and sensing capabilities. These, in turn, strengthen and optimize the original cognitive architecture itself (e.g., by adding new nodes to Cognition Forest or refining existing ones).

    This loop highlights the co-constructive nature of cognitive architecture and system design in Galaxy, enabling continuous self-evolution guided by user needs.

5. Experimental Setup

5.1. Datasets

To evaluate the comprehensive capabilities of the Galaxy framework, the authors employ three public benchmarks: AgentBoard, PrefEval, and PrivacyLens.

  • AgentBoard (Ma et al. 2024):

    • Description: This benchmark uses six types of tasks to simulate a multi-round interactive environment for LLM agents. It aims to assess an agent's ability to handle complex interactions and achieve specific goals over multiple steps.
    • Characteristics: Focuses on the completion rate of an entire behavioral chain, simulating real-world interactive scenarios.
    • Why chosen: To validate Galaxy's general performance in complex, multi-turn interactive environments.
  • PrefEval (Zhao et al. 2025):

    • Description: This benchmark specifically evaluates whether LLM agents can maintain user preferences consistently throughout long conversations. It assesses the agent's ability to remember and apply personalized preferences without explicit re-statement.
    • Characteristics: It measures preference retention accuracy in two settings: Zero-Shot (without reminding users of their preferences) and Reminder (by reminding users of their preferences). This tests the long-term memory and consistency of user modeling.
    • Why chosen: To validate Galaxy's capabilities in long-term user modeling and personalized preference retention, which is crucial for proactive assistance.
  • PrivacyLens (Shao et al. 2025):

    • Description: This benchmark measures the ability of LLM agents to understand and adhere to privacy norms when performing real-world tasks. It evaluates how well agents protect sensitive user information during operations.
    • Characteristics: It comprehensively evaluates privacy protection using metrics like helpfulness, privacy leakage rate, and accuracy. This involves understanding sensitive information and applying appropriate safeguards.
    • Why chosen: To specifically validate Galaxy's privacy-preserving mechanisms, particularly the Privacy Gate managed by Kernel.

5.2. Evaluation Metrics

For each benchmark, specific metrics are used to quantify performance.

  • For AgentBoard:

    1. Conceptual Definition: Target Achievement Rate (TAR) measures the percentage of tasks where the LLM agent successfully completes the target goal across the entire multi-round interactive behavior chain. It assesses the agent's ability to execute complex, multi-step plans correctly.
    2. Mathematical Formula: While the paper does not provide an explicit formula for Target Achievement Rate, it is generally defined as: $ \mathrm{TAR} = \frac{\text{Number of successfully completed tasks}}{\text{Total number of tasks}} \times 100% $
    3. Symbol Explanation:
      • Number of successfully completed tasks: The count of tasks where the agent reached the defined target state or outcome.
      • Total number of tasks: The total number of tasks attempted by the agent.
  • For PrefEval:

    1. Conceptual Definition: Preference Retention Accuracy measures how accurately an LLM agent remembers and applies a user's stated preferences over multi-round conversations. This metric is crucial for personalization. It's evaluated in two modes:
      • Zero-Shot (Z): The agent is not reminded of the user's preferences in subsequent turns. It must recall them from its memory/modeling.
      • Reminder (R): The user's preferences are explicitly reminded to the agent in subsequent turns, testing its ability to consistently apply them when recalled.
    2. Mathematical Formula: The paper presents results for different numbers of conversations (e.g., 10 and 300 rounds). The accuracy for a given number of rounds would be: $ \mathrm{Accuracy} = \frac{\text{Number of turns where preferences were correctly applied}}{\text{Total number of turns where preferences were relevant}} \times 100% $
    3. Symbol Explanation:
      • Number of turns where preferences were correctly applied: The count of conversational turns where the agent's response or action correctly reflected the user's previously stated preferences.
      • Total number of turns where preferences were relevant: The total number of conversational turns where user preferences were applicable and should have been considered by the agent.
  • For PrivacyLens:

    1. Conceptual Definition: PrivacyLens uses three metrics to comprehensively evaluate privacy protection:
      • Helpfulness (Help.): Measures the quality and utility of the agent's output from the user's perspective, ensuring that privacy measures do not overly degrade task performance. A higher value indicates better user satisfaction.
      • Privacy Leakage Rate (LR / LRh): Quantifies the percentage of sensitive information that is inadvertently exposed or inferable from the agent's output. LR likely refers to a general leakage rate, while LRh might be a variant like leakage rate for highly sensitive information. A lower value indicates better privacy protection.
      • Accuracy (Acc.%): Measures the correctness of the agent's task execution, specifically in the context of privacy-sensitive tasks. It indicates whether the agent completed the task successfully while adhering to privacy norms.
    2. Mathematical Formula:
      • Helpfulness: This is often a subjective metric, typically measured via human evaluation (e.g., Likert scale scores) or LLM-based evaluators. If using a scale (e.g., 1-5), the formula might be: $ \mathrm{Helpfulness} = \frac{\sum_{i=1}^{N} \mathrm{Score}_i}{N} $ Where NN is the number of evaluations, and Scorei\mathrm{Score}_i is the helpfulness score for evaluation ii.
      • Privacy Leakage Rate: $ \mathrm{Privacy , Leakage , Rate} = \frac{\text{Number of leaked sensitive items}}{\text{Total number of sensitive items present}} \times 100% $
      • Accuracy: $ \mathrm{Accuracy} = \frac{\text{Number of correctly completed privacy-sensitive tasks}}{\text{Total number of privacy-sensitive tasks}} \times 100% $
    3. Symbol Explanation:
      • ScoreiScore_i: The helpfulness score given for a specific agent interaction or output.
      • Number of leaked sensitive items: The count of individual pieces of sensitive information that were exposed (e.g., name, address, phone number).
      • Total number of sensitive items present: The total count of all sensitive pieces of information that could potentially be leaked in the context.
      • Number of correctly completed privacy-sensitive tasks: The count of tasks that were executed successfully without compromising privacy.
      • Total number of privacy-sensitive tasks: The total count of tasks that involved sensitive information and required privacy consideration.

5.3. Baselines

The paper compares Galaxy against several state-of-the-art LLM agents from major providers. These baselines represent leading performance in various LLM agent capabilities. The performance of Galaxy without Kernel (Galaxy (w/o Kernel)) is also included as an ablation baseline to specifically highlight Kernel's contribution.

The compared LLM agents are:

  • GPT-4o (OpenAI)

  • GPT-01-pro (OpenAI, likely a typo for GPT-4-0125-preview or a similar commercial model)

  • Claude-Opus-4 (Anthropic, likely a typo for Claude 3 Opus)

  • Claude-Sonnet-4 (Anthropic, likely a typo for Claude 3 Sonnet)

  • Deepseek-Chat (Deepseek AI)

  • Deepseek-Reasoner (Deepseek AI)

  • Gemini-2.0-Flash (Google, likely a typo for Gemini 1.5 Flash)

  • Gemini-2.5-Flash (Google, likely a typo for Gemini 1.5 Flash or a similar next-gen model)

  • Qwen-Max (Alibaba Cloud)

  • Qwen3 (Alibaba Cloud, likely Qwen2)

    For Galaxy's configuration:

  • Local model within Kernel: Qwen2.5-14B

  • Cloud-based model in KoRa: GPT-4o-mini

    Experiments were run on an M3 Max platform with macOS, and average results over 100 trials are reported.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that Galaxy significantly outperforms existing LLM agents and Galaxy (w/o Kernel) across multiple benchmarks, particularly highlighting the crucial role of the Kernel meta-agent in preference retention and privacy protection.

The following are the results from Table 1 of the original paper:

LLM Agents AgentBoard PrefEval PrivacyLens
ALF SW BA JC PL TQ Z10 R10 Z300 R300 Acc.% LR LRh Help.
GPT-40 54.5 19.7 67.5 99.4 85.1 99.2 7.0 98.0 0.0 78.0 97.0 50.5 51.0 2.71
GPT-01-pro 87.2 39.0 90.2 99.6 95.7 96.3 37.0 98.0 7.0 98.0 92.0 52.5 53.0 2.83
Claude-Opus-4 86.2 38.5 92.5 99.8 95.7 99.5 3.0 98.0 1.0 87.0 97.5 38.5 39.0 2.73
Claude-Sonnet-4 77.1 38.2 92.2 99.8 98.6 99.0 14.0 96.0 1.0 85.0 98.0 24.0 24.5 2.73
Deepseek-Chat 17.5 9.8 55.4 99.2 41.7 95.3 1.0 92.0 0.0 73.0 89.5 53.5 54.5 2.52
Deepseek-Reasoner 42.0 27.9 81.6 99.6 63.9 98.1 83.0 85.0 83.0 80.0 86.0 55.0 57.5 2.66
Gemini-2.0-Flash 42.1 13.6 77.5 90.8 20.4 99.1 10.0 98.0 8.0 91.0 91.0 52.0 52.5 2.57
Gemini-2.5-Flash 50.2 14.3 84.1 95.1 43.3 97.8 91.0 92.0 89.0 92.0 96.0 53.5 55.0 2.59
Qwen-Max 78.1 22.3 83.7 99.6 80.8 99.8 5.0 98.0 1.0 83.0 91.5 56.0 57.0 2.55
Qwen3 71.3 32.7 85.4 90.6 83.3 86.2 7.0 94.0 0.0 69.0 94.0 38.0 39.0 2.58
Galaxy(w/o Kernel) 88.4 39.1 93.1 99.9 99.3 99.7 17.0 96.0 11.0 96.0 97.0 50.5 51.0 2.71
Galaxy 88.4 39.1 93.1 99.9 99.3 99.9 96.0 96.0 94.0 998.0 99.0 18.5 19.0 2.74

Analysis of Benchmark Results (Table 1):

  1. Overall Superiority: Both Galaxy and Galaxy (w/o Kernel) demonstrate strong performance, outperforming most existing LLM agents across a majority of metrics. For instance, on AgentBoard, Galaxy achieves top scores (88.4 in ALF, 39.1 in SW, 93.1 in BA, 99.9 in JC, 99.3 in PL, 99.9 in TQ), indicating robust capabilities in multi-round interactive environments.
  2. Impact of Kernel on PrefEval (Preference Retention):
    • Galaxy (w/o Kernel) shows limited preference retention, especially in Zero-Shot conditions (Z10 at 17.0% and Z300 at 11.0%). This implies KoRa alone, even with its generative agent architecture, struggles to consistently recall and apply user preferences over long interactions without explicit reminders or the Kernel's oversight.
    • With Kernel enabled, Galaxy's preference retention dramatically improves: Z10 jumps from 17.0% to 96.0%, and Z300 from 11.0% to 94.0%. The R300 score for Galaxy is listed as 998.0, which is likely a typo and should be interpreted as near 99.8 or 98.0 given the context, still indicating very high performance. This highlights Kernel's critical role in maintaining an evolving Cognition Forest and supporting long-term personalized planning.
  3. Impact of Kernel on PrivacyLens (Privacy Protection):
    • Galaxy (w/o Kernel) has a privacy leakage rate (LR) of 50.5% and LRh of 51.0%, comparable to GPT-4o, indicating a significant amount of sensitive information leakage when operating without Kernel's explicit privacy controls.
    • Galaxy (with Kernel) drastically reduces the privacy leakage rate: LR drops from 50.5% to 18.5%, and LRh drops from 51.0% to 19.0%. This confirms Kernel's effectiveness in enforcing privacy through the Privacy Gate, which masks sensitive content before cloud transmission.
    • Helpfulness (Help.) remains consistently high (2.71 for w/o Kernel, 2.74 for Galaxy), suggesting that Kernel's privacy mechanisms do not unduly degrade the agent's utility or helpfulness to the user. Accuracy for PrivacyLens also improves from 97.0% to 99.0% with Kernel.
  4. Overall Contribution of Kernel: The ablation study clearly demonstrates that Kernel is indispensable for Galaxy to achieve its stated goals of privacy preservation and self-evolution (which manifests as improved preference retention and adaptive system design). Kernel's two key roles are confirmed:
    • Maintaining an evolving Cognition Forest for long-term preference retention and personalized planning.
    • Enforcing privacy through the Privacy Gate.

6.2. Ablation Studies / Parameter Analysis

6.2.1. End-to-End Evaluation: Cost Analysis

Figure 5 presents a performance analysis of Galaxy in terms of latency and success rate under different model configurations.

Figure 5: Latency and success analysis of Galaxy under different model configurations. (a) shows end-to-end latency of different model combinations across four task types: TOD (pure chat), STC (simpl… 该图像是图表,展示了Galaxy在不同模型配置下的延迟和成功分析。图(a)显示了采用不同模型组合在四种任务类型下的端到端延迟,图(b)比较了不同本地模型大小下的成功率和失败次数。

Figure 5: Latency and success analysis of Galaxy under different model configurations. (a) shows end-to-end latency of different model combinations across four task types: TOD (pure chat), STC (simple tool call), CTC (complex tool call), and SD (space design). (b) compares success rate under different local model sizes (1.5B14B) when Kernel uses Qwen2.5 for intent extraction.

  • Latency Analysis (Figure 5a):

    • For simpler tasks like TOD (pure chat) and STC (simple tool call), latency is primarily dominated by local model inference, suggesting that even small local LLMs contribute significantly to response time.
    • For more complex tasks such as CTC (complex tool call) and SD (space design), cloud-based inference becomes the main latency contributor.
    • Using larger and more complex models (14B configuration) further amplifies total latency, reaching up to 6.3s for the Space Design task. This indicates a trade-off between model complexity/capability and response speed, especially for demanding tasks.
  • Success Rate Analysis (Figure 5b):

    • Despite the latency cost, larger models within Kernel (specifically Qwen2.5-14B for local inference) deliver substantially better performance.

    • When Kernel uses Qwen2.5-14B for local inference, it achieves an 81.5% one-shot intent extraction success rate. This demonstrates its ability to accurately resolve complex user goals without requiring fallback interactions or clarification from the user, highlighting the benefit of a more capable local model for Kernel's metacognition.

      The following are the results from Table 2 of the original paper:

      Execution Route Cloud API Latency (s)
      KoRa calls cloud API Yes 0.13
      Kernel retrieves cognition No 0.87
      Kernel calls space function No 0.22
      KoRa feeds back result Yes 0.12
      Overall 1.34

Table 2: Latency breakdown across different execution routes in Galaxy for a complex tool call task. Kernel is set to Qwen2.5-14B and KoRa to GPT4o-mini.

Latency Breakdown (Table 2): For a Complex Tool Call task, with Kernel using Qwen2.5-14B and KoRa using GPT-4o-mini:

  • Kernel's cognition retrieval (0.87s) accounts for the largest share of the total latency (1.34s). This step is critical for selecting and grounding tool actions within the Cognition Forest.
  • Kernel calling Space functions takes 0.22s.
  • KoRa calling the cloud API (for GPT-4o-mini) and feeding back results each take relatively short times (0.13s and 0.12s, respectively). This breakdown shows that Kernel's local processing and Cognition Forest traversal are significant components of the overall task execution time, underscoring its central role in orchestrating actions.

6.2.2. Case Study: Kernel's Effectiveness

A real-world case study validates Kernel's ability to maintain system stability and perform self-recovery.

  • Scenario: After cloning the project and running main.py, the system encountered a ModuleNotFoundError, failing to locate the core module world_stage and preventing the cognitive architecture from starting.
  • Traditional Agent Behavior: Conventional LLM agent frameworks would simply return the error stack, requiring manual troubleshooting by a human developer.
  • Kernel's Action: As a self-contained minimal runtime unit, Kernel remained operational even when the main system entry failed. Leveraging its code-level understanding of the system (via the Cognition Forest's Design dimension), Kernel identified that the world_stage module should reside in the project root. It inferred the error was due to a missing PYTHONPATH environment variable. Kernel then injected the correct path, restarted execution, and successfully restored operation.
  • Validation: This case demonstrates Kernel's vital role in framework-level meta-management, enabling self-checks and recovery actions that are beyond the scope of typical LLM agents.

6.2.3. Ablation Study: Analysis Layer Modules

Figure 6 illustrates an ablation study on the Analysis Layer modules (Agenda and Persona) through a real-world interaction example of a Daily Report.

Figure 6: A real-world interaction example of Daily Report for ablation study. 该图像是图表,展示了与KoRa的个性化每日反思和规划空间的示例。其中包含今日的反思、明日的计划以及总时间安排。

Figure 6: A real-world interaction example of Daily Report for ablation study.

  • Impact of Agenda:
    • Without Agenda: KoRa relies entirely on its memory-stream context (short-term memory). This results in less structured plans and increased reliance on user feedback for clarification. KoRa cannot proactively anticipate or structure daily activities effectively.
    • With Agenda: Agenda consolidates multi-source perceptual signals and infers a coherent behavioral profile, which serves as a structured input for KoRa's plan generation. This allows KoRa to create more structured daily plans and reduce the need for user clarification.
  • Impact of Persona:
    • Without Persona: If a user repeatedly translates paper abstracts and introductions via KoRa over several days, and Kernel generates a dedicated literature translating Space in response, KoRa might incorrectly infer that the user has discontinued translation when the user switches to the new Space. This is because KoRa lacks a long-term, stable understanding of user habits beyond immediate interactions.
    • With Persona: Persona maintains the User Cognition Tree, providing a comprehensive, long-term model of user characteristics. When Persona is available, KoRa correctly interprets the user's continued behavior (even if it's now through a new tool) and generates the corresponding Daily Report ("Today's Roast"), demonstrating a consistent understanding of user preferences.
  • Validation: Both cases underscore the importance of the Analysis Layer (Agenda and Persona) in integrating and interpreting heterogeneous information from multiple sources. These modules are essential for KoRa's proactive capabilities and for maintaining a stable, long-term understanding of user needs, preventing persona drift.

6.3. Boundaries and Errors

The paper also acknowledges current limitations and potential issues within the Galaxy framework:

  1. Alignment Overfitting: Alignment inputs (e.g., explicit user confirmations or corrections) are prioritized during cognitive construction. However, these inputs often reflect short-term characteristics or immediate needs. There is a risk that Galaxy might overfit to these short-term signals, failing to accurately reflect or learn long-term user habits and preferences.
  2. Human-Dependent Space Expansion: While the Space protocol supports automated extensibility and generation of new interaction modules, creating complex Spaces (i.e., those requiring intricate logic or novel integrations) still necessitates multiple rounds of human guidance. Fully autonomous design and implementation of highly complex Spaces remain a challenge.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work introduces Galaxy, a novel Intelligent Personal Assistant (IPA) framework centered on cognition-enhanced Large Language Model (LLM) agents. The core innovation is the Cognition Forest, a semantic structure that unifies cognitive architecture with system design, establishing a self-reinforcing loop for continuous improvement. Galaxy is explicitly designed to address three major limitations in current LLM agents: the lack of proactive skills, robust privacy preservation, and genuine self-evolution.

The framework implements two cooperative agents: KoRa, a generative agent for responsive and proactive task execution, grounded in the Cognition Forest to ensure consistency; and Kernel, a meta-cognition empowered meta-agent responsible for framework-level supervision, privacy management via Privacy Gate, and driving self-evolution.

Experimental evaluations on AgentBoard, PrefEval, and PrivacyLens benchmarks demonstrated Galaxy's superior performance compared to multiple state-of-the-art LLM agents. Ablation studies highlighted the critical contributions of Kernel and the Analysis Layer modules (Agenda, Persona) to Galaxy's capabilities. A real-world case study further validated Kernel's ability to perform self-recovery and maintain system stability. The paper concludes by emphasizing the necessity of a deeply integrated and mutually reinforcing relationship between cognitive architecture and system design for the future of LLM agents.

7.2. Limitations & Future Work

The authors identified several limitations of the current Galaxy framework:

  • Alignment Overfitting: Galaxy's cognitive construction prioritizes alignment inputs (explicit user feedback). However, these inputs may be short-term focused and could lead to overfitting, potentially misrepresenting long-term user habits. Future work could explore methods to balance short-term alignment with long-term behavioral patterns to create more robust user models.
  • Human-Dependent Space Expansion: While the Space protocol enables automated extensibility, the creation of highly complex Spaces still requires significant human guidance and multiple rounds of interaction for full implementation. Future research could focus on enhancing the autonomy of Kernel's User-Adaptive System Design mechanism to generate and integrate complex functionalities with less human intervention, possibly through more sophisticated LLM-driven code generation and testing.

7.3. Personal Insights & Critique

  • Inspiration from Unification: The central idea of unifying cognitive architecture with system design through Cognition Forest is profoundly insightful. It moves beyond the typical LLM agent paradigm of simply providing tools or memory and instead posits a system that can understand and modify its own foundational structure. This "self-aware system design" could unlock true self-evolution for AIs, enabling them to adapt to entirely novel challenges rather than just improving performance within fixed constraints. This principle could be transferred to other complex AI systems, such as autonomous driving or robotics, where the AI could not only learn to drive better but also propose modifications to its control architecture or sensor integration based on real-world experience.
  • Practicality of Kernel's Meta-Agent Role: The implementation of Kernel as a framework-level meta-agent with a local code interpreter and rule engine that can operate even when the main system fails (as shown in the case study) is a robust design choice. This ensures a high degree of resilience and self-healing, crucial for IPAs that are expected to be available and functional continuously. This approach sets a new standard for reliability in LLM agent systems.
  • Robust Privacy Mechanism: The Privacy Gate with its tiered masking levels, managed by Kernel and aligned with the User Cognition Tree, provides a sophisticated and context-aware approach to privacy preservation. This is a critical step towards building trust in proactive LLM agents that necessarily handle sensitive user data. The ability to dynamically adjust masking based on context is a significant advantage over static anonymization methods.
  • Addressing Persona Drift: KoRa's integration with Cognition Forest to mitigate persona drift is an important contribution. As LLM agents interact over long periods, maintaining a consistent persona is key to user experience and trust. Grounding behavior in a hierarchical semantic structure, rather than just a linear memory stream, offers a more stable foundation for persona consistency.
  • Potential Issues & Areas for Improvement:
    • Complexity and Maintainability: While powerful, the Cognition Forest's Semantic, Function, and Design dimensions, coupled with its hierarchical structure and the self-reinforcing loop, introduce significant complexity. Managing and debugging such a dynamically evolving architecture could be challenging, especially in real-world deployments. The authors could elaborate on strategies for version control, rollback mechanisms, and transparent introspection of the Cognition Forest's evolution.
    • Scalability of Kernel's Self-Evolution: The User-Adaptive System Design by Kernel relies on identifying "latent user needs" and confirming them through "lightweight alignment." How robust is this alignment process for truly novel or ambiguous needs? And how scalable is the autonomous generation and integration of new Spaces beyond "simple" ones, given the acknowledged limitation of "human-dependent Space Expansion"? The transition from identifying a need to reliably generating complex, bug-free code for a new system component is a massive leap.
    • Evaluation of Proactivity: While Agenda and Persona enable proactive planning, the paper's experiments primarily focus on preference retention and privacy. A more direct and quantitative evaluation of proactive behavior effectiveness (e.g., number of successful proactive interventions, user satisfaction with proactivity, false positive rate of proactive actions) would further strengthen the claims.
    • Computational Overhead: The Cost Analysis shows that Kernel's cognition retrieval is a significant portion of latency for complex tasks. As the Cognition Forest grows with self-evolution and user personalization, this overhead could become substantial. Future work might explore more efficient indexing, retrieval, or pruning strategies for the Cognition Forest.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.