PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction
TL;DR Summary
PolySkill uses polymorphic abstraction to decouple skill goals from implementations, enhancing LLM agents' generalizable skill learning, improving task success and reuse rates on seen and unseen websites while reducing steps.
Abstract
Large language models (LLMs) are moving beyond static uses and are now powering agents that learn continually during their interaction with external environments. For example, agents can learn reusable skills while navigating web pages or toggling new tools. However, existing methods for skill learning often create skills that are over-specialized to a single website and fail to generalize. We introduce PolySkill, a new framework that enables agents to learn generalizable and compositional skills. The core idea, inspired by polymorphism in software engineering, is to decouple a skill's abstract goal (what it accomplishes) and its concrete implementation (how it is executed). Experiments show that our method (1) improves skill reuse by 1.7x on seen websites and (2) boosts success rates by up to 9.4% on Mind2Web and 13.9% on unseen websites, while reducing steps by over 20%. (3) In self-exploration settings without specified tasks, our framework improves the quality of proposed tasks and enables agents to learn generalizable skills that work across different sites. By enabling the agent to identify and refine its own goals, the PolySkill enhances the agent's ability to learn a better curriculum, leading to the acquisition of more generalizable skills compared to baseline methods. This work provides a practical path toward building agents capable of continual learning in adaptive environments. Our findings show that separating a skill's goal from its execution is a crucial step toward developing autonomous agents that can learn and generalize across the open web continuously.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction
- Authors: Simon Vu¹, Gang Li², Weiyan Shi¹'†, Peng Qi²'†.
- Affiliations: ¹Northeastern University, ²Uniphore. († Co-Supervision).
- Journal/Conference: This paper is a preprint available on arXiv. The venue is not a formal peer-reviewed conference or journal at the time of this analysis. ArXiv is a widely used repository for researchers to share their work before or during the peer-review process.
- Publication Year: 2025 (based on the arXiv ID
2510.15863v1). - Abstract: The paper addresses the challenge that skills learned by LLM-powered web agents are often over-specialized and do not generalize to new websites. It introduces PolySkill, a framework inspired by polymorphism in software engineering. The core idea is to separate a skill's abstract goal (e.g., "add to cart") from its concrete, website-specific implementation. Experiments show PolySkill improves skill reuse by 1.7x, boosts task success rates (up to 9.4% on Mind2Web, 13.9% on unseen sites), and reduces the number of steps required. In self-exploration settings, the framework helps agents learn a better curriculum and acquire more generalizable skills. The authors conclude that decoupling a skill's goal from its execution is a critical step towards building autonomous agents capable of continual learning across the open web.
- Original Source Link: https://arxiv.org/abs/2510.15863v1 (Preprint)
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Large Language Model (LLM)-based web agents can learn skills from experience, but these skills are often "brittle." A skill learned on one website (e.g., searching for a product on Amazon) usually fails on another website (e.g., Target) because the underlying code and UI elements are different. This is known as the problem of over-specialization, which severely limits an agent's ability to generalize across the diverse landscape of the internet.
- Importance & Gaps: For web agents to become truly autonomous and useful, they must be able to adapt to new and unseen environments without constant retraining or manual intervention. Existing skill-learning methods (
ASI,SkillWeaver) focus on optimizing performance within a single domain, leading to poor cross-domain generalization. There's a fundamental tension between a skill's specificity (needed to work on one site) and its generalizability (needed to work on many sites). - Innovation: The paper introduces a novel solution inspired by a fundamental concept in software engineering: polymorphism. PolySkill decouples what a skill does (its abstract purpose) from how it does it (its specific implementation on a given website). This creates a structured, hierarchical skill library that promotes reuse and generalization.
-
Main Contributions / Findings (What):
- PolySkill Framework: A new framework that organizes skills into a polymorphic hierarchy. It uses an abstract class to define a "schema" of common goals for a domain (e.g.,
AbstractShoppingSite) and concrete subclasses for website-specific implementations (e.g.,AmazonWebsite,TargetWebsite). - Improved Generalization and Efficiency: PolySkill significantly outperforms existing methods. It improves skill reuse by 1.7x and boosts task success rates on both seen and unseen websites. It also makes agents more efficient, reducing the number of actions needed to complete tasks by over 20%.
- Prevention of Catastrophic Forgetting: In continual learning scenarios, where an agent learns new skills on new websites, PolySkill's structure prevents the agent from "forgetting" how to perform tasks on previously learned sites.
- Enhanced Autonomous Exploration: The framework provides a structured approach for agents to explore new websites and learn skills on their own, without a predefined curriculum, leading to the acquisition of more generalizable skills.
- New Evaluation Metrics: The paper introduces
Skill Reusability,Task Coverage, andSkill Compositionalityto provide a more nuanced evaluation of how effectively an agent is learning and using its skills, beyond just task success rate.
- PolySkill Framework: A new framework that organizes skills into a polymorphic hierarchy. It uses an abstract class to define a "schema" of common goals for a domain (e.g.,
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- LLM-based Agents: These are autonomous systems powered by Large Language Models (LLMs) like GPT-4. They can understand natural language instructions, reason, plan, and interact with external environments (like a web browser or operating system) to achieve goals.
- Web Agents: A specific type of LLM agent designed to navigate and interact with websites. They perform tasks like filling forms, clicking buttons, and extracting information, essentially automating what a human user would do.
- Skill Induction: This is the process by which an agent learns a new, reusable "skill" from its past experiences. For example, after successfully completing a task by taking a sequence of primitive actions (e.g.,
click,type), the agent can "induce" a higher-level skill (e.g., a Python function) that encapsulates this sequence. This skill can then be used as a single action in the future, making the agent more efficient. - Polymorphism: A core principle of object-oriented programming. It allows objects of different classes to be treated as objects of a common superclass. In simple terms, it means you can have a single interface (the "what") for multiple underlying forms or implementations (the "how"). For example, a
draw()function could be used on aCircleobject and aSquareobject, and it would produce the correct shape for each, even though the drawing logic is different. PolySkill applies this by defining an abstract skill (e.g.,search_product) that has different concrete implementations for Amazon and Target. - Continual Learning & Catastrophic Forgetting: Continual learning is the ability of a model to learn from a continuous stream of data over time. A major challenge in this area is catastrophic forgetting, where learning a new task causes the model to lose its ability to perform previously learned tasks.
-
Previous Works:
-
Initial approaches used natural language descriptions (
Agent Workflow Memory) or simple action traces (ASI) to store skills. While functional, these were often brittle. -
ASI(Agent Skill Induction) andSkillWeavermade a significant leap by representing skills as more robust programmatic code (e.g., Python functions). However, as the paper points out, these methods generate over-specialized skills tailored to a single website's UI, leading to poor generalization. Image 8 shows examples of skills from these methods, which are concrete implementations without an abstract layer.
该图像是一张示意不同技能实现的对比例子表,展示了ASI技能与SkillWeaver技能的代码实现对比,分别展示了函数定义和异步函数定义的示例。 -
SkillWeaveralso introduced the idea of self-exploration, where an agent proposes its own tasks to learn from. However, the paper argues that without a guiding structure, this exploration can lead to overly complex and non-generalizable skills.
-
-
Differentiation: PolySkill differentiates itself by not just creating skills as code, but by organizing them into a hierarchical and polymorphic structure. While
ASIandSkillWeavercreate a flat library of isolated, concrete skills, PolySkill creates an interconnected library with an abstraction layer. This separation of "what" from "how" is the key innovation that enables better generalization, composition, and prevention of catastrophic forgetting.
4. Methodology (Core Technology & Implementation)
The PolySkill framework is designed to learn agent skills that are both specialized and generalizable. It achieves this through a structured, three-stage process guided by the principle of polymorphism.
-
Principles: The core idea is to decouple a skill's abstract goal from its concrete, site-specific implementation. This is achieved by creating a domain-specific skill hierarchy. For example, for the "shopping" domain, an abstract class
AbstractShoppingSiteis defined. This class acts as a schema or interface, outlining high-level skills likesearch_product,add_to_cart, andcheckout. Then, for each specific website, a concrete class (e.g.,AmazonWebsite,TargetWebsite) is created that inherits from the abstract class and provides the actual implementation for each of these skills.
该图像是代码示例表,展示了PolySkill框架中购物领域技能的高层抽象(左侧)及其在不同网站(Amazon和Target,右侧)的具体实现,体现了技能目标与执行的解耦与复用。Image 3 shows this perfectly. The left side defines the abstract class with its methods. The right side shows how
AmazonWebsiteandTargetWebsiteprovide different, specific code to implement the same abstract methods. Crucially, compositional skills likepurchase_item(which callsfind_and_add_to_cartand thencheckout) are defined once in the abstract class and work automatically for all websites that implement the necessary base skills. -
Problem Formulation: The agent's interaction is modeled as a Partially Observable Markov Decision Process (POMDP). The agent, powered by an LLM policy , aims to maximize an efficiency-aware reward.
- Objective Function:
- Symbol Explanation:
- : The LLM-based agent policy.
- : The skill library that the agent learns.
- : The expectation over a distribution of tasks .
- : A success function that is 1 if the trajectory successfully completes task , and 0 otherwise.
- : The length of the trajectory (number of steps).
- : A penalty coefficient that incentivizes efficiency. A larger encourages the agent to complete tasks in fewer steps. The goal is to learn a skill library that not only maximizes task success but also minimizes the number of steps, which encourages the creation of useful and reusable skills.
-
Steps & Procedures (Skill Induction Process):
- Initial Abstract Class Induction: When an agent encounters the first website in a new domain (e.g., its first shopping site), and after successfully completing a few tasks, it is prompted to induce a high-level abstract class (
AbstractShoppingSite). This class defines the common interface and compositional skills for that domain. - Concrete Skill Implementation: On a specific site (e.g.,
amazon.com), the agent first attempts a task using primitive actions. Upon success, the induction module is invoked. Instead of just creating a generic skill, it is given theAbstractShoppingSiteclass as context and tasked with implementing a method for a newAmazonWebsiteclass that conforms to the abstract interface. For example, it implements thesearch_productmethod using the specific element IDs and actions that worked on Amazon. - Verification: The new skill is tested by having the agent re-solve the same task, this time by calling the newly created skill. If successful, the skill is verified and added to the library as part of the
AmazonWebsiteclass. - Learning on Unseen Websites: When the agent encounters a new website in a known domain (e.g.,
target.com), it already has theAbstractShoppingSiteblueprint. This gives the agent a structured set of goals: it knows it needs to figure out how to implementsearch_product,add_to_cart, etc., on this new site. This guided exploration is much more efficient than random trial-and-error. Once it figures out an implementation, it follows the same induction and verification process to create aTargetWebsiteclass.
- Initial Abstract Class Induction: When an agent encounters the first website in a new domain (e.g., its first shopping site), and after successfully completing a few tasks, it is prompted to induce a high-level abstract class (
5. Experimental Setup
-
Datasets:
- Mind2Web: A large and diverse benchmark for general web navigation, covering 137 websites across 31 domains and 2,350 tasks. It's used to evaluate generalization in cross-task, cross-website, and cross-domain settings.
- WebArena: A realistic evaluation environment with fully functional websites (e-commerce, forums, dev tools). It includes 812 tasks with automatic evaluation, providing a robust testbed for complex, multi-step procedures.
-
Evaluation Metrics:
- Task Success Rate (SR):
- Conceptual Definition: The most fundamental metric, it measures the percentage of tasks the agent completes successfully. It is the primary indicator of overall agent capability.
- Mathematical Formula:
- Number of Steps:
- Conceptual Definition: The average number of actions an agent takes to finish a task. A lower number indicates higher efficiency. Each primitive action (e.g.,
click) and each skill call counts as one step. - Mathematical Formula:
- Symbol Explanation: is the number of successful tasks, and is the number of steps in the trajectory for the -th successful task.
- Conceptual Definition: The average number of actions an agent takes to finish a task. A lower number indicates higher efficiency. Each primitive action (e.g.,
- Skill Reusability:
- Conceptual Definition: Measures how often skills learned from one set of tasks are successfully used in a different, unseen set of tasks. A high rate indicates that the learned skills are general and not over-specialized.
- Mathematical Formula (Standard Interpretation): (Note: The paper also refers to this as Skill Utilization Rate, and the provided chart uses this interpretation. Another way to define it could be the number of skills reused, but this "utilization" definition is more common for measuring impact on efficiency.)
- Task Coverage:
- Conceptual Definition: Measures the percentage of tasks in which at least one learned skill was used. This indicates the breadth of applicability of the skill library. A low coverage suggests the skills are too niche.
- Mathematical Formula:
- Skill Compositionality:
- Conceptual Definition: Measures the extent to which the agent builds more complex skills by reusing simpler, existing skills as sub-routines. High compositionality is a sign of a scalable and efficient learning process.
- Mathematical Formula (Proxy Interpretation): (Note: This is a plausible proxy. The paper doesn't provide a formula, but this captures the idea of skills calling other skills.)
- Task Success Rate (SR):
-
Baselines:
- Baseline (
Base): A standard agent that does not learn or use any skills. - ASI (Agent Skill Induction): A state-of-the-art method that induces skills as programmatic code from successful action traces.
- SkillWeaver: Another leading method that also induces code-based skills and supports self-exploration to propose new tasks.
- Foundation Models: The experiments were run on four powerful LLMs:
GPT-4.1,Claude-3.7-Sonnet(closed-source), andQwen3-Coder-480B-A35B,GLM-4.5(open-source), demonstrating the robustness of the approach across different models.
- Baseline (
6. Results & Analysis
The paper presents a comprehensive set of experiments to validate the effectiveness of the PolySkill framework.
-
Core Results:
该图像是论文中关于PolySkill方法的示意图和实验结果图。上部展示了该方法在不同网站间实现多态技能的示例。下部左图显示PolySkill在任务成功率、技能复用率等指标上显著优于其他方法;右图展示PolySkill在连续学习中有效防止灾难性遗忘,提高了WebArena购物任务表现。Image 1 provides a high-level summary of the key findings. The top diagram illustrates the core concept: a single abstract goal like
searchis implemented differently onOneStopShop,amazon, andTARGET. The bottom-left bar chart shows that PolySkill achieves 1.3x to 1.8x improvements over baselines inTask Success Rate,Skill Reusability,Task Coverage, andSkill Compositionality. The bottom-right line chart previews the continual learning results, showing PolySkill avoids catastrophic forgetting.
该图像是图表,展示了PolySkill与基线方法在Mind2Web基准上的表现对比,涵盖四个大型语言模型及三种不同的泛化难度设置(Cross-task、Cross-Website、Cross-Domain)。PolySkill尤其是在挑战较大的Cross-Domain场景中表现优异,且带在线持续更新的版本效果最佳,误差棒为三次运行的标准误差。Image 4 (Figure 3 in the paper) shows the performance on the
Mind2Webbenchmark. Across all four LLMs,PolySkill(red) consistently outperforms theASIbaseline (orange). The online version, (dark red), which continually learns, achieves the best results. The performance gap is particularly wide in the more challengingCross-WebsiteandCross-Domainsettings, which directly confirms that PolySkill's polymorphic structure leads to superior generalization. For instance, with GPT-4.1, achieves ~64% accuracy in theCross-Domainsetting, compared to ~62% for .
该图像是图表,展示了PolySkill与Baseline、SkillWeaver和ASI在WebArena基准测试中四种领先语言模型(GPT-4.1、Claude-3.7-Sonnet、Qwen3-Coder-480B-A35B、GLM-4.5)上的整体性能对比。结果显示PolySkill在各模型和网站类别中均实现了最高平均成功率,尤其在GPT-4.1和Claude-3.7-Sonnet上提升最显著。Image 5 (Figure 4 in the paper) presents results on the
WebArenabenchmark. Again,PolySkill(red) achieves the highest average success rate across all models and website categories compared to theBaseline(blue),SkillWeaver(green), andASI(orange). This demonstrates its effectiveness on complex, realistic web tasks. -
Analysis of Skill Learning Dynamics:
该图像是图表,展示了ASI和SkillWeaver两种技能学习方法在不同基础模型下的性能表现。图中分别以成功率和技能利用率随迭代次数变化为指标,反映了方法的不稳定学习动态及过度专用导致的技能复用率低问题。Image 2 illustrates the motivation for PolySkill. It shows that existing methods like
ASIandSkillWeavercan have unstable learning curves. For example,SkillWeaver's success rate with Claude-3.7-Sonnet (bottom right) actually decreases over time, and its skill utilization remains very low. This suggests these methods can learn overly specific skills that don't generalize well, hurting long-term performance.
该图像是图表,展示了WebArena购物任务中步骤数与技能利用率的关系。图中用不同颜色分别表示ASI、SkillWeaver和PolySkill方法,折线显示完成任务所需的平均步骤数,柱状图表示技能复用率。结果表明技能复用率越高,完成任务所需步骤越少,体现了技能学习对任务效率的提升。Image 6 (Figure 5 in the paper) validates the core hypothesis that skill reuse improves efficiency. The chart shows a clear inverse correlation: as
Skill Reusability(bars) increases, theAverage Number of Steps(lines) decreases for all methods.PolySkill(red) achieves the highestSkill Reusability(over 20%) while also significantly reducing the number of steps, confirming its ability to learn and effectively apply general skills. -
Continual Learning and Catastrophic Forgetting:
该图像是图表,展示了PolySkill在持续学习实验中防止灾难性遗忘的效果。实验包含两个阶段:先在WebArena Shopping基准上训练技能库,再在跨网站Amazon和Target上持续学习。图中橙色和红色线条(右轴)显示PolySkill比ASI基线学得更好,蓝色线条(左轴)跟踪原始WA表现,阴影区域表示三次运行的标准误差。Image 7 (Figure 6 in the paper) is a key result. The agent first learns on
WebArena Shopping, then continually learns onAmazonandTarget.- Positive Transfer: Both
PolySkill(solid lines) andASI(dashed lines) show positive transfer, improving their performance onAmazonandTargetover time (orange and red lines, right y-axis). - Catastrophic Forgetting: The crucial difference is on the left y-axis (blue lines), which tracks performance back on the original
WebArena Shoppingtasks. TheASIagent's performance drops significantly (from ~34% to ~29%), indicating it "forgot" the original skills after specializing on the new sites. In contrast,PolySkill's performance remains stable. This is because its polymorphic structure isolates the new, site-specific implementations (AmazonWebsite,TargetWebsite) from the general skills, preventing interference. This demonstratesPolySkill's ability to prevent catastrophic forgetting, ending with a +4.9% advantage overASI.
- Positive Transfer: Both
-
Task-Free Explorative Learning: The paper investigates whether an agent can learn general skills on its own without a human-designed curriculum. The results are shown in Table 2, which is transcribed below.
This table has been transcribed from the paper's content as no image was provided. Table 2: Performance in the task-free exploration setting for the Shopping Domain.
| Training Setting | Iterations | Evaluation Benchmark (SR % / Skill Usage %) | | | | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | Method | | WA Shopping | | AMZ | | Target | Baseline | | 37.4 / - | | 47.3 / - | | 60.5 / - | 1. Single-Domain Specialists | | | | | | | WA | 50 | 42.3 / 14.9 | | 50.2 / 3.3 | | 61.2 / 2.8 | AMZ | 50 | 38.1 / 2.7 | | 69.5 / 48.3 | | 61.5 / 3.0 | Target | 50 | 38.0 / 2.1 | | 48.5 / 3.5 | | 77.0 / 52.1 | 2. Sequential Curriculum | | | | | | | AMZ → WA | 75 + 75 | 40.2 / 12.3 | | 65.3 / 42.7 | | 62.5 / 3.1 | AMZ → Target → WA | 50 + 50 + 50 | 38.2 / 11.9 | | 65.2 / 43.3 | | 77.3 / 24.3 | Target → AMZ → WA | 50 + 50 + 50 | 39.5 / 11.5 | | 66.1 / 40.8 | | 69.2 / 18.9 | WA → Target → AMZ | 50 + 50 + 50 | 42.1 / 10.8 | | 70.5 / 43.2 | | 76.8 / 23.3 | SkillWeaver* | 150 | 39.8 / 8.6 | | 64.4 / 25.2 | | 74.2 / 18.3 | 3. Self-guided Exploration | | | | | | | AMZ + Target + WA | 150 | 43.1 / 14.6 | | 66.7 / 36.4 | | 75.2 / 19.4
* For the SkillWeaver, the best-performing curriculum was selected.
- Single-Domain Specialists perform well on their home website (e.g., Target specialist gets 77.0% SR on Target) but fail to transfer this knowledge, with skill usage on other sites being very low (<4%).
- Sequential Curriculum shows that the order of training matters, but the results are mixed.
- Self-guided Exploration with PolySkill, where the agent freely explores all three websites, achieves the highest success rate (43.1%) on the held-out
WA Shoppingbenchmark. This is a key finding: the agent, guided by PolySkill's structured framework, can autonomously build a general skill set that is more effective than one learned from a fixed curriculum.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces PolySkill, a novel framework that applies the software engineering principle of polymorphism to agent skill learning. By separating a skill's abstract intent from its concrete implementation, PolySkill enables the creation of generalizable, compositional, and reusable skills for web agents. This approach leads to significant improvements in task success, efficiency, and cross-domain generalization. Furthermore, it mitigates catastrophic forgetting in continual learning settings and empowers agents to learn effectively through self-guided exploration. The work provides a strong argument that structured abstraction is a crucial component for building truly adaptive and autonomous agents.
-
Limitations & Future Work: The authors acknowledge several limitations and propose future research directions:
- Dynamic Environments: The framework may struggle with highly dynamic websites where UI elements change frequently, as this would invalidate the concrete skill implementations. Future work could explore adaptive skill repair mechanisms.
- Abstract Class Quality: The quality of the entire skill hierarchy depends on inducing a good initial abstract class. This can be a sensitive process.
- Long-Tail Websites: The approach works best for well-defined domains (e.g., e-commerce). It may be less effective for unique, "long-tail" websites that don't fit neatly into existing categories.
- Future Directions:
- Learning from Failures: Systematically analyzing failed skill executions to proactively refine them.
- Training Autonomous Skill Learners: Using reinforcement learning to train smaller, open-source models to acquire polymorphic skills autonomously.
- Collaborative Skill Ecosystems: Creating centralized or federated libraries where multiple agents can share and version-control skills.
-
Personal Insights & Critique:
- Novelty and Impact: The application of a time-tested software engineering principle like polymorphism to the domain of agent skill learning is both elegant and highly effective. It addresses a fundamental bottleneck in the field and provides a clear path toward more robust agents. The idea of structured abstraction seems far more scalable than creating flat, monolithic skill libraries.
- Practicality and Scalability: A potential challenge is the initial creation of abstract classes. While the paper suggests this can be induced, it might require a degree of human oversight or a "critical mass" of examples from a domain to work well. Scaling this to the entire open web, with its countless domains and sub-domains, remains an open question. How does the agent decide when a new website belongs to an existing domain versus when it needs a new abstract class?
- Untested Assumptions: The framework assumes that websites within a domain share a common semantic structure (e.g., all shopping sites have a search, add-to-cart, and checkout flow). While true for many domains, this might not hold for more heterogeneous categories.
- Future Potential: The concept is highly transferable. As the authors note, it could be applied to robotics (generalizing skills across different robots or environments), tool use (adapting to different software APIs), and other areas where agents need to interact with diverse but structurally similar systems. The idea of collaborative skill ecosystems is particularly exciting, envisioning a future where agents collectively build and refine a shared "understanding" of how to operate in the digital world.
Similar papers
Recommended via semantic vector search.