Paper status: completed

Leveraging LLMs for Collaborative Ontology Engineering in Parkinson Disease Monitoring and Alerting

Published:12/16/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper examines four methods for using LLMs in constructing a Parkinson's Disease monitoring ontology, revealing that while LLMs can generate ontologies, human-LLM collaboration significantly improves their comprehensiveness and accuracy.

Abstract

This paper explores the integration of Large Language Models (LLMs) in the engineering of a Parkinson's Disease (PD) monitoring and alerting ontology through four key methodologies: One Shot (OS) prompt techniques, Chain of Thought (CoT) prompts, X-HCOME, and SimX-HCOME+. The primary objective is to determine whether LLMs alone can create comprehensive ontologies and, if not, whether human-LLM collaboration can achieve this goal. Consequently, the paper assesses the effectiveness of LLMs in automated ontology development and the enhancement achieved through human-LLM collaboration. Initial ontology generation was performed using One Shot (OS) and Chain of Thought (CoT) prompts, demonstrating the capability of LLMs to autonomously construct ontologies for PD monitoring and alerting. However, these outputs were not comprehensive and required substantial human refinement to enhance their completeness and accuracy. X-HCOME, a hybrid ontology engineering approach that combines human expertise with LLM capabilities, showed significant improvements in ontology comprehensiveness. This methodology resulted in ontologies that are very similar to those constructed by experts. Further experimentation with SimX-HCOME+, another hybrid methodology emphasizing continuous human supervision and iterative refinement, highlighted the importance of ongoing human involvement. This approach led to the creation of more comprehensive and accurate ontologies. Overall, the paper underscores the potential of human-LLM collaboration in advancing ontology engineering, particularly in complex domains like PD. The results suggest promising directions for future research, including the development of specialized GPT models for ontology construction.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Leveraging LLMs for Collaborative Ontology Engineering in Parkinson Disease Monitoring and Alerting

The title clearly outlines the paper's core subject: using Large Language Models (LLMs) as tools within a collaborative process to build ontologies. It also specifies the application domain, which is the monitoring and alerting system for Parkinson's Disease (PD), highlighting the practical and healthcare-oriented nature of the research.

1.2. Authors

Georgios Bouchouras, Dimitrios Doumanas, Andreas Soularidis, Konstantinos Kotis, and George A. Vouros.

The authors are affiliated with the Intelligent Systems Laboratory at the University of the Aegean and the Artificial Intelligence Laboratory at the University of Piraeus, both in Greece. These affiliations indicate a strong research background in artificial intelligence, knowledge systems, and ontology engineering. Notably, Konstantinos Kotis and George A. Vouros have prior publications on human-centered ontology engineering methodologies, lending credibility to the collaborative frameworks proposed in this paper.

1.3. Journal/Conference

The paper is available as a preprint on arXiv, a repository for electronic preprints of scientific papers. The metadata indicates a future publication date, which is likely a placeholder. A previous version of this work was presented at the GeNeSy2024 Workshop, part of the Extended Semantic Web Conference (ESWC) 2024. ESWC is a top-tier venue for semantics and knowledge graph research, suggesting the work is relevant and of interest to the academic community. As a preprint, this version has not yet completed a formal peer-review process for a journal or major conference.

1.4. Publication Year

The provided metadata lists a publication date of December 16, 2025. This is highly unusual and most likely a placeholder or an error in the arXiv submission metadata. The research itself appears to be contemporary, citing works up to 2023 and mentioning a 2024 workshop.

1.5. Abstract

The paper investigates the use of Large Language Models (LLMs) for creating an ontology for Parkinson's Disease (PD) monitoring and alerting. It compares four different methodologies: One Shot (OS) prompting, Chain of Thought (CoT) prompting, X-HCOME (a human-LLM collaborative approach), and SimX-HCOME+ (a simulated iterative collaboration). The central research question is whether LLMs can build comprehensive ontologies on their own, or if human-LLM collaboration is necessary and more effective. Initial experiments with OS and CoT show that LLMs can generate ontologies autonomously, but these are incomplete and require significant human refinement. The collaborative methodologies, X-HCOME and SimX-HCOME+, which involve human expertise and supervision, demonstrate substantial improvements, producing more comprehensive and accurate ontologies that closely resemble those created by human experts. The paper concludes that human-LLM collaboration is a powerful and promising direction for advancing ontology engineering, especially in complex domains like healthcare.

2. Executive Summary

2.1. Background & Motivation

Ontology engineering—the process of formally defining concepts and their relationships within a specific domain—is a cornerstone of knowledge representation. However, it is traditionally a manual, time-consuming, and resource-intensive task that requires deep expertise in both the domain (e.g., medicine) and the engineering principles. In rapidly evolving fields like Parkinson's Disease (PD) research and monitoring, ontologies must be continuously updated to remain relevant, exacerbating these challenges.

Large Language Models (LLMs) have emerged as powerful tools capable of processing and generating vast amounts of information, making them a candidate for automating or assisting in ontology construction. Yet, significant gaps exist in understanding their true capabilities in this structured task. Key challenges include:

  • Lack of Comprehensiveness: LLMs may produce plausible but incomplete knowledge structures.

  • Reasoning Deficiencies: LLMs can struggle with the logical consistency and formal rigor required for robust ontologies.

  • The "Last Mile" Problem: Automating the initial draft (kick-off ontology) is one thing, but refining it into a complete, accurate, and usable artifact is another.

    This paper's entry point is to move beyond simply asking if LLMs can generate ontologies and instead explore how to best integrate them into a collaborative workflow with human experts. It systematically investigates different levels of human involvement, from minimal prompting to deeply iterative partnerships, to create a more efficient and effective ontology engineering methodology.

2.2. Main Contributions / Findings

The paper makes several key contributions to the field of LLM-assisted ontology engineering:

  1. Systematic Evaluation of Collaborative Methodologies: It proposes and evaluates a spectrum of four distinct methodologies for ontology generation:

    • LLM-Only: One Shot (OS) and Chain of Thought (CoT) prompting.
    • Human-LLM Collaboration: X-HCOME (a structured, alternating workflow between human and LLM) and SimX-HCOME+ (a simulated, iterative development environment).
  2. Demonstration of Collaborative Superiority: The core finding is that while LLMs can create a basic ontology from prompts, the result is not comprehensive. Human-LLM collaboration is essential for producing high-quality, accurate, and complete ontologies. The X-HCOME methodology, especially when combined with expert review, yielded ontologies that were not only more complete but also identified valid domain concepts missing from the original gold standard.

  3. Quantification of Human Involvement: The paper introduces a novel analysis that correlates the degree of human involvement with the quality of the final ontology. Results show a clear positive relationship: more intensive and structured collaboration leads to significantly better outcomes (higher F1-scores).

  4. Introduction of New Methodologies: The paper extends the existing HCOME framework to X-HCOME by formally integrating LLM-based tasks and introduces SimX-HCOME+ as a new paradigm for simulated, iterative ontology development.

  5. Benchmarking LLM Capabilities: The study provides a valuable benchmark of different LLMs (ChatGPT-3.5/4, Bard/Gemini, Llama2, Claude) on a complex ontology engineering task, including a test of converting natural language rules to the Semantic Web Rule Language (SWRL), which revealed significant limitations in current models' logical reasoning capabilities.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully understand this paper, a novice reader should be familiar with the following concepts:

  • Ontology and Ontology Engineering (OE): An ontology is a formal, explicit specification of a shared conceptualization. In simpler terms, it's like a highly structured dictionary and grammar for a specific field of knowledge (e.g., Parkinson's Disease). It defines classes (e.g., Patient, Sensor, Bradykinesia), properties (e.g., hasSymptom, monitoredBy), and the logical relationships between them. Ontology Engineering is the systematic process of designing, building, evaluating, and maintaining these ontologies.
  • Large Language Models (LLMs): These are a type of artificial intelligence model, such as OpenAI's GPT series or Google's Gemini, trained on massive datasets of text and code. Their primary strength lies in understanding context, generating human-like text, summarizing information, and answering questions. This paper explores leveraging these abilities for the structured task of ontology creation.
  • Knowledge Representation (KR): This is a subfield of AI focused on how to represent information about the world in a way that a computer can use to solve complex problems. Ontologies are a popular form of KR, enabling machines to "understand" relationships and reason about data.
  • Neurosymbolic AI: This is an emerging approach in AI that aims to combine the strengths of two historically different paradigms: neural networks (like LLMs), which excel at learning patterns from data (statistical AI), and symbolic AI, which uses explicit rules, logic, and symbols for reasoning (e.g., knowledge graphs and ontologies). This paper is an example of Neurosymbolic AI, using an LLM (neural) to help construct an ontology (symbolic).
  • Turtle (TTL) Format: Turtle stands for Terse RDF Triple Language. It is a simple, human-readable syntax for writing down data for the Semantic Web, specifically for RDF (Resource Description Framework). Ontologies are often expressed in RDF, and Turtle is a popular format for writing and sharing them.
  • Semantic Web Rule Language (SWRL): This is a language built on top of the ontology framework (specifically OWL - Web Ontology Language) that allows users to write rules. These rules enable logical inferences. For example, a rule could state: "IF a Patient exhibits Bradykinesia (slow movement) AND this occurs after a MedicationDose, THEN infer a MissingDoseEvent." This allows the system to deduce new information not explicitly stated in the data.

3.2. Previous Works

The authors position their work in the context of recent research exploring LLMs for knowledge-related tasks. Key prior studies include:

  • Ontology Extraction and Mapping:
    • Oksannen et al. (2021): Used BERT models (an earlier type of LLM) to automatically extract product ontologies from text reviews, showing better performance than traditional methods.
    • He et al. (2022): Developed BERTMap, a system that leverages BERT for ontology mapping (aligning concepts between two different ontologies), demonstrating high precision.
  • Knowledge Base Construction:
    • Ning et al. (2022): Showed how to extract factual information from LLMs by designing specific prompts.
    • Biester et al. (2023): Used "prompt ensembles" (combining multiple prompts) to improve the accuracy of constructing knowledge bases from LLMs like ChatGPT.
  • Using LLMs with Existing Knowledge Graphs (KGs):
    • Lippolis et al. (2023): Combined LLMs with traditional query methods to align entities between two large knowledge graphs (ArtGraph and Wikidata).
    • Mountantonakis and Tzitzikas (2023): Devised a method to validate facts generated by ChatGPT against an existing RDF knowledge graph, finding a high percentage of verifiable facts.
    • Pan et al. (2023): Proposed frameworks for unifying LLMs and KGs to enhance reasoning capabilities, leveraging the strengths of both.
  • Specialized Applications:
    • Funk et al. (2023): Investigated using GPT-3.5 to semi-automatically create concept hierarchies, showing its effectiveness in generating structured knowledge.
    • Mateiu et al. (2023): Demonstrated using GPT-3 to convert natural language descriptions into formal ontology axioms, simplifying the ontology creation process.

3.3. Technological Evolution

The field of ontology engineering has evolved significantly:

  1. Manual Era: Initially, ontologies were handcrafted by "knowledge engineers" in collaboration with domain experts. This process was rigorous but extremely slow, expensive, and difficult to scale.

  2. Semi-Automated Era: Tools were developed to help automate parts of the process, such as extracting terms from domain texts (e.g., medical literature) or identifying relationships. However, significant human oversight was still required.

  3. LLM-Assisted Era: The current wave, which this paper contributes to, explores using the vast internal knowledge of LLMs to generate or refine ontologies. The research is moving from "zero-shot" or fully automated attempts to more nuanced, collaborative approaches.

    This paper fits into the most recent stage by formalizing and evaluating different models of human-LLM collaboration, treating the LLM not as a replacement for the human expert but as a powerful, interactive partner.

3.4. Differentiation Analysis

Compared to the related work, this paper's primary innovation is its focus on the collaborative process itself. While previous studies demonstrated that LLMs can perform specific sub-tasks of ontology engineering (e.g., extraction, mapping, axiom generation), this paper:

  • Proposes and compares end-to-end methodologies (X-HCOME, SimX-HCOME+) that structure the entire collaborative workflow between a human and an LLM.
  • Systematically varies the level of human involvement, from a single prompt to an iterative, supervised dialogue, and measures the impact on the final product's quality.
  • Provides a practical case study in a complex, real-world domain (Parkinson's Disease), testing the methodologies' effectiveness beyond theoretical or toy problems.
  • Highlights the "expert review" loop, where analysis of the LLM's "mistakes" (false positives) can actually lead to the discovery of new, valid knowledge that was missing from the original gold standard, turning the LLM into a discovery tool.

4. Methodology

4.1. Principles

The core principle of the research methodology is to evaluate a spectrum of approaches for ontology engineering, ranging from fully automated generation by LLMs to deeply integrated human-LLM collaboration. The goal is to determine the most effective and efficient method for creating a comprehensive and accurate domain ontology. This is achieved through a series of four experiments, each testing a specific methodology and a corresponding hypothesis about the role of LLMs and humans.

The experiments revolve around the task of creating an ontology for monitoring and alerting in Parkinson's Disease. The performance of each methodology is benchmarked against a pre-existing, expert-created gold standard ontology, the Wear4PDmove ontology.

4.2. Core Methodology In-depth (Layer by Layer)

The research is structured around four experiments, each with a distinct methodology.

4.2.1. Experiment 1: LLM-Only Ontology Generation

This experiment tests Hypothesis 1: LLMs, when prompted with domain-specific queries, can autonomously develop a comprehensive ontology. It explores two low-human-involvement prompting techniques.

  • One Shot (OS) Prompting:

    • Description: This method involves providing the LLM with a single, comprehensive prompt that details the entire task. The model is expected to generate the full ontology based on this one input, relying on its pre-trained knowledge.
    • Example Prompt:

      "Act as an Ontology Engineer, I need to generate an ontology about Parkinson disease monitoring and alerting patients. The aim of the ontology is to collect movement data of Parkinson disease patients through wearable sensors, analyze them in a way that enables the understanding (uncover) of their semantics, and use these semantics to semantically annotate the data for interoperability and interlinkage with other related data. You will reuse other related ontologies about neurodegenerative diseases. In the process, you should focus on modeling different aspects of PD, such as disease severity, movement patterns of activities of daily living, and gait. Give the output in TTL format."

  • Chain of Thought (CoT) Prompting:

    • Description: In this paper, CoT refers to a specific method of breaking the single OS prompt into two sequential prompts. This is designed to guide the LLM's reasoning process in a more structured manner. The first prompt sets the overall context, and the second adds specific modeling constraints.
    • Example Prompts:

      Prompt 1: "Act as an Ontology Engineer, I need to generate an ontology about Parkinson disease monitoring and alerting patients. The aim of the ontology is to collect movement data of Parkinson disease patients through wearable sensors, analyze them in a way that enables the understanding (uncover) of their semantics, and use these semantics to semantically annotate the data for interoperability and interlinkage with other related data."

      Prompt 2: "You will reuse other related ontologies about neurodegenerative diseases. In the process, you should focus on modeling different aspects of PD, such as disease severity, movement patterns of activities of daily living and gait. Give the output in TTL format."

The outputs from both techniques are then validated for syntactical correctness and compared against the gold standard ontology.

4.2.2. Experiment 2: The X-HCOME Methodology

This experiment tests Hypothesis 2: The combination of human expertise and LLM capabilities enhances the comprehensiveness of the developed ontology. It introduces X-HCOME, a hybrid, iterative methodology.

  • Description: X-HCOME stands for an extension of the Human-Centered Collaborative Ontology Engineering (HCOME) methodology. It formalizes an alternating, collaborative workflow between human experts and an LLM.

  • The 7-Step Iterative Process:

    1. (Human): The process begins with a human expert defining the core requirements: the aim and scope of the ontology, competency questions (questions the ontology should be able to answer), and providing relevant domain data.

    2. (LLM): The LLM takes this input and constructs an initial version of the domain ontology (e.g., in Turtle format).

    3. (Human): The human expert then compares the LLM-generated ontology with the gold standard ontology. This can be done manually or with the help of ontology mapping tools like LogMap.

    4. (LLM): The LLM is then tasked to perform its own machine-based comparison between its generated ontology and the gold standard, providing an automated alignment perspective.

    5. (Human): The human expert synthesizes the insights from both the human and machine comparisons to create a revised, improved version of the ontology, potentially merging the LLM's output with the existing gold standard.

    6. (LLM): The LLM performs another evaluation of this newly revised ontology.

    7. (Human): Finally, the human expert uses standard ontology engineering tools (like Protégé with a reasoner like Pellet) to perform final consistency and validity checks on the refined ontology.

      This loop represents a much deeper level of collaboration than the simple prompting in Experiment 1.

4.2.3. Experiment 3: Expert Review of X-HCOME False Positives

This experiment tests Hypothesis 3: Analyzing false positives and incorporating domain expert opinions, LLMs can identify relevant domain knowledge not included in the gold standard ontology.

  • Description: This is not a new generation method but an analysis step applied to the results of X-HCOME. The authors, acting as domain experts, manually reviewed every False Positive (a class generated by the LLM that was not in the gold standard).
  • Process: For each False Positive, they judged whether it was genuinely an error or if it was a valid and relevant concept for the PD domain that the original gold standard ontology had simply omitted. If the latter was true, the class was re-categorized from a False Positive to a True Positive. This step acknowledges that the LLM, trained on a vast corpus, might uncover knowledge gaps in a human-curated ontology.

4.2.4. Experiment 4: The SimX-HCOME+ Methodology

This experiment tests Hypothesis 4: Simulated collaboration between human experts and LLMs enhances ontology engineering by introducing a methodology where LLMs lead ontology development tasks within a controlled environment. It introduces a new, highly interactive methodology.

The overall experimental process is depicted in the following flowchart:

Figure 1: Flowchart of a multi-phase experimentation assessing the construction and validation of ontologies using different methodologies (created with AI-Whimsical ChatGPT, 2023). 该图像是一个示意图,展示了通过不同实验阶段评估模型构建和验证本体的方法。主要包括与LLMs合作的各项实验,以及相应的评估指标,如类数、真阳性、假阳性等。

  • Description: SimX-HCOME+ (Simulated X-HCOME+) creates a simulated environment for collaboration. The human user orchestrates an iterative conversation with the LLM, prompting it to act out the roles of three key personas in ontology engineering:

    • Knowledge Worker (KW): Gathers and provides initial domain knowledge.
    • Domain Expert (DE): Validates concepts and provides specialized feedback.
    • Knowledge Engineer (KE): Formalizes the knowledge into the ontology structure.
  • Process: The human user guides the LLM through these roles in an iterative dialogue. A key feature of SimX-HCOME+ is continuous generation and refinement, meaning an updated version of the ontology is produced at each step of the conversation, allowing for a more granular and detailed development process.

  • NL to SWRL Conversion Task: Within this experiment, the authors also tested the LLM's ability to translate a complex rule from natural language into the formal Semantic Web Rule Language (SWRL).

    • Requested Rule:

      "If an observation indicates that there is bradykinesia of the upper limb (indicating slow movement) and this observation pertains to the property and the observation is made after medication dosing, then a notification should be sent indicating a <MissingDoseNotification><MissingDoseNotification> and this observation should be marked as a <PDpatientMissingDoseEventObservation><PDpatientMissingDoseEventObservation>." This sub-task specifically probes the LLM's logical reasoning and formal language generation capabilities.

5. Experimental Setup

5.1. Datasets

The primary "dataset" for this study is not a conventional dataset of instances but rather the conceptual framework of the target domain.

  • Gold Standard Ontology: The ground truth for evaluation is the Wear4PDmove ontology. This is an existing, expert-built ontology designed for Parkinson's Disease. Its purpose is to semantically integrate heterogeneous data sources, including:
    • Dynamic/stream data from wearable sensors (monitoring movement).
    • Static/historic data from Personal Health Records (PHR). It serves as a knowledge model to connect patients, doctors, and smart health applications, enabling reasoning for high-level event recognition like 'missing dose' or 'patient fall'.
  • Input Data: The inputs provided to the LLMs consisted of prompts containing the aim, scope, requirements, and competency questions for the target ontology, as detailed in the Methodology section. The process did not use a large corpus of raw patient data for training but rather relied on the LLM's internal knowledge, guided by expert-defined prompts.

5.2. Evaluation Metrics

The primary evaluation focused on comparing the classes generated by the LLMs against the classes in the Wear4PDmove gold standard ontology. The paper uses standard metrics from information retrieval: Precision, Recall, and F1-score.

  1. Conceptual Definition:

    • True Positives (TP): The number of classes correctly generated by the LLM that are also present in the gold standard ontology.
    • False Positives (FP): The number of classes generated by the LLM that are not present in the gold standard ontology (i.e., erroneous or irrelevant classes).
    • False Negatives (FN): The number of classes from the gold standard ontology that the LLM failed to generate.
  2. Mathematical Formula & Symbol Explanation:

    • Precision: Measures the accuracy of the generated classes. Of all the classes the LLM generated, what fraction were actually correct? $ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $ Where:

      • TP\text{TP}: True Positives
      • FP\text{FP}: False Positives
    • Recall: Measures the completeness of the generated ontology. Of all the correct classes that should have been generated, what fraction did the LLM actually find? $ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $ Where:

      • TP\text{TP}: True Positives
      • FN\text{FN}: False Negatives
    • F1-score: The harmonic mean of Precision and Recall, providing a single score that balances both. It is useful when there is an uneven class distribution (e.g., many more possible negatives than positives). $ \text{F1-score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $

The paper notes that validation involved both exact matching (the generated class name is identical to the gold standard) and similarity matching (the generated class is semantically similar).

5.3. Baselines

The study uses the different methodologies and LLMs as comparative baselines against each other rather than comparing to external state-of-the-art systems.

  • Methodology Baselines: The simpler, low-human-involvement methods (OS and CoT) serve as baselines to demonstrate the performance gain achieved by the more advanced collaborative methods (X-HCOME and SimX-HCOME+).
  • LLM Baselines: The experiments were run across a set of different LLMs to compare their performance on the same tasks. These include:
    • ChatGPT-3.5

    • ChatGPT-4

    • Bard/Gemini (from Google)

    • Llama2 (from Meta)

    • Claude (from Anthropic, used only in Experiment 4)

      The ultimate benchmark for all experiments is the Wear4PDmove gold standard ontology, which represents the target 100% score for an ideal system.

6. Results & Analysis

6.1. Core Results Analysis

The results are presented across the four experiments, evaluating the performance of different LLMs and methodologies. The focus is on the generation of ontology classes.

6.1.1. Experiments 1 & 2 (OS, CoT, and X-HCOME)

This section compares the LLM-only approaches (OS, CoT) with the initial human-LLM collaborative approach (X-HCOME). The results clearly show the superiority of the collaborative method.

The following are the results from Table 2 of the original paper:

Method Number of Classes True Positives False Positives False Negatives Precision Recall F-1 score
Gold-ontology 41
ChatGPT3.5 CoT 3 2 1 39 67% 5% 9%
ChatGPT3.5 OS 5 2 3 39 40% 5% 9%
ChatGPT3.5 X-HCOME 25 10 15 31 40% 24% 30%
ChatGPT4 CoT 6 4 2 37 67% 10% 17%
ChatGPT4 OS 9 5 4 36 56% 12% 20%
ChatGPT4 X-HCOME 33 10 23 31 30% 24% 27%
Bard/Gemini CoT 8 5 3 36 63% 12% 20%
Bard/Gemini OS 13 1 12 40 8% 2% 4%
Bard/Gemini X-HCOME 50 19 31 22 38% 46% 42%
Llama2 CoT 3 3 0 38 100% 7% 14%
Llama2 OS 2 2 0 39 100% 5% 9%
Llama2 X-HCOME 32 4 28 37 13% 10% 11%

Analysis:

  • LLM-Only Performance (OS, CoT): These methods produce very few classes and have extremely low recall (<13%) and F1-scores (<21%). This supports the conclusion that LLMs, when used with minimal guidance, fail to create a comprehensive ontology.
  • Collaborative Performance (X-HCOME): The X-HCOME methodology results in a dramatic improvement across the board. The number of generated classes increases significantly, and the F1-scores jump to 30% for ChatGPT3.5, 27% for ChatGPT4, and a notable 42% for Bard/Gemini. This demonstrates that the iterative, human-supervised process is far more effective.
  • Llama2 Issues: The paper notes that outputs from Llama2 contained syntactical errors and inconsistencies, hindering its performance, especially in the more complex X-HCOME task.

6.1.2. Experiment 3 (Expert Review of X-HCOME)

This analysis revisits the X-HCOME results, with human experts re-evaluating the False Positives.

The following are the results from Table 3 of the original paper:

Method Number of Classes True Positives False Positives False Negatives Precision Recall F-1 score
Gold-ontology 41
ChatGPT3.5 CoT 3 2 1 39 67% 5% 9%
ChatGPT3.5 OS 5 2 3 39 40% 5% 9%
ChatGPT3.5 X-HCOME 25 23 2 18 92% 56% 70%
ChatGPT4 CoT 6 4 2 37 67% 10% 17%
ChatGPT4 OS 9 5 4 36 56% 12% 20%
ChatGPT4 X-HCOME 33 29 4 12 88% 71% 78%
Bard/Gemini CoT 8 5 3 36 63% 12% 20%
Bard/Gemini OS 13 1 12 40 8% 2% 4%
Bard/Gemini X-HCOME 50 50 0 -9 100% 122% 110%
Llama2 CoT 3 3 0 38 100% 7% 14%
Llama2 OS 2 2 0 39 100% 5% 9%
Llama2 X-HCOME 32 26 6 15 81% 63% 71%

Analysis:

  • Massive Performance Boost: The expert review dramatically increases the measured performance. The F1-scores for X-HCOME jump to 70% (ChatGPT3.5), 78% (ChatGPT4), and an incredible 110% (Bard/Gemini).
  • Knowledge Discovery: The Bard/Gemini result is particularly striking. An F1-score over 100% (and recall of 122%) means the LLM not only found all relevant classes from the gold standard but also identified 9 additional, valid domain classes that were missing from the original expert-built ontology. Examples include concepts like "Surgical Intervention", "Rigidity", and "Cognitive Impairment". This confirms Hypothesis 3 and shows that the LLM can act as a knowledge discovery tool, augmenting human expertise.

6.1.3. Experiment 4 (SimX-HCOME+)

This experiment tests the new SimX-HCOME+ methodology.

The following are the results from Table 4 of the original paper:

Method Number of Classes True Positives False Positives False Negatives Precision Recall F-1 Score
Gold ontology 41
ChatGPT-4 17 9 8 32 52% 21% 31%
ChatGPT-3.5 21 14 7 27 66% 34% 45%
Gemini 22 15 7 26 68% 36% 48%
Claude 24 12 12 29 50% 29% 37%

Analysis:

  • The SimX-HCOME+ results show moderate performance, with F1-scores ranging from 31% to 48%. Gemini is the top performer with a 48% F1-score.
  • These scores are notably lower than the expert-reviewed X-HCOME scores but generally higher than the initial X-HCOME scores from Table 2. This suggests the methodology is effective, but the evaluation may not have included the crucial "expert review of false positives" step, which would likely have boosted the scores significantly.

NL to SWRL Conversion Results: The paper reports that the task of converting a natural language rule to SWRL was largely unsuccessful. The results from Table 5 are presented in the original paper in a corrupted and uninterpretable format. Based on the textual description in the paper:

  • Most LLMs (ChatGPT-4, ChatGPT-3.5, Claude) could generate syntactically correct SWRL.

  • Gemini failed to produce a valid output.

  • Overall performance was very low, as the models failed to detect the correct number and type of logical atoms required for the rule.

  • Claude performed slightly better than others but still achieved a very low F1-score of 20% in logical comparison.

    This finding highlights a key weakness of current LLMs: while good at generating natural language and structured data formats, they struggle with formal logical languages that require precise, symbolic reasoning.

6.2. Ablation Studies / Parameter Analysis

The paper presents a final analysis correlating the F1-score with the degree of human involvement, which functions as a high-level ablation study on the importance of human collaboration. The authors assign an arbitrary "Level of Human Involvement" score from 1 (minimal) to 5 (maximal) for each methodology.

The following are the levels from Table 6 of the original paper:

Methodological Approach OS CoT SimX-HCOME+ X-HCOME Expert Review X-HCOME
Level of Human Involvement 1 2 3 4 5

The relationship between this level and the best F1-score achieved by any model for that method is plotted.

Figure 2: The graph compares the highest F1 scores from various LLM methods and the degree of human involvement in the PD domain. The x-axis represents the different methods: CoT Gemini, OS ChatGPT4, X-HCOME Bard/Gemini, Expert Review X-HCOME Bard/Gemini, and SimX-HCOME \(^ +\) Gemini. The left y-axis shows the F1 score, while the right y-axis indicates the degree of human involvement, measured on a scale from 1 (minimum) to 5 (maximum).

Analysis: The graph clearly illustrates a strong positive correlation between the level of human involvement and the performance of the ontology engineering process.

  • Level 1 (OS): Low F1-score (~20%).

  • Level 2 (CoT): Similar low F1-score (~20%).

  • Level 3 (SimX-HCOME+): Moderate F1-score (~48%).

  • Level 4 (X-HCOME): Higher F1-score (~42%, before expert review).

  • Level 5 (Expert Review X-HCOME): The highest F1-score by a large margin (110%).

    This visualization provides the most compelling evidence for the paper's central thesis: deep, structured human-LLM collaboration is not just beneficial but essential for achieving expert-level results in ontology engineering.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper concludes that while LLMs can autonomously generate draft ontologies, they are insufficient for creating comprehensive and accurate knowledge structures on their own. The true potential of LLMs in ontology engineering is unlocked through human-LLM collaboration.

The study successfully demonstrated that structured, iterative methodologies like X-HCOME and SimX-HCOME+, which integrate human expertise for guidance, refinement, and validation, significantly outperform simple prompting techniques. The most profound finding was that this collaboration can be synergistic; in the "Expert Review" phase, the LLM was shown to identify valid domain knowledge that was missing from the original human-built gold standard, effectively augmenting the human expert's knowledge. The research strongly advocates for a shift from viewing LLMs as autonomous agents to viewing them as powerful collaborative tools within a human-centered workflow.

7.2. Limitations & Future Work

The authors acknowledge several limitations in their study:

  • Inherent Bias: The methodologies are susceptible to biases from two sources: the LLM's training data and the specific domain expert's opinions and potential knowledge gaps.

  • Oversimplification of Ontology: The evaluation primarily focused on classes and object properties. A complete ontology also includes other critical components like data properties, axioms, and constraints, which were not thoroughly evaluated.

  • Dependence on Human Evaluation: Despite the assistance from LLMs, the methodologies still rely heavily on human evaluation and a gold standard ontology for benchmarking and validation, which can be a bottleneck.

    For future work, the authors propose an intriguing direction: leveraging the insights from these collaborative methodologies to develop a specialized GPT model tailored specifically for the task of ontology construction.

7.3. Personal Insights & Critique

This paper provides a valuable, practical contribution to the rapidly growing field of applied LLMs. Its strength lies in its systematic and empirical approach to a complex, real-world problem.

Positive Insights:

  • Pragmatic Framework: The X-HCOME methodology is a clear and actionable framework that research and industry teams could adopt to integrate LLMs into their knowledge management workflows. It formalizes the "human-in-the-loop" concept in a structured way.
  • Quantifying Collaboration: The idea of assigning and analyzing "Levels of Human Involvement" is a powerful way to frame the discourse around human-AI collaboration. It moves the conversation beyond a binary "human vs. machine" to a more nuanced spectrum.
  • LLM as a Discovery Engine: The finding that an LLM could identify gaps in an expert-curated ontology is profound. It reframes the LLM's "hallucinations" or "false positives" not just as errors to be filtered, but as potential starting points for human-led discovery.

Areas for Improvement and Critique:

  • Confusing Terminology: The paper's use of "Chain of Thought (CoT)" is non-standard. CoT typically refers to prompting an LLM to "think step by step" to break down its reasoning process. Here, it simply means splitting a prompt in two. This could be confusing for readers familiar with the standard definition. "Sequential Prompting" would have been a more accurate term.
  • Unclear SimX-HCOME+ Evaluation: The results for SimX-HCOME+ are presented without the "expert review" step, making a direct comparison with the best X-HCOME results difficult. The scores are lower, which seems counter-intuitive if it's an "enhanced" methodology. A clearer explanation of why the evaluation protocols differed would have strengthened the paper.
  • Poor Presentation Quality: The corruption of Table 5 in the provided text (and likely in the source PDF from which it was extracted) is a significant flaw. It renders the results of the NL2SWRL experiment nearly impossible to verify and reflects poorly on the paper's final presentation quality.
  • Understated Logical Limitations: The failure of LLMs in the SWRL task is a critical finding that underscores a major hurdle for true Neurosymbolic AI. While mentioned, the implications of this poor performance on formal logic could have been discussed more deeply as a key barrier that needs to be overcome.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.