System Report for CCL24-Eval Task 3: Chinese Spatial Semantic Understanding Based on In-Context Learning and Chain of Thought Strategy
TL;DR Summary
This report details the team's methods for the SpaCE2024 evaluation, utilizing in-context learning and chain of thought strategies, achieving top accuracy in five tasks with an overall accuracy of 0.6024, securing first place.
Abstract
This technical report provides a detailed introduction to the methods and achievements of our team in the Fourth Chinese Spatial Semantic Understanding Evaluation (SpaCE2024). The SpaCE2024 aims to comprehensively test a machine’s ability to understand Chinese spatial semantics across five different tasks: spatial information entity recognition, spatial information entity disambiguation, spatial information anomaly detection, spatial orientation reasoning, and spatial heteronym synonym recognition. Our team employed meticulously designed prompts combined with fine-tuning to enhance the spatial semantic understanding capabilities of large language models, thereby constructing an efficient spatial semantic understanding system. In the final evaluation, our system achieved an accuracy of 0.8947 in spatial information entity recognition, 0.9364 in spatial information entity disambiguation, 0.8480 in spatial information anomaly detection, 0.3471 in spatial orientation reasoning, and 0.5631 in spatial heteronym synonym recognition. The overall accuracy on the test set was 0.6024, earning us a first-place ranking.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
System Report for CCL24-Eval Task 3: Chinese Spatial Semantic Understanding Based on In-Context Learning and Chain of Thought Strategy
1.2. Authors
Shiquan Wang, Weiwei Fu, Ruiyu Fang, Mengxiang Li, Zhongjiang He, Yongxiang Li, and Shuangyong Song. The authors are affiliated with the Institute of Artificial Intelligence (TeleAI), China Telecom Corp Ltd. This indicates the research comes from an industrial research lab with a focus on applying artificial intelligence.
1.3. Journal/Conference
This paper is a technical report for the Fourth Chinese Spatial Semantic Understanding Evaluation (SpaCE2024), which is a shared task (Task 3) at the 24th China National Conference on Computational Linguistics (CCL24). CCL is a premier conference for computational linguistics and natural language processing in China, organized by the Chinese Information Processing Society of China (CIPS). Technical reports like this one detail the system submitted by a participating team to a competitive evaluation task.
1.4. Publication Year
- The paper refers to the SpaCE2024 evaluation.
1.5. Abstract
The abstract introduces the team's submission to the SpaCE2024 evaluation, a competition designed to test a machine's ability to understand Chinese spatial semantics. The evaluation consists of five distinct tasks: spatial information entity recognition, entity disambiguation, anomaly detection, orientation reasoning, and heteronym synonym recognition. The team's approach utilized a large language model (LLM) enhanced with meticulously designed prompts and fine-tuning. Their system achieved high accuracy scores across the tasks, culminating in an overall accuracy of 0.6024 on the test set, which secured them the first-place ranking in the competition.
1.6. Original Source Link
The provided link is /files/papers/6917c5cb110b75dcc59ae0de/paper.pdf. This appears to be a direct link to the PDF file of the paper, likely from the conference proceedings or a preprint server. Its status is a technical report submitted to a conference workshop/evaluation task.
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by this paper is Chinese spatial semantic understanding. This field of natural language processing (NLP) aims to enable machines to comprehend and reason about spatial relationships, locations, and orientations as described in human language. This is a complex task because spatial language is often ambiguous, context-dependent, and requires real-world knowledge.
The immediate motivation for this work was the SpaCE2024 evaluation, a competitive benchmark designed to push the boundaries of machine capabilities in this area. The evaluation is comprehensive, breaking the problem down into five distinct sub-tasks that cover different facets of spatial understanding, from simple entity identification to complex logical reasoning.
Prior research has addressed aspects of this problem, but the advent of Large Language Models (LLMs) has opened up new paradigms. The key challenge and research gap is how to effectively harness the vast knowledge and emergent reasoning abilities of these powerful models for such a specialized and multi-faceted domain. The paper's innovative entry point is to move beyond traditional task-specific fine-tuning and instead employ advanced prompting strategies—specifically In-Context Learning (ICL) and Chain of Thought (CoT)—to guide a general-purpose LLM to solve these specialized tasks.
2.2. Main Contributions / Findings
The paper's primary contributions are threefold:
-
Development of a State-of-the-Art System: The authors constructed a highly effective system for Chinese spatial semantic understanding by leveraging the
Qwen1.5-72B-ChatLLM. This system demonstrated superior performance in a competitive setting. -
Systematic Application of Advanced Prompting: The work provides a practical demonstration of how
In-Context Learning(providing examples in the prompt) andChain of Thought(prompting the model to "think step-by-step") can be systematically applied and combined to solve a diverse set of complex NLP tasks. The results validate these techniques as powerful tools for adapting LLMs without extensive retraining. -
First-Place Performance in SpaCE2024: The most significant finding is the system's success. It achieved an overall accuracy of 0.6024, ranking first in the competition. This result not only validates their methodology but also sets a new state-of-the-art benchmark for the SpaCE2024 tasks. The detailed results show that the approach is particularly effective for knowledge-intensive tasks like entity recognition and disambiguation, while highlighting that complex spatial reasoning remains a significant challenge.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, one must be familiar with the following core concepts from modern NLP:
-
Large Language Models (LLMs): LLMs are deep learning models, typically based on the Transformer architecture, that are trained on massive amounts of text data. This extensive training endows them with a broad understanding of grammar, facts, reasoning abilities, and cultural contexts. Models like GPT-4, Llama, and Qwen (used in this paper) can perform a wide range of language tasks, such as translation, summarization, and question answering, often with minimal task-specific training.
-
In-Context Learning (ICL): This is a remarkable capability of LLMs where the model can learn to perform a new task simply by being shown a few examples (
shots) within the prompt itself. This happens without any updates to the model's internal parameters (weights). The prompt typically contains a task description followed by one or more input-output examples.- Zero-shot: The model is given only the task description and is expected to perform the task without any examples.
- Few-shot: The model is given a few examples (e.g., 1 to 5) in the prompt before the actual query. The paper experiments with
5-shotlearning.
-
Chain of Thought (CoT) Prompting: CoT is an advanced prompting technique designed to improve the reasoning ability of LLMs on complex tasks. Instead of just providing the final answer in the few-shot examples, the prompt also includes the intermediate reasoning steps that lead to the answer. By seeing these "chains of thought," the model learns to break down a new problem into smaller, manageable steps, verbalize its reasoning process, and arrive at a more accurate final answer. This is particularly useful for arithmetic, commonsense, and symbolic reasoning tasks.
The following figure from the paper provides an excellent example of a
CoTprompt for a spatial entity disambiguation task.
该图像是一个关于实际识别的单选题描述,题目涉及中国历史博物馆和天安门城楼的相关信息。文本描述了背景与问题,选择与答案给出了进一步的推理过程。
This example shows the model being prompted not just with the question and answer, but with a detailed thought process: 1. Identify entities. 2. Analyze their relationship. 3. Formulate a conclusion.
3.2. Previous Works
The authors situate their work within the history of shared tasks on spatial language understanding.
-
SemEval Competitions (2012, 2013, 2015): The Semantic Evaluation (SemEval) workshops have been instrumental in this field. Early tasks like
SemEval-2012 Task 3andSemEval-2013 Task 3focused on Spatial Role Labeling (SpRL). The goal of SpRL is to identify spatial elements in a text and assign them specific roles (e.g.,trajector- the object that moves or is located,landmark- the reference object, andspatial indicator- the term describing the relationship, like "on" or "near").SemEval-2015 Task 8 (SpaceEval)continued this effort, standardizing datasets and evaluation for spatial understanding. -
SpaCE Evaluations (2021-2023): The Chinese Spatial Semantic Understanding Evaluation (SpaCE) is a series of evaluations specifically focused on the Chinese language. It builds upon the foundations of SemEval but introduces more diverse and challenging tasks tailored to the nuances of Chinese. This paper participates in the fourth iteration,
SpaCE2024.
3.3. Technological Evolution
The methodology for solving spatial understanding tasks has evolved significantly:
- Rule-based and Feature-based Systems: Early systems relied on handcrafted grammatical rules and features to identify spatial relations.
- Traditional Machine Learning: These were followed by statistical models like Support Vector Machines (SVMs) and Conditional Random Fields (CRFs) that learned from annotated data.
- Deep Learning (pre-LLM): With the rise of deep learning, models like LSTMs and later BERT-based encoders were fine-tuned for specific spatial tasks like entity recognition and relation extraction. This required a separate fine-tuned model for each task.
- Large Language Models (LLM) Paradigm: The current state-of-the-art, as exemplified by this paper, uses a single, massive, pre-trained LLM as a general-purpose reasoner. The focus shifts from model architecture engineering to prompt engineering, using techniques like
ICLandCoTto steer the model's behavior for various tasks. This paper fits squarely into this modern paradigm.
3.4. Differentiation Analysis
Compared to previous methods, this paper's approach has several key differentiators:
- Unified Model: Instead of training multiple specialized models for the five different sub-tasks, the authors use a single foundation model (
Qwen1.5-72B-Chat). This is more efficient and leverages the model's general reasoning skills across all tasks. - Prompt-Centric Adaptation: The primary method of adaptation is through sophisticated prompting (
ICLandCoT), not extensive fine-tuning of the model's weights. This makes the system more flexible and less reliant on large amounts of labeled training data for each specific task. - Focus on Reasoning: The explicit use of
Chain of Thoughtdistinguishes this work from approaches that treat tasks as simple input-output mapping. By encouraging the model to generate reasoning steps, they aim for more robust and explainable predictions, especially on the more complex tasks.
4. Methodology
4.1. Principles
The core principle of the authors' methodology is to reframe all five spatial understanding sub-tasks as a unified text generation problem that can be solved by a large language model. The approach does not rely on creating distinct model architectures for each task. Instead, it leverages the inherent knowledge and reasoning capabilities of the Qwen1.5-72B-Chat model and guides its output through meticulously crafted prompts. The key strategies employed are In-Context Learning (ICL) to provide task-specific examples and Chain of Thought (CoT) to elicit step-by-step reasoning for more complex problems.
4.2. Core Methodology In-depth
The authors formalize their approach as a probabilistic framework. For a given input sentence or context , the goal is to select the most likely correct answer from a set of possible choices . This is modeled as finding the answer that maximizes the conditional probability.
4.2.1. Integrated Explanation
The process can be broken down into the following steps:
Step 1: Prompt Construction A context prompt, denoted as , is constructed to guide the LLM. This prompt is the primary tool for adapting the model to a specific task. The prompt is a concatenation of several components:
-
An Instruction (): A clear description of the task the model needs to perform.
-
Few-shot Examples ( shots): A set of example pairs, , where each is a demonstration of the task with an input and its corresponding correct output . For
Chain of Thoughtprompts, the output includes the intermediate reasoning steps. The paper experiments with0-shot() and5-shot() settings. These examples are drawn from the provided training dataset.Thus, the complete prompt can be represented as either or , depending on whether an explicit instruction is included.
Step 2: Probability Estimation The LLM, denoted as , takes the constructed prompt and the new input query to predict the probability of each possible answer . The model acts as a function that computes this probability. This is formally expressed as:
- : The conditional probability of the answer being correct, given the input .
- : This means "is defined as".
- : This represents the computation performed by the language model . The model takes the candidate answer , the context prompt , and the current input to generate a score or probability. In practice, for multiple-choice questions, this often involves the model generating the text of the chosen option or calculating the likelihood of generating that text.
Step 3: Answer Selection The final answer, , is the one with the highest probability as calculated by the model in the previous step. This is a standard maximum likelihood decision rule.
-
: This means "find the from the set of all possible answers that maximizes the following expression".
-
: The probability calculated in Step 2.
In essence, the model is prompted with instructions and examples, then given a new problem. It evaluates all possible answers and selects the one it deems most plausible based on the context it was given.
4.2.2. Model and Strategy Variations
-
Model Choice: The primary model is
Qwen1.5-72B-Chat, a very large and capable instruction-tuned LLM. A smallerQwen1.5-7B-Chatmodel was also used for some comparative experiments. -
Ensembling (Voting): To improve the robustness and accuracy of the final predictions, the team employed an ensemble strategy referred to as
Vote. As shown in their final results (Table 3), the best-performing submission () utilized voting. This typically involves running the prediction process multiple times with slight variations (e.g., different prompts, different model decoding parameters) and choosing the answer that appears most frequently (a majority vote). This helps to reduce the impact of random errors or model instability.The following image from the paper provides a visual example for the task of spatial orientation reasoning, which is a multiple-choice question format.
该图像是一个关于中国历史博物馆和天安门的示意图,展示了天安门城楼及其周边环境。图中提到的地点包括人民大会堂和中国历史博物馆,指出了它们在西侧和南侧的相对位置。
This illustrates a typical input where the model must understand the relative positions of landmarks to answer the question. The authors' method would format this into a prompt, possibly with other solved examples, and feed it to the LLM to select the correct option (, , , or ).
5. Experimental Setup
5.1. Datasets
The experiments were conducted using the official dataset from the SpaCE2024 competition. This dataset is divided into training, development (dev), and test sets for five distinct sub-tasks.
The five sub-tasks are:
-
Spatial Information Entity Recognition: Identifying spatial entities within a text.
-
Spatial Information Entity Disambiguation: Resolving ambiguity between entities with similar names or in complex contexts.
-
Spatial Information Anomaly Detection: Identifying sentences that contain illogical or contradictory spatial information.
-
Spatial Orientation Reasoning: Answering questions that require reasoning about directions and relative positions.
-
Spatial Heteronym Synonym Recognition: Identifying whether two spatially-related terms with different names are synonyms in a given context.
The following are the results from Table 1 of the original paper, detailing the data distribution for each task:
Task Train Total Dev Total Test Total Entity Recognition 937 161 1098 226 24 250 513 27 600 Entity Disambiguation 1074 19 1093 186 4 190 776 24 800 Anomaly Detection 1077 10 1077 40 0 40 530 0 530 Orientation Reasoning 909 301 1210 468 207 675 1533 537 2070 Heteronym Synonym Recognition 4 1 5 44 11 55 541 139 680
(Note: The original table in the paper has unlabeled columns. Based on the context of NLP datasets, these likely represent positive and negative samples, or different categories within each task, but the paper does not specify.)
5.2. Evaluation Metrics
The single evaluation metric used across all tasks is Accuracy.
-
1. Conceptual Definition: Accuracy is a straightforward metric that measures the proportion of correct predictions made by the model out of the total number of predictions. In the context of this competition, it represents the percentage of test instances for which the system produced the correct answer. It is a suitable metric for classification-style tasks where each instance has one correct label.
-
2. Mathematical Formula:
-
3. Symbol Explanation:
- Number of Correct Predictions: The count of test samples where the model's output perfectly matches the ground-truth label.
- Total Number of Predictions: The total number of samples in the evaluation set (e.g., the dev set or the test set).
5.3. Baselines
The primary baseline for comparison is the official baseline provided by the SpaCE2024 organizers. The performance of this baseline is reported in the final results table (Table 3). Additionally, the authors' own experiments with different settings (e.g., 0-shot vs. 5-shot, with/without CoT) serve as internal baselines to demonstrate the effectiveness of their chosen strategies.
6. Results & Analysis
6.1. Core Results Analysis
The authors present two sets of results: an ablation study on the development set to fine-tune their strategy, and the final results on the test set submitted to the competition.
6.1.1. Ablation Studies on the Development Set
The experiments on the SpaCE24_dev set were designed to determine the optimal prompting strategy. The key takeaways are derived by comparing the different configurations.
The following are the results from Table 2 of the original paper:
| DataSet | Metrics | Accuracy | |||||||
|---|---|---|---|---|---|---|---|---|---|
| ICL | CoT | Train | Total | Entity Recognition | Entity Disambiguation | Anomaly Detection | Orientation Reasoning | Heteronym Synonym Recognition | |
| SpaCE24_dev | 0-shot | w/o | w/o | 0.5570 | 0.8560 | 0.9526 | 0.8750 | 0.3170 | 0.5454 |
| SpaCE24_dev | 5-shot | w/o | w/o | 0.4330 | 0.6360 | 0.9316 | 0.4250 | 0.2119 | 0.5091 |
| SpaCE24_dev | 5-shot | w/o | w/o | 0.4240 | 0.6720 | 0.9263 | 0.5750 | 0.1719 | 0.5455 |
| SpaCE24_dev | 5-shot | w/o | w/o | 0.5686 | 0.8440 | 0.9737 | 0.9250 | 0.3304 | 0.5818 |
(Note: The original table has some inconsistencies. The second and third rows are identical except for the "Total" and some sub-task accuracies, which might be a typo. The fourth row appears to be missing a value in the CoT column. Assuming the progression is logical, the fourth row should have "w/" CoT, which explains the large performance jump. The analysis proceeds with this assumption.)
- Impact of ICL (
0-shotvs.5-shot): Comparing the0-shotbaseline (Row 1) to the5-shotconfigurations reveals that simply adding examples is not always beneficial without the right strategy. The naive5-shotapproaches (Rows 2 & 3) perform worse than0-shot. However, the final optimized5-shotprompt (Row 4) significantly outperforms0-shot(0.5686 vs. 0.5570). This suggests that the quality and format of the examples are critical. - Impact of CoT: The most dramatic improvement comes from incorporating
Chain of Thought. Comparing Row 4 (, assumed) with Row 3 (w/o CoT), the overall accuracy jumps from 0.4240 to 0.5686. This highlights that enabling the model to perform step-by-step reasoning is crucial, especially for the more complex tasks like anomaly detection (0.5750 -> 0.9250) and orientation reasoning (0.1719 -> 0.3304). - Impact of Using Training Data for Prompts (
Traincolumn): TheTraincolumn indicates whether the few-shot examples were selected from the training set. The best result (Row 4) uses training data, confirming that providing relevant, in-distribution examples is key to effectiveIn-Context Learning.
6.1.2. Final Results on the Test Set
The final results demonstrate the system's top-ranking performance in the competition. The team submitted multiple systems, with the best one () using an ensemble (voting) strategy.
The following are the results from Table 3 of the original paper:
| DataSet | Metric | Vote | Accuracy | |||||
|---|---|---|---|---|---|---|---|---|
| Total | Entity Recognition | Entity Disambiguation | Anomaly Detection | Orientation Reasoning | Heteronym Synonym Recognition | |||
| Baseline | | | w/o | 0.4792 | 0.7509 | 0.8818 | 0.6860 | 0.2196 | 0.4200 |
| SpaCE24_test | TeleAI_test_1 | w/o | 0.5991 | 0.8895 | 0.9312 | 0.8440 | 0.3471 | 0.5538 |
| SpaCE24_test | TeleAI_test_2 | w/ | 0.5958 | 0.8895 | 0.9273 | 0.8480 | 0.3373 | 0.5631 |
| SpaCE24_test | TeleAI_test_3 | w/o | 0.5898 | 0.8912 | 0.9364 | 0.8360 | 0.3265 | 0.5523 |
| SpaCE24_test | TeleAI_test_4 | w/o | 0.5885 | 0.8947 | 0.9260 | 0.8440 | 0.3255 | 0.5492 |
| SpaCE24_test | TeleAI_test_5 | w/ | 0.5958 | 0.8895 | 0.9273 | 0.8480 | 0.3373 | 0.5631 |
| SpaCE24_test | TeleAI_test_6 | w/ | **0.6024** | **0.8947** | **0.9364** | **0.8480** | **0.3471** | **0.5631** |
- Superiority over Baseline: The best system () achieves an overall accuracy of 0.6024, which is a massive 25.7% relative improvement over the official baseline's score of 0.4792. This clearly demonstrates the effectiveness of the LLM-based prompting approach.
- Task-Specific Performance: The system excels at tasks that rely heavily on semantic knowledge and context understanding. It achieves very high accuracy on Entity Disambiguation (0.9364), Entity Recognition (0.8947), and Anomaly Detection (0.8480).
- The Reasoning Challenge: The system's performance drops significantly on the Spatial Orientation Reasoning task, with an accuracy of only 0.3471. While this is still a large improvement over the baseline's 0.2196, it is the lowest score among all tasks and indicates that complex, multi-step spatial reasoning remains a major challenge even for powerful LLMs.
- Effect of Voting: Comparing the non-voted submissions (e.g., at 0.5991) with the voted ones ( at 0.6024) shows that ensembling provides a small but consistent boost to the final score, making the result more robust.
7. Conclusion & Reflections
7.1. Conclusion Summary
This technical report successfully documents a first-place system for the SpaCE2024 Chinese Spatial Semantic Understanding evaluation. The authors demonstrate that by combining a powerful large language model (Qwen1.5-72B-Chat) with advanced prompting techniques—namely 5-shot In-Context Learning and Chain of Thought reasoning—it is possible to build a single, unified system that achieves state-of-the-art performance across five diverse spatial language tasks. The final system, enhanced with an ensemble voting mechanism, achieved an overall accuracy of 0.6024, significantly outperforming the official baseline and securing the top rank in the competition.
7.2. Limitations & Future Work
While the paper is a system report and does not explicitly detail its limitations, several can be inferred from the results and methodology:
-
Weakness in Complex Reasoning: The most significant limitation is the model's low accuracy (0.3471) on the spatial orientation reasoning task. This suggests that while LLMs excel at pattern recognition and knowledge retrieval, their ability to perform robust, multi-step logical and spatial reasoning is still underdeveloped and a critical area for future research.
-
Dependence on Prompt Engineering: The success of the system hinges on "meticulously designed prompts." This process of finding the optimal prompt is often more of an art than a science, requiring significant manual effort and trial-and-error. The system's performance may be brittle and sensitive to small changes in the prompt wording.
-
Computational Cost: The use of a 72-billion-parameter model implies substantial computational resources for both inference and development, which could be a barrier to reproducibility and practical application in resource-constrained environments.
-
Lack of Detailed Prompts: The paper describes the methods used but does not provide the exact prompts for each of the five tasks. Including these would greatly benefit the research community by making the work more transparent and reproducible.
Future work could focus on developing more systematic methods for prompt optimization and, more importantly, exploring hybrid approaches that integrate symbolic reasoning engines with LLMs to better tackle the challenging spatial reasoning tasks.
7.3. Personal Insights & Critique
This paper is an excellent case study of the current "prompting" paradigm in NLP. It effectively shows how to leverage the immense power of foundation models for specialized domains without costly, full-scale fine-tuning.
- Key Insight: The stark performance difference between tasks like entity disambiguation (>93%) and orientation reasoning (<35%) is highly illuminating. It provides clear evidence that LLMs' capabilities are not uniform across all types of linguistic challenges. They have largely solved problems related to semantic pattern matching but are still in their infancy when it comes to formal, abstract reasoning.
- Practical Value: The validation of
Chain of Thoughtprompting is a crucial takeaway. The ablation study shows that simply instructing a model to "think step-by-step" can unlock significant performance gains on complex tasks. This is a practical and powerful technique for any NLP practitioner working with LLMs. - Critique: As a technical report, the paper is concise and focused on results. However, it could be strengthened by a more in-depth qualitative analysis. For instance, an error analysis for the low-performing reasoning task could provide valuable insights into why the model fails (e.g., does it misunderstand directions, fail at transitive logic, or get confused by complex sentence structures?). This deeper analysis would elevate the paper from a report of "what worked" to a more scientific exploration of the underlying model capabilities and limitations.
Similar papers
Recommended via semantic vector search.