CCL24-Eval 任 办 系 统 报 告 : 基 于 大 型 语 言 模 型 的 中 文 空 间 语 义 评 测
TL;DR Summary
This study evaluates large language models' spatial semantic understanding through tasks like entity and role recognition. Three prompt strategies were tested, revealing ERNIE-4 performed best with a 1-shot vanilla prompt, ranking sixth with an accuracy of 56.20%.
Abstract
本 研 究 的 任 务 旨 在 让 大 模 型 进 行 实 体 识 别 、 角 色 识 别 、 异 常 识 别 、 信 息 推 理 、 同 义 识 别 任 务 , 综 合 评 估 大 模 型 的 空 间 语 义 理 解 能 力 。 其 中 , 我 们 使 用 普 通 提 示 词 、 工 作 流 提 示 词 和 思 维 链 三 种 提 示 词 策 略 来 探 讨 大 模 型 的 空 间 语 义 理 解 能 力 , 最 后 发 现 ERNIE-4 在 1-shot 的 普 通 提 示 词 上 表 现 最 佳 。 最 终 , 我 们 的 方 法 排 名 第 六 , 总 体 准 确 率 得 分 为 56.20% 。
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
CCL24-Eval 任 办 系 统 报 告 : 基 于 大 型 语 言 模 型 的 中 文 空 间 语 义 评 测 (CCL24-Eval Task 3 System Report: Evaluation of Chinese Spatial Semantics Based on Large Language Models)
1.2. Authors
Huo Shitu (霍世图), Wang Yujun (王钰君), Wu Tongjie (吴童杰) Affiliation: School of International Chinese Education, Beijing Normal University, Beijing 100875, China.
1.3. Journal/Conference
This is a system report for CCL24-Eval Task 3, which is part of the Chinese Computational Linguistics (CCL) conference series. CCL is a prominent academic conference in China focusing on computational linguistics and natural language processing. System reports for evaluation tasks like CCL24-Eval typically present participants' methods and results in a shared task setting, contributing to the understanding of current capabilities and challenges in a specific NLP area.
1.4. Publication Year
The paper implicitly refers to "CCL24-Eval" and "SpaCE2024", indicating the research was conducted for an evaluation event in 2024.
1.5. Abstract
This research aims to comprehensively evaluate the spatial semantic understanding capabilities of large language models (LLMs) by having them perform entity recognition, role recognition, anomaly detection, information inference, and synonym recognition tasks. The study explores three prompt strategies: general prompts (普通提示词), workflow prompts (工作流提示词), and chain-of-thought (CoT) (思维链), to investigate LLMs' spatial semantic understanding. The findings indicate that ERNIE-4 performs best with 1-shot general prompts. Ultimately, the proposed method achieved a sixth-place ranking in the competition with an overall accuracy score of 56.20%.
1.6. Original Source Link
/files/papers/6917c618110b75dcc59ae0e6/paper.pdf This link points to a PDF file, suggesting it is likely a conference paper or a system report submission. Its publication status is likely within the CCL24-Eval proceedings or a related workshop.
2. Executive Summary
2.1. Background & Motivation
The field of Natural Language Processing (NLP) has witnessed significant advancements due to large language models (LLMs). These models, primarily based on the Transformer architecture and leveraging Scaling Laws, have demonstrated remarkable abilities to capture complex language structures and contextual relationships, enabling them to perform various NLP tasks like machine translation and semantic analysis end-to-end. While LLMs exhibit strong semantic understanding, and even show potential as tools for theoretical linguistics, there's a need to systematically evaluate their nuanced capabilities, especially in complex domains like spatial semantics.
Spatial semantics refers to the understanding of how language expresses physical space, including movement, direction, and location. Traditionally, evaluating spatial semantic understanding in NLP systems has relied heavily on manual annotation, which is costly and difficult to scale. With the rise of generative language models, there's a growing trend to use them for evaluation tasks. This paper addresses two key questions:
-
How well do large models understand spatial semantics?
-
What are the strengths and weaknesses of large models in specific spatial semantic tasks?
The motivation behind this study is to leverage the capabilities of LLMs for comprehensive spatial semantic evaluation, moving beyond traditional methods and exploring the
emergentabilities of these models in understanding and reasoning about space. This research, based on theFourth Chinese Spatial Semantic Understanding Evaluation Task (SpaCE2024), aims to push the boundaries of LLM understanding in a challenging domain.
2.2. Main Contributions / Findings
The paper makes several key contributions and presents important findings:
- Comprehensive Evaluation Framework: The study designed an evaluation framework involving five distinct spatial semantic tasks:
entity recognition(实体识别),role recognition(角色识别),anomaly detection(异常识别),information inference(信息推理), andsynonym recognition(同义识别). These tasks, combined with different question types (single-choice and multiple-choice), provide a multi-faceted assessment of LLMs' spatial understanding. - Exploration of Prompt Engineering Strategies: The research rigorously investigated three
prompt engineeringstrategies:general prompts,workflow prompts, andchain-of-thought (CoT)prompts, with varyingshotsettings (0-shot, 1-shot, 3-shot). This exploration provides insights into how different prompting techniques influence LLM performance on spatial semantic tasks. - Identification of Best-Performing Model and Strategy: The study found that
ERNIE-4achieved the best performance on the validation set using a1-shot general prompt, scoring 53.88%. This strategy also yielded the highest accuracy on the test set for the team, reaching 56.20%.GLM-4with a1-shot workflow promptwas the second-best performer. - Insights into LLM Spatial Understanding:
- Base Model Capability is Crucial: Models like
ERNIE-4andGLM-4with strong underlying Chinese semantic understanding capabilities generally performed well. - Impact of Prompt Quantity: Adding a single example (
1-shot) significantly improved performance compared to0-shot, but increasing to3-shotdid not consistently lead to further improvements, sometimes even causing a decrease in accuracy. - Prompt Strategy Complexity: Simpler
general promptssometimes outperformed more complex strategies likeworkflow promptsandCoTin this specific evaluation, suggesting that complexity does not always equate to better performance for spatial semantics.CoTdid not show a significant advantage.
- Base Model Capability is Crucial: Models like
- Detailed Task-Specific Performance Analysis: The paper provides a granular breakdown of model performance across different spatial semantic tasks, highlighting that
role recognitiongenerally had the best performance, whilespatial inferencewas the most challenging. It also noted that models struggled with abstract spatial relations, unmentioned entities, and scenarios requiring external encyclopedic knowledge. - Competitive Ranking: The team's method achieved the 6th rank in the
CCL24-Eval Task 3closed track, demonstrating a competitive approach within the evaluation landscape.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a foundational grasp of several concepts in natural language processing and linguistics is essential:
- Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data. They are designed to understand, generate, and process human language. Their "largeness" refers to their immense number of parameters (tens of billions to trillions), which allows them to capture complex patterns in language.
- Transformer Architecture: The
Transformeris a neural network architecture introduced by Vaswani et al. (2017) that revolutionized NLP. Unlike previous recurrent neural networks (RNNs) or convolutional neural networks (CNNs), the Transformer relies entirely onattention mechanisms(specificallyself-attention) to draw global dependencies between input and output. This allows for parallel processing of words in a sequence, making training much faster and enabling models to handle longer contexts. - Attention Mechanism: A core component of the
Transformer, theattention mechanismallows the model to weigh the importance of different parts of the input sequence when processing a specific word. For example, when translating a sentence, the model can "attend" to relevant words in the source sentence while generating each word in the target sentence. Inself-attention, the attention mechanism is applied to the input sequence itself, allowing the model to learn relationships between different words within the same sentence. TheAttentionfunction can be described as mapping a query and a set of key-value pairs to an output, where the output is a weighted sum of the values, and the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. The most common form isScaled Dot-Product Attention: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices representing the input embeddings. These are derived from the same input sequence in
self-attention. - calculates the dot product between the queries and keys, indicating their similarity.
- is a scaling factor (where is the dimension of the keys) to prevent the dot products from becoming too large, which can push the
softmaxfunction into regions with very small gradients. - normalizes the scores, turning them into probabilities.
- The result is a weighted sum of the values, allowing the model to focus on relevant information.
- (Query), (Key), (Value) are matrices representing the input embeddings. These are derived from the same input sequence in
- Scaling Laws: These are empirical observations (Kaplan et al., 2020) that describe how the performance of neural language models improves predictably as model size (number of parameters), dataset size, and computational budget increase. These laws have guided the development of increasingly larger and more capable LLMs.
- Prompt Engineering: This refers to the art and science of crafting effective inputs (prompts) for LLMs to guide their behavior and elicit desired outputs. It involves designing instructions, examples, or questions that steer the model towards performing specific tasks correctly.
- 0-shot, 1-shot, 3-shot Prompting: These terms describe the number of examples provided in the prompt to guide the LLM:
- 0-shot: No examples are given; the model relies solely on its pre-trained knowledge to understand the task from the instructions.
- 1-shot: One example of an input-output pair is provided in the prompt, demonstrating the desired format and task.
- 3-shot: Three examples of input-output pairs are provided, offering more demonstrations to the model. Generally, few-shot (1-shot, 3-shot, etc.) prompting helps models
in-context learnthe task without explicit fine-tuning.
- Chain-of-Thought (CoT) Prompting: A specific
prompt engineeringtechnique (Wei et al., 2022) where intermediate reasoning steps are included in the prompt examples. Instead of just showing input-output pairs,CoTprompts demonstrate how to arrive at the answer step-by-step, encouraging the LLM to generate its own reasoning process, which can significantly improve performance on complex reasoning tasks. - Spatial Semantics: In linguistics and cognitive science,
spatial semanticsis the study of how languages encode and express information about space, including location, direction, movement, and the relationships between objects in space. This involves understanding spatial prepositions (e.g., "on," "under"), verbs of motion (e.g., "go," "come"), and spatial nouns (e.g., "top," "bottom"). - Natural Language Processing (NLP): A subfield of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language.
3.2. Previous Works
The paper contextualizes its research by reviewing significant studies in spatial semantics from both linguistic and computational perspectives.
3.2.1. Linguistic Perspectives on Spatial Semantics
Cognitive linguistics views spatial relations as fundamental to human perception and conceptualization.
- Foundational Theories:
- Cognitive Linguistics: Based on human experience and perception of the world (Zhang Min, 1998), abstract concepts are often concretized through metaphors. Spatial relations are among the earliest acquired human abilities (Akhundov, 1986; Clark, 1973; Zhang Min, 1998; Zhao Yanfang, 2001).
- Image Schema Metaphor: Lakoff and Turner (1989) posited that
spatial metaphorsproject concrete spatial concepts onto abstract linguistic structures, conveying spatial relations and their inherent logic, forming the cognitive basis of spatial semantics. - Space Grammar: Langacker (1982) emphasized the role of spatial semantics in language formation.
- Thematic Relations Hypothesis (TRH): Jackendoff (1983) suggested that events and states in human language's conceptual structure are organized through spatial conceptualization.
- Formal Spatialization Hypothesis (SFH): Lakoff (1987) similarly discussed the formation of basic sentence patterns from a spatial semantic viewpoint.
- Image Schemas: Johnson (1987) identified 27 crucial image schemas related to spatial semantics, fundamental for human spatial reasoning.
- Location as a Foundation: Putz and Dirven (1996) highlighted the foundational role of location in concept formation.
- Key Concepts: Talmy's (1983)
Figure-Ground theoryand Langacker's (1987)Trajector-Landmarkrelations are key tools.
- Micro-level Studies (International):
- Hawkins (1984) comprehensively studied English prepositions based on spatial cognition.
- Vandeloise (1994) systematically examined French spatial prepositions.
- Herskovits (1986) conducted an interdisciplinary survey of English spatial expressions.
- Svorou (1993) compared spatial prepositions across languages from a cognitive universality perspective.
- Micro-level Studies (Chinese):
- Liao Qiuzhong (1986) introduced the concept of
reference pointsfordirectional words(方位词), moving beyond static analysis. - Liu Ningsheng (1994) discussed how Chinese selects reference objects, target objects, and directional words to express spatial relations.
- Qi Huyang (1998) established a theoretical framework for the modern Chinese spatial system.
- Focus on
上(shàng, up/on) and下(xià, down/under): Cui Xiliang (2000) analyzed "在X上" (zài X shàng, on X) and its psychological extensions. Lan Chun (2003) compared Chinese上/下with Englishup/down. Bai Lifang (2006) noted asymmetry between上and下in Chinese, with上being more fundamental. Xu Dan (2008) examined spatiotemporal expressions in Chinese, highlighting上/下for time as unique to Chinese cognition. - Dynamic Research on Motion Verbs: Lu Jianming (2002) defined
displacement verbs(位移动词) as verbs "containing semantic features of displacement towards or away from the speaker," emphasizing their strong spatial semantic aspect. Subsequent work integrated this with semantic feature analysis or construction grammar (Zhang Guoxian, 2006; Yong Qian, 2013; Zeng Chuanlu, 2014). - Recent Research: Focuses on dialectal and cross-linguistic comparisons, and the cognitive/psychological reality of spatial semantics (Jia Hongxia, 2009; Yin Weibin, 2014; Li Yunbing, 2016, 2020; Zhu Keyi, 2018).
- Liao Qiuzhong (1986) introduced the concept of
3.2.2. Natural Language Processing (NLP) Research on Spatial Semantic Evaluation
NLP's focus is on extracting and understanding physical spatial information from natural language.
- Early Stages (Pre-Deep Learning):
- Phase One: Focused on defining hierarchical layers and relationships in
spatial semantic networks(Tappan, 2004), establishing foundations for spatial information processing. - Phase Two: Centered on specific tasks like
spatial entity recognition(Kordjamshidi et al., 2011) andspatial relation determination, using semi-supervised or unsupervised machine learning on specific datasets. - Limitations: These early methods often used non-linguistic formal approaches, failing to fully account for human expression of spatial relations in natural language, particularly abstract spatial concepts (Stock, 1998; Renz and Nebel, 2007; Bateman et al., 2007). They struggled with the inherent uncertainty and ambiguity in natural language spatial fragments.
- Phase One: Focused on defining hierarchical layers and relationships in
- Addressing Ambiguity and Developing Standard Tasks:
- Spatial Role Labelling (SpRL): Kordjamshidi et al. (2011) introduced
SpRLto address ambiguity in spatial prepositions. Roberts (2012) used a joint approach combining CRF models for feature extraction, maximum entropy/Naive Bayes classifiers for preposition disambiguation, and SemEval-2007 data to learn preposition meanings, achieving binary classification of spatial/non-spatial preposition meanings (Roberts and Harabagiu, 2012). This method considered all elements of spatial relations (e.g., trajector, landmark, indicator). - SpaceEval 2013: (Kolomiyets et al., 2013) extended SpRL by adding
Movelinkand motion tags to annotate motion verbs/nominal events and their categories from a spatial semantic perspective. - SpaceEval 2015: (Pustejovsky et al., 2015) further expanded by defining subtasks for spatial element identification/classification, motion signal recognition, and motion relation recognition, providing a comprehensive evaluation framework.
- Spatial Role Labelling (SpRL): Kordjamshidi et al. (2011) introduced
- Chinese Spatial Semantic Evaluation (SpaCE):
- The
SpaCEevaluations (Zhan Weidong et al., 2022; Yue Pengxue et al., 2023) drew inspiration from these international efforts, building high-quality Chinese datasets. - Task Setting:
SpaCEgoes beyond simple identification and classification, requiringorientation inferenceandheteronymous synonym identification, reflecting higher demands for spatial semantic understanding, especially as LLMs develop world knowledge andemergentspatial reasoning abilities.
- The
3.3. Technological Evolution
The field has evolved from rule-based and statistical methods to machine learning, then deep learning, and now large language models.
- Early NLP: Focused on explicit rules, handcrafted features, and statistical models for spatial relation extraction.
- Deep Learning Era: Introduced neural networks (CNNs, RNNs) for feature learning and sequence modeling, improving performance on tasks like entity recognition and relation extraction.
- Transformer and LLMs: The
Transformerarchitecture (Vaswani et al., 2017) and subsequent LLMs (BERT, GPT series, ERNIE, GLM, Qwen, Deepseek) marked a paradigm shift. Their ability to learn complex contextual representations from massive text corpora has led toemergentcapabilities in reasoning and understanding that were previously thought to be exclusive to symbolic AI or human cognition. This paper directly benefits from and evaluates these advanced LLMs.
3.4. Differentiation Analysis
Compared to previous work, this paper's core differentiation lies in its explicit focus on evaluating the spatial semantic understanding capabilities of large language models.
- Shift from Feature Engineering to Prompt Engineering: Earlier NLP spatial semantics research often involved extensive feature engineering or specialized neural network architectures. This paper, however, primarily uses
prompt engineeringto adapt pre-trained LLMs to the spatial semantic tasks, exploring how different prompting strategies elicit spatial reasoning. - Broader Range of Tasks: While previous tasks like
SpRLfocused on identifying spatial roles,SpaCE2024(and by extension this paper's evaluation) includes more complex reasoning tasks such asanomaly detection,information inference, andheteronymous synonym identification, which probe deeper into cognitive aspects of spatial understanding. - Leveraging
EmergentAbilities: Instead of building models specifically for spatial semantics, this research tests theinherentoremergentspatial reasoning abilities of general-purpose LLMs, which have learned spatial concepts implicitly from their vast training data. This contrasts with traditional methods that often built specialized knowledge bases or models for spatial reasoning.
4. Methodology
4.1. Principles
The core principle of this study is to leverage the advanced natural language understanding and generation capabilities of large language models (LLMs) to perform various spatial semantic tasks. The authors hypothesize that by carefully designing prompt engineering strategies, LLMs can be guided to understand and reason about complex spatial relationships embedded in natural language text. The methodology aims to systematically evaluate the models' performance across different task types and with varying levels of in-context learning (0-shot, 1-shot, 3-shot) to identify their strengths, weaknesses, and the most effective prompting approaches.
4.2. Core Methodology In-depth (Layer by Layer)
The methodology primarily revolves around selecting representative LLMs, designing structured prompts, and systematically evaluating their outputs.
4.2.1. Model Selection
The study selected six representative large language models from different developers, covering various architectures and scales. These models represent the state-of-the-art or recent advancements at the time of the research (2024). All models are accessed via API.
The following are the results from Table 2 of the original paper:
| 模型 (Model) |
版本日期 (Version Date) |
开发者 (Developer) |
模型大小 (Model Size) |
上下文 (Context Length) |
词表大小 (Vocabulary Size) |
是否开源调用方式 (Open Source Calling Method) |
调用方式 (Calling Method) |
| GPT-4 Turbo | 04-09 | OpenAI | 未披露 (Undisclosed) |
12.8万 (128K) |
10万 (100K) |
香 (No) |
API |
| GPT-40 | 05-13 | OpenAI | 未披露 (Undisclosed) |
12.8万 (128K) |
20万 (200K) |
否 (No) |
API |
| GLM-4 | 未披露 (Undisclosed) |
智谱华章 (Zhipu AI) |
未披露 (Undisclosed) |
12.8万 (128K) |
未披露 (Undisclosed) |
否 (No) |
API |
| ERNIE-4 | 03-29 | 百度 (Baidu) |
未披露 (Undisclosed) |
8千 (8K) |
未披露 (Undisclosed) |
否 (No) |
API |
| Qwen1.5-72B-chat | 未披露 (Undisclosed) |
阿里巴巴 (Alibaba) |
720亿 (72B) |
3.2万 (32K) |
15万 (150K) |
是 (Yes) |
API |
| Deepseek-V2-chat | 未披露 (Undisclosed) |
深度求索 (Deepseek AI) |
2360亿 (236B) |
3.2万 (32K) |
10万 (100K) |
是 (Yes) |
API |
4.2.2. Prompt Engineering
All prompts were designed using a structured Markdown format, comprising two main parts: prompt strategy and prompt sample construction.
4.2.2.1. Prompt Strategies
Three distinct prompt strategies were employed to investigate the LLMs' spatial semantic understanding:
- General Prompt (普通提示词): This is the simplest form of prompt, providing direct instructions without specific intermediate steps or roles. It asks the model to directly provide the answer.
- Workflow Prompt (工作流提示词): This strategy assigns a
roleto the LLM (e.g., "an expert in spatial information entity recognition") and outlines a step-by-stepworkflowfor the model to follow before providing the answer. This aims to guide the model's internal processing more explicitly. - Chain of Thought (CoT) Prompt (思维链提示词): Building on the
CoTtechnique by Wei et al. (2022), this strategy explicitly asks the model to output itsthought process(Thought) before giving the finalanswer(Answer). This encourages the model to break down complex problems into manageable steps, making its reasoning explicit. The structure was modified to include "想法" (Thought) and "答案" (Answer) for structured output.
4.2.2.2. Prompt Sample Construction
The construction of prompt samples varied based on the shot setting:
- 0-shot: No examples were provided in the prompt; the model relied solely on the instructions. This was used for
GeneralandWorkflowprompts. - 1-shot: One example of a question-answer pair was included in the prompt to demonstrate the desired output format and task. This was used for
General,Workflow, andCoTprompts. - 3-shot: Three examples of question-answer pairs were included. This was used for
GeneralandWorkflowprompts.
Sample Selection Process: For few-shot prompting (1-shot and 3-shot), the training samples were selected carefully:
- Each data entry in the training set consists of a text (C), a question (Q), four options (O), and an answer (A). Each entry was organized into a sample .
Sentence-BERT(Reimers and Gurevych, 2019) was used to convert these samples into vectors, capturing their semantic meaning.- For each task category (e.g.,
entity recognition), the average vector of all samples within that category was calculated, serving as thecluster centroid. - Finally, by calculating the semantic similarity between each sample vector and the
cluster centroid, the 1 and 3 samples closest to the centroid were selected as the 1-shot and 3-shot training data, respectively. This method aims to select representative examples forin-context learning.
Special Considerations for CoT and Spatial Heteronymous Synonym Identification:
- For
CoTprompts, the sample examples require athought process. Thesethought processeswere generated usingGPT-4for consistency and quality. - The
spatial heteronymous synonym identificationtask had very few training samples (only one multiple-choice question). To ensure 3-shot construction was possible, two single-choice questions from this task were manually adapted into multiple-choice questions.
Prompt Examples (Translated from Original Paper):
General Prompt Example
#Goal: From the four options, select the spatial information reference object in the text. Note: Only answer with one key from the option, no need to answer the value, no need to explain.
*Text:** <text> *Question:** <question> *Option:** <option> *Answer:**
Workflow Prompt Example
#Role: You are an expert skilled in spatial information entity recognition.
#Goal: From the four options, select the spatial information reference object in the text. Note: Only answer with one key from the option, no need to answer the value, no need to explain.
#Workflow:1.Read the text: Carefully read the provided text, paying special attention to spatial information descriptions. 2.Analyze options: Review all options and identify which ones might be spatial reference objects in the text. 3.Select the correct option: Compare the text with the options and choose the most matching spatial information reference object.
*Text:** <text>
*Question:** <question>
*Option:** <option>
*Answer:**
Chain of Thought Prompt Example
#Goal: From the four options, select the spatial information reference object in the text. Note: Only answer with one key from the option, no need to answer the value, write out Thought and Answer.
*Text:** <text> *Question:** <question> *Option:** <option> *Thought:** <thought> *Answer:**
4.2.3. Experiment Settings
4.2.3.1. Answer Extraction
Different answer extraction methods were used based on the prompt strategy:
- General and Workflow Prompts: The prompts requested models to output options directly, separated by English commas (e.g., "A,C"). The system first converted the output into a list and then iterated through each element to extract the first character (A, B, C, or D). This accounts for cases where models might output additional text after the answer.
- Chain of Thought Prompts: These prompts required the model to output its
thought processfirst, followed by theanswer. Aregular expressionwas used to extract the answer: <\*\*Answer:\*\*\n(.+?)(\n\nl\)>. This regex specifically targets the content between theAnswer:$ tag and the subsequent newline characters. - Manual Verification: Due to variations in
instruction following capabilityacross different models, an automatic check was implemented to ensure answers were one of A, B, C, or D. If not, manual inspection was performed.
4.2.3.2. Model Output Configuration
The temperature parameter for all LLM calls was set to 0.1. Temperature controls the randomness of the model's output; a lower temperature (like 0.1) makes the output more deterministic and focused, which is desirable for evaluation tasks requiring precise answers.
4.2.3.3. Evaluation Metric
The evaluation metric used was Accuracy. Accuracy measures the proportion of correctly answered questions out of the total number of questions. A correct answer was scored as 1 point, and all other cases (e.g., model stating no options fit, model refusing to answer, or partially correct answers in multiple-choice questions) were scored as 0 points.
5. Experimental Setup
5.1. Datasets
The dataset used for this research is from the Fourth Chinese Spatial Semantic Understanding Evaluation Task (SpaCE2024). It is designed to evaluate LLMs' spatial semantic understanding across various dimensions and contexts.
Dataset Characteristics:
-
Task Categories: The dataset comprises five major task categories, further divided into nine sub-categories. These tasks aim to assess
entity recognition,role recognition,anomaly detection,spatial orientation information inference, andspatial heteronymous synonym identification. -
Question Format: All questions are in a multiple-choice format, with four options provided for each question. Both single-choice and multiple-choice questions are included.
-
Domains: The dataset covers a wide range of domains, including general areas like newspapers, literary works, and primary/secondary school textbooks, as well as specialized fields such as traffic accidents, sports actions, and geographical encyclopedias. This diversity ensures a comprehensive evaluation of the models' ability to handle spatial semantics in different contexts.
-
Data Distribution: As shown in Table 1,
spatial orientation information inferencehas the largest number of questions, whilespatial heteronymous synonym identificationhas the fewest. Single-choice questions are predominant. The uneven distribution and varied complexity of question types pose a significant challenge to the evaluation.The following are the results from Table 1 of the original paper:
序号
(No.)任务类别
(Task Category)任务要求
(Task Requirement)题型
(Question Type)数据量
(Data Volume)训练集
(Training Set)验证集
(Validation Set)测试集
(Test Set)1 空间信息实体识别
(Spatial Information Entity Recognition)选出文本空间信息的参照物
(Select the reference object of spatial information in the text)单选题
(Single-choice)937 226 489 多选题
(Multiple-choice)161 24 81 2 空间信息角色识别
(Spatial Information Role Recognition)选出文本空间信息的语义角色,或者选出与语义角色相对应的空间表达形式
(Select the semantic role of spatial information in the text, or select the spatial expression corresponding to the semantic role)单选题
(Single-choice)1074 186 746 多选题
(Multiple-choice)19 4 24 3 空间信息异常识别
(Spatial Information Anomaly Recognition)从四个选项中选出文本空间信息异常的语言表达
(Select the abnormal linguistic expression of spatial information from the four options)单选题
(Single-choice)1077 40 500 4 空间方位信息推理
(Spatial Orientation Information Inference)基于文本给出的推理条件进行空间方位推理,从四个选项中选出推理结果
(Perform spatial orientation inference based on the inference conditions given in the text, and select the inference result from the four options)单选题
(Single-choice)909 468 1509 多选题
(Multiple-choice)301 207 531 5 空间异形同义识别
(Spatial Heteronymous Synonym Recognition)从四个选项中选出能使两个文本异形同义或异义的空间义词语
(Select a spatial semantic word from the four options that makes two texts heteronymous synonyms or antonyms)单选题
(Single-choice)4 44 517 多选题
(Multiple-choice)1 11 133 总计
(Total)4483 1210 4530
Example Data Samples (from the paper's analysis sections):
-
Spatial Information Entity Recognition (实体识别):
- Question 1: "周游口袋里只有五元钱....所以蹬三轮车的上来拉生意时,他理都不理他们,而是从西装口袋里掏出个玩具手机,这个玩具手机像真的一样,里面装上一节五号电池,悄悄按上一个键,手机的铃声就会响起来。 (题目:__-里面装上一节五号电池)"
- (Translation: "Zhou You only had five yuan in his pocket.... So when pedicab drivers came to solicit business, he ignored them, and instead took out a toy phone from his suit pocket. This toy phone looked real, with a AA battery inside, and quietly pressing a button would make the phone ring. (Question: __-has a AA battery inside)")
- Question 4: "回家以后,她给丈夫算了一笔账:我每天上下班路程要花3个小时,工作8小时,中午吃饭1小时,总共在外边花12小时..... (题目:总共在___外边花12小时)"
- (Translation: "After returning home, she calculated an account for her husband: I spend 3 hours commuting every day, 8 hours working, 1 hour eating lunch, totaling 12 hours spent outside..... (Question: Total 12 hours spent __-outside)")
- Question 1: "周游口袋里只有五元钱....所以蹬三轮车的上来拉生意时,他理都不理他们,而是从西装口袋里掏出个玩具手机,这个玩具手机像真的一样,里面装上一节五号电池,悄悄按上一个键,手机的铃声就会响起来。 (题目:__-里面装上一节五号电池)"
-
Spatial Information Role Recognition (角色识别):
- Question 251: "时间过去近两个月,木沙江·努尔墩仍清楚地记得.....在人工湖边的冰窟中,拉齐尼用一只手臂搂住孩子,另一只手努力托举着孩子.... (题目:__的手努力托举着孩子)"
- (Translation: "Almost two months had passed, but Musajiang Nurdon still clearly remembered..... In the ice cave by the artificial lake, Latif held the child with one arm and struggled to lift the child with the other hand... (Question: __- 's hand struggled to lift the child)")
- Question 251: "时间过去近两个月,木沙江·努尔墩仍清楚地记得.....在人工湖边的冰窟中,拉齐尼用一只手臂搂住孩子,另一只手努力托举着孩子.... (题目:__的手努力托举着孩子)"
-
Spatial Information Anomaly Recognition (异常识别):
- Question 441: "小红在下,我在上,走到四楼的东侧......(题目:异常的空间方位信息是__-,要求识别出“小红在下,我在上”)"
- (Translation: "Xiao Hong is below, I am above, walking to the east side of the fourth floor..... (Question: The abnormal spatial orientation information is __-, requiring identification of 'Xiao Hong is below, I am above')")
- Question 441: "小红在下,我在上,走到四楼的东侧......(题目:异常的空间方位信息是__-,要求识别出“小红在下,我在上”)"
-
Spatial Orientation Information Inference (空间方位信息推理):
- Question 481: "贺知章、李白、陈子昂、骆宾王、王维、孟浩然六个人在海边沙滩上围成一圈坐着,大家都面朝中心的篝火。六人的位置恰好形成一个正六边形。任意相邻两个人之间的距离相等,大约为一米。已知:陈子昂在骆宾王左边数起第1个位置,孟浩然在陈子昂逆时针方向的第5个位置,王维在孟浩然顺时针方向的第1个位置 (题目:孟浩然在___的斜对面)"
- (Translation: "He Zhizhang, Li Bai, Chen Zi'ang, Luo Binwang, Wang Wei, and Meng Haoran, six people, sat in a circle on the beach, all facing a bonfire in the center. Their positions formed a regular hexagon. The distance between any two adjacent people was equal, about one meter. Known: Chen Zi'ang is in the 1st position to the left of Luo Binwang, Meng Haoran is in the 5th position counter-clockwise from Chen Zi'ang, Wang Wei is in the 1st position clockwise from Meng Haoran (Question: Meng Haoran is diagonally opposite to ___)")
- Question 481: "贺知章、李白、陈子昂、骆宾王、王维、孟浩然六个人在海边沙滩上围成一圈坐着,大家都面朝中心的篝火。六人的位置恰好形成一个正六边形。任意相邻两个人之间的距离相等,大约为一米。已知:陈子昂在骆宾王左边数起第1个位置,孟浩然在陈子昂逆时针方向的第5个位置,王维在孟浩然顺时针方向的第1个位置 (题目:孟浩然在___的斜对面)"
-
Spatial Heteronymous Synonym Recognition (空间异形同义识别):
- Question 1157: "傍晚的时候,宋钢将他带回去的钱用一张旧报纸仔细包好了,放在了枕头下面.....。(题目:“回去"替换为_形成的新句可以与原句表达相同的空间场景,要求用"回来"替换“回去”)"
- (Translation: "In the evening, Song Gang carefully wrapped the money he had taken back with an old newspaper and placed it under the pillow..... (Question: Replacing '回去' (go back) with _ forms a new sentence that expresses the same spatial scene as the original sentence, requiring '回来' (come back) to replace '回去')")
- Question 1157: "傍晚的时候,宋钢将他带回去的钱用一张旧报纸仔细包好了,放在了枕头下面.....。(题目:“回去"替换为_形成的新句可以与原句表达相同的空间场景,要求用"回来"替换“回去”)"
5.2. Evaluation Metrics
The primary evaluation metric used in this study is Accuracy.
5.2.1. Accuracy
- Conceptual Definition:
Accuracyis a common metric used to evaluate the overall correctness of a classification model. It quantifies the proportion of predictions that are exactly correct out of the total number of predictions made. In this context, it measures how many spatial semantic questions the model answered correctly compared to all questions. It provides a straightforward measure of a model's general performance. - Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
-
Number of Correct Predictions: The count of instances where the model's predicted answer matches the true answer. -
Total Number of Predictions: The total number of questions or instances for which the model made a prediction.In this study, a correct answer received 1 point, while any other outcome (e.g., model indicating no option was suitable, model refusing to answer, or incorrect/partial answers for multiple-choice questions) received 0 points.
-
5.3. Baselines
This paper evaluates several state-of-the-art large language models, treating them as individual systems to be tested with different prompt engineering strategies. Therefore, the "baselines" are not traditional smaller models or specific spatial semantic parsers, but rather the performance of these large models themselves under different conditions (0-shot, 1-shot, 3-shot, and different prompt types). The models compared include:
-
GPT-4 Turbo(OpenAI) -
GPT-4o(OpenAI) -
GLM-4(Zhipu AI) -
ERNIE-4(Baidu) -
Qwen1.5-72B-chat(Alibaba) -
Deepseek-V2-chat(Deepseek AI)These models are representative of cutting-edge LLM technology from major developers, enabling a comprehensive assessment of current LLM capabilities in Chinese
spatial semantics. The comparison across these models and various prompting techniques aims to understand which models and strategies are most effective.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate varying levels of spatial semantic understanding among the tested large language models, heavily influenced by the prompt engineering strategies employed.
The following are the results from Table 3 of the original paper:
| 模型 (Model) |
普通 (General) |
工作流 (Workflow) |
思维链 (CoT) (1-shot) |
||||
| 0-shot | 1-shot | 3-shot | 0-shot | 1-shot | 3-shot | ||
| ERNIE-4 | 50.25 | 53.88 | 52.73 | 52.23 | 52.73 | 52.81 | 51.06 |
| GLM-4 | 51.24 | 52.01 | 52.23 | 50.49 | 53.14 | 50.41 | 50.82 |
| GPT-40 | 48.92 | 51.16 | 52.89 | 48.35 | 50.99 | 51.73 | 50.91 |
| GPT-4 Turbo | 48.18 | 50.99 | 51.54 | 47.43 | 51.49 | 47.77 | 50.74 |
| Deepseek-V2-chat | 48.84 | 49.83 | 49.98 | 46.69 | 49.42 | 49.83 | 46.78 |
| Qwen1.5-72B-chat | 44.71 | 46.61 | 46.45 | 42.81 | 45.70 | 45.04 | 45.45 |
Table 3: Overall Performance of Models on the Validation Set (Accuracy %)
The following are the results from Table 4 of the original paper:
| 模型 (Model) |
样本数量 (Number of Shots) |
提示词 (Prompt Strategy) |
测试集准确率 (Test Set Accuracy) |
| ERNIE-4 | 1 | 普通 (General) |
56.20 |
| GLM-4 | 1 | 工作流 (Workflow) |
54.52 |
Table 4: Final Performance of Models on the Test Set (Accuracy %)
Key Findings from Overall Performance:
- Top Performers:
ERNIE-4consistently emerged as a strong performer. Its best score on the validation set was 53.88% with a1-shot general prompt, which also translated to the highest accuracy on the test set at 56.20%.GLM-4followed closely, achieving 53.14% on the validation set with a1-shot workflow prompt, and 54.52% on the test set. - Importance of Base Model Capability: The results suggest that the underlying
base capabilitiesof the large models play a crucial role. Models likeERNIE-4andGLM-4, known for their strong Chinese semantic understanding, generally adapted well to the challenging spatial tasks.Qwen1.5-72B-chat, despite being a large model, consistently showed lower performance across all prompt strategies. - Impact of Prompt Sample Quantity (
Shot):Few-shot prompting(1-shot or 3-shot) generally led to a significant improvement over0-shotperformance. For example,ERNIE-4jumped from 50.25% (0-shot general) to 53.88% (1-shot general). However, increasing the number of shots from 1 to 3 did not always guarantee further improvement; the accuracy either increased in 7 cases or decreased in 5 cases across different models and prompt types, indicating a complex relationship between sample quantity and performance. - Prompt Strategy Complexity: Counter-intuitively, simpler
general promptstrategies often yielded competitive or even superior results compared to more complex ones. TheChain of Thought (CoT)strategy, which explicitly encourages step-by-step reasoning, did not stand out in this evaluation, performing similarly or sometimes even worse thangeneralorworkflowprompts for several models. This suggests that for these specific spatial semantic tasks, explicit multi-step reasoning might not always be the most effective approach or might require further refinement in prompt design.
6.2. Model Specific Performance
To delve deeper than overall accuracy, the paper analyzed each model's performance across seven dimensions: entity recognition, role recognition, anomaly recognition, spatial inference, synonym recognition, single-choice questions, and multiple-choice questions. It distinguishes between actual best performance (score under the optimal prompt strategy for that dimension) and potential best performance (highest score achieved across any prompt strategy for that dimension).
The following are the results from Table 5 of the original paper:
| 实体识别 (Entity Recognition) |
角色识别 (Role Recognition) |
异常识别 (Anomaly Recognition) |
空间推理 (Spatial Inference) |
同义识别 (Synonym Recognition) |
单选题 (Single-choice) |
多选题 (Multiple-choice) |
||
| ERNIE-4 | 实际最佳 (Actual Best) |
79.20 | 95.26 | 87.50 | 29.92 | 65.45 | 61.20 | 25.20 |
| 潜在最佳 (Potential Best) |
80.40 | 96.84 | 87.50 | 29.92 | 65.45 | 61.20 | 25.20 | |
| GLM-4 | 实际最佳 (Actual Best) |
78.40 | 95.79 | 85.00 | 29.33 | 60.00 | 58.30 | 32.93 |
| 潜在最佳 (Potential Best) |
78.40 | 96.84 | 85.00 | 29.33 | 63.63 | 59.64 | 32.93 | |
| GPT-40 | 实际最佳 (Actual Best) |
76.40 | 93.68 | 80.00 | 30.52 | 60.00 | 58.09 | 32.52 |
| 潜在最佳 (Potential Best) |
76.40 | 95.79 | 80.00 | 30.52 | 65.45 | 59.34 | 33.74 | |
| GPT-4 Turbo | 实际最佳 (Actual Best) |
76.80 | 95.26 | 72.50 | 28.59 | 54.54 | 59.54 | 20.73 |
| 潜在最佳 (Potential Best) |
76.80 | 95.26 | 80.00 | 29.48 | 61.82 | 59.92 | 23.17 | |
| Deepseek-V2-chat | 实际最佳 (Actual Best) |
74.40 | 95.26 | 77.50 | 26.22 | 52.73 | 56.33 | 24.80 |
| 潜在最佳 (Potential Best) |
74.40 | 96.84 | 82.50 | 29.04 | 65.45 | 56.74 | 29.67 | |
| Qwenl.5-72B-chat | 实际最佳 (Actual Best) |
71.60 | 91.05 | 67.50 | 23.11 | 52.73 | 55.50 | 11.79 |
| 潜在最佳 (Potential Best) |
72.40 | 93.68 | 67.50 | 24.74 | 54.54 | 55.50 | 16.67 |
Table 5: Actual Best Performance and Potential Best Performance of Models on the Validation Set (Accuracy %)
Overall Specific Task Performance:
- All models performed best on
role recognitiontasks and worst onspatial inferencetasks. - Performance on
single-choice questionswas generally higher than onmultiple-choice questions. ERNIE-4showed strong overall performance, excelling in four tasks andsingle-choice questions.GLM-4was a close second, achieving the highest scores inrole recognitionandmultiple-choice questions.Deepseek-V2-chatshowed competitivepotential best performanceinsynonym recognition, comparable toERNIE-4andGLM-4.
6.2.1. In Entity Recognition Tasks
Entity recognition tasks assess the model's ability to identify spatial directional words and their co-reference with entities already present in the context.
ERNIE-4performed well when the spatial directional words and entity relationships were fixed and explicitly present in the context. For example, in Questions 1 and 4 (see Section 5.1 for examples), where the spatial relationship is clear,ERNIE-4showed accurate understanding.- However,
ERNIE-4's error rate increased when the target entity was not explicitly mentioned in the context, or when there weredistractorentities withpossessiveorgeneralized possessive relationships. For instance, in Question 58 ("外面的雨声混合着我的哭声" - 'the rain outside mixed with my cries'), the entity "屋子" (house/room) was not present. In Question 67 ("楼上只有南面的大厅有灯亮" - 'only the hall on the south side upstairs has lights on'), "大厅" (hall) acted as a distractor when the correct reference should have been "楼上" (upstairs), which has a generalized possessive relationship with "大厅". - Conclusion:
ERNIE-4accurately identifies entities associated with directional words that appear in the context but struggles with unmentioned entities or when distractors withpossessiverelationships are present.
6.2.2. In Role Recognition Tasks
Role recognition tasks examine the model's ability to identify two entities with spatial interaction relationships. These relationships can stem from possessive relationships (e.g., Question 251), event-based relationships (e.g., Question 258), or relative position relationships (e.g., Question 259).
ERNIE-4demonstrated very accurate understanding of these specific spatial interaction relationships between concrete entities.- However, the model's recognition ability decreased significantly when the question asked about
complex spatial relationships(includingimplicit spatial relationships) ofabstract entities, especially those involvingmeta-language(abstract descriptions of spatial relations unrelated to specific text content, like "path," "direction," "start point," "external position"). For example, in Question 398 ("鄱阳城" (Poyang City) belongs to "昌江" (Changjiang River) during its flow as a ___ information), or Question 425 (Asking where "broken clothes thrown in Shanghai garbage can" happened), the model struggled. - Conclusion:
ERNIE-4is most accurate in judging specific spatial relationships between concrete entities, followed by specific spatial relationships of abstract entities, and weakest in judging abstract spatial relationships of abstract entities. This aligns with human cognitive patterns where concrete objects and relationships are easier to perceive.
6.2.3. In Anomaly Recognition Tasks
Anomaly recognition tasks assess the model's capacity to identify entities with anomalous or incorrect spatial interaction relationships. Anomalies can be illogical (e.g., Question 441: "小红在下,我在上" - 'Xiao Hong is below, I am above' being the anomaly) or contradictory (e.g., Question 442: "灵车渐渐地靠近并消失在夜色中" - 'the hearse gradually approached and disappeared into the night').
ERNIE-4performed reasonably well in identifying relatively straightforwardanomalous spatial relationships.- However, its performance declined when the
anomalous spatial relationshipwas complex and requiredinferential abilityorencyclopedic knowledge. For instance, in Question 478, identifying that a car traveling east to west turning left (向北 - North) is anomalous (it should be South) requires external knowledge about cardinal directions and turns. - Conclusion:
ERNIE-4can generally identify intuitiveanomalous spatial relationshipsbut performs poorly when such identification necessitates complex reasoning or externalencyclopedic knowledge.
6.2.4. In Spatial Inference Tasks
Spatial inference tasks require the model to deduce correct spatial relationships between entities through simple reasoning. In these problems, the model only receives the antecedent of conditional propositions from the context, and the consequent needs to be inferred based on the problem's requirements.
ERNIE-4showed relatively weakspatial inference capability, even for problems that are not overly complex. For example, Question 481 (see Section 5.1 for example) involves six people sitting in a regular hexagon, requiring understanding of geometry and relative positions.- This type of reasoning requires not only correctly extracting information from the context but also calling upon necessary
encyclopedic knowledge(e.g., understanding a "regular hexagon") and then performinginferencebased on this combined information. - Conclusion:
Spatial inferencecan be viewed as a complexrole recognitionandentity recognitionproblem where the model needs to process continuous entity relationships and stack complex spatial information.ERNIE-4struggles with these tasks, indicating limitations in its ability to correctly judge complex spatial relationships that require multi-step reasoning and integration of external knowledge.
6.2.5. In Synonym Recognition Tasks
Synonym recognition tasks evaluate the model's accurate understanding of the specific spatial relationships expressed by different spatial directional words. In Chinese, some spatial directional words have different standalone meanings but can express the same spatial direction when combined with certain spatial entities. For example, in Question 1157, "回来" (come back) and "回去" (go back) are often considered antonyms, but in specific contexts, replacing one with the other might not significantly alter the spatial scene.
- To correctly perform this task, the model must understand the semantics of the "entity + directional word" sequence, rather than merely identifying entities associated with directional words (which is an
entity recognitiontask). - Conclusion: Comparing its performance on
synonym recognitiontasks withentity recognitiontasks,ERNIE-4can locate entities associated with directional words relatively well, but its understanding of their semantic nuances for replacement purposes is not outstanding.
6.2.6. Performance Summary Across Task Types
The following are the results from Table 6 of the original paper:
| 任务类别 (Task Category) |
模型表现 (Model Performance) |
影响因素 (Influencing Factors) |
| 角色识别 (Role Recognition) |
好 (Good) |
具体实体的具体空间关系不受外界影响,但不容易判断抽象实体(元语言对象)的抽象空间关系 (Specific spatial relations between concrete entities are unaffected by external factors, but abstract spatial relations of abstract entities ('meta-language' objects) are difficult to judge) |
| 实体识别 (Entity Recognition) |
较好 (Fairly Good) |
表示静态的空间关系,基本不受外界影响,但出现与目标实体有领属或广义领属关系的其他实体时容易受干扰 (Represents static spatial relations, generally unaffected by external factors, but easily interfered with by other entities having possessive or generalized possessive relations with the target entity) |
| 异常识别 (Anomaly Recognition) |
较好 (Fairly Good) |
简单异常空间关系容易识别,但在需要百科知识或推理能力的问题上易受干扰 (Simple anomalous spatial relations are easy to identify, but susceptible to interference in problems requiring encyclopedic knowledge or reasoning ability) |
| 同义识别 (Synonym Recognition) |
较差 (Poor) |
表示空间关系的联系,受“实体+方位词”语义的影响 (Represents the connection of spatial relations, influenced by the semantics of "entity + directional word") |
| 空间推理 (Spatial Inference) |
差 (Bad) |
参与空间主体较多,而且需要百科知识和推理能力,易受干扰项影响 (Involves many spatial agents, requires encyclopedic knowledge and reasoning ability, and is easily affected by distractors) |
Table 6: Model Performance and Influencing Factors
6.3. Ablation Studies / Parameter Analysis
While the paper doesn't present traditional ablation studies in the sense of removing model components, the comparative analysis across different prompt strategies (General, Workflow, CoT) and shot settings (0-shot, 1-shot, 3-shot) effectively functions as a form of parameter analysis or sensitivity analysis for prompt engineering.
The findings from this analysis highlight:
-
The critical role of
few-shot learning: Moving from0-shotto1-shotgenerally provides a substantial performance boost across models, indicating that even a single example significantly aids the models in understanding the task's requirements and desired output format. -
Diminishing or inconsistent returns for more shots: The mixed results when moving from
1-shotto3-shotsuggest that simply adding more examples isn't always beneficial and might even introduce noise or overwhelm the model for these specific tasks. -
The value of simplicity in prompts: The
general promptoften performed as well as, or better than, the more structuredworkflow promptand the reasoning-focusedCoT prompt. This indicates that for certain tasks, a concise and direct instruction might be more effective than explicitly guiding the model's internal processing or asking it to show its steps. This could be due to the nature of the spatial semantic tasks, where direct pattern matching or straightforward understanding is sometimes sufficient, or perhaps theCoTexamples themselves were not optimally designed for all scenarios.This analysis, though not a deep dive into model architecture, is crucial for understanding how to best interact with and evaluate LLMs for specialized tasks like
spatial semantics.
7. Conclusion & Reflections
7.1. Conclusion Summary
This study comprehensively evaluated the spatial semantic understanding capabilities of various large language models in the CCL2024 Chinese Spatial Semantic Evaluation Task. The research systematically explored different prompt engineering strategies, including general prompts, workflow prompts, and chain-of-thought (CoT) prompts, alongside varying shot settings (0-shot, 1-shot, 3-shot).
The key findings are:
ERNIE-4, particularly with a1-shot general prompt, demonstrated the strongest overall performance, achieving an accuracy of 56.20% on the test set and ranking 6th in the closed track.- The
base capabilitiesof the LLMs are paramount, with models likeERNIE-4andGLM-4excelling due to their robust Chinese semantic understanding. Few-shot prompting(especially 1-shot) significantly improves performance over0-shot, but increasing the number of examples to 3-shot does not guarantee further gains.- Simpler
prompt strategies(e.g.,general prompts) can be highly effective, sometimes outperforming more complexworkfloworCoTapproaches for spatial semantic tasks. - Task-specific analysis revealed that models perform best on
role recognitionand worst onspatial inference. They struggle with abstract spatial concepts, unmentioned entities, and problems requiring externalencyclopedic knowledgeor complex multi-step reasoning.
7.2. Limitations & Future Work
The authors implicitly acknowledged several limitations through their detailed task-specific analysis:
-
Difficulty with Abstract Spatial Relations: Models struggle to interpret abstract entities and their abstract spatial relationships, as well as
meta-languagespatial descriptions. -
Interference from Distractors:
Entity recognitionis hampered by the presence ofdistractorentities that havepossessiveorgeneralized possessive relationshipswith the target. -
Dependence on External Knowledge and Reasoning:
Anomaly recognitionand especiallyspatial inferencetasks reveal a significant weakness when the solution requires invokingencyclopedic knowledgeor complex multi-step reasoning beyond direct contextual extraction. -
Nuance in Synonymy:
Synonym recognitiontasks indicate that models have difficulty understanding the subtle semantic interplay between an entity and its associated directional word for accurate substitution. -
Uneven Dataset Distribution: The highly uneven distribution of questions across task categories (e.g., very few
spatial heteronymous synonym identificationtraining samples) could limit the generalizability of findings for those specific tasks and impact models' ability to learn those specific nuances.Based on these observations, the paper suggests several future research directions:
-
Optimizing Prompt Processing Mechanisms: Further improving how LLMs process and respond to prompts, perhaps by fine-tuning their
instruction following capabilities. -
Designing More Structured and Explicit Prompts: Developing more sophisticated
prompt engineeringtechniques that better guide models through complex spatial reasoning, potentially makingCoTmore effective for spatial tasks. -
Enhancing LLM Base Capabilities: Continuing to improve the foundational architecture and training algorithms of LLMs, perhaps with more spatially-aware pre-training data or inductive biases.
-
Integrating External Knowledge Bases: Combining LLMs with external
knowledge graphsorencyclopedic knowledge basesto augment their reasoning capabilities for tasks requiring world knowledge. -
Interactive Decision Strategies: Exploring
interactiveormulti-turnprompting approaches where models can ask clarifying questions or refine their understanding through dialogue.
7.3. Personal Insights & Critique
This paper offers valuable insights into the current state of LLMs' spatial semantic understanding in Chinese.
Personal Insights:
- Emergent vs. Engineered Intelligence: The findings underscore that while LLMs exhibit impressive
emergentlinguistic capabilities, their "common-sense" spatial reasoning is still far from human-like, particularly when moving from concrete, explicit relationships to abstract, implicit, or inferential ones. This gap highlights that large-scale pre-training alone doesn't guarantee robust cognitive abilities like spatial reasoning. - The Art of Prompt Engineering: The varying performance across
prompt strategiesandshotsettings reaffirms thatprompt engineeringis not just a trick but a crucial interface for unlocking and guiding LLM capabilities. The fact that simplergeneral promptssometimes outperformCoTfor these tasks is a counter-intuitive but important finding, suggesting that for certain domains, a direct approach might be less prone to misinterpretations or spurious reasoning steps. - Specificity of Language: The detailed breakdown of errors points to the inherent complexities of
spatial semanticsin Chinese, where nuanced interactions between entities and directional words, ormeta-languagedescriptions, can trip up even advanced LLMs. This suggests that future LLM development might benefit from incorporating more linguistically informed inductive biases or structured representations for spatial concepts.
Critique:
-
Lack of Error Analysis Depth: While the paper provides examples of error types, a more systematic qualitative error analysis across different models and tasks could offer deeper insights into why models fail (e.g., misinterpreting context, flawed reasoning steps, lack of world knowledge, or output format issues). This would be particularly useful for tasks like
spatial inferenceandanomaly recognition. -
GPT-4 Generated CoT Samples: The use of
GPT-4to generateChain of Thoughtexamples for other models introduces a potential bias. IfGPT-4is inherently better at generating optimal reasoning steps, it might set an unfairly high bar or tailor theCoTformat to its own strengths, possibly explaining why other models didn't benefit as much fromCoT. An alternative would be human-annotatedCoTexamples or using a different, less capable model to generate them. -
"Why" Questions Unanswered: The paper states that
ERNIE-4andGLM-4have "strong base capabilities," but it doesn't speculate on why these specific models might be better suited for Chinese spatial semantics (e.g., differences in training data, architectural choices, or pre-training objectives specific to Chinese). -
Limited Exploration of Multiple-Choice Performance: The significantly lower performance on
multiple-choice questions(e.g., ERNIE-4 at 25.20%) deserves more in-depth investigation. Is it due to the inherent complexity of multiple correct answers, or models' difficulty in adhering to the format for listing multiple options accurately? -
Generalizability of Prompt Findings: The conclusion that "simpler is better" for prompts, while interesting, might be highly task- and dataset-specific. It would be beneficial for future work to explore if this holds true across a broader range of spatial semantic tasks or languages.
Despite these points, the paper provides a solid foundation for understanding LLM capabilities in a challenging NLP domain and highlights important avenues for future research in
prompt engineeringandLLM development. Its findings are valuable for researchers working on improving the cognitive reasoning abilities of large language models.
Similar papers
Recommended via semantic vector search.