Abstract

本研究的任务旨在让大模型进行实体识别、角色识别、异常识别、信息推理、同义识别任务，综合评估大模型的空间语义理解能力。其中，我们使用普通提示词、工作流提示词和思维链三种提示词策略来探讨大模型的空间语义理解能力，最后发现 ERNIE-4 在 1-shot 的普通提示词上表现最佳。最终，我们的方法排名第六，总体准确率得分为 56.20% 。

1. Bibliographic Information

1.1. Title

CCL24-Eval 任办系统报告 : 基于大型语言模型的中文空间语义评测 (CCL24-Eval Task 3 System Report: Evaluation of Chinese Spatial Semantics Based on Large Language Models)

1.2. Authors

Huo Shitu (霍世图), Wang Yujun (王钰君), Wu Tongjie (吴童杰) Affiliation: School of International Chinese Education, Beijing Normal University, Beijing 100875, China.

1.3. Journal/Conference

This is a system report for CCL24-Eval Task 3, which is part of the Chinese Computational Linguistics (CCL) conference series. CCL is a prominent academic conference in China focusing on computational linguistics and natural language processing. System reports for evaluation tasks like CCL24-Eval typically present participants' methods and results in a shared task setting, contributing to the understanding of current capabilities and challenges in a specific NLP area.

1.4. Publication Year

The paper implicitly refers to "CCL24-Eval" and "SpaCE2024", indicating the research was conducted for an evaluation event in 2024.

1.5. Abstract

This research aims to comprehensively evaluate the spatial semantic understanding capabilities of large language models (LLMs) by having them perform entity recognition, role recognition, anomaly detection, information inference, and synonym recognition tasks. The study explores three prompt strategies: general prompts (普通提示词), workflow prompts (工作流提示词), and chain-of-thought (CoT) (思维链), to investigate LLMs' spatial semantic understanding. The findings indicate that ERNIE-4 performs best with 1-shot general prompts. Ultimately, the proposed method achieved a sixth-place ranking in the competition with an overall accuracy score of 56.20%.

1.6. Original Source Link

/files/papers/6917c618110b75dcc59ae0e6/paper.pdf This link points to a PDF file, suggesting it is likely a conference paper or a system report submission. Its publication status is likely within the CCL24-Eval proceedings or a related workshop.

2. Executive Summary

2.1. Background & Motivation

The field of Natural Language Processing (NLP) has witnessed significant advancements due to large language models (LLMs). These models, primarily based on the Transformer architecture and leveraging Scaling Laws, have demonstrated remarkable abilities to capture complex language structures and contextual relationships, enabling them to perform various NLP tasks like machine translation and semantic analysis end-to-end. While LLMs exhibit strong semantic understanding, and even show potential as tools for theoretical linguistics, there's a need to systematically evaluate their nuanced capabilities, especially in complex domains like spatial semantics.

Spatial semantics refers to the understanding of how language expresses physical space, including movement, direction, and location. Traditionally, evaluating spatial semantic understanding in NLP systems has relied heavily on manual annotation, which is costly and difficult to scale. With the rise of generative language models, there's a growing trend to use them for evaluation tasks. This paper addresses two key questions:

How well do large models understand spatial semantics?
What are the strengths and weaknesses of large models in specific spatial semantic tasks?

The motivation behind this study is to leverage the capabilities of LLMs for comprehensive spatial semantic evaluation, moving beyond traditional methods and exploring the emergent abilities of these models in understanding and reasoning about space. This research, based on the Fourth Chinese Spatial Semantic Understanding Evaluation Task (SpaCE2024), aims to push the boundaries of LLM understanding in a challenging domain.

2.2. Main Contributions / Findings

The paper makes several key contributions and presents important findings:

Comprehensive Evaluation Framework: The study designed an evaluation framework involving five distinct spatial semantic tasks: entity recognition (实体识别), role recognition (角色识别), anomaly detection (异常识别), information inference (信息推理), and synonym recognition (同义识别). These tasks, combined with different question types (single-choice and multiple-choice), provide a multi-faceted assessment of LLMs' spatial understanding.
Exploration of Prompt Engineering Strategies: The research rigorously investigated three prompt engineering strategies: general prompts, workflow prompts, and chain-of-thought (CoT) prompts, with varying shot settings (0-shot, 1-shot, 3-shot). This exploration provides insights into how different prompting techniques influence LLM performance on spatial semantic tasks.
Identification of Best-Performing Model and Strategy: The study found that ERNIE-4 achieved the best performance on the validation set using a 1-shot general prompt, scoring 53.88%. This strategy also yielded the highest accuracy on the test set for the team, reaching 56.20%. GLM-4 with a 1-shot workflow prompt was the second-best performer.
Insights into LLM Spatial Understanding:
- Base Model Capability is Crucial: Models like ERNIE-4 and GLM-4 with strong underlying Chinese semantic understanding capabilities generally performed well.
- Impact of Prompt Quantity: Adding a single example (1-shot) significantly improved performance compared to 0-shot, but increasing to 3-shot did not consistently lead to further improvements, sometimes even causing a decrease in accuracy.
- Prompt Strategy Complexity: Simpler general prompts sometimes outperformed more complex strategies like workflow prompts and CoT in this specific evaluation, suggesting that complexity does not always equate to better performance for spatial semantics. CoT did not show a significant advantage.
Detailed Task-Specific Performance Analysis: The paper provides a granular breakdown of model performance across different spatial semantic tasks, highlighting that role recognition generally had the best performance, while spatial inference was the most challenging. It also noted that models struggled with abstract spatial relations, unmentioned entities, and scenarios requiring external encyclopedic knowledge.
Competitive Ranking: The team's method achieved the 6th rank in the CCL24-Eval Task 3 closed track, demonstrating a competitive approach within the evaluation landscape.

3.1. Foundational Concepts

To understand this paper, a foundational grasp of several concepts in natural language processing and linguistics is essential:

Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data. They are designed to understand, generate, and process human language. Their "largeness" refers to their immense number of parameters (tens of billions to trillions), which allows them to capture complex patterns in language.
Transformer Architecture: The Transformer is a neural network architecture introduced by Vaswani et al. (2017) that revolutionized NLP. Unlike previous recurrent neural networks (RNNs) or convolutional neural networks (CNNs), the Transformer relies entirely on attention mechanisms (specifically self-attention) to draw global dependencies between input and output. This allows for parallel processing of words in a sequence, making training much faster and enabling models to handle longer contexts.
Attention Mechanism: A core component of the Transformer, the attention mechanism allows the model to weigh the importance of different parts of the input sequence when processing a specific word. For example, when translating a sentence, the model can "attend" to relevant words in the source sentence while generating each word in the target sentence. In self-attention, the attention mechanism is applied to the input sequence itself, allowing the model to learn relationships between different words within the same sentence. The Attention function can be described as mapping a query and a set of key-value pairs to an output, where the output is a weighted sum of the values, and the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. The most common form is Scaled Dot-Product Attention: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
- $Q$ (Query), $K$ (Key), $V$ (Value) are matrices representing the input embeddings. These are derived from the same input sequence in self-attention.
- $Q K^T$ calculates the dot product between the queries and keys, indicating their similarity.
- $\sqrt{d_k}$ is a scaling factor (where $d_k$ is the dimension of the keys) to prevent the dot products from becoming too large, which can push the softmax function into regions with very small gradients.
- $\mathrm{softmax}$ normalizes the scores, turning them into probabilities.
- The result is a weighted sum of the values, allowing the model to focus on relevant information.
Scaling Laws: These are empirical observations (Kaplan et al., 2020) that describe how the performance of neural language models improves predictably as model size (number of parameters), dataset size, and computational budget increase. These laws have guided the development of increasingly larger and more capable LLMs.
Prompt Engineering: This refers to the art and science of crafting effective inputs (prompts) for LLMs to guide their behavior and elicit desired outputs. It involves designing instructions, examples, or questions that steer the model towards performing specific tasks correctly.
0-shot, 1-shot, 3-shot Prompting: These terms describe the number of examples provided in the prompt to guide the LLM:
- 0-shot: No examples are given; the model relies solely on its pre-trained knowledge to understand the task from the instructions.
- 1-shot: One example of an input-output pair is provided in the prompt, demonstrating the desired format and task.
- 3-shot: Three examples of input-output pairs are provided, offering more demonstrations to the model. Generally, few-shot (1-shot, 3-shot, etc.) prompting helps models in-context learn the task without explicit fine-tuning.
Chain-of-Thought (CoT) Prompting: A specific prompt engineering technique (Wei et al., 2022) where intermediate reasoning steps are included in the prompt examples. Instead of just showing input-output pairs, CoT prompts demonstrate how to arrive at the answer step-by-step, encouraging the LLM to generate its own reasoning process, which can significantly improve performance on complex reasoning tasks.
Spatial Semantics: In linguistics and cognitive science, spatial semantics is the study of how languages encode and express information about space, including location, direction, movement, and the relationships between objects in space. This involves understanding spatial prepositions (e.g., "on," "under"), verbs of motion (e.g., "go," "come"), and spatial nouns (e.g., "top," "bottom").
Natural Language Processing (NLP): A subfield of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language.

3.2. Previous Works

The paper contextualizes its research by reviewing significant studies in spatial semantics from both linguistic and computational perspectives.

3.2.1. Linguistic Perspectives on Spatial Semantics

Cognitive linguistics views spatial relations as fundamental to human perception and conceptualization.

Foundational Theories:
- Cognitive Linguistics: Based on human experience and perception of the world (Zhang Min, 1998), abstract concepts are often concretized through metaphors. Spatial relations are among the earliest acquired human abilities (Akhundov, 1986; Clark, 1973; Zhang Min, 1998; Zhao Yanfang, 2001).
- Image Schema Metaphor: Lakoff and Turner (1989) posited that spatial metaphors project concrete spatial concepts onto abstract linguistic structures, conveying spatial relations and their inherent logic, forming the cognitive basis of spatial semantics.
- Space Grammar: Langacker (1982) emphasized the role of spatial semantics in language formation.
- Thematic Relations Hypothesis (TRH): Jackendoff (1983) suggested that events and states in human language's conceptual structure are organized through spatial conceptualization.
- Formal Spatialization Hypothesis (SFH): Lakoff (1987) similarly discussed the formation of basic sentence patterns from a spatial semantic viewpoint.
- Image Schemas: Johnson (1987) identified 27 crucial image schemas related to spatial semantics, fundamental for human spatial reasoning.
- Location as a Foundation: Putz and Dirven (1996) highlighted the foundational role of location in concept formation.
- Key Concepts: Talmy's (1983) Figure-Ground theory and Langacker's (1987) Trajector-Landmark relations are key tools.
Micro-level Studies (International):
- Hawkins (1984) comprehensively studied English prepositions based on spatial cognition.
- Vandeloise (1994) systematically examined French spatial prepositions.
- Herskovits (1986) conducted an interdisciplinary survey of English spatial expressions.
- Svorou (1993) compared spatial prepositions across languages from a cognitive universality perspective.
Micro-level Studies (Chinese):
- Liao Qiuzhong (1986) introduced the concept of reference points for directional words (方位词), moving beyond static analysis.
- Liu Ningsheng (1994) discussed how Chinese selects reference objects, target objects, and directional words to express spatial relations.
- Qi Huyang (1998) established a theoretical framework for the modern Chinese spatial system.
- Focus on 上 (shàng, up/on) and 下 (xià, down/under): Cui Xiliang (2000) analyzed "在X上" (zài X shàng, on X) and its psychological extensions. Lan Chun (2003) compared Chinese 上/下 with English up/down. Bai Lifang (2006) noted asymmetry between 上 and 下 in Chinese, with 上 being more fundamental. Xu Dan (2008) examined spatiotemporal expressions in Chinese, highlighting 上/下 for time as unique to Chinese cognition.
- Dynamic Research on Motion Verbs: Lu Jianming (2002) defined displacement verbs (位移动词) as verbs "containing semantic features of displacement towards or away from the speaker," emphasizing their strong spatial semantic aspect. Subsequent work integrated this with semantic feature analysis or construction grammar (Zhang Guoxian, 2006; Yong Qian, 2013; Zeng Chuanlu, 2014).
- Recent Research: Focuses on dialectal and cross-linguistic comparisons, and the cognitive/psychological reality of spatial semantics (Jia Hongxia, 2009; Yin Weibin, 2014; Li Yunbing, 2016, 2020; Zhu Keyi, 2018).

3.2.2. Natural Language Processing (NLP) Research on Spatial Semantic Evaluation

NLP's focus is on extracting and understanding physical spatial information from natural language.

Early Stages (Pre-Deep Learning):
- Phase One: Focused on defining hierarchical layers and relationships in spatial semantic networks (Tappan, 2004), establishing foundations for spatial information processing.
- Phase Two: Centered on specific tasks like spatial entity recognition (Kordjamshidi et al., 2011) and spatial relation determination, using semi-supervised or unsupervised machine learning on specific datasets.
- Limitations: These early methods often used non-linguistic formal approaches, failing to fully account for human expression of spatial relations in natural language, particularly abstract spatial concepts (Stock, 1998; Renz and Nebel, 2007; Bateman et al., 2007). They struggled with the inherent uncertainty and ambiguity in natural language spatial fragments.
Addressing Ambiguity and Developing Standard Tasks:
- Spatial Role Labelling (SpRL): Kordjamshidi et al. (2011) introduced SpRL to address ambiguity in spatial prepositions. Roberts (2012) used a joint approach combining CRF models for feature extraction, maximum entropy/Naive Bayes classifiers for preposition disambiguation, and SemEval-2007 data to learn preposition meanings, achieving binary classification of spatial/non-spatial preposition meanings (Roberts and Harabagiu, 2012). This method considered all elements of spatial relations (e.g., trajector, landmark, indicator).
- SpaceEval 2013: (Kolomiyets et al., 2013) extended SpRL by adding Movelink and motion tags to annotate motion verbs/nominal events and their categories from a spatial semantic perspective.
- SpaceEval 2015: (Pustejovsky et al., 2015) further expanded by defining subtasks for spatial element identification/classification, motion signal recognition, and motion relation recognition, providing a comprehensive evaluation framework.
Chinese Spatial Semantic Evaluation (SpaCE):
- The SpaCE evaluations (Zhan Weidong et al., 2022; Yue Pengxue et al., 2023) drew inspiration from these international efforts, building high-quality Chinese datasets.
- Task Setting: SpaCE goes beyond simple identification and classification, requiring orientation inference and heteronymous synonym identification, reflecting higher demands for spatial semantic understanding, especially as LLMs develop world knowledge and emergent spatial reasoning abilities.

3.3. Technological Evolution

The field has evolved from rule-based and statistical methods to machine learning, then deep learning, and now large language models.

Early NLP: Focused on explicit rules, handcrafted features, and statistical models for spatial relation extraction.
Deep Learning Era: Introduced neural networks (CNNs, RNNs) for feature learning and sequence modeling, improving performance on tasks like entity recognition and relation extraction.
Transformer and LLMs: The Transformer architecture (Vaswani et al., 2017) and subsequent LLMs (BERT, GPT series, ERNIE, GLM, Qwen, Deepseek) marked a paradigm shift. Their ability to learn complex contextual representations from massive text corpora has led to emergent capabilities in reasoning and understanding that were previously thought to be exclusive to symbolic AI or human cognition. This paper directly benefits from and evaluates these advanced LLMs.

3.4. Differentiation Analysis

Compared to previous work, this paper's core differentiation lies in its explicit focus on evaluating the spatial semantic understanding capabilities of large language models.

Shift from Feature Engineering to Prompt Engineering: Earlier NLP spatial semantics research often involved extensive feature engineering or specialized neural network architectures. This paper, however, primarily uses prompt engineering to adapt pre-trained LLMs to the spatial semantic tasks, exploring how different prompting strategies elicit spatial reasoning.
Broader Range of Tasks: While previous tasks like SpRL focused on identifying spatial roles, SpaCE2024 (and by extension this paper's evaluation) includes more complex reasoning tasks such as anomaly detection, information inference, and heteronymous synonym identification, which probe deeper into cognitive aspects of spatial understanding.
Leveraging Emergent Abilities: Instead of building models specifically for spatial semantics, this research tests the inherent or emergent spatial reasoning abilities of general-purpose LLMs, which have learned spatial concepts implicitly from their vast training data. This contrasts with traditional methods that often built specialized knowledge bases or models for spatial reasoning.

4. Methodology

4.1. Principles

The core principle of this study is to leverage the advanced natural language understanding and generation capabilities of large language models (LLMs) to perform various spatial semantic tasks. The authors hypothesize that by carefully designing prompt engineering strategies, LLMs can be guided to understand and reason about complex spatial relationships embedded in natural language text. The methodology aims to systematically evaluate the models' performance across different task types and with varying levels of in-context learning (0-shot, 1-shot, 3-shot) to identify their strengths, weaknesses, and the most effective prompting approaches.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology primarily revolves around selecting representative LLMs, designing structured prompts, and systematically evaluating their outputs.

4.2.1. Model Selection

The study selected six representative large language models from different developers, covering various architectures and scales. These models represent the state-of-the-art or recent advancements at the time of the research (2024). All models are accessed via API.

The following are the results from Table 2 of the original paper:

模型 (Model)	版本日期 (Version Date)	开发者 (Developer)	模型大小 (Model Size)	上下文 (Context Length)	词表大小 (Vocabulary Size)	是否开源调用方式 (Open Source Calling Method)	调用方式 (Calling Method)
GPT-4 Turbo	04-09	OpenAI	未披露 (Undisclosed)	12.8万 (128K)	10万 (100K)	香 (No)	API
GPT-40	05-13	OpenAI	未披露 (Undisclosed)	12.8万 (128K)	20万 (200K)	否 (No)	API
GLM-4	未披露 (Undisclosed)	智谱华章 (Zhipu AI)	未披露 (Undisclosed)	12.8万 (128K)	未披露 (Undisclosed)	否 (No)	API
ERNIE-4	03-29	百度 (Baidu)	未披露 (Undisclosed)	8千 (8K)	未披露 (Undisclosed)	否 (No)	API
Qwen1.5-72B-chat	未披露 (Undisclosed)	阿里巴巴 (Alibaba)	720亿 (72B)	3.2万 (32K)	15万 (150K)	是 (Yes)	API
Deepseek-V2-chat	未披露 (Undisclosed)	深度求索 (Deepseek AI)	2360亿 (236B)	3.2万 (32K)	10万 (100K)	是 (Yes)	API

4.2.2. Prompt Engineering

All prompts were designed using a structured Markdown format, comprising two main parts: prompt strategy and prompt sample construction.

4.2.2.1. Prompt Strategies

Three distinct prompt strategies were employed to investigate the LLMs' spatial semantic understanding:

General Prompt (普通提示词): This is the simplest form of prompt, providing direct instructions without specific intermediate steps or roles. It asks the model to directly provide the answer.
Workflow Prompt (工作流提示词): This strategy assigns a role to the LLM (e.g., "an expert in spatial information entity recognition") and outlines a step-by-step workflow for the model to follow before providing the answer. This aims to guide the model's internal processing more explicitly.
Chain of Thought (CoT) Prompt (思维链提示词): Building on the CoT technique by Wei et al. (2022), this strategy explicitly asks the model to output its thought process (Thought) before giving the final answer (Answer). This encourages the model to break down complex problems into manageable steps, making its reasoning explicit. The structure was modified to include "想法" (Thought) and "答案" (Answer) for structured output.

4.2.2.2. Prompt Sample Construction

The construction of prompt samples varied based on the shot setting:

0-shot: No examples were provided in the prompt; the model relied solely on the instructions. This was used for General and Workflow prompts.
1-shot: One example of a question-answer pair was included in the prompt to demonstrate the desired output format and task. This was used for General, Workflow, and CoT prompts.
3-shot: Three examples of question-answer pairs were included. This was used for General and Workflow prompts.

Sample Selection Process: For few-shot prompting (1-shot and 3-shot), the training samples were selected carefully:

Each data entry in the training set consists of a text (C), a question (Q), four options (O), and an answer (A). Each entry was organized into a sample ${ \cal S } _ { i } = \{ C _ { i } , Q _ { i } , O _ { i } \}$ .
Sentence-BERT (Reimers and Gurevych, 2019) was used to convert these samples into vectors, capturing their semantic meaning.
For each task category (e.g., entity recognition), the average vector of all samples within that category was calculated, serving as the cluster centroid.
Finally, by calculating the semantic similarity between each sample vector and the cluster centroid, the 1 and 3 samples closest to the centroid were selected as the 1-shot and 3-shot training data, respectively. This method aims to select representative examples for in-context learning.

Special Considerations for CoT and Spatial Heteronymous Synonym Identification:

For CoT prompts, the sample examples require a thought process. These thought processes were generated using GPT-4 for consistency and quality.
The spatial heteronymous synonym identification task had very few training samples (only one multiple-choice question). To ensure 3-shot construction was possible, two single-choice questions from this task were manually adapted into multiple-choice questions.

Prompt Examples (Translated from Original Paper):

General Prompt Example

#Goal: From the four options, select the spatial information reference object in the text. Note: Only answer with one key from the option, no need to answer the value, no need to explain.

*Text:** <text> *Question:** <question> *Option:** <option> *Answer:**

Workflow Prompt Example

#Role: You are an expert skilled in spatial information entity recognition.

#Goal: From the four options, select the spatial information reference object in the text. Note: Only answer with one key from the option, no need to answer the value, no need to explain.

#Workflow：1.Read the text: Carefully read the provided text, paying special attention to spatial information descriptions. 2.Analyze options: Review all options and identify which ones might be spatial reference objects in the text. 3.Select the correct option: Compare the text with the options and choose the most matching spatial information reference object.

*Text:** <text>

*Question:** <question>

*Option:** <option>

*Answer:**

Chain of Thought Prompt Example

#Goal: From the four options, select the spatial information reference object in the text. Note: Only answer with one key from the option, no need to answer the value, write out Thought and Answer.

*Text:** <text> *Question:** <question> *Option:** <option> *Thought:** <thought> *Answer:**

4.2.3. Experiment Settings

4.2.3.1. Answer Extraction

Different answer extraction methods were used based on the prompt strategy:

General and Workflow Prompts: The prompts requested models to output options directly, separated by English commas (e.g., "A,C"). The system first converted the output into a list and then iterated through each element to extract the first character (A, B, C, or D). This accounts for cases where models might output additional text after the answer.
Chain of Thought Prompts: These prompts required the model to output its thought process first, followed by the answer. A regular expression was used to extract the answer: $<\*\*Answer:\*\*\n(.+?)(\n\nl\$ )>. This regex specifically targets the content between theAnswer:$ tag and the subsequent newline characters.
Manual Verification: Due to variations in instruction following capability across different models, an automatic check was implemented to ensure answers were one of A, B, C, or D. If not, manual inspection was performed.

4.2.3.2. Model Output Configuration

The temperature parameter for all LLM calls was set to 0.1. Temperature controls the randomness of the model's output; a lower temperature (like 0.1) makes the output more deterministic and focused, which is desirable for evaluation tasks requiring precise answers.

4.2.3.3. Evaluation Metric

The evaluation metric used was Accuracy. Accuracy measures the proportion of correctly answered questions out of the total number of questions. A correct answer was scored as 1 point, and all other cases (e.g., model stating no options fit, model refusing to answer, or partially correct answers in multiple-choice questions) were scored as 0 points.

5. Experimental Setup

5.1. Datasets

The dataset used for this research is from the Fourth Chinese Spatial Semantic Understanding Evaluation Task (SpaCE2024). It is designed to evaluate LLMs' spatial semantic understanding across various dimensions and contexts.

Dataset Characteristics:

Task Categories: The dataset comprises five major task categories, further divided into nine sub-categories. These tasks aim to assess entity recognition, role recognition, anomaly detection, spatial orientation information inference, and spatial heteronymous synonym identification.
Question Format: All questions are in a multiple-choice format, with four options provided for each question. Both single-choice and multiple-choice questions are included.
Domains: The dataset covers a wide range of domains, including general areas like newspapers, literary works, and primary/secondary school textbooks, as well as specialized fields such as traffic accidents, sports actions, and geographical encyclopedias. This diversity ensures a comprehensive evaluation of the models' ability to handle spatial semantics in different contexts.

Data Distribution: As shown in Table 1, spatial orientation information inference has the largest number of questions, while spatial heteronymous synonym identification has the fewest. Single-choice questions are predominant. The uneven distribution and varied complexity of question types pose a significant challenge to the evaluation.

The following are the results from Table 1 of the original paper:

序号 (No.)	任务类别 (Task Category)	任务要求 (Task Requirement)	题型 (Question Type)	数据量 (Data Volume)
序号 (No.)	任务类别 (Task Category)	任务要求 (Task Requirement)	题型 (Question Type)	训练集 (Training Set)	验证集 (Validation Set)	测试集 (Test Set)
1	空间信息实体识别 (Spatial Information Entity Recognition)	选出文本空间信息的参照物 (Select the reference object of spatial information in the text)	单选题 (Single-choice)	937	226	489
1	空间信息实体识别 (Spatial Information Entity Recognition)		多选题 (Multiple-choice)	161	24	81
2	空间信息角色识别 (Spatial Information Role Recognition)	选出文本空间信息的语义角色，或者选出与语义角色相对应的空间表达形式 (Select the semantic role of spatial information in the text, or select the spatial expression corresponding to the semantic role)	单选题 (Single-choice)	1074	186	746
2	空间信息角色识别 (Spatial Information Role Recognition)		多选题 (Multiple-choice)	19	4	24
3	空间信息异常识别 (Spatial Information Anomaly Recognition)	从四个选项中选出文本空间信息异常的语言表达 (Select the abnormal linguistic expression of spatial information from the four options)	单选题 (Single-choice)	1077	40	500
4	空间方位信息推理 (Spatial Orientation Information Inference)	基于文本给出的推理条件进行空间方位推理，从四个选项中选出推理结果 (Perform spatial orientation inference based on the inference conditions given in the text, and select the inference result from the four options)	单选题 (Single-choice)	909	468	1509
4	空间方位信息推理 (Spatial Orientation Information Inference)		多选题 (Multiple-choice)	301	207	531
5	空间异形同义识别 (Spatial Heteronymous Synonym Recognition)	从四个选项中选出能使两个文本异形同义或异义的空间义词语 (Select a spatial semantic word from the four options that makes two texts heteronymous synonyms or antonyms)	单选题 (Single-choice)	4	44	517
5	空间异形同义识别 (Spatial Heteronymous Synonym Recognition)		多选题 (Multiple-choice)	1	11	133
总计 (Total)				4483	1210	4530

Example Data Samples (from the paper's analysis sections):

Spatial Information Entity Recognition (实体识别):
- Question 1: "周游口袋里只有五元钱....所以蹬三轮车的上来拉生意时，他理都不理他们，而是从西装口袋里掏出个玩具手机，这个玩具手机像真的一样，里面装上一节五号电池，悄悄按上一个键，手机的铃声就会响起来。 (题目：__-里面装上一节五号电池)"
  - (Translation: "Zhou You only had five yuan in his pocket.... So when pedicab drivers came to solicit business, he ignored them, and instead took out a toy phone from his suit pocket. This toy phone looked real, with a AA battery inside, and quietly pressing a button would make the phone ring. (Question: __-has a AA battery inside)")
- Question 4: "回家以后，她给丈夫算了一笔账：我每天上下班路程要花3个小时，工作8小时，中午吃饭1小时，总共在外边花12小时..... (题目：总共在___外边花12小时)"
  - (Translation: "After returning home, she calculated an account for her husband: I spend 3 hours commuting every day, 8 hours working, 1 hour eating lunch, totaling 12 hours spent outside..... (Question: Total 12 hours spent __-outside)")
Spatial Information Role Recognition (角色识别):
- Question 251: "时间过去近两个月，木沙江·努尔墩仍清楚地记得.....在人工湖边的冰窟中，拉齐尼用一只手臂搂住孩子，另一只手努力托举着孩子...． (题目：__的手努力托举着孩子)"
  - (Translation: "Almost two months had passed, but Musajiang Nurdon still clearly remembered..... In the ice cave by the artificial lake, Latif held the child with one arm and struggled to lift the child with the other hand... (Question: __- 's hand struggled to lift the child)")
Spatial Information Anomaly Recognition (异常识别):
- Question 441: "小红在下，我在上，走到四楼的东侧.....．（题目：异常的空间方位信息是__-，要求识别出“小红在下，我在上”)"
  - (Translation: "Xiao Hong is below, I am above, walking to the east side of the fourth floor..... (Question: The abnormal spatial orientation information is __-, requiring identification of 'Xiao Hong is below, I am above')")
Spatial Orientation Information Inference (空间方位信息推理):
- Question 481: "贺知章、李白、陈子昂、骆宾王、王维、孟浩然六个人在海边沙滩上围成一圈坐着，大家都面朝中心的篝火。六人的位置恰好形成一个正六边形。任意相邻两个人之间的距离相等，大约为一米。已知：陈子昂在骆宾王左边数起第1个位置，孟浩然在陈子昂逆时针方向的第5个位置，王维在孟浩然顺时针方向的第1个位置 (题目：孟浩然在___的斜对面)"
  - (Translation: "He Zhizhang, Li Bai, Chen Zi'ang, Luo Binwang, Wang Wei, and Meng Haoran, six people, sat in a circle on the beach, all facing a bonfire in the center. Their positions formed a regular hexagon. The distance between any two adjacent people was equal, about one meter. Known: Chen Zi'ang is in the 1st position to the left of Luo Binwang, Meng Haoran is in the 5th position counter-clockwise from Chen Zi'ang, Wang Wei is in the 1st position clockwise from Meng Haoran (Question: Meng Haoran is diagonally opposite to ___)")
Spatial Heteronymous Synonym Recognition (空间异形同义识别):
- Question 1157: "傍晚的时候，宋钢将他带回去的钱用一张旧报纸仔细包好了，放在了枕头下面.....。（题目：“回去"替换为_形成的新句可以与原句表达相同的空间场景，要求用"回来"替换“回去”)"
  - (Translation: "In the evening, Song Gang carefully wrapped the money he had taken back with an old newspaper and placed it under the pillow..... (Question: Replacing '回去' (go back) with _ forms a new sentence that expresses the same spatial scene as the original sentence, requiring '回来' (come back) to replace '回去')")

5.2. Evaluation Metrics

The primary evaluation metric used in this study is Accuracy.

5.2.1. Accuracy

Conceptual Definition: Accuracy is a common metric used to evaluate the overall correctness of a classification model. It quantifies the proportion of predictions that are exactly correct out of the total number of predictions made. In this context, it measures how many spatial semantic questions the model answered correctly compared to all questions. It provides a straightforward measure of a model's general performance.
Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
Symbol Explanation:
- Number of Correct Predictions: The count of instances where the model's predicted answer matches the true answer.
- Total Number of Predictions: The total number of questions or instances for which the model made a prediction.
  
  In this study, a correct answer received 1 point, while any other outcome (e.g., model indicating no option was suitable, model refusing to answer, or incorrect/partial answers for multiple-choice questions) received 0 points.

5.3. Baselines

This paper evaluates several state-of-the-art large language models, treating them as individual systems to be tested with different prompt engineering strategies. Therefore, the "baselines" are not traditional smaller models or specific spatial semantic parsers, but rather the performance of these large models themselves under different conditions (0-shot, 1-shot, 3-shot, and different prompt types). The models compared include:

GPT-4 Turbo (OpenAI)
GPT-4o (OpenAI)
GLM-4 (Zhipu AI)
ERNIE-4 (Baidu)
Qwen1.5-72B-chat (Alibaba)
Deepseek-V2-chat (Deepseek AI)

These models are representative of cutting-edge LLM technology from major developers, enabling a comprehensive assessment of current LLM capabilities in Chinese spatial semantics. The comparison across these models and various prompting techniques aims to understand which models and strategies are most effective.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate varying levels of spatial semantic understanding among the tested large language models, heavily influenced by the prompt engineering strategies employed.

The following are the results from Table 3 of the original paper:

模型 (Model)	普通 (General)			工作流 (Workflow)			思维链 (CoT) (1-shot)
模型 (Model)	0-shot	1-shot	3-shot	0-shot	1-shot	3-shot	思维链 (CoT) (1-shot)
ERNIE-4	50.25	53.88	52.73	52.23	52.73	52.81	51.06
GLM-4	51.24	52.01	52.23	50.49	53.14	50.41	50.82
GPT-40	48.92	51.16	52.89	48.35	50.99	51.73	50.91
GPT-4 Turbo	48.18	50.99	51.54	47.43	51.49	47.77	50.74
Deepseek-V2-chat	48.84	49.83	49.98	46.69	49.42	49.83	46.78
Qwen1.5-72B-chat	44.71	46.61	46.45	42.81	45.70	45.04	45.45

Table 3: Overall Performance of Models on the Validation Set (Accuracy %)

The following are the results from Table 4 of the original paper:

模型 (Model)	样本数量 (Number of Shots)	提示词 (Prompt Strategy)	测试集准确率 (Test Set Accuracy)
ERNIE-4	1	普通 (General)	56.20
GLM-4	1	工作流 (Workflow)	54.52

Table 4: Final Performance of Models on the Test Set (Accuracy %)

Key Findings from Overall Performance:

Top Performers: ERNIE-4 consistently emerged as a strong performer. Its best score on the validation set was 53.88% with a 1-shot general prompt, which also translated to the highest accuracy on the test set at 56.20%. GLM-4 followed closely, achieving 53.14% on the validation set with a 1-shot workflow prompt, and 54.52% on the test set.
Importance of Base Model Capability: The results suggest that the underlying base capabilities of the large models play a crucial role. Models like ERNIE-4 and GLM-4, known for their strong Chinese semantic understanding, generally adapted well to the challenging spatial tasks. Qwen1.5-72B-chat, despite being a large model, consistently showed lower performance across all prompt strategies.
Impact of Prompt Sample Quantity (Shot): Few-shot prompting (1-shot or 3-shot) generally led to a significant improvement over 0-shot performance. For example, ERNIE-4 jumped from 50.25% (0-shot general) to 53.88% (1-shot general). However, increasing the number of shots from 1 to 3 did not always guarantee further improvement; the accuracy either increased in 7 cases or decreased in 5 cases across different models and prompt types, indicating a complex relationship between sample quantity and performance.
Prompt Strategy Complexity: Counter-intuitively, simpler general prompt strategies often yielded competitive or even superior results compared to more complex ones. The Chain of Thought (CoT) strategy, which explicitly encourages step-by-step reasoning, did not stand out in this evaluation, performing similarly or sometimes even worse than general or workflow prompts for several models. This suggests that for these specific spatial semantic tasks, explicit multi-step reasoning might not always be the most effective approach or might require further refinement in prompt design.

6.2. Model Specific Performance

To delve deeper than overall accuracy, the paper analyzed each model's performance across seven dimensions: entity recognition, role recognition, anomaly recognition, spatial inference, synonym recognition, single-choice questions, and multiple-choice questions. It distinguishes between actual best performance (score under the optimal prompt strategy for that dimension) and potential best performance (highest score achieved across any prompt strategy for that dimension).

The following are the results from Table 5 of the original paper:

		实体识别 (Entity Recognition)	角色识别 (Role Recognition)	异常识别 (Anomaly Recognition)	空间推理 (Spatial Inference)	同义识别 (Synonym Recognition)	单选题 (Single-choice)	多选题 (Multiple-choice)
		ERNIE-4	实际最佳 (Actual Best)	79.20	95.26	87.50	29.92	65.45	61.20	25.20
潜在最佳 (Potential Best)	80.40	ERNIE-4	96.84	87.50	29.92	65.45	61.20	25.20
GLM-4	实际最佳 (Actual Best)	78.40	95.79	85.00	29.33	60.00	58.30	32.93
GLM-4	潜在最佳 (Potential Best)	78.40	96.84	85.00	29.33	63.63	59.64	32.93
GPT-40	实际最佳 (Actual Best)	76.40	93.68	80.00	30.52	60.00	58.09	32.52
GPT-40	潜在最佳 (Potential Best)	76.40	95.79	80.00	30.52	65.45	59.34	33.74
GPT-4 Turbo	实际最佳 (Actual Best)	76.80	95.26	72.50	28.59	54.54	59.54	20.73
GPT-4 Turbo	潜在最佳 (Potential Best)	76.80	95.26	80.00	29.48	61.82	59.92	23.17
Deepseek-V2-chat	实际最佳 (Actual Best)	74.40	95.26	77.50	26.22	52.73	56.33	24.80
Deepseek-V2-chat	潜在最佳 (Potential Best)	74.40	96.84	82.50	29.04	65.45	56.74	29.67
Qwenl.5-72B-chat	实际最佳 (Actual Best)	71.60	91.05	67.50	23.11	52.73	55.50	11.79
Qwenl.5-72B-chat	潜在最佳 (Potential Best)	72.40	93.68	67.50	24.74	54.54	55.50	16.67

Table 5: Actual Best Performance and Potential Best Performance of Models on the Validation Set (Accuracy %)

Overall Specific Task Performance:

All models performed best on role recognition tasks and worst on spatial inference tasks.
Performance on single-choice questions was generally higher than on multiple-choice questions.
ERNIE-4 showed strong overall performance, excelling in four tasks and single-choice questions. GLM-4 was a close second, achieving the highest scores in role recognition and multiple-choice questions.
Deepseek-V2-chat showed competitive potential best performance in synonym recognition, comparable to ERNIE-4 and GLM-4.

6.2.1. In Entity Recognition Tasks

Entity recognition tasks assess the model's ability to identify spatial directional words and their co-reference with entities already present in the context.

ERNIE-4 performed well when the spatial directional words and entity relationships were fixed and explicitly present in the context. For example, in Questions 1 and 4 (see Section 5.1 for examples), where the spatial relationship is clear, ERNIE-4 showed accurate understanding.
However, ERNIE-4's error rate increased when the target entity was not explicitly mentioned in the context, or when there were distractor entities with possessive or generalized possessive relationships. For instance, in Question 58 ("外面的雨声混合着我的哭声" - 'the rain outside mixed with my cries'), the entity "屋子" (house/room) was not present. In Question 67 ("楼上只有南面的大厅有灯亮" - 'only the hall on the south side upstairs has lights on'), "大厅" (hall) acted as a distractor when the correct reference should have been "楼上" (upstairs), which has a generalized possessive relationship with "大厅".
Conclusion: ERNIE-4 accurately identifies entities associated with directional words that appear in the context but struggles with unmentioned entities or when distractors with possessive relationships are present.

6.2.2. In Role Recognition Tasks

Role recognition tasks examine the model's ability to identify two entities with spatial interaction relationships. These relationships can stem from possessive relationships (e.g., Question 251), event-based relationships (e.g., Question 258), or relative position relationships (e.g., Question 259).

ERNIE-4 demonstrated very accurate understanding of these specific spatial interaction relationships between concrete entities.
However, the model's recognition ability decreased significantly when the question asked about complex spatial relationships (including implicit spatial relationships) of abstract entities, especially those involving meta-language (abstract descriptions of spatial relations unrelated to specific text content, like "path," "direction," "start point," "external position"). For example, in Question 398 ("鄱阳城" (Poyang City) belongs to "昌江" (Changjiang River) during its flow as a ___ information), or Question 425 (Asking where "broken clothes thrown in Shanghai garbage can" happened), the model struggled.
Conclusion: ERNIE-4 is most accurate in judging specific spatial relationships between concrete entities, followed by specific spatial relationships of abstract entities, and weakest in judging abstract spatial relationships of abstract entities. This aligns with human cognitive patterns where concrete objects and relationships are easier to perceive.

6.2.3. In Anomaly Recognition Tasks

Anomaly recognition tasks assess the model's capacity to identify entities with anomalous or incorrect spatial interaction relationships. Anomalies can be illogical (e.g., Question 441: "小红在下，我在上" - 'Xiao Hong is below, I am above' being the anomaly) or contradictory (e.g., Question 442: "灵车渐渐地靠近并消失在夜色中" - 'the hearse gradually approached and disappeared into the night').

ERNIE-4 performed reasonably well in identifying relatively straightforward anomalous spatial relationships.
However, its performance declined when the anomalous spatial relationship was complex and required inferential ability or encyclopedic knowledge. For instance, in Question 478, identifying that a car traveling east to west turning left (向北 - North) is anomalous (it should be South) requires external knowledge about cardinal directions and turns.
Conclusion: ERNIE-4 can generally identify intuitive anomalous spatial relationships but performs poorly when such identification necessitates complex reasoning or external encyclopedic knowledge.

6.2.4. In Spatial Inference Tasks

Spatial inference tasks require the model to deduce correct spatial relationships between entities through simple reasoning. In these problems, the model only receives the antecedent of conditional propositions from the context, and the consequent needs to be inferred based on the problem's requirements.

ERNIE-4 showed relatively weak spatial inference capability, even for problems that are not overly complex. For example, Question 481 (see Section 5.1 for example) involves six people sitting in a regular hexagon, requiring understanding of geometry and relative positions.
This type of reasoning requires not only correctly extracting information from the context but also calling upon necessary encyclopedic knowledge (e.g., understanding a "regular hexagon") and then performing inference based on this combined information.
Conclusion: Spatial inference can be viewed as a complex role recognition and entity recognition problem where the model needs to process continuous entity relationships and stack complex spatial information. ERNIE-4 struggles with these tasks, indicating limitations in its ability to correctly judge complex spatial relationships that require multi-step reasoning and integration of external knowledge.

6.2.5. In Synonym Recognition Tasks

Synonym recognition tasks evaluate the model's accurate understanding of the specific spatial relationships expressed by different spatial directional words. In Chinese, some spatial directional words have different standalone meanings but can express the same spatial direction when combined with certain spatial entities. For example, in Question 1157, "回来" (come back) and "回去" (go back) are often considered antonyms, but in specific contexts, replacing one with the other might not significantly alter the spatial scene.

To correctly perform this task, the model must understand the semantics of the "entity + directional word" sequence, rather than merely identifying entities associated with directional words (which is an entity recognition task).
Conclusion: Comparing its performance on synonym recognition tasks with entity recognition tasks, ERNIE-4 can locate entities associated with directional words relatively well, but its understanding of their semantic nuances for replacement purposes is not outstanding.

6.2.6. Performance Summary Across Task Types

The following are the results from Table 6 of the original paper:

任务类别 (Task Category)	模型表现 (Model Performance)	影响因素 (Influencing Factors)
角色识别 (Role Recognition)	好 (Good)	具体实体的具体空间关系不受外界影响，但不容易判断抽象实体（元语言对象）的抽象空间关系 (Specific spatial relations between concrete entities are unaffected by external factors, but abstract spatial relations of abstract entities ('meta-language' objects) are difficult to judge)
实体识别 (Entity Recognition)	较好 (Fairly Good)	表示静态的空间关系，基本不受外界影响，但出现与目标实体有领属或广义领属关系的其他实体时容易受干扰 (Represents static spatial relations, generally unaffected by external factors, but easily interfered with by other entities having possessive or generalized possessive relations with the target entity)
异常识别 (Anomaly Recognition)	较好 (Fairly Good)	简单异常空间关系容易识别，但在需要百科知识或推理能力的问题上易受干扰 (Simple anomalous spatial relations are easy to identify, but susceptible to interference in problems requiring encyclopedic knowledge or reasoning ability)
同义识别 (Synonym Recognition)	较差 (Poor)	表示空间关系的联系，受“实体+方位词”语义的影响 (Represents the connection of spatial relations, influenced by the semantics of "entity + directional word")
空间推理 (Spatial Inference)	差 (Bad)	参与空间主体较多，而且需要百科知识和推理能力，易受干扰项影响 (Involves many spatial agents, requires encyclopedic knowledge and reasoning ability, and is easily affected by distractors)

Table 6: Model Performance and Influencing Factors

6.3. Ablation Studies / Parameter Analysis

While the paper doesn't present traditional ablation studies in the sense of removing model components, the comparative analysis across different prompt strategies (General, Workflow, CoT) and shot settings (0-shot, 1-shot, 3-shot) effectively functions as a form of parameter analysis or sensitivity analysis for prompt engineering.

The findings from this analysis highlight:

The critical role of few-shot learning: Moving from 0-shot to 1-shot generally provides a substantial performance boost across models, indicating that even a single example significantly aids the models in understanding the task's requirements and desired output format.
Diminishing or inconsistent returns for more shots: The mixed results when moving from 1-shot to 3-shot suggest that simply adding more examples isn't always beneficial and might even introduce noise or overwhelm the model for these specific tasks.
The value of simplicity in prompts: The general prompt often performed as well as, or better than, the more structured workflow prompt and the reasoning-focused CoT prompt. This indicates that for certain tasks, a concise and direct instruction might be more effective than explicitly guiding the model's internal processing or asking it to show its steps. This could be due to the nature of the spatial semantic tasks, where direct pattern matching or straightforward understanding is sometimes sufficient, or perhaps the CoT examples themselves were not optimally designed for all scenarios.

This analysis, though not a deep dive into model architecture, is crucial for understanding how to best interact with and evaluate LLMs for specialized tasks like spatial semantics.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study comprehensively evaluated the spatial semantic understanding capabilities of various large language models in the CCL2024 Chinese Spatial Semantic Evaluation Task. The research systematically explored different prompt engineering strategies, including general prompts, workflow prompts, and chain-of-thought (CoT) prompts, alongside varying shot settings (0-shot, 1-shot, 3-shot).

The key findings are:

ERNIE-4, particularly with a 1-shot general prompt, demonstrated the strongest overall performance, achieving an accuracy of 56.20% on the test set and ranking 6th in the closed track.
The base capabilities of the LLMs are paramount, with models like ERNIE-4 and GLM-4 excelling due to their robust Chinese semantic understanding.
Few-shot prompting (especially 1-shot) significantly improves performance over 0-shot, but increasing the number of examples to 3-shot does not guarantee further gains.
Simpler prompt strategies (e.g., general prompts) can be highly effective, sometimes outperforming more complex workflow or CoT approaches for spatial semantic tasks.
Task-specific analysis revealed that models perform best on role recognition and worst on spatial inference. They struggle with abstract spatial concepts, unmentioned entities, and problems requiring external encyclopedic knowledge or complex multi-step reasoning.

7.2. Limitations & Future Work

The authors implicitly acknowledged several limitations through their detailed task-specific analysis:

Difficulty with Abstract Spatial Relations: Models struggle to interpret abstract entities and their abstract spatial relationships, as well as meta-language spatial descriptions.
Interference from Distractors: Entity recognition is hampered by the presence of distractor entities that have possessive or generalized possessive relationships with the target.
Dependence on External Knowledge and Reasoning: Anomaly recognition and especially spatial inference tasks reveal a significant weakness when the solution requires invoking encyclopedic knowledge or complex multi-step reasoning beyond direct contextual extraction.
Nuance in Synonymy: Synonym recognition tasks indicate that models have difficulty understanding the subtle semantic interplay between an entity and its associated directional word for accurate substitution.
Uneven Dataset Distribution: The highly uneven distribution of questions across task categories (e.g., very few spatial heteronymous synonym identification training samples) could limit the generalizability of findings for those specific tasks and impact models' ability to learn those specific nuances.

Based on these observations, the paper suggests several future research directions:
Optimizing Prompt Processing Mechanisms: Further improving how LLMs process and respond to prompts, perhaps by fine-tuning their instruction following capabilities.
Designing More Structured and Explicit Prompts: Developing more sophisticated prompt engineering techniques that better guide models through complex spatial reasoning, potentially making CoT more effective for spatial tasks.
Enhancing LLM Base Capabilities: Continuing to improve the foundational architecture and training algorithms of LLMs, perhaps with more spatially-aware pre-training data or inductive biases.
Integrating External Knowledge Bases: Combining LLMs with external knowledge graphs or encyclopedic knowledge bases to augment their reasoning capabilities for tasks requiring world knowledge.
Interactive Decision Strategies: Exploring interactive or multi-turn prompting approaches where models can ask clarifying questions or refine their understanding through dialogue.

7.3. Personal Insights & Critique

This paper offers valuable insights into the current state of LLMs' spatial semantic understanding in Chinese.

Personal Insights:

Emergent vs. Engineered Intelligence: The findings underscore that while LLMs exhibit impressive emergent linguistic capabilities, their "common-sense" spatial reasoning is still far from human-like, particularly when moving from concrete, explicit relationships to abstract, implicit, or inferential ones. This gap highlights that large-scale pre-training alone doesn't guarantee robust cognitive abilities like spatial reasoning.
The Art of Prompt Engineering: The varying performance across prompt strategies and shot settings reaffirms that prompt engineering is not just a trick but a crucial interface for unlocking and guiding LLM capabilities. The fact that simpler general prompts sometimes outperform CoT for these tasks is a counter-intuitive but important finding, suggesting that for certain domains, a direct approach might be less prone to misinterpretations or spurious reasoning steps.
Specificity of Language: The detailed breakdown of errors points to the inherent complexities of spatial semantics in Chinese, where nuanced interactions between entities and directional words, or meta-language descriptions, can trip up even advanced LLMs. This suggests that future LLM development might benefit from incorporating more linguistically informed inductive biases or structured representations for spatial concepts.

Critique:

Lack of Error Analysis Depth: While the paper provides examples of error types, a more systematic qualitative error analysis across different models and tasks could offer deeper insights into why models fail (e.g., misinterpreting context, flawed reasoning steps, lack of world knowledge, or output format issues). This would be particularly useful for tasks like spatial inference and anomaly recognition.
GPT-4 Generated CoT Samples: The use of GPT-4 to generate Chain of Thought examples for other models introduces a potential bias. If GPT-4 is inherently better at generating optimal reasoning steps, it might set an unfairly high bar or tailor the CoT format to its own strengths, possibly explaining why other models didn't benefit as much from CoT. An alternative would be human-annotated CoT examples or using a different, less capable model to generate them.
"Why" Questions Unanswered: The paper states that ERNIE-4 and GLM-4 have "strong base capabilities," but it doesn't speculate on why these specific models might be better suited for Chinese spatial semantics (e.g., differences in training data, architectural choices, or pre-training objectives specific to Chinese).
Limited Exploration of Multiple-Choice Performance: The significantly lower performance on multiple-choice questions (e.g., ERNIE-4 at 25.20%) deserves more in-depth investigation. Is it due to the inherent complexity of multiple correct answers, or models' difficulty in adhering to the format for listing multiple options accurately?
Generalizability of Prompt Findings: The conclusion that "simpler is better" for prompts, while interesting, might be highly task- and dataset-specific. It would be beneficial for future work to explore if this holds true across a broader range of spatial semantic tasks or languages.

Despite these points, the paper provides a solid foundation for understanding LLM capabilities in a challenging NLP domain and highlights important avenues for future research in prompt engineering and LLM development. Its findings are valuable for researchers working on improving the cognitive reasoning abilities of large language models.

CCL24-Eval 任 办 系 统 报 告 : 基 于 大 型 语 言 模 型 的 中 文 空 间 语 义 评 测

TL;DR Summary