Paper status: completed

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

Published:06/29/2025

Multi-Stage Training Framework (2)Urban Intelligence Multi-Modal Large Language Model (1)Urban Instruction Dataset (1)Spatial Reasoning Enhancement (1)Urban Task Performance Evaluation (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

UrbanLLaVA is a multi-modal language model designed for urban intelligence, processing four data types to enhance urban task performance. It leverages a diverse instruction dataset and a multi-stage training framework, achieving strong cross-city generalization.

Abstract

Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $\textit{UrbanLLaVA}$ , a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $\textit{UrbanLLaVA}$ , we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $\textit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $\textit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.

Mind Map

In-depth Reading

English Analysis~43 min read · 59,968 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding." This title clearly indicates the development of a specific multi-modal large language model (MLLM), named UrbanLLaVA, tailored for urban intelligence applications, with a particular focus on its capabilities in spatial reasoning and understanding within urban environments.

1.2. Authors

The authors of the paper are Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, and Yong Li.

Jie Feng and Yong Li are affiliated with the Department of Electronic Engineering, BNRist, Tsinghua University, Beijing, China.
Shengyuan Wang is from the Department of Computer Science and Technology, Tsinghua University, Beijing, China.
Tianhui Liu is from the School of Electronic and Information Engineering, Beijing Jiaotong University, China.
Yanxin Xi is associated with the University of Helsinki, Finland. The contact email for Feng and Li is {fengjie, liyong07}@tsinghua.edu.cn. Their affiliations suggest a strong background in electronic engineering, computer science, and information engineering, with expertise likely spanning artificial intelligence, machine learning, and urban computing.

1.3. Journal/Conference

The paper is published at (UTC) 2025-06-29T13:04:27.000Z. Based on the provided metadata, it appears to be a preprint on arXiv, with the link https://arxiv.org/abs/2506.23219. As a preprint, it has not yet undergone full peer review for a specific journal or conference, but arXiv is a highly reputable platform for disseminating cutting-edge research in computer science and related fields, allowing for early sharing and feedback. Many papers on arXiv are eventually published in top-tier conferences or journals.

1.4. Publication Year

The paper was published in 2025 (specifically, June 29, 2025, according to the UTC timestamp).

1.5. Abstract

The abstract introduces the challenges in urban research, which often requires understanding diverse multi-modal data but lacks a unified processing framework. It positions the recent success of multi-modal large language models (MLLMs) as a promising solution. The paper then introduces UrbanLLaVA, an MLLM specifically designed to process four types of urban data simultaneously (urban visual data, geo-text, structured geospatial data, and spatiotemporal series data) and achieve strong performance across various urban tasks, outperforming general MLLMs. Key contributions include:

Curating UData, a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, ranging from local to global views of the urban environment.
Proposing UTrain, a multi-stage training framework that explicitly decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the model's compatibility and downstream performance.
Extending existing benchmarks to create UBench, a comprehensive benchmark for assessing MLLM performance across a wide range of urban tasks. Experimental results across three cities (Beijing, London, New York) demonstrate that UrbanLLaVA outperforms open-source and proprietary MLLMs in both single-modal and complex cross-modal tasks, exhibiting robust generalization abilities. The source code and data are openly accessible.

1.6. Original Source Link

The official source link for the paper is https://arxiv.org/abs/2506.23219, and the PDF link is https://arxiv.org/pdf/2506.23219v1.pdf. It is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

Urban research is inherently complex, involving a vast array of scenarios and tasks that necessitate the understanding and integration of multi-modal data. This data includes urban visual data (like street views and satellite images), geo-text (location descriptions), structured geospatial data (e.g., from OpenStreetMap), and spatiotemporal series data (e.g., traffic flows, mobility patterns). The core problem the paper addresses is the lack of a unified framework capable of comprehensively processing this diverse multi-modal urban data.

This problem is critical because a holistic understanding of urban spaces and the development of advanced reasoning capabilities for real-world urban applications depend on the effective integration of these data types. Prior research, while extensive, often falls short in several ways:

Specificity: Many existing methods are designed for specific data types or are tailored to particular urban tasks, limiting their generalizability and ability to provide a comprehensive urban understanding.
Unimodal Focus: Even recent advancements that integrate urban data into Large Language Models (LLMs) often focus on unimodal urban data (e.g., only geospatial text or only remote sensing images), failing to achieve true cross-modal understanding and modeling of complex urban systems.
Heterogeneity Challenges: The inherent diversity and heterogeneity of urban data (different formats, scales, and semantics) pose significant challenges for data integration and comprehensive processing within a single framework.

The paper identifies the recent success of multi-modal large language models (MLLMs) in general domains as a promising opportunity to overcome these limitations. MLLMs, with their built-in common sense and reasoning abilities, offer a pathway to unify data processing across various modalities. The paper's innovative idea is to adapt and advance MLLMs specifically for urban intelligence by creating a model that can process four major types of urban data simultaneously, thereby enabling comprehensive urban cognition and tackling diverse tasks within a unified framework.

2.2. Main Contributions / Findings

The paper introduces UrbanLLaVA and makes several key contributions:

Introduction of UrbanLLaVA, a Unified Multi-modal Urban LLM: This is presented as the first MLLM designed for the unified modeling of four major types of urban data (urban visual data, geo-text, structured geospatial data, and spatiotemporal series data). Its goal is to foster comprehensive understanding and effective task-solving for urban environments, moving beyond task-specific or unimodal approaches.
Development of UData, a Diverse Urban Instruction Dataset: The authors curate a systematic urban instruction data pipeline (UData) that generates high-quality synthetic data. This dataset spans multiple perspectives, from a localized view (single-modality data) to trajectory and global views (cross-modality data), capturing the multi-faceted nature of urban systems. This addresses the scarcity of high-quality cross-modality alignment data in urban research.
Proposal of UTrain, an Effective Multi-stage Training Pipeline: To address challenges like training stability and balancing performance across diverse urban tasks and data modalities, the paper proposes UTrain. This three-stage pipeline explicitly decouples spatial reasoning enhancement from domain knowledge learning. The stages are: task alignment, knowledge learning, and mixture learning. This approach significantly improves UrbanLLaVA's compatibility and downstream performance.
Extension of UBench, an Enhanced Multi-modal Benchmark: The paper extends existing urban benchmarks to create UBench, a systematic and comprehensive evaluation benchmark. UBench includes 12 tasks (6 adopted/extended, 6 newly introduced) designed to assess the capabilities of MLLMs in tackling a wide range of diverse urban tasks, especially those involving multi-modal data.

The key findings from the experimental results in Beijing, London, and New York are:

UrbanLLaVA consistently outperforms both open-source and proprietary general MLLMs across various urban tasks, including both single-modal and complex cross-modal tasks in the UBench benchmark.
It demonstrates robust generalization abilities across different cities, even when trained on data from a single city (e.g., Beijing).
The UData dataset effectively equips smaller MLLMs with diverse urban capabilities, achieving superior performance over more advanced general MLLMs.
The UTrain multi-stage training pipeline successfully ensures stable training and balanced performance across diverse urban tasks.
UrbanLLaVA maintains its stability and competitive performance on general MLLM benchmarks (LLaVA-Bench, RealWorldQA, MM-Vet), indicating its ability to enhance specialized urban capabilities without sacrificing general MLLM strengths.

3.1. Foundational Concepts

To fully grasp the contributions of UrbanLLaVA, a foundational understanding of several key concepts is essential for a beginner.

3.1.1. Large Language Models (LLMs)

Large Language Models (LLMs) are advanced artificial intelligence models trained on vast amounts of text data. They are designed to understand, generate, and process human language. At their core, LLMs use a type of neural network architecture called Transformers, which are particularly good at handling sequential data. LLMs excel at tasks like answering questions, summarizing texts, writing creative content, and translating languages. Their "largeness" comes from having billions or even trillions of parameters (the internal variables that the model learns during training), which allows them to capture complex patterns and relationships in language. The "reasoning abilities" mentioned in the abstract refer to their emergent capacity to perform complex cognitive tasks by manipulating symbolic representations, often via chain-of-thought prompting.

Multi-modal Large Language Models (MLLMs) extend the capabilities of LLMs by integrating information from multiple types of data, or "modalities," beyond just text. While LLMs focus solely on language, MLLMs can process and understand combinations of text, images, audio, video, and other data types. For example, a common MLLM combines text with vision, allowing it to answer questions about images (visual question answering), generate descriptions for images (image captioning), or follow visual instructions. This is typically achieved by using specialized encoders for each modality (e.g., a vision encoder for images) that convert the non-textual data into a format (embeddings) that the LLM can understand and reason with. The LLM then acts as the central reasoning engine, integrating information from all modalities to produce a coherent textual output.

3.1.3. Urban Intelligence (Urban Computing/Urban Science)

Urban intelligence (also known as urban computing or urban science) is an interdisciplinary field that applies computational and data science methods to address challenges and improve various aspects of urban life. It aims to understand the dynamics of cities, predict urban phenomena, and optimize urban systems using vast amounts of heterogeneous urban data. This includes areas like traffic management, public safety, environmental monitoring, urban planning, and social equity. The goal is to build "smarter" and more sustainable cities by leveraging technology to enhance decision-making and resource allocation.

3.1.4. Urban Data Modalities

Urban intelligence relies on diverse data types, often referred to as urban data modalities:

Urban Visual Data: This includes images and videos captured within urban environments. The paper specifically mentions:
- Street View Images: Photographs taken from street level, offering a ground-level perspective of urban features (buildings, roads, storefronts, etc.). Examples include Google Street View or Baidu Map Street View.
- Satellite Images: Aerial or overhead imagery of urban areas, providing a global or regional perspective on land use, infrastructure, and geographical features. Examples include Google Earth imagery.
Geo-Text: Textual data that contains geographical references or descriptions related to locations. This can include social media posts with location tags, reviews of places, descriptions of landmarks, or news articles mentioning specific areas.
Structured Geospatial Data: Data that represents geographic features with defined attributes and spatial coordinates. Examples include:
- OpenStreetMap (OSM) data: A collaborative project to create a free editable map of the world. It contains information about roads, buildings, points of interest (POIs), land use, and more, all with precise geographical coordinates.
- Geographic Information Systems (GIS) data.
Spatiotemporal Series Data: Data that records observations over both space and time. This is crucial for understanding dynamic urban phenomena. Examples include:
- Trajectory Data: Sequences of geographical points (latitude, longitude, timestamp) tracing the movement of people or vehicles. This can be from GPS devices, mobile phone records, or public transit logs.
- Environmental sensor readings (e.g., air quality over time at different locations).
- Traffic flow data.

3.1.5. Instruction Tuning

Instruction tuning is a fine-tuning technique used to adapt LLMs or MLLMs to follow specific instructions or perform tasks as described in natural language prompts. Instead of traditional supervised learning with input-output pairs, the model is trained on a dataset where each example consists of an instruction (e.g., "Describe this image," "Summarize this text," "What is the capital of France?") and the corresponding desired output. This process teaches the model to understand and respond to a wide variety of natural language prompts, making it more versatile and user-friendly. For MLLMs, instruction tuning often involves multi-modal inputs (e.g., an image and a textual instruction) and a textual output.

3.2. Previous Works

The paper contextualizes UrbanLLaVA within the landscape of MLLMs and urban studies, highlighting the evolution from specific solutions to more generalized, unified approaches.

3.2.1. General MLLMs

The foundation of UrbanLLaVA lies in the success of general-purpose MLLMs that emerged following models like GPT-4V (OpenAI's vision-enabled GPT-4). Key examples mentioned in the paper include:

LLaVA [30, 31]: One of the pioneering open-source MLLMs that leveraged GPT-4V to create visual instruction tuning data, demonstrating strong visual understanding and reasoning.
VILA [29]: Explored the impact of training pipelines and data formats on MLLM pre-training. UrbanLLaVA uses VILA as its base model.
QwenVL [41] and InternVL [7, 8]: Other prominent open-source MLLMs that have shown strong performance in generic visual-linguistic tasks. These general MLLMs demonstrate powerful visual understanding and reasoning in common scenarios but often struggle with the specialized knowledge and complex reasoning required for domain-specific tasks like those in urban intelligence.

3.2.2. Domain-Specific MLLMs

Recognizing the limitations of general MLLMs in specialized fields, researchers have developed domain-specific MLLMs. Examples cited include:

Dolphins [34]: A multi-modal language model for autonomous driving, integrating sensor data for navigation and control.
LLaVA-Med [27]: Fine-tuned for answering open-ended questions related to biomedicine images.
GeoChat [26]: An early MLLM effort specifically designed for remote sensing tasks, focusing on satellite imagery. These models demonstrate the value of tailoring MLLM capabilities to particular domains but often remain specialized to a subset of data types within that domain.

Before UrbanLLaVA, urban research leveraged various methods, often focusing on specific data types:

Deep Learning-based Fusion Methods [53, 57]: Numerous deep learning methods have been proposed to fuse various cross-domain urban data. However, these are typically designed for specific urban tasks (e.g., predicting traffic flow, identifying land use) and lack the ability to achieve a comprehensive, unified understanding of the urban environment or perform advanced reasoning across a wide range of tasks.
LLMs/MLLMs for Unimodal Urban Data: More recently, LLMs and MLLMs have been explored for specific urban data types:
- Structured Geospatial Data: Balsebre et al. [1] and Feng et al. [16] proposed methods to convert structured geospatial data into a language-compatible format to enhance the geospatial knowledge of LLMs (CityGPT [16] is an example).
- Remote Sensing Data: Kuckreja et al. [26] (GeoChat) and Zhang et al. [52] designed instruction data to fine-tune general MLLMs for remote sensing tasks.
- Street View Data: Hao et al. [22] fine-tuned CLIP models for urban indicator prediction by integrating street view and remote sensing data. Liu et al. [32] evaluated multi-modal language models for urban socioeconomic sensing.
- Spatiotemporal Series Data: Li et al. [28] and Gong et al. [20] (Mobility-LLM) introduced domain-specific encoders to enhance LLM capabilities for trajectory modeling and spatiotemporal series. Feng et al. [15] proposed an agentic framework for zero-shot mobility prediction.

3.3. Technological Evolution

The technological evolution leading to UrbanLLaVA can be traced through several stages:

Early Deep Learning for Urban Data: Initially, deep learning models were applied to urban data, but these were largely task-specific and often handled single modalities or limited fusion.
Rise of General LLMs: The development of LLMs like GPT-3 demonstrated unprecedented language understanding and generation capabilities.
Emergence of General MLLMs: The integration of vision (and other modalities) into LLMs (e.g., LLaVA, VILA, GPT-4V) created MLLMs capable of handling multi-modal input in general contexts, showing strong reasoning abilities.
Domain Adaptation of LLMs/MLLMs: Researchers then began adapting LLMs and MLLMs to specific domains (e.g., medicine, autonomous driving, remote sensing). In urban studies, this initially led to models that either focused on specific data types (e.g., CityGPT for structured geospatial data, GeoChat for satellite images, Mobility-LLM for trajectories) or aimed at specific urban tasks.

This paper's work (UrbanLLaVA) represents the next logical step in this evolution: moving from domain-specific MLLMs that handle limited data types or unimodal urban data to a comprehensive MLLM that integrates all major multi-modal urban data types into a unified framework for a wide range of urban tasks, thereby enabling holistic urban cognition and spatial reasoning.

3.4. Differentiation Analysis

Compared to the main methods in related work, UrbanLLaVA introduces several core differences and innovations:

Unified Multi-modal Integration: Unlike most previous works that focus on limited data types (e.g., only satellite images, only structured geospatial data, or only trajectory data) or integrate unimodal urban data into LLMs, UrbanLLaVA is explicitly designed to unify and process four major types of urban data simultaneously: urban visual data (street view and satellite), geo-text, structured geospatial data, and spatiotemporal series data. This comprehensive integration is a key differentiator.
Comprehensive Urban Cognition: By integrating multiple modalities, UrbanLLaVA aims for a more holistic and comprehensive understanding of urban environments, going beyond task-specific solutions or single-modal interpretations. This allows for advanced reasoning that leverages the interplay between different data types.
Systematic Data Curation (UData): The paper addresses the crucial challenge of cross-modality alignment by curating UData, a diverse urban instruction dataset. This dataset is meticulously structured to generate high-quality synthetic data covering location view, trajectory view, and global view, capturing the multi-faceted nature of urban systems—a level of systematic multi-view data generation not explicitly highlighted in prior urban MLLM efforts.
Decoupled Multi-stage Training (UTrain): To manage the heterogeneity of multi-modal urban data and the diversity of urban tasks, UrbanLLaVA proposes UTrain, a novel three-stage training pipeline. This framework decouples spatial reasoning enhancement from domain knowledge learning, which is a distinct methodological innovation aimed at improving training stability and performance across diverse urban tasks. Prior MLLM training often follows more general fine-tuning strategies without this explicit decoupling for urban contexts.
Enhanced Urban Benchmark (UBench): UrbanLLaVA extends existing benchmarks to create UBench, specifically designed to assess MLLMs across a wide range of urban tasks. This includes newly introduced tasks that better reflect complex multi-modal urban scenarios (e.g., STV-Outlier, SceneFunc), providing a more rigorous evaluation standard than previous benchmarks which might be less comprehensive or primarily focused on specific sub-domains.

In essence, UrbanLLaVA moves beyond the "one data type, one task" or "unimodal LLM for urban" paradigms by offering a truly unified multi-modal framework equipped with specialized data and training strategies for comprehensive urban intelligence.

4. Methodology

4.1. Principles

The core idea behind UrbanLLaVA is to build comprehensive urban cognition and address a wide range of urban tasks by integrating multi-modal urban data into a single, unified Multi-modal Large Language Model (MLLM). The theoretical basis or intuition is that MLLMs, with their inherent common sense and reasoning capabilities (derived from their Large Language Model (LLM) backbone), can act as a central component to process heterogeneous urban data effectively. By training on a diverse dataset that systematically covers urban environments from local to global views, and by employing a specialized multi-stage training strategy, UrbanLLaVA aims to overcome the limitations of prior task-specific or unimodal approaches. This allows the model to develop sophisticated spatial reasoning and urban domain knowledge that enables it to understand complex urban phenomena and perform various tasks.

4.2. Core Methodology In-depth (Layer by Layer)

The UrbanLLaVA framework is structured into three main components: UData (data pipeline), UTrain (training pipeline), and UBench (evaluation benchmark). The following figure (Figure 2 from the original paper) illustrates the overall framework:

Figure 2. The framework of UrbanLLaVA, including UData, UTrain and UBench 该图像是UrbanLLaVA框架示意图，展示了数据管道UData、训练流程UTrain及评估基准UBench。该模型整合了多模态数据，支持城市智能任务的处理与评估，包含位置视图与轨迹视图的信息。不同任务的例子列于边框，以示模型的多样性和应用场景。

Figure 2. The framework of UrbanLLaVA, including UData, UTrain and UBench

4.2.1. UData: Constructing Urban Instruction Data from a Multi-View Perspective of Urban Space

UData is the data pipeline responsible for generating diverse and high-quality urban instruction data across various urban scenarios. The fundamental approach is to organize the urban instruction data in a sequence that moves from a location view to a trajectory view, and finally to a global view. This ensures comprehensive spatial coverage and maintains the integrity of relationships between different modalities.

The construction of UData builds upon four kinds of original urban data:

Structured geospatial data: Obtained from OpenStreetMap (e.g., points of interest, road networks, building footprints).
Public trajectory data: Examples include Foursquare-checkins [48] and OpenStreetMap traces (sequences of GPS points representing movement).
Satellite images: Sourced from GoogleEarth (overhead views of urban areas).
Street view images: Collected from GoogleMap and BaiduMap (ground-level panoramic images).

The following figure (Figure 3 from the original paper) provides a detailed composition of UData in Beijing:

该图像是UData数据组成的示意图，展示了不同类型城市数据的分布，包括全球视图数据、轨迹视图数据和位置视图数据等。每个数据类型的详细信息和数量均以圆形结构呈现，便于理解各类数据之间的关系。

Figure 3. The thorough composition of UData in Beijing.

The UData construction process is divided into three stages:

4.2.1.1. Location View Data

This stage focuses on integrating structured geospatial data and single street view images.

Structured Geospatial Data Instructions: Following practices from previous works [1, 16], geospatial instruction data is created by designing question templates that transform basic geospatial information into natural language question-and-answer pairs. For instance, questions about landmarks, addresses, or facility types at specific locations.
Single Street View Image Instructions: For each street view image, three types of questions are synthesized:
1. Template-based with Geospatial Data: Two types of questions are generated using predefined templates, populated with information from structured geospatial data (e.g., querying the address or landmark details visible in the image).
2. General MLLM-generated Description: A general MLLM is prompted to generate a detailed description of the image content, similar to image captioning [6]. A core principle here is to ensure consistency between street view image content and structured geographical knowledge (e.g., location addresses and landmark descriptions must match). An example of Location View Data for Location Address is provided in the supplementary material (Figure 22):

Figure 22. An example of local view training instances of Location Address. 该图像是街景视图，展示了北京市的一条道路，周围有树木和建筑物。图中可以看到沿路的交通标志和设施，为城市环境提供了详细的视觉信息。

Figure 22. An example of local view training instances of Location Address.

4.2.1.2. Trajectory View Data

This stage integrates geospatial data, trajectory data, and street view images to capture continuous spatial movement.

Text-based Trajectory Data: Two types are generated:
1. Random Walk Routing: Generated by randomly sampling origin and destination points and creating routing instructions (e.g., "Go straight, then turn left").
2. Real-world Trajectories: Utilizes public trajectory data (e.g., Foursquare-checkins, OpenStreetMap traces). GPS coordinates from these sources are aligned with structured geospatial data, using textual addresses to represent locations within the trajectory.
Vision-augmented Trajectory Data: This combines visual information with trajectories:
1. Street View Images along Route: Extends text-based trajectory data by incorporating street view images captured along the route (excluding intersections). This data is organized in an interleaved image-text format, similar to VILA [29]. For example, a sequence of "Image 1, Turn left, Image 2, Go straight."
2. Navigation Instructions (Vision-Language Navigation): Builds on navigation instruction formats akin to classical vision-language navigation tasks [5]. Multiple street view images are presented at intersections during a trajectory, and the model must select the correct image to guide the continuation of the journey.

4.2.1.3. Global View Data

This stage focuses on capturing relationships among diverse data types over longer distances, primarily using street view images and satellite images, with geospatial data as auxiliary support.

Basic Global View Data (Single Satellite Image):
1. Detailed Content Description: A general MLLM is prompted to produce detailed content descriptions for individual satellite images.
2. Spatial Coverage Summary: Location addresses within a satellite image are sampled, and a general LLM is used to summarize the spatial coverage based on these addresses.
3. Land Use Inference with Reasoning: A general MLLM is prompted with land use ground-truth labels to generate land use inference results along with explanations (reason).
Multiple Satellite Images for Complex Instructions:
1. Building Density Comparison: A task to compare building densities across multiple satellite images.
2. Functional Point of Interest (POI) Identification: Focuses on identifying specific functional POIs within multiple images. For these tasks, manually crafted reasoning steps in a chain-of-thoughts format, supported by structured geospatial data, are provided to improve alignment between satellite images and geospatial data.
Street View and Satellite Image Alignment:
1. Correct Satellite Image Selection: Given a street view image, the model must select the correct satellite image from a set, requiring understanding and matching content or address across both image types.
2. Street View Location Pinpointing: A more challenging task involving pinpointing the location of the street view image within a specific satellite image (e.g., identifying it as being in the top-left region).
  
  After data generation, quality checks and filtering are performed on the synthesized data. An example of Global View Data for Landuse Inference is shown in the supplementary material (Figure 21):
  
  该图像是土地利用推断的全球视图训练实例示例，展示了城市环境中设施和建筑的分布情况，包括运动场和住宅区等元素。

Figure 21. An example of global view training instances of Landuse Inference.

4.2.2. UTrain: A Multi-Stage Training Pipeline for Decoupling Reasoning and Knowledge Learning

The training of UrbanLLaVA faces challenges due to data heterogeneity and task diversity. The paper selects VILA1.5 [29] as the base MLLM and proposes UTrain, a three-stage tuning pipeline. The following figure (Figure 4 from the original paper) illustrates the UTrain pipeline:

Figure 4. UTrain: three-stage training pipeline. 该图像是UTrain的三阶段训练流程示意图。流程包含任务对齐（Stage 1）、知识学习（Stage 2）和混合调优（Stage 3）三个阶段，展示了多模态大语言模型（LLM）在城市智能任务中的训练步骤。

Figure 4. UTrain: three-stage training pipeline.

The UTrain pipeline distinguishes three types of learning procedures:

Knowledge Learning: The process where UrbanLLaVA acquires foundational urban knowledge from various urban data, including information from geospatial data, pure textual trajectories, and detailed descriptions of street view and satellite images.
Task Alignment: Focuses on equipping UrbanLLaVA with task-specific skills for urban applications, such as vision-language navigation, trajectory prediction, and chain-of-thoughts reasoning across multiple satellite and street view images.
Mixture Learning: Represents the standard training method where all types of instruction data are mixed directly, as used by most MLLMs.

Based on experimental observations regarding training stability and performance, the authors propose a specific three-stage tuning pipeline:

Stage 1: Task Alignment
- Description: Starting with a well-trained general MLLM (e.g., VILA1.5), the model is first fine-tuned with diverse urban task-related instructions.
- Purpose: This stage familiarizes the model with various urban tasks, leveraging its pre-existing general knowledge to understand task formats and requirements.
Stage 2: Knowledge Learning
- Description: After task alignment, this stage imparts specialized urban knowledge from multi-modal urban data that is crucial for effective task resolution.
- Purpose: Addresses the limitation that general knowledge alone is insufficient for diverse urban tasks, providing the domain-specific information necessary.
Stage 3: Mixture Learning
- Description: In this final stage, the model is further tuned using a mixture of data: 1/3 domain-specific data (resampled from the first two stages) and 1/3 general textual instruction data (e.g., ShareGPT [6], UltralChat [11]).
- Purpose: This stage enhances the model's ability to combine domain knowledge and task-specific skills for solving diverse urban tasks, ensuring robust performance across the entire spectrum of urban intelligence challenges.
  
  This multi-stage framework explicitly decouples the learning of reasoning capabilities (enhanced in task alignment) from domain-specific knowledge (acquired in knowledge learning), which is presented as a promising practice for MLLMs.

4.2.3. UBench: An Enhanced Multimodal Benchmark for Urban Intelligence Tasks

UBench is the evaluation benchmark designed to systematically assess MLLM capabilities in multimodal urban tasks. It expands upon existing benchmarks like CityBench [18] and Urbench [56].

The benchmark comprises 12 tasks, categorized for different data combinations:

Adopted/Extended Tasks (6 tasks):
- From CityBench [18] (for structured geospatial data and trajectory modeling):
  - GeoQA: Geospatial Question Answering.
  - TrajPredict: Trajectory Prediction.
  - Navigation: Vision-language navigation.
- From Urbench [56] (for cross-view urban tasks involving street view and satellite images):
  - Image Retrieval: Retrieve matching image across modalities.
  - Camera Localization: Localize street view position on satellite map.
  - Scene Comparison: Compare urban scenes.

Newly Introduced Tasks (6 tasks):

Single-image tasks (4 tasks): Aligned with the urban instruction data, these are designed for single street view and satellite images. The original dataset is partitioned into training and validation sets to prevent data leakage.
- STV-Address: Address inference from a street view image.
- STV-Landmark: Landmark recognition from a street view image.
- SAT-Address: Address inference from a satellite image.
- SAT-Landuse: Land use inference from a satellite image.

Multiple-image tasks (2 tasks):

STV-Outlier: A spatial consistency task for street view images. Multiple images from a single trajectory are presented, and the model must identify an outlier image not part of the trajectory.

SceneFunc: Extends the Scene Comparison task from Urbench. It challenges the model to select the correct satellite image that fulfills specific functional requirements (e.g., highest concentration of POIs).

The following table (Table 1 from the original paper) details the tasks in UBench:

Tasks	Data	Category	Metrics	Samples	Source
GeoQA	Geospatial Data	GeoQA	Avg. Accuracy	1450	CityBench
TrajPredict	Trajectory Data	Geo+Traj	Top-1	500	CityBench
Navigation	Single STV	Geo+Traj	Success Rate	50	CityBench
SceneComp	Multi SAT	Geo+SAT	Accuracy	200	UrBench
ImgRetrieval	Multi STV & SAT	Geo+SS	Accuracy	200	UrBench
CameraLoc	Multi STV & SAT	Geo+SS	Accuracy	200	UrBench
STV-Address	Single STV	Geo+STV	Accuracy	200	UBench
STV-Landmark	Single STV	Geo+STV	Accuracy	200	UBench
SAT-Address	Single SAT	Geo+SAT	Accuracy	200	UBench
SAT-Landuse	Single SAT	Geo+SAT	Accuracy	200	UBench
STV-Outlier	Multi STV	Geo+STV	Accuracy	200	UBench
SceneFunc	Multi SAT	Geo+SAT	Accuracy	200	UBench

Table 1. Detailed information about UBench for Beijing, 'STV' refers to street view image, and 'SAT' refers to satellite image.

5. Experimental Setup

5.1. Datasets

The experiments for UrbanLLaVA are conducted across three major cities: Beijing, London, and New York. Due to the large volume of data, a specific region from each city is selected for the experiments. The spatial coverage of these regions is detailed in the supplementary material (Figure 36).

Figure 36. Maps for Beijing, London and New York. 该图像是图表，展示了北京、伦敦和纽约的地图。图中分别标注了三个城市的主要街道和地理特征，提供了对于城市空间布局的直观理解。

Figure 36. Maps for Beijing, London and New York.

The UData dataset, specifically curated for this paper, is used for training UrbanLLaVA. UData encompasses diverse urban instruction data generated from multiple original sources as described in the methodology. The detailed statistics of UData for each city are provided in Table 10 of the supplementary material.

City	Category	Dataset	Instance Rounds
I	General	ShareGPT,UltraChat,Open-Platypus	19866	3.7
Beijing	Location View Data	CityQA	19271	1
		Location Address	93246	1
		Landmark Details	51130	1
		Image Description	28798	1
		Cross Modality Reasoning	2000	1
	Trajectory View Data	Random Walk	9001	1
		Real-World Trajectory	98	1
		Visual Random Walk	8936	1
		Vision-Language Navigation	3000	1
	Global View Data	Image Content	9315	1
		Location Address	2777
		Landuse Inference	3642	1
		Multiple SAT Comparison	10114	1
		Cross-View Data	77204	1
London		Cross Modality Reasoning	14977	1
	Location View Data	CityQA	28934	1
		Location Address	2172	1
		Landmark Details	2372
		Image Description	716	1
		Cross Modality Reasoning	1286	1
	Trajectory View Data	Random Walk	16524	1
		Real-World Trajectory	98	1
		Visual Random Walk	13412	1
		Vision-Language Navigation	3000	1
	Global View Data	Image Content	3853	1
		Location Address	882	1
		Landuse Inference	4332	1
		Multiple SAT Comparison	4500	1
		Cross-View Data	2172	1
		Cross Modality Reasoning	5758	1
	New York Location View Data	CityQA	25413	1
		Location Address	94886	1
		Landmark Details	50404	1
		Image Description	24529	1
		Cross Modality Reasoning	2012	1
	Trajectory View Data	Random Walk	12277	1
		Real-World Trajectory	98	1
		Visual Random Walk	12229	1
		Vision-Language Navigation	3000	1
	Global View Data	Image Content	18368	1
		Location Address	5113	1
		Landuse Inference	17899	1
		Multiple SAT Comparison	22020	1
		Cross-View Data	94886	1
		Cross Modality Reasoning	23603	1

Table 10. Basic information of UData on three cities.

Key characteristics of UData:

Multi-View Perspective: Data is categorized into Location View Data (e.g., CityQA, Location Address, Landmark Details, Image Description, Cross Modality Reasoning for single street views and geospatial data), Trajectory View Data (e.g., Random Walk, Real-World Trajectory, Visual Random Walk, Vision-Language Navigation), and Global View Data (e.g., Image Content, Location Address, Landuse Inference, Multiple SAT Comparison, Cross-View Data for satellite images and their integration with street views).
Scale: Beijing UData consists of approximately 340,000 instruction rounds, London approximately 80,000, and New York approximately 390,000.
General Data: The general category includes ShareGPT, UltraChat, and Open-Platypus datasets, which are standard for LLM instruction tuning and provide general language understanding capabilities.
Raw Data Sources: The construction of UData leverages OpenStreetMap, Foursquare-checkins, OpenStreetMap traces, GoogleEarth, GoogleMap, and BaiduMap (as detailed in the Methodology section). The raw data of the selected regions in the three cities is provided in Table 11 of the supplementary material.

City AoIs PoIs Roads Trajectory Street View Image Satellite Image
Beijing 4647 1882 2320 21015 28798 1533
London 13705 11715 1322 173268 3125 556
New York 19541 11112 522 390934 24444 2738

Table 11. The raw data of the selected region in three cities.

These datasets were chosen because they represent the core modalities of urban data and are essential for developing comprehensive urban intelligence. They allow for testing a wide range of single-modal, cross-modal, and complex spatial reasoning tasks critical for urban applications.

5.2. Evaluation Metrics

The paper uses various metrics depending on the specific task in UBench and other general benchmarks.

5.2.1. Metrics for UBench Tasks

As detailed in Table 1 (repeated above for convenience), UBench uses the following metrics:

Avg. Accuracy (Average Accuracy):
- Conceptual Definition: This metric quantifies the proportion of correct predictions across a set of diverse questions or tasks. For GeoQA, where questions might have different structures, average accuracy typically aggregates the correctness over all questions.
- Mathematical Formula: $ \text{Avg. Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}(\text{prediction}_i = \text{ground_truth}_i) $
- Symbol Explanation:
  - $N$ : Total number of instances (questions/tasks) in the evaluation set.
  - $\mathbb{I}(\cdot)$ : The indicator function, which returns 1 if the condition inside the parenthesis is true (i.e., the prediction matches the ground truth for instance $i$ ), and 0 otherwise.
  - $\text{prediction}_i$ : The model's output for instance $i$ .
  - $\text{ground\_truth}_i$ : The correct answer for instance $i$ .
Top-1 (Top-1 Accuracy):
- Conceptual Definition: In tasks where the model predicts a ranked list of choices (e.g., for trajectory prediction, predicting the most likely next location from a set of candidates), Top-1 accuracy measures whether the single most probable prediction made by the model is correct.
- Mathematical Formula: $ \text{Top-1 Accuracy} = \frac{\text{Number of times the top prediction is correct}}{\text{Total number of predictions}} $
- Symbol Explanation: This is a specific form of accuracy where correctness is attributed only if the highest-ranked prediction matches the true label.
Success Rate:
- Conceptual Definition: For tasks like Navigation, this metric measures the proportion of attempts where the model successfully completes the task (e.g., reaches the destination following instructions) according to predefined success criteria.
- Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of successful task completions}}{\text{Total number of task attempts}} $
- Symbol Explanation: The definition of "successful task completion" is task-specific (e.g., reaching a target with a certain accuracy, navigating a path without errors).
Accuracy:
- Conceptual Definition: This is the most common metric, representing the proportion of correct predictions (or classifications) out of the total number of predictions made. It's used for tasks like SceneComp, ImgRetrieval, CameraLoc, STV-Address, STV-Landmark, SAT-Address, SAT-Landuse, STV-Outlier, and SceneFunc, which are typically classification or direct answer tasks.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} = \frac{\sum_{i=1}^{N} \mathbb{I}(\text{prediction}_i = \text{ground_truth}_i)}{N} $
- Symbol Explanation:
  - $N$ : Total number of predictions.
  - $\mathbb{I}(\cdot)$ : Indicator function, returning 1 if the prediction matches the ground truth, 0 otherwise.
  - $\text{prediction}_i$ : The model's output for instance $i$ .
  - $\text{ground\_truth}_i$ : The correct answer for instance $i$ .

5.2.2. Metrics for General Evaluation Tasks

For general MLLM benchmarks:

Rating Score from GPT4o (LLM-as-a-judge):
- Conceptual Definition: For benchmarks like LLaVA-Bench (In-the-Wild) and MM-Vet, the performance of MLLMs is often evaluated by having another powerful LLM (in this case, GPT-4o) act as a judge. GPT-4o is given the input prompt, the model's response, and sometimes a reference answer, and then rates the quality of the model's response on a predefined scale. This approach is used for tasks where subjective quality or complex reasoning is hard to quantify with simple metrics.
- Mathematical Formula: No single mathematical formula defines this, as it's based on an external LLM's subjective judgment. The score is typically an average of the ratings. For LLaVA-Bench, scores range from 0 to 100, and for MM-Vet, scores range from 0.0 to 1.0.
- Symbol Explanation: Higher scores indicate better performance, as judged by GPT-4o.
ACC (Accuracy):
- Conceptual Definition: Used for RealWorldQA, this is the standard accuracy metric, quantifying the proportion of correct answers to factual questions in real-world scenarios.
- Mathematical Formula: $ \text{ACC} = \frac{\text{Number of correct answers}}{\text{Total number of questions}} $
- Symbol Explanation: As defined for Accuracy above.

5.3. Baselines

UrbanLLaVA is compared against a comprehensive set of MLLM baselines, including both open-source and proprietary models, to demonstrate its superior performance in urban tasks.

Open-source MLLMs:
- Qwen2VL [41]: Both Qwen2VL-7B and Qwen2VL-72B are included. This is a powerful vision-language model series known for its perception capabilities.
- InternVL2 [7, 8]: Both InternVL2-8B and InternVL2-26B are used. This series is recognized for scaling vision foundation models and aligning them for generic visual-linguistic tasks.
- VILA1.5 [29]: This is the base model from which UrbanLLaVA is fine-tuned. VILA1.5-3B, VILA1.5-8B, and VILA1.5-13B are evaluated to show the impact of model size and the improvement gained by UrbanLLaVA's specialized training.
- LLaMA3.2 [36]: LLaMA3.2-11B and LLaMA3.2-90B are included. These models are from the LLaMA series, a popular family of LLMs, here with multi-modal extensions. It's noted that LLaMA3.2 models currently do not support multi-image input, leading to blank results for such tasks in UBench.
Proprietary MLLMs:
- GPT4o [40]: A powerful commercial MLLM from OpenAI, representing the state-of-the-art in general multi-modal AI.
- GPT4o-mini [40]: A smaller version of GPT4o, also from OpenAI.
  
  These baselines are representative because they cover a range of model sizes (from 3B to 90B parameters) and capabilities (both cutting-edge open-source and commercial MLLMs). Comparing against them effectively validates UrbanLLaVA's performance, especially highlighting its advantage in domain-specific urban tasks, given that general MLLMs are often not optimized for such specialized contexts.

Implementation Details:

The models are deployed through VLMEvalKit [13] for open-source MLLMs.
The maximum output tokens are set to 1000.
The temperature (a parameter controlling randomness in generation) is set to 0, implying deterministic and less creative outputs, suitable for factual task evaluations.
UrbanLLaVA is fine-tuned using codes from the official VILA [29] repository on a single 8xA100 node.
Training parameters: learning rate of $1e^{-5}$ , maximum sequence length of 2048, batch size of 8 per GPU, and one training epoch.
Training UrbanLLaVA for Beijing on 4xA100 GPUs took 10.7 hours.

6. Results & Analysis

6.1. Core Results Analysis

The main experimental results, presented in Table 2, demonstrate that UrbanLLaVA significantly outperforms both open-source and proprietary MLLMs across diverse urban tasks in all three evaluated cities (Beijing, London, and New York). This strongly validates the effectiveness of the proposed UData and UTrain methodologies.

The following are the aggregated results from Table 2 of the original paper:

City	Beijing					London					New York
Task Group	GeoQA Geo+Traj Geo+STV Geo+SAT										GeoQA Geo+Traj Geo+STV Geo+SAT
VILA1.5-3B	0.3873	0.0200	0.3967	0.3200	0.2575	0.4362	0.0400	0.2557	0.2850	0.2725	0.3954	0.0400	0.4400	0.2713	0.2425
VILA1.5-8B	0.4322	0.0589	0.4300	0.3488	0.2425	0.4841	0.0884	0.4495	0.4575	0.2575	0.4575	0.1200	0.4983	0.3763	0.2525
VILA1.5-13B	0.4410	0.1156	0.5167	0.3638	0.2400	0.4592	0.1298	0.4991	0.4538	0.2625	0.4501	0.2350	0.5583	0.4025	0.2825
InternVL2-8B	0.4709	0.1578	0.4667	0.3313	0.2325	0.4973	0.1347	0.4477	0.4763	0.2400	0.4632	0.1830	0.4917	0.4175	0.2400
InternVL2-26B	0.4877	0.1478	0.4550	0.3825	0.2275	0.5168	0.1288	0.4923	0.5138	0.2425	0.4766	0.2240	0.5217	0.4738	0.2375
Qwen2VL-7B	0.4950	0.1389	0.4383	0.3638	0.2675	0.4991	0.1560	0.4381	0.4863	0.2775	0.4567	0.1700	0.5117	0.5100	0.2950
Qwen2VL-72B	0.5491	0.1611	0.5817	0.3588	0.2975	0.5802	0.2322	0.6375	0.4375	0.3250	0.5273	0.2540	0.6333	0.3788	0.3275
LLaMA3.2-11B	0.4229	0.0756	0.4375	0.3075	I	0.4804	0.1180	0.4000	0.3800	I	0.4127	0.1100	0.5200	0.2225	I
LLaMA3.2-90B	0.4502	0.1056	0.5325	0.2925	,	0.5659	0.2010	0.5450	0.4700	I	0.5234	0.1570	0.6825	0.3400	I
GPT4o-mini	0.4542	0.1622	0.4350	0.3800	0.2475	0.5357	0.1278	0.4752	0.5388	0.2675	0.5075	0.2320	0.5633	0.4775	0.2350
GPT40	0.5479	0.1522	0.4300	0.4125	0.3025	0.6446	0.1300	0.5469	0.6050	0.2850	0.6232	0.2340	0.5767	0.5400	0.2900
UrbanLLaVA-VILA1.5-8B 0.5682		0.2800	0.8650	0.6663	0.7025	0.6399	0.2680	0.7500	0.7100	0.4325	0.5773	0.3060	0.8500	0.7725	0.5825
vs. VILA1.5-8B	+31.47% +375.38% +101.16% +91.03% +189.69% +32.18% +203.17% +66.85%									+55.19% +67.96% +26.19% +155.00% +70.57% +26.19% +155.00% +70.57% +105.32% +130.69%
vs. Best Baseline	+3.48% +72.63% +48.70% +61.53% +132.23% -0.73% +15.42% +17.65%									+17.36% +33.08% -7.37% +20.47% +24.54% +43.06% +77.86%

Table 2. Main results on UBench. 'STV' refers to street view images, 'SAT' refers to satellite images, 'TB' denotes satellite images related tasks, and 'SS' denotes street view + satellite images. Detailed subtask and metrics can refer to Table 1.

Analysis for Beijing:

UrbanLLaVA (using VILA1.5-8B as base) consistently outperforms all baselines across all tasks in UBench.
Compared to its base model VILA1.5-8B, UrbanLLaVA shows remarkable improvements:
- GeoQA: +31.47%
- $Geo+Traj$ : +375.38% (a massive gain, highlighting its strength in trajectory-related tasks)
- $Geo+STV$ : +101.16%
- $Geo+SAT$ : +91.03%
- $Geo+SS$ : +189.69% These improvements clearly indicate the effectiveness of UData and UTrain in equipping a smaller MLLM with advanced urban capabilities.
Even against the best baselines (including GPT4o and Qwen2VL-72B), UrbanLLaVA achieves significant gains, ranging from +3.48% (GeoQA) to +132.23% ( $Geo+SS$ ). This demonstrates its competitive edge even against powerful commercial and large open-source MLLMs in urban-specific contexts.
The LLaMA3.2 series models have blank entries for tasks involving multiple images due to input limitations.

Analysis for London and New York:

The results generally mirror those in Beijing, demonstrating UrbanLLaVA's robust generalization abilities.
For London, UrbanLLaVA performs best in 4 out of 5 task groups. It is slightly inferior to GPT4o in the GeoQA task (-0.73%).
For New York, UrbanLLaVA also performs best in 4 out of 5 task groups, with a larger reduction of -7.37% compared to GPT4o in GeoQA.
The paper speculates two reasons for UrbanLLaVA's slight underperformance on GeoQA in London and New York:
1. Lower quality of relevant data in these cities compared to Beijing, hindering knowledge acquisition.
2. The base model VILA1.5-8B may have inherently weaker capabilities than GPT4o for general factual QA, even with UrbanLLaVA's domain fine-tuning (e.g., UrbanLLaVA@London outperforms VILA1.5-8B by 32.18% in GeoQA but still trails GPT4o).
  
  Overall, the results underscore that UrbanLLaVA successfully enhances the performance of smaller MLLMs on diverse urban tasks, proving its efficacy in integrating multi-modal urban data and solving complex urban challenges.

6.2. Data Presentation (Tables)

6.2.1. Main Results (Aggregated)

The following are the aggregated results from Table 2 of the original paper:

City	Beijing					London					New York
Task Group	GeoQA Geo+Traj Geo+STV Geo+SAT Geo+SS					GeoQA Geo+Traj Geo+STV Geo+SAT Geo+SS					GeoQA Geo+Traj Geo+STV Geo+SAT Geo+SS
VILA1.5-3B	0.3873	0.0200	0.3967	0.3200	0.2575	0.4362	0.0400	0.2557	0.2850	0.2725	0.3954	0.0400	0.4400	0.2713	0.2425
VILA1.5-8B	0.4322	0.0589	0.4300	0.3488	0.2425	0.4841	0.0884	0.4495	0.4575	0.2575	0.4575	0.1200	0.4983	0.3763	0.2525
VILA1.5-13B	0.4410	0.1156	0.5167	0.3638	0.2400	0.4592	0.1298	0.4991	0.4538	0.2625	0.4501	0.2350	0.5583	0.4025	0.2825
InternVL2-8B	0.4709	0.1578	0.4667	0.3313	0.2325	0.4973	0.1347	0.4477	0.4763	0.2400	0.4632	0.1830	0.4917	0.4175	0.2400
InternVL2-26B	0.4877	0.1478	0.4550	0.3825	0.2275	0.5168	0.1288	0.4923	0.5138	0.2425	0.4766	0.2240	0.5217	0.4738	0.2375
Qwen2VL-7B	0.4950	0.1389	0.4383	0.3638	0.2675	0.4991	0.1560	0.4381	0.4863	0.2775	0.4567	0.1700	0.5117	0.5100	0.2950
Qwen2VL-72B	0.5491	0.1611	0.5817	0.3588	0.2975	0.5802	0.2322	0.6375	0.4375	0.3250	0.5273	0.2540	0.6333	0.3788	0.3275
LLaMA3.2-11B	0.4229	0.0756	0.4375	0.3075	I	0.4804	0.1180	0.4000	0.3800	I	0.4127	0.1100	0.5200	0.2225	I
LLaMA3.2-90B	0.4502	0.1056	0.5325	0.2925	,	0.5659	0.2010	0.5450	0.4700	I	0.5234	0.1570	0.6825	0.3400	I
GPT4o-mini	0.4542	0.1622	0.4350	0.3800	0.2475	0.5357	0.1278	0.4752	0.5388	0.2675	0.5075	0.2320	0.5633	0.4775	0.2350
GPT40	0.5479	0.1522	0.4300	0.4125	0.3025	0.6446	0.1300	0.5469	0.6050	0.2850	0.6232	0.2340	0.5767	0.5400	0.2900
UrbanLLaVA-VILA1.5-8B	0.5682	0.2800	0.8650	0.6663	0.7025	0.6399	0.2680	0.7500	0.7100	0.4325	0.5773	0.3060	0.8500	0.7725	0.5825
vs. VILA1.5-8B	+31.47%	+375.38%	+101.16%	+91.03%	+189.69%	+32.18%	+203.17%	+66.85%	+55.19%	+67.96%	+26.19%	+155.00%	+70.57%	+105.32%	+130.69%
vs. Best Baseline	+3.48%	+72.63%	+48.70%	+61.53%	+132.23%	-0.73%	+15.42%	+17.65%	+17.36%	+33.08%	-7.37%	+20.47%	+24.54%	+43.06%	+77.86%

6.2.2. Detailed Results per City

The detailed results for each city, which are aggregated into Table 2, are provided in the supplementary material as Table 6 (Beijing), Table 7 (London), and Table 8 (New York).

The following are the results from Table 6 of the original paper (Beijing):

Tasks@Beijing	GeoQA	Geo+Traj		Geo+STV			Geo+SAT			Geo+SS
Tasks@Beijing	GeoQA	TrajPredict	Navigation	STV-Address	STV-Landmark	STV-Outlier	SAT-Address	SAT-Landuse	SceneComp	SceneFunc	ImgRetrieval	CameraLoc
Qwen2VL-7B	0.4950	0.0978	0.18	0.440	0.755	0.1200	0.295	0.405	0.400	0.355	0.275	0.260
Qwen2VL-72B	0.5491	0.0822	0.24	0.410	0.785	0.5500	0.395	0.395	0.335	0.310	0.290	0.305
InternVL2-8B	0.4709	0.0957	0.22	0.420	0.755	0.2250	0.295	0.300	0.390	0.340	0.210	0.255
InternVL2-26B	0.4877	0.0756	0.22	0.440	0.755	0.1700	0.360	0.375	0.440	0.355	0.230	0.225
VILA1.5-3B	0.3873	0.0000	0.04	0.270	0.655	0.2650	0.275	0.475	0.295	0.235	0.250	0.265
VILA1.5-8B	0.4322	0.0578	0.06	0.270	0.650	0.3700	0.225	0.405	0.420	0.345	0.195	0.290
VILA1.5-13B	0.4410	0.0511	0.18	0.305	0.715	0.5300	0.320	0.320	0.425	0.390	0.270	0.210
LLaMA3.2-11B	0.4229	0.0711	0.08	0.280	0.595	,	0.290	0.325	I	I	1	I
LLaMA3.2-90B	0.4502	0.0711	0.14	0.295	0.770	I	0.295	0.290	I	,	1	I
GPT4o-mini	0.4542	0.0844	0.24	0.280	0.765	0.2600	0.350	0.360	0.465	0.345	0.205	0.290
GPT40	0.5479	0.0844	0.22	0.405	0.775	0.1100	0.390	0.420	0.450	0.390	0.315	0.290
UrbanLLaVA-VILA1.5-8B	0.5682	0.1000	0.46	0.91	0.870	0.8150	0.780	0.72	0.585	0.58	0.785	0.62
Vs. VILA1.5-8B	+31.47%	+73.10%	+666.67%	+237.04%	+33.85%	+120.27%	+246.67%	+77.78%	+39.29%	+68.12%	+302.56%	+113.79%
vs. Best Baseline	+3.48%	+2.28%	+91.67%	+106.82%	+10.83%	+48.18%	+97.47%	+51.58%	+25.81%	+48.72%	+149.21%	+103.28%

Table6. Main results on UBench at Beijing. UrbanLLaVA significantly outperforms other baselines in every task.

The following are the results from Table 7 of the original paper (London):

Tasks@London	GeoQA	Geo+Traj		Geo+STV			Geo+SAT			Geo+SS
Tasks@London	GeoQA	TrajPredict	Navigation	STV-Address	STV-Landmark	STV-Outlier	SAT-Address	SAT-Landuse	SceneComp	SceneFunc	ImgRetrieval	CameraLoc
Qwen2VL-7B	0.4991	0.1920	0.12	0.405	0.760	0.1492	0.305	0.550	0.870	0.220	0.270	0.285
Qwen2VL-72B	0.5802	0.2245	0.24	0.485	0.875	0.5525	0.530	0.535	0.420	0.265	0.405	0.245
InternVL2-8B	0.4973	0.1694	0.10	0.290	0.810	0.2431	0.315	0.490	0.785	0.315	0.215	0.265
InternVL2-26B	0.5168	0.1776	0.08	0.380	0.865	0.2320	0.355	0.490	0.905	0.305	0.215	0.270
VILA1.5-3B	0.4362	0.0000	0.08	0.230	0.305	0.2320	0.200	0.445	0.295	0.200	0.290	0.255
VILA1.5-8B	0.4841	0.1367	0.04	0.330	0.560	0.4586	0.305	0.485	0.705	0.335	0.250	0.265
VILA1.5-13B	0.4592	0.1796	0.08	0.430	0.570	0.4972	0.275	0.350	0.800	0.390	0.275	0.250
LLama3.2-11B	0.4804	0.1959	0.04	0.360	0.440	,	0.260	0.500	I	,	I	I
LLama3.2-90B	0.5659	0.2020	0.20	0.375	0.715	I	0.385	0.555	,	I	I	,
GPT4o-mini	0.5357	0.1755	0.08	0.375	0.835	0.2155	0.390	0.570	0.855	0.340	0.290	0.245
GPT40	0.6446	0.2000	0.06	0.580	0.895	0.1657	0.480	0.610	0.900	0.430	0.320	0.250
UrbanLLaVA-VILA1.5-8B	0.6399	0.1959	0.34	0.610	0.955	0.6851	0.575	0.750	0.955	0.560	0.605	0.260
vs. VILA1.5-8B	+32.20%	+43.28%	+750.00%	+84.85%	+70.54%	+49.40%	+88.52%	+54.64%	+35.46%	+67.16%	+142.00%	-1.89%
vs. Best Baseline	-0.72%	-12.73%	+41.67%	+5.17%	+6.70%	+24.00%	+8.49%	+22.95%	+5.52%	+30.23%	+49.38%	-8.77%

Table 7. Main results on UBench at London. UrbanLLaVA achieves better performance than other models in most tasks.

The following are the results from Table 8 of the original paper (New York):

Tasks@NewYork	GeoQA	Geo+Traj		Geo+STV			Geo+SAT			Geo+SS
Tasks@NewYork	GeoQA	TrajPredict	Navigation	STV-Address	STV-Landmark	STV-Outlier	SAT-Address	SAT-Landuse	SceneComp	SceneFunc	ImgRetrieval	CameraLoc
Qwen2VL-7B	0.4567	0.1200	0.22	0.585	0.805	0.1450	0.455	0.395	0.875	0.315	0.275	0.315
Qwen2VL-72B	0.5273	0.1480	0.36	0.550	0.795	0.5550	0.520	0.235	0.470	0.290	0.335	0.320
InternVL2-8B	0.4632	0.1260	0.24	0.440	0.780	0.2550	0.395	0.135	0.835	0.305	0.245	0.235
InternVL2-26B	0.4766	0.1080	0.34	0.490	0.805	0.2700	0.495	0.225	0.885	0.290	0.230	0.245
VILA1.5-3B	0.3954	0.0000	0.08	0.330	0.745	0.2450	0.310	0.250	0.280	0.245	0.255	0.230
VILA1.5-8B	0.4575	0.1000	0.14	0.345	0.680	0.4700	0.235	0.160	0.795	0.315	0.260	0.245
VILA1.5-13B	0.4501	0.1100	0.36	0.375	0.765	0.5350	0.325	0.175	0.820	0.290	0.285	0.280
LLama3.2-11B	0.4127	0.1000	0.12	0.395	0.645	I	0.295	0.150	I	I	I	I
LLama3.2-90B	0.5234	0.1140	0.20	0.575	0.790	I	0.460	0.220	I	I	I	I
GPT4o-mini	0.5075	0.1240	0.34	0.550	0.880	0.2600	0.415	0.265	0.880	0.350	0.255	0.215
GPT40	0.6232	0.1080	0.36	0.740	0.830	0.1600	0.610	0.215	0.930	0.405	0.305	0.275
CityGPT-V-VILA1.5-8B	0.5773	0.1120	0.50	0.920	0.935	0.6950	0.885	0.880	0.835	0.490	0.645	0.520
vs. VILA1.5-8B	+26.19%	+12.00%	+257.14%	+166.67%	+37.50%	+47.87%	+276.60%	+450.00%	+5.03%	+55.56%	+148.08%	+112.24%
vs. Best Baseline	-7.36%	-24.32%	+38.89%	+24.32%	+6.25%	+25.23%	+45.08%	+122.78%	-10.22%	+20.99%	+92.54%	+62.50%

Table 8. Main results on UBench at New York. UrbanLLaVA achieves better performance than other models in most tasks.

6.3. Ablation Studies / Parameter Analysis

The paper conducts several ablation studies and parameter analyses to understand the influences of different training strategies and data compositions.

6.3.1. Effects of Training Strategies (UTrain)

The paper investigates various training strategies to ensure stable and well-performing training. The UTrain multi-stage pipeline is key.

The following figure (Figure 5 from the original paper) illustrates the performance of three-stage tuning:

该图像是一个柱状图，展示了不同任务的准确率（%）对比。图中包含了“招牌预测（TrajPredict）”、“导航（Navigation）”、“SAT地址（SAT-Address）”等多个任务的结果，并对比了一阶段和二阶段的模型表现。

Figure 5. The performance of three-stage tuning, gray part is the default tuning method for MLLMs.

The following figure (Figure 6 from the original paper) illustrates the effects of the order between knowledge learning and task alignment in two-stage and three-stage tuning:

该图像是一个柱状图，展示了不同阶段模型在各种城市任务上的准确率。图中显示了三种不同的训练阶段及其对应的准确率表现，包括 K→TA→Mix、TA→K→Mix 和 K + TA 的单阶段训练。整体结果表明，多阶段训练方法在任务中具有更好的性能。

Figure 6. (b) The effects of the order between knowledge learning and task alignment in two-stage tuning. (c) The effects of the order between knowledge learning and task alignment in three-stage tuning.

Multi-stage vs. Single-stage Training:
- Three-stage: TA→K→Mix (Task Alignment → Knowledge Learning → Mixture Learning) performs best across most tasks and maintains reliable performance. This significantly surpasses one-stage: K + TA (knowledge learning and task alignment merged directly), which represents the default tuning method for MLLMs. This indicates the benefit of explicitly decoupling these learning processes.
Order of Task Alignment (TA) and Knowledge Learning (K):
- In two-stage training, K→TA (Knowledge first, then Task Alignment) slightly outperforms TA→K. This suggests that acquiring foundational knowledge before aligning with specific tasks can be beneficial.
- However, when the mixture learning stage is added (creating three-stage training), TA→K→Mix (Task Alignment first, then Knowledge, then Mixture) achieves better results than K→TA→Mix. The hypothesis is that TA first allows the model to become familiar with task formats, which Mixture Learning can then effectively leverage, even if initial knowledge is less developed. If $K$ comes first, the model already possesses considerable capabilities, so the impact of Mixture Learning is less significant.
  
  These findings confirm that the proposed UTrain three-stage pipeline effectively integrates cross-modal data to achieve stable training and balanced performance across various urban tasks.

6.3.2. Generalization Study

The paper also evaluates UrbanLLaVA's generalization capabilities on general MLLM benchmarks and across different cities.

The following table (Table 3 from the original paper) shows UrbanLLaVA's performance on general benchmarks:

Test@General	LLaVA-Bench (In-the-Wild)	RealWorldQA	MM-Vet
Metric	Rating Score	ACC	Rating Score
VILA1.5-8B	60.75	0.3765	0.3518
Ours-8B	58.95	0.4052	0.3239

Table 3. General benchmark results. Rating Score refers to result from the LLM-as-a-judge method with GPT4o. For LLaVABench, scores range from 0 to 100, for MM-Vet, scores range from 0.0 to 1.0. Higher scores indicate better performance.

UrbanLLaVA (Ours-8B) maintains competitive stability on general benchmarks like LLaVA-Bench, RealWorldQA, and MM-Vet. While its Rating Score for LLaVA-Bench and MM-Vet is slightly lower than the base VILA1.5-8B, it shows an improvement in RealWorldQA accuracy. This suggests that the specialized urban training does not significantly degrade its general visual and language understanding abilities, which is crucial for general urban intelligence.

The following figure (Figure 7 from the original paper) illustrates the generalization across cities:

该图像是一个条形图，展示了北京市、伦敦与纽约在不同任务（如GeoQA、TrajPredict等）上的性能评分对比。图中深色条代表我们的模型，浅色条代表基线，表现出我们模型在这些任务上优于基线的趋势。

Figure 7. Our UrbanLLaVA trained with Beijing data and tested on London and New York. The bar chart shows the performance score comparisons among Beijing, London, and New York across different tasks such as GeoQA and TrajPredict. The darker bars represent our model, while the lighter bars indicate the baseline, illustrating the trend of our model outperforming the baseline in these tasks.

UrbanLLaVA trained on Beijing data shows competitive capabilities when tested on London and New York benchmarks. Performance improvements are observed across all tasks in London and New York, even for out-of-domain data. This indicates that UrbanLLaVA can generalize to different data distributions and tasks, suggesting the presence of similarity structures across cities that go beyond superficial differences.

6.3.3. Data Ablation Study

This study investigates the influence of different data compositions within UData. The results are shown in Table 4 (assuming a one-stage training strategy for efficiency).

The following table (Table 4 from the original paper) shows the data ablation study results:

Task	Data View GTrajPredict Navigati STV-Address STV-Landmark STV-Outlier SAT-Address SAT-Landuse SceneComp SceneFunc ImgRetrieval CameraLoc
Metric		Avg. Acc	Top-1	Success Rate	Accuracy	Accuracy	Accuracy	Accuracy	Accuracy	Accuracy	Accuracy	Accuracy
		0.5741	0.0711	0.8550	0.8750	0.7450	0.7850	0.3600	0.7800	0.5500	0.5050	0.7300	0.5100
Ours w/o CityQA	Local	0.5409	0.0822 ↑	0.8700	0.8900	0.7150	0.6950 ↓	0.4000	0.8050	0.5400	0.5200	0.7750	0.5200
w/o STV	Local	0.5192 ↓	0.0622	0.4300 ↓	0.7300 ↓	0.4700 ↓	0.7200 ↓	0.4200 ↑	0.6700 ↓	0.4900↓	0.4550 ↓	0.6250 ↓	0.4250 ↓
w/o Traj-Text&Nav	Trajectory	0.4769 ↓	0.0644	0.8100	0.8800	0.6350↓	0.7050 ↓	0.0000 ↓	0.7600	0.4950 ↓	0.4300 ↓	0.6800 ↓	0.4600 ↓
w/o Traj-Vision	Trajectory	0.5590	0.0690	0.8350	0.9050	0.7300	0.7100 ↓	0.3000 ↓	0.8000	0.5150	0.4650	0.7150	0.4950
w/o SAT-Single	Global	0.5345	0.0778	0.8600	0.9100	0.5550↓	0.4550 ↓	0.3800	0.7800	0.5150	0.4100 ↓	0.7200	0.4800
w/o SAT-Multi	Global	0.5420	0.0778	0.8500	0.8700	0.6200 ↓	0.6800 ↓	0.3400	0.6450 ↓	0.3500 ↓	0.3400 ↓	0.3950 ↓	0.2600 ↓

Table 4. Data ablation study. Red arrows indicate significant performance drops, and green arrows indicate significant performance increases. For TrajPredict task, the threshold is $1 \%$ , for other tasks, the threshold is $5 \%$ . All models are trained using the one-stage strategy to optimize experimental efficiency.

Local View Data (w/o CityQA, w/o STV):
- Removing CityQA (textual urban geography) generally leads to slight drops in accuracy across various tasks, showing its importance for general geospatial knowledge.
- Removing STV (street view data) results in significant performance deterioration across both single-modal (e.g., STV-Address, STV-Landmark) and multi-modal tasks, highlighting the critical role of locality knowledge for overall urban understanding.
Trajectory View Data (w/o Traj-Text&Nav, w/o Traj-Vision):
- Removing Traj-Text&Nav (text-based trajectory and navigation instructions) causes a substantial drop, especially in Navigation (0% accuracy) and other related tasks, confirming its necessity for understanding continuous urban spaces and navigation.
- Removing Traj-Vision (visual-trajectory data) also leads to notable drops, particularly for tasks involving visual guidance in trajectories, although not as severe as Traj-Text&Nav for Navigation.
Global View Data (w/o SAT-Single, w/o SAT-Multi):
- Removing SAT-Single (single satellite image data for urban knowledge) impacts tasks like STV-Outlier and SceneFunc, demonstrating its role in understanding specific urban areas from an overhead perspective.
- Removing SAT-Multi (multiple satellite images for correlations and cross-alignment with street views) leads to significant drops in ImgRetrieval, CameraLoc, and SceneFunc, proving its essential role in empowering the MLLM to handle urban tasks from a global, interconnected view.
  
  In summary, the ablation studies reveal that all components of UData (local, trajectory, and global views, and their sub-components) are crucial for UrbanLLaVA's comprehensive performance, particularly for tasks directly involving those data types.

6.3.4. Effects of Other Training Parameters (Supplementary Material)

The supplementary material provides additional analysis on other training parameters:

Learning Rate (Figure 10a): A learning rate of $1e^{-5}$ results in a smoother and lower training loss curve compared to $1e^{-4}$ (the default for VILA). This indicates that a lower learning rate is more robust for training with mixed domain-specific structured instruction data, helping the model handle features from different modalities more stably.
Modality Separation (Figure 10b): Training with text and vision data together in one stage yields better results than separating them (one stage: text or two stage: text then vision). This suggests the benefit of early multi-modal integration during training.
Trained Components (Figure 10c): Experiments with different training components (T-LLM-Proj for text, V-LLM-Proj for vision) show little difference. This implies that the specific choice of which parts of the MLLM (e.g., LLM vs. projector) are trained for each modality might not be as impactful as the overall data strategy or learning rate.

6.3.5. Effects of Training Data Size (Supplementary Material)

The following figure (Figure 11 from the original paper) presents training results with different amounts, exhibiting the high quality of UData:

Figure 11. Scaling law from training data size to performance. 该图像是图表，展示了不同训练数据比例对模型性能得分的影响。各条线分别代表GeoQA、Geo+Traj、Geo+STV、Geo+SAT、Geo+SS和MMScore，显示出随训练数据增加，得分整体呈上升趋势，特别是GeoQA表现最佳。

Figure 11. Scaling law from training data size to performance.

The figure demonstrates that performance generally improves as the amount of training data increases across various urban tasks (e.g., GeoQA, $Geo+Traj$ , $Geo+STV$ , $Geo+SAT$ , $Geo+SS$ ). This scaling law suggests that UData is of high quality and that UrbanLLaVA benefits from more data, indicating potential for further improvement with even larger datasets.

6.3.6. Effects of Base Model (Supplementary Material)

The following table (Table 9 from the original paper) shows the generalizability of methods on Qwen2.5-VL:

Task Group @ Beijing	GeoQA	Geo+Traj	Geo+STV	Geo+SAT	Geo+SS
Qwen2.5-VL-7B-Instruct	0.4324	0.2192	0.4467	0.2850	0.2225
+ Finetuned with UData	0.5720↑	0.1876	0.6833↑	0.4800↑	0.3800↑

Table 9. Evaluating generalizability of methods on Qwen2.5VL.

The results show that UrbanLLaVA's methodology (fine-tuning with UData) is model-agnostic and can be generalized to different MLLMs. When applied to Qwen2.5-VL-7B-Instruct, fine-tuning with UData leads to significant performance improvements across most task groups (GeoQA, $Geo+STV$ , $Geo+SAT$ , $Geo+SS$ ). While $Geo+Traj$ shows a slight decrease, the overall trend confirms the generalizability of the UData approach.

6.3.7. Effects of Model Size (Supplementary Material)

The following figure (Figure 12 from the original paper) shows the effects of model size:

Figure 12. Results on UrbanLLaVA with different model sizes. 该图像是图表，展示了 UrbanLLaVA 在不同模型大小下的性能结果。图中显示了在不同模型（3B、8B、13B）下，GeoQA、STV、SAT 等任务的得分变化情况，表现出模型规模对任务性能的影响。

Figure 12. Results on UrbanLLaVA with different model sizes.

Performance generally improves with increasing parameter size for VILA1.5 models (from 3B to 13B), which is a common trend in LLMs and MLLMs.
However, for certain tasks, models of different sizes exhibit similar capabilities. This occurs either because the tasks are inherently challenging (e.g., TrajPredict, where even larger models struggle significantly) or relatively easy (e.g., SAT-Landuse, where even smaller models perform well).
The minimal performance improvement from VILA1.5-8B to VILA1.5-13B is attributed to the capabilities of the underlying LLM backbones (LLaMA3-8B and LLaMA2-13B). The authors suggest that a larger base model like VILA1.5-40B (if computational resources were available) could potentially yield much better performance.

6.4. Case Study

The paper includes case studies to qualitatively demonstrate UrbanLLaVA's capabilities in handling challenging urban tasks compared to general MLLMs.

6.4.1. SceneFunc Task

Task: Identify which satellite image contains the highest concentration of a specified category of Points of Interest (POIs). This requires multiple image understanding and comparison.

The following figure (Figure 7 from the original paper) shows an example of the SceneFunc task:

该图像是一个示意图，展示了四幅卫星图像及其对应的 POI 分析。图中提示选择哪个图像显示了最多的餐饮相关 POI，结果表明第三幅图像包含主要商业区域，可能拥有较高的餐饮业务集中度。

Figure 7. An example of the SceneFunc task, where correct answers are in green, wrong ones in red.

Observation: While the base model VILA1.5-8B fails to answer the question, UrbanLLaVA successfully provides the correct answer. This highlights UrbanLLaVA's strong capabilities in multi-image understanding and comparison within an urban context, making it competitive with successful closed-source models.

6.4.2. STV-Outlier Task

Task: Compare multiple street view images from a trajectory and identify the outlier image that does not belong. This demands implicit logical reasoning.

The following figure (Figure 8 from the original paper) shows an example of the STV-Outlier task:

该图像是一个示意图，展示了一个城市道路场景，并提供了四个选项供选择哪个图像最接近参考图。参考图展示了一条有自行车道的城市道路。图中要求选择的选项涵盖了不同的场景，其中部分选项缺乏类似的特征。

Figure 8. An example of the STV-Outlier task.

Observation: VILA1.5-8B fails to identify the scene of the reference image. GPT-4o-mini comes closer but is still confused by a wrong option. UrbanLLaVA successfully performs this task, showcasing its ability to understand multiple images and conduct high-level implicit logical reasoning in an urban context, outperforming general MLLMs.

6.4.3. Additional Case Studies (Supplementary Material)

The supplementary material provides more case studies:

SAT-LandUse (Figure 13): The model needs to speculate the land use type (e.g., commercial, residential, agricultural) based on a satellite image. UrbanLLaVA responds precisely, demonstrating accurate image perception, instruction following, and urban knowledge mastering.

该图像是一个多选题示例，题目要求根据卫星图像选择最可能的土地使用类型。模型的正确答案以绿色标示，显示我们的回答为 B，解释说明该区域的土地使用类型为住宅。

Figure 13. An example of the SAT-LandUse task. The correct answers from model are denoted with green color. The response from ours is in bold. Explanation is written by human for this question and answer.

STV-Landmark (Figure 14): The task is to find the closest landmark feature to a given street view image. UrbanLLaVA correctly identifies the landmark, showcasing its ability to conduct logical reasoning in a multi-modal context.

该图像是一个城市道路的局部视图，展示了道路右侧的交通及环境特点。画面中有一辆白色汽车正在驶过，旁边还有公交车停靠站，背景显示城市建筑和树木，反映了城市交通的真实情境。

Figure 14. An example of the STV-Landmark task. The correct answers from model are denoted with green color. The response from ours is in bold. Explanation is written by human for this question and answer.

SAT-Address (Figure 15): The model speculates the most probable address description based on a satellite image.

该图像是一个SAT-Address任务的示例，展示了一个卫星图像及周围环境的描述。该任务要求选择最合适的地址选项，根据图像信息，选项B描述的区域与住宅区对应，因此是最佳选择。

Figure 15. Example of a SAT-Address task.

STV-Address (Figure 16): The model speculates the most probable address where a street view image was taken.

该图像是一个城市街道的实景图，展示了空旷的道路和建筑物。图中可以见到一辆白色汽车驶过，周围环境清晰可见，体现了城市的日常生活场景。

Figure 16. Example of a STV-Address task.

SceneComp (Figure 17): Given four satellite remote sensing images, the model chooses the one with the most buildings.

该图像是四张城市区域的卫星图像示例，分别展示了不同的城市景观。图像中涉及的问题是识别哪一张图像的建筑最为密集。参照答案为A，说明第一张图像展示了一条有单车道的城市道路。

Figure 17. An example of a SceneComp task.

ImgRetrieval (Figure 18): Evaluates the capability to map a given street view image to the corresponding satellite image.

该图像是示意图，展示了不同视角的城市环境数据，包括街景、道路和建筑等。这些图像支持 extit{UrbanLLaVA} 模型处理多模态数据，增强城市智能研究的可能性。

Figure 18. An example of an ImagRetrieval task.

CameraLoc (Figure 19): Requires the model to infer which quadrant of a satellite image corresponds to the location where a given street view image was captured.

该图像是一个示意图，展示了CameraLoc任务的两个视角：左侧是城市区域的航空视图，右侧是街道的实时摄像头视角。这种对比有助于理解空间定位与环境感知在城市智能中的应用。

Figure 19. An example of a CameraLoc task.

These case studies collectively illustrate UrbanLLaVA's enhanced spatial cognition, multi-modal understanding, and reasoning abilities in complex urban scenarios, capabilities that general MLLMs often lack.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces UrbanLLaVA, a novel multi-modal large language model (MLLM) specifically designed to enhance urban spatial cognition. UrbanLLaVA addresses a critical gap in urban research by providing a unified framework capable of integrating and processing four major types of urban data simultaneously: urban visual data (street view and satellite images), geo-text, structured geospatial data, and spatiotemporal series data.

The core contributions include:

UData: A meticulously curated, diverse urban instruction dataset that systematically covers urban environments from location view to trajectory view and global view, crucial for cross-modality alignment.
UTrain: An innovative three-stage training pipeline (task alignment, knowledge learning, mixture learning) that effectively decouples spatial reasoning enhancement from domain knowledge learning, leading to stable training and superior performance.
UBench: An extended and comprehensive benchmark for evaluating MLLMs in a wide array of urban tasks, including several newly introduced complex multi-modal challenges.

Experimental results across Beijing, London, and New York unequivocally demonstrate UrbanLLaVA's effectiveness. It significantly outperforms both open-source and proprietary general MLLMs in diverse urban tasks, showcasing robust generalization abilities across different cities without sacrificing its performance on general MLLM benchmarks. In summary, UrbanLLaVA represents a significant step towards building a unified foundation model with powerful perception and reasoning abilities for general urban intelligence.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose clear directions for future work:

Model Size Exploration: Current experiments primarily focused on the 8B model (specifically, VILA1.5-8B). The full potential of UData and UTrain on larger MLLMs (e.g., VILA1.5-40B) remains to be realized, which could lead to significantly better performance.
UBench Refinement: The UBench benchmark can be further improved. Refining task design and testing MLLMs' overall multi-modal capabilities from an even more fine-grained perspective would enhance its utility.
Inclusion of More Modalities: The current model integrates four types of urban data. However, other important modalities such as video data and time series data (beyond just trajectories) could be included to provide a more complete picture of urban intelligence.

In the future, the authors plan to:
Extend UrbanLLaVA to incorporate more diverse data types relevant to urban research.
Tackle more advanced urban tasks that arise from various interdisciplinary fields.

7.3. Personal Insights & Critique

This paper presents a highly valuable contribution to the nascent field of urban intelligence powered by MLLMs. The rigorous approach to data curation (UData), the thoughtful multi-stage training (UTrain), and the comprehensive evaluation (UBench) are particularly commendable.

Inspirations and Applications:

Unified Urban Understanding: The core idea of creating a single MLLM for comprehensive urban understanding, rather than fragmented task-specific models, is incredibly powerful. This paradigm could revolutionize urban planning, resource management, disaster response, and smart city development by providing a holistic, AI-driven cognitive layer.
Cross-City Generalization: The demonstrated cross-city generalization ability is crucial. It suggests that models trained on one city's data can be effectively deployed in others, significantly reducing the cost and effort of city-specific model development. This implies that urban patterns and structures have underlying commonalities that MLLMs can learn.
Methodology for Domain Adaptation: The UTrain framework, with its decoupling of task alignment and knowledge learning, offers a valuable blueprint for adapting general MLLMs to other complex, multi-modal domains beyond urban intelligence (e.g., environmental science, industrial automation, cultural heritage). The insight into the learning rate's significance for heterogeneous data is also very practical.
Data-centric AI in Urban Research: The emphasis on systematic and multi-view data generation (UData) underscores the importance of data engineering in MLLM development, especially for specialized fields. The quality and structure of domain-specific data are paramount.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Data Bias and Representativeness: While UData is extensive, urban data often carries inherent biases (e.g., richer data in central areas, privacy concerns in certain locations). The paper doesn't deeply discuss how these potential biases in the source data might translate into UrbanLLaVA's cognition or how they are mitigated. This is a critical consideration for real-world urban applications.
Interpretability and Explainability: As MLLMs become more capable, understanding why they make certain decisions in complex urban tasks (e.g., recommending a particular traffic route, inferring land use) becomes vital, especially for decision-makers. The paper focuses on performance but less on the interpretability of its spatial reasoning.
Dynamic Data and Real-time Capabilities: Urban environments are constantly changing. While spatiotemporal series data is included, the paper doesn't explicitly discuss how UrbanLLaVA would handle continuous, real-time data streams or adapt to rapid urban changes. This could be a significant challenge for practical deployment.
Computational Cost for Larger Models: The authors acknowledge that testing with larger base models (VILA1.5-40B) was limited by resources. The training time (10.7 hours on 4xA100 for Beijing) is significant. Scaling UrbanLLaVA to larger models or for more cities might incur prohibitive computational costs, raising questions about its practical scalability without further optimization or more efficient architectures.
Human-in-the-Loop Integration: For urban intelligence to be truly effective, AI models need to integrate seamlessly with human experts and decision-making processes. The paper presents UrbanLLaVA as an analytical tool, but the interface and interaction mechanisms with urban planners or policymakers are not discussed.

Overall, UrbanLLaVA is a groundbreaking work that effectively bridges the gap between general MLLMs and the complex, multi-modal demands of urban intelligence. Its methodical approach to data, training, and evaluation sets a high standard for future research in this exciting interdisciplinary domain.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

City	AoIs	PoIs	Roads	Trajectory	Street View Image	Satellite Image
Beijing	4647	1882	2320	21015	28798	1533
London	13705	11715	1322	173268	3125	556
New York	19541	11112	522	390934	24444	2738