UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
TL;DR Summary
UrbanLLaVA is a multi-modal language model designed for urban intelligence, processing four data types to enhance urban task performance. It leverages a diverse instruction dataset and a multi-stage training framework, achieving strong cross-city generalization.
Abstract
Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce , a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In , we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding." This title clearly indicates the development of a specific multi-modal large language model (MLLM), named UrbanLLaVA, tailored for urban intelligence applications, with a particular focus on its capabilities in spatial reasoning and understanding within urban environments.
1.2. Authors
The authors of the paper are Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, and Yong Li.
- Jie Feng and Yong Li are affiliated with the Department of Electronic Engineering, BNRist, Tsinghua University, Beijing, China.
- Shengyuan Wang is from the Department of Computer Science and Technology, Tsinghua University, Beijing, China.
- Tianhui Liu is from the School of Electronic and Information Engineering, Beijing Jiaotong University, China.
- Yanxin Xi is associated with the University of Helsinki, Finland. The contact email for Feng and Li is {fengjie, liyong07}@tsinghua.edu.cn. Their affiliations suggest a strong background in electronic engineering, computer science, and information engineering, with expertise likely spanning artificial intelligence, machine learning, and urban computing.
1.3. Journal/Conference
The paper is published at (UTC) 2025-06-29T13:04:27.000Z. Based on the provided metadata, it appears to be a preprint on arXiv, with the link https://arxiv.org/abs/2506.23219. As a preprint, it has not yet undergone full peer review for a specific journal or conference, but arXiv is a highly reputable platform for disseminating cutting-edge research in computer science and related fields, allowing for early sharing and feedback. Many papers on arXiv are eventually published in top-tier conferences or journals.
1.4. Publication Year
The paper was published in 2025 (specifically, June 29, 2025, according to the UTC timestamp).
1.5. Abstract
The abstract introduces the challenges in urban research, which often requires understanding diverse multi-modal data but lacks a unified processing framework. It positions the recent success of multi-modal large language models (MLLMs) as a promising solution. The paper then introduces UrbanLLaVA, an MLLM specifically designed to process four types of urban data simultaneously (urban visual data, geo-text, structured geospatial data, and spatiotemporal series data) and achieve strong performance across various urban tasks, outperforming general MLLMs. Key contributions include:
- Curating
UData, a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, ranging from local to global views of the urban environment. - Proposing
UTrain, a multi-stage training framework that explicitly decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the model's compatibility and downstream performance. - Extending existing benchmarks to create
UBench, a comprehensive benchmark for assessingMLLMperformance across a wide range of urban tasks. Experimental results across three cities (Beijing, London, New York) demonstrate thatUrbanLLaVAoutperforms open-source and proprietaryMLLMsin both single-modal and complex cross-modal tasks, exhibiting robust generalization abilities. The source code and data are openly accessible.
1.6. Original Source Link
The official source link for the paper is https://arxiv.org/abs/2506.23219, and the PDF link is https://arxiv.org/pdf/2506.23219v1.pdf. It is currently a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
Urban research is inherently complex, involving a vast array of scenarios and tasks that necessitate the understanding and integration of multi-modal data. This data includes urban visual data (like street views and satellite images), geo-text (location descriptions), structured geospatial data (e.g., from OpenStreetMap), and spatiotemporal series data (e.g., traffic flows, mobility patterns). The core problem the paper addresses is the lack of a unified framework capable of comprehensively processing this diverse multi-modal urban data.
This problem is critical because a holistic understanding of urban spaces and the development of advanced reasoning capabilities for real-world urban applications depend on the effective integration of these data types. Prior research, while extensive, often falls short in several ways:
-
Specificity: Many existing methods are designed for specific data types or are tailored to particular urban tasks, limiting their generalizability and ability to provide a comprehensive urban understanding.
-
Unimodal Focus: Even recent advancements that integrate urban data into
Large Language Models (LLMs)often focus onunimodal urban data(e.g., only geospatial text or only remote sensing images), failing to achieve true cross-modal understanding and modeling of complex urban systems. -
Heterogeneity Challenges: The inherent diversity and heterogeneity of urban data (different formats, scales, and semantics) pose significant challenges for data integration and comprehensive processing within a single framework.
The paper identifies the recent success of
multi-modal large language models (MLLMs)in general domains as a promising opportunity to overcome these limitations.MLLMs, with their built-in common sense and reasoning abilities, offer a pathway to unify data processing across various modalities. The paper's innovative idea is to adapt and advanceMLLMsspecifically for urban intelligence by creating a model that can process four major types of urban data simultaneously, thereby enabling comprehensive urban cognition and tackling diverse tasks within a unified framework.
2.2. Main Contributions / Findings
The paper introduces UrbanLLaVA and makes several key contributions:
-
Introduction of
UrbanLLaVA, a Unified Multi-modal Urban LLM: This is presented as the firstMLLMdesigned for the unified modeling of four major types of urban data (urban visual data, geo-text, structured geospatial data, and spatiotemporal series data). Its goal is to foster comprehensive understanding and effective task-solving for urban environments, moving beyond task-specific or unimodal approaches. -
Development of
UData, a Diverse Urban Instruction Dataset: The authors curate a systematic urban instruction data pipeline (UData) that generates high-quality synthetic data. This dataset spans multiple perspectives, from a localized view (single-modality data) to trajectory and global views (cross-modality data), capturing the multi-faceted nature of urban systems. This addresses the scarcity of high-quality cross-modality alignment data in urban research. -
Proposal of
UTrain, an Effective Multi-stage Training Pipeline: To address challenges like training stability and balancing performance across diverse urban tasks and data modalities, the paper proposesUTrain. This three-stage pipeline explicitly decouplesspatial reasoning enhancementfromdomain knowledge learning. The stages are:task alignment,knowledge learning, andmixture learning. This approach significantly improvesUrbanLLaVA's compatibility and downstream performance. -
Extension of
UBench, an Enhanced Multi-modal Benchmark: The paper extends existing urban benchmarks to createUBench, a systematic and comprehensive evaluation benchmark.UBenchincludes 12 tasks (6 adopted/extended, 6 newly introduced) designed to assess the capabilities ofMLLMsin tackling a wide range of diverse urban tasks, especially those involving multi-modal data.The key findings from the experimental results in Beijing, London, and New York are:
UrbanLLaVAconsistently outperforms both open-source and proprietary generalMLLMsacross various urban tasks, including both single-modal and complex cross-modal tasks in theUBenchbenchmark.- It demonstrates robust generalization abilities across different cities, even when trained on data from a single city (e.g., Beijing).
- The
UDatadataset effectively equips smallerMLLMswith diverse urban capabilities, achieving superior performance over more advanced generalMLLMs. - The
UTrainmulti-stage training pipeline successfully ensures stable training and balanced performance across diverse urban tasks. UrbanLLaVAmaintains its stability and competitive performance on generalMLLMbenchmarks (LLaVA-Bench,RealWorldQA,MM-Vet), indicating its ability to enhance specialized urban capabilities without sacrificing generalMLLMstrengths.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the contributions of UrbanLLaVA, a foundational understanding of several key concepts is essential for a beginner.
3.1.1. Large Language Models (LLMs)
Large Language Models (LLMs) are advanced artificial intelligence models trained on vast amounts of text data. They are designed to understand, generate, and process human language. At their core, LLMs use a type of neural network architecture called Transformers, which are particularly good at handling sequential data. LLMs excel at tasks like answering questions, summarizing texts, writing creative content, and translating languages. Their "largeness" comes from having billions or even trillions of parameters (the internal variables that the model learns during training), which allows them to capture complex patterns and relationships in language. The "reasoning abilities" mentioned in the abstract refer to their emergent capacity to perform complex cognitive tasks by manipulating symbolic representations, often via chain-of-thought prompting.
3.1.2. Multi-modal Large Language Models (MLLMs)
Multi-modal Large Language Models (MLLMs) extend the capabilities of LLMs by integrating information from multiple types of data, or "modalities," beyond just text. While LLMs focus solely on language, MLLMs can process and understand combinations of text, images, audio, video, and other data types. For example, a common MLLM combines text with vision, allowing it to answer questions about images (visual question answering), generate descriptions for images (image captioning), or follow visual instructions. This is typically achieved by using specialized encoders for each modality (e.g., a vision encoder for images) that convert the non-textual data into a format (embeddings) that the LLM can understand and reason with. The LLM then acts as the central reasoning engine, integrating information from all modalities to produce a coherent textual output.
3.1.3. Urban Intelligence (Urban Computing/Urban Science)
Urban intelligence (also known as urban computing or urban science) is an interdisciplinary field that applies computational and data science methods to address challenges and improve various aspects of urban life. It aims to understand the dynamics of cities, predict urban phenomena, and optimize urban systems using vast amounts of heterogeneous urban data. This includes areas like traffic management, public safety, environmental monitoring, urban planning, and social equity. The goal is to build "smarter" and more sustainable cities by leveraging technology to enhance decision-making and resource allocation.
3.1.4. Urban Data Modalities
Urban intelligence relies on diverse data types, often referred to as urban data modalities:
- Urban Visual Data: This includes images and videos captured within urban environments. The paper specifically mentions:
Street View Images: Photographs taken from street level, offering a ground-level perspective of urban features (buildings, roads, storefronts, etc.). Examples include Google Street View or Baidu Map Street View.Satellite Images: Aerial or overhead imagery of urban areas, providing a global or regional perspective on land use, infrastructure, and geographical features. Examples include Google Earth imagery.
- Geo-Text: Textual data that contains geographical references or descriptions related to locations. This can include social media posts with location tags, reviews of places, descriptions of landmarks, or news articles mentioning specific areas.
- Structured Geospatial Data: Data that represents geographic features with defined attributes and spatial coordinates. Examples include:
OpenStreetMap (OSM)data: A collaborative project to create a free editable map of the world. It contains information about roads, buildings, points of interest (POIs), land use, and more, all with precise geographical coordinates.- Geographic Information Systems (GIS) data.
- Spatiotemporal Series Data: Data that records observations over both space and time. This is crucial for understanding dynamic urban phenomena. Examples include:
Trajectory Data: Sequences of geographical points (latitude, longitude, timestamp) tracing the movement of people or vehicles. This can be from GPS devices, mobile phone records, or public transit logs.- Environmental sensor readings (e.g., air quality over time at different locations).
- Traffic flow data.
3.1.5. Instruction Tuning
Instruction tuning is a fine-tuning technique used to adapt LLMs or MLLMs to follow specific instructions or perform tasks as described in natural language prompts. Instead of traditional supervised learning with input-output pairs, the model is trained on a dataset where each example consists of an instruction (e.g., "Describe this image," "Summarize this text," "What is the capital of France?") and the corresponding desired output. This process teaches the model to understand and respond to a wide variety of natural language prompts, making it more versatile and user-friendly. For MLLMs, instruction tuning often involves multi-modal inputs (e.g., an image and a textual instruction) and a textual output.
3.2. Previous Works
The paper contextualizes UrbanLLaVA within the landscape of MLLMs and urban studies, highlighting the evolution from specific solutions to more generalized, unified approaches.
3.2.1. General MLLMs
The foundation of UrbanLLaVA lies in the success of general-purpose MLLMs that emerged following models like GPT-4V (OpenAI's vision-enabled GPT-4). Key examples mentioned in the paper include:
- LLaVA [30, 31]: One of the pioneering open-source
MLLMsthat leveragedGPT-4Vto create visual instruction tuning data, demonstrating strong visual understanding and reasoning. - VILA [29]: Explored the impact of training pipelines and data formats on
MLLMpre-training.UrbanLLaVAusesVILAas its base model. - QwenVL [41] and InternVL [7, 8]: Other prominent open-source
MLLMsthat have shown strong performance in generic visual-linguistic tasks. Thesegeneral MLLMsdemonstrate powerful visual understanding and reasoning in common scenarios but often struggle with the specialized knowledge and complex reasoning required for domain-specific tasks like those in urban intelligence.
3.2.2. Domain-Specific MLLMs
Recognizing the limitations of general MLLMs in specialized fields, researchers have developed domain-specific MLLMs. Examples cited include:
- Dolphins [34]: A
multi-modal language modelfor autonomous driving, integrating sensor data for navigation and control. - LLaVA-Med [27]: Fine-tuned for answering open-ended questions related to biomedicine images.
- GeoChat [26]: An early
MLLMeffort specifically designed for remote sensing tasks, focusing on satellite imagery. These models demonstrate the value of tailoringMLLMcapabilities to particular domains but often remain specialized to a subset of data types within that domain.
3.2.3. Multi-modal Models for Urban Study (Prior to UrbanLLaVA)
Before UrbanLLaVA, urban research leveraged various methods, often focusing on specific data types:
- Deep Learning-based Fusion Methods [53, 57]: Numerous deep learning methods have been proposed to fuse various
cross-domain urban data. However, these are typically designed for specific urban tasks (e.g., predicting traffic flow, identifying land use) and lack the ability to achieve a comprehensive, unified understanding of the urban environment or perform advanced reasoning across a wide range of tasks. - LLMs/MLLMs for Unimodal Urban Data: More recently,
LLMsandMLLMshave been explored for specific urban data types:- Structured Geospatial Data:
Balsebre et al. [1]andFeng et al. [16]proposed methods to convert structured geospatial data into a language-compatible format to enhance the geospatial knowledge ofLLMs(CityGPT[16] is an example). - Remote Sensing Data:
Kuckreja et al. [26](GeoChat) andZhang et al. [52]designed instruction data to fine-tunegeneral MLLMsfor remote sensing tasks. - Street View Data:
Hao et al. [22]fine-tunedCLIPmodels for urban indicator prediction by integrating street view and remote sensing data.Liu et al. [32]evaluatedmulti-modal language modelsfor urban socioeconomic sensing. - Spatiotemporal Series Data:
Li et al. [28]andGong et al. [20](Mobility-LLM) introduced domain-specific encoders to enhanceLLMcapabilities for trajectory modeling and spatiotemporal series.Feng et al. [15]proposed an agentic framework for zero-shot mobility prediction.
- Structured Geospatial Data:
3.3. Technological Evolution
The technological evolution leading to UrbanLLaVA can be traced through several stages:
-
Early Deep Learning for Urban Data: Initially, deep learning models were applied to urban data, but these were largely task-specific and often handled single modalities or limited fusion.
-
Rise of General LLMs: The development of
LLMslikeGPT-3demonstrated unprecedented language understanding and generation capabilities. -
Emergence of General MLLMs: The integration of vision (and other modalities) into
LLMs(e.g.,LLaVA,VILA,GPT-4V) createdMLLMscapable of handlingmulti-modal inputin general contexts, showing strong reasoning abilities. -
Domain Adaptation of LLMs/MLLMs: Researchers then began adapting
LLMsandMLLMsto specific domains (e.g., medicine, autonomous driving, remote sensing). In urban studies, this initially led to models that either focused on specific data types (e.g.,CityGPTfor structured geospatial data,GeoChatfor satellite images,Mobility-LLMfor trajectories) or aimed at specific urban tasks.This paper's work (
UrbanLLaVA) represents the next logical step in this evolution: moving from domain-specificMLLMsthat handle limited data types or unimodal urban data to a comprehensiveMLLMthat integrates all major multi-modal urban data types into a unified framework for a wide range of urban tasks, thereby enabling holistic urban cognition and spatial reasoning.
3.4. Differentiation Analysis
Compared to the main methods in related work, UrbanLLaVA introduces several core differences and innovations:
-
Unified Multi-modal Integration: Unlike most previous works that focus on limited data types (e.g., only satellite images, only structured geospatial data, or only trajectory data) or integrate unimodal urban data into
LLMs,UrbanLLaVAis explicitly designed to unify and process four major types of urban data simultaneously:urban visual data(street view and satellite),geo-text,structured geospatial data, andspatiotemporal series data. This comprehensive integration is a key differentiator. -
Comprehensive Urban Cognition: By integrating multiple modalities,
UrbanLLaVAaims for a more holistic andcomprehensive understandingof urban environments, going beyond task-specific solutions or single-modal interpretations. This allows for advanced reasoning that leverages the interplay between different data types. -
Systematic Data Curation (
UData): The paper addresses the crucial challenge ofcross-modality alignmentby curatingUData, a diverse urban instruction dataset. This dataset is meticulously structured to generate high-quality synthetic data coveringlocation view,trajectory view, andglobal view, capturing the multi-faceted nature of urban systems—a level of systematic multi-view data generation not explicitly highlighted in priorurban MLLMefforts. -
Decoupled Multi-stage Training (
UTrain): To manage the heterogeneity of multi-modal urban data and the diversity of urban tasks,UrbanLLaVAproposesUTrain, a novel three-stage training pipeline. This frameworkdecouples spatial reasoning enhancement from domain knowledge learning, which is a distinct methodological innovation aimed at improving training stability and performance across diverse urban tasks. PriorMLLMtraining often follows more general fine-tuning strategies without this explicit decoupling for urban contexts. -
Enhanced Urban Benchmark (
UBench):UrbanLLaVAextends existing benchmarks to createUBench, specifically designed to assessMLLMsacross a wide range of urban tasks. This includes newly introduced tasks that better reflect complexmulti-modal urban scenarios(e.g.,STV-Outlier,SceneFunc), providing a more rigorous evaluation standard than previous benchmarks which might be less comprehensive or primarily focused on specific sub-domains.In essence,
UrbanLLaVAmoves beyond the "one data type, one task" or "unimodal LLM for urban" paradigms by offering a truly unifiedmulti-modal frameworkequipped with specialized data and training strategies for comprehensive urban intelligence.
4. Methodology
4.1. Principles
The core idea behind UrbanLLaVA is to build comprehensive urban cognition and address a wide range of urban tasks by integrating multi-modal urban data into a single, unified Multi-modal Large Language Model (MLLM). The theoretical basis or intuition is that MLLMs, with their inherent common sense and reasoning capabilities (derived from their Large Language Model (LLM) backbone), can act as a central component to process heterogeneous urban data effectively. By training on a diverse dataset that systematically covers urban environments from local to global views, and by employing a specialized multi-stage training strategy, UrbanLLaVA aims to overcome the limitations of prior task-specific or unimodal approaches. This allows the model to develop sophisticated spatial reasoning and urban domain knowledge that enables it to understand complex urban phenomena and perform various tasks.
4.2. Core Methodology In-depth (Layer by Layer)
The UrbanLLaVA framework is structured into three main components: UData (data pipeline), UTrain (training pipeline), and UBench (evaluation benchmark). The following figure (Figure 2 from the original paper) illustrates the overall framework:
该图像是UrbanLLaVA框架示意图,展示了数据管道UData、训练流程UTrain及评估基准UBench。该模型整合了多模态数据,支持城市智能任务的处理与评估,包含位置视图与轨迹视图的信息。不同任务的例子列于边框,以示模型的多样性和应用场景。
Figure 2. The framework of UrbanLLaVA, including UData, UTrain and UBench
4.2.1. UData: Constructing Urban Instruction Data from a Multi-View Perspective of Urban Space
UData is the data pipeline responsible for generating diverse and high-quality urban instruction data across various urban scenarios. The fundamental approach is to organize the urban instruction data in a sequence that moves from a location view to a trajectory view, and finally to a global view. This ensures comprehensive spatial coverage and maintains the integrity of relationships between different modalities.
The construction of UData builds upon four kinds of original urban data:
-
Structured geospatial data: Obtained fromOpenStreetMap(e.g., points of interest, road networks, building footprints). -
Public trajectory data: Examples includeFoursquare-checkins[48] andOpenStreetMap traces(sequences of GPS points representing movement). -
Satellite images: Sourced fromGoogleEarth(overhead views of urban areas). -
Street view images: Collected fromGoogleMapandBaiduMap(ground-level panoramic images).The following figure (Figure 3 from the original paper) provides a detailed composition of
UDatain Beijing:
该图像是UData数据组成的示意图,展示了不同类型城市数据的分布,包括全球视图数据、轨迹视图数据和位置视图数据等。每个数据类型的详细信息和数量均以圆形结构呈现,便于理解各类数据之间的关系。
Figure 3. The thorough composition of UData in Beijing.
The UData construction process is divided into three stages:
4.2.1.1. Location View Data
This stage focuses on integrating structured geospatial data and single street view images.
- Structured Geospatial Data Instructions: Following practices from previous works [1, 16],
geospatial instruction datais created by designing question templates that transform basic geospatial information into natural language question-and-answer pairs. For instance, questions about landmarks, addresses, or facility types at specific locations. - Single Street View Image Instructions: For each street view image, three types of questions are synthesized:
- Template-based with Geospatial Data: Two types of questions are generated using predefined templates, populated with information from
structured geospatial data(e.g., querying the address or landmark details visible in the image). - General MLLM-generated Description: A general
MLLMis prompted to generate a detailed description of the image content, similar toimage captioning[6]. A core principle here is to ensure consistency between street view image content and structured geographical knowledge (e.g., location addresses and landmark descriptions must match). An example ofLocation View DataforLocation Addressis provided in the supplementary material (Figure 22):
- Template-based with Geospatial Data: Two types of questions are generated using predefined templates, populated with information from
该图像是街景视图,展示了北京市的一条道路,周围有树木和建筑物。图中可以看到沿路的交通标志和设施,为城市环境提供了详细的视觉信息。
Figure 22. An example of local view training instances of Location Address.
4.2.1.2. Trajectory View Data
This stage integrates geospatial data, trajectory data, and street view images to capture continuous spatial movement.
- Text-based Trajectory Data: Two types are generated:
- Random Walk Routing: Generated by randomly sampling origin and destination points and creating routing instructions (e.g., "Go straight, then turn left").
- Real-world Trajectories: Utilizes public trajectory data (e.g.,
Foursquare-checkins,OpenStreetMap traces).GPS coordinatesfrom these sources are aligned withstructured geospatial data, using textual addresses to represent locations within the trajectory.
- Vision-augmented Trajectory Data: This combines visual information with trajectories:
- Street View Images along Route: Extends text-based trajectory data by incorporating street view images captured along the route (excluding intersections). This data is organized in an
interleaved image-text format, similar toVILA[29]. For example, a sequence of "Image 1, Turn left, Image 2, Go straight." - Navigation Instructions (Vision-Language Navigation): Builds on navigation instruction formats akin to classical
vision-language navigationtasks [5]. Multiple street view images are presented at intersections during a trajectory, and the model must select the correct image to guide the continuation of the journey.
- Street View Images along Route: Extends text-based trajectory data by incorporating street view images captured along the route (excluding intersections). This data is organized in an
4.2.1.3. Global View Data
This stage focuses on capturing relationships among diverse data types over longer distances, primarily using street view images and satellite images, with geospatial data as auxiliary support.
- Basic Global View Data (Single Satellite Image):
- Detailed Content Description: A general
MLLMis prompted to produce detailed content descriptions for individualsatellite images. - Spatial Coverage Summary:
Location addresseswithin a satellite image are sampled, and a generalLLMis used to summarize the spatial coverage based on these addresses. - Land Use Inference with Reasoning: A general
MLLMis prompted withland use ground-truth labelsto generateland use inference resultsalong with explanations (reason).
- Detailed Content Description: A general
- Multiple Satellite Images for Complex Instructions:
- Building Density Comparison: A task to compare
building densitiesacross multiplesatellite images. - Functional Point of Interest (POI) Identification: Focuses on identifying specific
functional POIswithin multiple images. For these tasks, manually craftedreasoning stepsin achain-of-thoughtsformat, supported bystructured geospatial data, are provided to improve alignment betweensatellite imagesandgeospatial data.
- Building Density Comparison: A task to compare
- Street View and Satellite Image Alignment:
-
Correct Satellite Image Selection: Given a
street view image, the model must select the correctsatellite imagefrom a set, requiring understanding and matching content or address across both image types. -
Street View Location Pinpointing: A more challenging task involving pinpointing the location of the
street view imagewithin a specificsatellite image(e.g., identifying it as being in the top-left region).After data generation, quality checks and filtering are performed on the synthesized data. An example of
Global View DataforLanduse Inferenceis shown in the supplementary material (Figure 21):
该图像是土地利用推断的全球视图训练实例示例,展示了城市环境中设施和建筑的分布情况,包括运动场和住宅区等元素。
-
Figure 21. An example of global view training instances of Landuse Inference.
4.2.2. UTrain: A Multi-Stage Training Pipeline for Decoupling Reasoning and Knowledge Learning
The training of UrbanLLaVA faces challenges due to data heterogeneity and task diversity. The paper selects VILA1.5 [29] as the base MLLM and proposes UTrain, a three-stage tuning pipeline. The following figure (Figure 4 from the original paper) illustrates the UTrain pipeline:
该图像是UTrain的三阶段训练流程示意图。流程包含任务对齐(Stage 1)、知识学习(Stage 2)和混合调优(Stage 3)三个阶段,展示了多模态大语言模型(LLM)在城市智能任务中的训练步骤。
Figure 4. UTrain: three-stage training pipeline.
The UTrain pipeline distinguishes three types of learning procedures:
-
Knowledge Learning: The process where
UrbanLLaVAacquires foundationalurban knowledgefrom various urban data, including information fromgeospatial data,pure textual trajectories, and detailed descriptions ofstreet viewandsatellite images. -
Task Alignment: Focuses on equipping
UrbanLLaVAwith task-specific skills for urban applications, such asvision-language navigation,trajectory prediction, andchain-of-thoughts reasoningacross multiplesatelliteandstreet view images. -
Mixture Learning: Represents the standard training method where all types of instruction data are mixed directly, as used by most
MLLMs.Based on experimental observations regarding training stability and performance, the authors propose a specific three-stage tuning pipeline:
- Stage 1: Task Alignment
- Description: Starting with a well-trained general
MLLM(e.g.,VILA1.5), the model is first fine-tuned with diverseurban task-related instructions. - Purpose: This stage familiarizes the model with various urban tasks, leveraging its pre-existing general knowledge to understand task formats and requirements.
- Description: Starting with a well-trained general
- Stage 2: Knowledge Learning
- Description: After
task alignment, this stage imparts specializedurban knowledgefrommulti-modal urban datathat is crucial for effective task resolution. - Purpose: Addresses the limitation that general knowledge alone is insufficient for diverse urban tasks, providing the domain-specific information necessary.
- Description: After
- Stage 3: Mixture Learning
-
Description: In this final stage, the model is further tuned using a mixture of data: 1/3
domain-specific data(resampled from the first two stages) and 1/3general textual instruction data(e.g.,ShareGPT[6],UltralChat[11]). -
Purpose: This stage enhances the model's ability to combine
domain knowledgeandtask-specific skillsfor solving diverse urban tasks, ensuring robust performance across the entire spectrum of urban intelligence challenges.This multi-stage framework explicitly
decouples the learning of reasoning capabilities(enhanced in task alignment)from domain-specific knowledge(acquired in knowledge learning), which is presented as a promising practice forMLLMs.
-
4.2.3. UBench: An Enhanced Multimodal Benchmark for Urban Intelligence Tasks
UBench is the evaluation benchmark designed to systematically assess MLLM capabilities in multimodal urban tasks. It expands upon existing benchmarks like CityBench [18] and Urbench [56].
The benchmark comprises 12 tasks, categorized for different data combinations:
- Adopted/Extended Tasks (6 tasks):
- From
CityBench[18] (for structured geospatial data and trajectory modeling):GeoQA: Geospatial Question Answering.TrajPredict: Trajectory Prediction.Navigation: Vision-language navigation.
- From
Urbench[56] (for cross-view urban tasks involving street view and satellite images):Image Retrieval: Retrieve matching image across modalities.Camera Localization: Localize street view position on satellite map.Scene Comparison: Compare urban scenes.
- From
- Newly Introduced Tasks (6 tasks):
- Single-image tasks (4 tasks): Aligned with the urban instruction data, these are designed for single
street viewandsatellite images. The original dataset is partitioned into training and validation sets to prevent data leakage.STV-Address: Address inference from astreet view image.STV-Landmark: Landmark recognition from astreet view image.SAT-Address: Address inference from asatellite image.SAT-Landuse: Land use inference from asatellite image.
- Multiple-image tasks (2 tasks):
-
STV-Outlier: Aspatial consistency taskfor street view images. Multiple images from a single trajectory are presented, and the model must identify an outlier image not part of the trajectory. -
SceneFunc: Extends theScene Comparisontask fromUrbench. It challenges the model to select the correctsatellite imagethat fulfills specific functional requirements (e.g., highest concentration of POIs).The following table (Table 1 from the original paper) details the tasks in
UBench:Tasks Data Category Metrics Samples Source GeoQA Geospatial Data GeoQA Avg. Accuracy 1450 CityBench TrajPredict Trajectory Data Geo+Traj Top-1 500 CityBench Navigation Single STV Geo+Traj Success Rate 50 CityBench SceneComp Multi SAT Geo+SAT Accuracy 200 UrBench ImgRetrieval Multi STV & SAT Geo+SS Accuracy 200 UrBench CameraLoc Multi STV & SAT Geo+SS Accuracy 200 UrBench STV-Address Single STV Geo+STV Accuracy 200 UBench STV-Landmark Single STV Geo+STV Accuracy 200 UBench SAT-Address Single SAT Geo+SAT Accuracy 200 UBench SAT-Landuse Single SAT Geo+SAT Accuracy 200 UBench STV-Outlier Multi STV Geo+STV Accuracy 200 UBench SceneFunc Multi SAT Geo+SAT Accuracy 200 UBench
-
- Single-image tasks (4 tasks): Aligned with the urban instruction data, these are designed for single
Table 1. Detailed information about UBench for Beijing, 'STV' refers to street view image, and 'SAT' refers to satellite image.
5. Experimental Setup
5.1. Datasets
The experiments for UrbanLLaVA are conducted across three major cities: Beijing, London, and New York. Due to the large volume of data, a specific region from each city is selected for the experiments. The spatial coverage of these regions is detailed in the supplementary material (Figure 36).
该图像是图表,展示了北京、伦敦和纽约的地图。图中分别标注了三个城市的主要街道和地理特征,提供了对于城市空间布局的直观理解。
Figure 36. Maps for Beijing, London and New York.
The UData dataset, specifically curated for this paper, is used for training UrbanLLaVA. UData encompasses diverse urban instruction data generated from multiple original sources as described in the methodology. The detailed statistics of UData for each city are provided in Table 10 of the supplementary material.
| City | Category | Dataset | Instance Rounds | |
| I | General | ShareGPT,UltraChat,Open-Platypus | 19866 | 3.7 |
| Beijing | Location View Data | CityQA | 19271 | 1 |
| Location Address | 93246 | 1 | ||
| Landmark Details | 51130 | 1 | ||
| Image Description | 28798 | 1 | ||
| Cross Modality Reasoning | 2000 | 1 | ||
| Trajectory View Data | Random Walk | 9001 | 1 | |
| Real-World Trajectory | 98 | 1 | ||
| Visual Random Walk | 8936 | 1 | ||
| Vision-Language Navigation | 3000 | 1 | ||
| Global View Data | Image Content | 9315 | 1 | |
| Location Address | 2777 | |||
| Landuse Inference | 3642 | 1 | ||
| Multiple SAT Comparison | 10114 | 1 | ||
| Cross-View Data | 77204 | 1 | ||
| London | Cross Modality Reasoning | 14977 | 1 | |
| Location View Data | CityQA | 28934 | 1 | |
| Location Address | 2172 | 1 | ||
| Landmark Details | 2372 | |||
| Image Description | 716 | 1 | ||
| Cross Modality Reasoning | 1286 | 1 | ||
| Trajectory View Data | Random Walk | 16524 | 1 | |
| Real-World Trajectory | 98 | 1 | ||
| Visual Random Walk | 13412 | 1 | ||
| Vision-Language Navigation | 3000 | 1 | ||
| Global View Data | Image Content | 3853 | 1 | |
| Location Address | 882 | 1 | ||
| Landuse Inference | 4332 | 1 | ||
| Multiple SAT Comparison | 4500 | 1 | ||
| Cross-View Data | 2172 | 1 | ||
| Cross Modality Reasoning | 5758 | 1 | ||
| New York Location View Data | CityQA | 25413 | 1 | |
| Location Address | 94886 | 1 | ||
| Landmark Details | 50404 | 1 | ||
| Image Description | 24529 | 1 | ||
| Cross Modality Reasoning | 2012 | 1 | ||
| Trajectory View Data | Random Walk | 12277 | 1 | |
| Real-World Trajectory | 98 | 1 | ||
| Visual Random Walk | 12229 | 1 | ||
| Vision-Language Navigation | 3000 | 1 | ||
| Global View Data | Image Content | 18368 | 1 | |
| Location Address | 5113 | 1 | ||
| Landuse Inference | 17899 | 1 | ||
| Multiple SAT Comparison | 22020 | 1 | ||
| Cross-View Data | 94886 | 1 | ||
| Cross Modality Reasoning | 23603 | 1 |
Table 10. Basic information of UData on three cities.
Key characteristics of UData:
-
Multi-View Perspective: Data is categorized into
Location View Data(e.g.,CityQA,Location Address,Landmark Details,Image Description,Cross Modality Reasoningfor single street views and geospatial data),Trajectory View Data(e.g.,Random Walk,Real-World Trajectory,Visual Random Walk,Vision-Language Navigation), andGlobal View Data(e.g.,Image Content,Location Address,Landuse Inference,Multiple SAT Comparison,Cross-View Datafor satellite images and their integration with street views). -
Scale: Beijing
UDataconsists of approximately 340,000 instruction rounds, London approximately 80,000, and New York approximately 390,000. -
General Data: The general category includes
ShareGPT,UltraChat, andOpen-Platypusdatasets, which are standard forLLMinstruction tuning and provide general language understanding capabilities. -
Raw Data Sources: The construction of
UDataleveragesOpenStreetMap,Foursquare-checkins,OpenStreetMap traces,GoogleEarth,GoogleMap, andBaiduMap(as detailed in the Methodology section). The raw data of the selected regions in the three cities is provided in Table 11 of the supplementary material.City AoIs PoIs Roads Trajectory Street View Image Satellite Image Beijing 4647 1882 2320 21015 28798 1533 London 13705 11715 1322 173268 3125 556 New York 19541 11112 522 390934 24444 2738
Table 11. The raw data of the selected region in three cities.
These datasets were chosen because they represent the core modalities of urban data and are essential for developing comprehensive urban intelligence. They allow for testing a wide range of single-modal, cross-modal, and complex spatial reasoning tasks critical for urban applications.
5.2. Evaluation Metrics
The paper uses various metrics depending on the specific task in UBench and other general benchmarks.
5.2.1. Metrics for UBench Tasks
As detailed in Table 1 (repeated above for convenience), UBench uses the following metrics:
Avg. Accuracy(Average Accuracy):- Conceptual Definition: This metric quantifies the proportion of correct predictions across a set of diverse questions or tasks. For
GeoQA, where questions might have different structures, average accuracy typically aggregates the correctness over all questions. - Mathematical Formula: $ \text{Avg. Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}(\text{prediction}_i = \text{ground_truth}_i) $
- Symbol Explanation:
- : Total number of instances (questions/tasks) in the evaluation set.
- : The indicator function, which returns 1 if the condition inside the parenthesis is true (i.e., the prediction matches the ground truth for instance ), and 0 otherwise.
- : The model's output for instance .
- : The correct answer for instance .
- Conceptual Definition: This metric quantifies the proportion of correct predictions across a set of diverse questions or tasks. For
Top-1(Top-1 Accuracy):- Conceptual Definition: In tasks where the model predicts a ranked list of choices (e.g., for
trajectory prediction, predicting the most likely next location from a set of candidates),Top-1accuracy measures whether the single most probable prediction made by the model is correct. - Mathematical Formula: $ \text{Top-1 Accuracy} = \frac{\text{Number of times the top prediction is correct}}{\text{Total number of predictions}} $
- Symbol Explanation: This is a specific form of accuracy where correctness is attributed only if the highest-ranked prediction matches the true label.
- Conceptual Definition: In tasks where the model predicts a ranked list of choices (e.g., for
Success Rate:- Conceptual Definition: For tasks like
Navigation, this metric measures the proportion of attempts where the model successfully completes the task (e.g., reaches the destination following instructions) according to predefined success criteria. - Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of successful task completions}}{\text{Total number of task attempts}} $
- Symbol Explanation: The definition of "successful task completion" is task-specific (e.g., reaching a target with a certain accuracy, navigating a path without errors).
- Conceptual Definition: For tasks like
Accuracy:- Conceptual Definition: This is the most common metric, representing the proportion of correct predictions (or classifications) out of the total number of predictions made. It's used for tasks like
SceneComp,ImgRetrieval,CameraLoc,STV-Address,STV-Landmark,SAT-Address,SAT-Landuse,STV-Outlier, andSceneFunc, which are typically classification or direct answer tasks. - Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} = \frac{\sum_{i=1}^{N} \mathbb{I}(\text{prediction}_i = \text{ground_truth}_i)}{N} $
- Symbol Explanation:
- : Total number of predictions.
- : Indicator function, returning 1 if the prediction matches the ground truth, 0 otherwise.
- : The model's output for instance .
- : The correct answer for instance .
- Conceptual Definition: This is the most common metric, representing the proportion of correct predictions (or classifications) out of the total number of predictions made. It's used for tasks like
5.2.2. Metrics for General Evaluation Tasks
For general MLLM benchmarks:
Rating ScorefromGPT4o(LLM-as-a-judge):- Conceptual Definition: For benchmarks like
LLaVA-Bench (In-the-Wild)andMM-Vet, the performance ofMLLMsis often evaluated by having another powerfulLLM(in this case,GPT-4o) act as a judge.GPT-4ois given the input prompt, the model's response, and sometimes a reference answer, and then rates the quality of the model's response on a predefined scale. This approach is used for tasks where subjective quality or complex reasoning is hard to quantify with simple metrics. - Mathematical Formula: No single mathematical formula defines this, as it's based on an external
LLM's subjective judgment. The score is typically an average of the ratings. ForLLaVA-Bench, scores range from 0 to 100, and forMM-Vet, scores range from 0.0 to 1.0. - Symbol Explanation: Higher scores indicate better performance, as judged by
GPT-4o.
- Conceptual Definition: For benchmarks like
ACC(Accuracy):- Conceptual Definition: Used for
RealWorldQA, this is the standard accuracy metric, quantifying the proportion of correct answers to factual questions in real-world scenarios. - Mathematical Formula: $ \text{ACC} = \frac{\text{Number of correct answers}}{\text{Total number of questions}} $
- Symbol Explanation: As defined for Accuracy above.
- Conceptual Definition: Used for
5.3. Baselines
UrbanLLaVA is compared against a comprehensive set of MLLM baselines, including both open-source and proprietary models, to demonstrate its superior performance in urban tasks.
- Open-source MLLMs:
- Qwen2VL [41]: Both
Qwen2VL-7BandQwen2VL-72Bare included. This is a powerful vision-language model series known for its perception capabilities. - InternVL2 [7, 8]: Both
InternVL2-8BandInternVL2-26Bare used. This series is recognized for scaling vision foundation models and aligning them for generic visual-linguistic tasks. - VILA1.5 [29]: This is the base model from which
UrbanLLaVAis fine-tuned.VILA1.5-3B,VILA1.5-8B, andVILA1.5-13Bare evaluated to show the impact of model size and the improvement gained byUrbanLLaVA's specialized training. - LLaMA3.2 [36]:
LLaMA3.2-11BandLLaMA3.2-90Bare included. These models are from theLLaMAseries, a popular family ofLLMs, here with multi-modal extensions. It's noted thatLLaMA3.2models currently do not support multi-image input, leading to blank results for such tasks inUBench.
- Qwen2VL [41]: Both
- Proprietary MLLMs:
-
GPT4o [40]: A powerful commercial
MLLMfrom OpenAI, representing the state-of-the-art in generalmulti-modal AI. -
GPT4o-mini [40]: A smaller version of
GPT4o, also from OpenAI.These baselines are representative because they cover a range of model sizes (from 3B to 90B parameters) and capabilities (both cutting-edge open-source and commercial
MLLMs). Comparing against them effectively validatesUrbanLLaVA's performance, especially highlighting its advantage in domain-specific urban tasks, given that generalMLLMsare often not optimized for such specialized contexts.
-
Implementation Details:
- The models are deployed through
VLMEvalKit[13] for open-sourceMLLMs. - The maximum output tokens are set to 1000.
- The temperature (a parameter controlling randomness in generation) is set to 0, implying deterministic and less creative outputs, suitable for factual task evaluations.
UrbanLLaVAis fine-tuned using codes from the officialVILA[29] repository on a single8xA100node.- Training parameters:
learning rateof ,maximum sequence lengthof 2048,batch sizeof 8 perGPU, and one training epoch. - Training
UrbanLLaVAfor Beijing on4xA100GPUs took 10.7 hours.
6. Results & Analysis
6.1. Core Results Analysis
The main experimental results, presented in Table 2, demonstrate that UrbanLLaVA significantly outperforms both open-source and proprietary MLLMs across diverse urban tasks in all three evaluated cities (Beijing, London, and New York). This strongly validates the effectiveness of the proposed UData and UTrain methodologies.
The following are the aggregated results from Table 2 of the original paper:
| City | Beijing | London | New York | ||||||||||||
| Task Group | GeoQA Geo+Traj Geo+STV Geo+SAT | GeoQA Geo+Traj Geo+STV Geo+SAT | |||||||||||||
| VILA1.5-3B | 0.3873 | 0.0200 | 0.3967 | 0.3200 | 0.2575 | 0.4362 | 0.0400 | 0.2557 | 0.2850 | 0.2725 | 0.3954 | 0.0400 | 0.4400 | 0.2713 | 0.2425 |
| VILA1.5-8B | 0.4322 | 0.0589 | 0.4300 | 0.3488 | 0.2425 | 0.4841 | 0.0884 | 0.4495 | 0.4575 | 0.2575 | 0.4575 | 0.1200 | 0.4983 | 0.3763 | 0.2525 |
| VILA1.5-13B | 0.4410 | 0.1156 | 0.5167 | 0.3638 | 0.2400 | 0.4592 | 0.1298 | 0.4991 | 0.4538 | 0.2625 | 0.4501 | 0.2350 | 0.5583 | 0.4025 | 0.2825 |
| InternVL2-8B | 0.4709 | 0.1578 | 0.4667 | 0.3313 | 0.2325 | 0.4973 | 0.1347 | 0.4477 | 0.4763 | 0.2400 | 0.4632 | 0.1830 | 0.4917 | 0.4175 | 0.2400 |
| InternVL2-26B | 0.4877 | 0.1478 | 0.4550 | 0.3825 | 0.2275 | 0.5168 | 0.1288 | 0.4923 | 0.5138 | 0.2425 | 0.4766 | 0.2240 | 0.5217 | 0.4738 | 0.2375 |
| Qwen2VL-7B | 0.4950 | 0.1389 | 0.4383 | 0.3638 | 0.2675 | 0.4991 | 0.1560 | 0.4381 | 0.4863 | 0.2775 | 0.4567 | 0.1700 | 0.5117 | 0.5100 | 0.2950 |
| Qwen2VL-72B | 0.5491 | 0.1611 | 0.5817 | 0.3588 | 0.2975 | 0.5802 | 0.2322 | 0.6375 | 0.4375 | 0.3250 | 0.5273 | 0.2540 | 0.6333 | 0.3788 | 0.3275 |
| LLaMA3.2-11B | 0.4229 | 0.0756 | 0.4375 | 0.3075 | I | 0.4804 | 0.1180 | 0.4000 | 0.3800 | I | 0.4127 | 0.1100 | 0.5200 | 0.2225 | I |
| LLaMA3.2-90B | 0.4502 | 0.1056 | 0.5325 | 0.2925 | , | 0.5659 | 0.2010 | 0.5450 | 0.4700 | I | 0.5234 | 0.1570 | 0.6825 | 0.3400 | I |
| GPT4o-mini | 0.4542 | 0.1622 | 0.4350 | 0.3800 | 0.2475 | 0.5357 | 0.1278 | 0.4752 | 0.5388 | 0.2675 | 0.5075 | 0.2320 | 0.5633 | 0.4775 | 0.2350 |
| GPT40 | 0.5479 | 0.1522 | 0.4300 | 0.4125 | 0.3025 | 0.6446 | 0.1300 | 0.5469 | 0.6050 | 0.2850 | 0.6232 | 0.2340 | 0.5767 | 0.5400 | 0.2900 |
| UrbanLLaVA-VILA1.5-8B 0.5682 | 0.2800 | 0.8650 | 0.6663 | 0.7025 | 0.6399 | 0.2680 | 0.7500 | 0.7100 | 0.4325 | 0.5773 | 0.3060 | 0.8500 | 0.7725 | 0.5825 | |
| vs. VILA1.5-8B | +31.47% +375.38% +101.16% +91.03% +189.69% +32.18% +203.17% +66.85% | +55.19% +67.96% +26.19% +155.00% +70.57% +26.19% +155.00% +70.57% +105.32% +130.69% | |||||||||||||
| vs. Best Baseline | +3.48% +72.63% +48.70% +61.53% +132.23% -0.73% +15.42% +17.65% | +17.36% +33.08% -7.37% +20.47% +24.54% +43.06% +77.86% | |||||||||||||
Table 2. Main results on UBench. 'STV' refers to street view images, 'SAT' refers to satellite images, 'TB' denotes satellite images related tasks, and 'SS' denotes street view + satellite images. Detailed subtask and metrics can refer to Table 1.
Analysis for Beijing:
UrbanLLaVA(usingVILA1.5-8Bas base) consistently outperforms all baselines across all tasks inUBench.- Compared to its base model
VILA1.5-8B,UrbanLLaVAshows remarkable improvements:GeoQA: +31.47%- : +375.38% (a massive gain, highlighting its strength in trajectory-related tasks)
- : +101.16%
- : +91.03%
- : +189.69%
These improvements clearly indicate the effectiveness of
UDataandUTrainin equipping a smallerMLLMwith advanced urban capabilities.
- Even against the best baselines (including
GPT4oandQwen2VL-72B),UrbanLLaVAachieves significant gains, ranging from +3.48% (GeoQA) to +132.23% (). This demonstrates its competitive edge even against powerful commercial and large open-sourceMLLMsin urban-specific contexts. - The
LLaMA3.2series models have blank entries for tasks involving multiple images due to input limitations.
Analysis for London and New York:
- The results generally mirror those in Beijing, demonstrating
UrbanLLaVA's robust generalization abilities. - For London,
UrbanLLaVAperforms best in 4 out of 5 task groups. It is slightly inferior toGPT4oin theGeoQAtask (-0.73%). - For New York,
UrbanLLaVAalso performs best in 4 out of 5 task groups, with a larger reduction of -7.37% compared toGPT4oinGeoQA. - The paper speculates two reasons for
UrbanLLaVA's slight underperformance onGeoQAin London and New York:-
Lower quality of relevant data in these cities compared to Beijing, hindering knowledge acquisition.
-
The base model
VILA1.5-8Bmay have inherently weaker capabilities thanGPT4ofor general factualQA, even withUrbanLLaVA's domain fine-tuning (e.g.,UrbanLLaVA@LondonoutperformsVILA1.5-8Bby 32.18% inGeoQAbut still trailsGPT4o).Overall, the results underscore that
UrbanLLaVAsuccessfully enhances the performance of smallerMLLMson diverse urban tasks, proving its efficacy in integratingmulti-modal urban dataand solving complex urban challenges.
-
6.2. Data Presentation (Tables)
6.2.1. Main Results (Aggregated)
The following are the aggregated results from Table 2 of the original paper:
| City | Beijing | London | New York | ||||||||||||
| Task Group | GeoQA Geo+Traj Geo+STV Geo+SAT Geo+SS | GeoQA Geo+Traj Geo+STV Geo+SAT Geo+SS | GeoQA Geo+Traj Geo+STV Geo+SAT Geo+SS | ||||||||||||
| VILA1.5-3B | 0.3873 | 0.0200 | 0.3967 | 0.3200 | 0.2575 | 0.4362 | 0.0400 | 0.2557 | 0.2850 | 0.2725 | 0.3954 | 0.0400 | 0.4400 | 0.2713 | 0.2425 |
| VILA1.5-8B | 0.4322 | 0.0589 | 0.4300 | 0.3488 | 0.2425 | 0.4841 | 0.0884 | 0.4495 | 0.4575 | 0.2575 | 0.4575 | 0.1200 | 0.4983 | 0.3763 | 0.2525 |
| VILA1.5-13B | 0.4410 | 0.1156 | 0.5167 | 0.3638 | 0.2400 | 0.4592 | 0.1298 | 0.4991 | 0.4538 | 0.2625 | 0.4501 | 0.2350 | 0.5583 | 0.4025 | 0.2825 |
| InternVL2-8B | 0.4709 | 0.1578 | 0.4667 | 0.3313 | 0.2325 | 0.4973 | 0.1347 | 0.4477 | 0.4763 | 0.2400 | 0.4632 | 0.1830 | 0.4917 | 0.4175 | 0.2400 |
| InternVL2-26B | 0.4877 | 0.1478 | 0.4550 | 0.3825 | 0.2275 | 0.5168 | 0.1288 | 0.4923 | 0.5138 | 0.2425 | 0.4766 | 0.2240 | 0.5217 | 0.4738 | 0.2375 |
| Qwen2VL-7B | 0.4950 | 0.1389 | 0.4383 | 0.3638 | 0.2675 | 0.4991 | 0.1560 | 0.4381 | 0.4863 | 0.2775 | 0.4567 | 0.1700 | 0.5117 | 0.5100 | 0.2950 |
| Qwen2VL-72B | 0.5491 | 0.1611 | 0.5817 | 0.3588 | 0.2975 | 0.5802 | 0.2322 | 0.6375 | 0.4375 | 0.3250 | 0.5273 | 0.2540 | 0.6333 | 0.3788 | 0.3275 |
| LLaMA3.2-11B | 0.4229 | 0.0756 | 0.4375 | 0.3075 | I | 0.4804 | 0.1180 | 0.4000 | 0.3800 | I | 0.4127 | 0.1100 | 0.5200 | 0.2225 | I |
| LLaMA3.2-90B | 0.4502 | 0.1056 | 0.5325 | 0.2925 | , | 0.5659 | 0.2010 | 0.5450 | 0.4700 | I | 0.5234 | 0.1570 | 0.6825 | 0.3400 | I |
| GPT4o-mini | 0.4542 | 0.1622 | 0.4350 | 0.3800 | 0.2475 | 0.5357 | 0.1278 | 0.4752 | 0.5388 | 0.2675 | 0.5075 | 0.2320 | 0.5633 | 0.4775 | 0.2350 |
| GPT40 | 0.5479 | 0.1522 | 0.4300 | 0.4125 | 0.3025 | 0.6446 | 0.1300 | 0.5469 | 0.6050 | 0.2850 | 0.6232 | 0.2340 | 0.5767 | 0.5400 | 0.2900 |
| UrbanLLaVA-VILA1.5-8B | 0.5682 | 0.2800 | 0.8650 | 0.6663 | 0.7025 | 0.6399 | 0.2680 | 0.7500 | 0.7100 | 0.4325 | 0.5773 | 0.3060 | 0.8500 | 0.7725 | 0.5825 |
| vs. VILA1.5-8B | +31.47% | +375.38% | +101.16% | +91.03% | +189.69% | +32.18% | +203.17% | +66.85% | +55.19% | +67.96% | +26.19% | +155.00% | +70.57% | +105.32% | +130.69% |
| vs. Best Baseline | +3.48% | +72.63% | +48.70% | +61.53% | +132.23% | -0.73% | +15.42% | +17.65% | +17.36% | +33.08% | -7.37% | +20.47% | +24.54% | +43.06% | +77.86% |
6.2.2. Detailed Results per City
The detailed results for each city, which are aggregated into Table 2, are provided in the supplementary material as Table 6 (Beijing), Table 7 (London), and Table 8 (New York).
The following are the results from Table 6 of the original paper (Beijing):
| Tasks@Beijing | GeoQA | Geo+Traj | Geo+STV | Geo+SAT | Geo+SS | |||||||
| TrajPredict | Navigation | STV-Address | STV-Landmark | STV-Outlier | SAT-Address | SAT-Landuse | SceneComp | SceneFunc | ImgRetrieval | CameraLoc | ||
| Qwen2VL-7B | 0.4950 | 0.0978 | 0.18 | 0.440 | 0.755 | 0.1200 | 0.295 | 0.405 | 0.400 | 0.355 | 0.275 | 0.260 |
| Qwen2VL-72B | 0.5491 | 0.0822 | 0.24 | 0.410 | 0.785 | 0.5500 | 0.395 | 0.395 | 0.335 | 0.310 | 0.290 | 0.305 |
| InternVL2-8B | 0.4709 | 0.0957 | 0.22 | 0.420 | 0.755 | 0.2250 | 0.295 | 0.300 | 0.390 | 0.340 | 0.210 | 0.255 |
| InternVL2-26B | 0.4877 | 0.0756 | 0.22 | 0.440 | 0.755 | 0.1700 | 0.360 | 0.375 | 0.440 | 0.355 | 0.230 | 0.225 |
| VILA1.5-3B | 0.3873 | 0.0000 | 0.04 | 0.270 | 0.655 | 0.2650 | 0.275 | 0.475 | 0.295 | 0.235 | 0.250 | 0.265 |
| VILA1.5-8B | 0.4322 | 0.0578 | 0.06 | 0.270 | 0.650 | 0.3700 | 0.225 | 0.405 | 0.420 | 0.345 | 0.195 | 0.290 |
| VILA1.5-13B | 0.4410 | 0.0511 | 0.18 | 0.305 | 0.715 | 0.5300 | 0.320 | 0.320 | 0.425 | 0.390 | 0.270 | 0.210 |
| LLaMA3.2-11B | 0.4229 | 0.0711 | 0.08 | 0.280 | 0.595 | , | 0.290 | 0.325 | I | I | 1 | I |
| LLaMA3.2-90B | 0.4502 | 0.0711 | 0.14 | 0.295 | 0.770 | I | 0.295 | 0.290 | I | , | 1 | I |
| GPT4o-mini | 0.4542 | 0.0844 | 0.24 | 0.280 | 0.765 | 0.2600 | 0.350 | 0.360 | 0.465 | 0.345 | 0.205 | 0.290 |
| GPT40 | 0.5479 | 0.0844 | 0.22 | 0.405 | 0.775 | 0.1100 | 0.390 | 0.420 | 0.450 | 0.390 | 0.315 | 0.290 |
| UrbanLLaVA-VILA1.5-8B | 0.5682 | 0.1000 | 0.46 | 0.91 | 0.870 | 0.8150 | 0.780 | 0.72 | 0.585 | 0.58 | 0.785 | 0.62 |
| Vs. VILA1.5-8B | +31.47% | +73.10% | +666.67% | +237.04% | +33.85% | +120.27% | +246.67% | +77.78% | +39.29% | +68.12% | +302.56% | +113.79% |
| vs. Best Baseline | +3.48% | +2.28% | +91.67% | +106.82% | +10.83% | +48.18% | +97.47% | +51.58% | +25.81% | +48.72% | +149.21% | +103.28% |
Table6. Main results on UBench at Beijing. UrbanLLaVA significantly outperforms other baselines in every task.
The following are the results from Table 7 of the original paper (London):
| Tasks@London | GeoQA | Geo+Traj | Geo+STV | Geo+SAT | Geo+SS | |||||||
| TrajPredict | Navigation | STV-Address | STV-Landmark | STV-Outlier | SAT-Address | SAT-Landuse | SceneComp | SceneFunc | ImgRetrieval | CameraLoc | ||
| Qwen2VL-7B | 0.4991 | 0.1920 | 0.12 | 0.405 | 0.760 | 0.1492 | 0.305 | 0.550 | 0.870 | 0.220 | 0.270 | 0.285 |
| Qwen2VL-72B | 0.5802 | 0.2245 | 0.24 | 0.485 | 0.875 | 0.5525 | 0.530 | 0.535 | 0.420 | 0.265 | 0.405 | 0.245 |
| InternVL2-8B | 0.4973 | 0.1694 | 0.10 | 0.290 | 0.810 | 0.2431 | 0.315 | 0.490 | 0.785 | 0.315 | 0.215 | 0.265 |
| InternVL2-26B | 0.5168 | 0.1776 | 0.08 | 0.380 | 0.865 | 0.2320 | 0.355 | 0.490 | 0.905 | 0.305 | 0.215 | 0.270 |
| VILA1.5-3B | 0.4362 | 0.0000 | 0.08 | 0.230 | 0.305 | 0.2320 | 0.200 | 0.445 | 0.295 | 0.200 | 0.290 | 0.255 |
| VILA1.5-8B | 0.4841 | 0.1367 | 0.04 | 0.330 | 0.560 | 0.4586 | 0.305 | 0.485 | 0.705 | 0.335 | 0.250 | 0.265 |
| VILA1.5-13B | 0.4592 | 0.1796 | 0.08 | 0.430 | 0.570 | 0.4972 | 0.275 | 0.350 | 0.800 | 0.390 | 0.275 | 0.250 |
| LLama3.2-11B | 0.4804 | 0.1959 | 0.04 | 0.360 | 0.440 | , | 0.260 | 0.500 | I | , | I | I |
| LLama3.2-90B | 0.5659 | 0.2020 | 0.20 | 0.375 | 0.715 | I | 0.385 | 0.555 | , | I | I | , |
| GPT4o-mini | 0.5357 | 0.1755 | 0.08 | 0.375 | 0.835 | 0.2155 | 0.390 | 0.570 | 0.855 | 0.340 | 0.290 | 0.245 |
| GPT40 | 0.6446 | 0.2000 | 0.06 | 0.580 | 0.895 | 0.1657 | 0.480 | 0.610 | 0.900 | 0.430 | 0.320 | 0.250 |
| UrbanLLaVA-VILA1.5-8B | 0.6399 | 0.1959 | 0.34 | 0.610 | 0.955 | 0.6851 | 0.575 | 0.750 | 0.955 | 0.560 | 0.605 | 0.260 |
| vs. VILA1.5-8B | +32.20% | +43.28% | +750.00% | +84.85% | +70.54% | +49.40% | +88.52% | +54.64% | +35.46% | +67.16% | +142.00% | -1.89% |
| vs. Best Baseline | -0.72% | -12.73% | +41.67% | +5.17% | +6.70% | +24.00% | +8.49% | +22.95% | +5.52% | +30.23% | +49.38% | -8.77% |
Table 7. Main results on UBench at London. UrbanLLaVA achieves better performance than other models in most tasks.
The following are the results from Table 8 of the original paper (New York):
| Tasks@NewYork | GeoQA | Geo+Traj | Geo+STV | Geo+SAT | Geo+SS | |||||||
| TrajPredict | Navigation | STV-Address | STV-Landmark | STV-Outlier | SAT-Address | SAT-Landuse | SceneComp | SceneFunc | ImgRetrieval | CameraLoc | ||
| Qwen2VL-7B | 0.4567 | 0.1200 | 0.22 | 0.585 | 0.805 | 0.1450 | 0.455 | 0.395 | 0.875 | 0.315 | 0.275 | 0.315 |
| Qwen2VL-72B | 0.5273 | 0.1480 | 0.36 | 0.550 | 0.795 | 0.5550 | 0.520 | 0.235 | 0.470 | 0.290 | 0.335 | 0.320 |
| InternVL2-8B | 0.4632 | 0.1260 | 0.24 | 0.440 | 0.780 | 0.2550 | 0.395 | 0.135 | 0.835 | 0.305 | 0.245 | 0.235 |
| InternVL2-26B | 0.4766 | 0.1080 | 0.34 | 0.490 | 0.805 | 0.2700 | 0.495 | 0.225 | 0.885 | 0.290 | 0.230 | 0.245 |
| VILA1.5-3B | 0.3954 | 0.0000 | 0.08 | 0.330 | 0.745 | 0.2450 | 0.310 | 0.250 | 0.280 | 0.245 | 0.255 | 0.230 |
| VILA1.5-8B | 0.4575 | 0.1000 | 0.14 | 0.345 | 0.680 | 0.4700 | 0.235 | 0.160 | 0.795 | 0.315 | 0.260 | 0.245 |
| VILA1.5-13B | 0.4501 | 0.1100 | 0.36 | 0.375 | 0.765 | 0.5350 | 0.325 | 0.175 | 0.820 | 0.290 | 0.285 | 0.280 |
| LLama3.2-11B | 0.4127 | 0.1000 | 0.12 | 0.395 | 0.645 | I | 0.295 | 0.150 | I | I | I | I |
| LLama3.2-90B | 0.5234 | 0.1140 | 0.20 | 0.575 | 0.790 | I | 0.460 | 0.220 | I | I | I | I |
| GPT4o-mini | 0.5075 | 0.1240 | 0.34 | 0.550 | 0.880 | 0.2600 | 0.415 | 0.265 | 0.880 | 0.350 | 0.255 | 0.215 |
| GPT40 | 0.6232 | 0.1080 | 0.36 | 0.740 | 0.830 | 0.1600 | 0.610 | 0.215 | 0.930 | 0.405 | 0.305 | 0.275 |
| CityGPT-V-VILA1.5-8B | 0.5773 | 0.1120 | 0.50 | 0.920 | 0.935 | 0.6950 | 0.885 | 0.880 | 0.835 | 0.490 | 0.645 | 0.520 |
| vs. VILA1.5-8B | +26.19% | +12.00% | +257.14% | +166.67% | +37.50% | +47.87% | +276.60% | +450.00% | +5.03% | +55.56% | +148.08% | +112.24% |
| vs. Best Baseline | -7.36% | -24.32% | +38.89% | +24.32% | +6.25% | +25.23% | +45.08% | +122.78% | -10.22% | +20.99% | +92.54% | +62.50% |
Table 8. Main results on UBench at New York. UrbanLLaVA achieves better performance than other models in most tasks.
6.3. Ablation Studies / Parameter Analysis
The paper conducts several ablation studies and parameter analyses to understand the influences of different training strategies and data compositions.
6.3.1. Effects of Training Strategies (UTrain)
The paper investigates various training strategies to ensure stable and well-performing training. The UTrain multi-stage pipeline is key.
The following figure (Figure 5 from the original paper) illustrates the performance of three-stage tuning:
该图像是一个柱状图,展示了不同任务的准确率(%)对比。图中包含了“招牌预测(TrajPredict)”、“导航(Navigation)”、“SAT地址(SAT-Address)”等多个任务的结果,并对比了一阶段和二阶段的模型表现。
Figure 5. The performance of three-stage tuning, gray part is the default tuning method for MLLMs.
The following figure (Figure 6 from the original paper) illustrates the effects of the order between knowledge learning and task alignment in two-stage and three-stage tuning:
该图像是一个柱状图,展示了不同阶段模型在各种城市任务上的准确率。图中显示了三种不同的训练阶段及其对应的准确率表现,包括 K→TA→Mix、TA→K→Mix 和 K + TA 的单阶段训练。整体结果表明,多阶段训练方法在任务中具有更好的性能。
Figure 6. (b) The effects of the order between knowledge learning and task alignment in two-stage tuning. (c) The effects of the order between knowledge learning and task alignment in three-stage tuning.
- Multi-stage vs. Single-stage Training:
Three-stage: TA→K→Mix(Task Alignment → Knowledge Learning → Mixture Learning) performs best across most tasks and maintains reliable performance. This significantly surpassesone-stage: K + TA(knowledge learning and task alignment merged directly), which represents the default tuning method forMLLMs. This indicates the benefit of explicitly decoupling these learning processes.
- Order of Task Alignment (TA) and Knowledge Learning (K):
-
In
two-stage training,K→TA(Knowledge first, then Task Alignment) slightly outperformsTA→K. This suggests that acquiring foundational knowledge before aligning with specific tasks can be beneficial. -
However, when the
mixture learningstage is added (creatingthree-stage training),TA→K→Mix(Task Alignment first, then Knowledge, then Mixture) achieves better results thanK→TA→Mix. The hypothesis is thatTAfirst allows the model to become familiar with task formats, whichMixture Learningcan then effectively leverage, even if initial knowledge is less developed. If comes first, the model already possesses considerable capabilities, so the impact ofMixture Learningis less significant.These findings confirm that the proposed
UTrainthree-stage pipeline effectively integratescross-modal datato achieve stable training and balanced performance across various urban tasks.
-
6.3.2. Generalization Study
The paper also evaluates UrbanLLaVA's generalization capabilities on general MLLM benchmarks and across different cities.
The following table (Table 3 from the original paper) shows UrbanLLaVA's performance on general benchmarks:
| Test@General | LLaVA-Bench (In-the-Wild) | RealWorldQA | MM-Vet |
| Metric | Rating Score | ACC | Rating Score |
| VILA1.5-8B | 60.75 | 0.3765 | 0.3518 |
| Ours-8B | 58.95 | 0.4052 | 0.3239 |
Table 3. General benchmark results. Rating Score refers to result from the LLM-as-a-judge method with GPT4o. For LLaVABench, scores range from 0 to 100, for MM-Vet, scores range from 0.0 to 1.0. Higher scores indicate better performance.
-
UrbanLLaVA(Ours-8B) maintains competitive stability on general benchmarks likeLLaVA-Bench,RealWorldQA, andMM-Vet. While itsRating ScoreforLLaVA-BenchandMM-Vetis slightly lower than the baseVILA1.5-8B, it shows an improvement inRealWorldQAaccuracy. This suggests that the specialized urban training does not significantly degrade its general visual and language understanding abilities, which is crucial forgeneral urban intelligence.The following figure (Figure 7 from the original paper) illustrates the generalization across cities:
该图像是一个条形图,展示了北京市、伦敦与纽约在不同任务(如GeoQA、TrajPredict等)上的性能评分对比。图中深色条代表我们的模型,浅色条代表基线,表现出我们模型在这些任务上优于基线的趋势。
Figure 7. Our UrbanLLaVA trained with Beijing data and tested on London and New York. The bar chart shows the performance score comparisons among Beijing, London, and New York across different tasks such as GeoQA and TrajPredict. The darker bars represent our model, while the lighter bars indicate the baseline, illustrating the trend of our model outperforming the baseline in these tasks.
UrbanLLaVAtrained on Beijing data shows competitive capabilities when tested on London and New York benchmarks. Performance improvements are observed across all tasks in London and New York, even forout-of-domaindata. This indicates thatUrbanLLaVAcan generalize to different data distributions and tasks, suggesting the presence ofsimilarity structuresacross cities that go beyond superficial differences.
6.3.3. Data Ablation Study
This study investigates the influence of different data compositions within UData. The results are shown in Table 4 (assuming a one-stage training strategy for efficiency).
The following table (Table 4 from the original paper) shows the data ablation study results:
| Task | Data View GTrajPredict Navigati STV-Address STV-Landmark STV-Outlier SAT-Address SAT-Landuse SceneComp SceneFunc ImgRetrieval CameraLoc | ||||||||||||
| Metric | Avg. Acc | Top-1 | Success Rate | Accuracy | Accuracy | Accuracy | Accuracy | Accuracy | Accuracy | Accuracy | Accuracy | ||
| 0.5741 | 0.0711 | 0.8550 | 0.8750 | 0.7450 | 0.7850 | 0.3600 | 0.7800 | 0.5500 | 0.5050 | 0.7300 | 0.5100 | ||
| Ours w/o CityQA | Local | 0.5409 | 0.0822 ↑ | 0.8700 | 0.8900 | 0.7150 | 0.6950 ↓ | 0.4000 | 0.8050 | 0.5400 | 0.5200 | 0.7750 | 0.5200 |
| w/o STV | Local | 0.5192 ↓ | 0.0622 | 0.4300 ↓ | 0.7300 ↓ | 0.4700 ↓ | 0.7200 ↓ | 0.4200 ↑ | 0.6700 ↓ | 0.4900↓ | 0.4550 ↓ | 0.6250 ↓ | 0.4250 ↓ |
| w/o Traj-Text&Nav | Trajectory | 0.4769 ↓ | 0.0644 | 0.8100 | 0.8800 | 0.6350↓ | 0.7050 ↓ | 0.0000 ↓ | 0.7600 | 0.4950 ↓ | 0.4300 ↓ | 0.6800 ↓ | 0.4600 ↓ |
| w/o Traj-Vision | Trajectory | 0.5590 | 0.0690 | 0.8350 | 0.9050 | 0.7300 | 0.7100 ↓ | 0.3000 ↓ | 0.8000 | 0.5150 | 0.4650 | 0.7150 | 0.4950 |
| w/o SAT-Single | Global | 0.5345 | 0.0778 | 0.8600 | 0.9100 | 0.5550↓ | 0.4550 ↓ | 0.3800 | 0.7800 | 0.5150 | 0.4100 ↓ | 0.7200 | 0.4800 |
| w/o SAT-Multi | Global | 0.5420 | 0.0778 | 0.8500 | 0.8700 | 0.6200 ↓ | 0.6800 ↓ | 0.3400 | 0.6450 ↓ | 0.3500 ↓ | 0.3400 ↓ | 0.3950 ↓ | 0.2600 ↓ |
Table 4. Data ablation study. Red arrows indicate significant performance drops, and green arrows indicate significant performance increases. For TrajPredict task, the threshold is , for other tasks, the threshold is . All models are trained using the one-stage strategy to optimize experimental efficiency.
Local View Data(w/oCityQA, w/oSTV):- Removing
CityQA(textual urban geography) generally leads to slight drops in accuracy across various tasks, showing its importance for general geospatial knowledge. - Removing
STV(street view data) results in significant performance deterioration across bothsingle-modal(e.g.,STV-Address,STV-Landmark) andmulti-modaltasks, highlighting the critical role oflocality knowledgefor overall urban understanding.
- Removing
Trajectory View Data(w/oTraj-Text&Nav, w/oTraj-Vision):- Removing
Traj-Text&Nav(text-based trajectory and navigation instructions) causes a substantial drop, especially inNavigation(0% accuracy) and other related tasks, confirming its necessity for understanding continuous urban spaces and navigation. - Removing
Traj-Vision(visual-trajectory data) also leads to notable drops, particularly for tasks involving visual guidance in trajectories, although not as severe asTraj-Text&NavforNavigation.
- Removing
Global View Data(w/oSAT-Single, w/oSAT-Multi):-
Removing
SAT-Single(single satellite image data for urban knowledge) impacts tasks likeSTV-OutlierandSceneFunc, demonstrating its role in understanding specific urban areas from an overhead perspective. -
Removing
SAT-Multi(multiple satellite images for correlations and cross-alignment with street views) leads to significant drops inImgRetrieval,CameraLoc, andSceneFunc, proving its essential role in empowering theMLLMto handle urban tasks from a global, interconnected view.In summary, the ablation studies reveal that all components of
UData(local, trajectory, and global views, and their sub-components) are crucial forUrbanLLaVA's comprehensive performance, particularly for tasks directly involving those data types.
-
6.3.4. Effects of Other Training Parameters (Supplementary Material)
The supplementary material provides additional analysis on other training parameters:
- Learning Rate (Figure 10a): A
learning rateof results in a smoother and lower training loss curve compared to (the default forVILA). This indicates that a lower learning rate is more robust for training with mixeddomain-specific structured instruction data, helping the model handle features from different modalities more stably. - Modality Separation (Figure 10b): Training with
textandvision datatogether inone stageyields better results than separating them (one stage: textortwo stage: text then vision). This suggests the benefit of early multi-modal integration during training. - Trained Components (Figure 10c): Experiments with different training components (
T-LLM-Projfor text,V-LLM-Projfor vision) show little difference. This implies that the specific choice of which parts of theMLLM(e.g.,LLMvs.projector) are trained for each modality might not be as impactful as the overall data strategy or learning rate.
6.3.5. Effects of Training Data Size (Supplementary Material)
The following figure (Figure 11 from the original paper) presents training results with different amounts, exhibiting the high quality of UData:
该图像是图表,展示了不同训练数据比例对模型性能得分的影响。各条线分别代表GeoQA、Geo+Traj、Geo+STV、Geo+SAT、Geo+SS和MMScore,显示出随训练数据增加,得分整体呈上升趋势,特别是GeoQA表现最佳。
Figure 11. Scaling law from training data size to performance.
- The figure demonstrates that performance generally improves as the amount of training data increases across various urban tasks (e.g.,
GeoQA, , , , ). Thisscaling lawsuggests thatUDatais of high quality and thatUrbanLLaVAbenefits from more data, indicating potential for further improvement with even larger datasets.
6.3.6. Effects of Base Model (Supplementary Material)
The following table (Table 9 from the original paper) shows the generalizability of methods on Qwen2.5-VL:
| Task Group @ Beijing | GeoQA | Geo+Traj | Geo+STV | Geo+SAT | Geo+SS |
| Qwen2.5-VL-7B-Instruct | 0.4324 | 0.2192 | 0.4467 | 0.2850 | 0.2225 |
| + Finetuned with UData | 0.5720↑ | 0.1876 | 0.6833↑ | 0.4800↑ | 0.3800↑ |
Table 9. Evaluating generalizability of methods on Qwen2.5VL.
- The results show that
UrbanLLaVA's methodology (fine-tuning withUData) ismodel-agnosticand can be generalized to differentMLLMs. When applied toQwen2.5-VL-7B-Instruct, fine-tuning withUDataleads to significant performance improvements across most task groups (GeoQA, , , ). While shows a slight decrease, the overall trend confirms the generalizability of theUDataapproach.
6.3.7. Effects of Model Size (Supplementary Material)
The following figure (Figure 12 from the original paper) shows the effects of model size:
该图像是图表,展示了 UrbanLLaVA 在不同模型大小下的性能结果。图中显示了在不同模型(3B、8B、13B)下,GeoQA、STV、SAT 等任务的得分变化情况,表现出模型规模对任务性能的影响。
Figure 12. Results on UrbanLLaVA with different model sizes.
- Performance generally improves with increasing parameter size for
VILA1.5models (from 3B to 13B), which is a common trend inLLMsandMLLMs. - However, for certain tasks, models of different sizes exhibit similar capabilities. This occurs either because the tasks are inherently challenging (e.g.,
TrajPredict, where even larger models struggle significantly) or relatively easy (e.g.,SAT-Landuse, where even smaller models perform well). - The minimal performance improvement from
VILA1.5-8BtoVILA1.5-13Bis attributed to the capabilities of the underlyingLLMbackbones (LLaMA3-8BandLLaMA2-13B). The authors suggest that a larger base model likeVILA1.5-40B(if computational resources were available) could potentially yield much better performance.
6.4. Case Study
The paper includes case studies to qualitatively demonstrate UrbanLLaVA's capabilities in handling challenging urban tasks compared to general MLLMs.
6.4.1. SceneFunc Task
-
Task: Identify which satellite image contains the highest concentration of a specified category of
Points of Interest (POIs). This requiresmultiple image understandingandcomparison.The following figure (Figure 7 from the original paper) shows an example of the
SceneFunctask:
该图像是一个示意图,展示了四幅卫星图像及其对应的 POI 分析。图中提示选择哪个图像显示了最多的餐饮相关 POI,结果表明第三幅图像包含主要商业区域,可能拥有较高的餐饮业务集中度。
Figure 7. An example of the SceneFunc task, where correct answers are in green, wrong ones in red.
- Observation: While the base model
VILA1.5-8Bfails to answer the question,UrbanLLaVAsuccessfully provides the correct answer. This highlightsUrbanLLaVA's strong capabilities inmulti-image understandingandcomparisonwithin an urban context, making it competitive with successfulclosed-source models.
6.4.2. STV-Outlier Task
-
Task: Compare multiple
street view imagesfrom a trajectory and identify theoutlierimage that does not belong. This demandsimplicit logical reasoning.The following figure (Figure 8 from the original paper) shows an example of the
STV-Outliertask:
该图像是一个示意图,展示了一个城市道路场景,并提供了四个选项供选择哪个图像最接近参考图。参考图展示了一条有自行车道的城市道路。图中要求选择的选项涵盖了不同的场景,其中部分选项缺乏类似的特征。
Figure 8. An example of the STV-Outlier task.
- Observation:
VILA1.5-8Bfails to identify the scene of the reference image.GPT-4o-minicomes closer but is still confused by a wrong option.UrbanLLaVAsuccessfully performs this task, showcasing its ability to understand multiple images and conducthigh-level implicit logical reasoningin an urban context, outperforming generalMLLMs.
6.4.3. Additional Case Studies (Supplementary Material)
The supplementary material provides more case studies:
-
SAT-LandUse(Figure 13): The model needs to speculate theland use type(e.g., commercial, residential, agricultural) based on asatellite image.UrbanLLaVAresponds precisely, demonstrating accurateimage perception,instruction following, andurban knowledge mastering.
该图像是一个多选题示例,题目要求根据卫星图像选择最可能的土地使用类型。模型的正确答案以绿色标示,显示我们的回答为 B,解释说明该区域的土地使用类型为住宅。
Figure 13. An example of the SAT-LandUse task. The correct answers from model are denoted with green color. The response from ours is in bold. Explanation is written by human for this question and answer.
-
STV-Landmark(Figure 14): The task is to find the closestlandmark featureto a givenstreet view image.UrbanLLaVAcorrectly identifies the landmark, showcasing its ability to conductlogical reasoningin amulti-modal context.
该图像是一个城市道路的局部视图,展示了道路右侧的交通及环境特点。画面中有一辆白色汽车正在驶过,旁边还有公交车停靠站,背景显示城市建筑和树木,反映了城市交通的真实情境。
Figure 14. An example of the STV-Landmark task. The correct answers from model are denoted with green color. The response from ours is in bold. Explanation is written by human for this question and answer.
-
SAT-Address(Figure 15): The model speculates the most probable address description based on asatellite image.
该图像是一个SAT-Address任务的示例,展示了一个卫星图像及周围环境的描述。该任务要求选择最合适的地址选项,根据图像信息,选项B描述的区域与住宅区对应,因此是最佳选择。
Figure 15. Example of a SAT-Address task.
-
STV-Address(Figure 16): The model speculates the most probable address where astreet view imagewas taken.
该图像是一个城市街道的实景图,展示了空旷的道路和建筑物。图中可以见到一辆白色汽车驶过,周围环境清晰可见,体现了城市的日常生活场景。
Figure 16. Example of a STV-Address task.
-
SceneComp(Figure 17): Given foursatellite remote sensing images, the model chooses the one with the most buildings.
该图像是四张城市区域的卫星图像示例,分别展示了不同的城市景观。图像中涉及的问题是识别哪一张图像的建筑最为密集。参照答案为A,说明第一张图像展示了一条有单车道的城市道路。
Figure 17. An example of a SceneComp task.
-
ImgRetrieval(Figure 18): Evaluates the capability to map a givenstreet view imageto the correspondingsatellite image.
该图像是示意图,展示了不同视角的城市环境数据,包括街景、道路和建筑等。这些图像支持 extit{UrbanLLaVA}模型处理多模态数据,增强城市智能研究的可能性。
Figure 18. An example of an ImagRetrieval task.
-
CameraLoc(Figure 19): Requires the model to infer which quadrant of asatellite imagecorresponds to the location where a givenstreet view imagewas captured.
该图像是一个示意图,展示了CameraLoc任务的两个视角:左侧是城市区域的航空视图,右侧是街道的实时摄像头视角。这种对比有助于理解空间定位与环境感知在城市智能中的应用。
Figure 19. An example of a CameraLoc task.
These case studies collectively illustrate UrbanLLaVA's enhanced spatial cognition, multi-modal understanding, and reasoning abilities in complex urban scenarios, capabilities that general MLLMs often lack.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces UrbanLLaVA, a novel multi-modal large language model (MLLM) specifically designed to enhance urban spatial cognition. UrbanLLaVA addresses a critical gap in urban research by providing a unified framework capable of integrating and processing four major types of urban data simultaneously: urban visual data (street view and satellite images), geo-text, structured geospatial data, and spatiotemporal series data.
The core contributions include:
-
UData: A meticulously curated, diverse urban instruction dataset that systematically covers urban environments fromlocation viewtotrajectory viewandglobal view, crucial forcross-modality alignment. -
UTrain: An innovative three-stage training pipeline (task alignment,knowledge learning,mixture learning) that effectively decouplesspatial reasoning enhancementfromdomain knowledge learning, leading to stable training and superior performance. -
UBench: An extended and comprehensive benchmark for evaluatingMLLMsin a wide array of urban tasks, including several newly introduced complexmulti-modalchallenges.Experimental results across Beijing, London, and New York unequivocally demonstrate
UrbanLLaVA's effectiveness. It significantly outperforms both open-source and proprietary generalMLLMsin diverse urban tasks, showcasing robust generalization abilities across different cities without sacrificing its performance on generalMLLMbenchmarks. In summary,UrbanLLaVArepresents a significant step towards building a unified foundation model with powerful perception and reasoning abilities for generalurban intelligence.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose clear directions for future work:
-
Model Size Exploration: Current experiments primarily focused on the
8Bmodel (specifically,VILA1.5-8B). The full potential ofUDataandUTrainon largerMLLMs(e.g.,VILA1.5-40B) remains to be realized, which could lead to significantly better performance. -
UBench Refinement: The
UBenchbenchmark can be further improved. Refining task design and testingMLLMs' overallmulti-modal capabilitiesfrom an even more fine-grained perspective would enhance its utility. -
Inclusion of More Modalities: The current model integrates four types of urban data. However, other important modalities such as
video dataandtime series data(beyond just trajectories) could be included to provide a more complete picture of urban intelligence.In the future, the authors plan to:
-
Extend
UrbanLLaVAto incorporate more diverse data types relevant to urban research. -
Tackle more advanced urban tasks that arise from various interdisciplinary fields.
7.3. Personal Insights & Critique
This paper presents a highly valuable contribution to the nascent field of urban intelligence powered by MLLMs. The rigorous approach to data curation (UData), the thoughtful multi-stage training (UTrain), and the comprehensive evaluation (UBench) are particularly commendable.
Inspirations and Applications:
- Unified Urban Understanding: The core idea of creating a single
MLLMfor comprehensive urban understanding, rather than fragmented task-specific models, is incredibly powerful. This paradigm could revolutionize urban planning, resource management, disaster response, and smart city development by providing a holistic, AI-driven cognitive layer. - Cross-City Generalization: The demonstrated
cross-city generalizationability is crucial. It suggests that models trained on one city's data can be effectively deployed in others, significantly reducing the cost and effort of city-specific model development. This implies that urban patterns and structures have underlying commonalities thatMLLMscan learn. - Methodology for Domain Adaptation: The
UTrainframework, with its decoupling oftask alignmentandknowledge learning, offers a valuable blueprint for adaptinggeneral MLLMsto other complex, multi-modal domains beyond urban intelligence (e.g., environmental science, industrial automation, cultural heritage). The insight into the learning rate's significance for heterogeneous data is also very practical. - Data-centric AI in Urban Research: The emphasis on systematic and multi-view data generation (
UData) underscores the importance of data engineering inMLLMdevelopment, especially for specialized fields. The quality and structure of domain-specific data are paramount.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Data Bias and Representativeness: While
UDatais extensive, urban data often carries inherent biases (e.g., richer data in central areas, privacy concerns in certain locations). The paper doesn't deeply discuss how these potential biases in the source data might translate intoUrbanLLaVA's cognition or how they are mitigated. This is a critical consideration for real-world urban applications. -
Interpretability and Explainability: As
MLLMsbecome more capable, understanding why they make certain decisions in complex urban tasks (e.g., recommending a particular traffic route, inferring land use) becomes vital, especially for decision-makers. The paper focuses on performance but less on the interpretability of its spatial reasoning. -
Dynamic Data and Real-time Capabilities: Urban environments are constantly changing. While
spatiotemporal series datais included, the paper doesn't explicitly discuss howUrbanLLaVAwould handle continuous, real-time data streams or adapt to rapid urban changes. This could be a significant challenge for practical deployment. -
Computational Cost for Larger Models: The authors acknowledge that testing with larger base models (
VILA1.5-40B) was limited by resources. The training time (10.7 hours on4xA100for Beijing) is significant. ScalingUrbanLLaVAto larger models or for more cities might incur prohibitive computational costs, raising questions about its practical scalability without further optimization or more efficient architectures. -
Human-in-the-Loop Integration: For urban intelligence to be truly effective,
AImodels need to integrate seamlessly with human experts and decision-making processes. The paper presentsUrbanLLaVAas an analytical tool, but the interface and interaction mechanisms with urban planners or policymakers are not discussed.Overall,
UrbanLLaVAis a groundbreaking work that effectively bridges the gap betweengeneral MLLMsand the complex,multi-modaldemands ofurban intelligence. Its methodical approach to data, training, and evaluation sets a high standard for future research in this exciting interdisciplinary domain.
Similar papers
Recommended via semantic vector search.