Paper status: completed

MTP: Exploring Multimodal Urban Traffic Profiling with Modality Augmentation and Spectrum Fusion

Published:11/13/2025

Multimodal Urban Traffic Profiling (1)Frequency Domain Feature Learning (1)Visual Augmentation for Traffic Signals (1)Text Augmentation Techniques (1)Hierarchical Contrastive Learning (1)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces MTP, a multimodal framework for urban traffic profiling that learns features from numeric, visual, and textual perspectives, overcoming the limitations of unimodal approaches, and demonstrating superior performance across six real-world datasets.

Abstract

With rapid urbanization in the modern era, traffic signals from various sensors have been playing a significant role in monitoring the states of cities, which provides a strong foundation in ensuring safe travel, reducing traffic congestion and optimizing urban mobility. Most existing methods for traffic signal modeling often rely on the original data modality, i.e., numerical direct readings from the sensors in cities. However, this unimodal approach overlooks the semantic information existing in multimodal heterogeneous urban data in different perspectives, which hinders a comprehensive understanding of traffic signals and limits the accurate prediction of complex traffic dynamics. To address this problem, we propose a novel Multimodal framework, MTP, for urban Traffic Profiling, which learns multimodal features through numeric, visual, and textual perspectives. The three branches drive for a multimodal perspective of urban traffic signal learning in the frequency domain, while the frequency learning strategies delicately refine the information for extraction. Specifically, we first conduct the visual augmentation for the traffic signals, which transforms the original modality into frequency images and periodicity images for visual learning. Also, we augment descriptive texts for the traffic signals based on the specific topic, background information and item description for textual learning. To complement the numeric information, we utilize frequency multilayer perceptrons for learning on the original modality. We design a hierarchical contrastive learning on the three branches to fuse the spectrum of three modalities. Finally, extensive experiments on six real-world datasets demonstrate superior performance compared with the state-of-the-art approaches.

Mind Map

In-depth Reading

English Analysis~35 min read · 48,079 chars

1. Bibliographic Information

1.1. Title

MTP: Exploring Multimodal Urban Traffic Profiling with Modality Augmentation and Spectrum Fusion

1.2. Authors

The paper lists eight authors from various academic institutions and a state information center:

Haolong Xiang ( $^{1, 2}$ ) - School of Software, Nanjing University of Information Science and Technology, Nanjing, P.R. China; State Key Lab. for Novel Software Technology, Nanjing University, P.R. China
Peisi Wang ( $^{1}$ ) - School of Software, Nanjing University of Information Science and Technology, Nanjing, P.R. China
Xiaolong Xu ( $^{1}$ ) - School of Software, Nanjing University of Information Science and Technology, Nanjing, P.R. China
Kun Yi ( $^{3}$ ) - State Information Center, P.R. China
Xuyun Zhang ( $^{4}$ ) - School of Computing, Macquarie University, NSW, Australia
Quanzheng Sheng ( $^{4}$ ) - School of Computing, Macquarie University, NSW, Australia
Amin Beheshti ( $^{4}$ ) - School of Computing, Macquarie University, NSW, Australia
Wei Fan ( $^{5\ast}$ ) - School of Computer Science, University of Auckland, Auckland, New Zealand

Their affiliations suggest expertise in software, information science, artificial intelligence, and urban computing, spanning institutions in China, Australia, and New Zealand.

1.3. Journal/Conference

The paper does not explicitly state the journal or conference it is published in, but it is available as a preprint on arXiv. Given the publication date in 2025, it is likely intended for a reputable conference or journal in AI, data mining, or intelligent transportation systems.

1.4. Publication Year

2025

1.5. Abstract

The paper addresses the limitations of existing urban traffic signal modeling methods, which primarily rely on unimodal numerical sensor data and overlook rich semantic information in heterogeneous urban data. To overcome this, the authors propose MTP, a novel Multimodal framework for urban Traffic Profiling. MTP learns multimodal features from numeric, visual, and textual perspectives, utilizing frequency learning strategies for refinement. Specifically, it visually augments traffic signals into frequency images and periodicity images for visual learning, and augments descriptive texts based on topic, background, and item description for textual learning. The original numerical modality is processed using frequency multilayer perceptrons. A hierarchical contrastive learning mechanism is designed to fuse the "spectrum" of these three modalities. Extensive experiments on six real-world datasets demonstrate MTP's superior performance compared to state-of-the-art approaches.

1.6. Original Source Link

https://arxiv.org/abs/2511.10218v2 (Preprint) PDF Link: https://arxiv.org/pdf/2511.10218v2.pdf

2. Executive Summary

2.1. Background & Motivation

The rapid urbanization of modern cities heavily relies on traffic signals from various sensors to monitor urban states, which is crucial for ensuring safe travel, reducing congestion, and optimizing urban mobility.

The core problem the paper aims to solve is the unimodal limitation of most existing traffic signal modeling methods. These methods predominantly rely on numerical direct readings from sensors, which is a unimodal approach. This approach overlooks the rich semantic information present in multimodal heterogeneous urban data (e.g., surveillance images, textual descriptions, social media feedback). This oversight hinders a comprehensive understanding of traffic signals and limits the accuracy of predicting complex traffic dynamics.

Challenges in prior research include:

Static Feature Extraction: Traditional methods often use static features and assume stable data distribution, failing to adapt to the dynamic and temporal characteristics of traffic.
Temporal Dependence: Traffic data has strong temporal dependencies (e.g., peak vs. off-peak hours, extreme weather), making static methods inaccurate for cross-time profiling.
Unimodal Nature of Neural Networks: While neural networks excel at time series modeling, they are typically designed for single modalities and cannot effectively integrate spatial, visual, and textual features for deep correlation, leading to limited utilization of multimodal information.
Limitations of LLMs/VLMs in Traffic: Existing Large Language Models (LLMs) and Vision-Language Models (VLMs) are often optimized for specific modalities (e.g., LLMs for text, VLMs for image-text) but struggle with dynamic changes of time series features or lack the ability to model the time dimension effectively for traffic classification.

The paper's entry point or innovative idea is to explicitly address this unimodal limitation by augmenting the original numerical traffic signals into visual and textual modalities and then learning multimodal features in the frequency domain from these three perspectives (numeric, visual, and textual). This approach aims to leverage the semantic richness of heterogeneous data for a more comprehensive understanding and accurate prediction of traffic dynamics.

2.2. Main Contributions / Findings

The primary contributions of the paper are:

A Novel Multimodal Framework (MTP): The proposal of MTP, a new framework for urban traffic profiling, which is the first to augment multimodal features directly from traffic signals and learn them from numeric, visual, and textual perspectives in the frequency domain.
Modality Augmentation Strategies:
- Visual Augmentation: Transforming original numerical traffic signals into frequency images and periodicity images to enable visual learning and capture spatial and temporal patterns.
- Textual Augmentation: Generating descriptive texts for traffic signals based on specific topics, background information, and item descriptions, thereby enriching semantic information for textual learning.
- Numeric Processing: Utilizing frequency multilayer perceptrons (MLPs) to process the original numerical modality in the frequency domain, enhancing its feature extraction capabilities.
Hierarchical Contrastive Learning for Spectrum Fusion: Designing a hierarchical contrastive learning mechanism across the three augmented modalities to effectively fuse their "spectrum" representations. This optimizes multimodal learning by aligning semantic information and integrating diverse features.
Extensive Experimental Validation: Demonstrating MTP's superior performance through extensive experiments on six real-world datasets, consistently outperforming state-of-the-art baselines in F1-score and Accuracy. The ablation studies confirm the crucial contribution of each augmented modality and the fusion mechanism.
Qualitative Analysis: Providing t-SNE visualizations that show MTP's ability to learn highly cohesive and well-separated clusters for different traffic states, indicating strong discriminative power.

These findings solve the problem of limited understanding and prediction accuracy in urban traffic profiling due to reliance on unimodal data. By integrating diverse perspectives and leveraging frequency domain analysis, MTP achieves a more robust and comprehensive representation of complex traffic dynamics.

3.1. Foundational Concepts

To understand the MTP framework, several foundational concepts are crucial for a beginner.

Multimodal Data: Refers to data that comes from different sources or modalities. In this paper, the modalities are numerical time series (sensor readings), visual (images), and textual (descriptive text). Integrating information from multiple modalities often leads to a more comprehensive understanding than using a single modality alone.
Urban Traffic Profiling: The task of analyzing and characterizing urban traffic conditions. This can involve identifying traffic states (e.g., smooth, slow, congested) or detecting specific events (e.g., accidents, road construction). The goal is to provide insights for traffic management, planning, and optimization.
Frequency Domain: A way of representing signals (like time series data) based on their frequency components rather than their amplitude over time. The Fourier Transform is a mathematical tool used to convert a signal from the time domain (or spatial domain for images) to the frequency domain. Analyzing data in the frequency domain can reveal periodic patterns or dominant oscillations that might be hidden in the time domain.
Fast Fourier Transform (FFT): An efficient algorithm to compute the Discrete Fourier Transform (DFT). Given a sequence of data points in the time domain, FFT transforms it into a sequence of complex numbers, each representing the amplitude and phase of a sine wave at a specific frequency.
Inverse Fast Fourier Transform (IFFT): The inverse operation of FFT, which converts data from the frequency domain back into the time domain (or spatial domain).
Multilayer Perceptron (MLP): A fundamental type of feedforward artificial neural network. It consists of at least three layers of nodes: an input layer, one or more hidden layers, and an output layer. Each node (neuron) is a simple processor that takes a set of inputs, computes a weighted sum, adds a bias, and applies a non-linear activation function (like ReLU). MLPs are used for tasks like classification and regression.
Activation Function (ReLU): Rectified Linear Unit. A common non-linear activation function used in neural networks. It outputs the input directly if it's positive, and zero otherwise. Mathematically, $ReLU(x) = \max(0, x)$ . It helps neural networks learn complex patterns.
Contrastive Learning: A self-supervised learning paradigm where a model learns representations by contrasting similar and dissimilar pairs of data. The goal is to bring representations of similar (positive) pairs closer in the embedding space while pushing representations of dissimilar (negative) pairs further apart. This helps the model learn discriminative features.
- InfoNCE Loss: A specific form of contrastive loss often used in contrastive learning. It maximizes the agreement between different views (e.g., augmented versions or different modalities) of the same data point relative to other negative samples.
Jensen-Shannon (JS) Divergence: A method to measure the similarity between two probability distributions. It is based on Kullback-Leibler (KL) divergence but is symmetric and always finite, making it a well-behaved metric for comparing distributions. A lower JS Divergence indicates higher similarity between distributions.
Cross-Entropy Loss: A common loss function used in classification tasks, especially when the output is a probability distribution. It measures the difference between the predicted probability distribution and the true distribution. For multi-class classification, it aims to minimize the negative logarithm of the predicted probability for the true class.
- For a single sample with $m$ classes, if $y_i$ is 1 for the true class and 0 otherwise, and y'_i is the predicted probability for class $i$ , the loss is -\sum_{i=1}^{m} y_i \log(y'_i).
t-SNE (t-distributed Stochastic Neighbor Embedding): A dimensionality reduction technique used for non-linear visualization of high-dimensional data. It maps high-dimensional data points to a lower-dimensional space (typically 2D or 3D) such that similar data points are modeled by nearby points and dissimilar data points are modeled by distant points with high probability. It's often used to visualize how well a model separates different classes in its learned feature space.

3.2. Previous Works

The paper categorizes previous works into Traditional Traffic Time Series Profiling, Traffic Profiling with LLMs, and Traffic Profiling with VLMs.

Traditional Traffic Time Series Profiling

These methods primarily rely on single-modality data, specifically structured time series.

Deep Learning Techniques:
- Convolutional Neural Networks (CNNs): (He et al. 2016; Alam et al. 2023) Excel at capturing local patterns and hierarchical features.
- Recurrent Neural Networks (RNNs): (Jin et al. 2017; Zheng et al. 2020) Good for sequential data, capturing temporal dependencies. Long Short-Term Memory (LSTM) is a common variant.
- Graph Neural Networks (GNNs): (Zhang et al. 2023; Deng, Wang, and Xue 2024) Used for data with graph structures, like road networks, to model spatial relationships.
- Transformer-based methods: (Lin et al. 2022a; Zerveas et al. 2021) Leverage self-attention mechanisms to capture long-range dependencies in time series data. TST (Zerveas et al. 2021) is a direct application of the Transformer encoder to time series. PatchTST (Nie et al. 2023) processes time series as sequences of patches.
Limitations: These methods are inherently unimodal. They cannot integrate semantic information from images or text, making them insufficient to capture dynamic, real-world traffic conditions comprehensively.

Traffic Profiling with LLMs

Large Language Models (LLMs) are powerful in text understanding and generalization (Khattar et al. 2019; Feng et al. 2024).

Applications in Transportation:
- Qian et al. (2021): A multimodal framework combining BERT (a Transformer-based LLM for text) and ResNet (a CNN for images) to capture contextual information for fake news detection.
- Chen et al. (2024): LLM-driven frameworks for optimizing vehicle dispatching and navigation.
- Yan et al. (2024): UrbanCLIP enhances textual information and fuses it with images via contrastive learning for urban region profiling.
Limitations:
- LLMs are mostly optimized for single modalities (text). While they can understand event descriptions, they lack the ability to model the time dimension effectively for dynamic traffic changes.
- They struggle to parse dynamic changes of time series features when applied to image data (Gruver et al. 2023).

Traffic Profiling with VLMs

Vision-Language Models (VLMs) excel at jointly processing and understanding visual and textual information.

Examples: BLIVA (Hu et al. 2024), EMMA (Yang et al. 2024), OmniActions (Li et al. 2024). These models show strong capabilities in visual question answering and multimodal interaction.
Limitations: These methods have not fully combined multimodal data to generate powerful representations specifically for road traffic profiling. While they can bridge visual and textual information, their direct application to traffic time series and diverse traffic classification tasks is underexplored.

3.3. Technological Evolution

The evolution of traffic profiling technologies can be seen as a progression:

Early Statistical Methods: Relied on static features and statistical models, assuming stable data distributions. These were simple but quickly became insufficient for dynamic traffic.
Unimodal Deep Learning (Time Series): The advent of CNNs, RNNs, GNNs, and Transformers allowed for modeling complex temporal and spatial dependencies in raw time series data. This marked a significant improvement in capturing dynamics but remained limited to numerical sensor readings.
Emergence of Multimodal Approaches: As more diverse data (images, text) became available, researchers started exploring ways to combine them, often using concatenation or simple fusion techniques. Early attempts might involve training separate models for each modality and then merging their outputs.
LLMs and VLMs for Domain-Specific Tasks: Recent breakthroughs in LLMs and VLMs demonstrated powerful multimodal capabilities in general domains. Researchers began adapting these models to specific transportation tasks, showing potential but also revealing challenges in handling time series dynamics and deeply integrating various traffic-specific modalities.
MTP's Position: MTP represents a step forward in integrating LLM/VLM-like capabilities specifically for urban traffic profiling. It moves beyond simply using existing LLMs/VLMs by actively augmenting the core time series data into visual and textual forms and performing frequency-domain analysis and hierarchical contrastive fusion. This positions MTP at the forefront of multimodal traffic analysis, aiming for a comprehensive understanding by synthesizing diverse data perspectives.

3.4. Differentiation Analysis

Compared to the main methods in related work, MTP presents several core differences and innovations:

Active Modality Augmentation: Unlike most methods that either use existing multimodal data or rely solely on time series, MTP actively generates visual and textual representations from the original numerical time series. This is a key differentiator, as it creates multimodal perspectives even when only raw numerical data is initially available.
Frequency Domain Learning: A unique aspect of MTP is its focus on processing all modalities (numerical, augmented visual, augmented textual) within the frequency domain. This allows for the extraction of multi-scale and periodic features that are crucial for understanding traffic dynamics, which is not a common strategy in other multimodal traffic models.
Specialized Encoders for Augmented Modalities: MTP designs specific frequency-domain MLPs for numerical data, and multi-scale convolution combined with FFT and FIR filters for visual data, and text encoders with FFT and FIR filters for textual data. These are tailored to extract relevant features from the augmented representations.
Hierarchical Contrastive Spectrum Fusion: MTP proposes a sophisticated fusion mechanism that combines supervised and unsupervised contrastive learning with Jensen-Shannon (JS) divergence-based distribution similarity fusion. This multi-layered fusion strategy aims to align the "spectrum" (frequency-domain features) of the three modalities, leading to more robust and discriminative combined representations, which is more advanced than simple concatenation or early/late fusion strategies.
Comprehensive Traffic Profiling: While LLMs and VLMs have been applied to urban data, MTP specifically targets the traffic profiling task by integrating the often-overlooked temporal dynamics (via time series branch) with the semantic richness of visual and textual cues derived directly from traffic signals, thereby providing a more holistic view compared to task-specific LLM/VLM applications.

In essence, MTP's innovation lies in its proactive generation of multimodal data from unimodal sources, its novel application of frequency domain analysis across these modalities, and its sophisticated fusion strategy to capture a deeper, more comprehensive understanding of urban traffic.

4. Methodology

4.1. Principles

The core idea behind MTP is to overcome the limitations of unimodal traffic profiling by creating and leveraging multimodal features from numerical traffic signals. The theoretical basis is that urban traffic signals contain rich semantic information that can be better understood when viewed from numeric, visual, and textual perspectives. By transforming the original numerical data into frequency images, periodicity images, and descriptive texts, MTP aims to capture diverse patterns that are hard to discern from raw numbers alone. The framework then processes these modalities in the frequency domain to extract multi-scale and periodic features, which are particularly relevant for time-series data like traffic. Finally, a hierarchical contrastive learning approach is used to fuse these distinct modal representations, ensuring semantic alignment and robust feature integration. This holistic approach is designed to provide a more comprehensive understanding and accurate prediction of complex traffic dynamics.

4.2. Core Methodology In-depth (Layer by Layer)

The MTP framework consists of three main modal encoder branches and a feature fusion scheme. As shown in Figure 1, these branches are: a) time series modality encoder, b) vision modality encoder, and c) text modality encoder. The goal is for the fused features to simultaneously retain the temporal patterns of the numerical modality, the intuitive patterns of the visual modality, and the semantic information of the text modality.

4.2.1. Time Series Modality Encoder

This module processes the original numerical traffic time series data. Its components include semantic embedding, Fast Fourier Transform (FFT), frequency-domain Multi-Layer Perceptrons (MLPs), and Inverse Fast Fourier Transform (IFFT).

Semantic Embedding: Inspired by word embeddings, the input numerical data $\mathbf{X} \in \mathbb{R}^{n \times l}$ (where $n$ is the number of time steps and $l$ is the feature dimension for each time step) is first mapped into a hidden representation $\mathbf{D} \in \mathbb{R}^{n \times l \times m}$ . This is achieved using a learnable weight vector $\psi \in \mathbb{R}^{l \times m}$ . The process is denoted as: $ \mathbf{D} = \mathbf{X} \times \psi $
- $\mathbf{X}$ : The original numerical input data.
- $\psi$ : A learnable weight vector used for semantic embedding.
- $\mathbf{D}$ : The hidden representation of the input data, incorporating richer semantic information.
Fast Fourier Transform (FFT): The semantic embedding $\mathbf{D}$ is then converted from the spatial domain (or time domain) to the frequency domain. This step allows the model to extract multi-scale and periodic features inherent in the traffic time series data. The Fourier transform of the original time series embedding $\mathbf{D}$ is defined as: $ \pmb{\mathcal{D}}^{v}[k] = \sum_{i = 0}^{n - 1}\pmb {\mathcal{D}}^{v}[i]e^{-j\frac{2\pi k i}{n}} $
- $\pmb{\mathcal{D}}^{v}[k]$ : The $k$ -th frequency component in the frequency domain for the numerical modality. The superscript $v$ denotes the numerical (or "value") modality.
- $i$ : The index for the time-domain data points, ranging from 0 to n-1.
- $n$ : The total number of time-domain data points.
- $k$ : The index for the frequency-domain components.
- $j$ : The imaginary unit, where $j^2 = -1$ .
- $e^{-j\frac{2\pi k i}{n}}$ : The complex exponential term, which can be expanded using Euler's formula as $\cos (\frac{2\pi ki}{n}) - j\sin (\frac{2\pi ki}{n})$ . This transformation yields the numerical spectrum at frequency $2\pi ki/n$ .
Frequency-Domain Multi-Layer Perceptrons (MLPs): The obtained frequency components $\pmb{\mathcal{D}}^{v}[k]$ are then fed into frequency-domain MLPs. These MLPs perform non-linear mapping and feature extraction on the frequency domain features, enhancing their expressive ability to capture periodic and abnormal patterns in traffic time series. The operation is given by: $ \mathcal{H}_i = FMLA(\mathcal{D}^v,W,B) $
- $\mathcal{H}_i$ : The output from the Frequency-domain MLP Activation (FMLA) for the $i$ -th component.
- $\mathcal{D}^v$ : The input frequency components from the FFT.
- $W$ : Complex weight matrix.
- $B$ : Complex bias vector.
  
  The concrete process of frequency-domain MLPs for a layer is shown in the green box of Figure 1 and defined by: $ \mathcal{Z} = ReLU(\mathcal{H}W+B) $
- $\mathcal{Z}$ : The output of one layer of the frequency-domain MLP.
- ReLU: The Rectified Linear Unit activation function, applied element-wise.
- $\mathcal{H}$ : Input to the current MLP layer (can be $\mathcal{D}^v$ or output from a previous MLP layer).
- $W$ : Complex weight matrix for the current layer.
- $B$ : Complex bias vector for the current layer.
  
  If the MLP consists of $l$ layers, the input of each layer is the output $\mathcal{Z}^{l-1}$ of the previous layer's frequency-domain MLPs. The complex weight matrix $W$ and bias $B$ are represented as $W = W_i + \eta W_j$ and $B = B_i + \eta B_j$ , where $W_i, B_i$ are real parts and $W_j, B_j$ are imaginary parts, and $\eta$ is the imaginary unit. The complex multiplication with ReLU is then derived as: $ \begin{array}{rl} \pmb {Z}^{l} = & ReLU(\mathcal{O}(\mathcal{Z}^{l - 1})W_{i}^{l} - \mathcal{I}(\mathcal{Z}^{l - 1})W_{j}^{l} + B_{i}^{l})\ & +\eta ReLU(\mathcal{O}(\mathcal{Z}^{l - 1})W_{j}^{l} - \mathcal{I}(\mathcal{Z}^{l - 1})W_{i}^{l} + B_{j}^{l}) \end{array} $
- $\pmb {Z}^{l}$ : The output of the $l$ -th layer of the frequency-domain MLP.
- $\mathcal{O}(\mathcal{Z}^{l-1})$ : The real part of the input complex number $\mathcal{Z}^{l-1}$ .
- $\mathcal{I}(\mathcal{Z}^{l-1})$ : The imaginary part of the input complex number $\mathcal{Z}^{l-1}$ .
- $W_i^l, W_j^l$ : Real and imaginary parts of the weight matrix for the $l$ -th layer.
- $B_i^l, B_j^l$ : Real and imaginary parts of the bias vector for the $l$ -th layer.
- $\eta$ : Imaginary unit. This equation applies ReLU separately to the real and imaginary components after complex multiplication.
Inverse Fast Fourier Transform (IFFT): After processing by the frequency-domain MLPs, the optimized frequency-domain features are converted back to the spatial domain using IFFT. This provides features that retain frequency-domain information but are represented in the original spatial/time domain, suitable for concatenation and similarity calculations. The calculation of IFFT is: $ D^{v}[i] = \sum_{k = 0}^{n - 1}{\pmb{\mathcal{D}}}^{v}[k]e^{j\frac{2\pi ki}{n}} $
- $D^{v}[i]$ : The $i$ -th time-domain data point after inverse transformation.
- $\pmb{\mathcal{D}}^{v}[k]$ : The $k$ -th frequency component (optimized by MLPs).
- $e^{j\frac{2\pi ki}{n}}$ : The complex exponential term for IFFT.

4.2.2. Vision Modality Encoder

This module extracts visual features by first converting traffic time series data into images, then processing these images in the frequency domain.

Image Generation from Time Series: The essence is to transform spatial-domain traffic time series data into visual images, converting numerical modality to visual modality. The paper mentions two types of images: frequency images and periodicity images.
- Frequency Domain Encoder: FFT is applied to extract frequency information from the input data.
- Periodicity Domain Encoder: For each time stamp $t$ $t$ , a periodic encoding is generated: $ \pmb {P}_t = [\sin (2\pi t / \phi),\cos (2\pi t / \phi)] $
  - $\pmb {P}_t$ : The periodic encoding for time $t$ .
  - $\phi$ : A periodicity hyperparameter that controls the cycle length of the sine and cosine waves. These frequency representations and periodicity encodings are concatenated with the original input to form a new representation $\mathbf{X}^g$ .
Multi-scale Convolution: Multi-scale convolution is employed to extract hierarchical temporal patterns.
- A 1D convolutional layer first captures local dependencies.
- Two subsequent 2D convolutional layers are used: one halves the channel dimension, and another maps features to multiple output channels, aiming to capture global temporal structures.
- The output features are then resized to the desired image dimensions (e.g., 64x64 pixels as per Appendix C) via bilinear interpolation and subjected to normalization. This process results in numerical representations of the generated images.
FFT for Image Features: The numerical representations (image features), denoted as $\mathcal{X}^g$ , are converted into the frequency domain by FFT: $ \mathcal{X}^g [k] = \sum_{i = 0}^{n - 1}\mathcal{X}^g [i]e^{-j\frac{2\pi k}{\lambda}} $
- $\mathcal{X}^g [k]$ : The $k$ -th frequency component of the image features.
- $\mathcal{X}^g [i]$ : The $i$ -th spatial-domain component of the image features.
- $\lambda$ : This symbol appears to be a typo and likely should be $n$ (the length of the sequence or relevant dimension) similar to Equation (2) and (6). Assuming it represents the sequence length for the FFT.
Finite Impulse Response (FIR) Filter: To reduce noise and focus on core information in the augmented images' frequency domain features, an FIR filter is introduced. It's constructed using the Hamming window technique.
- Hamming Window Function: Given a filter length $s$ $s$ , window function parameters $\omega_i$ $ω_{i}$ are generated by: $ \omega_{i} = 0.54 - 0.46\cos (2\pi i / (s - 1)) $
  - $\omega_i$ : The $i$ -th coefficient of the Hamming window.
  - $s$ : The length of the filter.
  - $i$ : Index from 0 to s-1. The paper states $z\pi i / s - 1$ but the standard form of Hamming window is $2\pi i / (s-1)$ , assuming $z=2$ and the s-1 term is the range.
- Actual Impulse Response: The actual impulse response $r^i$ is obtained by multiplying the window function with the filter’s ideal impulse response $r'$ : $r^i = \omega_i[i] \cdot r'$ .
- Filter Bank: These impulse responses form a filter bank $\mathbf{R} = [r^1, r^2, \dots, r^s]$ , which divides the input spectrum into multiple sub-bands, filtering out key features within corresponding frequency ranges.
Spectrum Compression: After filtering, spectrum compression is performed: $ \mathcal{X}{spe}^{g} = \sum{i = 1}^{s}{{\frac{1}{c}}|\mathcal{X}^{g}|^{2}\odot r^{i}} $
- $\mathcal{X}_{spe}^{g}$ : The compressed spectrum for the visual modality.
- $c$ : Represents the length of the image modality (likely a normalization factor related to dimensions).
- $|\mathcal{X}^{g}|^2$ : The power spectrum of the image features in the frequency domain.
- $r^i$ : The $i$ -th filter from the filter bank.
- $\odot$ : Element-wise multiplication. This operation weights the spectrum, retaining important frequency components and weakening redundant information.
Average Pooling: Average pooling is introduced to further reduce high-frequency noise and random fluctuations, making the spectrum more regular and improving compression efficiency. $ \mathcal{X}{pool}^{g} = Average(\mathcal{X}{spe}^{g}\odot \delta^{g}) $
- $\mathcal{X}_{pool}^{g}$ : The average-pooled spectrum for the visual modality.
- Average(): The average pooling operation.
- $\delta^{g}$ : A matrix with the corresponding dimension to $\mathcal{X}_{spe}^{g}$ , likely used for masking or weighting before pooling.
Cross-modal Fusion (Image with Text): The enhanced spectrum of images is further refined with the help of text modal information: $ \mathcal{X}{out}^{g} = \mathcal{X}{spe}^{g}\odot \mathcal{X}_{pool}^{t} $
- $\mathcal{X}_{out}^{g}$ : The output spectrum for the visual modality after cross-modal enhancement.
- $\mathcal{X}_{spe}^{g}$ : The compressed spectrum of the visual modality.
- $\mathcal{X}_{pool}^{t}$ : The output of the text modality after pooling enhancement (from the text encoder). This implies an element-wise interaction in the frequency domain.
IFFT: Finally, IFFT is applied to convert the frequency-domain features $\mathcal{X}_{out}^{g}$ back into spatial representations $\pmb{X}^{g}[i]$ : $ \pmb{X}^{g}[i] = \sum_{k = 0}^{n - 1}\pmb{\mathcal{X}}_{\mathit{o u t}}^{g}[k]e^{j\frac{2\pi ki}{T}} $
- $\pmb{X}^{g}[i]$ : The $i$ -th spatial-domain feature for the visual modality.
- $\pmb{\mathcal{X}}_{\mathit{o u t}}^{g}[k]$ : The $k$ -th output frequency component for the visual modality.
- $T$ : Represents the length of the sequence or relevant dimension for the IFFT, similar to $n$ .

4.2.3. Text Modality Encoder

This module processes textual information, either pre-defined or generated, to extract semantic features.

Text Generation (if needed): If the input data lacks complete textual information, it can be generated.
- LLMs (e.g., ChatGPT) can generate item descriptions to enhance semantic information.
- Contextual information (topic, background, vehicle position) can be extracted directly from input data. If textual information is already complete, it's fed directly into the text encoder.
Text Encoder and FFT: The text encoder generates a vector $X^t$ . This vector is then converted to the frequency domain using FFT: $ \mathcal{X}^{t}[k] = \sum_{i = 0}^{n - 1}X^{t}[i]e^{-j\frac{2\pi ki}{n}} $
- $\mathcal{X}^{t}[k]$ : The $k$ -th frequency component of the text features.
- $X^{t}[i]$ : The $i$ -th component of the text vector.
FIR Filter, Average Pooling, and Cross-modal Enhancement: Similar to the vision modality encoder, the text frequency features undergo processing via an FIR filter and average pooling. Then, cross-modal enhancement is applied, this time with the image modality: $ \pmb{\mathcal{X}}{out}^{t} = \pmb{\mathcal{X}}{spe}^{t}\odot Average(\pmb{\mathcal{X}}_{spe}^{g}\odot \delta^{g}) $
- $\pmb{\mathcal{X}}_{out}^{t}$ : The output spectrum for the text modality after cross-modal enhancement.
- $\pmb{\mathcal{X}}_{spe}^{t}$ : The compressed spectrum for the text modality (obtained via FIR filter and spectrum compression, analogous to $\mathcal{X}_{spe}^{g}$ ).
- $Average(\pmb{\mathcal{X}}_{spe}^{g}\odot \delta^{g})$ : This term is equivalent to $\mathcal{X}_{pool}^{g}$ from the vision modality, representing the average-pooled spectrum of the visual modality.
IFFT: Finally, IFFT is applied to invert these frequency-domain features $\pmb{\mathcal{X}}_{out}^{t}$ into spatial-domain features $X_{out}^t$ for further cross-modal fusion.

After each modality (numerical, visual, textual) has been processed through its respective encoder and spectral transformations, the features are fused using contrastive learning and distribution similarity fusion.

Contrastive Learning: The purpose of contrastive loss is to semantically align cross-modal features. It reduces the distance between features from the same traffic scene (positive pairs) across different modalities, while increasing the distance between features from irrelevant scenes (negative pairs).
- Supervised Loss ( $\mathcal{L}(SUP)$ ): For labeled data, a supervised loss is calculated. Given a data instance $x_i$ , encoded features $\boldsymbol{x}_{i}^{\prime}$ (which can be numerical, visual, or textual features) and real features $\boldsymbol{s}_{i}$ (ground truth or target features). The dataset has $m$ categories $\mathcal{Y} = \{\mathcal{M}_1, \mathcal{M}_2, \dots, \mathcal{M}_m\}$ . The total supervised loss is calculated by: $ \mathcal{L}(SUP) = \sum_{X}\sum_{y}(\sum_{x^{\prime}\in \mathcal{M}{i}}\frac{1}{|\mathcal{M}{i}|}\sum_{s\in \mathcal{M}{i},x^{\prime}\neq s}[\mathcal{L}{i}(x^{\prime v},s^{v}) + \mathcal{L}{i}(x^{\prime g},s^{g}) + \mathcal{L}{i}(x^{\prime t},s^{t})]) $
  - $\mathcal{L}(SUP)$ : The total supervised loss.
  - $X$ : Represents the entire dataset.
  - $y$ : Represents a specific category.
  - $\mathcal{M}_i$ : The set of instances belonging to category $i$ .
  - $x^{\prime v}, x^{\prime g}, x^{\prime t}$ : Encoded features from the numerical, visual, and textual modalities, respectively.
  - $s^{v}, s^{g}, s^{t}$ : Real features (ground truth) for the numerical, visual, and textual modalities, respectively. The notation suggests comparing each modality's encoded feature to its corresponding ground truth.
  - $\mathcal{L}_i(x', s)$ : A loss function (e.g., MSE or a distance metric) comparing an encoded feature $x'$ to a real feature $s$ . The sum implies averaging over instances within a category and summing across categories.
- Unsupervised Loss ( $\mathcal{L}(UNS)$ - InfoNCE Loss): Unsupervised learning captures differences between modalities by aligning features. InfoNCE loss (He et al. 2020) is used to calculate similarity: $ \mathcal{L}(UNS) = \frac{1}{3|X|}\sum_{i = 1}^{|X|}\big[\mathcal{L}{v}(x{i}^{v},x_{i}^{g},x_{i}^{t}) + \mathcal{L}{g}(x{i}^{g},x_{i}^{t},x_{i}^{v}) + \mathcal{L}{t}(x{i}^{t},x_{i}^{v},x_{i}^{g})\big] $
  - $\mathcal{L}(UNS)$ : The total unsupervised loss.
  - $|X|$ : The total number of data instances.
  - $\mathcal{L}_{v}(x_{i}^{v},x_{i}^{g},x_{i}^{t})$ : The InfoNCE loss for the numerical modality $x_i^v$ , contrasting it against positive samples (other modalities of the same instance, $x_i^g, x_i^t$ ) and negative samples (other instances' modalities). The formula structure suggests that each modality is considered as an anchor, and it tries to pull features from the other two modalities for the same instance closer while pushing features from other instances' modalities away.
Distribution Similarity Fusion: To ensure semantic consistency of cross-modal features, Jensen-Shannon (JS) divergence is used to assess similarity between probability distributions of different modal features. Given a data instance $x$ , its posterior probability in the numerical modality is $\mathbb{I}(\alpha^{v} | x^{v})$ . The JS divergence between any two modalities is calculated, then averaged: $ \begin{array}{rl} \Delta = & (JS(\mathbb{I}(\alpha^v | x^v)||\mathbb{I}(\alpha^g | x^g)) + JS(\mathbb{I}(\alpha^v | x^v)||\mathbb{I}(\alpha^t | x^t))\ & +JS(\mathbb{I}(\alpha^{g} | x^g)||\mathbb{I}(\alpha^t | x^t)))/3 \end{array} $
- $\Delta$ : The averaged JS divergence across all pairs of modalities for a given instance.
- $JS(\mathbb{I}(\alpha^A | x^A)||\mathbb{I}(\alpha^B | x^B))$ : The Jensen-Shannon divergence between the posterior probability distributions of modalities A and B, given their respective features $x^A$ and $x^B$ .
- $\mathbb{I}(\alpha^v | x^v)$ : The posterior probability distribution for the numerical modality, given its features $x^v$ . Similarly for visual ( $\mathbb{I}(\alpha^g | x^g)$ ) and textual ( $\mathbb{I}(\alpha^t | x^t)$ ) modalities.
  
  New features $\hat{x}$ after distribution similarity fusion are obtained: $ \hat{x} = (1 - \Delta)(K^v x^v + K^g x^g + K^t x^t) + \Delta x^v + \Delta x^g + \Delta x^t $
- $\hat{x}$ : The final fused feature representation.
- $x^v, x^g, x^t$ : The spatial-domain features from numerical, visual, and textual encoders, respectively.
- $K^v, K^g, K^t$ : Represent training metrics for instance $x$ in each modality. These might be weights indicating the reliability or importance of each modality's features.
- $\Delta$ : The averaged JS divergence. This term acts as a weighting factor: if modalities are very similar ( $\Delta$ is small), the combined weighted sum is emphasized. If they are dissimilar ( $\Delta$ is large), the individual modality features are added with more emphasis, suggesting a more direct combination.
MLP Classifier and Fusion Loss ( $\mathcal{L}(CE)$ ): The fused representation $\hat{x}$ is then fed into a Multi-Layer Perceptron (MLP) classifier to predict the label for each data instance. Since urban traffic profiling is a multi-classification problem, multi-class cross-entropy loss is used: $ \mathcal{L}(CE) = -\mathbb{E}{y\sim \hat{Y}}\sum{i = 1}^{m}y_{i}\log(y_{i}^{\prime}) $
- $\mathcal{L}(CE)$ : The cross-entropy loss.
- $\mathbb{E}_{y\sim \hat{Y}}$ : Expectation over the true label distribution $\hat{Y}$ .
- $m$ : Number of classes.
- $y_i$ : The true label for class $i$ (1 if it's the true class, 0 otherwise, in a one-hot encoding).
- y'_i: The predicted probability that the instance belongs to class $i$ .
Full Objective Loss ( $\mathcal{L}$ ): The overall objective loss combines the contrastive loss and the fusion loss: $ \mathcal{L} = \alpha \mathcal{L}(\mathrm{SUP}) + \beta \mathcal{L}(\mathrm{UNS}) + \gamma \mathcal{L}(CE) $
- $\mathcal{L}$ : The total loss to be minimized during training.
- $\mathcal{L}(\mathrm{SUP})$ : Supervised contrastive loss.
- $\mathcal{L}(\mathrm{UNS})$ : Unsupervised InfoNCE contrastive loss.
- $\mathcal{L}(CE)$ : Cross-entropy classification loss.
- $\alpha, \beta, \gamma$ : Hyperparameters that balance the influence of the different loss components.
  
  The overall framework diagram (Figure 1) visually summarizes this process:
  
  该图像是一个示意图，展示了多模态城市交通信号的源码结构，包括时间序列模态编码器、视觉模态编码器和文本模态编码器的过程。在这个框架中，采用FFT和多尺度卷积对频率域和周期性域的数据进行处理，以实现信息的增强和特征提取。

The overview of our framework. MTP learns multimodal features in the frequency domain from three perspectives: numerical, visual, and textual. These modalities are fused to provide more comprehensive features for urban traffic profiling.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on six widely-used public benchmarks for time series classification.

Chinatown:
- Source: UCR/UEA Time Series Classification Archive.
- Characteristics: Records the number of pedestrians at two different locations in a Chinatown district. It's a multivariate time series classification task.
MelbournePedestrian (Melbourne):
- Source: Not explicitly stated beyond "hourly pedestrian counts from 2015 to 2017".
- Characteristics: Contains hourly pedestrian counts at various locations in the city center of Melbourne, Australia. Used for urban mobility pattern analysis and pedestrian flow prediction.
PEMS-BAY:
- Source: California Department of Transportation's Performance Measurement System (PeMS).
- Characteristics: A large-scale traffic dataset, including traffic data from 325 sensors in the San Francisco Bay Area over 6 months.
METR-LA:
- Source: Not explicitly stated, but common in traffic forecasting research (likely public).
- Characteristics: A large-scale traffic dataset containing traffic speed data from 207 sensors on Los Angeles County highways over 4 months. It's a classic multivariate time series dataset for traffic flow and speed forecasting.
- Label Generation Strategy for METR-LA Dataset (from Appendix F): To classify continuous speed recordings into discrete traffic states, the Level of Service (LOS) standards from the Highway Capacity Manual (HCM) are used. The six standard LOS grades (A-F) are consolidated into three categories: Low Congestion, Moderate Congestion, and High Congestion. The numerical thresholds are based on the "Percentage of Free-Flow Speed" (FFS) method. Assuming an average FFS of approximately $65 \mathrm{mph}$ $65 mph$ on METR-LA highways:
  - High Congestion (Traffic Breakdown): Speed $< 40 \mathrm{mph}$ (approximately $< 60\%$ of FFS). Corresponds to $LOS E/F$ , indicating severe congestion.
  - Moderate Congestion (Transitional Flow): Speed $\in [40, 60] \mathrm{mph}$ . Corresponds to $LOS C/D$ , representing unstable but not fully broken-down traffic.
  - Low Congestion (Free Flow): Speed $> 60 \mathrm{mph}$ . Corresponds to $LOS A/B$ , indicating smooth traffic flow. This physics-informed labeling ensures interpretability and consistency with transportation engineering principles.
DodgerLoopDay (DodgerLoop):
- Source: UCR/UEA Time Series Classification Archive.
- Characteristics: Contains counts of vehicles on a road leading to the Dodgers Stadium in Los Angeles. The classification task is to distinguish between game days and non-game days based on traffic patterns.
PEMS-SF:
- Source: Caltrans Performance Measurement System (PeMS).
- Characteristics: Contains traffic occupancy rates from 963 sensors on the highways of the San Francisco Bay Area. A widely used benchmark for traffic classification.
  
  These datasets were chosen because they represent diverse real-world urban traffic scenarios, ranging from pedestrian counts to vehicle speeds and occupancy, and cover various scales and geographical locations. They are effective for validating the method's performance across different types of time series classification problems in urban environments.

5.2. Evaluation Metrics

The paper adopts a set of standard classification metrics: Accuracy, Macro-Precision, Macro-Recall, and Macro F1-Score.

Accuracy (Acc):
- Conceptual Definition: Accuracy measures the proportion of total predictions that were correct. It indicates how often the classifier is correct overall.
- Mathematical Formula: $ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $
- Symbol Explanation:
  - TP (True Positives): Number of instances correctly predicted as positive.
  - TN (True Negatives): Number of instances correctly predicted as negative.
  - FP (False Positives): Number of instances incorrectly predicted as positive (Type I error).
  - FN (False Negatives): Number of instances incorrectly predicted as negative (Type II error).
Precision (Pre):
- Conceptual Definition: Precision measures the proportion of positive predictions that were actually correct. It answers: "Of all items the model predicted as positive, how many are truly positive?" In multi-class classification, Macro-Precision calculates precision for each class independently and then averages them, treating all classes equally.
- Mathematical Formula: For a single class: $Precision = \frac{TP}{TP + FP}$ For Macro-Precision: $Macro\text{-}Precision = \frac{1}{C} \sum_{c=1}^{C} \frac{TP_c}{TP_c + FP_c}$
- Symbol Explanation:
  - TP: True Positives for a given class.
  - FP: False Positives for a given class.
  - $C$ : Total number of classes.
  - $TP_c$ : True Positives for class $c$ .
  - $FP_c$ : False Positives for class $c$ .
Recall (Rec):
- Conceptual Definition: Recall (also known as Sensitivity or True Positive Rate) measures the proportion of actual positives that were correctly identified. It answers: "Of all actual positive items, how many did the model correctly identify?" In multi-class classification, Macro-Recall calculates recall for each class independently and then averages them.
- Mathematical Formula: For a single class: $Recall = \frac{TP}{TP + FN}$ For Macro-Recall: $Macro\text{-}Recall = \frac{1}{C} \sum_{c=1}^{C} \frac{TP_c}{TP_c + FN_c}$
- Symbol Explanation:
  - TP: True Positives for a given class.
  - FN: False Negatives for a given class.
  - $C$ : Total number of classes.
  - $TP_c$ : True Positives for class $c$ .
  - $FN_c$ : False Negatives for class $c$ .
F1-Score (F1):
- Conceptual Definition: The F1-score is the harmonic mean of Precision and Recall. It provides a single metric that balances both precision and recall, especially useful when there is an uneven class distribution. Macro-F1 calculates the F1-score for each class and then averages them, giving equal weight to each class.
- Mathematical Formula: For a single class: $F1\text{-}Score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$ For Macro-F1: Macro\text{-}F1 = \frac{1}{C} \sum_{c=1}^{C} \left( 2 \cdot \frac{Precision_c \cdot Recall_c}{Precision_c + Recall_c} \right)
- Symbol Explanation:
  - Precision: Precision for a given class.
  - Recall: Recall for a given class.
  - $C$ : Total number of classes.
  - $Precision_c$ : Precision for class $c$ .
  - $Recall_c$ : Recall for class $c$ .

5.3. Baselines

The paper compares MTP against 8 state-of-the-art time series models. These models cover various techniques from Transformer-based architectures to shapelet-based methods and pre-training frameworks.

TST (Time Series Transformer) (Zerveas et al. 2021): Directly applies the standard Transformer encoder architecture to the spatial domain of time series. It uses a self-attention mechanism to capture pairwise relationships across all time steps.
ShapeNet (Cheng et al. 2021): A shapelet-based neural network for multivariate time series classification. It learns discriminative shapelets (short, representative subsequences) and feeds these extracted features into a fully connected network to capture both local patterns and global dependencies.
PatchTST (Nie et al. 2023): A recent Transformer-based model that treats a time series as a sequence of patches (subseries). It processes each channel independently, enabling effective representation learning for long-term forecasting and classification tasks.
LightTS (Zhang, Chen, and He 2023): A lightweight framework designed for time series classification. It employs an adaptive integrated distillation technique to transfer knowledge from multiple heterogeneous teacher models into a single lightweight student model.
SVP-T (Zuo et al. 2023): A pre-training framework for time series data that operates on two levels. It learns representations from both the shape-level (local patterns) and velocity-level (trend information) of the time series, aiming to create more robust features for downstream tasks.
ModernTCN (Wang et al. 2024c): A modernized version of the classic Temporal Convolutional Network (TCN). It incorporates modern CNN design principles, such as depthwise separable convolutions, to enhance model performance and scalability.
CAFO (Li, Wang, and Liu 2024): A convolutional attention-based backbone network designed for time series classification tasks. It effectively combines the local feature extraction capabilities of convolutional layers with the ability of attention mechanisms to capture long-range dependencies.
InterpGN (Wen, Ma et al. 2025): A framework that aims to combine model performance with interpretability for time series classification. It uses learnable shapelets as an interpretable module and fuses its output with a "black-box" network via a gated mechanism to provide both accurate and explainable predictions.

5.4. Implementation Details

The experiments are implemented using the PyTorch framework on a single NVIDIA RTX 3090 GPU.

Modality Generation:
- Images: Created with a size of 64x64 pixels.
- Texts: Maximum text length is set to 128 tokens.
Training:
- Epochs: Trained for 50 epochs.
- Batch Size: 64.
- Optimizer: AdamW.
- Learning Rate: Initial learning rate of 1e-4.
- Weight Decay: 0.01.
- Learning Rate Scheduler: Uses a linear warmup.
Loss Function Parameters:
- Contrastive Loss Weight ( $\alpha$ ): 0.1.
- Temperature Coefficient ( $\tau$ ): 0.07. (Note: The paper mentions $\tau$ but doesn't explicitly show it in the loss function (20). It's typically part of the InfoNCE loss for scaling logits.)
Validation: A five-fold cross-validation approach is employed.
Runs: All experiments are run 15 times, and the arithmetic average results are reported.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that MTP consistently achieves state-of-the-art performance across various time series classification tasks on six real-world urban traffic datasets.

The following are the results from Table 1 of the original paper:

Dataset	ShapeNet		TST		PatchTST		SVP-T		LightTS		ModernTCN		CAFO		InterpGN		MTP
Dataset	F1	Acc	F1	Acc	F1	Acc	F1	Acc	F1	Acc	F1	Acc	F1	Acc	F1	Acc	F1	Acc
Chinatown	0.7206	0.7259	0.9472	0.9563	0.9714	0.9767	0.9456	0.9592	0.9680	0.9708	0.9712	0.9767	0.9784	0.9825	0.9541	0.9659	0.9820	0.9839
Melbourne	0.7186	0.7314	0.8246	0.8421	0.8897	0.8873	0.8030	0.8065	0.8670	0.8655	0.8732	0.8786	0.8876	0.8860	0.8392	0.8364	0.9669	0.9635
PEMS-BAY	0.6365	0.6790	0.6712	0.6882	0.6838	0.6929	0.6573	0.6844	0.6736	0.6860	0.6950	0.7055	0.6637	0.6840	0.6770	0.6989	0.7091	0.7200
METR-LA	0.7186	0.7314	0.7143	0.7224	0.7295	0.7425	0.7158	0.7269	0.7113	0.7229	0.7483	0.7562	0.7158	0.7266	0.7262	0.7385	0.7590	0.7684
DodgerLoop	0.1500	0.2153	0.3523	0.4125	0.4535	0.5750	0.3817	0.4250	0.5156	0.5625	0.2442	0.3750	0.3607	0.4500	0.1519	0.2250	0.5676	0.6000
PEMS-SF	0.6373	0.6503	0.7900	0.7919	0.7468	0.7446	0.8215	0.8286	0.7384	0.7514	0.7594	0.7630	0.7857	0.7919	0.6246	0.6705	0.8310	0.8277

Analysis:

Consistent Superiority: MTP achieves the best F1-score and Accuracy on Chinatown, Melbourne, PEMS-BAY, METR-LA, and DodgerLoop. On PEMS-SF, MTP achieves the best F1-score and second-best Accuracy. This consistent outperformance across diverse datasets validates the effectiveness of the proposed multimodal augmentation and frequency-domain fusion.
Significant Gains:
- On Melbourne, MTP achieves an F1-score of 0.9669 and Accuracy of 0.9635, substantially higher than the next best (PatchTST with 0.8897 F1).
- For Chinatown, MTP reaches an F1-score of 0.9820 and Accuracy of 0.9839, demonstrating strong performance in datasets with clear, classifiable patterns. CAFO and PatchTST are competitive but fall short.
- On the highly volatile DodgerLoop dataset, which often poses a challenge for models (indicated by generally lower scores), MTP still achieves the highest F1-score of 0.5676 and Accuracy of 0.6000, significantly outperforming baselines which often struggle to exceed 0.4 F1.
Scalability: MTP maintains its leading performance on large-scale traffic datasets like PEMS-BAY (F1: 0.7091, Acc: 0.7200) and METR-LA (F1: 0.7590, Acc: 0.7684), suggesting its robustness across different data volumes and complexities.
Comparison with Baselines:
- Transformer-based models like TST and PatchTST perform well, but MTP generally surpasses them, highlighting the benefits of multimodal augmentation and frequency-domain processing over purely temporal modeling.
- CAFO (Convolutional Attention) and SVP-T (pre-training framework) also show strong results, but MTP's fusion of diverse perspectives provides an additional edge. The results strongly validate that MTP's approach of learning more comprehensive feature representations through multimodal augmentation and spectrum fusion leads to state-of-the-art performance.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Study

The ablation study rigorously evaluates the contribution of each core component of MTP by removing individual branches (visual, textual, time series) from the full framework.

The following are the results from Table 2 of the original paper:

Variant	Melbourne				DodgerLoop
Variant	Acc	Pre	Rec	F1	Acc	Pre
MTP	0.9672	0.9671	0.9669	0.9669	0.600	0.6978
w/o Visual	0.7593	0.7617	0.7595	0.7584	0.2375	0.0674
w/o Textual	0.9659	0.9660	0.9657	0.9657	0.5375	0.6248
w/o TS	0.6839	0.6845	0.6833	0.6766	0.5875	0.6015

The following are the results from Table 3 of the original paper:

Dataset	Metric	MTP	w/o V	w/o T	w/o TS
Chinatown	Accuracy	0.9854	0.9271	0.9796	0.9563
	Precision	0.9747	0.9008	0.9653	0.9312
	Recall	0.9900	0.9233	0.9859	0.9699
	F1-score	0.9820	0.9110	0.9749	0.9475
PEMS-BAY	Accuracy	0.7128	0.7051	0.7071	0.6639
	Precision	0.7133	0.6929	0.7037	0.6606
	Recall	0.7093	0.6916	0.7066	0.6474
	F1-score	0.7091	0.6905	0.7050	0.6478
METR-LA	Accuracy	0.7680	0.7623	0.7671	0.7342
	Precision	0.7592	0.7584	0.7590	0.7304
	Recall	0.7592	0.7543	0.7520	0.7235
	F1-score	0.7590	0.7552	0.7526	0.7248
PEMS-SF	Accuracy	0.7977	0.6127	0.5549	0.6185
	Precision	0.8100	0.5967	0.5641	0.6035
	Recall	0.7912	0.5999	0.5448	0.6114
	F1-score	0.7888	0.5810	0.5428	0.5997

Key Findings:

Full Model Superiority: The complete MTP framework consistently outperforms all its ablated variants across all datasets and metrics, confirming that the integration of all three modalities and the fusion mechanism are essential for optimal performance.
Impact of Visual Modality (w/o V): Removing the visual branch has the most pronounced impact, especially on the DodgerLoop dataset, where the F1-score drops drastically from 0.5676 to 0.1048 (as mentioned in text, not exactly matching table's 0.2375 Acc, 0.0674 Pre for F1). This suggests that visual features derived from frequency and periodicity images are critical for capturing patterns distinguishing game days from non-game days in traffic. A significant drop is also seen on PEMS-SF (F1 drops from 0.7888 to 0.5810), and Chinatown (F1 drops from 0.9820 to 0.9110).
Impact of Time Series Modality (w/o TS): Relying solely on visual and textual modalities without the original time-series data (w/o TS) leads to a substantial performance degradation. For Melbourne, F1-score drops from 0.9669 to 0.6766, and for Chinatown, from 0.9820 to 0.9475. This emphasizes that the raw numerical temporal patterns are fundamental and cannot be fully replaced by augmented modalities alone.
Impact of Textual Modality (w/o T): While textual modality removal also causes a decline, it is generally less severe than removing the visual or time series branches. For Melbourne, F1-score decreases slightly from 0.9669 to 0.9657. On DodgerLoop, the F1-score (inferred from Acc/Pre) shows a noticeable drop compared to the full model, and on PEMS-SF, F1-score drops significantly from 0.7888 to 0.5428. This indicates that semantic information from textual descriptions, even if subtle, contributes to a more robust understanding of traffic scenarios. These findings confirm that MTP's modality augmentation and subsequent fusion strategies are crucial drivers of its superior performance.

6.2.2. Hyperparameter Sensitivity Analysis

The paper investigates the sensitivity of MTP to four key hyperparameters.

fig 2 该图像是四个参数对模型表现的影响图，其中分别显示了学习率（a）、温度（b）、阿尔法权重（c）和嵌入维度（d）与准确率、精确率、召回率及F1分数之间的关系。每个子图通过线条展示不同参数值对性能指标的影响趋势。

Hyperparameter sensitivity analysis on four key parameters: (a) Learning Rate, (b) Temperature, (c) Alpha weight, and (d) Embedding dimension.

Analysis of Figure 2:

Learning Rate (a): Performance peaks around 1e-4. The model shows robustness within a range (e.g., 1e-3 to 1e-5), but values too high or too low lead to decreased performance, as expected.
Temperature (b): The optimal range for the temperature parameter in contrastive loss is between 0.05 and 0.1. Values outside this range (e.g., 0.01 or 0.5) result in lower performance, indicating the importance of correctly scaling the similarity logits in InfoNCE loss.
Alpha Weight (c): MTP demonstrates a preference for a smaller alpha value (weight for supervised contrastive loss), with the best performance achieved at 0.1. This suggests that while supervised contrastive learning is beneficial, it should not overpower the other loss components (unsupervised contrastive loss and cross-entropy loss).
Embedding Dimension (d): Performance plateaus once the embedding dimension reaches 128. Increasing the dimension beyond this point does not yield significant performance improvements but would increase computational cost. This indicates an optimal dimensionality for capturing relevant features.

These analyses collectively validate that MTP is stable and does not necessitate extensive hyperparameter tuning, indicating a robust design.

6.3. Qualitative Analysis

To intuitively understand the effectiveness of MTP, t-SNE is used to visualize the feature distributions.

fig 3 该图像是一个示意图，展示了四种不同类型的特征，分别是最终融合特征（a）、图像特征（b）、文本特征（c）和时间序列特征（d）。每种特征在空间中的分布不同，提供了多模态学习的可视化结果。

Comparative t-SNE visualizations on the METR-LA dataset, which contains three types of labels.

Analysis of Figure 3 (METR-LA dataset):

Final Fused Features (a): The final fused features learned by the complete MTP framework form highly cohesive and clearly separated clusters in the 2D space. Samples from different classes (represented by different colors) are distinctly separated with minimal overlap. This strong visual separation directly correlates with the high classification performance reported in Table 1.
Single Modality Features (b, c, d): In contrast, the feature distributions from single modalities (e.g., image features (b), text features (c), time series features (d)) are more diffuse and intermingled. While some clustering might be present, the boundaries between classes are much less distinct, and there is significant overlap. This indicates that no single modality alone can provide the same level of discriminative power as their fusion. This qualitative result aligns with the conclusions from the ablation study, demonstrating that MTP's fusion module successfully integrates complementary information from different modalities to produce a more powerful and discriminative final representation.

该图像是展示了多模态特征的可视化图，包含四个子图：最终融合特征（a）、图像特征（b）、文本特征（c）和时间序列特征（d），分别展示了不同特征在空间中的分布情况。

Comparative t-SNE visualizations on the Chinatown dataset. Color coding: blue (Class 0), green (Class 1).

Analysis of Figure 4 (Chinatown dataset - Appendix E):

Similar to METR-LA, the final fused features (a) on the Chinatown dataset learned by MTP form clear and well-separated clusters, even for just two classes (blue and green). This further supports the framework's ability to learn highly discriminative representations.
Again, the features from individual modalities (b, c, d) show substantial overlap between classes, reinforcing the idea that multimodal fusion is key to enhancing the discriminative power.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper proposes MTP, a novel multimodal framework for urban traffic profiling. It addresses the critical limitation of existing methods that rely solely on numerical time series data, neglecting the rich semantic information available in heterogeneous urban data. MTP innovatively learns multimodal features from numeric, visual, and textual perspectives by augmenting the original traffic signals.

Key aspects of MTP include:

Frequency Domain Learning: Processing numerical information with frequency multi-layer perceptrons, converting raw data into frequency images and periodicity images for visual learning, and generating descriptive texts for textual learning, all within the frequency domain.
Hierarchical Contrastive Fusion: Employing a sophisticated hierarchical contrastive learning mechanism to fuse the spectral representations of the three modalities, ensuring semantic alignment and comprehensive feature integration.

Extensive experiments on six real-world datasets demonstrate MTP's superior performance compared to state-of-the-art methods, validating the effectiveness of its modality augmentation and spectrum fusion strategies.

7.2. Limitations & Future Work

The authors point out the following limitations and suggest future research directions:

Integrating More Data Types: Future work will involve integrating additional types of urban modal data. This implies that MTP, in its current form, might not yet encompass all possible relevant data streams (e.g., weather data, social media sentiment, public transport schedules, GIS data beyond just location).
Fine-grained Cross-modal Modeling: Exploring more fine-grained modeling mechanisms for cross-modal correlations. While MTP proposes a robust fusion, there might be deeper, more nuanced ways to capture interactions between different modalities beyond current spectral multiplication and contrastive losses. This suggests a direction towards more complex relational reasoning between modalities.

7.3. Personal Insights & Critique

MTP presents a compelling solution to a well-recognized problem in urban computing. The idea of augmenting a primary modality (numerical time series) into other modalities (visual, textual) is particularly innovative, especially when explicit visual or textual data is not readily available. This makes the framework highly applicable to scenarios where sensor data is abundant, but other data types are scarce.

Inspirations and Applications:

Synthetic Multimodal Data Generation: The approach of generating visual and textual representations from time series data could inspire similar modality augmentation techniques in other domains where one modality is rich but others are sparse. For instance, generating descriptive text from sensor readings in industrial fault detection or healthcare.
Frequency Domain as a Unifying Space: The successful application of frequency domain processing across all three modalities suggests that this could be a powerful unifying representation for multimodal learning, particularly for data with inherent temporal or periodic characteristics.
Hierarchical Contrastive Learning: The combination of supervised and unsupervised contrastive losses with distribution similarity fusion offers a robust fusion mechanism that could be adapted to other multimodal tasks requiring semantic alignment and integration of diverse feature spaces.
Intelligent City Applications: Beyond traffic profiling, MTP's principles could extend to other smart city applications like energy consumption prediction, environmental monitoring, or public safety, where diverse sensor data can be augmented and fused.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Computational Cost of Augmentation: Generating images and texts, especially using LLMs for text augmentation, can be computationally intensive. The paper mentions the image size (64x64) and text length (128 tokens), but the real-time implications for large-scale, continuous urban data streams might be a practical challenge. An analysis of the latency introduced by augmentation steps would be valuable.
Generalizability of Text Generation: Relying on LLMs for text generation assumes that these models can accurately and consistently describe traffic events from numerical data without introducing biases or hallucinations. The quality and specificity of the generated text could significantly impact the textual encoder's performance. The methodology section mentions "specific topic, background information and item description" – the robustness of generating this accurately across diverse traffic events and locations needs further scrutiny.
Interpretability of Frequency Domain: While powerful, frequency domain features can sometimes be less intuitive to interpret than time-domain features, especially for non-experts. While the t-SNE plots show good separation, understanding why certain frequency components are discriminative for high congestion vs. low congestion could enhance trust and practical application.
Hyperparameter Sensitivity of Fusion Weights: The weights $\alpha, \beta, \gamma$ for the different loss components are crucial. While the paper shows some robustness for $\alpha$ , the overall interplay of these weights, especially with varying dataset characteristics, could be complex to tune.
Choice of Periodicity Hyperparameter ( $\phi$ ): The periodicity hyperparameter $\phi$ in the vision modality encoder is critical. How is it determined? Is it fixed, learned, or adapted per dataset? An optimal $\phi$ would likely depend on the natural periodicities of the traffic data (e.g., daily, weekly cycles).
Scalability of FIR Filters and FFT: While FFT is efficient, the repeated application of FFT, IFFT, and FIR filtering across multiple modalities, especially for very long time series or very high-resolution images, could be a bottleneck. The current image size is relatively small, but scaling up might introduce performance concerns.

Overall, MTP offers a robust and innovative approach to multimodal urban traffic profiling, demonstrating significant advantages. Addressing the practical considerations of computational cost, generalizability of augmentation, and deeper interpretability will be key for its widespread adoption in real-world intelligent transportation systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

MTP: Exploring Multimodal Urban Traffic Profiling with Modality Augmentation and Spectrum Fusion

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~35 min read · 48,079 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

Traditional Traffic Time Series Profiling

Traffic Profiling with LLMs

Traffic Profiling with VLMs

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Time Series Modality Encoder

4.2.2. Vision Modality Encoder

4.2.3. Text Modality Encoder

4.2.4. Cross-modal Fusion

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Study

6.2.2. Hyperparameter Sensitivity Analysis

6.3. Qualitative Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers