Abstract

Spatia, a spatial memory-aware video generation framework, maintains long-term spatial and temporal consistency by preserving and updating a 3D scene point cloud, enabling realistic video generation and interactive editing.

1. Bibliographic Information

1.1. Title

Spatia: Video Generation with Updatable Spatial Memory

1.2. Authors

Jinjing Zhao (University of Sydney), Fangyun Wei (Microsoft Research), Zhening Liu (HKUST), Hongyang Zhang (University of Waterloo), Chang Xu (University of Sydney), Yan Lu (Microsoft Research).

1.3. Journal/Conference

This paper appears to be a research preprint (likely intended for a major computer vision conference like CVPR or ICCV). It is hosted on ArXiv and Hugging Face.

1.4. Publication Year

The metadata provided indicates a publication date of December 16, 2001. However, based on the references to contemporary models like Wan2.2 (2025) and CogVideoX (2024), this is a very recent piece of research from late 2024 or early 2025.

1.5. Abstract

Existing video generation models struggle with long-term spatial and temporal consistency because they cannot effectively "remember" 3D scene structures over time. Spatia introduces an explicit spatial memory mechanism by maintaining a 3D scene point cloud. The framework iteratively generates video clips conditioned on this memory and updates the memory using visual SLAM (Simultaneous Localization and Mapping) algorithms. This approach allows for realistic dynamic object generation while keeping the background static and consistent. Additionally, Spatia enables precise camera control and 3D-aware interactive editing, such as removing or modifying objects within a scene before generating the video.

1.6. Original Source Link

PDF Link: https://arxiv.org/pdf/2512.15716.pdf
Hugging Face: https://huggingface.co/papers/2512.15716

2. Executive Summary

2.1. Background & Motivation

The field of video generation has seen massive progress with models capable of creating short, high-quality clips. However, long-horizon generation (videos lasting minutes or hours) remains a major challenge.

The Token Problem: Videos are high-dimensional. A 5-second clip can contain 36,000 tokens (the basic units of data models process). To remember what happened 30 seconds ago, a model would need to process hundreds of thousands of tokens, which is computationally impossible for current hardware.
The Consistency Gap: Without an explicit memory of the 3D world, models often "forget" the layout of a room if the camera pans away and returns, leading to "hallucinations" where the scene changes randomly.

The authors identify that while Large Language Models (LLMs) can handle long contexts by attending to previous text tokens, video models need a more efficient way to store "spatial history."

2.2. Main Contributions / Findings

Spatia Framework: A novel architecture that uses a 3D point cloud as a persistent, updatable spatial memory.
Dynamic-Static Disentanglement: The model can generate moving people or animals (dynamic) while ensuring the environment (static) remains perfectly consistent.
Explicit Camera Control: By rendering the 3D point cloud from a specific path, the model can follow complex camera trajectories with high geometric accuracy.
3D-Aware Editing: Users can modify the 3D memory (e.g., delete a sofa) and the model will generate the video with that change reflected consistently across all frames.
State-of-the-Art Results: Experiments show that Spatia significantly outperforms existing models in long-term consistency and spatial accuracy.

3.1. Foundational Concepts

3.1.1. Latent Diffusion Models (LDM) & Diffusion Transformers (DiT)

Diffusion: A process where a model learns to generate data by reversing a noise-adding process. It starts with pure static (noise) and gradually refines it into an image or video.
Latent Space: Instead of working on raw pixels (which are huge), models work in a compressed "latent" space.
Transformer: An architecture using self-attention to weigh the importance of different parts of the input data. In video, it helps the model understand how a pixel in Frame 1 relates to a pixel in Frame 10.

3.1.2. Tokens and Embeddings

Tokens: Think of these as "visual words." A video is broken down into small spatio-temporal blocks, which are then converted into numerical vectors (embeddings).

3.1.3. 3D Point Clouds & SLAM

Point Cloud: A collection of data points in a 3D coordinate system (x, y, z). It represents the external surface of an object or scene.
SLAM (Simultaneous Localization and Mapping): A technique used by robots and self-driving cars to build a map of an unknown environment while simultaneously keeping track of their own location within that map.

3.1.4. Flow Matching

Flow Matching is a newer alternative to standard diffusion. Instead of predicting the noise at each step, it learns to predict the "velocity" or the direction in which the data needs to move to reach the target image/video.

Short-term Video Models: Models like Sora, Kling, and Wan2.2 generate high-fidelity short clips but lack the mechanism to return to the same spot in a 3D scene consistently over long periods.
Camera Control: Previous methods like AnimateDiff used "Motion LoRAs" (small tuned weights) to prompt camera movements, but these are often imprecise.
3D Scene Generation: Methods like WonderWorld create 3D scenes but often struggle with dynamic entities (moving people).

3.3. Differentiation Analysis

The core innovation of Spatia is the Updatable Spatial Memory. While other models might use the "previous frame" as memory (temporal context), Spatia uses a 3D Point Cloud (spatial context). This allows the model to "see" the world in 3D, ensuring that if the camera turns 360 degrees, the starting point looks exactly the same when the camera returns.

4. Methodology

4.1. Principles

Spatia treats video generation as a multi-modal problem. It doesn't just look at text; it looks at:

Text Instructions (What should happen?)
Temporal Context (What were the previous frames?)
Spatial Memory (What does the 3D world look like?)

4.2. Core Methodology In-depth (Layer by Layer)

The following figure (Figure 2 from the original paper) illustrates the training pipeline:

$该图像是示意图，展示了Spatia框架中视图特定场景点云估计的过程，包括候选帧、目标帧及参考帧的检索流程，以及利用VAE编码器的架构设计，其中涉及的IoU计算公式为 $IoU(T_i, C_j)$。$ 该图像是示意图，展示了Spatia框架中视图特定场景点云估计的过程，包括候选帧、目标帧及参考帧的检索流程，以及利用VAE编码器的架构设计，其中涉及的IoU计算公式为 $IoU(T_i, C_j)$ 。

4.2.1. View-Specific Scene Point Cloud Estimation

The first step is building the memory from a single image or video frame.

Static/Dynamic Separation: To prevent moving objects (like a walking dog) from being baked into the "permanent" background map, Spatia uses segmentation models (SAM2, ReferDINO) to identify and remove dynamic entities.
3D Mapping: Using a tool called MapAnything, the model estimates a global scene point cloud $s$ from the static parts of the images.
Camera Posing: It calculates the camera's position ( $\theta$ ) for every frame.
Projection: To make this 3D data understandable to a 2D video model, the 3D point cloud is "rendered" back into a 2D image plane from the perspective of the camera. This creates a "scene projection video" that acts as a blueprint for the generation.

4.2.2. Reference Frame Retrieval

To further help the model, Spatia searches through previously generated frames to find "Reference Frames" $\{R\}^K$ that show the same area the camera is currently looking at.

It uses 3D IoU (Intersection over Union) to check for spatial overlap. If the current view and a past view overlap significantly, that past view is retrieved as a reference.

4.2.3. Model Architecture & Flow Matching

The backbone of Spatia is based on the Wan2.2 architecture, enhanced with ControlNet blocks.

Token Extraction: The model converts inputs into tokens:

$X_P$ : Previous video tokens.
$X_R$ : Reference image tokens.
$X_{S_P}$ and $X_{S_T}$ : Tokens representing the 3D scene layout for previous and target clips.
$X_\tau$ : Text instruction tokens.

The Network Block: The following figure (Figure 3 from the original paper) shows a single network block:

$Figure 3. Illustration of a single network block composed of one ControlNet \[115\] block operating in parallel with four main blocks. Detailed definitions of all token types are provided in Figure 2.$ 该图像是示意图，展示了一个网络块的结构，包含一个 ControlNet 块与四个主块并行操作。图中详细描述了各自的功能，包括 FFN、Cross-Attention 和 Self-Attention 模块，展示了信息流动的方向。相关的标记和连接也在图中清晰体现。

Each block consists of a ControlNet (which handles the spatial scene layout) and a Main Block (which handles the video generation).

Flow Matching Mathematical Flow: The model is trained using Flow Matching. We define the target video tokens as $X_T$ .

We sample a time step $t$ between 0 and 1.
We initialize random noise $\mathbf{x}_0$ from a Gaussian distribution $\mathcal{N}(\mathbf{0}, \mathbf{I})$ .
We create an intermediate sample $\mathbf{x}_t$ using linear interpolation: $ \mathbf{x}_t = (1 - t)\mathbf{x}_0 + tX_T $ Here, $\mathbf{x}_t$ is a mixture of pure noise and the final video. When $t=0$ , it is all noise; when $t=1$ , it is the perfect video.
The model is trained to predict the "velocity" $\mathbf{u}_t$ (the change in pixels over time) by minimizing the Mean Squared Error (MSE): $ \mathcal{L} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{X}_T} | \mathbf{v}_t - \mathbf{u}_t |^2 $ Where $\mathbf{v}_t$ is the model's prediction and $\mathbf{u}_t$ is the true direction toward the target video.

4.2.4. Inference Process (Generating Videos)

The following figure (Figure 4 from the original paper) shows the iterative generation process:

$Figure 4. Illustration of the Spatia inference process. At the first iteration, the user provides an initial image, from which Spatia estimates the initial 3D scene point cloud. The user then specifies a text instruction and a camera path based on the estimated scene, producing a projection video along the desired trajectory that conditions the generation of clip-1. In subsequent iterations, two steps are performed: (1) Spatia updates the spatial memory (3D scene point cloud) using all previously generated frames via MapAnything \[42\]; and (2) the user specifies a new text instruction and camera path based on the updated scene. Spatia then takes the reference frames (generated as described in Section 3.1.2), the previously generated clip, and the new projection video as input to produce the next video clip. Text instructions are omitted.$ 该图像是示意图，展示了Spatia推理过程的两个迭代。在第一次迭代中，用户提供初始图像，系统生成初始3D场景点云，并依据用户指定的相机路径生成投影视频，生成剪辑1。在第二次迭代中，系统更新空间记忆，并根据新的相机路径和投影视频生成剪辑2。

Iteration 1: Start with an image $\rightarrow$ Build initial 3D memory $\rightarrow$ Generate first clip.
Update: Use the new frames to update the 3D point cloud via MapAnything.
Iteration 2: Use updated 3D memory + previous clip + new camera path $\rightarrow$ Generate second clip.
Repeat: This allows the model to walk through an entire house or city while keeping the map updated and consistent.

5. Experimental Setup

5.1. Datasets

RealEstate: 40,000 videos of real estate tours (useful for learning indoor/outdoor architecture).
SpatialVID: 10,000 high-definition videos with spatial annotations.
Scale: Training was performed on 64 AMD MI250 GPUs.

5.2. Evaluation Metrics

5.2.1. PSNR (Peak Signal-to-Noise Ratio)

Conceptual Definition: Measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects its representation. Higher is better.
Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10}\left(\frac{\mathrm{MAX}_I^2}{\mathrm{MSE}}\right) $
Symbol Explanation: $\mathrm{MAX}_I$ is the maximum pixel value (e.g., 255); $\mathrm{MSE}$ is the Mean Squared Error between the generated and ground-truth images.

5.2.2. SSIM (Structural Similarity Index Measure)

Conceptual Definition: Assesses the perceived quality of digital images by comparing luminance, contrast, and structure.
Mathematical Formula: $ \mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
Symbol Explanation: $\mu$ is the average, $\sigma^2$ is the variance, and $\sigma_{xy}$ is the covariance of images $x$ and $y$ .

5.2.3. LPIPS (Learned Perceptual Image Patch Similarity)

Conceptual Definition: Uses a deep neural network to measure how "similar" two images look to a human, rather than just comparing pixels. Lower is better.

5.2.4. Match Accuracy

Conceptual Definition: Quantifies how well the features (corners, edges) of the final frame align with the first frame in a closed-loop video.
Calculation: Uses the RoMa algorithm to count high-confidence matching points.

5.3. Baselines

The authors compare Spatia against:

Static Scene Models: WonderJourney, Voyager.
Foundation Video Models: VideoCrafter2, CogVideoX, LTX-Video, Wan2.1.

6. Results & Analysis

6.1. Core Results Analysis

Spatia achieves a significant lead in Spatial Consistency. In "closed-loop" tests (where the camera returns to the start), Spatia's final frame is nearly identical to the first, whereas other models "drift" and create a different-looking room.

The following are the results from Table 1 of the original paper:

Method	Average Scores						3D Align	Photo Const	Style Const	Subject Quality
Method	Total	Static	Dynamic	Camera	Object	Content	3D Align	Photo Const	Style Const	Subject Quality
WonderWorld [108]	61.79	72.69	50.88	92.98	51.76	71.25	86.87	85.56	70.57	49.81
CogVideoX-I2V [106]	60.64	62.15	59.12	38.27	40.07	36.73	86.21	88.12	83.22	62.44
Wan2.1 [87]	55.21	57.56	52.85	23.53	40.32	45.44	78.74	78.36	77.18	59.38
Spatia (Ours)	69.73	72.63	66.82	75.66	52.32	69.95	86.40	89.10	80.09	54.86

Analysis: Spatia significantly outperforms standard video models (like Wan2.1) in "Static" score and "Camera Control," while maintaining high "Dynamic" scores (the ability to move objects).

6.2. Memory Mechanism Evaluation

The following are the results from Table 3 of the original paper, evaluating "Closed-Loop" consistency:

Method	PSNRC ↑	SSIMC ↑	LPIPSC ↓	Match Acc ↑
ViewCrafter [113]	14.79	0.481	0.365	0.447
Voyager [40]	17.66	0.540	0.380	0.507
Spatia (Ours)	19.38	0.579	0.213	0.698

Analysis: The Match Acc of 0.698 vs 0.507 for the next best model proves that Spatia's memory mechanism is much more robust at remembering specific 3D points.

6.3. Long-Horizon Generation

The authors tested generating multiple consecutive clips (2, 4, and 6 clips). The following are the results from Table 6 of the original paper:

Method	#Clips	PSNRC ↑	SSIMC ↑	LPIPSC ↓
Wan2.2 [87]	6	10.74	0.310	0.644
Spatia (Ours)	6	18.04	0.541	0.259

Analysis: As the video gets longer, standard models (Wan2.2) rapidly lose quality and consistency (PSNR drops to 10.74). Spatia stays relatively stable even after 6 clips, proving the effectiveness of the persistent 3D memory.

7. Conclusion & Reflections

7.1. Conclusion Summary

Spatia introduces a paradigm shift in long-horizon video generation by moving away from purely "temporal" context (looking at past frames) toward "spatial" memory (maintaining a 3D point cloud). This geometric grounding allows for unprecedented consistency in camera movement and scene structure, while the dynamic-static disentanglement ensures that movements remain realistic.

7.2. Limitations & Future Work

Point Cloud Density: As shown in Table 7, reducing point cloud density to save memory leads to visual degradation. Finding a more "compressed" but "lossless" way to store 3D memory is a future challenge.
SLAM Dependency: The system relies on external SLAM tools like MapAnything. If the SLAM tool fails (e.g., in very dark or featureless scenes), the video generation will suffer.
Inference Speed: Updating a 3D point cloud and rendering it iteratively adds computational overhead compared to simple frame-by-frame generation.

7.3. Personal Insights & Critique

Spatia is a brilliant bridge between Computer Vision (3D reconstruction) and Generative AI (Video synthesis).

Innovation: The use of 3D point clouds as a "KV-cache" for video is highly intuitive. Just as LLMs cache text, Spatia "caches" space.
Application: The 3D-aware editing shown in Figure 8 (removing a sofa) is a "killer feature." This could revolutionize movie pre-visualization and interior design.
Critique: The paper mentions removing dynamic entities before building the map. This works well for a dog walking through a room, but what if the scene is a crowded street? The complexity of separating hundreds of moving people from the static background might still be a bottleneck for the current framework.