Paper status: completed

Instant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting

Published:03/21/2025

Dynamic Scene Reconstruction (3)Gaussian Motion Network (1)Instant Streaming Framework (1)Key-frame-guided Reconstruction Strategy (1)Multi-View Feature Projection (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The Instant Gaussian Stream (IGS) framework addresses the high reconstruction time and error accumulation in dynamic scene Free-Viewpoint Videos. It uses a generalized Anchor-driven Gaussian Motion Network for rapid Gaussian motion generation and a key-frame-guided strategy for i

Abstract

Building Free-Viewpoint Videos in a streaming manner offers the advantage of rapid responsiveness compared to offline training methods, greatly enhancing user experience. However, current streaming approaches face challenges of high per-frame reconstruction time (10s+) and error accumulation, limiting their broader application. In this paper, we propose Instant Gaussian Stream (IGS), a fast and generalizable streaming framework, to address these issues. First, we introduce a generalized Anchor-driven Gaussian Motion Network, which projects multi-view 2D motion features into 3D space, using anchor points to drive the motion of all Gaussians. This generalized Network generates the motion of Gaussians for each target frame in the time required for a single inference. Second, we propose a Key-frame-guided Streaming Strategy that refines each key frame, enabling accurate reconstruction of temporally complex scenes while mitigating error accumulation. We conducted extensive in-domain and cross-domain evaluations, demonstrating that our approach can achieve streaming with a average per-frame reconstruction time of 2s+, alongside a enhancement in view synthesis quality.

Mind Map

In-depth Reading

English Analysis~36 min read · 47,802 chars

1. Bibliographic Information

1.1. Title

Instant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting

1.2. Authors

The authors of the paper are Jinbo Yan, Rui Peng, Zhiyan Wang, Luyang Tang, Jiayu Yang, Jie Liang, Jiahao Wu, and Ronggang Wang. They are affiliated with the Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University, and Pengcheng Laboratory.

1.3. Journal/Conference

This paper was published at (UTC) 2025-03-21T09:46:22.000Z, indicating it is a recent preprint. As of the provided information, it is published on arXiv. arXiv is a well-regarded open-access preprint server for research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Papers on arXiv are typically pre-peer-review versions.

1.4. Publication Year

2025

1.5. Abstract

The paper addresses the challenges in building Free-Viewpoint Videos (FVV) in a streaming manner, specifically high per-frame reconstruction time (over 10 seconds) and error accumulation, which limit broader application. The authors propose Instant Gaussian Stream (IGS), a fast and generalizable streaming framework. IGS introduces a generalized Anchor-driven Gaussian Motion Network (AGM-Net) that projects multi-view 2D motion features into 3D space using anchor points to drive the motion of all Gaussians, enabling single-inference motion generation per target frame. Additionally, a Key-frame-guided Streaming Strategy refines key frames to accurately reconstruct temporally complex scenes and mitigate error accumulation. Extensive evaluations demonstrate that IGS achieves an average per-frame reconstruction time of over 2 seconds, alongside an enhancement in view synthesis quality, outperforming existing streaming methods.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2503.16979v1 PDF Link: https://arxiv.org/pdf/2503.16979v1.pdf This paper is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inefficiency and quality degradation in streaming Free-Viewpoint Video (FVV) construction for dynamic scenes. FVV, which allows users to interactively view a scene from any angle, is crucial for immersive media applications like Virtual Reality (VR), Augmented Reality (AR), and sports broadcasting. Building FVVs in a streaming manner, where dynamic scenes are reconstructed frame by frame, offers significant advantages in responsiveness compared to traditional offline training methods, making it ideal for real-time interactive scenarios such as live streaming or virtual meetings.

However, current streaming approaches face two major challenges:

High Per-Frame Reconstruction Time: Existing methods typically require per-frame optimization, leading to latencies often exceeding 10 seconds per frame. This severely hinders their usability in real-time applications that demand rapid responses.
Error Accumulation: Over longer video sequences, errors in reconstruction can accumulate from frame to frame, progressively degrading the quality of later frames. This limits the scalability and long-term accuracy of streaming methods.

The paper's entry point is to leverage the advancements in real-time rendering and high-quality view synthesis provided by 3D Gaussian Splatting (3DGS) and integrate a generalized motion model to overcome these limitations. The innovative idea is to move away from computationally expensive per-frame optimization by instead training a generalizable network that can infer Gaussian motion in a single pass, combined with a strategy to periodically correct accumulated errors.

2.2. Main Contributions / Findings

The primary contributions of the paper, aiming to address the high latency and error accumulation issues in streaming FVV, are:

A Generalized Anchor-driven Gaussian Motion Network (AGM-Net): This novel network is designed to predict the motion of Gaussian primitives between adjacent frames in a single inference step. By using anchor points to carry and propagate 3D motion features derived from multi-view 2D motion, AGM-Net eliminates the need for time-consuming per-frame optimization, significantly reducing the per-frame reconstruction time. This is the first approach to use a generalized method for streaming reconstruction of dynamic scenes.
A Key-frame-guided Streaming Strategy: This strategy enhances the model's ability to handle temporally complex scenes and effectively mitigates error accumulation. It works by periodically designating key frames that undergo a max-points-bounded refinement. This refinement corrects accumulated errors and provides a fresh starting point for subsequent frame predictions, improving overall view synthesis quality within the streaming framework.
Demonstrated Fast and Generalizable Performance: Through extensive in-domain and cross-domain evaluations, the paper shows that its approach, Instant Gaussian Stream (IGS), achieves an average per-frame reconstruction time of just over 2 seconds (e.g., 2.67s for IGS-s), representing a significant improvement (up to 6x faster) over previous state-of-the-art streaming methods. Concurrently, it enhances view synthesis quality (higher PSNR) and demonstrates strong generalization capabilities to unseen environments without requiring per-frame optimization. The method also enables real-time rendering at 204 FPS while maintaining comparable storage overhead.

In essence, the paper proposes a novel framework that makes streaming Free-Viewpoint Video construction both fast and robust, significantly improving user experience for real-time interactive applications.

3.1. Foundational Concepts

To understand Instant Gaussian Stream (IGS), several foundational concepts are crucial:

Free-Viewpoint Video (FVV): Free-Viewpoint Video refers to a technology that allows a user to interactively experience a dynamic scene from any desired viewpoint. Unlike traditional videos, which are captured from a fixed perspective, FVV enables navigation through a 3D reconstruction of a dynamic event, offering immersive experiences for applications like Virtual Reality (VR), Augmented Reality (AR), and advanced broadcasting.
Neural Radiance Fields (NeRF): NeRF [39] is a pioneering method for novel view synthesis (NVS) that represents a 3D scene implicitly using a Multi-Layer Perceptron (MLP). Given a 3D coordinate (x, y, z) and viewing direction $(\theta, \phi)$ , an MLP predicts the color and volume density at that point. By integrating these predictions along camera rays, NeRF can render photorealistic images from any viewpoint. While revolutionary, traditional NeRF models are computationally intensive for training and rendering, especially for dynamic scenes.
3D Gaussian Splatting (3DGS): 3DGS [26] is a more recent and highly efficient novel view synthesis technique. It represents a 3D scene as a collection of anisotropic 3D Gaussian primitives. Each Gaussian is characterized by its 3D position ( $\mu$ ), covariance matrix ( $\Sigma$ ), opacity ( $\alpha$ ), and color ( $c$ ). During rendering, these 3D Gaussians are projected onto the 2D image plane and rendered using a differentiable splatting (rasterization-based) algorithm. This approach offers significantly faster training and real-time rendering capabilities while achieving high-fidelity results, making it a popular choice for static and dynamic scene reconstruction.

The mathematical representation of a single 3D Gaussian primitive is given by: $ \mathcal { G } ( x ) = e ^ { - { \frac { 1 } { 2 } } ( x - \mu ) ^ { T } \Sigma ^ { - 1 } ( x - \mu ) } $ where:
- $\mathcal{G}(x)$ is the density of the Gaussian at point $x \in \mathbb{R}^3$ .
- $x$ is a 3D point in space.
- $\mu \in \mathbb{R}^3$ is the 3D mean (center) of the Gaussian.
- $\Sigma \in \mathbb{R}^{3 \times 3}$ is the 3D covariance matrix, which defines the shape and orientation of the anisotropic Gaussian. Its inverse $\Sigma^{-1}$ is used to calculate the Mahalanobis distance.
  
  The color of a pixel is obtained by alpha blending (also known as volume rendering or front-to-back compositing) the projected Gaussians sorted by depth: $ \mathbf { c } = \sum _ { i = 1 } ^ { n } c _ { i } \alpha _ { i } ^ { \prime } \prod _ { j = 1 } ^ { i - 1 } ( 1 - \alpha _ { i } ^ { \prime } ) $ where:
- $\mathbf{c}$ is the final color of the pixel.
- $n$ is the number of Gaussians covering the pixel, sorted by depth from front to back.
- $c_i$ is the color of the $i$ -th Gaussian.
- $\alpha_i^{\prime}$ is the opacity of the $i$ -th Gaussian after projection onto the 2D image plane.
- The product $\prod _ { j = 1 } ^ { i - 1 } ( 1 - \alpha _ { i } ^ { \prime } )$ represents the transmittance (how much light passes through) from the camera to the $i$ -th Gaussian, accounting for the opacities of all preceding Gaussians.
Streaming-based FVV: This refers to the process of reconstructing dynamic scenes frame by frame as new input data arrives, rather than waiting for an entire video sequence to be collected and processed offline. The goal is to provide a low-latency response, which is critical for real-time interactive applications.
Anchor Points: In the context of IGS, anchor points are a sparse set of key 3D points sampled from the scene (specifically, from the positions of Gaussian primitives). These points are used to represent and propagate motion features across the 3D space, guiding the deformation of numerous Gaussian primitives without needing to compute motion for every single Gaussian directly. This reduces computational and memory overhead.
Optical Flow: Optical flow is the apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and the scene. In computer vision, optical flow algorithms estimate the 2D motion vector for each pixel between two consecutive frames in a video. IGS uses an optical flow model to extract 2D motion features from multi-view images.
Farthest Point Sampling (FPS): FPS is a sampling algorithm commonly used in 3D computer graphics and point cloud processing. Given a set of points, FPS iteratively selects points such that each new point is as far as possible from the already selected points. This ensures that the sampled points are well-distributed and representative of the entire point set. In IGS, it's used to select anchor points from the Gaussian positions.
Transformer: A Transformer is a neural network architecture introduced for sequence-to-sequence tasks (like natural language processing) but now widely applied in computer vision. Its key innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence (or, in this case, different 3D motion features) when processing each element. This helps capture long-range dependencies and global context effectively.
Quaternion: A quaternion is a number system that extends complex numbers and is often used in computer graphics and robotics to represent 3D rotations. Quaternions offer advantages over other rotation representations (like Euler angles or rotation matrices), such as avoiding gimbal lock (a loss of one degree of freedom) and providing a compact and efficient way to interpolate rotations. In IGS, quaternions are used to represent and deform the rotation of Gaussian primitives.

3.2. Previous Works

The paper contextualizes its contribution by discussing prior work in 3D Reconstruction and View Synthesis, Generalizable 3D Reconstruction for Acceleration, and Dynamic Scene Reconstruction and View Synthesis.

3.2.1. 3D Reconstruction and View Synthesis

This field has seen significant advancements, with NeRF [39] and its derivatives [1, 2, 3, 8, 16, 20, 21, 40, 46, 47, 52] establishing high-quality novel view synthesis. These methods implicitly represent scenes using MLPs. More recently, 3D Gaussian Splatting (3DGS) [26] has emerged as a powerful explicit representation, offering fast rendering speeds and high quality by using anisotropic Gaussian primitives. Subsequent works have focused on improving 3DGS further, addressing rendering quality [28, 37, 48, 74, 78, 83], geometric accuracy [22, 79, 80], compression efficiency [11, 13, 37, 72], and joint optimization of camera poses [14, 18, 49]. This paper builds upon the efficiency and quality benefits of 3DGS.

3.2.2. Generalizable 3D Reconstruction for Acceleration

Traditional 3DGS requires per-scene optimization, which is time-consuming. To accelerate this, researchers have developed generalizable models [24, 27, 54, 55, 81, 84], inspired by generalizable NeRFs [7, 25, 64, 69, 77]. These models are trained on large datasets to generalize to new scenes without extensive per-scene optimization. PixelSplat [6], for instance, uses a Transformer to encode features and decode them into Gaussian attributes. Other generalizable models [12, 15, 35, 82] utilize Transformers or Multi-View Stereo (MVS) [76] techniques to build cost volumes for fast reconstruction. The current paper is innovative in applying generalizable models to dynamic streaming scenes, leveraging their rapid inference capabilities.

3.2.3. Dynamic Scene Reconstruction and View Synthesis

Extending static scene reconstruction to dynamic scenes has been a major research direction. Initial efforts focused on adapting NeRF for dynamic scenes [5, 17, 19, 33, 34, 42, 44, 50, 60]. With the advent of 3DGS, many offline training methods [23, 31, 66, 70, 73, 75] have emerged to incorporate 3DGS for dynamic scenes. These methods achieve high-quality results but are limited by their offline nature, requiring the full video sequence before training, making them unsuitable for real-time streaming applications.

To address the real-time interaction need, several streaming methods have been proposed:

StreamRF [29]: An early method that formulates dynamic scene modeling for streaming by optimizing scene representations frame by frame.
NeRFPlayer [51]: Another streaming approach that uses decomposed Neural Radiance Fields for dynamic scenes, aiming for streamable representations.
ReRF [62]: Focuses on neural residual radiance fields for streamable free-viewpoint videos.
3DGStream [53]: This is the most direct baseline comparison for IGS. It's a Gaussian Splatting-based streaming method that optimizes a Neural Transformation Cache to model Gaussian movements between frames. While it improves performance over earlier methods, it still relies on per-frame optimization, resulting in significant delays (over 10 seconds per frame, as noted in the IGS paper's evaluation).

3.3. Technological Evolution

The evolution of 3D scene reconstruction and novel view synthesis has progressed from implicit representations (NeRF) to more explicit and efficient ones (3DGS). Initially, efforts focused on static scenes, then extended to dynamic scenes, first in an offline setting, and more recently, in a streaming manner. The current paper, Instant Gaussian Stream (IGS), represents a significant step in this evolution by integrating the efficiency of 3DGS with a novel generalized motion prediction network and a key-frame-guided strategy to achieve truly fast and generalizable streaming reconstruction of dynamic scenes, moving beyond the per-frame optimization bottleneck of prior streaming methods like 3DGStream.

3.4. Differentiation Analysis

Compared to the main methods in related work, IGS introduces several core innovations:

Generalized Motion Prediction vs. Per-Frame Optimization:
- Prior Streaming Methods (e.g., 3DGStream): These methods, while supporting streaming, still perform per-frame optimization. This means that for each new frame, they need to run an optimization loop to update the Gaussian parameters (or a transformation cache) to match the new observations. This process is computationally expensive and leads to high per-frame latencies (10s+).
- IGS's Innovation: IGS replaces this per-frame optimization with a generalized Anchor-driven Gaussian Motion Network (AGM-Net). AGM-Net is trained once on a diverse dataset of dynamic scenes. During inference, it can predict the motion of Gaussians between frames in a single forward pass of the network. This "generalizable" nature drastically reduces the per-frame reconstruction time, achieving a speed-up of approximately 6x compared to 3DGStream.
Key-frame-guided Streaming Strategy vs. Continuous Propagation:
- Prior Streaming Methods: Often rely on continuously propagating Gaussian states from one frame to the next. While efficient for small motions, this approach is prone to error accumulation over longer sequences, where small errors in each frame's prediction or optimization can compound.
- IGS's Innovation: IGS introduces a Key-frame-guided Streaming Strategy. It periodically designates key frames that undergo a more thorough max-points-bounded refinement. For intermediate candidate frames, AGM-Net predicts motion relative to the most recent key frame. This strategy effectively "resets" or "corrects" the accumulated errors at each key frame, preventing them from propagating indefinitely and significantly improving long-term reconstruction quality, especially in temporally complex scenes with non-rigid motion or appearance/disappearance of objects.
Handling Non-rigid Motion and Appearance/Disappearance:
- Prior Motion Models: Many struggle with non-rigid motion or significant topological changes (objects appearing/disappearing) because they primarily model rigid or small deformations.
- IGS's Innovation: The max-points-bounded refinement at key frames allows IGS to optimize all parameters of the Gaussians, including supporting cloning, splitting, and filtering of Gaussians (similar to initial 3DGS optimization). This enables IGS to adapt to significant non-rigid deformations and object appearance/disappearance, making it more robust for diverse dynamic scenes.
  
  In summary, IGS differentiates itself by introducing a truly generalized and non-optimizing approach for inter-frame Gaussian motion combined with a strategic error-correction mechanism, addressing both the speed and quality limitations of previous streaming FVV methods.

4. Methodology

4.1. Principles

The core idea behind Instant Gaussian Stream (IGS) is to model dynamic scenes in a streaming manner with minimal per-frame reconstruction time and reduced error accumulation. This is achieved through two main principles:

Generalized Motion Generation: Instead of optimizing Gaussian motion for each new frame, IGS employs a pre-trained, generalized network (the Anchor-driven Gaussian Motion Network, or AGM-Net) that can infer Gaussian transformations in a single forward pass. This network learns to predict motion from multi-view 2D features lifted to 3D anchor points.
Key-frame-guided Error Mitigation: To counteract error accumulation and accurately represent temporally complex scenes, IGS uses a key-frame-guided strategy. This involves periodically refining designated key frames with a max-points-bounded refinement process, which effectively resets the accumulation of errors and allows for adaptive scene representation.

4.2. Core Methodology In-depth (Layer by Layer)

The overall pipeline of IGS is illustrated in Figure 2 (from the original paper). It shows a continuous streaming process where AGM-Net propagates Gaussians between frames, with periodic keyframe refinement.

The following figure (Figure 2 from the original paper) illustrates the overall pipeline of Instant Gaussian Stream (IGS):

该图像是示意图，展示了即时高斯流的关键帧引导流媒体策略。图中描述了从关键帧提取运动特征、锚点采样、运动特征提升以及解码器的过程，旨在有效重建复杂场景。通过细化每个关键帧，降低误差积累，实现快速响应。

The figure depicts the Instant Gaussian Stream framework. It shows how multi-view images are fed into the Anchor-driven Gaussian Motion Network (AGM-Net) to extract motion features. These features, along with anchor points, drive the motion of Gaussian primitives from the previous frame to the current frame. The pipeline highlights a Key-frame-guided Streaming Strategy where AGM-Net is used for candidate frames, and periodically, a key-frame undergoes max-point bounded refinement to update the Gaussian primitives.

4.2.1. Preliminary: Gaussian Splatting Fundamentals

As a foundation, IGS utilizes 3D Gaussian Splatting (3DGS) [26] to represent scenes. In 3DGS, a scene is represented by a collection of anisotropic 3D Gaussian primitives. Each primitive $\mathcal{G}_i$ is defined by its:

Center $\mu \in \mathbb{R}^3$ : The 3D position of the Gaussian.
3D Covariance Matrix $\Sigma \in \mathbb{R}^{3 \times 3}$ : Defines the shape and orientation of the Gaussian.
Opacity $\alpha \in \mathbb{R}$ : Determines how transparent or opaque the Gaussian is.
Color $c \in \mathbb{R}^3$ : The color of the Gaussian.

The probability density function for a 3D Gaussian is given by: $ \mathcal { G } ( x ) = e ^ { - { \frac { 1 } { 2 } } ( x - \mu ) ^ { T } \Sigma ^ { - 1 } ( x - \mu ) } $ where:
$\mathcal{G}(x)$ represents the density at a 3D point $x$ .
$x$ is a 3D point in space.
$\mu$ is the 3D mean (center) of the Gaussian.
$\Sigma^{-1}$ is the inverse of the 3D covariance matrix, which dictates the spread and orientation.

During rendering, these 3D Gaussians are first projected onto a 2D image plane. Then, the Gaussians covering a pixel are sorted by depth, and their colors are composited using point-based alpha blending: $ \mathbf { c } = \sum _ { i = 1 } ^ { n } c _ { i } \alpha _ { i } ^ { \prime } \prod _ { j = 1 } ^ { i - 1 } ( 1 - \alpha _ { i } ^ { \prime } ) $ where:
$\mathbf{c}$ is the final color of the pixel.
$n$ is the total number of Gaussians covering the pixel, sorted from front to back.
$c_i$ is the color of the $i$ -th Gaussian.
$\alpha_i^{\prime}$ is the opacity of the $i$ -th Gaussian after projection to 2D.
$\prod _ { j = 1 } ^ { i - 1 } ( 1 - \alpha _ { i } ^ { \prime } )$ is the accumulated transmittance, representing the portion of light that passes through all preceding Gaussians up to the (i-1)-th one.

4.2.2. Anchor-driven Gaussian Motion Network (AGM-Net)

The AGM-Net is designed to infer the motion of Gaussian primitives between the previous frame and the current frame in a single feedforward pass, eliminating the need for iterative optimization.

4.2.2.1. Motion Feature Maps

The process begins by acquiring multi-view images of the current frame, denoted as $\mathbf{I}' = (I'_1, ..., I'_V)$ , along with their corresponding camera parameters. For each viewpoint, an image pair is formed, consisting of the current frame I'_j and the previous frame $I_j$ . These image pairs are then fed into an optical flow model (specifically, GM-Flow [68] is used, fine-tuned with a Swin-Transformer [36] block). This model extracts intermediate flow embeddings. A modulation layer [9, 43] is subsequently applied to inject viewpoint and depth information into these embeddings, resulting in 2D motion feature maps $F \in \mathbb{R}^{V \times C \times H \times W}$ .

$V$ : Number of views.
$C$ : Number of channels in the feature map.
$H$ : Height of the feature map.
$W$ : Width of the feature map.

4.2.2.2. Anchor Sampling

To deform the Gaussian primitives $\mathcal{G}$ from the previous frame to the current one, the motion of each Gaussian needs to be computed. However, directly doing this for a large number of Gaussians is computationally and memory-intensive. To address this, AGM-Net employs an anchor-point-based approach. A sparse set of $M$ anchor points is sampled from the existing $N$ Gaussian primitives' positions using Farthest Point Sampling (FPS). $ \mathcal { C } = \mathbf { F } \mathbf { P } \mathbf { S } ( { \mu _ { i } } _ { i \in N } ) $ where:

$\mathcal{C} \in \mathbb{R}^{M \times 3}$ represents the sampled anchor points.
$M$ is the number of anchor points (set to 8192 in experiments).
$\{\mu_i\}_{i \in N}$ is the set of 3D positions of all $N$ Gaussian primitives.
$\mathbf{FPS}(\cdot)$ denotes the Farthest Point Sampling algorithm, which ensures the anchor points are well-distributed across the scene.

4.2.2.3. Projection-aware 3D Motion Feature Lift

The 2D motion features are then lifted into 3D space in a projection-aware manner. Each sampled anchor point $\mathcal{C}_i$ is projected onto each 2D motion feature map $F_j$ using the corresponding camera parameters. Bilinear interpolation is then used to extract high-resolution motion features for each anchor point from these projected locations. $ f _ { i } = \frac { 1 } { V } \sum _ { j \in V } \Psi ( \Pi _ { j } ( \mathcal { C } _ { i } ) , F _ { j } ) $ where:

$f_i \in \mathbb{R}^C$ is the aggregated 3D motion feature for the $i$ -th anchor point.
$\Pi_j(\mathcal{C}_i)$ represents the projection of the $i$ -th anchor point $\mathcal{C}_i$ onto the image plane of the $j$ -th view's feature map $F_j$ , using the camera parameters of view $j$ .
$\Psi$ denotes bilinear interpolation, used to sample features from the continuous feature map $F_j$ at the projected 2D coordinates.
The features are averaged over all $V$ views to get a single 3D feature for each anchor.

These 3D motion features, $\{f_i\}_{i \in M}$ , are then fed into a Transformer block. This Transformer uses self-attention to further capture motion information and relationships among the anchor points within the 3D scene. $ { z _ { i } : z _ { i } \in \mathbb { R } ^ { C } } _ { i \in M } = \mathbf { T r a n s f o r m e r } ( { f _ { i } } _ { i \in M } ) $ where:
$\{z_i\}_{i \in M}$ are the final refined 3D motion features, each $z_i \in \mathbb{R}^C$ .
$\mathbf{Transformer}(\cdot)$ represents the Transformer block, which processes the set of features $\{f_i\}_{i \in M}$ to produce $\{z_i\}_{i \in M}$ . These features $z_i$ now encapsulate the motion information for each anchor and its spatial context.

4.2.2.4. Interpolate and Motion Decode

The 3D motion features $z_i$ stored at each anchor point are used to determine the motion of all Gaussian primitives. For each Gaussian primitive $\mathcal{G}_i$ , its motion feature is obtained by interpolating from its $K$ nearest anchor points. A weighted average, based on the inverse exponential of the Euclidean distance, is used for this interpolation: $ z _ { i } = \frac { \sum _ { k \in \mathcal { N } ( i ) } e ^ { - d _ { k } } z _ { k } } { \sum _ { k \in \mathcal { N } ( i ) } e ^ { - d _ { k } } } $ where:

$z_i$ is the interpolated motion feature assigned to Gaussian primitive $\mathcal{G}_i$ .
$\mathcal{N}(i)$ is the set of neighboring anchor points of Gaussian primitive $\mathcal{G}_i$ .
$d_k$ is the Euclidean distance from Gaussian point $\mathcal{G}_i$ to the $k$ -th anchor point $\mathcal{C}_k$ . The exponential term $e^{-d_k}$ gives higher weight to closer anchors.

Finally, a Linear head (a simple fully connected layer) decodes this interpolated motion feature $z_i$ into the specific movement parameters for the Gaussian primitive: $ d \mu _ { i } , d r o t _ { i } = \mathbf { L i n e a r } ( z _ { i } ) $ where:
$d\mu_i \in \mathbb{R}^3$ is the deformation of the Gaussian's position (translation vector).
$drot_i$ is the deformation of the rotation (represented as a quaternion delta).

The new position and rotation of the Gaussian primitive are then updated: $ \mu _ { i } ^ { ' } = \mu _ { i } + d \mu _ { i } $ $ r o t _ { i } ^ { ' } = n o r m ( r o t _ { i } ) \times n o r m ( d r o t _ { i } ) $ where:
$\mu_i'$ and $rot_i'$ are the new position and rotation of the Gaussian.
$\mu_i$ and $rot_i$ are the position and rotation from the previous frame.
$norm(\cdot)$ denotes quaternion normalization.
$\times$ represents quaternion multiplication, as commonly used for composing rotations.

4.2.3. Key-frame-guided Streaming

While AGM-Net efficiently propagates Gaussian motion, it primarily handles transformations of existing Gaussians and doesn't inherently account for non-rigid motion or objects appearing/disappearing. Also, without correction, small inference errors could accumulate. The Key-frame-guided Streaming strategy addresses these limitations.

4.2.3.1. Key-frame-guided strategy

The strategy works as follows:

Key Frame Selection: Starting from frame 0, a key frame is designated every $w$ frames. This creates a sequence of key frames: $\{K_0, K_w, \ldots, K_{nw}\}$ . All other frames are candidate frames.
Gaussian Propagation: During streaming, Gaussians are deformed forward using AGM-Net. For example, from a key frame $K_{iw}$ , AGM-Net predicts the motion for subsequent candidate frames until the next key frame $K_{(i+1)w}$ is reached.
Key Frame Refinement: Upon reaching a key frame $K_{(i+1)w}$ , the Gaussians that have been propagated to this frame are refined. This refinement process corrects errors and adapts to scene changes.
Continued Streaming: After refinement, the refined Gaussians of $K_{(i+1)w}$ serve as the new base for further propagation to subsequent frames.

Advantages of this strategy:

Mitigates Error Accumulation: By starting from the most recently refined key frame, AGM-Net operates on a corrected state, preventing error propagation across long sequences of candidate frames.
Low Per-Frame Reconstruction Time: Candidate frames only require a single AGM-Net inference, avoiding costly optimization.
Batch Processing: Up to $w$ frames can be processed in a batch following each key frame, further accelerating the pipeline.

During the refinement of key frames, all parameters of the Gaussians are optimized, similar to the initial 3DGS optimization process. This includes cloning, splitting, and filtering Gaussians based on rendering error and density. This allows the system to:

Handle object deformations.
Account for objects appearing or disappearing in the scene.
Prevent error accumulation from propagating from one key frame to the next.

However, an issue with unconstrained densification is a potential increase in the number of Gaussian primitives at each key frame, leading to:
Increased computational complexity.
Higher storage requirements.
Risk of overfitting, especially in sparse-viewpoint scenes common in dynamic captures.

To counter this, a Max Points Bounded Refine method is adopted. When densifying Gaussians (e.g., splitting or cloning), the number of allowed new Gaussians is controlled by adjusting each point's gradient. This ensures that the total number of Gaussian primitives does not exceed a predefined maximum limit, thus managing resource usage and preventing overfitting.

4.2.4. Loss Function

The training process involves two phases, each using a specific loss function:

Offline Training of the Generalized AGM-Net: The AGM-Net is trained once across multiple dynamic scenes. This training uses a view synthesis loss between the predicted views from the AGM-Net and the ground truth views. The loss function combines an $\mathcal{L}_1$ term and an $\mathcal{L}_{D-SSIM}$ term: $ \mathcal { L } = ( 1 - \lambda ) \mathcal { L } _ { 1 } + \lambda \mathcal { L } _ { D - S S I M } $ where:
- $\mathcal{L}$ is the total loss function.
- $\mathcal{L}_1$ is the L1 loss (Mean Absolute Error), which measures the absolute difference between predicted and ground truth pixel values. It encourages pixel-wise accuracy.
- $\mathcal{L}_{D-SSIM}$ is the 1-DSSIM loss (1 minus the Structural Similarity Index Measure). SSIM is a perceptual metric that measures the similarity between two images, taking into account luminance, contrast, and structure. Using 1-SSIM as a loss term encourages perceptually similar predictions.
- $\lambda$ is a weighting parameter (set to 0.2 in experiments) that balances the contribution of the $\mathcal{L}_1$ and $\mathcal{L}_{D-SSIM}$ terms.
Online Training (Refinement) of Key Frames: During the streaming inference, when a key frame is encountered, the Gaussian attributes are optimized. This online training also utilizes the same loss function as above (Eq. 10). However, in this phase, the optimization is applied to the attributes of the Gaussian primitives (position, covariance, opacity, color) themselves, rather than the parameters of the neural network. This allows the Gaussians to adapt to the specific key frame's observations and correct any accumulated errors.

5. Experimental Setup

5.1. Datasets

The authors used two primary datasets for evaluating IGS:

Neural 3D Video Datasets (N3DV) [30]:
- Source: A publicly available dataset for dynamic scene reconstruction.
- Characteristics: It includes 6 dynamic scenes, each recorded using a multi-view setup with 21 cameras.
- Resolution: The videos have a high resolution of $2704 \times 2028$ pixels.
- Duration: Each multi-view video comprises 300 frames.
- Domain: Indoor scenes with various objects and human interactions.
- Usage: Four sequences were used for training AGM-Net (offline training phase). The remaining two sequences, {cut roasted beef, sear steak}, were designated as the test set for in-domain evaluation.
Meeting Room Datasets [29]:
- Source: Another dataset used for dynamic scene reconstruction.
- Characteristics: Includes 3 dynamic scenes, recorded with 13 cameras.
- Resolution: The videos have a resolution of $1280 \times 720$ pixels.
- Duration: Each multi-view video also contains 300 frames.
- Domain: Indoor meeting room environments.
- Usage: Used for cross-domain evaluation to test the generalization capability of the model trained on N3DV.

Dataset Preparation Details:

For AGM-Net training, 3D Gaussians were constructed for all frames in the four N3DV training sequences, totaling 1200 frames. This process required 192 GPU hours.
For each frame's 3D Gaussian, motions were performed forward and backward for five frames, generating 12,000 pairs, which were used for training the AGM-Net.
For testing, one viewpoint was selected for evaluation for both datasets, consistent with previous methods.

5.2. Evaluation Metrics

The paper evaluates the performance of IGS using several standard metrics, averaged over the full 300-frame sequence (including frame 0):

PSNR (Peak Signal-to-Noise Ratio) ↑ (dB):
- Conceptual Definition: PSNR is a common objective metric used to quantify the quality of a reconstructed image compared to a reference (ground truth) image. It measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. A higher PSNR value indicates higher image quality and less distortion.
- Mathematical Formula: $ PSNR = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{MSE} \right) $
- Symbol Explanation:
  - $MAX_I$ : The maximum possible pixel value of the image (e.g., 255 for 8-bit grayscale images).
  - MSE: Mean Squared Error between the reconstructed image and the ground truth image, calculated as: $ MSE = \frac{1}{mn} \sum_{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $ where $I$ is the original image, $K$ is the reconstructed image, and $m \times n$ is the size of the images.
Storage ↓ (MB):
- Conceptual Definition: This metric quantifies the memory footprint required to store the 3D Gaussian representation of the dynamic scene. Lower storage values are desirable for efficient deployment and distribution.
- Calculation (for IGS): Includes the Gaussian primitives for frame 0 and each key frame, as well as residuals (displacement $d\mu$ and rotation drot, along with a mask of points with motion) for each candidate frame. The reported value is the average storage requirements over the 300 frames.
Train ↓ (s):
- Conceptual Definition: Refers to the average time required to reconstruct (or "train") a single frame within the streaming pipeline. This includes the time for Gaussian initialization for frame 0, AGM-Net inference for candidate frames, and refinement for key frames. Lower training time per frame is critical for real-time streaming applications.
- Calculation: The total time to construct the Free-Viewpoint Video for the entire sequence (e.g., 300 frames) divided by the total number of frames.
Render ↑ (FPS - Frames Per Second):
- Conceptual Definition: Measures the speed at which novel views can be rendered from the reconstructed 3D scene. Higher FPS indicates smoother and more responsive real-time rendering.
DSSIM (1-SSIM) ↓:
- Conceptual Definition: SSIM (Structural Similarity Index Measure) is a perception-based metric that measures the similarity between two images, focusing on luminance, contrast, and structure, which aligns better with human visual perception than PSNR. DSSIM usually refers to 1-SSIM, so a lower DSSIM value indicates higher similarity and better perceptual quality.
- Mathematical Formula (for SSIM): $ SSIM(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
- Symbol Explanation:
  - x, y: Two image patches being compared.
  - $\mu_x, \mu_y$ : Mean intensity of $x$ and $y$ , respectively.
  - $\sigma_x, \sigma_y$ : Standard deviation of $x$ and $y$ , respectively.
  - $\sigma_{xy}$ : Covariance of $x$ and $y$ .
  - $c_1 = (K_1 L)^2, c_2 = (K_2 L)^2$ : Two small constants to avoid division by zero, where $L$ is the dynamic range of the pixel values (e.g., 255) and $K_1, K_2 \ll 1$ .
LPIPS (Learned Perceptual Image Patch Similarity) ↓:
- Conceptual Definition: LPIPS is a metric that measures the perceptual distance between two images using features extracted from a pre-trained deep neural network (e.g., AlexNet or VGG). It has been shown to correlate better with human judgment of image similarity than traditional metrics like PSNR or SSIM. A lower LPIPS score indicates higher perceptual similarity (better quality).
- Mathematical Formula: LPIPS is not defined by a simple mathematical formula; instead, it involves extracting features from a deep neural network at different layers, normalizing them, and then computing a weighted Euclidean distance between the feature stacks. Conceptually, it's: $ LPIPS(x, y) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} |w_l \odot (\phi_l(x){hw} - \phi_l(y){hw})|_2 $
- Symbol Explanation:
  - x, y: The two image patches being compared.
  - $\phi_l$ : Features extracted from layer $l$ of a pre-trained deep neural network.
  - $w_l$ : A learnable vector that weights the channels of the feature map at layer $l$ .
  - $\odot$ : Element-wise product.
  - $H_l, W_l$ : Height and width of the feature map at layer $l$ .

5.3. Baselines

The paper compares IGS against both offline and online (streaming) methods for dynamic scene reconstruction:

Offline Training Methods: These methods typically require the entire video sequence to be available before training, making them unsuitable for real-time streaming but often achieving high quality.

Kplanes [17]: An explicit radiance field representation in space, time, and appearance.
Realtime-4DGS [75]: A method for real-time photorealistic dynamic scene representation and rendering using 4D Gaussian Splatting.
4DGS [66]: Another 4D Gaussian Splatting approach for real-time dynamic scene rendering.
Spacetime-GS [31]: Focuses on spacetime Gaussian feature splatting for real-time dynamic view synthesis.
Saro-GS [70]: A 4D Gaussian Splatting method using a scale-aware residual field and adaptive optimization.

Online Training Methods (Streaming): These methods aim to reconstruct dynamic scenes frame by frame to support streaming, but often suffer from high latency due to per-frame optimization.
StreamRF [29]: A streaming radiance fields approach for 3D video synthesis.
3DGStream [53]: A state-of-the-art 3DGS-based method for efficient streaming of photo-realistic free-viewpoint videos. It models Gaussian movements by optimizing a Neural Transformation Cache. The paper specifically re-evaluates 3DGStream (3DGStream†) under its own experimental environment (same initial point cloud and Gaussian Splatting Rasterization variant) for a fair comparison.

5.4. Implementation Details

5.4.1. AGM Network Configuration (Sec. 4.2)

Optical Flow Model: GM-Flow [68] is used to extract optical flow embeddings, with an added Swin-Transformer [36] block for fine-tuning.
Input Views: AGM-Net is designed to accept an arbitrary number of input views. For balancing computational complexity and performance, $V=4$ views are used.
Motion Map Features: Each view produces a motion map with $C=128$ channels and a resolution of $128 \times 128$ .
Anchor Points: $M=8192$ anchor points are sampled from Gaussian Points using Farthest Point Sampling (FPS).
Transformer Block: The Transformer block in the 3D motion feature lift module comprises 4 layers, outputting a 3D motion feature with $C=128$ channels.
Rendering: A variant of Gaussian Splatting Rasterization from Rade-GS [80] is adopted for more accurate depth maps and geometry.

5.4.2. Training Hyperparameters (Sec. 4.2)

Input/Supervision Views: Randomly selects 4 views as input and uses 8 views for supervision during training.
Hardware: Training is conducted on four A100 GPUs with 40GB of memory each.
Epochs & Batch Size: Runs for 15 epochs with a batch size of 16.
Loss Function Parameter: $\lambda$ in Eq. 10 is set to 0.2.
Optimizer: Adam optimizer with a weight decay of 0.05, and $\beta$ values of (0.9, 0.95).
Learning Rate: Set to $4 \times 10^{-4}$ for training on the N3DV dataset.

5.4.3. Streaming Inference Setup (Sec. 4.3)

Key Frame Interval ( $w$ ): Set to $w=5$ frames to construct key frame sequences. This means a video of 300 frames will have 60 key frames. An ablation study is performed on different $w$ values.
Key Frame Optimization Versions:
- IGS-s (Ours-s): Smaller version with 50 iterations of refinement for key frames, aiming for lower per-frame latency.
- IGS-l (Ours-l): Larger version with 100 iterations of refinement for key frames, aiming for higher reconstruction quality.
Densification and Pruning: Performed every 20 iterations during key frame refinement, mirroring 3DGS techniques.
Initial Frame (Frame 0) Gaussian Construction:
- For test sequences, Gaussians for the 0th frame are constructed using the compression method provided by Lightgaussian [13] to reduce storage and mitigate overfitting in sparse viewpoints.
- N3DV dataset: 6,000 iterations for training the first frame, with Gaussian compression at 5,000 iterations.
- Meeting Room dataset: 15,000 iterations for training the first frame, with Gaussian compression at 7,000 iterations.

Key Frame Refinement Parameters (Supplementary Sec. C):

SH degree: Set to 3 for N3DV scenes and 1 for Meeting Room to mitigate overfitting in sparse viewpoints.
Learning Rate: For position and rotation, it's set to ten times that in 3DGS; other parameters remain consistent with 3DGS.

Max Points Number (N_{max}): Determined by the number of Gaussians in the initial frame. $N_{max}=150,000$ for N3DV and $N_{max}=40,000$ for the Meeting Room dataset.

The following are the reconstruction results of Gaussian models for the first frame in each scenario from [Table C1] of the original paper:

Scene	PSNR↑ (dB)	Train ↓ (s)	Storage↓ (MB)	Points Num
N3DV[30]
cur roasted beef	33.96	287	36	149188
sear steak	34.03	287	35	143996
Meeting room[29]
trimming	30.36	540	3.9	37432
vrheadset	30.68	540	4	38610

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that IGS significantly improves streaming Free-Viewpoint Video (FVV) reconstruction in terms of speed, quality, and generalizability, addressing the limitations of prior methods.

6.1.1. In-domain Evaluation (N3DV Dataset)

The in-domain evaluation on the N3DV dataset compares IGS with various offline and online state-of-the-art methods.

The following are the results from [Table 1] of the original paper:

Method	PSNR↑ (dB)	Train ↓ (s)	Render↑ (FPS)	Storage↓ (MB)
Offline training
Kplanes[17]	32.17	48	0.15	1.0
Realtime-4DGS[75]	33.68	-	114	-
4DGS[66]	32.70	7.8	30	0.3
Spacetime-GS[31]	33.71	48	140	0.7
Saro-GS[70]	33.90	-	40	1.0
Online training
StreamRF[29]	32.09	15	8.3	31.4
3DGStream[53]	33.11	12	215	7.8
3DGStream[53]†	32.75	16.93	204	7.69
Ours-s	33.89	2.67	204	7.90
Ours-1	34.15	3.35	204	7.90

Note: † indicates re-evaluation under the authors' experimental environment for fair comparison.

Key Findings:

Per-frame Reconstruction Time: IGS-s achieves an average Train time of 2.67 seconds, and IGS-l achieves 3.35 seconds. This is a substantial improvement, representing a 6x reduction compared to 3DGStream† (16.93s), the previous state-of-the-art streaming method. This confirms IGS's ability to provide rapid responsiveness.
View Synthesis Quality: Both IGS-s (33.89 dB) and IGS-l (34.15 dB) achieve higher PSNR values than 3DGStream† (32.75 dB) and most offline training methods (e.g., Kplanes, 4DGS). IGS-l even surpasses Saro-GS (33.90 dB), an offline method, in PSNR.
Rendering Speed & Storage: IGS maintains a comparable real-time Render speed (204 FPS) and Storage overhead (7.90 MB) to 3DGStream† (204 FPS, 7.69 MB), indicating that speed and quality improvements are not at the cost of rendering efficiency or memory footprint.

Qualitatively, Figure 5 (from the original paper) provides a visual comparison of rendering quality. IGS demonstrates superior detail rendition, particularly in complex dynamic elements like the transition between a knife and fork, and the moving hand and reflections, suggesting better handling of intricate scene dynamics.

The following figure (Figure 5 from the original paper) shows a qualitative comparison of different methods on the N3DV dataset:

Figure 5. Qualitative comparison from the N3DV dataset. 该图像是一个定性比较图，展示了来自N3DV数据集的不同动态场景重建方法，包括SaRo-GS、4DGS、3DStream、IGS和GT。每种方法在不同帧中的重建效果被标注，并以红框和绿框形式展示，以便比较各自的性能与准确性。

The figure presents qualitative comparisons across various methods (SaRo-GS, 4DGS, 3DGStream, Ours-s, Ours-l, and GT) at different frames of a dynamic scene. It highlights Ours-s and Ours-l (IGS) showing finer details and better handling of reflections and object transitions compared to baselines.

6.1.2. Error Accumulation Mitigation

The paper also investigates the effectiveness of IGS in mitigating error accumulation by comparing PSNR trends over time with 3DGStream.

The following figure (Figure 3 from the original paper) shows the PSNR trend comparison on the sear steak sequence:

$该图像是一个曲线图，展示了不同方法下PSNR（峰值信噪比）随帧索引变化的情况。红色曲线表示IGS-I方法，绿色曲线表示3DGStream方法。IGS-I的PSNR呈现出上升趋势，均值约为34，而3DGStream则呈下降趋势，其斜率为$-3.6 imes 10^{-3}$。$ 该图像是一个曲线图，展示了不同方法下PSNR（峰值信噪比）随帧索引变化的情况。红色曲线表示IGS-I方法，绿色曲线表示3DGStream方法。IGS-I的PSNR呈现出上升趋势，均值约为34，而3DGStream则呈下降趋势，其斜率为 $-3.6 imes 10^{-3}$ 。

The graph plots PSNR (dB) against Frame Index. The red curve (Ours-l) shows a relatively stable or slightly increasing PSNR trend, indicating that IGS effectively prevents error accumulation. In contrast, the green curve (3DGStream) exhibits a noticeable decline in PSNR as the frame number increases, demonstrating the issue of error accumulation in that method.

Key Findings:

IGS (red curve) maintains a stable PSNR quality throughout the video sequence, confirming that the Key-frame-guided Streaming strategy effectively prevents the degradation of reconstruction quality over time.
3DGStream (green curve) shows a clear decline in PSNR as the frame index increases, indicating significant error accumulation.
While IGS's PSNR might show more fluctuation, this is attributed to 3DGStream's assumption of small inter-frame motion, leading to smoother but cumulatively erroneous results. IGS's fluctuations are likely due to the periodic keyframe refinements and the generalized motion network adapting to more complex motions.

6.1.3. Cross-domain Evaluation (Meeting Room Dataset)

To assess generalization capability, IGS (trained on N3DV) was evaluated on the unseen Meeting Room Dataset.

The following are the results from [Table 2] of the original paper:

Method	PSNR↑ (dB)	Train ↓ (s)	Render↑ (FPS)	Storage↓ (MB)
3DGStream[53]†	28.36	11.51	252	7.59
Ours-s	29.24	2.77	252	1.26
Ours-l	30.13	3.20	252	1.26

Note: † indicates re-evaluation under the authors' experimental environment for fair comparison.

Key Findings:

IGS significantly outperforms 3DGStream† in all metrics on the cross-domain dataset.
PSNR: IGS-l (30.13 dB) shows a substantial improvement over 3DGStream† (28.36 dB).
Train time: IGS-s (2.77s) again achieves a much faster per-frame reconstruction time compared to 3DGStream† (11.51s).
Storage: IGS also demonstrates better storage efficiency (1.26 MB) than 3DGStream† (7.59 MB).

These results highlight the strong generalization capability of IGS. It can efficiently model dynamic scenes and provide streaming capabilities in entirely new environments without requiring per-frame optimization or re-training. Qualitatively, Figure 4 (from the original paper) shows that IGS handles large displacements more accurately and produces fewer artifacts near moving objects compared to 3DGStream, leading to improved performance in temporally complex scenes.

The following figure (Figure 4 from the original paper) shows a qualitative comparison on the Meeting Room dataset:

该图像是一个比较示意图，展示了在会议室数据集上，3DGStream、IGS和GT三种动态场景重建方法的渲染质量。IGS方法在大的位移下比3DGStream更准确，减少了运动伪影，提升了在复杂场景中的表现。

The figure presents a qualitative comparison between 3DGStream, Ours-l (IGS), and GT (Ground Truth) on the Meeting Room dataset. Ours-l shows a cleaner reconstruction, especially around the moving arm, with fewer motion artifacts compared to 3DGStream, which exhibits blurring or distortions.

6.2. Ablation Studies / Parameter Analysis

The authors conducted several ablation studies to validate the design choices and parameters of IGS.

6.2.1. Impact of the Pretrained Optical Flow Model

The following are the results from [Table 3] of the original paper:

Method	PSNR↑ (dB)	Train↓ (s)	Storage↓ (MB)
No-pretrained optical flow model	31.07	2.65	7.90
No-projection-aware feature lift	32.95	2.38	7.90
No-points bounded refinement	33.23	3.02	110.26
Ours-s(full)	33.62	2.67	7.90

To validate the effectiveness of using a pretrained optical flow model (GM-Flow with Swin-Transformer), it was replaced with a 4-layer UNet trained jointly with the overall model from scratch. The results (first row of Table 3) show a significant drop in PSNR (31.07 dB) compared to the full IGS-s model (33.62 dB). This highlights the crucial benefit of leveraging the 2D prior knowledge encoded in a well-trained optical flow model for accurate motion feature extraction.

6.2.2. Impact of Projection-aware 3D Motion Feature Lift

The Projection-aware 3D Motion Feature Lift is a key component for effectively translating 2D motion features to 3D. An alternative Transformer-based approach using cross-attention between image features and anchor points (with positional embeddings) was tested. As shown in the second row of Table 3, removing the projection-aware mechanism (No-projection-aware feature lift) leads to a PSNR of 32.95 dB, which is lower than the full IGS-s (33.62 dB), despite a slight reduction in Train time (2.38s vs 2.67s). This indicates that the projection-aware approach is crucial for IGS's performance, providing a more accurate linking of 3D anchor points to multi-view 2D features.

6.2.3. Impact of Key-frame-guided Streaming

The Key-frame-guided Streaming strategy is designed to handle error accumulation and enhance quality.

Without Refinement (Figure 6a): If key frame refinement is omitted (i.e., AGM-Net continuously propagates Gaussians without periodic corrections), the PSNR performance degrades significantly due to accumulated errors. Figure 6a visually demonstrates this decline.

The following figure (Figure 6 from the original paper) shows an ablation study on Key-frame Refinement and per-frame reconstruction time:

该图像是图表，展示了两部分内容：(a) 针对关键帧精炼的消融研究，显示了含有关键帧精炼的IGS-s与不含关键帧精炼的结果在PSNR上的对比；(b) 各方法的每帧重建时间，包括IGS-S、3DGStream和IGS-I的表现。左侧图表中，随着帧索引增加，PSNR的变化趋势明显，右侧图表则展示了不同方法在特定帧的重建时间。

Figure 6(a) shows an ablation study where IGS-s is compared with IGS-s (no key-frame refinement). The latter experiences a continuous drop in PSNR as the frame index increases, indicating error accumulation, while IGS-s maintains a stable PSNR. Figure 6(b) illustrates the Per-frame reconstruction time for different methods, showing IGS-s and IGS-l have significantly lower and more consistent reconstruction times for candidate frames and periodic, higher times for key frames, compared to 3DGStream's consistently high per-frame time.

Max Points Bounded Refinement (Table 3): Without the Max Points Bounded Refine method (No-points bounded refinement), the storage requirements dramatically increase from 7.90 MB to 110.26 MB. Additionally, the PSNR drops to 33.23 dB (compared to 33.62 dB for full IGS-s), suggesting that unconstrained densification leads to overfitting and degraded view quality, especially in sparse-viewpoint scenes.

6.2.4. Impact of Key-frame Selection Interval ( $w$ )

The following are the results from [Table 4] of the original paper:

Method	PSNR(dB)↑	Train(s)↓	Storage(MB)↓
W=1	33.55	6.38	36.0
w=5	33.62	2.67	7.90
w=10	30.14	2.75	1.26

An ablation study on the key frame interval $w$ was conducted:

$w=1$ (Every frame is a key frame): This configuration leads to excessive optimization, resulting in a higher Train time (6.38s) and significantly increased Storage (36.0 MB). Despite high optimization, the PSNR (33.55 dB) is slightly lower than $w=5$ , indicating overfitting to training views and poorer generalization to test views.
$w=5$ (Chosen value): This setting strikes the best balance, achieving the highest PSNR (33.62 dB) with low Train time (2.67s) and efficient Storage (7.90 MB).
$w=10$ (Infrequent key frames): With a larger interval, the model relies more on AGM-Net propagation between key frames. The distance between key frames weakens model performance, leading to a substantial drop in PSNR (30.14 dB), even though Train time (2.75s) and Storage (1.26 MB) are low. This confirms that periodic correction is necessary.

The results validate that $w=5$ is the optimal choice for the key frame interval, balancing quality, speed, and storage efficiency.

6.2.5. Independent Per-frame Reconstruction Time

Figure 6(b) (from the original paper) illustrates the detailed per-frame reconstruction time profile.

Candidate Frames: For frames that are not key frames, the reconstruction time is consistently low, around 0.8 seconds. This is due to the AGM-Net performing a single, fast inference pass.
Key Frames: For key frames, the reconstruction time is higher due to the refinement process: ~4 seconds for IGS-s and ~7.5 seconds for IGS-l.
Comparison: Even for key frames, these times are significantly lower than 3DGStream's average of 16 seconds per frame, which is consistently high for every frame. This periodic pattern demonstrates IGS's efficiency in streaming.

6.2.6. Impact of the Number of Anchor Points

The following figure (Figure E2 from the original paper) illustrates the impact of the number of anchor points:

Figure E2. The impact of the number of anchor points 该图像是一个图表，展示了锚点数量对PSNR和每帧训练时间的影响。从图中可以看出，随着锚点数量增加，PSNR值稳步上升，而每帧训练时间呈现波动并最终明显增加。

The figure displays two curves: one for PSNR (dB) and one for per-frame training time (s) against the Number of anchor points. As the number of anchor points increases, the PSNR generally improves slightly up to a certain point (e.g., around 8192-16384 points), after which it plateaus or slightly decreases. The per-frame training time, however, shows a steep increase with a larger number of anchor points, especially beyond 8192.

This ablation study (Supplementary Sec. E.2) shows that the number of anchor points has a relatively small impact on PSNR performance within a reasonable range. However, increasing the number of anchor points significantly increases per-frame training time. The choice of $M=8192$ anchor points (used in the main experiments) appears to be a good trade-off, providing sufficient dynamic detail without incurring excessive computational overhead.

6.2.7. Additional Ablation Studies (Supplementary Sec. D)

The authors also explored other modules that did not yield expected improvements:

Attention-Based View Fusion: An attempt to assign different weights to features from different viewpoints using self-attention and Softmax. The following are the results from [Table D2] of the original paper:

Method PSNR(dB)↑

Add-Attention-based view fusion 33.58

Add-Occulusion aware projection 33.50

Ours-s 33.62

As shown in Table D2, adding this module (Add-Attention-based view fusion) resulted in a slightly lower PSNR (33.58 dB) compared to Ours-s (33.62 dB). The authors speculate this might be due to N3DV scenes being forward-facing, where viewpoint differences are not significant enough to warrant complex weighting. It could be more beneficial for 360-degree scenes.
Occlusion-Aware Projection: An attempt to account for occlusion effects during Projection-Aware Motion Feature Lift using point rasterization [45] to ensure each pixel corresponds to only one visible anchor point. Table D2 shows that Add-Occlusion aware projection also led to a lower PSNR (33.50 dB). The reasoning is that anchor points are much sparser than Gaussian points, so significant occlusion effects are rare. Moreover, rasterization for projection might reduce the accuracy of feature extraction for sparse points.

Method	PSNR(dB)↑
Add-Attention-based view fusion	33.58
Add-Occulusion aware projection	33.50
Ours-s	33.62

6.2.8. Per-scene results on N3DV (Supplementary Sec. G)

The following are the results from [Table G3] of the original paper:

Method	cut roasted beef			sear steak
Method	PSNR(dB)↑	DSSIM↓	LPIPS↓	PSNR(dB)↑	DSSIM↓	LPIPS↓
Offine training
Kplanes[17]	31.82	0.017		32.52	0.013	-
Realtime-4DGS[75]	33.85	-	-	33.51	-	-
4DGS[66]	32.90	0.022	-	32.49	0.022	-
Spacetime-GS[31]	33.52	0.011	0.036	33.89	0.009	0.030
Saro-GS[70]	33.91	0.021	0.038	33.89	0.010	0.036
Online training
StreamRF[29]	31.81	-	-	32.36	-
3DGStream[53]	33.21	-	-	33.01	-	-
3DGStream[53]†	32.39	0.015	0.042	33.12	0.014	0.036
Ours-s	33.62	0.012	0.048	34.16	0.010	0.038
Ours-1	33.93	0.011	0.043	34.35	0.010	0.035

Table G3 provides a detailed breakdown of PSNR, DSSIM, and LPIPS for individual scenes in the N3DV dataset. Ours-l consistently achieves the highest PSNR and competitive DSSIM and LPIPS across both "cut roasted beef" and "sear steak" scenes, further reinforcing its superior quality. For instance, on "sear steak", Ours-l achieves 34.35 dB PSNR, outperforming all other methods. Ours-s also shows strong performance, often surpassing most offline methods while being significantly faster.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Instant Gaussian Stream (IGS), a novel streaming-based framework for dynamic scene reconstruction that significantly advances the state-of-the-art. IGS addresses the critical challenges of high per-frame reconstruction time and error accumulation prevalent in previous streaming methods. Its core innovations include:

A generalized Anchor-driven Gaussian Motion Network (AGM-Net): This network leverages multi-view 2D motion features projected onto 3D anchor points to infer Gaussian motion between adjacent frames in a single, fast inference step, eliminating the need for computationally expensive per-frame optimization.
A Key-frame-guided Streaming Strategy: This strategy periodically refines designated key frames with a max-points-bounded approach. This not only allows for accurate reconstruction of temporally complex scenes (including non-rigid motion and object appearance/disappearance) but also effectively mitigates the propagation of error accumulation over long video sequences.

Extensive evaluations on both in-domain (N3DV) and cross-domain (Meeting Room) datasets demonstrate that IGS achieves an average per-frame reconstruction time of just over 2 seconds (e.g., 2.67s for IGS-s), representing a 6x speedup over prior state-of-the-art streaming methods. Concurrently, IGS achieves enhanced view synthesis quality (higher PSNR), maintains efficient render speed (204 FPS), and exhibits strong generalization capabilities with comparable storage efficiency. This work marks a significant step towards practical, real-time, and high-quality Free-Viewpoint Video generation.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Frame Jittering in Static Backgrounds: IGS exhibits jittering, particularly in static background areas between adjacent frames (Supplementary Figure E1). This is attributed to the current framework's lack of explicit temporal dependencies and its sensitivity to noise. The key-frame optimization can disturb background Gaussians, especially floaters (Supplementary Figure F3).
- Future Work: Incorporating explicit temporal dependencies (e.g., modeling motion as a time series) could lead to more robust performance and smoother results. Segmenting scenes into foreground and background, and applying the segmentation mask during key-frame optimization, could help prevent unwanted background disturbances. Improving the robustness of first-frame reconstruction in sparse views is also suggested.
Dependency on Initial Frame Quality: The performance of streaming dynamic scene reconstruction methods, including IGS, is influenced by the quality of the static reconstruction in the first frame. Poor initial reconstruction, such as the presence of excessive floaters around moving objects (Supplementary Figure F3) in sparse viewpoints, can negatively impact the subsequent performance of AGM-Net.
- Future Work: Adopting more robust static reconstruction methods for the initial frame could enhance dynamic scene reconstruction results.
Limited Generalization from Training Data: AGM-Net was trained on a relatively limited dataset (four sequences from the N3DV indoor dataset). This restricts its generalization capability to broader, more diverse dynamic scenes.
- Future Work: Training on larger-scale and more diverse multi-view video sequences is a promising direction for improving generalization. The authors note that their method's reliance solely on view synthesis loss for supervision makes it amenable to large-scale datasets without requiring explicit ground truth annotations.
Motion Feature Extraction: The current approach injects depth and view conditions into the embeddings of an optical flow model to provide 3D scene awareness.
- Future Work: Leveraging more accurate and advanced long-range optical flow [65] or scene flow [38, 58] methods could further improve the motion prediction capabilities of AGM-Net.

7.3. Personal Insights & Critique

This paper presents a significant leap forward in streaming Free-Viewpoint Video reconstruction, directly tackling the core limitations of existing methods. The shift from per-frame optimization to a generalized inference network (AGM-Net) is a truly innovative and practical solution for achieving real-time performance. The Key-frame-guided Streaming strategy is a clever mechanism to prevent error accumulation without sacrificing the speed gains of the generalized network, providing a periodic "reset" and adaptation capability.

Strengths and Innovations:

Paradigm Shift: Moving from iterative per-frame optimization to a single-inference generalized motion network is a fundamental change that unlocks unprecedented speed for streaming 3DGS models. This addresses a critical bottleneck for real-world applications.
Robustness to Error Accumulation: The key-frame strategy is a well-designed solution to a persistent problem in sequential processing, enabling IGS to maintain quality over long video sequences.
Generalizability: The strong cross-domain performance demonstrates that the learned motion model is not just memorizing training data but capturing fundamental dynamic properties, which is crucial for practical deployment.
Comprehensive Evaluation: The paper includes detailed in-domain, cross-domain, and ablation studies, providing solid evidence for its claims. The re-evaluation of 3DGStream under the same conditions (3DGStream†) ensures a fair comparison.

Potential Issues and Areas for Improvement:

Jitter in Static Backgrounds: While acknowledged, the frame jittering in static areas (Figure E1) is a noticeable visual artifact. Explicit temporal regularization or motion segmentation could be key to resolving this. The current approach's sensitivity to initial floaters (Figure F3) suggests that improvements in base 3DGS initialization for sparse views would have a cascading positive effect.
Fluctuation in PSNR: The observed fluctuations in PSNR (Figure 3) suggest that while error accumulation is mitigated, the frame-to-frame consistency could still be improved. This points back to the need for explicit temporal modeling rather than treating each frame's motion prediction as largely independent from the wider temporal context beyond the previous frame.
Training Data Scale: The reliance on N3DV for training, while effective, underscores the common challenge in generalizable models: the breadth of generalization is inherently tied to the diversity and scale of the training data. The suggestion to use view synthesis loss for large-scale training is promising, as it avoids costly manual annotations.

Applicability and Future Value: The methods and conclusions of IGS have immense potential for real-time interactive 3D applications. Industries such as VR/AR, telepresence, live broadcasting, and robotics could directly benefit from rapid and high-quality streaming of dynamic 3D environments. The generalized motion network could potentially be adapted for other types of 3D representations or even serve as a foundation for predictive modeling in dynamic scenes. The framework's modularity, separating generalized motion inference from periodic refinement, could inspire further research into hybrid approaches for other challenging sequential prediction tasks.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Instant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~36 min read · 47,802 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. 3D Reconstruction and View Synthesis

3.2.2. Generalizable 3D Reconstruction for Acceleration

3.2.3. Dynamic Scene Reconstruction and View Synthesis

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Preliminary: Gaussian Splatting Fundamentals

4.2.2. Anchor-driven Gaussian Motion Network (AGM-Net)

4.2.2.1. Motion Feature Maps

4.2.2.2. Anchor Sampling

4.2.2.3. Projection-aware 3D Motion Feature Lift

4.2.2.4. Interpolate and Motion Decode

4.2.3. Key-frame-guided Streaming

4.2.3.1. Key-frame-guided strategy

4.2.3.2. Max Points Bounded Key-frame Refinement

4.2.4. Loss Function

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Implementation Details

5.4.1. AGM Network Configuration (Sec. 4.2)

5.4.2. Training Hyperparameters (Sec. 4.2)

5.4.3. Streaming Inference Setup (Sec. 4.3)

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. In-domain Evaluation (N3DV Dataset)

6.1.2. Error Accumulation Mitigation

6.1.3. Cross-domain Evaluation (Meeting Room Dataset)

6.2. Ablation Studies / Parameter Analysis

6.2.1. Impact of the Pretrained Optical Flow Model

6.2.2. Impact of Projection-aware 3D Motion Feature Lift

6.2.3. Impact of Key-frame-guided Streaming

6.2.4. Impact of Key-frame Selection Interval (www)

6.2.5. Independent Per-frame Reconstruction Time

6.2.6. Impact of the Number of Anchor Points

6.2.7. Additional Ablation Studies (Supplementary Sec. D)

6.2.8. Per-scene results on N3DV (Supplementary Sec. G)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.2.4. Impact of Key-frame Selection Interval ( $w$ )