Pretraining Frame Preservation in Autoregressive Video Memory Compression
TL;DR Summary
This paper introduces a neural architecture for compressing long videos into short contexts, focusing on frame preservation. It enables autoregressive video generation with high detail retention and consistency using only about 5k tokens.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Pretraining Frame Preservation in Autoregressive Video Memory Compression
1.2. Authors
Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Affiliations include Stanford University, MIT, Carnegie Mellon University, and HKUST.
1.3. Journal/Conference
This paper is a preprint currently hosted on arXiv (arXiv:2512.23851). Given the authors' affiliations and the quality of the work, it is likely targeted for a major computer vision or machine learning conference (e.g., CVPR, ICCV, or NeurIPS).
1.4. Publication Year
Published on December 28, 2025 (UTC).
1.5. Abstract
The paper introduces a neural network architecture designed to compress long video sequences into short, manageable contexts for autoregressive video generation. The core innovation is a pretraining objective focused on "frame preservation," which ensures that high-frequency details from any random frame in the history can be retrieved. By first pretraining a memory encoder on this retrieval task, the authors can then fine-tune it as a memory component for Diffusion Transformers (DiTs). This approach allows models to process 20-second histories using only about 5k tokens, maintaining high consistency and perceptual quality while significantly reducing computational costs.
1.6. Original Source Link
-
PDF Link: https://arxiv.org/pdf/2512.23851.pdf
-
Code Repository: https://github.com/lllyasviel/PFP
2. Executive Summary
2.1. Background & Motivation
Generating long, coherent videos (e.g., movies or long-form stories) is a significant challenge in AI. Most current models use an autoregressive approach, where the model predicts the next chunk of video based on the previous history. However, there is a fundamental trade-off:
-
Context Length: Keeping every frame in memory is computationally impossible due to GPU memory limits (the
quadratic complexityof attention). -
Context Quality: Compressing the history (e.g., by skipping frames or downsampling) often results in the loss of fine details, leading to "drifting" where characters' faces or clothes change over time.
The researchers identified a gap: existing compression methods do not explicitly prioritize the preservation of fine details across the entire temporal span. They aimed to create a "white-box" compression model where the ability to reconstruct any past frame serves as a direct proxy for the quality of the memory.
2.2. Main Contributions / Findings
-
Frame Retrieval Pretraining: Proposed a new pretraining task where a memory encoder must compress a 20-second video so that any random frame can be reconstructed with high fidelity.
-
Lightweight Memory Encoder: Developed an architecture using 3D convolutions and attention that bypasses typical bottlenecks (like VAE channel limits) to output directly into the
latent spaceof the generator. -
Efficient Autoregressive System: Demonstrated that a pretrained memory encoder allows for fine-tuning long-video models with much lower computational overhead.
-
Practical Scaling: Achieved the compression of 20 seconds of video into a ~5k context length, enabling long-history processing on consumer-grade hardware (like an RTX 4070).
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Autoregressive Video Generation
In this paradigm, a video is generated sequentially. If we want to generate the next segment of video , we condition the model on all previous segments (History). Mathematically: .
3.1.2. Diffusion Transformers (DiT)
A Diffusion Transformer is a model that generates images or videos by starting with pure noise and gradually "denoising" it into a clear image. Unlike older models that used 2D grids (CNNs), DiTs treat video patches as "tokens" in a sequence, similar to how ChatGPT treats words.
3.1.3. Latent Space & VAEs
Raw video pixels are massive. Most models first use a Variational Autoencoder (VAE) to compress pixels into a smaller, abstract "latent" representation. For example, a image might be compressed into a latent grid. The DiT works in this "latent space" to save memory.
3.1.4. Rectified Flow Matching
This is a specific training framework for diffusion models. It learns a straight-line path to transform noise into a clean image . The noisy version at time is calculated as: where is the timestep.
3.2. Previous Works
- Sliding Windows: Simple methods that only look at the most recent frames (e.g., the last 2 seconds) and forget everything else. This causes long-term inconsistency.
- Token Merging (ToMe): Methods that combine similar tokens to reduce context length but often blur fine details like facial features.
- FramePack: An earlier method by some of the same authors that "packs" frames into a fixed size, but this paper argues it loses too much "high-frequency" (fine-grained) detail.
3.3. Differentiation Analysis
The core difference is the Pretraining Objective. While others train the compressor and the generator simultaneously, this paper argues for an independent "pretraining for retrieval" phase. This ensures the compressor is "detail-aware" before it ever tries to help generate new frames.
4. Methodology
4.1. Principles
The intuition is simple: If a compressed representation contains enough information to reconstruct a specific frame from the past, it definitely contains enough information to maintain consistency in the future.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Stage 1: Pretraining for Frame Retrieval
The goal is to train a compression function that takes a long history and turns it into a compact context . To test this, we try to retrieve a specific subset of frames using the compressed context.
The following figure (Figure 2 from the original paper) illustrates this pretraining process:

The Procedural Steps for Pretraining:
- Sample History: Take a 20-second video .
- Compress: Pass through the encoder to get the compressed tokens .
- Random Masking: Select a random set of frame indices . We keep these frames clean and mask the rest with noise.
- Reconstruction Objective: The model tries to denoise the noisy frames at indices using the information stored in . The loss function is:
L = \mathbb{E}_{H, \Omega, c, \epsilon, t_i} || (\epsilon - H_{\Omega}) - G_{\theta} ((H_{\Omega})_{t_i}, t_i, c, \phi(H)) ||_2^2- : The original clean frames at selected positions.
- : Random Gaussian noise.
- : A trainable video diffusion model (e.g., Wan or HunyuanVideo).
- : The diffusion timestep.
- : The compressed context providing the "memory."
4.2.2. Network Architecture
The encoder is designed to be lightweight. It uses a dual-branch approach:
-
Low-Resolution (LR) Branch: Processes a downsampled version of the video to capture global motion and scene structure.
-
High-Resolution (HR) Branch: Processes the original resolution to extract "residual enhancing vectors" (fine details like textures).
The architecture is shown in the following figure (Figure 3 from the original paper):

Layer Breakdown:
- 3D Convolutions: Used to reduce the spatial and temporal dimensions. For example, a rate means the width and height are reduced by 4x, and the time (frames) by 2x.
- Feature Projection: Instead of going through the narrow 16-channel VAE bottleneck, the encoder outputs directly into the DiT's inner channel dimension (e.g., 3072 or 5120 channels), preserving more data.
4.2.3. Stage 2: Fine-tuning for Autoregressive Generation
Once the encoder is pretrained, it is frozen or fine-tuned alongside the main video generator to produce new frames.
The following figure (Figure 4 from the original paper) shows the transition to fine-tuning:

The Generation Flow:
-
Input: Current noise , text prompt , and the compressed history .
-
Diffusion Step: The model predicts the clean version of the next frame :
-
Iterative Concatenation: The newly generated frames are added to the history , which is then re-compressed by for the next step.
5. Experimental Setup
5.1. Datasets
- Size: 5 million internet videos.
- Content: A mix of horizontal (widescreen) and vertical (Shorts-style) videos.
- Annotations: Captioned using
Gemini-2.5-flashin a "storyboard" format (descriptions with timestamps). - Example: A video of a grandmother petting a cat would have captions like: "0s: Woman stands by shelf," "12s: Woman pets cat," "22s: Woman sits down."
5.2. Evaluation Metrics
5.2.1. PSNR (Peak Signal-to-Noise Ratio)
- Conceptual Definition: Measures the ratio between the maximum possible power of a signal and the power of corrupting noise. Higher is better.
- Mathematical Formula:
PSNR = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{MSE} \right) - Symbol Explanation: is the maximum pixel value (e.g., 255);
MSEis the Mean Squared Error between the original and reconstructed frame.
5.2.2. SSIM (Structural Similarity Index Measure)
- Conceptual Definition: Evaluates the perceived change in structural information between two images.
- Mathematical Formula:
- Symbol Explanation: is the mean; is the variance; is the covariance; are constants to stabilize the division.
5.2.3. LPIPS (Learned Perceptual Image Patch Similarity)
- Conceptual Definition: Uses a deep neural network to measure how similar two images look to a human. Lower is better.
5.2.4. Consistency Metrics (Cloth, Identity, Object)
- Conceptual Definition: Use Vision-Language Models (VLMs) like
GeminiorLLaVAto answer questions: "Is the character wearing the same shirt as in the previous scene?"
5.3. Baselines
-
Large Patchifier: Increases the patch size of the DiT (equivalent to
FramePack). -
Only LR: Only uses the low-resolution branch of the encoder.
-
Without Pretrain: Trains the system from scratch without the frame-retrieval pretraining phase.
6. Results & Analysis
6.1. Core Results Analysis
The experiments confirm that pretraining is the "secret sauce." Models without pretraining often "hallucinate" new details that don't match the history, while the Proposed method maintains consistent character identity and background details even after 20 seconds.
The following are the results from Table 1 of the original paper, showing the reconstruction quality during the pretraining phase:
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| Large Patchifier* (4×4×2) | 12.93 | 0.412 | 0.365 |
| Only LR (4×4×2) | 15.21 | 0.472 | 0.212 |
| Without LR (4×4×2) | 15.73 | 0.423 | 0.198 |
| Proposed (4×4×2) | 17.41 | 0.596 | 0.171 |
| Proposed (2×2×2) | 19.12 | 0.683 | 0.152 |
| Proposed (2×2×1) | 20.19 | 0.705 | 0.121 |
Analysis: The "Proposed (4×4×2)" significantly outperforms the "Large Patchifier" (FramePack) across all metrics, proving that the dual-branch architecture and retrieval objective are superior at preserving details.
The following are the results from Table 2 of the original paper, evaluating the final video consistency:
| Method | Human | Object ↑ | User Study ELO ↑ | |
|---|---|---|---|---|
| Cloth ↑ | Identity ↑ | |||
| WanI2V + QwenEdit (2p) | 95.09 | 68.22 | 91.19 | 1198 |
| Only LR (4×4×2) | 91.98 | 69.22 | 85.32 | 1194 |
| Without Pretrain (4×4×2) | 87.12 | 66.99 | 81.13 | N/A |
| **Proposed (4×4×2)** | **96.12** | **70.73** | **89.89** | **1216** |
| **Proposed (2×2×2)** | **96.71** | **72.12** | **90.27** | **1218** |
Analysis: The proposed method achieves higher ELO scores and better identity/cloth consistency than existing baselines. The difference between "Without Pretrain" (87.12) and "Proposed" (96.12) in cloth consistency is a stark validation of the methodology.
6.2. Visual Comparison
The effect of pretraining is visually apparent. The following figure (Figure 6 from the original paper) shows how the model with pretraining correctly remembers facial features and clothing styles, whereas the model without it creates inconsistent characters:
该图像是一个示意图,展示了预训练在自回归视频记忆压缩中的应用。图中分为三组:历史(20秒)、使用预训练(建议)和未使用预训练,分别展示了不同情况下的图像特征对比。
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully demonstrates that frame preservation is a highly effective objective for video memory compression. By training a model to "remember and retrieve" past frames, it naturally learns to encode the specific details (faces, textures, lighting) required for long-term consistency in video generation. This allows for a massive reduction in context length (down to 5k tokens for 20s of video) without sacrificing the "storytelling" quality.
7.2. Limitations & Future Work
- Error Accumulation (Drifting): While improved, the model still faces "drifting" in very long shots (single continuous takes without cuts). The authors suggest that specialized training on "single-shot continuation" is still needed.
- Computational Cost: Pretraining on 5 million videos requires significant resources ( cluster), even if the final inference is efficient.
- Multi-Modal Integration: Future work could explore integrating audio or more complex storyboard instructions into the compression space.
7.3. Personal Insights & Critique
Innovation: The move from "implicit memory" (just training on next-frame prediction) to "explicit retrieval" is a brilliant move towards more explainable AI. It allows researchers to verify if the memory works before running the expensive generation phase.
Application: This technology is a massive win for the "indie creator" community. Being able to run long-context video generation on a 12GB GPU (RTX 4070) democratizes high-quality video storytelling, which was previously the domain of companies with massive server farms.
Critique: The paper relies heavily on VLMs (Gemini) for evaluation. While VLMs are getting better, they can have their own biases. Supplementing this with more traditional geometric or optical flow metrics for all experiments would have added another layer of rigor.
Similar papers
Recommended via semantic vector search.