Paper status: completed

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Published:12/04/2025

Efficient Inference of Diffusion Models (9)Real-Time Audio-Driven Avatar Generation (1)Low-Latency Streaming Generation (1)Temporal Consistency Enhancement Mechanism (1)Large-Scale Parameter Diffusion Model (1)

Original Link PDF

Price: 0.100000

7 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Live Avatar' is an innovative algorithm-system framework for high-fidelity, infinite-length audio-driven avatar generation. It employs a 14-billion-parameter diffusion model and introduces a timestep-forcing pipeline for low-latency streaming, enhancing temporal consistency and m

Abstract

Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.

Mind Map

In-depth Reading

English Analysis~21 min read · 28,324 chars

1. Bibliographic Information

1.1. Title

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

1.2. Authors

The authors of this paper are Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, and Steven Hoi. The affiliations listed connect the researchers to major technology companies and universities, including Alibaba Group, University of Science and Technology of China (USTC), Beijing University of Posts and Telecommunications (BUPT), and Zhejiang University. This collaboration between industry and academia often signifies research that is both theoretically grounded and aimed at practical, large-scale applications.

1.3. Journal/Conference

The paper was submitted to arXiv, an open-access repository of electronic preprints. As a preprint, it has not yet undergone formal peer review for publication in a journal or conference. However, given the topic and the scale of the work, it is likely intended for submission to a top-tier conference in computer vision or artificial intelligence, such as CVPR, ICCV, or NeurIPS.

1.4. Publication Year

The publication date listed on arXiv is December 4, 2025. This is likely a placeholder or a typo in the provided metadata, as the paper cites works from 2024 and 2025, indicating it is a very recent piece of research from late 2024 or early 2025. The arXiv identifier 2512.04677 suggests a submission in December 2025, which is unusual.

1.5. Abstract

The abstract introduces Live Avatar, an "algorithm-system co-designed framework" for generating high-fidelity, real-time, and infinitely long audio-driven talking avatars. The authors identify two main obstacles for existing diffusion-based methods: sequential computation, which causes high latency, and long-horizon inconsistency, which leads to visual artifacts over time. To solve these, Live Avatar introduces three key innovations:

Timestep-forcing Pipeline Parallelism (TPP): A distributed inference method that pipelines the denoising steps of a diffusion model across multiple GPUs to achieve low-latency, real-time streaming.
Rolling Sink Frame Mechanism (RSFM): An algorithmic approach to maintain identity and color consistency by dynamically using a cached reference frame.
Self-Forcing Distribution Matching Distillation: A technique to adapt a large, 14-billion-parameter model for efficient, causal (streamable) generation. The paper reports state-of-the-art performance, achieving 20 frames per second (FPS) on 5 H800 GPUs, and claims to be the first to enable practical, real-time generation with a model of this scale.

1.6. Original Source Link

Original Source: https://arxiv.org/abs/2512.04677
PDF Link: https://arxiv.org/pdf/2512.04677v1.pdf
Publication Status: This is a preprint on arXiv and has not yet been peer-reviewed.

2. Executive Summary

2.1. Background & Motivation

The creation of photorealistic digital avatars that can talk and react in real-time is a cornerstone technology for applications like virtual reality, live streaming, and digital assistants. Recent advancements in diffusion models have enabled the generation of extremely high-quality videos, but they come with significant drawbacks that hinder their use in live, interactive settings.

The paper identifies two fundamental and conflicting challenges:

The Real-Time Fidelity Dilemma: Large-scale diffusion models (e.g., 14 billion parameters) produce stunning visual quality but are computationally intensive. The core of a diffusion model is a sequential denoising process, where each step depends on the previous one. This autoregressive nature makes inference inherently slow, leading to low frame rates that are unacceptable for real-time interaction (which typically requires ≥ 20 FPS).
Long-Horizon Consistency: For applications requiring avatars to operate for extended periods (minutes or even hours), it's crucial to maintain a consistent identity and appearance. Existing methods often suffer from identity drift (the face slowly changes), color shifts, or other distracting visual artifacts that accumulate over time.

The motivation behind Live Avatar is to bridge this gap by creating a framework that allows a massive, high-fidelity diffusion model to run in real-time for infinite-length streams without sacrificing visual quality or consistency. The paper's innovative entry point is a co-design approach, tackling the problem from both the algorithm (how the model is trained and guided) and the system (how inference is executed on hardware) perspectives.

2.2. Main Contributions / Findings

The paper presents Live Avatar, a framework that makes several key contributions to solve the aforementioned problems:

Timestep-forcing Pipeline Parallelism (TPP): This is the core system-level innovation. Instead of running the sequential denoising steps on one GPU, TPP assigns each step (or a small group of steps) to a different GPU. This transforms the sequential process into a parallel pipeline, where the system's throughput is limited only by the time for a single step, not the total time for all steps. This is the key to achieving real-time speeds with a large model.
Rolling Sink Frame Mechanism (RSFM): This is the primary algorithmic solution for long-term consistency. It dynamically recalibrates the avatar's appearance by using a reference "sink" frame. This mechanism effectively suppresses the accumulation of errors, preventing identity drift and color artifacts during long generation sessions.
Causal, Streamable Model Adaptation: The paper uses Self-Forcing Distribution Matching Distillation to "distill" a powerful, non-causal (bidirectional) teacher model into a student model that is causal (can generate frame-by-frame) and requires only a few denoising steps. This makes the model efficient enough to be deployed on the TPP inference engine. They also introduce noise into the model's memory (KV cache) during training to make it more robust for long-term generation.
State-of-the-Art Performance: The culmination of these contributions is a system that can generate audio-driven avatar video from a 14-billion-parameter model at 20 FPS using 5 H800 GPUs. The authors claim this is the first practical demonstration of real-time, high-fidelity, and infinite-length avatar generation at such a scale.

3.1. Foundational Concepts

3.1.1. Diffusion Models

Diffusion models are a class of generative models that create data (like images or videos) by reversing a gradual noising process. The core idea can be understood in two stages:

Forward Process: You start with a clean data sample (e.g., an image) and slowly add a small amount of Gaussian noise over many steps. After enough steps, the image becomes pure, indistinguishable noise. This process is fixed and mathematically defined.
Reverse Process: The model learns to reverse this process. It takes a noisy image and predicts the noise that was added to it (or equivalently, predicts the slightly less noisy image). By repeatedly applying this denoising step, starting from pure random noise, the model can generate a brand new, clean image. The paper uses Flow Matching, a more recent and often more efficient formulation. Instead of a discrete noising process, it defines a continuous "flow" or path from a noise distribution to the data distribution. The model is trained to predict the velocity vector that moves a point along this path toward the clean data.

3.1.2. Latent Diffusion Models

Training diffusion models directly on high-resolution images or videos is extremely computationally expensive. Latent Diffusion Models solve this by working in a compressed latent space.

An encoder, typically part of a Variational Autoencoder (VAE), compresses the high-resolution video into a much smaller latent representation. This latent data captures the essential information but is cheaper to process.
The diffusion model then performs its noising and denoising process entirely within this compact latent space.
Finally, a decoder (from the same VAE) takes the denoised latent representation and reconstructs it back into a high-resolution video frame. The paper uses a causal 3D VAE, which means when encoding a frame, it only uses information from past and current frames, not future ones, which is crucial for streaming applications.

3.1.3. Autoregressive Models

Autoregressive (AR) models generate sequences one element at a time, where each new element is conditioned on the previously generated ones. For example, to generate a sentence, an AR model would generate the first word, then generate the second word based on the first, the third based on the first two, and so on. In this paper, video is generated in blocks of frames, where each new block is generated based on the previously generated blocks.

3.1.4. Model Distillation and Distribution Matching Distillation (DMD)

Model Distillation is a technique to transfer the knowledge from a large, powerful but slow "teacher" model to a smaller, faster "student" model. The goal is for the student to mimic the teacher's output behavior but with much lower computational cost. Distribution Matching Distillation (DMD) is a specific distillation method for diffusion models. The goal is to train a student model that can generate a high-quality sample in just one or a few steps. It works by making the distribution of the student's generated samples match the distribution of the teacher's generated samples (which are considered high-quality "ground truth"). This is often framed as an adversarial game where the student tries to generate samples that can "fool" a score function derived from the teacher. This allows for massive speedups in inference time.

3.2. Previous Works

3.2.1. Streaming and Long Video Generation

CausVid and Self-Forcing: These works pioneered the use of distillation for streaming video generation. They address the "train-inference mismatch" where a model trained on entire video clips fails when used to generate autoregressively (one frame/block at a time). Self-Forcing trains the model to condition on its own previously generated frames, making it more robust to error accumulation. Live Avatar builds directly on this paradigm.
LongLive and StreamDiT: These methods introduced techniques like KV-recache (managing memory of past frames) and windowed attention to enable longer video generation. Live Avatar also uses a KV cache but combines it with its unique TPP and RSFM mechanisms.
Rolling Forcing: This work proposed a rolling window and attention sinks to improve temporal coherence, which is philosophically similar to Live Avatar's Rolling Sink Frame Mechanism, though the technical implementation differs.

3.2.2. Audio-driven Avatar Video Generation

GAN-based methods (Wav2Lip): Early works used Generative Adversarial Networks (GANs) to achieve accurate lip-syncing but often struggled with overall video quality and naturalness.
3D-based methods (SadTalker): These methods first predict 3D facial motion from audio and then render a video, allowing for more control over head pose and expression but sometimes lacking photorealism.
Diffusion-based methods (StableAvatar, OmniAvatar, WanS2V): Recent state-of-the-art methods use diffusion models to achieve high-fidelity video. WanS2V, a 14B parameter model, is a direct predecessor to Live Avatar, which uses its architecture as a starting point. However, these models are typically very slow and not designed for real-time streaming or infinite-length generation. OmniAvatar and WanS2V can generate high-quality clips but suffer from performance degradation in long videos.

3.3. Technological Evolution

The field of audio-driven avatar generation has evolved from focusing solely on lip-sync accuracy (GANs) to incorporating head motion and expression (3D models), and finally to achieving high photorealism with diffusion models. The current frontier, which Live Avatar addresses, is not just about quality but about practical deployment. This means tackling the challenges of real-time latency and long-term stability, which are essential for interactive applications. Live Avatar represents a shift from "can we generate a good 10-second clip?" to "can we run a high-quality avatar in a live-stream for hours?".

3.4. Differentiation Analysis

Compared to prior work, Live Avatar's main innovation is the holistic co-design of algorithms and systems.

While methods like CausVid and LongLive achieved streaming, they typically used much smaller models (e.g., 1.3B parameters) and still did not achieve the real-time performance of Live Avatar with a massive 14B model.
While models like WanS2V and OmniAvatar have high fidelity, they are not streamable or real-time and suffer from long-horizon degradation.
Live Avatar is the first to combine pipeline parallelism (TPP) specifically for diffusion models with advanced algorithmic techniques (RSFM, Self-Forcing Distillation) to overcome both the latency and consistency barriers simultaneously. TPP is a unique system-level contribution not seen in previous audio-driven avatar papers.

The following table from the paper summarizes this differentiation: The following are the results from Table 1 of the original paper:

Method	stream	real time	inf-len	size
Hallo3[6]		*		5B
StableAvatar[38]			✗	1.3B
Wan-s2v[15]				14B
Ditto[24]			✗	0.2B
InfiniteTalk[47]	2	×	v	14B
OminiAvatar[13]	X	X	X	14B
Live avatar(ours)	√	√	√	14B

This table clearly shows that Live Avatar is the only method that simultaneously achieves streaming, real-time performance, and infinite-length capability with a very large (14B) model.

4. Methodology

4.1. Principles

The core principle of Live Avatar is to resolve the tension between three competing goals: high visual fidelity, which requires large models; low latency, which is needed for real-time interaction; and long-term consistency, required for extended use. The solution is an algorithm-system co-design that tackles these problems on multiple fronts:

Algorithm: Use model distillation to create a fast, few-step model, and design a mechanism (RSFM) to enforce long-term consistency.
System: Invent a new parallel inference paradigm (TPP) that fundamentally breaks the sequential bottleneck of diffusion models, enabling real-time execution.

4.2. Core Methodology In-depth

4.2.1. Model Architecture

The model generates video in an autoregressive fashion, creating one block of frames at a time while conditioning on previously generated blocks. This enables streaming generation. The core generation step for a single block is defined by the following formula:

$ B_{t-1}^{i} = v_{\theta} (B_{t}^{i}, B_{t}^{(i-w):(i-1)}, I, a^{i}, t^{i}) $

Let's break down this formula:

$v_{\theta}$ : This is the neural network (the diffusion model), which is trained to predict the velocity that transforms a noisy latent back towards a clean one.
$B_{t}^{i}$ : This represents the $i$ -th block of frame latents at denoising step $t$ . It is the primary input that the model will denoise.
$B_{t-1}^{i}$ : This is the output of the model, the slightly less noisy version of the block at denoising step t-1.
$B_{t}^{(i-w):(i-1)}$ : This is the historical context. It represents the noisy frame latents of the previous blocks (from block i-w to i-1) at the same denoising step $t$ . This is a crucial design choice that enables the TPP system, as it means the history required for denoising a block is at the same noise level, and doesn't require waiting for previous blocks to be fully denoised.
$I$ : This is the Rolling Sink Frame, a special reference frame that provides a stable identity and appearance cue throughout the generation process.
$a^{i}$ : The audio embedding corresponding to the $i$ -th block.
$t^{i}$ : The text prompt embedding for the $i$ -th block (if any).

In short, to denoise the current block ( $B_t^i$ ), the model looks at the noisy history of recent blocks ( $B_t^{(i-w):(i-1)}$ ) and a stable identity reference ( $I$ ), guided by the current audio ( $a^i$ ).

4.2.2. Model Training

The training process consists of two stages, as illustrated in Figure 2.

该图像是示意图，展示了“Live Avatar”算法的两个主要阶段：阶段一为扩散强制预训练，阶段二为自我强迫分布匹配蒸馏。图中包含了算法的架构，突出了条件编码器、噪声级别和MM-DiT模型的作用，以及在生成过程中如何进行噪声注入和历史缓存的处理。此图明确了该算法的工作流程和技术机制，有助于理解其在实时音频驱动头像生成中的应用。

Stage 1: Diffusion Forcing Pretraining The first stage prepares the model for the more complex distillation stage. The goal is to pre-train a base model with a causal structure. This is done by applying a causal attention mask. Within a block of frames, attention is all-to-all (intra-block full attention). However, between blocks, a block can only attend to past blocks, not future ones (inter-block causal attention). This forces the model to learn a left-to-right, streamable structure.

Stage 2: Self-Forcing Distribution Matching Distillation This stage distills the knowledge of a powerful teacher model into the causal student model from Stage 1, making it a fast, few-step generator. This process involves three models:

Real Score Model: A fixed, pre-trained, bidirectional (non-causal) teacher model. It represents the ideal target distribution of high-quality video.
Fake Score Model: A model identical in architecture to the teacher, but it's trained to track the distribution of the videos generated by the student.
Causal Generator (Student): The model from Stage 1, which is being optimized to generate high-quality video in just a few steps.

The training loop works as follows:

The causal generator produces a long video sequence autoregressively, one block at a time.
This generated sequence is fed into both the Real Score Model and the Fake Score Model to calculate a DMD loss. This loss pushes the student's output distribution to match the teacher's.
To improve robustness for long-term generation, a technique called History Corrupt is used. During training, random noise is injected into the Key-Value (KV) cache of historical frames. The KV cache is a memory mechanism in attention layers that stores representations of past tokens (here, past frame blocks). By corrupting this memory, the model is forced to be less reliant on a perfect history and to lean more on the stable identity information provided by the sink frame. This prevents the model from being overly sensitive to small errors that might accumulate during long-term streaming.

4.2.3. Timestep-forcing Pipeline Parallelism (TPP)

This is the system-level innovation for achieving real-time inference. Standard diffusion inference is sequential: you must complete step $T$ to start step T-1, and so on. This creates a latency bottleneck equal to the time per step multiplied by the number of steps. TPP breaks this bottleneck.

The process, illustrated in Figure 3, works as follows:

Setup: For a diffusion process with $N$ denoising steps, we use $N$ GPUs (or $N+1$ if including a VAE decoder GPU). Each GPU is assigned a single, fixed timestep. For example, GPU 1 is always responsible for the $t_N \to t_{N-1}$ step, GPU 2 for the $t_{N-1} \to t_{N-2}$ step, and so on. In the paper's case, they use 4 denoising steps, so 4 GPUs are used for the diffusion model.
Warm-up: The first block of frames is generated. It passes sequentially through GPU 1, then GPU 2, GPU 3, and GPU 4. This "fills the pipeline".
Streaming Phase: Once the pipeline is full, the system achieves maximum throughput. As soon as GPU 1 finishes its step on block $i$ and sends the latent to GPU 2, it can immediately start processing block $i+1$ . All GPUs work in parallel on different blocks at different stages of denoising.
Efficiency: The frame rate (throughput) is no longer determined by the total time for all steps, but by the latency of the slowest single step in the pipeline. Since communication between GPUs is minimal (only the small latent tensor is passed), the speed is dramatically increased. The paper reports going from ~5 FPS to over 20 FPS.
VAE Parallelization: The final VAE decoding step (converting the latent back to a pixel-space video) can also be a bottleneck. This is offloaded to a dedicated fifth GPU, ensuring the diffusion pipeline is not held up.

该图像是示意图，展示了Live Avatar算法中的完全流水线流式处理框架，标识了不同GPU在时间步之间的工作流程，并指出了各个阶段的延迟情况。

4.2.4. Rolling Sink Frame Mechanism (RSFM) for Long Video Generation

This mechanism is designed to combat identity drift and color shifts during long generation sessions. It consists of two parts:

Adaptive Attention Sink (AAS)²: At the beginning of generation, the model uses a provided reference image as the initial "sink frame" for identity. However, this external image might be "out-of-distribution" for the model. To solve this, as soon as the very first block of the video is generated, the model replaces the original reference image with its own first generated frame. This new frame is then used as the persistent sink frame for the rest of the entire generation process. The intuition is that by using a reference that comes from its own learned distribution, the model is less likely to drift towards unrealistic outputs over time.
Rolling RoPE²: The sink frame is cached and used as a reference in the attention mechanism. However, as generation proceeds, the temporal distance between the current frame and the fixed sink frame increases. This can weaken the sink frame's influence. Rolling RoPE² addresses this by dynamically adjusting the Rotary Position Embeddings (RoPE). RoPE is a technique for encoding positional information in attention layers. By manipulating the RoPE values, the model can be "tricked" into thinking the sink frame is always at a fixed, close temporal distance to the current frame being generated. This ensures that the identity information from the sink frame remains strong and consistent throughout an infinitely long generation process.

5. Experimental Setup

5.1. Datasets

AVSpeech: This is a large-scale public dataset containing thousands of hours of video clips from YouTube, featuring people speaking. It provides synchronized audio and video, making it ideal for training audio-driven models. The authors followed the preprocessing protocol from a prior work (OmniAvatar) and filtered the data to keep only segments longer than 10 seconds, resulting in a final training set of 400,000 samples.
GenBench: This is a custom benchmark created by the authors to evaluate the model's out-of-domain (OOD) generalization capabilities, meaning its performance on data types and styles it was not trained on. It was generated using powerful modern generative models (Gemini-2.5 Pro, Qwen-Image, CosyVoice). It includes a wide variety of characters (photorealistic, animated, non-human) and camera shots (frontal, profile, full-body). It has two subsets:
- GenBench-ShortVideo: 100 samples of ~10 seconds each.
- GenBench-LongVideo: 15 videos, each over 5 minutes long, to specifically test long-horizon consistency.

5.2. Evaluation Metrics

The paper uses a comprehensive set of metrics to evaluate different aspects of the generated videos.

Frame-level Image Quality:
- FID (Fréchet Inception Distance): Measures the similarity between the distribution of generated images and real images. A lower FID score indicates higher image quality and realism. It is calculated by comparing statistics (mean and covariance) of features extracted by a pre-trained InceptionV3 network. $ \mathrm{FID}(x, g) = \left| \mu_x - \mu_g \right|_2^2 + \mathrm{Tr}\left( \Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2} \right) $
  - $\mu_x, \mu_g$ : Mean of Inception features for real and generated images.
  - $\Sigma_x, \Sigma_g$ : Covariance matrices of Inception features.
  - $\mathrm{Tr}$ : The trace of a matrix.
Video Coherence:
- FVD (Fréchet Video Distance): An extension of FID to videos. It measures the quality and temporal consistency of generated videos compared to real ones. Lower is better.
Audio-Visual Synchronization:
- Sync-C (SyncNet Confidence) & Sync-D (SyncNet Distance): These metrics use a pre-trained SyncNet model to measure how well the lip movements in the video match the input audio.
  - Sync-C: Higher is better, indicating more confidence in the match.
  - Sync-D: Lower is better, indicating a smaller distance (mismatch) between audio and video embeddings.
Perceptual Quality:
- IQA (Image Quality Assessment) & ASE (Aesthetics): These metrics use the Q-align model, which is trained to predict human preferences.
  - IQA: A score for overall perceptual quality. Higher is better.
  - ASE: A score for aesthetic appeal. Higher is better.
Identity/Semantic Consistency:
- Dino-S (DINOv2 Similarity): Measures the cosine similarity between feature embeddings of the generated frames and a reference frame, extracted using the DINOv2 model. A higher score indicates better identity preservation.
Inference Efficiency:
- FPS (Frames Per Second): The number of video frames generated per second. Higher is better.
- TTFF (Time-to-First-Frame): The latency from starting the process to getting the very first frame of video. Lower is better, especially for interactive applications.

5.3. Baselines

The paper compares Live Avatar against several state-of-the-art open-source audio-driven avatar generation models:

Ditto
Echomimic-V2
Hallo3
StableAvatar
OmniAvatar
WanS2V

These baselines are representative because they cover a range of recent approaches, including models with different architectures and sizes. Crucially, OmniAvatar and WanS2V are large, high-fidelity diffusion models, making them the most relevant competitors in terms of quality, even though they are not designed for real-time streaming.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate Live Avatar's strong performance, particularly in its core target areas: real-time speed and long-duration stability.

The following are the results from Table 2 of the original paper:

Dataset	Model	Metrics
Dataset	Model	ASE ↑	IQA ↑	Sync-C↑	Sync-D↓	Dino-S ↑	FPS ↑
GenBench-ShortVideo	Ditto[24]	3.31	4.24	4.09	10.76	0.99	21.80
	Echomimic-V2[33]	2.82	3.61	5.57	9.13	0.79	0.53
	Hallo3[6]	3.12	3.97	4.74	10.19	0.94	0.26
	StableAvatar[38]	3.52	4.47	3.42	11.33	0.93	0.64
	OmniAvatar[13]	3.53	4.49	6.77	8.22	0.95	0.16
	WanS2V[14]	3.36	4.29	5.89	9.08	0.95	0.25
	Ours	3.44	4.35	5.69	9.13	0.95	20.88
GenBench-LongVideo	Ditto[24]	2.90	4.48	3.98	10.57	0.98	21.80
	Hallo3[6]	2.65	4.04	6.18	9.29	0.83	0.26
	StableAvatar[38]	3.00	4.66	1.97	13.57	0.94	0.64
	OmniAvatar[13]	2.36	2.86	8.00	7.59	0.66	0.16
	WanS2V[14]	2.63	3.99	6.04	9.12	0.80	0.25
	Ours	3.38	4.73	6.28	8.81	0.94	20.88

Analysis:

Inference Speed (FPS): Live Avatar achieves 20.88 FPS, which is real-time. This is orders of magnitude faster than other high-fidelity models like OmniAvatar (0.16 FPS) and WanS2V (0.25 FPS). Only Ditto, a much smaller (0.2B) and lower-quality model, achieves a similar speed.
Short Video Quality (GenBench-ShortVideo): On short videos, Live Avatar achieves competitive scores in aesthetics (ASE) and quality (IQA), comparable to the top-performing non-real-time models (OmniAvatar, StableAvatar). This shows that the distillation and streaming framework does not significantly degrade quality for short-form content.
Long Video Quality (GenBench-LongVideo): This is where Live Avatar truly excels. Its ASE (3.38) and IQA (4.73) scores are significantly higher than all other methods. The large models OmniAvatar and WanS2V show a massive drop in quality on long videos (e.g., OmniAvatar's IQA drops from 4.49 to 2.86), confirming they suffer from long-horizon degradation. Live Avatar's quality remains stable. This validates the effectiveness of the Rolling Sink Frame Mechanism (RSFM).

The qualitative results in Figure 4 visually confirm these numbers. While other models show artifacts or frozen expressions over time, Live Avatar remains consistent.

该图像是图表，展示了与现有方法的定性比较。上方为不同时间点（2s、200s、400s）的生成效果，下方为 IQA 和 Sync-C 指标的变化曲线，比较了多个算法的性能，包括 Ours、Ditto、Hallo3、Omni Avatar 和 Stable Avatar。

6.2. Ablation Studies / Parameter Analysis

6.2.1. Study of Inference Efficiency

This study dissects the sources of the speedup.

The following are the results from Table 3 of the original paper:

Methods	#GPUs	NFE	FPS ↑	TTFF↓
w/o TPP	2	5	4.26	3.88
w/o TPP, w/ SP4GPU	5	5	5.01	3.24
w/o VAE Parallel	4	4	10.16	4.73
w/o DMD	2	80	0.29	45.50
Ours	5	4	20.88	2.89

Analysis:

Impact of DMD: Removing Distribution Matching Distillation (w/o DMD) is catastrophic. The Number of Function Evaluations (NFE) jumps from 4 to 80, and the FPS plummets to 0.29. This shows that distillation is essential for making the model fast enough in the first place.
Impact of TPP: Removing Timestep-forcing Pipeline Parallelism (w/o TPP) but keeping the distilled model results in only 4.26 FPS. Even with standard sequential parallelism ( $w/ SP4GPU$ ), the speed is only 5.01 FPS. This proves that TPP is the key system innovation responsible for the leap from ~5 FPS to ~21 FPS.
Impact of VAE Parallelization: Without a dedicated GPU for VAE decoding (w/o VAE Parallel), the FPS is halved to 10.16. This confirms that the VAE decoding is a significant bottleneck and offloading it is critical for achieving maximum throughput.

6.2.2. Study on Long Video Generation

This study validates the components of the Rolling Sink Frame Mechanism.

The following are the results from Table 4 of the original paper:

Methods	Metrics
Methods	ASE ↑	IQA ↑	Sync-C↑	Dino-S ↑
w/o AAS	3.13	4.44	6.25	0.91
w/o Rolling RoPE	3.38	4.71	6.29	0.86
w/o History Corrupt	2.90	3.88	6.14	0.81
Ours	3.38	4.73	6.28	0.93

Analysis:

History Corrupt: Removing the noise injection during training leads to the largest drop in all metrics, especially IQA and Dino-S. This indicates that making the model robust to historical errors is crucial for maintaining both perceptual quality and identity.
Adaptive Attention Sink (AAS): Removing AAS (i.e., using the original reference image throughout) causes a noticeable drop in ASE and IQA. This confirms that using an "in-distribution" generated frame as the sink helps maintain better visual quality and aesthetics.
Rolling RoPE: Removing this component causes the Dino-S (identity similarity) to drop from 0.93 to 0.86. This directly validates its role in preserving the global identity by ensuring the sink frame's influence remains strong.

6.2.3. User Study

The user study provides insights into human perception, which can differ from automated metrics.

The following are the results from Table 5 of the original paper:

Model	Naturalness ↑	Synchronization ↑	Consistency ↑
Ditto [24]	78.2	40.5	90.2
EchoMimic-V2 [33]	60.3	71.1	38.7
Hallo3 [6]	78.5	69.2	89.3
StableAvatar [38]	68.7	70.8	88.9
OmniAvatar [13]	71.1	78.5	90.8
WanS2V [14]	84.3	85.2	92.0
Ours	86.3	80.6	91.1

Analysis:

Live Avatar receives the highest score for Naturalness (86.3), suggesting that its motion and expressions are perceived as more realistic by humans.
Interestingly, OmniAvatar, which scored highest on the objective Sync-C metric, is rated lower by users on Synchronization than WanS2V. The authors suggest this is because models can "game" metrics like Sync-C by producing exaggerated lip movements that are technically accurate but look unnatural.
Live Avatar achieves a well-balanced, high score across all three human-rated criteria, indicating it aligns well with human perception of a good talking avatar.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully presents Live Avatar, a framework that makes a significant leap towards the practical deployment of large-scale generative models for real-world applications. By pioneering an algorithm-system co-design, it simultaneously solves the critical challenges of real-time inference latency and long-horizon visual consistency. The key contributions—Timestep-forcing Pipeline Parallelism (TPP), Rolling Sink Frame Mechanism (RSFM), and a robust distillation strategy—enable a massive 14-billion-parameter diffusion model to generate high-fidelity, audio-driven avatar video at 20 FPS for potentially infinite durations. This work establishes a new paradigm for deploying advanced diffusion models in industrial settings and lays a solid foundation for the future of interactive digital humans.

7.2. Limitations & Future Work

The authors acknowledge two main limitations:

Latency vs. Throughput: The proposed TPP strategy dramatically increases throughput (FPS) but does not reduce the time-to-first-frame (TTFF). The TTFF is still dependent on the full sequential pass through the pipeline. This makes it less suitable for highly interactive turn-based conversations where initial response time is critical.
Dependence on RSFM: The system's long-term stability relies heavily on the RSFM. While effective, this reliance on a single, dynamically-chosen sink frame might limit its ability to handle very complex scenarios with significant, intentional changes in identity, lighting, or background over extremely long periods.

Future work will focus on reducing the initial latency (TTFF) and further improving temporal coherence to handle even more complex and dynamic long-form generation scenarios.

7.3. Personal Insights & Critique

This paper is an excellent example of engineering-driven research that addresses a real-world bottleneck.

Positive Insights:

The Power of Co-Design: The most significant insight is the effectiveness of tackling a problem from both the system and algorithm levels. The algorithmic improvements (RSFM) would be useless for real-time applications without the system-level speedup from TPP, and TPP would be generating low-quality, inconsistent video without the algorithmic safeguards.
TPP is a Breakthrough Idea: Timestep-forcing Pipeline Parallelism is a genuinely clever and impactful idea. It reframes a temporally sequential problem into a spatially parallel one, a paradigm that could potentially be applied to other iterative generative models beyond diffusion. It's a powerful demonstration of how rethinking the hardware execution model can unlock new capabilities.
RSFM and "In-Distribution" Conditioning: The Adaptive Attention Sink component of RSFM is particularly insightful. The idea of using the model's own output as a cleaner, "in-distribution" reference is elegant and effectively grounds the generation process, preventing the slow drift that plagues many autoregressive models.

Potential Issues and Critique:

Scalability and Cost of TPP: The paper demonstrates TPP with a 4-step distilled model on 4+1 GPUs. While impressive, this raises questions about scalability. If a model required 20 or 50 steps for sufficient quality, would one need 21 or 51 high-end GPUs (like the H800)? This could make the approach prohibitively expensive for many use cases. The cost-effectiveness of TPP seems highly dependent on successful few-step distillation.
Generalizability of the Framework: The framework is highly optimized for a specific task (audio-driven avatar) and hardware (NVIDIA H800s with fast interconnects). How easily could TPP be adapted to different model architectures or less powerful, more heterogeneous hardware setups? The tight coupling might limit its broader applicability without significant re-engineering.
Creative Limitations of a Sink Frame: The reliance on a single sink frame, even a dynamically chosen one, is a double-edged sword. It ensures consistency, but it might also stifle creativity or prevent the avatar from naturally evolving its appearance (e.g., subtle changes in lighting as it moves). For a truly dynamic virtual world, a more flexible consistency mechanism might be needed.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~21 min read · 28,324 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Diffusion Models

3.1.2. Latent Diffusion Models

3.1.3. Autoregressive Models

3.1.4. Model Distillation and Distribution Matching Distillation (DMD)

3.2. Previous Works

3.2.1. Streaming and Long Video Generation

3.2.2. Audio-driven Avatar Video Generation

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Model Architecture

4.2.2. Model Training

4.2.3. Timestep-forcing Pipeline Parallelism (TPP)

4.2.4. Rolling Sink Frame Mechanism (RSFM) for Long Video Generation

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.2.1. Study of Inference Efficiency

6.2.2. Study on Long Video Generation

6.2.3. User Study

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers