-
xt: The visual latents at noise level t.
-
x1: The clean visual latents from the VAE.
-
x0: A random noise tensor, typically sampled from a standard normal distribution N(0,1).
-
t: A time-step value from [0, 1]
, where t=0 is pure noise and t=1 is the clean data.
The semantic layers S(⋅) are pre-trained via distillation to mimic the behavior of a powerful vision model (SigLIP
) on both clean and noised latents. The loss for this distillation is:
Ldistill=−n1∑logsin(S(xt),SigLIP(X))
-
Ldistill: The distillation loss.
-
S(xt): The feature representation from the semantic layers for noised latents.
-
SigLIP(X): The feature representation from the original SigLIP
model for the clean input image X.
-
sim(⋅,⋅): Cosine similarity function. This loss encourages the semantic layers to produce features similar to SigLIP
's features.
The final unified representation u is created by the fusion mechanism:
u=STF(S(xt),P(xt))
-
u: The unified visual representation fed to the LLM.
-
STF: The Spatial (-Temporal) Fusion function, which concatenates and processes the semantic and detailed features.
-
P(xt): The low-level feature representation from the projector.