xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed
Representations
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed
Representations
We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the …