
Video Probabilistic Diffusion Models in Projected Latent Space
Sihyun Yu
1
Kihyuk Sohn
2
Subin Kim
1
Jinwoo Shin
1
1
KAIST
2
Google Research
{sihyun.yu, subin-kim, jinwoos}@kaist.ac.kr, kihyuks@google.com
Abstract
Despite the remarkable progress in deep generative
models, synthesizing high-resolution and temporally co-
herent videos still remains a challenge due to their high-
dimensionality and complex temporal dynamics along with
large spatial variations. Recent works on diffusion models
have shown their potential to solve this challenge, yet they
suffer from severe computation- and memory-inefficiency
that limit the scalability. To handle this issue, we propose
a novel generative model for videos, coined projected la-
tent video diffusion model (PVDM), a probabilistic dif-
fusion model which learns a video distribution in a low-
dimensional latent space and thus can be efficiently trained
with high-resolution videos under limited resources. Specifi-
cally, PVDM is composed of two components: (a) an autoen-
coder that projects a given video as 2D-shaped latent vectors
that factorize the complex cubic structure of video pixels and
(b) a diffusion model architecture specialized for our new fac-
torized latent space and the training/sampling procedure to
synthesize videos of arbitrary length with a single model. Ex-
periments on popular video generation datasets demonstrate
the superiority of PVDM compared with previous video syn-
thesis methods; e.g., PVDM obtains the FVD score of 639.7
on the UCF-101 long video (128 frames) generation bench-
mark, which improves 1773.4 of the prior state-of-the-art.
1. Introduction
Recent progresses of deep generative models have shown
their promise to synthesize high-quality, realistic samples in
various domains, such as images [9, 27, 41], audio [8, 31, 32],
3D scenes [6, 38, 48], natural languages [2, 5], etc. As
a next step forward, several works have been actively
focusing on the more challenging task of video synthe-
sis [12, 18, 21, 47, 55, 67]. In contrast to the success in other
domains, the generation quality is yet far from real-world
videos, due to the high-dimensionality and complexity of
videos that contain complicated spatiotemporal dynamics in
high-resolution frames.
Inspired by the success of diffusion models in handling
complex and large-scale image datasets [9, 40], recent ap-
proaches have attempted to design diffusion models for
videos [16, 18, 21, 22, 35, 66]. Similar to image domains,
these methods have shown great potential to model video
distribution much better with scalability (both in terms of
spatial resolution and temporal durations), even achieving
photorealistic generation results [18]. However, they suffer
from severe computation and memory inefficiency, as diffu-
sion models require lots of iterative processes in input space
to synthesize samples [51]. Such bottlenecks are much more
amplified in video due to a cubic RGB array structure.
Contribution.
We present a novel latent diffusion model
for videos, coined projected latent video diffusion model
(PVDM). Specifically, it is a two-stage framework (see Fig-
ure 1
Meanwhile, recent works in image generation have pro-
posed latent diffusion models to circumvent the computation
and memory inefficiency of diffusion models [15, 41, 59].
Instead of training the model in raw pixels, latent diffusion
models first train an autoencoder to learn a low-dimensional
latent space succinctly parameterizing images [10, 41, 60]
and then models this latent distribution. Intriguingly, the ap-
proach has shown a dramatic improvement in efficiency for
synthesizing samples while even achieving state-of-the-art
generation results [41]. Despite their appealing potential,
however, developing a form of latent diffusion model for
videos is yet overlooked.
for the overall illustration):
•
Autoencoder: We introduce an autoencoder that repre-
sents a video with three 2D image-like latent vectors by
factorizing the complex cubic array structure of videos.
Specifically, we propose 3D
→
2D projections of videos
at each spatiotemporal direction to encode 3D video pix-
els as three succinct 2D latent vectors. At a high level,
we design one latent vector across the temporal direction
to parameterize the common contents of the video (e.g.,
background), and the latter two vectors to encode the mo-
tion of a video. These 2D latent vectors are beneficial for
achieving high-quality and succinct encoding of videos,
as well as enabling compute-efficient diffusion model
architecture design due to their image-like structure.
1
arXiv:2302.07685v2 [cs.CV] 30 Mar 2023
评论