Towards End-to-End Generative Modeling of Long Videos.pdf

dong

299

15页

0次

2023-11-20

50墨值下载

Towards End-to-End Generative Modeling of Long Videos

with Memory-Efﬁcient Bidirectional Transformers

Jaehoon Yoo

, Semin Kim

, Doyup Lee

, Chiheon Kim

, Seunghoon Hong

KAIST,

Kakao Brain

{wogns98, kimsm1121, seunghoon.hong}@kaist.ac.kr, {damien.re, frost.conv}@kakaobrain.com

Abstract

Autoregressive transformers have shown remarkable

success in video generation. However, the transformers are

prohibited from directly learning the long-term dependency

in videos due to the quadratic complexity of self-attention,

and inherently suffering from slow inference time and er-

ror propagation due to the autoregressive process. In this

paper, we propose Memory-efﬁcient Bidirectional Trans-

former (MeBT) for end-to-end learning of long-term depen-

dency in videos and fast inference. Based on recent ad-

vances in bidirectional transformers, our method learns to

decode the entire spatio-temporal volume of a video in par-

allel from partially observed patches. The proposed trans-

former achieves a linear time complexity in both encoding

and decoding, by projecting observable context tokens into

a ﬁxed number of latent tokens and conditioning them to

decode the masked tokens through the cross-attention. Em-

our method demonstrates signiﬁcant improvement over the

autoregressive transformers for generating moderately long

videos in both quality and speed. Videos and code are avail-

able at https://sites.google.com/view/mebt-cvpr2023.

1. Introduction

Modeling the generative process of videos is an impor-

tant yet challenging problem. Compared to images, gener-

ating convincing videos requires not only producing high-

quality frames but also maintaining their semantic struc-

tures and dynamics coherently over long timescale [11, 16,

27, 36,38, 54].

Recently, autoregressive transformers on discrete repre-

sentation of videos have shown promising generative mod-

eling performances [11,32,51,53]. Such methods generally

involve two stages, where the video frames are ﬁrst turned

into discrete tokens through vector quantization, and then

their sequential dynamics are modeled by an autoregressive

transformer. Powered by the ﬂexibility of discrete distri-

butions and the expressiveness of transformer architecture,

these methods demonstrate impressive results in learning

and synthesizing high-ﬁdelity videos.

However, autoregressive transformers for videos suffer

from critical scaling issues in both training and inference.

During training, due to the quadratic cost of self-attention,

the transformers are forced to learn the joint distribution of

frames entirely from short videos (e.g., 16 frames [11, 53])

and cannot directly learn the statistical dependencies of

frames over long timescales. During inference, the mod-

els are challenged by two issues of autoregressive gener-

ative process – its serial process signiﬁcantly slows down

the inference speed, and perhaps more importantly, autore-

gressive prediction is prone to error propagation where the

prediction error of the frames accumulates over time.

To (partially) address the issues, prior works proposed

improved transformers for generative modeling of videos,

which are categorized as the following: (a) Employing

sparse attention to improve scaling during training [12, 16,

51], (b) Hierarchical approaches that employ separate mod-

els in different frame rates to generate long videos with

a smaller computation budget [11, 16], and (c) Remov-

ing autoregression by formulating the generative process as

masked token prediction and training a bidirectional trans-

former [12, 13]. While each approach is effective in ad-

dressing speciﬁc limitations in autoregressive transformers,

none of them provides comprehensive solutions to afore-

mentioned problems – (a, b) still inherits the problems in

autoregressive inference and cannot leverage the long-term

dependency by design due to the local attention window,

and (c) is not appropriate to learn long-range dependency

due to the quadratic computation cost. We believe that an

alternative that jointly resolves all the issues would provide

a promising approach towards powerful and efﬁcient video

generative modeling with transformers.

In this work, we propose an efﬁcient transformer for

video synthesis that can fully leverage the long-range de-

pendency of video frames during training, while being able

to achieve fast generation and robustness to error propaga-

tion. We achieve the former with a linear complexity ar-

arXiv:2303.11251v2 [cs.CV] 27 Mar 2023

chitecture that still imposes dense dependencies across all

timesteps, and the latter by removing autoregressive seri-

alization through masked generation with a bidirectional

transformer. While conceptually simple, we show that ef-

ﬁcient dense architecture and masked generation are highly

complementary, and when combined together, lead to sub-

stantial improvements in modeling longer videos compared

to previous works both in training and inference. The con-

tributions of this paper are as follows:

• We propose Memory-efﬁcient Bidirectional Trans-

former (MeBT) for generative modeling of video. Un-

like prior methods, MeBT can directly learn long-

range dependency from training videos while enjoying

fast inference and robustness in error propagation.

• To train MeBT for moderately long videos, we propose

a simple yet effective curriculum learning that guides

the model to learn short- to long-term dependencies

gradually over training.

• We evaluate MeBT on three challenging real-world

video datasets. MeBT achieves a performance compet-

itive to state-of-the-arts for short videos of 16 frames,

and outperforms all for long videos of 128 frames

while being considerably more efﬁcient in memory

and computation during training and inference.

2. Background

This section introduces generative transformers for

videos that utilize discrete token representation, which can

be categorized into autoregressive and bidirectional models.

Let x ∈ R

T ×H×W ×3

be a video. To model its genera-

tive distribution p(x), prior works on transformers employ

discrete latent representation of frames y ∈ R

t×h×w×d

and

model the prior distribution on the latent space p(y).

Vector Quantization To map a video x into discrete to-

kens y, previous works [11, 13, 32, 48, 51] utilize vector

quantization with an encoder E that maps x onto a learn-

able codebook F = {e

}

i=1

[45]. Speciﬁcally, given

a video x, the encoder produces continuous embeddings

h = E(x) ∈ R

t×h×w×d

and searches for the nearest code

∈ F .

The encoder E is trained through an autoencoding

framework by introducing a decoder D that takes discrete

tokens y and produces reconstruction

x = D(y). The en-

coder E, codebook F , and the decoder D are optimized

with the following training objective:

= ||x −

x||

+ ||sg(h) − y||

+ β||sg(y) − h||

(1)

where sg denotes stop-gradient operator. In practice, to

improve the quality of discrete representations, additional

perceptual loss and adversarial loss are often introduced [9].

For the choice of the encoder E, we utilize 3D convo-

lutional networks that compress a video in both spatial and

temporal dimensions following prior works [11, 53].

Autoregressive Transformers Given the discrete latent

representations y ∈ R

t×h×w×d

, generative modeling of

videos boils down to modeling the prior p(y). Prior work

based on autoregressive transformers employ a sequential

factorization p(y) = Π

i≤N

p(y

) where N = thw, and

use a transformer to model the conditional distribution of

each token p(y

). The transformer is trained to mini-

mize the following negative log-likelihood of training data:

i≤N

− log p(y

). (2)

During inference, the transformer generates a video by

sequentially sampling each token y

from the conditional

p(y

) based on context y

. The sampled tokens y are

then mapped back to a video using the decoder D.

While simple and powerful, autoregressive transform-

ers for videos suffer from critical scaling issues. First,

each conditional p(y

) involves O(n

) computational

cost due to the quadratic complexity of self-attention. This

forces the model to only utilize short-term context in both

training and inference, making it inappropriate in modeling

spatio-temporal long-term coherence. Furthermore, during

inference, the sequential decoding requires N model pre-

dictions that recursively depend on the previous one. This

leads to a slow inference, and more notably, a potential error

propagation over space and time since the prediction error

at a certain token accumulates over the remaining decod-

ing steps. This is particularly problematic for videos since

N is often very large as tokenization spans both spatial and

temporal dimensions.

Bidirectional Transformers To improve the decoding ef-

ﬁciency of autoregressive transformers, bidirectional gener-

ative transformers have been proposed [5, 13, 56]. Contrary

to autoregressive models that predict a single consecutive

token at each step, a bidirectional transformer learns to pre-

dict multiple masked tokens at once based on the previously

generated context. Speciﬁcally, given the random masking

indices m ⊆ {1, ..., N}, it models the joint distribution over

masked tokens y

= {y

|i ∈ m} conditioned on the visi-

ble context y

= {y

|i /∈ m}, and is trained with the below

objective:

= − log p(y

, z

) ≈ −

i∈m

− log p(y

, z

(3)

where mask embeddings z

encode positions of the mask

tokens with learnable vectors. Each conditional in Eq. (3)

is modeled by a transformer, but contrary to autoregressive





y

z

z

y







s

s + 1

























  



  





Figure 1. Overview of our method. Our model learns to predict masked tokens from the context tokens with linear complexity encoder and

decoder. The encoder and decoder utilize latent bottlenecks to achieve linear complexity while performing full dependency modeling.

models that apply causal masking, bidirectional transform-

ers operate on entire tokens y = y

∪ y

. During in-

ference, the model is initialized with empty context (i.e.,

= ∅), and performs iterative decoding by predicting the

probability over all masked tokens y

by Eq. (3) and sam-

pling their subset as the context for the next iteration.

Adopting bidirectional transformers to model videos has

two advantages compared to autoregressive transformers.

First, by decoding multiple tokens at once, bidirectional

transformers enjoy a better parallelization and smaller de-

coding steps than autoregressive transformers which results

in a faster sampling. Second, bidirectional transformers are

more robust to error propagation than autoregressive coun-

terparts since the decoding of the masked token is indepen-

dent of the temporal order, allowing consistent prediction

quality over time. However, complexity of the bidirectional

transformers is still quadratic to the number of tokens, lim-

iting the length of training videos to only short sequences

thus hindering learning long-term dependencies.

3. Memory-efﬁcient Bidirectional Transformer

Our goal is to design a generative transformer for videos

that can perform fast inference with a robustness to error

propagation, while being able to fully leverage the long-

range statistical dependency of video frames in training.

For inference speed and robustness, we adopt the bidi-

rectional approach and parameterize the joint distribution

of masked tokens p(y

, z

) ≈ Π

i∈m

p(y

, z

)

(Eq. (3)) with a transformer. For learning long-range depen-

dency, we take a simple approach of employing an efﬁcient

transformer architecture [40] of sub-quadratic complexity

and directly training it with longer videos. While sparse at-

tention (e.g., local, axial, strided) is dominant in autoregres-

sive transformers to reduce complexity [11, 12, 16, 51], we

ﬁnd it potentially problematic for bidirectional transformers

since the context y

can be provided for an arbitrary sub-

set of token positions that often cannot be covered by the

ﬁxed sparse attention patterns

. Thus, unlike prior work,

we design an efﬁcient bidirectional transformer based on

low-rank latent bottleneck [17,18,21,25,30,49] that always

enables a dense attention dependency over all token posi-

tions while guaranteeing linear complexity.

To this end, we propose Memory-efﬁcient Bidirectional

Transformer (MeBT) for generative modeling of videos.

The overall framework of MeBT is illustrated in Fig. 1. To

predict the masked tokens y

from the context tokens y

MeBT employs an encoder-decoder architecture based on

a ﬁxed number of latent bottleneck tokens z

∈ R

×d

with N

 N . The encoder projects the context tokens y

to the latent tokens z

, and the decoder utilizes the latent

tokens z

to predict the masked tokens y

. Overall, the

encoder and decoder exploit the latent bottleneck to achieve

a linear complexity O(N) while performing a full depen-

dency modeling across the token positions. In the following

sections, we describe the details of the encoder, decoder,

and the training and inference procedures of MeBT.

3.1. Encoder Architecture

The encoder aims to project all previously generated

context tokens y

to ﬁxed-size latent bottleneck z

with

a memory and time cost linear to context size O(N

Following prior work on transformers with latent bottle-

necks [17, 18, 21, 25, 30], we construct the encoder as an

alternating stack of two types of attention layers that pro-

gressively update the latent tokens based on the provided

context tokens. Speciﬁcally, as illustrated in Fig. 2, the ﬁrst

layer updates the latent tokens by cross-attending to context

tokens, and the second one updates the latent tokens by per-

Note that, in autoregressive transformers, the context y

(Eq. (2)) is

always the entire past, allowing sparse attention to robustly see the context.

of 15

50墨值下载

相关文档

评论