chitecture that still imposes dense dependencies across all
timesteps, and the latter by removing autoregressive seri-
alization through masked generation with a bidirectional
transformer. While conceptually simple, we show that ef-
ficient dense architecture and masked generation are highly
complementary, and when combined together, lead to sub-
stantial improvements in modeling longer videos compared
to previous works both in training and inference. The con-
tributions of this paper are as follows:
• We propose Memory-efficient Bidirectional Trans-
former (MeBT) for generative modeling of video. Un-
like prior methods, MeBT can directly learn long-
range dependency from training videos while enjoying
fast inference and robustness in error propagation.
• To train MeBT for moderately long videos, we propose
a simple yet effective curriculum learning that guides
the model to learn short- to long-term dependencies
gradually over training.
• We evaluate MeBT on three challenging real-world
video datasets. MeBT achieves a performance compet-
itive to state-of-the-arts for short videos of 16 frames,
and outperforms all for long videos of 128 frames
while being considerably more efficient in memory
and computation during training and inference.
2. Background
This section introduces generative transformers for
videos that utilize discrete token representation, which can
be categorized into autoregressive and bidirectional models.
Let x ∈ R
T ×H×W ×3
be a video. To model its genera-
tive distribution p(x), prior works on transformers employ
discrete latent representation of frames y ∈ R
t×h×w×d
and
model the prior distribution on the latent space p(y).
Vector Quantization To map a video x into discrete to-
kens y, previous works [11, 13, 32, 48, 51] utilize vector
quantization with an encoder E that maps x onto a learn-
able codebook F = {e
i
}
U
i=1
[45]. Specifically, given
a video x, the encoder produces continuous embeddings
h = E(x) ∈ R
t×h×w×d
and searches for the nearest code
e
u
∈ F .
The encoder E is trained through an autoencoding
framework by introducing a decoder D that takes discrete
tokens y and produces reconstruction
ˆ
x = D(y). The en-
coder E, codebook F , and the decoder D are optimized
with the following training objective:
L
q
= ||x −
ˆ
x||
2
2
+ ||sg(h) − y||
2
2
+ β||sg(y) − h||
2
2
,
(1)
where sg denotes stop-gradient operator. In practice, to
improve the quality of discrete representations, additional
perceptual loss and adversarial loss are often introduced [9].
For the choice of the encoder E, we utilize 3D convo-
lutional networks that compress a video in both spatial and
temporal dimensions following prior works [11, 53].
Autoregressive Transformers Given the discrete latent
representations y ∈ R
t×h×w×d
, generative modeling of
videos boils down to modeling the prior p(y). Prior work
based on autoregressive transformers employ a sequential
factorization p(y) = Π
i≤N
p(y
i
|y
<i
) where N = thw, and
use a transformer to model the conditional distribution of
each token p(y
i
|y
<i
). The transformer is trained to mini-
mize the following negative log-likelihood of training data:
L
a
=
X
i≤N
− log p(y
i
|y
<i
). (2)
During inference, the transformer generates a video by
sequentially sampling each token y
i
from the conditional
p(y
i
|y
<i
) based on context y
<i
. The sampled tokens y are
then mapped back to a video using the decoder D.
While simple and powerful, autoregressive transform-
ers for videos suffer from critical scaling issues. First,
each conditional p(y
n
|y
<n
) involves O(n
2
) computational
cost due to the quadratic complexity of self-attention. This
forces the model to only utilize short-term context in both
training and inference, making it inappropriate in modeling
spatio-temporal long-term coherence. Furthermore, during
inference, the sequential decoding requires N model pre-
dictions that recursively depend on the previous one. This
leads to a slow inference, and more notably, a potential error
propagation over space and time since the prediction error
at a certain token accumulates over the remaining decod-
ing steps. This is particularly problematic for videos since
N is often very large as tokenization spans both spatial and
temporal dimensions.
Bidirectional Transformers To improve the decoding ef-
ficiency of autoregressive transformers, bidirectional gener-
ative transformers have been proposed [5, 13, 56]. Contrary
to autoregressive models that predict a single consecutive
token at each step, a bidirectional transformer learns to pre-
dict multiple masked tokens at once based on the previously
generated context. Specifically, given the random masking
indices m ⊆ {1, ..., N}, it models the joint distribution over
masked tokens y
M
= {y
i
|i ∈ m} conditioned on the visi-
ble context y
C
= {y
i
|i /∈ m}, and is trained with the below
objective:
L
b
= − log p(y
M
|y
C
, z
M
) ≈ −
X
i∈m
− log p(y
i
|y
C
, z
M
),
(3)
where mask embeddings z
M
encode positions of the mask
tokens with learnable vectors. Each conditional in Eq. (3)
is modeled by a transformer, but contrary to autoregressive
2
相关文档
评论