Text-Visual Prompting for Efficient 2D Temporal Video Grounding.pdf

dong

430

11页

0次

2023-11-19

25墨值下载

Text-Visual Prompting for Efﬁcient 2D Temporal Video Grounding

Yimeng Zhang

1,2

, Xin Chen

, Jinghan Jia

, Sijia Liu

, Ke Ding

Michigan State University,

Applied AI, Intel

{zhan1853, jiajingh, liusiji5}@msu.edu,

{xin.chen, ke.ding}@intel.com

Abstract

In this paper, we study the problem of temporal video

grounding (TVG), which aims to predict the starting/ending

time points of moments described by a text sentence within

a long untrimmed video. Beneﬁting from ﬁne-grained 3D

visual features, the TVG techniques have achieved remark-

able progress in recent years. However, the high complexity

of 3D convolutional neural networks (CNNs) makes extract-

ing dense 3D visual features time-consuming, which calls

for intensive memory and computing resources. Towards

efﬁcient TVG, we propose a novel text-visual prompting

(TVP) framework, which incorporates optimized perturba-

tion patterns (that we call ‘prompts’) into both visual in-

puts and textual features of a TVG model. In sharp con-

trast to 3D CNNs, we show that TVP allows us to effec-

tively co-train vision encoder and language encoder in a

2D TVG model and improves the performance of cross-

modal feature fusion using only low-complexity sparse 2D

visual features. Further, we propose a Temporal-Distance

IoU (TDIoU) loss for efﬁcient learning of TVG. Experi-

ments on two benchmark datasets, Charades-STA and Ac-

tivityNet Captions datasets, empirically show that the pro-

posed TVP signiﬁcantly boosts the performance of 2D TVG

(e.g., 9.79% improvement on Charades-STA and 30.77%

improvement on ActivityNet Captions) and achieves 5× in-

ference acceleration over TVG using 3D visual features.

Codes are available at Open.Intel.

1. Introduction

In recent years, we have witnessed great progress on

temporal video grounding (TVG) [30, 74]. One key to

this success comes from the ﬁne-grained dense 3D vi-

sual features extracted by 3D convolutional neural networks

(CNNs) (e.g., C3D [56] and I3D [3]) since TVG tasks de-

mand spatial-temporal context to locate the temporal inter-

val of the moments described by the text query. However,

due to the high cost of the dense 3D feature extraction, most

existing TVG models only take these 3D visual features ex-

Figure 1. The architecture and performance comparison among TVG

methods: a) 3D TVG methods [14, 16, 18, 34, 43,60–62,64, 67, 69, 71, 73],

b) 2D TVG methods [1, 7], and c) TVP-based 2D TVG (Ours), d) over-

all performance comparison. Ours is the most efﬁcient (least inference

time) and achieves competitive performance compared to 3D TVG meth-

ods. In contrast to existing TVG methods, which utilize dense video fea-

tures extracted by non-trainable ofﬂine 3D CNNs and textual features, our

proposed framework utilizes a trainable 2D CNN as the vision encoder to

extract features from sparsely-sampled video frames with a universal set of

frame-aware visual prompts and adds text prompts in textual feature space

for end-to-end regression-based modeling.

tracted by ofﬂine 3D CNNs as inputs instead of co-training

during TVG model training.

Although models using 3D visual features (that we

call ‘3D methods or models’) outperform these using the

2D features (that we call ‘2D methods or models’), a

unique advantage of 2D methods is that extracting 2D

visual features can signiﬁcantly reduce the cost in TVG

tasks [14, 15, 30, 34, 61, 62, 69, 74, 75]. An efﬁcient and

lightweight solution with reasonable performance is also

demanded in computer vision, NLP, and video-language

tasks [19, 23, 38, 41, 68, 76–80]. As discussed above, the

methods employing 3D video features are challenging to be

arXiv:2303.04995v2 [cs.CV] 22 Mar 2023

employed in practical applications. It thus has signiﬁcant

practical and economic value to develop compact 2D solu-

tions for TVG tasks. In this paper, we ask:

How to advance 2D TVG methods so as to achieve

comparable results to 3D TVG methods?

To address this problem, we propose a novel text-visual

prompting (TVP) framework for training TVG models us-

ing 2D visual features. As shown in Fig. 1, for existing

2D TVG and 3D TVG methods, they all utilize ofﬂine pre-

trained vision encoders and language encoders to perform

feature extraction. In contrast, our proposed TVP frame-

work is end-to-end trainable. Furthermore, beneﬁting from

text-visual prompting and cross-modal pretraining on large-

scale image-text datasets, our proposed framework could

achieve comparable performance to 3D TVG methods with

signiﬁcant inference time acceleration.

Conventionally, TVG methods consist of three stages:

¬ extracting feature from visual and text inputs;  multi-

modal feature fusion; ® cross-modal modelling. In contrast

to conventional methods, TVP incorporates optimized input

perturbation patterns (that we call ‘prompts’) into both vi-

sual inputs and textual features of a TVG model. We apply

trainable parameters in the textual features as text prompts

and develop a universal set of frame-aware patterns as visual

prompts. Specially, we sample a ﬁxed number of frames

from a video and optimize text prompts for the input query

sentence and a set of visual prompts for frames with differ-

ent temporal locations during training. During testing, the

same set of optimized visual prompts and textual prompts

are applied to all test-time videos. We refer readers to Fig. 2

for illustrations of visual prompts and text prompts intro-

duced. To the best of our knowledge, our work makes the

ﬁrst attempt to utilize prompt learning to successfully im-

prove the performance of regression-based TVG tasks using

2D visual features.

Compared to 3D CNNs, 2D CNNs loses spatiotempo-

ral information of the video during feature extraction. In-

spired by the success of transformers on the vision-language

tasks [9, 22, 35, 44, 47, 54, 55] and the recent application of

prompt learning to transformers in both vision and language

domains [2,25,27, 32,37,40], we choose transformer as our

base TVG model and propose to utilize prompts to compen-

sate for the lack of spatiotemporal information in 2D visual

features. Furthermore, we develop a Temporal-Distance

IoU (TDIoU) loss for training our proposed framework.

There are two aspects that distinguish our proposed frame-

work from existing works. First

, our proposed framework is

designed to boost the performance of the regression-based

TVG methods utilizing 2D CNNs as the vision encoder,

not for transfer learning [2, 21, 26] Second

, our proposed

framework utilizes 2D CNN to extract visual features from

(a) Text Prompts (b) Frame-aware Visual Prompts

Figure 2. Text-visual prompting illustration. (a) Text prompts are directly

applied in the feature space. (b) A set of visual prompts are applied to

video frames in order.

sparsely-sampled video frames, which requires less mem-

ory and is easier to be applied in practical applications com-

pared to 3D methods [34,60–62,69,75], especially for long

videos. Furthermore, thanks to the compact 2D CNN as

the vision encoder, our proposed framework could imple-

ment the language encoder and visual encoder co-training

for better multimodal feature fusion. In summary, the con-

tributions of this work are unfolded below:

• We propose an effective and efﬁcient framework to

train 2D TVG models, in which we leverage TVP

(text-visual prompting) to improve the utility of sparse

2D visual features without resorting to costly 3D fea-

tures. To the best of our knowledge, it is the ﬁrst work

to expand the application of prompt learning for re-

solving TVG problems. Our method outperforms all

of 2D methods and achieves competitive performance

to 3D TVG methods.

• Technology-wise, we integrate visual prompt with text

prompt to co-improve the effectiveness of 2D visual

features. On top of that, we propose TDIoU (temporal-

distance IoU)-based prompt-model co-training method

to obtain high-accuracy 2D TVG models.

• Experiment-wise, we show the empirical success of

our proposal to boost the performance of 2D TVG on

Charades-STA and ActivityNet Captions datasets, e.g.,

9.79% improvement in Charades-STA, and 30.77% in

ActivityNet-Captions together with 5× inference time

acceleration over 3D TVG methods.

2. Related Work

Video Temporal Grounding (TVG). The objective of the

TVG is to predict the starting/ending time points of target

moments within an untrimmed video, which is described by

a text sentence. Early TVG solutions [7,14,20,39,62,64,70]

mainly employ two-stage “propose-and-rank” pipeline: ¬

of 11

25墨值下载

text-visual prompting efficient 2d temporal video

Tonyhacks

关注

dba

评论