
employed in practical applications. It thus has significant
practical and economic value to develop compact 2D solu-
tions for TVG tasks. In this paper, we ask:
How to advance 2D TVG methods so as to achieve
comparable results to 3D TVG methods?
To address this problem, we propose a novel text-visual
prompting (TVP) framework for training TVG models us-
ing 2D visual features. As shown in Fig. 1, for existing
2D TVG and 3D TVG methods, they all utilize offline pre-
trained vision encoders and language encoders to perform
feature extraction. In contrast, our proposed TVP frame-
work is end-to-end trainable. Furthermore, benefiting from
text-visual prompting and cross-modal pretraining on large-
scale image-text datasets, our proposed framework could
achieve comparable performance to 3D TVG methods with
significant inference time acceleration.
Conventionally, TVG methods consist of three stages:
¬ extracting feature from visual and text inputs; multi-
modal feature fusion; ® cross-modal modelling. In contrast
to conventional methods, TVP incorporates optimized input
perturbation patterns (that we call ‘prompts’) into both vi-
sual inputs and textual features of a TVG model. We apply
trainable parameters in the textual features as text prompts
and develop a universal set of frame-aware patterns as visual
prompts. Specially, we sample a fixed number of frames
from a video and optimize text prompts for the input query
sentence and a set of visual prompts for frames with differ-
ent temporal locations during training. During testing, the
same set of optimized visual prompts and textual prompts
are applied to all test-time videos. We refer readers to Fig. 2
for illustrations of visual prompts and text prompts intro-
duced. To the best of our knowledge, our work makes the
first attempt to utilize prompt learning to successfully im-
prove the performance of regression-based TVG tasks using
2D visual features.
Compared to 3D CNNs, 2D CNNs loses spatiotempo-
ral information of the video during feature extraction. In-
spired by the success of transformers on the vision-language
tasks [9, 22, 35, 44, 47, 54, 55] and the recent application of
prompt learning to transformers in both vision and language
domains [2,25,27, 32,37,40], we choose transformer as our
base TVG model and propose to utilize prompts to compen-
sate for the lack of spatiotemporal information in 2D visual
features. Furthermore, we develop a Temporal-Distance
IoU (TDIoU) loss for training our proposed framework.
There are two aspects that distinguish our proposed frame-
work from existing works. First
, our proposed framework is
designed to boost the performance of the regression-based
TVG methods utilizing 2D CNNs as the vision encoder,
not for transfer learning [2, 21, 26] Second
, our proposed
framework utilizes 2D CNN to extract visual features from
(a) Text Prompts (b) Frame-aware Visual Prompts
Figure 2. Text-visual prompting illustration. (a) Text prompts are directly
applied in the feature space. (b) A set of visual prompts are applied to
video frames in order.
sparsely-sampled video frames, which requires less mem-
ory and is easier to be applied in practical applications com-
pared to 3D methods [34,60–62,69,75], especially for long
videos. Furthermore, thanks to the compact 2D CNN as
the vision encoder, our proposed framework could imple-
ment the language encoder and visual encoder co-training
for better multimodal feature fusion. In summary, the con-
tributions of this work are unfolded below:
• We propose an effective and efficient framework to
train 2D TVG models, in which we leverage TVP
(text-visual prompting) to improve the utility of sparse
2D visual features without resorting to costly 3D fea-
tures. To the best of our knowledge, it is the first work
to expand the application of prompt learning for re-
solving TVG problems. Our method outperforms all
of 2D methods and achieves competitive performance
to 3D TVG methods.
• Technology-wise, we integrate visual prompt with text
prompt to co-improve the effectiveness of 2D visual
features. On top of that, we propose TDIoU (temporal-
distance IoU)-based prompt-model co-training method
to obtain high-accuracy 2D TVG models.
• Experiment-wise, we show the empirical success of
our proposal to boost the performance of 2D TVG on
Charades-STA and ActivityNet Captions datasets, e.g.,
9.79% improvement in Charades-STA, and 30.77% in
ActivityNet-Captions together with 5× inference time
acceleration over 3D TVG methods.
2. Related Work
Video Temporal Grounding (TVG). The objective of the
TVG is to predict the starting/ending time points of target
moments within an untrimmed video, which is described by
a text sentence. Early TVG solutions [7,14,20,39,62,64,70]
mainly employ two-stage “propose-and-rank” pipeline: ¬
评论