暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
超越卷积的迭代视觉推理.pdf
178
11页
0次
2021-05-02
50墨值下载
Iterative Visual Reasoning Beyond Convolutions
Xinlei Chen
1
Li-Jia Li
2
Li Fei-Fei
2
Abhinav Gupta
1
1
Carnegie Mellon University
2
Google
Abstract
We present a novel framework for iterative visual reason-
ing. Our framework goes beyond current recognition sys-
tems that lack the capability to reason beyond stack of con-
volutions. The framework consists of two core modules: a
local module that uses spatial memory [4] to store previous
beliefs with parallel updates; and a global graph-reasoning
module. Our graph module has three components: a) a
knowledge graph where we represent classes as nodes and
build edges to encode different types of semantic relation-
ships between them; b) a region graph of the current image
where regions in the image are nodes and spatial relation-
ships between these regions are edges; c) an assignment
graph that assigns regions to classes. Both the local mod-
ule and the global module roll-out iteratively and cross-feed
predictions to each other to refine estimates. The final pre-
dictions are made by combining the best of both modules
with an attention mechanism. We show strong performance
over plain ConvNets, e.g. achieving an 8.4% absolute im-
provement on ADE [55] measured by per-class average pre-
cision. Analysis also shows that the framework is resilient
to missing regions for reasoning.
1. Introduction
In recent years, we have made significant advances in
standard recognition tasks such as image classification [16],
detection [37] or segmentation [3]. Most of these gains are
a result of using feed-forward end-to-end learned ConvNet
models. Unlike humans where visual reasoning about the
space and semantics is crucial [1], our current visual sys-
tems lack any context reasoning beyond convolutions with
large receptive fields. Therefore, a critical question is how
do we incorporate both spatial and semantic reasoning as
we build next-generation vision systems.
Our goal is to build a system that can not only extract
and utilize hierarchy of convolutional features, but also im-
prove its estimates via spatial and semantic relationships.
But what are spatial and semantic relationships and how can
they be used to improve recognition? Take a look at Fig. 1.
An example of spatial reasoning (top-left) would be: if three
regions out of four in a line are “window”, then the fourth is
also likely to be “window”. An example of semantic reason-
ing (bottom-right) would be to recognize “school bus” even
origin semantic
spatial
car
window
spatial reasoning
car
person
spatial-semantic
current systems semantic reasoning
school bus
bus
window
receptive fields
Figure 1. Current recognition systems lack the reasoning power
beyond convolutions with large receptive fields, whereas humans
can explore the rich space of spatial and semantic relationships for
reasoning: e.g. inferring the fourth “window” even with occlusion,
or the “person” who drives the “car”. To close this gap, we present
a generic framework that also uses relationships to iteratively rea-
son and build up estimates.
if we have seen few or no examples of it – just given exam-
ples of “bus” and knowing their connections. Finally, an ex-
ample of spatial-semantic reasoning could be: recognition
of a “car” on road should help in recognizing the “person”
inside “driving” the “car”.
A key recipe to reasoning with relationships is to it-
eratively build up estimates. Recently, there have been
efforts to incorporate such reasoning via top-down mod-
ules [38, 48] or using explicit memories [51, 32]. In the
case of top-down modules, high-level features which have
class-based information can be used in conjunction with
low-level features to improve recognition performance. An
1
arXiv:1803.11189v1 [cs.CV] 29 Mar 2018
alternative architecture is to use explicit memory. For exam-
ple, Chen & Gupta [4] performs sequential object detection,
where a spatial memory is used to store previously detected
objects, leveraging the power of ConvNets for extracting
dense context patterns beneficial for follow-up detections.
However, there are two problems with these approaches:
a) both approaches use stack of convolutions to perform lo-
cal pixel-level reasoning [11], which can lack a global rea-
soning power that also allows regions farther away to di-
rectly communicate information; b) more importantly, both
approaches assume enough examples of relationships in
the training data so that the model can learn them from
scratch, but as the relationships grow exponentially with in-
creasing number of classes, there is not always enough data.
A lot of semantic reasoning requires learning from few or
no examples [14]. Therefore, we need ways to exploit addi-
tional structured information for visual reasoning.
In this paper, we put forward a generic framework for
both spatial and semantic reasoning. Different from current
approaches that are just relying on convolutions, our frame-
work can also learn from structured information in the form
of knowledge bases [5, 56] for visual recognition. The core
of our algorithm consists of two modules: the local mod-
ule, based on spatial memory [4], performs pixel-level rea-
soning using ConvNets. We make major improvements on
efficiency by parallel memory updates. Additionally, we in-
troduce a global module for reasoning beyond local regions.
In the global module, reasoning is based on a graph struc-
ture. It has three components: a) a knowledge graph where
we represent classes as nodes and build edges to encode dif-
ferent types of semantic relationships; b) a region graph of
the current image where regions in the image are nodes and
spatial relationships between these regions are edges; c) an
assignment graph that assigns regions to classes. Taking
advantage of such a structure, we develop a reasoning mod-
ule specifically designed to pass information on this graph.
Both the local module and the global module roll-out itera-
tively and cross-feed predictions to each other in order to re-
fine estimates. Note that, local and global reasoning are not
isolated: a good image understanding is usually a compro-
mise between background knowledge learned a priori and
image-specific observations. Therefore, our full pipeline
joins force of the two modules by an attention [3] mech-
anism allowing the model to rely on the most relevant fea-
tures when making the final predictions.
We show strong performance over plain ConvNets using
our framework. For example, we can achieve 8.4% absolute
improvements on ADE [55] measured by per-class average
precision, where by simply making the network deeper can
only help 1%.
2. Related Work
Visual Knowledge Base. Whereas past five years in com-
puter vision will probably be remembered as the success-
ful resurgence of neural networks, acquiring visual knowl-
edge at a large scale the simplest form being labeled in-
stances of objects [39, 30], scenes [55], relationships [25]
etc.– deserves at least half the credit, since ConvNets hinge
on large datasets [44]. Apart from providing labels us-
ing crowd-sourcing, attempts have also been made to ac-
cumulate structured knowledge (e.g. relationships [5], n-
grams [10]) automatically from the web. However, these
works fixate on building knowledge bases rather than us-
ing knowledge for reasoning. Our framework, while being
more general, is along the line of research that applies vi-
sual knowledge base to end tasks, such as affordances [56],
image classification [32], or question answering [49].
Context Modeling. Modeling context, or the interplay be-
tween scenes, objects and parts is one of the central prob-
lems in computer vision. While various previous work (e.g.
scene-level reasoning [46], attributes [13, 36], structured
prediction [24, 9, 47], relationship graph [21, 31, 52]) has
approached this problem from different angles, the break-
through comes from the idea of feature learning with Con-
vNets [16]. On the surface, such models hardly use any
explicit context module for reasoning, but it is generally ac-
cepted that ConvNets are extremely effective in aggregating
local pixel-to-level context through its ever-growing recep-
tive fields [54]. Even the most recent developments such as
top-down module [50, 29, 43], pairwise module [40], itera-
tive feedback [48, 34, 2], attention [53], and memory [51, 4]
are motivated to leverage such power and depend on vari-
ants of convolutions for reasoning. Our work takes an im-
portant next step beyond those approaches in that it also in-
corporates learning from structured visual knowledge bases
directly to reason with spatial and semantic relationships.
Relational Reasoning. The earliest form of reasoning in ar-
tificial intelligence dates back to symbolic approaches [33],
where relations between abstract symbols are defined by
the language of mathematics and logic, and reasoning takes
place by deduction, abduction [18], etc. However, symbols
need to be grounded [15] before such systems are practi-
cally useful. Modern approaches, such as path ranking algo-
rithm [26], rely on statistical learning to extract useful pat-
terns to perform relational reasoning on structured knowl-
edge bases. As an active research area, there are recent
works also applying neural networks to the graph structured
data [42, 17, 27, 23, 35, 7, 32], or attempting to regularize
the output of networks with relationships [8] and knowl-
edge bases [20]. However, we believe for visual data, rea-
soning should be both local and global: discarding the two-
dimensional image structure is neither efficient nor effective
for tasks that involve regions.
3. Reasoning Framework
In this section we build up our reasoning framework. Be-
sides plain predictions p
0
from a ConvNet, it consists of
two core modules that reason to predict. The first one, local
module, uses a spatial memory to store previous beliefs with
parallel updates, and still falls within the regime of convo-
lution based reasoning (Sec. 3.1). Beyond convolutions, we
present our key contribution – a global module that reasons
directly between regions and classes represented as nodes
in a graph (Sec. 3.2). Both modules build up estimation it-
ConvNet
image + region Inputs
𝒮
"
"
attention
spatial
memory
non-spatial
memory
reasoning modules
high-level &
mid-level
features
𝑎
%
𝑓
%
prediction
LOCAL: convolution-based
GLOBAL: graph-based
𝒮
'
𝒞
)𝑎
"
*
)𝑓
"
*
'
class
region
region edge
assignment
class edge 1
class edge 2
)𝑎
"
+
)𝑓
"
+
cross
feed
𝒮
,
LOCAL
GLOBAL
,
iterations attention-based prediction
𝑓
cross
feed
𝑎
%
𝑎
"
*
𝑎
"
+
𝑎
-
*
𝑎
-
+
attentions
𝑓
%
𝑓
"
*
𝑓
"
+
𝑓
-
*
𝑓
-
+
predictions
final
dog
person
helmet
head
arm
arm
person
leg
leg
horse
saddle
grass
water
Figure 2. Overview of our reasoning framework. Besides a plain ConvNet that gives predictions, the framework has two modules to
perform reasoning: a local one (Sec. 3.1) that uses spatial memory S
i
, and reasons with another ConvNet C; and a global one (Sec. 3.2) that
treats regions and classes as nodes in a graph and reasons by passing information among them. Both modules receive combined high-level
and mid-level features, and roll-out iteratively (Sec. 3.3) while cross-feeding beliefs. The final prediction f is produced by combining all
the predictions f
i
with attentions a
i
(Sec. 3.4).
eratively (Sec. 3.3), with beliefs cross-fed to each other. Fi-
nally taking advantage of both local and global, we combine
predictions from all iterations with an attention mechanism
(Sec. 3.4) and train the model with sample re-weighting
(Sec. 3.5) that focuses on hard examples (See Fig. 2).
3.1. Reasoning with Convolutions
Our first building block, the local module, is inspired
from [4]. At a high level, the idea is to use a spatial mem-
ory S to store previously detected objects at the very loca-
tion they have been found. S is a tensor with three dimen-
sions. The first two, height H and width W , correspond to
the reduced size (1/16) of the image. The third one, depth
D (=512), makes each cell of the memory c a vector that
stores potentially useful information at that location.
S is updated with both high-level and mid-level features.
For high-level, information regarding the estimated class la-
bel is stored. However, just knowing the class may not be
ideal more details about the shape, pose etc. can also be
useful for other objects. For example, it would be nice to
know the pose of a “person” playing tennis to recognize the
“racket”. In this paper, we use the logits f before soft-max
activation, in conjunction with feature maps from a bottom
convolutional layer h to feed-in the memory.
Given an image region r to update, we first crop the cor-
responding features from the bottom layer, and resize it to
a predefined square (7×7) with bi-linear interpolation as h.
Since high-level feature f is a vector covering the entire
region, we append it to all the 49 locations. Two 1×1 con-
volutions are used to fuse the information [4] and form our
input features f
r
for r. The same region in the memory S
is also cropped and resized to 7×7, denoted as s
r
. After
this alignment, we use a convolutional gated recurrent unit
(GRU) [6] to write the memory:
s
0
r
= u s
r
+ (1 u) σ(W
f
f
r
+ W
s
(z s
r
) + b), (1)
where s
0
r
is the updated memory for r, u is update gate, z
is reset gate, W
f
, W
s
and b are convolutional weights and
bias, and is entry-wise product. σ(·) is an activation func-
tion. After the update, s
0
r
is placed back to S with another
crop and resize operation
1
.
Parallel Updates. Previous work [4] made sequential up-
dates to memory. However, sequential inference is ineffi-
cient and GPU-intensive limiting it to only give ten out-
puts per image [4]. In this paper we propose to update
the regions in parallel as an approximation. In overlapping
cases, a cell can be covered multiple times from different
regions. When placing the regions back to S, we also cal-
culate a weight matrix Γ where each entry γ
r,c
[0, 1] keeps
track of how much a region r has contributed to a memory
cell c: 1 meaning the cell is fully covered by the region, 0
meaning not covered. The final values of the updated cell is
the weighted average of all regions.
The actual reasoning module, a ConvNet C of three 3×3
convolutions and two 4096-D fully-connected layers, takes
S as the input, and builds connections within the local win-
dow of its receptive fields to perform prediction. Since the
two-dimensional image structure and the location informa-
tion is preserved in S, such an architecture is particularly
useful for relationships with spatial reasoning.
3.2. Beyond Convolutions
Our second module goes beyond local regions and con-
volutions for global reasoning. Here the meaning of global
is two-fold. First is spatial, that is, we want to let the regions
farther away to directly communicate information with each
other, not confined by the receptive fields of the reasoning
module C. Second is semantic, meaning we want to take
advantage of visual knowledge bases, which can provide re-
lationships between classes that are globally true (i.e. com-
monsense) across images. To achieve both types of reason-
ing, we build a graph G = (N , E), where N and E denote
node sets and edge sets, respectively. Two types of nodes
are defined in N : region nodes N
r
for R regions, and class
nodes N
c
for C classes.
As for E, three groups of edges are defined between
nodes. First for N
r
, a spatial graph is used to encode spa-
tial relationships between regions (E
rr
). Multiple types
1
Different from previous work [4] that introduces an inverse operation
to put the region back, we note that crop and resize itself with proper ex-
trapolation can simply meet this requirement.
of 11
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。