alternative architecture is to use explicit memory. For exam-
ple, Chen & Gupta [4] performs sequential object detection,
where a spatial memory is used to store previously detected
objects, leveraging the power of ConvNets for extracting
dense context patterns beneficial for follow-up detections.
However, there are two problems with these approaches:
a) both approaches use stack of convolutions to perform lo-
cal pixel-level reasoning [11], which can lack a global rea-
soning power that also allows regions farther away to di-
rectly communicate information; b) more importantly, both
approaches assume enough examples of relationships in
the training data – so that the model can learn them from
scratch, but as the relationships grow exponentially with in-
creasing number of classes, there is not always enough data.
A lot of semantic reasoning requires learning from few or
no examples [14]. Therefore, we need ways to exploit addi-
tional structured information for visual reasoning.
In this paper, we put forward a generic framework for
both spatial and semantic reasoning. Different from current
approaches that are just relying on convolutions, our frame-
work can also learn from structured information in the form
of knowledge bases [5, 56] for visual recognition. The core
of our algorithm consists of two modules: the local mod-
ule, based on spatial memory [4], performs pixel-level rea-
soning using ConvNets. We make major improvements on
efficiency by parallel memory updates. Additionally, we in-
troduce a global module for reasoning beyond local regions.
In the global module, reasoning is based on a graph struc-
ture. It has three components: a) a knowledge graph where
we represent classes as nodes and build edges to encode dif-
ferent types of semantic relationships; b) a region graph of
the current image where regions in the image are nodes and
spatial relationships between these regions are edges; c) an
assignment graph that assigns regions to classes. Taking
advantage of such a structure, we develop a reasoning mod-
ule specifically designed to pass information on this graph.
Both the local module and the global module roll-out itera-
tively and cross-feed predictions to each other in order to re-
fine estimates. Note that, local and global reasoning are not
isolated: a good image understanding is usually a compro-
mise between background knowledge learned a priori and
image-specific observations. Therefore, our full pipeline
joins force of the two modules by an attention [3] mech-
anism allowing the model to rely on the most relevant fea-
tures when making the final predictions.
We show strong performance over plain ConvNets using
our framework. For example, we can achieve 8.4% absolute
improvements on ADE [55] measured by per-class average
precision, where by simply making the network deeper can
only help ∼1%.
2. Related Work
Visual Knowledge Base. Whereas past five years in com-
puter vision will probably be remembered as the success-
ful resurgence of neural networks, acquiring visual knowl-
edge at a large scale – the simplest form being labeled in-
stances of objects [39, 30], scenes [55], relationships [25]
etc.– deserves at least half the credit, since ConvNets hinge
on large datasets [44]. Apart from providing labels us-
ing crowd-sourcing, attempts have also been made to ac-
cumulate structured knowledge (e.g. relationships [5], n-
grams [10]) automatically from the web. However, these
works fixate on building knowledge bases rather than us-
ing knowledge for reasoning. Our framework, while being
more general, is along the line of research that applies vi-
sual knowledge base to end tasks, such as affordances [56],
image classification [32], or question answering [49].
Context Modeling. Modeling context, or the interplay be-
tween scenes, objects and parts is one of the central prob-
lems in computer vision. While various previous work (e.g.
scene-level reasoning [46], attributes [13, 36], structured
prediction [24, 9, 47], relationship graph [21, 31, 52]) has
approached this problem from different angles, the break-
through comes from the idea of feature learning with Con-
vNets [16]. On the surface, such models hardly use any
explicit context module for reasoning, but it is generally ac-
cepted that ConvNets are extremely effective in aggregating
local pixel-to-level context through its ever-growing recep-
tive fields [54]. Even the most recent developments such as
top-down module [50, 29, 43], pairwise module [40], itera-
tive feedback [48, 34, 2], attention [53], and memory [51, 4]
are motivated to leverage such power and depend on vari-
ants of convolutions for reasoning. Our work takes an im-
portant next step beyond those approaches in that it also in-
corporates learning from structured visual knowledge bases
directly to reason with spatial and semantic relationships.
Relational Reasoning. The earliest form of reasoning in ar-
tificial intelligence dates back to symbolic approaches [33],
where relations between abstract symbols are defined by
the language of mathematics and logic, and reasoning takes
place by deduction, abduction [18], etc. However, symbols
need to be grounded [15] before such systems are practi-
cally useful. Modern approaches, such as path ranking algo-
rithm [26], rely on statistical learning to extract useful pat-
terns to perform relational reasoning on structured knowl-
edge bases. As an active research area, there are recent
works also applying neural networks to the graph structured
data [42, 17, 27, 23, 35, 7, 32], or attempting to regularize
the output of networks with relationships [8] and knowl-
edge bases [20]. However, we believe for visual data, rea-
soning should be both local and global: discarding the two-
dimensional image structure is neither efficient nor effective
for tasks that involve regions.
3. Reasoning Framework
In this section we build up our reasoning framework. Be-
sides plain predictions p
0
from a ConvNet, it consists of
two core modules that reason to predict. The first one, local
module, uses a spatial memory to store previous beliefs with
parallel updates, and still falls within the regime of convo-
lution based reasoning (Sec. 3.1). Beyond convolutions, we
present our key contribution – a global module that reasons
directly between regions and classes represented as nodes
in a graph (Sec. 3.2). Both modules build up estimation it-
相关文档
评论