超越卷积的迭代视觉推理.pdf

poPoq

178

11页

0次

2021-05-02

50墨值下载

Iterative Visual Reasoning Beyond Convolutions

Xinlei Chen

Li-Jia Li

Li Fei-Fei

Abhinav Gupta

Carnegie Mellon University

Google

Abstract

We present a novel framework for iterative visual reason-

ing. Our framework goes beyond current recognition sys-

tems that lack the capability to reason beyond stack of con-

volutions. The framework consists of two core modules: a

local module that uses spatial memory [4] to store previous

beliefs with parallel updates; and a global graph-reasoning

module. Our graph module has three components: a) a

knowledge graph where we represent classes as nodes and

build edges to encode different types of semantic relation-

ships between them; b) a region graph of the current image

where regions in the image are nodes and spatial relation-

ships between these regions are edges; c) an assignment

graph that assigns regions to classes. Both the local mod-

ule and the global module roll-out iteratively and cross-feed

predictions to each other to reﬁne estimates. The ﬁnal pre-

dictions are made by combining the best of both modules

with an attention mechanism. We show strong performance

over plain ConvNets, e.g. achieving an 8.4% absolute im-

provement on ADE [55] measured by per-class average pre-

cision. Analysis also shows that the framework is resilient

to missing regions for reasoning.

1. Introduction

In recent years, we have made signiﬁcant advances in

standard recognition tasks such as image classiﬁcation [16],

detection [37] or segmentation [3]. Most of these gains are

a result of using feed-forward end-to-end learned ConvNet

models. Unlike humans where visual reasoning about the

space and semantics is crucial [1], our current visual sys-

tems lack any context reasoning beyond convolutions with

large receptive ﬁelds. Therefore, a critical question is how

do we incorporate both spatial and semantic reasoning as

we build next-generation vision systems.

Our goal is to build a system that can not only extract

and utilize hierarchy of convolutional features, but also im-

prove its estimates via spatial and semantic relationships.

But what are spatial and semantic relationships and how can

they be used to improve recognition? Take a look at Fig. 1.

An example of spatial reasoning (top-left) would be: if three

regions out of four in a line are “window”, then the fourth is

also likely to be “window”. An example of semantic reason-

ing (bottom-right) would be to recognize “school bus” even

origin semantic

spatial

car

window

spatial reasoning

car

person

spatial-semantic

current systems semantic reasoning

school bus

bus

window

receptive fields

Figure 1. Current recognition systems lack the reasoning power

beyond convolutions with large receptive ﬁelds, whereas humans

can explore the rich space of spatial and semantic relationships for

reasoning: e.g. inferring the fourth “window” even with occlusion,

or the “person” who drives the “car”. To close this gap, we present

a generic framework that also uses relationships to iteratively rea-

son and build up estimates.

if we have seen few or no examples of it – just given exam-

ples of “bus” and knowing their connections. Finally, an ex-

ample of spatial-semantic reasoning could be: recognition

of a “car” on road should help in recognizing the “person”

inside “driving” the “car”.

A key recipe to reasoning with relationships is to it-

eratively build up estimates. Recently, there have been

efforts to incorporate such reasoning via top-down mod-

ules [38, 48] or using explicit memories [51, 32]. In the

case of top-down modules, high-level features which have

class-based information can be used in conjunction with

low-level features to improve recognition performance. An

arXiv:1803.11189v1 [cs.CV] 29 Mar 2018

alternative architecture is to use explicit memory. For exam-

ple, Chen & Gupta [4] performs sequential object detection,

where a spatial memory is used to store previously detected

objects, leveraging the power of ConvNets for extracting

dense context patterns beneﬁcial for follow-up detections.

However, there are two problems with these approaches:

a) both approaches use stack of convolutions to perform lo-

cal pixel-level reasoning [11], which can lack a global rea-

soning power that also allows regions farther away to di-

rectly communicate information; b) more importantly, both

approaches assume enough examples of relationships in

the training data – so that the model can learn them from

scratch, but as the relationships grow exponentially with in-

creasing number of classes, there is not always enough data.

A lot of semantic reasoning requires learning from few or

no examples [14]. Therefore, we need ways to exploit addi-

tional structured information for visual reasoning.

In this paper, we put forward a generic framework for

both spatial and semantic reasoning. Different from current

approaches that are just relying on convolutions, our frame-

work can also learn from structured information in the form

of knowledge bases [5, 56] for visual recognition. The core

of our algorithm consists of two modules: the local mod-

ule, based on spatial memory [4], performs pixel-level rea-

soning using ConvNets. We make major improvements on

efﬁciency by parallel memory updates. Additionally, we in-

troduce a global module for reasoning beyond local regions.

In the global module, reasoning is based on a graph struc-

ture. It has three components: a) a knowledge graph where

we represent classes as nodes and build edges to encode dif-

ferent types of semantic relationships; b) a region graph of

the current image where regions in the image are nodes and

spatial relationships between these regions are edges; c) an

assignment graph that assigns regions to classes. Taking

advantage of such a structure, we develop a reasoning mod-

ule speciﬁcally designed to pass information on this graph.

Both the local module and the global module roll-out itera-

tively and cross-feed predictions to each other in order to re-

ﬁne estimates. Note that, local and global reasoning are not

isolated: a good image understanding is usually a compro-

mise between background knowledge learned a priori and

image-speciﬁc observations. Therefore, our full pipeline

joins force of the two modules by an attention [3] mech-

anism allowing the model to rely on the most relevant fea-

tures when making the ﬁnal predictions.

We show strong performance over plain ConvNets using

our framework. For example, we can achieve 8.4% absolute

improvements on ADE [55] measured by per-class average

precision, where by simply making the network deeper can

only help ∼1%.

2. Related Work

Visual Knowledge Base. Whereas past ﬁve years in com-

puter vision will probably be remembered as the success-

ful resurgence of neural networks, acquiring visual knowl-

edge at a large scale – the simplest form being labeled in-

stances of objects [39, 30], scenes [55], relationships [25]

etc.– deserves at least half the credit, since ConvNets hinge

on large datasets [44]. Apart from providing labels us-

ing crowd-sourcing, attempts have also been made to ac-

cumulate structured knowledge (e.g. relationships [5], n-

grams [10]) automatically from the web. However, these

works ﬁxate on building knowledge bases rather than us-

ing knowledge for reasoning. Our framework, while being

more general, is along the line of research that applies vi-

sual knowledge base to end tasks, such as affordances [56],

image classiﬁcation [32], or question answering [49].

Context Modeling. Modeling context, or the interplay be-

tween scenes, objects and parts is one of the central prob-

lems in computer vision. While various previous work (e.g.

scene-level reasoning [46], attributes [13, 36], structured

prediction [24, 9, 47], relationship graph [21, 31, 52]) has

approached this problem from different angles, the break-

through comes from the idea of feature learning with Con-

vNets [16]. On the surface, such models hardly use any

explicit context module for reasoning, but it is generally ac-

cepted that ConvNets are extremely effective in aggregating

local pixel-to-level context through its ever-growing recep-

tive ﬁelds [54]. Even the most recent developments such as

top-down module [50, 29, 43], pairwise module [40], itera-

tive feedback [48, 34, 2], attention [53], and memory [51, 4]

are motivated to leverage such power and depend on vari-

ants of convolutions for reasoning. Our work takes an im-

portant next step beyond those approaches in that it also in-

corporates learning from structured visual knowledge bases

directly to reason with spatial and semantic relationships.

Relational Reasoning. The earliest form of reasoning in ar-

tiﬁcial intelligence dates back to symbolic approaches [33],

where relations between abstract symbols are deﬁned by

the language of mathematics and logic, and reasoning takes

place by deduction, abduction [18], etc. However, symbols

need to be grounded [15] before such systems are practi-

cally useful. Modern approaches, such as path ranking algo-

rithm [26], rely on statistical learning to extract useful pat-

terns to perform relational reasoning on structured knowl-

edge bases. As an active research area, there are recent

works also applying neural networks to the graph structured

data [42, 17, 27, 23, 35, 7, 32], or attempting to regularize

the output of networks with relationships [8] and knowl-

edge bases [20]. However, we believe for visual data, rea-

soning should be both local and global: discarding the two-

dimensional image structure is neither efﬁcient nor effective

for tasks that involve regions.

3. Reasoning Framework

In this section we build up our reasoning framework. Be-

sides plain predictions p

from a ConvNet, it consists of

two core modules that reason to predict. The ﬁrst one, local

module, uses a spatial memory to store previous beliefs with

parallel updates, and still falls within the regime of convo-

lution based reasoning (Sec. 3.1). Beyond convolutions, we

present our key contribution – a global module that reasons

directly between regions and classes represented as nodes

in a graph (Sec. 3.2). Both modules build up estimation it-

ConvNet

image + region Inputs

𝒮

ℳ

attention

spatial

memory

non-spatial

memory

reasoning modules

high-level &

mid-level

features

𝑎

𝑓

prediction

LOCAL: convolution-based

GLOBAL: graph-based

𝒮

𝒞

)𝑎

)𝑓

ℳ

class

region

region edge

assignment

class edge 1

class edge 2

)𝑎

)𝑓

cross

feed

𝒮

LOCAL

GLOBAL

ℳ

iterations attention-based prediction

𝑓

cross

feed

𝑎

attentions

𝑓

predictions

final

dog

person

helmet

head

arm

person

leg

horse

saddle

grass

water

Figure 2. Overview of our reasoning framework. Besides a plain ConvNet that gives predictions, the framework has two modules to

perform reasoning: a local one (Sec. 3.1) that uses spatial memory S

, and reasons with another ConvNet C; and a global one (Sec. 3.2) that

treats regions and classes as nodes in a graph and reasons by passing information among them. Both modules receive combined high-level

and mid-level features, and roll-out iteratively (Sec. 3.3) while cross-feeding beliefs. The ﬁnal prediction f is produced by combining all

the predictions f

with attentions a

(Sec. 3.4).

eratively (Sec. 3.3), with beliefs cross-fed to each other. Fi-

nally taking advantage of both local and global, we combine

predictions from all iterations with an attention mechanism

(Sec. 3.4) and train the model with sample re-weighting

(Sec. 3.5) that focuses on hard examples (See Fig. 2).

3.1. Reasoning with Convolutions

Our ﬁrst building block, the local module, is inspired

from [4]. At a high level, the idea is to use a spatial mem-

ory S to store previously detected objects at the very loca-

tion they have been found. S is a tensor with three dimen-

sions. The ﬁrst two, height H and width W , correspond to

the reduced size (1/16) of the image. The third one, depth

D (=512), makes each cell of the memory c a vector that

stores potentially useful information at that location.

S is updated with both high-level and mid-level features.

For high-level, information regarding the estimated class la-

bel is stored. However, just knowing the class may not be

ideal – more details about the shape, pose etc. can also be

useful for other objects. For example, it would be nice to

know the pose of a “person” playing tennis to recognize the

“racket”. In this paper, we use the logits f before soft-max

activation, in conjunction with feature maps from a bottom

convolutional layer h to feed-in the memory.

Given an image region r to update, we ﬁrst crop the cor-

responding features from the bottom layer, and resize it to

a predeﬁned square (7×7) with bi-linear interpolation as h.

Since high-level feature f is a vector covering the entire

region, we append it to all the 49 locations. Two 1×1 con-

volutions are used to fuse the information [4] and form our

input features f

for r. The same region in the memory S

is also cropped and resized to 7×7, denoted as s

. After

this alignment, we use a convolutional gated recurrent unit

(GRU) [6] to write the memory:

= u ◦ s

+ (1 − u) ◦ σ(W

+ W

(z ◦ s

) + b), (1)

where s

is the updated memory for r, u is update gate, z

is reset gate, W

, W

and b are convolutional weights and

bias, and ◦ is entry-wise product. σ(·) is an activation func-

tion. After the update, s

is placed back to S with another

crop and resize operation

Parallel Updates. Previous work [4] made sequential up-

dates to memory. However, sequential inference is inefﬁ-

cient and GPU-intensive – limiting it to only give ten out-

puts per image [4]. In this paper we propose to update

the regions in parallel as an approximation. In overlapping

cases, a cell can be covered multiple times from different

regions. When placing the regions back to S, we also cal-

culate a weight matrix Γ where each entry γ

r,c

∈[0, 1] keeps

track of how much a region r has contributed to a memory

cell c: 1 meaning the cell is fully covered by the region, 0

meaning not covered. The ﬁnal values of the updated cell is

the weighted average of all regions.

The actual reasoning module, a ConvNet C of three 3×3

convolutions and two 4096-D fully-connected layers, takes

S as the input, and builds connections within the local win-

dow of its receptive ﬁelds to perform prediction. Since the

two-dimensional image structure and the location informa-

tion is preserved in S, such an architecture is particularly

useful for relationships with spatial reasoning.

3.2. Beyond Convolutions

Our second module goes beyond local regions and con-

volutions for global reasoning. Here the meaning of global

is two-fold. First is spatial, that is, we want to let the regions

farther away to directly communicate information with each

other, not conﬁned by the receptive ﬁelds of the reasoning

module C. Second is semantic, meaning we want to take

advantage of visual knowledge bases, which can provide re-

lationships between classes that are globally true (i.e. com-

monsense) across images. To achieve both types of reason-

ing, we build a graph G = (N , E), where N and E denote

node sets and edge sets, respectively. Two types of nodes

are deﬁned in N : region nodes N

for R regions, and class

nodes N

for C classes.

As for E, three groups of edges are deﬁned between

nodes. First for N

, a spatial graph is used to encode spa-

tial relationships between regions (E

r→r

). Multiple types

Different from previous work [4] that introduces an inverse operation

to put the region back, we note that crop and resize itself with proper ex-

trapolation can simply meet this requirement.

of 11

50墨值下载

自动驾驶

相关文档

评论