2015CVPR最佳论文-动态融合：实时非刚性场景的重建与跟踪.pdf

poPoq

212

11页

0次

2021-05-02

50墨值下载

DynamicFusion: Reconstruction and Tracking of Non-rigid Scenes in Real-Time

Richard A. Newcombe

newcombe@cs.washington.edu

Dieter Fox

fox@cs.washington.edu

University of Washington, Seattle

Steven M. Seitz

seitz@cs.washington.edu

Figure 1: Real-time reconstructions of a moving scene with DynamicFusion; both the person and the camera are moving. The initially

noisy and incomplete model is progressively denoised and completed over time (left to right).

Abstract

We present the ﬁrst dense SLAM system capable of re-

constructing non-rigidly deforming scenes in real-time, by

fusing together RGBD scans captured from commodity sen-

sors. Our DynamicFusion approach reconstructs scene ge-

ometry whilst simultaneously estimating a dense volumet-

ric 6D motion ﬁeld that warps the estimated geometry into

a live frame. Like KinectFusion, our system produces in-

creasingly denoised, detailed, and complete reconstructions

as more measurements are fused, and displays the updated

model in real time. Because we do not require a template

or other prior scene model, the approach is applicable to a

wide range of moving objects and scenes.

3D scanning traditionally involves separate capture and

off-line processing phases, requiring very careful planning

of the capture to make sure that every surface is cov-

ered. In practice, it’s very difﬁcult to avoid holes, requir-

ing several iterations of capture, reconstruction, identifying

holes, and recapturing missing regions to ensure a complete

model. Real-time 3D reconstruction systems like KinectFu-

sion [18, 10] represent a major advance, by providing users

the ability to instantly see the reconstruction and identify

regions that remain to be scanned. KinectFusion spurred a

ﬂurry of follow up research aimed at robustifying the track-

ing [9, 32] and expanding its spatial mapping capabilities to

larger environments [22, 19, 34, 31, 9].

However, as with all traditional SLAM and dense re-

construction systems, the most basic assumption behind

KinectFusion is that the observed scene is largely static.

The core question we tackle in this paper is: How can we

generalise KinectFusion to reconstruct and track dynamic,

non-rigid scenes in real-time? To that end, we introduce

DynamicFusion, an approach based on solving for a vol-

umetric ﬂow ﬁeld that transforms the state of the scene at

each time instant into a ﬁxed, canonical frame. In the case

of a moving person, for example, this transformation un-

does the person’s motion, warping each body conﬁguration

into the pose of the ﬁrst frame. Following these warps, the

scene is effectively rigid, and standard KinectFusion up-

dates can be used to obtain a high quality, denoised recon-

struction. This progressively denoised reconstruction can

then be transformed back into the live frame using the in-

verse map; each point in the canonical frame is transformed

to its location in the live frame (see Figure 1).

Deﬁning a canonical “rigid” space for a dynamically

moving scene is not straightforward. A key contribution

of our work is an approach for non-rigid transformation and

fusion that retains the optimality properties of volumetric

scan fusion [5], developed originally for rigid scenes. The

main insight is that undoing the scene motion to enable fu-

sion of all observations into a single ﬁxed frame can be

achieved efﬁciently by computing the inverse map alone.

Under this transformation, each canonical point projects

along a line of sight in the live camera frame. Since the

optimality arguments of [5] (developed for rigid scenes) de-

pend only on lines of sight, we can generalize their optimal-

ity results to the non-rigid case.

Our second key contribution is to represent this volumet-

ric warp efﬁciently, and compute it in real time. Indeed,

even a relatively low resolution, 256

deformation volume

would require 100 million transformation variables to be

computed at frame-rate. Our solution depends on a com-

bination of adaptive, sparse, hierarchical volumetric basis

functions, and innovative algorithmic work to ensure a real-

time solution on commodity hardware. As a result, Dynam-

icFusion is the ﬁrst system capable of real-time dense recon-

struction in dynamic scenes using a single depth camera.

The remainder of this paper is structured as follows. Af-

ter discussing related work, we present an overview of Dy-

namicFusion in Section 2 and provide technical details in

Section 3. We provide experimental results in Section 4 and

conclude in Section 5.

1. Related Work

While no prior work achieves real-time, template-free,

non-rigid reconstruction, there are two categories of closely

related work: 1) real-time non-rigid tracking algorithms,

and 2) ofﬂine dynamic reconstruction techniques.

Real-time non-rigid template tracking. The vast ma-

jority of non-rigid tracking research focuses on human body

parts, for which specialised shape and motion templates are

learnt or manually designed. The best of these demonstrate

high accuracy, real-time performance capture for tracking

faces [16, 3], hands [21, 20], complete bodies [27], or gen-

eral articulated objects [23, 33].

Other techniques directly track and deform more gen-

eral mesh models. [12] demonstrated the ability to track

a statically acquired low resolution shape template and up-

grade its appearance with high frequency geometric details

not present in the original model. Recently, [37] demon-

strated an impressive real-time version of a similar tech-

nique, using GPU accelerated optimisations. In that sys-

tem, a dense surface model of the subject is captured while

remaining static, yielding a template for use in their real-

time tracking pipeline. This separation into template gen-

eration and tracking limits the system to objects and scenes

that are completely static during the geometric reconstruc-

tion phase, precluding reconstruction of things that won’t

reliably hold still (e.g., children or pets).

Ofﬂine simultaneous tracking and reconstruction of

dynamic scenes. There is a growing literature on ofﬂine

non-rigid tracking and reconstruction techniques. Several

researchers have extended ICP to enable small non-rigid

deformations, e.g., [1, 2]. Practical advancements to pair-

wise 3D shape and scan alignment over larger deformations

make use of reduced deformable model parametrisations

[14, 4]. In particular, embedded deformation graphs [25]

use a sparsely sampled set of transformation basis func-

tions that can be efﬁciently and densely interpolated over

space. Quasi-rigid reconstruction has also been demon-

strated [15, 35] and hybrid systems, making use of a known

kinematic structure (e.g., a human body), are able to per-

form non-rigid shape denoising [36]. Other work combines

non-rigid mesh template tracking and temporal denoising

and completion [13], but does not obtain a single consistent

representation of the scene.

More closely related to our work are template-free tech-

niques. An intriguing approach to template-free non-rigid

alignment, introduced in [17] and [26], treats each non-

rigid scan as a view from a 4D geometric observation and

performs 4D shape reconstruction. [30, 29] reconstruct

a ﬁxed topology geometry by performing pair-wise scan

alignment. [24] use a space-time solid incompressible ﬂow

prior that results in water tight reconstructions and is ef-

fective against noisy input point-cloud data. [28] intro-

duce animation cartography that also estimates shape and

a per frame deformation by developing a dense correspon-

dence matching scheme that is seeded with sparse landmark

matches. Recent work using multiple ﬁxed kinect cameras

[8] [7] demonstrates larger scale non-rigid reconstruction by

densely tracking and fusing all depth map data into a novel

directional distance function representation.

All of these techniques require three to four orders of

magnitude more time than is available within a real-time

setting.

2. DynamicFusion Overview

DynamicFusion decomposes a non-rigidly deforming

scene into a latent geometric surface, reconstructed into a

rigid canonical space S ⊆ R

; and a per frame volumetric

warp ﬁeld that transforms that surface into the live frame.

There are three core algorithmic components to the system

that are performed in sequence on arrival of each new depth

frame:

1. Estimation of the volumetric model-to-frame warp

ﬁeld parameters (Section 3.3)

2. Fusion of the live frame depth map into the canonical

space via the estimated warp ﬁeld (Section 3.2)

3. Adaptation of the warp-ﬁeld structure to capture newly

added geometry (Section 3.4)

Figure 2 provides an overview.

3. Technical Details

We will now describe the components of DynamicFusion

in detail. First, we describe our dense volumetric warp-ﬁeld

parametrisation. This allows us to model per-frame defor-

mations in the scene. The warp-ﬁeld is the key extension

over static state space representations used in traditional re-

construction and SLAM systems, and its estimation is the

enabler of both non-rigid tracking and scene reconstruction.

3.1. Dense Non-rigid Warp Field

We represent dynamic scene motion through a volumet-

ric warp-ﬁeld, providing a per point 6D transformation

W : S 7→ SE(3). Whereas a dense 3D translation ﬁeld

would be sufﬁcient to describe time varying geometry, we

have found that representing the real-world transformation

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance

(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals

Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the moving

scene (d,e). To achieve this, we estimate a volumetric warp (motion) ﬁeld that transforms the canonical model space into the live frame,

enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-

neously, the structure of the warp ﬁeld is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated through

a k-nearest node average in the canonical frame (c). The resulting per-frame warp ﬁeld estimate enables the progressively denoised and

completed scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sample

of model vertices over the last 1 second of scene motion together with a coordinate frame showing the rigid body component of the scene

motion. In (c) we render the nearest node to model surface distance where increased distance is mapped to a lighter value.

of objects with both translation and rotation results in signif-

icantly better tracking and reconstruction. For each canoni-

cal point v

∈ S, T

= W(v

) transforms that point from

canonical space into the live, non-rigidly deformed frame of

reference.

Since we will need to estimate the warp function for each

new frame, W

, its representation must be efﬁciently opti-

misable. One possibility is to densely sample the volume,

e.g. representing a quantised SE(3) ﬁeld at the resolution

of the truncated signed distance (TSDF) geometric repre-

sentation. However, a typical TSDF volume reconstruction

at a relatively low resolution of 256

voxels would require

the solution of 6 × 256

parameters per frame, about 10

million times more than in the original KinectFusion al-

gorithm, which only estimates a single rigid transforma-

tion. Clearly, a completely dense parametrisation of the

warp function is infeasible. In reality, surfaces tend to move

smoothly in space, and so we can instead use a sparse set

of transformations as bases and deﬁne the dense volumet-

ric warp function through interpolation. Due to its compu-

tational efﬁciency and high quality interpolation capability

we use dual-quaternion blending DQB [11], to deﬁne our

warp function:

W(x

) ≡ SE3(DQB(x

)) , (1)

where the weighted average over unit dual quaternion trans-

formations is simply DQB(x

) ≡

k∈N (x

)

)ˆq

k∈N (x

)

)ˆq

with each unit dual-quaternion ˆq

∈ R

. Here, N (x) are

the k-nearest transformation nodes to the point x and w

7→ R deﬁnes a weight that alters the radius of inﬂuence

of each node and SE3(.) converts from quaternions back to

an SE(3) transformation matrix. The state of the warp-ﬁeld

at time t is deﬁned by the values of a set of n defor-

mation nodes N

warp

= {dg

, dg

se3

}

. Each of the

i = 1..n nodes has a position in the canonical frame dg

∈

, its associated transformation T

= dg

se3

, and a ra-

dial basis weight dg

that controls the extent of the trans-

formation w

) = exp



−kdg

− x



2(dg

)



Each radius parameter dg

is set to ensure the node’s in-

ﬂuence overlaps with neighbouring nodes, dependent on

the sampling sparsity of nodes, which we describe in de-

tail in section (3.4). Since the warp function deﬁnes a

rigid body transformation for all supported space, both posi-

tion and any associated orientation of space is transformed,

e.g., the vertex v

from a surface with orientation or nor-

mal n

is transformed into the live frame as (v

, 1)

)(v

, 1)

and (n

, 0)

= W

)(n

, 0)

. We note

that scaling of space can also be represented with this warp

function, since compression and expansion of space are rep-

resented by neighbouring points moving in converging and

diverging directions. Finally, we note that we can factor

out any rigid body transformation common to all points in

the volume, e.g., due to camera motion. We therefore in-

troduce the explicit warped model to live camera transform,

, and compose this onto the volumetric warp function;

of 11

50墨值下载

自动驾驶

相关文档

评论