dense continuous-time tracking and mapping with · pdf filedense continuous-time tracking and...

9
Dense Continuous-Time Tracking and Mapping with Rolling Shutter RGB-D Cameras Christian Kerl, J¨ org St¨ uckler, and Daniel Cremers Technische Universit¨ at M ¨ unchen {kerl,stueckle,cremers}@in.tum.de Abstract We propose a dense continuous-time tracking and map- ping method for RGB-D cameras. We parametrize the camera trajectory using continuous B-splines and optimize the trajectory through dense, direct image alignment. Our method also directly models rolling shutter in both RGB and depth images within the optimization, which improves track- ing and reconstruction quality for low-cost CMOS sensors. Using a continuous trajectory representation has a num- ber of advantages over a discrete-time representation (e.g. camera poses at the frame interval). With splines, less vari- ables need to be optimized than with a discrete represen- tation, since the trajectory can be represented with fewer control points than frames. Splines also naturally include smoothness constraints on derivatives of the trajectory es- timate. Finally, the continuous trajectory representation al- lows to compensate for rolling shutter effects, since a pose estimate is available at any exposure time of an image. Our approach demonstrates superior quality in tracking and re- construction compared to approaches with discrete-time or global shutter assumptions. 1. Introduction Most current approaches to simultaneous localization and mapping (SLAM) with RGB-D cameras estimate the trajectory at discrete times, e.g. one pose for each frame [2, 6, 10]. Implicitly, the discrete pose of a frame needs to ap- ply to all pixels in the image—an assumption that is clearly violated for rolling shutter cameras. In effect, the alignment of two images is biased towards a wrong estimate for such cameras. Since most consumer-grade RGB-D cameras use rolling shutter CMOS sensors, the use of a trajectory rep- resentation is desirable that can provide a camera pose esti- mate for each sensor row individually and allows for com- pensating for rolling shutter. In this paper, we propose to optimize a continuous-time trajectory representation in the form of B-splines through direct, dense image alignment. This way, our approach can Figure 1: We propose a dense tracking and mapping method to estimate a continuous-time trajectory from a sequence of rolling shutter RGB-D images. Our approach significantly reduces tra- jectory drift leading to more consistent 3D reconstructions (top) compared to a state-of-the-art algorithm which estimate discrete camera poses w.r.t. to keyframes (bottom). Neither method in- volves global optimization. incorporate parameters of sensor exposure such as exposure durations and time differences of the RGB and depth sensor. The trajectory can be represented with much less parame- ters, i.e. control points, than pose variables per frame. The spline parametrization also inherently adds constraints on derivatives and produces smooth trajectories. Hence, it reg- ularizes the ill-posed problem of estimating the camera pose for each image row. A further advantage of the spline repre- sentation is the easy integration of other sensing modalities such as inertial measurement units (IMU). These sensors can have arbitrarily lower or higher measurement rates than the camera and do not need to measure synchronously with the frames. 1

Upload: doannhu

Post on 23-Mar-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Dense Continuous-Time Tracking and Mapping

with Rolling Shutter RGB-D Cameras

Christian Kerl, Jorg Stuckler, and Daniel CremersTechnische Universitat Munchen

kerl,stueckle,[email protected]

Abstract

We propose a dense continuous-time tracking and map-ping method for RGB-D cameras. We parametrize thecamera trajectory using continuous B-splines and optimizethe trajectory through dense, direct image alignment. Ourmethod also directly models rolling shutter in both RGB anddepth images within the optimization, which improves track-ing and reconstruction quality for low-cost CMOS sensors.

Using a continuous trajectory representation has a num-ber of advantages over a discrete-time representation (e.g.camera poses at the frame interval). With splines, less vari-ables need to be optimized than with a discrete represen-tation, since the trajectory can be represented with fewercontrol points than frames. Splines also naturally includesmoothness constraints on derivatives of the trajectory es-timate. Finally, the continuous trajectory representation al-lows to compensate for rolling shutter effects, since a poseestimate is available at any exposure time of an image. Ourapproach demonstrates superior quality in tracking and re-construction compared to approaches with discrete-time orglobal shutter assumptions.

1. Introduction

Most current approaches to simultaneous localizationand mapping (SLAM) with RGB-D cameras estimate thetrajectory at discrete times, e.g. one pose for each frame [2,6, 10]. Implicitly, the discrete pose of a frame needs to ap-ply to all pixels in the image—an assumption that is clearlyviolated for rolling shutter cameras. In effect, the alignmentof two images is biased towards a wrong estimate for suchcameras. Since most consumer-grade RGB-D cameras userolling shutter CMOS sensors, the use of a trajectory rep-resentation is desirable that can provide a camera pose esti-mate for each sensor row individually and allows for com-pensating for rolling shutter.

In this paper, we propose to optimize a continuous-timetrajectory representation in the form of B-splines throughdirect, dense image alignment. This way, our approach can

Figure 1: We propose a dense tracking and mapping method toestimate a continuous-time trajectory from a sequence of rollingshutter RGB-D images. Our approach significantly reduces tra-jectory drift leading to more consistent 3D reconstructions (top)compared to a state-of-the-art algorithm which estimate discretecamera poses w.r.t. to keyframes (bottom). Neither method in-volves global optimization.

incorporate parameters of sensor exposure such as exposuredurations and time differences of the RGB and depth sensor.The trajectory can be represented with much less parame-ters, i.e. control points, than pose variables per frame. Thespline parametrization also inherently adds constraints onderivatives and produces smooth trajectories. Hence, it reg-ularizes the ill-posed problem of estimating the camera posefor each image row. A further advantage of the spline repre-sentation is the easy integration of other sensing modalitiessuch as inertial measurement units (IMU). These sensorscan have arbitrarily lower or higher measurement rates thanthe camera and do not need to measure synchronously withthe frames.

1

By explicitly modeling rolling shutter in direct imagealignment, our method achieves improvements in trackingand reconstruction quality. In our SLAM method, we usethe continuous trajectory estimate to fuse RGB-D framesconsistently in keyframes over time and remove rollingshutter effects. We also accurately quantify motion blurat each pixel, and use this measure to improve the qualityof the color reconstruction. We compare our method withapproaches estimating discrete camera poses or relying onglobal shutter assumptions, and demonstrate superior per-formance for tracking and reconstruction towards the state-of-the-art. Figure 1 shows a 3D reconstruction of our pro-posed method (top) in comparison to a baseline algorithm(bottom). The better performance of our method leads to aconsistent 3D model even without global optimization.

2. Related WorkVisual SLAM and structure from motion (SfM) has

been traditionally formulated with discrete-time trajecto-ries [2, 6, 10]. Recently, Furgale et al. [3] proposed acontinuous-time spline-based trajectory representation forkeypoint-based monocular and stereo SLAM. They explorecontinuous trajectories for the main purposes of reduc-ing the state-space and for seamlessly integrating high-rateIMU measurements. Lovegrove et al. [9] formulate monoc-ular SLAM in a compositional spline representation in these(3) Lie algebra, overcoming problems of the Cayley-Gibbs-Rodrigues representation of poses used in [3]. Theirapproach compensates for rolling shutter, however, in con-trast to our dense method, sparse keypoints are matched be-tween images. They demonstrate SLAM in a small-scalecalibration example and on a synthetic sequence.

Several approaches correct for rolling shutter in monoc-ular, stereo, or RGB-D images. Klein and Murray [8] esti-mate camera motion in a monocular image from the move-ment of keypoints between frames and use this motion todetermine a rolling-shutter compensated image. Baker etal. [1] estimate translational pixel motion in a frame fromoptical flow. The approach in [15] estimates a linear inter-polation spline of rotational camera motion from KLT fea-ture tracks in a short window of frames. A similar approachfor modeling rolling shutter is used in the bundle adjustmentapproach in [5]. Here, the pose of the top row of each im-age is estimated and interpolated linearly between frames.For RGB-D cameras, only a few approaches exist. Ringabyet al. [14] remove rolling shutter in the depth image of astructured light sensor by applying the keypoint-based cor-rection method [15] to the infrared image. In [13], a 3-axisgyro is used to determine the rotation of the camera instead.Meilland et al. [11] assume the camera velocity to be linearduring the exposure of a frame and determine the motionthrough direct image alignment.

Our method introduces a generalized continuous-time

spline-based trajectory representation to consider rollingshutter for direct RGB-D image alignment and SLAM. Wemodel rolling shutter in both the RGB and depth image. Wealso propose to fuse the rolling shutter corrected RGB-Dimages in keyframes and reduce the effects of motion blurbased on the motion estimate of individual pixels.

3. ApproachThe RGB-D camera provides us with a sequence of RGB

and depth images in the time interval [tmin, tmax). Our goalis to estimate a trajectory function T (t) : R→ SE(3)∀ t ∈[tmin, tmax) from the image sequence using dense imagealignment.

An RGB image will be denoted by C : ΩC → R3, thederived intensity image by I : ΩI → R and a depth imageby Z : ΩZ → R. The corresponding timestamps are tC , tIand tZ . We do not require the timestamps of an RGB-D im-age pair to be synchronized. For simplicity we assume thatall image domains have the same dimensions with width wand height h.

3.1. Camera Model

We model the RGB and depth cameras as rolling shuttercameras with pinhole projection including radial and tan-gential lens distortion. We assume the intrinsic and extrinsicparameters of both cameras to be known. The focal lengthwill be denoted by f and the offset of the projection centerby ox along the image x-axis and oy for the y-axis respec-tively. The relative transformation between RGB and depthcamera is Tcam. For clarity of presentation we omit the lensdistortion in the projection equation.

The projection function π mapping a 3D point p =(X,Y, Z)T to a 2D pixel x = (x, y)T is defined as:

x = π(p) =

(X

Zf + ox,

Y

Zf + oy

). (1)

The inverse projection function π-1 reconstructs a 3Dpoint given a pixel location x and a depth image Z:

p = π-1(x,Z(x)) =

(x− oxf

,y − oyf

, 1

)T

Z(x). (2)

For a rolling shutter camera the pose which transformsa world point into the camera frame depends on the pixellocation the point projects to. This can be formalized as thefollowing constraint:

tr/h [π(T -1(t0 + tx)p)]y!= tx (3)

where t0 is the time of the first row, tr is the read out time ofthe camera, and h is the number of rows. The read out timespecifies the time difference between the capture of the firstand the last row. Here we assume that the camera reads theimage row-wise. To project a point with arbitrary T (t) wesolve (3) for tx using the Newton-Raphson method.

3.2. Trajectory Representation

We use the cumulative cubic B-spline representation in-troduced by Lovegrove et al. [9] for the trajectory functionT (t). Here we briefly repeat the definition of T (t). For adetailed derivation and formulas for the time derivatives werefer the interested reader to [9].

The trajectory is defined over a time interval [tmin, tmax).The shape of the trajectory is determined by the set TC ofm control points TC,l ∈ SE(3) placed at discrete times(knots) t0 . . . tl . . . tm−1 in the time interval. The knots areuniformly distributed in the interval. Each of these knotintervals is ∆t wide. In our case we use ∆t = 0.05 s.Every point on the trajectory is influenced by four con-trol points. At time t these are the control points at timestl−1, tl, tl+1, tl+2 if t ∈ [tl, tl+1). Then the pose at t is:

T (t) = TC,l−1

3∏k=1

exp(Bj(u)Ωl−1+k) (4)

with Ωl = log(T -1C,l−1 TC,l) ∈ se(3). u = (t− tl)/∆t nor-

malizes t to the interval [0, 1). The cumulative basis func-tions B(u) are:

B(u) =1

6

6 0 0 05 3 −3 11 3 3 −20 0 0 1

1uu2

u3

. (5)

Bk(u) selects the k-th element from B(u) with indicesstarting at 0. A pose T (t) transforms a point from localcoordinates to world coordinates.

Equipped with the rolling shutter camera model and thecontinuous-time trajectory representation we define in thenext sections the dense image alignment terms, which weuse to estimate the control points of the spline.

3.3. Dense Image Alignment

To estimate the trajectory function T (t) we use a dense,direct image alignment method, because they are more ro-bust in scenes with little features and on blurred images. Ingeneral, direct image alignment methods estimate the mo-tion parameters to align a given current image to a referenceimage or model. Iteratively we take a set of points fromthe reference image, apply a warping function to transferthese points to the current image, where we evaluate an er-ror function measuring consistency between the two, finallythe motion parameters are adjusted to increase consistency.In this process the data association between reference andcurrent image is solved implicitly, which is in contrast tofeature based methods. In our case the warping function isthe composition of back projection, rigid body motion andprojection.

For rolling shutter cameras the parameters of the warp-ing function depend on its result as previously described.

Therefore, the warping involves for general rigid body mo-tions and projection functions itself an iterative optimiza-tion, which increases the computational load. Especially fordirect, dense methods, because they have to apply the warp-ing function to a large number of points. Therefore, previ-ous approaches resort to linear motion approximations. Incontrast, we present a dense image alignment term, whichdoes not require the projection into a rolling shutter image,and as alternative an efficient strategy to project even withour more general trajectory representation.

We use a geometric and a photometric error term in theimage alignment. In the following we describe both errorterms.

Geometric Error For the geometric error we take advan-tage of the fact that the roles of reference and current imagecan be swapped, i.e., points from the current depth imagecan be warped to the reference image. For the current depthimage we know the capture time for every row and can eval-uate the trajectory function at the appropriate point in time.The geometric error for one point is given by

rZ,i,j =Zref(π(T -1(tZ,ref + tx)p′j))−[T -1(tZ,ref + tx)p′j ]Z (6)

where p′j = T (tZ,i + tr/h [xj ]y)π-1(xj ,Zi(xj)) is thepoint from the i-th image transformed into the world frame.[·]Z selects the Z component of the point. The time off-set tx is either 0, if the reference image is rectified, or it isdetermined by solving (3). Here, rectified means that weremoved the rolling shutter effects from the image and aglobal shutter projection model applies. Note, if we haverectified the reference image, we can solve the alignmentproblem based on the geometric error without the need toproject into a rolling shutter image. This insight is equallyapplicable to ICP-like error functions.

Photometric Error We define the photometric error fora point visible in the reference intensity image Iref and thei-th intensity image Ii as:

rI,i,j =Ii(π(Tcam T -1(tI,i + ti,x)p′j))−Iref(π(Tcam T -1(tI,ref + tref,x)p′j)) (7)

where p′j = T (tZ,ref +tr/h [xj ]y)π-1(xj ,Zref(xj)) is a 3D

point from the reference depth image transformed into theworld frame. The time offsets ti,x and tref,x are determinedby solving (3). In case the depth reference image is rectifiedand has a registered intensity image (7) can be simplified.Registered means that we determined the intensity of every3D point in the reference depth image beforehand. Basi-cally, this can be done by evaluating the second part of (7).

However, it requires an estimate of T (t) during the frameinterval of Iref. The simplified error term is:

rI,i,j = Ii(π(Tcam T -1(tI,i + tcur,x)p′j))− Iref(xj) (8)

with p′j = T (tZ,ref)π-1(xj ,Zref(xj)). In this case it is

equivalent to the photometric error described by Meilland etal. [11] except for our more general trajectory function. Inthe following we will only use the simplified version (8) ofthe photometric error assuming we have a rectified and reg-istered reference image. In Section 3.6 we give additionaldetails about frame rectification and registration.

3.4. Non-linear Trajectory Optimization

With the point-wise photometric and geometric errorsdefined in the previous section we formulate an error func-tion over a whole image, which we minimize with respect tothe spline control points parameterizing our trajectory (c.f .Section 3.2). Every error term depends on multiple con-trol points. Therefore, multiple images constrain the samecontrol points and require us to minimize the error functionjointly over multiple images. This joint optimization is incontrast to all previous dense, direct image alignment meth-ods, which only optimize the relative transformation of themost recent image to some reference image and ignore thetemporal relationship of the images.

Given a rectified and registered reference RGB-D imageand a set of N rolling shutter RGB and depth images wedefine the joint photometric error as:

rI =

N∑i

Mi∑j

wI,i,jr2I,i,j (9)

where Mi is the number of points visible in image Ii andrI,i,j is the photometric error for one point as defined in (8).

Similarly, we define the joint geometric error for theseN images as:

rZ =

N∑i

Mi∑j

wZ,i,jr2Z,i,j (10)

with Mi being the number of points in the i-th depth image,which project into the reference image.

We also include a per residual weight in the joint er-rors wI,i,j = w(rI,i,j) for (9) and wZ,i,j = w(rZ,i,j) for(10). The weight function w(r) is derived from the Studentt-distribution [7]:

w(r) =ν + 1

ν + (r/σ)2(11)

where ν are the degrees of freedom (fixed to ν = 5) andscale parameter σ. We estimate 2N scale parameters σ.One for the photometric and one for the geometric errors ofeach of the N images.

When we align two images only the relative pose is ob-servable. However, with our trajectory representation weoptimize the poses w.r.t. the world frame. Therefore, we fixthe control points defining the pose of the rectified referenceimage T (tZ,ref). With one pose fixed every point-wise er-ror influences four control points. As the capture interval ofone image [t0, t0 + tr) might span multiple knot intervalsof the spline, one image influences four or more spline con-trol points. Let TC ⊂ TC denote the set of control pointsinfluenced by the N images excluding the fixed ones of thereference pose. Then we define our joint trajectory opti-mization problem for the set of optimal control point posesT ∗C as:

T ∗C = arg minTC

rI + rZ (12)

This is a non-linear weighted least squares problem, whichwe solve iteratively using the Gauss-Newton method. Theerror function is linearized with respect to small increments∆TC to the control points. The increments are representedusing the Lie algebra se(3). In every iteration we solve thenormal equations A∆TC = −b. The Hessian matrix Ahas a band diagonal structure. Every point-wise error con-tributes a dense 24 × 24 block to A. In Section 3.7 weprovide details how to efficiently compute the Jacobians toconstruct the normal equations.

As common in dense image alignment we use a coarse-to-fine scheme to increase the convergence radius. The opti-mization for T ∗C starts with a low resolution version of eachof theN images and sequentially increases the resolution asthe affected control points converge. To choose the appro-priate resolution for an image we use the following strategy.We associate with every control point TC,l an image reso-lution. Furthermore, we keep track of the magnitude of theincrements to the control point. Once this magnitude fallsbelow a threshold or after a maximum number of iterationswe increase the resolution associated with the control point.We choose the lowest resolution among all affected con-trol points per image. When control points converge on thehighest resolution they are excluded from further optimiza-tion.

3.5. Online Trajectory Estimation

We apply our optimization strategy in an online systemwhere new RGB and depth images become available in-crementally. In our system we choose certain frames askeyframes. The latest keyframe serves as reference image inthe direct image alignment. Previous keyframes are merelykept for visualization purposes. For a new image the splineis expanded by adding new knots such that the capture in-terval [t0, t0 + tr) of the new frame lies entirely inside thetrajectory interval [tmin, tmax). We initialize the new controlpoints simply with the value of the last control point. Opti-mization for the new image starts on the lowest resolution.

Furthermore, we can remove old frames from the optimiza-tion for which all control points converged. We fuse theseconverged frames into the latest keyframe as described inthe next section. For the online estimation we choose theimage resolution only once before the non-linear optimiza-tion and keep it fixed during the iterations.

At some point, the latest image will have too little over-lap with the keyframe to reliably estimate the pose. There-fore, we have to choose a new one. The simplest strategyis to choose the last converged frame as new keyframe. Analternative is to choose the last frame and initially only usethe geometric error term with rolling shutter reference. Af-ter the control points of this new keyframe converge it canbe rectified and the photometric and geometric term can beused. This more involved approach helps to keep track dur-ing camera motions with fast viewpoint changes. Further-more, this strategy allows to bootstrap the system, becausethe geometric term does not require a rectified reference.We always propagate the depth and RGB values from thelast keyframe to the new, rectified keyframe. Therefore, weloose all information about the scene, which is not repre-sented in the new keyframe, rendering our trajectory esti-mation an odometry method. Alternatively, if a consistentmodel of the scene is available, e.g., as signed distance func-tion volume [12] or a set of keyframes [10], the referenceframes can be synthesized from this model.

3.6. Frame Rectification and Fusion

Once all control points for an RGB-D frame converged,we rectify the frame and fuse it into the keyframe. To fusethe depth image, we first warp it to the keyframe using theGPU-based technique described by Meilland et al. [10]. Tocorrect for the rolling shutter distortion we use a differenttransformation per row, which we obtain by evaluating T (t)at the corresponding timestamps. Afterwards, we update foreach pixel in the keyframe the weighted average of the depthvalue. The weights take the distance based uncertainty ofthe depth values into account. The forward warping has thebenefit, that we can fill pixels with missing data.

For the fusion of the RGB images we warp every pointof the keyframe into the RGB image, taking the rollingshutter constraint into account, and lookup the new RGBmeasurement. In practice, we also handle self occlusionsof those points. All RGB measurements for a point in thekeyframe are combined using a moving average. We pro-pose to use the following weighting function for the RGBmeasurements, which models the amount of motion blur:

wC,i(p) = exp

(−‖xt − xt−te‖

σ2

)(13)

where xt = π(Tcam T -1(tC,i + ti,x)p) is the pixel loca-tion the point p (in world coordinates) projects to in the i-thRGB image, and xt−te = π(Tcam T -1(tC,i + ti,x − te)p)

(a) Registered RGB image (b) Blur-aware weights

(c) Fused keyframe using con-stant weights

(d) Fused keyframe using blur-aware weights

Figure 2: Illustration of our weighting to suppress motion blur inthe fused keyframes: a) rectified input image with motion blur,b) computed blur-aware weights where green indicates a high andred a low value, c) fused keyframe averaged with constant weights,and d) keyframe fused using blur-aware weights for the average.Note how our weighting scheme retains sharp details.

is the pixel at the beginning of the frame exposure. There-fore, ‖xt − xt−te‖ is the length of a line approximating thepath of the point in the image during exposure time te. Themore pixels a point covers the stronger motion blur is andcorrespondingly wC,i(p) will be lower. In all experimentswe set σ = 1. Figure 2 illustrates the effects of our blur-aware weight function in comparison to a simple constantweight. Figure 2a shows an RGB image, which was reg-istered to the keyframe. In Figure 2b we show the corre-sponding weights where green indicates high and red lowweights. Note how the green spot matches the sharp areain the center of the registered RGB image. The images inthe bottom row show the color image of the keyframe afterfusing multiple images with constant weights (Figure 2c)and our proposed weights (Figure 2d). It is clearly visiblethat our weights suppress the influence of blurred imageson the fusion result. Our weighting function, which quanti-fies the amount of motion blur, is complementary to other,e.g. normal-based, weight functions previously proposed.Whenever we propagate the fused RGB and depth valuesfrom the last keyframe to a new one we also propagate theircorresponding weights.

3.7. Implementation Details

The main computational burden in dense, direct imagealignment is the evaluation of the error function and the con-struction of the normal equations. In both parts we found

operations which evaluate the trajectory T (t) most costly.For the geometric alignment a separate transformation

for every point depending on its image row is required.Therefore, we precompute the pose of every row for all theimages currently involved in the optimization once per it-eration. Similarly, for the projection into a rolling shut-ter image we need to repeatedly evaluate the pose and itsderivative w.r.t. time. We found it sufficient to use the poseand time derivative of the closest integer row index, whichwe also precompute for every image included in the opti-mization. Note, that the projection still results in subpixellocations. An even coarser sampling of the poses and veloc-ities, e.g. every 5th or 10th row, might be sufficient. Usingonly one pose and velocity per image results in a constantvelocity model.

Another costly operation is the computation of the Ja-cobians and building the normal equations. Lovegrove etal. use numerical derivatives [9] that is feasible for a smallnumber of feature points, but prohibitively expensive forthe amount of points involved in dense image alignment.Therefore, we derive analytic expressions for the photomet-ric and geometric error terms and a decomposition, whichreduces the additional effort for the spline representationcompared to a single pose for each image to a factor propor-tional to the height h of the images instead of the numberof pixels h× w. The derivative of the simplified photomet-ric error (8) w.r.t. the increments to the four control pointsdefining T (tI,i + ti,x) is:

∂rI,i,j∂∆TC

=∂Ii(π(Tcam T -1 p′j))

∂T

∣∣∣∣∣T=T (tI,i+ti,x)

∂T (tI,i + ti,x)

∂∆TC

∣∣∣∣∆TC=0

(14)

where the first part is a 1 × 12 matrix well known fromdense photometric alignment estimating a single rigid bodymotion. The second part is a 12 × 24 matrix and specificto the spline trajectory representation. For the geometricerror with a rectified reference image we obtain a similarexpression:

∂rZ,i,j∂∆TC

=∂Zref(π(T -1

ref T pj))− [T -1ref T pj ]Z

∂T

∣∣∣∣T=T (tZ,i+tj,x)

∂T (tZ,i + tj,x)

∂∆TC

∣∣∣∣∆TC=0

(15)

with Tref = T (tZ,ref), pj = π-1(xj ,Zi(xj)), and tj,x =tr/h [xj ]y . The key insight is that the second part of theJacobian in (15) is constant for all pixels in the same row.Therefore, we just have to form the outer product of 1× 12Jacobians and sum 12 × 12 matrices for all pixels in thesame row instead of 1 × 24 Jacobians and 24 × 24 matri-ces to build the normal equations. The same holds for (14)

if we evaluate the Jacobian only at times corresponding tointeger row indices. Nevertheless, for the photometric er-ror points from the same row in the reference image will bescattered to different rows in the i-th image. Therefore, itis useful to sort the residual terms by row before computingthe Jacobians. This ensures that the Jacobian of the posew.r.t. the control points has to be computed only once foreach row. We also derived an analytic expression for the Ja-cobian of a pose w.r.t. its control points, which we providein the supplementary material due to space limitations.

4. EvaluationWe evaluate our trajectory estimation approach on simu-

lated and real world datasets. For the experiments on syn-thetic data we adapt the ICL-NUIM dataset proposed byHanda et al. [4]. We perform the experiments with real dataon a set of sequences, which we recorded along with thegroundtruth trajectory from a motion capture system.

We run all experiments on a PC with Intel Core i7-2600CPU and 8GB RAM. The alignment with geometric errorrequires on average 210ms/frame and 775ms/frame withphotometric and geometric error. Our C++ implementationcomputes the error terms and normal equations for the notyet converged images in parallel. We expect to achieve fur-ther speedup using a GPU implementation.

4.1. Synthetic Rolling Shutter Dataset

The ICL-NUIM dataset is a synthetic, ray-casted RGB-D dataset. It comprises eight datasets for two scenes, a liv-ing room and an office, with four different trajectories each.Besides the groundtruth trajectories the authors also pro-vide a model of the living room scene and tools to evaluatethe reconstruction accuracy. However, Handa et al. createdthe datasets with the global shutter assumption. Therefore,we extend this dataset by creating four sequences with therolling shutter model for the living room scene. To do sowe raycast every row of each image from a separate camerapose. We obtain the pose of every row by fitting a cumu-lative B-spline trajectory to the groundtruth trajectories andevaluating it at the appropriate time. We also generate newgroundtruth trajectories from the fitted spline, which differslightly from the original ones. Figure 3 shows an exampleRGB-D frame from the lr kt2 sequence rendered with globalshutter and rolling shutter model, and the absolute differ-ence of the intensity and depth values. Note how rollingshutter affects the error of most pixels in the depth images,but only pixels close to edges in the color images. We didnot simulate the motion blur effect, because it would haveincreased the raycasting time tremendously.

4.2. Real World Dataset

We ran a second set of experiments on six datasetscaptured with a PrimeSense Carmine sensor inside a mo-

(a) Global shuttercolor image

(b) Rolling shuttercolor image

(c) Absolute inten-sity difference

(d) Global shutterdepth image

(e) Rolling shutterdepth image

(f) Absolute depthdifference

Figure 3: Example of simulated RGB-D images and comparisonbetween global shutter and rolling shutter models. To visualize theinfluence of the rolling shutter model we show the intensity differ-ence c) between a) and b), and the depth difference f) between im-ages d) and e). Green corresponds to zero and red to a differenceof 10% for the intensity and 0.01m for depth. The color imagesdiffer mainly on the small number of edge pixels. In contrast, thereis a large difference for many pixels in the depth images.

tion capture volume. We could not use any of the exist-ing RGB-D benchmark datasets with groundtruth trajectory(e.g. TUM RGB-D benchmark [16]), because they only pro-vide pre-registered depth images, which destroys the cor-respondence between rows in the depth image and theircapture time. The six sequences comprise three differentmotion patterns: translation along horizontal camera axis(robot t1/t2), rotation around camera z-axis (robot r1/r2),and a more general scanning motion (table1/2). Table 1 listslength, average translational and rotational velocity for eachof these datasets. The translational velocities are amongthe fastest in comparison to the TUM RGB-D benchmarkdatasets and the rotational velocities are larger than anypresent in the TUM RGB-D benchmark. We plan to releaseall of these new datasets to the public with this publication.

4.3. Results

In total we have ten datasets to evaluate our approach.We compare a baseline method (a variant of [10]) and fourvariants of our approach. The baseline method performsframe to keyframe tracking estimating a single pose perframe with a geometric error term and global shutter as-sumption. Furthermore, we fuse all frames into the cur-rent keyframe and choose new keyframes based on the sameoverlap criterion as in our approach. We generate the fourvariants of our approach by using either only the geomet-ric error term (G) or in addition the photometric error term(P+G), and by switching between global shutter (GS) androlling shutter (RS) models. To evaluate the different al-gorithms we use the absolute trajectory error (ATE) and

Table 1: Statistics of real world datasets we recorded withgroundtruth trajectories from a motion capture system. We per-formed three different motions: translation along camera x-axis(robot t1/t2), rotation around camera z-axis (robot r1/r2) and gen-eral motion (table1/2).

Dataset Length [m] avg. transl. avg. rot.velocity [m/s] velocity [deg/s]

robot t1 7.678 0.356 15.807robot t2 7.671 0.359 13.782robot r1 2.592 0.123 62.172robot r2 3.203 0.153 66.847table1 11.494 0.423 23.223table2 5.273 0.153 38.582

the relative pose error per second (RPE/s) as proposed bySturm et al. [16]. Table 2 shows the root mean squarederror (RMSE) of the translational and rotational RPE/s forall five algorithms and the ten sequences. For the syntheticand real world datasets we separately show the average driftand the improvement relative to the baseline algorithm. Theresults of the best performing algorithm for each sequenceare marked with a bold font. Similarly, Table 3 shows theRMSE of the ATE.

In general, our algorithms, which estimate a continuous-time trajectory, outperform the baseline algorithm in termsof the RPE and ATE metrics. The large difference on ta-ble1/table2, kt0, kt1 and kt3 is mainly due to temporaryfailures of the baseline algorithm, because of too fast mo-tion or few frames only observing a wall. The spline repre-sentation helps in these cases to stabilize the optimization.There is only a small difference between the geometric andcombined photometric and geometric error terms on the realworld datasets. We attribute this to the presence of enoughgeometric structure and that we neglect the motion blur ef-fects in the photometric error term. In contrast, on the syn-thetic datasets the inclusion of the photometric error termimproves the performance, because of the missing motionblur and little structure in parts of the sequences. Further-more, modeling rolling shutter improves the translationalRPE and the ATE about 10%. More interestingly, the im-provement w.r.t. rotational drift is around 30%. Consideringthat rotational drift causes most of the error in large scalemapping we find this a remarkable improvement.

To summarize, we demonstrate the superior performanceof our continuous-time tracking and mapping algorithm ona number of synthetic and real world datasets in comparisonto a standard baseline algorithm. The continuous-time tra-jectory representation improves tracking performance dur-ing fast motions. The combination of photometric and geo-metric error terms helps to stabilize tracking in structurelessregions. Modeling of the rolling shutter effects increases theprecision even further.

Table 2: RMSE values of translational and rotational drift per second (RPE/s) for five different trajectory estimation methods on six realworld and four synthetic datasets. The four spline-based methods use different combinations of geometric (G) and photometric (P) errorterms, and global shutter (GS) and rolling shutter (RS) models. The results for real world and synthetic datasets are separately summarizedby average RPE/s values and relative improvement w.r.t. baseline method. All algorithms estimating a continuous-time trajectory clearlyoutperform the baseline method.

Dataset Baseline Spline+GS+G Spline+RS+G Spline+GS+P+G Spline+RS+P+G[m/s] [deg/s] [m/s] [deg/s] [m/s] [deg/s] [m/s] [deg/s] [m/s] [deg/s]

robot t1 0.0149 0.7761 0.0168 0.8098 0.0116 0.5389 0.0133 0.9368 0.0085 0.4857robot t2 0.0102 0.6679 0.0104 0.6649 0.0100 0.4962 0.0108 0.7730 0.0089 0.4961robot r1 0.0180 2.1351 0.0177 2.1407 0.0173 1.1222 0.0160 2.1507 0.0156 1.1624robot r2 0.0112 2.3313 0.0108 2.2961 0.0110 1.1619 0.0114 2.0803 0.0105 1.1911table1 0.0827 2.2656 0.0362 1.4801 0.0158 0.9305 0.0371 1.4749 0.0160 0.9297table2 0.1160 4.2465 0.0117 1.6441 0.0132 0.5973 0.0103 1.5749 0.0126 0.5464

average 0.0422 2.0704 0.0173 1.5059 0.0132 0.8078 0.0165 1.4984 0.0120 0.80190% 0% 59.0% 27.3% 68.8% 61.0% 60.9% 27.6% 71.5% 61.3%

lr kt0 0.0162 2.9869 0.0804 0.2673 0.0268 0.1059 0.0075 0.2394 0.0056 0.1170lr kt1 0.1569 30.7227 0.0047 0.1700 0.0029 0.0625 0.0064 0.1709 0.0024 0.0590lr kt2 0.0054 0.2744 0.0068 0.2973 0.0044 0.0723 0.0068 0.2128 0.0031 0.0655lr kt3 0.7420 35.6279 0.0075 0.9161 0.0068 0.9427 0.0164 0.1900 0.0100 0.2482

average 0.2301 17.4030 0.0249 0.4127 0.0102 0.2958 0.0093 0.2033 0.0053 0.12240% 0% 89.2% 97.6% 95.6% 98.3% 96.0% 98.8% 97.7% 99.3%

Table 3: RMSE values of absolute trajectory error (ATE) c.f . caption of Table 2 for structure and abbreviations. Modeling rolling shut-ter improves the ATE on the real world sequences by 10%. On the synthetic sequences the spline representation and inclusion of thephotometric term stabilize the trajectory estimate.

Dataset Baseline [m] Spline+GS+G [m] Spline+RS+G [m] Spline+GS+P+G [m] Spline+RS+P+G [m]

robot t1 0.0129 0.0169 0.0129 0.0133 0.0092robot t2 0.0116 0.0161 0.0082 0.0287 0.0145robot r1 0.0170 0.0148 0.0145 0.0135 0.0137robot r2 0.0099 0.0105 0.0108 0.0092 0.0083table1 0.2521 0.0842 0.0379 0.0909 0.0110table2 0.4675 0.0127 0.0111 0.0144 0.0112

average 0.1285 (0%) 0.0259 (79.9%) 0.0159 (87.6%) 0.0283 (78.0%) 0.0113 (91.2%)

lr kt0 0.1129 0.3980 0.1088 0.0259 0.0186lr kt1 0.2651 0.0111 0.0108 0.0092 0.0054lr kt2 0.0209 0.0173 0.0109 0.0082 0.0079lr kt3 1.2669 0.0607 0.0676 0.0498 0.0210

average 0.4165 (0%) 0.1218 (70.8%) 0.0495 (88.1%) 0.0233 (94.4%) 0.0132 (96.8%)

5. Conclusion

In this paper we introduced a dense tracking methodto estimate a continuous-time trajectory from a sequenceof unsynchronized and unregistered RGB-D images. Thecontinuous-time representation helps to better constrain in-dividual image poses and to model the rolling shutter ofcolor and depth cameras. Furthermore, this representationhas several benefits, which are now available to dense, di-rect tracking methods. Additionally, we show how to usethe continuous-time trajectory to easily quantify the amount

of motion blur and suppress its influence on the fused 3Dmodel. Quantitative evaluation on both synthetic and realworld datasets shows that the proposed dense alignmentwith continuous-time trajectory and rolling shutter modellead to drastic improvements in the trajectory error.

AcknowledgmentsWe thank the Chair of Information-oriented Control (ITR) at TUMfor access to their motion capture system. This work has beenpartially funded through ERC grant Convex Vision (#240168) andERC Proof of Concept grant CopyMe3D (#632200).

References[1] S. Baker, E. Bennett, S. B. Kang, and R. Szeliski. Remov-

ing rolling shutter wobble. In Computer Vision and Pat-tern Recognition (CVPR), 2010 IEEE Conference on, pages2392–2399, June 2010. 2

[2] F. Endres, J. Hess, J. Sturm, D. Cremers, and W. Burgard.3d mapping with an RGB-D camera. IEEE Transactions onRobotics (T-RO), 30(1):177–187, 2013. 1, 2

[3] P. Furgale, T. D. Barfoot, and G. Sibley. Continuous-timebatch estimation using temporal basis functions. In Roboticsand Automation (ICRA), 2012 IEEE International Confer-ence on, pages 2088–2095. IEEE, 2012. 2

[4] A. Handa, T. Whelan, J. McDonald, and A. Davison. Abenchmark for rgb-d visual odometry, 3d reconstruction andslam. In Robotics and Automation (ICRA), 2014 IEEE Inter-national Conference on, pages 1524–1531, May 2014. 6

[5] J. Hedborg, P.-E. Forssen, M. Felsberg, and E. Ringaby.Rolling shutter bundle adjustment. In Computer Vision andPattern Recognition (CVPR), 2012 IEEE Conference on,pages 1434–1441, June 2012. 2

[6] C. Kerl, J. Sturm, and D. Cremers. Dense visual slam forrgb-d cameras. In Intelligent Robots and Systems (IROS),2013 IEEE/RSJ International Conference on, pages 2100–2106, Nov 2013. 1, 2

[7] C. Kerl, J. Sturm, and D. Cremers. Robust odometry estima-tion for rgb-d cameras. In Robotics and Automation (ICRA),2013 IEEE International Conference on, pages 3748–3754,May 2013. 4

[8] G. Klein and D. Murray. Parallel tracking and mapping ona camera phone. In Mixed and Augmented Reality, 2009.ISMAR 2009. 8th IEEE International Symposium on, pages83–86, Oct 2009. 2

[9] S. Lovegrove, A. Patron-Perez, and G. Sibley. Spline fu-sion: A continuous-time representation for visual-inertial fu-sion with application to rolling shutter cameras. Proceedingsof the British machine vision conference, pages 93–1, 2013.2, 3, 6

[10] M. Meilland and A. Comport. On unifying key-frame andvoxel-based dense visual slam at large scales. In IntelligentRobots and Systems (IROS), 2013 IEEE/RSJ InternationalConference on, pages 3677–3683, Nov 2013. 1, 2, 5, 7

[11] M. Meilland, T. Drummond, and A. Comport. A unifiedrolling shutter and motion blur model for 3d visual registra-tion. In Computer Vision (ICCV), 2013 IEEE InternationalConference on, pages 2016–2023, Dec 2013. 2, 4

[12] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux,D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, andA. Fitzgibbon. Kinectfusion: Real-time dense surface map-ping and tracking. In Mixed and Augmented Reality (IS-MAR), 2011 10th IEEE International Symposium on, pages127–136, Oct 2011. 5

[13] H. Ovren, P.-E. Forssen, and D. Tornqvist. Improving rgb-d scene reconstruction using rolling shutter rectification. InY. Sun, A. Behal, and C.-K. R. Chung, editors, New Develop-ment in Robot Vision, volume 23 of Cognitive Systems Mono-graphs, pages 55–71. Springer Berlin Heidelberg, 2015. 2

[14] E. Ringaby and P.-E. Forssen. Scan rectification for struc-tured light range sensors with rolling shutters. In Com-puter Vision (ICCV), 2011 IEEE International Conferenceon, pages 1575–1582, Nov 2011. 2

[15] E. Ringaby and P.-E. Forssen. Efficient video rectificationand stabilisation for cell-phones. International Journal ofComputer Vision, 96(3):335–352, 2012. 2

[16] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre-mers. A benchmark for the evaluation of rgb-d slam systems.In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ In-ternational Conference on, pages 573–580, Oct 2012. 7