dense 3d.pdf

8/9/2019 dense 3D.pdf

1/13

Published in IET Computer Vision

Received on 24th September 2012

Revised on 7th June 2013

Accepted on 16th July 2013

doi: 10.1049/iet-cvi.2013.0017

ISSN 1751-9632

Augmented Lagrangian-based approach for densethree-dimensional structure and motion estimationfrom binocular image sequencesGeert De Cubber1,2, Hichem Sahli1,3

1Electronics and Information Processing (ETRO), Vrije Universiteit Brussel, Brussels 1040, Belgium2Mechanical Engineering, Royal Military Academy of Belgium, Brussels 1000, Belgium3

Interuniversity Microelectronics Centre

IMEC, Heverlee 3001, BelgiumE-mail:[email protected]

Abstract: In this study, the authors propose a framework for stereomotion integration for dense depth estimation. Theyformulate the stereomotion depth reconstruction problem into a constrained minimisation one. A sequential unconstrainedminimisation technique, namely, the augmented Lagrange multiplier (ALM) method has been implemented to address theresulting constrained optimisation problem. ALM has been chosen because of its relative insensitivity to whether the initialdesign points for a pseudo-objective function are feasible or not. The development of the method and results from solving thestereomotion integration problem are presented. Although the authors work is not the only one adopting the ALMsframework in the computer vision context, to thier knowledge the presented algorithm is the rst to use this mathematicalframework in a context of stereomotion integration. This study describes how the stereomotion integration problem wascast in a mathematical context and solved using the presented ALM method. Results on benchmark and real visual input datashow the validity of the approach.

1 Introduction

1.1 Problem statement

The integration of the stereo and motion depth cues offers thepotential of a superior depth reconstruction, as thecombination of temporal and spatial information makes it

possible to reduce the uncertainty in the depthreconstruction result and to augment its precision. However,this requires the development of a data fusion methodology,which is able to combine the advantages of each method,without propagating errors induced by one of the depthreconstruction cues. Therefore the mathematical formulationof the problem of combining stereo and motion informationmust be carefully considered.

The dense depth reconstruction problem can be casted as avariational problem, as advocated by a number of researchers[1, 2]. The main problem in dense stereomotionreconstruction is that the solution depends on thesimultaneous evaluation of multiple constraints whichhave to be balanced carefully. This is sketched in Fig. 1,which shows the different constraints to be imposed for asequence acquired with a moving binocular camera.Considering a pair of rectied stereo images Il1, I

r1 at time

t= t0 and a stereo pair Il2, I

r2

at time t= t0 + tk, with tk

being determined by the frame rate of the camera. A pointx

l1 in the reference frame I

l1 can be related to a pointx

r1 via

the stereo constraint, as well as to a pointxl2 via the motion

constraint. Using the stereo and motion constraints incombination, the pointx

l1 can even be related to a pointx

r2,

via a stereo + motion or a motion + stereo constraint. It isevident that, ideally, all these interrelations should be takeninto consideration for all the pixels in all the frames in thesequence. In the following, we present such a methodologyfor addressing the stereomotion integration problem fordense reconstruction.

1.2 State-of-the-artThe early work on stereomotion integration goes back to theapproach of Richards [3], relating the stereomotionintegration problem to the human vision system. Based onthis analysis, Waxman and Duncan [4] proposed in astereomotion fusion algorithm. They dene a binoculardifference ow as the difference between the left and rightoptical ow elds, where the right ow eld is shifted bythe current disparity eld. In 1993, Li and Duncan [5]

presented a method for recovering structure from stereo andmotion. They assume that the cameras undergo translation,

but no rotational motion. Tests on laboratory scenespresented good results; however, the constraint of having

only translational motion is hard to full for a real-worldapplication.

The above-mentioned early work on stereomotionintegration generally considers only sparse features and uses

www.ietdl.org

98

& The Institution of Engineering and Technology 2014

IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109

doi: 10.1049/iet-cvi.2013.0017
mailto:mailto:mailto:mailto:mailto:mailto:mailto:mailto:mailto:mailto:mailto:mailto:mailto:mailto:mailto:


2/13

three-dimensional (3D) tracking techniques [6] or directmethods [7] for reconstruction. Tracking techniques track3D tokens from frame-to-frame and estimate theirkinematics. The motion computation problem is formulatedas a tracking problem and solved using an extendedKalman lter. Direct methods use a rigid-body motionmodel to estimate relative camera orientation and localranges for both the stereo and motion components of thedata. The obvious disadvantage of sparse reconstructionmethodologies is that no densely reconstructed model can

be obtained. To overcome this problem, other researchershave proposed model-based approaches [8]. The visiblescene surface is represented with a parametricallydeformable, spatially adaptive, wireframe model. The model

parameters are iteratively estimated using the imageintensity matching criterion. The disadvantage of this kind

of approache is that they only work well for reconstructingobjects that can be easily modelled (small objects, statues,), and not for unstructured environments like outdoornatural scenes.

Recent approaches to stereomotion-based reconstructionconcentrated more on dense reconstruction. The generalidea of these approaches is to combine the left and rightoptical ows with the disparity eld, for example, usingspace carving [9] or voxel carving [10]. Some researchers[11] emphasise on the stereo constraint and only reinforcethe stereo disparity estimates using an optical owinformation, whereas Isard and MacCormick [12] use moreadvanced belief propagation techniques to nd the right

balance between the stereo and optical ow constraints.Sudhiret al.[13] model the visual processes as a sequence

of coupled Markov random elds (MRFs). The MRFformulation allows us to dene appropriate interactions

between the stereo and motion processes and outlines asolution in terms of an appropriate energy function. TheMRF property allows to model the interactions betweenstereo and motion in terms of local probabilities, speciedin terms of local energy functions. These local energyfunctions express constraints helping the stereodisambiguation by signicantly reducing the search space.The integration algorithm as proposed by Sudhiret al. [13]makes the visual processes tightly constrained and reducesthe possibility of an error. Moreover, it is able to detect

stereo-occlusions and sharp object boundaries in both thedisparity and the motion eld. However, as this is a localmethod, it has difculties when there are many regions withhomogeneous intensities. In these regions, any local method

of computation of stereo and motion is unreliable. Otherresearchers (e.g. Larsen et al. in [14]) later improved theMRF-based stereomotion reconstruction methodology bymaking it able to operate on a 3D graph that includes bothspatial and temporal neighbours and by introducing noisesuppression methods.

As an alternative to the MRF-based approach, Strecha andVan Gool [1, 15] presented a partial differential equation

(PDE)-based approach for 3D reconstruction frommulti-view stereo. Their method builds upon the PDE-basedapproach for dense optical ow estimation by Proesmanset al. [16] and reasons on the occlusions between stereo andmotion to estimate the quality or condence ofcorrespondences. The evolution of the condence measuresis driven by the difference between the forward and

backward ows in the stereo and motion directions. Basedon the above-estimated per-pixel and per-depth cue qualityor condence measures, their weighting scheme guides atevery iteration and at every pixel the relative inuences of

both depth cues during the evolution towards the solution.Other researchers [1720] use scene-ow-based methods

for stereomotion integration. Like the optical ow, 3Dscene ow is dened at every point in a reference image.The difference is that the velocity vector in scene-ow eldcontains not onlyx, y, but also zvelocities.

Zhang and Kambhamettu [17] formulated the problem ascomputing a 4D vector (u, v, w, d), where (u, v) are thecomponents of optical ow vector, dis the disparity and wis the disparity motion, at every point of the referenceimage, where the initial disparity is used as an initial guess.However, with serious occlusion and limited number ofcameras, this formulation is very difcult, because itimplies solving for four unknowns at every point. At leastfour independent constraints are needed to make thealgorithm stable. Therefore in [17], constraints on motion,

disparity, smoothness and optical ow, as well ascondence measurement on the disparity estimation, have

been formulated. The major disadvantage of this approach,is its limitation for slowly moving Lambertian scenes underconstant illumination.

The method advocated by Pons et al. in [18] handlesprojective distortion without any approximation of shapeand motion and can be made robust to appearance changes.The metric used in their framework is the ability to predictthe other input views from one input view and theestimated shape or motion. Their method consists ofmaximising, with respect to shape and motion, thesimilarity between each input view and the predictedimages coming from the other views. They warp the inputimages to compute the predicted images, whichsimultaneously removes projective distortion.

Huguet and Devernay [19] proposed a method to recoverthe scene ow by coupling the optical ow estimation in

both cameras with dense stereo matching between theimages, thus reducing the number of unknowns per image

point. The main advantage of this method is that it handlesocclusions both for optical ow and stereo. In [20],Sizintsev and Wildes extend the scene-ow reconstructionapproach, by introducing a spatiotemporal quadric element,which encapsulates both spatial and temporal imagestructure for 3D estimation. These so-called stequels areused for spatiotemporal view matching. Whereas Huguet

and Devernay [19] apply a joint smoothness term to alldisplacement elds, Valgaerts et al. [21] propose aregularisation strategy that penalises discontinuities in thedifferent displacementelds separately.

Fig. 1 Motion and stereo constraints on a binocular sequence

www.ietdl.org


doi: 10.1049/iet-cvi.2013.0017

99



3/13

1.3 Related work

As can be noted from the overview of the previous section,most of the recent research works on stereomotionreconstruction use scene-ow-based reconstruction methods.The main disadvantage to 3D scene ow is that it iscomputationally quite expensive, because of the 4D natureof the problem. Therefore we formulate the stereomotion

depth reconstruction problem into a constrainedminimisation one and use a sequential unconstrainedminimisation technique, namely, the augmented Lagrangemultiplier (ALM) for solving it. This approach has been

presented originally by De Cubber in [22]. The use ofALM has been also proposed recently by Del Bue et al.[23]; however, they apply the technique only to singularstereo reconstruction and structure from motion, whereas we

propose an ALM use for integrated stereomotionreconstruction.

The augmented Lagrangian (AL)-based stereomotionreconstruction methodology presented here differentiatesitself from the current state-of-the-art in stereomotionreconstruction by a number of key factors: the processing

strategy, depicted in Fig. 2, considers three sources ofinformation for the structure estimation process: a left and aright proximity maps from motion, and a proximity mapfrom stereo. During optimisation, information from the(central) proximity map from stereo is transferred to the leftand right proximity maps, which are the ones actually beingoptimised simultaneously. During the optimisation process,data is constantly being interchanged between bothoptimisers, as they are highly dependent. The advantage ofthis concurrent optimisation methodology is that it providesa symmetric processing cue. This makes it easer to handlethe uncertainties induced by the unknown displacements

between the different cameras, in comparison with other

approaches [13] who consider only one reference image andwarp all other images to this reference image for matchingand depth estimation. Other researchers have noted this too

and have used even more depth or proximity maps. In [1],Strecha and Van Gool combine four proximity mapsd

li , d

li+1, d

ri andd

ri+1, as displayed in Fig. 1. The problem

with using so many proximity maps, however, is that theproblem size is increased drastically, and with it, also thecomputation time.

The proposed methodology poses the dense stereomotion

reconstruction problem as a constrained optimisation problemand uses the AL to transform the estimation intounconstrained optimisation problem, which can be solvedwith a classical method. Whereas other researchers expressthe stereomotion reconstruction problem as a MRF [13,14] or a graph cut [2] optimisation problem. The approachwe follow is very natural, as the stereomotionreconstruction problem is by nature a highly constrainedand tightly coupled optimisation problem and the AL has

been proven before [23, 24] to be an excellent method forthese kind of problems.

2 Methodology

2.1 Depth reconstruction model

The stereomotion integration problem for densedepth estimation can be regarded as a high-dimensionaldata fusion problem. In this paper, we formulate thestereomotion depth reconstruction problem into aconstrained minimisation one, with suitable functional thatminimises the error on the dense reconstruction. Fig. 2illustrates the proposed methodology, where a pair ofstereo images at time t is related to a consecutive pair attimet+ 1.

Fig.2considers a binocular image stream consisting of leftand right images of a stereo camera system. The left and rightstreams are processed individually, using the densestructure-from-motion algorithm proposed by De Cubberand Sahli in [25], resulting in, respectively, a left and right

proximity maps dl and dr. In parallel, the left and rightimages are combined using a stereo algorithm [26, 27],embedded in the used Bumblebee stereo camera. As aresult of this stereo computation, a new proximity map fromstereo d

ccan be dened. The reason for calling this

proximity map dc lies in the fact that it is dened in thereference frame of a virtual central camera of the stereovision system.

There exist strong interrelations between the different

proximity maps dl

, dc

and dr

, which need to be expressedto ensure consistency and to improve the reconstructionresult. Therefore we adopt an approach where the left

proximity map dl is optimised, subject to two constraints,relating it to dc and dr, respectively. In parallel, the right

proximity map dr is optimised, also subject to twoconstraints, relating it to dc anddl. The compatibility of theleft and right proximities is hereby automatically ensured,as bothdl anddr are related to dc.

The dense stereomotion reconstruction problem can thusbe stated as the following constrained optimisation problem

Find min

x[V

E(x) subject to: ui(x)= 0 for i= 1, ..., n (1)

with E(x) as the objective functional andi(x) expressing anumber of constraint equations.

Fig. 2 Processing strategy of a binocular sequence: from a left and

right image sequences, proximity maps are calculated throughstereo and dense structure from motion

These maps are iteratively improved by constrained optimisation, using theAL method

www.ietdl.org

100



doi: 10.1049/iet-cvi.2013.0017


4/13

A traditional solving technique for constrainedoptimisation problems as the one posed by (1) is theLagrangian multiplier method, which converts a constrainedminimisation problem into an unconstrained minimisation

problem of a Lagrange function. In theory, the Lagrangianmethodology can be used to solve the stereomotionreconstruction problem; however, to improve theconvergence characteristics of the optimisation scheme, it is

better [28] to use the AL L(x, l), with as theLangrangian multiplier. The AL, which was presented byPowell and Hestenes in [29, 30], adds a quadratic penaltyterm to the original Lagrangian

L(x, l)= E(x) +ni=1

liui(x)

+r

2

ni=1

ui(x)2 (2)

with a penalty parameter > 0.In the context of dense stereomotion reconstruction, we

seek to simultaneously minimise two energy functions:El(dl), for the left image, and Er(dr), for the right image,which we seek to optimise subject to four constraints

1. ullc dl, dc

= 0 relates dl to the proximity map obtained

from stereodc.2. ullr d

l, dr

= 0 relates dl to the proximity map of the rightimage dr.3. u

rrc d

r, d

c

= 0 relates dr to the proximity map obtainedfrom stereodc.4. u

rrl d

r, d

l

=0 relates dr

to the proximity map of the leftimage dl.

According to the AL theorem and the denition given byEquation 2, we can write the AL for the left image as follows

Ll

dl

, lllc, l

llr

=E

l

(dl

) + lllcu

llc d

l

, dc +

r

2 ullc d

l

, dc 2

+ lllrullr d

l, dr

+r

2 ullr d

l, dr 2

(3)

For the right image, we have in a similar fashion

Lr

dr, l

rrc, l

rrl

= E

r(d

r) + l

rrcu

rrc d

r, d

c

+r

2 urrc d

r, d

c 2

+ lrrlurrl d

r, dl

+r

2 urrl d

r, dl 2

(4)

The energy functions in (3) and (4) express the relationshipbetween structure and motion between successive images.

It has to be noted that the approach for solving thereconstruction problem, in principle, is not tied to theformulation of the dense structure-from-motion problem, soany formulation can be chosen. Here, we use the densestructure-from-motion approach presented originally by DeCubber in [22], which formulates the dense structure frommotion as minimising the following energy functional [25]

E= fdata + mfregularisation (5)

The data term is based on the image derivatives based opticalow constraint

fdata= Ix a1d+ b1

+Iy[ad+ b] +It

2 (6)

whereIxandIydenote the spatial gradient of the image in the

x- and y-directions, It denotes the temporal gradient, d is adepth (proximity) parameter and the motion coefcientsa, b, a, b[ ] are dened as a function of the camera focal

length f and its translation t= (tx, ty, tz) and rotation =(x, y, z)

a

a = Qtt=

ftx +xtz

fty +ytz

b

b

= Qvv=

xy

f vx f+

x2

f

vy +yvz

f+y

2

f

vx

xy

f vy xvz

(7)

As expressed by (5), a regularisation is used to lter erroneousreconstruction results and to smooth and extrapolate thestructure (depth) over related pixels. A key aspect here is ofcourse to nd out which pixels are related (e.g. belongingto the same object on the same distance), such that

proximity information can be propagated and which pixels

are not related. Here, we make use of the NagelEnkelmannanisotropic regularisation model, as dened in [31]

fregularisation=(d)TD(I1)(d) (8)

with D as a regularised projection matrix.The energy functionsEl(dl) andEr(dr) can then be dened

as

El

dl

= f

ldata d

l

+ mflregularisation d

l

(9)

Er

dr

= f

ldata d

r

+ mf

lregularisation d

r

(10)

with fldata(dl) andfldata(d

l) are as given by (6) for,respectively, the left and right images and flregularisation(d

l);flregularisation(d

l), the regularisation term, according to (8).

The diffusion parameter regulates the balance between thedata and regularisation term. In order to regulate this

balance, is estimated iteratively, using the methodologydescribed in [22].

The constraints uiij(di, dj) with (i, j= left, centre, right)

express the similarity between an estimated proximity mapdi and another proximity map dj. In order to calculate thissimilarity measure, the second proximity map must bewarped to the rst one. This warping process can beexpressed by introducing a warping function, = (x, d, ,

t), with d as the proximity, and t as the camera rotationand translation, respectively. allows dening theconstraint equations, uiij, i, j[ l, r, c as errors in thewarping

uiij d

i, d

j

= di(x) d

jx+ c x, d

i(x), v

ji, t

ji 2

(11)

The rst constraint, ullc dl, dc

, expresses the similarity

between the estimated left proximity map dl and theproximity map from stereo dc. The motion that isconsidered in this case is in fact the displacement betweenthe left camera and the virtual central camera, which isknown a priori. Since we consider rectied stereo images,

the rotational movement between the cameras is zero(stereo = 0) and the translational movement is according tothe X-axis over a distance of half the stereo baseline b, suchthat tcl = (b/2, 0, 0)T. For estimating the depth, an iterative

www.ietdl.org


doi: 10.1049/iet-cvi.2013.0017

101



5/13

procedure is proposed. Following this methodology, thecurrent estimate of the proximity map dl is lled in (11). Assuch, the warping process is integrated in the optimisationscheme and will gradually improve over time. Finally,ullc d

l, dc

is given by

ullc dl, d

c

= d

l(x) d

cx+ c x, d

l(x), 0, tcl

2(12)

The second constraint,ullr dl, dr1

, on the left proximity map

can be obtained in the same way

ullr dl, dr

= dl(x) dr x+ c x, dl(x), 0, tst

2(13)

Note that, in this case, we use the translation over the wholebaseline tst= (b, 0, 0)

T for warping the right proximity map tothe left proximity map.

The constraints on the right proximity map are as follows

urrc dr, d

c

= d

r(x) d

cx+ c x, d

r(x), 0, tst/2

2

(14)

urrl dr, d

l

= dr(x) d

lx+ c x, d

r(x), 0, tst

2(15)

By integrating the denitions of the energy functions of (9)and (10), and the constraint (12)(15) into the formulationof the AL functions, given by (3) and (4), the constrainedminimisation problem stated in (1) is now completelydened. How this problem is numerically solved isdiscussed in the following section.

2.2 Numerical implementation

The discrete version of (3) is given by

Ll1 k

i,j= El

ki,j+ lllc k

i,j ullc k

i,j+

r

2 ullc k

i,j

2+ lllr k

i,j ullr k

i,j+

r

2 ullr k

i,j

2(16)

The constraints given by (12) and (13) measure thedissimilarity between the left proximity map and the(warped) central and right proximity map, respectively.However, these proximity maps are discrete and possiblyhighly discontinuous, which makes them impractical towork with in an optimisation scheme. Therefore we use aninterpolation function fI(d, x, y) which interpolates the

discrete function d at a continuous location (x, y). In thiswork, we use a bi-cubic spline interpolation function [32],and formulate the discrete version of the constraint fllc of(12) as

fllc k

i,j= dl

ki,jfI d

c1

k, i f

b

2 dl k

i,j, j

2(17)

Similarly, fllr

ki,j

is given by

fllr

ki,j= d

l k

i,jfI d

r k

, i fb dl

ki,j

, j 2

(18)

The update equations of the Lagrangian multipliers lki,j are

derived as follows. When the solution forxk converges to alocal minimum x*, then the k must converge to thecorresponding optimal Lagrange multiplier *. This

condition can be expressed by differentiating the AL of (2)with respect to x

xL(x, l)= E(x) +ni=1

liui(x)

+ rn

i=1

ui(x)ui(x) (19)In the local minimum, E(x*)= 0 and the optimalityconditions on the AL require that also xL x

, l

= 0;hence, we can deduce

li =li+ rui(x) (20)

which give us an update scheme for the Lagrangianmultipliers, such that they converge to i*

lllc

k+1

i,j = lllc

k

i,j+ r ullc

k

i,j

lllr k+1

i,j = lllr k

i,j+ r ullr k

i,j

lrrc k+1

i,j = lrrc

ki,j+ r urrc

ki,j

lrrl k+1

i,j = l

rrl

ki,j+r u

rrl

ki,j

(21)

The expression of the energy and the constraint equationscompletely denes the formulation of the AL of (16),governing the iterative optimisation of the left proximitymap dl. As such, the constrained optimisation problem of(1) is transformed into an unconstrained optimisation

problem. To solve this unconstrained optimisation problem,we use a classical numerical solving technique, proposed by

Brent in [33]. Brents method switches between inverseparabolic interpolation and golden section search. Goldensection search [34] is a methodology for nding theminimum of a bounded function by successively narrowingthe range of values inside which the minimum is known toexist. This range is also updated using inverse parabolicinterpolation, but only if the produced result is acceptable.If not, then the algorithm falls back to an ordinary goldensection step.

This optimisation method converges to a minimum withinthe search interval. Therefore it is crucial that a good initialvalue is available for all status variables. To estimate thisinitial value for the proximity eld, the dense disparity mapfrom stereo is used. The reason for this is that the cameradisplacement between the left and right stereo frames iswell known and is xed over time. As such, it is possible towarp the stereo data in the virtual central camera referenceframe towards the left and the right image with highaccuracy. Applying image warping following the

perspective projection model, it is possible to dene theequations providing an initial value for the left and right

proximity mapsdl

anddr, based on a stereo proximity map dst

dlinitial(x, y)= dst x fdst(x, y)

b

2, y

drinitial(x, y)= dst x +fdst(x, y)b

2

, y

(22)

As can be noted, (22) contains no real unknown data, next tothe stereo proximity map dst.

www.ietdl.org

102



doi: 10.1049/iet-cvi.2013.0017


6/13

The application of Brents optimisationmethodalso requiresthat the minimum and maximum boundaries where thesolution is to be found be known. In our case, it means that aminimum and maximum proximity values must be availablefor each pixel of the left and right images. These minimumand maximum proximity maps are calculated based on the3error interval of the initial value of the proximity maps

dimin = diinitial 3s diinitial

dimax = diinitial+ 3s d

iinitial

(23)wheredlinitial andd

linitialare calculated according to (22).

For the right proximity map, a set of similar expressionscan be found, starting from the AL

Lr1

ki,j= E

r k

i,j+ l

rrc

ki,j

urrc

ki,j+

r

2 u

rrc

ki,j

2+ l

lrl

ki,j

urrl k

i,j+

r

2 urrl k

i,j

2(24)

and with the constraints

flrc k

i,j= d

r k

i,jfI d

c k

, i +f b

2 d

r k

i,j, j

2(25)

flrl k

i,j= d

r k

i,jfI d

l k

, i +fb dr

ki,j

, j 2

(26)

Algorithm 1 details the constrained optimisationmethodology (Fig.3). As shown in Fig. 3, there are, in fact,

two functions that are optimised at the same time: oneusing L

l1

ki,j

which optimises the left proximity map dl

and one using Lr1

ki,j

which optimises the right proximitymap dr. In the proposed algorithm, these functions areoptimised alternatively, hereby always using the latest resultfor both proximity maps.

An aspect which is not depicted by Algorithm 1 is thechoice of the optimal framerate. The underlying

structure-from-motion algorithm uses the geometric robustinformation criterion scoring scheme introduced by Torr in[35] not to assess the optimal framerate. This will have asan effect that if the camera does not move (no translationand no rotation) between two consecutive time instants, noreconstruction will be performed.

3 Results and analysis

3.1 Qualitative analysis using a real-worldbinocular video sequence

3.1.1 Evaluation methodology: The validation andevaluation of a dense stereomotion reconstructionalgorithm requires the use of an image sequence with amoving stereo camera. Hence, we recorded, using aBumblebee stereo head, an image sequence of an ofceenvironment as illustrated Fig. 4, denoted here after asDesk sequence. The translation of the camera is mainlyalong its optical axis (Z-axis) and along the positive X-axis.The rotation of the camera is almost only along the positiveY-axis.

Fig. 3 Constrained optimisation for binocular depth reconstruction using AL

www.ietdl.org


doi: 10.1049/iet-cvi.2013.0017

103



7/13

As it can be seen from Fig. 4, the recorded sequenceconsists of a cluttered environment, presenting seriouschallenges for any reconstruction algorithm:

Cluttered environment with many objects at differentscales of depth. Relatively large untextured areas (e.g. the wall in the upperleft) making correspondence matching very difcult. Areas with specular reection (e.g. on the poster in theupper right of the image), violating the Lambertianassumption, traditionally made for stereo matching. Variable lighting and heavy reections (in the window onthe upper right), causing saturation effects and incoherent

pixel colours across different frames.

We will focus our evaluation on how the presented iterativeoptimisation methodology deals with these issues and howwell it is able to reconstruct the structure of this scene.However, it must not be forgotten that this iterativeoptimiser is also dependent on an initialisation procedure,which can inuence the reconstruction result.

The initialisation step of the iterative optimiser estimates aninitial value for the left and right depth elds. This methodconsists of warping a stereo proximity image to the left andright camera reference frames. The initial values for the left

and right proximity maps still contains a lot of blindspots, or areas where no (reliable) proximity data isavailable. These areas are caused by unsuccessfulcorrespondences in the used stereo vision algorithm, which

performs an area-based correlation with sum of absolutedifferences on bandpassed images [26, 27]. This algorithmis fairly robust and it has a number of validation steps thatreduce the level of noise. However, the method requirestexture and contrast to work correctly. Effects likeocclusions, repetitive features and specular reections cancause problems leading to gaps in the proximity maps. Inthe following discussion, we will evaluate how well the

proposed dense stereomotion algorithm is able to copewith these blind spots and see whether it is capable oflling in the areas where depth data are missing.

To compare our method to the state-of-the-art, weimplemented a more classical dense stereomotionreconstruction approach. This approach denes classicalstereo and motion constraints, based upon the constantimage brightness assumption, alongside the NagelEnkelmann regularisation constraint. These constraints areintegrated into one objective function, which is solvedusing a traditional trust-region method. As such, thisapproach presents a relatively simple and straightforwardsolution. This methodology is used to serve as a base

benchmarking method for the AL-based stereomotionreconstruction technique.

Applying this more classical technique to the Desksequence shown in Fig. 4 results in a depth reconstruction

as shown in Fig. 5. Overall, the reconstruction of theproximity eld correlates with the physical reality, asimaged in Fig. 4, but there are some serious errors in thereconstructed proximity elds, notably on the board in the

Fig. 4 Some frames of the binocular desk sequence

a Frame 1, left imageb Frame 1, right imagecFrame 10, left imatedFrame 10, right image

www.ietdl.org

104



doi: 10.1049/iet-cvi.2013.0017


8/13

Fig. 5 Proximity maps for different frames of the desk sequence using the global optimisation algorithm

a Frame 1, left proximity d101

b Frame 1, right proximity d10r

cFrame 10, left proximity d101

dFrame 10, right proximity d10r

Fig. 6 Proximity maps for different frames of the desk sequence using the AL algorithm

a Frame 1, left proximity dl1

b Frame 1, right proximity d1r

cFrame 10, left proximity d101

dFrame 10, right proximity d10r

www.ietdl.org


doi: 10.1049/iet-cvi.2013.0017

105



9/13

middle of the image. This leads us to conclude that thismethod is not suitable for high-quality 3D modelling. In thefollowing, we compare these results with the ones obtained

by the proposed AL-based stereomotion optimisationmethodology, using the same input sequence.

3.1.2 Reconstruction results: Fig. 6 shows thereconstructed left and right proximity maps using thealgorithm shown in Fig. 3. The reconstructed proximity eldcorrelates very well with the physical nature of the scene.Foreground and background objects are clearlydistinguishable. The depth gradients on the left and backwalls can be clearly identied, despite the fact that there isvery little texture on these walls. The occurrence of specularreection on the poster does not cause erroneousreconstruction results. The only remaining errors on the

proximity eld are in fact because of border effects. Indeed,at the lower left of Fig. 6a and the lower right of Fig. 6b,one can note some areas where the regularisation hassmoothed out the proximity eld. The reason for this lies inthe lack of initial proximity data in these areas. Owing to thetotal absence of proximity information in these areas, thealgorithm used the solution from the neighbouring regions.In general, this was performed correctly, but because of thelack of information, the algorithm estimated the direction of

regularisation wrongly at these two locations. This is quite anormal side-effect when using area-based optimisationtechniques, which can be solved by extending the imagecanvas before the calculations. The result of Fig. 6 can be

compared with Fig. 5, which shows the same output, butusing the global optimisation approach. From thiscomparison, it is evident that the result of the AL-basedreconstruction technique is far superior to the one usingglobal optimisation. The global optimisation result featuresnumerous problems: erroneous proximity values,under-regularised areas, over-regularised areas and erroneousestimation of discontinuities. None of those problems are

present in the result of the AL, as shown in Fig.6.To show the applicability of the presented technique for 3D

modelling, the individual reconstruction results wereintegrated to form one consistent 3D representation of theimaged environment. Fig. 7 shows four novel views of the3D model. From the different novel viewpoints, the 3Dstructure of the ofce environment can be clearly deduced,there are no visible outliers and all items in the scene have

been reconstructed, even those with very low texture. Thisillustrates the capabilities of the proposed AL-based stereomotion reconstruction technique, which allows thereconstruction of a qualitative 3D model.

3.2 Quantitative analysis using standardbenchmark sequences

For quantitative analysis, we compared the performance of

the proposed approach with a traditional variationalscene-ow-based method using standard benchmarksequences. The selected benchmarking sequences are thewell known Cones and Teddy sequences created by

Fig. 7 Reconstructed 3D model of the desk sequence

a Novel view 1b Novel view 2cNovel view 3dNovel view 4

www.ietdl.org

106



doi: 10.1049/iet-cvi.2013.0017


10/13

Fig. 9 Comparison of the reconstruction result using the traditional variational scene-ow method [19] and the proposed method

Top row: reconstructed left depth image using [19] and bottom row: reconstructed left depth image using proposed method. Left column: Cones sequence and right

column: Teddy sequencea Cones sqeuqnce, depth image att0using [19]b Teddy sequence, depth image att0 using [19]cCones sequence, depth image att0 using proposeddTeddy sequence, depth image att0 using proposed method

Fig. 8 Quantitative analysis: input images and ground truth depth maps

Top row: left input image and bottom row: ground truth left depth image. Left column: Cones sequence and right column: Teddy sequencea Cones sequence, left image att0b Teddy sequence, left image att0cCones sequence, ground truth depth image att0dTeddy sequence, ground truth depth image att0

www.ietdl.org


doi: 10.1049/iet-cvi.2013.0017

107



11/13

Scharstein and Szeliski [36, 37], shown on the top row ofFig.8.

As a baseline algorithm, the variational scene-owreconstruction approach presented by Huguet and Devernay[19] was chosen, as the authors provided the algorithmonline, which makes it possible to perform comparisontests. To be able to supply a correct comparison of thestereomotion reconstruction capabilities of both

algorithms, the same base stereo algorithm [38] was used toinitialise both methods.

The results of any reconstruction algorithm depend largelyon the correct initialisation of the algorithm and the selectionof the parameters. In the initialisation phase of the proposedmethod, the estimation of the motion vectors tl, l and tr,rvia sparse structure from motion plays an important role.To assess the validity of the motion vector estimationresults, it is possible to compare the measured motion withthe perceived motion between the subsequent images. Forexample, for the Cones sequence, the main motion is ahorizontal movement, which is correctly expressed by theestimated translation vectors: tl = [0.0800, 0.3151, 0.0988]and t

r= [0.1101, 0.3131, 0.1255]. Ideally, both vectors

should be identical (as both cameras follow an identicalmotion pattern), which gives an idea of the errors on themotion estimation process.

Parameter-tuning is a process which affects many modernreconstruction algorithms, as the parameter selection makescomparison and application of the algorithms in realsituations difcult. For the proposed approach, one

parameter is of major importance: the parameter decidingon the balance between the data and the regularisation term.In our experiments, a value of = 0.5 was chosen, based on

previous [25] analysis. A remaining parameter of lesserimportance is the threshold on for stopping the iterativesolver. This parameter is somewhat sequence-dependent,

with typical values somewhere between 10 and 20. Withregard to the benchmark algorithm by Huguet andDevernay, all parameters were chosen as provided by theauthors in their original implementation.

Fig.9shows a qualitative comparison of the reconstructionresults of both methods. The quantitative evaluation is done

by computing the root-mean-square (RMS) error on thedepth map measured in pixels, as presented by Table1. The

proposed AL-based binocular structure-from-motion(ALBDSFM) approach has the convenient property ofdecreasing the residual on the objective functiondramatically in the rst iteration, whereas convergenceslows down in subsequent iterations. For this reason, wealso included the results after one iteration in the tables.

As can be noted from Fig. 9 and Table 1, the proposedALBDSFM algorithm performs better on both the Conesand Teddy sequences. On the Cones sequence, theALBDSFM approach is better capable of representing thestructure of the lattice in the back, whereas this structure iscompletely smoothed by the SceneFlow algorithm. On theTeddy sequence, both reconstruction results are visuallyquite similar. It is clear that both reconstruction techniquessuffer from over-segmentation. This is a typical problem ofthe NagelEnkelmann regularisation problem we used andcan partly be remedied by ne-tuning the regularisation

parameters; however, for keeping the comparison honest, wedid not perform such sequence-specic parameter-tuning.

The quantitative analysis on the Teddysequence in Table1shows an advantage for the ALBDSFM approach.

Table 2 gives an overview of the total processing timesrequired for both algorithms. It must be noted that all

experiments were performed on an Intel Core i5 centralprocessing unit (CPU) of 1.6 GHz. The SceneFlowalgorithm is a C++ application (available onhttp://devernay.

free.fr/vision/varsceneow), whereas the ALBDSFM isimplemented in MATLAB. While none of the algorithmscan be called fast, it is clear that the ALBDSFM approachis much faster than the SceneFlow implementation. The

processing time is mostly dependent on the computationalcost for a single iteration (within the while-loop ofAlgorithm 1) and the number of iterations, as theinitialisation and stereo computation steps only take a fewseconds. As the iteration step consists of a doubleoptimisation step using the Brents method, itscomputational complexity is of the order of O(2n2) with n,the number of image pixels. When high-resolution imagesare used, the computational cost quickly rises, which

explains the relatively large processing times. However,none of these algorithm implementations make use ofmulti-threading or graphics processing unit (GPU)optimisations, so large speed gains could be obtained byapplying such optimisations.

4 Conclusions

The combination of spatial and temporal visual informationmakes it possible to achieve high-quality dense depthreconstruction, but comes at the cost of a highcomputational complexity. To this extent, we presented anovel solution to integrate the stereo and motion depth cue,

by simultaneously optimising the left and right proximityelds, using the AL. The main advantage of our algorithmis the ability to exploit all the available constraints in oneminimisation framework. Another advantage is that theframework is able to incorporate any given stereoreconstruction methodology. The algorithm has beenimplemented and applied on real imagery as well as

benchmarks. A comparison of the proposed method to thevariational scene-ow method shows that the quality of theobtained results far exceed the quality of the results usingthe traditional method. The added quality with respect tonormal stereo comes with a penalty of increased processingtime, which is still important. Even though futureoptimisation of the implementation, by considering, for

example, a GPU implementation, will certainly furtherreduce the processing time, we consider that the proposedapproach can already be effectively used at this moment inan off-line production environment, where the proposed 3D

Table 1 RMS error in pixels on the different sequences usingboth methods

RMS error Cones Teddy

ALBDSFM (1 iteration) 2.411 5.002ALBDSFM (convergence) 2.381 4.961variational scene flow [19] 9.636 8.650

Table 2 Total processing time in minutes on the differentsequences using both methods

Processing time (min) Cones Teddy

ALBDSFM (1 iteration) 24 24ALBDSFM (convergence) 74 205variational scene flow [19] 257 243

www.ietdl.org

108



doi: 10.1049/iet-cvi.2013.0017
http://devernay.free.fr/vision/varsceneflowhttp://devernay.free.fr/vision/varsceneflowhttp://devernay.free.fr/vision/varsceneflowhttp://devernay.free.fr/vision/varsceneflowhttp://devernay.free.fr/vision/varsceneflowhttp://devernay.free.fr/vision/varsceneflowhttp://devernay.free.fr/vision/varsceneflowhttp://devernay.free.fr/vision/varsceneflowhttp://devernay.free.fr/vision/varsceneflowhttp://devernay.free.fr/vision/varsceneflow


12/13

reconstruction methodology presents an excellentreconstruction tool allowing high-quality 3D recosntructionfrom binocular video.

5 Acknowledgment

The research leading to these results has received fundingfrom the European Union Seventh Framework Programme

(FP7/2007-2013) under grant agreement number 285417.

6 References

1 Strecha, C., Van Gool, L.J.: Motion stereo integration for depthestimation. ECCV, 2002, no. 2, pp. 170185

2 Worby, J.A.:Multi-resolution graph cuts for stereo-motion estimation.Masters thesis, University of Toronto, 2007

3 Richards, W.: Structure from stereo and motion, J. Opt. Soc. Am.,1985, 2 , pp. 343349

4 Waxman, A., Duncan, J.: Binocular image ows: steps towardsstereo-motion fusion, IEEE Trans. Pattern Anal. Mach. Intell., 1986,8, (6), pp. 715729

5 Li, L., Duncan, J.:3-d translational motion and structure from binocularimage ows, IEEE Trans. Pattern Anal. Mach. Intell., 1993, 15, (7),pp. 657667

6 Zhang, Z., Faugeras, O.D.:Three-dimensional motion computation andobject segmentation in a long sequence of stereo frames,Int. J. Comput.Vis., 1992, 7 , (3), pp. 211241

7 Hanna, K.J., Okamoto, N.E.:Combining stereo and motion analysis fordirect estimation of scene structure. ICCV, 1993, pp. 357365

8 Malassiotis, S., Strintzis, M.G.:Model-based joint motion and structureestimation from stereo images,Comput. Vis. Image Underst., 1997, 65 ,(1), pp. 7994

9 Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving,Int. J. Comput. Vis., 2000, 38, (3), pp. 199218

10 Neumann, J., Aloimonos, Y.:Spatio-temporal stereo using multiresolutionsubdivision surfaces,Int. J. Comput. Vis., 2002, 47, (13), pp. 181193

11 Gong, M.: Enforcing temporal consistency in real-time stereoestimation. ECCV, 2006, pp. 564577

12 Isard, M., MacCormick, J.: Dense motion and disparity estimation vialoopy belief propagation. ACCV, 2006, pp. 3241

13 Sudhir, G., Banerjee, S., Biswas, K.K., Bahl, R.: Cooperativeintegration of stereopsis and optic ow computation, J. Opt. Soc. Am.A, 1995, 12, (12), pp. 25642572

14 Larsen, E.S., Mordohai,P., Pollefeys, M.,Fuchs, H.:Temporally consistentreconstruction from multiple video streams. ICCV, 2007, pp. 18

15 Strecha, C., Van Gool, L.: Pde-based multi-view depth estimation.First Int. Symp. 3D Data Processing Visualization and Transmission(3DPVT02), 2002, vol. 416

16 Proesmans, M., van Gool, L., Pauwels, E., Oosterlinck, A.:Determination of optical ow and its discontinuities using non-lineardiffusion. ECCV, 1994, pp. 295304

17 Zhang, Y., Kambhamettu, C.:On 3-d sceneow and structure recoveryfrom multiview image sequences, Syst. Man Cybern. B, 2003, 33 , (4),pp. 592606

18 Pons, J.P., Keriven, R., Faugeras, O.: Modelling dynamic scenes by

registering multiview image sequences. Int. Conf. Computer Vision

and Pattern Recognition, 2005, vol. 2, pp. 822827

19 Huguet, F., Devernay, F.: A variational method for scene ow

estimation from stereo sequences. ICCV, 2007, pp. 17

20 Sizintsev, M., Wildes, R.: Spatiotemporal stereo and scene ow via

stequel matching, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34,

(6), pp. 12061219

21 Valgaerts, L., Bruhn, A., Zimmer, H., Weickert, J., Stoll, C., Theobalt,

C.: Joint estimation of motion, structure and geometry from stereo

sequences. ECCV, 2010, (2010)

22 De Cubber, G.: Variational methods for dense depth reconstructionfrom monocular and binocular sequences. PhD thesis, Vrije

Universiteit Brussel, March 2010

23 Del Bue, A., Xavier, J., Agapito, L., Paladini, M.:Bilinear modeling via

augmented lagrange multipliers (balm), IEEE Trans. Pattern Anal.

Mach. Intell., 2012, 34, (8), pp. 14961508

24 Nocedal, J., Wright, S.J.: Numerical optimization Springer series in

operations research. (Springer, 1999, 2nd edn.)

25 De Cubber, G., Sahli, H.: Partial differential equation-based dense 3d

structure and motion estimation from monocular image sequences,IET Comput. Vis., 2012, 6, (3), pp. 174185

26 Fua, P.: Combining stereo and monocular information to compute

dense depth maps that preserve depth discontinuities. 12th Int. Joint

Conf. Articial Intelligence, 1991, pp. 12921298

27 Murray, D., Little, J.J.: Using real-time stereo vision for mobile robot

navigation, Auton. Robots, 2000, 8, (2), pp. 161171

28 Bertsekas, D.P.: Constrained optimization and lagrange multiplier

methods (Athena Scientic, 1996)

29 Powell, M.J.D.: Optimization A method of nonlinear constraints in

minimization problems, (Academic Press, London, 1969)

30 Hestenes, M.R.: Multipler and gradient methods, J. Optim. Theory

Appl., 1969, 4 , pp. 303320

31 Nagel, H., Enkelmann, W.: An investigation of smoothness constraints

for the estimation of displacement vectorelds from image sequences,IEEE Trans. Pattern Anal. Mach. Intell., 1986, 8 , (5), pp. 565593

32 Keys, R.:Cubic convolution interpolation for digital image processing,IEEE Trans. Acoust. Speech Signal Process., 1981, 29, (6),

pp. 11531160

33 Brent, R.P.: Algorithms for minimization without derivatives

(Prentice-Hall, Englewood Cliffs, NJ, 1973)

34 Forsythe, G.E., Malcolm, M.A., Moler, C.B.: Computer methods for

mathematical computations (Prentice-Hall, 1976)

35 Torr, P.H.S.: Bayesian model estimation and selection for epipolargeometry and generic manifoldtting, Int. J. Comput. Vis., 2002, 50,

(1), pp. 3561

36 Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense

two-frame stereo correspondence algorithms, Int. J. Comput. Vis.,

2002, 47 , (13), pp. 742

37 Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using

structured light. IEEE Computer Society Conf. Computer Vision and

Pattern Recognition (CVPR 2003), Madison, WI, USA, 2003, vol. 1,

pp. 195202

38 Felzenszwalb, P.F., Huttenlocher, D.P.:Efcient belief propagation for

early vision, Int. J. Comput. Vis., 2006, 70, (1), pp. 126

www.ietdl.org


doi: 10.1049/iet-cvi.2013.0017

109



13/13

C o p y r i g h t o f I E T C o m p u t e r V i s i o n i s t h e p r o p e r t y o f I n s t i t u t i o n o f E n g i n e e r i n g &

T e c h n o l o g y a n d i t s c o n t e n t m a y n o t b e c o p i e d o r e m a i l e d t o m u l t i p l e s i t e s o r p o s t e d t o a

l i s t s e r v w i t h o u t t h e c o p y r i g h t h o l d e r ' s e x p r e s s w r i t t e n p e r m i s s i o n . H o w e v e r , u s e r s m a y p r i n t ,

d o w n l o a d , o r e m a i l a r t i c l e s f o r i n d i v i d u a l u s e .

dense 3d.pdf

Documents