3d trafﬁc scene understanding from movable...

1

3D Traffic Scene Understandingfrom Movable Platforms

Andreas Geiger Martin Lauer Christian Wojek Christoph Stiller Raquel Urtasun

Abstract—In this paper, we present a novel probabilistic generative model for multi-object traffic scene understanding from movableplatforms which reasons jointly about the 3D scene layout as well as the location and orientation of objects in the scene. In particular, thescene topology, geometry and traffic activities are inferred from short video sequences. Inspired by the impressive driving capabilitiesof humans, our model does not rely on GPS, lidar or map knowledge. Instead, it takes advantage of a diverse set of visual cues in theform of vehicle tracklets, vanishing points, semantic scene labels, scene flow and occupancy grids. For each of these cues we proposelikelihood functions that are integrated into a probabilistic generative model. We learn all model parameters from training data usingcontrastive divergence. Experiments conducted on videos of 113 representative intersections show that our approach successfullyinfers the correct layout in a variety of very challenging scenarios. To evaluate the importance of each feature cue, experiments usingdifferent feature combinations are conducted. Furthermore, we show how by employing context derived from the proposed method weare able to improve over the state-of-the-art in terms of object detection and object orientation estimation in challenging and clutteredurban environments.

Index Terms—3D Scene Understanding, Autonomous Driving, 3D Scene Layout Estimation

F

1 INTRODUCTION

R ECENT scientific and engineering progress in ad-vanced driver assistance systems and autonomous

mobile robots make us believe that cars will be ableto drive without human intervention in the near fu-ture [31]. While autonomous driving on highways hasbeen successfully demonstrated for several decades [21],navigating urban environments remains an unsolvedproblem. This is mainly due to the complexity of urbantraffic situations and the limited capabilities of currentonboard sensors and image understanding algorithms.In recent years, the need for solving the onboard sceneunderstanding problem has been bypassed thanks tothe use of manually labeled maps and GPS localizationsystems, e.g., competitors at the DARPA Urban Chal-lenge [13] or Google’s autonomous car. While leading toimpressive solutions, this approach suffers from the factthat up-to-date sub-meter accurate maps will most likelybe impossible to acquire in practice. Furthermore, GPSsignals are not always available, and localization canbecome imprecise in the presence of skyscrapers, tunnelsor jammed signals. We believe that safe operation of anautonomous car can only be guaranteed if the car caninterpret its environment reliably from its own sensory

• A. Geiger is with the MPI for Intelligent Systems in Tubingen and withthe Institute of Measurement and Control, KIT, Germany.E-mail: [email protected]

• M. Lauer and C. Stiller are with the Institute of Measurement and Control,Karlsruhe Institute of Technology, Germany.E-mail: {lauer,stiller}@kit.edu

• C. Wojek is with Carl Zeiss Corporate Research, Germany.E-mail: [email protected]

• R. Urtasun is with the Toyota Technological Institute at Chicago, USA.E-mail: [email protected]

input just like a human driver can navigate unknownenvironments even without maps [12]. Video camerasare particularly appealing sensors in this context asthey are cheap and often readily available in modernvehicles. But what makes the interpretation of urbantraffic scenarios so challenging compared to highwaysand rural roads?

First, the traffic situations are very complex. Manydifferent traffic participants may be present and thegeometric layout of roads and crossroads is more vari-able than the geometry of highways. Unfortunately, theavailability of unambiguous visual features is limitedsince road markings and curb stones are often missing oroccluded. Difficult illumination conditions such as castshadows caused by vegetation or infrastructure easilyconfuse image processing algorithms. Furthermore, thelimited aperture angle of onboard cameras, their lowmount point and the limited depth perception of stereomake the inference problem very difficult. As a conse-quence, only objects close to the observer can be locatedreliably.

In this paper, we present a method that is able torobustly deal with the inherent difficulties in onboardrecognition of urban intersections. Our approach reasonsjointly about the 3D scene layout of intersections aswell as the location and orientation of objects in thescene. We refer the reader to Fig. 1 for an illustration.Towards this goal we exploit a larger number of visualcues than existing approaches, which typically rely onroad markings that are often unavailable. In particular,we propose five different visual cues, which describethe static environment encoded in terms of occupancygrids, semantic labels and vanishing points, as well asthe dynamic information of vehicles encoded by scene

2

Vehicle Tracklets

Scene Labels

Vanishing Points

Occupancy Grid

3D Scene Flow

Unobserved

Freespace

Occupied

Observer

SkyBackgroundForeground

Monocular Features Stereo Features (Bird's eye view) Inference Result

Fig. 1: 3D Intersection Understanding. Our system makes use of monocular (left) and stereo (middle) feature cues to infer the road layout andthe location of traffic participants in the scene (right) from short video sequences. The observer is depicted in black.

flow and vehicle tracklets. As each of these featuresmight be ambiguous and lead to wrong interpretationsof the scene, we introduce a probabilistic model that in-tegrates all cues in a principled manner. Our approach iscomposed of a scalable geometry model for intersectionsand a proabilistic model that relates the geometry of anintersection to the visual features. We derive learningand inference procedures for this model and demon-strate its performance on video sequences from real-world intersections. Our dataset and a MATLAB/C++reference implementation is made publicly available.1

A preliminary version of this paper appeared in [30],[33]. Herein, we unify both models and perform a greatlyenriched analysis on a variety of different feature com-binations, yielding novel insights into the importanceof the individual feature cues. Moreover, we propose aprincipled way to learn the parameters of our model.As the partition function of our joint distribution isintractable, we re-formulate the problem in terms ofa Gibbs random field and derive all required poten-tials. We show how the gradients of this function canbe approximated by drawing samples from the modeldistribution, and learning can be simply performed bygradient ascent on the log-likelihood.

This paper is organized as follows. Section 2 gives anoverview over related work. In Section 3 we present ourprobabilistic intersection model as well as our learningand inference procedures. Section 4 gives details onthe computation of the visual features. Finally, Section5 describes the data collection process as well as ourexperiments. Conclusions are drawn in Section 6.

2 RELATED WORK

In 1986, Ernst Dickmanns and his team equipped a vanwith cameras and demonstrated the first self-driving caron well-marked streets (i.e., highways) without traffic[21]. Motivated by this success, several efforts have beenconducted towards autonomous driving in real traffic[10], [29], [51] triggering research on lane estimation [3],

1. http://www.cvlibs.net/projects/intersection

[18], [21], [41], [49] and road detection [1], [2], [17] algo-rithms. However, simplifying assumptions were madesuch as low-traffic environments, manual longitudinalcontrol or human intervention for lane changes andhighway exits. While the DARPA Urban Challenge [13]came closer to challenging urban traffic situations, thestreets were wider than usual, the field of view was un-obstructed and only a very limited number of traffic par-ticipants were present. Furthermore, GPS in combinationwith manually annotated maps was neccessary for local-ization and navigation. Subsequently, Google gathereda team of engineers to equip a Toyota Prius with self-driving capabilities. Similarly to the participants of theUrban Challenge, Google’s driver-less car is equippedwith a Velodyne 3D laser scanner for perception andrequires manually annotated maps at lane-level accuracyfor path planning. In contrast, the approach presentedin this paper aims at analyzing complex and clutteredscenes in the absence of maps or 3D point clouds.

For a long time intersection understanding has beenrecognized as a difficult problem [16], [22], [24], [35], [53].For instance, Luetzeler and Dickmanns [48] extract localimage features and match these to a T-shaped intersec-tion model that involves several parameters. They usea multifocal active camera system that allows to detectmissing lane boundary segments as an indicator forintersections. Subsequently, the segments are matchedwith a road map and tracked over time using a KalmanFilter. Zhu et al. [64] use a laser scanner for intersectionrecognition by searching for positions in front of thevehicle that may serve as a source for long virtual beamsthat stay collision free in a 2D static grid map. While allexisting approaches focus on a small number of simplefeatures such as road markings or 2D grids, in thispaper we argue that a bigger picture of the scene thatintegrates several sources of information in a robust wayis able to handle challenging real-world scenarios. Incontrast to [64], our method is also able to deal withstereo information, which is much noisier than laser-based measurements.

In computer vision, a large body of work has focused

3

on estimating 3D from single images [36], [39], [50],[54], [55]. Often, a Manhattan world [7], [43] is assumedto infer vanishing points from line segments. In addi-tion, several methods try to infer the 3D locations ofobjects in outdoor scenarios [6], [20], [38]. In [14], [23],[60] the camera pose and the location of objects areinferred jointly. Structure-from-motion point clouds areleveraged in [11] to segment and semantically classifythe environment. Recently, Singh and Kosecka [56] haveproposed a semantic model for urban scene recognitionsegmenting buildings, road, sky, cars and trees. Theyemploy a trained classifier for pixelwise scene classi-fication in a panoramic image. Unfortunately, most ofthe existing 3D layout estimation techniques are mainlyqualitative, do not model object dynamics, suffer fromclutter and lack the level of accuracy neccessary for real-world applications such as autonomous driving. Existingmethods that take objects into account usually model thescene in terms of a simple ground plane and thus arenot able to draw conclusions from the complex interplayof the objects with the larger scene layout. In contrast,we propose a method that is able to extract accurategeometric information by reasoning jointly about staticand dynamic elements as well as their interplay.

For a long time, dynamic objects have been consideredeither in isolation [5], [26], [27], [52] or using simplemotion [9], [23], [42], [47], [63] or social interaction [15],[62] models. For example, Choi et al. [15] introduce ahierarchy of activities, modeling the behavior of groups.Methods for unsupervised activity recognition and ab-normality detection [44], [59] are able to recover spatio-temporal dependencies using video sequences from astatic camera mounted on top of a building. Whilepromising results have been shown, the interplay ofobjects with their environment is neglected and the focusis put on surveillance scenarios with a fixed cameraviewpoint, limiting their applicability. In contrast, themethod developed in this paper infers semantics ata higher level such as multi-object traffic patterns atintersections, in order to improve layout and objectestimation. Importantly, we do inference at scenes thatwe have never seen before and our viewpoint is sub-stantially lower compared to the surveillance scenario,which renders the problem very challenging.

3 URBAN SCENE UNDERSTANDING

In this paper, we tackle the problem of understandingcomplex 3D traffic scenes from short video sequences.In particular, we are interested in inferring the scenelayout (e.g., number and location of streets) as well as the3D location and orientation of traffic participants (e.g.,cars) from short video sequences captured onboard adriving vehicle. We assume a flat road surface and modelthe scene layout and all objects in the road coordinatesystem, which is located directly below the left camerain the last frame of the video using the yaw angle andcoordinate axis illustrated in Fig. 2 (right) and Fig. 7.

1 2 3

4 5 6

7

Fig. 2: Topology Model (left) and Geometry Model (right) in bird’seye perspective. The gray shaded areas (left) illustrate the flexibility ofthe crossing street (α).

The number and location of streets and vehicles are re-lated to the image observations through our probabilisticmodel. In this section, we introduce all relevant notation,develop the geometric and probabilistic models and de-scribe our inference and parameter learning procedure.

3.1 Geometric ModelOur geometric model is inspired by typical traffic scenes.We assume that the layout of the scene is dominatedby up to four roads intersecting at a single location,the intersection center. All vehicles are either parkedat the side of the road or drive on lanes and respectsome basic rules such as right-handed traffic. Lanes aremodeled using B-splines. All inbound and outboundstreets are connected with a lane, except for U-turns.Road boundaries determine the border between drivableregions and areas that are likely to contain buildings andinfrastructure.

Let R = {κ, c, w, r, α} be the set of random vari-ables describing the road layout, where κ ∈ {1, . . . , 7}denotes the topology of the intersection (see Fig. 2)and c = (x, z)T ∈ R2 is the intersection center. Letw ∈ R+ be the street width and r ∈ [−π4 ,+

π4 ] the

global rotation, i.e., the observer’s yaw orientation withrespect to the incoming street. We define α ∈ [−π4 ,+

π4 ]

as the crossing angle, i.e., the relative orientation of thecrossing street with respect to the incoming street. Forsimplicity, we assume that all opposing intersection armsare collinear and all streets share the same width. Allvariables are illustrated in Fig. 2. Note that our modelis able to express straight roads, turns, three-armed andfour-armed intersections.

For simplicity, we restrict our focus to two lanes perstreet, one incoming and one outgoing lane for eachintersection arm. As vehicles are allowed to cross theintersection in any possible direction, we have K(K−1)lanes for a K-armed intersection. For each street wemodel two parking areas, one at the left side and oneat the right side, yielding 2K parking areas in total. Two(out of six) lanes of a 3-armed intersection as well as oneparking area are illustrated in Fig. 3 (left).

Lane centerlines are modeled using quadratic B-splines [19] governed by five control points {q1, . . . ,q5},

4

Fig. 3: Lane/Parking Areas (left) and Lane B-splines (right). Lane cen-terlines are defined via B-splines and parking areas are located at eachroad side. For tractability of our tracklet likelihood, all lanes/parkingareas are discretized at 1 meter intervals (see Section 3.3.2).

which are located at the center of the lane as illustrated inFig. 3 (right). The knot vector controlling the shape of theB-splines was chosen to be (0 0 0 0.1 0.9 1 1 1)T, forcingthe spline to interpolate all but the central control point.Empirically this resulted in realistic curvatures. Givenall lane splines and all parking areas, we equidistantlydefine discrete vehicle locations s at 1m intervals asillustrated in Fig. 3 (left) to allow for efficient inferenceof the scene layout and the vehicle locations.

3.2 Image EvidenceBesides the geometric model, we define a probabilisticmodel to explain the image evidence E = {T ,V,S,F ,O}with vehicle tracklets T , vanishing points V , semanticscene labels S, scene flow F and occupancy grid O. Allfeatures are mapped into the last frame of each sequenceusing visual odometry.

Let T = {t1, . . . , tNt} be the set of vehicle tracklets

that have been detected in the sequence. A vehicletracklet t is defined as a sequence of object detectionsprojected into bird’s eye perspective t = {d1, . . . ,dMd

}.Here, d = (fd,md,Sd,od) denotes a single object detec-tion with fd ∈ N the frame number, md ∈ R2,Sd ∈ R2×2

the mean and covariance describing the object locationin road coordinates and od ∈ R8 the probability of theobject facing into each of eight possible directions.

Let V = {v1, . . . , vNv} be the set of vanishing points.

We detect up to two vanishing points (Nv ∈ {0, 1, 2})and represent each vanishing point by a single rotationangle around the yaw axis of the road coordinate system,i.e. vi ∈ [0, π). The vertical vanishing point is non-informative for our task and not considered here. LetS = {s1, . . . , sNs

} be the set of semantic labels. Here,Ns is the number of image patches of size ns×ns pixelson a regular grid and si ∈ ∆2 is a discrete probabilitydistribution over the semantic categories road, backgroundand sky. Both, vanishing points and scene flow, are com-puted in the last frame of each sequence as this yielded agood compromise between the quality of the results andcomputation time. Further, denote F = {f1, . . . , fNf

} theset of scene flow vectors capturing the 3D motion of thescene not explained by the observer’s egomotion. Eachflow vector f = (pf ,qf ) is defined by its location pf ∈ R2

and velocity qf ∈ R2 on the road plane. All velocityvectors are normalized to ‖qf‖2 = 1 as our scene flow

Fig. 4: Probabilistic Graphical Model corresponding to the jointdistribution over image evidence E and road layout R in Eq. (1).

model does not explicitly reason about vehicle velocities.The occupancy grid O = {ρ1, . . . , ρNo} is represented byNo cells of size no×no meters. Each cell ρi ∈ {−1, 0,+1}can be either free (−1), occupied (+1) or unobserved (0). Wepostpone the discussion on feature extraction to Section4 and describe our probabilistic model in this section.

3.3 Probabilistic ModelWe assume that all observations E = {T ,V,S,F ,O}are conditionally independent given the road layout R.As a consequence, the joint distribution over the imageevidence E and the road parameters R factorizes as

p(E ,R|Θ) = p(R|Θ)︸︷︷︸Prior

Nt∏i=1

p(ti|R,Θ)︸︷︷︸Vehicle Tracklets

Nv∏i=1

p(vi|R,Θ)︸︷︷︸Vanishing Points

×Ns∏i=1

p(si|R,Θ)︸︷︷︸Scene Labels

Nf∏i=1

p(fi|R,Θ)︸︷︷︸Scene Flow

No∏i=1

p(ρi|R,Θ)︸︷︷︸Occupancy Grid

(1)

where Θ denotes the set of all parameters in our model.This is illustrated in the graphical model of Fig. 4.

3.3.1 PriorWe define the prior on road parameters R as

p(R|Θ) = p(κ|Θ) p(c, r, w|κ,Θ) p(α|κ,Θ) (2)

with

κ ∼ Cat(ξp) (3)

(c, r, logw)T|κ ∼ N(µ(κ)p ,Λ(κ)

p

−1)

(4)

α|κ ∼ fκ(α, σα)λp (5)

where Cat(·) denotes the categorical distribution andc, r and w are modeled jointly to capture correlationsbetween the variables. The width w is modeled using alog-Normal distribution to enforce positivity. Empiricallywe found α to be highly multi-modal. We thus model αusing kernel density estimation with a Gaussian kerneland bandwidth σα. The scalar λp controls the importanceof the crossing street prior. All parameters are learnedfrom training data as detailed in Section 3.5.

5

3.3.2 Vehicle Tracklets

Assuming a uniform prior on all lanes and parking areaswe model vehicle tracklets as

p(t|R,Θ) =

L∑l=1

p(t, l|R) (6)

p(t, l|R) = p(l|R) p(t|l,R) ∝ p(t|l,R) (7)

where the tracklet index i and the dependency on theparameters Θ have been dropped for clarity and the la-tent lane (or parking spot) index l has been marginalized.To evaluate the tracklet posterior for lanes pl(t|l,R), wediscretize the lane spline at 1 meter intervals and aug-ment the observation model with an additional discretelatent variable s per object detection d, which indexesthe location on the lane, as illustrated in Fig. 3. Weemploy a simple left-to-right Hidden Markov Modelto model the dynamics. Marginalizing over all hiddenstates s1, . . . , sMd

yields

pl(t|l,R) =∑

s1,...,sMd

p(s1) pl(d1|s1, l,R)

×Md∏j=2

p(sj |sj−1) pl(dj |sj , l,R) (8)

where Md denotes the number of object detections in thetracklet. We allow tracklets to start anywhere on the lanewith equal probability. Our motion model is simple, yeteffective. By constraining all tracklets to move forwardwith uniform probability, i.e.,

p(sj |sj−1) =

{ 1Ml−sj−1+1 if sj ≥ sj−1

0 otherwise(9)

lanes on crossing streets can be distinguished purelybased on the vehicle’s motion. Here, Ml is the numberof spline points on lane l. The emission probability forlanes pl(d|s, l,R) is factorized into the object’s locationand the object’s orientation

pl(d|s, l,R) = p(md|s, l,R,Sd) p(od|s, l,R) (10)

The object location is modeled as a Gaussian mixture

p(md|s, l,R,Sd) = (1− ζt) pin(md|s, l,R,Sd)+ ζt pout(md|s, l,R) (11)

with inlier and outlier distributions defined by

pin(md|s, l,R,Sd) ∝ exp

(−

(φt −md)TS−1d (φt −md)

2

)pout(md|s, l,R) ∝ exp

(−

mTdmd

2σ2out

)(12)

respectively. Here, φt(s, l,R) ∈ R2 denotes the 2D loca-tion of spline point s on lane l according to the B-splinemodel presented in Section 3.1, ζt is the outlier proba-bility and σout is a parameter controlling the ’spread’ ofthe outlier distribution.

Fig. 5: Semantic Scene Label Likelihood. Intuitively, the semanticlikelihood measures the ’overlap’ of the rendered model hypothesis(left) with the semantic labels estimated by a classifier (right).

For the object orientation likelihood, we impose acategorical distribution over 8 object orientations od

p(od|s, l,R) =

8∏i=1

o[ϕt(s,l,R)=i]d,i (13)

where ϕt(s, l,R) ∈ {1, . . . , 8} selects the orientation bincorresponding to the viewpoint relative to the observer:The relative viewing direction is computed from thetangent of lane l at spline point s. Intuitively, Eq. (13)encourages lane associations such that the estimatedvehicle orientation and the direction of the lane coincide.

As parking cars are static, the tracklet likelihood forparking areas reduces to

pp(t|l,R) =∑s

Md∏j=1

p(s) pp(dj |s, l,R) (14)

assuming a uniform prior on the location s. Note thatwe do not make any assumptions about the orientationof a parked car. Thus the emission probability becomes

pp(d|s, l,R) =1

8p(µd|s, l,R,Sd) (15)

with p(µd|s, l,R,Sd) defined in Eq. (11).

3.3.3 Vanishing PointsWe model the vanishing likelihood over v ∈ [0, π) as

p(v|R,Θ) ∝ ζv + (1− ζv) exp (−λvφv(v,R,Θ)) (16)

with orientation error

φv(v,R,Θ) = 1− cos(2v − 2ϕv(R)) (17)

derived from the von Mises distribution, which is acontinuous probability distribution on the circle. Here,ζv is a small constant capturing outlier detections, ϕv(R)is the orientation of the closest street, based on thecurrent road model configurationR, and λv is a precisionparameter that controls the importance of this term.

3.3.4 Semantic Scene LabelsThe semantic scene label likelihood [33] for each ns×nspixels image patch s is modeled as

p(s|R,Θ) ∝ exp

(λsNs

ws,φs(R) · sφs(R)

)(18)

where Ns is the total number of image patches (seeSection 4.3) and λs is a parameter controlling the impor-tance of this cue. φs(R) ∈ {1, 2, 3} selects the class labelof s according to a ’virtual’ segmentation of the scenewhich depends on the reprojection of the road layout

6

Fig. 6: Scene Flow (left) and Occupancy Grid Likelihood (right).The proposed scene flow likelihood encourages flow vectors to agreewith the lane geometry. The geometric occupancy prior (right) can beenvisioned as a ’template’ of freespace and occupied areas.

R such that sφs(R) returns the probability of the classlabel given a semantic classifier. This is illustrated inFig. 5. Additionally, the weight vector ws ∈ R3 controlsthe importance of each semantic class. Our model usesthree classes: foreground (e.g., road), background (e.g.,buildings, vegetation) and sky. To simplify matters, weassume that the background (i.e., buildings, trees) startsdirectly behind the curb of the road and buildings reacha height of four stories on average, thereby defining thebackground area which separates the sky from the roadregion. Facades adjacent to the observer’s own streetare not considered. Despite these approximations, wehave observed that many inner-city scenes in our datasetfollow this scheme closely.

3.3.5 Scene Flow

In contrast to the tracklet observations, the 3D sceneflow likelihood directly explains all moving objects in thescene given the road layout R. Thus, moving objects thatdo not fit the appearance model of the car detector (e.g.,trucks, tractors, quad bikes, motorbikes) are consideredhere as well. The probability of a scene flow vectordepends on its proximity to the closest lane and on howwell its velocity vector aligns with the tangent of therespective B-spline at the corresponding foot point

p(f |R,Θ) ∝ φf (f ,R,Θ)1

Nf (19)

where

φf (·) = ζf exp

(−‖pf‖

22

2σ2out

)+ (1− ζf ) exp

(−φf (f ,R,Θ)

)φf (·) = −λf1‖pf −ϕf (pf ,R)‖22 − λf2(1− qT

f ϕf (pf ,R))

with parameters ζf , λf1, λf2, σout. Here, ζf accounts foroutliers and λf1 and λf2 control the importance of thelocation and the orientation agreement. Nf normalizesfor the number of scene flow vectors and, similar tothe vehicle tracklet model from Section 3.3.2, σout de-notes the width of the outlier distribution. The functionsϕf (pf ,R) ∈ R2 and ϕf (pf ,R) ∈ R2 return the splinefoot point and tangent vector at the location closestto pf , respectively. This is illustrated in Fig. 6. Thedependencies are modeled as a hard mixture, i.e. for eachflow vector we select the spline l that maximizes Eq. (19).

3.3.6 Occupancy GridWe assume that the road area should coincide with freespace while non-road areas may be covered by buildingsor vegetation. Thus, we use stereo information to com-pute an occupancy grid, which represents occupied andfree-space in bird’s eye perspective, see Fig. 1 (middle)for an illustration. The occupancy likelihood of each cellρ in the grid is modeled as follows

p(ρ|R,Θ) ∝ exp

(λoNo

ρ · φo(ρ,R)

)(20)

where φo(ρ,R) ∈ {wo,1, wo,2, wo,3} is a mapping that forany cell ρ ∈ {−1, 0,+1} returns the value (or weight) ofa model-dependent geometric prior expressing the beliefon the location of free space (i.e., road) and buildingsalongside the road. The geometric prior is illustratedin Fig. 6 for the case of a right turn. Intuitively, itencourages free space where the road is located and ob-stacles elsewhere, with a preference towards the roadsideregion. λo controls the strength of this term.

3.4 InferenceAs inference is a key component for learning we startwith a discussion on how to obtain samples from thejoint distribution using Markov Chain Monte Carlo tech-niques. Given the image evidence E , we are interestedin determining the underlying road layout R and thelocation of cars C = {(l, s)} in the scene, where l denotesthe lane index and s contains the spline points of alldetections in a tracklet. Unfortunately, the posteriorsinvolved in this computation have no analytical solu-tion and cannot be computed in closed form. Thus weapproximate them using Metropolis-Hastings sampling[4]. To keep computations tractable, we split the prob-lem into two sub-problems: We first estimate R whilemarginalizing C as

R = argmaxR

p(R|E ,Θ) = argmaxR

∑Cp(R, C|E ,Θ) (21)

through Eq. (1), Eq. (8) and Eq. (14). Given an estimateof R, we infer the object locations C as

C = argmaxC

p(C|E ,R,Θ) (22)

Both steps are detailed in the following.

3.4.1 Inferring the Road LayoutFor maximizing Eq. (21), we draw ninf samples usingMarkov Chain Monte Carlo and select the one withthe highest probability as our MAP estimate R. Weexploit a combination of local, inter-topology and globalmoves to obtain a well-mixing Markov chain. While localmoves modify R only slightly, global moves sample Rdirectly from the prior. This ensures a quick traversalof the search space, while still exploring local modes.For local moves we choose symmetric proposals in theform of Gaussians centered on the previous state. Table

7

Local Metropolis Proposals (33%)1. Vary center of crossroads c (σc)2. Vary width of all roads w (σw)3. Vary angle of crossing street α (σα)4. Vary overall orientation r (σr)5. Vary center c and width w jointly6. Vary center c, width w, angle α and rotation r jointly

Inter-Topology Metropolis Proposals (33%)7. Re-sample κ uniformly

Global Metropolis-Hastings Proposals (33%)8. Re-sample all parameters R = {κ, c, w, r, α} from the prior

TABLE 1: Metropolis-Hastings-Proposals for Inference. We randomlychose a local, inter-topology or global move with uniform probability.

1 gives an overview of the move categories which areselected at random. Each sample requires the evaluationof p(R|E ,Θ) up to a normalizing constant. The marginal-ization in Eq. (8) can be carried out efficiently using theforward algorithm for hidden Markov models. To avoidtrans-dimensional moves, we include α in all models.

3.4.2 Inferring the Location of ObjectsGiven the road model R, we are now interested in re-covering the location of cars C = {(l1, s1), . . . , (lNt , sNt)}.Conditioned on R, all tracklets become independent andthe inference problem decomposes into sub-problems.Neglecting the tracklet index i and the dependency onΘ for clarity, and observing that p(t) is constant, l can beinferred by computing the marginal over object locations

l = argmaxl

p(l|t,R) = argmaxl

p(t, l|R) (23)

with p(t, l|R) defined by Eq. (7). Given l, we have

s1, . . . , sMd= argmax

s1,...,sMd

pl(t, s1, . . . , sMd|l,R) (24)

which can be easily inferred using Viterbi decoding forhidden Markov models [8].

3.5 Learning

The parameters Θ = {λp, ξp, λt, λv, λs, λf1, λf2, λo} ofour model are learned from held-out training data usingmaximum likelihood and contrastive divergence [37].Let (E,R) be a training set with E = {E1, . . . , ED}denoting the image evidence and R = {R1, . . . ,RD} theannotated road layout of each sequence. The parameterset Θ maximizing the likelihood of the data is given by

Θ = argmaxΘ

p(E,R|Θ) (25)

with

p(E,R|Θ) =

D∏d=1

p(Ed,Rd|Θ) (26)

Unfortunately, maximizing Eq. (26) directly for Θ isintractable due to the partition function. Instead, werewrite p(Ed,Rd|Θ) in terms of a Gibbs random field

p(Ed,Rd|Θ) =1

Zd(Θ)exp (−Ψ(Ed,Rd,Θ)) (27)

where Ψ(Ed,Rd,Θ) is the sum of a set of potential func-tions {ψi} corresponding to the probability distributionsin Section 3.3 which are described at the end of thissection. The log-likelihood function is given by

L(E,R|Θ) =

D∑d=1

log p(Ed,Rd|Θ)

= −D∑d=1

(Ψ(Ed,Rd|Θ) + logZd(Θ)) (28)

Taking the partial derivative of L(E,R|Θ) with respectto a parameter θi ∈ Θ, we obtain

∂L(E,R|Θ)

∂θi= −

D∑d=1

(∂

∂θiΨ(Ed,Rd,Θ) +

∂

∂θilogZd(Θ)

)(29)

While the first term is easy to evaluate, the second termcan be rewritten as

∂

∂θilogZd(Θ) =

1

Zd(Θ)

∫∂

∂θiexp(−Ψ(Ed,R,Θ))dR

= −⟨∂

∂θiΨ(Ed,R,Θ)

⟩p(Ed,R|Θ)

(30)

where 〈·〉p(·) denotes the expected value with respect top(·). Note that in contrast to [37] the potentials Ψ addi-tionally depend on Ed in our case. While it is impossibleto evaluate this expression exactly, it can be approxi-mated by drawing samples using MCMC as describedin Section 3.4. Sampling exhaustively from the modeldistribution is computationally prohibitive. We thereforemake use of contrastive divergence [37], and instead ofrunning the Markov chain until convergence, we onlyrun it for a small number of nlearn iterations per gradientupdate. By starting the chain at the training data weobtain samples in all places where we want the modeldistribution to be accurate. Running the MCMC samplerfor a couple of iterations already draws the samplescloser to the model distribution. For more details andderivations we refer the interested reader to [28], [37].Energy Potentials: The joint potential Ψ(E ,R,Θ) in Eq.(27) decomposes as

Ψ(E ,R,Θ) = ψp(R,Θ) + ψt(T ,R,Θ) + ψv(V,R,Θ)

+ ψs(S,R,Θ) + ψf (F ,R,Θ) + ψo(O,R,Θ)

using the same subscript notation as in Eq. (1). Takingthe logarithm of Eq. (2) we obtain

ψp(R,Θ) = −λp log fκ(α)−7∑i=1

[κ = i] log ξp,i

+1

2φp(R,µ(κ)

p )TΛ(κ)p φp(R,µ(κ)

p ) (31)

with φp(R,µ(κ)p ) = (c, r, logw)T − µ(κ)

p . We set µp ∈ R4

and Λp ∈ R4×4 to their empiric marginals and learnξp, λp ∈ Θ using the approach described in Section 3.5.

8

The derivatives are readily derived from Eq. (31). Takingthe logarithm of Eq. (6) - (20) we obtain:

ψt = − λtNt

Nt∑i=1

log

L∑l=1

p(ti, l|R)

ψv = −Nv∑i=1

log [ζv + (1− ζv) exp (−λvφv(vi,R,Θ))]

ψs = − λsNs

Ns∑i=1

ws,φs(R) si,φs(R)

ψf = − 1

Nf

Nf∑i=1

log φf (fi,R,Θ), ψo = − λoNo

No∑i=1

ρi φo(R)

Here, we have added an additional degree of freedomλt to the tracklet potential ψt, which accommodates forviolations of the naıve Bayesian observation model andcontrols the relative strength of the tracklet feature withrespect to the prior and all other features.

4 IMAGE EVIDENCE

This section describes the feature cues used by ourprobabilistic model described in Section 3.3. Using mo-tion information from visual odometry, we representall features in the reference coordinate system whichis located on the road surface below the left cameracoordinate system in the last frame of each sequence,as illustrated in Fig. 7. While vehicle tracklets, vanishingpoints and semantic scene labels can be computed froma monocular video stream, the scene flow and occupancygrid features require a stereo setup. The importance ofeach of these cues is investigated in our experiments.

4.1 Vehicle TrackletsWe define a tracklet as a set of object detections, pro-jected into bird’s eye perspective t = {d1, . . . ,dMd

} withd = (fd,md,Sd,od). Here, fd ∈ N is the frame numberand md ∈ R2,Sd ∈ R2×2 are the mean and covariancematrix of the object location. od ∈ R8 describes theprobability of the vehicle being viewed from any of8 possible orientations. Our goal now is to associateobject detections to tracklets and project them into 3Dusing cues such as the object size or the bounding boxground contact point in combination with the height andthe pitch angle of the camera. Association of detectionsto tracklets is performed in image-scale space to betteraccount for uncertainties of the object detector.

For detection, we train the part-based object detectorof [25] on a large set of manually annotated images.As our training set comprises ground truth orientationlabels, we modify the latent SVM [25] such that the latentcomponent variables are fixed to the orientations. Wenormalize o to one using the softmax transformation,applied to the scores of all detections that overlap thenon-maximum-suppressed ones.

Tracking is performed in two simple stages. First, weassociate all object detections frame-by-frame using the

Fig. 7: Projection of 2D Object Detections into 3D. By learning thestatstics of the vehicle dimensions, we are able to relate the 2D objectbounding box to the location of the object in 3D.

Hungarian method [45]. The affinity matrix is computedusing geometry and appearance cues of the object. Asgeometry cue we employ the bounding box intersection-over-union score. The appearance cue is computed bycorrelating the bounding box region in the previousframe with the bounding box region in the current frame,using a small margin (20%) to account for the localizationuncertainty of the object detector. In a second stage weassociate tracklets to each other. We allow for bridgingocclusions of up to 20 frames and make use of theHungarian algorithm for optimal tracklet association.Towards this goal, we employ a geometry cue [28] thatpredicts the object location and size from the boundingboxes in the other tracklet. Similar to above, object ap-pearance is compared via normalized cross-correlation.

Given the associated bounding boxes, we estimate the3D location of the vehicles relative to the camera. Let(u, v)T be the image coordinates of the object’s boundingbox bottom-center and let w, h be its width and height.Let (x, 0, z)T be the 3D location of an object in groundplane coordinates (y = 0) as illustrated in Fig. 7. Further,let ∆x,∆y be the object width and height in meters, mea-sured via parallel-projection to the z = 0 plane. Finally,let o denote the MAP orientation of the vehicle as re-turned by the object detector. Assuming a uniform priorover x and z, the posterior on the object’s 3D locationcan be factorized as p(x, z|u, v)p(z|w,∆x, o)p(z|h,∆y).Here, the first term relates the bounding box position(u, v)T to the 3D location (x, 0, z)T and the second andlast term model the relationship between the distancez and the width w and height h of the bounding box.As the bounding box width varies with the orientationof a vehicle, p(z|w,∆x, o) depends on o. We learn aseparate set of parameters for each o, but will drop thisdependency in the following for clarity of presentation.Let x, z|u, v ∼ N (µ1,Λ

−11 ), z|w,∆x ∼ N (µ2, λ

−22 ) and

z|h,∆y ∼ N (µ3, λ−23 ). Then, x, z|u, v, w, h,∆x,∆y ∼

N (µ,Σ) with

µ = ΣΛ1µ1 + ΣΛ2

[0 µ2

]T+ ΣΛ3

[0 µ3

]TΣ = (Λ1 + Λ2 + Λ3)

−1 (32)

where Λ1 has full rank and Λ2,Λ3 are singular matricesof the form Λ2 = diag(0, λ2) and Λ3 = diag(0, λ3).Assuming a standard pinhole camera model we have[

u v 1]T

= P3×4[x 0 z 1

]T (33)

9

∆x1 ∆x2 ∆x3 ∆yµ 1.86 4.37 2.82 1.59σ 0.28 0.64 0.63 0.22

Fig. 8: Object Size Statistics. ∆x and ∆y are the width and heightafter parallel projection to z = 0. Cars in our dataset are ∼ 1.9m wideand ∼ 4.4m long.

where P = K T R is the product of a calibration matrixK3×3, the transformation from ground plane coordinatesto camera coordinates T3×4 and an additional pitch errorθ, parameterized by the rotation matrix R4×4(θ). Given[u v]T we obtain

[x z]T by solving the linear system

A[x z]T

= b with

A =

[uP31 − P11 uP33 − P13

vP31 − P21 vP33 − P23

], b =

[P14 − uP34

P24 − vP34

]where we have made use of 〈θ〉p(θ) = 0. The covarianceof[x z]T can be approximated using error propagation

Λ1 = Σ−11 , Σ1 = J diag(σ2

u, σ2v , σ

2θ) JT (34)

where the Jacobian J ∈ R3×3 is given by

J =(∂∂uA−1b ∂

∂vA−1b ∂∂θA

−1b)

(35)

with ∂(A−1b) = A−1(∂b− ∂AA−1b

). Furthermore, for

a pinhole camera with focal length f we have µ2 = z =f∆xw . We obtain the precision in z as

λ2 = σ−22 , σ2

2 = J

[σ2w 00 σ2

∆x

]JT, J =

[− f∆x

w2fw

](36)

Similarly, we have µ3 = z = f∆yh with precision

λ3 = σ−23 , σ2

3 = J

[σ2h 00 σ2

∆y

]JT, J =

[− f∆y

h2fh

](37)

The unknown parameters of the proposed projectionmodel σu, σv , σw, σh, ∆x, ∆y, σ∆x and σ∆y are deter-mined empirically from a held out training set composedof 1020 stereo images with 3634 annotated vehicle 2Dbounding boxes, including orientation labels (8 bins).The object depth is computed from the median disparitywithin the respective bounding box. and computed thecorresponding disparity maps. Due to the characteristicsof sliding-window detectors, we expect the noise to bedependent on the object scale. We model this relationshipusing linear regression with respect to the bounding boxheight h. We learn a separate ∆x for each of three carorientation classes illustrated in Fig. 8.

As the raw 3D location estimates {(m,S)} are noisydue to the low camera viewpoint, the uncertainties in theobject detector and the ground plane estimation process,we temporally integrate detections within a tracklet t us-ing a Kalman smoother [40] assuming constant velocity.

0 0.5 10

0.2

0.4

0.6

0.8

1

false positive rate

true p

ositiv

e r

ate

Learning based

Kosecka et al.

Fig. 9: Structured Line Segments. ROC classification results intostructure and clutter (left). Qualitative results from Kosecka et al. [43](top-right) vs. our method (bottom-right). Red corresponds to structure.

4.2 Vanishing Points

We detect up to two vanishing points V = {v1, . . . , vNv}

in the last frame of each sequence, where a vanishingpoint is defined by a rotation angle around the y-axis inroad coordinates, i.e. we assume that all vanishing linesare collinear with the ground plane and vi representsthe yaw angle. All 3D lines that are collinear with avanishing line intersect at the same vanishing point. Fortypical scenarios, two vanishing points are dominant:One which is collinear with the forward facing streetand one which is collinear with the crossing street. Thisis illustrated in Fig. 1.

In order to detect vanishing points we make use ofthe method described by Kosecka et al. [43], which wehave modified slightly for taking into account the knowncamera calibration information. We restrict the searchspace such that all vanishing lines are collinear withthe ground plane. Additionally, we relax the model toalso allow for non-orthogonal vanishing points as thisis required by the intersection types encountered in ourdataset. Unfortunately, traditional vanishing point detec-tion methods [7], [43] require relatively clean scenariosand tend to fail in the presence of clutter such as castshadows or defects in the road surface. We learn a k-nearest-neighbor classifier based on a held-out annotatedset of 185 images, in which all detected line segmentshave been manually labeled as either structure (e.g. roadmarkings, building facades) or clutter. The classifier’sconfidence on structure is used as a weight in the vanish-ing point voting process. Details on the feature set whichwe use can be found in [33].

Fig. 9 (left) shows the ROC curve for classifying linesinto structure and clutter. The curves have been obtainedby adjusting k for the k-nn classifier in the learningbased method and by varying the inlier threshold for[43]. Fig. 9 (right) compares the classification results for aparticular scene: While the cast shadows in the lower-leftpart of the image causes wrong evidence for traditionaldetectors (top), the proposed classification step is able toreject most of these line segments (bottom).

4.3 Semantic Scene Labels

For extracting semantic information in the form of scenelabels we use the joint boosting framework proposedin [58] to learn a strong classifier. Following [61], we

10

divide the last image of each sequence into patches ofsize ns × ns pixels and classify them into the categoriesroad, background and sky. Details on the feature set canbe found in [33]. In order to avoid hard decisions andto interpret the boosting confidences as probabilities weapply the softmax transformation [46] to the resultingscores. The semantic label of a single patch is defined ass ∈ ∆2, where ∆2 is the unit 2-simplex, i.e., the space ofpossible categorical distributions. For training, we usea held out dataset of 200 hand-labeled images. Aftersoftmax normalization we obtain a label distributions for each image patch s ∈ S , which is used in oursemantic scene label likelihood described in Section 3.3.4.

4.4 Scene Flow and EgomotionWe compute the vehicle’s egomotion and 3D scene flowusing the method presented in [34]. Towards this goal weaccumulate all vectors in the reference (road) coordinatesystem, compensate the egomotion and threshold themby their length, i.e., we remove short vectors that arelikely to belong to the static environment. As the 3Dscene flow likelihood doesn’t account for object veloc-ities, we normalize all flow vectors to unit length andproject them onto the road plane as illustrated in Fig. 1(red arrows in occupancy grid).

4.5 Occupancy GridBuildings represent obstacles in the scene and thusshould never coincide with drivable regions (road). Thisassumption is incorporated into the occupancy grid fea-ture. We construct a 2D grid in road plane coordinatesfrom disparity measurements which classifies the area infront of the vehicle into the categories obstacle, free spaceand unobserved cells as illustrated in Fig. 1.

We make use of the efficient large-scale stereo match-ing algorithm ELAS [32] to compute disparity mapsfor each frame in each sequence. Given the disparitymaps and the motion estimates from Section 4.4, wecompute a 2D occupancy grid of the environment inreference road coordinates, representing obstacles anddrivable (road) areas. The occupancy grid is obtained byassuming spatial independency amongst cells, applyingthe discrete Bayes filter [57] and ignoring moving objects(tracklets, scene flow).

5 EXPERIMENTAL EVALUATION

We recorded a dataset of 113 video sequences of realtraffic with a duration of 5 to 30 seconds each. This cor-responds to a total of 9438 frames or ∼ 944 seconds. Allsequences end when entering the intersection, i.e., whenthe autonomous system would need to take a decision.Fig. 11 shows the large variability in appearance. Weannotate the data using GoogleMaps aerial images. Foreach intersection in the database we labeled the centerof the intersection as well as the number, orientation andwidth of the intersecting streets in bird’s eye perspective.

We then map the annotated geometry into the road coor-dinate system using the GPS coordinates of the vehicle.Note that the GPS is only used for annotation. Addi-tionally, for all vehicle tracklets that have been detectedby the approach described in Section 4.1, we annotatetheir lane index l, including parking. For vehicles thathave been associated to a lane, the tangent at the closestfoot point of lane l is used as object orientation groundtruth. Furthermore, we manually annotate all lanes ineach scenario with a binary label indicating whether thelane is active, i.e., whether vehicles move on that lane.

5.1 SettingsOur experiments evaluate the performance of the pro-posed system as well as the importance of each individ-ual feature cue, abbreviated as

P = Prior (see Section 3.3.1)T = Tracklets (see Section 3.3.2)V = Vanishing Lines (see Section 3.3.3)S = Semantic labels (see Section 3.3.4)F = Scene Flow (see Section 3.3.5)O = Occupancy Grid (see Section 3.3.6)

To gain insights into the strengths and weaknesses ofeach cue, we evaluate the prior alone, all terms individ-ually (PT, PV, PS, PF, PO), all terms but one (PVSFO,PTSFO, PTVFO, PTVSO, PTVSF) and the full model(PTVSFO). For each setting we learn a separate set ofparameters.

For evaluating the proposed model, we leverage 10-fold cross-validation and learn the model parametersΘ = {λp, ξp, λt, λv, λs, λf1, λf2, λo} in each fold usingcontrastive divergence, as described in Section 3.5. Allparameters are initialized to θi = 1. Convergence typ-ically occurred after 150-250 gradient ascent steps. Weexcluded ζt, ζv and ζf from the maximum likelihoodestimation process described in Section 3.5 as we foundthem hard to optimize. Instead, we estimated them usingcross-validation. All parameters which are not part of Θhave been determined empirically and are summarizedin Table 2. Inference is performed by drawing ninf =10, 000 samples and computing the MAP solution. Ourmixed C++/MATLAB implementation requires ∼ 8 sec-onds per frame for inference and about an hour forlearning the model parameters on a standard PC using8 CPU cores. Most time is spent on feature extraction,especially object detection.

5.2 Topology and GeometryTo judge the performance of the proposed model, weevaluate the estimation results of each setting againstseveral metrics. First, we measure the accuracy in topol-ogy estimation, which is the percentage of all 113 casesin which the correct topology κ has been recovered.We propose three geometric metrics: average Euclideanerror when estimating the intersection center, averagestreet orientation error and the road area overlap. Re-garding the street orientation, we assign each street to

11

Prior: (Section 3.3.1)σα = 0.1 rad KDE kernel bandwidth

Vehicle Tracklets: (Section 3.3.2 and Section 4.1)τd = 0.2 NMS overlap threshold (object detection)τt1 = 0.5 Gating threshold (tracking stage 1)τt2 = 0.3 Gating threshold (tracking stage 2)ζt = 10−20 Outlier thresholdσout = 70 m Std. deviation of outlier distribution

Vanishing Points: (Section 3.3.3)ζv = 10−10 Outlier threshold

Semantic Scene Labels: (Section 3.3.4 and Section 4.3)ns = 4 Px Image patch (superpixel) sizews = (1, 1, 4) Scene label weights

Scene Flow: (Section 3.3.5 and Section 4.4)nf = 50 Number of RANSAC samplesζf = 10−15 Outlier thresholdσout = 70 m Std. deviation of outlier distribution

Occupancy Grid: (Section 3.3.6 and Section 4.5)no = 1 m Occupancy grid cell sizewo = (−1, 4, 1) Weights of geometric prior∆o = (2, 20) m Margins of geometric prior

Inference and Learning: (Section 3.4 and Section 3.5)ninf = 10, 000 Number of samples at inferencenlearn = 10 Number of samples per learning iterationniter = 500 Number of learning iterations

TABLE 2: Setting of Constants in the Model. For reproducibility ofour results, we specify all constants in our model. On acceptance ofthis paper, we will also release the data and MATLAB/C++ code.

its (rotationally) closest counterpart in the ground truthlayout in order to decouple the orientation measure fromthe estimated topology κ. More precisely, we take thelayout with the smaller number of streets and assignall streets to their closest counterparts in the layoutwith the larger number of streets. Finally, the road areaoverlap measures to which extend the estimated roadlayout overlaps with the ground truth by computingthe intersection-over-union of both road areas. Here, theroad area is defined as the concave hull enclosing allstreets up to a distance of three times the average groundtruth street width.

As evidenced by Table 3, each feature is able toimprove results compared to the prior (column 1-6). Thestrongest cues in our framework are tracklets, 3D sceneflow and the occupancy grid features. This indicatesthat despite its noisy nature, depth information is veryimportant. The smallest gain in performance is observedfor the vanishing point feature as it only improvesperformance in combination with other cues.

Performance improves further when combining fea-tures. In terms of topology estimation, the best resultshave been obtained by making use of all information.Regarding the geometric error measures all settings thatinclude the occupancy grid feature perform comparablywell. This leads us to the conclusion that occupancyinformation is complementary information, while othercues such as 3D scene flow and vehicle tracklets canpartly replace each other.

5.3 Tracklet Associations and Semantic Activities

Besides the geometric reasoning discussed so far, animportant aspect in real-world applications is to un-derstand the scene at a higher level. This includes theassociation of vehicle tracklets to lanes (’Tracklet Ac-curacy’ in Table 3) as well as the detection of activelanes (’Lane Accuracy’ in Table 3), where we assume asingle lane connecting each inbound with each outboundstreet, as described in Section 3.1. With active we referto lanes on which at least one vehicle is moving. Forevaluating the above mentioned metrics we extract allunique tracklets, defined by a minimum length of 10meters. We define a lane as active if at least one tracklethas been uniquely assigned to it. The tracklet and laneaccuracies for all settings are depicted in Table 3 (rows 5and 6). As expected, the best results are obtained wheneither the 3D scene flow or the vehicle tracklet featuresare present, with 80% accuracy in tracklet associationsand 90% accuracy in detecting active lanes.

5.4 Object Detection and Orientation Estimation

As we have shown in Section 5.2, objects help in estimat-ing the layout and geometry of the scene. On the otherhand, knowledge about the road layout should also helpin improving the performance of object detectors, bothin accuracy and object orientation estimation.

First, we re-estimate the orientations of all objectsthat are used as input to our model. For associatingthe tracklets to lanes and the detections to lane splinepoints, we employ the inference procedure described inSection 3.4.2. We select the tangent angle at the associ-ated spline’s foot point s on the inferred lane l as ournovel orientation estimate. Table 3 (row 7) shows that weare able to significantly reduce the orientation error ofmoving vehicles from 32.6 degrees, corresponding to theorientation error of the raw detections (not depicted inthe table), down to 14.0 degrees when using our modelin combination with vehicle tracklets or 3D scene flow.

We also manually annotated all cars in the last frameof each sequence using 2D bounding boxes, yielding 355labeled car instances in total. Given the object detectionsand the inferred road geometry from Section 5.2, we re-score each object detection by adding the following termto the scores of [25]

0.5

[maxl

exp

(− ∆2

l

2w2

)+

3∑i=1

exp

(− (xi − µi)2

2σ2i

)]− 1

Here ∆l is the distance of a car detection to lane splinel, w is the estimated street width and {µi, σi} are meanand standard deviation of the object width, height andposition, respectively. These estimates are obtained froma held-out training set using maximum likelihood esti-mation. Fig. 10 depicts the precision-recall curves for theL-SVM baseline [25] and our approach. Note that ourgeometric and topological constraints increase detectionperformance significantly, improving average precision

12

P PT PV PS PF PO PVSFO

PTSFO

PTVFO

PTVSO

PTVSF

PTVSFO

Topology Accuracy (%, ↑) 19.5 64.6 20.3 55.8 66.4 83.2 91.1 89.4 91.1 90.3 72.6 92.0Location Error (m, ↓) 6.8 5.1 6.7 7.9 4.5 4.0 3.0 2.9 2.7 3.1 4.8 3.0Street Orient. Err. (deg, ↓) 8.7 5.4 8.7 7.3 5.2 5.1 3.6 3.6 3.6 3.6 4.7 3.6Road Area Overlap (%, ↑) 33.1 51.0 32.9 41.2 53.6 63.4 69.3 69.7 71.2 68.9 57.5 69.9Tracklet Accuracy (%, ↑) 28.1 78.7 30.0 35.8 79.2 65.1 80.7 80.3 79.8 77.6 81.6 82.0Lane Accuracy (%, ↑) 77.3 90.3 78.0 80.2 90.5 85.2 89.7 89.4 89.8 88.3 90.2 89.7Object Orient. Err. (deg, ↓) 53.0 17.7 54.4 45.1 17.5 24.2 14.0 15.1 14.9 15.4 15.1 14.3Object Detection AP (%, ↑) 73.6 73.7 73.7 73.1 73.6 74.1 74.1 74.1 74.2 74.1 73.7 74.0

P = PriorT = TrackletsV = Vanishing LinesS = Semantic labelsF = Scene FlowO = Occupancy Grid

TABLE 3: Quantitative Results. ↑ (↓) means higher (lower) is better. All numbers represent averages over all 113 sequences.

0 50 1000

50

100

Recall (%)

Pre

cis

ion

(%

)

L−SVMPTVSFO

Fig. 10: Improving Object Detection. Precision-recall curves (left) andan example where our algorithm (bottom-right) is able to eliminatefalse positives of a state-of-the-art object detector [25] (top-right).

from 69.9% to 74.2%. The benefits of including thisknowledge into the detection process are also illustratedin Fig. 10 (right). In order to include the partly occludedcar to the right into the detection result, the thresholdof the baseline has to be lowered to a value whichproduces two false positives (top). In contrast, our re-scored ranking is able to handle this case (bottom).

5.5 Qualitative ResultsFig. 11 illustrates our inference results for the setting’PTVSFO’, with the most likely lanes for each uniquetracklet, indicated by an arrow. The ego-vehicle (ob-server) is depicted in black. For most sequences the roadlayout has been estimated correctly and the vehicles havebeen assigned to the correct lanes. Only vehicles that arevery far away or visible only for a couple of frames poseproblems in terms of their lane associations. However,note that in many cases this didn’t affect the estimatedlayout. More results can be found in our supplementaryvideo which is available from our project website.2

6 CONCLUSIONS AND FUTURE DIRECTIONS

We have developed an approach that allows autonomousvehicles to estimate the layout of urban intersectionsbased on onboard stereo imagery alone. Our model doesnot rely on strong prior knowledge such as intersectionmaps, but infers all information from different types ofvisual features that describe the static environment ofthe crossroads (i.e., facades of houses) and the motionsof objects (i.e., cars) in the scene. Different from previousapproaches, our method does not rely on traditionalfeatures like lane markings or curb stones which are

2. http://www.cvlibs.net/projects/intersection

often unavailable in urban scenarios. While we foundperformance improvements for all features, occupancygrids as well as vehicle tracklets and 3D scene flow havebeen identified as the strongest and most important cues.

To accommodate for the ambiguities in the visualinformation we developed a probabilistic model to de-scribe the appearance of the proposed features relativeto the intersection geometry. We utilized Markov ChainMonte Carlo sampling for inference to cope with thecomplex relationship between the crossroad layout andthe appearance of the features. As evidenced by theexperiments, our approach is able to recognize urbanintersections reliably with an accuracy of up to 90% ona data set of 113 realistic real-world intersections. Formost of these intersections, traditional approaches basedon lane markings and curb stones would have faileddue to the absence of these feature cues. Moreover, wefound that context from our model helps to improvethe performance of state-of-the-art object detectors interms of detecting objects as well as estimating theirorientation.

Our approach serves as basis for future improvements.Currently, the assumption that tracklets are indepen-dent can lead to implausible configurations such as carscolliding with each other. Including higher-level back-ground knowledge on how cars can pass an intersectionwill help to reduce ambiguities and increase robustness.Furthermore, improved sensor observations and morecomputing power will allow for more accurate motionmodels and lane representations. Another interestingdirection of research will be to integrate informationfrom other traffic participants (e.g., pedestrians) into themodel to perform collaborative inference.

Andreas Geiger received his Diploma in com-puter science and his Ph.D. degree from Karl-sruhe Institute of Technology in 2008 and 2013.Currently, he is a research scientist in the Per-ceiving Systems group at the Max Planck In-stitute for Intelligent Systems in Tubingen. Hisresearch interests include computer vision, ma-chine learning and scene understanding.

13

Fig. 11: Inference Results. For each sequence we show the input image (top) and the inference result (bottom). Arrows indicate the predicteddriving direction(s). The last row shows four cases where topology estimation has failed. However, note that in many cases the lanes are stillassociated correctly.

Martin Lauer studied computer science atKarlsruhe University and received his Ph.D. incomputer science from University of Osnabruck.As research group leader he works in the fieldsof machine vision, robotics, and machine learn-ing at Karlsruhe Institute of Technology.

Christian Wojek received his Diplom degree inComputer Science from the University of Karl-sruhe in 2006 and his PhD from TU Darmstadtin 2010. He was awarded a DAAD scholarshipto visit McGill University from 2004 to 2005. Hewas with MPI Informatics Saarbruecken as apostdoctoral fellow from 2010 to 2011 and in2011 joined Carl Zeiss Corporate Research. Hisresearch interests are object detection, sceneunderstanding and activity recognition in partic-ular with application to real world scenarios.

Christoph Stiller is professor at the KarlsruheInstitute of Technology and head of the Instituteof Measurement and Control. At the same timehe is director at the FZI Research Center forInformation Technology heading the mobile per-ception research group. His research focus is instereo vision, camera based scene understand-ing, driver assistance systems, and autonomousvehicles.

Raquel Urtasun Raquel Urtasun is an Assis-tant Professor at TTI-Chicago. Previously, shewas a postdoctoral research scientist at UCBerkeley and ICSI and a postdoctoral associateat the Computer Science and Artificial Intelli-gence Laboratory (CSAIL) at MIT. Raquel Ur-tasun completed her PhD at the Computer Vi-sion Laboratory at EPFL, Switzerland. Her majorinterests are statistical machine learning andcomputer vision.

14

REFERENCES

[1] Y. Alon, A. Ferencz, and A. Shashua. Off-road path followingusing region classification and geometric projection constraints.In CVPR, 2006.

[2] J. M. Alvarez, T. Gevers, and A. M. Lopez. 3d scene priors forroad detection. In CVPR, 2010.

[3] M. Aly. Real time detection of lane markers in urban streets. InIV, 2008.

[4] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. Anintroduction to MCMC for machine learning. Machine Learning,50(1-2):5–43, 2003.

[5] M. Andriluka, S. Roth, and B. Schiele. Monocular 3d poseestimation and tracking by detection. In CVPR, 2010.

[6] S. Bao, M. Sun, and S. Savarese. Toward coherent object detectionand scene layout understanding. In CVPR, 2010.

[7] O. Barinova, V. Lempitsky, E. Tretyak, and P. Kohli. Geometricimage parsing in man-made environments. In ECCV, 2010.

[8] C. M. Bishop. Pattern Recognition and Machine Learning. Springer,1st ed. 2006 edition, October 2006.

[9] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, andL. Van Gool. Robust tracking-by-detection using a detectorconfidence particle filter. In ICCV, 2009.

[10] A. Broggi, P. Medici, P. Zani, A. Coati, and M. Panciroli. Au-tonomous vehicles control in the VisLab Intercontinental Au-tonomous Challenge. ARC, 36(1):161–171, 2012.

[11] G. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentationand recognition using SfM point clouds. In ECCV, 2008.

[12] M. A. Brubaker, A. Geiger, and R. Urtasun. Lost! leveraging thecrowd for probabilistic visual self-localization. In CVPR, 2013.

[13] M. Buehler, K. Iagnemma, and S. Singh, editors. The DARPA UrbanChallenge, volume 56 of Advanced Robotics, 2009.

[14] W. Choi and S. Savarese. Multiple target tracking in worldcoordinate with single, minimally calibrated camera. In ECCV,2010.

[15] W. Choi and S. Savarese. A unified framework for multi-targettracking and collective activity recognition. In ECCV, 2012.

[16] J. Crisman and C. Thorpe. Scarf: A color vision system that tracksroads and intersections. TRA, 9(1):49–58, February 1993.

[17] H. Dahlkamp, A. Kaehler, D. Stavens, S. Thrun, and G. R. Bradski.Self-supervised monocular road detection in desert terrain. InRSS, 2006.

[18] R. Danescu and S. Nedevschi. Probabilistic lane tracking indifficult road scenarios using stereovision. TITS, 10(2):272–282,2009.

[19] C. De Boor. A Practical Guide to Splines. Number 27 in AppliedMathematical Sciences. Springer-Verlag, 1978.

[20] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative modelsfor multi-class object layout. In ICCV, 2009.

[21] E. D. Dickmanns and B. D. Mysliwetz. Recursive 3-d road andrelative ego-state recognition. PAMI, 14(2):199–213, Feb 1992.

[22] W. Enkelmann, G. Struck, and J. Geisler. Roma - a system formodel-based analysis of road markings. In IV, 1995.

[23] A. Ess, B. Leibe, K. Schindler, and L. V. Gool. Robust multi-persontracking from a mobile platform. PAMI, 31:1831–1846, 2009.

[24] A. Ess, T. Mueller, H. Grabner, and L. van Gool. Segmentation-based urban traffic scene understanding. In BMVC, 2009.

[25] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan.Object detection with discriminatively trained part based models.PAMI, 32:1627–1645, 2010.

[26] D. M. Gavrila. A bayesian, exemplar-based approach to hierar-chical shape matching. PAMI, 29:1408–1421, 2007.

[27] D. M. Gavrila and S. Munder. Multi-cue pedestrian detection andtracking from a moving vehicle. IJCV, 73:41–59, 2007.

[28] A. Geiger. Probabilistic Models for 3D Urban Scene Understandingfrom Movable Platforms. PhD thesis, KIT, 2013.

[29] A. Geiger, M. Lauer, F. Moosmann, B. Ranft, H. Rapp, C. Stiller,and J. Ziegler. Team annieway’s entry to the grand cooperativedriving challenge 2011. TITS, 2012.

[30] A. Geiger, M. Lauer, and R. Urtasun. A generative model for 3durban scene understanding from movable platforms. In CVPR,2011.

[31] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomousdriving? The KITTI vision benchmark suite. In CVPR, 2012.

[32] A. Geiger, M. Roser, and R. Urtasun. Efficient large-scale stereomatching. In ACCV, 2010.

[33] A. Geiger, C. Wojek, and R. Urtasun. Joint 3d estimation of objectsand scene layout. In NIPS, 2011.

[34] A. Geiger, J. Ziegler, and C. Stiller. StereoScan: Dense 3d recon-struction in real-time. In IV, 2011.

[35] V. Gengenbach, H. H. Nagel, F. Heimes, G. Struck, and H. Kollnig.Model-based recognition of intersections and lane structures. InIV, 1995.

[36] A. Gupta, A. A. Efros, and M. Hebert. Blocks world revisited:Image understanding using qualitative geometry and mechanics.In ECCV, 2010.

[37] G. Hinton. Training products of experts by minimizing contrastivedivergence. NC, 14:2002, 2000.

[38] D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective.IJCV, 80:3–15, 2008.

[39] D. Hoiem, A. A. Efros, and M. Hebert. Recovering surface layoutfrom an image. IJCV, 75(1):151–172, October 2007.

[40] R. E. Kalman. A new approach to linear filtering and predictionproblems. JBE, 1(82):35–45, 1960.

[41] V. Kastrinaki. A survey of video processing techniques for trafficapplications. IVC, 21(4):359, 2003.

[42] Z. Khan, T. Balch, and F. Dellaert. Mcmc-based particle filteringfor tracking a variable number of interacting targets. PAMI,27(11):1805–1819, 2005.

[43] J. Kosecka and W. Zhang. Video compass. In ECCV, 2002.[44] D. Kuettel, M. D. Breitenstein, L. V. Gool, and V. Ferrari. What’s

going on?: Discovering spatio-temporal dependencies in dynamicscenes. In CVPR, 2010.

[45] H. W. Kuhn. The Hungarian method for the assignment problem.NRLQ, 2:83–97, 1955.

[46] S. Kumar and M. Hebert. A hierarchical field framework forunified context-based classification. In ICCV, 2005.

[47] B. Leibe, K. Schindler, N. Cornelis, and L. Van Gool. Coupleddetection and tracking from static cameras and moving vehicles.PAMI, 30(10):1683–1698, 2008.

[48] M. Lutzeler and E. D. Dickmanns. Ems-vision: recognition ofintersections on unmarked road networks. In IV, 2000.

[49] J. C. McCall and M. M. Trivedi. Video-based lane estimationand tracking for driver assistance: survey, system, and evaluation.TITS, 7(1):20–37, March 2006.

[50] V. Nedovic, A. W. M. Smeulders, A. Redert, and J. M. Geusebroek.Stages as models of scene geometry. PAMI, 32:1673–1687, 2010.

[51] D. Pomerleau. Ralph: Rapidly adapting lateral position handler.In IV, pages 506–511, 1995.

[52] D. Ramanan and D. Forsyth. Finding and tracking people fromthe bottom up. In CVPR, 2003.

[53] C. Rasmussen. Road shape classification for detecting and nego-tiating intersections. In IV, 2003.

[54] A. Saxena, S. H. Chung, and A. Y. Ng. 3-D depth reconstructionfrom a single still image. IJCV, 76:53–69, 2008.

[55] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d scenestructure from a single still image. PAMI, 31:824–840, 2009.

[56] G. Singh and J. Kosecka. Acquiring semantics induced topologyin urban environments. In ICRA, 2012.

[57] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. The MITPress, 2005.

[58] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing features:efficient boosting procedures for multiclass object detection. InCVPR, 2004.

[59] X. Wang, X. Ma, and W. Grimson. Unsupervised activity per-ception in crowded and complicated scenes using hierarchicalbayesian models. PAMI, 31:539–555, 2009.

[60] C. Wojek, S. Walk, S. Roth, K. Schindler, and B. Schiele. Monocularvisual scene understanding: Understanding multi-object trafficscenes. PAMI, 2012.

[61] C. Wojek and B. Schiele. A dynamic conditional random fieldmodel for joint labeling of object and scene classes. In ECCV,2008.

[62] K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg. Who areyou with and where are you going? In CVPR, 2011.

[63] Q. Yu and G. Medioni. Multiple-target tracking by spatiotemporalmonte carlo markov chain data association. PAMI, 31(12):2196–2210, 2009.

[64] Q. Zhu, L. Chen, Q. Li, M. Li, A. Nuchter, and J. Wang. 3dlidar point cloud based intersection recognition for autonomousdriving. In IV, 2012.

3d trafﬁc scene understanding from movable...

Documents