ieee intelligent systems

IEEE Intelligent Systems, vol .13, nr. 6, 1999.

- 1 -

Autonomous Driving approaches Downtown

U. Franke, D. GavrilaS. Görzig, F. Lindner, F. Paetzold, C. WöhlerDaimler-Benz Research, T728D-70546 Stuttgart, Germany{franke, paetzold, goerzig}@dbag.stg.DaimlerBenz.com{gavrila, lindner, woehler}@dbag.ulm.DaimlerBenz.com

Keywords: Machine Vision, Image Understanding, Autonomous Driving

Abstract

Most computer vision systems for vehicleguidance developed in the past were de-signed for the comparatively simple high-way scenario.Autonomous driving in the much morecomplex scenario of urban traffic or driverassistance systems like IntelligentStop&Go are new challenges not onlyfrom the algorithmic but also from thesystem architecture point of view.This contribution describes our currentwork on these topics. It includes the ap-propriate algorithms as well as ap-proaches to control the various visionmodules.

1. Introduction

Systems for vision-based navigation in theearly 80’s were based on the experiencegained from static image processing.Since little computational power wasavailable that time, it was common to takea picture, analyse it, and drive some dis-tance in a blind fashion before the vehiclewas stopped again for the next picture.

In 1986, Dickmanns [Dic86] demonstratedautonomous driving on highways atspeeds up to 100 km/h using only a cou-ple of 8086 processors. His idea was touse Kalman Filters to restrict the possibleinterpretations of the scene so that theyare consistent with the dynamics of theconsidered systems as well as the inter-pretations derived in the past.

A large number of vision systems for lat-eral and longitudinal vehicle guidance,lane departure warning and collisionavoidance has been developed during thelast 10 years all over the world (e.g.[Tho88], [Web95], [Fra95], [Pom96]).

Most of the known autonomous demon-strators have been designed for highwaytraffic since this scenario is relatively sim-ple: lanes are usually well marked andbuilt with slowly changing curvature, trafficsigns are large and clearly visible andother vehicles are the only potential ob-stacles that need to be considered. Asshown in the final presentation of theEuropean Prometheus project, the Daim-ler-Benz demonstrator vehicle VITA II isable to drive autonomously on highwaysand perform overtaking manoeuvres with-out any interaction [Ulm94].

A vision system would be even more at-tractive for future customers if its usewere not limited to highway-like roads, butwere also extended to support the driverin everyday traffic situations, including citytraffic. Consequently, future computervision research for traffic applications willhave to consider a much wider range ofsituations than it does today.

Particularly attractive for driver assistanceis the urban traffic environment. Imaginean Intelligent Stop&Go system that is ableto behave like a human driver: it does notonly keep the distance to its leader con-stant, as a radar based system would do,but also follows the leader laterally.Moreover, it stops at red traffic lights andstop signs, gives right of way to other ve-


- 2 -

hicles if necessary and tries to avoid colli-sions with children running across thestreet.

Unfortunately, vision on urban roads turnsout to be much more difficult than onhighways due to the complexity of thisenvironment. Figure 1 shows an everydayexample, that we have to understand.

In this contribution we describe our ap-proach to build an intelligent real-time vi-sion system for this scenario. This in-cludes stereo vision for depth-based ob-stacle detection and tracking, a frameworkfor monocular detection and recognition ofrelevant objects and an attempt to realisesuch a system without the necessity of asuper computer in the trunk. The compu-tational power in our demonstrator carUTA (Urban Traffic Assistant) is currentlylimited to three 200 MHz PowerPCs.

2. Urban Applications and VisionTasks

In order to limit the complexity of autono-mous driving, we intend to realise the de-scribed Intelligent Stop&Go system first.The goal is to detect an appropriate lead-ing vehicle and to signal the driver that thesystem is ready for autonomous following.It will be his responsibility to activate thesystem. If it cannot continue to follow theleader (e.g. since he is changing thelane), the system shall be allowed to drive

a limited distance in autonomous modesearching for a new leader before thecontrol of the vehicle is turned over to thedriver again. Control will also be givenback if the vehicle has to stop in front of astop sign or if it is the first vehicle in frontof a red traffic light.

Besides an Intelligent Stop&Go system assketched above, driver assistant systemslike rear-end collision avoidance or redtraffic light recognition and warning arealso of interest for urban traffic.

The most important perception tasks thathave to be performed in order to buildsuch systems are:

• The leading vehicle must be detectedand its distance, speed and accelera-tion must be estimated in longitudinaland lateral direction.

• The course of the lane must be ex-tracted even if it is not given by wellpainted markings and does not showclothoidal geometry.

• Small traffic signs and traffic lightshave to be detected and recognised ina highly coloured environment.

• Different additional traffic participantslike bicyclists or pedestrians must bedetected and classified.

• Stationary obstacles that limit the avail-able free space e.g. parking cars mustbe detected.

3. Stereo based Obstacle Detectionand Tracking

For navigation in urban traffic it is neces-sary to build an internal 3D map of theenvironment in front of the car. This mapmust include position and motion esti-mates of relevant traffic participants andpotential obstacles. In contrast to thehighway scenario where one can concen-trate on looking for rear sides of leadingvehicles, our system has to deal with alarge number of different objects.

Several schemes for object detection intraffic scenes have been investigated inthe past. Besides the mentioned 2D

Fig.1: Image understanding in the urbanenvironment is more challenging thanon highways.


- 3 -

model based approaches searching forrectangular, symmetric shapes, inverseperspective mapping based techniques[Bro97], optical flow based approaches[Enk97] and correlation based stereosystems [San96] have been tested.

The most direct method to derive 3D-information is binocular stereo vision. Thekey problem is the correspondance analy-sis. Unfortunately, classical approacheslike area based correlation techniques oredge based approaches are computation-ally very expensive. To overcome thisproblem, we have developed a featurebased approach that is tailored to ourspecific needs and runs in real time on a200 MHz PowerPC 604 [Fra96].

This scheme classifies each pixel accord-ing to the grey values of its four directneighbours. It is checked whether eachneighbour is much brighter, much darkeror has similar brightness compared to theconsidered central pixel. This leads to34=81 different classes encoding edgesand corners at different orientations.

Fig.3.1 shows a stereo image pair takenfrom our camera system with a base widthof 30 cm. The result of the structure clas-sification is shown in Fig.3.2 for the leftimage. Different grey values representdifferent structures, pixel in homogeneousareas are assigned to the „white“ classand ignored in the sequel.

The correspondance analysis works onthese feature images. The search for pos-sibly corresponding pixels is reduced to asimple test whether two pixels belong tothe same class. Thanks to the epipolarconstraint and the fact that the camerasare mounted with parallel optical axis,pixels with identical classes must besearched on corresponding image rowsonly.

It is obvious that this classification schemecannot guarantee uniqueness of the cor-respondances. If ambiguities occur, thesolution giving the smallest disparity i.e.the largest distance is chosen to over-come this problem. This prevents wrongcorrespondances caused by for exampleperiodic structures to generate phantomobstacles close to the camera. In addition,measurements that violate the orderingconstraint are ignored.

The outcome of the correspondanceanalysis is a disparity image, which is thebasis for all subsequent steps. Fig. 3.3tries to visualise such an image.

Fig.3.1: Stereo image pair.

Fig. 3.2: Classified pixel.

Fig. 3.3: Distance Image. The brightnessof the feature pixel is proportional to theirdistance.


- 4 -

3.1 Structures on the Road

For the recognition of road boundaries it ishelpful to know which structures in theimage lie on the road surface. Suchstructures can easily be extracted fromthe disparity image if the road can be as-sumed to be (nearly) flat.

If the camera orientation is known, allpoints having a disparity in a certain inter-val around the expected value lie on theroad, those with larger disparities belongto objects above the road. Looking for allpoints which lie within a height interval of[-15cm, 15cm] relative to the road yieldsthe result displayed in Fig.3.4.

3.2 Estimation of Camera Height andPitch Angle

In practice camera height and pitch angleare not constant during driving. Fortu-nately, the relevant camera parameterscan be efficiently estimated themselvesusing the extracted road surface points.Least squares techniques or Kalman fil-

tering can be used to minimise the sum ofsquared residuals between expected andfound disparities.

Results of a test drive are shown in Fig.3.5. During acceleration, the pitch angledecreases, while it increases duringbreaking manoeuvres. In the time be-tween 80 and 120 seconds the speed wasnearly constant. The oscillations of thepitch angle are due to the uneven roadsurface.

3.3 Obstacle Detection and Tracking

If all featureson the roadplane are re-moved, a 2Ddepth mapcontaining theremainingfeatures canbe generated.The mapshown in Fig.3.6 covers anarea of 30m inlength and 8min width. The

leading vehicle causes the two peaks at15m distance and lateral positions of 0mand –2m. This map is used to detect ob-jects that are tracked subsequently. Ineach loop, already tracked objects aredeleted in this depth map prior to the de-tection.

The detection step delivers a rough esti-mate of the object width. A rectangularbox is fitted to the cluster of feature pointsthat contributed to the extracted area inthe depth map. This cluster is trackedfrom frame to frame. For the estimation ofthe obstacle distance, a disparity histo-gram of the object’s feature points iscomputed. In order to obtain a disparityestimate with subpixel accuracy, a parab-ola is fitted to the maximum of this histo-gram.

In the current version, an arbitrary numberof objects can be considered. Sometimes

Fig. 3.4: Classified road pixel.

-2m 0m +2m

30m

Fig. 3.6: Depth map,peaks are caused by thecar and the trees.

-0,015

-0,01

-0,005

0

0,005

0,01

0,015

0,02

0,025

0,03

32 48 64 80 96 112 128 144 160

time [sec]

ang

le [

rad

]

0

5

10

15

20

25

spee

d [

m/s

ec]

pitch angle

aver. pitch angle

speed

Fig. 3.5: Estimated camera pitch angleduring acceleration and deceleration.


- 5 -

the right and left part of a vehicle are ini-tially tracked as two distinct objects.These objects are merged on a higher„object-level“ if their relative position andmotion fulfil reasonable conditions.

From the position of the objects relative tothe camera system their motion states i.e.speed and acceleration in longitudinal aswell as lateral direction are estimated bymeans of Kalman filters.

The longitudinal motion parameters arethe inputs for a distance controller. Fig.3.7 shows the results of a test drive in thecity of Esslingen, Germany. The desireddistance is composed of a safety distanceof 10 meters and a time headway of 1second. Notice the small distance errorwhen the leader stops.

4. Object Recognition

The previous section dealt with the use ofstereo vision to detect and track obstaclesin front of the vehicle. One essential thingthat makes Stop&Go intelligent is the abil-ity to recognise objects. Two classes ofobjects are relevant for this application:

• elements of the infrastructure (roadboundaries, road marks, traffic signsand traffic lights) and

• traffic participants (pedestrians, (motor)bicycles and vehicles).

How do we recognise various objects?Although devising a general framework isdifficult, we often find ourselves applyingtwo steps, detection and classification.The purpose of the detection step is toefficiently obtain a region of interest (ROI),i.e. a region in image space or parameterspace that could be associated with a po-tential object. Besides the described ste-reo approach, we have also developeddetection methods based on shape, col-our and motion. For example, shape andcolour cues are used to find potential traf-fic signs and arrows on the road (seeSubsections 4.1 and 4.2), motion is usedto find potential pedestrians (see Subsec-tion 4.4).

Once a ROI has been obtained, morecomputation-intensive algorithms are ap-plied to „recognise'' the object, i.e. to es-timate model parameters. In our applica-tion to natural scenes, objects have a widevariety of appearances because of shapevariability, different viewing angles andillumination changes. Because explicitmodels are seldom available, we derivemodels implicitly by learning from exam-ples. Recognition is seen as a classifica-tion process. We have implemented alarge set of classifiers for this purpose, i.e.Polynomial Classifiers, Principal Compo-nent Analysis, Radial Basis Functions,(Time Delay) Neural Networks and Sup-port Vector Machines (the latter in collabo-ration with the MIT CBCL Laboratory[Pap98]). Our emphasis on learning ap-proaches to object recognition is backedup by many hours of data recorded on adigital video recorder while driving in ur-ban traffic. Interesting data segments arelabelled off-line.

4.1 Road Boundaries and Markings

4.1.1 Road Recognition

Road recognition has two major objec-tives. As a stand alone application, it en-ables automatic road following. In thecontext of a Stop&Go system, lanechanges of the leading vehicle have to beregistered in order to return the control of

-5

0

5

10

15

20

25

30

35

190 201 210 219 229 238 247 256 264 272 282

tim e [sec]

distance [m], speed

vEgo

vLeader

measured distance

desired distance

Fig. 3.7: Autonomous vehicle following in citytraffic.


- 6 -

the vehicle back to the driver, as stated inchapter 1.

Furthermore, the lateral control behaviourof the Stop&Go system can be improved.As the standard solution, lateral guidanceis accomplished by a so called lateral towbar controller. It requires distance andangle with respect to the leading vehicleas measured by the stereo vision systemdescribed above. A tow bar controlledvehicle drives approximately along thetrajectory of the leading vehicle but tendsto cut the corner in narrow turns. This un-desirable behaviour can be controlled ifthe position of the ego vehicle relative tothe lane is known.

Lanes of urban roads are not as well de-fined as those of highways. Lane bounda-ries often show poor visibility. Variousobjects, as traffic participants or back-ground infrastructure, clutter the image.The road topology is comprised of a vari-ety of different lane elements as stoplines, zebra crossings and forbiddenzones. A comprehensive geometrical roadmodel can not be readily defined. All thesecharacteristics suggest a data driven,global image analysis that robustly sepa-rates lane structures from background.

The global detection analyses the polygo-nal contour images. Fig. 4.1 shows animage of the considered road, Fig. 4.2 theextracted edges. A database organisesthe polygons with respect to their length,orientation, position and mutual spatialrelations such as parallelism and colline-arity. It provides fast filters for these at-tributes. Arbitrary combinations of proper-

ties can be specified to detect possibleroad structures. Detected lane boundariesare subsequently classified as curbs,markings and clutter by a PolynomialClassifier. Results of the lane boundaryrecognition are shown in figure 4.2 withdark lines, a recognised pedestriancrossing is shown in figure 4.3. Regionswhere obstacles have been detected bythe stereo vision system of chapter 3 areruled out as possible positions of roadstructures.

Even though the entire global detectionneeds only 80-100ms on a PowerPC 604,data driven methods are computationallyquite expensive. As learned from the workon highway lane following, model basedestimation of the road geometry is veryefficient in that only small portions of theimage have to be considered when lanemarkings are locally tracked in image se-quences.

The urban lane tracker combines globalanalysis with this local principle. For aslong as possible, boundaries detected bythe global analysis are tracked locally inorder to estimate the vehicle’s lateral po-

Fig. 4.1 Urban road scenario.Fig. 4.2: Polygonal edge image (light)and detected lane structures (dark).

Fig. 4.3: Detected zebra crossing.


- 7 -

sition and yaw angle in a few milliseconds.Tracking and global detection are super-vised to collaborate efficiently [Pae98].

4.1.2 Arrow Recognition

The recognition of arrows on the roadfollows the two-step procedure, detectionand classification, mentioned at the be-ginning of this section. It uses shape andcolour cues in a region-based approach.The detection step consists of a coloursegmentation step and a filtering step.

The colour (i.e. greyscale) segmentationstep involves reducing the number of col-ours in the original image to a handful. Inthis application, this reduction is based onthe minima and plateaus of the greyscalehistogram. Following this greyscale seg-mentation a colour connected compo-nents (CCC) analysis is applied to thesegmented image. The algorithm pro-duces a database containing informationabout all regions in the segmented image.Among the computed attributes are thearea, bounding box, aspect ratio, lengthand smoothness of contour as well asadditional features which are determinedon a need-basis, due to computationalcost. We have developed a query lan-guage for this region database, dubbed„Meta-CCC'', which allows queries basedon attributes of single regions as well ason relations between regions (e.g. adja-cency, proximity, enclosure and collinear-ity). The filtering step thus involves for-mulating a query to select candidate re-gions from the database. The resulting set

is normalised for size and given as inputto a radial basis function (RBF) classifier.Fig. 4.4 shows the original and the ob-tained result.

4.2 Traffic Signs

4.2.1 Colour

Our colour traffic sign recognition system[Jan93] goes back to the Prometheusproject and was originally developed withthe highway scenario in mind. Overall, itfollows the same steps as described in theprevious subsection on road arrow detec-tion, that is, colour segmentation, filteringand classification. The colour segmenta-tion step involves pixel classification usinga look-up table. The look-up table wasgenerated off-line in a training processusing a polynomial classifier. The out-come of the segmentation process arepixels labeled "red", "blue", "yellow" and"uncoloured". As before, we apply theCCC algorithm and formulate queries onthe resulting regions using the MetaCCCprocedure. Typical queries would includesearching for red regions of particularshape which enclose an uncoloured re-gion. The next step, classification, is donewith a RBF classifier in a multi-stage pro-cess. The input is a colour-normalisedpictograph, extracted from the originalimage at the candidate locations providedby the MetaCCC procedure. The classifi-cation stages involve colour, shape andpixel values. The results are stabilised byintegration over time [Jan93].

There are a number of challenges to traf-

Fig. 4.4: Above: Street scene displaying adirection arrow. Below: the segmentedand classified arrow.

Fig. 4.5: Recognised traffic signs in awide-angle view.


- 8 -

fic sign recognition in the urban scenario.First, the field of view needs to be ex-tended, which results in a lower effectiveimage resolution. At the same time, theuse of wide-angle lenses introduces sig-nificant lens distortion. Traffic signs arealso not viewed head-on, as most of thetime on highways. In the city, they mustbe recognised when they are lateral to thecamera and appear skewed in the image.The sketched system has been extendedin order to cope with these problems andthe urban specific signs that are relevantfor our goals have been added. Figure 4.5shows some results.

4.2.2 Greyscale

Clearly, colour information is beneficial fortraffic sign recognition. Yet there are stillsome issues open regarding colour seg-mentation. The difficulty is that there isquite a range in which the same „true''traffic sign colour can appear in the im-age, depending on illumination condition,i.e. whether camera looks towards oraway from the sun, day or night, rain orsunshine. In order to answer the questionof how successful we can be without col-our cues we are also working on a shape-based traffic sign detection system ingreyscale, described here. This systemmight also be useful for the case one can-not use colour because of cost considera-tions.

The greyscale detection system works onedge features, rather than region features.The approach uses a hierarchical tem-plate matching technique based on dis-tance transforms (DTs) [Gav98]. DT-based matching allows the matching ofarbitrary (binary) patterns in an image.These could be circles, ellipses, trianglesbut also non-parameterised patterns, forexample outlines of pedestrians (see alsosubsection 4.4.1)

The pre-processing step involves readinga test image (Figure 4.6a), computingthresholded edge image (Figure 4.6c),and computing its distance image("chamfer image", Figure 4.6d). The dis-tance image has the same size as thebinary edge image; at each pixel it con-tains the image distance to the nearestedge pixel of the corresponding binaryedge image.

The matching step involves correlating abinary shape pattern (e.g. Figure 4.6b)with the distance image; at each templatelocation a correlation measure gives thesum of nearest-distance of templatepoints to image edge points. A low valuedenotes a good match (low dissimilarity),a zero value denotes a perfect match. Ifthe measure is below a user-definedthreshold one considers a pattern de-tected.

The advantage of matching a templatewith the DT image is that the resultingsimilarity measure will be smoother as afunction of the template parameters(transformation, shape). This enables theuse of various efficient search algorithmsto lock onto the correct solution. It alsoallows more variability between a templateand an object of interest in the image.Matching with the edge image (or the un-segmented gradient image, not shownhere), on the other hand, typically pro-vides strong peak responses but rapidlydeclining off-peak responses, which donot facilitate efficient search strategies orallow object variability.

(a) (b)

(c) (d)Fig: 4.6: (a) original image, (b) template,(c) edge image and (d) Distance-Trans-form image (see text).


- 9 -

We have extended the basic DT-matchingscheme with a hierarchical approach,where in addition to a coarse-to-finesearch over the translation parameters,templates are grouped off-line into a tem-plate hierarchy based on their similarity.This way, multiple templates can bematched simultaneously at the coarselevels of the search, resulting in variousspeed-up factors. The template hierarchycan be constructed manually or can begenerated automatically from availableexamples (i.e. learning). Furthermore, inmatching, features are distinguished bytype and separate DT's are computed foreach type (e.g. based on edge orienta-tions). For details, refer to [Gav98].

Figure 4.7 illustrates the followed hierar-chical approach. The white dots indicatelocations where the match between imageand a (prototype) template of the templatetree was good enough to considermatching with more specific templates(e.g. the children) on a finer grid. The final

detection result is also shown. More de-tection results are given in Figure 4.8,including some false positives (c and i).We achieve typical detection rates of 90%on single frames, with 4-6% false posi-tives. More than 95% of the false positiveswere rejected in a subsequent pictographclassification stage using a RBF network.

4.3 Traffic Lights

The recognition of traffic lights follows thesame three steps used before: coloursegmentation, filtering and classification.Colour segmentation uses a simple look-up-table in order to determine image partsin the traffic light colours red, yellow, andgreen. By applying the before-mentionedcolour connected components (CCC) al-gorithm to the segmented image only theblob-like shaped regions the area of whichlies within a certain range are selected aspossible traffic light candidates using theMetaCCC procedure.

A region of interest (ROI) of a sizeadapted to the blob diameter is thencropped such that it contains not only theluminous part of the traffic light but also itsdark box. The ROI is normalised to a uni-form size. Eventually, a local contrastnormalisation by means of a simulated

(a)

(b)Fig. 4.7: Traffic sign detection: (a) dayand (b) night (white dots denote inter-mediate results; the locations matchedduring hierarchical search).

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)Fig. 4.8: More detection results.


- 10 -

Mahowald retina is carried out additionally(see Fig. 4.10).

This pre-processed ROI serves as an in-put to a three-layer feed-forward neuralnetwork that performs the actual objectrecognition task. It is constructed suchthat a neuron of the second network layerdoes not „see'' the complete underlyingimage but only a small region of it, i.e. itsreceptive field. These receptive fields ex-tract local features from the input imagethat have been learned during the trainingprocess. The actual classification takesplace in the higher network layers. Thenetwork has K output neurons for the Kdifferent object classes to be distin-guished; the output neuron with the high-est activation denotes the class to whichthe object is assigned. In the case of traf-fic lights, we have K=2 classes, the class„traffic light'' and the class „garbage''.

The appearance of red, red-yellow, yellow,and green traffic lights is trained sepa-

rately, respectively. On a 133 MHz PowerPC, the algorithm runs at a rate of about 4images per second, depending on thenumber of traffic light candidates. In ex-periments, we found recognition rates ofabove 90%, with false positive rates below2%.

4.4 Pedestrians

This subsection deals with work in prog-ress in recognising the most vulnerabletraffic participants, the pedestrians. Wecan either recognise pedestrians by theirshape (subsection 4.4.1) or by their char-acteristic walking pattern (subsection4.4.2).

4.4.1 Towards Pedestrian Recognitionfrom Shape

We are currently compiling a large set ofpedestrian outlines to account for the widevariability of pedestrian shapes in the realworld. Our first approach to pedestrianrecognition by shape has involved blurringthe pedestrian outlines in the database(Fig. 4.10) and performing a principalcomponent analysis on the resulting dataset. This results in a compact representa-tion of the original image data in terms ofthe eigenvectors. The first eigenvectors(and the mean) represent characteristicfeatures of the pedestrian distribution, thelast eigenvectors mostly represent noise(see Fig. 4.11). In order to find pedestri-ans in a test image, we apply a gradient

Fig. 4.9: Example of the recognition of ared traffic light. Additionally shown are thecropped and normalised ROI before (left)and after (right) pre-processing it bymeans of a simulated Mahowald retina.

Fig. 4.10:Examples of the pedestrian da-tabase.


- 11 -

operator to the image and normalise forimage energy. The resulting normalisedgradient image is then correlated with thefirst few eigenvectors of the pedestriandata set. Preliminary findings show thatwe need no more than the first 10 eigen-vectors. A threshold on the correlationvalue determines whether a pedestrian isrecognised or not, see Fig. 4.12 for amatching result. Besides the correct solu-tion, we also obtain a false positive at thewindows. The latter could be accountedfor if camera positioning with respect tothe road is considered. For a differentlearning-based approach to pedestrianrecognition, see [Pap98].

4.4.2 Towards Pedestrian Recognitionfrom Motion

The neural network used for traffic lightrecognition can be extended into the tem-poral dimension; the input then consists ofgreyscale image sequences. This resultsin a time delay neural network (TDNN)architecture with spatio-temporal receptivefields.

The receptive fields act as filters to extractspatio-temporal features from the input

image sequence. Thus, features are nothard-coded but learned during a trainingprocess. Subsequent classification relieson the „filtered'' image sequences ratherthan on the raw image data.

The TDNN is currently applied to the rec-ognition of pedestrian gait patterns. A de-tection step determines the approximateposition of possible pedestrians; we havetwo methods at our disposal:

• Colour clustering on monocular imagesin a combined colour/position featurespace [Hei98]. A fast polynomial clas-sifier selects the clusters possibly con-taining a pedestrian's legs by evaluat-ing temporal changes of a shape-dependent cluster feature.

• 3D segmentation by stereo vision: thestereo algorithm described in section 3determines bounding boxes circum-scribing obstacles in the scene.

Final verification on the candidate regionsequence is then performed by the TDNN,compare Fig. 4.13. We are currently ex-amining whether the TDNN approach canbe used for segmentation-free object andmotion detection and recognition.

Fig. 4.13: Example of a se-quence of stereo images (onlythe left image of each pair isshown, respectively) on whicha 3D segmentation has beenperformed in order to deter-mine the bounding boxesshown. The TDNN recognisedthe corresponding image re-gions as the legs of a pedes-trian. Left: Example of a re-sulting cropped and scaledimage sequence to be fed intothe TDNN.

Fig. 4.11: Principal component analysis:eigenvectors 0 (mean), 1, 2, 25.

Fig. 4.12: Matching results: The pedes-trian is correctly detected, a false alarmoccurs in the window above.


- 12 -

5. Putting things together

In our well known Prometheus demon-strator VITA II for driving autonomously onhighways each vision module was as-signed to a subset of processors in astatic configuration. The modules wererunning permanently, even though someresults were of no interest for vehicleguidance at that moment.

To use this architecture for computer vi-sion in complex inner city traffic wouldrequire even more resources. But whylooking for new traffic signs or continu-ously determine the lane width while thecar is stopping at a red traffic light? Whilefollowing a leading vehicle, why should notonly tasks be processed which are usefulin that situation?

Although we were successful with thementioned brutal force approach, suchquestions reveal some disadvantages:

• There is no concept for controlling themodules, e.g. to focus the resourceson relevant tasks for specific situa-tions.

• It is not scaleable for a larger numberof modules.

• There is no uniform concept for theinterconnection and Cupertino ofmodules.

• The development of new applicationsusually requires extensive reimple-mentations.

• Reuse of old modules can be difficultdue to missing interfaces.

The growing complexity of autonomoussystems hence requires an architecturewhich can cope with these problems. Thepursued approach is to make an explicitdistinction between the modules (obstacledetection, vehicle control etc.) on onehand and the connection of modules toperform a specific application (e.g.autonomous Stop&Go driving) on theother hand.

Each module is connected to the systemby a module interface. This allows devel-opers to concentrate on solving the com-

puter vision problem instead of having toworry about how to connect their work tothe application environment.

An application is formed by connectingtogether a group of these modules. Thelane keeping application for example canbe implemented by connecting the lanedetection module and the lateral vehiclecontrol module.

This flexible architecture allows the devel-opment of various applications withoutmodifying the modules. Moreover, thesystem allows to change the connectionsduring runtime. This enables the eco-nomical use of computational resourcesand adaptation of the system to the cur-rent situation. It is possible, for example,to accomplish lane keeping on highwaysand switch dynamically to the Stop&Goapplication in the city. For a more detaileddescription of the system see [Goe98].

Fig. 5.1: Visualization of the objectsseen by the system. The touch screenacts as the interface between developerand vision modules.


- 13 -

So far, most of the modules described inthe paper have been integrated into thesystem. Currently, our demonstrator carUTA is able to recognise traffic lights, traf-fic signs, overtaking vehicles, directionarrows, pedestrian crossings and to followa leading vehicle (Stop&Go) autono-mously. The system is running on three200 MHz AIX/PowerPCs for the computervision modules and a Windows/Pentium IIfor the visualisation of the results (seeimage 5.1). The next steps are to includethe missing modules, and to adapt thesystem to multiple environments and ap-plications:

• autonomous driving on highways• speed limit assistant: the driver is

warned when driving faster than al-lowed on the current road

• lane departure warning• enhanced cruise control: the vehicle

slows down, when a car in front fallsbelow the desired speed. It raises thespeed again, if there is no obstacleahead.

6. The Road Ahead

The experience gained from tests on ur-ban roads strengthens our confidence thatcomputer vision will go downtown andvision-based driver assistance andautonomous driving will appear in the in-ner city environment in the not so distantfuture.

In this paper we have focused on com-puter vision issues, but there are also re-lated issues which are of great interest forthis kind of applications.

Angle of View: A serious problem in urbantraffic is the necessary angle of view. Onone hand we need a wide angle lens tosee traffic lights and signs if we are closeto them. On the other hand, the precisionof stereo based distance measurementand the performance of object detectionand recognition, in particular traffic signrecognition, grows with the focal length.We expect that these problems could berather solved with high resolution chips of

10002 or 20002 pixels rather than with ro-tatable cameras, vario-lenses or multi-camera systems.

Camera Dynamics: A second sensorproblem is the insufficient dynamic of thecommon CCD. The new logarithmicCMOS-Chips (High Dynamic Range Chip)promise a way out of this dilemma. Thesechips will help us on sunny days to seestructures in the shadowed areas as wellas at night-time, when bright lights glareinto the camera and confuse the auto-matic camera control.

Application Specific Hardware: Anotherimportant problem we have to solve inorder to bring intelligent vision systems tothe market is the price of the appropriatehardware. Analogue chips for early visionsteps and modern FPGA technology arepossible solutions for future mobile visionsystems. We are currently investigatingthe possibility of using these programma-ble arrays in order to speed up edge de-tection and the sketched distance trans-form.

Besides these technical issues, there areimportant legal (i.e. liability) and accep-tance issues which need to be resolvedbefore vehicles with an IntelligentStop&Go system can be driven downtownby our customers.


- 14 -

References

[Bro97] A.Broggi: „Obstacle and LaneDetection on the ARGO“, IEEE Conf.on Intelligent Transportation SystemsITSC ’97, Boston, 9.-12.Nov.1997

[Dic86] E.D.Dickmanns, A.Zapp: "A cur-vature-based scheme for improvingroad vehicle guidance by computer vi-sion", Proc. SPIE Conference on Mo-bile Robots, Vol. 727, 1986, S. 161-16

[Enk97] W.Enkelmann: „Robust ObstacleDetection and Tracking by MotionAnalysis“, IEEE Conf. on IntelligentTransportation Systems ITSC ’97,Boston, 9.-12.Nov.1997

[Fra95] U.Franke, S.Mehring, A.Suissa,S.Hahn: „The Daimler-Benz SteeringAssistant - a spin-off from autonomousdriving“, Intelligent Vehicles ‘94, Paris,Oktober 1994, pp.120-126

[Fra96] U.Franke, I.Kutzbach: “Fast Ste-reo based Object Detection forStop&Go Traffic“, Intelligent Vehicles‘96, Tokyo, pp.339-344

[Gav98] D. Gavrila: „Multi-feature Hierar-chical Template Matching Using Dis-tance Transforms'', ICPR'98

[Goe98] S.Görzig, U.Franke: „ANTS –Intelligent Vision in Urban Traffic”, inIEEE Conference on Intelligent Trans-portation Systems, October 1998,Stuttgart

[Hei98] B.Heisele and C.Woehler: „Mo-tion-Based Recognition of Pedestri-ans'', ICPR'98

[Jan93] R.Janssen, W.Ritter, F.Stein andS.Ott: „Hybrid Approach for Traffic SignRecognition'', Proc. of Intelligent Vehi-cles Conference, 1993.

[Pae98] F.Paetzold, U.Franke; “RoadRecognition in Urban Environment”,IEEE Conference on Intelligent Trans-portation Systems, October 1998,Stuttgart

[Pap98] C.Papageorgiou, M.Oren andT.Poggio: „A General Framework forObject Detection'', Int. Conf. on Com-puter Vision, 1998

[Pom96] D.Pommerleau, T.Jochem:”Rapidly Adapting Machine Vision forAutomated Vehicle Steering”, IEEE Ex-pert, Vol.11, No.2, pp19-27

[San96] K.Saneyoshi: „3-D image recog-nition system by means of stereoscopycombined with ordinary image proc-essing“, Intelligent Vehicles ‘94, 24.-26.Oct. 1994, Paris, pp.13-18

[Tho88] C.Thorpe, M.Herbert, T.Kanadeand S.Shafer: „Vision and Navigationfor the Carnegie-Mellon Navlab“, IEEE-PAMI, vol.10, no.3, 1988

[Ulm94] B.Ulmer: „VITA II - Active collisionavoidance in real traffic“, Intelligent Ve-hicles ‘94, Paris, 24.-26. Oct.. 1994,S.1-6

[Web95] J.Weber et al.: „New results instereo-based automatic vehicle guid-ance“, Intelligent Vehicles ‘95, 25./26.Sept. 1995, Detroit, pp.530-535

ieee intelligent systems

Documents