task-dependent scene interpretation in driver assistance · 2011-01-18 · task-dependent scene...

Task-Dependent Scene Interpretationin Driver Assistance

PhD Thesis

Thomas Paul Michalke

Task-Dependent Scene Interpretationin Driver Assistance

Vom Fachbereich 18

Elektrotechnik und Informationstechnik

der Technischen Universitat Darmstadt

zur Erlangung des Grades eines Doktor-Ingenieurs (Dr.-Ing.)

genehmigte Dissertation

vorgelegt von

Dipl.-Wirtsch.-Ing. Thomas Paul Michalke

geboren am 06.05.1979

in Gera

Referent: Prof. Dr.-Ing. Jurgen Adamy

Korreferent: Prof. Dr.-Ing. Edgar Korner

Tag der Einreichung: 09.04.2009

Tag der mundlichen Prufung: 29.06.2009

D17

Darmstadt 2009

Acknowledgements

The PhD project was carried out in 3 years during February 2006 and January 2009 while

I was working as a PhD student at the Control Theory and Robotics Lab at Darmstadt

University and the Honda Research Institute Europe in Offenbach. In many ways I am

deeply indebted to numerous people working in these two facilities.

I want to thank my supervising professor, Prof. Dr.-Ing. Jurgen Adamy, the head of

Control Theory and Robotics Lab, for all his encouragement and belief in the success of

this work. I want to thank all my colleagues at the Control Theory and Robotics Lab for

their technical, professional, and administrative support. Especially, I want to thank my

colleague Robert Kastner, who shares and supports many of my beliefs, as our short but

fruitful cooperation and numerous inspiring discussions have shown. I also want to thank

Robert Kastner for proof reading this work.

I want to express my gratitude to all students, who participated in projects related to

my PhD work. My thanks go among others to Shi Xuehui, Michael Herbert, Imran Bashir

Bhatti, Wang Zheng, Yan Jiajie, Pol Blasco Moreno, Andreas Schlensag, Marco-Antonio

Garcia-Ochoa, Conrad Klytta, Ming Zhao, Sun Hailin, Jochen Schmell, and Zhang Lyan.

My PhD project was supervised at the Honda Research Institute in Offenbach, where I

realized the major part of my PhD project. I’m grateful to Prof. Dr.-Ing. Edgar Korner,

president of the Honda Research Institute, whose visions and ideas have guided also my

work in numerous ways. Besides major professional contributions, a key factor for my

successfully finished work was the extensive access on numerous, costly facilities, including

the hardware to simulate and test my approaches in real-time on a prototype vehicle. In

particular, my thanks go to Dr.Ing. Jannik Fritsch at Honda for the close supervision of

my work even in busy times and his experience-driven warnings of the numerous possible

pitfalls of a PhD thesis.

I want to thank all other people, who followed and contributed to the project and stay

unnamed.

Finally, I owe my parents a debt of gratitude for all their support during my personal

and professional education.

I am mostly indebted to my wife Gabriele Michalke, for her incessant comprehension

and support during the long evenings I communicated too much with my computer, for

hearing my complaints, giving me encouragement, approval, acknowledgement, and her

love.

iii

Acknowledgements

I dedicate these lines to my parents, who taught me that measurable facts always last

longer than fancy but hollow phrases alone.

iv

Contents

Acknowledgements iii

List of Symbols vii

List of Abbreviations ix

Abstract x

Kurzzusammenfassung xi

1 Introduction 1

1.1 Motivation - Going beyond State-of-the-Art in Driver Assistance . . . . . . 1

1.2 Scope - Inspiration from Biology . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Contributions to Community . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Feature Space 8

2.1 Static Attention Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Intensity Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.2 Orientation Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.3 RGBY Color Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Depth Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.2.1 Biological Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2.2 Depth from Stereo Disparity . . . . . . . . . . . . . . . . . . . . . . 29

2.2.3 Depth from Object Knowledge . . . . . . . . . . . . . . . . . . . . . 33

2.2.4 Depth from Bird’s Eye View . . . . . . . . . . . . . . . . . . . . . . 35

2.2.5 Depth from Time to Contact . . . . . . . . . . . . . . . . . . . . . . 37

2.2.6 Depth from Radar . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.3 Motion Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.3.1 Differential Images . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.3.2 Detection of Dynamic Objects . . . . . . . . . . . . . . . . . . . . . 44

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3 Task-dependent Tunable Visual Attention 51

3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2 Real-World Challenges for Top-Down Attention Systems . . . . . . . . . . 54

3.3 Modeling Attention: From a Robustness Point of View . . . . . . . . . . . 55

3.4 Functional Comparison to other Top-Down Attention Models . . . . . . . . 61

v

Contents

3.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4 Road Detection in Unconstrained Environments 69

4.1 Adaptive Multi-Cue Fusion for Detecting Unmarked Roads in Inner-City . 69

4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.1.2 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.1.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 82

4.2 Temporal Integration for Feature-Based Road Detection Systems . . . . . . 86

4.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86



4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5 Integrated System Approaches for Scene Interpretation 101

5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.2 Advanced Driver Assistance on Highways . . . . . . . . . . . . . . . . . . . 103



5.3 Advanced Driver Assistance in Inner-City . . . . . . . . . . . . . . . . . . 115



5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6 Summary and Outlook 129

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.2 Limitations and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

A Annex 136

A.1 Gaussian Image Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

A.2 Kolmogorov-Smirnov Test of Goodness of Fit . . . . . . . . . . . . . . . . 137

A.3 World to Image Transformation . . . . . . . . . . . . . . . . . . . . . . . . 137

A.4 Time to Contact - Further Evaluation Results . . . . . . . . . . . . . . . . 140

A.5 High Attention-Feature Selectivity . . . . . . . . . . . . . . . . . . . . . . . 141

Bibliography 143

Curriculum Vitae 152

Publications 154

Index 157

vi

List of Symbols

c0 Velocity of propagation (speed of light)

D(u, v) Disparity

∆f Doppler frequency shift

∆φ Orientation selectivity of the Gabor filter bank

DRate Average detection rate

εfinal Threshold for computation of final road map

k0 Two dimensional wave number vector, which defines the direction of

selectivity of the Gabor kernel

f Focal length (in [m])

f0 Carrier frequency

F0 Theoretic cumulative frequency

fu Focal length normalized to the pixel width (in [pixels])

fv Focal length normalized to the pixel height (in [pixels])

frate Frame rate

fcenter Normalized center frequency of the Difference of Gaussian kernel

Fe Cumulative frequency of a sample

Fi,k Feature map of sub-feature i and feature modality k

Gu 2D Gaussian derivative in horizontal direction

Gv 2D Gaussian derivative in vertical direction

Hworld Height of an object in the world (in [m])

Him Height of an object in the image (in [pixels])

Hit Average FoA hit number

I Image

Igray Gray scale image

λ Weight for the linear combination of bottom-up and top-down saliency

ψ Aperture angle an object has when projected on the image plane

p(xi) Probability distribution of feature xir Parameter that defines the overlap of two adjacent Gabor kernels

SBU Bottom-up saliency map

STD Top-down saliency map

Stotal Overall saliency map

σi Standard deviation of internal Gaussian function of DoG

σe Standard deviation of external Gaussian function of DoG

t1, t2, t3 Translational camera offsets (position of the camera relative to the world

coordinate system)

θX , θY ,θZ Pitch angle, yaw angle, roll angle of the camera

vii

List of Symbols

tu Horizontal size of an image pixel (in [m])

tv Vertical size of an image pixel (in [m])

ttof Time of flight of electromagnetic wave˙θY Yaw rate

U Width of the image

V Height of the image

v Vertical pixel position (of undistorted image)

u Horizontal pixel position (of undistorted image)

u0 Horizontal position of principal camera point (approximately the hori-

zontal position of the middle of the image)

v0 Vertical position of principal camera point (approximately the vertical

position of the middle of the image)

vd Vertical pixel position of distorted image

ud Horizontal pixel position of distorted image

wTDi Top-down attention weight of sub-feature map i

wsparsei Popout weight used as sparseness operator of sub-feature map i

Wworld Width of an object in the world (in [m])

Wim Width of an object in the image (in [pixels])

viii

List of Abbreviations

ACC Adaptive Cruise Control

ADAS Advanced Driver Assistance System

AKTIV Adaptive und Kooperative Technologien fur den Intelligenten Verkehr

(Adaptive Cooperative Technologies for Intelligent Traffic)

APIA Active Passive Integration Approach

BRM Binary Road Map

BU Bottom-Up (data-driven processes)

CAPS Combined Active and Passive Safety

CCD Charge-Coupled Device

CIE Commision Internationale d’Eclairage

DARPA Defense Advanced Research Projects Agency

DFT Discrete Fourier Transform

DoG Difference of Gaussians

eBRM Extended Binary Road Map

EKF Extended Kalman Filter

FoA Focus of Attention

fRPM Final Road Probability Map

GPS Global Positioning System

HMI Human Machine Interface

IPP Intel Performance Primitives

LTM Long Term Memory

MREO Mean Relative Error in Offset

MRER Mean Relative Error in Radius

NCC Normalized Cross Correlation

PReVENT Preventive and Active Safety Applications

ROC Receiver Operator Characteristic

RoI Region of Interest

RPM Road Probability Map

RPROP Resilient backPROPagation

RTBOS Real-Time Brain-like Operation System

SNR Signal to Noise Ratio

STM Short Term Memory

TD Top-Down (knowledge-driven processes)

TTC Time to Contact

TTL Time to Live

ix

Abstract

Increasingly complex driver assistance functionalities are developed and combined in to-

day’s vehicles. Typically, these functionalities run as independent modules bringing own

sensors, processing devices, and actuators. In general, no information fusion, i.e., cross

talk between modules, takes place. However, information fusion of available sensors and

processing modules could lead to a new quality of driver assistance functionalities in terms

of performance and robustness. Furthermore, typical driver assistance functionalities on

the market are based on highly specialized and optimized algorithms that show sound per-

formance for a restricted number of clearly defined use-cases only. Also, the combination

of several of these rigid systems as a means to reach the long-term goal of autonomous

driving will not lead to robust system performance, taken the immense variety of traffic

situations into account.

Opposed to that, in the here presented work a flexible, biologically inspired driver as-

sistance system is developed that adapts its modules and data exchange between modules

online depending on the task. More specifically, the morphology of the brain as well as

brain-like signal processing principles are mimicked in order to increase the robustness and

flexibility of the system. Thereby, the followed development process aimed at reaching

a generic system structure, that supports several system tasks (e.g., detect fast objects,

redetect once tracked and later lost objects, predict object trajectories, or find cars on

the road). In order to include information of the scene context into the system, a robust

unmarked road detection module as well as an approach for the temporal integration of

road segments is developed.

The realized driver assistance system is tested online and in real-time on a prototype car.

In one of the presented online test scenarios, a stationary car is detected in a highway con-

struction site based on cameras as the main sensor. In order to allow the system to interact

with its environment, a three phase danger handling scheme is included into the system.

Following an acoustic warning, the belt pretensioner is activated, after which the vehicle

brakes autonomously preventing a crash. The gathered results prove the applicability of

the developed biologically inspired driver assistance system in real-world scenarios.

Extensive system evaluation shows that different system properties are in close compli-

ance with measurements gathered in psychophysical studies on humans. Based on these

results, it can be stated that the realized advanced driver assistance system closely models

important human information processing principles allowing the usage of the system as

attentional co-pilot for human drivers.

x

Kurzzusammenfassung

Zunehmend komplexere Fahrerassistenzfunktionalitaten werden in modernen Fahrzeugen

verbaut und kombiniert. Typischerweise arbeiten diese Funktionalitaten als unabhangi-

ge Module mit jeweils eigener, getrennter Sensorik, Rechenhardware und Aktuatorik. Im

Allgemeinen findet also keine Informationsfusion (d.h. Austausch und Kombination von

Daten) zwischen den Modulen statt. Dennoch wurde eine Informationsfusion der verschie-

denen Sensoren und Rechenhardware zu einer neuen Qualitat von Fahrerassistenzfunk-

tionen fuhren, da Performanz und Robustheit verbessert werden konnten. Des Weiteren

basieren typische, kommerziell verfugbare Fahrerassistenzfunktionen auf hoch spezialisier-

ten und optimierten Algorithmen, die nur in klar definierten Fallen sicher funktionieren

konnen. Unter Anbetracht der Vielfaltigkeit von moglichen Verkehrssituationen wird durch

die bloße Kombination einer großen Anzahl dieser starren Funktionalitaten das strategische

Ziel “Autonomes Fahren” nicht erreichbar werden.

Im Gegensatz dazu wird in der vorliegenden Arbeit ein flexibles, biologisch inspiriertes

Fahrerassistenzsystem entwickelt, dessen Module und Verbindungen zwischen Modulen zur

Laufzeit aufgabenabhangig angepasst werden konnen. Genauer, werden die Struktur und

bekannte Informationsverarbeitungsprozesse des menschlichen Gehirn nachgeahmt, um ei-

ne hohere Robustheit und Flexibilitat des Systems zu erreichen. Der Systementwicklungs-

prozess zielte auf das Erreichen einer generischen Systemstruktur, die anderes als bekannte

Systeme eine große Anzahl von Aufgaben unterstutzt, ohne spezifisch fur einzelne Aufga-

ben gebaut und optimiert zu sein (z.B., Detektion von schnell bewegenden Objekten, Suche

nach vorherig gefundenen und wieder verlorenen Objekten, Pradiktion von Objekttrajek-

torien, Finden von Fahrzeugen auf der Straße). Um dem System Kontextinformationen

der Szene bereit zu stellen, wurde ein Detektionssystem fur unmarkierte Straßen sowie ein

Ansatz zur zeitlichen Integration von Straßensegmenten entwickelt.

Das Fahrerassistenzsystem wurde in Echtzeit auf einem Testfahrzeug verifiziert. In einem

der vorgestellten Testszenarien wird nur basierend auf Kameradaten ein stationares Fahr-

zeug in einer Autobahnbaustelle detektiert. Um eine Interaktion des Systems mit seiner

Umwelt zu ermoglichen, wird ein dreiphasiges Schema zur Gefahrenabwehr durchfahren.

Nach einer akustischen Warnung, werden die Gurtstraffer aktiviert und das Fahrzeug da-

nach autonom gebremst, um einen Zusammenstoß zu verhindern. Die erzielten Ergebnisse

belegen die Anwendbarkeit des entwickelten biologisch motivierten Fahrerassistenzsystems

in Applikationen der echten Welt.

Durch umfangreiche Evaluierung konnten Systemeigenschaften aufgezeigt werden, die

auch in psychophysischen Studien beim Menschen beobachtet wurden. Basierend auf die-

sen Ergebnissen, kann gesagt werden, dass das realisierte Fahrerassistenzsystem wichtige

menschliche Signalverarbeitungsprinzipien gut abbildet, was eine Nutzung des Systems als

aufmerksamkeitsbasierten Beifahrer des Menschen ermoglicht.

xi

1 Introduction

Mobility is a central issue in modern economies. The need for individual and flexible

transportation systems has made the car to be one of the most influential products of our

time. Today’s customers expect a high degree of comfort and safety in vehicles, which is

impressively stressed by the increasing percentage of electronic equipment in automobiles.

Besides comfort functions, like, e.g., multimedia equipment, driver assistance functional-

ities of various kind come with today’s vehicles. Such systems are designed to diminish

the effects of frequent types of road accidents (e.g., blind spot warning systems prevent

highway accidents caused by carelessness during passing maneuvers).

In the following Section, independent driver assistance functionalities, which are available

on the market and presented in literature, are described. As will become apparent, the

combination of several of these individual functionalities poses heavy challenges. In order to

solve these challenges, integrated driver assistance systems are needed. However, regarding

the long-term goal of autonomous driving, existing integrated concepts lack the necessary

flexibility to cope with the high scene complexity and variety of scenarios present in the

traffic domain. As a measure to solve this challenge a new biologically motivated system

design is proposed. A list of the contributed novelties presented in this doctor thesis as

well as an overview of the remaining Chapters closes this introduction.

1.1 Motivation - Going beyond State-of-the-Art in Driver

Assistance

Numerous highly specialized and robust driver assistance functionalities exist on the market

and are presented in various publications. For example, many automotive suppliers have

implemented lane marking detection systems that, e.g., warn drivers, in case they leave

the highway without intension (so-called Lane Keeping Assistant). Recently available

Stop&Go Adaptive Cruise Control (ACC) systems allow following a preceding car with

appropriate distance even in case of a traffic jam. More complex functionalities still in

prototype status exist, concerning among others, pedestrian detection in inner-city and

detection of the free drivable area in front of the car. By now, a large number of driver

assistance functionalities can be found in upper class vehicles. Furthermore, their number

is increasing in correspondence with the growth of overall electronic equipment in vehicles

(see Fig. 1.1a). All commercially available driver assistance approaches have in common

that they solve very restricted tasks in clearly defined scenarios using highly specialized

algorithms. However, said functionalities typically run completely independent from each

other without sharing information or sensor data. Each functionality brings its own sensors

and actuators. When extrapolating this development, problems will arise, when the Human

1

1 Introduction

Machine Interfaces (HMI) and actuators of many independent assistance functionalities

interfere in highly complex scenarios.

Fraction of

Annual growth

Interior

Chassis

Motor and

Interior

Electrics and

Body(exterior)

power train

vehicle powersystem

electronics per vehicle

(a)

(b)

Figure 1.1: (a) Global market of electrics and electronics in vehicles and vehicle parts in billion

Euro and annual growth [Hertel, 2007], (b) Autonomous vehicles in action during the DARPA

Urban Challenge 2007.

For example, a slow vehicle on the highway will provoke a warning of a collision avoidance

system. The surprised driver changes the lane abruptly. The Lane Keeping Assistant warns

the driver when crossing the central lane marking without activating the direction indicator.

All this might confuse the driver in this critical situation. Car manufacturers try to solve

2

1 Introduction

this dilemma by using different human sensory interaction channels for the available HMIs,

but manage to do so only incompletely. Future, more complex functionalities will not allow

such simple solutions altogether.

Based on this example, the need for an integrated driver assistance system becomes

apparent. Said systems integrate independent assistance functionalities coherently, which

allows the realization of conflict-free HMIs and actuator control procedures, thereby reduc-

ing the system costs. Furthermore, several integrated functionalities can share the same

sensors and actuators. However, the most important advantage of an integrated system

is that a fusion of the input of all available sensors and functionalities is possible. Such

information fusion can result in an advanced driver assistance system (ADAS) allowing

complex assistance functionalities. For example, the results of a road detector and ob-

stacle detector can be fused to build up an internal representation of the environment, in

which objects can be tracked and their trajectories predicted in order to avoid collisions

and reduce the number of false-positive collision warnings.

Such integrated systems are very rare on the market or in literature, where mostly

independent driver assistance functionalities and their optimization are handled. In the

following, some of the few existing approaches will be introduced shortly in order to set

them apart from the here presented system in terms of their goals and performance.

For example, the prototype vehicles presented during the Urban Challenge [WWW,

2007a] of the Defense Advanced Research Projects Agency (DARPA) were able to perform

several driving tasks autonomously in a simplified inner-city environment (see Fig. 1.1b).

However, only a restricted number of cars as traffic participants and no pedestrians or

bicycles were present. Moreover, previously provided detailed annotated maps and the

usage of the Global Positioning System (GPS) reduced the problem complexity even fur-

ther and allowed to solve the driving tasks without camera data. Several roughly related

European projects exist, but these concentrate on restricted issues of an integrated driver

assistance system only. For example, Safespot [WWW, 2007c] aims at preventing road acci-

dents based on cooperative systems, as e.g., vehicle-to-vehicle and vehicle-to-infrastructure

communication. Furthermore, the project “Preventive and Active Safety Applications”

(PReVENT) [WWW, 2006] that was issued by a cooperation of the European automotive

industry, concentrates on fusing well-known driver assistance functionalities already on the

market. Summarizing, the named projects focus on improving existing driver assistance

functionalities (e.g., enhanced digital maps, lane keeping in challenging environmental con-

ditions) without reaching a higher level of functional integration.

In the project AKTIV (Adaptive Cooperative Technologies for Intelligent Traffic), whose

main sub-projects are financed by the German Federal Ministry of Economy, prototypes for

collision avoidance systems and assisting functions for intersections using stereo cameras

and vehicle-to-vehicle communication are developed and tested. The tested systems are

able to detect red traffic lights and the right-of-way at intersections. Although the results

gathered in the performed online tests look promising, the system can handle only a limited

number of well-defined scenarios (e.g., a pedestrian crossing the road, intersection equipped

with traffic lights). The system is based on specifically designed, dedicated modules that

are built and optimized only for these specific scenarios.

Based on the so-called 6D-approach [Badino et al., 2008] the Daimler Research has devel-

3

1 Introduction

oped a more innovative, advanced driver assistance functionality that detects and evaluates

the free space in front of the car. The functionality relies on stereo cameras and integrates

the optical flow and depth measurements over time. The gathered information is visual-

ized at the car’s dashboard and fused with a prototypical intersection assistant. However,

how to link the free space detection module to a complex driver assistance system is not

handled, thereby ignoring the potential of free space information in the system context.

Regarding complete architectures for intelligent vehicles in literature, [Franke et al.,

2001] and [Broggi et al., 2001] have presented approaches that focus mainly on the de-

sign of a framework that combines several reactive systems. The presented systems show

impressive results in specific scenarios and offer a good scalability in terms of computational

aspects, but the challenge of functional integration and interaction is not solved.

There are some integrated driver assistance systems already on the market that aim at

the fusion of basic sensors and actuators available in modern cars. For example, in the

project APIA (Active Passive Integration Approach) the automotive component supplier

Continental has developed a system that integrates active and passive driver assistance

functionalities (e.g., belt pretensioner, airbag, brakes, and sensing systems) in order to

decrease the braking distance and the severity of vehicle accidents. Although the general

approach is promising, the integration can so far only improve existing driver assistance

functionalities.

The automotive component supplier Bosch follows a related direction of thought. The

initiative “Vehicle Motion Management” aims at integrating and linking all available sen-

sors, actuators as well as safety and comfort functionalities in order to support and inform

the driver situation-dependently. However, so far available products that partly realize this

large scale integration only concern the improvement of vehicle dynamics. With CAPS

(Combined Active and Passive Safety) Bosch has introduced a system that functionally

integrates the actuators of safety-related vehicle functionalities.

Additional to the lack of solutions for large scale integration, all these approaches have

in common that they are restricted in terms of the supported scenarios and thereby do not

show the required robustness in highly complex real-world scenarios. This lack of flexibility

results from the typically very rigid system structure of such systems that is caused by

a design process that is focused mainly on the optimal fulfillment of individual, clearly

restricted tasks.

The here presented doctor thesis tries to overcome these restrictions. The central idea

of this work is to solve the complexity challenge of the environment on the system level

by designing a generic system that mimics the human brain. More specifically, a driver

assistance system is developed that gets inspiration from known signal processing principles

and the structural organization of the brain. For example, instead of collecting all existing

environmental information followed by a late selection of relevant data, as it is done in

most technical systems, the biologically inspired approach of early information selection

is realized. Predominantly environmental data that is compatible to the current system

expectation will reach higher processing levels, which reduces the problem complexity. The

described complex selection and suppression principle is named attention and is one of the

key aspects of the doctor thesis at hand.

4

1 Introduction

1.2 Scope - Inspiration from Biology

As opposed to the existing classical driver assistance functionalities presented before, the

here developed system takes the human as a role model. This is done on the micro-level

by getting inspiration from human signal processing principles. More specifically, on the

micro-level the type and parameterization of the supported visual features is derived from

biology (e.g., edge filters are drawn from the form of receptive fields of neurons, color pro-

cessing principles are inspired by the processes on the retina). Please refer to Chapter 2

for details. Also, on the macro-level the system gets inspiration from the human brain,

since the organization and combination of signal flows in the brain is mimicked. For exam-

ple, a brain-like separation between the processing pathways for the detection of motion

respectively position of objects (“motion pathway”) and for the classification of objects

(“form pathway”) allows the generic task-dependent adaptation of signal processing (see

Sect. 5.2). Furthermore, design principles inspired by biology are used to increase the

robustness of system modules. For example, the attention sub-system in Chapter 3 or the

unmarked road detection sub-system in Chapter 4 are adapted dependent on the environ-

ment, meaning that all essential system parameters are computed dynamically based on

the characteristics of the current scene, which assures a robust system performance even

in challenging lighting or weather conditions. Based on these exemplarily described brain-

like principles, a biologically motivated driver assistance system is developed and tested

online. The presented system explicitly searches biological motivation in case classical

engineering-based approaches cannot do better or fail.

The presented thesis aims at developing an integrated driver assistance system that

fuses the data of various sensors (e.g., laser, internal sensors, dense camera-based stereo)

and higher level modules (e.g., unmarked road detection module, short term memory with

objects detected in the environment). The gathered information is used to actively improve

internal processes and modules in the vision sub-system as well as to control actuators (e.g.,

the belt pretensioner, acoustic warnings, or brakes).

Based on this large scale information fusion a new quality of driver assistance function-

alities is reached and tested online on a prototype vehicle. For example, the system is able

to detect and autonomously brake on a standing obstacle, where current driver assistance

functionalities for Radar-based collision mitigation fail (see Sect. 5.2). The system is fur-

thermore able to actively generate and test predictions concerning the behavior of other

objects in the environment. For example, the driver assistance system actively searches

for cars when car-like openings in the detected road segment are found (please refer to

Sect. 5.3).

The central assumption of this doctor thesis, developed in the following Chapters, is

that only a generic system will be able to cope with the high number of possible scenarios

in the traffic domain. So, instead of designing and fusing specific methods that solve

restricted clearly specified tasks, the here proposed system is organized in a way that

allows a generic processing in terms of the supported tasks. More specifically, the existing

system modules can be modified online by adapting parameters and links between the

modules dependent on the current system task. The central system component that is

based on this assumption is the attention sub-system (described in detail in Chapter 3),

5

1 Introduction

which is used as generic front-end for all visual processing.

1.3 Contributions to Community

In the following, the novelties presented in this doctor thesis are summarized shortly.

Starting with novelties on level of modules and functionalities, the subsequent approaches

were developed.� Biologically motivated filter kernels described in literature are computational effi-

ciently extended, adding novel visual features to the system,� A novel biologically motivated suppression of the horizon edge is proposed,� A well-known lane marking detection approach is extended by a biologically moti-

vated preprocessing step that improves the detection performance,� A well-known approach for the detection of moving objects is extended by including

accumulated top-down knowledge of the environment,� An innovative mono-camera-based depth cue is formalized to be suitable for the

vehicle domain.

As described before, the thesis at hand stresses the role of system design and large scale

information fusion. On the system level the following novelties were realized.� A robust human-like attention system running in real-time is developed that is based

on five novel principles solving typical attention-related challenges,� An attention-based vision system is applied online in real-world scenarios of the

vehicle domain,� An adaptive unmarked road detection system is proposed that relies on four novel

principles,� A generic computationally efficient temporal integration approach for improving ex-

isting unmarked road detection systems is developed,� A driver assistance system on a prototype vehicle is realized that allows autonomous

emergency braking on highways based on vision as the major cue,� A biologically motivated driver assistance system is realized that integrates envi-

ronmental context information, in order to facilitate safe processing in inner-city

scenarios.

1.4 Overview

The thesis is structured as follows: In Chapter 2 the implemented biologically motivated

feature space is described. Proposed extensions to state-of-the-art approaches in liter-

ature are evaluated qualitatively and quantitatively based on known stand-alone driver

assistance functionalities (e.g., marked lane detection). Chapter 3 elaborates on the pro-

posed human-like attention approach, which integrates the previously described features

allowing generic task-dependent scene decomposition. The comprehensive description of

6

1 Introduction

the realized attention sub-system centers on concepts that improve the system robustness

allowing its application in real-world traffic scenes. Chapter 4 focuses on visual-feature-

based approaches that allow a robust detection of unmarked roads and also introduces

the concept of temporal integration for improving the road detection performance in com-

plex scenes as inner-city. Finally, the central Chapter 5 proposes a biologically motivated

driver assistance system, which relies on the human-like attention system as generic visual

front-end of all task-related processing. The integration of the unmarked road detection

system introduces environmental context into the driver assistance system. Online results

gathered on a prototype car that brakes autonomously in a complex highway construction

site and the evaluation of internal system representations in an inner-city scene allow the

assessment of the proposed system. Chapter 6 summarizes the PhD thesis allowing a com-

prehensive overview of the contributions to the community. Furthermore, the limitations

of the system and an outlook to future extensions are given.

7

2 Feature Space

The following Chapter describes the developed biologically inspired feature space our

ADAS architecture relies on for fulfilling its generic vision tasks. Typical features

for biologically motivated vision systems are intensity, orientation, and color (see, e.g.

[Frintrop et al., 2005, Itti et al., 1998]). These features are often preferred because they

are so-called basic features. A feature is a basic feature, in case it allows, among other

things, an efficient visual search. The efficiency of visual search tasks is assessed by psy-

chophysical studies that determine the reaction time of subjects to visual impulses (see

[Treisman, 1993] and [Wolfe and Horowitz, 2004] for a summary). More specifically, a ba-

sic feature allows an efficient parallel search, i.e., in a search task with a growing number

of distractors the mean search time is constant or only slightly increasing (see Fig. 2.1). A

basic feature allows a clear differentiation against distractors that do not dispose of this

feature (i.e., one differentiating feature exists in these so-called feature search tasks).

Additional to the named three features, recent biologically motivated systems also incor-

porate depth and motion (see, e.g., [Aziz and Mertsching, 2008]). Both features are marked

as basic features by most researchers (see [Wolfe and Horowitz, 2004] for an overview). An

important property of said basic features and reason for their efficiency in visual search

tasks is that they draw or guide attention. The attention principle plays an important role

for the here developed ADAS since it will allow solving specific vision tasks in a generic

fashion. It will be described in detail in Chapter 3. In a nutshell, the attention features

introduced in the following are combined to form a saliency map that is the key aspect and

major output of a human-like attention system. This saliency map shows high activation

at image regions that contain a high level of information in terms of a specific vision task

(top-down driven activation) or because the image regions differ strongly from the rest of

the image, meaning that a high local entropy is present (bottom-up driven activation).

In the following Section, three well-known biologically motivated visual features together

with some important conceptional extensions are described. These features are static, i.e.,

they depend on the current image only (as opposed to dynamic features that also depend

on previous images). These static features will be Difference of Gaussians (DoG) filters

for detecting intensity changes, Gabor filters for detecting oriented structures, and RGBY-

colors. In the second and third Section of this Chapter, higher level features like different

biologically inspired depth-sources, as well as different motion features are described.

2.1 Static Attention Features

In the following, specific filter types and thereby an extensively used image processing

method will be motivated by showing the resemblance to the processing in the human

8

2 Feature Space

(a) (b)

Figure 2.1: Efficient and inefficient visual search tasks: (a) Efficient search: Orientation is a

basic feature, (b) Inefficient search: Numeric character differentiation is not a basic feature.

brain.

This resemblance is based on the fact that receptive fields of neurons (i.e., their measur-

able transfer functions, see [Flores-Herr, 2001]) are equivalent to filter kernels. Therefore,

the signal processing principles in neurons can be described in computational image pro-

cessing by a convolution, as stated in [von Seelen, 1970]. Furthermore, in the first stage

of the human visual processing the image is sampled with the resolution of photorecep-

tors on the retina. All this makes the computational image filtering based on convolution

with biologically inspired filter kernels to a close approximation of the human visual signal

processing. Figure 2.2 visualizes the neuronal signal processing that equals the convolution:

Ifilt(u, v) = g(u, v) ∗ I(u, v) =U−1∑

u′=0

V−1∑

v′=0

g(u, v)I(u− u′, v − v′). (2.1)

modelneuronstatic

receptivefield size

3x3

Image on retina / image pixels(photoreceptor output)

synapseweights

with DoGcharacteristic /

DoG image filterkernel

+−

−−−−

−

g(u, v)

Figure 2.2: Simple static neuron model with receptive field and synaptic weights g(u, v) that

are equivalent to a symmetric filter kernel used in image convolution.

9

2 Feature Space

In the following two subsections, the technical equivalents of two basic measured recep-

tive field types are described (DoG and Gabor). After that, with RGBY, a color space is

described that mirrors the processing on the retina.

2.1.1 Intensity Feature

In the following, the Difference of Gaussians feature is biologically motivated and a pa-

rameterization is derived that allows the implementation of a filter bank for a sparse signal

decomposition.

Biological Motivation

In vitro receptive field measurements of ganglion cells in the retina of macaque mon-

keys have shown a characteristic center-surround behavior [Flores-Herr, 2001]. Also mea-

surements of receptive fields of neurons located in early regions of the visual pathway of

macaque monkeys have shown a similar characteristic [Trapp, 1998]. In other words, the

receptive fields are selective to monotonous regions (blobs), which differ from the back-

ground in terms of their intensity. An example for such a contrast is shown in the lower

left image in Fig. 2.4b. Furthermore, theories and supporting measurements exist, which

allow to interpret specific brain regions in the human visual pathway as filter banks that

decompose an input image in terms of the existing frequencies [Mallot, 2002]. The realized

attention system extends this principle to all static and dynamic features.

In the following, the measured center-surround behavior is modeled using a filter kernel

of two 2D Gaussian functions that are subtracted (Difference of Gaussians). A parameter-

ization of these Gaussian functions will be provided that will allow the implementation of

a low-loss filter bank for said filter kernel.

The Difference of Gaussians (DoG) filter is selective to monotonous regions (blobs) of

different sizes (see Fig. 2.4). The filter kernel is not orientation selective (i.e., isotropic).

In its basic form the centre of the DoG filter kernel is excitatory and the lateral region is

inhibitory. The discrete DoG filter kernel results from the sampling of a Gaussian curve

with a small variance σ2e that is subtracted from a Gaussian curve with a bigger variance

σ2i (see also Fig. 2.3):

DoG(u, v) =1

2πσ2i

e−u2+v2

2σ2i − 1

2πσ2ee−u2+v2

2σ2e . (2.2)

On a more qualitative level, the DoG subtracts the mean weighted intensity of a smaller

center region from the mean weighted intensity of a bigger surround for each image pixel.

As will become apparent in Chapter 3, in order to yield high hit rates in top-down related

search (i.e., when searching for a specific object using the saliency map), the features of

an attention system need high selectivity to provide as much supporting and inhibiting

(i.e., suppressing) maps as possible. At the same time, high efficiency is needed due

to constraints in computational resources. An approach fulfilling these demands, is the

separation of the DoG filter in on-center (called on-off in the following) and off-center

selectivity (off-on) as it is emphasized in [Frintrop, 2006] (see Fig. 2.5a and Fig. 2.5b).

10

2 Feature Space

Pixel position

Sig

nal a

mpl

itude

σe

σi

σe

σi

Figure 2.3: One-dimensional on-off DoG (black) and the two Gaussian functions it is composed

of (positive Gaussian function with small standard deviation in blue, negative Gaussian function

in red).

Thereby, the attention system can differentiate between bright blobs on a dark background

and dark blobs on a bright background. To realize such an on-off/off-on separation the

DoG filter response is separated into its positive and negative part, which is equivalent

to the computationally more demanding independent filtering with the two different filter

kernels depicted in the lower left corner of Fig. 2.5a and Fig. 2.5b respectively.

As described at the beginning of this Section, the attention features are computed on

different scales allowing a decomposition of the signal into overlapping frequency chunks.

In the following, the appropriate parameterization of the DoG kernel is derived, which

makes such a decomposition possible.

Parameterization of the DoG Kernel

The parameters σi and σe in Equ. (2.2) determine the frequency characteristic of the filter.

As described in Annex A.1 for efficient filtering a Gaussian pyramid approach that scales

the input image is used, while the filter kernel is not changed in size. In order to assure an

accurate filtering procedure, the normalized central frequency fcenter of the band-pass-type

DoG filter needs to be 0.25, which equals a period length of 4 pixels and hence a blob of

2x2 pixels or a line of 2 pixels of any orientation. If fcenter = 0.25 it can be assured that a

low-loss pyramid-based image filtering can be done.

In order to derive the said DoG parameterization, the frequency domain of the DoG

kernel and its dependencies are handled. For decomposition of the problem, it is adopted

that the 2D-DFT transform can be expressed as a combination of two single Discrete

Fourier Transformations (DFT ) [Proakis and Manolakis, 2006]:

DFTu,v (f (u, v)) = DFTu (DFTv (f (u, v))) . (2.3)

According to Equ. (2.3), the computation of the 2D-DFT is equivalent to applying the

11

2 Feature Space

(a) (b) (c)

growing scale (frequency)

(d) (e) (f)

Figure 2.4: (a) Test image containing all frequencies in all orientations, (b)-(f) Different levels

of the DoG filter bank with the filter response for the test image on top, the filter kernel in the

image domain at bottom left, and the filter kernel in the frequency domain at bottom right.

12

2 Feature Space

(a) DoG on−off (b) DoG off−on (c) 0° Even Gabor on−off

(d) 0° Even Gabor off−on (e) 0° Odd Gabor on−off (f) 0° Odd Gabor off−on

Figure 2.5: Application of filter kernels on simple test images (negative filter response is cut

off). Both the 2 DoG features (a), (b) and the 4 Gabor features (c)-(f) are realized with one

filter operation each. Every picture shows on the left the used input test image and on the

right the respective filter response for the filter kernel in the bottom left corner.

DFT on the two image dimensions independently. Therefore, also for the 2D case the

following transformation rule can be applied that gives the transformation of a 1D Gauss

to the frequency domain:

DFT

(

1

σ√

2πe−

n2

2σ2

)

= e−σ2$2

n2 . (2.4)

Based on Equ. (2.3) and (2.4), the 2D-DFT of the DoG is defined as given in Equ. (2.5).

DFTu,v(DoG)

= DFTu,v

(

1

2πσ2i

e−

(

u2+v2

2σ2i

)

− 1

2πσ2ee−(

u2+v2

2σ2e

)

)

= DFTu

(

DFTv

(

1

2πσ2i

e−

(

u2+v2

2σ2i

)

− 1

2πσ2ee−(

u2+v2

2σ2e

)

))

= DFTu

(

DFTv

(

1

2πσ2i

e−

(

u2

2σ2i

)

e−

(

v2

2σ2i

)

− 1

2πσ2ee−(

u2

2σ2e

)

e−(

v2

2σ2e

)

))

= DFTu

(

DFTv

(

1√2πσi

e−

(

v2

2σ2i

)

)

1√2πσi

e−

(

u2

2σ2i

)

)

+

DFTu

(

−DFTv(

1√2πσe

e−(

v2

2σ2e

))

1√2πσe

e−(

u2

2σ2e

))

13

2 Feature Space

= DFTu

(

e−

(

$2vσ2

i2

)

1√2πσi

e−

(

u2

2σ2i

)

− e−(

$2vσ2

e2

)

1√2πσe

e−(

u2

2σ2e

)

)

= DFTu

(

1√2πσi

e−

(

u2

2σ2i

)

)

e−

(

$2vσ2

i2

)

−DFTu

(

1√2πσe

e−(

u2

2σ2e

))

e−(

$2vσ2

e2

)

= e−(

$2v+$2

u2

)

σ2i − e

−(

$2v+$2

u2

)

σ2e

= e−2π2(f2v +f2

u)σ2i − e−2π2(f2

v +f2u)σ2

e (2.5)

The DoG has a band-pass frequency characteristic (i.e., the constant component of the

resulting filter responses DFTu,v(DoG(ωu = 0, ωv = 0)) is equal to zero). The main

parameter of a band-pass filter is the center frequency fcenter. The value fcenter is the fre-

quency, where the transfer function of the filter has its maximum or in other words it marks

the line width, the filter is selective to. To find the filter parameter σi, a reformulation of

Equ. (2.5) to polar coordinates is helpful (fu = Rsin(α) and fv = Rcos(α)):

DoG (R) = e−2π2R2σ2i − e−2π2R2σ2

e . (2.6)

To find the extremum, the derivative of Equ. (2.6) is taken (see Equ. (2.7)) and set to zero

(see Equ. (2.8)). After introducing γDoG = σ2e

σ2i

as the ratio between the inner and outer

Gaussian and reforming, Equ. (2.9) results.

∂DoG (R)

∂R= 4π2Rσ2

ee−2π2R2σ2

e − 4π2Rσ2i e

−2π2R2σ2i (2.7)

0 = 4π2R(

σ2ee

−2π2R2σ2e − σ2

i e−2π2R2σ2

i

)

(2.8)

0 = e−2π2R2σ2i

(

γ2DoGe

−2π2R2σ2i (γ2

DoG−1) − 1)

(2.9)

1 = γ2DoGe

−2π2R2σ2i (γ2

DoG−1)

−2π2R2σ2i

(

γ2DoG − 1

)

= ln

(

1

γ2DoG

)

R2 = ln (γDoG)1

π2σ2i

(

γ2DoG − 1

) (2.10)

Notice that fcenter represents the norm (fcenter =√

ω2u + ω2

v) in the 2D frequency domain,

which means the radius R in Equ. (2.10) is set to fcenter. Reforming Equ. (2.10), the

dependencies of the optimal frequency fcenter can be found to be Equ. (2.11) and hence σiresults to Equ. (2.12).

fcenter =1

πσi

√

ln (γDoG)

γ2DoG − 1

(2.11)

σi =1

πfcenter

√

ln (γDoG)

γ2DoG − 1

(2.12)

It can be shown that the second derivative is positive marking the extremum as a maximum.

As described before, fcenter in Equ. (2.12) is set to 0.25. Furthermore, e.g., [Mallot, 2002]

recommends to set the ratio between the inner and outer Gaussian γDoG = 1.6, which

results in a low-loss signal decomposition with only few redundancies between scales.

14

2 Feature Space

Discussion

As described in the previous subsection, the separation of the DoG filter response into on-off

and off-on contrasts increases the feature space without additional computational costs.

In the following, it is qualitatively shown that this approach increases the performance

of a marked lane detection system. The approach extends known algorithms for lane

marking detection by adding a biologically motivated filter step for preprocessing. More

specifically, the DoG filter is used as input feature. Figure 2.6a shows a typical inner-city

scenario with strong shadows on the road. For detecting the lane markings the view from

above (the so-called bird’s eye view) is computed (see Fig. 2.6b), as will be described

in Sect. 2.2.4. On the bird’s eye view lane marking-like contrasts (bright image regions

on a darker background) are detected by the DoG filter after which a clothoid-model-

based approach for detecting the markings is used (see, e.g., [Dickmanns and Mysliwetz,

1992, Franke et al., 2007, Ramstroem and Christensen, 2005] for related clothoid-based

approaches). Figure 2.6c depicts the DoG filter results without the described on-off/off-

on separation. Since lane markings have a typical on-off contrast (white/yellow markings

on a darker street), the on-off DoG filter results should be used, since these contain less

false-positive activations (Fig. 2.6d). For example, in [Luo-Wai, 2008] the pre-filtered road

image still contains the lane marking unspecific off-on contrasts (e.g., shadows on the road).

Such off-on contrasts are filtered out in our marked street detection approach to improve

the road detection performance.

For a quantitative evaluation of the influence of the described on-off DoG separation an

implemented lane marking detection system is used. The system gets a DoG-filtered edge

image without on-off (please refer to Fig. 2.6c) and with on-off separation (as shown in

Fig. 2.6d). The gathered results are summarized in Tab. 2.1. The evaluation shows the

improvement in accuracy of the detected offset (i.e., horizontal position of lane markings)

and radius of the road based on manually labeled ground truth data of 330 highway frames

(see Fig. 2.7 for a visualization of the scenario and the gathered results).

Table 2.1: Mean relative error of detection results (offset and radius of lane marking model).

Type of input data Mean relative error in offset Mean relative error in radius

preprocessing MREO = MRER =

1/N∑ GToffset−offset

GToffset1/N

∑ GTradius−radiusGTradius

Without DoG 4.46 80.87

on-off separation

With DoG 4.35 72.22

on-off separation

15

2 Feature Space

Width in m

Dis

tanc

e (d

epth

) in

m

−5 0 5

30

25

20

15

10

5

0

Width in m

Dis

tanc

e (d

epth

) in

m

−5 0 5

30

25

20

15

10

5

0

Width in m

Dista

nce

(dep

th) i

n m

−5 0 5

30

25

20

15

10

5

0

(a) (b)

(c) (d)

Figure 2.6: Exemplary performance gain of on-off DoG separation as pre-processing step of a

lane marking detection system: (a) Input image, (b) Bird’s eye view, (c) DoG result without

on-off separation, (d) DoG result with on-off contrasts only (off-on contrasts are filtered out).

2.1.2 Orientation Feature

In the following, the Gabor feature for the detection of lines and edges is biologically

motivated and a parameterization is derived that allows the implementation of a filter

bank for a sparse signal decomposition.


According to [Hubel and Wiesel, 1962], the lower layers of the cortex in cats contain ori-

entation selective neuron populations. Please note that for lines and edges no 360 degree

direction, but only 180 degree orientation is defined. The activation of these neuron pop-

16

2 Feature Space

Marker offsets: −1.80m 1.75m 5.10m Marker offsets: −1.90m 1.50m 5.09mRoad radius: >3000m no turn Road radius: >3000m no turn Road radius: 2378m left turn

Marker offsets: −1.90m 1.50m 4.90m

Figure 2.7: Sample images of the evaluation scene (lane marking detection results visualized).

ulations lessens by 50% when the stimulus is rotated by 15-20 degree versus the preferred

orientation. According to the spatial frequency theory (refer to [Palmer, 1999] for details),

these results give biological motivation for a filter bank with an angle selectivity of 30-40

degree. Also a frequency (respectively scale) selectivity of these neuron populations was

proven to exist based on experiments (see [Marcelja, 1980]). The receptive fields of these

neurons can be described by even and odd Gabor functionals (pairs of quadrature filters),

which represent a Gaussian kernel that is modulated by a sine functional with a phase

shift of 0 and 90 degree. For a visualization Fig. 2.8 depicts a one dimensional odd and

even Gabor kernel.

Pixel position

Sig

nal a

mpl

itude

(a) (b)

Pixel position

Sig

nal a

mpl

itude

(a) (b)

Figure 2.8: (a) One-dimensional even Gabor function (black), the Gaussian function (blue)

and the modulating cosine function (red) it is composed of, (b) One-dimensional odd Gabor

function.

17

2 Feature Space

Parameterization of the Gabor Kernel

The even Gabor filter kernel, which is selective to lines, equals the real part of g(x) in

Equ. (2.13). The odd Gabor filter kernel, which is selective to edges, equals the imaginary

part of g(x) in Equ. (2.13) with x =

[

u

v

]

.

g(x) =1

2πabe−

12xTAxejk

T0 x (2.13)

With:

k0 =

[

|k0| cosφ|k0| sinφ

]

A = RPRT =

[

cosφ −sinφsinφ cosφ

] [

a−2 0

0 b−2

] [

cosφ sinφ

−sinφ cosφ

]

The variances a2 and b2 influence the size of the underlying Gaussian function in the two

image dimensions. According to the biological measurements, the filter kernel will have a

modulation orthogonal to the longer principal axis of the Gaussian curve, which requires

a2 being smaller than b2. The angle φ in the rotation matrix R determines the orientation

of the filter kernel:

φ = m∆φ and ∆φ =2π

Mwith m ∈ [0..M − 1] . (2.14)

The wave number vector k0 determines the 2D period length of the modulated complex

oscillation and thereby the selectivity of the kernels in the frequency domain. The value

|k0| determines the line width the filter is selective to. It is important to note that |k0| is

constant for all filter orientations.

The factor γGabor is introduced, which is the ratio between the principal axes a and b

(i.e., width and height) of the underlying Gaussian curve:

γGabor =a

bwith a < b. (2.15)

In [Jones et al., 1987] γGabor was measured to be between 0.25 and 1.0 with a mean of 0.6

in neuronal receptive fields in the cat cortex. Taken these measurements and Equ. (2.14)

the number of orientation channels M is typically set to a value between 4 and 18. As

shown in [Trapp, 1998], the application of the same parameter setting for the size of the

Gaussian curve on all orientations of a specific frequency channel is possible. Based on

this, for γGabor in [Trapp, 1998] the following generic Equation is derived:

γGabor =a

b=

3sin(

∆φ2

)

√

1 − 9(

cos∆φ2 − 1

)2. (2.16)

Since the filter bank uses a Gaussian resolution pyramid the filter kernel is also independent

of the frequency channel (i.e., scale).

18

2 Feature Space

It is now sufficient, to determine the parameter a in Equ. (2.16), which encodes the over-

lap in frequency domain between the filter kernels of a specific orientation of two adjacent

frequency channels (scales). In [Trapp, 1998], a generic formulation for the parameter a is

proposed that contains the parameter r, which determines the value at the overlap of two

adjacent transfer functions (adjacent in orientation or scale):

a =3√

−2ln(r)

|k0|. (2.17)

The parameter r is a value between 0 and 1 and is the ratio between the overlap value and

the maximal value of the transfer function. The bigger r the more do adjacent filter kernels

overlap in the frequency domain. Setting r=0.5 yields a reversible (non-dissipative and

disjoint) signal decomposition compatible to the wavelet theory. This means that during

Gabor decomposition of an image and subsequent recombination nothing is lost and no

redundancy exists.

Summarizing, the filter bank formulation uses the parameter ∆φ defining the number

of orientation channels. Furthermore, the overlap value r and the value |k0| define the

frequency selectivity of the filter. A typical value for |k0| is 0.25, which makes the filter

selective to lines with a period length of 4 and hence a line thickness of 2 pixels. The

following Fig. 2.9 depicts a Gabor filter bank with ∆φ = 45 degree, r = 0.5 and |k0| = 0.25.

As stated before, Gabor functions model the receptive field characteristics in early layers

of the vision system of mammals. Gabor filters are suited to the local detection of image

frequencies since they optimally fulfill the trade-off between a good resolution in frequency

and image domain, which represents the “uncertainty principle of quantum mechanics” for

image processing. In other words, the time bandwidth product reaches its lower border

for Gabor functions, as was shown in [Gabor, 1946], which is equivalent to the fact that

Gabor filter pairs are optimally localized both in the image and frequency domain. For

a very descriptive mathematical formulation of the mentioned facts see [Trapp, 1998]. A

good localization in the image domain allows small filter kernels and hence minimizes the

calculation time for filter operations. Additionally, a good localization in frequency domain

allows an efficient use of Gaussian pyramids as well as sparseness in the orientation selective

feature maps. An optimized localization in the image domain alone leads to sparse lines

in the Gabor filter response (no repeating patterns), but lines of adjacent orientations

are amplified as well. An optimized localization in the frequency domain leads to good

selectivity regarding the orientation of lines, but the filter response in the image domain

shows no isolated line at a certain location, but an unlocalized pattern of lines of the

specific orientation.

For summarizing qualitatively, Gabor filters are selective to oriented lines (i.e., contours),

when filtering an image with the even part of the Gabor kernel. Furthermore, Gabor filters

are selective to oriented edges (i.e., steps), when using the odd part of the Gabor kernel.

Additionally, Gabor filters are scale selective (i.e., selective to a certain thickness of lines

respectively sharpness of edges).

19

2 Feature Space

scale 2 scale 2 scale 1 scale 1

(a)

(b) (c) (d) (e) (f)

(l)(k)(i)(h)(g)

(m) (n) (o) (p) (q)

(r) (s) (t) (u) (v)

growing scale (frequency)or

ient

atio

n ch

ange

Figure 2.9: (a) Test image containing all frequencies in all orientations, (b)-(f) Even Gabor

(without on-off/off-on separation), orientation 0 degree on 5 scales, on top: filter response to

test image, bottom left: filter kernel in image domain, bottom right: filter kernel in frequency

domain, (g)-(l) Even Gabor orientation 45 degree, (m)-(q) Even Gabor orientation 90 degree,

(r)-(v) Even Gabor orientation 135 degree.

20

2 Feature Space

Conceptional Extensions

Besides the well-known concept of decomposing the Gabor filter response in the odd and

even part (i.e., computing the real and imaginary part of the filter response), here, an

additional decomposition is done, which is motivated from the DoG filter decomposition.

More specifically, we transfer the DoG on-off center concept to the Gabor filter and separate

the odd and even Gabor responses into their positive and negative parts (please refer to

Fig. 2.5c-f for a visualization). The proposed decomposition increases the performance of

the ADAS attention system. For example, an on-off versus off-on even Gabor separation

allows for the efficient separation of specifically oriented white lane markings from shadows

on the road. Also, as shown in the following subsection an on-off/off-on separation for odd

Gabor allows for the crisp suppression of the sky edge present in most scenes in the car

domain.

In sum, 4 different Gabor-based feature types are derived from one filtering step (even

on-off, odd on-off, even off-on, odd off-on). Each of these 4 Gabor feature types consists of

20 independently weighable sub-feature maps (4 orientations on 5 scales each). Hence, the

Gabor filter bank in the used parameterization allows for 80 independent filter responses

and hence features.

Discussion

As discussed in the last subsection, a decomposition of the even and odd Gabor feature in

their on-off and off-on component is proposed, in order to increase the feature selectivity.

In the car domain the search performance is strongly influenced by the horizon edge, which

is present in most images of highways and country roads. In the following, this serves as

exemplary problem for showing the importance of a high feature selectivity. Typically, the

horizon edge is removed by mapping out the sky in the input image, which might not be

biologically plausible and is error prone. Instead, we suppress the horizon edge directly

in the attention by weighting the sub-feature maps (the required weighting procedure is

described in Chapter 3), based on the high selectivity of the attention features. The gain of

this approach is depicted in Fig. 2.10c that shows the diminished influence of the horizon

edge on the saliency map of the real-world example in Fig. 2.10a. For a quantitative

evaluation of the performance gain based on attentional sky suppression, refer to Sect. 3.5.

For a further qualitative assessment of the gain of on-off/off-on separation see Fig. 2.11a-

d. Here a successful top-down search of the black vehicle is only possible using full sepa-

rability of the input feature space.

When comparing the properties of Gabor and DoG filters, it is perceivable that a DoG

filter response of a specific scale is equivalent to a combination of 4 Gabor filter responses

for the 4 orientations of the same scale. Still using both features in an attention system

(instead of Gabor alone) is reasonable, because the discrete nature of image filtering will

lead to a certain loss in selectivity for combined Gabor feature maps. Supporting the DoG

feature in the attention system makes up for these losses.

However, it is important to note that DoG and Gabor filters are not independent, which

will play an important role in the normalization process described in Chapter 3 (see page

21

2 Feature Space

50

100

150

200

50

100

150

200

50

100

150

200

(a) (b)

0°off−on 45° off−on 135° on−off

(c) (d)

Figure 2.10: Evaluation of selectivity, (a) Input image, (b) Original bottom-up attention with-

out sky suppression, (c) Modified bottom-up attention with attentional sky suppression (top-

down influence), using suppressive odd Gabor filter kernels in low scales, (d) Bottom-up atten-

tion with traditional sky suppression.

61). Normalization will be done to assure a comparability of all attention features, which

is commonly neglected in comparable systems but is important for a robust vision system.

2.1.3 RGBY Color Space

Numerous color spaces are known in image processing (e.g., RGB, HSV, XYZ, Lab). The

implemented biologically motivated RGBY color space shows several important advantages.

The color space is introduced and assessed in the following.


The so-called human search asymmetries, which are measured in psychophysical studies

(e.g., an inclined line among vertical lines is detected more easily than a vertical line among

inclined lines) were conceptionalized by [Treisman and Gormican, 1988]. The authors pro-

pose a theory, which supports that it is easier to detect feature deviations of a non-canonical

feature among canonical features than the other way around. The term canonical feature

is related to the term basic feature (refer to the introduction of Chapter 2 for details on

22

2 Feature Space

(a) (b)

(c) (d)

Figure 2.11: Evaluation of the gain of on-off/off-on decomposition and usage of DoG as fea-

ture, (a) Shady input image with top-down search target black car, (b) Bottom-up attention,

(c) Top-down attention, without on-off feature separation (overall attention is negative, maxi-

mum is not on the car, search is not successful), (d) Top-down attention, with on-off feature

separation, only positive values are displayed, search is successful.

basic features). In short, basic features define specific feature types that guide attention

(e.g., lines, color), whereas a canonical feature defines characteristics of a certain feature

parameterization within a basic feature type (e.g., a subset of certain orientations leads to

a line feature that is canonical). Hence, canonical features can be understood as feature

parameterizations that mimic the way neuron populations are tuned in the human vision

system (e.g., lines of 0 degree and 45 degree orientation are canonical). Feature deviations

are represented by a combination of canonical features (e.g., a 10 degree line is represented

by a combination of 0 degree and 45 degree neurons). By finding search asymmetries,

such canonical features can be located. The psychophysical tests described in [Treisman,

1993] revealed typical search asymmetries for colors, which suggest that a canonical fea-

ture parameterization for colors should differentiate between red, green, blue, and yellow.

Furthermore, the human morphology in the early visual processing on the retina supports

this notion. More specifically, cells exist on the retina that are tuned to red-green and

blue-yellow contrasts (so-called color opponents). Both facts give a biological motivation,

for preferring RGBY colors to any of the earlier mentioned color spaces.

23

2 Feature Space

Computation of RGBY Colors

A very basic and thereby computationally efficient approach to compute RGBY colors was

proposed by [Itti et al., 1998] (the so-called color opponent approach from the Neuromor-

phic Vision Toolkit of Itti):

R = R− G+B

2(2.18)

G = G− R +B

2(2.19)

B = B − R +G

2(2.20)

Y =R +G

2− |R−G|

2−B. (2.21)

Drawbacks of this approach are the missing white balance and missing uniformity in the

resulting color maps. A color space is uniform, in case the distance between adjacent

colors is equal over the whole color space (as related to the human color perception, i.e.,

the human ability to distinguish between colors). Uniform color spaces hence numerically

represent colors very similar to the human color perception.

A more complex approach for computing RGBY colors is described in the following.

The basic idea is based on the work of [Frintrop, 2006]. This more complex approach

has a number of important advantages, namely its incorporated white balance and unifor-

mity. Since the computational demands are moderate enough, the said RGBY approach

is included into the feature space of our attention system.

RGBY colors are based on the Lab color space. The Lab color space like the Luv color

space was defined in 1976 by the CIE (Commision Internationale d’Eclairage) as a more

accurate model of the human color perception. Both color spaces are uniform. The well-

known HSV (Hue, Saturation, Value) and XYZ (X and Z contain color information, Y

luminance information) are examples for non-uniform color spaces. In the uniform Lab

color space “L” holds luminance information, “a” represents the red-green contrast and “b”

the blue-yellow contrast. Since the L-channel is independent from the color information

the Lab shows a certain extend of invariance against changes in illumination (similar to

the HSV color space). A basic illumination invariance is very important for the proposed

attention system. At the end of this subsection, the illumination variance properties of

RGB and RGBY color space will be compared for a color-based detection of signal boards

in twilight. The Lab color space is computed based on the following Equations:

L = 116

(

Y

Yn

)1/3

− 16 (2.22)

a = 500

[

(

X

Xn

)1/3

−(

Y

Yn

)1/3]

(2.23)

b = 200

[

(

Y

Yn

)1/3

−(

Z

Zn

)1/3]

. (2.24)

24

2 Feature Space

With default values for a full spectrum light source:

Xn = 242.4

Yn = 255.0

Zn = 277.7.

It depends on the XYZ color space:

X

Y

Z

=

0.490 0.310 0.200

0.177 0.812 0.011

0.000 0.010 0.990

R

G

B

. (2.25)

The XYZ color space incorporates a white balance mechanism based on the reference

values Xn, Yn, and Zn, which are the XYZ values of a white reference patch in the image

[Forsyth and Ponce, 2003]. A white balance is necessary in order to adapt the perceived

colors to the current spectrum of the current light source. More specifically, the sensed

colors have to be shifted and biased depending on the spectrum of the current light source.

More qualitatively, this mechanism assures that the human can recognize a bright yellow

cab in full spectrum noon light as well as in reddish evening light.

The CIE XYZ was initially developed in order to better match the characteristics of

monitors based on RGB color information. The elements of the XYZ-transformation ma-

trix have to be selected dependent on the monitor type. In the here proposed system, the

transformation matrix proposed in [Jaehne, 2005] is used.

To compute RGBY, [Frintrop, 2006] proposes the Euclidian distance between the Lab

color pixels of the image to the four Lab reference colors (aref,R=0 and bref,R=127 for red,

aref,G=127 and bref,G=127 for green, aref,Y =127 and bref,Y =0 for yellow and aref,B=127 and

bref,B=127 for blue). Exemplarily, Equ. (2.26) to (2.28) show the computation of the R

color map of RGBY space that has to be applied pixel-wise over the whole image:

Pref,R = (aref,R , bref,R) = (0, 127) (2.26)

Rfinal = dist(PLab − Pref,R) = ‖(a, b) − (aref,R , bref,R)‖ (2.27)

=√

(a− aref,R)2 + (b− bref,R)2. (2.28)

Based on this, 4 color maps will result that contain only non-negative values (see

[Frintrop, 2006]). For a numerical representation of the interdependencies between RGBY

and RGB see Tab. 2.2.

However, there is a drawback in the approach of [Frintrop, 2006], which makes the

resulting color maps inappropriate for their usage in the here proposed attention system.

The problem is that the so computed color maps are not independent. More specifically,

the R map is equal to the inverted G map and the B map is equal the inverted Y map, which

means that only 2 independent color maps exist. Thereby selectivity is lost. Furthermore,

for example the R map holds zeros at image positions of pure green, whereas for an

attention system zero value in the red and green color map should define the intermediate

value between red and green.

25

2 Feature Space

Table 2.2: Numerical interdependencies between RGB and RGBY.

Reference RGB color space RGBY color space (without normalization)

color red green blue red green blue yellow

RGB red 255 0 0 217.7 81.9 100.0 210

RGB green 0 255 0 92.7 228.8 96.7 227.2

RGB blue 0 0 255 232.7 118.0 247.9 81.5

RGB yellow 255 255 0 141.6 176.1 39.0 222.5

Therefore, in the following section a rescaling is proposed, which will lead to four inde-

pendent RGBY color maps. Furthermore, it will be shown that the on-off/off-on separa-

tion for RGBY colors is system-immanent and therefore already included. Additionally,

so-called double color opponent maps are proposed that are selective to color contrasts.

Furthermore, exemplarily their importance for the attention is shown.


In order to allow a more suitable decomposition of color maps, the approach of Equ. (2.28)

is adapted leading to the following Equations:

Rtmp =√

(a− aref,R)2 + (b− bref,R)2/Rmax − 0.536 (2.29)

Gtmp =√

(a− aref,G)2 + (b− bref,G)2/Gmax − 0.555 (2.30)

Btmp =√

(a− aref,B)2 + (b− bref,B)2/Bmax − 0.512 (2.31)

Ytmp =√

(a− aref,Y )2 + (b− bref,Y )2/Ymax − 0.559 (2.32)

With:

Rmax = 232.7

Gmax = 228.8

Bmax = 247.9

Ymax = 227.2

Rfinal =

{

2Rtmp ∀ Rtmp > 0

0 ∀ Rtmp ≤ 0(2.33)

Gfinal =

{

2Gtmp ∀ Gtmp > 0

0 ∀ Gtmp ≤ 0(2.34)

Bfinal =

{

2Btmp ∀ Btmp > 0

0 ∀ Btmp ≤ 0(2.35)

Yfinal =

{

2Ytmp ∀ Ytmp > 0

0 ∀ Ytmp ≤ 0. (2.36)

26

2 Feature Space

As the equations show, a normalization is done prior to the decomposition in four inde-

pendent color channels. The received RGBY color maps do not contain redundancies after

the proposed decomposition procedure.

As a result, four normalized independent RGBY color maps are received. An additional

decomposition of all these maps into on-off and off-on components, which would double the

number of color feature maps is not possible. This is the case, since the decomposition of

RG and BY map into four independent color maps is equivalent to an on-off/off-on decom-

position. Furthermore, image pyramids for all four color maps are build (see Annex A.1)

in order to allow a separation between colored blobs of different sizes. The computed four

independent color pyramids can be used for an attention-based task-driven search (e.g.,

search for red objects of a certain size). The concept of task-driven attention (so-called

top-down attention) is described in detail in Chapter 3.

However, colors themselves are not a cue for the contrast-driven attention (so-called

bottom-up attention). For example, a monotonous red image should not lead to contrast-

driven attention. How such a contrast-driven (bottom-up) attention is computed will be

described in Chapter 3. A color-based contrast-driven attention should be guided to image

positions showing color contrasts (e.g., a green blob on a red background). The color

feature type required for such operations is termed double color opponency in literature.

Said double color opponent maps are received by filtering all four RGBY color maps with a

Difference of Gaussians kernel based on image pyramids (refer to Annex A.1 for details on

filtering with image pyramids). As a result, we receive for each color map five double color

opponent maps, also allowing a differentiation between red blobs on a green background

and vice versa. The received color contrast maps can now be used to detect targets based

on their color contrast (e.g., traffic signs of typically bright colors in a typically less colorful

traffic environment). Additional to bottom-up searches, such color contrasts can be used

for the top-down search for objects with high color contrasts.

In sum, the color feature space consist of four RGBY color pyramids (i.e., 20 color

feature maps) and 20 double color opponent maps. Top-down search hence disposes of 40

color feature maps and bottom-up search of 20 color maps.

Discussion

In the following, the performance of the RGBY color space is assessed and compared

qualitatively to RGB. For that a complex traffic scene in twilight is used (see Fig. 2.12a),

in which the present signal board should be found based on color information. As it can

be expected, the absolute R-channel of RGB shown in Fig. 2.12b is not sufficient for a

successful color-based detection of the signal board. However, even when normalizing the

RGB R-channel to the overall sum of all three color channels, the signal board can not

be detected (see Fig. 2.12c). Reason is that the low light situation changes the color

of the signal board despite the automatic white balance of the camera. When using the

proposed non-linear RGBY colors a separation on the RGBY R-channel is possible, despite

the challenging lighting conditions (see Fig. 2.12d). An attention-based search for such

signal boards can hence rely on the highly discriminating R-channel, which boosts the

performance of the attention system as compared to the usage of RGB colors. Extensive

27

2 Feature Space

testing with HSV colors have also shown inferior performance compared to the proposed

RGBY color space.

(a) (b)

(c) (d)

Figure 2.12: (a) Input image of road scene in twilight (signal board marked in red), (b)

Absolute R-channel of RGB-space, (c) Relative R-channel of RGB-space (i.e., R-channel nor-

malized to the sum of RGB-channels, with thresholding), (d) R-channel of RGBY space (with

thresholding).

2.2 Depth Features

Accurate depth information is of vital importance for a driver assistance system. Typical

commercial applications for assisting the driver use Radar or Lidar data. Such sensors

deliver accurate but sparse depth information of the scene. So far, only few commer-

cial driver assistance systems use vision, despite the fact that the information density is

comparatively high. During the projection of the 3D world to the 2D image chip, one

dimension - the depth information - is lost. Recovering the depth cannot be done with

100% certainty, i.e. 2D images are ambiguous in terms of depth. For solving this challenge

several depth cues are fused. After a biological motivation, the implemented depth cues

are described.

28

2 Feature Space

2.2.1 Biological Motivation

Following [Palmer, 1999] several stereo-related cues and at least nine monocular depth

cues exist that allow the human to reliably perceive the depth in the environment. In the

following, some of these monocular cues are listed and described shortly:� Depth from object knowledge (known object size in the world as reference for the

measured object size on the camera plane)� Depth from ground plane assumption (assuming a flat world, the vertical image

position is proportional to the object depth)� Depth from blur (optimizing the edge sharpness by changing the focal length of the

camera)� Depth from Time to Contact (infer the time that remains until collision from the

growth of perceived object size)� Depth from relative size (several objects of the same type in different distances)� Depth from shading (positioning of shades relative to objects)� Depth from texture gradient (depth-dependent image frequencies on homogenous

textured surfaces)� Depth from aerial/atmospheric perspective (blue bias on objects that are far away)

The following subsections describe five monocular and binocular (stereoscopic) depth

cues our ADAS is based on in order to perform its various vision tasks.

2.2.2 Depth from Stereo Disparity

The perception of stereoscopic depth is based on the interpretation of the differences be-

tween the projected images of both eyes (so-called parallax). An isolated point in the 3D

world is projected to slightly different positions on the retina of both eyes, since these have

a horizontal distance, the so-called basic distance. The horizontal shift between the im-

ages is called lateral disparity, see [Mallot, 2002]. In addition to the lateral disparity, other

flavors of disparity exist (see [Mallot, 2002]) that can also cause an impression of depth -

still the lateral disparity seems to be the most important disparity-related depth cue and

is therefore also in the focus of the following reflections. For detecting lateral disparity (for

simplification called disparity in the following) the detection of correspondences between

the left and right eye is necessary. Here, ambiguities are possible, due to differences in

illumination and partial occlusion between both images. Especially, local regions of low

texture can lead to the well-known aperture problem, which is also a challenge for the opti-

cal flow computation (refer to [Willert et al., 2006]). Furthermore, differences and changes

in the internal optical parameters of both eyes exist that influence the projections and

hence the detected lateral disparity. Still, the human vision system can cope with these

29

2 Feature Space

challenges by continuous adaptation mechanisms. How these challenges are solved by the

human vision system is largely unknown. Designing a technical stereo system that closely

mimics the processing steps in the brain is therefore not possible up to now. However, the

engineered approaches show sound results, but also have their limitations.

Figure 2.13 depicts the individual processing steps, which are needed for computing

dense 3D world coordinates from stereo based on an engineering-driven approach.

captureimage

left camera

right camera

rectification

rectification

undistortion

undistortion

correspondencesearch

disparitystereo mapscompute

rectificationinverse stereo

maps

Figure 2.13: Processing steps for computing dense 3D world coordinates from stereo.

After capturing pairs of images, the camera lens distortion is corrected for both cameras

independently. The undistortion step is essential in order to make the mapping of the

3D world to the 2D image plane comparable for both cameras, which is a prerequisite

when computing the stereo disparity in the following step. Based on the captured stereo

images the undistorted vertical and horizontal pixels v and u are computed on the initial

(distorted) vertical and horizontal pixels vd and ud:

u = (1 + k1β2 + k2β

4)ud + 2k3udvd + k4(β2 + 2u2

d) (2.37)

v = (1 + k1β2 + k2β

4)vd + k3(β2 + 2udvd) + 2k4udvd (2.38)

with β =√

u2d + v2

d.

The undistortion is based on a lens distortion model (described in [Heikkila and Silven,

1997]) that uses radial (k1 and k2) and tangential distortion coefficients (k3 and k4). For

both cameras, these coefficients are determined offline using captured images of a checker-

board pattern based on the camera calibration toolbox [J.Y.Bouguet, 2007] that is available

in the internet.

Furthermore, the cameras are oriented differently in the world (i.e., the camera angles

θX , θY , and θZ are different for both cameras). In order to allow an efficient search for

correspondences between the two camera images, these angles need to be compensated

(i.e., the optical axes of both cameras need to be parallel). In theory, this could be done

physically by adapting the camera position. However, this is not possible with the needed

30

2 Feature Space

accuracy. The usual approach is to virtually adapt the camera angles by shifting and

remapping the image pixels of both cameras, which is called rectification. Typically, a

linear rectification is realized, which means both camera images are rescaled, rotated and

shifted in horizontal and vertical direction in order to compensate the differences in the

camera angles. The rectification is done using the commercial “Small Vision System”

software [Konolige, 1997].

For the rectification process the camera angles of both cameras are required. These can

be computed based on the following Equations that describe the 3D world to 2D image

mapping:

u = −fur11(X-t1) + r12(Y -t2) + r13(Z-t3)

r31(X-t1) + r32(Y -t2) + r33(Z-t3)+ u0 (2.39)

v = −fvr21(X-t1) + r22(Y -t2) + r23(Z-t3)

r31(X-t1) + r32(Y -t2) + r33(Z-t3)+ v0. (2.40)

Equation (2.39) and (2.40) use the 3 camera angles θX , θY , and θZ , the 3 translational

camera offsets t1, t2, t3 (see Fig. 2.16b), the horizontal and vertical principal point u0

and v0 as well as the horizontal and vertical focal lengths fu and fv (focal lengths that

are normalized to the horizontal and vertical pixel size respectively). In sum 12 unknown

variables exist (the elements of the rotation matrix Equ. (2.41) as well as the position of

the camera in the world t1, t2, and t3).

R = RXRYRZ =

r11 r12 r13r21 r22 r23r31 r32 r33

(2.41)

For determining these 12 variables the calibration scene shown in Fig. 2.14 is used, for

which the 3D world position of the marked points was measured manually with a laser

device and stored.

Based on internal interdependencies (orthogonality equations of the rotation matrix,

see Equ. (2.42)) and the correspondences between the stored 3D world position and the

measured image position for 3 points, all 12 parameters and hence the camera angles θX ,

θY , and θZ can be determined.

r211 + r212 + r213 − 1 = 0

r221 + r222 + r223 − 1 = 0

r231 + r232 + r233 − 1 = 0 (2.42)

r11r21 + r12r22 + r13r23 = 0

r11r31 + r12r32 + r13r33 = 0

r21r31 + r22r32 + r23r33 = 0

After repeating the described procedure for the second camera, the image rectification

can be done. After the rectification, an efficient search for correspondences between the

left and right image can be done. More specifically, for each pixel and its neighborhood in

31

2 Feature Space

(a) (b)

8

37

24 156

Figure 2.14: Calibration scene with measured 3D world calibration points: (a) Left image

(calibration points marked), (b) Right image.

one of the camera images the best match in the other camera image is determined using

a correspondence search with a probabilistic matching algorithm (refer to [Willert et al.,

2006]). Since both images are undistorted and rectified the correspondence search between

the images can be restricted to horizontal shifts, which makes the procedure very efficient.

The result of the correspondence search is a dense disparity map D(u, v), which contains

a measured horizontal shift for all image positions.

Based on the disparity image the 3D world position for all image pixels can be computed

using:

Zstereo(u, v) =fuB

D(u, v)+ t3 (2.43)

Ystereo(u, v) =Z(v − v0)

fv+ t2 (2.44)

Xstereo(u, v) =Z(u− u0)

fu+ t1. (2.45)

With: B... basic distance between the left and right cameras principal point

fu, fv... normalized focal length [in pixels]

D(u, v)... disparity

u0, v0... principal point

t1, t2, t3... translational camera offset.

The equations are derived by transforming Equ. (2.39) and (2.40), setting all camera

angles to zero, since the disparity computation was done on rectified images.

In the last step, the stereo maps are unrectified (i.e., the prior rectification is neutralized)

to make them comparable to the input image on which all other processing steps are

running. This is realized by remapping the pixel values of the rectified stereo maps based

on Equ. (2.39) and (2.40), which results in unrectified stereo maps.

32

2 Feature Space

Figure 2.15 depicts a typical example for the resulting unrectified stereo maps in an

inner-city scenario.

X position in mRGB input image

Z position in mY position in m

Figure 2.15: Dense 3D world positions for all image pixels based on stereo from a probabilistic

matching approach [Willert et al., 2006].


Analyzing the Equ. (2.43) to (2.43) and Fig. 2.15, it can be seen that the stereo maps

are dense (i.e., for all image pixels a 3D world position is computed). However, at image

positions near to the car the computed values are not sufficiently accurate. When using

a threshold on the stereo confidence map that was calculated during the disparity com-

putation, these pixels can be identified. Furthermore, the thereby identified pixels can be

corrected using an inter-modality depth cue fusion (see Sect. 4.1.2 for details) with the

depth cues described in the following.

2.2.3 Depth from Object Knowledge

Depth from object knowledge calculates the distance of an object Zobj based on knowledge

about the area the object covers on the image plane (width Wpixel and height Hpixel), the

width and height of the object in the world drawn from experience (Wworld and Hworld)

as well as the intrinsic parameters of the sensor (fu = f/tu and fv = f/tv, with the focal

length f and the horizontal and vertical pixel size tu and tv):

Zobj,W ≈ WworldfuWim

and Zobj,H ≈ HworldfvHim

. (2.46)

33

2 Feature Space

A prerequisite for depth from object knowledge is a reliable segmentation algorithm.

Currently we use histogram-based segmentation on an image region that is pre-segmented

by a region growing algorithm (see [Jaehne, 2005]) running on the saliency map (see

Fig. 2.19c on page 43 for a visualization of the gathered segmentation results)

In the following, Equ. (2.46) is derived. Without loss of generality, we simplify

Equ. (2.39) and Equ. (2.40) with θY = 0 and θZ = 0, which do not influence the ob-

ject distance, but would make the following steps more cumbersome. Furthermore, we can

set t3 = 0, since the Z coordinate of the center point of our coordinate system equals the

principal point of both cameras (see Fig. 2.16b). Using Equ. (2.39), two bordering points

of an object with the same height Y (i.e., having the same vertical pixel value v) and depth

Zobj,W have the following width Wim in pixels:

Wim = u1 − u2 =(X1 − t1)fu + (Y − t2)u0sin(θX) + Zobj,Wu0cos(θX)

(Y − t2)sin(θX) + Zconstcos(θX)−

(X2 − t1)fu + (Y − t2)u0sin(θX) + Zobj,Wu0cos(θX)

(Y − t2)sin(θX) + Zobj,W cos(θX). (2.47)

This can be reformulated to:

Wim =(X1 −X2)fu

(Y − t2)sin(θX) + Zobj,W cos(θX)

=Wworldfu

(Y − t2)sin(θX) + Zobj,W cos(θX). (2.48)

In Equ. (2.48) (Y − t2)sin(θX) is small, because θX is between −5◦ and 5◦. The term

represents the part of the distance that is induced by the shift of the object in the height

direction Y. Since traffic-relevant objects are typically near a defined road plane, Y-values

between 0 and -10m can be expected ( note that due to the right-hand-rule the Y-axis

is defined negatively). The induced error is hence small and stays below the uncertainty

induced by the segmentation done for determining Wim. Additionally, cos(θX) is close to

1. Therefore, we can simplify Equ. (2.48) to:

Wim ≈ WworldfuZobj,W

. (2.49)

Transposed to Zobj,W the distance of an object can be computed, given the width in the

world Wworld, the width in pixels Wim, the size of the pixels on the image chip tu as well

as the focal length f :

Zobj,W ≈ Wworld fuWim

=Wworld f

Wim tu. (2.50)

Similarly, to compute the depth Zobj,H based on the known object height in the 3D

world Hworld the following Equation can be found:

Him = v1(Y1 = 0) − v2(Y2 = Hworld)

=Zobj,HfvHworld

(hsin(θX) + Zobj,Hcos(θX))((h− 1)sin(θX) + Zobj,Hcos(θX)). (2.51)

34

2 Feature Space

Transposed to Zobj,H the following Equation can be inferred:

Zobj,H = −p/2 +

√

(p

2)2 − q (2.52)

With:

p =2hsin(θX )cos(θX)Him − cos(θX)sin(θX)Him − fvHworld

Himcos2(θX)

q =(h2 − h)sin2(θX)

cos2(θX).

Since sin(θX) is small and cos(θX) close to 1 we get:

p ≈ −fvHworld

Himand q ≈ 0. (2.53)

Finally, we get Equ. (2.54) and thereby can compute the distance Zobj,H given the object

height in the 3D world Hworld, the height of the object in the image Him, the focal length

f and the height of pixels on the image chip tv:

Zobj,H ≈ Hworld fvHim

=Hworld f

Him tv. (2.54)

Again, the error induced by this simplification is small and thereby below the uncertainty

of the segmented object height in the image Him.

2.2.4 Depth from Bird’s Eye View

For computing the distance of objects that are positioned on the drivable path the bird’s

eye view is used. The bird’s eye view is a metric representation of the scene as viewed

from above (see Fig. 2.16a). The cue is able to detect and estimate the distance of objects

present on the ego vehicle’s and neighboring lane (as opposed to the perspective image).

Working on this representation for estimating object distances has the advantage that

the cumbersome non-linear projection from 3D world coordinates to the 2D image plane

(see Equ. (2.39) and (2.40)) is compensated. As such, world position coordinates can

directly be assigned to a detected object without further processing. Furthermore, by this

transformation, the detection of lanes and objects can be realized easier than working on

the projected camera image, since expectations regarding typical metric lane widths can be

integrated easily into the algorithm. The bird’s eye view is calculated on the undistorted

pixels v and u based on Equ. (2.39) and (2.40) by inverse perspective mapping of the

3D world points X, Y , and Z (see Fig. 2.16b for a visualization of the used coordinate

system) to the 2D (u,v) image plane. The equations describe how to map a 3D position

of the world to the 2D image plane (refer to [Broggi, 1995]). More specifically, only the

image pixels (u,v) that are required to get the metric bird’s eye view (i.e., the XZ-plane)

dense, are mapped. The approach also leads to low computational demands. The usage of

35

2 Feature Space

image viewbird’s eyeperspective

(a) (b)

X

Y

camera

Z

axisoptical

θX

θY

T = [t1, t2, t3]θZ

Figure 2.16: (a) Visualization of the bird’s eye view, (b) Coordinate system and position of

the camera.

inverse perspective mapping makes the inversion of Equ. (2.39) and (2.40) obsolete, when

computing the bird’s eye view.

As can be seen in Equ. (2.39) and (2.40) the 3D world position coordinates X, Y , and

Z of all image pixels (u,v) are required. By using a monocular system, one dimension (the

depth Z) is lost. A solution to this dilemma is the so-called flat plane assumption. Here,

for all pixels in the image, the height Y is set to 0. Based on this, only objects in the

image with Y = 0 (especially, the street we are interested in) are mapped correctly to the

bird’s eye view, while all the other regions are stretched to infinity in the bird’s eye view

(for example the car in Fig. 2.19d).

Now, a vertical grow algorithm with dynamic thresholds searches for discontinuities in

the bird’s eye view and assigns a distance value to them (see Fig. 2.19d).

In the rectified image (i.e., the image is virtually remapped to be equivalent to an

image with all 3 camera angles zero, see page 31 for details on the image rectification) the

following direct relation between the vertical pixel value v and the depth Zbirds exists:

Zbirds =fv t2

(v − v0)(2.55)

With:

t2 ... camera height above the ground

v0 ... the vertical principal point

v ... vertical pixel position that shows significant contrast change

fv ... Normalized focal length.

36

2 Feature Space


In case the flat plane assumption is not fulfilled (i.e., the street surface is not flat) the

bird’s eye view is inaccurate, which decreases the quality of all algorithms that run on

the bird’s eye view (e.g., depth estimation or temporal road integration, see Sect. 4.2.2).

To allow a stable bird’s eye view even in case of non-flat street surfaces and pitching of

the vehicle, stereo data is used for plane fitting. In order to enhance the robustness of the

correction, only pixels that belong to the currently detected street segment (see Chapter 4)

are used for surface estimation. More specifically, the differences between the orientation

and position of the coordinate axes and the street surface in terms of the pitch ∆θX and

roll angle ∆θZ , as well as the height of the camera over the ground ∆t2 are computed:

Y = Y0 + cZ + dX (2.56)

∆θZ = atan(d) (2.57)

∆θX = atan(c) (2.58)

∆t2 = Y0. (2.59)

This is done based on the 3D position for all image pixels derived from the stereo disparity

(see Fig. 2.15 for 3D data of a sample image). The flat plane assumption Y = 0 can be

replaced by Y = f(X,Z) leading to an extended bird’s eye view. In our implementation, a

first order model for the street surface (linear hyperplane) is used as shown in Equ. (2.56)

(see [Li et al., 2004] for more details). Results have shown that higher order models for

plane fitting lead to inferior performance. The reason for this is the restraint number of

3D measurement points at the borders of the image, because only reliable pixels belonging

to the detected street are used for the surface estimation. Since the estimated surface is

noisy (stereo data is calculated based on the error prone correlation between the left and

the right camera image), a linear Kalman filter is used on the parameters Y0, c, and d,

which improves the performance considerably. A further possible improvement would be to

use a model of the vehicle kinetics (containing damper and spring characteristics, realistic

distribution of the vehicle mass) for the Kalman prediction (as proposed in [Cech et al.,

2004]) instead of the linear Kalman prediction model used here.

2.2.5 Depth from Time to Contact

The time to contact (TTC) is the quantity of time it takes an observer to reach an ap-

proaching surface in case the observer continues its current relative motion (see [Palmer,

1999]). The TTC is believed to be a basic cue the for behavior generation of simple organ-

isms, like for flies during the landing task [Borst, 1990]. Various concepts for computing

the TTC are known.

The potential of the TTC was first researched by [Hoyle, 1957], who found that the

optical aperture angle ψ an object takes up on the retina-related to the change of ψ in

time is an approximate measure for the TTC (please also refer to Fig. 2.17):

TTC ≈ ψdψdt

. (2.60)

37

2 Feature Space

It is important to note that the TTC can be derived without information of the object

distance or relative velocity. The concept was extended by [Hoyle, 1957] to the so-called

tau-function, which states that the TTC can be derived from the ratio of any visually

perceived spatial quantity to its derivative.

As an example for the tau function [Palmer, 1999] gives the distance of an image position

to the Focus of Expansion (i.e., the image point all longitudinal lines meet) related to its

derivative. Also for an observer that passes an approaching object without colliding with

it, the tau function is applicable (refer to [Kaiser and Mowafy, 1993]). Also binocular

ψ

Figure 2.17: Object aperture angle ψ.

information (stereo disparity) can be used to derive the TTC (refer to [Regan, 2002] and

[Harris, 2004]). In Equ. (2.61) the TTC is given depending on the object distance and

disparity (with B as the distance of the stereo cameras, Z the object distance, and D(u, v)

the stereo disparity):

TTC ≈ B

Z[dD(u, v)/dt]. (2.61)

However, for later depth cue fusion we are interested in cues that are independent in

terms of the required input signals. The stereo disparity is already used as depth cue

in our system. Therefore, only the TTC depth cue defined in Equation (2.60) is helpful

as starting point of the following conceptional extensions, which will make the approach

accessible for the vehicle domain.


In order to make the TTC more accessible to object depth estimation in the vehicle domain

the concept was extended from the expansion of image regions to contractions (i.e., the

distance to the object of interest is increasing). Furthermore, to our knowledge the TTC

was not used to compute object depth in real-world scenarios of the vehicle domain before.

In the following, an universal approach for computing the depth from TTC in the vehicle

domain is given. As will be shown on ground truth data the concept is theoretically sound,

but has some drawbacks in a real-world application.

For computing depth from TTC, the image-related size of an object bt for three consec-

utive frames is required (see Fig. 2.18). Furthermore, we assume a constant object motion

vobj within these frames. Taken a frame rate frate of 11Hz this assumption is justified.

38

2 Feature Space

Also the ego vehicle motion vego,t is needed and accessible on the CAN bus on today’s

vehicles.

(a) (b) (c)

D1

ψ1

b1 b2

D2

ψ2

D3

ψ3

b3

f

L

f f

L

L

vobj

vobj

vobj

Figure 2.18: Depth from Time to Contact measurements on three consecutive frames for an

approaching vehicle (i.e., b1 < b2 < b3).

For the scenario visualized in Fig. 2.18, which assumes a decreasing distance Dt to the

car (i.e., D1 > D2 > D3 or b1 < b2 < b3) the following Equ. (2.62) to (2.72) can be derived

and resolved to Equ. (2.73) (with f as the focal length and L the object size in the world).

D1 = (vego,1 − vobj)tTTC1 (2.62)

D2 = (vego,2 − vobj)tTTC2 (2.63)

D1 =fL

b1(2.64)

D2 =fL

b2(2.65)

D1 =(vego,1 − vego,2)b2tTTC1tTTC2

b2tTTC1 − b1tTTC2(2.66)

tTTC1 ≈ ψ1

dψ1/dt=

ψ1

(ψ2 − ψ1)frate(2.67)

tTTC2 ≈ ψ2

dψ2/dt=

ψ2

(ψ3 − ψ2)frate(2.68)

(2.69)

39

2 Feature Space

ψ1 = atan(b1f

) (2.70)

ψ2 = atan(b2f

) (2.71)

ψ3 = atan(b3f

) (2.72)

D1 ≈ (vego,1 − vego,2) ·b2atan(b1/f)atan(b2/f)

b2atan(b2/f)[atan(b2/f)−atan(b1/f)]−b1atan(b1/f)[atan(b3/f)−atan(b2/f)](2.73)

Note that Equ. (2.73) only depends on easily accessible input data alone (measured

object width on the camera chip b1, b2, and b3 as well as the vehicle ego motion vego,t).

Additionally to the direct plausibility check of the computed object distance D1, the object

motion vobj can be computed and assessed for plausibility:

vobj ≈ D1 + vego,1tTTC1

tTTC1. (2.74)

For an increasing object distance Dt (i.e., b1 > b2 > b3), a TTC computation based on

Equ. (2.60) and hence Equ. (2.73) is not possible. However, when redefining the TTC for

objects that move away, we get Equ. (2.75) and (2.76) as well as Equ. (2.77) and (2.78):

tTTC1 ≈ − ψ1

dψ1/dt=

ψ1

(ψ1 − ψ2)frate(2.75)

tTTC2 ≈ − ψ2

dψ2/dt=

ψ2

(ψ2 − ψ3)frate(2.76)

D1 = (vobj − vego,1)tTTC1 (2.77)

D2 = (vobj − vego,2)tTTC2. (2.78)

For the case of an increasing object distance, these Equations can be resolved to:


b2atan(b3/f)[atan(b1/f) − atan(b2/f)] − b1atan(b2/f)[atan(b2/f) − atan(b3/f)]. (2.79)

With a similar approach, for the two remaining cases (b1 < b2 > b3 and b1 > b2 < b3) we

get Equ. (2.80) and (2.81):


b1atan(b2/f)[atan(b3/f) − atan(b2/f)] + b2atan(b2/f)[atan(b1/f) − atan(b2/f)](2.80)


b1atan(b1/f)[atan(b2/f) − atan(b3/f)] + b2atan(b3/f)[atan(b2/f) − atan(b1/f)]. (2.81)

40

2 Feature Space

Discussion

The gathered results in Tab. 2.3 for the case b1 < b2 > b3 show that the TTC com-

putation works well when using manually segmented ground truth data. The thereby

gathered accuracy is roughly comparable to the error measured in studies with human

subjects ([Gray and Regan, 1998] reported errors between 2.6 and 3.0% for approaching

small objects). In contrast, TTC computation on a real-world test stream showed poor

performance. This is due to the fact that a robust segmentation algorithm (including a

tracking of the segmentation parameters) would be required by the described TTC ap-

proach, in order to reach the necessary robustness in real-world scenarios and error rates

that are consistent with psychophysical studies. The in this context required segmentation

algorithms would require sub-pixel accuracy, which could be realized in a brain-like way

by mimicking the hyperacuity principle based on population coding (see [Mallot, 2002]).

Since such an algorithm is not in the focus of the current work, depth from TTC was not

included into the developed depth fusion approach (refer to Sect. 5.2.1). Additionally, as

shown in Equ. (2.67) and (2.68) the TTC is calculated using a derivation of the measured

object segments. A derivation of the already uncertain segmentation results reduces the

signal to noise ratio (SNR) even further [Mallot, 2002]. Furthermore, it is important to

note that the introduced depth from TTC approach can only be applied in case the object

velocity vobj is as assumed approximately constant in the considered time interval. How-

ever, a check that evaluates if the object motion vobj (computable by Equ. (2.74)) is within

plausible boundaries, can be used to verify if this condition was observed. All this makes

depth from TTC a cue that delivers a coarse estimation of the object depth. However,

it should be fused with other more reliable depth cues, allowing at least an approximate

depth measurement in case everything other fails.

Table 2.3: Examples for depth from TTC for b1 < b2 > b3 (frate = 3).

Ego velocity Ground truth Resulting Computed Relative error

vego,1 (vego,2) distance vobj [in m/s] depth from|D1−ZTTC|

D1[in %]

in [m/s] D1(D2,D3) [in m] TTC ZTTC

24 (21.9) 39(39.5, 39.2) 22.50 41.38 6.10

24 (21.9) 38(38.5, 38.2) 22.50 40.43 6.39

24 (21.9) 37(37.5, 37.2) 22.50 39.48 6.70

24 (21.9) 36(36.5, 36.2) 22.50 38.54 7.06

24 (21.9) 35(35.5, 35.2) 22.50 37.60 7.43

24 (21.9) 34(34.5, 34.2) 22.50 36.66 7.82

24 (21.9) 33(33.5, 33.2) 22.50 35.73 8.27

Mean relative error 7.11

41

2 Feature Space

2.2.6 Depth from Radar

Depth from Radar (Radio Detecting and Ranging) is obtained from a commercial standard

vehicle equipment sensor, which delivers sparse point-wise measurements of low longitudi-

nal but higher lateral uncertainty (for an example see Fig. 2.19b). Radar sensors evaluate

the reflections (echoes) of bundled micro wave beams (typically between 400 MHz and 80

GHz) for detecting, localizing, tracking, and classifying objects. More specifically, the time

of flight ttof is used to determine the object distance Zradar:

Zradar =c0 · ttof

2. (2.82)

With: c0 ... velocity of propagation (speed of light) ≈ 300000km

sttof ... time of flight (to the object and back).

For measuring the time of flight the individual beam packages must be marked and

recognized, which can be done by modulation and demodulation of the signal amplitude,

frequency or phase. The object velocity vdop is determined based on the Doppler shift ∆f :

vdop =c0 · ∆f

2f0. (2.83)

With: c0 ... velocity of propagation (speed of light) ≈ 300000km

s∆f ... measured Doppler frequency shift

f0 ... carrier frequency.

Using Radar sensors, the object distance and velocity can hence be measured with inde-

pendent approaches. Different from visual sensors, Radar is very robust against changing

weather conditions [Winner, 2007], which makes it an important cue that increases the

system robustness.

2.3 Motion Features

The role of motion in a biologically motivated driver assistance system is twofold. First, a

decomposition of the scene regarding the magnitude and direction of motion is required in

order to increase the selectivity of the attention system. Second, a robust decomposition of

the scene into dynamic and static objects is required. Especially features for decomposing

the scene into static and dynamic objects are of high relevance in driver assistance, since

dynamic objects require fast system reactions and reliable predictions. Different system

modules should take detected dynamic objects into account. For example, appropriate

motion models for dynamic objects should be included into the collision avoidance and

path planning modules.

2.3.1 Differential Images

A well-known feature for motion detection are differential images (i.e., computing the differ-

ence between the current and the previous frame), which are usually found in applications

42

2 Feature Space

39

.4

18

.3 71

.5

25

.2

Bird’s eye view

Width in mD

ista

nce

(dep

th)

in m

−10 0 10

50

40

30

20

10

0

17

.8

16

.7

28

.0

12

.1

28.7 9.

6

25.8

17.2

25.5

14.4

(b)(a)

(d)(c)

Figure 2.19: Used depth cues: Depth from (a) Stereo disparity, (b) Radar, (c) Object knowl-

edge, (d) Bird’s eye view.

in the surveillance domain. In its classical form, this procedure has the general drawback

that the motion cannot be localized and classified in terms of magnitude and direction,

because only changes in intensity are evaluated by this approach. Furthermore, in the

car domain the vehicle ego-motion-induced influence on differential images is high. This

causes the impossibility to detect and differentiate dynamic (i.e., moving) objects from

static scene content based on differential images. Despite these drawbacks differential im-

ages are used as feature in the presented ADAS, because of two important advantages.

First, differential images as motion feature show strong activations at image regions that

contain static and dynamic objects that are near to the moving ego vehicle, which is of high

importance in driver assistance. Furthermore, the approach has the highest computational

efficiency among the existent motion features.


The used implementation computes differential images on multiple scales, which allows a

decomposition in different motion magnitudes (see Fig. 2.20). Higher scales (i.e., lower

image resolutions) represent higher motion magnitudes. This is an important extension to

43

2 Feature Space

(a) (b) (c)

(d) (e) (f)

Figure 2.20: Differential images: (a) Input image, (b) to (f) Motion feature on 5 scales

known approaches.

As stated before, with differential images a differentiation between dynamic and static

scene elements is not possible. Therefore, a more sophisticated approach for motion esti-

mation is needed. In the next subsection, an approaches for estimating the object motion

is described. The approach is based on a biologically motivated optical flow algorithm

[Willert et al., 2007].

2.3.2 Detection of Dynamic Objects

In the following, an approach for the visual vehicle ego motion compensation based on

stereo disparity is described, which compensates the ego-motion-induced movement of

static scene elements. Based on this compensation, dynamic objects can be detected.

System Description

In order to describe the object motion detection approach a detailed system description

is given that roughly relates to [Schmudderich et al., 2008]. The authors realized an ego

motion compensation on the robot ASIMO. Based on that, moving human interactors are

detected even in case ASIMO itself is moving. Different from the indoor environment with

constant illumination and a restricted number of objects, the here presented system runs

on real-world, outdoor data of the vehicle domain. The system (see Fig. 2.21) uses the

current and previous gray value image (It and It−1) as input. Furthermore, the vehicle

yaw rate ˙θY and longitudinal velocity from the CAN bus as well as stereo disparity are

required. Based on Equ. (2.43), (2.44), and (2.45) the 3D world position of all objects in

44

2 Feature Space

the scene can be computed, leading to the 3 maps X, Y, and Z (refer to Fig. 2.15 on page

33).

For compensating the vehicle ego motion on the image plane the following processing

steps are realized: First, an extended single track model is applied (refer to Sect. 4.2.2 for

details) that computes the longitudinal ∆Z and lateral vehicle motion ∆X since the last

captured frame in world coordinates . The yaw angle ∆θ since the last frame can be derived

from the yaw rate ˙θY . Now, the current 3D coordinates (X(u, v),Y (u, v), and Z(u, v)) of all

image pixels (u, v) that are computed from stereo disparity are corrected by the computed

longitudinal and lateral motion as well as the yaw angle. More specifically, ∆X and ∆Z

are used as offset on the world coordinate maps X(u, v) and Z(u, v). The change in yaw

rate directly corrects the yaw angle θY of the pin hole camera model (see Equ. (A.1) and

(A.2) on page 138). By using a pin hole camera model on the corrected 3D coordinates,

a change in 2D pixel coordinates is computed. This change represents how the image is

expected to change due to the vehicle ego motion. The information can be used to warp

the pixels of the current image It back in time (the so-called backward warping), assuming

the overall scene to be static. This results in the warped image I+t−1. Computing the

optical flow (i.e., the pixel-wise motion of image regions between two consecutive images

of the same camera) between the previous camera image It−1 and the warped image I+t−1

reveals the present dynamic objects in the scene.

Several postprocessing steps improve the robustness. More specifically, morphologi-

cal operations assure that small clutter is rejected. The Pearson measure described in

[Schmudderich et al., 2008] boosts flow values with high magnitude and high correlation

confidence. The correlation confidence is a by-product of the NCC-based optical flow

computation (see [Willert et al., 2006]).


As an extension to the state-of-the-art system of [Schmudderich et al., 2008] for improving

the quality, the here described system incorporates top-down (TD) information of static

scene objects. For known static image regions, e.g., containing road (see Sect. 4.1.2 for the

robust unmarked road detection system our ADAS disposes of) or the sky, the predicted

values I+t−1 are set to It−1 and will hence not produce optical flow, when comparing the

back-warped image I+t−1 and previous image It−1. This procedure improves the system

robustness and decreases false negative detections of dynamic objects.

Discussion

In the following, the realized object motion detection system is evaluated in terms of

computational demands and the detection performance in four real-world scenarios.

Table 2.4 lists the computation time of the required system modules depicted in Fig. 2.21

for images of 200x150 pixels. In sum, the motion detection algorithm needs about 1.4s for

each image (processing rate of 0.7Hz), when running on a single computer. The image data

as well as vehicle state data from the CAN bus is transmitted via LAN to a Toshiba Tecra

A7 (2 GHz Core Duo) running our RTBOS integration middleware [Ceravola et al., 2006]

45

2 Feature Space

single trackmodel computation

stereo

backwardwarping

Input: Input: Input:

disparitystereoCAN

velocityyaw rate

3D data

currentimage

−1of static scene elements

TD knowledge

(e.g., road, sky)

motion magnitude, direction, andposition of detected dynamic objects

optional input

optical flowcomputation

postprocessing

It−1 I+t−1

−∆θY

−∆X

−∆Z

It

It

˙θY

Figure 2.21: System overview: Object motion detection based on backward image warping.

Table 2.4: Computational demands for dynamic object detection system.

Module Computation time [in ms]

Rectification 88

Stereo (SVS) 21

Single track model 11

Backward Warping 3

Optical Flow

Displacement maps 625

Confidence maps 525

Optical flow 65

Postprocessing

Pearson threshold 13

3x3 median filter 77

Morphological opening 1.5

Σ Comp. time 1429.5

46

2 Feature Space

on top of Linux. The algorithm was implemented in C using an optimized image processing

library based on the Intel IPP [Intel, 2006]. Since our ADAS receives images with a frame

rate of 10Hz, after distribution of the system to 2 computers, only for every 7th image, ego

motion compensated data is accessible and can be combined in the attention system. The

future envisioned acceleration of the optical flow algorithms will increase the frame rate.

(d)

(a) (b)

(c)

Figure 2.22: Visualization of test scenarios, (a) Target vehicle crossing the road from left, (b)

Target vehicle from left stops in the middle of the road, (c) Target vehicle from left, camera

vehicle turns right and follows, (d) Pedestrian passes the road.

In the following, a quantitative system evaluation on four test scenarios (see visualization

in Fig. 2.22) is realized. In the first scenario, a target vehicle is crossing the road in front

of the ego vehicle, while the ego vehicle moves straight along the road. In the second

scenario, the target vehicle stops in the middle of the road. In the third scenario, the

target vehicle crosses the road, while the ego vehicle follows and turns right. In the last

scenario, a pedestrian crosses the road in front of the moving ego vehicle.

In Fig. 2.23, the gathered detection results for all four scenarios are depicted showing a

robust detection of dynamic image regions. Especially, the results gathered for scenario 3

show that the incorporation of vehicle kinematics (in form of a single track model) allows

the detection of moved objects even during heavy lateral movements.

47

2 Feature Space

(a) (b)

(c) (d)

Figure 2.23: Object motion detection with camera vehicle motion (right: input image, left:

detected dynamic objects), (a) Car crossing the road in front, (b) Car from left stops in the

middle of the road, (c) Car from left, camera vehicle turns right and follows, (d) Pedestrian

passes the road.

48

2 Feature Space

2.4 Summary

Chapter 2 describes the feature space, the realized Advanced Driver Assistance System

(ADAS) relies on. Hence, the Chapter provides the theoretical foundations for the remain-

ing work, since the described features are required by the ADAS for fulfilling its various

driver assistance tasks. More specifically, the attention system described in the following

Chapter is based on various static and dynamic feature maps that are weighted and com-

bined in a robust way. All these features are biologically motivated, meaning that known

processing principles of the vision pathway in the human brain are mimicked (e.g., the

signal processing characteristics of neurons in the brain).

Related to static features, Difference of Gaussian (DoG) filters are introduced. DoG

filters are selective to homogenous regions in the image, which differ from the background

in terms of their intensity. A decomposition of the DoG filter responses allows an efficient

separation of dark regions on a bright background (off-on contrast) compared to bright

regions on a dark background (on-off contrast). Based on image pyramids, a DoG filter

bank is realized that allows the detection of homogenous regions of five different sizes. In

sum 10 DoG feature maps are accessible.

Furthermore, a Gabor filter bank is introduced. Based on the orientation and scale

selective Gabor kernel, lines and edges of five different sizes and four orientations can be

detected. By applying an additional separation in on-off and off-on contrasts in sum 80

Gabor feature maps are computed.

As a further static feature the biologically motivated RGBY color space is described that

mimics the color processing on the human retina. Based on RGBY colors four pyramids

of color maps are received (in sum 20 feature maps). Furthermore, based on the RGBY

color maps 20 RGBY color contrast maps are computed that assess local changes in the

color maps.

In addition to these features, a depth map of the scene is computed. During the projec-

tion of the 3D world to the 2D image the depth is lost. Therefore, approaches that recover

the depth are error prone. The realized ADAS disposes of five different depth cues that are

combined in order to increase the accuracy. The realized depth sources are stereo disparity,

depth from object knowledge, depth from Time to Contact, Depth from the bird’s eye view

and Radar-based depth. In sum, 130 static feature maps are accessible for the proposed

ADAS (10 DoG, 80 Gabor, 20 RGBY color maps, 20 RGBY color contrast maps).

As dynamic feature maps differential images (i.e., two consecutive images are sub-

tracted), computed on five scales, are used. However, based on differential images a sep-

aration between static (e.g., parking cars) and dynamic objects (e.g., moving cars) is not

possible due to the motion of the camera vehicle. For solving this challenge, a system for

the detection of dynamic objects is described and tested. In sum, six dynamic feature

maps are accessible to the ADAS (5 motion maps from differential images, 1 motion map

for dynamic objects), which makes an overall number of 136 feature maps.

Summing up, the novelties described in Chapter 2 are:� Computationally efficient decomposition of the Gabor filter response in on-off and

off-on components, allowing a gain in selectivity for the attention system,

49

2 Feature Space� A formalization for allowing the computation of depth from Time to Contact (for

approaching and departing objects) with sensor data, which is usually accessible in

today’s vehicles,� Detection of ego-moved objects is improved with early incorporation of top-down

knowledge that prevents the false detection of known static scene elements (e.g.,

road, sky, parking vehicles),� Robust suppression of the horizon edge based on highly selective attention features.

The large number of innovative and robust feature maps described here are the basis

for numerous system-related novelties described in the remaining Chapters. For example,

said feature maps are weighted and combined in the attention system that is described in

Chapter 3. Thereby, the saliency map (i.e., the output of the attention system) can be

used to actively search for objects dependent on the current system task. Based on that, a

task-dependent decomposition of the input image is possible that increases the relevance of

input data to higher system layers. Hence, the biologically motivated attention approach

described in the following is one of the key aspects of the thesis at hand.

50

3 Task-dependent Tunable Visual

Attention

Facilities for controlling and managing traffic are always visually conspicuous. For exam-

ple, lane markers are white on a typically dark road and traffic signs or traffic lights have

bright colors. According to that, in many countries flashy advertisement is prohibited in

the proximity of roads. The said examples exploit a key aspect of the human visual pro-

cessing - the principle of early selection. With vision being the most important sensory

modality of humans having the highest information density, the named principle signifi-

cantly accelerates the processing of vision data. More specifically, the abundance of visual

stimuli in the world is prefiltered or preselected early to match the restricted cognitive ca-

pacity of the human brain. In plain words, the principle of early selection suppresses sensor

data that is not relevant to the current needs or goals of the system causing a colorful,

bright traffic sign to visually pop-out in a traffic scenario. For realizing said early selection

principle the human disposes of the so-called attention mechanism, which preselects the

scene elements.

More specifically, the human vision system filters the high abundance of environmental

information by attending to scene elements that either pop out in the scene (i.e., ob-

jects that are visually conspicuous) or match the current task best (i.e., objects that are

compliant to the current internal state or need/task of the system), while suppressing

the rest. For both attention guiding principles psychophysical and neurological evidence

exists (see [Corbetta and Shulman, 2002, Egeth and Yantis, 1997]). Following this prin-

ciple, technical vision systems have been developed that prefilter a scene by decomposing

it into its features (see [Wolfe and Horowitz, 2004]) and recombining these to a saliency

map that contains high activation at regions that differ strongly from the surrounding (i.e.,

bottom-up (BU) attention, see [Koch and Ullman, 1985]). More recent system implementa-

tions additionally include the modulatory influence of task relevance into the saliency (i.e.,

top-down (TD) attention, see [Tsotsos et al., 1995] as one of the first and [Frintrop, 2006,

Navalpakkam and Itti, 2005] as the most recent and probably most influential approaches).

In these systems, instead of scanning the whole scene in search for certain objects in a brute

force way, the use of TD attention allows a full scene decomposition despite restraints in

computational resources. In principle, the vision input data is serialized with respect to

the importance to the current task. Based on this, computationally demanding processing

stages located higher in the architecture work on prefiltered data of improved relevance,

which saves computation time and allows complex real-time vision applications.

During the vision system design we aimed at a computational efficient system imple-

mentation for online use on vehicles. The overall system should be flexible, meaning that

a new system task should not lead to the necessity of realizing new modules or a structural

51

3 Task-dependent Tunable Visual Attention

redesign of the whole system. Getting our inspiration from biology we therefore aimed

at a system that exhibits specific properties without being specifically designed for these

properties (e.g., our system is able to locate the horizon edge or detect fast moving objects

or red traffic signs without being explicitly designed for these tasks).

The design goals of our TD attention sub-system comprised the development of an

object and task-specific tunable saliency map suitable for the real-world scenarios in the

car domain.

However, the robustness of biological attention systems is difficult to achieve, given e.g.,

the high variability of scene content, changes in illumination, and scene dynamics. Most

computational attention models do not show real-time capability and are mainly tested

in a controlled indoor environment on artificial scenes. Important aspects discriminating

real-world scenes from indoor and artificial scenes are the dynamics in the environment

(e.g., changing lighting and weather conditions, dynamic scene content) as well as the high

scene complexity (e.g., cluttered scenes). Dealing with such scenarios requires a strong

system adaptation capability with respect to changes in the environment. Here, we focus

on five conceptual issues crucial for closing the gap between artificial and natural attention

systems operating on real-world scenes. We show the feasibility of our approach on vision

data from the car domain. The described TD tunable attention system is used as front-end

of the vision system of an advanced driver assistance system (ADAS) described in Chapter

5, whose architecture is inspired by the human brain.

After elaborating on related approaches in Section 3.1, Section 3.2 will describe specific

challenges for an attention system under real-world conditions. Section 3.3 will describe

our attention sub-system in detail pointing out the solutions to the denoted challenges.

Taken these challenges, Section 3.4 compares the proposed attention system on a functional

level to two other, influential attention approaches from literature. Section 3.5 underlines

the potential of the described solutions based on results calculated on different real-world

scenes, after which Chapter 3 is summarized.

3.1 Related Work

In the past, the human vision system has been examined in a large number of studies.

For example, the psychophysical experiments of [Simons and Chabris, 1995] impressively

showed that the task has a modulating effect on attention. The gathered results were for-

malized in the concept of inattentional blindness. In their experiments participants did not

notice unexpected events (like a black gorilla walking through an indoor scene) when the

task (counting ball contacts of a white basketball team) involved features complementary

to the unexpected events (see Fig. 3.1).

Related to the vehicle domain the task-dependent nature of gazing has also been proven

while steering a car. Recently, it was shown in [Most and Astur, 2007] that the performance

for dangerous situation detection (a colored motorcycle veering into the vehicle’s path)

strongly depends on the feature-match between the current distracting visual task and the

unexpected obstacle. In another example, the gaze of drivers in a virtual environment was

examined [Shinoda et al., 2001]. The results show that the performance in detecting stop

52


Figure 3.1: Psychophysical study conducted by [Simons and Chabris, 1995] marking the hu-

man visual attention as strong mediator between the world and our perception of the world.

signs is heavily modulated by context (i.e., top-down) factors and not only by bottom-

up visual saliency. Endowing a vision architecture for an intelligent car with similar,

task-based attention can result in a gain of performance with minimal additional resource

requirements (see Sect. 3.5).

In most research on human visual attention the focus is on the bottom-up detection

of salient features/objects in a scene (for a review of biologically evident attention fea-

tures see [Wolfe and Horowitz, 2004]). A well-known computational model for saliency

calculation is the approach by [Itti et al., 1998] that is used in a number of implemented

systems. Recently, this approach has been extended by various researchers to account for

task-dependent aspects of visual attention (see, e.g., [Frintrop et al., 2005, Goerick et al.,

2005, Hawes and Wyatt, 2006]) by applying dynamic weights to different processing stages.

The tasks are often to find a specific object within a predominantly static indoor scene.

A more complete view on a possible architecture for a visual system incorporating task-

dependent visual attention is given by [Navalpakkam and Itti, 2005, 2006]. The proposed

architecture combines top-down (TD) and bottom-up (BU) influences by using TD weights

on the calculated BU features. However, there is no separation between the untuned BU

saliency map and the calculated TD saliency maps allowing a weighted combination, which

would ensure the preservation of BU influence in all system states. The system is evaluated

mainly on static indoor scenes and a few static outdoor scenes. Furthermore, there are only

few attention-based vision systems that use a motion feature (see [Backer and Mertsching,

2000, Tsotsos et al., 2004]). Given the importance of motion in the human visual percep-

tion, we see modeling the influence of scene dynamics on attention as a key issue to realize

robust human-like vision systems.

In Sect. 3.4, we chose the two related top-down attention systems of

[Navalpakkam and Itti, 2005] and [Frintrop, 2006] for a detailed structural and functional

comparison, since these impacted our work most.

However, numerous other psychophysical and computational attention models ex-

ist (please refer to [Frintrop, 2006, Frintrop et al., 2009, Heinke and Humphreys, 2005,

Itti et al., 2005] for a comprehensive overview of the latest developments in attention re-

search and [Findlay and Gilchrist, 2003] for an overview of related psychophysical studies).

Turning to the domain of vision systems developed for ADAS, there have been few at-

53


tempts to incorporate aspects of the human visual system into complete systems. With

respect to attention processing, a saliency-based traffic sign detection and recognition sys-

tem was demonstrated in [Ouerhani, 2003]. In terms of complete vision systems, one of the

most prominent examples is a system developed in the group of E. Dickmanns [Dickmanns,

2004]. It uses several active cameras mimicking the active nature of gaze control in the

human visual system. However, the processing framework is not closely related to the

human visual system. Without a tunable attention system and with TD aspects that are

limited to a number of object-specific approaches for classification, no dynamic preselection

of image regions is performed. A more biologically inspired approach has been presented

by Farber [Farber, 2005]. However, their publication as well as the recently started Ger-

man Transregional Collaborative Research Centre “Cognitive Automobiles” [Stiller et al.,

2007] address mainly human-inspired behavior planning whereas our work focuses more on

task-dependent perception aspects.

The only other known vision system approach that attempts to explicitly model aspects

of the human visual system is described by [Matzka et al., 2008]. The system is some-

what related to the here presented ADAS. However, published after our work (see, e.g.,

[Michalke et al., 2007]), the approach allows for a simple attention-based decomposition of

road scenes but without incorporating object knowledge or context information. Addition-

ally, the overall system organization is not biologically inspired and hence shows limitations

in its flexibility.

In contrast to the here presented ADAS, a tendency of most large-scale research projects

like, e.g., the European PreVENT project [WWW, 2006] is the decomposition of the overall

functionality into many building blocks and combining these blocks into subsets for solving

isolated tasks. While this ’divide and conquer’ approach does lead to impressive results

in specific settings, we believe the challenge of integrating all these functionalities into a

coherently working flexible system is not yet solved.

3.2 Real-World Challenges for Top-Down Attention

Systems

In the following paragraph we describe challenges a TD attention system is facing when

used on real-world images.

1O High feature selectivity: In order to yield high hit rates in TD search an attention

system needs high feature selectivity to have as much supporting and inhibiting feature

maps as possible. For this the used features must be selected and parameterized appro-

priately. Even more important for high selectivity is the use of modulatory TD weights

on all sub-feature maps and scales. Many TD attention approaches allow TD weighting

only on a high integration level (e.g., no weighting on scale level [Frintrop et al., 2005])

or without using the full potential of features (e.g., no on-off/off-on feature separation

[Navalpakkam and Itti, 2005]) which leads to a performance loss. Our system fulfills both

aspects. Based on the extended selectivity of our attention sub-system, we can handle

specific challenges of the car domain, as dealing with the horizon edge present in most

images.

54


2O Comparable TD and BU saliency maps: Typically the TD and BU saliency maps

are combined to an overall saliency, on which the Focus of Attention (FoA) is calculated.

The combination requires comparable TD and BU saliency maps, making a normalization

necessary. Humans undergo the same challenge when elements popping out compete with

task-relevant scene elements for attention. A prominent procedure in literature normalizes

each feature map to its current maximum (see [Navalpakkam and Itti, 2005] that is based

on [Itti et al., 1998]), which has some drawbacks our approach avoids.

3O Comparability of modalities: Similarly, the combination of different a priori in-

comparable modalities (e.g., decide on the relative importance of edges versus color) must

be achieved. We realize this by the biological principle of homeostasis that we define as

the reversible adaptation of essential processes of a (biological) system to the environment

(see e.g., [Hardy, 1983]).

4O Support of conjunctions of weak object features in the TD path: Another

important robustness aspect is the support of conjunction of weak object features in the

TD path of the attention sub-system. That is, an object having a number of mediocre

feature activations but no feature map popping out should still yield a clear maximum

when combined on the overall saliency.

5O Changing lighting conditions: In a real-world scene changing lighting conditions

heavily influence the features the saliency map is composed of and hence the performance of

attention system suffers. As the calculated TD weights are based on the features of training

images (see Sect. 3.3), the TD weights are illumination-dependent as well. Put differently,

the TD weights are optimal for the specific illumination and thereby to the contrast that

is present in the training images. The usage of TD weights on test images with a differing

illumination will lead to an inferior TD search performance. Instead of adapting the TD

weights dependent on the illumination, a local exposure control is proposed in order to

adjust the contrast of the training images as well as the test images before applying TD

weight calculation and TD search.

3.3 Modeling Attention: From a Robustness Point of

View

The organization of Sect. 3.3 is led by the consecutive processing steps of the current

ADAS attention sub-system as depicted in Fig. 3.2. After a short description of the

general purpose of the BU and TD pathways, their combination to the overall saliency

is described. Following this overview, the used modalities (feature types) are specified

followed by the entropy measure that is used for the camera exposure control. Next, the

different steps of the feature postprocessing are described. The TD feature weighting, the

homeostasis process to get the conspicuity maps (i.e., modalities) comparable, as well as

the final BU/TD saliency normalization are the final processing steps in our attention

architecture.

The attention system consists of a BU and a TD pathway. The TD pathway (red

enclosed region in Fig. 3.2) allows an object- and task-dependent filtering of the input

data. All image regions containing features that match the current system task well are

55


motion

TD

TD

TD

TD BU

DoG

off−onon−off &

imageRGBinput

BU

spa

rsen

ess

wei

ghts

on 5 scalesimage pyramid

starting from 256x256

control exposure timerecurrent cycle to

gray valueim pyramid

Gabor

TD BU

TD BU

TD BU

TD BU

TD BU

TD BU

...

BU

feat

ure

wei

ghts

BU

con

spic

uity

wei

ghts

...

BU

Motion

...

...

...

AbsCol

Gaboron−off

off−onGabor

Gabor

Gabor

on−off

off−on

DoColOp

Intensity

Intensity

Intensity

IntensityI off−on

I on−off

I off−on

normalizationTD

BUnormal.I on−off

GRTD BU

RG color oppBY color opp

BY

YBTD BU

TD BU

TD

RGBY

4 RGBY pyramids

DoColOpponent

DoColOpponentBY pyramid

RG pyramid RG

odd & even

on−off &off−on

postprocessingfeature

entropy

TD path

overallSaliency

image

TD

feat

ure

wei

ghts

TD

con

spic

uity

wei

ghts

gray valueimage

DoColOp

consp. maps

consp. maps

Non

lin.O

p(c

ut n

eg.)

BU path

homeostasislow light adaptation: parameterize degree ofnonlin. noise suppression with

cameraCCD color

wsparse

i

1 − λ

wBUi

wB

Ui

λ

wTDi

wTDnorm

wBUnorm

IRGB

wC

j

wT

Di

wC

j

Igray

FTDi

FBUi

CTDj

CBUj

wCj

wCj

Ksupp

Figure 3.2: Visual attention sub-system (dashed lines correspond to TD links).

supported (excitation), while the others are suppressed (inhibition) resulting in a sparse

task-dependent scene representation. Opposed to that, the BU pathway (blue enclosed

region in Fig. 3.2) supports an object- and task-unspecific filtering of input data supporting

scene elements that differ from their surroundings. The BU pathway is important for a

task-unspecific analysis of the scene supporting task-unrelated but salient scene elements.

The BU and TD saliency maps are linearly combined to an overall saliency map. This

map is used to generate FoAs that represent the scene elements higher system layers work

on. The combination is realized using parameter λ (on the right hand side in Fig. 3.2)

that is set dependent on the system state emphasizing the BU and/or TD influence (see

Equ. (3.9)). Due to this combination the system also detects scene elements that do

not match the current TD system task. By ensuring a certain BU influence such scene

elements are not suppressed, which would otherwise lead to the so-called inattentional

blindness phenomenon (i.e., complete perceptual suppression of scene elements as described

in [Simons and Chabris, 1995]).

Turning to the processing details, the following modalities are calculated on the captured

color images: RGBY colors (inspired by [Frintrop, 2006]), intensity by a Difference of

Gaussian (DoG) kernel, oriented lines and edges by a Gabor kernel, motion by differential

images (see Chapter 2 for details on the used feature maps) and entropy using structure

56


tensor.

In the following, a rough repeated description of the modalities is given, after which the

entropy feature is specified that is used to set the camera exposure. The features motion

and color are used differently for the BU and TD path. The BU path uses double color

opponency from RGBY colors by applying a DoG on 5 scales on the RG and BY color

opponent maps. The filter results are separated into their positive and negative parts

(on-off/off-on separation, whose importance is emphasized in [Frintrop, 2006]) leading to 4

pyramids of double color opponent RG,GR,BY and YB-maps. The TD path uses the same

color feature but additionally 4 pyramids of the absolute RGBY maps. Absolute RGBY

colors do not support the BU pop-out character and are hence not used in the BU path. A

DoG filter bank is applied on 5 scales separating on-off and off-on effects. Furthermore, a

Gabor filter bank on 4 orientations (0, π/4, π/2, 3/4π) and 5 scales is calculated separately

for lines and edges (even and odd Gabor). The realized Gabor filter bank ensures disjoint

decomposition of the input image. The detailed mathematical formulation of the used

Gabor filter bank can be found in Chapter 2. Motivated from DoG the concept of on-

off/off-on separation is transferred to Gabor allowing e.g., the crisp separation of the sky

edge or street markings from shadows on the street. Motion from differential images on

5 scales is used in the BU path alone. Since this simple motion concept cannot separate

static objects from self-moving objects, it is not helpful in TD search. The entropy T is

based on the absolute gradient strength of the structure tensor A on the image Igray:

T =det(A)

trace(A). (3.1)

The matrix A is calculated using derivatives of Gaussian filters Gu and Gv and a rect-

angular filter of size W :

A =

[

ΣW (Gu ∗ Igray)2 ΣW (Gu ∗ Igray)(Gv ∗ Igray)ΣW (Gv ∗ Igray)(Gu ∗ Igray) ΣW (Gv ∗ Igray)2

]

(3.2)

Gu(u, v) = − u

2πσ4exp(−u

2 + v2

2σ2), Gv(u, v) = − v

2πσ4exp(−u

2 + v2

2σ2).

We use the entropy as a means to adapt the camera exposure and not as a feature.

The local exposure control works on the accumulated activation Tsum = ΣRoIT on an

image region of interest (RoI) (e.g., coming from the appearance-based object tracker that

is part of our ADAS, for details see [Michalke et al., 2007]). Here we get inspiration from

the human local contrast normalization. The exposure time is recursively modified in

search of a maximum on Tsum, which maximizes the contrast on the defined image regions.

As described in Chapter 2, the system disposes of 136 independently weighable sub-feature

maps.

Following the calculation of the raw features, a postprocessing step on all sub-feature

maps is performed (see Fig. 3.3). The feature postprocessing consists of 5 steps. First,

all sub-features are normalized to the maximal value that can be expected for the specific

sub-feature map (not the current maximum on the map). For example, for DoG and

Gabor this is done by determining the filter response for the ideal input pattern, i.e., the

57


( )2

nonlin. noise supp.

resize toBilinearmax

maxmin

normalization squaring (sig. power)

sparsenessweight

TD postprocessing

BU postprocessedsubfeat. maps

subfeat. maps

TD postprocessed

BU postprocessing

256x256

adapt noise suppressionparameter to

TD link to BUsaliency normalization

1

Subfeat. Max wsparsei

Ksupp

Ksupp

FBUi

FTDi

Fi

Figure 3.3: Postprocessing of feature maps in BU and TD path.

maximum possible filter response. The ideal input pattern is generated by setting all pixels

to 1 whose matching pixel positions in the filter kernel are bigger than 0. Figure 3.4 shows

the resulting ideal DoG and 0◦ even Gabor input patterns that are derived from the given

filter kernels. This procedure ensures comparability between sub-features of one modality

(e.g., all sub-feature maps of the motion modality). Next, the signal power is calculated by

(a) (b)

Figure 3.4: Input patterns that maximize the filter response. The maximum of this filter

response is used for sub-feature normalization: (a) Ideal DoG input pattern, (b) Ideal 0◦ even

Gabor input pattern.

squaring and a dynamic neuronal suppression using a sigmoid function, which is applied

for noise suppression. A parameter Ksupp shifts the sigmoid function horizontally, which

influences the degree of noise suppression respectively the sparseness of the resulting sub-

feature maps. After a bilinear resize to the resolution 256x256, which allows a later feature

combination, for the BU feature postprocessing a sparseness weight wsparsei is multiplied

that ensures pop-out by boosting sub-feature maps with sparse activation:

wsparsei =

√

√

√

√

2s∑

∀u,v Fi,k(u,v)>ξ

Fi,k(u, v)for s = [0, 4] and ξ = 0.9 ·Max(Fi,k). (3.3)

The sparseness operator is not used in the TD path (see red enclosed region in Fig. 3.3)

in order to prevent the suppression of weak object features.

Later in the TD path a weighting on all 136 sub-feature maps takes place to realize

inhibition and excitation. The TD-related tuning on the feature level is motivated from

58


the fact that neurobiological studies have shown that attentional influences are present

very early in the human visual pathway (see [Treue, 2003]). Furthermore, neurophysical

studies on monkeys have shown that attention-based modulation of neuronal activities lead

to an increase in activity in case the bias matches the preferences of cell populations. But

also a suppression of neuron populations can be encountered in case the attended features

do not match the preferences of the cells (see [Treue, 2003]). These measurements motivate

the usage of supporting (excitation) and suppressing (inhibition) feature TD weights as

realized in the presented attention system.

The TD weights wTDi are calculated in an offline step (inspired by [Frintrop et al., 2005]

but extended):

wTDi =

{

SNRi ∀ SNRi ≥ 1

− 1SNRi

∀ SNRi < 1with SNRi =

∑

∀u,v FTDi,obj

(u,v)>φ

FTDi,obj(u,v)

Ni,obj∑

∀u,v FTDi,surr

(u,v)>φ

FTDi,surr(u,v)

Ni,surr

. (3.4)

The average activation in the object region is related to the average activation in the

surround on each feature map F TDi taken only the Ni pixels above the threshold φ =

KconjMax(F TDi ) with Kconj = (0, 1] into account.

As opposed to the weighting scheme proposed by [Frintrop et al., 2005], in the here

presented approach the threshold φ assures that numerous small values on a feature map

do not even out rarely present big ones. The proposed threshold φ improves the TD search

performance, since big values influence the TD search performance overproportionally. In

the BU path only excitation (wBUi ≥ 0) takes place, since without object or task knowledge

in BU nothing can be inhibited. For a more detailed discussion of feature map weighting

see [Frintrop, 2006, Michalke et al., 2007].

As visualized in Fig. 3.2, the j = 1..M conspicuity maps CBUj and CTD

j result from a

weighted combination of the Nj BU and TD sub-feature maps within a certain feature

type j:

CBUj =

Nj∑

i=1

wBUi,j F

BUi,j (3.5)

CTDj =

Nj∑

i=1

wTDi,j F

TDi,j . (3.6)

The sub-feature normalization procedure ensures intra-feature comparability, but for the

overall combination, comparability between modalities (i.e., conspicuity maps) is required

as well. We solve the normalization problem of the conspicuity maps by dynamically

adapting the conspicuity weights wCjfor weighting the BU and TD conspicuity maps

CBUj and CTDj . This concept mimics the homeostasis process in biological systems (see

e.g., [Hardy, 1983]), which we understand as the property of a biological system to regulate

its internal processes in order to broaden the range of environmental conditions in which

the system is able to survive. More specifically, the wCj(t) are set to equalize the activation

59


on all j = 1..M BU conspicuity maps, taking only the Nj pixel over the threshold ξ =

0.9 ·Max(CBUj ) into account:

wCj(t) =

11Nj

∑

∀u,v with CBUj (u,v)>ξ

CBUj (u, v)

and ξ = 0.9 ·Max(CBUj ). (3.7)

Exponential smoothing is used to fuse old conspicuity weights wCj(t− 1) with the new

optimized ones wCj(t):

wCj(t) = αwCj

(t) + (1 − α)wCj(t− 1) for j = 1..M. (3.8)

The parameter α sets the velocity of the adaptation and could be adapted online depen-

dent on the gist (i.e., basic environmental situation) via a TD link. In case of fast changes

in the environment α could be set high for a brief interval, e.g., while passing a tunnel or

low in case the car stops. Additionally, we use thresholds for all M conspicuity maps based

on a sigma interval of recorded scene statistics to avoid complete adaptation to extreme

environmental situations.

Before combining the BU and TD saliency maps using the parameter λ (see Equ. (3.9)

and Fig. 3.2) a final normalization step takes place. Like the sub-feature and conspicuity

maps, the saliency maps are normalized to the maximum expected value.

Stotal = λSTD + (1 − λ)SBU (3.9)

For this we have to step back through the attention sub-system taking into account

all weights (wsparsei , wBU

i , wTDi ) and the internal disjointness/conjointness of the features

to determine the highest value (vBUmax,j and vTD

max,j) a single pixel can achieve in each BU

and TD conspicuity map j. We define a feature as internally disjoint (conjoint), when

the input image is decomposed without (with) redundancy in the sub-feature space. In

other words the recombination of disjoint (conjoint) sub-feature maps of adjacent scales or

orientations is equal to (bigger than) the decomposed input image. Since DoG and Gabor

are designed to be internally disjoint between scales and orientations (see Chapter 2) the

maximum pixel value on a conspicuity map j is equal to the maximum of the product of

all sub-feature and/or sparseness weights of the sub-features it is composed of (wsparsei and

wBUi for BU as well as wTD

i for TD). Motion is conjoint between scales, therefore we sum

up the product of all sub-feature motion weights wBUi and their corresponding wsparse

i to

get the maximally expected value on the motion conspicuity map. The contribution of the

color feature to the saliency normalization weight is similar but more complex.

Since appart from DoG and Gabor there is disjointness between conspicuity maps the

maximum possible pixel values for all BU and TD conspicuity maps, calculated as described

above, are multiplied with the corresponding wCjand added to achieve the normalization

weights wTDnorm and wBU

norm for the TD and BU attention (please also refer to Fig. 3.2 for

60


the position the normalization weights are applied):

wBUnorm =

1∑M

j=1 kjwCjvBUmax,j

(3.10)

wTDnorm =

1∑M

j=1 kjwCjvTDmax,j

. (3.11)

With:

kj =

{

0.5 for j ∈ {DoG,Gabor}1 for j /∈ {DoG,Gabor}.

Using this approach, wTDnorm will adapt when the TD weight set changes.

It is important to note that DoG and Gabor features are conjoint, meaning that they

represent the same signal characteristics. Put differently the conspicuity maps for DoG and

Gabor are not independent. As discussed in Chapter 2 using both DoG and Gabor is still

helpful, since the signal decomposition is different for both filter types. The conjointness

is taken into account in the attention normalization procedure in Equ. (3.10) and (3.11)

in the form of the factor kj that decreases the integral influence of DoG and Gabor on the

overall attention.

3.4 Functional Comparison to other Top-Down Attention

Models

Taken the abundance of computational attention models (see [Heinke and Humphreys,

2005] for a review) we selected the two related approaches of [Navalpakkam and Itti, 2005]

and [Frintrop, 2006] for a detailed structural comparison, since these impacted our work

most. Then, we summarize what makes our approach particularly appropriate for the

real-world car domain.

The system of [Navalpakkam and Itti, 2005] is based on the BU attention model

Neuromorphic Vision Toolkit (NVT) [Itti et al., 1998] but adds TD to the system. Each

feature map is normalized to its current maximum, resulting in a loss of information about

the absolute level of activity and a boosting of noise in case the activation is low. Taken such

a normalization procedure and the object dependence of the TD weights, the BU and TD

saliency maps are not comparable, since the relative influence of the TD map varies when

the TD weight set is changed. Additionally, the BU and TD saliency maps are not weighted

separately for combination. As features a speed-optimized RGBY (leading to an inferior

separability performance), a DoG intensity feature and Gabor filter on 4 orientations (both

without on-off/off-on or line/edge separation) are used on 6 scales starting at a resolution

of 640x480. The system uses TD weights on all sub-feature maps resulting in 42 weights

that allow reasonable selectivity. A DoG-based normalization operator (see [Itti et al.,

1998]) is applied for pop-out support and to diminish the noise resulting from the used

feature normalization. However, the absolute map activation and therefore comparability

is lost.

61


The system of [Frintrop, 2006] integrates BU and TD attention and is real-time capable

(see [Frintrop et al., 2007]). It was evaluated mainly on indoor scenes. The system normal-

izes the features to their current maximum, resulting in the same problems as described

above. The BU and TD saliency maps are weighted separately for combination. Following

the argumentation above the used normalization makes these combination weights depen-

dent on the used TD weight set and thereby object-dependent. As features the system uses

double color opponency based on an efficient RGBY color space implementation, a DoG

intensity feature (with on-off/off-on separation), and a Gabor with 4 orientations starting

from 300x300 resolution. A total of 13 TD weights are used on feature (integrated over all

scales) and conspicuity maps. For pop-out support a uniqueness operator is used.

Most important differences comparing the systems: We obtain high selectivity

by decomposing the DoG (on-off/off-on separation) and Gabor (on-off/off-on separation,

lines and edges) features without increasing the calculation time. Furthermore, the usage

of TD weights on all sub-feature maps and scales results in 136 independent tunable feature

weights that increase the selectivity. The resulting scale variance of the TD weights is not

a crucial issue in the car domain. The RGBY is used as color and double color opponency.

In contrast to [Frintrop, 2006, Navalpakkam and Itti, 2005], we use motion to support

scene dynamics. All sub-feature maps and the BU respectively TD saliency maps are

normalized without loosing information or boosting noise and by that preventing false-

positive detections. Comparability of modalities is assured via homeostasis. The attention

sub-system works on 5 scales starting at a resolution of 256x256. Experiments have shown

that in the car domain bigger image sizes do not improve the attention system performance.

Our system supports conjunction of weak features since the sparseness operator is not

used in the TD path. Illumination invariance is reached by image region-specific exposure

control that is coupled tightly to the system.

3.5 Experiments and Results

In the following, we evaluate the system properties related to the challenges described in

Sect. 3.2. All results are calculated on five real-world data sets (cars, reflection poles,

construction site, inner-city stream, toys in an indoor scene) accessible in the internet (see

[BenchmarkData, 2008a]).

1O High feature selectivity: In the car domain the search performance is strongly

influenced by the horizon edge present in most images of highways and country roads.

This serves as example problem for showing the importance of high feature selectivity.

Typically, the horizon edge is removed by mapping out the sky in the input image, which

might not be biological plausible. Based on the high selectivity of the attention features,

we instead suppress the horizon edge directly in the saliency by weighting the sub-feature

maps. The gain of this approach is depicted in Fig. 3.5 that shows the diminished influence

of the horizon edge on the (TD modified) BU saliency of the real-world example in Fig. 3.6b.

For evaluation the average FoA hit number (Hit) and average detection rate (DRate) were

calculated. While DRate is the ratio of the number of found task-relevant objects to the

overall number of task-relevant objects, Hit states that the object was found on average

62


with the Hit ’th generated FoA. Hence the smaller Hit the earlier an object is detected

(see [Frintrop, 2006] for a more detailed definition of these measures). Table 3.1 shows the

significant performance gain of attentional sky suppression versus no horizon edge handling

and masking of the sky based on these measures.

50

100

150

200

50

100

150

200

50

100

150

200

135° on−off0°off−on 45° off−on

(a) (b) (c)

Figure 3.5: Evaluation of selectivity (based on the input image depicted in Fig. 3.6b): (a)

Original BU saliency, (b) Modified BU saliency with attentional sky suppression (TD influence),

using suppressive odd Gabor filter kernels in low scales, (c) BU saliency, masked sky (standard

method).

Table 3.1: Benefit of attentional sky suppression on real-world data.

Search target # test a) original BU b) attentional sky supp. c) sky masked

images Hit (DRate) Hit (DRate) Hit (DRate)

Cars 54 3.06 (56.3%) 2.19 (71.4%) 2.47 (71.4%)

2O Comparable TD and BU saliency maps: The used feature normalization pre-

vents noise on the saliency map and ensures the preservation of the absolute level of

feature activation. Using a TD weight set that supports certain object-specific features

our normalization hence ensures that the TD map will show high activation if and only if

the searched object is really present. Figure 3.6f shows that the maximum saliency value

on the TD saliency map for cars rises when the car comes into view (see [BenchmarkData,

2008a] for a downloadable result stream).

The influence combining the now comparable TD and BU saliency maps for cars and

reflection posts (reflection posts are, e.g., useful for unmarked road detection as done in

[von Trzebiatowski et al., 2004]) as trained search objects is depicted in Tab. 3.2, showing

that TD improves the search performance considerably. It is important to note that besides

an exchange of the training images no modification in the system structure is required,

when changing the search object. For evaluation the average FoA hit number (Hit) and

average detection rate (DRate) were calculated. The choice of training images has only

63


small influence on the search performance as the comparable results for different sets of

training images in Tab. 3.2 show.

The evaluation shows best hit numbers and highest detection rates for pure TD search

(λ = 1). However, it is important to note that pure TD search would lead to a suppression

of unexpected objects (inattentional blindness, see Sect. 3.1) and would hence potentially

cause dangerous situations.

50

100

150

200

50

100

150

200

50

100

150

200

250

300

350

400

(a) (b) (c)

(d)(e)

(g)(f)

Frame 1, TD car Frame 45, TD sig. boards

Frame 1 Frame 45 Frame 45, TD car

Figure 3.6: Evaluation of normalization: (a), (b) Input images (c) TD saliency tuned to cars,

(d) TD saliency tuned to signal boards, (e) TD saliency tuned to cars (noise, since no car is

present), (f) Maximum saliency activation level on BU, TD car and TD signal board map, (g)

Dynamically adapted conspicuity weights wCj(homeostasis) for the M=7 modalities.

64


Table 3.2: Linear combination of BU and TD saliency, influence on search performance (λ = 0

equals pure BU and λ = 1 pure TD search).

# Test # Trai- Hit (DRate)

Target images ning im λ = 0 λ = 0.5 λ = 1

(objects) (BU) (BU & TD) (TD)

Cars 54 (self test) 1.56 (93.1%) 1.53 (100%)

Train. set 1 3 3.06 1.87 (89.7%) 1.82 (96.6%)

Train. set 2 54 (58) 2 (56.9%) 1.90 (84.5%) 1.76 (93.1%)

Train. set 3 3 1.96 (82.8%) 1.94 (93.1%)

Train. set 4 3 1.84 (86.2%) 1.74 (93.1%)

Reflect. posts 56 (self test) 1.78 (59.8%) 1.85 (66.3%)

Train. set 1 6 2.97 2.10 (51.3%) 2.25 (52.2%)

Train. set 2 56 (113) 7 (33.6%) 2.20 (51.3%) 2.28 (51.3%)

Train. set 3 7 2.07 (51.3%) 2.36 (52.2%)

Train. set 4 5 2.10 (51.3%) 2.30 (51.3%)

3O Comparability of modalities: The used dynamic adaptation of wCj(homeostasis,

see Equation (3.8)) causes a twofold performance gain. First, the a priori incomparable

modalities can be combined yielding a well balanced BU and TD saliency map. Second,

the system adapts to the dynamics of the environment preventing varying modalities from

influencing the system performance (e.g., with this procedure in the red evening sun the

R color channel will not be overrepresented in the saliency). Figure 3.6g depicts the

dynamically adapted wCj. Table 3.3 shows a noticeable SNR gain on the overall saliency for

26 traffic relevant objects (e.g., traffic light, road signs, cars), comparing the dynamically

adapted wCjvector with a static wCj

vector that was locally optimized on the stream.

Table 3.3: Comparability of modalities via homeostasis.

Traffic-relevant #images SNRobj using SNRobj using

objects (obj) static wCjdynamic wCj

Inner-city stream 20 (26) 2.56 2.86 (+11.7%)

4O Support of conjunctions of weak object features in the TD path is assured

since wsparsei is used in BU only. Evaluation on 54 images with cars as TD search object

shows that the average object signal to noise ratio (SNRobj) on the TD saliency map

(defined as the mean activation in the object versus its surround) decreases by 9% when

wsparsei is also used in the TD path. For evaluation we define weak object feature maps as

having the current maximum outside the object region but still having object values of at

least 60% of the maximum within the object. For the used 54 traffic scene images 11% of

65


all feature maps are weak. In case weak feature maps are used to optimally support the

TD saliency in an excitatory way SNRobj on the TD saliency map increases by 25%. The

results are aggregated in Tab. 3.4. Figure 3.7a shows that the number of excitatory TD

weights wTDi decreases the bigger Kconj (see Equ. (3.4)) is. An object-dependent trade-off

exists since the TD saliency map gets sparser the bigger Kconj is.

Table 3.4: Improvement of SNR due to support of weak feature conjunctions.

TD search # test SNRobj SNRobj SNRobj with optimal

target image with wsparsei without wsparsei weak feat. excitation

Cars 54 6.72 7.32 (+9%) 8.41 (+25%)

5O Changing lighting conditions: The feature activation of an image region depends

on the illumination. Hence the TD weight set is only optimal for the lighting conditions

present in the training images and the TD search performance decreases when illumination

changes without an adaptation of the camera exposure. It is important to note that in

a real-world scene the optimal exposure in varying illumination is different for all objects

(see Fig. 3.7b and c, making the exposure control dependent on the current task of the

system. Evaluation based on a complex indoor test setting, where the illumination could

be controlled, shows that the realized exposure control leads to an illumination invariance

of the TD weight sets (see Tab. 3.5).

Table 3.5: Illumination invariance of TD weight sets due to dedicated exposure control.

Target # Test Average hit number (and detection rate [%]), TD search λ = 1

Toys in a im Traning illu- without expos. control with expos. control

complex (obj) mination 75 lx 150 lx 15 lx 150 lx 15 lx

indoor 20 1.95 2.74 2.83 1.80 2.0

setup (20) (100%) (95%) (30%) (100%) (100%)

3.6 Summary

Chapter 3 describes a flexible biologically motivated attention system that is used as the

front-end of our ADAS. The 136 feature maps described in Chapter 2 are independently

weighted and combined to the so-called saliency map that is the key aspect and resulting

output of the attention system. The amplitude of the saliency map (i.e., its activation in

neurobiological terms) encodes the conspicuousness of an image region. A high saliency

66


Whole image

Lower half

Car

Whole imageLower halfCar

(a) (b)

(c)

Figure 3.7: Evaluation of illumination influence: (a) Number of excitatory TD weights de-

pending on feature preprocessing parameter Kconj , (b) Image regions used for exposure opti-

mization (whole image, lower half, and car), (c) Energy function: Accumulated entropy Tsumwith object-dependent optima.

value can result 1) from an object that visually differs strongly from its surroundings

(sensory-driven or bottom-up attention) or 2) from an object that matches the current

search task (goal-driven or top-down attention). Both the bottom-up (BU) and top-down

(TD) attention weight and combine the implemented features. For both attention types

the feature maps are normalized to their potential maximum (not the current maximum)

in order to assure a comparability of feature maps of the same modality. However, the

feature post-processing for both attention types also differs in certain aspects. For the

BU attention a sparseness weight is applied that boosts feature maps having a strong,

locally restricted maximum. For the TD attention no sparseness weight is applied in

order to assure that feature maps without a clear maximum (so-called weak feature maps)

can contribute to the TD attention. All feature maps are weighted with object-specific

TD weights that are computed based on a weight calculation scheme that evaluates the

feature characteristics of training images showing the searched object. In a nutshell, the

weight calculation scheme boosts feature maps that are typical for the searched object and

67


suppresses feature maps not compatible with the searched object.

The different sub-feature maps are independently weighted for the BU and TD case and

combined to so-called conspicuity maps, which represent the different modalities of the

system (e.g., colors, motion, lines). It is important to note that the conspicuity maps are

a priori not comparable. Based on the biologically motivated principle of homeostasis the

conspicuity maps are normalized in order to get them comparable before their combination

to the BU respectively TD saliency map. In the last step the BU and TD pathways are

combined, which requires the saliency maps to be comparable. However, the absolute value

of the TD saliency map depends on the TD weight set and thereby on the searched objects.

Therefore, a normalization procedure for the BU and TD saliency maps is introduced that

preserves the information present in the saliency map amplitude.

A last robustness enhancing approach handles the problem that the TD weight sets are

optimal for the illumination present on the training images only. In case in the training

images a different lighting situation is present than in the test images, the performance

of the TD search suffers. This means that a priori the TD weight sets are not invari-

ant to illumination changes. Instead of adapting the TD weight sets dependent on the

illumination an image region-specific exposure control in proposed. When applying said

exposure control, abundant testing showed that the TD search performance and hence the

TD weight-sets are independent from illumination changes.

The introduction of several new approaches into the attention system allows for robustly

coping with the real-world requirements in the car domain. More specifically. the following

robustness-related novelties were introduced in Chapter 3:� An attention system relying on high feature selectivity based on 136 independently

tunable feature maps,� A sub-feature normalization procedure that assures the comparability of BU and TD

attention without loosing information about the absolute signal amplitude,� A biologically motivated homeostasis approach for making diverse modalities com-

parable,� Support of weak feature conjunctions in TD search mode,� An image region-specific exposure control that assures the illumination invariance of

the TD weight sets.

In the following Chapter 4 a robust approach for unmarked road detection is described

that in combination with the proposed attention system allows building complex driver

assistance systems presented in Chapter 5. Additionally, in Chapter 5 the real-time capa-

bility of our attention system in a real-world test setup will be shown. More specifically,

a test setup will be described, in which our prototype car was reliably able to brake au-

tonomously in an emergency situation (see [Michalke et al., 2007]).

68

4 Road Detection in Unconstrained

Environments

The importance of driver assistance systems for further decreasing the number of traffic

accidents is a widely acknowledged fact. The growing complexity of tasks, which these

Advanced Driver Assistance Systems have to handle leads to complex systems that use

information fusion from many sensory devices and incorporate processing results of multiple

other modules. One important field of interest for said systems are applications like, e.g.,

the “Honda Intelligent Driver Support System” [Ikegaya et al., 1998] supporting the driver

to stay in the lane and to maintain a safe distance from the car in front. Other systems focus

on collision avoidance based on autonomous steering and braking (see, e.g., [Schorn et al.,

2006]) as well as path-planning even in unstructured environments (see, e.g., [Dang et al.,

2006]). All these applications need a robust detection of the drivable road area. The more

safety-relevant applications become, the more the required quality of the detected drivable

road area must be improved. As “drivable road area” we define the space in front, which

the car can move on safely in a physical sense, but without taking symbolic information

into account (e.g., one-way-street, traffic signs).

First vision-based approaches for detecting the drivable road area on unmarked streets

were introduced in recent years. Although most of these visual-feature-based approaches

show sound results in scenarios of limited complexity, they seem to lack the necessary

system-inherent flexibility to run in complex environments under changing lighting con-

ditions. To cope with such environments, in Sect. 4.1 we introduce an architecture for

robust unmarked road detection. The system relies on four novel approaches that permit

the autonomous adaptation of important system parameters to the environment. As the

presented results show, the approach allows for robust road detection on unmarked inner-

city streets without manual tuning of internal parameters. This is different from most

approaches in literature that rely on strong rigid road models and offline set parameters.

In order to further stabilize the gathered results, in Sect. 4.2 a novel, generic approach for

improving unmarked road detection systems by temporal integration is proposed.

4.1 Adaptive Multi-Cue Fusion for Detecting Unmarked

Roads in Inner-City

In this section, a robust system approach for detecting the drivable area on unmarked

roads is presented. Based on four novel techniques, which extend known unmarked road

detection approaches, the proposed system reliably detects the road in complex scenarios

by autonomously adapting its internal parameters. As evaluation on inner-city streams

69

4 Road Detection in Unconstrained Environments

shows, the presented techniques are an important step toward more generic and robust

driving-path detection for unmarked roads. Unlike other approaches, no scene-dependent

manual adaptation of system parameters is required. The input images used for the evalu-

ation, corresponding ground truth data, and a result stream are accessible on the internet

[BenchmarkData, 2009a].

4.1.1 Related Work

Initial approaches for lane detection on marked roads date back to the 1990s (see [Broggi,

1995] for an overview of the early approaches). These to date commercially available

systems are restricted to marked roads with a predictable course, based on a clothoid lane

model that is also used for road construction of motor-ways. In recent years, the field of

research for road detection has shifted to unmarked country roads and inner-city streets.

To this end, current prototype systems evaluate and fuse different visual features. In the

following, the structure of such visual-feature-based systems is analyzed. It is shown that

despite the large number of existing road detection systems some important techniques for

increasing the road detection robustness are not considered so far.

Image training regions: Current approaches for road detection often use street train-

ing regions in front of the car in order to parameterize the probability distributions that

describe the road feature characteristics (e.g., [Rotaru et al., 2004, Soquet et al., 2007], see

also Fig. 4.4). Only very few approaches partially incorporate information of non-road im-

age regions to improve road detection (e.g., [Apostoloff and Zelinsky, 2003, Franke et al.,

2007]). However, to our knowledge no approach uses the full potential of non-road informa-

tion, e.g., for the autonomous adaptation of internal system parameters and the dynamic

online assessment of the cue quality, as it is done in our system.

Features: Typical visual features for road detection in state-of-the-art systems are:

texture (edge density) on the intensity [Franke et al., 2007, Hong et al., 2002, Sha et al.,

2007], stereo disparity [Lombardi et al., 2005, Soquet et al., 2007], HSI color [Franke et al.,

2007, Lin and Chen, 1991, Rotaru et al., 2004, Soquet et al., 2007], or depth from Lidar /

Radar [Dahlkamp et al., 2006, Rasmussen, 2002]. Many system approaches use the feature

edge density (structure) on the intensity map. However, edge density on further feature

maps is so far not considered. To our knowledge no approach uses the edge density on

color maps for road detection. During the evaluation of our system, we experienced the

edge density computed on color maps as a robust cue for detecting the road.

Feature granularity: Numerous system approaches rely on probabilistic methods for

classifying street and non-street pixels (e.g., [Apostoloff and Zelinsky, 2003, Aufrere et al.,

2004, Franke et al., 2007, Ramstroem and Christensen, 2005, Smuda et al., 2006]). Such

iconic (i.e., pixel-based) approaches do not include information of the neighborhood of

a pixel, but handle all pixels independently. Nevertheless, discontinuities in the fea-

ture maps often contain important information that allow improved scene decomposition

(e.g., curbstones that separate the road from the sidewalk). Other approaches stress the

importance of region-based information and use region growing or vertical filling (e.g.,

[Chern and Cheng, 2003, Mateus et al., 2005, Rotaru et al., 2004]). Such approaches are

often sensitive to changing lighting conditions causing large gradients in the feature maps

70


(e.g., shadows on the road). Both, the iconic and the region-based system approaches

have important advantages that partially compensate their drawbacks. However, to our

knowledge no system approach for road detection uses both approaches to the same extend.

Road modeling: Many of the recent feature-based systems use road mod-

els of varying complexity that support the feature-based road detection (e.g.,

[Dickmanns and Mysliwetz, 1992, Franke et al., 2007, Ramstroem and Christensen, 2005]

use clothoids, [Lombardi et al., 2005] distinguishes between left, right, and straight street

course, [Sotelo et al., 2004] uses second order polynomials). For country roads and high-

ways such approaches seem to yield sound results. Nevertheless, as further discussed in

Sect. 4.1.2, we claim that said rigid street models are not flexible enough to robustly run

on inner-city streets that often show abrupt changes in their course as well as occlusions

of significant parts of the drivable road area. However, some kind of road model seems to

be necessary in order to improve robustness of the road detection. This dilemma can be

resolved by relying on a generic and flexible road model that makes only simple assump-

tions about the course of the road. One of the few system approaches that follows this idea

is presented in [Lin and Chen, 1991]. The authors point out that the road area typically

covers between 30 to 85% of the image. The feature thresholds are adapted in order to

reach this ratio. Unfortunately, the proposed approach is restricted in its flexibility, since

the ratio is set offline without constantly adapting it to match the current characteristics

of the scene.

To sum up, existing state-of-the-art road detection systems are marked by a limited

flexibility, which restricts their application to country roads and highways. In order to

allow reliable road detection in more complex inner-city scenarios, we propose four novel

techniques to enhance robustness and system-inherent flexibility by enabling adaptation

to the environment. To our knowledge a combination of these techniques have not been

used for road detection before.

In detail, these techniques are:� Using street and non-street training regions (see Fig. 4.4) that both adapt the feature

probability distributions,� Using edge density (structure) feature, computed on the HSI hue and saturation

maps,� Combining iconic and region-based feature processing,� Fusing feature-based road detection with a dynamic and generic road model.

In the following section, details about our road detection system embedding these four

techniques are given. The presented system approach is not restricted to inner-city streets,

but was tested on country roads and highways as well.

4.1.2 System Description

In the following, the realized system architecture for unmarked road detection is described

(see Fig. 4.1). It relies on our four novel techniques that enhance the system-inherent

71


flexibility. After giving a rough overview on the individual processing steps, all system

modules are described in detail.

Our system takes RGB input images, stereo disparity (from two parallel cameras), and

Radar data as input. Knowledge about previously detected objects in the scene can be

used as optional input. The system detects the road based on six robust features that are

evaluated and fused in a probabilistic way. For this step, street and non-street training

regions are defined in the input image. In parallel, the system detects present lane mark-

ings with a biologically motivated filter approach. The lane markings are fused with the

detected road segments. In the final step, a binary road map is computed relying on a

road model that adapts itself to the environment.

Next, the system is described in more detail. In the first step, different features are

calculated on the 400x300 pixel RGB input images. The features we use are saturation

and hue of the HSI color space (see, e.g., [Jaehne, 2005]). Furthermore, we apply the

structure tensor in Equ. (4.1) (with W being a 9x9 region around the current pixel) to

compute the edge density Ej (see Equ. (4.2)) on the hue, saturation, and intensity of the

HSI color space:

Aj(u, v) =

[

ΣW (Gu∗Fj)2 ΣW (Gu∗Fj)(Gv ∗ Fj)ΣW (Gv∗Fj)(Gu∗Fj) ΣW (Gv∗Fj)2

]

(4.1)

with j ∈ {hue, saturation, intensity} and

Gu(u, v) = − u

2πσ4exp(−u

2 + v2

2σ2)

Gv(u, v) = − v

2πσ4exp(−u

2 + v2

2σ2)

Ej(u, v) =det(Aj(u, v))

trace(Aj(u, v)). (4.2)

Typically, the edge density computed on these feature channels is different for the road

and the rest of the scene, which makes it a reliable feature.

Furthermore, vision-based stereo data is used as feature. For computing stereo vision,

the camera images are rectified in order to facilitate the correspondence search between

the two camera images (i.e., the images are remapped, virtually aligning the two camera

coordinate systems with the world coordinate system). The thereby necessary intrinsic

(i.e., internal camera properties, like the focal length and the principal point) and extrinsic

(i.e., external camera properties, like camera angles and offsets) camera parameters were

determined using the freely available calibration toolbox [J.Y.Bouguet, 2007]. The toolbox

was applied on a calibration scene similar to the one described in [Marita et al., 2007] (see

also Sect. 2.2.2 for details on the computation of stereo disparity). There is no dynamic

change of the camera pitch angle, since on the one hand the input images are pitch-

corrected using a correlation-based method similar to [Broggi and Grisleri, 2005]. On the

other hand, we assume a flat road, which is present in most inner-city environments.

When using the system in an urban environment, the course of the road and hence the

camera angles could be estimated using a surface model (e.g., a hyperplane, please refer to

72


3 RPMsof mostreliablefeatures

adapt

threshold

road model feature fusion

to Final RPM6 eRPMs

on−off DoGfiltering

stereocalculation

threshold on Final RPM

street/non−street

distance ofhistograms of

regions

featurecalculation

features

(assure obstaclefree street

training region)

non−streettraining regions

selectstreet /

threshold

1. Compute features

3. Fuse

4. Compute finalroad segment

build feature BRMs

and RPMs

probability & binary maps2. Compute feature−

fusion of stereowith DoG results

lane

mar

king

sde

tect

ed

for each feature:

with detectedlane markings

fuse BRM and RPM

(HSI color, structure, stereo)

object knowledge(cars, houses)

Optional input: Input:Radardata

disparitystereoInput:

imageRGB input400x300

Input:

fuse RPMs & BRMsto extended RPMs (eRPMs)

Output:final binary

road map

(xi+/−βiσi)(p(xi))

εfinal

εfinal

Figure 4.1: System overview: Adaptive road detection system (red modules contain novel

techniques).

73


[Michalke et al., 2008c] for details). The image rectification assures that the camera angles

(including the static pitch angle) will not influence the stereo results. The correspondence

search yields a disparity map. Based on the disparity map three dense maps containing

the 3D-world positions for all image pixels can be obtained (see Fig. 4.5). The stereo data

is remapped using the measured camera angles in order to have the stereo maps and the

image comparable in terms of the pixel position of objects.

The stereo maps are postprocessed for solving the problem of missing disparity values

near to the car (see Fig. 4.2b). More specifically, during the computation of the stereo

disparity no correspondence search is possible at image regions near to the car, since this

would come at the cost of high computation time. We solve this problem by searching

line-wise for high horizontal gradients in the bird’s eye view of the camera image (for

information on this representation see [Broggi, 1995]) taking only the area directly in front

of the car (e.g., first 10 meters) into account (see example in Fig. 4.2a). It is assured that

no objects are present in the said area based on Radar and low vertical gradients in the

bird’s eye view. The area between the found gradients, which mark the road borders, is

assumed to be road. The image regions in bird’s eye view representation are mapped to

the perspective image with a pin hole camera model (see Annex A.3), which includes the

determined intrinsic and extrinsic camera parameters (e.g., static camera angles). Based

on the perspectively mapped road regions the three stereo maps are corrected assuming

a perfectly flat plane (see the resulting corrected depth map in Fig. 4.2c). Since only the

region directly in front of the car is corrected, the error induced by a non-flat road plane

can be considered as small. However, to eliminate this error the estimated camera angles

coming from the optional surface model could also be included into the pin hole camera

model.

Tests have shown that huge shadows on the road result in poor stereo quality, since

the correspondence search gets difficult on dark, noisy image regions. This supports using

more cues that are to some extend more invariant to shadows as done in the presented

system (e.g., HSI color space). Altogether, our system relies on six different cues for road

detection (see Tab. 4.1 for an overview).

Table 4.1: Used visual features for unmarked road detection.

MODALITY Cue # VISUAL ROAD DETECTION FEATURE

Color 1 Hue

2 Saturation

Structure 3 Edge density on Hue

4 Edge density on Saturation

5 Edge density on Intensity

Stereo 6 Height of objects in scene

In the second step, binary road maps (BRM) and road probability maps (RPM) for

74


Width in m

Dis

tan

ce

(d

ep

th)

in m

−5 0 5

30

25

20

15

10

5

0

(b)

(a)

Width in m

(c)

−5 0 5

Dis

tanc

e (d

epth

) in

m

25

0

30

15

20

5

10

Figure 4.2: (a) Gradient-based road search on the bird’s eye view on the image depicted in

Fig. 4.4, (b) Missing disparity values near to the camera vehicle induce false and missing depth

values, (c) Corrected depth map.

(b)(a)

Figure 4.3: Structuring element for region growing for (a) left image half, (b) right image half.

the six feature maps are computed. The BRMs are binary maps that hold “1” for pixels

belonging to the detected street and zero for the rest. The six BRMs are calculated with

a region-growing algorithm, by which region-related feature properties are incorporated.

Opposed to that, the six RPMs contain continuous probability values that assess the “road-

likeness” of the feature values for all pixels independently. Both map types rely on the same

normal distribution, see Equ. (4.13) and (4.14). The parameters of the normal distribution

are calculated using a street and at least 2 non-street training regions (see Fig. 4.4). Please

note that the training region needs to be set beyond the regions of corrected height values

(see Fig. 4.2c). The training regions are adapted dynamically depending on the scene.

For example, it is assured that no obstacle is within the training region by incorporating

Radar data. Furthermore, the size of the street training region is set proportionally to the

75


training regionNon−street

training regionNon−street

Street training region

Figure 4.4: Visualization of street and non-street training regions.

velocity of the ego vehicle, to exploit the fact that typically no near obstacles exist during

fast driving, e.g., on highways. The street and non-street training regions are chosen by

considering the height map of the scene derived from the stereo disparity map (see Fig. 4.5)

and existing knowledge about objects in the scene. In the following, the computation of

the BRMs and RPMs is described in more detail.

RGB input image

Z position in m

X position in m

Y position in m

Figure 4.5: Dense 3D-world position for all image pixels based on stereo vision. The X, Y,

and Z-maps contain dense 3D-world data of the scene in image coordinates. The X-map codes

the horizontal world position, the Y-map the vertical world position (object height), and the

Z-map the depth.

76


For computing the BRMs a region-growing algorithm, which connects continuous regions

in the feature maps is applied (i.e., the neighborhood of a pixel is evaluated). The latter

approach is done, in order to get crisp borders between the road and the sidewalks that

often have road-like features. The region growing uses two different structuring elements

for the left and right half of the image (see Fig. 4.3), which is motivated from the typical

course of roads in a perspective image. The region-growing algorithm recursively sets all

pixels that are adjacent to the currently known street segment in BRMi to “1”, when the

corresponding pixels in the feature map i are within the confidence interval (see Equ. (4.3)

and Equ. (4.4)).

xi − βiσi < xi < xi + βiσi ∀i = 1..5 (4.3)

with βi = 4di(Hsi, Hni

) ∀i = 1..6

x6 − εY (v) < x6 < x6 + εY (v) (4.4)

with εY (v) = β6(σ6 − σq(vtrain)) + σq(v) (4.5)

di(Hsi, Hni

) =√

1 − γi(Hsi, Hni

) ∀i = 1..6 (4.6)

γi(Hsi, Hni

) =∑

∀x

√

Hsi(x)Hni

(x) ∀i = 1..6 (4.7)

The region-growing algorithm starts from the road-training region. The normal-

distribution-based confidence interval in Equ. (4.3) uses the feature thresholds xi +/-

βiσi, which are independently calculated for all five visual features. Here, the parameter

xi is the mean and σi the standard deviation of the normal distribution calculated on the

street training region. The parameter βi is introduced in order to adapt the confidence

interval to the current scene properties. Different from xi and σi, which are calculated on

the street training region alone, the threshold parameter βi changes dynamically depending

on the characteristics of the street and non-street training regions. More specifically, the

parameter βi, which influences the feature thresholds, is calculated from di (see Equ. (4.6)).

The parameter di is the distance between the two histograms Hsiand Hni

of the street

and non-street training regions for the i=1..6 features. The measure di is based on the

Bhattacharya coefficient γi(Hsi, Hni

) (see Equ. (4.7)), which assesses the similarity of two

histograms. Based on βi the confidence interval is adapted (see Equ. (4.3)). The larger the

difference between the street versus the non-street areas on a feature map is, the bigger

the confidence interval becomes.

Different from the five visual cues (hue, saturation, and the three edge density maps),

the normal distribution of the stereo height Y also depends on the measured distance to

the car. This is empirically plausible, since Y is a function of the stereo disparity D(u, v)

and the relative influence of the quantization error of D(u, v) (measured in pixels) grows

the smaller D(u, v) and hence the bigger the distance of a road segment is to the car.

Hence, the part σq of the standard deviation of the stereo height cue that is induced by

the quantization error of D(u, v) increases with growing distance to the car. In order to

mathematically assess the error propagation of the quantization error of disparity D(u, v)

77


to the stereo height Y their functional relation is required. The stereo height x6 = Y

can be computed using Equ. (4.8) (with B as the horizontal distance between the stereo

cameras, h the camera height, v the vertical pixel position and v0 the vertical principal

point of the camera).

x6 = Y =B · (v − v0)

D(u, v)− h (4.8)

Dsurf(v) =B · (v − v0)

h(4.9)

σD =∆g√12

=1√12

(4.10)

σq(v) ≈ σD

∣

∣

∣

∣

dY

dD

∣

∣

∣

∣

D(u,v)=Dsurf(v)

(4.11)

≈ σD

∣

∣

∣

∣

∣

−B · (v − v0)

[Dsurf(v)]2

∣

∣

∣

∣

∣

σq(v) ≈ 1√12

∣

∣

∣

∣

h2

B · (v − v0)

∣

∣

∣

∣

(4.12)

Equation (4.10) defines the standard deviation σD of the disparity (measured in pixels),

which is induced by the quantization error (the step size ∆g is set to 1 pixel). For computing

the propagated standard deviation σq (required in Equ. (4.5)), we use Equ. (4.11) (refer to

[Jaehne, 2005]), which describes how the standard deviation of a random variable (here the

disparity D(u, v)) is propagated through a function (here Y (D)). We are interested in the

disparity on road surface Dsurf alone (see Equ. (4.9), gathered after reforming Equ. (4.8)

with Y=0). Hence, Dsurf defines the position at which Equ. (4.11) is linearized. Here,

the vertical pixel position v is a parameter of the distribution. For the quantization-

error-induced standard deviation of the height cue Y , we finally find Equ. (4.12). The

hyperbolic form of Equ. (4.12) confirms the made empirical assumptions. Based on that,

the confidence interval εY for the stereo height Y (see Equ. (4.4)) includes the standard

deviation σq(v) that is adapted depending on the current vertical image position v of the

current pixel in focus (see Equ. (4.5)). Besides adding σq(v) in Equ. (4.5), the standard

deviation σ6, computed on the training region on the Y map, needs to be corrected by

σq(vtrain) present at the vertical image position vtrain of the training region. As result, we

now have six BRMs for six features.

Additionally to the region-based processing for calculating the BRMs, a pixel-based

(iconic) processing for computing the RPMs is done (i.e., each pixel is handled indepen-

dently from its surround). All pixel values xi receive a probability value p(xi), which

results in six independent Road Probability Maps (RPMs) for the six features:

p(xi) = e−

( xi -xi)2

2σ 2i ∀i = 1..5 (4.13)

p(x6) = e

−x26

2[ σ6−σq(vtrain) +σq(v) ]2 . (4.14)

78


The probability distribution for the stereo-based height cue Y (see Equ. (4.14)) assumes

the mean height zero x6 = 0 and adapts σq(v) during the computation of RPM6 and

BRM6 dependent on the vertical pixel position v. The approach assumes a normal distri-

bution of the six features in the street training region and beyond. As described for x6 a

position-dependent variance was introduced. We verify the assumed normal distribution

with statistical tests of goodness of fit for all features independently (see Sect. 4.1.3).

In the third step, the computed BRMs and RPMs are fused with the detected lane

markings. More specifically, the RPMs for all features are set to a high probability for

the detected lane markings. The lane-marking detection is done with the biologically

motivated Difference of Gaussian (DoG) kernel (see Fig. 4.6a), which takes the receptive

fields of neurons in the retina as a role model. The DoG filter kernel is adapted to be

selective to bright structures on a dark background, the so-called on-off contrasts without

reacting to dark structures on a brighter background. Figure 4.6c shows the filter response

on the inner-city frame shown in Fig. 4.6b. All image regions with on-off contrasts, that

have a height within the confidence interval Equ. (4.4), and that are below the horizon are

detected as being lane markings (see Fig. 4.6d). The separation between on-off and off-on

contrasts reduces the number of false positive road marking detections. For example, in

[Luo-Wai, 2008] the prefiltered road image still contains the lane marking unspecific off-on

contrasts (e.g., traffic signs in front of a bright sky). Such off-on contrasts are filtered out

in our approach to improve the road detection performance.

The six iconic RPMs and their respective BRMs are combined by multiplication, which

leads to six extended RPMs (eRPM):

eRPMi = RPMiBRMi ∀i = 1..6. (4.15)

Based on this, the advantage of probability-based computation is preserved. At the same

time, discontinuities in the feature maps can be detected. As a result, the advantages of

both approaches are combined.

Next, all eRPMs are fused resulting in the final RPM (fRPM) using the geometric mean:

fRPM =

(

6∏

i=1

eRPMi

)1/6

. (4.16)

In the forth and final step, the Final Road Map is determined by applying a threshold

εfinal to the fRPM:

Final Road Map(u, v) =

{

1 ∀ fRPM(u, v) > εfinal

0 else. (4.17)

The threshold εfinal is set dynamically based on the correlation results of the three

currently most reliable features maps, in order to get a prediction of the current relative

size of the road versus the rest of the image. For these three features the currently best HSI

color feature (hue or saturation), the best structure feature (structure on hue, saturation,

or intensity), as well as stereo are selected. For the selection process the Bhattacharya

79


50

100

150

200

250

300

(a) (b)

(c) (d)

Figure 4.6: (a) On-off Difference of Gaussian (DoG) filtering on two test images with on-off

and off-on contrast (left) as well as the respective filter responses (right), (b) Inner-city test

frame, (c) On-off DoG filter response for bright contrasts on a dark background (with lane

markings popping out), (d) Detected lane markings (after fusion of DoG and object height

from stereo).

coefficient γi(Hsi, Hni

) is evaluated (see Equ. (4.7)), by which the separability of street

versus non-street histograms Hsi,Hni

can be assessed.

Hence, the computation of the Final Road Map relies on a simple road model (expected

fraction of the road area in the current image, termed road-to-image-ratio). No assump-

tions are made regarding the current position of the road in the image. As our evaluation

results in Sect. 4.1.3 show, it is of crucial importance to adapt the said expected fraction

dynamically to the current scene. This dynamic adaptation enables the system to run

robustly in complex scenes, as in inner-city scenarios.

For adapting εfinal the control loop depicted in Fig. 4.7 is used. The threshold εfinal is

adapted by a gradient method based on Equ. (4.22). In the following, the applied procedure

is described in detail. It uses the BRMs of the three most reliable feature maps A, B, and

C that are combined to the road reference map (i.e., feature product R that represents the

expected road area), depicted in Fig. 4.7. The four binary maps are summed up, which

results in four scalar values S{A,B,C,R}:

SX =∑

∀(u,v)

BRMX(u, v) with X ∈ {A,B,C,R}. (4.18)

The values S{A,B,C,R} represent the integral number of pixels detected as road for the

three feature maps and the road reference map.

80


reference mapmaps (BRM)

threemost reliable

feature A feature B feature CBRMBRMBRM

product R

=

binary road

of featuresA, B, C

Road Mapcompute Finalthreshold

adaptFinal

Road Map

6 features

fusionfeature

road

road area)(expected

εfinal

Figure 4.7: Control loop to adapt the final road detection threshold εfinal.

Then, the parameter κ is calculated:

κ =SR

SA+ SR

SB+ SR

SC

3. (4.19)

It represents the mean percentage with which the three most reliable feature maps

correspond to the road reference map R. The larger κ is, the more the features match to

each other, i.e., the more similar the three features maps are. The degree of similarity of

these features gives a hint about what to expect from the remaining cues and can hence

be used to adapt εfinal. The Final Road Map is computed (see Equ. (4.17), where εfinal is

set to a typical initial value for bootstrapping) and summed up yielding the scalar value

SFRM:

SFRM =∑

∀(u,v)

Final Road Map(u, v). (4.20)

Next, it is checked if the calculated scalar value SFRM fulfills:

1

κ<

SFRM

SR< 1.2

1

κ. (4.21)

If the inequality is fulfilled, the Final Road Map is valid. If not, εfinal is adapted incre-

mentally based on the following Equation (with α− < 1 and α+ > 1), until the following

inequality is fulfilled:

εfinal(t) =

α−εfinal(t− 1) when SFRM

SR< 1

κ

α+εfinal(t− 1) when SFRM

SR> 1.2 1

κ .

(4.22)

Equation (4.22) is motivated from the well-known Resilient Backpropagation (RPROP)

approach. The step sizes α+ and α− are adapted using the SuperSAB approach (see

[Adamy, 2007] for details). The processing stops after 100 iterations at the latest.

In the following section, our system approach is evaluated based on an inner-city sce-

nario.

81


4.1.3 Experiments and Results

In the following, accumulated road detection results on 440 frames of an inner-city stream

are presented (please also refer to [Michalke et al., 2009c] for a more extensive system

evaluation). The performance gain reached by incorporating street and non-street training

regions as well as the dynamic road model is assessed. The results of statistical tests of

goodness of fit are given, which support the assumption of a normal distribution for the

color and structure features within the street-training region. In a final step, details of the

needed computation time on our test vehicle are given. The inner-city result stream, the

input images, and the manually annotated ground truth street segments are available on

the internet [BenchmarkData, 2009a] for benchmark testing.

In order to evaluate our system, we adopt the following Equations on the resulting road

segment:

Completeness =TP

TP + FN(4.23)

Correctness =TP

TP + FP(4.24)

Quality =TP

TP + FP + FN. (4.25)

The Equations define different ground-truth-based measures, which were taken from

[Lombardi et al., 2005] (with pixels being True Positive (TP), False Negative (FN), False

Positive (FP)).

On a descriptive level, the Completeness states, based on given ground truth data, how

much of the present road was actually detected. The Correctness states how much of the

detected road is actually road to avoid classifying all as road leading to a Completeness

of 100%. The Quality combines both measures, since between the Completeness and

Correctness a trade-off is possible. Based on this, the Quality measure should be used for

a comparison, since it weights the FP and FN pixels equally. For a more detailed analysis

the Completeness and Correctness state what exactly caused a difference in Quality. The

necessary ground truth data was produced by accurate manual annotation of the road in

the 440 images.

In order to evaluate the novel techniques, the three measures were calculated on the

detected road segments of 440 image frames for three system instances. The first instance

is our system as proposed in Sect. 4.1.2 with all four novel techniques running. The second

system instance is equivalent to the first but runs with a constant road-to-image-ratio

(i.e., with a rigid road model). The third system is equivalent to the first but uses no

non-street training regions, which makes the confidence interval thresholds less adaptive

to the environment (βi = const., see Equ. (4.3)).

We used 220 frames of our inner-city scenario as training data for the two competitive,

less adaptive systems in order to tune the road-to-image-ratio of the second system and

confidence interval factors βi of the third system. The accumulated results in Tab. 4.2 show

that all three systems have a similar performance in terms of Quality on the training data.

On training images, the highest Quality is reached by the second competitive system that

uses a rigid road model. The accumulated results for the training sequence are plausible,

82


Table 4.2: Comparison of our road detection system with two competitive systems running

without two of the proposed novel techniques on training images.

Road detect. # training Correct- Complete- Quality

approaches images ness ness

Our system - 96% 75% 73%

Without non-street 220 96% 73% 71%

training areas

With rigid road model 220 88% 84% 75%

since both competitive systems were tuned to run with good performance on the training

images. Different from that, our system adapts itself to the environment based on the four

described techniques. Therefore, for our system no manual tuning to the training sequence

was done.

For the actual evaluation, the two competitive systems were run on consecutive parts of

the remaining stream (in sum 220 images) that were used for testing. In a direct comparison

between our system and the rigid road model system, we could gain the results depicted in

Tab. 4.3. Table 4.4 shows the results of the comparison between our system and the system

without non-street training regions. In both cases, our system significantly outperforms

the competitive systems in terms of Quality (75% compared to 68% and 69% compared to

50%). These results confirm the gain of the system-inherent adaptation capabilities offered

by the proposed four techniques.

Table 4.3: Comparison of our system and an equivalent system with a rigid road model on a

test stream with narrow street.

Road detect. # test Correct- Complete- Quality


Our system 120 97% 77% 75%

With rigid road model 120 77% 85% 68%

As Tab. 4.3 and 4.4 reveal, the Correctness of found street segments is high, which means

a small number of false positive street pixels are found. However, the gathered results show

that the detection performance varies between frames. This is due to the changing content

of the training region in front of the car. Thereby, the system possibly adapts to local

characteristics present in the current training region that might differ from the current

global road characteristics. Furthermore, local illumination changes that depend on the

current view angle and lighting conditions influence the detection performance. To solve

this, a temporal integration method was developed, which is introduced in the following

Section 4.2.

83


Table 4.4: Comparison of our system and an equivalent system without non-street training

regions for a shady test stream.

Road detect. # test Correct- Complete- Quality


Our system 100 99% 68% 69%

Without non-street 100 99% 50% 50%

training region

(with ground truth With temporalintegration

Without temporalintegration

Our results

Frame 257

Frame 253

Frame 181

Frame 116

road segment)

1.benchmark systemwith constant road

region to image ratio training region

Input image 2.benchmark system without non−street

Figure 4.8: Example images of the benchmark inner-city stream (First column: Input image

with ground truth road segment, Second column: First benchmark system with rigid road

model, Third column: Second benchmark system without non-street training segment, Last

columns (highlighted): Resulting road segment of our system improved by temporal integration

(see Sect. 4.2).

For further evaluation Fig. 4.8 shows typical results of our system compared to the two

competitive systems and the ground truth data, based on four sample frames of the inner-

city stream (overall stream available at [BenchmarkData, 2009a]). As can be seen, our

system performs better in complex scenes and scenes with strong shadows on the road.

As described in Sect. 4.1.2 a central assumption of our system is that the features in the

street training region are normal-distributed (see confidence interval defined in Equ. (4.3)).

84


Therefore, the Kolmogorov-Smirnov (KS) test of goodness of fit with its Lilliefors extension

[Lilliefors, 1967] checks if the hypothesis of a normal distribution Fe(x) for the six system

features can be rejected. For this, the cumulative frequency F0(x) for all features in the

street training region was measured for 30 consecutive inner-city frames. Exemplarily,

Tab. 4.5 shows the results for the edge density computed on the hue for a single frame.

The maximal deviation from the tested normal distribution (µ = 0.1559 and σ2 = 0.0088)

was d = 0.1044 and consequently was below the allowed margin of dth = 0.271 (level of

significance α = 0.05). For the remaining 29 frames similar results were accumulated. Also

for the other 5 features the KS-test assured that the hypothesis of a normal distribution

could not be rejected. As long as no object is present in the training region, we therefore

can assume that the features on the road are normal-distributed.

Table 4.5: Kolmogorov-Smirnov test of goodness of fit for the edge density feature computed

on the hue.

Hue edge density Cumulative Tested normal Statistical

9 classes frequency distribution measure d =

F0(x) Fe(x) max|Fe(x) − F0(x)|0.0353 755/4819 = 0.1567 0.1296 0.0271

0.1016 2332/4819 = 0.4839 0.3795 0.1044

0.1678 3589/4819 = 0.7448 0.6725 0.0722

0.2341 4258/4819 = 0.8836 0.8814 0.0022

0.3003 4574/4819 = 0.9492 0.9719 0.0228

0.3665 4713/4819 = 0.9780 0.9958 0.0178

0.4328 4765/4819 = 0.9888 0.9996 0.0108

0.4990 4793/4819 = 0.9946 1 0.0054

0.5653 4819/4819 = 1 1 0

For the experiments we use a Honda Legend prototype car equipped with a mvBlue-

Fox CCD (charge-coupled device) color camera from Matrix Vision delivering images of

800x600 pixels at 10Hz, which is hence the processing rate our road detection module

should approximately reach. The image data as well as the laser and vehicle state data

from the CAN bus is transmitted via LAN to several Toshiba Tecra A7 (2 GHz Core

Duo) running our RTBOS integration middleware [Ceravola et al., 2006] on top of Linux.

The road detection component together with other driver assistance components (see, e.g.,

[Michalke et al., 2007]) are implemented in C using an optimized image processing library

based on the Intel IPP [Intel, 2006]. Table 4.6 shows the computational demands of differ-

ent sub-modules of the presented approach running on one of these laptops. The overall

computation time of our road detection system currently amounts to 123.5 ms (8.1 Hz),

which allows real-time processing on our prototype vehicle.

85


Table 4.6: Computation time (M - Including detection of lane markings, T - Including temporal

integration approach).

M T Used RAM Comp. time [in ms]

[in MB] (frame rate [in Hz])

- - 185 93.5 (10.7)

X - 203 101.0 (9.9)

- X 214 105.0 (9.5)

X X 233 123.5 (8.1)

In the following section, a tracking procedure based on temporal integration is proposed,

which steadies the gained road detection results, e.g., in case of difficult lighting conditions.

4.2 Temporal Integration for Feature-Based Road

Detection Systems

Although existing state-of-the-art systems for unmarked road detection show promising

results, the detected road segments often contain holes and show a detection performance

that strongly varies in time depending on environmental conditions (see also previous Sec-

tion 4.1). The varying detection performance is due to the changing content of the training

region in front of the car. Thereby, the system possibly adapts to local characteristics

present in the current training region that might differ from the global road characteris-

tics. Furthermore, local illumination changes that depend on the current view angle and

lighting conditions influence the detection performance. See Fig. 4.9 for a visualization of

both effects.

In the following section a real-time capable approach for improving the road detection

results for this type of state-of-the-art system is presented that adds a generic postpro-

cessing step. Our proposed architecture removes the drawbacks of these systems using a

temporal integration approach based on the bird’s eye view. In order to test the proposed

approach, the visual-feature-based road detection system described in Sect. 4.1 is used.

Still, this road detection system can be exchanged with any other state-of-the-art system.

Evaluation results computed on inner-city data show that this approach is an important

enhancement for all visual-feature-based road detection systems. One of the used streams

and corresponding ground truth data is accessible on the internet for benchmark testing.

The proposed approach is a crucial step toward robust road detection in complex scenarios

that allows building high-level applications, as, e.g., active collision avoidance or trajectory

planning, based on vision as the major cue.

4.2.1 Related Work

The concept of temporal integration is used in various applications in the field of computer

vision for driver assistance. For example, [Gepperth et al., 2007] uses spatiotemporal inte-

86


(b)

(c) (d)

(a)

street training street trainingregion in the

sunregion in the

shade

detectedstreet

detectedstreet

(in white) (in white)segments segments

reflectionsdepending onthe view angle

Figure 4.9: Causes for varying road detection performance: (a) Illumination change with de-

pendence on the view angle, (b) Sample image showing typical illumination gradient, (c)

Schematic example: Training region in the sun and resulting detected street segment (in white),

(d) Schematic example: Training region in the shade and resulting detected street segment (in

white).

gration to improve the classifier performance when detecting signal boards and cars. Other

applications for improving the classifier performance rely on (temporal-integration-based)

voting mechanisms, which are widely used in numerous domains (see [Bauer and Kohavi,

1999] for an overview). Also the well-known Kalman filter approach [Kalman, 1960] sta-

bilizes its state estimate by temporal integration (fusion of measured and predicted data).

In [Nieto et al., 2007] temporal integration is used to determine the camera parameters,

thereby stabilizing the input image of a marked lane detection system running online in a

car.

Also for clothoid-model-based lane detection on highways and country roads (see,

e.g., [Dickmanns and Mysliwetz, 1992] and [Franke et al., 2007]) temporal integration was

found to improve the detection performance. Still, the usage of such model-based ap-

proaches for road detection in complex inner-city scenes is heavily restricted, due to the

unpredictable and abruptly changing course of the road and various occlusions of road

parts. Figure 4.10a shows the complexity of a hand labeled ground truth road segment

for an inner-city frame that can hardly be modeled using, e.g., a clothoid model. There-

fore, also a model-based temporal integration is not possible and will not show the desired

results in such complex scenarios.

87


(c)

Birds view

Width in m

Dista

nce (

depth

) in m

−10 0 10

50

40

30

20

10

0

−10

(a) (b)Width in m

Dis

tanc

e (d

epth

) in

m

−10 0 10

Bird’s eye view

0

10

2

0

30

4

0

50

Figure 4.10: Exemplary inner-city frame: (a) Hand-labeled ground truth street segment, (b)

Optical flow (colors code the direction of the motion), (c) Bird’s eye view.

Newer road detection approaches that rely on the statistical evaluation of different image

features (see, e.g. [Rotaru et al., 2004] and [Soquet et al., 2007]) can handle such scenarios

but have the drawbacks discussed at the beginning of Sect. 4.2 (see page 86). Nevertheless,

also for these systems temporal integration can and should be used for making the road

segment detection more robust. To this end, the most direct approach would be to use the

optical flow that reflects the magnitude and direction of the motion of image regions, as

shown in Fig. 4.10b. Based on that, the current position of a street segment detected in

the past can be determined and used for a fusion with the current road detection results.

However, the optical flow has certain drawbacks. First, it’s to date high computational

costs make it scarcely applicable in domains with hard real-time constrains, as the car

domain. Second, the optical flow cannot be calculated at the borders of an image and is

error prone due to ambiguities resulting from the aperture problem, illumination change,

and camera noise [Willert et al., 2007]. Instead of detecting the motion of all image regions

based on the optical flow, the approach proposed here concentrates on the drivable street

plane alone, relying on the bird’s eye view (see Fig. 4.10c and Fig. 2.16a).


In the following, a rough overview of our approach of bird’s-eye-view-based temporal road

integration is given (see Fig. 4.11). Thereafter, all processing steps and their theoretical

background are described in more detail.

As input data our system uses 400x300 monocular gray value images and a binary map

of the currently detected street segment. The images are used for calculating the bird’s

eye view, which is a representation of the scene as viewed from above (see Fig. 4.12a and

Fig. 4.10c). In the following step, the bird’s eye view is used for detecting the motion of the

static vehicle environment based on Normalized Cross Correlation (NCC). Based on these

88


correlation results the current and past street segments are fused by temporal integration

on the bird’s eye view. The fused street segments are then mapped back to the perspective

view corresponding to the input image.

The system takes optional input data that improves the quality and makes the temporal

integration more robust. As such optional input data, stereo images as well as the longi-

tudinal ego velocity and yaw rate of the CAN bus of our prototype vehicle are processed.

The depth map that is calculated from stereo images (using the commercial “Small Vision

System” [Konolige, 1997], see Sect. 2.2.2) is the basis for correcting the changes in the pitch

and roll angle. An uncompensated change in the pitch and roll angles make the bird’s eye

view unstable in case the car brakes or the street profile is not flat. The CAN data is used

for predicting the motion of the car based on a single track model. The predicted motion

is used for determining the anchor for the correlation on the bird’s eye view. The usage of

CAN data makes the system faster. Still, without CAN data the detection quality is not

reduced.

In the following, the processing steps (as depicted in Fig. 4.11) are described in more

detail. First, the camera lens distortion is corrected. The undistorted vertical and hori-

zontal pixels v and u are computed on the initial (distorted) vertical and horizontal pixels

vd and ud based on:

u = (1 + k1β2 + k2β

4)ud + 2k3udvd + k4(β2 + 2u2

d) (4.26)

v = (1 + k1β2 + k2β

4)vd + k3(β2 + 2udvd) + 2k4udvd (4.27)

with β =√

u2d + v2

d.

The undistortion is based on a lens distortion model (described in [Heikkila and Silven,

1997]) that uses radial (k1 and k2) and tangential distortion coefficients (k3 and k4). The

undistortion step is essential in order to allow a correct mapping of the image pixels to

the bird’s eye view. It is important to note that for the bird’s eye view as a metric

representation, the undistortion step makes sure that the proportions in the bird’s eye

view match the world.

Then the bird’s eye view is calculated on the undistorted pixels v and u based on

Equ. (4.28) and (4.29) by inverse perspective mapping of the 3D world points X, Y , and

Z to the 2D (u,v) image plane (see Fig. 4.12b for the notation in our coordinate system).

The equations describe how to map a 3D position of the world to the 2D image plane

(refer to [Broggi, 1995]). More specifically, only the image pixels (u,v) that are needed

to get a dense metric bird’s eye view plane are mapped into the XZ-plane. The usage of

inverse perspective mapping makes the inversion of Equ. (4.28) and (4.29) for calculating

the bird’s eye view obsolete. Equation (4.28) and(4.29) use the 3 camera angles θX , θY ,

and θZ , the 3 translational camera offsets t1, t2, t3 (see Fig. 4.12b), the horizontal and

vertical principal point u0 and v0 as well as the horizontal and vertical focal lengths fu and

fv. The intrinsic (i.e., internal camera properties, like the focal length and the principal

point) and extrinsic (i.e., external camera properties, like camera angles and offsets) camera

parameters were determined using the freely available calibration toolbox [J.Y.Bouguet,

2007] and a calibration scene similar to the one described in [Marita et al., 2007].

89


perspectivemapping

image bird’seye view t

Y−planeestimation

(dense)

extendedbird’s eye viewcalculation

depth from stereoOptional input:

CAN data(acceleration, velocity,yaw rate)

Optional input:

(incomplete)map

binary road

distortedimage

Input:

detected road

undis−

Input:

roaddetection

(exchangeable)module

imagebird’s eye

tortion

view t−1

Output:

integrationtemporal

correlation onbird’s eye view

segments

bird’s eye viewpast N frames

storagedata

of road

temporalintegrated

road

Y−plane

shift($\Delta X$,$\Delta Z$)

Optional

Figure 4.11: System structure: Temporal road segment integration (the dashed module can be

exchanged with the road detection algorithm preferred by the user, optional module highlighted

in red).

As can be seen in Equ. (4.28) and (4.29) the 3D world position coordinates X, Y , and

90


Z of all image pixels (u,v) are needed:

u = −fur11(X − t1) + r12(Y − t2) + r13(Z − t3)

r31(X − t1) + r32(Y − t2) + r33(Z − t3)+ u0 (4.28)

v = −fvr21(X − t1) + r22(Y − t2) + r23(Z − t3)

r31(X − t1) + r32(Y − t2) + r33(Z − t3)+ v0. (4.29)

With Y = 0 ,

R = RXRYRZ =

r11 r12 r13r21 r22 r23r31 r32 r33

,

and

r11 = cos(θZ)cos(θY )

r12 = − sin(θZ)cos(θX) + cos(θZ)sin(θY )sin(θX)

r13 = sin(θZ)sin(θX) + cos(θZ)sin(θY )cos(θX)

r21 = sin(θZ)cos(θY )

r22 = cos(θZ)cos(θX) + sin(θZ)sin(θY )sin(θX)

r23 = − cos(θZ)sin(θX) + sin(θZ)sin(θY )cos(θX)

r31 = − sin(θY )

r32 = cos(θY )sin(θX )

r33 = cos(θY )cos(θX).

By using a monocular system, one dimension (the depth Z) is lost. A solution to this

dilemma is the so-called flat plane assumption. Here, for all pixels in the image, the height

Y is set to 0. Based on this, only objects in the image with Y = 0 (especially, the street we

are interested in) are mapped correctly to the bird’s eye view, while all the other regions

are stretched to infinity in the bird’s eye view (for example the cars in Fig. 4.10c).

In case this assumption is not fulfilled (i.e., the street surface is not flat) the bird’s eye

view is inaccurate, which leads to decreasing quality of the temporal integration. To allow

a stable bird’s eye view even in case of non-flat street surfaces and pitching of the vehicle,

stereo data from our stereo camera setup is used. In order to enhance the robustness of

the correction, only pixels that belong to the currently detected street segment are used

for surface estimation. More specifically, the differences between the coordinate axes and

the street surface in terms of the pitch ∆θX and roll angle ∆θZ , as well as the height of

the camera over the ground ∆t2 are computed:

Y = Y0 + aZ + bX (4.30)

∆θZ = atan(b) (4.31)

∆θX = atan(a) (4.32)

∆t2 = Y0. (4.33)

91


Z

X

l ∆x

∆z

˙θY

X

Y

camera

Z

axisoptical

θX

θY

T = [t1, t2, t3]θZ

imageperspective

viewbird’s eye

(a) (b) (c)

Figure 4.12: (a) Visualization of the bird’s eye view, (b) Coordinate system and position of

the camera (car is heading in Z-direction), (c) Single track vehicle model.

This is done based on the 3D position for all image pixels derived from the stereo disparity

(see Fig. 4.5 for 3D data of a sample image). The flat plane assumption Y = 0 is then

replaced by Y = f(X,Z) leading to an extended bird’s eye view. In our implementation a

first order model for the street surface (linear hyperplane) is used as shown in Equ. (4.30)

(see [Li et al., 2004] for more details). Results have shown that higher order models lead

to inferior performance. The reason for this is the restraint number of 3D measurement

points at the borders of the image, since only reliable pixels belonging to the detected

street are used for the surface estimation. Since the estimated surface is noisy (stereo

data is calculated based on error prone correlation between the left and right image), a

linear Kalman filter is used on the parameters Y0, a, and b that raise the performance

considerably. A possible improvement would be to use a model of the vehicle kinetics

(containing damper and spring characteristics, realistic distribution of the vehicle mass) for

the Kalman prediction (as proposed in [Cech et al., 2004]) instead of the linear prediction

model used here.

By NCC-based correlation between the current and the stored previous bird’s eye view

the vehicle motion (∆X, ∆Z) since the previous time step is detected. A single track

vehicle model, as depicted in Fig. 4.12c, predicts the starting point xt = xt−1 + ∆x and

zt = zt−1 +∆z of the NCC correlation patch of time step t-1 in the current bird’s eye view

map. The values ∆x and ∆z are calculated based on the sample time T , the distance of

the camera from the rear wheel l, as well as the yaw rate θY , and lateral velocity Z from

92


the CAN bus (see single track model Equ. (4.34) and (4.35)):

∆x =Z

θY(1 − cos(θY T )) + sin(θY T )l (4.34)

∆z =Z

θYsin(θY T ) + cos(θY T )l − l. (4.35)

The derived longitudinal and lateral motions as well as rotational change (i.e., yaw angle)

between the current and the previous bird’s eye view are stored along with the incremental

motion between the previous N = 40 frames (equivalent to 4 seconds of processing by our

prototype vehicles’ vision system).

The NCC correlation patch on the bird’s eye view is selected to contain enough structure

(using the entropy-based measure described in Sect. 3.3), which improves the accuracy of

the NCC. Furthermore, it is assured that the patch belongs to the detected street and

that it is not too far away from the ego vehicle, since the resolution of the bird’s eye view

decreases with growing distance to the vehicle.

The bird’s eye view maps of the detected street segments of the previous N = 40 frames

are calculated and stored. The stored incremental motion during the past 4 seconds is

integrated and used to shift all stored bird’s eye view street segments correspondingly.

Then the shifted previous 40 bird’s eye view street segments are weighted (weights αt) and

summed up by:

Sinteg =N∑

t=1

αtSt withN∑

t=1

αt = N. (4.36)

Thereafter, the sum of the street segments Sinteg(X,Z) is related to the maximum pos-

sible number of overlaid street segments Smax(X,Z), which results in an Integrated Road

Probability Map (IRPM):

IRPM =Sinteg(X,Z)

Smax(X,Z). (4.37)

Please note that Smax(X,Z) changes depending on the position in the bird’s eye view

map. The following final threshold operation determines the final temporal integrated

street segment Sfinal in the bird’s eye view representation:

Sfinal =

{

1 ∀ IRPM(X,Z) ≥ β

0 ∀ IRPM(X,Z) < β. (4.38)

The weight α1 in Equation (4.36) is set high to ensure that the pixels in the current

detected street segment are with a high probability also present in the final temporal

integrated street segment. The other weights αt could be set dynamically dependent on a

quality measure of the bird’s-eye-view-based NCC or the road detection system as well as

the capturing time t. The threshold β in Equ. (4.38) is currently set to 0.7. This means

that a pixel is classified as street if at least 70% of the overlaid past street segments have

voted for street.

93


(a) (b)

Figure 4.13: Final morphological fill operation for closing spaces in the street segment that are

due to perspective mapping (justified openings are preserved): (a) Raw perspectively mapped

street segment, (b) After morphological closing.

Next, the final temporal integrated street segment Sfinal is mapped back to the image

using Equ. (4.28) and (4.29). For this operation the resolution of the street segment in

the bird’s eye view representation needs to be high (which is done by upsampling the

size by factor of 4) in order to allow a lossless perspective mapping of the street segment.

The perspective mapping step produces equidistant, periodic spaces in the street segment

directly in front of the car (see Fig. 4.13a). These spaces are filled using a morphological

close operation with a small morphological structuring element to prevent adding too

many false positive street pixels (see Fig. 4.13b). In other words, openings in the bird’s

eye street segment (that, e.g., correspond to objects on the street) are retained in the final

perspectively mapped street segment. Such openings are explicitly checked for objects in

the implemented ADAS (see Chapter 5).

The following section shows that the proposed temporal integration procedure results in

an enhanced street segmentation. The final detected street segment has fewer holes and is

dynamically more stable than that of other approaches, which allows complex path-related

applications.


In this section, we evaluate the performance of our system by applying it to the results of

the state-of-the-art road detection algorithm described in Sect. 4.1. As described before,

the proposed temporal integration approach can work on top of all road detection algo-

rithms for unmarked roads and is therefore interchangeable. Additionally, the required

computation time for the proposed temporal integration approach is given.

Figure 4.14 shows qualitative results of the various system modules of our system. The

depicted snapshot is part of a result stream showing our system running on 160 consecutive

frames of an inner-city course. The input images and stereo data used for the evaluation as

well as the ground truth data and results are accessible on the internet [BenchmarkData,

2008b] for open benchmark testing. Additionally, Fig. 4.14e shows a kind of 360◦ represen-

tation of the environment that is derived from the combination of all stored bird’s eye view

94


maps of the past 4 seconds. This representation builds up gradually, after the algorithm

starts. It could be used for higher-level trajectory planning algorithms.

The white rectangle in Fig. 4.14b and d-f represents the position of our prototype vehicle,

while the black regions are outside the field of vision of our vehicle cameras.

Width in m Width in m Width in m Width in m−10 0 10 −10 0 10 −10 0 10

−10

0

10

20

30

40

50

−10

0

10

20

30

40

50

−10

0

10

20

30

40

50

Dis

tanc

e (d

epth

) in

m

Dis

tanc

e (d

epth

) in

m

Dis

tanc

e (d

epth

) in

m

Dis

tanc

e (d

epth

) in

m

−10 0 10

(d) (e) (f)

Bird’s eye view (BEV) Temporal integration

−10

0

10

20

30

40

50

(b)

(a)Input image

(c) (g)Input road segment Output: Dense detected road segment

Detected road temp. integ.BEV, input road seg.

Figure 4.14: (a) System input image, (b) Input image in bird’s eye view, (c) System input:

Detected road segment of road detection module, (d) Detected road segment in bird’s eye view,

(e) Temporal integration of bird’s eye view images of past 4 seconds, (f) Temporal integration

of detected road segments, (g) System output: Integrated road segment mapped back to the

perspective image.

In order to evaluate our algorithm with respect to its impact on the road detection

performance, we adopt the ground-truth-based measures (see Equations (4.23), (4.24),

and (4.25)) defined in Sect. 4.1.3. The necessary ground truth data was produced by

95


accurate manual annotation of the 440 test images (see Fig. 4.10a for a sample).

The three measures were then calculated on the detected street segments of 440 image

frames of two inner-city streams. The gathered results are depicted in Tab. 4.7. There,

the standard street detection algorithm without temporal integration is compared to our

approach. Furthermore, our approach is compared to one that uses the optical flow for

temporal integration (based on the state-of-the-art optical flow algorithm described in

[Willert et al., 2006]), and finally to our approach using only the mandatory input data

(without the usage of stereo data). In all 4 cases the same algorithm for detecting the street

was used, in order to allow a comparison. As the results in Tab. 4.7 show, the highest

Table 4.7: Comparison of different methods for temporal integration.

Road detect. approaches # test Correct- Comple- Quality

(BEV: bird’s eye view) images ness teness

No temp. integration 440 98.1% 61.5% 60.5%

Temp. integration, BEV 440 95.2% 94.1% 89.9%

Temp. integration,

optical flow 440 92.6% 72.4% 68.1%

Temp. integration, BEV,

without stereo 440 96.9% 84.0% 81.7%

Quality (89.9% enhancing the 60.5% of the initial street detection algorithm) is reached

with temporal integration based on our algorithm. Without stereo data our algorithm still

reaches a Quality of 81.7%. Optical-flow-based temporal integration reaches a Quality of

merely 68.1%, which is due to the well-known aperture problem (see, e.g., [Willert et al.,

2006]) and present illumination changes. The initial road detection approach without

temporal integration has the highest Correctness with 98.1%, but this comes to the cost of

reduced Completeness of merely 61.5%. Our temporal integration approach decreases this

value from 98.1% to 95.2%, but it increases the Completeness disproportionately (from

61.5% to 94.1%).

For further evaluation, Fig. 4.15 shows typical results of a standard street detection

algorithm compared to results gathered with the proposed temporal integration approach

based on 4 sample images of the inner-city stream.

For the experiments we use a Honda Legend prototype car equipped with a mvBlueFox

CCD color camera from Matrix Vision delivering images of 800x600 pixels at 10Hz, which

is hence the processing rate our road detection module must at least reach. The image

data as well as the laser and vehicle state data from the CAN bus is transmitted via

LAN to several Toshiba Tecra A7 (2 GHz Core Duo) running our RTBOS integration

middleware [Ceravola et al., 2006] on top of Linux. The road detection component together

96


Frame 97

Frame 24 Frame 24

Frame 105 Frame 105

Frame 147

Frame 97

Results of standard approach(Input street segment for us) (by temporal integration)

Our approach

Frame 147

Figure 4.15: Example images of the used inner-city stream (left: Standard approach (input

street segment for our approach), right: Our approach (by temporal integration), the last image

is visually enhanced to improve its legibility when printed.

with other driver assistance components (see, e.g., [Michalke et al., 2007]) are implemented

in C using an optimized image processing library based on the Intel IPP [Intel, 2006]. The

road detection component is set to run on a single core.

97


Table 4.8 shows the computational demands of different sub-modules of the presented

approach and compares these to the qualitatively inferior approach based on the optical

flow (as was shown in Tab. 4.7). The reasonable parameterized state-of-the-art optical flow

Table 4.8: Comparison of computational demands for temporal integration on the bird’s eye

view and using optical flow.

Module / sub-module Comp. time [in ms]

(frame rate [in Hz])

Temp. integration, BEV Σ 49.8 (≈ 20)

Bird’s eye view 6.9

Correlation sub-module 14.7

Temp. integration 20.0

Perspective mapping to image plane 8.2

Temp. integration, optical flow Σ >537.0 (≈ 2)

implementation (based on [Willert et al., 2006]) needs 537.0 ms (≈ 2 Hz), without taking

further system modules into account, which are additionally required by this approach.

The overall computation time of our temporal integration system amounts to 49.8 ms

(≈ 20 Hz). Combined with the realized unmarked road detection system described in

Section 4.1 real-time processing on our prototype vehicle is reached (refer to Tab. 4.6).

98


4.3 Summary

In Chapter 4 an unmarked road detection system based on vision as the major cue is

described and evaluated in real-time. At run time the system dynamically adapts central

system parameters to the environment allowing a robust road detection under changing

environmental conditions. More specifically, a road training region in front of the car

is used in order to derive the visual properties of the road. Furthermore, two non-road

training regions are used to determine how good the road can be separated from the rest

of the scene in the current scenario. This separability information is used to parameterize

the feature fusion processes of six visual features the system relies on to detect the road.

The visual features are processed iconically (i.e., the road likeliness of each independent

pixel is determined) and region-based (i.e., the properties of neighboring pixels are taken

into account). Furthermore, instead of relying on strong, rigid road models as proposed

in literature, the here presented system uses a simple, dynamic road model only and

puts stronger weight on the visual feature information. The novel feature edge density

(structure) computed on HSI hue and saturation is introduced as a reliable cue for detecting

the road. With the usage of on-off Difference of Gaussians filters for lane marker detection,

a further robust, biologically inspired approach was included to the system. The thereby

gathered lane marker information is fused to the detected unmarked road segments. The

so detected road segments match ground truth data well in most situations. However, in

case of shadows on the road the detected road segment contains holes and gets unstable

in time.

In order the improve this situation, a generic tracking approach for unmarked road

detection systems was introduced that is based on temporal integration. The system fuses

the detected road segments of the past and present frames on the bird’s eye view and allows

a robust unmarked road detection in shady conditions. The temporal integration approach

was tested on the road detection system described before, but is suitable to improve any

state-of-the-art system for unmarked road detection.

Summarizing, in Chapter 4 the following novelties were introduced:� The edge density (structure) is used as novel feature for road detection, computed

on the HSI hue and saturation maps,� The usage of street and non-street training regions that both adapt the feature prob-

ability distributions during iconic feature processing in the unmarked road detection

system,� A combination of iconic (i.e., pixel-based) and region-based feature processing for

road detection,� A fusion between feature-based and dynamically adapting model-based road detec-

tion,� A biologically motivated on-off DoG filter for lane marking detection is proposed

allowing their fusion with the detected unmarked road. Using on-off DoG assures

that only white lane markings on a darker background are detected, suppressing

off-on contrasts, like shadows or tar seems on the road,

99

4 Road Detection in Unconstrained Environments� A generic tracking approach based on temporal integration on the bird’s eye view

allowing the stabilization of road detection results of state-of-the-art unmarked road

detection systems.

Both the road detection system and the temporal integration approach were tested online

and in real-time on a test vehicle. The detected road is used as input cue information for

the Advanced Driver Assistance Systems described in Chapter 5. The detected road is used

to improve the performance of various modules of the ADAS and allows the development

of complex driver assistance functionalities.

100

5 Integrated System Approaches for

Scene Interpretation

Following the preceding description of biologically motivated visual features in Chapter

2, the attention sub-system in Chapter 3, and the unmarked road detection sub-system

in Chapter 4, in the current Chapter all these approaches are combined to a generic,

biologically inspired Advanced Driver Assistance System. After introducing some of the

few existing biologically inspired driver assistance systems in Sect. 5.1 along with the major

differences to our system approach, in Sect. 5.2 the developed attention-based ADAS and

evaluation results we gathered in an online highway scenario are described. In Sect. 5.3

this system is extended by among other things fusing the detected road, which allows for

robust operation in inner-city, after which a summary closes Chapter 5.

5.1 Related Work

Today’s Advanced Driver Assistance Systems effectively support the driver in clearly de-

fined traffic situations like keeping the distance to the forward vehicle. For this pur-

pose Radar sensors, Lidar sensors, and cameras are used to extract parameters of the

scene, like, e.g., headway distances, relative velocities, and relative position of lane mark-

ers ahead. Such approaches resulted in specialized commercial products improving the

driving safety (e.g., the “Honda Collision Mitigation Brake System” [Kodaka and Gayko,

2004, Kodaka et al., 2003] to help the driver to avoid rear end collisions in case the for-

ward vehicle brakes unexpectedly). Although traffic rules and road infrastructure, like, e.g.

lane markings, restrict the complexity of what to sense while driving, perception systems

of today’s ADAS are capable of recognizing simple traffic situations only. Furthermore,

driving in normal traffic scenes can be done mainly in a rather reactive way by staying in

the middle of the lane and keeping an appropriate distance.

However, for assisting the driver over the full range of driving tasks less reactive, intelli-

gent systems are required. The goal of realizing such Advanced Driver Assistance Systems

(ADAS) can be approached from two directions: either searching for the best engineering

solution or taking the human as a role model. Today’s ADAS are engineered for support-

ing the driver in clearly defined traffic situations like, e.g., keeping the distance to the

forward vehicle. While it may be argued that the quality of an engineered system in terms

of isolated aspects, e.g., object detection or tracking, is often sound, the solutions lack

the necessary flexibility. Small changes in the task and/or environment often lead to the

necessity of redesigning the whole system in order to add new features and modules, as

well as adapting how they are linked. Taking the high quality of signal processing reached

101

5 Integrated System Approaches for Scene Interpretation

in biology into account, one promising way for building such intelligent systems is to take

the human as a role model, mimicking known signal processing principles in the human

brain.

Recently, the topic of researching intelligent cars is gaining increasingly interest as doc-

umented by the DARPA Urban Challenge [WWW, 2007a] and the European Information

Society 2010 Intelligent Car Initiative [WWW, 2007b] as well as several European Projects

like, e.g., Safespot or PReVENT. As described in Chapter 1, the gathered results of such

purely engineering-driven approaches are somewhat limited.

With regard to vision systems developed for ADAS, there have been few attempts to

incorporate aspects of the human visual system into complete systems. One of the most

prominent examples is a system developed in the group of E. Dickmanns [Dickmanns,

2004]. It uses several active cameras mimicking the active nature of gaze control in the

human visual system. However, the processing framework is not closely related to the

human visual system. Without a tunable bottom-up attention system and with top-down

aspects that are limited to a number of object-specific features for classification, no dynamic

preselection of image regions is performed.

With respect to attention-based approaches for the vehicle domain, a saliency-based

traffic sign detection and recognition system was proposed by [Ouerhani, 2003]. A fur-

ther biologically inspired system approach has been presented by [Farber, 2005]. This

publication as well as the recently started German Transregional Collaborative Research

Centre “Cognitive Automobiles” [Stiller et al., 2007] address mainly human-inspired be-

havior planning whereas our work focuses more on task-dependent perception aspects.

A vision system approach that is some aspects related to the here presented ADAS is

described by [Matzka et al., 2008]. Published after our work (see, e.g., [Michalke et al.,

2007]), the approach allows for a simple attention-based decomposition of road scenes but

without incorporating object knowledge or pre-knowledge. Additionally, the overall system

organization is not biologically motivated and hence shows limitations in its flexibility.

For assisting the driver over the full range of driving tasks in all kinds of challenging sit-

uations and going beyond simple reactive behaviors, a more sophisticated task-dependent

processing strategy is required. We see the necessity of an adequate organization of per-

ception using a generic vision system, as a major challenge to achieve this target.

When assessing biological vision systems, it can be experienced that these are highly

flexible and capable of adapting to severe changes in the task and/or the environment.

Hence, one of our design goals on our way to achieve an “all-situation” ADAS is to im-

plement a biologically motivated, cognitive vision system as perceptual front-end of an

ADAS, which can handle the wide variety of situations typically encountered when driving

a car. Note that only if an ADAS vision system attends to the relevant surrounding traffic

and obstacles, it will be fast enough to assist the driver in real-time during all dangerous

situations.

More specifically, one possible biologically inspired way to solve this challenge is to realize

a task-dependent perception using top-down links. In this paradigm, the same scene can

be decomposed in different ways depending on the current task. A promising approach is

to use an attention system that can be modulated in a task-oriented way, i.e., based on

the current context. For example, while driving at high speed, the center of the field of

102


view becomes more important than the surround. Furthermore, only if the vision system

attends fast enough to the relevant parts of the surrounding traffic and obstacles, it will

be able to assist the driver in all dangerous situations.

The computational model of the human attention system described in Chapter 3 is used

as front-end of a biologically inspired driver assistance system that determines the “how”

and “when” of scene decomposition and interpretation.

Recently, some authors stress the role of incorporating context into the attention-based

scene analysis. For example [Torralba, 2003], proposes a combination of a bottom-up

saliency map and a top-down context-driven approach. The top-down path uses spatial

statistics, which are learned during an offline learning phase, to modulate the bottom-up

saliency map. This is different to the here described system, where no offline spatial prior

learning phase is required. In our online system, context is incorporated in the form of

top-down weights that are modified at run time as well as by fusing road information.

To our knowledge, in the car domain no task-dependent tunable vision system that

mimics human attention processes exists.

5.2 Advanced Driver Assistance on Highways

Based on the paradigm of a task-dependent tunable vision system, Sect. 5.2 describes a

vision architecture that is being developed as perceptual front-end of an ADAS. The pro-

posed system provides a framework that enables task-dependent tuning of visual processes

via object-specific weighting of input features of the attention system described in Chap-

ter 3. The system generates an appropriate reaction in dangerous situations (autonomous

braking). Its architecture is inspired by findings of human visual system research and or-

ganizes the different functionalities in a similar way. For a first proof of concept, we focus

on assisting the driver during a critical situation in a construction site. The system has

been implemented using a software framework for component integration and is evaluated

on a number of test streams. It achieves real-time performance on a prototype car, which

has been demonstrated live on a testing range.

The Section 5.2 is organized as follows: In Section 5.2.1 an overview of the system archi-

tecture and its individual components is provided. For the analysis of the attention system,

we evaluated the construction site scenario to illustrate the performance of the top-down

approach in a complex environment. The obtained results demonstrating the feasibility

and benefits of top-down attention in a complex ADAS are described in Sect. 5.2.2.


In the following, a rough overview of the implemented vision system structure for driver

assistance is given. Subsequently, crucial system parts are described in more detail.

Overview

The overall architecture concept to realize task-based visual processing is depicted in

Fig. 5.1. It contains a distinction between a “what” and a “where” processing path,

103


somewhat similar to the human visual system where the ventral and dorsal pathway are

typically associated with these two functions. Among other things, the “where” path-

way in the human brain is believed to perform the localization and coarse tracking of a

small number of objects that are relevant to the current task. This tracking is performed

by the human visual system without focusing the eye gaze on individual objects to be

tracked [Cavanagh and Alvarez, 2005], i.e., tracking does not require high resolution. In

contrast, the “what” pathway considers the detailed analysis of a single spot in the image.

In the human visual system this is intimately bound to the current eye gaze, as the human

eye possesses a high resolution in the central 2-3◦ (foveal retina area) of the visual field

only.

In our vision system the eye gaze is performed virtually as the camera mounted in the

car has a constant resolution in the complete field of view. Changing the eye gaze is

therefore equivalent to shifting the processing to another spot of the input image. This

spot is analyzed in the “what” pathway in full resolution while the whole image is analyzed

in the “where” path in lower resolution. Processing in these two pathways is believed to

occur in parallel in the human brain, but their intertwinings are as yet not known in too

much detail. We here adopt the idea of continuously tracking a small number of objects

in each image of the incoming visual stream to coarsely represent the current scene and at

the same time acquiring more detailed information on one additional object. We therefore

have two analysis processes running in parallel in our system, indicated by the two circular

arrows in Fig. 5.1.

The detailed organization of the two processing streams in our architecture concept is

as follows: The input image is analyzed in the “what” path (depicted left in Fig. 5.1) for

salient locations using a variety of visual features including orientation, intensity, color,

and motion. This visual attention combines bottom-up (BU) and top-down (TD) path-

ways and is described in full detail in Chapter 3. The resulting saliency map Stotal is

modulated by suppressing image regions that contain known objects, i.e., that have been

detected earlier. The system stores all detected objects in a so-called Short Term Mem-

ory(STM) that provides the position information of known objects as top-down link. The

suppression of saliency areas is also known as Inhibition of Return (IoR) in the human

visual system [Klein, 2000]. The performance gain of using this IoR approach and the

influence on the STM will be shown in Sect. 5.2.2. A simple maximum search is used on

the resulting saliency map to find the currently most salient point in the scene, the Focus

of Attention (FoA). At this position the Region of Interest (RoI) is determined by region

growing on the overall saliency map using the FoA as seed. In the final step of the “what”

cycle, the resulting RoI as well as its position (pos) are fed to the fast feed-forward object

recognition system described in the following subsections (see page 106).

After object recognition, the image region, its position, and the object label (pos, RoI,

ID) are stored in the STM in order to be coarsely tracked in subsequent images in the

“where” path. Before insertion, it is checked whether the new object can be associated to

a known object based on its position, size, and label; if a matching object is found, the

object already stored in the STM is updated. One iteration is concluded by calculating

distance (dist) for all objects in the STM based on fusing measurements from Radar,

depth from familiar object size (i.e., object knowledge, see [Palmer, 1999]), and depth

104


new posnew RoI

coarse trackingof relevant

objects

trackmanagement

1. warning step2. soft braking (0.25g)3. hard braking (0.6g)

knowledgeobject bird’s eye

view

fusion ofdistance data

dangerlevel

objectrecognition

Segmentationaround FoA

saliency mapInhibition of

using weightsvisual attention

TD & BU

posRoIID

"Where"

dist

know

n ob

ject

s

dangercomputation

vision data CAN data

dataRadar

pos,RoI

"What" ID

object memorySTM

(pos,RoI,ID)egocentric mem

STM

(dist,ID)

image pyramid

analyzedone RoI iseach image

in

5−8 obj.

coarsetracking of

Stotal

Figure 5.1: Architecture concept of vision-based driver assistance system.

from bird’s eye view (see [Broggi, 1995] for the computation of the bird’s eye view) using

an Extended Kalman Filter. Details on the Extended Kalman Filter are given in the

following subsections (see page 107)). The distance information is stored in a separate

egocentric representation that is directly suitable for calculating the current danger level

and generating a warning message if necessary.

All objects contained in the STM are constantly tracked in the “where” path based on

an appearance-based tracker. The tracker uses a second order motion model for predicting

object positions on the image plane and a local correlation step for the refinement of the

new object positions. In each iteration the position is updated in the STM and a new

template RoI is stored. In case the prediction does not match (no good correlation found)

the object is deleted from the STM and therefore its position will not be inhibited in the

“what” pathway anymore. Consequently, the attention will be focused on the missing

object in one of the next images if the object is still present and salient. This way, all

objects being recognized and behaving as predicted are coarsely tracked while the “what”

attention is always focused on new objects and objects behaving unexpectedly.

However, it should be avoided that objects that can be tracked are stored in the STM

forever, as this would mean that the system cannot correct a wrong object label. This is

achieved by deleting an object from the STM after N frames, i.e., objects have a lifetime

of N frames. This is equivalent to limiting the capacity of the STM to N objects in scenes

with more than N objects. Note that the rather simple tracking method is sufficient for

many applications in the automotive domain where most objects are rigid (e.g., a car) and

105


therefore the main appearance changes are limited to small translations and scalings.

The novelty of our architecture lies in the introduction of top-down aspects (like, e.g.,

task-dependent tunable attention generation via sets of weights and, in parallel, inhibiting

known object positions predicted by tracking) resulting in the ability to cope with highly

dynamic traffic scenes using limited computational resources. The top-down tunable atten-

tion system is a key aspect of our ADAS, since such preprocessing leads to a considerable

reduction of scene complexity by restricting further processing steps to image regions that

are interesting according to the current system task. This saves not only computational

resources but we implicitly reduce the number of false positive detections as, e.g., the

object classifier only gets RoIs that are likely to be a car based on their current saliency

profile.

Attention Sub-System

The ADAS uses the biologically motivated attention sub-system as its generic visual front-

end (see Chapter 3 for a detailed description) that is tuned by applying 136 independent

modifiable feature weights. As BU weights wBUi we choose a set of weights that shows

good performance for most situations in the car environment. In the object-unspecific

bottom-up path no inhibition takes place (i.e., feature maps are only added up), since its

purpose is to evaluate the general unspecific saliency of a scene. For modulating the TD

attention in the here described ADAS, we currently use TD weight sets for signal boards

and cars (wTDi,sigboard and wTD

i,car) that were calculated in a supervised training step using

Equ. (3.4) on page 59. In Sect. 5.3 this concept is extended, to allow calculating these

weights dynamically at runtime to track known objects and search for new objects.

The overall saliency map Stotal - the output of the attention system - is calculated

by linearly combining the normalized bottom-up SBU and top-down STD saliency maps

dependent on the current task of the ADAS using the parameter λ. With increasing λ,

the top-down saliency contributes more to the final saliency map, leading to a focus of

attention on specific objects. The overall saliency map is passed on to the FoA generation.

Object Recognition

For object recognition we use a view-based approach, where we perform classification only

on the image patch provided by the FoA segmentation. Note that object recognition

operates on the original image resolution of 800×600 pixels, i.e., the RoI position and size

provided by the saliency system are transformed appropriately.

The object recognition module is based on the biologically motivated processing architec-

ture proposed in [Wersing and Korner, 2003]. It uses a strategy similar to the hierarchical

processing in the “what” pathway of the human visual system by creating a classification

hierarchy. Unsupervised learning is used for the lower levels of the hierarchy to determine

general features that are suitable for representing arbitrary objects robustly with regard

to local invariance transformations like local shift and small rotations. Only at the high-

est level of the hierarchy object-specific learning is carried out, i.e., only this layer has to

be trained for different objects. This architecture can be applied to the difficult case of

106


segmentation-free recognition that we have to deal with as the saliency segmentation only

provides an approximate RoI with rectangular shape and no object-specific segmentation.

Training is done by presenting several thousand color RoI images with changing back-

grounds for back views of cars and signal boards (see also [Gepperth et al., 2007]). The

learning algorithm automatically extracts the relevant object structure and neglects the

clutter in the surround. The output of the classifier is the identity (ID) of the recognized

object and a confidence value where a threshold is used to reject object hypotheses with

low confidence. The threshold is chosen so that only a small number of false positives can

occur for cars, as a wrong car detection could lead to a false emergency braking. If a car is

not recognized due to the high threshold, it is stored in the STM as unknown and tracked

for N frames before it is removed from the STM. Subsequently, if the car is still a salient

object, a new FoA will be generated and recognition is performed again. As now the car

may be closer due to the ego motion of our vehicle, the image patch may be larger and

therefore may have a higher confidence resulting in a correct recognition.

As described in Sect. 5.2.1, with the ‘what’ pathway, the presented system uses a cascade

of attention-based object detection followed by an appearance-based object classification.

According to [Neisser, 1967], object recognition in human perception is organized in the

same way. As argued above, the central hypothesis regarding the here presented attention-

based preselection is that it saves computation time and lowers the number of false positive

classifications due to the high relevance of input data at the classifier stage. However,

the question arises if in terms of computational demands, the approach is superior to

an exhaustive classification of the whole image (e.g., by classifying overlapping image

patches). As argued in [Frintrop, 2006], in case of a complex and thereby slow classifier,

the advantages of an attention system are obvious. Since in the vehicle domain false

detections might have severe consequences, with [Wersing and Korner, 2003] a reliable

and hence complex classifier was applied in the presented system.

Even for applications that allow the usage of fast (and less reliable classifiers), as Viola-

Jones (see [Viola and Jones, 2001]) the usage of an attention system saves computational

resources, as was shown in [Frintrop, 2006]. The here gathered results show that already

in case of more than 1 object class, the computation time needed by the attention system

is compensated by the need of fewer classifier cycles. Furthermore, based on numerous

experiments, [Frintrop, 2006] could show that the number of false classifications is reduced

when using an attention system for preselecting image regions as compared to applying

exhaustive classification.

Depth Cues

The current ADAS uses the four independent depth sources introduced in Sect. 2.2 (see

Fig. 5.2) that are combined using weak fusion (see [Landy et al., 1995]). Weak fusion

combines the depth sources based on the reliability of the specific cues. It is realized

here using an Extended Kalman Filter (EKF) that combines at each time step the cues

via dynamic weights depending on the static sensor variances (calculated offline) and the

currently available depth sources. Note that not every cue is available in each time step.

The EKF uses a second order process model for its prediction step that models the relevant

107


kinematics in the car domain (velocity and acceleration). The cues show strong differences

in accuracy (especially depth from bird’s eye view and object knowledge show a high

variance). However, this is uncritical, since the sensor variances (that were determined

offline) are taken into account during the EKF-based sensor fusion, see also [Fritsch et al.,

2008]. The resulting depth values are assigned to detected objects in the image. In the

presented ADAS the following depth sources are used for fusion in the EKF (see Sect. 2.2

for more details on these depth cues):� Depth from Radar,� Depth from bird’s eye view,� Depth from object knowledge,� Depth from Stereo.

A prerequisite for depth from object knowledge is a reliable segmentation algorithm.

Currently we use histogram-based segmentation on an image region that is pre-segmented

by our region growing algorithm working on the saliency (see Fig. 5.2c).

26.

0 m

33.8

m

19.2

m

17.6

m

Width Xw

[m]

Dis

tanc

e (d

epth

) Z

w [m

] 28.9

m

17.4

m

−5 0 5

40

35

30

25

20

15

10

5

0

(a)

(b)

20.3

m

18.5

m

59.6

m

18.5

m

(c)

Figure 5.2: Used depth cues: Depth from (a) Radar, (b) Bird’s eye view, (c) Object knowledge.

108



Evaluation of Depth Fusion

Figure 5.3 shows the EKF-based fusion of depth measurements for a car that drives in front

of our prototype vehicle through an inner-city scenario (see Fig. 5.2). For the EKF we

used the sensor variances σradar = 0.3, σbirds = 2.8, and σobj = 2.7 as well as the process

variance σprocess = 0.023 for the prediction step. Note that the usage of two additional

monocular depth cues of high variance fused with the low variance Radar cue ensures the

availability of depth values even if the interesting objects are outside of the Radar beam.

100 200 300 400 500 600 700 800

10

12

14

16

18

20

22

24

26

28

Samples [0.1s]

Dis

tanc

e to

pre

cedi

ng c

ar [m

]

Depth from bird’s eye viewDepth from object knowledgeRadar measurementsEKF Fusion

Figure 5.3: Depth from bird’s eye view, object knowledge, Radar and fusion with EKF.

Experimental Setup for System Evaluation

Scenario: In order to evaluate the proposed system in a challenging situation, we concen-

trate on typical construction sites on highways. This situation is quite frequent and a traffic

jam ending exactly within a construction site is a highly dangerous situation: due to the

S-curve in many construction sites, the driver will notice a braking or stopping car quite

late as the signal boards limit the field of view (see Fig. 5.4a). Our ADAS implementation

109


uses a 3-phase danger handling scheme depending on the distance and relative speed of a

recognized obstacle. For example, when the vehicle drives around 40 km/h and a static

obstacle is detected in front at less than 33 meters, in the first warning phase a visual

and acoustic warning is issued and the brakes are prepared. If the dangerous situation is

not resolved by the human driver, the second phase triggers the belt pretensioner and the

brakes are engaged with a deceleration of 0.25 g followed by hard braking of 0.6 g in the

third phase.

(a) (b)

distance of 48 msignal boards at aemerges behind

Stationary car

carStationary

Figure 5.4: Scenario: (a) Schematic sketch of the construction site scenario. Stationary car

is visible from 48 meters on. (b) Real scenario.

Technical setup: For the experiments we used a Honda Legend prototype car equipped

with a mvBlueFox CCD color camera from Matrix Vision delivering images of 800×600

pixels at 10 Hz. The image data as well as the Radar and vehicle state data from the CAN

bus can be recorded. The recorded data is used during offline evaluation. For online pro-

cessing all data is transmitted via Ethernet to two laptops (2 GHz Core Duo) running our

RTBOS (Real-Time Brain-like Operation System) integration middleware [Ceravola et al.,

2006] on top of Linux. The individual RTBOS components are implemented in C using an

optimized image processing library based on the Intel performance primitives (IPP) [Intel,

2006].

Test data for training and evaluation: In order to gain sufficient training data and

for evaluating the actual system performance, we set up an exemplary construction site on

a private driving range where we recorded data and performed the actual online tests.

Influence of Parameters on Detection Performance

All results described in the following are obtained by averaging over 10 recorded streams

in order to lessen statistical outliers. As performance metric we will use the detection

distance as this is a good indicator for the efficiency of the saliency system in analyzing

110


complex visual scenes under time constraints. As in each time step of the system running

at 10 Hz, one FoA is analyzed in the “what” pathway and potentially added to the STM,

we will use [frames] (equivalent to 110 second) as time unit.

In the first step the object detection distance is evaluated depending on STM size N and

the TD parameter λ (setting the amount of TD influence) while using a TD weight set

trained on cars. Figure 5.5 shows the distance to the stationary car when the first FoA hits

0 0.25 0.5 0.75 120

25

30

35

40

45

50

TD combination weight λ (using a car tuned TD weight set)

Car

det

ectio

n di

stan

ce in

m

(mea

n of

10

reco

rded

str

eam

s)

N=1N=2N=3N=5N=7Max. distance

Figure 5.5: Stationary car detection distance depending on the TD attention parameter λ=0,

0.25, 0.5, 0.75, and 1 as well as the STM size N=1,2,3,5, and 7 when using ground truth for

detecting a hit.

the car, which is defined by hand-labeled ground truth on the recorded streams. It can be

seen that the larger the TD influence (search task: find cars) expressed by λ, the earlier

the car is detected. Similarly, the more objects are stored in the STM (object number N),

the earlier the car is detected as a large part of the visual scene is already contained as

(unknown) objects in the STM and therefore inhibited in the saliency map. It can also

be deduced that with growing N the influence of TD is reduced since the scene coverage

increases.

Including the task of object recognition in the evaluation, Fig. 5.6 shows the distance to

the stationary car when the first FoA hits the target and this RoI is recognized as car by

the object classifier. Since the used classification threshold was set high to obtain a low

false-positive error rate at the cost of a high false-negative error rate, the distance when the

111


car is detected is smaller than in the evaluation with ground truth. Differing from Fig. 5.5,

at large values of N (see Fig. 5.6 for N=7) the detection distance worsens again. The

reason for this effect is that our system is not using object segmentation algorithms but

performs segmentation directly on the saliency image which can lead to enlarged patches

suppressing the surround of the found objects as well. In this way, the borders of the car

might be suppressed by adjacent signal board patches leading to incomplete car FoAs that

are not sufficient for correct classification by the used object classifier. The likelihood that

this happens is growing with the growing size N of the STM. However with a growing N

also the scene coverage improves. This trade-off leads to the measured results.

0 0.2 0.4 0.6 0.8 120

25

30

35

40

45

50


Car

det

ectio

n di

stan

ce in

m

(mea

n of

10

reco

rded

str

eam

s)

N=1N=2N=3N=5N=7Max. distance


0.25, 0.5, 0.75, and 1 as well as the STM size N=1,2,3,5, and 7 when using the classifier for

detecting a hit.

Based on Fig. 5.6 the best choice of λ for detecting cars would be 1, which equals pure

TD search mode. However, such a parameterization is not appropriate because this leads to

a reduced capability of detecting other objects that are only prominent in the BU saliency

(see Fig. 5.7). Here we see that with growing λ the average detection distance of signal

boards (the only other object class besides cars in our evaluation) drops. Stated differently,

the system ignores all other objects while searching for cars in pure TD mode (λ = 1),

112


which might lead to dangerous situations. The default value for λ was hence set to 0.5 for

the online tests.

0 0.25 0.5 0.75 10

5

10

15

20

25

30

35

40

45

50


Car

/ si

gnal

boa

rd d

etec

tion

dist

ance

in m

(m

ean

of 1

0 re

cord

ed s

trea

ms)

N=1 signal boardsN=2 signal boardsN=5 signal boardsN=1 carN=2 carN=5 carMax. distance

Figure 5.7: Detection distance depending on the TD attention parameter λ=0, 0.25, 0.5,

0.75, and 1. Average detection distance of signal boards and the stationary car using the

object classifier for an STM size of N=1,2, and 5.

In the previous evaluations we assumed that the scene contains more than N objects and

used a fixed STM size which is equivalent to storing any object for N frames independent

of, e.g., whether it is was correctly recognized. We now introduce an object-specific Time

To Live (TTL) defining for how many frames an object is stored in the STM before it is

removed. In this way, unknown objects can be tracked for only a short time before a new

recognition attempt is carried out if the image region is still salient. Figure 5.8 shows how

the choice of the TTL influences the system performance. For an object-unspecific TTL of

5 frames the curve is identical to Fig. 5.7 for N=5. For the object-specific case we choose

TTLsigboard = 6 frames for signal boards, TTLcars = 20 frames, and TTLunknown = 3

frames, leading for the construction site streams on average to N=5 objects in the STM.

Note that the low value of TTLunknown and the high value of TTLcars both support to

set the object recognition threshold high, i.e., it is very likely to get an unknown object,

which is a false negative car but rather unlikely to get a car that is a false positive.

A clear gain in detection performance can be seen when using object-dependent TTL

113


0 0.25 0.5 0.75 130

32

34

36

38

40

42

44

46

48

50


Car

det

ectio

n di

stan

ce in

m

(mea

n of

10

reco

rded

str

eam

s)

FoA patch recognition using car trained classifier

Object−unspecific TTL=5

Object−specific TTL (see text)

Max. distance


0.25, 0.5, 0.75, and 1 while using object-unspecific and object-specific TTL values.

values, which is due to the fact that FoAs, which hit the car very early are often too

small for a reliable classification. These unknown scene parts are suppressed only for 3

frames before the classifier gets a second chance to detect the car. This object-specific

TTL parameterization was used during the online tests described below.

Evaluation of System Performance

We in detail evaluated the warning generation offline on 10 recorded construction site

streams used also for evaluation in the previous subsection. In all streams, the ADAS was

able to recognize and track the car from a distance between 42 and 32 meters, while the

car was fully visible at a distance of about 48 meters.

During documented online system tests in the setting depicted in Fig. 5.4 with our

prototype vehicle driving 40 km/h our system detected in 57 of 60 cases the stationary

car in time and issued the 3 warning phases as expected including autonomous braking.

In the remaining cases, either the object recognition detected a signal board as car and

the braking was performed too early or the FoA generation did not deliver a good car RoI

position so that the fusion of the car RoI with Radar data failed and no warning/braking

was performed at all. Note that in our vision-based proof-of-concept system we completely

rely on vision and do not make use of an additional Radar-based emergency braking that

would be needed in real traffic as backup for situations in which our vision system fails.

In the following Sect. 5.3, an extended ADAS is presented that improves the so far

114


presented system in several aspects.

5.3 Advanced Driver Assistance in Inner-City

In this section, we present a highly integrated vision architecture for an advanced driver

assistance system inspired by human cognitive principles. As in Sect. 5.2, the system uses

an attention system as the flexible and generic front-end for all visual processing, allowing

a task-specific scene decomposition and search for known objects (based on a short term

memory) as well as generic object classes (based on a long term memory). Knowledge

fusion, e.g., between an internal 3D representation and a reliable road detection module

improves the system performance. The system heavily relies on top-down links to modulate

lower processing levels, resulting in a high system robustness.

While Sect. 5.2 concentrated mainly on the usage of saliency-based attention in the

system context (see also [Fritsch et al., 2008, Michalke et al., 2009a, 2008a]), this section

describes the additional incorporation of environmental 3D representations and static do-

main specific tasks, in order to use context information (“where is the road”) to guide

attention and, therefore, analysis of the overall scene (see also [Michalke et al., 2008b]).

For all acquired information our enhanced system builds up internal 3D representations

that support scene analysis and at the same time serve for behavior generation. Using a

metric representation of the road area in combination with detected traffic objects, the

system can guide its processing on relevant objects in the context of the current road area.

For example, this allows to perform warning and emergency braking if a parked car is

detected on our lane and during its by-passing the pro-actively adapted attention detects

oncoming traffic on the road.


The proposed overall architecture concept for a robust attention-based scene analysis is

depicted in Fig. 5.9. It consists of four major parts: the “what” pathway, the “where”

pathway, a part executing static domain-specific tasks, and the behavior generation. The

distinction between “what” and “where” processing path is somewhat similar to the hu-

man visual system where the dorsal and ventral pathway are typically associated with

these two functions (see, e.g., [Palmer, 1999]). Among other things, the “where” pathway

in the human brain is believed to perform the localization and tracking of a small number

of objects. In contrast, the “what” pathway considers the detailed analysis of a single spot

in the image (see theories of spatial attention, e.g., spotlight theory [Palmer, 1999]). Nev-

ertheless, an ADAS also requires specific information of the road and its shape, generated

by the static domain-specific part.

The “What” Pathway

Starting in the “what” pathway the 400x300 color input image is analyzed by calculating

the saliency map Stotal. The saliency map Stotal results from a weighted linear combination

115


using weights

generatevisual

TD attention

fuse roadinformation

...

color 300x400input image,

using weights

generate

attentionvisual BU & TD

3D to 2D

supp

ress

roa

d su

rfac

e

domain−specific processing

temporal integ.detection,

unmarked road

detectionmarked road

TD

atte

ntio

n m

apfo

r w

hite

and

yel

low

on−

off c

ontr

asts

TD: inhibitionof known objectslocal saliency

modulation

Static domain−specific tasks

detect open−

street

objectsfuse new & old

boards) (cars & sig.

object

image patch

detected lanes

(multiple pathways in parallel for:STM search and several LTM classes)

classifier

‘‘What’’ pathway

calculate LTMTD weights

ings in found

create/update object

FoA

calculate STMTD weights

match object

Stotal

wTDi wTD

i , wBUi

Figure 5.9: System structure allowing attention-based scene analysis (see page 117 for re-

maining system graph).

of N = 136 biologically inspired input feature maps Fi:

Stotal = λ

N∑

i=1

wTDi Fi + (1 − λ)

N∑

i=1

wBUi Fi . (5.1)

More specifically, we filter the image using, among others Difference of Gaussian (DoG)

and Gabor filter kernels that model the characteristics of neural receptive fields measured

in the mammal brain. Furthermore, we use the RGBY color space [Frintrop, 2006] as at-

tention feature that models the processing of photoreceptors on the retina (see Sect. 2.1.3

for details on the computation of the color feature). Additionally, with the incorporation

of differential images and an approach for the detection of moving objects, dynamic fea-

tures are included to the system (see Sect. 2.3). All features are computed on 5 scales

relying on the well-known principle of image pyramids in order to allow computationally

efficient filtering (see Annex A.1). All feature maps are postprocessed non-linearly in or-

der to suppress noise and boost conspicuous or prominent scene parts (see Sect. 3.3 and

[Michalke et al., 2008c] for a detailed description of these nonlinear processing steps).

The top-down (TD) attention can be tuned (i.e., parameterized) task-dependently to

search for specific objects. This is done by applying a TD weight set wTDi that is computed

and adapted online, based on Equation (5.2), where the threshold φ = KconjMax(Fi) with

Kconj = (0, 1] (see Fig. 5.10a for a visualization). Equation (5.2) is equivalent to Equ. (3.4)

116


label)

objectmemory

(pos,templ,roadmemory

update object motion

tracker2D

behaviorcontrol

control ofactuators

Interaction with theenvironment(affordance)

(weak fusion)

depth cuecombination

objectknow. BEV

object position

update

set flag if dynamic

3D to 2D

2D to 3D

stereo data(disparity)

update

motion

single trackmodel

Short Term Memory(based on an environmental representation)

ego motion

object data

detect objectego motion

positiontransform

templatesobjecttypical

Long Term Memory

Behavior generation

‘‘Where’’ pathway

Radar

CAN dataRadar

create/update object

on page 59 in Chapter 3, but is reformulated to match the following description of the

online weight computation. The weights wTDi dynamically boost feature maps that are

important for our current task/object class in focus and suppress the rest. The bottom-up

(BU) weights wBUi are set object-unspecifically in order to detect unexpected potentially

dangerous scene elements. The parameter λ ∈ [0, 1] (see Equation (5.1)) determines the

current relative importance of TD and BU search in the system. For more details on the

attention system please refer to [Michalke et al., 2008a] and Chapter 3. It is important to

note that the TD weights (calculated using Equation (5.2)) are dependent on the features

present in the background (rest) of the current image, since the background information

is used to differentiate the searched object from the rest of the image [Frintrop, 2006]:

wTDi =

mRoI,i

mrest,i∀mRoI,i

mrest,i≥ 1

−mrest,i

mRoI,i∀mRoI,i

mrest,i< 1

(5.2)

with m{RoI,rest},i =

∑

∀u,v∈{RoI,rest}

Fi(u, v)

size region {RoI,rest}

and Fi(u, v) =

{

Fi(u, v) ∀(u, v), Fi(u, v) ≥ φ

0 else.

117


Because of this, it is not sufficient to store the TD weight sets wTDi of different object

classes directly and switch between them during online processing. Instead, an aggregated

form of all feature maps of objects Fi,RoI are stored (equivalent to the value mRoI,i in

Equ. (5.2)). To compensate the dependency from the background, the stored object feature

maps are fused with the feature maps of the current image before calculating the TD

weights. In plain words, the system takes the current scene characteristics (i.e., its features)

into account in order to determine the optimal TD weight set that shows a maximum

performance in the current frame. Put differently, the described separability approach

includes the current scene context on a sensory level.

As described in Sect. 5.2, we detect the maximum on the current saliency map Stotal and

get the focus of attention (FoA) by generic region-growing-based segmentation on Stotal.

In the following, only the FoA is classified using a state-of-the-art object classifier that is

based on neural nets [Wersing and Korner, 2003]. This procedure (attention generation,

FoA segmentation and classification) models the saccadic eye movements of mammals,

where a complex scene is scanned and decomposed by sequential focusing of objects in

the central 2-3◦ foveal retina area of the visual field. The system uses a time integrating

mechanism to decide on the object class, in order to improve the reliability of the classifier

decision. More specifically, all detected objects are tracked and reclassified in the following

frames. On each frame a majority decision (voting) on the current and all stored classifier

results decides on the object class.

The proposed system incorporates the biologically motivated concept of TD links. Based

on these links information on higher levels of knowledge integration modulate lower levels of

knowledge integration. This brain-like concept improves robustness, increases the relevance

of input data for higher system levels, and accelerates the system reaction (see evaluation

results in Sect. 5.3.2). Our system uses such links for the task-specific modulation of

the TD attention (i.e., by adapting system parameters online, as, e.g., the previously

described TD weights wTDi ) and for suppressing the detected road as context information

in all feature maps Fi before fusing them in the overall saliency Stotal. Additionally, TD

links are used for the modulation of the attention based on detected car-like openings in

the found drivable road segment (see “where” path in Fig. 5.9). Such car-like openings

are detected by searching for car-sized holes in the road segment that is transformed to

the metric bird’s eye view (for an example see Fig. 5.12d) by inverse perspective mapping.

In a nutshell, the bird’s eye view is the representation of the scene as viewed from above,

computed by transforming a monocular camera image taking intrinsic and extrinsic camera

parameters into account (refer to [Michalke et al., 2008c] and Annex A.3 for more details).

The “Where” Pathway

The next step is the fusion between the newly detected object and the already known ones.

The result will be further processed in the “where” pathway and stored in the short term

memory (STM). The objects in the STM are then suppressed in the current calculated

saliency map to enable the system to focus on new objects. The principle of suppressing

known objects was proved to exist in the human vision system as well and is termed

inhibition of return (IoR), refer to [Klein, 2000] for details.

118


All known objects are tracked using a 2D tracker that is based on normalized cross corre-

lation (NCC). The tracker gets its anchor (i.e., the 2D pixel position where the correlation-

based object search on the new image will be started) from a Kalman-filter-based prediction

on the 3D representation taking the motion of the camera vehicle and tracked object into

account. The predicted 3D position is transformed to 2D pixel positions (u,v) using a pin

hole camera model that contains all intrinsic and extrinsic camera parameters (in detail

these are the 3 camera angles θX , θY , and θZ , the 3 translational camera offsets t1, t2, t3,

the horizontal and vertical principal point u0 and v0, as well as the horizontal and vertical

normalized focal length fu and fv), refer to Equation (A.1) and (A.2).

In case the NCC tracker is able to re-detect the object in 2D pixel coordinates, the 3D

position in the representation is updated using 4 different depth cues for the 2D pixel (u,v)

to 3D world (Xobj, Yobj, Zobj) transformation. More specifically, our system uses stereo

data, Radar, depth from object knowledge, and depth from bird’s eye view (see Fig. 5.12

and [Fritsch et al., 2008, Michalke et al., 2008c] for more details on these cues). The cur-

rently available depth cues are combined using the biologically motivated principle of weak

fusion (see [Landy et al., 1995]). Weak fusion combines the depth sources based on their

reliability (i.e., sensor variances). The fusion is realized using an Extended Kalman Filter

(EKF) that combines the cues based on dynamically adapted weights depending on the

static predefined sensor variances and the currently existing depth sources, as not every

cue is available in each time step. The EKF uses a second order process model for its

prediction step that models the relevant kinematics of the car (velocity and acceleration).

Objects whose updated position leave the represented surrounding scene or whose Kalman

variances are too high (i.e., they received no new measurements for several frames) are

deleted from the STM. The concept of appearance-based 2D tracking (analysis of motion

in 2D) supported by a 3D representation (interpretation of motion in 3D) was found in

humans as well [Palmer, 1999]. From a technical point of view, the advantage of this

approach is the simple correction of the ego motion relying on the internal 3D representa-

tion. The vehicle ego motion (translations ∆Xe and ∆Ze, as well as the change of the yaw

angle ∆θX) is determined based on a standard single track model and compensated in the

Kalman prediction step (see Equation (5.3) and (5.4) for the state vector E and process

model A):

E =[

Zobj Xobj vZ,obj vX,obj

]

(5.3)

A =

cos(∆θX) sin(∆θX ) T 0 -∆Ze

-sin(∆θX) cos(∆θX) T 0 -∆Xe

0 0 cos(∆θX) sin(∆θX ) 0

0 0 -sin(∆θX) cos(∆θX) 0

0 0 0 0 1

. (5.4)

Therefore, we do not need a computationally intensive optical-flow-based prediction. The

main reason for the strong object motion in the 2D image is compensated by correcting the

ego-motion-based position change of objects, which eases the tracking task considerably.

A comparison between the current Kalman fused 3D object position Pt = [Zobj, Xobj]

and the predicted 3D object position P ′t decides, based on the state variances σ2

P ′t

and

119


σ2Pt

, if the tracked object is static or dynamic (see Fig. 5.10b). P ′t is calculated by an

ego-motion-based prediction starting from the stored Kalman-fused value Pt−4. For the

comparison, βth is used as a threshold on the measure β(Pt, P′t) defined in:

β(Pt, P′t) =

∣

∣

∣

∣

∣

∣

∣

∣

Pt − P ′t

√

∣

∣σ2Pt

∣

∣+∣

∣

∣σ2P ′

t

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

. (5.5)

The calculated measure is motivated from a statistical parameter test that checks for

the equality of two distributions. It showed good performance on various test streams. If

β(Pt, P′t) is bigger than βth (i.e., the object is detected to be dynamic) the Kalman filter

receives the object ego motion vZ,obj 6= 0 and vX,obj 6= 0 that is derived from the integrated

object position change Dobj,egotas measurement (see Fig. 5.10b).

Background (rest)

Region of interest(RoI) for TD weight

(a) (b)

Dobj,egot−3

Dobj,egot−2

Dobj,egot−1

Dobj,egotX′t

Pt−4

P ′t−3

Pt−3

P ′t−2

P ′t−1

σP ′t

Pt σPt

Pt−1

Pt−2

set calculation

Figure 5.10: (a) Visualization of the object training region (RoI) for TD weight calculation

against the background (rest), (b) Prediction of object ego motion (dots: Kalman tracked

object position, squares: predicted object position including measured ego motion, dashed line:

accumulated object ego motion)

From a representational point of view, the “where” pathway of our system consists on

the one hand of the STM, that stores all properties of sensed objects in a 3D representation

and on the other hand of a long term memory (LTM) that stores the generic properties

of object classes. The LTM is filled offline with typical patches and corresponding feature

maps Fi of specific object classes. For evaluation purposes we use cars, reflection posts,

and signal boards as LTM content, but our system can detect any other object types as

well, if the attention and the object classifier are trained accordingly. In the default state

the system searches for the generic LTM object class car. This is done by calculating the

geometric mean of all TD weight sets of the LTM objects that were computed based on

Equation (5.2). These weights tune the TD attention in the ”what” pathway.

120


As described above, in case the tracker has re-detected the object in the current frame

the 3D representation is updated. In case the tracker looses the object the system searches

for the lost STM object in the following frames. This is realized by calculating a TD

weight set that is specific to the lost STM object using Equation (5.2). The object Offound by the STM search is then compared to the searched object Os by means of the

distance measure δ(Of , Os) that is based on the Bhattacharya coefficient (a measure for

determining the similarity between two histograms) calculated on the histograms of all N

object feature maps HOf

i and HOs

i :

δ(Of , Os) =N∑

i=1

√

1 − γ(HOf

i , HOs

i ) (5.6)

γ(HOf

i , HOs

i ) =∑

∀u,v

√

HOf

i (u, v)HOs

i (u, v).

The LTM and STM object search run in parallel as visually indicated in Fig. 5.9. It is

important to note that our system is not restricted to the detection and tracking of cars,

reflection posts, and signal boards. By using different LTM object patches and by offline

training of our object classifier in combination with the generic concept of online tunable

TD attention our system is highly dynamic and flexible.

Static Domain-Specific Tasks

The third major part of our system handles the domain-specific tasks of marked and un-

marked lane detection. The marked lane detection is based on a standard Hough transform

whose input signal is generated by our generic attention system. The scale-selective TD

attention weight set used here boosts white and yellow structures on a darker background

(so-called on-off contrast), to which the biological motivated DoG filter (see Sect. 2.1.1) is

selective. The yellow on-off structures are weighted stronger than the white to allow the

handling of lane markings in construction sites.

The state-of-the-art unmarked lane detection evaluates a street training region in front

of the car and two non-street training regions at the side of the road (see Sect. 4.1). The

features (stereo, edge density, color hue, color saturation) in the street training region

are used to detect the drivable road based on dynamic probability distributions for all

cues. Additionally, region growing that starts at the street training region assures a crisp

distinction between the road and the sidewalk. The region growing uses dynamic self-

adaptive thresholds that are derived from the feature characteristics in the street training

as compared to the non-street training region. No fixed parameters for detecting the

road are used, which makes the system adaptive to its environment and hence robust.

A temporal integration procedure between the current and past detected road segments

based on the bird’s eye view is used to increase the completeness of the detected street by

decreasing the number of false negative road pixels (see Sect. 4.2). More specifically, based

on the measured ego motion of the car the road segments detected in the past are shifted

and fused with the currently detected road segments. Refer to [Michalke et al., 2008c] for

a comprehensive description of the temporal integration procedure. In the final step, a

121


fusion between the marked and unmarked detected road segments is used to derive the

present drivable lanes.

Behavior Control

The system can interact with the world via a behavior control module. Currently our ADAS

implementation uses a 3 phase danger handling scheme depending on the distance and

relative speed of a recognized obstacle (see also Sect. 5.2.2). When an obstacle is detected

in front at a rapidly decreasing distance, a visual and acoustic warning is issued and the

brakes are prepared. In the second phase the brakes are engaged with a deceleration of

0.25 g followed by hard braking of 0.6 g in the third phase. Other behaviors, like trajectory

planning and active steering, as well as the detection of possible collisions and their active

avoidance based on predictions on internal 3D representations are possible and planned in

the near future.


In the following, we will evaluate individual system modules that are most important for

our cognitive ADAS architecture. Also, the overall system performance will be assessed

based on the construction site scenario described in Sect. 5.2.2.

Evaluation of System Modules

Evaluation of attention sub-system: In order to evaluate the generic nature of the

attention-based TD search, we used cars and reflection posts (useful for unmarked road

detection as done, e.g., in [von Trzebiatowski et al., 2004]) as LTM search objects. The

results are depicted in Tab. 5.1, showing that incorporating TD information improves

the search performance considerably. Please note that when changing the LTM search

object, besides an exchange of the LTM image patches and an appropriate training of

the object classifier no modification in the system structure is required. For evaluation

the measures average FoA hit number (Hit) and average detection rate (DRate) were

calculated. While DRate is the ratio of the number of found task-relevant objects to the

overall number of task-relevant objects, Hit states that the object was found on average

with the Hit’th generated FoA. Hence, the smaller Hit is, the earlier an object is detected

(see [Frintrop, 2006] for more details on these measures). The choice of training images has

only small influence on the search performance as the comparable results for different sets

of training images show (see Tab. 5.1). The evaluation shows the highest hit numbers and

detection rates for pure TD search (λ = 1). However, as will be discussed in the following

a combination of BU and TD influence is recommended in the attention system .

The presented results support the generic nature of the TD tunable attention sub-system

during object search. Moreover, we see the attention system as a common tunable front-

end for the various other system tasks, e.g., as lane marking detection (as described in

Sect. 5.3.1). Following this concept, the task-specific tunable attention system can be used

for scene decomposition and analysis, as it is shown exemplarily on two typical German

highway scenes in Fig. 5.11.

122


Table 5.1: Search performance for BU- and TD-based LTM object search for cars and reflection

posts for 2 different training sets.

# Test # Trai- Hit (DRate)Target images ning im pure BU pure TD

(objects) (λ = 0) (λ = 1)Cars 54 1.53

(self test) (100%)T.set 1 54 3 3.06 1.82

(58) (56.9%) (96.6%)T.set 2 3 1.74

(93.1%)

Reflect. 56 1.85posts (self test) (66.3%)

T.set 1 56 6 2.97 2.25(113) (33.6%) (52.2%)

T.set 2 7 2.36(52.2%)

(a) (b) (c)

(d) (e) (f)

Figure 5.11: Attention-based scene decomposition: (a) Highway scene, (b) TD attention

tuned to lane markings, (c) TD attention tuned to cars, (d) Construction site, (e) TD at-

tention tuned to signal boards, (f) TD attention tuned to cars.

123


Evaluation of classifier performance: For a proof of concept, we trained the classifier

to distinguish cars from non-cars (clutter). A set of image segments generated by our

vision system during online operation was used for training. It contains 11000 square

image patches of size 64x64 pixels, and was divided into the classes “cars” (2952 patches),

“signal boards” (2408 patches) and “clutter” (5803 patches) by visual inspection. Car

segments contain complete back-views of cars (at any position) which must be at least half

as large as the patch in both dimensions. At equal false positive and true negative rates,

for cars an error of 4.7% and for signal boards an error of 9.7% was obtained on equally

large test sets. The performance of the trained classifier is shown in Fig. 5.13a in form of

a receiver operator characteristic (ROC) curve that visualizes the trade-off between false

positive (clutter recognized as object) and false negative (object recognized as clutter)

detections when varying the classification thresholds. The ROC curve was generated using

5-fold cross validation. Furthermore, the quality of the classification is enhanced by the

voting process described in Sect. 5.3.1.

Qualitative evaluation of depth cues: For a more qualitative evaluation Fig. 5.12

shows the unpreprocessed results for all depth cues on a typical inner-city sample. The

cues show strong differences in accuracy (especially depth from bird’s eye view and object

knowledge show a high variance). However, this is uncritical, since the sensor variances

(that were determined offline) are taken into account during the EKF-based sensor fusion

(see Sect. 5.2.2 for a more detailed depth cue evaluation).

Evaluation of Overall System Performance

The performance gain of incorporating the detected drivable street, the internal metric 3D

representation, and TD links are evaluated on a real-world construction site scenario. The

results gathered with the proposed system are then compared with the previous system

described in Sect. 5.2.

In the previous section, we concentrated on typical construction sites on highways. A

traffic jam ending exactly within a construction site is a highly dangerous situation: due

to the S-curve in many construction sites, the driver will notice a braking or stopping car

quite late (see Fig. 5.4 on page 110). The evaluation was done offline by averaging on

3 streams that were stored during the online demonstration of the previous ADAS. As

depicted in Fig. 5.13b the current system architecture can classify the stationary car from

25 to 42 meters on. How early the car is detected depends on how much TD influence is

incorporated. For λ = 0 the car is detected late, because only visually conspicuous object

features are incorporated that draw BU attention. For a growing λ the car is detected early

since “car-like” features are boosted stronger in the TD attention. Based on Fig. 5.13b the

best choice of λ for detecting cars would be 1, which equals pure TD search mode. However,

such a parameterization is not appropriate because this leads to a reduced capability of

detecting other objects that are only prominent in the BU saliency map. As depicted in

Fig. 5.13b with growing λ (i.e., with growing influence of car features in the attention) the

mean detection distance of signal boards as BU salient objects drops. Stated differently,

the system ignores all other objects while searching for cars in pure TD mode (λ = 1),

124


39

.4

18

.3 71

.5

25

.2

Bird’s eye view

Width in m

Dis

tanc

e (d

epth

) in

m

−10 0 10

50

40

30

20

10

0

17

.8

16

.7

28

.0

12

.1

28.7 9.

6

25.8

17.2

25.5

14.4

(b)(a)

(d)(c)

Figure 5.12: (a) Depth from stereo (calculated as a median over the object region), (b) Depth

from Radar, (c) Depth from object knowledge (for all objects detected as cars), (d) Depth from

bird’s eye view (using threshold-based detection of intensity changes on the road).

which might lead to dangerous situations. The measured effect was also proved to exist in

humans and is termed “inattentional blindness” (see Sect. 3.1 and [Simons and Chabris,

1995]). This suggests to set λ to an intermediate value of about 0.5, which was also the

setting used during our online tests (see [Fritsch et al., 2008]).

Also compared to the previous system described in Sect. 5.2 for all λ-values a better

system performance was achieved. In the previous system an appearance-based 2D tracking

as opposed to the 3D tracking presented here was used. Furthermore, the TD weights

were computed offline as opposed to the online LTM object search in the current system.

Additionally, information drawn from the road detection module is included and combined

to the attention module in the current system (see Sect. 5.3.1). The attained performance

gain affirms the soundness of these cognitive system extensions.

For further system evaluation, Fig. 5.14 depicts internal system variables for three se-

quential frames of an inner-city stream with cars as LTM search object. As described in

Sect. 5.3.1, for each new image the attention is calculated and a new FoA is generated

via maximum search and segmentation on the saliency map. The detected road area (and

thereby also the present lane markings) are mapped out of the saliency map, which de-

creases the false positive rate of generated FoAs, i.e., less non-car FoAs are generated. In

125


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

fals

e n

eg

ative

ra

te

false positive rate

whole car color (back/front)signal boards color

0 0.25 0.5 0.75 10

10

20

30

40

50

TD combination weight λ (with car−tuned TD weights)

Ca

r d

ete

ctio

n d

ista

nce

in

m

(me

an

of

3 r

eco

rde

d s

tre

am

s)

Car dist. (previous system)Car dist. (current system)Car max. distanceSignal boards mean dist.

(a) (b)

Figure 5.13: (a) Receiver operator characteristic curve for cars (back and front views) and

signal boards, (b) Comparison between previous and current system implementation: Stationary

car detection distance depending on TD attention parameter λ = 0, 0.25, 0.5, 0.75, and 1.

For both systems a comparable parameter set was used.

the first frame, the car in front is detected and stored in the representation-based on a

car-like hole in the detected street segment that modulates the attention. Please note that

car 2 and 3 are not stored in the internal representation, since their position is beyond the

represented road environment.

5.4 Summary

In this Chapter, the visual features described in Chapter 2, the attention sub-system

of Chapter 3, and the unmarked road detection sub-system of Chapter 4 are combined

to a generic, biologically inspired Advanced Driver Assistance System. The attention

sub-system that weights and combines all visual features allows a task-dependent scene

decomposition, which enables the system to react in real-time to challenging situations.

The ADAS described in Sect. 5.2 uses the attention sub-system as front-end for the

detection of a stationary vehicle in a highway construction site. In a nutshell, based

on an object-specific weight set, the attention sub-system suppresses all scene elements

not relevant to the current task, while boosting task-related scene content. The thereby

preselected scene elements are classified by a biologically inspired state-of-the-art classifier.

In order to allow a thorough scene exploration, once analyzed scene elements are suppressed

in the saliency map. Since the camera vehicle moves and typically other dynamic objects

are present, a tracking approach is included into the system. In order to assess the potential

danger of once detected and classified vehicles depth information is needed. For this,

four depth cues described in Chapter 2 are combined based on a biologically inspired

fusion approach. In case a quickly approaching vehicle is detected, a danger handling

scheme is initiated, meaning that after an acoustic warning and the activation of the belt

pretensioner, an autonomous braking of the ego vehicle is done. Different from available

systems on the market and in literature, the object detection is based on vision as the

126


Width in m

Dis

tan

ce (

de

pth

) in

m

−5 0 5

25

20

15

10

5

0

Width in m

Dis

tan

ce (

de

pth

) in

m

−5 0 5

25

20

15

10

5

0

Width in m

Dis

tan

ce (

de

pth

) in

m

−5 0 5

25

20

15

10

5

0

Frame 194

Car 1

Frame 196

Car 3 Car 2

Frame 197

Car 4

Figure 5.14: System evaluation on example images of an inner-city stream. Left column:

visualization of found FoAs, middle: calculated saliency map Stotal (previously found objects

are suppressed by inhibition of return), Right column: Visualization of internal representation

(dashed line marks the border of the vision field).

main cue.

In Sect. 5.3 the driver assistance system is significantly extended. The detected road, as

introduced in Chapter 4, is fused to the system, improving the quality of different system

modules. For instance, the detected road can be suppressed in the visual feature maps,

allowing a more efficient attention-based object search. Car-like openings in the detected

road are used to spatially modulate and guide the attention in order to search for cars at

these image regions. As a further extension, the object-specific attention weight sets are

127


computed online, allowing the detection of multiple object classes. The tracker is improved

by coupling it to an environmental 3D representation that compensates the 3D position

of all known objects by the measured ego motion of the camera vehicle. Additionally, a

voting mechanism is used in order to improve the robustness of the object classification,

meaning that an once found object is classified several times in the subsequent frames in

order to increase the confidence of the detected object class. All these improvements allow

a faster detection of dangerous objects. This was shown with the help of the previously

introduced example scenario of a stationary vehicle in a construction site. Here a faster

system reaction could be experienced.

In the following, the realized novelties of Chapter 5 are listed.� A driver assistance system on a prototype vehicle was implemented that allows au-

tonomous emergency braking on highways based on vision as the major cue,� The realized driver assistance system is based on an attention system as generic

front-end of all visual processing allowing task-dependent scene decomposition and

interpretation,� A driver assistance system was realized that fuses the detected drivable road to

various system modules and thereby includes environmental context information, in

order to allow safe processing in inner-city scenarios.

Based on evaluation results, is could be shown that with the inattentional blindness

phenomenon, specific attention-related properties of the human vision system can be re-

produced with the here presented driver assistance system. Based on these results, the

realized ADAS can be understood as an attentive co-pilot supporting the human driver

while closely mimicking the human visual processing.

128

6 Summary and Outlook

In Sect. 6.1 the thesis is summarized emphasizing the role of biologically inspired ap-

proaches that are part of the realized Advanced Driver Assistance System. Based on that

remaining functional limitations and ongoing system extensions are described in Sect. 6.2.

6.1 Summary

In modern vehicles, numerous driver assistance functionalities exist that support the driver

in typical traffic situations. In general, all these functionalities bring own sensors, process-

ing devices, and actuators. No information fusion takes places between the functionalities,

due to among other things unsolved questions in system design that come along with high

system complexity. Thereby, the potential to improve the robustness of so far independent

system modules as well as the potential to develop higher level functionalities is ignored.

Furthermore, the number of driver assistance functionalities is growing constantly in to-

day’s vehicles. In the near future, this will lead to problems regarding the limited number

of interaction channels to inform the driver of dangerous events (so-called human machine

interface (HMI)). Already today, in specific traffic scenarios a contradicting HMI access of

different driver assistance functionalities can occur. For solving these challenges, a large-

scale driver assistance system is required that integrates and fuses various functionalities

(leading to so-called advanced driver assistance systems (ADAS)). Despite the apparent

necessity of such systems, only a few of these approaches exist in literature. Said systems

typically rely on rigid system structures and safely run in clearly defined scenarios only.

While it may be argued that the quality of such engineered systems in terms of isolated

aspects, e.g., object detection or tracking, is often sound, the solutions lack the necessary

flexibility.

Instead of following classical engineering-based approaches, in the thesis at hand, an

ADAS is developed that solves the challenges of complex system design by mimicking

known information processing principles in the human brain. On the micro-level the real-

ized system copies the signal processing characteristics of neurons in order to reach robust

image filtering. On the macro-level the system gets inspiration from the way the human

brain organizes higher level signal flows in order to reach a generic system structure (e.g.,

a task-dependent tunable attention system).

More specifically, on the micro-level various static and dynamic visual features are real-

ized that are biologically inspired. Inspiration is drawn from known processing principles

of the vision pathway in the human brain (i.e., the signal processing characteristics of

neurons in the brain). Features for the detection of specific intensity changes, oriented

lines and edges as well as a retina-like color space are introduced and tested. Since during

129


the projection of 3D world objects to the 2D image plane the object depth is lost, a dis-

tance feature becomes necessary. Five biologically motivated depth cues and their dynamic

fusion are described. The realized depth sources are stereo disparity, depth from object

knowledge, depth from Time to Contact, Depth from the bird’s eye view and Radar-based

depth. Thereby, in sum 130 static feature maps are accessible, by which the proposed

ADAS can sense the world. Additionally, two dynamic feature types are described that

allow the detection of ego-propelled objects in the scene. In sum, six dynamic feature maps

are accessible to the ADAS. This makes an overall number of 136 feature maps.

The 136 biologically inspired feature maps are combined in the attention system that

is used as common front-end of all vision processes of our ADAS. The key aspect and

output of the attention system is the saliency map, whose amplitude (i.e., its activation

in neurobiological terms) encodes the level of information contained in an image region.

A high activation can be caused by 1) an object that visually differs strongly from its

surrounding environment (sensory-driven or bottom-up attention) or 2) by an image region

that matches the current searched object properties (goal-driven or top-down attention).

Both the bottom-up (BU) and top-down (TD) attention use a weighted combination of

the features. Five novelties increase the adaptivity of the attention system to the scene

and thereby assure a high robustness, which allows for building applications for outdoor

scenarios of the dynamic vehicle domain.

The importance of context information for improving the performance of driver assis-

tance functionalities is a widely acknowledged fact in the community. Therefore the ADAS

contains a real-time unmarked road detection system that relies on vision as the major cue.

Four system-related novelties assure sound performance in terms of the detection quality.

Based on these novelties, at run time the system dynamically adapts important system

parameters allowing robust road detection under changing environmental conditions. As

evaluation showed, the so detected road segments match ground truth data well in most

situations. However, in case of shadows on the road the detected road segments contain

holes and get unstable in time. As a further novelty, for solving these challenges a generic

tracking approach for unmarked road detection systems was introduced that is based on

temporal integration (i.e., integration of road detection results over time). The temporal

integration approach was successfully tested on the implemented system described before,

but would be suitable to improve any comparable state-of-the-art unmarked road detection

system.

In the central part of the PhD thesis, the realized visual features, the attention sub-

system, and the unmarked road detection sub-system including the temporal integration

approach are combined to a generic, biologically motivated Advanced Driver Assistance

System. The attention sub-system, which weights and combines all visual features, allows

a task-dependent scene decomposition that enables the system to react in real-time to chal-

lenging situations. In the first developed instance of an ADAS, the attention sub-system

is used as front-end for the detection of a stationary vehicle in a highway construction site.

In case a quickly approaching vehicle is detected by the system, a danger handling scheme

is initiated that in the final phase allows the system to brake autonomously. Different from

commercially available systems and approaches in literature, the object detection is based

on vision as the main cue.

130


A second instance of the driver assistance system was developed that contains numerous

extensions that rely on a higher level of information fusion between modules (e.g., close

fusion of the detected road into the attention system) and the inclusion of more dynamics

into the system (e.g., online computation and adaptation of weight sets for tuning the visual

preprocessing). The improvements allow a faster detection of dangerous objects. This was

exemplarily shown on the previously described detection task of a stationary vehicle in the

construction site, where faster reaction times to the stationary vehicle could be reached.

Based on an abundant system evaluation, system properties could be demonstrated that

were measured in psychophysical studies on humans as well. Similar to humans, it could be

shown that a number of 5 to 7 stored and tracked objects in the short term memory results

in the best overall system performance when searching for a dangerous object under time

constraints. Furthermore, it could be shown that the reaction time to unexpected traffic-

relevant objects grows with an increasing focus on a specific object class (a phenomenon

that is called inattentional blindness in psychophysical studies with humans). Based on

these results, it can be stated that the realized ADAS closely models important human

information processing principles allowing the usage of the system as attentional co-pilot

for human drivers.

6.2 Limitations and Outlook

The performance assessment of the human vision system has revealed capabilities that

exceed all known computational vision systems having a comparable image resolution. For

example, [Manz et al., 2007] provides the following Equation for computing the maximum

distance Dv, a human is able to visually classify an object in good weather conditions:

Dv =a

tan(

πS 10800

) . (6.1)

With:

a = object size in m

S = factor of acuteness of vision - usually 1.0.

A vehicle of width a = 2.5m with an acuteness factor S = 1.0 can hence be classified from

approximately 2500 meters on.

As stated in Sect. 5.2.2 the here proposed ADAS is able to classify cars from about 42

meters on. Among other things (like, e.g., differences in image resolution), the large dif-

ference in performance between the human and a technical vision system might be due to

the fact that the human integrates information about the scene context and gathered ex-

perience. In the discussed example, context information about the road (e.g., the object is

positioned on the road), which restricts the potential type of object and object class-related

experience (e.g., a car is a fast moving object on the road) is crucial. First approaches to

include context into the ADAS are realized in the PhD thesis at hand, in order to improve

the object detection. More specifically, the detected road is searched for car-like openings

and used to modulate the attention and object segmentation (see Sect. 5.3.1). However,

more context information (e.g., road type: inner-city, urban road, highway) needs to be

131


included, in order to further improve the system performance. [Kastner et al., 2009] de-

scribes a robust, real-time capable system for said basic scene classification, which we plan

to incorporate in our ADAS in the near future.

The PhD thesis at hand concentrated mainly on saliency-based attention and building

of a generic system that allows the dynamic modulation of modules and links between

modules at run-time. Our further work focuses on ways to control the designed cognitive

system based on reinforcement learning on high system level.

In order to tackle this goal the previously presented ADAS was extended by a simple-

structured control module, which realizes a functional mapping of measured internal system

states as input feature space and parameters for controlling the system behavior (see

[Michalke et al., 2009b] for an extensive description of the realized approach). Based on this

approach first promising results could be gathered on a complex real-world test scenario.

In the scenario the camera ego vehicle detects and tracks a bicycle, which the ego vehicle

overtakes. Based on the internal 3D representation (described in Sect. 5.3.1 on page 118)

the bicycle position is predicted even while being outside the field of view of the camera.

The ego vehicle stops to turn right and “remembers” the previously detected bicycle.

The ego vehicle waits for the bicycle to reappear. In order to allow a fast redetection,

the top-down attention is tuned to the bicycle (see Fig. 6.1, 6.2, 6.3) finally allowing its

instantaneous redetection. Summarizing, the system at runtime builds up and verifies

expectations to the environment and thereby autonomously tunes internal parameters and

processes that improve and accelerate the system reaction. The complete result video is

accessible in the internet at [BenchmarkData, 2009b].

The goal of the current system extensions is to develop control strategies that allow an

appropriate and safe system reaction in various environmental situations. As it has become

apparent in the thesis at hand, the key aspect for reaching an “all situation ADAS” is a

generic system structure. Therefore, low-complexity system control strategies seem to

be sufficient. In other words, the cognitive complexity is distributed over the system in

multiple processing loops that can be tuned and modulated. Therefore, no complex central

control system is necessary, which increases robustness and could also allow the learning

of control strategies in the future.

More specifically, after the successful test of the previously described and exemplarily

tested low complexity control approach, in the next step, learning of the functional mapping

between the measured input feature space and the output control parameter space will be

in the focus. A possible way could be to replay stored scenarios of critical traffic situations

from a data base. As learn signal dangerous objects could be labeled, which the system

has to detect fast enough to prevent a collision. In case the system is too slow, the scenario

is replayed while changing the functional mapping between input and output signals of the

behavior control module. Also measuring and mimicking the reactions of an experienced

driver is envisioned based on this approach.

As motivated earlier, the central assumption is that a robust learning system requires

a generic system structure with a high number of degrees of freedom for controlling the

system reaction and measuring the system states. Such a system was realized in the PhD

thesis at hand (see Sect. 5.3) allowing the future learning of control strategies and hence

offering a promising way to realize an “all situation” ADAS.

132


(a)

(b)

Figure 6.1: Visualization of system states for bicycle stream: (a) Scene exploration mode (no

dynamic object present), (b) Tracking the bicycle, ego vehicle is closing in.

133


(a)

(b)

Figure 6.2: Visualization of system states for bicycle stream: (a) Shortly after overtaking the

bicycle, (b) Blind prediction of bicycle.

134


(a)

(b)

Figure 6.3: Visualization of system states for bicycle stream: (a) Ego car searching actively

for the bicycle, waiting to turn right, (b) Bicycle redetected successfully, ego vehicle turns right.

135

A Annex

A.1 Gaussian Image Pyramid

Figure A.1 visualizes the image filtering methods with and without usage of an image pyra-

mid for a comparison of the computational demands. Since for both methods filtering in

frequency domain is faster, the comparison is done in the frequency domain (correspond-

ing to the right side of Fig. A.1a and b respectively). As can be seen qualitatively, the

computational demands for applying the FFT are lower when using an image pyramid (see

Fig. A.1b). Note that a zero-padding is done in image domain to have both the image and

the kernel at the same size before applying the FFT. Furthermore, also the multiplication

in the frequency domain as well as transformation back to the image domain is more ef-

ficient when using a filter pyramid. According to [Jaehne, 2005], filtering with an image

pyramid of infinite scales (i.e., steps) takes 4/3 of the computation time of a single scale

without usage of an image pyramid. The same factor is found when comparing Fig. A.1a

and b. Therefore, already when filtering an image on two scales an image pyramid will be

faster. The more scales are used, the higher the gain in computation time will be. A more

quantitative (i.e., mathematical) derivation of the performance increase can be found, in

[Jaehne, 2005].

.

.

.

.

.

*

*

*

*

*

.*

*

*

*

*

.

.

.

.

With image pyramid (down−scaling the image)

(a) (b)

Image domain

Scale 1

Scale 2

Scale 3

Scale 4

Scale 5

Image Kernel Image KernelFrequency domain

Image KernelImage domain

Image KernelFrequency domain

Without image pyramid (up−scaling the kernel)

Figure A.1: Assessment of pyramid-based image filtering (a) Filtering without an image pyra-

mid, (b) Filtering with an image pyramid.

136

A Annex

A.2 Kolmogorov-Smirnov Test of Goodness of Fit

In Sect. 4.1.3 the Kolmogorov-Smirnov (KS) test of goodness of fit with its Lilliefors ex-

tension is used in order to statistically verify, if the features in the road training region are

normal-distributed.

In the following, the in Sect. 4.1.3 realized KS-test steps are described and motivated in

detail. The here used KS-test checks the null hypothesis that a sample follows a normal

distribution with a certain measured variance σ2 and mean value µ against a given level

of significance α = 0.05. The normal-distributed features are a prerequisite for the visual

feature fusion process proposed in Sect. 4.1.2.

As opposed to the well-known χ2 goodness of fit test, the KS-test can also be used

with small sample sizes. As test statistic the KS-test does not use the difference of the

absolute frequency of the sample from the theoretical probability function, but is based on

the difference between the cumulative frequency of the sample Fe(x) and the theoretical

cumulative frequency F0(x). In case that both cumulative frequencies match closely, the

observed absolute deviation |Fe(x) − F0(x)| will be small. Therefore, on a qualitative

level, the observed maximum of the difference between both cumulative frequencies d =

max |Fe(x) − F0(x)| is suitable as test statistic for the KS-test.

Hence, in the following the test statistic d is used to verify the null hypothesis against a

level of significance α = 0.05. Kolmogorov and Smirnov have shown that the distribution

of the test statistic d is independent of the theoretical distribution the test is used for

(in our case the normal distribution), but depends on the sample size alone. This allows

to define a general chart of the test statistic. However, it is important to note that the

variance σ2 of the theoretical distribution is a priori not known in our case, but must

be estimated from the test sample. But this contradict an important assumption of the

KS-test in its basic form. By estimating the standard deviation from the test sample a

negative test result gets less probable. The KS-test in its basic form will then have a too

high critical value for the test statistic (i.e., the border value is too large) . This means

that the critical value has to be set lower. For the case that the mean value µ and variance

σ2 are estimated from the test sample, [Lilliefors, 1967] has published the corrected values

of the test statistic for a goodness of fit test for normal-distributed test samples. It is

important to note that opposed to the KS-test for the KS-Lilliefors test the distribution

of the test statistic depends on the theoretical distribution.

A.3 World to Image Transformation

A 3D world position can be transformed to a 2D pixel position (u,v) using a pin hole

camera model that contains all intrinsic and extrinsic camera parameters (in detail these

are the 3 camera angles θX , θY , and θZ , which are aggregated in the rotation matrix R, the

3 translational camera offsets t1, t2, t3, the horizontal and vertical principal point u0 and

v0, as well as the normalized horizontal and vertical focal length fu = f/tu and fv = f/tv),

see Equ. (A.1) and (A.2).

137

A Annex

X

Y

camera

Z

axisoptical

θX

θY

T = [t1, t2, t3]θZ

(a) (b)

optical axis

aperture (simplified as pin hole)

f

image chip

tu

u0t v

v 0

Figure A.2: (a) Visualization of internal camera parameters, (b) Coordinate system and ex-

ternal camera parameters.

u = −fur11(X-t1) + r12(Y -t2) + r13(Z-t3)

r31(X-t1) + r32(Y -t2) + r33(Z-t3)+ u0 (A.1)

v = −fvr21(X-t1) + r22(Y -t2) + r23(Z-t3)

r31(X-t1) + r32(Y -t2) + r33(Z-t3)+ v0 (A.2)

R = RXRYRZ =

r11 r12 r13r21 r22 r23r31 r32 r33

with:

r11 = cos(θZ)cos(θY )

r12 = −sin(θZ)cos(θX) + cos(θZ)sin(θY )sin(θX)

r13 = sin(θZ)sin(θX) + cos(θZ)sin(θY )cos(θX)

r21 = sin(θZ)cos(θY )

r22 = cos(θZ)cos(θX) + sin(θZ)sin(θY )sin(θX )

r23 = −cos(θZ)sin(θX) + sin(θZ)sin(θY )cos(θX)

r31 = −sin(θY )

r32 = cos(θY )sin(θX)

r33 = cos(θY )cos(θX)

Figure A.2 gives a visualization of all the named internal and external camera parame-

138

A Annex

ters.

Equation (A.1) and (A.2) can also be expressed in homogenous coordinates (see Equa-

tion (A.3), (A.4), and (A.5)), which decreases the computational demands considerably.

u =Xcam

Zcam(A.3)

v =Ycam

Zcam(A.4)

with:

Xcam

Ycam

Zcam

1

= M

X

Y

Z

1

(A.5)

M = MhpR

hXR

hYR

hZT

h

RhX =

1 0 0 0

0 cos(θX) −sin(θX ) 0

0 sin(θX) cos(θX) 0

0 0 0 1

RhY =

cos(θY ) 0 sin(θY ) 0

0 1 0 0

−sin(θY ) 0 cos(θY ) 0

0 0 0 1

RhZ =

cos(θZ) −sin(θZ) 0 0

sin(θZ) cos(θZ) 0 0

0 0 1 0

0 0 0 1

T h =

1 0 0 t10 1 0 t20 0 1 t30 0 0 1

Mhp =

fu 0 u0 0

0 fv v0 0

0 0 1 0

0 0 0 1

139

A Annex

A.4 Time to Contact - Further Evaluation Results

Additionally to the results given in Sect. 2.2.5, further evaluation results gathered by eval-

uating presegmented, synthetic image data of cars are shown in Tab. A.1, A.2, and A.3. As

stated in Sect. 2.2.5 the here gathered results are roughly comparable to the measurements

accumulated in psychophysical experiments with humans.

Table A.1: Some examples of depth from TTC for an object moving away (i.e., b1 > b2 > b3)

with a frame rate frate = 3.


vego,1 (vego,2) distance vobj [in m/s] depth from |D1−ZTTC|D1

[in %]


24 (24.9) 39(40, 40.7) 20.93 42.73 9.56

24 (24.9) 38(39, 39.7) 20.93 41.79 9.97

24 (24.9) 37(38, 38.7) 20.93 40.85 10.41

24 (24.9) 36(36.5, 36.7) 22.49 38.70 7.50

24 (24.9) 35(35.5, 35.7) 22.49 37.76 7.89

24 (24.9) 34(34.5, 34.7) 22.49 36.83 8.32

24 (24.9) 33(33.5, 33.7) 22.49 35.90 8.79


Table A.2: Some examples of depth from TTC with b1 > b2 < b3 and a frame rate frate = 3.


vego,1 (vego,2) distance vobj [in m/s] depth from |D1−ZTTC|D1

[in %]


24 (25.5) 33(33.3, 33.1) 23.10 35.34 7.09

24 (25.5) 34(34.3, 34.1) 23.10 36.27 6.68

24 (25.5) 35(35.3, 35.1) 23.10 37.21 6.31

24 (25.5) 36(36.3, 36.1) 23.10 38.15 5.97

24 (25.5) 37(37.3, 37.1) 23.10 39.10 5.68

24 (25.5) 38(38.3, 38.1) 23.10 40.04 5.37

24 (25.5) 39(39.3, 39.1) 23.10 41.00 5.13


140

A Annex

Table A.3: Some examples of depth from TTC for an approaching object (i.e., b1 < b2 < b3)

with a frame rate frate = 3.


vego,1 (vego,2) distance vobj [in m/s] depth from|D1−ZTTC|

D1[in %]


24 (24.9) 45(42, 38.7) 14.52 49.12 9.16

24 (24.9) 42(40, 37.7) 17.83 45.00 7.14

24 (24.9) 40(38, 35.7) 17.79 43.26 8.15

24 (24.9) 38(36, 33.7) 17.75 41.57 9.39

24 (24.9) 36(34, 31.7) 17.70 39.93 10.92

24 (24.9) 34(32, 29.7) 17.64 38.36 12.82

24 (24.9) 33(32, 30.7) 20.95 35.82 8.55


A.5 High Attention-Feature Selectivity

In the following, an indoor application for the attention system is shown that will highlight

the performance of the approach. Figure A.3a shows a complex scene showing a bookshelf.

Marked in red is my favorite book “A journey into the brain and beyond”. However,

someone has removed my book and has put it back on a different location (see Fig. A.3d).

Since the scene is highly complex (refer to the dense BU attention in Fig. A.3c), the book is

hard to find. Based on a stored training image, the described attention system is now able

to compute a TD weight set. Based on the training image the TD search is successful (see

Fig. A.3b), which allows a first positive assessment of the TD weight set that corresponds

to the features of my book. A TD-attention-based search on the current test image (see

Fig. A.3d) leads to Fig. A.3e, where the TD attention leads to a clear saliency maximum.

This allows the fast relocation of my favorite book (see Fig. A.3f).

The described applications is somewhat related to approaches shown in [Frintrop, 2006].

Still, the test examples in [Frintrop, 2006] are much simpler in terms of the complexity of

the scene.

141

A Annex

(a) (b) (c)

(f)(e)(d)

Figure A.3: (a) Search target (favorite book) marked by rectangle (remembered training im-

age), (b) TD attention computed on the remembered training image (TD weights are stored),

(c) Dense BU attention showing the complexity of the test scene, (d) Test image with changed

position of the book, (e) TD attention on the test image based on the stored TD weights, (f)

Relocated book on the test image.

142

Bibliography

Adamy, J. (2007). Fuzzy-Logik, Neuronale Netze und Evolutionare Algorithmen. Shaker

Verlag Achen.

Apostoloff, N. and Zelinsky, A. (2003). Robust vision based lane tracking using multiple

cues and particle filtering. In IEEE Intelligent Vehicles Symposium.

Aufrere, R., Marion, V., Laneurit, J., Lewandowski, C., Morillon, J., and Chapuis, R.

(2004). Road sides recognition in non-structured environment by vision. In IEEE Intel-

ligent Vehicles Symposium, Parma.

Aziz, Z. and Mertsching, B. (2008). Visual search in static and dynamic scenes using fine-

grain top-down visual attention. In Lecture Notes in Computer Science, volume 5008,

pages 3–12.

Backer, G. and Mertsching, B. (2000). Integrating depth and motion into the attentional

control of an active vision system. In G. Baratoff, H. Neumann, (Eds.), Dynamische

Perzeption, St. Augustin (Infix), pages 69–74.

Badino, H., Vaudrey, T., Franke, U., and Meyer, R. (2008). Stereo-based free space

computation in complex traffic scenarios. In IEEE Southwest Symposium on Image

Analysis and Interpretation, New Mexico.

Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algo-

rithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):105–139.

BenchmarkData (2008a). http://www.rtr.tu-darmstadt.de/~tmichalk/

ICVS2008_BenchmarkData/.

BenchmarkData (2008b). http://www.rtr.tu-darmstadt.de/~tmichalk/

ITSC_TempIntegration/.

BenchmarkData (2009a). http://www.rtr.tu-darmstadt.de/~tmichalk/

IV2009_RoadDetectionSystem/.

BenchmarkData (2009b). http://www.rtr.tu-darmstadt.de/~tmichalk/

IV2009_ADASControl/.

Borst, A. (1990). How do flies land? From behavior to neural circuits. In BioScience,

volume 40, pages 292–299.

143

http://www.rtr.tu-darmstadt.de/~tmichalk/

ICVS2008_BenchmarkData/


ITSC_TempIntegration/


IV2009_RoadDetectionSystem/


IV2009_ADASControl/

Bibliography

Broggi, A. (1995). Robust real-time lane and road detection in critical shadow conditions.

In Proc. Int. Symp. on Computer Vision, Parma. IEEE.

Broggi, A., Bertozzi, M., Conte, G., and Fascioli, A. (2001). ARGO prototype vehicle.

In Vlacic, L., Parent, M., and Harashima, F., editors, Intelligent Vehicle Technologies.

Butterworth Heinemann, Oxford.

Broggi, A. and Grisleri, P. (2005). A software video stabilization system for automotive

oriented applications. In Procs. IEEE Vehicular Technology Conference, Stockholm,

Sweden.

Cavanagh, P. and Alvarez, G. (2005). Tracking multiple targets with multifocal attention.

Trends in Cognitive Sciences, 9:350–355.

Cech, M., Niem, W., Abraham, S., and Stiller, C. (2004). Dynamic ego-pose estimation

for driver assistance in urban environments. In IEEE Intelligent Vehicles Symposium,

pages 43–48.

Ceravola, A., Joublin, F., Dunn, M., Eggert, J., and Goerick, C. (2006). Integrated research

and development environment for real-time distributed embodied intelligent systems. In

Proc. Int. Conf. on Robots and Intelligent Systems, pages 1631–1637.

Chern, M. and Cheng, S. (2003). Finding road boundaries from the unstructured rural road

scene. In 16th IPPR Conference on Computer Vision, Graphics and Image Processing.

Corbetta, M. and Shulman, G. (2002). Control of goal-directed and stimulus-driven atten-

tion in the brain. Nature Reviews Neuroscience, 3:201–215.

Dahlkamp, H., Kaehler, A., Stavens, D., Thrun, S., and Bradski, G. (2006). Self-supervised

monocular road detection in desert terrain. In Proceedings of Robotics: Science and

Systems, Philadelphia, USA.

Dang, T., Kammel, S., Duchow, C., Hummel, B., and Stiller, C. (2006). Path planning

for autonomous driving based on stereoscopic and monoscopic vision cues. In IEEE

Proceedings of the 2006 American Control Conference, pages 191–196.

Dickmanns, E. (2004). Three-stage visual perception for vertebrate-type dynamic machine

vision. In Engineering of Intelligent Systems (EIS), Madeira.

Dickmanns, E. and Mysliwetz, B. (1992). Recursive 3-d road and relative ego-state recog-

nition. IEEE Trans. Pattern Anal. Mach. Intell., 14(2):199–213.

Egeth, H. and Yantis, S. (1997). Visual attention: control, representation, and time course.

Annual Review of Psychology, 48:269–297.

Farber, G. (2005). Biological aspects in technical sensor systems. In Proc. Advanced

Microsystems for Automotive Applications, pages 3–22, Berlin.

144

Bibliography

Findlay, J. and Gilchrist, I. (2003). Active Vision: The psychology of looking and seeing.

Oxford University Press.

Flores-Herr, N. (2001). Das hemmende Umfeld von Ganglienzellen in der Netzhaut des

Auges. PhD thesis, Frankfurt am Main, Johann Wolfgang Goethe-Universitat.

Forsyth, D. and Ponce, J. (2003). Computer Vision: A Modern Approach. Prentice Hall,

Berkeley.

Franke, U., Gavrila, D., Gern, A., Gorzig, S., Janssen, R., Paetzold, F., and Wohler, C.

(2001). From door to door - principles and applications of computer vision for driver

assistant systems. In Vlacic, L., Parent, M., and Harashima, F., editors, Intelligent

Vehicle Technologies. Butterworth Heinemann, Oxford.

Franke, U., Loose, H., and Knoeppel, C. (13-15 June 2007). Lane recognition on country

roads. Intelligent Vehicles Symposium, 2007 IEEE, pages 99–104.

Frintrop, S. (2006). VOCUS: A Visual Attention System for Object Detection and Goal-

Directed Search. PhD thesis, University of Bonn Germany.

Frintrop, S., Backer, G., and Rome, E. (2005). Goal-directed search with a top-down

modulated computational attention system. In DAGM-Symposium, pages 117–124.

Frintrop, S., Klodt, M., and Rome, E. (2007). A real-time visual attention system using

integral images. In Int. Conf. on Computer Vision Systems, Bielefeld.

Frintrop, S., Rome, E., and Christensen, H. (2009). Computational visual attention systems

and their cognitive foundation: a survey. ACM Transactions on Applied Percerption

(TAP).

Fritsch, J., Michalke, T., Gepperth, A., Bone, S., Waibel, F., Kleinehagenbrock, M., Gayko,

J., and Goerick, C. (2008). Towards a human-like vision system for driver assistance. In

IEEE Intelligent Vehicles Symposium, Eindhoven.

Gabor, D. (1946). Theory of communication. J. IEE, 93:429–457.

Gepperth, A., Mersch, B., Goerick, C., and Fritsch, J. (2007). Color object recognition in

real-world scenes. In de Sa, J., editor, J. Marques de Sa et al. (Eds.): Artificial Neural

Networks, 17th International Conference ICANN, Part II, Lecture Notes in computer

science, pages 583–592. Springer Verlag Berlin Heidelberg New York.

Goerick, C., Wersing, H., Mikhailova, I., and Dunn, M. (2005). Peripersonal space and

object recognition for humanoids. In Proc. Int. Conf. on Humanoid Robots.

Gray, R. and Regan, D. (1998). Accuracy of estimating time to collision using binocular

and monocular information. In Vision Research, volume 38, pages 499–512.

Hardy, R. (1983). Homeostasis. Arnold.

145

Bibliography

Harris, J. (2004). Binocular vision: moving closer to reality. In Philosophical Transactions

of the Royal Society, volume 42, pages 2721–2739.

Hawes, N. and Wyatt, J. (2006). Towards context-sensitive visual attention. In Proceedings

of the Second Int. Cognitive Vision Workshop, Graz, Austria.

Heikkila, J. and Silven, O. (1997). A four-step camera calibration procedure with implicit

image correction.

Heinke, D. and Humphreys, G. (2005). Computational models of visual selective attention:

a review. In Houghton, G., editor, Connectionist Models in Psychology, pages 273–312.

Psychology Press.

Hertel, G. (2007). Mercer-Studie Autoelektronik. Automobilelektronik, pages 26–27.

Hong, T., Chang, T., Rasmussen, C., and Shneier, M. (2002). Road detection and tracking

for autonomous mobile robots. In Proceedings of SPIE Aerosense Conference.

Hoyle, F. (1957). The black cloud. London: Penguin.

Hubel, D. and Wiesel, T. (1962). Receptive fields, binocular interaction and functional

architecture in the cat’s visual cortex. Journal of Physiology, 160:106–154.

Ikegaya, M., Asanuma, N., Ishida, S., and Kondo, S. (1998). Development of a lane

following assistance system. In Int. Symp. on Advanced Vehicle Control, Nagoya.

Intel (2006). Integrated Performance Primitives. http://www.intel.com/cd/software/

products/asmo-na/eng/perflib/ipp/302910.htm.

Itti, L., Koch, C., and Niebur, E. (1998). A model of saliency-based visual attention for

rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell., 20(11):1254–1259.

Itti, L., Rees, G., and Tsotsos, J., editors (2005). Neurobiology of Attention. Elsevier.

Jaehne, B. (2005). Digital Image Processing. Springer, Berlin.

Jones, J., Stepnoski, A., and Palmer, L. (1987). The two-dimensional spectral structure of

simple receptive fields in the cats striate cortex. Journal of Neurophysiology, 58(6):1233–

1258.

J.Y.Bouguet (2007). Camera Calibration Toolbox for Matlab.

http://www.vision.caltech.edu/bouguetj.

Kaiser, M. and Mowafy, L. (1993). Optical specification of time-to-passage: observers’

sensitivity to global tau. Journal of Experimental Psychology, 19:1028–1040.

Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Trans-

actions of the ASME–Journal of Basic Engineering, 82(Series D):35–45.

146

http://www.intel.com/cd/software/

products/asmo-na/eng/perflib/ipp/302910.htm

http://www.vision.caltech.edu/bouguetj

Bibliography

Kastner, R., Schneider, F., Michalke, T., Fritsch, J., and Goerick, C. (2009). Image-

based classification of driving scenes by Hierarchical Principal Component Classification

(HPCC). In IEEE Intelligent Vehicles Symposium, Xian.

Klein, R. (2000). Inhibition of return. Trends in Cognitive Science, 4(4):138–145.

Koch, C. and Ullman, S. (1985). Shifts in selective visual attention: towards the underlying

neural circuitry. Human Neurobiology, 4(4):219–227.

Kodaka, K. and Gayko, J. (2004). Intelligent systems for active and passive safety -

Collision Mitigation Brake System. In Proc. of the ATA EL conference 2004, Parma.

Kodaka, K., Otabe, M., Urai, Y., and Koike, H. (2003). Rear-end collision velocity reduc-

tion system. In Proc. 2003 SAE World Congress, Detroit.

Konolige, K. (1997). Small Vision System: Hardware and implementation. In Eighth

International Symposium on Robotics Research.

Landy, M., Maloney, L., Johnsten, E., and Young, M. (1995). Measurement and modeling

of depth cue combinations: in defense of weak fusion.

Li, X., Yao, X., Murphey, Y., Karlsen, R., and Gerhart, G. (2004). A real-time vehicle

detection and tracking system in outdoor traffic scenes. In Proceedings of the 17th

International Conference on Pattern Recognition.

Lilliefors, W. (1967). On the Kolmogorov-Smirnov test for normality with mean and

variance unknown. In Journal of the American Statistical Association, volume 62, pages

399–402.

Lin, X. and Chen, S. (1991). Color image segmentation using modified HSI system for

road following. In IEEE International Conference on Robotics and Automation.

Lombardi, P., Zanin, M., and Messelodi, S. (2005). Unified stereovision for ground, road

and obstacle detection. In IEEE Intelligent Vehicles Symposium.

Luo-Wai, T. (2008). Lane detection using directional random walks. In IEEE Intelligent

Vehicles Symposium, Eindhoven.

Mallot, H. (2002). Computational vision: Information processing in perception and visual

behavior. MIT Press Robotica.

Manz, K., Kooß, D., Klinger, K., and Schellinger, S. (2007). Entwicklung von Kri-

terien zur Bewertung der Fahrzeugbeleuchtung im Hinblick auf ein NCAP fur aktive

Fahrzeugsicherheit. Universitat Karlsruhe, Lichttechnisches Institut.

Marcelja, S. (1980). Mathematical description of the response of simple cortical cells. J.

Optical Society of America, 70(11):1297–1300.

147

Bibliography

Marita, T., Oniga, F., Nedevschi, S., Graf, T., and Schmidt, R. (2007). Camera calibration

method for far range stereovision sensors used in vehicles. In IEEE Intelligent Vehicles

Symposium, pages 356–363.

Mateus, D., Avina, G., and Devy, M. (2005). Robot visual navigation in semi-structured

outdoor environments. In IEEE International Conference on Robotics and Automation,

Barcelona.

Matzka, S., Petillot, Y., and Wallace, A. (2008). Proactive sensor-resource allocation using

optical sensors. In VDI-Berichte 2038, pages 159–167.

Michalke, T., Fritsch, J., Gepperth, A., and Goerick, C. (to appear end of 2009a). Robust

top-down attention for a human-like driver assistance system. Computer Vision and

Image Understanding, Special Issue On: Intelligent Vision Systems, Elsevier.

Michalke, T., Fritsch, J., and Goerick, C. (2008a). Enhancing robustness of a saliency-

based attention system for driver assistance. In The 6th International Conference on

Computer Vision Systems (ICVS), Santorini, Greece, 2008. Lecture Notes in Computer

Science, Springer, number 5008, pages 43–55.

Michalke, T., Gepperth, A., Schneider, M., Fritsch, J., and Goerick, C. (2007). Towards

a human-like vision system for resource-constrained intelligent cars. In Int. Conf. on

Computer Vision Systems, Bielefeld.

Michalke, T., Kastner, R., Adamy, J., Bone, S., Waibel, F., Kleinehagenbrock, M., Gayko,

J., Gepperth, A., Fritsch, J., and Goerick, C. (2008b). An attention-based system ap-

proach for scene analysis in driver assistance. at - Automatisierungstechnik, 56(11):575–

584.

Michalke, T., Kastner, R., Fritsch, J., and Goerick, C. (2008c). A generic temporal integra-

tion approach for enhancing feature-based road-detection systems. In IEEE Intelligent

Transportation Systems Conference, Beijing.

Michalke, T., Kastner, R., Fritsch, J., and Goerick, C. (2009b). Towards a proactive

biologically-inspired advanced driver assistance system. In IEEE Intelligent Vehicles

Symposium, Xian.

Michalke, T., Kastner, R., Herbert, M., Fritsch, J., and Goerick, C. (2009c). Adaptive

multi-cue fusion for robust detection of unmarked inner-city streets. In IEEE Intelligent

Vehicles Symposium, Xian.

Most, S. and Astur, R. (2007). Feature-based attentional set as a cause of traffic accidents.

Visual Cognition, 15(2):125–132.

Navalpakkam, V. and Itti, L. (2005). Modeling the influence of task on attention. Vision

Research, 45(2):205–231.

148

Bibliography

Navalpakkam, V. and Itti, L. (2006). An integrated model of top-down and bottom-up

attention for optimal object detection. In Proc. IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 2049–2056.

Neisser, U. (1967). Cognitive Psychology. Appleton-Century-Crofts, New York.

Nieto, M., Salgado, L., Jaureguizar, F., and Cabrera, J. (2007). Stabilization of inverse

perspective mapping images based on robust vanishing point estimation. In IEEE In-

telligent Vehicles Symposium.

Ouerhani, N. (2003). Visual Attention: From Bio-Inspired Modeling to Real-Time Imple-

mentation. PhD thesis, Universite de Neuchatel, Institute de Microtechnique.

Palmer, S. (1999). Vision Science: Photons to Phenomenology. MIT Press.

Proakis, J. and Manolakis, D. (2006). Digital Signal Processing.

Ramstroem, O. and Christensen, H. (2005). A method for following unmarked roads. In

IEEE Intelligent Vehicles Symposium, pages 650–655.

Rasmussen, C. (2002). Combining laser range, color and texture cues for autonomous road

following. In IEEE International Conference on Robotics and Automation, Washington

DC.

Regan, D. (2002). Binocular information about time to collision and time to passage. In

Vision Research, volume 42, pages 2479–2484.

Rotaru, C., Graf, T., and Zhang, J. (2004). Extracting road features from color images

using a cognitive approach. In IEEE Intelligent Vehicles Symposium.

Schmudderich, J., Willert, V., Eggert, J., Rebhan, S., Goerick, C., Sagerer, G., and Ko-

erner, E. (2008). Estimating object proper motion using optical flow, kinematics, and

depth information. In IEEE Transactions on Systems, Man, and Cybernetics, volume

PartB 38, pages 1139–1151.

Schorn, M., Stahlin, U., Khanafer, A., and Isermann, R. (2006). Nonlinear trajectory

following control for automatic steering of a collision avoiding vehicle. In IEEE Inter-

national Conference on Multisensor Fusion and Integration for Intelligent Systems.

Sha, Y., Zhang, G., and Yang, Y. (2007). A road detection algorithm by boosting using

feature combination. In IEEE Intelligent Vehicles Symposium, Istanbul.

Shinoda, H., Hayhoe, M., and Shrivastava, A. (2001). What controls attention in natural

environments. Vision Research, (41):3535 – 3546.

Simons, D. and Chabris, C. (1995). Gorillas in our midst: Sustained inattentional blindness

for dynamic events. British Journal of Developmental Psychology, 13:113–142.

149

Bibliography

Smuda, P., Schweiger, R., Neumann, H., and Ritter, W. (2006). Multiple cue data fusion

with particle filters for road course detection in vision systems. In IEEE Intelligent

Vehicles Symposium, Tokyo.

Soquet, N., Aubert, D., and Hautiere, N. (2007). Road segmentation supervised by an

extended vdisparity algorithm for autonomous navigation. In IEEE Intelligent Vehicles

Symposium.

Sotelo, M., Rodriguez, F., and Magdalena, L. (2004). VIRTUOUS: Vision-based road

transportation for unmanned operation on urban-like scenarios. In IEEE Transactions

on Intelligent Transportation Systems, volume 5.

Stiller, C., Farber, G., and Kammel, S. (2007). Cooperative Cognitive Automobiles. In

IEEE Intelligent Vehicles Symposium, pages 215–220.

Torralba, A. (2003). Contextual priming for object detection. In International Journal of

Computer Vision, volume 53.

Trapp, R. (1998). Stereoskopische Korrespondenzbestimmung mit impliziter Detektion von

Okklusionen. PhD thesis, University of Paderborn Germany.

Treisman, A. (1993). The perception of features and objects. In Baddeley, A. and

Weiskrantz, L., editors, Attention: Selection, awareness, and control, pages 5–35. Claren-

don Press, Oxford.

Treisman, A. and Gormican, S. (1988). Feature analysis in early vision: Evidence from

search asymmetries. Psychological Review, 95:15–48.

Treue, S. (2003). Visual attention: the where, what, how and why of saliency. In Current

Opinion in Neurobiology, volume 13.

Tsotsos, J., Culhane, S., Wai, W., Lai, Y., Davis, N., and Nuflo, F. (1995). Modeling

visual attention via selective tuning. Artificial Intelligence, 78(1-2):507–545.

Tsotsos, J., Liu, Y., Martinez-Trujillo, J., Pomplun, M., Simine, E., and Zhou, K. (2004).

Attending to visual motion. CVIU, 100(1-2):3–40.

Viola, P. and Jones, M. J. (2001). Robust real-time object detection.

von Seelen, W. (1970). Zur Informationsverarbeitung im visuellen System der Wirbeltiere.

In Kybernetik, volume 7, pages 43–60.

von Trzebiatowski, M., Gern, A., Franke, U., Kaeppeler, U.-P., and Levi, P. (2004). De-

tecting reflection posts - lane recognition on country roads. In IEEE Intelligent Vehicles

Symposium.

Wersing, H. and Korner, E. (2003). Learning optimized features for hierarchical models of

invariant object recognition. Neural Computation, 15(2):1559–1588.

150

Bibliography

Willert, V., Eggert, J., Adamy, J., and Koerner, E. (2006). Non-gaussian velocity distribu-

tions integrated over space, time and scales. IEEE Transactions on Systems, Man and

Cybernetics B, 36(3):482–493.

Willert, V., Toussaint, M., Eggert, J., and Korner, E. (2007). Uncertainty optimization

for robust dynamic optical flow estimation. In Proceedings of the 2007 International

Conference on Machine Learning and Applications (ICMLA). IEEE.

Winner, H. (2007). Fahrerassistenzsysteme. In Vorlesungsskript.

Wolfe, J. and Horowitz, T. (2004). What attributes guide the deployment of visual atten-

tion and how do they do it? Nat. Reviews Neuroscience, 5(6):495–501.

WWW (2006). European Project PReVENT. http://www.prevent-ip.org/.

WWW (2007a). DARPA Urban Challenge. http://www.darpa.mil/

grandchallenge/.

WWW (2007b). European Commission Information Society Intelligent Car Initiative.

http://ec.europa.eu/informationsociety/activities/

intelligentcar/.

WWW (2007c). European project SAFESPOT. http://www.safespot-eu.org.

151

http://www.prevent-ip.org/

http://www.darpa.mil/

grandchallenge/

http://ec.europa.eu/information society/activities/

intelligentcar/

http://www.safespot-eu.org

Curriculum Vitae

Personal DetailsName: Thomas Paul Michalke

Date of birth: 06.05.1979

Place of birth: Weimar / Thuringen, Germany

Nationality: German

Family status: Married

EducationFeb. 2006 to Jan. 2009 Ph.D. student, control theory and robotics lab, Darmstadt

University of Technology, Germany, in close cooperation

with Honda Research Institute Europe in Offenbach, topic:

camera-based driver assistance using biologically signal pro-

cessing principles

Sept. 2002 to July 2003 Study visit at the Technical University of Lyon (Ecole Cen-

trale de Lyon) in France. Learning French, student thesis

and hearing technical courses

June 2001 Obtained bachelor degree

Sept. 1998 to Jan. 2006 Studying Master in industrial engineering at the Technical

University of Darmstadt, economical major in operations re-

search, technical major in telecommunications engineering

and data processing technology

July 1998 Obtained baccalaureate at the Friedrich Schiller grammar

school in Weimar, courses: English, mathematics, Russian

152

Curriculum Vitae

Work ExperienceSince July 2009 Daimler AG, research and development at Daimler

Evobus.

Feb. 2006 to April 2009 Honda Research Institute Europe in Offenbach, research

and development of a biologically motivated and generic

driver assistance system, real-time implementation on a

prototype car used for online demonstrations, fusion of

sensor data of innovative sensor concepts as PMD, laser,

camera-based stereo, publications on international confer-

ences and journals.

Aug. 2005 to Jan. 2006 Diploma thesis in a development department at BOSCH

in Ditzingen/Germany, content was the analysis of crash

signals, the correspondent application of crash algorithms

and the development of a signal analysis software.

April 2005 to June 2005 Practical training in a development department at BOSCH

in Farmington Hills, Michigan /USA, working on a time

critical passenger safety project, programming a Texas In-

struments microcontroller in C.

Dec. 2004 to April 2005 Practical training in a development department at BOSCH

in Ditzingen/Germany, working on a complex group soft-

ware project regarding signal processing and analysis in

Matlab.

Dec. 2002 to July 2003 Member of a research team during a project at LEOM (a

semi-private electronic research laboratory) in Lyon, work-

ing on new concepts for the transmission of high frequency

signals with light.

153

Publications

Conference papers� T. Michalke, R. Kastner, J. Fritsch, C. Goerick: Towards a Proactive Biologically-

inspired Advanced Driver Assistance System, IEEE Intelligent Vehicles Symposium,

Xian, 2009� T. Michalke, R. Kastner, M. Herbert, J. Fritsch, C. Goerick: Adaptive Multi-Cue Fu-

sion for Robust Detection of Unmarked Inner-City Streets, IEEE Intelligent Vehicles

Symposium, Xian, 2009� R. Kastner, F. Schneider, T. Michalke, J. Fritsch, C. Goerick: Image-based classifi-

cation of driving scenes by Hierarchical Principal Component Classification (HPCC),

IEEE Intelligent Vehicles Symposium, Xian, 2009� T. Michalke, R. Kastner, J. Fritsch, C. Goerick: A generic temporal integration ap-

proach for enhancing feature-based road-detection systems, IEEE Intelligent Trans-

portation Systems Conference, Beijing, 2008� J. Fritsch, T. Michalke, A. Gepperth, S. Bone, F. Waibel, M. Kleinehagenbrock,

J. Gayko, C. Goerick: Towards a Human-like Vision System for Driver Assistance,

IEEE Intelligent Vehicles Symposium, Eindhoven, 2008� T. Michalke, J. Fritsch, C. Goerick: Enhancing Robustness of a Saliency-based At-

tention System for Driver Assistance, Int. Conf. on Computer Vision Systems,

Santorini, 2008� T. Michalke, M. Schneider, A. Gepperth, J. Fritsch, C. Goerick: Towards a Human-

Like Vision System for Resource-Constrained Intelligent Cars, Int. Conf. on Com-

puter Vision Systems, Bielefeld, 2007� M. Briere, L. Carrel, T. Michalke, F. Mieyeville, I. O’Connor, F. Gaffiot: Design and

Behavioral Modeling Tools for Optical Network-on-Chip. IEEE Proceedings of the

Design, Automation and Test in Europe Conference and Exhibition (DATE 2004),

Paris, 2004

Journal papers� T. Michalke, R. Kastner, J. Adamy, A. Gepperth, S. Bone, F. Waibel, M. Kleine-

hagenbrock, J. Gayko, J. Fritsch, C. Goerick: An attention-based system ap-

proach for scene analysis in driver assistance, Automatisierungstechnik (AT), at-

Schwerpunktheft “Kognitive Automobile”, 2008

154

Publications� T. Michalke, J. Fritsch, A. Gepperth, C. Goerick: Robust Top-Down Attention for a

Human-like Driver Assistance System, Computer Vision and Image Understanding,

Special Issue On: Intelligent Vision Systems, Elsevier (to appear end of 2009)

Books� T. Michalke, J. Fritsch, C. Goerick: Enhancing Robustness of a Saliency-based At-

tention System for Driver Assistance, 6th International Conference on Computer

Vision Systems, ICVS 2008 Santorini, Greece, May 12-15, 2008, Proceedings, Series:

Lecture Notes in Computer Science, Vol. 5008, Sublibrary: Theoretical Computer

Science and General Issues, Gasteratos, Antonios; Vincze, Markus; Tsotsos, John

(Eds.), 2008

Patents� Christof Karner, Thomas Michalke: Vorrichtung zu Crashklassifizierung, 2006,

Patentnummer: DE 10 2006 038 348 A1 2008.02.21� Martin Heckmann, Jannik Fritsch, Thomas Michalke: Driving Path Identification

via Online Adaptation of the Driving Path Model, 2008, (pending)� Thomas Michalke, Robert Kastner, Jannik Fritsch: System and method for object

motion detection based on multiple 3D warping and vehicle equipped with such

system, 2008, (pending)

Supervised Student Theses� Thesis: Evaluation of different tracking algorithms and their implementation in the

context of an environmental representation for a driver assistance system, Shi Xuehui,

2006� Thesis: Road detection on unmarked roads, Michael Herbert, 2007� Diploma thesis: Autonomous learning in intelligent vehicles, Imran Bashir Bhatti,

2008� Thesis: Computational efficient lane detection on marked roads for a driver assistance

system, Wang Zheng, 2007� Thesis: Implementation of a fuzzy-based central control unit for a complex driver

assistance system, Yan Jiajie, 2007� Thesis: Biologically motivated filter adaptation for robust image interpretation, Pol

Blasco Moreno, 2007� Diploma thesis: Detection of object proper motion by fusion of stereo vision with

optical flow for a driver assistance system, Andreas Schlensag, 2007

155

Publications� Diploma thesis: Bio-inspired tracking of traffic-relevant objects, Marco-Antonio

Garcia-Ochoa, 2008� Bachelor thesis: Attention-based edge and contour detection with artificial neuronal

nets, Conrad Klytta, 2007� Diploma thesis: Pitch angle correction for a driver assistance system, Ming Zhao,

2007� Thesis: Robust depth cue integration in driver assistance, Sun Hailin, 2008� Thesis: Biologically motivated motion detection and classification, Jochen Schmell,

2008

Supervised Seminars� Seminar: Lane detection on unmarked roads, Aleksandar Aleksandrov, Christian

Schmell, Jochen Schmell� Seminar: Guided Search und Tracking mittels eines Top-Down Aufmerk-

samkeitsmodells, Jean-Pierre Hickey, Jingmin Zhang� Seminar: Untersuchung von Gewichtssets fur verschiedene Objekte, Tobias Pietsch,

Sebastian Waz� Seminar: Vision-based control of a mobil robot, Jonatan Antoni, Daniel Donigus� Seminar: Visuelle Positionsverfolgung eines mobilen Roboters Stefanie Apprich, Said

Azzam, Ulrich Schmieder

Supervised Internships� Student apprentice: Model- and appearance-based pitch estimation, Zhang Lyan� Student apprentice: Lane detection and appearance-based pitch estimation, Andre

Justus� Student apprentice: Temporal integration for improving lane detection systems,

Jochen Schmell

Supervised Tutorials� Tutorial: Fuzzy-Logik, Neuronale Netze und Evolutionre Algorithmen� Tutorial: Regelungstechnisches Praktikum II� Administrative supervision: Projektseminar Robotik und Computational Intelligence� Laboratory course: Regelung von Servoantrieben� Administrative supervision: Prozessleittechnik

156

Index

Attention 51

Basic feature 8

Bird’s eye view 35

Bottom-up attention 51

Canonical feature 22

Depth from object knowledge 33

Difference of Gaussian (DoG) filter 10

Differential images 42

Disparity 29

Double color opponents 26

Dynamic neuronal suppression 58

Early selection principle 51

Flat plane assumption 36

Gabor filter 16

Homeostasis 55

Inattentional blindness 52

Kolmogorov-Smirnov (KS) test of goodness of fit 137

KS-Lilliefors test of goodness of fit 137

Object motion detection 44

Parallel search 8

Plane fitting 37

Radar 42

Rectification 31

RGBY color space 22

Sigmoid function 58

Single track model 89

Sparseness weight 58

Stereoscopic depth 29

Short Term Memory 104

Structure tensor 57

Tau-function 38

Temporal integration 86

Time to contact 37

Top-down attention 51

Uniformity of color spaces 24

157

Index

Undistortion 30

Unmarked Road Detection 69

Voting 118

Weak fusion 107

Weak object feature conjunctions 55

158

task-dependent scene interpretation in driver assistance · 2011-01-18 · task-dependent scene...

Documents