task-dependent scene interpretation in driver assistance · 2011-01-18 · task-dependent scene...
TRANSCRIPT
Task-Dependent Scene Interpretationin Driver Assistance
PhD Thesis
Thomas Paul Michalke
Task-Dependent Scene Interpretationin Driver Assistance
Vom Fachbereich 18
Elektrotechnik und Informationstechnik
der Technischen Universitat Darmstadt
zur Erlangung des Grades eines Doktor-Ingenieurs (Dr.-Ing.)
genehmigte Dissertation
vorgelegt von
Dipl.-Wirtsch.-Ing. Thomas Paul Michalke
geboren am 06.05.1979
in Gera
Referent: Prof. Dr.-Ing. Jurgen Adamy
Korreferent: Prof. Dr.-Ing. Edgar Korner
Tag der Einreichung: 09.04.2009
Tag der mundlichen Prufung: 29.06.2009
D17
Darmstadt 2009
Acknowledgements
The PhD project was carried out in 3 years during February 2006 and January 2009 while
I was working as a PhD student at the Control Theory and Robotics Lab at Darmstadt
University and the Honda Research Institute Europe in Offenbach. In many ways I am
deeply indebted to numerous people working in these two facilities.
I want to thank my supervising professor, Prof. Dr.-Ing. Jurgen Adamy, the head of
Control Theory and Robotics Lab, for all his encouragement and belief in the success of
this work. I want to thank all my colleagues at the Control Theory and Robotics Lab for
their technical, professional, and administrative support. Especially, I want to thank my
colleague Robert Kastner, who shares and supports many of my beliefs, as our short but
fruitful cooperation and numerous inspiring discussions have shown. I also want to thank
Robert Kastner for proof reading this work.
I want to express my gratitude to all students, who participated in projects related to
my PhD work. My thanks go among others to Shi Xuehui, Michael Herbert, Imran Bashir
Bhatti, Wang Zheng, Yan Jiajie, Pol Blasco Moreno, Andreas Schlensag, Marco-Antonio
Garcia-Ochoa, Conrad Klytta, Ming Zhao, Sun Hailin, Jochen Schmell, and Zhang Lyan.
My PhD project was supervised at the Honda Research Institute in Offenbach, where I
realized the major part of my PhD project. I’m grateful to Prof. Dr.-Ing. Edgar Korner,
president of the Honda Research Institute, whose visions and ideas have guided also my
work in numerous ways. Besides major professional contributions, a key factor for my
successfully finished work was the extensive access on numerous, costly facilities, including
the hardware to simulate and test my approaches in real-time on a prototype vehicle. In
particular, my thanks go to Dr.Ing. Jannik Fritsch at Honda for the close supervision of
my work even in busy times and his experience-driven warnings of the numerous possible
pitfalls of a PhD thesis.
I want to thank all other people, who followed and contributed to the project and stay
unnamed.
Finally, I owe my parents a debt of gratitude for all their support during my personal
and professional education.
I am mostly indebted to my wife Gabriele Michalke, for her incessant comprehension
and support during the long evenings I communicated too much with my computer, for
hearing my complaints, giving me encouragement, approval, acknowledgement, and her
love.
iii
Acknowledgements
I dedicate these lines to my parents, who taught me that measurable facts always last
longer than fancy but hollow phrases alone.
iv
Contents
Acknowledgements iii
List of Symbols vii
List of Abbreviations ix
Abstract x
Kurzzusammenfassung xi
1 Introduction 1
1.1 Motivation - Going beyond State-of-the-Art in Driver Assistance . . . . . . 1
1.2 Scope - Inspiration from Biology . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions to Community . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Feature Space 8
2.1 Static Attention Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Intensity Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Orientation Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 RGBY Color Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Depth Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Biological Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.2 Depth from Stereo Disparity . . . . . . . . . . . . . . . . . . . . . . 29
2.2.3 Depth from Object Knowledge . . . . . . . . . . . . . . . . . . . . . 33
2.2.4 Depth from Bird’s Eye View . . . . . . . . . . . . . . . . . . . . . . 35
2.2.5 Depth from Time to Contact . . . . . . . . . . . . . . . . . . . . . . 37
2.2.6 Depth from Radar . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3 Motion Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.1 Differential Images . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.2 Detection of Dynamic Objects . . . . . . . . . . . . . . . . . . . . . 44
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3 Task-dependent Tunable Visual Attention 51
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Real-World Challenges for Top-Down Attention Systems . . . . . . . . . . 54
3.3 Modeling Attention: From a Robustness Point of View . . . . . . . . . . . 55
3.4 Functional Comparison to other Top-Down Attention Models . . . . . . . . 61
v
Contents
3.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 Road Detection in Unconstrained Environments 69
4.1 Adaptive Multi-Cue Fusion for Detecting Unmarked Roads in Inner-City . 69
4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1.2 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2 Temporal Integration for Feature-Based Road Detection Systems . . . . . . 86
4.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2.2 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5 Integrated System Approaches for Scene Interpretation 101
5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2 Advanced Driver Assistance on Highways . . . . . . . . . . . . . . . . . . . 103
5.2.1 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3 Advanced Driver Assistance in Inner-City . . . . . . . . . . . . . . . . . . 115
5.3.1 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6 Summary and Outlook 129
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2 Limitations and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
A Annex 136
A.1 Gaussian Image Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
A.2 Kolmogorov-Smirnov Test of Goodness of Fit . . . . . . . . . . . . . . . . 137
A.3 World to Image Transformation . . . . . . . . . . . . . . . . . . . . . . . . 137
A.4 Time to Contact - Further Evaluation Results . . . . . . . . . . . . . . . . 140
A.5 High Attention-Feature Selectivity . . . . . . . . . . . . . . . . . . . . . . . 141
Bibliography 143
Curriculum Vitae 152
Publications 154
Index 157
vi
List of Symbols
c0 Velocity of propagation (speed of light)
D(u, v) Disparity
∆f Doppler frequency shift
∆φ Orientation selectivity of the Gabor filter bank
DRate Average detection rate
εfinal Threshold for computation of final road map
k0 Two dimensional wave number vector, which defines the direction of
selectivity of the Gabor kernel
f Focal length (in [m])
f0 Carrier frequency
F0 Theoretic cumulative frequency
fu Focal length normalized to the pixel width (in [pixels])
fv Focal length normalized to the pixel height (in [pixels])
frate Frame rate
fcenter Normalized center frequency of the Difference of Gaussian kernel
Fe Cumulative frequency of a sample
Fi,k Feature map of sub-feature i and feature modality k
Gu 2D Gaussian derivative in horizontal direction
Gv 2D Gaussian derivative in vertical direction
Hworld Height of an object in the world (in [m])
Him Height of an object in the image (in [pixels])
Hit Average FoA hit number
I Image
Igray Gray scale image
λ Weight for the linear combination of bottom-up and top-down saliency
ψ Aperture angle an object has when projected on the image plane
p(xi) Probability distribution of feature xir Parameter that defines the overlap of two adjacent Gabor kernels
SBU Bottom-up saliency map
STD Top-down saliency map
Stotal Overall saliency map
σi Standard deviation of internal Gaussian function of DoG
σe Standard deviation of external Gaussian function of DoG
t1, t2, t3 Translational camera offsets (position of the camera relative to the world
coordinate system)
θX , θY ,θZ Pitch angle, yaw angle, roll angle of the camera
vii
List of Symbols
tu Horizontal size of an image pixel (in [m])
tv Vertical size of an image pixel (in [m])
ttof Time of flight of electromagnetic wave˙θY Yaw rate
U Width of the image
V Height of the image
v Vertical pixel position (of undistorted image)
u Horizontal pixel position (of undistorted image)
u0 Horizontal position of principal camera point (approximately the hori-
zontal position of the middle of the image)
v0 Vertical position of principal camera point (approximately the vertical
position of the middle of the image)
vd Vertical pixel position of distorted image
ud Horizontal pixel position of distorted image
wTDi Top-down attention weight of sub-feature map i
wsparsei Popout weight used as sparseness operator of sub-feature map i
Wworld Width of an object in the world (in [m])
Wim Width of an object in the image (in [pixels])
viii
List of Abbreviations
ACC Adaptive Cruise Control
ADAS Advanced Driver Assistance System
AKTIV Adaptive und Kooperative Technologien fur den Intelligenten Verkehr
(Adaptive Cooperative Technologies for Intelligent Traffic)
APIA Active Passive Integration Approach
BRM Binary Road Map
BU Bottom-Up (data-driven processes)
CAPS Combined Active and Passive Safety
CCD Charge-Coupled Device
CIE Commision Internationale d’Eclairage
DARPA Defense Advanced Research Projects Agency
DFT Discrete Fourier Transform
DoG Difference of Gaussians
eBRM Extended Binary Road Map
EKF Extended Kalman Filter
FoA Focus of Attention
fRPM Final Road Probability Map
GPS Global Positioning System
HMI Human Machine Interface
IPP Intel Performance Primitives
LTM Long Term Memory
MREO Mean Relative Error in Offset
MRER Mean Relative Error in Radius
NCC Normalized Cross Correlation
PReVENT Preventive and Active Safety Applications
ROC Receiver Operator Characteristic
RoI Region of Interest
RPM Road Probability Map
RPROP Resilient backPROPagation
RTBOS Real-Time Brain-like Operation System
SNR Signal to Noise Ratio
STM Short Term Memory
TD Top-Down (knowledge-driven processes)
TTC Time to Contact
TTL Time to Live
ix
Abstract
Increasingly complex driver assistance functionalities are developed and combined in to-
day’s vehicles. Typically, these functionalities run as independent modules bringing own
sensors, processing devices, and actuators. In general, no information fusion, i.e., cross
talk between modules, takes place. However, information fusion of available sensors and
processing modules could lead to a new quality of driver assistance functionalities in terms
of performance and robustness. Furthermore, typical driver assistance functionalities on
the market are based on highly specialized and optimized algorithms that show sound per-
formance for a restricted number of clearly defined use-cases only. Also, the combination
of several of these rigid systems as a means to reach the long-term goal of autonomous
driving will not lead to robust system performance, taken the immense variety of traffic
situations into account.
Opposed to that, in the here presented work a flexible, biologically inspired driver as-
sistance system is developed that adapts its modules and data exchange between modules
online depending on the task. More specifically, the morphology of the brain as well as
brain-like signal processing principles are mimicked in order to increase the robustness and
flexibility of the system. Thereby, the followed development process aimed at reaching
a generic system structure, that supports several system tasks (e.g., detect fast objects,
redetect once tracked and later lost objects, predict object trajectories, or find cars on
the road). In order to include information of the scene context into the system, a robust
unmarked road detection module as well as an approach for the temporal integration of
road segments is developed.
The realized driver assistance system is tested online and in real-time on a prototype car.
In one of the presented online test scenarios, a stationary car is detected in a highway con-
struction site based on cameras as the main sensor. In order to allow the system to interact
with its environment, a three phase danger handling scheme is included into the system.
Following an acoustic warning, the belt pretensioner is activated, after which the vehicle
brakes autonomously preventing a crash. The gathered results prove the applicability of
the developed biologically inspired driver assistance system in real-world scenarios.
Extensive system evaluation shows that different system properties are in close compli-
ance with measurements gathered in psychophysical studies on humans. Based on these
results, it can be stated that the realized advanced driver assistance system closely models
important human information processing principles allowing the usage of the system as
attentional co-pilot for human drivers.
x
Kurzzusammenfassung
Zunehmend komplexere Fahrerassistenzfunktionalitaten werden in modernen Fahrzeugen
verbaut und kombiniert. Typischerweise arbeiten diese Funktionalitaten als unabhangi-
ge Module mit jeweils eigener, getrennter Sensorik, Rechenhardware und Aktuatorik. Im
Allgemeinen findet also keine Informationsfusion (d.h. Austausch und Kombination von
Daten) zwischen den Modulen statt. Dennoch wurde eine Informationsfusion der verschie-
denen Sensoren und Rechenhardware zu einer neuen Qualitat von Fahrerassistenzfunk-
tionen fuhren, da Performanz und Robustheit verbessert werden konnten. Des Weiteren
basieren typische, kommerziell verfugbare Fahrerassistenzfunktionen auf hoch spezialisier-
ten und optimierten Algorithmen, die nur in klar definierten Fallen sicher funktionieren
konnen. Unter Anbetracht der Vielfaltigkeit von moglichen Verkehrssituationen wird durch
die bloße Kombination einer großen Anzahl dieser starren Funktionalitaten das strategische
Ziel “Autonomes Fahren” nicht erreichbar werden.
Im Gegensatz dazu wird in der vorliegenden Arbeit ein flexibles, biologisch inspiriertes
Fahrerassistenzsystem entwickelt, dessen Module und Verbindungen zwischen Modulen zur
Laufzeit aufgabenabhangig angepasst werden konnen. Genauer, werden die Struktur und
bekannte Informationsverarbeitungsprozesse des menschlichen Gehirn nachgeahmt, um ei-
ne hohere Robustheit und Flexibilitat des Systems zu erreichen. Der Systementwicklungs-
prozess zielte auf das Erreichen einer generischen Systemstruktur, die anderes als bekannte
Systeme eine große Anzahl von Aufgaben unterstutzt, ohne spezifisch fur einzelne Aufga-
ben gebaut und optimiert zu sein (z.B., Detektion von schnell bewegenden Objekten, Suche
nach vorherig gefundenen und wieder verlorenen Objekten, Pradiktion von Objekttrajek-
torien, Finden von Fahrzeugen auf der Straße). Um dem System Kontextinformationen
der Szene bereit zu stellen, wurde ein Detektionssystem fur unmarkierte Straßen sowie ein
Ansatz zur zeitlichen Integration von Straßensegmenten entwickelt.
Das Fahrerassistenzsystem wurde in Echtzeit auf einem Testfahrzeug verifiziert. In einem
der vorgestellten Testszenarien wird nur basierend auf Kameradaten ein stationares Fahr-
zeug in einer Autobahnbaustelle detektiert. Um eine Interaktion des Systems mit seiner
Umwelt zu ermoglichen, wird ein dreiphasiges Schema zur Gefahrenabwehr durchfahren.
Nach einer akustischen Warnung, werden die Gurtstraffer aktiviert und das Fahrzeug da-
nach autonom gebremst, um einen Zusammenstoß zu verhindern. Die erzielten Ergebnisse
belegen die Anwendbarkeit des entwickelten biologisch motivierten Fahrerassistenzsystems
in Applikationen der echten Welt.
Durch umfangreiche Evaluierung konnten Systemeigenschaften aufgezeigt werden, die
auch in psychophysischen Studien beim Menschen beobachtet wurden. Basierend auf die-
sen Ergebnissen, kann gesagt werden, dass das realisierte Fahrerassistenzsystem wichtige
menschliche Signalverarbeitungsprinzipien gut abbildet, was eine Nutzung des Systems als
aufmerksamkeitsbasierten Beifahrer des Menschen ermoglicht.
xi
1 Introduction
Mobility is a central issue in modern economies. The need for individual and flexible
transportation systems has made the car to be one of the most influential products of our
time. Today’s customers expect a high degree of comfort and safety in vehicles, which is
impressively stressed by the increasing percentage of electronic equipment in automobiles.
Besides comfort functions, like, e.g., multimedia equipment, driver assistance functional-
ities of various kind come with today’s vehicles. Such systems are designed to diminish
the effects of frequent types of road accidents (e.g., blind spot warning systems prevent
highway accidents caused by carelessness during passing maneuvers).
In the following Section, independent driver assistance functionalities, which are available
on the market and presented in literature, are described. As will become apparent, the
combination of several of these individual functionalities poses heavy challenges. In order to
solve these challenges, integrated driver assistance systems are needed. However, regarding
the long-term goal of autonomous driving, existing integrated concepts lack the necessary
flexibility to cope with the high scene complexity and variety of scenarios present in the
traffic domain. As a measure to solve this challenge a new biologically motivated system
design is proposed. A list of the contributed novelties presented in this doctor thesis as
well as an overview of the remaining Chapters closes this introduction.
1.1 Motivation - Going beyond State-of-the-Art in Driver
Assistance
Numerous highly specialized and robust driver assistance functionalities exist on the market
and are presented in various publications. For example, many automotive suppliers have
implemented lane marking detection systems that, e.g., warn drivers, in case they leave
the highway without intension (so-called Lane Keeping Assistant). Recently available
Stop&Go Adaptive Cruise Control (ACC) systems allow following a preceding car with
appropriate distance even in case of a traffic jam. More complex functionalities still in
prototype status exist, concerning among others, pedestrian detection in inner-city and
detection of the free drivable area in front of the car. By now, a large number of driver
assistance functionalities can be found in upper class vehicles. Furthermore, their number
is increasing in correspondence with the growth of overall electronic equipment in vehicles
(see Fig. 1.1a). All commercially available driver assistance approaches have in common
that they solve very restricted tasks in clearly defined scenarios using highly specialized
algorithms. However, said functionalities typically run completely independent from each
other without sharing information or sensor data. Each functionality brings its own sensors
and actuators. When extrapolating this development, problems will arise, when the Human
1
1 Introduction
Machine Interfaces (HMI) and actuators of many independent assistance functionalities
interfere in highly complex scenarios.
Fraction of
Annual growth
Interior
Chassis
Motor and
Interior
Electrics and
Body(exterior)
power train
vehicle powersystem
electronics per vehicle
(a)
(b)
Figure 1.1: (a) Global market of electrics and electronics in vehicles and vehicle parts in billion
Euro and annual growth [Hertel, 2007], (b) Autonomous vehicles in action during the DARPA
Urban Challenge 2007.
For example, a slow vehicle on the highway will provoke a warning of a collision avoidance
system. The surprised driver changes the lane abruptly. The Lane Keeping Assistant warns
the driver when crossing the central lane marking without activating the direction indicator.
All this might confuse the driver in this critical situation. Car manufacturers try to solve
2
1 Introduction
this dilemma by using different human sensory interaction channels for the available HMIs,
but manage to do so only incompletely. Future, more complex functionalities will not allow
such simple solutions altogether.
Based on this example, the need for an integrated driver assistance system becomes
apparent. Said systems integrate independent assistance functionalities coherently, which
allows the realization of conflict-free HMIs and actuator control procedures, thereby reduc-
ing the system costs. Furthermore, several integrated functionalities can share the same
sensors and actuators. However, the most important advantage of an integrated system
is that a fusion of the input of all available sensors and functionalities is possible. Such
information fusion can result in an advanced driver assistance system (ADAS) allowing
complex assistance functionalities. For example, the results of a road detector and ob-
stacle detector can be fused to build up an internal representation of the environment, in
which objects can be tracked and their trajectories predicted in order to avoid collisions
and reduce the number of false-positive collision warnings.
Such integrated systems are very rare on the market or in literature, where mostly
independent driver assistance functionalities and their optimization are handled. In the
following, some of the few existing approaches will be introduced shortly in order to set
them apart from the here presented system in terms of their goals and performance.
For example, the prototype vehicles presented during the Urban Challenge [WWW,
2007a] of the Defense Advanced Research Projects Agency (DARPA) were able to perform
several driving tasks autonomously in a simplified inner-city environment (see Fig. 1.1b).
However, only a restricted number of cars as traffic participants and no pedestrians or
bicycles were present. Moreover, previously provided detailed annotated maps and the
usage of the Global Positioning System (GPS) reduced the problem complexity even fur-
ther and allowed to solve the driving tasks without camera data. Several roughly related
European projects exist, but these concentrate on restricted issues of an integrated driver
assistance system only. For example, Safespot [WWW, 2007c] aims at preventing road acci-
dents based on cooperative systems, as e.g., vehicle-to-vehicle and vehicle-to-infrastructure
communication. Furthermore, the project “Preventive and Active Safety Applications”
(PReVENT) [WWW, 2006] that was issued by a cooperation of the European automotive
industry, concentrates on fusing well-known driver assistance functionalities already on the
market. Summarizing, the named projects focus on improving existing driver assistance
functionalities (e.g., enhanced digital maps, lane keeping in challenging environmental con-
ditions) without reaching a higher level of functional integration.
In the project AKTIV (Adaptive Cooperative Technologies for Intelligent Traffic), whose
main sub-projects are financed by the German Federal Ministry of Economy, prototypes for
collision avoidance systems and assisting functions for intersections using stereo cameras
and vehicle-to-vehicle communication are developed and tested. The tested systems are
able to detect red traffic lights and the right-of-way at intersections. Although the results
gathered in the performed online tests look promising, the system can handle only a limited
number of well-defined scenarios (e.g., a pedestrian crossing the road, intersection equipped
with traffic lights). The system is based on specifically designed, dedicated modules that
are built and optimized only for these specific scenarios.
Based on the so-called 6D-approach [Badino et al., 2008] the Daimler Research has devel-
3
1 Introduction
oped a more innovative, advanced driver assistance functionality that detects and evaluates
the free space in front of the car. The functionality relies on stereo cameras and integrates
the optical flow and depth measurements over time. The gathered information is visual-
ized at the car’s dashboard and fused with a prototypical intersection assistant. However,
how to link the free space detection module to a complex driver assistance system is not
handled, thereby ignoring the potential of free space information in the system context.
Regarding complete architectures for intelligent vehicles in literature, [Franke et al.,
2001] and [Broggi et al., 2001] have presented approaches that focus mainly on the de-
sign of a framework that combines several reactive systems. The presented systems show
impressive results in specific scenarios and offer a good scalability in terms of computational
aspects, but the challenge of functional integration and interaction is not solved.
There are some integrated driver assistance systems already on the market that aim at
the fusion of basic sensors and actuators available in modern cars. For example, in the
project APIA (Active Passive Integration Approach) the automotive component supplier
Continental has developed a system that integrates active and passive driver assistance
functionalities (e.g., belt pretensioner, airbag, brakes, and sensing systems) in order to
decrease the braking distance and the severity of vehicle accidents. Although the general
approach is promising, the integration can so far only improve existing driver assistance
functionalities.
The automotive component supplier Bosch follows a related direction of thought. The
initiative “Vehicle Motion Management” aims at integrating and linking all available sen-
sors, actuators as well as safety and comfort functionalities in order to support and inform
the driver situation-dependently. However, so far available products that partly realize this
large scale integration only concern the improvement of vehicle dynamics. With CAPS
(Combined Active and Passive Safety) Bosch has introduced a system that functionally
integrates the actuators of safety-related vehicle functionalities.
Additional to the lack of solutions for large scale integration, all these approaches have
in common that they are restricted in terms of the supported scenarios and thereby do not
show the required robustness in highly complex real-world scenarios. This lack of flexibility
results from the typically very rigid system structure of such systems that is caused by
a design process that is focused mainly on the optimal fulfillment of individual, clearly
restricted tasks.
The here presented doctor thesis tries to overcome these restrictions. The central idea
of this work is to solve the complexity challenge of the environment on the system level
by designing a generic system that mimics the human brain. More specifically, a driver
assistance system is developed that gets inspiration from known signal processing principles
and the structural organization of the brain. For example, instead of collecting all existing
environmental information followed by a late selection of relevant data, as it is done in
most technical systems, the biologically inspired approach of early information selection
is realized. Predominantly environmental data that is compatible to the current system
expectation will reach higher processing levels, which reduces the problem complexity. The
described complex selection and suppression principle is named attention and is one of the
key aspects of the doctor thesis at hand.
4
1 Introduction
1.2 Scope - Inspiration from Biology
As opposed to the existing classical driver assistance functionalities presented before, the
here developed system takes the human as a role model. This is done on the micro-level
by getting inspiration from human signal processing principles. More specifically, on the
micro-level the type and parameterization of the supported visual features is derived from
biology (e.g., edge filters are drawn from the form of receptive fields of neurons, color pro-
cessing principles are inspired by the processes on the retina). Please refer to Chapter 2
for details. Also, on the macro-level the system gets inspiration from the human brain,
since the organization and combination of signal flows in the brain is mimicked. For exam-
ple, a brain-like separation between the processing pathways for the detection of motion
respectively position of objects (“motion pathway”) and for the classification of objects
(“form pathway”) allows the generic task-dependent adaptation of signal processing (see
Sect. 5.2). Furthermore, design principles inspired by biology are used to increase the
robustness of system modules. For example, the attention sub-system in Chapter 3 or the
unmarked road detection sub-system in Chapter 4 are adapted dependent on the environ-
ment, meaning that all essential system parameters are computed dynamically based on
the characteristics of the current scene, which assures a robust system performance even
in challenging lighting or weather conditions. Based on these exemplarily described brain-
like principles, a biologically motivated driver assistance system is developed and tested
online. The presented system explicitly searches biological motivation in case classical
engineering-based approaches cannot do better or fail.
The presented thesis aims at developing an integrated driver assistance system that
fuses the data of various sensors (e.g., laser, internal sensors, dense camera-based stereo)
and higher level modules (e.g., unmarked road detection module, short term memory with
objects detected in the environment). The gathered information is used to actively improve
internal processes and modules in the vision sub-system as well as to control actuators (e.g.,
the belt pretensioner, acoustic warnings, or brakes).
Based on this large scale information fusion a new quality of driver assistance function-
alities is reached and tested online on a prototype vehicle. For example, the system is able
to detect and autonomously brake on a standing obstacle, where current driver assistance
functionalities for Radar-based collision mitigation fail (see Sect. 5.2). The system is fur-
thermore able to actively generate and test predictions concerning the behavior of other
objects in the environment. For example, the driver assistance system actively searches
for cars when car-like openings in the detected road segment are found (please refer to
Sect. 5.3).
The central assumption of this doctor thesis, developed in the following Chapters, is
that only a generic system will be able to cope with the high number of possible scenarios
in the traffic domain. So, instead of designing and fusing specific methods that solve
restricted clearly specified tasks, the here proposed system is organized in a way that
allows a generic processing in terms of the supported tasks. More specifically, the existing
system modules can be modified online by adapting parameters and links between the
modules dependent on the current system task. The central system component that is
based on this assumption is the attention sub-system (described in detail in Chapter 3),
5
1 Introduction
which is used as generic front-end for all visual processing.
1.3 Contributions to Community
In the following, the novelties presented in this doctor thesis are summarized shortly.
Starting with novelties on level of modules and functionalities, the subsequent approaches
were developed.� Biologically motivated filter kernels described in literature are computational effi-
ciently extended, adding novel visual features to the system,� A novel biologically motivated suppression of the horizon edge is proposed,� A well-known lane marking detection approach is extended by a biologically moti-
vated preprocessing step that improves the detection performance,� A well-known approach for the detection of moving objects is extended by including
accumulated top-down knowledge of the environment,� An innovative mono-camera-based depth cue is formalized to be suitable for the
vehicle domain.
As described before, the thesis at hand stresses the role of system design and large scale
information fusion. On the system level the following novelties were realized.� A robust human-like attention system running in real-time is developed that is based
on five novel principles solving typical attention-related challenges,� An attention-based vision system is applied online in real-world scenarios of the
vehicle domain,� An adaptive unmarked road detection system is proposed that relies on four novel
principles,� A generic computationally efficient temporal integration approach for improving ex-
isting unmarked road detection systems is developed,� A driver assistance system on a prototype vehicle is realized that allows autonomous
emergency braking on highways based on vision as the major cue,� A biologically motivated driver assistance system is realized that integrates envi-
ronmental context information, in order to facilitate safe processing in inner-city
scenarios.
1.4 Overview
The thesis is structured as follows: In Chapter 2 the implemented biologically motivated
feature space is described. Proposed extensions to state-of-the-art approaches in liter-
ature are evaluated qualitatively and quantitatively based on known stand-alone driver
assistance functionalities (e.g., marked lane detection). Chapter 3 elaborates on the pro-
posed human-like attention approach, which integrates the previously described features
allowing generic task-dependent scene decomposition. The comprehensive description of
6
1 Introduction
the realized attention sub-system centers on concepts that improve the system robustness
allowing its application in real-world traffic scenes. Chapter 4 focuses on visual-feature-
based approaches that allow a robust detection of unmarked roads and also introduces
the concept of temporal integration for improving the road detection performance in com-
plex scenes as inner-city. Finally, the central Chapter 5 proposes a biologically motivated
driver assistance system, which relies on the human-like attention system as generic visual
front-end of all task-related processing. The integration of the unmarked road detection
system introduces environmental context into the driver assistance system. Online results
gathered on a prototype car that brakes autonomously in a complex highway construction
site and the evaluation of internal system representations in an inner-city scene allow the
assessment of the proposed system. Chapter 6 summarizes the PhD thesis allowing a com-
prehensive overview of the contributions to the community. Furthermore, the limitations
of the system and an outlook to future extensions are given.
7
2 Feature Space
The following Chapter describes the developed biologically inspired feature space our
ADAS architecture relies on for fulfilling its generic vision tasks. Typical features
for biologically motivated vision systems are intensity, orientation, and color (see, e.g.
[Frintrop et al., 2005, Itti et al., 1998]). These features are often preferred because they
are so-called basic features. A feature is a basic feature, in case it allows, among other
things, an efficient visual search. The efficiency of visual search tasks is assessed by psy-
chophysical studies that determine the reaction time of subjects to visual impulses (see
[Treisman, 1993] and [Wolfe and Horowitz, 2004] for a summary). More specifically, a ba-
sic feature allows an efficient parallel search, i.e., in a search task with a growing number
of distractors the mean search time is constant or only slightly increasing (see Fig. 2.1). A
basic feature allows a clear differentiation against distractors that do not dispose of this
feature (i.e., one differentiating feature exists in these so-called feature search tasks).
Additional to the named three features, recent biologically motivated systems also incor-
porate depth and motion (see, e.g., [Aziz and Mertsching, 2008]). Both features are marked
as basic features by most researchers (see [Wolfe and Horowitz, 2004] for an overview). An
important property of said basic features and reason for their efficiency in visual search
tasks is that they draw or guide attention. The attention principle plays an important role
for the here developed ADAS since it will allow solving specific vision tasks in a generic
fashion. It will be described in detail in Chapter 3. In a nutshell, the attention features
introduced in the following are combined to form a saliency map that is the key aspect and
major output of a human-like attention system. This saliency map shows high activation
at image regions that contain a high level of information in terms of a specific vision task
(top-down driven activation) or because the image regions differ strongly from the rest of
the image, meaning that a high local entropy is present (bottom-up driven activation).
In the following Section, three well-known biologically motivated visual features together
with some important conceptional extensions are described. These features are static, i.e.,
they depend on the current image only (as opposed to dynamic features that also depend
on previous images). These static features will be Difference of Gaussians (DoG) filters
for detecting intensity changes, Gabor filters for detecting oriented structures, and RGBY-
colors. In the second and third Section of this Chapter, higher level features like different
biologically inspired depth-sources, as well as different motion features are described.
2.1 Static Attention Features
In the following, specific filter types and thereby an extensively used image processing
method will be motivated by showing the resemblance to the processing in the human
8
2 Feature Space
(a) (b)
Figure 2.1: Efficient and inefficient visual search tasks: (a) Efficient search: Orientation is a
basic feature, (b) Inefficient search: Numeric character differentiation is not a basic feature.
brain.
This resemblance is based on the fact that receptive fields of neurons (i.e., their measur-
able transfer functions, see [Flores-Herr, 2001]) are equivalent to filter kernels. Therefore,
the signal processing principles in neurons can be described in computational image pro-
cessing by a convolution, as stated in [von Seelen, 1970]. Furthermore, in the first stage
of the human visual processing the image is sampled with the resolution of photorecep-
tors on the retina. All this makes the computational image filtering based on convolution
with biologically inspired filter kernels to a close approximation of the human visual signal
processing. Figure 2.2 visualizes the neuronal signal processing that equals the convolution:
Ifilt(u, v) = g(u, v) ∗ I(u, v) =U−1∑
u′=0
V−1∑
v′=0
g(u, v)I(u− u′, v − v′). (2.1)
modelneuronstatic
receptivefield size
3x3
Image on retina / image pixels(photoreceptor output)
synapseweights
with DoGcharacteristic /
DoG image filterkernel
+−
−−−−
−
g(u, v)
Figure 2.2: Simple static neuron model with receptive field and synaptic weights g(u, v) that
are equivalent to a symmetric filter kernel used in image convolution.
9
2 Feature Space
In the following two subsections, the technical equivalents of two basic measured recep-
tive field types are described (DoG and Gabor). After that, with RGBY, a color space is
described that mirrors the processing on the retina.
2.1.1 Intensity Feature
In the following, the Difference of Gaussians feature is biologically motivated and a pa-
rameterization is derived that allows the implementation of a filter bank for a sparse signal
decomposition.
Biological Motivation
In vitro receptive field measurements of ganglion cells in the retina of macaque mon-
keys have shown a characteristic center-surround behavior [Flores-Herr, 2001]. Also mea-
surements of receptive fields of neurons located in early regions of the visual pathway of
macaque monkeys have shown a similar characteristic [Trapp, 1998]. In other words, the
receptive fields are selective to monotonous regions (blobs), which differ from the back-
ground in terms of their intensity. An example for such a contrast is shown in the lower
left image in Fig. 2.4b. Furthermore, theories and supporting measurements exist, which
allow to interpret specific brain regions in the human visual pathway as filter banks that
decompose an input image in terms of the existing frequencies [Mallot, 2002]. The realized
attention system extends this principle to all static and dynamic features.
In the following, the measured center-surround behavior is modeled using a filter kernel
of two 2D Gaussian functions that are subtracted (Difference of Gaussians). A parameter-
ization of these Gaussian functions will be provided that will allow the implementation of
a low-loss filter bank for said filter kernel.
The Difference of Gaussians (DoG) filter is selective to monotonous regions (blobs) of
different sizes (see Fig. 2.4). The filter kernel is not orientation selective (i.e., isotropic).
In its basic form the centre of the DoG filter kernel is excitatory and the lateral region is
inhibitory. The discrete DoG filter kernel results from the sampling of a Gaussian curve
with a small variance σ2e that is subtracted from a Gaussian curve with a bigger variance
σ2i (see also Fig. 2.3):
DoG(u, v) =1
2πσ2i
e−u2+v2
2σ2i − 1
2πσ2ee−u2+v2
2σ2e . (2.2)
On a more qualitative level, the DoG subtracts the mean weighted intensity of a smaller
center region from the mean weighted intensity of a bigger surround for each image pixel.
As will become apparent in Chapter 3, in order to yield high hit rates in top-down related
search (i.e., when searching for a specific object using the saliency map), the features of
an attention system need high selectivity to provide as much supporting and inhibiting
(i.e., suppressing) maps as possible. At the same time, high efficiency is needed due
to constraints in computational resources. An approach fulfilling these demands, is the
separation of the DoG filter in on-center (called on-off in the following) and off-center
selectivity (off-on) as it is emphasized in [Frintrop, 2006] (see Fig. 2.5a and Fig. 2.5b).
10
2 Feature Space
Pixel position
Sig
nal a
mpl
itude
σe
σi
σe
σi
Figure 2.3: One-dimensional on-off DoG (black) and the two Gaussian functions it is composed
of (positive Gaussian function with small standard deviation in blue, negative Gaussian function
in red).
Thereby, the attention system can differentiate between bright blobs on a dark background
and dark blobs on a bright background. To realize such an on-off/off-on separation the
DoG filter response is separated into its positive and negative part, which is equivalent
to the computationally more demanding independent filtering with the two different filter
kernels depicted in the lower left corner of Fig. 2.5a and Fig. 2.5b respectively.
As described at the beginning of this Section, the attention features are computed on
different scales allowing a decomposition of the signal into overlapping frequency chunks.
In the following, the appropriate parameterization of the DoG kernel is derived, which
makes such a decomposition possible.
Parameterization of the DoG Kernel
The parameters σi and σe in Equ. (2.2) determine the frequency characteristic of the filter.
As described in Annex A.1 for efficient filtering a Gaussian pyramid approach that scales
the input image is used, while the filter kernel is not changed in size. In order to assure an
accurate filtering procedure, the normalized central frequency fcenter of the band-pass-type
DoG filter needs to be 0.25, which equals a period length of 4 pixels and hence a blob of
2x2 pixels or a line of 2 pixels of any orientation. If fcenter = 0.25 it can be assured that a
low-loss pyramid-based image filtering can be done.
In order to derive the said DoG parameterization, the frequency domain of the DoG
kernel and its dependencies are handled. For decomposition of the problem, it is adopted
that the 2D-DFT transform can be expressed as a combination of two single Discrete
Fourier Transformations (DFT ) [Proakis and Manolakis, 2006]:
DFTu,v (f (u, v)) = DFTu (DFTv (f (u, v))) . (2.3)
According to Equ. (2.3), the computation of the 2D-DFT is equivalent to applying the
11
2 Feature Space
(a) (b) (c)
growing scale (frequency)
(d) (e) (f)
Figure 2.4: (a) Test image containing all frequencies in all orientations, (b)-(f) Different levels
of the DoG filter bank with the filter response for the test image on top, the filter kernel in the
image domain at bottom left, and the filter kernel in the frequency domain at bottom right.
12
2 Feature Space
(a) DoG on−off (b) DoG off−on (c) 0° Even Gabor on−off
(d) 0° Even Gabor off−on (e) 0° Odd Gabor on−off (f) 0° Odd Gabor off−on
Figure 2.5: Application of filter kernels on simple test images (negative filter response is cut
off). Both the 2 DoG features (a), (b) and the 4 Gabor features (c)-(f) are realized with one
filter operation each. Every picture shows on the left the used input test image and on the
right the respective filter response for the filter kernel in the bottom left corner.
DFT on the two image dimensions independently. Therefore, also for the 2D case the
following transformation rule can be applied that gives the transformation of a 1D Gauss
to the frequency domain:
DFT
(
1
σ√
2πe−
n2
2σ2
)
= e−σ2$2
n2 . (2.4)
Based on Equ. (2.3) and (2.4), the 2D-DFT of the DoG is defined as given in Equ. (2.5).
DFTu,v(DoG)
= DFTu,v
(
1
2πσ2i
e−
(
u2+v2
2σ2i
)
− 1
2πσ2ee−(
u2+v2
2σ2e
)
)
= DFTu
(
DFTv
(
1
2πσ2i
e−
(
u2+v2
2σ2i
)
− 1
2πσ2ee−(
u2+v2
2σ2e
)
))
= DFTu
(
DFTv
(
1
2πσ2i
e−
(
u2
2σ2i
)
e−
(
v2
2σ2i
)
− 1
2πσ2ee−(
u2
2σ2e
)
e−(
v2
2σ2e
)
))
= DFTu
(
DFTv
(
1√2πσi
e−
(
v2
2σ2i
)
)
1√2πσi
e−
(
u2
2σ2i
)
)
+
DFTu
(
−DFTv(
1√2πσe
e−(
v2
2σ2e
))
1√2πσe
e−(
u2
2σ2e
))
13
2 Feature Space
= DFTu
(
e−
(
$2vσ2
i2
)
1√2πσi
e−
(
u2
2σ2i
)
− e−(
$2vσ2
e2
)
1√2πσe
e−(
u2
2σ2e
)
)
= DFTu
(
1√2πσi
e−
(
u2
2σ2i
)
)
e−
(
$2vσ2
i2
)
−DFTu
(
1√2πσe
e−(
u2
2σ2e
))
e−(
$2vσ2
e2
)
= e−(
$2v+$2
u2
)
σ2i − e
−(
$2v+$2
u2
)
σ2e
= e−2π2(f2v +f2
u)σ2i − e−2π2(f2
v +f2u)σ2
e (2.5)
The DoG has a band-pass frequency characteristic (i.e., the constant component of the
resulting filter responses DFTu,v(DoG(ωu = 0, ωv = 0)) is equal to zero). The main
parameter of a band-pass filter is the center frequency fcenter. The value fcenter is the fre-
quency, where the transfer function of the filter has its maximum or in other words it marks
the line width, the filter is selective to. To find the filter parameter σi, a reformulation of
Equ. (2.5) to polar coordinates is helpful (fu = Rsin(α) and fv = Rcos(α)):
DoG (R) = e−2π2R2σ2i − e−2π2R2σ2
e . (2.6)
To find the extremum, the derivative of Equ. (2.6) is taken (see Equ. (2.7)) and set to zero
(see Equ. (2.8)). After introducing γDoG = σ2e
σ2i
as the ratio between the inner and outer
Gaussian and reforming, Equ. (2.9) results.
∂DoG (R)
∂R= 4π2Rσ2
ee−2π2R2σ2
e − 4π2Rσ2i e
−2π2R2σ2i (2.7)
0 = 4π2R(
σ2ee
−2π2R2σ2e − σ2
i e−2π2R2σ2
i
)
(2.8)
0 = e−2π2R2σ2i
(
γ2DoGe
−2π2R2σ2i (γ2
DoG−1) − 1)
(2.9)
1 = γ2DoGe
−2π2R2σ2i (γ2
DoG−1)
−2π2R2σ2i
(
γ2DoG − 1
)
= ln
(
1
γ2DoG
)
R2 = ln (γDoG)1
π2σ2i
(
γ2DoG − 1
) (2.10)
Notice that fcenter represents the norm (fcenter =√
ω2u + ω2
v) in the 2D frequency domain,
which means the radius R in Equ. (2.10) is set to fcenter. Reforming Equ. (2.10), the
dependencies of the optimal frequency fcenter can be found to be Equ. (2.11) and hence σiresults to Equ. (2.12).
fcenter =1
πσi
√
ln (γDoG)
γ2DoG − 1
(2.11)
σi =1
πfcenter
√
ln (γDoG)
γ2DoG − 1
(2.12)
It can be shown that the second derivative is positive marking the extremum as a maximum.
As described before, fcenter in Equ. (2.12) is set to 0.25. Furthermore, e.g., [Mallot, 2002]
recommends to set the ratio between the inner and outer Gaussian γDoG = 1.6, which
results in a low-loss signal decomposition with only few redundancies between scales.
14
2 Feature Space
Discussion
As described in the previous subsection, the separation of the DoG filter response into on-off
and off-on contrasts increases the feature space without additional computational costs.
In the following, it is qualitatively shown that this approach increases the performance
of a marked lane detection system. The approach extends known algorithms for lane
marking detection by adding a biologically motivated filter step for preprocessing. More
specifically, the DoG filter is used as input feature. Figure 2.6a shows a typical inner-city
scenario with strong shadows on the road. For detecting the lane markings the view from
above (the so-called bird’s eye view) is computed (see Fig. 2.6b), as will be described
in Sect. 2.2.4. On the bird’s eye view lane marking-like contrasts (bright image regions
on a darker background) are detected by the DoG filter after which a clothoid-model-
based approach for detecting the markings is used (see, e.g., [Dickmanns and Mysliwetz,
1992, Franke et al., 2007, Ramstroem and Christensen, 2005] for related clothoid-based
approaches). Figure 2.6c depicts the DoG filter results without the described on-off/off-
on separation. Since lane markings have a typical on-off contrast (white/yellow markings
on a darker street), the on-off DoG filter results should be used, since these contain less
false-positive activations (Fig. 2.6d). For example, in [Luo-Wai, 2008] the pre-filtered road
image still contains the lane marking unspecific off-on contrasts (e.g., shadows on the road).
Such off-on contrasts are filtered out in our marked street detection approach to improve
the road detection performance.
For a quantitative evaluation of the influence of the described on-off DoG separation an
implemented lane marking detection system is used. The system gets a DoG-filtered edge
image without on-off (please refer to Fig. 2.6c) and with on-off separation (as shown in
Fig. 2.6d). The gathered results are summarized in Tab. 2.1. The evaluation shows the
improvement in accuracy of the detected offset (i.e., horizontal position of lane markings)
and radius of the road based on manually labeled ground truth data of 330 highway frames
(see Fig. 2.7 for a visualization of the scenario and the gathered results).
Table 2.1: Mean relative error of detection results (offset and radius of lane marking model).
Type of input data Mean relative error in offset Mean relative error in radius
preprocessing MREO = MRER =
1/N∑ GToffset−offset
GToffset1/N
∑ GTradius−radiusGTradius
Without DoG 4.46 80.87
on-off separation
With DoG 4.35 72.22
on-off separation
15
2 Feature Space
Width in m
Dis
tanc
e (d
epth
) in
m
−5 0 5
30
25
20
15
10
5
0
Width in m
Dis
tanc
e (d
epth
) in
m
−5 0 5
30
25
20
15
10
5
0
Width in m
Dista
nce
(dep
th) i
n m
−5 0 5
30
25
20
15
10
5
0
(a) (b)
(c) (d)
Figure 2.6: Exemplary performance gain of on-off DoG separation as pre-processing step of a
lane marking detection system: (a) Input image, (b) Bird’s eye view, (c) DoG result without
on-off separation, (d) DoG result with on-off contrasts only (off-on contrasts are filtered out).
2.1.2 Orientation Feature
In the following, the Gabor feature for the detection of lines and edges is biologically
motivated and a parameterization is derived that allows the implementation of a filter
bank for a sparse signal decomposition.
Biological Motivation
According to [Hubel and Wiesel, 1962], the lower layers of the cortex in cats contain ori-
entation selective neuron populations. Please note that for lines and edges no 360 degree
direction, but only 180 degree orientation is defined. The activation of these neuron pop-
16
2 Feature Space
Marker offsets: −1.80m 1.75m 5.10m Marker offsets: −1.90m 1.50m 5.09mRoad radius: >3000m no turn Road radius: >3000m no turn Road radius: 2378m left turn
Marker offsets: −1.90m 1.50m 4.90m
Figure 2.7: Sample images of the evaluation scene (lane marking detection results visualized).
ulations lessens by 50% when the stimulus is rotated by 15-20 degree versus the preferred
orientation. According to the spatial frequency theory (refer to [Palmer, 1999] for details),
these results give biological motivation for a filter bank with an angle selectivity of 30-40
degree. Also a frequency (respectively scale) selectivity of these neuron populations was
proven to exist based on experiments (see [Marcelja, 1980]). The receptive fields of these
neurons can be described by even and odd Gabor functionals (pairs of quadrature filters),
which represent a Gaussian kernel that is modulated by a sine functional with a phase
shift of 0 and 90 degree. For a visualization Fig. 2.8 depicts a one dimensional odd and
even Gabor kernel.
Pixel position
Sig
nal a
mpl
itude
(a) (b)
Pixel position
Sig
nal a
mpl
itude
(a) (b)
Figure 2.8: (a) One-dimensional even Gabor function (black), the Gaussian function (blue)
and the modulating cosine function (red) it is composed of, (b) One-dimensional odd Gabor
function.
17
2 Feature Space
Parameterization of the Gabor Kernel
The even Gabor filter kernel, which is selective to lines, equals the real part of g(x) in
Equ. (2.13). The odd Gabor filter kernel, which is selective to edges, equals the imaginary
part of g(x) in Equ. (2.13) with x =
[
u
v
]
.
g(x) =1
2πabe−
12xTAxejk
T0 x (2.13)
With:
k0 =
[
|k0| cosφ|k0| sinφ
]
A = RPRT =
[
cosφ −sinφsinφ cosφ
] [
a−2 0
0 b−2
] [
cosφ sinφ
−sinφ cosφ
]
The variances a2 and b2 influence the size of the underlying Gaussian function in the two
image dimensions. According to the biological measurements, the filter kernel will have a
modulation orthogonal to the longer principal axis of the Gaussian curve, which requires
a2 being smaller than b2. The angle φ in the rotation matrix R determines the orientation
of the filter kernel:
φ = m∆φ and ∆φ =2π
Mwith m ∈ [0..M − 1] . (2.14)
The wave number vector k0 determines the 2D period length of the modulated complex
oscillation and thereby the selectivity of the kernels in the frequency domain. The value
|k0| determines the line width the filter is selective to. It is important to note that |k0| is
constant for all filter orientations.
The factor γGabor is introduced, which is the ratio between the principal axes a and b
(i.e., width and height) of the underlying Gaussian curve:
γGabor =a
bwith a < b. (2.15)
In [Jones et al., 1987] γGabor was measured to be between 0.25 and 1.0 with a mean of 0.6
in neuronal receptive fields in the cat cortex. Taken these measurements and Equ. (2.14)
the number of orientation channels M is typically set to a value between 4 and 18. As
shown in [Trapp, 1998], the application of the same parameter setting for the size of the
Gaussian curve on all orientations of a specific frequency channel is possible. Based on
this, for γGabor in [Trapp, 1998] the following generic Equation is derived:
γGabor =a
b=
3sin(
∆φ2
)
√
1 − 9(
cos∆φ2 − 1
)2. (2.16)
Since the filter bank uses a Gaussian resolution pyramid the filter kernel is also independent
of the frequency channel (i.e., scale).
18
2 Feature Space
It is now sufficient, to determine the parameter a in Equ. (2.16), which encodes the over-
lap in frequency domain between the filter kernels of a specific orientation of two adjacent
frequency channels (scales). In [Trapp, 1998], a generic formulation for the parameter a is
proposed that contains the parameter r, which determines the value at the overlap of two
adjacent transfer functions (adjacent in orientation or scale):
a =3√
−2ln(r)
|k0|. (2.17)
The parameter r is a value between 0 and 1 and is the ratio between the overlap value and
the maximal value of the transfer function. The bigger r the more do adjacent filter kernels
overlap in the frequency domain. Setting r=0.5 yields a reversible (non-dissipative and
disjoint) signal decomposition compatible to the wavelet theory. This means that during
Gabor decomposition of an image and subsequent recombination nothing is lost and no
redundancy exists.
Summarizing, the filter bank formulation uses the parameter ∆φ defining the number
of orientation channels. Furthermore, the overlap value r and the value |k0| define the
frequency selectivity of the filter. A typical value for |k0| is 0.25, which makes the filter
selective to lines with a period length of 4 and hence a line thickness of 2 pixels. The
following Fig. 2.9 depicts a Gabor filter bank with ∆φ = 45 degree, r = 0.5 and |k0| = 0.25.
As stated before, Gabor functions model the receptive field characteristics in early layers
of the vision system of mammals. Gabor filters are suited to the local detection of image
frequencies since they optimally fulfill the trade-off between a good resolution in frequency
and image domain, which represents the “uncertainty principle of quantum mechanics” for
image processing. In other words, the time bandwidth product reaches its lower border
for Gabor functions, as was shown in [Gabor, 1946], which is equivalent to the fact that
Gabor filter pairs are optimally localized both in the image and frequency domain. For
a very descriptive mathematical formulation of the mentioned facts see [Trapp, 1998]. A
good localization in the image domain allows small filter kernels and hence minimizes the
calculation time for filter operations. Additionally, a good localization in frequency domain
allows an efficient use of Gaussian pyramids as well as sparseness in the orientation selective
feature maps. An optimized localization in the image domain alone leads to sparse lines
in the Gabor filter response (no repeating patterns), but lines of adjacent orientations
are amplified as well. An optimized localization in the frequency domain leads to good
selectivity regarding the orientation of lines, but the filter response in the image domain
shows no isolated line at a certain location, but an unlocalized pattern of lines of the
specific orientation.
For summarizing qualitatively, Gabor filters are selective to oriented lines (i.e., contours),
when filtering an image with the even part of the Gabor kernel. Furthermore, Gabor filters
are selective to oriented edges (i.e., steps), when using the odd part of the Gabor kernel.
Additionally, Gabor filters are scale selective (i.e., selective to a certain thickness of lines
respectively sharpness of edges).
19
2 Feature Space
scale 2 scale 2 scale 1 scale 1
(a)
(b) (c) (d) (e) (f)
(l)(k)(i)(h)(g)
(m) (n) (o) (p) (q)
(r) (s) (t) (u) (v)
growing scale (frequency)or
ient
atio
n ch
ange
Figure 2.9: (a) Test image containing all frequencies in all orientations, (b)-(f) Even Gabor
(without on-off/off-on separation), orientation 0 degree on 5 scales, on top: filter response to
test image, bottom left: filter kernel in image domain, bottom right: filter kernel in frequency
domain, (g)-(l) Even Gabor orientation 45 degree, (m)-(q) Even Gabor orientation 90 degree,
(r)-(v) Even Gabor orientation 135 degree.
20
2 Feature Space
Conceptional Extensions
Besides the well-known concept of decomposing the Gabor filter response in the odd and
even part (i.e., computing the real and imaginary part of the filter response), here, an
additional decomposition is done, which is motivated from the DoG filter decomposition.
More specifically, we transfer the DoG on-off center concept to the Gabor filter and separate
the odd and even Gabor responses into their positive and negative parts (please refer to
Fig. 2.5c-f for a visualization). The proposed decomposition increases the performance of
the ADAS attention system. For example, an on-off versus off-on even Gabor separation
allows for the efficient separation of specifically oriented white lane markings from shadows
on the road. Also, as shown in the following subsection an on-off/off-on separation for odd
Gabor allows for the crisp suppression of the sky edge present in most scenes in the car
domain.
In sum, 4 different Gabor-based feature types are derived from one filtering step (even
on-off, odd on-off, even off-on, odd off-on). Each of these 4 Gabor feature types consists of
20 independently weighable sub-feature maps (4 orientations on 5 scales each). Hence, the
Gabor filter bank in the used parameterization allows for 80 independent filter responses
and hence features.
Discussion
As discussed in the last subsection, a decomposition of the even and odd Gabor feature in
their on-off and off-on component is proposed, in order to increase the feature selectivity.
In the car domain the search performance is strongly influenced by the horizon edge, which
is present in most images of highways and country roads. In the following, this serves as
exemplary problem for showing the importance of a high feature selectivity. Typically, the
horizon edge is removed by mapping out the sky in the input image, which might not be
biologically plausible and is error prone. Instead, we suppress the horizon edge directly
in the attention by weighting the sub-feature maps (the required weighting procedure is
described in Chapter 3), based on the high selectivity of the attention features. The gain of
this approach is depicted in Fig. 2.10c that shows the diminished influence of the horizon
edge on the saliency map of the real-world example in Fig. 2.10a. For a quantitative
evaluation of the performance gain based on attentional sky suppression, refer to Sect. 3.5.
For a further qualitative assessment of the gain of on-off/off-on separation see Fig. 2.11a-
d. Here a successful top-down search of the black vehicle is only possible using full sepa-
rability of the input feature space.
When comparing the properties of Gabor and DoG filters, it is perceivable that a DoG
filter response of a specific scale is equivalent to a combination of 4 Gabor filter responses
for the 4 orientations of the same scale. Still using both features in an attention system
(instead of Gabor alone) is reasonable, because the discrete nature of image filtering will
lead to a certain loss in selectivity for combined Gabor feature maps. Supporting the DoG
feature in the attention system makes up for these losses.
However, it is important to note that DoG and Gabor filters are not independent, which
will play an important role in the normalization process described in Chapter 3 (see page
21
2 Feature Space
50
100
150
200
50
100
150
200
50
100
150
200
(a) (b)
0°off−on 45° off−on 135° on−off
(c) (d)
Figure 2.10: Evaluation of selectivity, (a) Input image, (b) Original bottom-up attention with-
out sky suppression, (c) Modified bottom-up attention with attentional sky suppression (top-
down influence), using suppressive odd Gabor filter kernels in low scales, (d) Bottom-up atten-
tion with traditional sky suppression.
61). Normalization will be done to assure a comparability of all attention features, which
is commonly neglected in comparable systems but is important for a robust vision system.
2.1.3 RGBY Color Space
Numerous color spaces are known in image processing (e.g., RGB, HSV, XYZ, Lab). The
implemented biologically motivated RGBY color space shows several important advantages.
The color space is introduced and assessed in the following.
Biological Motivation
The so-called human search asymmetries, which are measured in psychophysical studies
(e.g., an inclined line among vertical lines is detected more easily than a vertical line among
inclined lines) were conceptionalized by [Treisman and Gormican, 1988]. The authors pro-
pose a theory, which supports that it is easier to detect feature deviations of a non-canonical
feature among canonical features than the other way around. The term canonical feature
is related to the term basic feature (refer to the introduction of Chapter 2 for details on
22
2 Feature Space
(a) (b)
(c) (d)
Figure 2.11: Evaluation of the gain of on-off/off-on decomposition and usage of DoG as fea-
ture, (a) Shady input image with top-down search target black car, (b) Bottom-up attention,
(c) Top-down attention, without on-off feature separation (overall attention is negative, maxi-
mum is not on the car, search is not successful), (d) Top-down attention, with on-off feature
separation, only positive values are displayed, search is successful.
basic features). In short, basic features define specific feature types that guide attention
(e.g., lines, color), whereas a canonical feature defines characteristics of a certain feature
parameterization within a basic feature type (e.g., a subset of certain orientations leads to
a line feature that is canonical). Hence, canonical features can be understood as feature
parameterizations that mimic the way neuron populations are tuned in the human vision
system (e.g., lines of 0 degree and 45 degree orientation are canonical). Feature deviations
are represented by a combination of canonical features (e.g., a 10 degree line is represented
by a combination of 0 degree and 45 degree neurons). By finding search asymmetries,
such canonical features can be located. The psychophysical tests described in [Treisman,
1993] revealed typical search asymmetries for colors, which suggest that a canonical fea-
ture parameterization for colors should differentiate between red, green, blue, and yellow.
Furthermore, the human morphology in the early visual processing on the retina supports
this notion. More specifically, cells exist on the retina that are tuned to red-green and
blue-yellow contrasts (so-called color opponents). Both facts give a biological motivation,
for preferring RGBY colors to any of the earlier mentioned color spaces.
23
2 Feature Space
Computation of RGBY Colors
A very basic and thereby computationally efficient approach to compute RGBY colors was
proposed by [Itti et al., 1998] (the so-called color opponent approach from the Neuromor-
phic Vision Toolkit of Itti):
R = R− G+B
2(2.18)
G = G− R +B
2(2.19)
B = B − R +G
2(2.20)
Y =R +G
2− |R−G|
2−B. (2.21)
Drawbacks of this approach are the missing white balance and missing uniformity in the
resulting color maps. A color space is uniform, in case the distance between adjacent
colors is equal over the whole color space (as related to the human color perception, i.e.,
the human ability to distinguish between colors). Uniform color spaces hence numerically
represent colors very similar to the human color perception.
A more complex approach for computing RGBY colors is described in the following.
The basic idea is based on the work of [Frintrop, 2006]. This more complex approach
has a number of important advantages, namely its incorporated white balance and unifor-
mity. Since the computational demands are moderate enough, the said RGBY approach
is included into the feature space of our attention system.
RGBY colors are based on the Lab color space. The Lab color space like the Luv color
space was defined in 1976 by the CIE (Commision Internationale d’Eclairage) as a more
accurate model of the human color perception. Both color spaces are uniform. The well-
known HSV (Hue, Saturation, Value) and XYZ (X and Z contain color information, Y
luminance information) are examples for non-uniform color spaces. In the uniform Lab
color space “L” holds luminance information, “a” represents the red-green contrast and “b”
the blue-yellow contrast. Since the L-channel is independent from the color information
the Lab shows a certain extend of invariance against changes in illumination (similar to
the HSV color space). A basic illumination invariance is very important for the proposed
attention system. At the end of this subsection, the illumination variance properties of
RGB and RGBY color space will be compared for a color-based detection of signal boards
in twilight. The Lab color space is computed based on the following Equations:
L = 116
(
Y
Yn
)1/3
− 16 (2.22)
a = 500
[
(
X
Xn
)1/3
−(
Y
Yn
)1/3]
(2.23)
b = 200
[
(
Y
Yn
)1/3
−(
Z
Zn
)1/3]
. (2.24)
24
2 Feature Space
With default values for a full spectrum light source:
Xn = 242.4
Yn = 255.0
Zn = 277.7.
It depends on the XYZ color space:
X
Y
Z
=
0.490 0.310 0.200
0.177 0.812 0.011
0.000 0.010 0.990
R
G
B
. (2.25)
The XYZ color space incorporates a white balance mechanism based on the reference
values Xn, Yn, and Zn, which are the XYZ values of a white reference patch in the image
[Forsyth and Ponce, 2003]. A white balance is necessary in order to adapt the perceived
colors to the current spectrum of the current light source. More specifically, the sensed
colors have to be shifted and biased depending on the spectrum of the current light source.
More qualitatively, this mechanism assures that the human can recognize a bright yellow
cab in full spectrum noon light as well as in reddish evening light.
The CIE XYZ was initially developed in order to better match the characteristics of
monitors based on RGB color information. The elements of the XYZ-transformation ma-
trix have to be selected dependent on the monitor type. In the here proposed system, the
transformation matrix proposed in [Jaehne, 2005] is used.
To compute RGBY, [Frintrop, 2006] proposes the Euclidian distance between the Lab
color pixels of the image to the four Lab reference colors (aref,R=0 and bref,R=127 for red,
aref,G=127 and bref,G=127 for green, aref,Y =127 and bref,Y =0 for yellow and aref,B=127 and
bref,B=127 for blue). Exemplarily, Equ. (2.26) to (2.28) show the computation of the R
color map of RGBY space that has to be applied pixel-wise over the whole image:
Pref,R = (aref,R , bref,R) = (0, 127) (2.26)
Rfinal = dist(PLab − Pref,R) = ‖(a, b) − (aref,R , bref,R)‖ (2.27)
=√
(a− aref,R)2 + (b− bref,R)2. (2.28)
Based on this, 4 color maps will result that contain only non-negative values (see
[Frintrop, 2006]). For a numerical representation of the interdependencies between RGBY
and RGB see Tab. 2.2.
However, there is a drawback in the approach of [Frintrop, 2006], which makes the
resulting color maps inappropriate for their usage in the here proposed attention system.
The problem is that the so computed color maps are not independent. More specifically,
the R map is equal to the inverted G map and the B map is equal the inverted Y map, which
means that only 2 independent color maps exist. Thereby selectivity is lost. Furthermore,
for example the R map holds zeros at image positions of pure green, whereas for an
attention system zero value in the red and green color map should define the intermediate
value between red and green.
25
2 Feature Space
Table 2.2: Numerical interdependencies between RGB and RGBY.
Reference RGB color space RGBY color space (without normalization)
color red green blue red green blue yellow
RGB red 255 0 0 217.7 81.9 100.0 210
RGB green 0 255 0 92.7 228.8 96.7 227.2
RGB blue 0 0 255 232.7 118.0 247.9 81.5
RGB yellow 255 255 0 141.6 176.1 39.0 222.5
Therefore, in the following section a rescaling is proposed, which will lead to four inde-
pendent RGBY color maps. Furthermore, it will be shown that the on-off/off-on separa-
tion for RGBY colors is system-immanent and therefore already included. Additionally,
so-called double color opponent maps are proposed that are selective to color contrasts.
Furthermore, exemplarily their importance for the attention is shown.
Conceptional Extensions
In order to allow a more suitable decomposition of color maps, the approach of Equ. (2.28)
is adapted leading to the following Equations:
Rtmp =√
(a− aref,R)2 + (b− bref,R)2/Rmax − 0.536 (2.29)
Gtmp =√
(a− aref,G)2 + (b− bref,G)2/Gmax − 0.555 (2.30)
Btmp =√
(a− aref,B)2 + (b− bref,B)2/Bmax − 0.512 (2.31)
Ytmp =√
(a− aref,Y )2 + (b− bref,Y )2/Ymax − 0.559 (2.32)
With:
Rmax = 232.7
Gmax = 228.8
Bmax = 247.9
Ymax = 227.2
Rfinal =
{
2Rtmp ∀ Rtmp > 0
0 ∀ Rtmp ≤ 0(2.33)
Gfinal =
{
2Gtmp ∀ Gtmp > 0
0 ∀ Gtmp ≤ 0(2.34)
Bfinal =
{
2Btmp ∀ Btmp > 0
0 ∀ Btmp ≤ 0(2.35)
Yfinal =
{
2Ytmp ∀ Ytmp > 0
0 ∀ Ytmp ≤ 0. (2.36)
26
2 Feature Space
As the equations show, a normalization is done prior to the decomposition in four inde-
pendent color channels. The received RGBY color maps do not contain redundancies after
the proposed decomposition procedure.
As a result, four normalized independent RGBY color maps are received. An additional
decomposition of all these maps into on-off and off-on components, which would double the
number of color feature maps is not possible. This is the case, since the decomposition of
RG and BY map into four independent color maps is equivalent to an on-off/off-on decom-
position. Furthermore, image pyramids for all four color maps are build (see Annex A.1)
in order to allow a separation between colored blobs of different sizes. The computed four
independent color pyramids can be used for an attention-based task-driven search (e.g.,
search for red objects of a certain size). The concept of task-driven attention (so-called
top-down attention) is described in detail in Chapter 3.
However, colors themselves are not a cue for the contrast-driven attention (so-called
bottom-up attention). For example, a monotonous red image should not lead to contrast-
driven attention. How such a contrast-driven (bottom-up) attention is computed will be
described in Chapter 3. A color-based contrast-driven attention should be guided to image
positions showing color contrasts (e.g., a green blob on a red background). The color
feature type required for such operations is termed double color opponency in literature.
Said double color opponent maps are received by filtering all four RGBY color maps with a
Difference of Gaussians kernel based on image pyramids (refer to Annex A.1 for details on
filtering with image pyramids). As a result, we receive for each color map five double color
opponent maps, also allowing a differentiation between red blobs on a green background
and vice versa. The received color contrast maps can now be used to detect targets based
on their color contrast (e.g., traffic signs of typically bright colors in a typically less colorful
traffic environment). Additional to bottom-up searches, such color contrasts can be used
for the top-down search for objects with high color contrasts.
In sum, the color feature space consist of four RGBY color pyramids (i.e., 20 color
feature maps) and 20 double color opponent maps. Top-down search hence disposes of 40
color feature maps and bottom-up search of 20 color maps.
Discussion
In the following, the performance of the RGBY color space is assessed and compared
qualitatively to RGB. For that a complex traffic scene in twilight is used (see Fig. 2.12a),
in which the present signal board should be found based on color information. As it can
be expected, the absolute R-channel of RGB shown in Fig. 2.12b is not sufficient for a
successful color-based detection of the signal board. However, even when normalizing the
RGB R-channel to the overall sum of all three color channels, the signal board can not
be detected (see Fig. 2.12c). Reason is that the low light situation changes the color
of the signal board despite the automatic white balance of the camera. When using the
proposed non-linear RGBY colors a separation on the RGBY R-channel is possible, despite
the challenging lighting conditions (see Fig. 2.12d). An attention-based search for such
signal boards can hence rely on the highly discriminating R-channel, which boosts the
performance of the attention system as compared to the usage of RGB colors. Extensive
27
2 Feature Space
testing with HSV colors have also shown inferior performance compared to the proposed
RGBY color space.
(a) (b)
(c) (d)
Figure 2.12: (a) Input image of road scene in twilight (signal board marked in red), (b)
Absolute R-channel of RGB-space, (c) Relative R-channel of RGB-space (i.e., R-channel nor-
malized to the sum of RGB-channels, with thresholding), (d) R-channel of RGBY space (with
thresholding).
2.2 Depth Features
Accurate depth information is of vital importance for a driver assistance system. Typical
commercial applications for assisting the driver use Radar or Lidar data. Such sensors
deliver accurate but sparse depth information of the scene. So far, only few commer-
cial driver assistance systems use vision, despite the fact that the information density is
comparatively high. During the projection of the 3D world to the 2D image chip, one
dimension - the depth information - is lost. Recovering the depth cannot be done with
100% certainty, i.e. 2D images are ambiguous in terms of depth. For solving this challenge
several depth cues are fused. After a biological motivation, the implemented depth cues
are described.
28
2 Feature Space
2.2.1 Biological Motivation
Following [Palmer, 1999] several stereo-related cues and at least nine monocular depth
cues exist that allow the human to reliably perceive the depth in the environment. In the
following, some of these monocular cues are listed and described shortly:� Depth from object knowledge (known object size in the world as reference for the
measured object size on the camera plane)� Depth from ground plane assumption (assuming a flat world, the vertical image
position is proportional to the object depth)� Depth from blur (optimizing the edge sharpness by changing the focal length of the
camera)� Depth from Time to Contact (infer the time that remains until collision from the
growth of perceived object size)� Depth from relative size (several objects of the same type in different distances)� Depth from shading (positioning of shades relative to objects)� Depth from texture gradient (depth-dependent image frequencies on homogenous
textured surfaces)� Depth from aerial/atmospheric perspective (blue bias on objects that are far away)
The following subsections describe five monocular and binocular (stereoscopic) depth
cues our ADAS is based on in order to perform its various vision tasks.
2.2.2 Depth from Stereo Disparity
The perception of stereoscopic depth is based on the interpretation of the differences be-
tween the projected images of both eyes (so-called parallax). An isolated point in the 3D
world is projected to slightly different positions on the retina of both eyes, since these have
a horizontal distance, the so-called basic distance. The horizontal shift between the im-
ages is called lateral disparity, see [Mallot, 2002]. In addition to the lateral disparity, other
flavors of disparity exist (see [Mallot, 2002]) that can also cause an impression of depth -
still the lateral disparity seems to be the most important disparity-related depth cue and
is therefore also in the focus of the following reflections. For detecting lateral disparity (for
simplification called disparity in the following) the detection of correspondences between
the left and right eye is necessary. Here, ambiguities are possible, due to differences in
illumination and partial occlusion between both images. Especially, local regions of low
texture can lead to the well-known aperture problem, which is also a challenge for the opti-
cal flow computation (refer to [Willert et al., 2006]). Furthermore, differences and changes
in the internal optical parameters of both eyes exist that influence the projections and
hence the detected lateral disparity. Still, the human vision system can cope with these
29
2 Feature Space
challenges by continuous adaptation mechanisms. How these challenges are solved by the
human vision system is largely unknown. Designing a technical stereo system that closely
mimics the processing steps in the brain is therefore not possible up to now. However, the
engineered approaches show sound results, but also have their limitations.
Figure 2.13 depicts the individual processing steps, which are needed for computing
dense 3D world coordinates from stereo based on an engineering-driven approach.
captureimage
left camera
right camera
rectification
rectification
undistortion
undistortion
correspondencesearch
disparitystereo mapscompute
rectificationinverse stereo
maps
Figure 2.13: Processing steps for computing dense 3D world coordinates from stereo.
After capturing pairs of images, the camera lens distortion is corrected for both cameras
independently. The undistortion step is essential in order to make the mapping of the
3D world to the 2D image plane comparable for both cameras, which is a prerequisite
when computing the stereo disparity in the following step. Based on the captured stereo
images the undistorted vertical and horizontal pixels v and u are computed on the initial
(distorted) vertical and horizontal pixels vd and ud:
u = (1 + k1β2 + k2β
4)ud + 2k3udvd + k4(β2 + 2u2
d) (2.37)
v = (1 + k1β2 + k2β
4)vd + k3(β2 + 2udvd) + 2k4udvd (2.38)
with β =√
u2d + v2
d.
The undistortion is based on a lens distortion model (described in [Heikkila and Silven,
1997]) that uses radial (k1 and k2) and tangential distortion coefficients (k3 and k4). For
both cameras, these coefficients are determined offline using captured images of a checker-
board pattern based on the camera calibration toolbox [J.Y.Bouguet, 2007] that is available
in the internet.
Furthermore, the cameras are oriented differently in the world (i.e., the camera angles
θX , θY , and θZ are different for both cameras). In order to allow an efficient search for
correspondences between the two camera images, these angles need to be compensated
(i.e., the optical axes of both cameras need to be parallel). In theory, this could be done
physically by adapting the camera position. However, this is not possible with the needed
30
2 Feature Space
accuracy. The usual approach is to virtually adapt the camera angles by shifting and
remapping the image pixels of both cameras, which is called rectification. Typically, a
linear rectification is realized, which means both camera images are rescaled, rotated and
shifted in horizontal and vertical direction in order to compensate the differences in the
camera angles. The rectification is done using the commercial “Small Vision System”
software [Konolige, 1997].
For the rectification process the camera angles of both cameras are required. These can
be computed based on the following Equations that describe the 3D world to 2D image
mapping:
u = −fur11(X-t1) + r12(Y -t2) + r13(Z-t3)
r31(X-t1) + r32(Y -t2) + r33(Z-t3)+ u0 (2.39)
v = −fvr21(X-t1) + r22(Y -t2) + r23(Z-t3)
r31(X-t1) + r32(Y -t2) + r33(Z-t3)+ v0. (2.40)
Equation (2.39) and (2.40) use the 3 camera angles θX , θY , and θZ , the 3 translational
camera offsets t1, t2, t3 (see Fig. 2.16b), the horizontal and vertical principal point u0
and v0 as well as the horizontal and vertical focal lengths fu and fv (focal lengths that
are normalized to the horizontal and vertical pixel size respectively). In sum 12 unknown
variables exist (the elements of the rotation matrix Equ. (2.41) as well as the position of
the camera in the world t1, t2, and t3).
R = RXRYRZ =
r11 r12 r13r21 r22 r23r31 r32 r33
(2.41)
For determining these 12 variables the calibration scene shown in Fig. 2.14 is used, for
which the 3D world position of the marked points was measured manually with a laser
device and stored.
Based on internal interdependencies (orthogonality equations of the rotation matrix,
see Equ. (2.42)) and the correspondences between the stored 3D world position and the
measured image position for 3 points, all 12 parameters and hence the camera angles θX ,
θY , and θZ can be determined.
r211 + r212 + r213 − 1 = 0
r221 + r222 + r223 − 1 = 0
r231 + r232 + r233 − 1 = 0 (2.42)
r11r21 + r12r22 + r13r23 = 0
r11r31 + r12r32 + r13r33 = 0
r21r31 + r22r32 + r23r33 = 0
After repeating the described procedure for the second camera, the image rectification
can be done. After the rectification, an efficient search for correspondences between the
left and right image can be done. More specifically, for each pixel and its neighborhood in
31
2 Feature Space
(a) (b)
8
37
24 156
Figure 2.14: Calibration scene with measured 3D world calibration points: (a) Left image
(calibration points marked), (b) Right image.
one of the camera images the best match in the other camera image is determined using
a correspondence search with a probabilistic matching algorithm (refer to [Willert et al.,
2006]). Since both images are undistorted and rectified the correspondence search between
the images can be restricted to horizontal shifts, which makes the procedure very efficient.
The result of the correspondence search is a dense disparity map D(u, v), which contains
a measured horizontal shift for all image positions.
Based on the disparity image the 3D world position for all image pixels can be computed
using:
Zstereo(u, v) =fuB
D(u, v)+ t3 (2.43)
Ystereo(u, v) =Z(v − v0)
fv+ t2 (2.44)
Xstereo(u, v) =Z(u− u0)
fu+ t1. (2.45)
With: B... basic distance between the left and right cameras principal point
fu, fv... normalized focal length [in pixels]
D(u, v)... disparity
u0, v0... principal point
t1, t2, t3... translational camera offset.
The equations are derived by transforming Equ. (2.39) and (2.40), setting all camera
angles to zero, since the disparity computation was done on rectified images.
In the last step, the stereo maps are unrectified (i.e., the prior rectification is neutralized)
to make them comparable to the input image on which all other processing steps are
running. This is realized by remapping the pixel values of the rectified stereo maps based
on Equ. (2.39) and (2.40), which results in unrectified stereo maps.
32
2 Feature Space
Figure 2.15 depicts a typical example for the resulting unrectified stereo maps in an
inner-city scenario.
X position in mRGB input image
Z position in mY position in m
Figure 2.15: Dense 3D world positions for all image pixels based on stereo from a probabilistic
matching approach [Willert et al., 2006].
Conceptional Extensions
Analyzing the Equ. (2.43) to (2.43) and Fig. 2.15, it can be seen that the stereo maps
are dense (i.e., for all image pixels a 3D world position is computed). However, at image
positions near to the car the computed values are not sufficiently accurate. When using
a threshold on the stereo confidence map that was calculated during the disparity com-
putation, these pixels can be identified. Furthermore, the thereby identified pixels can be
corrected using an inter-modality depth cue fusion (see Sect. 4.1.2 for details) with the
depth cues described in the following.
2.2.3 Depth from Object Knowledge
Depth from object knowledge calculates the distance of an object Zobj based on knowledge
about the area the object covers on the image plane (width Wpixel and height Hpixel), the
width and height of the object in the world drawn from experience (Wworld and Hworld)
as well as the intrinsic parameters of the sensor (fu = f/tu and fv = f/tv, with the focal
length f and the horizontal and vertical pixel size tu and tv):
Zobj,W ≈ WworldfuWim
and Zobj,H ≈ HworldfvHim
. (2.46)
33
2 Feature Space
A prerequisite for depth from object knowledge is a reliable segmentation algorithm.
Currently we use histogram-based segmentation on an image region that is pre-segmented
by a region growing algorithm (see [Jaehne, 2005]) running on the saliency map (see
Fig. 2.19c on page 43 for a visualization of the gathered segmentation results)
In the following, Equ. (2.46) is derived. Without loss of generality, we simplify
Equ. (2.39) and Equ. (2.40) with θY = 0 and θZ = 0, which do not influence the ob-
ject distance, but would make the following steps more cumbersome. Furthermore, we can
set t3 = 0, since the Z coordinate of the center point of our coordinate system equals the
principal point of both cameras (see Fig. 2.16b). Using Equ. (2.39), two bordering points
of an object with the same height Y (i.e., having the same vertical pixel value v) and depth
Zobj,W have the following width Wim in pixels:
Wim = u1 − u2 =(X1 − t1)fu + (Y − t2)u0sin(θX) + Zobj,Wu0cos(θX)
(Y − t2)sin(θX) + Zconstcos(θX)−
(X2 − t1)fu + (Y − t2)u0sin(θX) + Zobj,Wu0cos(θX)
(Y − t2)sin(θX) + Zobj,W cos(θX). (2.47)
This can be reformulated to:
Wim =(X1 −X2)fu
(Y − t2)sin(θX) + Zobj,W cos(θX)
=Wworldfu
(Y − t2)sin(θX) + Zobj,W cos(θX). (2.48)
In Equ. (2.48) (Y − t2)sin(θX) is small, because θX is between −5◦ and 5◦. The term
represents the part of the distance that is induced by the shift of the object in the height
direction Y. Since traffic-relevant objects are typically near a defined road plane, Y-values
between 0 and -10m can be expected ( note that due to the right-hand-rule the Y-axis
is defined negatively). The induced error is hence small and stays below the uncertainty
induced by the segmentation done for determining Wim. Additionally, cos(θX) is close to
1. Therefore, we can simplify Equ. (2.48) to:
Wim ≈ WworldfuZobj,W
. (2.49)
Transposed to Zobj,W the distance of an object can be computed, given the width in the
world Wworld, the width in pixels Wim, the size of the pixels on the image chip tu as well
as the focal length f :
Zobj,W ≈ Wworld fuWim
=Wworld f
Wim tu. (2.50)
Similarly, to compute the depth Zobj,H based on the known object height in the 3D
world Hworld the following Equation can be found:
Him = v1(Y1 = 0) − v2(Y2 = Hworld)
=Zobj,HfvHworld
(hsin(θX) + Zobj,Hcos(θX))((h− 1)sin(θX) + Zobj,Hcos(θX)). (2.51)
34
2 Feature Space
Transposed to Zobj,H the following Equation can be inferred:
Zobj,H = −p/2 +
√
(p
2)2 − q (2.52)
With:
p =2hsin(θX )cos(θX)Him − cos(θX)sin(θX)Him − fvHworld
Himcos2(θX)
q =(h2 − h)sin2(θX)
cos2(θX).
Since sin(θX) is small and cos(θX) close to 1 we get:
p ≈ −fvHworld
Himand q ≈ 0. (2.53)
Finally, we get Equ. (2.54) and thereby can compute the distance Zobj,H given the object
height in the 3D world Hworld, the height of the object in the image Him, the focal length
f and the height of pixels on the image chip tv:
Zobj,H ≈ Hworld fvHim
=Hworld f
Him tv. (2.54)
Again, the error induced by this simplification is small and thereby below the uncertainty
of the segmented object height in the image Him.
2.2.4 Depth from Bird’s Eye View
For computing the distance of objects that are positioned on the drivable path the bird’s
eye view is used. The bird’s eye view is a metric representation of the scene as viewed
from above (see Fig. 2.16a). The cue is able to detect and estimate the distance of objects
present on the ego vehicle’s and neighboring lane (as opposed to the perspective image).
Working on this representation for estimating object distances has the advantage that
the cumbersome non-linear projection from 3D world coordinates to the 2D image plane
(see Equ. (2.39) and (2.40)) is compensated. As such, world position coordinates can
directly be assigned to a detected object without further processing. Furthermore, by this
transformation, the detection of lanes and objects can be realized easier than working on
the projected camera image, since expectations regarding typical metric lane widths can be
integrated easily into the algorithm. The bird’s eye view is calculated on the undistorted
pixels v and u based on Equ. (2.39) and (2.40) by inverse perspective mapping of the
3D world points X, Y , and Z (see Fig. 2.16b for a visualization of the used coordinate
system) to the 2D (u,v) image plane. The equations describe how to map a 3D position
of the world to the 2D image plane (refer to [Broggi, 1995]). More specifically, only the
image pixels (u,v) that are required to get the metric bird’s eye view (i.e., the XZ-plane)
dense, are mapped. The approach also leads to low computational demands. The usage of
35
2 Feature Space
image viewbird’s eyeperspective
(a) (b)
X
Y
camera
Z
axisoptical
θX
θY
T = [t1, t2, t3]θZ
Figure 2.16: (a) Visualization of the bird’s eye view, (b) Coordinate system and position of
the camera.
inverse perspective mapping makes the inversion of Equ. (2.39) and (2.40) obsolete, when
computing the bird’s eye view.
As can be seen in Equ. (2.39) and (2.40) the 3D world position coordinates X, Y , and
Z of all image pixels (u,v) are required. By using a monocular system, one dimension (the
depth Z) is lost. A solution to this dilemma is the so-called flat plane assumption. Here,
for all pixels in the image, the height Y is set to 0. Based on this, only objects in the
image with Y = 0 (especially, the street we are interested in) are mapped correctly to the
bird’s eye view, while all the other regions are stretched to infinity in the bird’s eye view
(for example the car in Fig. 2.19d).
Now, a vertical grow algorithm with dynamic thresholds searches for discontinuities in
the bird’s eye view and assigns a distance value to them (see Fig. 2.19d).
In the rectified image (i.e., the image is virtually remapped to be equivalent to an
image with all 3 camera angles zero, see page 31 for details on the image rectification) the
following direct relation between the vertical pixel value v and the depth Zbirds exists:
Zbirds =fv t2
(v − v0)(2.55)
With:
t2 ... camera height above the ground
v0 ... the vertical principal point
v ... vertical pixel position that shows significant contrast change
fv ... Normalized focal length.
36
2 Feature Space
Conceptional Extensions
In case the flat plane assumption is not fulfilled (i.e., the street surface is not flat) the
bird’s eye view is inaccurate, which decreases the quality of all algorithms that run on
the bird’s eye view (e.g., depth estimation or temporal road integration, see Sect. 4.2.2).
To allow a stable bird’s eye view even in case of non-flat street surfaces and pitching of
the vehicle, stereo data is used for plane fitting. In order to enhance the robustness of the
correction, only pixels that belong to the currently detected street segment (see Chapter 4)
are used for surface estimation. More specifically, the differences between the orientation
and position of the coordinate axes and the street surface in terms of the pitch ∆θX and
roll angle ∆θZ , as well as the height of the camera over the ground ∆t2 are computed:
Y = Y0 + cZ + dX (2.56)
∆θZ = atan(d) (2.57)
∆θX = atan(c) (2.58)
∆t2 = Y0. (2.59)
This is done based on the 3D position for all image pixels derived from the stereo disparity
(see Fig. 2.15 for 3D data of a sample image). The flat plane assumption Y = 0 can be
replaced by Y = f(X,Z) leading to an extended bird’s eye view. In our implementation, a
first order model for the street surface (linear hyperplane) is used as shown in Equ. (2.56)
(see [Li et al., 2004] for more details). Results have shown that higher order models for
plane fitting lead to inferior performance. The reason for this is the restraint number of
3D measurement points at the borders of the image, because only reliable pixels belonging
to the detected street are used for the surface estimation. Since the estimated surface is
noisy (stereo data is calculated based on the error prone correlation between the left and
the right camera image), a linear Kalman filter is used on the parameters Y0, c, and d,
which improves the performance considerably. A further possible improvement would be to
use a model of the vehicle kinetics (containing damper and spring characteristics, realistic
distribution of the vehicle mass) for the Kalman prediction (as proposed in [Cech et al.,
2004]) instead of the linear Kalman prediction model used here.
2.2.5 Depth from Time to Contact
The time to contact (TTC) is the quantity of time it takes an observer to reach an ap-
proaching surface in case the observer continues its current relative motion (see [Palmer,
1999]). The TTC is believed to be a basic cue the for behavior generation of simple organ-
isms, like for flies during the landing task [Borst, 1990]. Various concepts for computing
the TTC are known.
The potential of the TTC was first researched by [Hoyle, 1957], who found that the
optical aperture angle ψ an object takes up on the retina-related to the change of ψ in
time is an approximate measure for the TTC (please also refer to Fig. 2.17):
TTC ≈ ψdψdt
. (2.60)
37
2 Feature Space
It is important to note that the TTC can be derived without information of the object
distance or relative velocity. The concept was extended by [Hoyle, 1957] to the so-called
tau-function, which states that the TTC can be derived from the ratio of any visually
perceived spatial quantity to its derivative.
As an example for the tau function [Palmer, 1999] gives the distance of an image position
to the Focus of Expansion (i.e., the image point all longitudinal lines meet) related to its
derivative. Also for an observer that passes an approaching object without colliding with
it, the tau function is applicable (refer to [Kaiser and Mowafy, 1993]). Also binocular
ψ
Figure 2.17: Object aperture angle ψ.
information (stereo disparity) can be used to derive the TTC (refer to [Regan, 2002] and
[Harris, 2004]). In Equ. (2.61) the TTC is given depending on the object distance and
disparity (with B as the distance of the stereo cameras, Z the object distance, and D(u, v)
the stereo disparity):
TTC ≈ B
Z[dD(u, v)/dt]. (2.61)
However, for later depth cue fusion we are interested in cues that are independent in
terms of the required input signals. The stereo disparity is already used as depth cue
in our system. Therefore, only the TTC depth cue defined in Equation (2.60) is helpful
as starting point of the following conceptional extensions, which will make the approach
accessible for the vehicle domain.
Conceptional Extensions
In order to make the TTC more accessible to object depth estimation in the vehicle domain
the concept was extended from the expansion of image regions to contractions (i.e., the
distance to the object of interest is increasing). Furthermore, to our knowledge the TTC
was not used to compute object depth in real-world scenarios of the vehicle domain before.
In the following, an universal approach for computing the depth from TTC in the vehicle
domain is given. As will be shown on ground truth data the concept is theoretically sound,
but has some drawbacks in a real-world application.
For computing depth from TTC, the image-related size of an object bt for three consec-
utive frames is required (see Fig. 2.18). Furthermore, we assume a constant object motion
vobj within these frames. Taken a frame rate frate of 11Hz this assumption is justified.
38
2 Feature Space
Also the ego vehicle motion vego,t is needed and accessible on the CAN bus on today’s
vehicles.
(a) (b) (c)
D1
ψ1
b1 b2
D2
ψ2
D3
ψ3
b3
f
L
f f
L
L
vobj
vobj
vobj
Figure 2.18: Depth from Time to Contact measurements on three consecutive frames for an
approaching vehicle (i.e., b1 < b2 < b3).
For the scenario visualized in Fig. 2.18, which assumes a decreasing distance Dt to the
car (i.e., D1 > D2 > D3 or b1 < b2 < b3) the following Equ. (2.62) to (2.72) can be derived
and resolved to Equ. (2.73) (with f as the focal length and L the object size in the world).
D1 = (vego,1 − vobj)tTTC1 (2.62)
D2 = (vego,2 − vobj)tTTC2 (2.63)
D1 =fL
b1(2.64)
D2 =fL
b2(2.65)
D1 =(vego,1 − vego,2)b2tTTC1tTTC2
b2tTTC1 − b1tTTC2(2.66)
tTTC1 ≈ ψ1
dψ1/dt=
ψ1
(ψ2 − ψ1)frate(2.67)
tTTC2 ≈ ψ2
dψ2/dt=
ψ2
(ψ3 − ψ2)frate(2.68)
(2.69)
39
2 Feature Space
ψ1 = atan(b1f
) (2.70)
ψ2 = atan(b2f
) (2.71)
ψ3 = atan(b3f
) (2.72)
D1 ≈ (vego,1 − vego,2) ·b2atan(b1/f)atan(b2/f)
b2atan(b2/f)[atan(b2/f)−atan(b1/f)]−b1atan(b1/f)[atan(b3/f)−atan(b2/f)](2.73)
Note that Equ. (2.73) only depends on easily accessible input data alone (measured
object width on the camera chip b1, b2, and b3 as well as the vehicle ego motion vego,t).
Additionally to the direct plausibility check of the computed object distance D1, the object
motion vobj can be computed and assessed for plausibility:
vobj ≈ D1 + vego,1tTTC1
tTTC1. (2.74)
For an increasing object distance Dt (i.e., b1 > b2 > b3), a TTC computation based on
Equ. (2.60) and hence Equ. (2.73) is not possible. However, when redefining the TTC for
objects that move away, we get Equ. (2.75) and (2.76) as well as Equ. (2.77) and (2.78):
tTTC1 ≈ − ψ1
dψ1/dt=
ψ1
(ψ1 − ψ2)frate(2.75)
tTTC2 ≈ − ψ2
dψ2/dt=
ψ2
(ψ2 − ψ3)frate(2.76)
D1 = (vobj − vego,1)tTTC1 (2.77)
D2 = (vobj − vego,2)tTTC2. (2.78)
For the case of an increasing object distance, these Equations can be resolved to:
D1 ≈ (vego,2 − vego,1) ·b2atan(b2/f)atan(b3/f)
b2atan(b3/f)[atan(b1/f) − atan(b2/f)] − b1atan(b2/f)[atan(b2/f) − atan(b3/f)]. (2.79)
With a similar approach, for the two remaining cases (b1 < b2 > b3 and b1 > b2 < b3) we
get Equ. (2.80) and (2.81):
D1 ≈ (vego,2 − vego,1) ·b2atan(b2/f)atan(b2/f)
b1atan(b2/f)[atan(b3/f) − atan(b2/f)] + b2atan(b2/f)[atan(b1/f) − atan(b2/f)](2.80)
D1 ≈ (vego,1 − vego,2) ·b2atan(b1/f)atan(b3/f)
b1atan(b1/f)[atan(b2/f) − atan(b3/f)] + b2atan(b3/f)[atan(b2/f) − atan(b1/f)]. (2.81)
40
2 Feature Space
Discussion
The gathered results in Tab. 2.3 for the case b1 < b2 > b3 show that the TTC com-
putation works well when using manually segmented ground truth data. The thereby
gathered accuracy is roughly comparable to the error measured in studies with human
subjects ([Gray and Regan, 1998] reported errors between 2.6 and 3.0% for approaching
small objects). In contrast, TTC computation on a real-world test stream showed poor
performance. This is due to the fact that a robust segmentation algorithm (including a
tracking of the segmentation parameters) would be required by the described TTC ap-
proach, in order to reach the necessary robustness in real-world scenarios and error rates
that are consistent with psychophysical studies. The in this context required segmentation
algorithms would require sub-pixel accuracy, which could be realized in a brain-like way
by mimicking the hyperacuity principle based on population coding (see [Mallot, 2002]).
Since such an algorithm is not in the focus of the current work, depth from TTC was not
included into the developed depth fusion approach (refer to Sect. 5.2.1). Additionally, as
shown in Equ. (2.67) and (2.68) the TTC is calculated using a derivation of the measured
object segments. A derivation of the already uncertain segmentation results reduces the
signal to noise ratio (SNR) even further [Mallot, 2002]. Furthermore, it is important to
note that the introduced depth from TTC approach can only be applied in case the object
velocity vobj is as assumed approximately constant in the considered time interval. How-
ever, a check that evaluates if the object motion vobj (computable by Equ. (2.74)) is within
plausible boundaries, can be used to verify if this condition was observed. All this makes
depth from TTC a cue that delivers a coarse estimation of the object depth. However,
it should be fused with other more reliable depth cues, allowing at least an approximate
depth measurement in case everything other fails.
Table 2.3: Examples for depth from TTC for b1 < b2 > b3 (frate = 3).
Ego velocity Ground truth Resulting Computed Relative error
vego,1 (vego,2) distance vobj [in m/s] depth from|D1−ZTTC|
D1[in %]
in [m/s] D1(D2,D3) [in m] TTC ZTTC
24 (21.9) 39(39.5, 39.2) 22.50 41.38 6.10
24 (21.9) 38(38.5, 38.2) 22.50 40.43 6.39
24 (21.9) 37(37.5, 37.2) 22.50 39.48 6.70
24 (21.9) 36(36.5, 36.2) 22.50 38.54 7.06
24 (21.9) 35(35.5, 35.2) 22.50 37.60 7.43
24 (21.9) 34(34.5, 34.2) 22.50 36.66 7.82
24 (21.9) 33(33.5, 33.2) 22.50 35.73 8.27
Mean relative error 7.11
41
2 Feature Space
2.2.6 Depth from Radar
Depth from Radar (Radio Detecting and Ranging) is obtained from a commercial standard
vehicle equipment sensor, which delivers sparse point-wise measurements of low longitudi-
nal but higher lateral uncertainty (for an example see Fig. 2.19b). Radar sensors evaluate
the reflections (echoes) of bundled micro wave beams (typically between 400 MHz and 80
GHz) for detecting, localizing, tracking, and classifying objects. More specifically, the time
of flight ttof is used to determine the object distance Zradar:
Zradar =c0 · ttof
2. (2.82)
With: c0 ... velocity of propagation (speed of light) ≈ 300000km
sttof ... time of flight (to the object and back).
For measuring the time of flight the individual beam packages must be marked and
recognized, which can be done by modulation and demodulation of the signal amplitude,
frequency or phase. The object velocity vdop is determined based on the Doppler shift ∆f :
vdop =c0 · ∆f
2f0. (2.83)
With: c0 ... velocity of propagation (speed of light) ≈ 300000km
s∆f ... measured Doppler frequency shift
f0 ... carrier frequency.
Using Radar sensors, the object distance and velocity can hence be measured with inde-
pendent approaches. Different from visual sensors, Radar is very robust against changing
weather conditions [Winner, 2007], which makes it an important cue that increases the
system robustness.
2.3 Motion Features
The role of motion in a biologically motivated driver assistance system is twofold. First, a
decomposition of the scene regarding the magnitude and direction of motion is required in
order to increase the selectivity of the attention system. Second, a robust decomposition of
the scene into dynamic and static objects is required. Especially features for decomposing
the scene into static and dynamic objects are of high relevance in driver assistance, since
dynamic objects require fast system reactions and reliable predictions. Different system
modules should take detected dynamic objects into account. For example, appropriate
motion models for dynamic objects should be included into the collision avoidance and
path planning modules.
2.3.1 Differential Images
A well-known feature for motion detection are differential images (i.e., computing the differ-
ence between the current and the previous frame), which are usually found in applications
42
2 Feature Space
39
.4
18
.3 71
.5
25
.2
Bird’s eye view
Width in mD
ista
nce
(dep
th)
in m
−10 0 10
50
40
30
20
10
0
17
.8
16
.7
28
.0
12
.1
28.7 9.
6
25.8
17.2
25.5
14.4
(b)(a)
(d)(c)
Figure 2.19: Used depth cues: Depth from (a) Stereo disparity, (b) Radar, (c) Object knowl-
edge, (d) Bird’s eye view.
in the surveillance domain. In its classical form, this procedure has the general drawback
that the motion cannot be localized and classified in terms of magnitude and direction,
because only changes in intensity are evaluated by this approach. Furthermore, in the
car domain the vehicle ego-motion-induced influence on differential images is high. This
causes the impossibility to detect and differentiate dynamic (i.e., moving) objects from
static scene content based on differential images. Despite these drawbacks differential im-
ages are used as feature in the presented ADAS, because of two important advantages.
First, differential images as motion feature show strong activations at image regions that
contain static and dynamic objects that are near to the moving ego vehicle, which is of high
importance in driver assistance. Furthermore, the approach has the highest computational
efficiency among the existent motion features.
Conceptional Extensions
The used implementation computes differential images on multiple scales, which allows a
decomposition in different motion magnitudes (see Fig. 2.20). Higher scales (i.e., lower
image resolutions) represent higher motion magnitudes. This is an important extension to
43
2 Feature Space
(a) (b) (c)
(d) (e) (f)
Figure 2.20: Differential images: (a) Input image, (b) to (f) Motion feature on 5 scales
known approaches.
As stated before, with differential images a differentiation between dynamic and static
scene elements is not possible. Therefore, a more sophisticated approach for motion esti-
mation is needed. In the next subsection, an approaches for estimating the object motion
is described. The approach is based on a biologically motivated optical flow algorithm
[Willert et al., 2007].
2.3.2 Detection of Dynamic Objects
In the following, an approach for the visual vehicle ego motion compensation based on
stereo disparity is described, which compensates the ego-motion-induced movement of
static scene elements. Based on this compensation, dynamic objects can be detected.
System Description
In order to describe the object motion detection approach a detailed system description
is given that roughly relates to [Schmudderich et al., 2008]. The authors realized an ego
motion compensation on the robot ASIMO. Based on that, moving human interactors are
detected even in case ASIMO itself is moving. Different from the indoor environment with
constant illumination and a restricted number of objects, the here presented system runs
on real-world, outdoor data of the vehicle domain. The system (see Fig. 2.21) uses the
current and previous gray value image (It and It−1) as input. Furthermore, the vehicle
yaw rate ˙θY and longitudinal velocity from the CAN bus as well as stereo disparity are
required. Based on Equ. (2.43), (2.44), and (2.45) the 3D world position of all objects in
44
2 Feature Space
the scene can be computed, leading to the 3 maps X, Y, and Z (refer to Fig. 2.15 on page
33).
For compensating the vehicle ego motion on the image plane the following processing
steps are realized: First, an extended single track model is applied (refer to Sect. 4.2.2 for
details) that computes the longitudinal ∆Z and lateral vehicle motion ∆X since the last
captured frame in world coordinates . The yaw angle ∆θ since the last frame can be derived
from the yaw rate ˙θY . Now, the current 3D coordinates (X(u, v),Y (u, v), and Z(u, v)) of all
image pixels (u, v) that are computed from stereo disparity are corrected by the computed
longitudinal and lateral motion as well as the yaw angle. More specifically, ∆X and ∆Z
are used as offset on the world coordinate maps X(u, v) and Z(u, v). The change in yaw
rate directly corrects the yaw angle θY of the pin hole camera model (see Equ. (A.1) and
(A.2) on page 138). By using a pin hole camera model on the corrected 3D coordinates,
a change in 2D pixel coordinates is computed. This change represents how the image is
expected to change due to the vehicle ego motion. The information can be used to warp
the pixels of the current image It back in time (the so-called backward warping), assuming
the overall scene to be static. This results in the warped image I+t−1. Computing the
optical flow (i.e., the pixel-wise motion of image regions between two consecutive images
of the same camera) between the previous camera image It−1 and the warped image I+t−1
reveals the present dynamic objects in the scene.
Several postprocessing steps improve the robustness. More specifically, morphologi-
cal operations assure that small clutter is rejected. The Pearson measure described in
[Schmudderich et al., 2008] boosts flow values with high magnitude and high correlation
confidence. The correlation confidence is a by-product of the NCC-based optical flow
computation (see [Willert et al., 2006]).
Conceptional Extensions
As an extension to the state-of-the-art system of [Schmudderich et al., 2008] for improving
the quality, the here described system incorporates top-down (TD) information of static
scene objects. For known static image regions, e.g., containing road (see Sect. 4.1.2 for the
robust unmarked road detection system our ADAS disposes of) or the sky, the predicted
values I+t−1 are set to It−1 and will hence not produce optical flow, when comparing the
back-warped image I+t−1 and previous image It−1. This procedure improves the system
robustness and decreases false negative detections of dynamic objects.
Discussion
In the following, the realized object motion detection system is evaluated in terms of
computational demands and the detection performance in four real-world scenarios.
Table 2.4 lists the computation time of the required system modules depicted in Fig. 2.21
for images of 200x150 pixels. In sum, the motion detection algorithm needs about 1.4s for
each image (processing rate of 0.7Hz), when running on a single computer. The image data
as well as vehicle state data from the CAN bus is transmitted via LAN to a Toshiba Tecra
A7 (2 GHz Core Duo) running our RTBOS integration middleware [Ceravola et al., 2006]
45
2 Feature Space
single trackmodel computation
stereo
backwardwarping
Input: Input: Input:
disparitystereoCAN
velocityyaw rate
3D data
currentimage
−1of static scene elements
TD knowledge
(e.g., road, sky)
motion magnitude, direction, andposition of detected dynamic objects
optional input
optical flowcomputation
postprocessing
It−1 I+t−1
−∆θY
−∆X
−∆Z
It
It
˙θY
Figure 2.21: System overview: Object motion detection based on backward image warping.
Table 2.4: Computational demands for dynamic object detection system.
Module Computation time [in ms]
Rectification 88
Stereo (SVS) 21
Single track model 11
Backward Warping 3
Optical Flow
Displacement maps 625
Confidence maps 525
Optical flow 65
Postprocessing
Pearson threshold 13
3x3 median filter 77
Morphological opening 1.5
Σ Comp. time 1429.5
46
2 Feature Space
on top of Linux. The algorithm was implemented in C using an optimized image processing
library based on the Intel IPP [Intel, 2006]. Since our ADAS receives images with a frame
rate of 10Hz, after distribution of the system to 2 computers, only for every 7th image, ego
motion compensated data is accessible and can be combined in the attention system. The
future envisioned acceleration of the optical flow algorithms will increase the frame rate.
(d)
(a) (b)
(c)
Figure 2.22: Visualization of test scenarios, (a) Target vehicle crossing the road from left, (b)
Target vehicle from left stops in the middle of the road, (c) Target vehicle from left, camera
vehicle turns right and follows, (d) Pedestrian passes the road.
In the following, a quantitative system evaluation on four test scenarios (see visualization
in Fig. 2.22) is realized. In the first scenario, a target vehicle is crossing the road in front
of the ego vehicle, while the ego vehicle moves straight along the road. In the second
scenario, the target vehicle stops in the middle of the road. In the third scenario, the
target vehicle crosses the road, while the ego vehicle follows and turns right. In the last
scenario, a pedestrian crosses the road in front of the moving ego vehicle.
In Fig. 2.23, the gathered detection results for all four scenarios are depicted showing a
robust detection of dynamic image regions. Especially, the results gathered for scenario 3
show that the incorporation of vehicle kinematics (in form of a single track model) allows
the detection of moved objects even during heavy lateral movements.
47
2 Feature Space
(a) (b)
(c) (d)
Figure 2.23: Object motion detection with camera vehicle motion (right: input image, left:
detected dynamic objects), (a) Car crossing the road in front, (b) Car from left stops in the
middle of the road, (c) Car from left, camera vehicle turns right and follows, (d) Pedestrian
passes the road.
48
2 Feature Space
2.4 Summary
Chapter 2 describes the feature space, the realized Advanced Driver Assistance System
(ADAS) relies on. Hence, the Chapter provides the theoretical foundations for the remain-
ing work, since the described features are required by the ADAS for fulfilling its various
driver assistance tasks. More specifically, the attention system described in the following
Chapter is based on various static and dynamic feature maps that are weighted and com-
bined in a robust way. All these features are biologically motivated, meaning that known
processing principles of the vision pathway in the human brain are mimicked (e.g., the
signal processing characteristics of neurons in the brain).
Related to static features, Difference of Gaussian (DoG) filters are introduced. DoG
filters are selective to homogenous regions in the image, which differ from the background
in terms of their intensity. A decomposition of the DoG filter responses allows an efficient
separation of dark regions on a bright background (off-on contrast) compared to bright
regions on a dark background (on-off contrast). Based on image pyramids, a DoG filter
bank is realized that allows the detection of homogenous regions of five different sizes. In
sum 10 DoG feature maps are accessible.
Furthermore, a Gabor filter bank is introduced. Based on the orientation and scale
selective Gabor kernel, lines and edges of five different sizes and four orientations can be
detected. By applying an additional separation in on-off and off-on contrasts in sum 80
Gabor feature maps are computed.
As a further static feature the biologically motivated RGBY color space is described that
mimics the color processing on the human retina. Based on RGBY colors four pyramids
of color maps are received (in sum 20 feature maps). Furthermore, based on the RGBY
color maps 20 RGBY color contrast maps are computed that assess local changes in the
color maps.
In addition to these features, a depth map of the scene is computed. During the projec-
tion of the 3D world to the 2D image the depth is lost. Therefore, approaches that recover
the depth are error prone. The realized ADAS disposes of five different depth cues that are
combined in order to increase the accuracy. The realized depth sources are stereo disparity,
depth from object knowledge, depth from Time to Contact, Depth from the bird’s eye view
and Radar-based depth. In sum, 130 static feature maps are accessible for the proposed
ADAS (10 DoG, 80 Gabor, 20 RGBY color maps, 20 RGBY color contrast maps).
As dynamic feature maps differential images (i.e., two consecutive images are sub-
tracted), computed on five scales, are used. However, based on differential images a sep-
aration between static (e.g., parking cars) and dynamic objects (e.g., moving cars) is not
possible due to the motion of the camera vehicle. For solving this challenge, a system for
the detection of dynamic objects is described and tested. In sum, six dynamic feature
maps are accessible to the ADAS (5 motion maps from differential images, 1 motion map
for dynamic objects), which makes an overall number of 136 feature maps.
Summing up, the novelties described in Chapter 2 are:� Computationally efficient decomposition of the Gabor filter response in on-off and
off-on components, allowing a gain in selectivity for the attention system,
49
2 Feature Space� A formalization for allowing the computation of depth from Time to Contact (for
approaching and departing objects) with sensor data, which is usually accessible in
today’s vehicles,� Detection of ego-moved objects is improved with early incorporation of top-down
knowledge that prevents the false detection of known static scene elements (e.g.,
road, sky, parking vehicles),� Robust suppression of the horizon edge based on highly selective attention features.
The large number of innovative and robust feature maps described here are the basis
for numerous system-related novelties described in the remaining Chapters. For example,
said feature maps are weighted and combined in the attention system that is described in
Chapter 3. Thereby, the saliency map (i.e., the output of the attention system) can be
used to actively search for objects dependent on the current system task. Based on that, a
task-dependent decomposition of the input image is possible that increases the relevance of
input data to higher system layers. Hence, the biologically motivated attention approach
described in the following is one of the key aspects of the thesis at hand.
50
3 Task-dependent Tunable Visual
Attention
Facilities for controlling and managing traffic are always visually conspicuous. For exam-
ple, lane markers are white on a typically dark road and traffic signs or traffic lights have
bright colors. According to that, in many countries flashy advertisement is prohibited in
the proximity of roads. The said examples exploit a key aspect of the human visual pro-
cessing - the principle of early selection. With vision being the most important sensory
modality of humans having the highest information density, the named principle signifi-
cantly accelerates the processing of vision data. More specifically, the abundance of visual
stimuli in the world is prefiltered or preselected early to match the restricted cognitive ca-
pacity of the human brain. In plain words, the principle of early selection suppresses sensor
data that is not relevant to the current needs or goals of the system causing a colorful,
bright traffic sign to visually pop-out in a traffic scenario. For realizing said early selection
principle the human disposes of the so-called attention mechanism, which preselects the
scene elements.
More specifically, the human vision system filters the high abundance of environmental
information by attending to scene elements that either pop out in the scene (i.e., ob-
jects that are visually conspicuous) or match the current task best (i.e., objects that are
compliant to the current internal state or need/task of the system), while suppressing
the rest. For both attention guiding principles psychophysical and neurological evidence
exists (see [Corbetta and Shulman, 2002, Egeth and Yantis, 1997]). Following this prin-
ciple, technical vision systems have been developed that prefilter a scene by decomposing
it into its features (see [Wolfe and Horowitz, 2004]) and recombining these to a saliency
map that contains high activation at regions that differ strongly from the surrounding (i.e.,
bottom-up (BU) attention, see [Koch and Ullman, 1985]). More recent system implementa-
tions additionally include the modulatory influence of task relevance into the saliency (i.e.,
top-down (TD) attention, see [Tsotsos et al., 1995] as one of the first and [Frintrop, 2006,
Navalpakkam and Itti, 2005] as the most recent and probably most influential approaches).
In these systems, instead of scanning the whole scene in search for certain objects in a brute
force way, the use of TD attention allows a full scene decomposition despite restraints in
computational resources. In principle, the vision input data is serialized with respect to
the importance to the current task. Based on this, computationally demanding processing
stages located higher in the architecture work on prefiltered data of improved relevance,
which saves computation time and allows complex real-time vision applications.
During the vision system design we aimed at a computational efficient system imple-
mentation for online use on vehicles. The overall system should be flexible, meaning that
a new system task should not lead to the necessity of realizing new modules or a structural
51
3 Task-dependent Tunable Visual Attention
redesign of the whole system. Getting our inspiration from biology we therefore aimed
at a system that exhibits specific properties without being specifically designed for these
properties (e.g., our system is able to locate the horizon edge or detect fast moving objects
or red traffic signs without being explicitly designed for these tasks).
The design goals of our TD attention sub-system comprised the development of an
object and task-specific tunable saliency map suitable for the real-world scenarios in the
car domain.
However, the robustness of biological attention systems is difficult to achieve, given e.g.,
the high variability of scene content, changes in illumination, and scene dynamics. Most
computational attention models do not show real-time capability and are mainly tested
in a controlled indoor environment on artificial scenes. Important aspects discriminating
real-world scenes from indoor and artificial scenes are the dynamics in the environment
(e.g., changing lighting and weather conditions, dynamic scene content) as well as the high
scene complexity (e.g., cluttered scenes). Dealing with such scenarios requires a strong
system adaptation capability with respect to changes in the environment. Here, we focus
on five conceptual issues crucial for closing the gap between artificial and natural attention
systems operating on real-world scenes. We show the feasibility of our approach on vision
data from the car domain. The described TD tunable attention system is used as front-end
of the vision system of an advanced driver assistance system (ADAS) described in Chapter
5, whose architecture is inspired by the human brain.
After elaborating on related approaches in Section 3.1, Section 3.2 will describe specific
challenges for an attention system under real-world conditions. Section 3.3 will describe
our attention sub-system in detail pointing out the solutions to the denoted challenges.
Taken these challenges, Section 3.4 compares the proposed attention system on a functional
level to two other, influential attention approaches from literature. Section 3.5 underlines
the potential of the described solutions based on results calculated on different real-world
scenes, after which Chapter 3 is summarized.
3.1 Related Work
In the past, the human vision system has been examined in a large number of studies.
For example, the psychophysical experiments of [Simons and Chabris, 1995] impressively
showed that the task has a modulating effect on attention. The gathered results were for-
malized in the concept of inattentional blindness. In their experiments participants did not
notice unexpected events (like a black gorilla walking through an indoor scene) when the
task (counting ball contacts of a white basketball team) involved features complementary
to the unexpected events (see Fig. 3.1).
Related to the vehicle domain the task-dependent nature of gazing has also been proven
while steering a car. Recently, it was shown in [Most and Astur, 2007] that the performance
for dangerous situation detection (a colored motorcycle veering into the vehicle’s path)
strongly depends on the feature-match between the current distracting visual task and the
unexpected obstacle. In another example, the gaze of drivers in a virtual environment was
examined [Shinoda et al., 2001]. The results show that the performance in detecting stop
52
3 Task-dependent Tunable Visual Attention
Figure 3.1: Psychophysical study conducted by [Simons and Chabris, 1995] marking the hu-
man visual attention as strong mediator between the world and our perception of the world.
signs is heavily modulated by context (i.e., top-down) factors and not only by bottom-
up visual saliency. Endowing a vision architecture for an intelligent car with similar,
task-based attention can result in a gain of performance with minimal additional resource
requirements (see Sect. 3.5).
In most research on human visual attention the focus is on the bottom-up detection
of salient features/objects in a scene (for a review of biologically evident attention fea-
tures see [Wolfe and Horowitz, 2004]). A well-known computational model for saliency
calculation is the approach by [Itti et al., 1998] that is used in a number of implemented
systems. Recently, this approach has been extended by various researchers to account for
task-dependent aspects of visual attention (see, e.g., [Frintrop et al., 2005, Goerick et al.,
2005, Hawes and Wyatt, 2006]) by applying dynamic weights to different processing stages.
The tasks are often to find a specific object within a predominantly static indoor scene.
A more complete view on a possible architecture for a visual system incorporating task-
dependent visual attention is given by [Navalpakkam and Itti, 2005, 2006]. The proposed
architecture combines top-down (TD) and bottom-up (BU) influences by using TD weights
on the calculated BU features. However, there is no separation between the untuned BU
saliency map and the calculated TD saliency maps allowing a weighted combination, which
would ensure the preservation of BU influence in all system states. The system is evaluated
mainly on static indoor scenes and a few static outdoor scenes. Furthermore, there are only
few attention-based vision systems that use a motion feature (see [Backer and Mertsching,
2000, Tsotsos et al., 2004]). Given the importance of motion in the human visual percep-
tion, we see modeling the influence of scene dynamics on attention as a key issue to realize
robust human-like vision systems.
In Sect. 3.4, we chose the two related top-down attention systems of
[Navalpakkam and Itti, 2005] and [Frintrop, 2006] for a detailed structural and functional
comparison, since these impacted our work most.
However, numerous other psychophysical and computational attention models ex-
ist (please refer to [Frintrop, 2006, Frintrop et al., 2009, Heinke and Humphreys, 2005,
Itti et al., 2005] for a comprehensive overview of the latest developments in attention re-
search and [Findlay and Gilchrist, 2003] for an overview of related psychophysical studies).
Turning to the domain of vision systems developed for ADAS, there have been few at-
53
3 Task-dependent Tunable Visual Attention
tempts to incorporate aspects of the human visual system into complete systems. With
respect to attention processing, a saliency-based traffic sign detection and recognition sys-
tem was demonstrated in [Ouerhani, 2003]. In terms of complete vision systems, one of the
most prominent examples is a system developed in the group of E. Dickmanns [Dickmanns,
2004]. It uses several active cameras mimicking the active nature of gaze control in the
human visual system. However, the processing framework is not closely related to the
human visual system. Without a tunable attention system and with TD aspects that are
limited to a number of object-specific approaches for classification, no dynamic preselection
of image regions is performed. A more biologically inspired approach has been presented
by Farber [Farber, 2005]. However, their publication as well as the recently started Ger-
man Transregional Collaborative Research Centre “Cognitive Automobiles” [Stiller et al.,
2007] address mainly human-inspired behavior planning whereas our work focuses more on
task-dependent perception aspects.
The only other known vision system approach that attempts to explicitly model aspects
of the human visual system is described by [Matzka et al., 2008]. The system is some-
what related to the here presented ADAS. However, published after our work (see, e.g.,
[Michalke et al., 2007]), the approach allows for a simple attention-based decomposition of
road scenes but without incorporating object knowledge or context information. Addition-
ally, the overall system organization is not biologically inspired and hence shows limitations
in its flexibility.
In contrast to the here presented ADAS, a tendency of most large-scale research projects
like, e.g., the European PreVENT project [WWW, 2006] is the decomposition of the overall
functionality into many building blocks and combining these blocks into subsets for solving
isolated tasks. While this ’divide and conquer’ approach does lead to impressive results
in specific settings, we believe the challenge of integrating all these functionalities into a
coherently working flexible system is not yet solved.
3.2 Real-World Challenges for Top-Down Attention
Systems
In the following paragraph we describe challenges a TD attention system is facing when
used on real-world images.
1O High feature selectivity: In order to yield high hit rates in TD search an attention
system needs high feature selectivity to have as much supporting and inhibiting feature
maps as possible. For this the used features must be selected and parameterized appro-
priately. Even more important for high selectivity is the use of modulatory TD weights
on all sub-feature maps and scales. Many TD attention approaches allow TD weighting
only on a high integration level (e.g., no weighting on scale level [Frintrop et al., 2005])
or without using the full potential of features (e.g., no on-off/off-on feature separation
[Navalpakkam and Itti, 2005]) which leads to a performance loss. Our system fulfills both
aspects. Based on the extended selectivity of our attention sub-system, we can handle
specific challenges of the car domain, as dealing with the horizon edge present in most
images.
54
3 Task-dependent Tunable Visual Attention
2O Comparable TD and BU saliency maps: Typically the TD and BU saliency maps
are combined to an overall saliency, on which the Focus of Attention (FoA) is calculated.
The combination requires comparable TD and BU saliency maps, making a normalization
necessary. Humans undergo the same challenge when elements popping out compete with
task-relevant scene elements for attention. A prominent procedure in literature normalizes
each feature map to its current maximum (see [Navalpakkam and Itti, 2005] that is based
on [Itti et al., 1998]), which has some drawbacks our approach avoids.
3O Comparability of modalities: Similarly, the combination of different a priori in-
comparable modalities (e.g., decide on the relative importance of edges versus color) must
be achieved. We realize this by the biological principle of homeostasis that we define as
the reversible adaptation of essential processes of a (biological) system to the environment
(see e.g., [Hardy, 1983]).
4O Support of conjunctions of weak object features in the TD path: Another
important robustness aspect is the support of conjunction of weak object features in the
TD path of the attention sub-system. That is, an object having a number of mediocre
feature activations but no feature map popping out should still yield a clear maximum
when combined on the overall saliency.
5O Changing lighting conditions: In a real-world scene changing lighting conditions
heavily influence the features the saliency map is composed of and hence the performance of
attention system suffers. As the calculated TD weights are based on the features of training
images (see Sect. 3.3), the TD weights are illumination-dependent as well. Put differently,
the TD weights are optimal for the specific illumination and thereby to the contrast that
is present in the training images. The usage of TD weights on test images with a differing
illumination will lead to an inferior TD search performance. Instead of adapting the TD
weights dependent on the illumination, a local exposure control is proposed in order to
adjust the contrast of the training images as well as the test images before applying TD
weight calculation and TD search.
3.3 Modeling Attention: From a Robustness Point of
View
The organization of Sect. 3.3 is led by the consecutive processing steps of the current
ADAS attention sub-system as depicted in Fig. 3.2. After a short description of the
general purpose of the BU and TD pathways, their combination to the overall saliency
is described. Following this overview, the used modalities (feature types) are specified
followed by the entropy measure that is used for the camera exposure control. Next, the
different steps of the feature postprocessing are described. The TD feature weighting, the
homeostasis process to get the conspicuity maps (i.e., modalities) comparable, as well as
the final BU/TD saliency normalization are the final processing steps in our attention
architecture.
The attention system consists of a BU and a TD pathway. The TD pathway (red
enclosed region in Fig. 3.2) allows an object- and task-dependent filtering of the input
data. All image regions containing features that match the current system task well are
55
3 Task-dependent Tunable Visual Attention
motion
TD
TD
TD
TD BU
DoG
off−onon−off &
imageRGBinput
BU
spa
rsen
ess
wei
ghts
on 5 scalesimage pyramid
starting from 256x256
control exposure timerecurrent cycle to
gray valueim pyramid
Gabor
TD BU
TD BU
TD BU
TD BU
TD BU
TD BU
...
BU
feat
ure
wei
ghts
BU
con
spic
uity
wei
ghts
...
BU
Motion
...
...
...
AbsCol
Gaboron−off
off−onGabor
Gabor
Gabor
on−off
off−on
DoColOp
Intensity
Intensity
Intensity
IntensityI off−on
I on−off
I off−on
normalizationTD
BUnormal.I on−off
GRTD BU
RG color oppBY color opp
BY
YBTD BU
TD BU
TD
RGBY
4 RGBY pyramids
DoColOpponent
DoColOpponentBY pyramid
RG pyramid RG
odd & even
on−off &off−on
postprocessingfeature
entropy
TD path
overallSaliency
image
TD
feat
ure
wei
ghts
TD
con
spic
uity
wei
ghts
gray valueimage
DoColOp
consp. maps
consp. maps
Non
lin.O
p(c
ut n
eg.)
BU path
homeostasislow light adaptation: parameterize degree ofnonlin. noise suppression with
cameraCCD color
wsparse
i
1 − λ
wBUi
wB
Ui
λ
wTDi
wTDnorm
wBUnorm
IRGB
wC
j
wT
Di
wC
j
Igray
FTDi
FBUi
CTDj
CBUj
wCj
wCj
Ksupp
Figure 3.2: Visual attention sub-system (dashed lines correspond to TD links).
supported (excitation), while the others are suppressed (inhibition) resulting in a sparse
task-dependent scene representation. Opposed to that, the BU pathway (blue enclosed
region in Fig. 3.2) supports an object- and task-unspecific filtering of input data supporting
scene elements that differ from their surroundings. The BU pathway is important for a
task-unspecific analysis of the scene supporting task-unrelated but salient scene elements.
The BU and TD saliency maps are linearly combined to an overall saliency map. This
map is used to generate FoAs that represent the scene elements higher system layers work
on. The combination is realized using parameter λ (on the right hand side in Fig. 3.2)
that is set dependent on the system state emphasizing the BU and/or TD influence (see
Equ. (3.9)). Due to this combination the system also detects scene elements that do
not match the current TD system task. By ensuring a certain BU influence such scene
elements are not suppressed, which would otherwise lead to the so-called inattentional
blindness phenomenon (i.e., complete perceptual suppression of scene elements as described
in [Simons and Chabris, 1995]).
Turning to the processing details, the following modalities are calculated on the captured
color images: RGBY colors (inspired by [Frintrop, 2006]), intensity by a Difference of
Gaussian (DoG) kernel, oriented lines and edges by a Gabor kernel, motion by differential
images (see Chapter 2 for details on the used feature maps) and entropy using structure
56
3 Task-dependent Tunable Visual Attention
tensor.
In the following, a rough repeated description of the modalities is given, after which the
entropy feature is specified that is used to set the camera exposure. The features motion
and color are used differently for the BU and TD path. The BU path uses double color
opponency from RGBY colors by applying a DoG on 5 scales on the RG and BY color
opponent maps. The filter results are separated into their positive and negative parts
(on-off/off-on separation, whose importance is emphasized in [Frintrop, 2006]) leading to 4
pyramids of double color opponent RG,GR,BY and YB-maps. The TD path uses the same
color feature but additionally 4 pyramids of the absolute RGBY maps. Absolute RGBY
colors do not support the BU pop-out character and are hence not used in the BU path. A
DoG filter bank is applied on 5 scales separating on-off and off-on effects. Furthermore, a
Gabor filter bank on 4 orientations (0, π/4, π/2, 3/4π) and 5 scales is calculated separately
for lines and edges (even and odd Gabor). The realized Gabor filter bank ensures disjoint
decomposition of the input image. The detailed mathematical formulation of the used
Gabor filter bank can be found in Chapter 2. Motivated from DoG the concept of on-
off/off-on separation is transferred to Gabor allowing e.g., the crisp separation of the sky
edge or street markings from shadows on the street. Motion from differential images on
5 scales is used in the BU path alone. Since this simple motion concept cannot separate
static objects from self-moving objects, it is not helpful in TD search. The entropy T is
based on the absolute gradient strength of the structure tensor A on the image Igray:
T =det(A)
trace(A). (3.1)
The matrix A is calculated using derivatives of Gaussian filters Gu and Gv and a rect-
angular filter of size W :
A =
[
ΣW (Gu ∗ Igray)2 ΣW (Gu ∗ Igray)(Gv ∗ Igray)ΣW (Gv ∗ Igray)(Gu ∗ Igray) ΣW (Gv ∗ Igray)2
]
(3.2)
Gu(u, v) = − u
2πσ4exp(−u
2 + v2
2σ2), Gv(u, v) = − v
2πσ4exp(−u
2 + v2
2σ2).
We use the entropy as a means to adapt the camera exposure and not as a feature.
The local exposure control works on the accumulated activation Tsum = ΣRoIT on an
image region of interest (RoI) (e.g., coming from the appearance-based object tracker that
is part of our ADAS, for details see [Michalke et al., 2007]). Here we get inspiration from
the human local contrast normalization. The exposure time is recursively modified in
search of a maximum on Tsum, which maximizes the contrast on the defined image regions.
As described in Chapter 2, the system disposes of 136 independently weighable sub-feature
maps.
Following the calculation of the raw features, a postprocessing step on all sub-feature
maps is performed (see Fig. 3.3). The feature postprocessing consists of 5 steps. First,
all sub-features are normalized to the maximal value that can be expected for the specific
sub-feature map (not the current maximum on the map). For example, for DoG and
Gabor this is done by determining the filter response for the ideal input pattern, i.e., the
57
3 Task-dependent Tunable Visual Attention
( )2
nonlin. noise supp.
resize toBilinearmax
maxmin
normalization squaring (sig. power)
sparsenessweight
TD postprocessing
BU postprocessedsubfeat. maps
subfeat. maps
TD postprocessed
BU postprocessing
256x256
adapt noise suppressionparameter to
TD link to BUsaliency normalization
1
Subfeat. Max wsparsei
Ksupp
Ksupp
FBUi
FTDi
Fi
Figure 3.3: Postprocessing of feature maps in BU and TD path.
maximum possible filter response. The ideal input pattern is generated by setting all pixels
to 1 whose matching pixel positions in the filter kernel are bigger than 0. Figure 3.4 shows
the resulting ideal DoG and 0◦ even Gabor input patterns that are derived from the given
filter kernels. This procedure ensures comparability between sub-features of one modality
(e.g., all sub-feature maps of the motion modality). Next, the signal power is calculated by
(a) (b)
Figure 3.4: Input patterns that maximize the filter response. The maximum of this filter
response is used for sub-feature normalization: (a) Ideal DoG input pattern, (b) Ideal 0◦ even
Gabor input pattern.
squaring and a dynamic neuronal suppression using a sigmoid function, which is applied
for noise suppression. A parameter Ksupp shifts the sigmoid function horizontally, which
influences the degree of noise suppression respectively the sparseness of the resulting sub-
feature maps. After a bilinear resize to the resolution 256x256, which allows a later feature
combination, for the BU feature postprocessing a sparseness weight wsparsei is multiplied
that ensures pop-out by boosting sub-feature maps with sparse activation:
wsparsei =
√
√
√
√
2s∑
∀u,v Fi,k(u,v)>ξ
Fi,k(u, v)for s = [0, 4] and ξ = 0.9 ·Max(Fi,k). (3.3)
The sparseness operator is not used in the TD path (see red enclosed region in Fig. 3.3)
in order to prevent the suppression of weak object features.
Later in the TD path a weighting on all 136 sub-feature maps takes place to realize
inhibition and excitation. The TD-related tuning on the feature level is motivated from
58
3 Task-dependent Tunable Visual Attention
the fact that neurobiological studies have shown that attentional influences are present
very early in the human visual pathway (see [Treue, 2003]). Furthermore, neurophysical
studies on monkeys have shown that attention-based modulation of neuronal activities lead
to an increase in activity in case the bias matches the preferences of cell populations. But
also a suppression of neuron populations can be encountered in case the attended features
do not match the preferences of the cells (see [Treue, 2003]). These measurements motivate
the usage of supporting (excitation) and suppressing (inhibition) feature TD weights as
realized in the presented attention system.
The TD weights wTDi are calculated in an offline step (inspired by [Frintrop et al., 2005]
but extended):
wTDi =
{
SNRi ∀ SNRi ≥ 1
− 1SNRi
∀ SNRi < 1with SNRi =
∑
∀u,v FTDi,obj
(u,v)>φ
FTDi,obj(u,v)
Ni,obj∑
∀u,v FTDi,surr
(u,v)>φ
FTDi,surr(u,v)
Ni,surr
. (3.4)
The average activation in the object region is related to the average activation in the
surround on each feature map F TDi taken only the Ni pixels above the threshold φ =
KconjMax(F TDi ) with Kconj = (0, 1] into account.
As opposed to the weighting scheme proposed by [Frintrop et al., 2005], in the here
presented approach the threshold φ assures that numerous small values on a feature map
do not even out rarely present big ones. The proposed threshold φ improves the TD search
performance, since big values influence the TD search performance overproportionally. In
the BU path only excitation (wBUi ≥ 0) takes place, since without object or task knowledge
in BU nothing can be inhibited. For a more detailed discussion of feature map weighting
see [Frintrop, 2006, Michalke et al., 2007].
As visualized in Fig. 3.2, the j = 1..M conspicuity maps CBUj and CTD
j result from a
weighted combination of the Nj BU and TD sub-feature maps within a certain feature
type j:
CBUj =
Nj∑
i=1
wBUi,j F
BUi,j (3.5)
CTDj =
Nj∑
i=1
wTDi,j F
TDi,j . (3.6)
The sub-feature normalization procedure ensures intra-feature comparability, but for the
overall combination, comparability between modalities (i.e., conspicuity maps) is required
as well. We solve the normalization problem of the conspicuity maps by dynamically
adapting the conspicuity weights wCjfor weighting the BU and TD conspicuity maps
CBUj and CTDj . This concept mimics the homeostasis process in biological systems (see
e.g., [Hardy, 1983]), which we understand as the property of a biological system to regulate
its internal processes in order to broaden the range of environmental conditions in which
the system is able to survive. More specifically, the wCj(t) are set to equalize the activation
59
3 Task-dependent Tunable Visual Attention
on all j = 1..M BU conspicuity maps, taking only the Nj pixel over the threshold ξ =
0.9 ·Max(CBUj ) into account:
wCj(t) =
11Nj
∑
∀u,v with CBUj (u,v)>ξ
CBUj (u, v)
and ξ = 0.9 ·Max(CBUj ). (3.7)
Exponential smoothing is used to fuse old conspicuity weights wCj(t− 1) with the new
optimized ones wCj(t):
wCj(t) = αwCj
(t) + (1 − α)wCj(t− 1) for j = 1..M. (3.8)
The parameter α sets the velocity of the adaptation and could be adapted online depen-
dent on the gist (i.e., basic environmental situation) via a TD link. In case of fast changes
in the environment α could be set high for a brief interval, e.g., while passing a tunnel or
low in case the car stops. Additionally, we use thresholds for all M conspicuity maps based
on a sigma interval of recorded scene statistics to avoid complete adaptation to extreme
environmental situations.
Before combining the BU and TD saliency maps using the parameter λ (see Equ. (3.9)
and Fig. 3.2) a final normalization step takes place. Like the sub-feature and conspicuity
maps, the saliency maps are normalized to the maximum expected value.
Stotal = λSTD + (1 − λ)SBU (3.9)
For this we have to step back through the attention sub-system taking into account
all weights (wsparsei , wBU
i , wTDi ) and the internal disjointness/conjointness of the features
to determine the highest value (vBUmax,j and vTD
max,j) a single pixel can achieve in each BU
and TD conspicuity map j. We define a feature as internally disjoint (conjoint), when
the input image is decomposed without (with) redundancy in the sub-feature space. In
other words the recombination of disjoint (conjoint) sub-feature maps of adjacent scales or
orientations is equal to (bigger than) the decomposed input image. Since DoG and Gabor
are designed to be internally disjoint between scales and orientations (see Chapter 2) the
maximum pixel value on a conspicuity map j is equal to the maximum of the product of
all sub-feature and/or sparseness weights of the sub-features it is composed of (wsparsei and
wBUi for BU as well as wTD
i for TD). Motion is conjoint between scales, therefore we sum
up the product of all sub-feature motion weights wBUi and their corresponding wsparse
i to
get the maximally expected value on the motion conspicuity map. The contribution of the
color feature to the saliency normalization weight is similar but more complex.
Since appart from DoG and Gabor there is disjointness between conspicuity maps the
maximum possible pixel values for all BU and TD conspicuity maps, calculated as described
above, are multiplied with the corresponding wCjand added to achieve the normalization
weights wTDnorm and wBU
norm for the TD and BU attention (please also refer to Fig. 3.2 for
60
3 Task-dependent Tunable Visual Attention
the position the normalization weights are applied):
wBUnorm =
1∑M
j=1 kjwCjvBUmax,j
(3.10)
wTDnorm =
1∑M
j=1 kjwCjvTDmax,j
. (3.11)
With:
kj =
{
0.5 for j ∈ {DoG,Gabor}1 for j /∈ {DoG,Gabor}.
Using this approach, wTDnorm will adapt when the TD weight set changes.
It is important to note that DoG and Gabor features are conjoint, meaning that they
represent the same signal characteristics. Put differently the conspicuity maps for DoG and
Gabor are not independent. As discussed in Chapter 2 using both DoG and Gabor is still
helpful, since the signal decomposition is different for both filter types. The conjointness
is taken into account in the attention normalization procedure in Equ. (3.10) and (3.11)
in the form of the factor kj that decreases the integral influence of DoG and Gabor on the
overall attention.
3.4 Functional Comparison to other Top-Down Attention
Models
Taken the abundance of computational attention models (see [Heinke and Humphreys,
2005] for a review) we selected the two related approaches of [Navalpakkam and Itti, 2005]
and [Frintrop, 2006] for a detailed structural comparison, since these impacted our work
most. Then, we summarize what makes our approach particularly appropriate for the
real-world car domain.
The system of [Navalpakkam and Itti, 2005] is based on the BU attention model
Neuromorphic Vision Toolkit (NVT) [Itti et al., 1998] but adds TD to the system. Each
feature map is normalized to its current maximum, resulting in a loss of information about
the absolute level of activity and a boosting of noise in case the activation is low. Taken such
a normalization procedure and the object dependence of the TD weights, the BU and TD
saliency maps are not comparable, since the relative influence of the TD map varies when
the TD weight set is changed. Additionally, the BU and TD saliency maps are not weighted
separately for combination. As features a speed-optimized RGBY (leading to an inferior
separability performance), a DoG intensity feature and Gabor filter on 4 orientations (both
without on-off/off-on or line/edge separation) are used on 6 scales starting at a resolution
of 640x480. The system uses TD weights on all sub-feature maps resulting in 42 weights
that allow reasonable selectivity. A DoG-based normalization operator (see [Itti et al.,
1998]) is applied for pop-out support and to diminish the noise resulting from the used
feature normalization. However, the absolute map activation and therefore comparability
is lost.
61
3 Task-dependent Tunable Visual Attention
The system of [Frintrop, 2006] integrates BU and TD attention and is real-time capable
(see [Frintrop et al., 2007]). It was evaluated mainly on indoor scenes. The system normal-
izes the features to their current maximum, resulting in the same problems as described
above. The BU and TD saliency maps are weighted separately for combination. Following
the argumentation above the used normalization makes these combination weights depen-
dent on the used TD weight set and thereby object-dependent. As features the system uses
double color opponency based on an efficient RGBY color space implementation, a DoG
intensity feature (with on-off/off-on separation), and a Gabor with 4 orientations starting
from 300x300 resolution. A total of 13 TD weights are used on feature (integrated over all
scales) and conspicuity maps. For pop-out support a uniqueness operator is used.
Most important differences comparing the systems: We obtain high selectivity
by decomposing the DoG (on-off/off-on separation) and Gabor (on-off/off-on separation,
lines and edges) features without increasing the calculation time. Furthermore, the usage
of TD weights on all sub-feature maps and scales results in 136 independent tunable feature
weights that increase the selectivity. The resulting scale variance of the TD weights is not
a crucial issue in the car domain. The RGBY is used as color and double color opponency.
In contrast to [Frintrop, 2006, Navalpakkam and Itti, 2005], we use motion to support
scene dynamics. All sub-feature maps and the BU respectively TD saliency maps are
normalized without loosing information or boosting noise and by that preventing false-
positive detections. Comparability of modalities is assured via homeostasis. The attention
sub-system works on 5 scales starting at a resolution of 256x256. Experiments have shown
that in the car domain bigger image sizes do not improve the attention system performance.
Our system supports conjunction of weak features since the sparseness operator is not
used in the TD path. Illumination invariance is reached by image region-specific exposure
control that is coupled tightly to the system.
3.5 Experiments and Results
In the following, we evaluate the system properties related to the challenges described in
Sect. 3.2. All results are calculated on five real-world data sets (cars, reflection poles,
construction site, inner-city stream, toys in an indoor scene) accessible in the internet (see
[BenchmarkData, 2008a]).
1O High feature selectivity: In the car domain the search performance is strongly
influenced by the horizon edge present in most images of highways and country roads.
This serves as example problem for showing the importance of high feature selectivity.
Typically, the horizon edge is removed by mapping out the sky in the input image, which
might not be biological plausible. Based on the high selectivity of the attention features,
we instead suppress the horizon edge directly in the saliency by weighting the sub-feature
maps. The gain of this approach is depicted in Fig. 3.5 that shows the diminished influence
of the horizon edge on the (TD modified) BU saliency of the real-world example in Fig. 3.6b.
For evaluation the average FoA hit number (Hit) and average detection rate (DRate) were
calculated. While DRate is the ratio of the number of found task-relevant objects to the
overall number of task-relevant objects, Hit states that the object was found on average
62
3 Task-dependent Tunable Visual Attention
with the Hit ’th generated FoA. Hence the smaller Hit the earlier an object is detected
(see [Frintrop, 2006] for a more detailed definition of these measures). Table 3.1 shows the
significant performance gain of attentional sky suppression versus no horizon edge handling
and masking of the sky based on these measures.
50
100
150
200
50
100
150
200
50
100
150
200
135° on−off0°off−on 45° off−on
(a) (b) (c)
Figure 3.5: Evaluation of selectivity (based on the input image depicted in Fig. 3.6b): (a)
Original BU saliency, (b) Modified BU saliency with attentional sky suppression (TD influence),
using suppressive odd Gabor filter kernels in low scales, (c) BU saliency, masked sky (standard
method).
Table 3.1: Benefit of attentional sky suppression on real-world data.
Search target # test a) original BU b) attentional sky supp. c) sky masked
images Hit (DRate) Hit (DRate) Hit (DRate)
Cars 54 3.06 (56.3%) 2.19 (71.4%) 2.47 (71.4%)
2O Comparable TD and BU saliency maps: The used feature normalization pre-
vents noise on the saliency map and ensures the preservation of the absolute level of
feature activation. Using a TD weight set that supports certain object-specific features
our normalization hence ensures that the TD map will show high activation if and only if
the searched object is really present. Figure 3.6f shows that the maximum saliency value
on the TD saliency map for cars rises when the car comes into view (see [BenchmarkData,
2008a] for a downloadable result stream).
The influence combining the now comparable TD and BU saliency maps for cars and
reflection posts (reflection posts are, e.g., useful for unmarked road detection as done in
[von Trzebiatowski et al., 2004]) as trained search objects is depicted in Tab. 3.2, showing
that TD improves the search performance considerably. It is important to note that besides
an exchange of the training images no modification in the system structure is required,
when changing the search object. For evaluation the average FoA hit number (Hit) and
average detection rate (DRate) were calculated. The choice of training images has only
63
3 Task-dependent Tunable Visual Attention
small influence on the search performance as the comparable results for different sets of
training images in Tab. 3.2 show.
The evaluation shows best hit numbers and highest detection rates for pure TD search
(λ = 1). However, it is important to note that pure TD search would lead to a suppression
of unexpected objects (inattentional blindness, see Sect. 3.1) and would hence potentially
cause dangerous situations.
50
100
150
200
50
100
150
200
50
100
150
200
250
300
350
400
(a) (b) (c)
(d)(e)
(g)(f)
Frame 1, TD car Frame 45, TD sig. boards
Frame 1 Frame 45 Frame 45, TD car
Figure 3.6: Evaluation of normalization: (a), (b) Input images (c) TD saliency tuned to cars,
(d) TD saliency tuned to signal boards, (e) TD saliency tuned to cars (noise, since no car is
present), (f) Maximum saliency activation level on BU, TD car and TD signal board map, (g)
Dynamically adapted conspicuity weights wCj(homeostasis) for the M=7 modalities.
64
3 Task-dependent Tunable Visual Attention
Table 3.2: Linear combination of BU and TD saliency, influence on search performance (λ = 0
equals pure BU and λ = 1 pure TD search).
# Test # Trai- Hit (DRate)
Target images ning im λ = 0 λ = 0.5 λ = 1
(objects) (BU) (BU & TD) (TD)
Cars 54 (self test) 1.56 (93.1%) 1.53 (100%)
Train. set 1 3 3.06 1.87 (89.7%) 1.82 (96.6%)
Train. set 2 54 (58) 2 (56.9%) 1.90 (84.5%) 1.76 (93.1%)
Train. set 3 3 1.96 (82.8%) 1.94 (93.1%)
Train. set 4 3 1.84 (86.2%) 1.74 (93.1%)
Reflect. posts 56 (self test) 1.78 (59.8%) 1.85 (66.3%)
Train. set 1 6 2.97 2.10 (51.3%) 2.25 (52.2%)
Train. set 2 56 (113) 7 (33.6%) 2.20 (51.3%) 2.28 (51.3%)
Train. set 3 7 2.07 (51.3%) 2.36 (52.2%)
Train. set 4 5 2.10 (51.3%) 2.30 (51.3%)
3O Comparability of modalities: The used dynamic adaptation of wCj(homeostasis,
see Equation (3.8)) causes a twofold performance gain. First, the a priori incomparable
modalities can be combined yielding a well balanced BU and TD saliency map. Second,
the system adapts to the dynamics of the environment preventing varying modalities from
influencing the system performance (e.g., with this procedure in the red evening sun the
R color channel will not be overrepresented in the saliency). Figure 3.6g depicts the
dynamically adapted wCj. Table 3.3 shows a noticeable SNR gain on the overall saliency for
26 traffic relevant objects (e.g., traffic light, road signs, cars), comparing the dynamically
adapted wCjvector with a static wCj
vector that was locally optimized on the stream.
Table 3.3: Comparability of modalities via homeostasis.
Traffic-relevant #images SNRobj using SNRobj using
objects (obj) static wCjdynamic wCj
Inner-city stream 20 (26) 2.56 2.86 (+11.7%)
4O Support of conjunctions of weak object features in the TD path is assured
since wsparsei is used in BU only. Evaluation on 54 images with cars as TD search object
shows that the average object signal to noise ratio (SNRobj) on the TD saliency map
(defined as the mean activation in the object versus its surround) decreases by 9% when
wsparsei is also used in the TD path. For evaluation we define weak object feature maps as
having the current maximum outside the object region but still having object values of at
least 60% of the maximum within the object. For the used 54 traffic scene images 11% of
65
3 Task-dependent Tunable Visual Attention
all feature maps are weak. In case weak feature maps are used to optimally support the
TD saliency in an excitatory way SNRobj on the TD saliency map increases by 25%. The
results are aggregated in Tab. 3.4. Figure 3.7a shows that the number of excitatory TD
weights wTDi decreases the bigger Kconj (see Equ. (3.4)) is. An object-dependent trade-off
exists since the TD saliency map gets sparser the bigger Kconj is.
Table 3.4: Improvement of SNR due to support of weak feature conjunctions.
TD search # test SNRobj SNRobj SNRobj with optimal
target image with wsparsei without wsparsei weak feat. excitation
Cars 54 6.72 7.32 (+9%) 8.41 (+25%)
5O Changing lighting conditions: The feature activation of an image region depends
on the illumination. Hence the TD weight set is only optimal for the lighting conditions
present in the training images and the TD search performance decreases when illumination
changes without an adaptation of the camera exposure. It is important to note that in
a real-world scene the optimal exposure in varying illumination is different for all objects
(see Fig. 3.7b and c, making the exposure control dependent on the current task of the
system. Evaluation based on a complex indoor test setting, where the illumination could
be controlled, shows that the realized exposure control leads to an illumination invariance
of the TD weight sets (see Tab. 3.5).
Table 3.5: Illumination invariance of TD weight sets due to dedicated exposure control.
Target # Test Average hit number (and detection rate [%]), TD search λ = 1
Toys in a im Traning illu- without expos. control with expos. control
complex (obj) mination 75 lx 150 lx 15 lx 150 lx 15 lx
indoor 20 1.95 2.74 2.83 1.80 2.0
setup (20) (100%) (95%) (30%) (100%) (100%)
3.6 Summary
Chapter 3 describes a flexible biologically motivated attention system that is used as the
front-end of our ADAS. The 136 feature maps described in Chapter 2 are independently
weighted and combined to the so-called saliency map that is the key aspect and resulting
output of the attention system. The amplitude of the saliency map (i.e., its activation in
neurobiological terms) encodes the conspicuousness of an image region. A high saliency
66
3 Task-dependent Tunable Visual Attention
Whole image
Lower half
Car
Whole imageLower halfCar
(a) (b)
(c)
Figure 3.7: Evaluation of illumination influence: (a) Number of excitatory TD weights de-
pending on feature preprocessing parameter Kconj , (b) Image regions used for exposure opti-
mization (whole image, lower half, and car), (c) Energy function: Accumulated entropy Tsumwith object-dependent optima.
value can result 1) from an object that visually differs strongly from its surroundings
(sensory-driven or bottom-up attention) or 2) from an object that matches the current
search task (goal-driven or top-down attention). Both the bottom-up (BU) and top-down
(TD) attention weight and combine the implemented features. For both attention types
the feature maps are normalized to their potential maximum (not the current maximum)
in order to assure a comparability of feature maps of the same modality. However, the
feature post-processing for both attention types also differs in certain aspects. For the
BU attention a sparseness weight is applied that boosts feature maps having a strong,
locally restricted maximum. For the TD attention no sparseness weight is applied in
order to assure that feature maps without a clear maximum (so-called weak feature maps)
can contribute to the TD attention. All feature maps are weighted with object-specific
TD weights that are computed based on a weight calculation scheme that evaluates the
feature characteristics of training images showing the searched object. In a nutshell, the
weight calculation scheme boosts feature maps that are typical for the searched object and
67
3 Task-dependent Tunable Visual Attention
suppresses feature maps not compatible with the searched object.
The different sub-feature maps are independently weighted for the BU and TD case and
combined to so-called conspicuity maps, which represent the different modalities of the
system (e.g., colors, motion, lines). It is important to note that the conspicuity maps are
a priori not comparable. Based on the biologically motivated principle of homeostasis the
conspicuity maps are normalized in order to get them comparable before their combination
to the BU respectively TD saliency map. In the last step the BU and TD pathways are
combined, which requires the saliency maps to be comparable. However, the absolute value
of the TD saliency map depends on the TD weight set and thereby on the searched objects.
Therefore, a normalization procedure for the BU and TD saliency maps is introduced that
preserves the information present in the saliency map amplitude.
A last robustness enhancing approach handles the problem that the TD weight sets are
optimal for the illumination present on the training images only. In case in the training
images a different lighting situation is present than in the test images, the performance
of the TD search suffers. This means that a priori the TD weight sets are not invari-
ant to illumination changes. Instead of adapting the TD weight sets dependent on the
illumination an image region-specific exposure control in proposed. When applying said
exposure control, abundant testing showed that the TD search performance and hence the
TD weight-sets are independent from illumination changes.
The introduction of several new approaches into the attention system allows for robustly
coping with the real-world requirements in the car domain. More specifically. the following
robustness-related novelties were introduced in Chapter 3:� An attention system relying on high feature selectivity based on 136 independently
tunable feature maps,� A sub-feature normalization procedure that assures the comparability of BU and TD
attention without loosing information about the absolute signal amplitude,� A biologically motivated homeostasis approach for making diverse modalities com-
parable,� Support of weak feature conjunctions in TD search mode,� An image region-specific exposure control that assures the illumination invariance of
the TD weight sets.
In the following Chapter 4 a robust approach for unmarked road detection is described
that in combination with the proposed attention system allows building complex driver
assistance systems presented in Chapter 5. Additionally, in Chapter 5 the real-time capa-
bility of our attention system in a real-world test setup will be shown. More specifically,
a test setup will be described, in which our prototype car was reliably able to brake au-
tonomously in an emergency situation (see [Michalke et al., 2007]).
68
4 Road Detection in Unconstrained
Environments
The importance of driver assistance systems for further decreasing the number of traffic
accidents is a widely acknowledged fact. The growing complexity of tasks, which these
Advanced Driver Assistance Systems have to handle leads to complex systems that use
information fusion from many sensory devices and incorporate processing results of multiple
other modules. One important field of interest for said systems are applications like, e.g.,
the “Honda Intelligent Driver Support System” [Ikegaya et al., 1998] supporting the driver
to stay in the lane and to maintain a safe distance from the car in front. Other systems focus
on collision avoidance based on autonomous steering and braking (see, e.g., [Schorn et al.,
2006]) as well as path-planning even in unstructured environments (see, e.g., [Dang et al.,
2006]). All these applications need a robust detection of the drivable road area. The more
safety-relevant applications become, the more the required quality of the detected drivable
road area must be improved. As “drivable road area” we define the space in front, which
the car can move on safely in a physical sense, but without taking symbolic information
into account (e.g., one-way-street, traffic signs).
First vision-based approaches for detecting the drivable road area on unmarked streets
were introduced in recent years. Although most of these visual-feature-based approaches
show sound results in scenarios of limited complexity, they seem to lack the necessary
system-inherent flexibility to run in complex environments under changing lighting con-
ditions. To cope with such environments, in Sect. 4.1 we introduce an architecture for
robust unmarked road detection. The system relies on four novel approaches that permit
the autonomous adaptation of important system parameters to the environment. As the
presented results show, the approach allows for robust road detection on unmarked inner-
city streets without manual tuning of internal parameters. This is different from most
approaches in literature that rely on strong rigid road models and offline set parameters.
In order to further stabilize the gathered results, in Sect. 4.2 a novel, generic approach for
improving unmarked road detection systems by temporal integration is proposed.
4.1 Adaptive Multi-Cue Fusion for Detecting Unmarked
Roads in Inner-City
In this section, a robust system approach for detecting the drivable area on unmarked
roads is presented. Based on four novel techniques, which extend known unmarked road
detection approaches, the proposed system reliably detects the road in complex scenarios
by autonomously adapting its internal parameters. As evaluation on inner-city streams
69
4 Road Detection in Unconstrained Environments
shows, the presented techniques are an important step toward more generic and robust
driving-path detection for unmarked roads. Unlike other approaches, no scene-dependent
manual adaptation of system parameters is required. The input images used for the evalu-
ation, corresponding ground truth data, and a result stream are accessible on the internet
[BenchmarkData, 2009a].
4.1.1 Related Work
Initial approaches for lane detection on marked roads date back to the 1990s (see [Broggi,
1995] for an overview of the early approaches). These to date commercially available
systems are restricted to marked roads with a predictable course, based on a clothoid lane
model that is also used for road construction of motor-ways. In recent years, the field of
research for road detection has shifted to unmarked country roads and inner-city streets.
To this end, current prototype systems evaluate and fuse different visual features. In the
following, the structure of such visual-feature-based systems is analyzed. It is shown that
despite the large number of existing road detection systems some important techniques for
increasing the road detection robustness are not considered so far.
Image training regions: Current approaches for road detection often use street train-
ing regions in front of the car in order to parameterize the probability distributions that
describe the road feature characteristics (e.g., [Rotaru et al., 2004, Soquet et al., 2007], see
also Fig. 4.4). Only very few approaches partially incorporate information of non-road im-
age regions to improve road detection (e.g., [Apostoloff and Zelinsky, 2003, Franke et al.,
2007]). However, to our knowledge no approach uses the full potential of non-road informa-
tion, e.g., for the autonomous adaptation of internal system parameters and the dynamic
online assessment of the cue quality, as it is done in our system.
Features: Typical visual features for road detection in state-of-the-art systems are:
texture (edge density) on the intensity [Franke et al., 2007, Hong et al., 2002, Sha et al.,
2007], stereo disparity [Lombardi et al., 2005, Soquet et al., 2007], HSI color [Franke et al.,
2007, Lin and Chen, 1991, Rotaru et al., 2004, Soquet et al., 2007], or depth from Lidar /
Radar [Dahlkamp et al., 2006, Rasmussen, 2002]. Many system approaches use the feature
edge density (structure) on the intensity map. However, edge density on further feature
maps is so far not considered. To our knowledge no approach uses the edge density on
color maps for road detection. During the evaluation of our system, we experienced the
edge density computed on color maps as a robust cue for detecting the road.
Feature granularity: Numerous system approaches rely on probabilistic methods for
classifying street and non-street pixels (e.g., [Apostoloff and Zelinsky, 2003, Aufrere et al.,
2004, Franke et al., 2007, Ramstroem and Christensen, 2005, Smuda et al., 2006]). Such
iconic (i.e., pixel-based) approaches do not include information of the neighborhood of
a pixel, but handle all pixels independently. Nevertheless, discontinuities in the fea-
ture maps often contain important information that allow improved scene decomposition
(e.g., curbstones that separate the road from the sidewalk). Other approaches stress the
importance of region-based information and use region growing or vertical filling (e.g.,
[Chern and Cheng, 2003, Mateus et al., 2005, Rotaru et al., 2004]). Such approaches are
often sensitive to changing lighting conditions causing large gradients in the feature maps
70
4 Road Detection in Unconstrained Environments
(e.g., shadows on the road). Both, the iconic and the region-based system approaches
have important advantages that partially compensate their drawbacks. However, to our
knowledge no system approach for road detection uses both approaches to the same extend.
Road modeling: Many of the recent feature-based systems use road mod-
els of varying complexity that support the feature-based road detection (e.g.,
[Dickmanns and Mysliwetz, 1992, Franke et al., 2007, Ramstroem and Christensen, 2005]
use clothoids, [Lombardi et al., 2005] distinguishes between left, right, and straight street
course, [Sotelo et al., 2004] uses second order polynomials). For country roads and high-
ways such approaches seem to yield sound results. Nevertheless, as further discussed in
Sect. 4.1.2, we claim that said rigid street models are not flexible enough to robustly run
on inner-city streets that often show abrupt changes in their course as well as occlusions
of significant parts of the drivable road area. However, some kind of road model seems to
be necessary in order to improve robustness of the road detection. This dilemma can be
resolved by relying on a generic and flexible road model that makes only simple assump-
tions about the course of the road. One of the few system approaches that follows this idea
is presented in [Lin and Chen, 1991]. The authors point out that the road area typically
covers between 30 to 85% of the image. The feature thresholds are adapted in order to
reach this ratio. Unfortunately, the proposed approach is restricted in its flexibility, since
the ratio is set offline without constantly adapting it to match the current characteristics
of the scene.
To sum up, existing state-of-the-art road detection systems are marked by a limited
flexibility, which restricts their application to country roads and highways. In order to
allow reliable road detection in more complex inner-city scenarios, we propose four novel
techniques to enhance robustness and system-inherent flexibility by enabling adaptation
to the environment. To our knowledge a combination of these techniques have not been
used for road detection before.
In detail, these techniques are:� Using street and non-street training regions (see Fig. 4.4) that both adapt the feature
probability distributions,� Using edge density (structure) feature, computed on the HSI hue and saturation
maps,� Combining iconic and region-based feature processing,� Fusing feature-based road detection with a dynamic and generic road model.
In the following section, details about our road detection system embedding these four
techniques are given. The presented system approach is not restricted to inner-city streets,
but was tested on country roads and highways as well.
4.1.2 System Description
In the following, the realized system architecture for unmarked road detection is described
(see Fig. 4.1). It relies on our four novel techniques that enhance the system-inherent
71
4 Road Detection in Unconstrained Environments
flexibility. After giving a rough overview on the individual processing steps, all system
modules are described in detail.
Our system takes RGB input images, stereo disparity (from two parallel cameras), and
Radar data as input. Knowledge about previously detected objects in the scene can be
used as optional input. The system detects the road based on six robust features that are
evaluated and fused in a probabilistic way. For this step, street and non-street training
regions are defined in the input image. In parallel, the system detects present lane mark-
ings with a biologically motivated filter approach. The lane markings are fused with the
detected road segments. In the final step, a binary road map is computed relying on a
road model that adapts itself to the environment.
Next, the system is described in more detail. In the first step, different features are
calculated on the 400x300 pixel RGB input images. The features we use are saturation
and hue of the HSI color space (see, e.g., [Jaehne, 2005]). Furthermore, we apply the
structure tensor in Equ. (4.1) (with W being a 9x9 region around the current pixel) to
compute the edge density Ej (see Equ. (4.2)) on the hue, saturation, and intensity of the
HSI color space:
Aj(u, v) =
[
ΣW (Gu∗Fj)2 ΣW (Gu∗Fj)(Gv ∗ Fj)ΣW (Gv∗Fj)(Gu∗Fj) ΣW (Gv∗Fj)2
]
(4.1)
with j ∈ {hue, saturation, intensity} and
Gu(u, v) = − u
2πσ4exp(−u
2 + v2
2σ2)
Gv(u, v) = − v
2πσ4exp(−u
2 + v2
2σ2)
Ej(u, v) =det(Aj(u, v))
trace(Aj(u, v)). (4.2)
Typically, the edge density computed on these feature channels is different for the road
and the rest of the scene, which makes it a reliable feature.
Furthermore, vision-based stereo data is used as feature. For computing stereo vision,
the camera images are rectified in order to facilitate the correspondence search between
the two camera images (i.e., the images are remapped, virtually aligning the two camera
coordinate systems with the world coordinate system). The thereby necessary intrinsic
(i.e., internal camera properties, like the focal length and the principal point) and extrinsic
(i.e., external camera properties, like camera angles and offsets) camera parameters were
determined using the freely available calibration toolbox [J.Y.Bouguet, 2007]. The toolbox
was applied on a calibration scene similar to the one described in [Marita et al., 2007] (see
also Sect. 2.2.2 for details on the computation of stereo disparity). There is no dynamic
change of the camera pitch angle, since on the one hand the input images are pitch-
corrected using a correlation-based method similar to [Broggi and Grisleri, 2005]. On the
other hand, we assume a flat road, which is present in most inner-city environments.
When using the system in an urban environment, the course of the road and hence the
camera angles could be estimated using a surface model (e.g., a hyperplane, please refer to
72
4 Road Detection in Unconstrained Environments
3 RPMsof mostreliablefeatures
adapt
threshold
road model feature fusion
to Final RPM6 eRPMs
on−off DoGfiltering
stereocalculation
threshold on Final RPM
street/non−street
distance ofhistograms of
regions
featurecalculation
features
(assure obstaclefree street
training region)
non−streettraining regions
selectstreet /
threshold
1. Compute features
3. Fuse
4. Compute finalroad segment
build feature BRMs
and RPMs
probability & binary maps2. Compute feature−
fusion of stereowith DoG results
lane
mar
king
sde
tect
ed
for each feature:
with detectedlane markings
fuse BRM and RPM
(HSI color, structure, stereo)
object knowledge(cars, houses)
Optional input: Input:Radardata
disparitystereoInput:
imageRGB input400x300
Input:
fuse RPMs & BRMsto extended RPMs (eRPMs)
Output:final binary
road map
(xi+/−βiσi)(p(xi))
εfinal
εfinal
Figure 4.1: System overview: Adaptive road detection system (red modules contain novel
techniques).
73
4 Road Detection in Unconstrained Environments
[Michalke et al., 2008c] for details). The image rectification assures that the camera angles
(including the static pitch angle) will not influence the stereo results. The correspondence
search yields a disparity map. Based on the disparity map three dense maps containing
the 3D-world positions for all image pixels can be obtained (see Fig. 4.5). The stereo data
is remapped using the measured camera angles in order to have the stereo maps and the
image comparable in terms of the pixel position of objects.
The stereo maps are postprocessed for solving the problem of missing disparity values
near to the car (see Fig. 4.2b). More specifically, during the computation of the stereo
disparity no correspondence search is possible at image regions near to the car, since this
would come at the cost of high computation time. We solve this problem by searching
line-wise for high horizontal gradients in the bird’s eye view of the camera image (for
information on this representation see [Broggi, 1995]) taking only the area directly in front
of the car (e.g., first 10 meters) into account (see example in Fig. 4.2a). It is assured that
no objects are present in the said area based on Radar and low vertical gradients in the
bird’s eye view. The area between the found gradients, which mark the road borders, is
assumed to be road. The image regions in bird’s eye view representation are mapped to
the perspective image with a pin hole camera model (see Annex A.3), which includes the
determined intrinsic and extrinsic camera parameters (e.g., static camera angles). Based
on the perspectively mapped road regions the three stereo maps are corrected assuming
a perfectly flat plane (see the resulting corrected depth map in Fig. 4.2c). Since only the
region directly in front of the car is corrected, the error induced by a non-flat road plane
can be considered as small. However, to eliminate this error the estimated camera angles
coming from the optional surface model could also be included into the pin hole camera
model.
Tests have shown that huge shadows on the road result in poor stereo quality, since
the correspondence search gets difficult on dark, noisy image regions. This supports using
more cues that are to some extend more invariant to shadows as done in the presented
system (e.g., HSI color space). Altogether, our system relies on six different cues for road
detection (see Tab. 4.1 for an overview).
Table 4.1: Used visual features for unmarked road detection.
MODALITY Cue # VISUAL ROAD DETECTION FEATURE
Color 1 Hue
2 Saturation
Structure 3 Edge density on Hue
4 Edge density on Saturation
5 Edge density on Intensity
Stereo 6 Height of objects in scene
In the second step, binary road maps (BRM) and road probability maps (RPM) for
74
4 Road Detection in Unconstrained Environments
Width in m
Dis
tan
ce
(d
ep
th)
in m
−5 0 5
30
25
20
15
10
5
0
(b)
(a)
Width in m
(c)
−5 0 5
Dis
tanc
e (d
epth
) in
m
25
0
30
15
20
5
10
Figure 4.2: (a) Gradient-based road search on the bird’s eye view on the image depicted in
Fig. 4.4, (b) Missing disparity values near to the camera vehicle induce false and missing depth
values, (c) Corrected depth map.
(b)(a)
Figure 4.3: Structuring element for region growing for (a) left image half, (b) right image half.
the six feature maps are computed. The BRMs are binary maps that hold “1” for pixels
belonging to the detected street and zero for the rest. The six BRMs are calculated with
a region-growing algorithm, by which region-related feature properties are incorporated.
Opposed to that, the six RPMs contain continuous probability values that assess the “road-
likeness” of the feature values for all pixels independently. Both map types rely on the same
normal distribution, see Equ. (4.13) and (4.14). The parameters of the normal distribution
are calculated using a street and at least 2 non-street training regions (see Fig. 4.4). Please
note that the training region needs to be set beyond the regions of corrected height values
(see Fig. 4.2c). The training regions are adapted dynamically depending on the scene.
For example, it is assured that no obstacle is within the training region by incorporating
Radar data. Furthermore, the size of the street training region is set proportionally to the
75
4 Road Detection in Unconstrained Environments
training regionNon−street
training regionNon−street
Street training region
Figure 4.4: Visualization of street and non-street training regions.
velocity of the ego vehicle, to exploit the fact that typically no near obstacles exist during
fast driving, e.g., on highways. The street and non-street training regions are chosen by
considering the height map of the scene derived from the stereo disparity map (see Fig. 4.5)
and existing knowledge about objects in the scene. In the following, the computation of
the BRMs and RPMs is described in more detail.
RGB input image
Z position in m
X position in m
Y position in m
Figure 4.5: Dense 3D-world position for all image pixels based on stereo vision. The X, Y,
and Z-maps contain dense 3D-world data of the scene in image coordinates. The X-map codes
the horizontal world position, the Y-map the vertical world position (object height), and the
Z-map the depth.
76
4 Road Detection in Unconstrained Environments
For computing the BRMs a region-growing algorithm, which connects continuous regions
in the feature maps is applied (i.e., the neighborhood of a pixel is evaluated). The latter
approach is done, in order to get crisp borders between the road and the sidewalks that
often have road-like features. The region growing uses two different structuring elements
for the left and right half of the image (see Fig. 4.3), which is motivated from the typical
course of roads in a perspective image. The region-growing algorithm recursively sets all
pixels that are adjacent to the currently known street segment in BRMi to “1”, when the
corresponding pixels in the feature map i are within the confidence interval (see Equ. (4.3)
and Equ. (4.4)).
xi − βiσi < xi < xi + βiσi ∀i = 1..5 (4.3)
with βi = 4di(Hsi, Hni
) ∀i = 1..6
x6 − εY (v) < x6 < x6 + εY (v) (4.4)
with εY (v) = β6(σ6 − σq(vtrain)) + σq(v) (4.5)
di(Hsi, Hni
) =√
1 − γi(Hsi, Hni
) ∀i = 1..6 (4.6)
γi(Hsi, Hni
) =∑
∀x
√
Hsi(x)Hni
(x) ∀i = 1..6 (4.7)
The region-growing algorithm starts from the road-training region. The normal-
distribution-based confidence interval in Equ. (4.3) uses the feature thresholds xi +/-
βiσi, which are independently calculated for all five visual features. Here, the parameter
xi is the mean and σi the standard deviation of the normal distribution calculated on the
street training region. The parameter βi is introduced in order to adapt the confidence
interval to the current scene properties. Different from xi and σi, which are calculated on
the street training region alone, the threshold parameter βi changes dynamically depending
on the characteristics of the street and non-street training regions. More specifically, the
parameter βi, which influences the feature thresholds, is calculated from di (see Equ. (4.6)).
The parameter di is the distance between the two histograms Hsiand Hni
of the street
and non-street training regions for the i=1..6 features. The measure di is based on the
Bhattacharya coefficient γi(Hsi, Hni
) (see Equ. (4.7)), which assesses the similarity of two
histograms. Based on βi the confidence interval is adapted (see Equ. (4.3)). The larger the
difference between the street versus the non-street areas on a feature map is, the bigger
the confidence interval becomes.
Different from the five visual cues (hue, saturation, and the three edge density maps),
the normal distribution of the stereo height Y also depends on the measured distance to
the car. This is empirically plausible, since Y is a function of the stereo disparity D(u, v)
and the relative influence of the quantization error of D(u, v) (measured in pixels) grows
the smaller D(u, v) and hence the bigger the distance of a road segment is to the car.
Hence, the part σq of the standard deviation of the stereo height cue that is induced by
the quantization error of D(u, v) increases with growing distance to the car. In order to
mathematically assess the error propagation of the quantization error of disparity D(u, v)
77
4 Road Detection in Unconstrained Environments
to the stereo height Y their functional relation is required. The stereo height x6 = Y
can be computed using Equ. (4.8) (with B as the horizontal distance between the stereo
cameras, h the camera height, v the vertical pixel position and v0 the vertical principal
point of the camera).
x6 = Y =B · (v − v0)
D(u, v)− h (4.8)
Dsurf(v) =B · (v − v0)
h(4.9)
σD =∆g√12
=1√12
(4.10)
σq(v) ≈ σD
∣
∣
∣
∣
dY
dD
∣
∣
∣
∣
D(u,v)=Dsurf(v)
(4.11)
≈ σD
∣
∣
∣
∣
∣
−B · (v − v0)
[Dsurf(v)]2
∣
∣
∣
∣
∣
σq(v) ≈ 1√12
∣
∣
∣
∣
h2
B · (v − v0)
∣
∣
∣
∣
(4.12)
Equation (4.10) defines the standard deviation σD of the disparity (measured in pixels),
which is induced by the quantization error (the step size ∆g is set to 1 pixel). For computing
the propagated standard deviation σq (required in Equ. (4.5)), we use Equ. (4.11) (refer to
[Jaehne, 2005]), which describes how the standard deviation of a random variable (here the
disparity D(u, v)) is propagated through a function (here Y (D)). We are interested in the
disparity on road surface Dsurf alone (see Equ. (4.9), gathered after reforming Equ. (4.8)
with Y=0). Hence, Dsurf defines the position at which Equ. (4.11) is linearized. Here,
the vertical pixel position v is a parameter of the distribution. For the quantization-
error-induced standard deviation of the height cue Y , we finally find Equ. (4.12). The
hyperbolic form of Equ. (4.12) confirms the made empirical assumptions. Based on that,
the confidence interval εY for the stereo height Y (see Equ. (4.4)) includes the standard
deviation σq(v) that is adapted depending on the current vertical image position v of the
current pixel in focus (see Equ. (4.5)). Besides adding σq(v) in Equ. (4.5), the standard
deviation σ6, computed on the training region on the Y map, needs to be corrected by
σq(vtrain) present at the vertical image position vtrain of the training region. As result, we
now have six BRMs for six features.
Additionally to the region-based processing for calculating the BRMs, a pixel-based
(iconic) processing for computing the RPMs is done (i.e., each pixel is handled indepen-
dently from its surround). All pixel values xi receive a probability value p(xi), which
results in six independent Road Probability Maps (RPMs) for the six features:
p(xi) = e−
( xi -xi)2
2σ 2i ∀i = 1..5 (4.13)
p(x6) = e
−x26
2[ σ6−σq(vtrain) +σq(v) ]2 . (4.14)
78
4 Road Detection in Unconstrained Environments
The probability distribution for the stereo-based height cue Y (see Equ. (4.14)) assumes
the mean height zero x6 = 0 and adapts σq(v) during the computation of RPM6 and
BRM6 dependent on the vertical pixel position v. The approach assumes a normal distri-
bution of the six features in the street training region and beyond. As described for x6 a
position-dependent variance was introduced. We verify the assumed normal distribution
with statistical tests of goodness of fit for all features independently (see Sect. 4.1.3).
In the third step, the computed BRMs and RPMs are fused with the detected lane
markings. More specifically, the RPMs for all features are set to a high probability for
the detected lane markings. The lane-marking detection is done with the biologically
motivated Difference of Gaussian (DoG) kernel (see Fig. 4.6a), which takes the receptive
fields of neurons in the retina as a role model. The DoG filter kernel is adapted to be
selective to bright structures on a dark background, the so-called on-off contrasts without
reacting to dark structures on a brighter background. Figure 4.6c shows the filter response
on the inner-city frame shown in Fig. 4.6b. All image regions with on-off contrasts, that
have a height within the confidence interval Equ. (4.4), and that are below the horizon are
detected as being lane markings (see Fig. 4.6d). The separation between on-off and off-on
contrasts reduces the number of false positive road marking detections. For example, in
[Luo-Wai, 2008] the prefiltered road image still contains the lane marking unspecific off-on
contrasts (e.g., traffic signs in front of a bright sky). Such off-on contrasts are filtered out
in our approach to improve the road detection performance.
The six iconic RPMs and their respective BRMs are combined by multiplication, which
leads to six extended RPMs (eRPM):
eRPMi = RPMiBRMi ∀i = 1..6. (4.15)
Based on this, the advantage of probability-based computation is preserved. At the same
time, discontinuities in the feature maps can be detected. As a result, the advantages of
both approaches are combined.
Next, all eRPMs are fused resulting in the final RPM (fRPM) using the geometric mean:
fRPM =
(
6∏
i=1
eRPMi
)1/6
. (4.16)
In the forth and final step, the Final Road Map is determined by applying a threshold
εfinal to the fRPM:
Final Road Map(u, v) =
{
1 ∀ fRPM(u, v) > εfinal
0 else. (4.17)
The threshold εfinal is set dynamically based on the correlation results of the three
currently most reliable features maps, in order to get a prediction of the current relative
size of the road versus the rest of the image. For these three features the currently best HSI
color feature (hue or saturation), the best structure feature (structure on hue, saturation,
or intensity), as well as stereo are selected. For the selection process the Bhattacharya
79
4 Road Detection in Unconstrained Environments
50
100
150
200
250
300
(a) (b)
(c) (d)
Figure 4.6: (a) On-off Difference of Gaussian (DoG) filtering on two test images with on-off
and off-on contrast (left) as well as the respective filter responses (right), (b) Inner-city test
frame, (c) On-off DoG filter response for bright contrasts on a dark background (with lane
markings popping out), (d) Detected lane markings (after fusion of DoG and object height
from stereo).
coefficient γi(Hsi, Hni
) is evaluated (see Equ. (4.7)), by which the separability of street
versus non-street histograms Hsi,Hni
can be assessed.
Hence, the computation of the Final Road Map relies on a simple road model (expected
fraction of the road area in the current image, termed road-to-image-ratio). No assump-
tions are made regarding the current position of the road in the image. As our evaluation
results in Sect. 4.1.3 show, it is of crucial importance to adapt the said expected fraction
dynamically to the current scene. This dynamic adaptation enables the system to run
robustly in complex scenes, as in inner-city scenarios.
For adapting εfinal the control loop depicted in Fig. 4.7 is used. The threshold εfinal is
adapted by a gradient method based on Equ. (4.22). In the following, the applied procedure
is described in detail. It uses the BRMs of the three most reliable feature maps A, B, and
C that are combined to the road reference map (i.e., feature product R that represents the
expected road area), depicted in Fig. 4.7. The four binary maps are summed up, which
results in four scalar values S{A,B,C,R}:
SX =∑
∀(u,v)
BRMX(u, v) with X ∈ {A,B,C,R}. (4.18)
The values S{A,B,C,R} represent the integral number of pixels detected as road for the
three feature maps and the road reference map.
80
4 Road Detection in Unconstrained Environments
reference mapmaps (BRM)
threemost reliable
feature A feature B feature CBRMBRMBRM
product R
=
binary road
of featuresA, B, C
Road Mapcompute Finalthreshold
adaptFinal
Road Map
6 features
fusionfeature
road
road area)(expected
εfinal
Figure 4.7: Control loop to adapt the final road detection threshold εfinal.
Then, the parameter κ is calculated:
κ =SR
SA+ SR
SB+ SR
SC
3. (4.19)
It represents the mean percentage with which the three most reliable feature maps
correspond to the road reference map R. The larger κ is, the more the features match to
each other, i.e., the more similar the three features maps are. The degree of similarity of
these features gives a hint about what to expect from the remaining cues and can hence
be used to adapt εfinal. The Final Road Map is computed (see Equ. (4.17), where εfinal is
set to a typical initial value for bootstrapping) and summed up yielding the scalar value
SFRM:
SFRM =∑
∀(u,v)
Final Road Map(u, v). (4.20)
Next, it is checked if the calculated scalar value SFRM fulfills:
1
κ<
SFRM
SR< 1.2
1
κ. (4.21)
If the inequality is fulfilled, the Final Road Map is valid. If not, εfinal is adapted incre-
mentally based on the following Equation (with α− < 1 and α+ > 1), until the following
inequality is fulfilled:
εfinal(t) =
α−εfinal(t− 1) when SFRM
SR< 1
κ
α+εfinal(t− 1) when SFRM
SR> 1.2 1
κ .
(4.22)
Equation (4.22) is motivated from the well-known Resilient Backpropagation (RPROP)
approach. The step sizes α+ and α− are adapted using the SuperSAB approach (see
[Adamy, 2007] for details). The processing stops after 100 iterations at the latest.
In the following section, our system approach is evaluated based on an inner-city sce-
nario.
81
4 Road Detection in Unconstrained Environments
4.1.3 Experiments and Results
In the following, accumulated road detection results on 440 frames of an inner-city stream
are presented (please also refer to [Michalke et al., 2009c] for a more extensive system
evaluation). The performance gain reached by incorporating street and non-street training
regions as well as the dynamic road model is assessed. The results of statistical tests of
goodness of fit are given, which support the assumption of a normal distribution for the
color and structure features within the street-training region. In a final step, details of the
needed computation time on our test vehicle are given. The inner-city result stream, the
input images, and the manually annotated ground truth street segments are available on
the internet [BenchmarkData, 2009a] for benchmark testing.
In order to evaluate our system, we adopt the following Equations on the resulting road
segment:
Completeness =TP
TP + FN(4.23)
Correctness =TP
TP + FP(4.24)
Quality =TP
TP + FP + FN. (4.25)
The Equations define different ground-truth-based measures, which were taken from
[Lombardi et al., 2005] (with pixels being True Positive (TP), False Negative (FN), False
Positive (FP)).
On a descriptive level, the Completeness states, based on given ground truth data, how
much of the present road was actually detected. The Correctness states how much of the
detected road is actually road to avoid classifying all as road leading to a Completeness
of 100%. The Quality combines both measures, since between the Completeness and
Correctness a trade-off is possible. Based on this, the Quality measure should be used for
a comparison, since it weights the FP and FN pixels equally. For a more detailed analysis
the Completeness and Correctness state what exactly caused a difference in Quality. The
necessary ground truth data was produced by accurate manual annotation of the road in
the 440 images.
In order to evaluate the novel techniques, the three measures were calculated on the
detected road segments of 440 image frames for three system instances. The first instance
is our system as proposed in Sect. 4.1.2 with all four novel techniques running. The second
system instance is equivalent to the first but runs with a constant road-to-image-ratio
(i.e., with a rigid road model). The third system is equivalent to the first but uses no
non-street training regions, which makes the confidence interval thresholds less adaptive
to the environment (βi = const., see Equ. (4.3)).
We used 220 frames of our inner-city scenario as training data for the two competitive,
less adaptive systems in order to tune the road-to-image-ratio of the second system and
confidence interval factors βi of the third system. The accumulated results in Tab. 4.2 show
that all three systems have a similar performance in terms of Quality on the training data.
On training images, the highest Quality is reached by the second competitive system that
uses a rigid road model. The accumulated results for the training sequence are plausible,
82
4 Road Detection in Unconstrained Environments
Table 4.2: Comparison of our road detection system with two competitive systems running
without two of the proposed novel techniques on training images.
Road detect. # training Correct- Complete- Quality
approaches images ness ness
Our system - 96% 75% 73%
Without non-street 220 96% 73% 71%
training areas
With rigid road model 220 88% 84% 75%
since both competitive systems were tuned to run with good performance on the training
images. Different from that, our system adapts itself to the environment based on the four
described techniques. Therefore, for our system no manual tuning to the training sequence
was done.
For the actual evaluation, the two competitive systems were run on consecutive parts of
the remaining stream (in sum 220 images) that were used for testing. In a direct comparison
between our system and the rigid road model system, we could gain the results depicted in
Tab. 4.3. Table 4.4 shows the results of the comparison between our system and the system
without non-street training regions. In both cases, our system significantly outperforms
the competitive systems in terms of Quality (75% compared to 68% and 69% compared to
50%). These results confirm the gain of the system-inherent adaptation capabilities offered
by the proposed four techniques.
Table 4.3: Comparison of our system and an equivalent system with a rigid road model on a
test stream with narrow street.
Road detect. # test Correct- Complete- Quality
approaches images ness ness
Our system 120 97% 77% 75%
With rigid road model 120 77% 85% 68%
As Tab. 4.3 and 4.4 reveal, the Correctness of found street segments is high, which means
a small number of false positive street pixels are found. However, the gathered results show
that the detection performance varies between frames. This is due to the changing content
of the training region in front of the car. Thereby, the system possibly adapts to local
characteristics present in the current training region that might differ from the current
global road characteristics. Furthermore, local illumination changes that depend on the
current view angle and lighting conditions influence the detection performance. To solve
this, a temporal integration method was developed, which is introduced in the following
Section 4.2.
83
4 Road Detection in Unconstrained Environments
Table 4.4: Comparison of our system and an equivalent system without non-street training
regions for a shady test stream.
Road detect. # test Correct- Complete- Quality
approaches images ness ness
Our system 100 99% 68% 69%
Without non-street 100 99% 50% 50%
training region
(with ground truth With temporalintegration
Without temporalintegration
Our results
Frame 257
Frame 253
Frame 181
Frame 116
road segment)
1.benchmark systemwith constant road
region to image ratio training region
Input image 2.benchmark system without non−street
Figure 4.8: Example images of the benchmark inner-city stream (First column: Input image
with ground truth road segment, Second column: First benchmark system with rigid road
model, Third column: Second benchmark system without non-street training segment, Last
columns (highlighted): Resulting road segment of our system improved by temporal integration
(see Sect. 4.2).
For further evaluation Fig. 4.8 shows typical results of our system compared to the two
competitive systems and the ground truth data, based on four sample frames of the inner-
city stream (overall stream available at [BenchmarkData, 2009a]). As can be seen, our
system performs better in complex scenes and scenes with strong shadows on the road.
As described in Sect. 4.1.2 a central assumption of our system is that the features in the
street training region are normal-distributed (see confidence interval defined in Equ. (4.3)).
84
4 Road Detection in Unconstrained Environments
Therefore, the Kolmogorov-Smirnov (KS) test of goodness of fit with its Lilliefors extension
[Lilliefors, 1967] checks if the hypothesis of a normal distribution Fe(x) for the six system
features can be rejected. For this, the cumulative frequency F0(x) for all features in the
street training region was measured for 30 consecutive inner-city frames. Exemplarily,
Tab. 4.5 shows the results for the edge density computed on the hue for a single frame.
The maximal deviation from the tested normal distribution (µ = 0.1559 and σ2 = 0.0088)
was d = 0.1044 and consequently was below the allowed margin of dth = 0.271 (level of
significance α = 0.05). For the remaining 29 frames similar results were accumulated. Also
for the other 5 features the KS-test assured that the hypothesis of a normal distribution
could not be rejected. As long as no object is present in the training region, we therefore
can assume that the features on the road are normal-distributed.
Table 4.5: Kolmogorov-Smirnov test of goodness of fit for the edge density feature computed
on the hue.
Hue edge density Cumulative Tested normal Statistical
9 classes frequency distribution measure d =
F0(x) Fe(x) max|Fe(x) − F0(x)|0.0353 755/4819 = 0.1567 0.1296 0.0271
0.1016 2332/4819 = 0.4839 0.3795 0.1044
0.1678 3589/4819 = 0.7448 0.6725 0.0722
0.2341 4258/4819 = 0.8836 0.8814 0.0022
0.3003 4574/4819 = 0.9492 0.9719 0.0228
0.3665 4713/4819 = 0.9780 0.9958 0.0178
0.4328 4765/4819 = 0.9888 0.9996 0.0108
0.4990 4793/4819 = 0.9946 1 0.0054
0.5653 4819/4819 = 1 1 0
For the experiments we use a Honda Legend prototype car equipped with a mvBlue-
Fox CCD (charge-coupled device) color camera from Matrix Vision delivering images of
800x600 pixels at 10Hz, which is hence the processing rate our road detection module
should approximately reach. The image data as well as the laser and vehicle state data
from the CAN bus is transmitted via LAN to several Toshiba Tecra A7 (2 GHz Core
Duo) running our RTBOS integration middleware [Ceravola et al., 2006] on top of Linux.
The road detection component together with other driver assistance components (see, e.g.,
[Michalke et al., 2007]) are implemented in C using an optimized image processing library
based on the Intel IPP [Intel, 2006]. Table 4.6 shows the computational demands of differ-
ent sub-modules of the presented approach running on one of these laptops. The overall
computation time of our road detection system currently amounts to 123.5 ms (8.1 Hz),
which allows real-time processing on our prototype vehicle.
85
4 Road Detection in Unconstrained Environments
Table 4.6: Computation time (M - Including detection of lane markings, T - Including temporal
integration approach).
M T Used RAM Comp. time [in ms]
[in MB] (frame rate [in Hz])
- - 185 93.5 (10.7)
X - 203 101.0 (9.9)
- X 214 105.0 (9.5)
X X 233 123.5 (8.1)
In the following section, a tracking procedure based on temporal integration is proposed,
which steadies the gained road detection results, e.g., in case of difficult lighting conditions.
4.2 Temporal Integration for Feature-Based Road
Detection Systems
Although existing state-of-the-art systems for unmarked road detection show promising
results, the detected road segments often contain holes and show a detection performance
that strongly varies in time depending on environmental conditions (see also previous Sec-
tion 4.1). The varying detection performance is due to the changing content of the training
region in front of the car. Thereby, the system possibly adapts to local characteristics
present in the current training region that might differ from the global road characteris-
tics. Furthermore, local illumination changes that depend on the current view angle and
lighting conditions influence the detection performance. See Fig. 4.9 for a visualization of
both effects.
In the following section a real-time capable approach for improving the road detection
results for this type of state-of-the-art system is presented that adds a generic postpro-
cessing step. Our proposed architecture removes the drawbacks of these systems using a
temporal integration approach based on the bird’s eye view. In order to test the proposed
approach, the visual-feature-based road detection system described in Sect. 4.1 is used.
Still, this road detection system can be exchanged with any other state-of-the-art system.
Evaluation results computed on inner-city data show that this approach is an important
enhancement for all visual-feature-based road detection systems. One of the used streams
and corresponding ground truth data is accessible on the internet for benchmark testing.
The proposed approach is a crucial step toward robust road detection in complex scenarios
that allows building high-level applications, as, e.g., active collision avoidance or trajectory
planning, based on vision as the major cue.
4.2.1 Related Work
The concept of temporal integration is used in various applications in the field of computer
vision for driver assistance. For example, [Gepperth et al., 2007] uses spatiotemporal inte-
86
4 Road Detection in Unconstrained Environments
(b)
(c) (d)
(a)
street training street trainingregion in the
sunregion in the
shade
detectedstreet
detectedstreet
(in white) (in white)segments segments
reflectionsdepending onthe view angle
Figure 4.9: Causes for varying road detection performance: (a) Illumination change with de-
pendence on the view angle, (b) Sample image showing typical illumination gradient, (c)
Schematic example: Training region in the sun and resulting detected street segment (in white),
(d) Schematic example: Training region in the shade and resulting detected street segment (in
white).
gration to improve the classifier performance when detecting signal boards and cars. Other
applications for improving the classifier performance rely on (temporal-integration-based)
voting mechanisms, which are widely used in numerous domains (see [Bauer and Kohavi,
1999] for an overview). Also the well-known Kalman filter approach [Kalman, 1960] sta-
bilizes its state estimate by temporal integration (fusion of measured and predicted data).
In [Nieto et al., 2007] temporal integration is used to determine the camera parameters,
thereby stabilizing the input image of a marked lane detection system running online in a
car.
Also for clothoid-model-based lane detection on highways and country roads (see,
e.g., [Dickmanns and Mysliwetz, 1992] and [Franke et al., 2007]) temporal integration was
found to improve the detection performance. Still, the usage of such model-based ap-
proaches for road detection in complex inner-city scenes is heavily restricted, due to the
unpredictable and abruptly changing course of the road and various occlusions of road
parts. Figure 4.10a shows the complexity of a hand labeled ground truth road segment
for an inner-city frame that can hardly be modeled using, e.g., a clothoid model. There-
fore, also a model-based temporal integration is not possible and will not show the desired
results in such complex scenarios.
87
4 Road Detection in Unconstrained Environments
(c)
Birds view
Width in m
Dista
nce (
depth
) in m
−10 0 10
50
40
30
20
10
0
−10
(a) (b)Width in m
Dis
tanc
e (d
epth
) in
m
−10 0 10
Bird’s eye view
0
10
2
0
30
4
0
50
Figure 4.10: Exemplary inner-city frame: (a) Hand-labeled ground truth street segment, (b)
Optical flow (colors code the direction of the motion), (c) Bird’s eye view.
Newer road detection approaches that rely on the statistical evaluation of different image
features (see, e.g. [Rotaru et al., 2004] and [Soquet et al., 2007]) can handle such scenarios
but have the drawbacks discussed at the beginning of Sect. 4.2 (see page 86). Nevertheless,
also for these systems temporal integration can and should be used for making the road
segment detection more robust. To this end, the most direct approach would be to use the
optical flow that reflects the magnitude and direction of the motion of image regions, as
shown in Fig. 4.10b. Based on that, the current position of a street segment detected in
the past can be determined and used for a fusion with the current road detection results.
However, the optical flow has certain drawbacks. First, it’s to date high computational
costs make it scarcely applicable in domains with hard real-time constrains, as the car
domain. Second, the optical flow cannot be calculated at the borders of an image and is
error prone due to ambiguities resulting from the aperture problem, illumination change,
and camera noise [Willert et al., 2007]. Instead of detecting the motion of all image regions
based on the optical flow, the approach proposed here concentrates on the drivable street
plane alone, relying on the bird’s eye view (see Fig. 4.10c and Fig. 2.16a).
4.2.2 System Description
In the following, a rough overview of our approach of bird’s-eye-view-based temporal road
integration is given (see Fig. 4.11). Thereafter, all processing steps and their theoretical
background are described in more detail.
As input data our system uses 400x300 monocular gray value images and a binary map
of the currently detected street segment. The images are used for calculating the bird’s
eye view, which is a representation of the scene as viewed from above (see Fig. 4.12a and
Fig. 4.10c). In the following step, the bird’s eye view is used for detecting the motion of the
static vehicle environment based on Normalized Cross Correlation (NCC). Based on these
88
4 Road Detection in Unconstrained Environments
correlation results the current and past street segments are fused by temporal integration
on the bird’s eye view. The fused street segments are then mapped back to the perspective
view corresponding to the input image.
The system takes optional input data that improves the quality and makes the temporal
integration more robust. As such optional input data, stereo images as well as the longi-
tudinal ego velocity and yaw rate of the CAN bus of our prototype vehicle are processed.
The depth map that is calculated from stereo images (using the commercial “Small Vision
System” [Konolige, 1997], see Sect. 2.2.2) is the basis for correcting the changes in the pitch
and roll angle. An uncompensated change in the pitch and roll angles make the bird’s eye
view unstable in case the car brakes or the street profile is not flat. The CAN data is used
for predicting the motion of the car based on a single track model. The predicted motion
is used for determining the anchor for the correlation on the bird’s eye view. The usage of
CAN data makes the system faster. Still, without CAN data the detection quality is not
reduced.
In the following, the processing steps (as depicted in Fig. 4.11) are described in more
detail. First, the camera lens distortion is corrected. The undistorted vertical and hori-
zontal pixels v and u are computed on the initial (distorted) vertical and horizontal pixels
vd and ud based on:
u = (1 + k1β2 + k2β
4)ud + 2k3udvd + k4(β2 + 2u2
d) (4.26)
v = (1 + k1β2 + k2β
4)vd + k3(β2 + 2udvd) + 2k4udvd (4.27)
with β =√
u2d + v2
d.
The undistortion is based on a lens distortion model (described in [Heikkila and Silven,
1997]) that uses radial (k1 and k2) and tangential distortion coefficients (k3 and k4). The
undistortion step is essential in order to allow a correct mapping of the image pixels to
the bird’s eye view. It is important to note that for the bird’s eye view as a metric
representation, the undistortion step makes sure that the proportions in the bird’s eye
view match the world.
Then the bird’s eye view is calculated on the undistorted pixels v and u based on
Equ. (4.28) and (4.29) by inverse perspective mapping of the 3D world points X, Y , and
Z to the 2D (u,v) image plane (see Fig. 4.12b for the notation in our coordinate system).
The equations describe how to map a 3D position of the world to the 2D image plane
(refer to [Broggi, 1995]). More specifically, only the image pixels (u,v) that are needed
to get a dense metric bird’s eye view plane are mapped into the XZ-plane. The usage of
inverse perspective mapping makes the inversion of Equ. (4.28) and (4.29) for calculating
the bird’s eye view obsolete. Equation (4.28) and(4.29) use the 3 camera angles θX , θY ,
and θZ , the 3 translational camera offsets t1, t2, t3 (see Fig. 4.12b), the horizontal and
vertical principal point u0 and v0 as well as the horizontal and vertical focal lengths fu and
fv. The intrinsic (i.e., internal camera properties, like the focal length and the principal
point) and extrinsic (i.e., external camera properties, like camera angles and offsets) camera
parameters were determined using the freely available calibration toolbox [J.Y.Bouguet,
2007] and a calibration scene similar to the one described in [Marita et al., 2007].
89
4 Road Detection in Unconstrained Environments
perspectivemapping
image bird’seye view t
Y−planeestimation
(dense)
extendedbird’s eye viewcalculation
depth from stereoOptional input:
CAN data(acceleration, velocity,yaw rate)
Optional input:
(incomplete)map
binary road
distortedimage
Input:
detected road
undis−
Input:
roaddetection
(exchangeable)module
imagebird’s eye
tortion
view t−1
Output:
integrationtemporal
correlation onbird’s eye view
segments
bird’s eye viewpast N frames
storagedata
of road
temporalintegrated
road
Y−plane
shift($\Delta X$,$\Delta Z$)
Optional
Figure 4.11: System structure: Temporal road segment integration (the dashed module can be
exchanged with the road detection algorithm preferred by the user, optional module highlighted
in red).
As can be seen in Equ. (4.28) and (4.29) the 3D world position coordinates X, Y , and
90
4 Road Detection in Unconstrained Environments
Z of all image pixels (u,v) are needed:
u = −fur11(X − t1) + r12(Y − t2) + r13(Z − t3)
r31(X − t1) + r32(Y − t2) + r33(Z − t3)+ u0 (4.28)
v = −fvr21(X − t1) + r22(Y − t2) + r23(Z − t3)
r31(X − t1) + r32(Y − t2) + r33(Z − t3)+ v0. (4.29)
With Y = 0 ,
R = RXRYRZ =
r11 r12 r13r21 r22 r23r31 r32 r33
,
and
r11 = cos(θZ)cos(θY )
r12 = − sin(θZ)cos(θX) + cos(θZ)sin(θY )sin(θX)
r13 = sin(θZ)sin(θX) + cos(θZ)sin(θY )cos(θX)
r21 = sin(θZ)cos(θY )
r22 = cos(θZ)cos(θX) + sin(θZ)sin(θY )sin(θX)
r23 = − cos(θZ)sin(θX) + sin(θZ)sin(θY )cos(θX)
r31 = − sin(θY )
r32 = cos(θY )sin(θX )
r33 = cos(θY )cos(θX).
By using a monocular system, one dimension (the depth Z) is lost. A solution to this
dilemma is the so-called flat plane assumption. Here, for all pixels in the image, the height
Y is set to 0. Based on this, only objects in the image with Y = 0 (especially, the street we
are interested in) are mapped correctly to the bird’s eye view, while all the other regions
are stretched to infinity in the bird’s eye view (for example the cars in Fig. 4.10c).
In case this assumption is not fulfilled (i.e., the street surface is not flat) the bird’s eye
view is inaccurate, which leads to decreasing quality of the temporal integration. To allow
a stable bird’s eye view even in case of non-flat street surfaces and pitching of the vehicle,
stereo data from our stereo camera setup is used. In order to enhance the robustness of
the correction, only pixels that belong to the currently detected street segment are used
for surface estimation. More specifically, the differences between the coordinate axes and
the street surface in terms of the pitch ∆θX and roll angle ∆θZ , as well as the height of
the camera over the ground ∆t2 are computed:
Y = Y0 + aZ + bX (4.30)
∆θZ = atan(b) (4.31)
∆θX = atan(a) (4.32)
∆t2 = Y0. (4.33)
91
4 Road Detection in Unconstrained Environments
Z
X
l ∆x
∆z
˙θY
X
Y
camera
Z
axisoptical
θX
θY
T = [t1, t2, t3]θZ
imageperspective
viewbird’s eye
(a) (b) (c)
Figure 4.12: (a) Visualization of the bird’s eye view, (b) Coordinate system and position of
the camera (car is heading in Z-direction), (c) Single track vehicle model.
This is done based on the 3D position for all image pixels derived from the stereo disparity
(see Fig. 4.5 for 3D data of a sample image). The flat plane assumption Y = 0 is then
replaced by Y = f(X,Z) leading to an extended bird’s eye view. In our implementation a
first order model for the street surface (linear hyperplane) is used as shown in Equ. (4.30)
(see [Li et al., 2004] for more details). Results have shown that higher order models lead
to inferior performance. The reason for this is the restraint number of 3D measurement
points at the borders of the image, since only reliable pixels belonging to the detected
street are used for the surface estimation. Since the estimated surface is noisy (stereo
data is calculated based on error prone correlation between the left and right image), a
linear Kalman filter is used on the parameters Y0, a, and b that raise the performance
considerably. A possible improvement would be to use a model of the vehicle kinetics
(containing damper and spring characteristics, realistic distribution of the vehicle mass) for
the Kalman prediction (as proposed in [Cech et al., 2004]) instead of the linear prediction
model used here.
By NCC-based correlation between the current and the stored previous bird’s eye view
the vehicle motion (∆X, ∆Z) since the previous time step is detected. A single track
vehicle model, as depicted in Fig. 4.12c, predicts the starting point xt = xt−1 + ∆x and
zt = zt−1 +∆z of the NCC correlation patch of time step t-1 in the current bird’s eye view
map. The values ∆x and ∆z are calculated based on the sample time T , the distance of
the camera from the rear wheel l, as well as the yaw rate θY , and lateral velocity Z from
92
4 Road Detection in Unconstrained Environments
the CAN bus (see single track model Equ. (4.34) and (4.35)):
∆x =Z
θY(1 − cos(θY T )) + sin(θY T )l (4.34)
∆z =Z
θYsin(θY T ) + cos(θY T )l − l. (4.35)
The derived longitudinal and lateral motions as well as rotational change (i.e., yaw angle)
between the current and the previous bird’s eye view are stored along with the incremental
motion between the previous N = 40 frames (equivalent to 4 seconds of processing by our
prototype vehicles’ vision system).
The NCC correlation patch on the bird’s eye view is selected to contain enough structure
(using the entropy-based measure described in Sect. 3.3), which improves the accuracy of
the NCC. Furthermore, it is assured that the patch belongs to the detected street and
that it is not too far away from the ego vehicle, since the resolution of the bird’s eye view
decreases with growing distance to the vehicle.
The bird’s eye view maps of the detected street segments of the previous N = 40 frames
are calculated and stored. The stored incremental motion during the past 4 seconds is
integrated and used to shift all stored bird’s eye view street segments correspondingly.
Then the shifted previous 40 bird’s eye view street segments are weighted (weights αt) and
summed up by:
Sinteg =N∑
t=1
αtSt withN∑
t=1
αt = N. (4.36)
Thereafter, the sum of the street segments Sinteg(X,Z) is related to the maximum pos-
sible number of overlaid street segments Smax(X,Z), which results in an Integrated Road
Probability Map (IRPM):
IRPM =Sinteg(X,Z)
Smax(X,Z). (4.37)
Please note that Smax(X,Z) changes depending on the position in the bird’s eye view
map. The following final threshold operation determines the final temporal integrated
street segment Sfinal in the bird’s eye view representation:
Sfinal =
{
1 ∀ IRPM(X,Z) ≥ β
0 ∀ IRPM(X,Z) < β. (4.38)
The weight α1 in Equation (4.36) is set high to ensure that the pixels in the current
detected street segment are with a high probability also present in the final temporal
integrated street segment. The other weights αt could be set dynamically dependent on a
quality measure of the bird’s-eye-view-based NCC or the road detection system as well as
the capturing time t. The threshold β in Equ. (4.38) is currently set to 0.7. This means
that a pixel is classified as street if at least 70% of the overlaid past street segments have
voted for street.
93
4 Road Detection in Unconstrained Environments
(a) (b)
Figure 4.13: Final morphological fill operation for closing spaces in the street segment that are
due to perspective mapping (justified openings are preserved): (a) Raw perspectively mapped
street segment, (b) After morphological closing.
Next, the final temporal integrated street segment Sfinal is mapped back to the image
using Equ. (4.28) and (4.29). For this operation the resolution of the street segment in
the bird’s eye view representation needs to be high (which is done by upsampling the
size by factor of 4) in order to allow a lossless perspective mapping of the street segment.
The perspective mapping step produces equidistant, periodic spaces in the street segment
directly in front of the car (see Fig. 4.13a). These spaces are filled using a morphological
close operation with a small morphological structuring element to prevent adding too
many false positive street pixels (see Fig. 4.13b). In other words, openings in the bird’s
eye street segment (that, e.g., correspond to objects on the street) are retained in the final
perspectively mapped street segment. Such openings are explicitly checked for objects in
the implemented ADAS (see Chapter 5).
The following section shows that the proposed temporal integration procedure results in
an enhanced street segmentation. The final detected street segment has fewer holes and is
dynamically more stable than that of other approaches, which allows complex path-related
applications.
4.2.3 Experiments and Results
In this section, we evaluate the performance of our system by applying it to the results of
the state-of-the-art road detection algorithm described in Sect. 4.1. As described before,
the proposed temporal integration approach can work on top of all road detection algo-
rithms for unmarked roads and is therefore interchangeable. Additionally, the required
computation time for the proposed temporal integration approach is given.
Figure 4.14 shows qualitative results of the various system modules of our system. The
depicted snapshot is part of a result stream showing our system running on 160 consecutive
frames of an inner-city course. The input images and stereo data used for the evaluation as
well as the ground truth data and results are accessible on the internet [BenchmarkData,
2008b] for open benchmark testing. Additionally, Fig. 4.14e shows a kind of 360◦ represen-
tation of the environment that is derived from the combination of all stored bird’s eye view
94
4 Road Detection in Unconstrained Environments
maps of the past 4 seconds. This representation builds up gradually, after the algorithm
starts. It could be used for higher-level trajectory planning algorithms.
The white rectangle in Fig. 4.14b and d-f represents the position of our prototype vehicle,
while the black regions are outside the field of vision of our vehicle cameras.
Width in m Width in m Width in m Width in m−10 0 10 −10 0 10 −10 0 10
−10
0
10
20
30
40
50
−10
0
10
20
30
40
50
−10
0
10
20
30
40
50
Dis
tanc
e (d
epth
) in
m
Dis
tanc
e (d
epth
) in
m
Dis
tanc
e (d
epth
) in
m
Dis
tanc
e (d
epth
) in
m
−10 0 10
(d) (e) (f)
Bird’s eye view (BEV) Temporal integration
−10
0
10
20
30
40
50
(b)
(a)Input image
(c) (g)Input road segment Output: Dense detected road segment
Detected road temp. integ.BEV, input road seg.
Figure 4.14: (a) System input image, (b) Input image in bird’s eye view, (c) System input:
Detected road segment of road detection module, (d) Detected road segment in bird’s eye view,
(e) Temporal integration of bird’s eye view images of past 4 seconds, (f) Temporal integration
of detected road segments, (g) System output: Integrated road segment mapped back to the
perspective image.
In order to evaluate our algorithm with respect to its impact on the road detection
performance, we adopt the ground-truth-based measures (see Equations (4.23), (4.24),
and (4.25)) defined in Sect. 4.1.3. The necessary ground truth data was produced by
95
4 Road Detection in Unconstrained Environments
accurate manual annotation of the 440 test images (see Fig. 4.10a for a sample).
The three measures were then calculated on the detected street segments of 440 image
frames of two inner-city streams. The gathered results are depicted in Tab. 4.7. There,
the standard street detection algorithm without temporal integration is compared to our
approach. Furthermore, our approach is compared to one that uses the optical flow for
temporal integration (based on the state-of-the-art optical flow algorithm described in
[Willert et al., 2006]), and finally to our approach using only the mandatory input data
(without the usage of stereo data). In all 4 cases the same algorithm for detecting the street
was used, in order to allow a comparison. As the results in Tab. 4.7 show, the highest
Table 4.7: Comparison of different methods for temporal integration.
Road detect. approaches # test Correct- Comple- Quality
(BEV: bird’s eye view) images ness teness
No temp. integration 440 98.1% 61.5% 60.5%
Temp. integration, BEV 440 95.2% 94.1% 89.9%
Temp. integration,
optical flow 440 92.6% 72.4% 68.1%
Temp. integration, BEV,
without stereo 440 96.9% 84.0% 81.7%
Quality (89.9% enhancing the 60.5% of the initial street detection algorithm) is reached
with temporal integration based on our algorithm. Without stereo data our algorithm still
reaches a Quality of 81.7%. Optical-flow-based temporal integration reaches a Quality of
merely 68.1%, which is due to the well-known aperture problem (see, e.g., [Willert et al.,
2006]) and present illumination changes. The initial road detection approach without
temporal integration has the highest Correctness with 98.1%, but this comes to the cost of
reduced Completeness of merely 61.5%. Our temporal integration approach decreases this
value from 98.1% to 95.2%, but it increases the Completeness disproportionately (from
61.5% to 94.1%).
For further evaluation, Fig. 4.15 shows typical results of a standard street detection
algorithm compared to results gathered with the proposed temporal integration approach
based on 4 sample images of the inner-city stream.
For the experiments we use a Honda Legend prototype car equipped with a mvBlueFox
CCD color camera from Matrix Vision delivering images of 800x600 pixels at 10Hz, which
is hence the processing rate our road detection module must at least reach. The image
data as well as the laser and vehicle state data from the CAN bus is transmitted via
LAN to several Toshiba Tecra A7 (2 GHz Core Duo) running our RTBOS integration
middleware [Ceravola et al., 2006] on top of Linux. The road detection component together
96
4 Road Detection in Unconstrained Environments
Frame 97
Frame 24 Frame 24
Frame 105 Frame 105
Frame 147
Frame 97
Results of standard approach(Input street segment for us) (by temporal integration)
Our approach
Frame 147
Figure 4.15: Example images of the used inner-city stream (left: Standard approach (input
street segment for our approach), right: Our approach (by temporal integration), the last image
is visually enhanced to improve its legibility when printed.
with other driver assistance components (see, e.g., [Michalke et al., 2007]) are implemented
in C using an optimized image processing library based on the Intel IPP [Intel, 2006]. The
road detection component is set to run on a single core.
97
4 Road Detection in Unconstrained Environments
Table 4.8 shows the computational demands of different sub-modules of the presented
approach and compares these to the qualitatively inferior approach based on the optical
flow (as was shown in Tab. 4.7). The reasonable parameterized state-of-the-art optical flow
Table 4.8: Comparison of computational demands for temporal integration on the bird’s eye
view and using optical flow.
Module / sub-module Comp. time [in ms]
(frame rate [in Hz])
Temp. integration, BEV Σ 49.8 (≈ 20)
Bird’s eye view 6.9
Correlation sub-module 14.7
Temp. integration 20.0
Perspective mapping to image plane 8.2
Temp. integration, optical flow Σ >537.0 (≈ 2)
implementation (based on [Willert et al., 2006]) needs 537.0 ms (≈ 2 Hz), without taking
further system modules into account, which are additionally required by this approach.
The overall computation time of our temporal integration system amounts to 49.8 ms
(≈ 20 Hz). Combined with the realized unmarked road detection system described in
Section 4.1 real-time processing on our prototype vehicle is reached (refer to Tab. 4.6).
98
4 Road Detection in Unconstrained Environments
4.3 Summary
In Chapter 4 an unmarked road detection system based on vision as the major cue is
described and evaluated in real-time. At run time the system dynamically adapts central
system parameters to the environment allowing a robust road detection under changing
environmental conditions. More specifically, a road training region in front of the car
is used in order to derive the visual properties of the road. Furthermore, two non-road
training regions are used to determine how good the road can be separated from the rest
of the scene in the current scenario. This separability information is used to parameterize
the feature fusion processes of six visual features the system relies on to detect the road.
The visual features are processed iconically (i.e., the road likeliness of each independent
pixel is determined) and region-based (i.e., the properties of neighboring pixels are taken
into account). Furthermore, instead of relying on strong, rigid road models as proposed
in literature, the here presented system uses a simple, dynamic road model only and
puts stronger weight on the visual feature information. The novel feature edge density
(structure) computed on HSI hue and saturation is introduced as a reliable cue for detecting
the road. With the usage of on-off Difference of Gaussians filters for lane marker detection,
a further robust, biologically inspired approach was included to the system. The thereby
gathered lane marker information is fused to the detected unmarked road segments. The
so detected road segments match ground truth data well in most situations. However, in
case of shadows on the road the detected road segment contains holes and gets unstable
in time.
In order the improve this situation, a generic tracking approach for unmarked road
detection systems was introduced that is based on temporal integration. The system fuses
the detected road segments of the past and present frames on the bird’s eye view and allows
a robust unmarked road detection in shady conditions. The temporal integration approach
was tested on the road detection system described before, but is suitable to improve any
state-of-the-art system for unmarked road detection.
Summarizing, in Chapter 4 the following novelties were introduced:� The edge density (structure) is used as novel feature for road detection, computed
on the HSI hue and saturation maps,� The usage of street and non-street training regions that both adapt the feature prob-
ability distributions during iconic feature processing in the unmarked road detection
system,� A combination of iconic (i.e., pixel-based) and region-based feature processing for
road detection,� A fusion between feature-based and dynamically adapting model-based road detec-
tion,� A biologically motivated on-off DoG filter for lane marking detection is proposed
allowing their fusion with the detected unmarked road. Using on-off DoG assures
that only white lane markings on a darker background are detected, suppressing
off-on contrasts, like shadows or tar seems on the road,
99
4 Road Detection in Unconstrained Environments� A generic tracking approach based on temporal integration on the bird’s eye view
allowing the stabilization of road detection results of state-of-the-art unmarked road
detection systems.
Both the road detection system and the temporal integration approach were tested online
and in real-time on a test vehicle. The detected road is used as input cue information for
the Advanced Driver Assistance Systems described in Chapter 5. The detected road is used
to improve the performance of various modules of the ADAS and allows the development
of complex driver assistance functionalities.
100
5 Integrated System Approaches for
Scene Interpretation
Following the preceding description of biologically motivated visual features in Chapter
2, the attention sub-system in Chapter 3, and the unmarked road detection sub-system
in Chapter 4, in the current Chapter all these approaches are combined to a generic,
biologically inspired Advanced Driver Assistance System. After introducing some of the
few existing biologically inspired driver assistance systems in Sect. 5.1 along with the major
differences to our system approach, in Sect. 5.2 the developed attention-based ADAS and
evaluation results we gathered in an online highway scenario are described. In Sect. 5.3
this system is extended by among other things fusing the detected road, which allows for
robust operation in inner-city, after which a summary closes Chapter 5.
5.1 Related Work
Today’s Advanced Driver Assistance Systems effectively support the driver in clearly de-
fined traffic situations like keeping the distance to the forward vehicle. For this pur-
pose Radar sensors, Lidar sensors, and cameras are used to extract parameters of the
scene, like, e.g., headway distances, relative velocities, and relative position of lane mark-
ers ahead. Such approaches resulted in specialized commercial products improving the
driving safety (e.g., the “Honda Collision Mitigation Brake System” [Kodaka and Gayko,
2004, Kodaka et al., 2003] to help the driver to avoid rear end collisions in case the for-
ward vehicle brakes unexpectedly). Although traffic rules and road infrastructure, like, e.g.
lane markings, restrict the complexity of what to sense while driving, perception systems
of today’s ADAS are capable of recognizing simple traffic situations only. Furthermore,
driving in normal traffic scenes can be done mainly in a rather reactive way by staying in
the middle of the lane and keeping an appropriate distance.
However, for assisting the driver over the full range of driving tasks less reactive, intelli-
gent systems are required. The goal of realizing such Advanced Driver Assistance Systems
(ADAS) can be approached from two directions: either searching for the best engineering
solution or taking the human as a role model. Today’s ADAS are engineered for support-
ing the driver in clearly defined traffic situations like, e.g., keeping the distance to the
forward vehicle. While it may be argued that the quality of an engineered system in terms
of isolated aspects, e.g., object detection or tracking, is often sound, the solutions lack
the necessary flexibility. Small changes in the task and/or environment often lead to the
necessity of redesigning the whole system in order to add new features and modules, as
well as adapting how they are linked. Taking the high quality of signal processing reached
101
5 Integrated System Approaches for Scene Interpretation
in biology into account, one promising way for building such intelligent systems is to take
the human as a role model, mimicking known signal processing principles in the human
brain.
Recently, the topic of researching intelligent cars is gaining increasingly interest as doc-
umented by the DARPA Urban Challenge [WWW, 2007a] and the European Information
Society 2010 Intelligent Car Initiative [WWW, 2007b] as well as several European Projects
like, e.g., Safespot or PReVENT. As described in Chapter 1, the gathered results of such
purely engineering-driven approaches are somewhat limited.
With regard to vision systems developed for ADAS, there have been few attempts to
incorporate aspects of the human visual system into complete systems. One of the most
prominent examples is a system developed in the group of E. Dickmanns [Dickmanns,
2004]. It uses several active cameras mimicking the active nature of gaze control in the
human visual system. However, the processing framework is not closely related to the
human visual system. Without a tunable bottom-up attention system and with top-down
aspects that are limited to a number of object-specific features for classification, no dynamic
preselection of image regions is performed.
With respect to attention-based approaches for the vehicle domain, a saliency-based
traffic sign detection and recognition system was proposed by [Ouerhani, 2003]. A fur-
ther biologically inspired system approach has been presented by [Farber, 2005]. This
publication as well as the recently started German Transregional Collaborative Research
Centre “Cognitive Automobiles” [Stiller et al., 2007] address mainly human-inspired be-
havior planning whereas our work focuses more on task-dependent perception aspects.
A vision system approach that is some aspects related to the here presented ADAS is
described by [Matzka et al., 2008]. Published after our work (see, e.g., [Michalke et al.,
2007]), the approach allows for a simple attention-based decomposition of road scenes but
without incorporating object knowledge or pre-knowledge. Additionally, the overall system
organization is not biologically motivated and hence shows limitations in its flexibility.
For assisting the driver over the full range of driving tasks in all kinds of challenging sit-
uations and going beyond simple reactive behaviors, a more sophisticated task-dependent
processing strategy is required. We see the necessity of an adequate organization of per-
ception using a generic vision system, as a major challenge to achieve this target.
When assessing biological vision systems, it can be experienced that these are highly
flexible and capable of adapting to severe changes in the task and/or the environment.
Hence, one of our design goals on our way to achieve an “all-situation” ADAS is to im-
plement a biologically motivated, cognitive vision system as perceptual front-end of an
ADAS, which can handle the wide variety of situations typically encountered when driving
a car. Note that only if an ADAS vision system attends to the relevant surrounding traffic
and obstacles, it will be fast enough to assist the driver in real-time during all dangerous
situations.
More specifically, one possible biologically inspired way to solve this challenge is to realize
a task-dependent perception using top-down links. In this paradigm, the same scene can
be decomposed in different ways depending on the current task. A promising approach is
to use an attention system that can be modulated in a task-oriented way, i.e., based on
the current context. For example, while driving at high speed, the center of the field of
102
5 Integrated System Approaches for Scene Interpretation
view becomes more important than the surround. Furthermore, only if the vision system
attends fast enough to the relevant parts of the surrounding traffic and obstacles, it will
be able to assist the driver in all dangerous situations.
The computational model of the human attention system described in Chapter 3 is used
as front-end of a biologically inspired driver assistance system that determines the “how”
and “when” of scene decomposition and interpretation.
Recently, some authors stress the role of incorporating context into the attention-based
scene analysis. For example [Torralba, 2003], proposes a combination of a bottom-up
saliency map and a top-down context-driven approach. The top-down path uses spatial
statistics, which are learned during an offline learning phase, to modulate the bottom-up
saliency map. This is different to the here described system, where no offline spatial prior
learning phase is required. In our online system, context is incorporated in the form of
top-down weights that are modified at run time as well as by fusing road information.
To our knowledge, in the car domain no task-dependent tunable vision system that
mimics human attention processes exists.
5.2 Advanced Driver Assistance on Highways
Based on the paradigm of a task-dependent tunable vision system, Sect. 5.2 describes a
vision architecture that is being developed as perceptual front-end of an ADAS. The pro-
posed system provides a framework that enables task-dependent tuning of visual processes
via object-specific weighting of input features of the attention system described in Chap-
ter 3. The system generates an appropriate reaction in dangerous situations (autonomous
braking). Its architecture is inspired by findings of human visual system research and or-
ganizes the different functionalities in a similar way. For a first proof of concept, we focus
on assisting the driver during a critical situation in a construction site. The system has
been implemented using a software framework for component integration and is evaluated
on a number of test streams. It achieves real-time performance on a prototype car, which
has been demonstrated live on a testing range.
The Section 5.2 is organized as follows: In Section 5.2.1 an overview of the system archi-
tecture and its individual components is provided. For the analysis of the attention system,
we evaluated the construction site scenario to illustrate the performance of the top-down
approach in a complex environment. The obtained results demonstrating the feasibility
and benefits of top-down attention in a complex ADAS are described in Sect. 5.2.2.
5.2.1 System Description
In the following, a rough overview of the implemented vision system structure for driver
assistance is given. Subsequently, crucial system parts are described in more detail.
Overview
The overall architecture concept to realize task-based visual processing is depicted in
Fig. 5.1. It contains a distinction between a “what” and a “where” processing path,
103
5 Integrated System Approaches for Scene Interpretation
somewhat similar to the human visual system where the ventral and dorsal pathway are
typically associated with these two functions. Among other things, the “where” path-
way in the human brain is believed to perform the localization and coarse tracking of a
small number of objects that are relevant to the current task. This tracking is performed
by the human visual system without focusing the eye gaze on individual objects to be
tracked [Cavanagh and Alvarez, 2005], i.e., tracking does not require high resolution. In
contrast, the “what” pathway considers the detailed analysis of a single spot in the image.
In the human visual system this is intimately bound to the current eye gaze, as the human
eye possesses a high resolution in the central 2-3◦ (foveal retina area) of the visual field
only.
In our vision system the eye gaze is performed virtually as the camera mounted in the
car has a constant resolution in the complete field of view. Changing the eye gaze is
therefore equivalent to shifting the processing to another spot of the input image. This
spot is analyzed in the “what” pathway in full resolution while the whole image is analyzed
in the “where” path in lower resolution. Processing in these two pathways is believed to
occur in parallel in the human brain, but their intertwinings are as yet not known in too
much detail. We here adopt the idea of continuously tracking a small number of objects
in each image of the incoming visual stream to coarsely represent the current scene and at
the same time acquiring more detailed information on one additional object. We therefore
have two analysis processes running in parallel in our system, indicated by the two circular
arrows in Fig. 5.1.
The detailed organization of the two processing streams in our architecture concept is
as follows: The input image is analyzed in the “what” path (depicted left in Fig. 5.1) for
salient locations using a variety of visual features including orientation, intensity, color,
and motion. This visual attention combines bottom-up (BU) and top-down (TD) path-
ways and is described in full detail in Chapter 3. The resulting saliency map Stotal is
modulated by suppressing image regions that contain known objects, i.e., that have been
detected earlier. The system stores all detected objects in a so-called Short Term Mem-
ory(STM) that provides the position information of known objects as top-down link. The
suppression of saliency areas is also known as Inhibition of Return (IoR) in the human
visual system [Klein, 2000]. The performance gain of using this IoR approach and the
influence on the STM will be shown in Sect. 5.2.2. A simple maximum search is used on
the resulting saliency map to find the currently most salient point in the scene, the Focus
of Attention (FoA). At this position the Region of Interest (RoI) is determined by region
growing on the overall saliency map using the FoA as seed. In the final step of the “what”
cycle, the resulting RoI as well as its position (pos) are fed to the fast feed-forward object
recognition system described in the following subsections (see page 106).
After object recognition, the image region, its position, and the object label (pos, RoI,
ID) are stored in the STM in order to be coarsely tracked in subsequent images in the
“where” path. Before insertion, it is checked whether the new object can be associated to
a known object based on its position, size, and label; if a matching object is found, the
object already stored in the STM is updated. One iteration is concluded by calculating
distance (dist) for all objects in the STM based on fusing measurements from Radar,
depth from familiar object size (i.e., object knowledge, see [Palmer, 1999]), and depth
104
5 Integrated System Approaches for Scene Interpretation
new posnew RoI
coarse trackingof relevant
objects
trackmanagement
1. warning step2. soft braking (0.25g)3. hard braking (0.6g)
knowledgeobject bird’s eye
view
fusion ofdistance data
dangerlevel
objectrecognition
Segmentationaround FoA
saliency mapInhibition of
using weightsvisual attention
TD & BU
posRoIID
"Where"
dist
know
n ob
ject
s
dangercomputation
vision data CAN data
dataRadar
pos,RoI
"What" ID
object memorySTM
(pos,RoI,ID)egocentric mem
STM
(dist,ID)
image pyramid
analyzedone RoI iseach image
in
5−8 obj.
coarsetracking of
Stotal
Figure 5.1: Architecture concept of vision-based driver assistance system.
from bird’s eye view (see [Broggi, 1995] for the computation of the bird’s eye view) using
an Extended Kalman Filter. Details on the Extended Kalman Filter are given in the
following subsections (see page 107)). The distance information is stored in a separate
egocentric representation that is directly suitable for calculating the current danger level
and generating a warning message if necessary.
All objects contained in the STM are constantly tracked in the “where” path based on
an appearance-based tracker. The tracker uses a second order motion model for predicting
object positions on the image plane and a local correlation step for the refinement of the
new object positions. In each iteration the position is updated in the STM and a new
template RoI is stored. In case the prediction does not match (no good correlation found)
the object is deleted from the STM and therefore its position will not be inhibited in the
“what” pathway anymore. Consequently, the attention will be focused on the missing
object in one of the next images if the object is still present and salient. This way, all
objects being recognized and behaving as predicted are coarsely tracked while the “what”
attention is always focused on new objects and objects behaving unexpectedly.
However, it should be avoided that objects that can be tracked are stored in the STM
forever, as this would mean that the system cannot correct a wrong object label. This is
achieved by deleting an object from the STM after N frames, i.e., objects have a lifetime
of N frames. This is equivalent to limiting the capacity of the STM to N objects in scenes
with more than N objects. Note that the rather simple tracking method is sufficient for
many applications in the automotive domain where most objects are rigid (e.g., a car) and
105
5 Integrated System Approaches for Scene Interpretation
therefore the main appearance changes are limited to small translations and scalings.
The novelty of our architecture lies in the introduction of top-down aspects (like, e.g.,
task-dependent tunable attention generation via sets of weights and, in parallel, inhibiting
known object positions predicted by tracking) resulting in the ability to cope with highly
dynamic traffic scenes using limited computational resources. The top-down tunable atten-
tion system is a key aspect of our ADAS, since such preprocessing leads to a considerable
reduction of scene complexity by restricting further processing steps to image regions that
are interesting according to the current system task. This saves not only computational
resources but we implicitly reduce the number of false positive detections as, e.g., the
object classifier only gets RoIs that are likely to be a car based on their current saliency
profile.
Attention Sub-System
The ADAS uses the biologically motivated attention sub-system as its generic visual front-
end (see Chapter 3 for a detailed description) that is tuned by applying 136 independent
modifiable feature weights. As BU weights wBUi we choose a set of weights that shows
good performance for most situations in the car environment. In the object-unspecific
bottom-up path no inhibition takes place (i.e., feature maps are only added up), since its
purpose is to evaluate the general unspecific saliency of a scene. For modulating the TD
attention in the here described ADAS, we currently use TD weight sets for signal boards
and cars (wTDi,sigboard and wTD
i,car) that were calculated in a supervised training step using
Equ. (3.4) on page 59. In Sect. 5.3 this concept is extended, to allow calculating these
weights dynamically at runtime to track known objects and search for new objects.
The overall saliency map Stotal - the output of the attention system - is calculated
by linearly combining the normalized bottom-up SBU and top-down STD saliency maps
dependent on the current task of the ADAS using the parameter λ. With increasing λ,
the top-down saliency contributes more to the final saliency map, leading to a focus of
attention on specific objects. The overall saliency map is passed on to the FoA generation.
Object Recognition
For object recognition we use a view-based approach, where we perform classification only
on the image patch provided by the FoA segmentation. Note that object recognition
operates on the original image resolution of 800×600 pixels, i.e., the RoI position and size
provided by the saliency system are transformed appropriately.
The object recognition module is based on the biologically motivated processing architec-
ture proposed in [Wersing and Korner, 2003]. It uses a strategy similar to the hierarchical
processing in the “what” pathway of the human visual system by creating a classification
hierarchy. Unsupervised learning is used for the lower levels of the hierarchy to determine
general features that are suitable for representing arbitrary objects robustly with regard
to local invariance transformations like local shift and small rotations. Only at the high-
est level of the hierarchy object-specific learning is carried out, i.e., only this layer has to
be trained for different objects. This architecture can be applied to the difficult case of
106
5 Integrated System Approaches for Scene Interpretation
segmentation-free recognition that we have to deal with as the saliency segmentation only
provides an approximate RoI with rectangular shape and no object-specific segmentation.
Training is done by presenting several thousand color RoI images with changing back-
grounds for back views of cars and signal boards (see also [Gepperth et al., 2007]). The
learning algorithm automatically extracts the relevant object structure and neglects the
clutter in the surround. The output of the classifier is the identity (ID) of the recognized
object and a confidence value where a threshold is used to reject object hypotheses with
low confidence. The threshold is chosen so that only a small number of false positives can
occur for cars, as a wrong car detection could lead to a false emergency braking. If a car is
not recognized due to the high threshold, it is stored in the STM as unknown and tracked
for N frames before it is removed from the STM. Subsequently, if the car is still a salient
object, a new FoA will be generated and recognition is performed again. As now the car
may be closer due to the ego motion of our vehicle, the image patch may be larger and
therefore may have a higher confidence resulting in a correct recognition.
As described in Sect. 5.2.1, with the ‘what’ pathway, the presented system uses a cascade
of attention-based object detection followed by an appearance-based object classification.
According to [Neisser, 1967], object recognition in human perception is organized in the
same way. As argued above, the central hypothesis regarding the here presented attention-
based preselection is that it saves computation time and lowers the number of false positive
classifications due to the high relevance of input data at the classifier stage. However,
the question arises if in terms of computational demands, the approach is superior to
an exhaustive classification of the whole image (e.g., by classifying overlapping image
patches). As argued in [Frintrop, 2006], in case of a complex and thereby slow classifier,
the advantages of an attention system are obvious. Since in the vehicle domain false
detections might have severe consequences, with [Wersing and Korner, 2003] a reliable
and hence complex classifier was applied in the presented system.
Even for applications that allow the usage of fast (and less reliable classifiers), as Viola-
Jones (see [Viola and Jones, 2001]) the usage of an attention system saves computational
resources, as was shown in [Frintrop, 2006]. The here gathered results show that already
in case of more than 1 object class, the computation time needed by the attention system
is compensated by the need of fewer classifier cycles. Furthermore, based on numerous
experiments, [Frintrop, 2006] could show that the number of false classifications is reduced
when using an attention system for preselecting image regions as compared to applying
exhaustive classification.
Depth Cues
The current ADAS uses the four independent depth sources introduced in Sect. 2.2 (see
Fig. 5.2) that are combined using weak fusion (see [Landy et al., 1995]). Weak fusion
combines the depth sources based on the reliability of the specific cues. It is realized
here using an Extended Kalman Filter (EKF) that combines at each time step the cues
via dynamic weights depending on the static sensor variances (calculated offline) and the
currently available depth sources. Note that not every cue is available in each time step.
The EKF uses a second order process model for its prediction step that models the relevant
107
5 Integrated System Approaches for Scene Interpretation
kinematics in the car domain (velocity and acceleration). The cues show strong differences
in accuracy (especially depth from bird’s eye view and object knowledge show a high
variance). However, this is uncritical, since the sensor variances (that were determined
offline) are taken into account during the EKF-based sensor fusion, see also [Fritsch et al.,
2008]. The resulting depth values are assigned to detected objects in the image. In the
presented ADAS the following depth sources are used for fusion in the EKF (see Sect. 2.2
for more details on these depth cues):� Depth from Radar,� Depth from bird’s eye view,� Depth from object knowledge,� Depth from Stereo.
A prerequisite for depth from object knowledge is a reliable segmentation algorithm.
Currently we use histogram-based segmentation on an image region that is pre-segmented
by our region growing algorithm working on the saliency (see Fig. 5.2c).
26.
0 m
33.8
m
19.2
m
17.6
m
Width Xw
[m]
Dis
tanc
e (d
epth
) Z
w [m
] 28.9
m
17.4
m
−5 0 5
40
35
30
25
20
15
10
5
0
(a)
(b)
20.3
m
18.5
m
59.6
m
18.5
m
(c)
Figure 5.2: Used depth cues: Depth from (a) Radar, (b) Bird’s eye view, (c) Object knowledge.
108
5 Integrated System Approaches for Scene Interpretation
5.2.2 Experiments and Results
Evaluation of Depth Fusion
Figure 5.3 shows the EKF-based fusion of depth measurements for a car that drives in front
of our prototype vehicle through an inner-city scenario (see Fig. 5.2). For the EKF we
used the sensor variances σradar = 0.3, σbirds = 2.8, and σobj = 2.7 as well as the process
variance σprocess = 0.023 for the prediction step. Note that the usage of two additional
monocular depth cues of high variance fused with the low variance Radar cue ensures the
availability of depth values even if the interesting objects are outside of the Radar beam.
100 200 300 400 500 600 700 800
10
12
14
16
18
20
22
24
26
28
Samples [0.1s]
Dis
tanc
e to
pre
cedi
ng c
ar [m
]
Depth from bird’s eye viewDepth from object knowledgeRadar measurementsEKF Fusion
Figure 5.3: Depth from bird’s eye view, object knowledge, Radar and fusion with EKF.
Experimental Setup for System Evaluation
Scenario: In order to evaluate the proposed system in a challenging situation, we concen-
trate on typical construction sites on highways. This situation is quite frequent and a traffic
jam ending exactly within a construction site is a highly dangerous situation: due to the
S-curve in many construction sites, the driver will notice a braking or stopping car quite
late as the signal boards limit the field of view (see Fig. 5.4a). Our ADAS implementation
109
5 Integrated System Approaches for Scene Interpretation
uses a 3-phase danger handling scheme depending on the distance and relative speed of a
recognized obstacle. For example, when the vehicle drives around 40 km/h and a static
obstacle is detected in front at less than 33 meters, in the first warning phase a visual
and acoustic warning is issued and the brakes are prepared. If the dangerous situation is
not resolved by the human driver, the second phase triggers the belt pretensioner and the
brakes are engaged with a deceleration of 0.25 g followed by hard braking of 0.6 g in the
third phase.
(a) (b)
distance of 48 msignal boards at aemerges behind
Stationary car
carStationary
Figure 5.4: Scenario: (a) Schematic sketch of the construction site scenario. Stationary car
is visible from 48 meters on. (b) Real scenario.
Technical setup: For the experiments we used a Honda Legend prototype car equipped
with a mvBlueFox CCD color camera from Matrix Vision delivering images of 800×600
pixels at 10 Hz. The image data as well as the Radar and vehicle state data from the CAN
bus can be recorded. The recorded data is used during offline evaluation. For online pro-
cessing all data is transmitted via Ethernet to two laptops (2 GHz Core Duo) running our
RTBOS (Real-Time Brain-like Operation System) integration middleware [Ceravola et al.,
2006] on top of Linux. The individual RTBOS components are implemented in C using an
optimized image processing library based on the Intel performance primitives (IPP) [Intel,
2006].
Test data for training and evaluation: In order to gain sufficient training data and
for evaluating the actual system performance, we set up an exemplary construction site on
a private driving range where we recorded data and performed the actual online tests.
Influence of Parameters on Detection Performance
All results described in the following are obtained by averaging over 10 recorded streams
in order to lessen statistical outliers. As performance metric we will use the detection
distance as this is a good indicator for the efficiency of the saliency system in analyzing
110
5 Integrated System Approaches for Scene Interpretation
complex visual scenes under time constraints. As in each time step of the system running
at 10 Hz, one FoA is analyzed in the “what” pathway and potentially added to the STM,
we will use [frames] (equivalent to 110 second) as time unit.
In the first step the object detection distance is evaluated depending on STM size N and
the TD parameter λ (setting the amount of TD influence) while using a TD weight set
trained on cars. Figure 5.5 shows the distance to the stationary car when the first FoA hits
0 0.25 0.5 0.75 120
25
30
35
40
45
50
TD combination weight λ (using a car tuned TD weight set)
Car
det
ectio
n di
stan
ce in
m
(mea
n of
10
reco
rded
str
eam
s)
N=1N=2N=3N=5N=7Max. distance
Figure 5.5: Stationary car detection distance depending on the TD attention parameter λ=0,
0.25, 0.5, 0.75, and 1 as well as the STM size N=1,2,3,5, and 7 when using ground truth for
detecting a hit.
the car, which is defined by hand-labeled ground truth on the recorded streams. It can be
seen that the larger the TD influence (search task: find cars) expressed by λ, the earlier
the car is detected. Similarly, the more objects are stored in the STM (object number N),
the earlier the car is detected as a large part of the visual scene is already contained as
(unknown) objects in the STM and therefore inhibited in the saliency map. It can also
be deduced that with growing N the influence of TD is reduced since the scene coverage
increases.
Including the task of object recognition in the evaluation, Fig. 5.6 shows the distance to
the stationary car when the first FoA hits the target and this RoI is recognized as car by
the object classifier. Since the used classification threshold was set high to obtain a low
false-positive error rate at the cost of a high false-negative error rate, the distance when the
111
5 Integrated System Approaches for Scene Interpretation
car is detected is smaller than in the evaluation with ground truth. Differing from Fig. 5.5,
at large values of N (see Fig. 5.6 for N=7) the detection distance worsens again. The
reason for this effect is that our system is not using object segmentation algorithms but
performs segmentation directly on the saliency image which can lead to enlarged patches
suppressing the surround of the found objects as well. In this way, the borders of the car
might be suppressed by adjacent signal board patches leading to incomplete car FoAs that
are not sufficient for correct classification by the used object classifier. The likelihood that
this happens is growing with the growing size N of the STM. However with a growing N
also the scene coverage improves. This trade-off leads to the measured results.
0 0.2 0.4 0.6 0.8 120
25
30
35
40
45
50
TD combination weight λ (using a car tuned TD weight set)
Car
det
ectio
n di
stan
ce in
m
(mea
n of
10
reco
rded
str
eam
s)
N=1N=2N=3N=5N=7Max. distance
Figure 5.6: Stationary car detection distance depending on the TD attention parameter λ=0,
0.25, 0.5, 0.75, and 1 as well as the STM size N=1,2,3,5, and 7 when using the classifier for
detecting a hit.
Based on Fig. 5.6 the best choice of λ for detecting cars would be 1, which equals pure
TD search mode. However, such a parameterization is not appropriate because this leads to
a reduced capability of detecting other objects that are only prominent in the BU saliency
(see Fig. 5.7). Here we see that with growing λ the average detection distance of signal
boards (the only other object class besides cars in our evaluation) drops. Stated differently,
the system ignores all other objects while searching for cars in pure TD mode (λ = 1),
112
5 Integrated System Approaches for Scene Interpretation
which might lead to dangerous situations. The default value for λ was hence set to 0.5 for
the online tests.
0 0.25 0.5 0.75 10
5
10
15
20
25
30
35
40
45
50
TD combination weight λ (using a car tuned TD weight set)
Car
/ si
gnal
boa
rd d
etec
tion
dist
ance
in m
(m
ean
of 1
0 re
cord
ed s
trea
ms)
N=1 signal boardsN=2 signal boardsN=5 signal boardsN=1 carN=2 carN=5 carMax. distance
Figure 5.7: Detection distance depending on the TD attention parameter λ=0, 0.25, 0.5,
0.75, and 1. Average detection distance of signal boards and the stationary car using the
object classifier for an STM size of N=1,2, and 5.
In the previous evaluations we assumed that the scene contains more than N objects and
used a fixed STM size which is equivalent to storing any object for N frames independent
of, e.g., whether it is was correctly recognized. We now introduce an object-specific Time
To Live (TTL) defining for how many frames an object is stored in the STM before it is
removed. In this way, unknown objects can be tracked for only a short time before a new
recognition attempt is carried out if the image region is still salient. Figure 5.8 shows how
the choice of the TTL influences the system performance. For an object-unspecific TTL of
5 frames the curve is identical to Fig. 5.7 for N=5. For the object-specific case we choose
TTLsigboard = 6 frames for signal boards, TTLcars = 20 frames, and TTLunknown = 3
frames, leading for the construction site streams on average to N=5 objects in the STM.
Note that the low value of TTLunknown and the high value of TTLcars both support to
set the object recognition threshold high, i.e., it is very likely to get an unknown object,
which is a false negative car but rather unlikely to get a car that is a false positive.
A clear gain in detection performance can be seen when using object-dependent TTL
113
5 Integrated System Approaches for Scene Interpretation
0 0.25 0.5 0.75 130
32
34
36
38
40
42
44
46
48
50
TD combination weight λ (using a car tuned TD weight set)
Car
det
ectio
n di
stan
ce in
m
(mea
n of
10
reco
rded
str
eam
s)
FoA patch recognition using car trained classifier
Object−unspecific TTL=5
Object−specific TTL (see text)
Max. distance
Figure 5.8: Stationary car detection distance depending on the TD attention parameter λ=0,
0.25, 0.5, 0.75, and 1 while using object-unspecific and object-specific TTL values.
values, which is due to the fact that FoAs, which hit the car very early are often too
small for a reliable classification. These unknown scene parts are suppressed only for 3
frames before the classifier gets a second chance to detect the car. This object-specific
TTL parameterization was used during the online tests described below.
Evaluation of System Performance
We in detail evaluated the warning generation offline on 10 recorded construction site
streams used also for evaluation in the previous subsection. In all streams, the ADAS was
able to recognize and track the car from a distance between 42 and 32 meters, while the
car was fully visible at a distance of about 48 meters.
During documented online system tests in the setting depicted in Fig. 5.4 with our
prototype vehicle driving 40 km/h our system detected in 57 of 60 cases the stationary
car in time and issued the 3 warning phases as expected including autonomous braking.
In the remaining cases, either the object recognition detected a signal board as car and
the braking was performed too early or the FoA generation did not deliver a good car RoI
position so that the fusion of the car RoI with Radar data failed and no warning/braking
was performed at all. Note that in our vision-based proof-of-concept system we completely
rely on vision and do not make use of an additional Radar-based emergency braking that
would be needed in real traffic as backup for situations in which our vision system fails.
In the following Sect. 5.3, an extended ADAS is presented that improves the so far
114
5 Integrated System Approaches for Scene Interpretation
presented system in several aspects.
5.3 Advanced Driver Assistance in Inner-City
In this section, we present a highly integrated vision architecture for an advanced driver
assistance system inspired by human cognitive principles. As in Sect. 5.2, the system uses
an attention system as the flexible and generic front-end for all visual processing, allowing
a task-specific scene decomposition and search for known objects (based on a short term
memory) as well as generic object classes (based on a long term memory). Knowledge
fusion, e.g., between an internal 3D representation and a reliable road detection module
improves the system performance. The system heavily relies on top-down links to modulate
lower processing levels, resulting in a high system robustness.
While Sect. 5.2 concentrated mainly on the usage of saliency-based attention in the
system context (see also [Fritsch et al., 2008, Michalke et al., 2009a, 2008a]), this section
describes the additional incorporation of environmental 3D representations and static do-
main specific tasks, in order to use context information (“where is the road”) to guide
attention and, therefore, analysis of the overall scene (see also [Michalke et al., 2008b]).
For all acquired information our enhanced system builds up internal 3D representations
that support scene analysis and at the same time serve for behavior generation. Using a
metric representation of the road area in combination with detected traffic objects, the
system can guide its processing on relevant objects in the context of the current road area.
For example, this allows to perform warning and emergency braking if a parked car is
detected on our lane and during its by-passing the pro-actively adapted attention detects
oncoming traffic on the road.
5.3.1 System Description
The proposed overall architecture concept for a robust attention-based scene analysis is
depicted in Fig. 5.9. It consists of four major parts: the “what” pathway, the “where”
pathway, a part executing static domain-specific tasks, and the behavior generation. The
distinction between “what” and “where” processing path is somewhat similar to the hu-
man visual system where the dorsal and ventral pathway are typically associated with
these two functions (see, e.g., [Palmer, 1999]). Among other things, the “where” pathway
in the human brain is believed to perform the localization and tracking of a small number
of objects. In contrast, the “what” pathway considers the detailed analysis of a single spot
in the image (see theories of spatial attention, e.g., spotlight theory [Palmer, 1999]). Nev-
ertheless, an ADAS also requires specific information of the road and its shape, generated
by the static domain-specific part.
The “What” Pathway
Starting in the “what” pathway the 400x300 color input image is analyzed by calculating
the saliency map Stotal. The saliency map Stotal results from a weighted linear combination
115
5 Integrated System Approaches for Scene Interpretation
using weights
generatevisual
TD attention
fuse roadinformation
...
color 300x400input image,
using weights
generate
attentionvisual BU & TD
3D to 2D
supp
ress
roa
d su
rfac
e
domain−specific processing
temporal integ.detection,
unmarked road
detectionmarked road
TD
atte
ntio
n m
apfo
r w
hite
and
yel
low
on−
off c
ontr
asts
TD: inhibitionof known objectslocal saliency
modulation
Static domain−specific tasks
detect open−
street
objectsfuse new & old
boards) (cars & sig.
object
image patch
detected lanes
(multiple pathways in parallel for:STM search and several LTM classes)
classifier
‘‘What’’ pathway
calculate LTMTD weights
ings in found
create/update object
FoA
calculate STMTD weights
match object
Stotal
wTDi wTD
i , wBUi
Figure 5.9: System structure allowing attention-based scene analysis (see page 117 for re-
maining system graph).
of N = 136 biologically inspired input feature maps Fi:
Stotal = λ
N∑
i=1
wTDi Fi + (1 − λ)
N∑
i=1
wBUi Fi . (5.1)
More specifically, we filter the image using, among others Difference of Gaussian (DoG)
and Gabor filter kernels that model the characteristics of neural receptive fields measured
in the mammal brain. Furthermore, we use the RGBY color space [Frintrop, 2006] as at-
tention feature that models the processing of photoreceptors on the retina (see Sect. 2.1.3
for details on the computation of the color feature). Additionally, with the incorporation
of differential images and an approach for the detection of moving objects, dynamic fea-
tures are included to the system (see Sect. 2.3). All features are computed on 5 scales
relying on the well-known principle of image pyramids in order to allow computationally
efficient filtering (see Annex A.1). All feature maps are postprocessed non-linearly in or-
der to suppress noise and boost conspicuous or prominent scene parts (see Sect. 3.3 and
[Michalke et al., 2008c] for a detailed description of these nonlinear processing steps).
The top-down (TD) attention can be tuned (i.e., parameterized) task-dependently to
search for specific objects. This is done by applying a TD weight set wTDi that is computed
and adapted online, based on Equation (5.2), where the threshold φ = KconjMax(Fi) with
Kconj = (0, 1] (see Fig. 5.10a for a visualization). Equation (5.2) is equivalent to Equ. (3.4)
116
5 Integrated System Approaches for Scene Interpretation
label)
objectmemory
(pos,templ,roadmemory
update object motion
tracker2D
behaviorcontrol
control ofactuators
Interaction with theenvironment(affordance)
(weak fusion)
depth cuecombination
objectknow. BEV
object position
update
set flag if dynamic
3D to 2D
2D to 3D
stereo data(disparity)
update
motion
single trackmodel
Short Term Memory(based on an environmental representation)
ego motion
object data
detect objectego motion
positiontransform
templatesobjecttypical
Long Term Memory
Behavior generation
‘‘Where’’ pathway
Radar
CAN dataRadar
create/update object
on page 59 in Chapter 3, but is reformulated to match the following description of the
online weight computation. The weights wTDi dynamically boost feature maps that are
important for our current task/object class in focus and suppress the rest. The bottom-up
(BU) weights wBUi are set object-unspecifically in order to detect unexpected potentially
dangerous scene elements. The parameter λ ∈ [0, 1] (see Equation (5.1)) determines the
current relative importance of TD and BU search in the system. For more details on the
attention system please refer to [Michalke et al., 2008a] and Chapter 3. It is important to
note that the TD weights (calculated using Equation (5.2)) are dependent on the features
present in the background (rest) of the current image, since the background information
is used to differentiate the searched object from the rest of the image [Frintrop, 2006]:
wTDi =
mRoI,i
mrest,i∀mRoI,i
mrest,i≥ 1
−mrest,i
mRoI,i∀mRoI,i
mrest,i< 1
(5.2)
with m{RoI,rest},i =
∑
∀u,v∈{RoI,rest}
Fi(u, v)
size region {RoI,rest}
and Fi(u, v) =
{
Fi(u, v) ∀(u, v), Fi(u, v) ≥ φ
0 else.
117
5 Integrated System Approaches for Scene Interpretation
Because of this, it is not sufficient to store the TD weight sets wTDi of different object
classes directly and switch between them during online processing. Instead, an aggregated
form of all feature maps of objects Fi,RoI are stored (equivalent to the value mRoI,i in
Equ. (5.2)). To compensate the dependency from the background, the stored object feature
maps are fused with the feature maps of the current image before calculating the TD
weights. In plain words, the system takes the current scene characteristics (i.e., its features)
into account in order to determine the optimal TD weight set that shows a maximum
performance in the current frame. Put differently, the described separability approach
includes the current scene context on a sensory level.
As described in Sect. 5.2, we detect the maximum on the current saliency map Stotal and
get the focus of attention (FoA) by generic region-growing-based segmentation on Stotal.
In the following, only the FoA is classified using a state-of-the-art object classifier that is
based on neural nets [Wersing and Korner, 2003]. This procedure (attention generation,
FoA segmentation and classification) models the saccadic eye movements of mammals,
where a complex scene is scanned and decomposed by sequential focusing of objects in
the central 2-3◦ foveal retina area of the visual field. The system uses a time integrating
mechanism to decide on the object class, in order to improve the reliability of the classifier
decision. More specifically, all detected objects are tracked and reclassified in the following
frames. On each frame a majority decision (voting) on the current and all stored classifier
results decides on the object class.
The proposed system incorporates the biologically motivated concept of TD links. Based
on these links information on higher levels of knowledge integration modulate lower levels of
knowledge integration. This brain-like concept improves robustness, increases the relevance
of input data for higher system levels, and accelerates the system reaction (see evaluation
results in Sect. 5.3.2). Our system uses such links for the task-specific modulation of
the TD attention (i.e., by adapting system parameters online, as, e.g., the previously
described TD weights wTDi ) and for suppressing the detected road as context information
in all feature maps Fi before fusing them in the overall saliency Stotal. Additionally, TD
links are used for the modulation of the attention based on detected car-like openings in
the found drivable road segment (see “where” path in Fig. 5.9). Such car-like openings
are detected by searching for car-sized holes in the road segment that is transformed to
the metric bird’s eye view (for an example see Fig. 5.12d) by inverse perspective mapping.
In a nutshell, the bird’s eye view is the representation of the scene as viewed from above,
computed by transforming a monocular camera image taking intrinsic and extrinsic camera
parameters into account (refer to [Michalke et al., 2008c] and Annex A.3 for more details).
The “Where” Pathway
The next step is the fusion between the newly detected object and the already known ones.
The result will be further processed in the “where” pathway and stored in the short term
memory (STM). The objects in the STM are then suppressed in the current calculated
saliency map to enable the system to focus on new objects. The principle of suppressing
known objects was proved to exist in the human vision system as well and is termed
inhibition of return (IoR), refer to [Klein, 2000] for details.
118
5 Integrated System Approaches for Scene Interpretation
All known objects are tracked using a 2D tracker that is based on normalized cross corre-
lation (NCC). The tracker gets its anchor (i.e., the 2D pixel position where the correlation-
based object search on the new image will be started) from a Kalman-filter-based prediction
on the 3D representation taking the motion of the camera vehicle and tracked object into
account. The predicted 3D position is transformed to 2D pixel positions (u,v) using a pin
hole camera model that contains all intrinsic and extrinsic camera parameters (in detail
these are the 3 camera angles θX , θY , and θZ , the 3 translational camera offsets t1, t2, t3,
the horizontal and vertical principal point u0 and v0, as well as the horizontal and vertical
normalized focal length fu and fv), refer to Equation (A.1) and (A.2).
In case the NCC tracker is able to re-detect the object in 2D pixel coordinates, the 3D
position in the representation is updated using 4 different depth cues for the 2D pixel (u,v)
to 3D world (Xobj, Yobj, Zobj) transformation. More specifically, our system uses stereo
data, Radar, depth from object knowledge, and depth from bird’s eye view (see Fig. 5.12
and [Fritsch et al., 2008, Michalke et al., 2008c] for more details on these cues). The cur-
rently available depth cues are combined using the biologically motivated principle of weak
fusion (see [Landy et al., 1995]). Weak fusion combines the depth sources based on their
reliability (i.e., sensor variances). The fusion is realized using an Extended Kalman Filter
(EKF) that combines the cues based on dynamically adapted weights depending on the
static predefined sensor variances and the currently existing depth sources, as not every
cue is available in each time step. The EKF uses a second order process model for its
prediction step that models the relevant kinematics of the car (velocity and acceleration).
Objects whose updated position leave the represented surrounding scene or whose Kalman
variances are too high (i.e., they received no new measurements for several frames) are
deleted from the STM. The concept of appearance-based 2D tracking (analysis of motion
in 2D) supported by a 3D representation (interpretation of motion in 3D) was found in
humans as well [Palmer, 1999]. From a technical point of view, the advantage of this
approach is the simple correction of the ego motion relying on the internal 3D representa-
tion. The vehicle ego motion (translations ∆Xe and ∆Ze, as well as the change of the yaw
angle ∆θX) is determined based on a standard single track model and compensated in the
Kalman prediction step (see Equation (5.3) and (5.4) for the state vector E and process
model A):
E =[
Zobj Xobj vZ,obj vX,obj
]
(5.3)
A =
cos(∆θX) sin(∆θX ) T 0 -∆Ze
-sin(∆θX) cos(∆θX) T 0 -∆Xe
0 0 cos(∆θX) sin(∆θX ) 0
0 0 -sin(∆θX) cos(∆θX) 0
0 0 0 0 1
. (5.4)
Therefore, we do not need a computationally intensive optical-flow-based prediction. The
main reason for the strong object motion in the 2D image is compensated by correcting the
ego-motion-based position change of objects, which eases the tracking task considerably.
A comparison between the current Kalman fused 3D object position Pt = [Zobj, Xobj]
and the predicted 3D object position P ′t decides, based on the state variances σ2
P ′t
and
119
5 Integrated System Approaches for Scene Interpretation
σ2Pt
, if the tracked object is static or dynamic (see Fig. 5.10b). P ′t is calculated by an
ego-motion-based prediction starting from the stored Kalman-fused value Pt−4. For the
comparison, βth is used as a threshold on the measure β(Pt, P′t) defined in:
β(Pt, P′t) =
∣
∣
∣
∣
∣
∣
∣
∣
Pt − P ′t
√
∣
∣σ2Pt
∣
∣+∣
∣
∣σ2P ′
t
∣
∣
∣
∣
∣
∣
∣
∣
∣
∣
∣
. (5.5)
The calculated measure is motivated from a statistical parameter test that checks for
the equality of two distributions. It showed good performance on various test streams. If
β(Pt, P′t) is bigger than βth (i.e., the object is detected to be dynamic) the Kalman filter
receives the object ego motion vZ,obj 6= 0 and vX,obj 6= 0 that is derived from the integrated
object position change Dobj,egotas measurement (see Fig. 5.10b).
Background (rest)
Region of interest(RoI) for TD weight
(a) (b)
Dobj,egot−3
Dobj,egot−2
Dobj,egot−1
Dobj,egotX′t
Pt−4
P ′t−3
Pt−3
P ′t−2
P ′t−1
σP ′t
Pt σPt
Pt−1
Pt−2
set calculation
Figure 5.10: (a) Visualization of the object training region (RoI) for TD weight calculation
against the background (rest), (b) Prediction of object ego motion (dots: Kalman tracked
object position, squares: predicted object position including measured ego motion, dashed line:
accumulated object ego motion)
From a representational point of view, the “where” pathway of our system consists on
the one hand of the STM, that stores all properties of sensed objects in a 3D representation
and on the other hand of a long term memory (LTM) that stores the generic properties
of object classes. The LTM is filled offline with typical patches and corresponding feature
maps Fi of specific object classes. For evaluation purposes we use cars, reflection posts,
and signal boards as LTM content, but our system can detect any other object types as
well, if the attention and the object classifier are trained accordingly. In the default state
the system searches for the generic LTM object class car. This is done by calculating the
geometric mean of all TD weight sets of the LTM objects that were computed based on
Equation (5.2). These weights tune the TD attention in the ”what” pathway.
120
5 Integrated System Approaches for Scene Interpretation
As described above, in case the tracker has re-detected the object in the current frame
the 3D representation is updated. In case the tracker looses the object the system searches
for the lost STM object in the following frames. This is realized by calculating a TD
weight set that is specific to the lost STM object using Equation (5.2). The object Offound by the STM search is then compared to the searched object Os by means of the
distance measure δ(Of , Os) that is based on the Bhattacharya coefficient (a measure for
determining the similarity between two histograms) calculated on the histograms of all N
object feature maps HOf
i and HOs
i :
δ(Of , Os) =N∑
i=1
√
1 − γ(HOf
i , HOs
i ) (5.6)
γ(HOf
i , HOs
i ) =∑
∀u,v
√
HOf
i (u, v)HOs
i (u, v).
The LTM and STM object search run in parallel as visually indicated in Fig. 5.9. It is
important to note that our system is not restricted to the detection and tracking of cars,
reflection posts, and signal boards. By using different LTM object patches and by offline
training of our object classifier in combination with the generic concept of online tunable
TD attention our system is highly dynamic and flexible.
Static Domain-Specific Tasks
The third major part of our system handles the domain-specific tasks of marked and un-
marked lane detection. The marked lane detection is based on a standard Hough transform
whose input signal is generated by our generic attention system. The scale-selective TD
attention weight set used here boosts white and yellow structures on a darker background
(so-called on-off contrast), to which the biological motivated DoG filter (see Sect. 2.1.1) is
selective. The yellow on-off structures are weighted stronger than the white to allow the
handling of lane markings in construction sites.
The state-of-the-art unmarked lane detection evaluates a street training region in front
of the car and two non-street training regions at the side of the road (see Sect. 4.1). The
features (stereo, edge density, color hue, color saturation) in the street training region
are used to detect the drivable road based on dynamic probability distributions for all
cues. Additionally, region growing that starts at the street training region assures a crisp
distinction between the road and the sidewalk. The region growing uses dynamic self-
adaptive thresholds that are derived from the feature characteristics in the street training
as compared to the non-street training region. No fixed parameters for detecting the
road are used, which makes the system adaptive to its environment and hence robust.
A temporal integration procedure between the current and past detected road segments
based on the bird’s eye view is used to increase the completeness of the detected street by
decreasing the number of false negative road pixels (see Sect. 4.2). More specifically, based
on the measured ego motion of the car the road segments detected in the past are shifted
and fused with the currently detected road segments. Refer to [Michalke et al., 2008c] for
a comprehensive description of the temporal integration procedure. In the final step, a
121
5 Integrated System Approaches for Scene Interpretation
fusion between the marked and unmarked detected road segments is used to derive the
present drivable lanes.
Behavior Control
The system can interact with the world via a behavior control module. Currently our ADAS
implementation uses a 3 phase danger handling scheme depending on the distance and
relative speed of a recognized obstacle (see also Sect. 5.2.2). When an obstacle is detected
in front at a rapidly decreasing distance, a visual and acoustic warning is issued and the
brakes are prepared. In the second phase the brakes are engaged with a deceleration of
0.25 g followed by hard braking of 0.6 g in the third phase. Other behaviors, like trajectory
planning and active steering, as well as the detection of possible collisions and their active
avoidance based on predictions on internal 3D representations are possible and planned in
the near future.
5.3.2 Experiments and Results
In the following, we will evaluate individual system modules that are most important for
our cognitive ADAS architecture. Also, the overall system performance will be assessed
based on the construction site scenario described in Sect. 5.2.2.
Evaluation of System Modules
Evaluation of attention sub-system: In order to evaluate the generic nature of the
attention-based TD search, we used cars and reflection posts (useful for unmarked road
detection as done, e.g., in [von Trzebiatowski et al., 2004]) as LTM search objects. The
results are depicted in Tab. 5.1, showing that incorporating TD information improves
the search performance considerably. Please note that when changing the LTM search
object, besides an exchange of the LTM image patches and an appropriate training of
the object classifier no modification in the system structure is required. For evaluation
the measures average FoA hit number (Hit) and average detection rate (DRate) were
calculated. While DRate is the ratio of the number of found task-relevant objects to the
overall number of task-relevant objects, Hit states that the object was found on average
with the Hit’th generated FoA. Hence, the smaller Hit is, the earlier an object is detected
(see [Frintrop, 2006] for more details on these measures). The choice of training images has
only small influence on the search performance as the comparable results for different sets
of training images show (see Tab. 5.1). The evaluation shows the highest hit numbers and
detection rates for pure TD search (λ = 1). However, as will be discussed in the following
a combination of BU and TD influence is recommended in the attention system .
The presented results support the generic nature of the TD tunable attention sub-system
during object search. Moreover, we see the attention system as a common tunable front-
end for the various other system tasks, e.g., as lane marking detection (as described in
Sect. 5.3.1). Following this concept, the task-specific tunable attention system can be used
for scene decomposition and analysis, as it is shown exemplarily on two typical German
highway scenes in Fig. 5.11.
122
5 Integrated System Approaches for Scene Interpretation
Table 5.1: Search performance for BU- and TD-based LTM object search for cars and reflection
posts for 2 different training sets.
# Test # Trai- Hit (DRate)Target images ning im pure BU pure TD
(objects) (λ = 0) (λ = 1)Cars 54 1.53
(self test) (100%)T.set 1 54 3 3.06 1.82
(58) (56.9%) (96.6%)T.set 2 3 1.74
(93.1%)
Reflect. 56 1.85posts (self test) (66.3%)
T.set 1 56 6 2.97 2.25(113) (33.6%) (52.2%)
T.set 2 7 2.36(52.2%)
(a) (b) (c)
(d) (e) (f)
Figure 5.11: Attention-based scene decomposition: (a) Highway scene, (b) TD attention
tuned to lane markings, (c) TD attention tuned to cars, (d) Construction site, (e) TD at-
tention tuned to signal boards, (f) TD attention tuned to cars.
123
5 Integrated System Approaches for Scene Interpretation
Evaluation of classifier performance: For a proof of concept, we trained the classifier
to distinguish cars from non-cars (clutter). A set of image segments generated by our
vision system during online operation was used for training. It contains 11000 square
image patches of size 64x64 pixels, and was divided into the classes “cars” (2952 patches),
“signal boards” (2408 patches) and “clutter” (5803 patches) by visual inspection. Car
segments contain complete back-views of cars (at any position) which must be at least half
as large as the patch in both dimensions. At equal false positive and true negative rates,
for cars an error of 4.7% and for signal boards an error of 9.7% was obtained on equally
large test sets. The performance of the trained classifier is shown in Fig. 5.13a in form of
a receiver operator characteristic (ROC) curve that visualizes the trade-off between false
positive (clutter recognized as object) and false negative (object recognized as clutter)
detections when varying the classification thresholds. The ROC curve was generated using
5-fold cross validation. Furthermore, the quality of the classification is enhanced by the
voting process described in Sect. 5.3.1.
Qualitative evaluation of depth cues: For a more qualitative evaluation Fig. 5.12
shows the unpreprocessed results for all depth cues on a typical inner-city sample. The
cues show strong differences in accuracy (especially depth from bird’s eye view and object
knowledge show a high variance). However, this is uncritical, since the sensor variances
(that were determined offline) are taken into account during the EKF-based sensor fusion
(see Sect. 5.2.2 for a more detailed depth cue evaluation).
Evaluation of Overall System Performance
The performance gain of incorporating the detected drivable street, the internal metric 3D
representation, and TD links are evaluated on a real-world construction site scenario. The
results gathered with the proposed system are then compared with the previous system
described in Sect. 5.2.
In the previous section, we concentrated on typical construction sites on highways. A
traffic jam ending exactly within a construction site is a highly dangerous situation: due
to the S-curve in many construction sites, the driver will notice a braking or stopping car
quite late (see Fig. 5.4 on page 110). The evaluation was done offline by averaging on
3 streams that were stored during the online demonstration of the previous ADAS. As
depicted in Fig. 5.13b the current system architecture can classify the stationary car from
25 to 42 meters on. How early the car is detected depends on how much TD influence is
incorporated. For λ = 0 the car is detected late, because only visually conspicuous object
features are incorporated that draw BU attention. For a growing λ the car is detected early
since “car-like” features are boosted stronger in the TD attention. Based on Fig. 5.13b the
best choice of λ for detecting cars would be 1, which equals pure TD search mode. However,
such a parameterization is not appropriate because this leads to a reduced capability of
detecting other objects that are only prominent in the BU saliency map. As depicted in
Fig. 5.13b with growing λ (i.e., with growing influence of car features in the attention) the
mean detection distance of signal boards as BU salient objects drops. Stated differently,
the system ignores all other objects while searching for cars in pure TD mode (λ = 1),
124
5 Integrated System Approaches for Scene Interpretation
39
.4
18
.3 71
.5
25
.2
Bird’s eye view
Width in m
Dis
tanc
e (d
epth
) in
m
−10 0 10
50
40
30
20
10
0
17
.8
16
.7
28
.0
12
.1
28.7 9.
6
25.8
17.2
25.5
14.4
(b)(a)
(d)(c)
Figure 5.12: (a) Depth from stereo (calculated as a median over the object region), (b) Depth
from Radar, (c) Depth from object knowledge (for all objects detected as cars), (d) Depth from
bird’s eye view (using threshold-based detection of intensity changes on the road).
which might lead to dangerous situations. The measured effect was also proved to exist in
humans and is termed “inattentional blindness” (see Sect. 3.1 and [Simons and Chabris,
1995]). This suggests to set λ to an intermediate value of about 0.5, which was also the
setting used during our online tests (see [Fritsch et al., 2008]).
Also compared to the previous system described in Sect. 5.2 for all λ-values a better
system performance was achieved. In the previous system an appearance-based 2D tracking
as opposed to the 3D tracking presented here was used. Furthermore, the TD weights
were computed offline as opposed to the online LTM object search in the current system.
Additionally, information drawn from the road detection module is included and combined
to the attention module in the current system (see Sect. 5.3.1). The attained performance
gain affirms the soundness of these cognitive system extensions.
For further system evaluation, Fig. 5.14 depicts internal system variables for three se-
quential frames of an inner-city stream with cars as LTM search object. As described in
Sect. 5.3.1, for each new image the attention is calculated and a new FoA is generated
via maximum search and segmentation on the saliency map. The detected road area (and
thereby also the present lane markings) are mapped out of the saliency map, which de-
creases the false positive rate of generated FoAs, i.e., less non-car FoAs are generated. In
125
5 Integrated System Approaches for Scene Interpretation
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
fals
e n
eg
ative
ra
te
false positive rate
whole car color (back/front)signal boards color
0 0.25 0.5 0.75 10
10
20
30
40
50
TD combination weight λ (with car−tuned TD weights)
Ca
r d
ete
ctio
n d
ista
nce
in
m
(me
an
of
3 r
eco
rde
d s
tre
am
s)
Car dist. (previous system)Car dist. (current system)Car max. distanceSignal boards mean dist.
(a) (b)
Figure 5.13: (a) Receiver operator characteristic curve for cars (back and front views) and
signal boards, (b) Comparison between previous and current system implementation: Stationary
car detection distance depending on TD attention parameter λ = 0, 0.25, 0.5, 0.75, and 1.
For both systems a comparable parameter set was used.
the first frame, the car in front is detected and stored in the representation-based on a
car-like hole in the detected street segment that modulates the attention. Please note that
car 2 and 3 are not stored in the internal representation, since their position is beyond the
represented road environment.
5.4 Summary
In this Chapter, the visual features described in Chapter 2, the attention sub-system
of Chapter 3, and the unmarked road detection sub-system of Chapter 4 are combined
to a generic, biologically inspired Advanced Driver Assistance System. The attention
sub-system that weights and combines all visual features allows a task-dependent scene
decomposition, which enables the system to react in real-time to challenging situations.
The ADAS described in Sect. 5.2 uses the attention sub-system as front-end for the
detection of a stationary vehicle in a highway construction site. In a nutshell, based
on an object-specific weight set, the attention sub-system suppresses all scene elements
not relevant to the current task, while boosting task-related scene content. The thereby
preselected scene elements are classified by a biologically inspired state-of-the-art classifier.
In order to allow a thorough scene exploration, once analyzed scene elements are suppressed
in the saliency map. Since the camera vehicle moves and typically other dynamic objects
are present, a tracking approach is included into the system. In order to assess the potential
danger of once detected and classified vehicles depth information is needed. For this,
four depth cues described in Chapter 2 are combined based on a biologically inspired
fusion approach. In case a quickly approaching vehicle is detected, a danger handling
scheme is initiated, meaning that after an acoustic warning and the activation of the belt
pretensioner, an autonomous braking of the ego vehicle is done. Different from available
systems on the market and in literature, the object detection is based on vision as the
126
5 Integrated System Approaches for Scene Interpretation
Width in m
Dis
tan
ce (
de
pth
) in
m
−5 0 5
25
20
15
10
5
0
Width in m
Dis
tan
ce (
de
pth
) in
m
−5 0 5
25
20
15
10
5
0
Width in m
Dis
tan
ce (
de
pth
) in
m
−5 0 5
25
20
15
10
5
0
Frame 194
Car 1
Frame 196
Car 3 Car 2
Frame 197
Car 4
Figure 5.14: System evaluation on example images of an inner-city stream. Left column:
visualization of found FoAs, middle: calculated saliency map Stotal (previously found objects
are suppressed by inhibition of return), Right column: Visualization of internal representation
(dashed line marks the border of the vision field).
main cue.
In Sect. 5.3 the driver assistance system is significantly extended. The detected road, as
introduced in Chapter 4, is fused to the system, improving the quality of different system
modules. For instance, the detected road can be suppressed in the visual feature maps,
allowing a more efficient attention-based object search. Car-like openings in the detected
road are used to spatially modulate and guide the attention in order to search for cars at
these image regions. As a further extension, the object-specific attention weight sets are
127
5 Integrated System Approaches for Scene Interpretation
computed online, allowing the detection of multiple object classes. The tracker is improved
by coupling it to an environmental 3D representation that compensates the 3D position
of all known objects by the measured ego motion of the camera vehicle. Additionally, a
voting mechanism is used in order to improve the robustness of the object classification,
meaning that an once found object is classified several times in the subsequent frames in
order to increase the confidence of the detected object class. All these improvements allow
a faster detection of dangerous objects. This was shown with the help of the previously
introduced example scenario of a stationary vehicle in a construction site. Here a faster
system reaction could be experienced.
In the following, the realized novelties of Chapter 5 are listed.� A driver assistance system on a prototype vehicle was implemented that allows au-
tonomous emergency braking on highways based on vision as the major cue,� The realized driver assistance system is based on an attention system as generic
front-end of all visual processing allowing task-dependent scene decomposition and
interpretation,� A driver assistance system was realized that fuses the detected drivable road to
various system modules and thereby includes environmental context information, in
order to allow safe processing in inner-city scenarios.
Based on evaluation results, is could be shown that with the inattentional blindness
phenomenon, specific attention-related properties of the human vision system can be re-
produced with the here presented driver assistance system. Based on these results, the
realized ADAS can be understood as an attentive co-pilot supporting the human driver
while closely mimicking the human visual processing.
128
6 Summary and Outlook
In Sect. 6.1 the thesis is summarized emphasizing the role of biologically inspired ap-
proaches that are part of the realized Advanced Driver Assistance System. Based on that
remaining functional limitations and ongoing system extensions are described in Sect. 6.2.
6.1 Summary
In modern vehicles, numerous driver assistance functionalities exist that support the driver
in typical traffic situations. In general, all these functionalities bring own sensors, process-
ing devices, and actuators. No information fusion takes places between the functionalities,
due to among other things unsolved questions in system design that come along with high
system complexity. Thereby, the potential to improve the robustness of so far independent
system modules as well as the potential to develop higher level functionalities is ignored.
Furthermore, the number of driver assistance functionalities is growing constantly in to-
day’s vehicles. In the near future, this will lead to problems regarding the limited number
of interaction channels to inform the driver of dangerous events (so-called human machine
interface (HMI)). Already today, in specific traffic scenarios a contradicting HMI access of
different driver assistance functionalities can occur. For solving these challenges, a large-
scale driver assistance system is required that integrates and fuses various functionalities
(leading to so-called advanced driver assistance systems (ADAS)). Despite the apparent
necessity of such systems, only a few of these approaches exist in literature. Said systems
typically rely on rigid system structures and safely run in clearly defined scenarios only.
While it may be argued that the quality of such engineered systems in terms of isolated
aspects, e.g., object detection or tracking, is often sound, the solutions lack the necessary
flexibility.
Instead of following classical engineering-based approaches, in the thesis at hand, an
ADAS is developed that solves the challenges of complex system design by mimicking
known information processing principles in the human brain. On the micro-level the real-
ized system copies the signal processing characteristics of neurons in order to reach robust
image filtering. On the macro-level the system gets inspiration from the way the human
brain organizes higher level signal flows in order to reach a generic system structure (e.g.,
a task-dependent tunable attention system).
More specifically, on the micro-level various static and dynamic visual features are real-
ized that are biologically inspired. Inspiration is drawn from known processing principles
of the vision pathway in the human brain (i.e., the signal processing characteristics of
neurons in the brain). Features for the detection of specific intensity changes, oriented
lines and edges as well as a retina-like color space are introduced and tested. Since during
129
6 Summary and Outlook
the projection of 3D world objects to the 2D image plane the object depth is lost, a dis-
tance feature becomes necessary. Five biologically motivated depth cues and their dynamic
fusion are described. The realized depth sources are stereo disparity, depth from object
knowledge, depth from Time to Contact, Depth from the bird’s eye view and Radar-based
depth. Thereby, in sum 130 static feature maps are accessible, by which the proposed
ADAS can sense the world. Additionally, two dynamic feature types are described that
allow the detection of ego-propelled objects in the scene. In sum, six dynamic feature maps
are accessible to the ADAS. This makes an overall number of 136 feature maps.
The 136 biologically inspired feature maps are combined in the attention system that
is used as common front-end of all vision processes of our ADAS. The key aspect and
output of the attention system is the saliency map, whose amplitude (i.e., its activation
in neurobiological terms) encodes the level of information contained in an image region.
A high activation can be caused by 1) an object that visually differs strongly from its
surrounding environment (sensory-driven or bottom-up attention) or 2) by an image region
that matches the current searched object properties (goal-driven or top-down attention).
Both the bottom-up (BU) and top-down (TD) attention use a weighted combination of
the features. Five novelties increase the adaptivity of the attention system to the scene
and thereby assure a high robustness, which allows for building applications for outdoor
scenarios of the dynamic vehicle domain.
The importance of context information for improving the performance of driver assis-
tance functionalities is a widely acknowledged fact in the community. Therefore the ADAS
contains a real-time unmarked road detection system that relies on vision as the major cue.
Four system-related novelties assure sound performance in terms of the detection quality.
Based on these novelties, at run time the system dynamically adapts important system
parameters allowing robust road detection under changing environmental conditions. As
evaluation showed, the so detected road segments match ground truth data well in most
situations. However, in case of shadows on the road the detected road segments contain
holes and get unstable in time. As a further novelty, for solving these challenges a generic
tracking approach for unmarked road detection systems was introduced that is based on
temporal integration (i.e., integration of road detection results over time). The temporal
integration approach was successfully tested on the implemented system described before,
but would be suitable to improve any comparable state-of-the-art unmarked road detection
system.
In the central part of the PhD thesis, the realized visual features, the attention sub-
system, and the unmarked road detection sub-system including the temporal integration
approach are combined to a generic, biologically motivated Advanced Driver Assistance
System. The attention sub-system, which weights and combines all visual features, allows
a task-dependent scene decomposition that enables the system to react in real-time to chal-
lenging situations. In the first developed instance of an ADAS, the attention sub-system
is used as front-end for the detection of a stationary vehicle in a highway construction site.
In case a quickly approaching vehicle is detected by the system, a danger handling scheme
is initiated that in the final phase allows the system to brake autonomously. Different from
commercially available systems and approaches in literature, the object detection is based
on vision as the main cue.
130
6 Summary and Outlook
A second instance of the driver assistance system was developed that contains numerous
extensions that rely on a higher level of information fusion between modules (e.g., close
fusion of the detected road into the attention system) and the inclusion of more dynamics
into the system (e.g., online computation and adaptation of weight sets for tuning the visual
preprocessing). The improvements allow a faster detection of dangerous objects. This was
exemplarily shown on the previously described detection task of a stationary vehicle in the
construction site, where faster reaction times to the stationary vehicle could be reached.
Based on an abundant system evaluation, system properties could be demonstrated that
were measured in psychophysical studies on humans as well. Similar to humans, it could be
shown that a number of 5 to 7 stored and tracked objects in the short term memory results
in the best overall system performance when searching for a dangerous object under time
constraints. Furthermore, it could be shown that the reaction time to unexpected traffic-
relevant objects grows with an increasing focus on a specific object class (a phenomenon
that is called inattentional blindness in psychophysical studies with humans). Based on
these results, it can be stated that the realized ADAS closely models important human
information processing principles allowing the usage of the system as attentional co-pilot
for human drivers.
6.2 Limitations and Outlook
The performance assessment of the human vision system has revealed capabilities that
exceed all known computational vision systems having a comparable image resolution. For
example, [Manz et al., 2007] provides the following Equation for computing the maximum
distance Dv, a human is able to visually classify an object in good weather conditions:
Dv =a
tan(
πS 10800
) . (6.1)
With:
a = object size in m
S = factor of acuteness of vision - usually 1.0.
A vehicle of width a = 2.5m with an acuteness factor S = 1.0 can hence be classified from
approximately 2500 meters on.
As stated in Sect. 5.2.2 the here proposed ADAS is able to classify cars from about 42
meters on. Among other things (like, e.g., differences in image resolution), the large dif-
ference in performance between the human and a technical vision system might be due to
the fact that the human integrates information about the scene context and gathered ex-
perience. In the discussed example, context information about the road (e.g., the object is
positioned on the road), which restricts the potential type of object and object class-related
experience (e.g., a car is a fast moving object on the road) is crucial. First approaches to
include context into the ADAS are realized in the PhD thesis at hand, in order to improve
the object detection. More specifically, the detected road is searched for car-like openings
and used to modulate the attention and object segmentation (see Sect. 5.3.1). However,
more context information (e.g., road type: inner-city, urban road, highway) needs to be
131
6 Summary and Outlook
included, in order to further improve the system performance. [Kastner et al., 2009] de-
scribes a robust, real-time capable system for said basic scene classification, which we plan
to incorporate in our ADAS in the near future.
The PhD thesis at hand concentrated mainly on saliency-based attention and building
of a generic system that allows the dynamic modulation of modules and links between
modules at run-time. Our further work focuses on ways to control the designed cognitive
system based on reinforcement learning on high system level.
In order to tackle this goal the previously presented ADAS was extended by a simple-
structured control module, which realizes a functional mapping of measured internal system
states as input feature space and parameters for controlling the system behavior (see
[Michalke et al., 2009b] for an extensive description of the realized approach). Based on this
approach first promising results could be gathered on a complex real-world test scenario.
In the scenario the camera ego vehicle detects and tracks a bicycle, which the ego vehicle
overtakes. Based on the internal 3D representation (described in Sect. 5.3.1 on page 118)
the bicycle position is predicted even while being outside the field of view of the camera.
The ego vehicle stops to turn right and “remembers” the previously detected bicycle.
The ego vehicle waits for the bicycle to reappear. In order to allow a fast redetection,
the top-down attention is tuned to the bicycle (see Fig. 6.1, 6.2, 6.3) finally allowing its
instantaneous redetection. Summarizing, the system at runtime builds up and verifies
expectations to the environment and thereby autonomously tunes internal parameters and
processes that improve and accelerate the system reaction. The complete result video is
accessible in the internet at [BenchmarkData, 2009b].
The goal of the current system extensions is to develop control strategies that allow an
appropriate and safe system reaction in various environmental situations. As it has become
apparent in the thesis at hand, the key aspect for reaching an “all situation ADAS” is a
generic system structure. Therefore, low-complexity system control strategies seem to
be sufficient. In other words, the cognitive complexity is distributed over the system in
multiple processing loops that can be tuned and modulated. Therefore, no complex central
control system is necessary, which increases robustness and could also allow the learning
of control strategies in the future.
More specifically, after the successful test of the previously described and exemplarily
tested low complexity control approach, in the next step, learning of the functional mapping
between the measured input feature space and the output control parameter space will be
in the focus. A possible way could be to replay stored scenarios of critical traffic situations
from a data base. As learn signal dangerous objects could be labeled, which the system
has to detect fast enough to prevent a collision. In case the system is too slow, the scenario
is replayed while changing the functional mapping between input and output signals of the
behavior control module. Also measuring and mimicking the reactions of an experienced
driver is envisioned based on this approach.
As motivated earlier, the central assumption is that a robust learning system requires
a generic system structure with a high number of degrees of freedom for controlling the
system reaction and measuring the system states. Such a system was realized in the PhD
thesis at hand (see Sect. 5.3) allowing the future learning of control strategies and hence
offering a promising way to realize an “all situation” ADAS.
132
6 Summary and Outlook
(a)
(b)
Figure 6.1: Visualization of system states for bicycle stream: (a) Scene exploration mode (no
dynamic object present), (b) Tracking the bicycle, ego vehicle is closing in.
133
6 Summary and Outlook
(a)
(b)
Figure 6.2: Visualization of system states for bicycle stream: (a) Shortly after overtaking the
bicycle, (b) Blind prediction of bicycle.
134
6 Summary and Outlook
(a)
(b)
Figure 6.3: Visualization of system states for bicycle stream: (a) Ego car searching actively
for the bicycle, waiting to turn right, (b) Bicycle redetected successfully, ego vehicle turns right.
135
A Annex
A.1 Gaussian Image Pyramid
Figure A.1 visualizes the image filtering methods with and without usage of an image pyra-
mid for a comparison of the computational demands. Since for both methods filtering in
frequency domain is faster, the comparison is done in the frequency domain (correspond-
ing to the right side of Fig. A.1a and b respectively). As can be seen qualitatively, the
computational demands for applying the FFT are lower when using an image pyramid (see
Fig. A.1b). Note that a zero-padding is done in image domain to have both the image and
the kernel at the same size before applying the FFT. Furthermore, also the multiplication
in the frequency domain as well as transformation back to the image domain is more ef-
ficient when using a filter pyramid. According to [Jaehne, 2005], filtering with an image
pyramid of infinite scales (i.e., steps) takes 4/3 of the computation time of a single scale
without usage of an image pyramid. The same factor is found when comparing Fig. A.1a
and b. Therefore, already when filtering an image on two scales an image pyramid will be
faster. The more scales are used, the higher the gain in computation time will be. A more
quantitative (i.e., mathematical) derivation of the performance increase can be found, in
[Jaehne, 2005].
.
.
.
.
.
*
*
*
*
*
.*
*
*
*
*
.
.
.
.
With image pyramid (down−scaling the image)
(a) (b)
Image domain
Scale 1
Scale 2
Scale 3
Scale 4
Scale 5
Image Kernel Image KernelFrequency domain
Image KernelImage domain
Image KernelFrequency domain
Without image pyramid (up−scaling the kernel)
Figure A.1: Assessment of pyramid-based image filtering (a) Filtering without an image pyra-
mid, (b) Filtering with an image pyramid.
136
A Annex
A.2 Kolmogorov-Smirnov Test of Goodness of Fit
In Sect. 4.1.3 the Kolmogorov-Smirnov (KS) test of goodness of fit with its Lilliefors ex-
tension is used in order to statistically verify, if the features in the road training region are
normal-distributed.
In the following, the in Sect. 4.1.3 realized KS-test steps are described and motivated in
detail. The here used KS-test checks the null hypothesis that a sample follows a normal
distribution with a certain measured variance σ2 and mean value µ against a given level
of significance α = 0.05. The normal-distributed features are a prerequisite for the visual
feature fusion process proposed in Sect. 4.1.2.
As opposed to the well-known χ2 goodness of fit test, the KS-test can also be used
with small sample sizes. As test statistic the KS-test does not use the difference of the
absolute frequency of the sample from the theoretical probability function, but is based on
the difference between the cumulative frequency of the sample Fe(x) and the theoretical
cumulative frequency F0(x). In case that both cumulative frequencies match closely, the
observed absolute deviation |Fe(x) − F0(x)| will be small. Therefore, on a qualitative
level, the observed maximum of the difference between both cumulative frequencies d =
max |Fe(x) − F0(x)| is suitable as test statistic for the KS-test.
Hence, in the following the test statistic d is used to verify the null hypothesis against a
level of significance α = 0.05. Kolmogorov and Smirnov have shown that the distribution
of the test statistic d is independent of the theoretical distribution the test is used for
(in our case the normal distribution), but depends on the sample size alone. This allows
to define a general chart of the test statistic. However, it is important to note that the
variance σ2 of the theoretical distribution is a priori not known in our case, but must
be estimated from the test sample. But this contradict an important assumption of the
KS-test in its basic form. By estimating the standard deviation from the test sample a
negative test result gets less probable. The KS-test in its basic form will then have a too
high critical value for the test statistic (i.e., the border value is too large) . This means
that the critical value has to be set lower. For the case that the mean value µ and variance
σ2 are estimated from the test sample, [Lilliefors, 1967] has published the corrected values
of the test statistic for a goodness of fit test for normal-distributed test samples. It is
important to note that opposed to the KS-test for the KS-Lilliefors test the distribution
of the test statistic depends on the theoretical distribution.
A.3 World to Image Transformation
A 3D world position can be transformed to a 2D pixel position (u,v) using a pin hole
camera model that contains all intrinsic and extrinsic camera parameters (in detail these
are the 3 camera angles θX , θY , and θZ , which are aggregated in the rotation matrix R, the
3 translational camera offsets t1, t2, t3, the horizontal and vertical principal point u0 and
v0, as well as the normalized horizontal and vertical focal length fu = f/tu and fv = f/tv),
see Equ. (A.1) and (A.2).
137
A Annex
X
Y
camera
Z
axisoptical
θX
θY
T = [t1, t2, t3]θZ
(a) (b)
optical axis
aperture (simplified as pin hole)
f
image chip
tu
u0t v
v 0
Figure A.2: (a) Visualization of internal camera parameters, (b) Coordinate system and ex-
ternal camera parameters.
u = −fur11(X-t1) + r12(Y -t2) + r13(Z-t3)
r31(X-t1) + r32(Y -t2) + r33(Z-t3)+ u0 (A.1)
v = −fvr21(X-t1) + r22(Y -t2) + r23(Z-t3)
r31(X-t1) + r32(Y -t2) + r33(Z-t3)+ v0 (A.2)
R = RXRYRZ =
r11 r12 r13r21 r22 r23r31 r32 r33
with:
r11 = cos(θZ)cos(θY )
r12 = −sin(θZ)cos(θX) + cos(θZ)sin(θY )sin(θX)
r13 = sin(θZ)sin(θX) + cos(θZ)sin(θY )cos(θX)
r21 = sin(θZ)cos(θY )
r22 = cos(θZ)cos(θX) + sin(θZ)sin(θY )sin(θX )
r23 = −cos(θZ)sin(θX) + sin(θZ)sin(θY )cos(θX)
r31 = −sin(θY )
r32 = cos(θY )sin(θX)
r33 = cos(θY )cos(θX)
Figure A.2 gives a visualization of all the named internal and external camera parame-
138
A Annex
ters.
Equation (A.1) and (A.2) can also be expressed in homogenous coordinates (see Equa-
tion (A.3), (A.4), and (A.5)), which decreases the computational demands considerably.
u =Xcam
Zcam(A.3)
v =Ycam
Zcam(A.4)
with:
Xcam
Ycam
Zcam
1
= M
X
Y
Z
1
(A.5)
M = MhpR
hXR
hYR
hZT
h
RhX =
1 0 0 0
0 cos(θX) −sin(θX ) 0
0 sin(θX) cos(θX) 0
0 0 0 1
RhY =
cos(θY ) 0 sin(θY ) 0
0 1 0 0
−sin(θY ) 0 cos(θY ) 0
0 0 0 1
RhZ =
cos(θZ) −sin(θZ) 0 0
sin(θZ) cos(θZ) 0 0
0 0 1 0
0 0 0 1
T h =
1 0 0 t10 1 0 t20 0 1 t30 0 0 1
Mhp =
fu 0 u0 0
0 fv v0 0
0 0 1 0
0 0 0 1
139
A Annex
A.4 Time to Contact - Further Evaluation Results
Additionally to the results given in Sect. 2.2.5, further evaluation results gathered by eval-
uating presegmented, synthetic image data of cars are shown in Tab. A.1, A.2, and A.3. As
stated in Sect. 2.2.5 the here gathered results are roughly comparable to the measurements
accumulated in psychophysical experiments with humans.
Table A.1: Some examples of depth from TTC for an object moving away (i.e., b1 > b2 > b3)
with a frame rate frate = 3.
Ego velocity Ground truth Resulting Computed Relative error
vego,1 (vego,2) distance vobj [in m/s] depth from |D1−ZTTC|D1
[in %]
in [m/s] D1(D2,D3) [in m] TTC ZTTC
24 (24.9) 39(40, 40.7) 20.93 42.73 9.56
24 (24.9) 38(39, 39.7) 20.93 41.79 9.97
24 (24.9) 37(38, 38.7) 20.93 40.85 10.41
24 (24.9) 36(36.5, 36.7) 22.49 38.70 7.50
24 (24.9) 35(35.5, 35.7) 22.49 37.76 7.89
24 (24.9) 34(34.5, 34.7) 22.49 36.83 8.32
24 (24.9) 33(33.5, 33.7) 22.49 35.90 8.79
Mean relative error 8.92
Table A.2: Some examples of depth from TTC with b1 > b2 < b3 and a frame rate frate = 3.
Ego velocity Ground truth Resulting Computed Relative error
vego,1 (vego,2) distance vobj [in m/s] depth from |D1−ZTTC|D1
[in %]
in [m/s] D1(D2,D3) [in m] TTC ZTTC
24 (25.5) 33(33.3, 33.1) 23.10 35.34 7.09
24 (25.5) 34(34.3, 34.1) 23.10 36.27 6.68
24 (25.5) 35(35.3, 35.1) 23.10 37.21 6.31
24 (25.5) 36(36.3, 36.1) 23.10 38.15 5.97
24 (25.5) 37(37.3, 37.1) 23.10 39.10 5.68
24 (25.5) 38(38.3, 38.1) 23.10 40.04 5.37
24 (25.5) 39(39.3, 39.1) 23.10 41.00 5.13
Mean relative error 6.03
140
A Annex
Table A.3: Some examples of depth from TTC for an approaching object (i.e., b1 < b2 < b3)
with a frame rate frate = 3.
Ego velocity Ground truth Resulting Computed Relative error
vego,1 (vego,2) distance vobj [in m/s] depth from|D1−ZTTC|
D1[in %]
in [m/s] D1(D2,D3) [in m] TTC ZTTC
24 (24.9) 45(42, 38.7) 14.52 49.12 9.16
24 (24.9) 42(40, 37.7) 17.83 45.00 7.14
24 (24.9) 40(38, 35.7) 17.79 43.26 8.15
24 (24.9) 38(36, 33.7) 17.75 41.57 9.39
24 (24.9) 36(34, 31.7) 17.70 39.93 10.92
24 (24.9) 34(32, 29.7) 17.64 38.36 12.82
24 (24.9) 33(32, 30.7) 20.95 35.82 8.55
Mean relative error 9.45
A.5 High Attention-Feature Selectivity
In the following, an indoor application for the attention system is shown that will highlight
the performance of the approach. Figure A.3a shows a complex scene showing a bookshelf.
Marked in red is my favorite book “A journey into the brain and beyond”. However,
someone has removed my book and has put it back on a different location (see Fig. A.3d).
Since the scene is highly complex (refer to the dense BU attention in Fig. A.3c), the book is
hard to find. Based on a stored training image, the described attention system is now able
to compute a TD weight set. Based on the training image the TD search is successful (see
Fig. A.3b), which allows a first positive assessment of the TD weight set that corresponds
to the features of my book. A TD-attention-based search on the current test image (see
Fig. A.3d) leads to Fig. A.3e, where the TD attention leads to a clear saliency maximum.
This allows the fast relocation of my favorite book (see Fig. A.3f).
The described applications is somewhat related to approaches shown in [Frintrop, 2006].
Still, the test examples in [Frintrop, 2006] are much simpler in terms of the complexity of
the scene.
141
A Annex
(a) (b) (c)
(f)(e)(d)
Figure A.3: (a) Search target (favorite book) marked by rectangle (remembered training im-
age), (b) TD attention computed on the remembered training image (TD weights are stored),
(c) Dense BU attention showing the complexity of the test scene, (d) Test image with changed
position of the book, (e) TD attention on the test image based on the stored TD weights, (f)
Relocated book on the test image.
142
Bibliography
Adamy, J. (2007). Fuzzy-Logik, Neuronale Netze und Evolutionare Algorithmen. Shaker
Verlag Achen.
Apostoloff, N. and Zelinsky, A. (2003). Robust vision based lane tracking using multiple
cues and particle filtering. In IEEE Intelligent Vehicles Symposium.
Aufrere, R., Marion, V., Laneurit, J., Lewandowski, C., Morillon, J., and Chapuis, R.
(2004). Road sides recognition in non-structured environment by vision. In IEEE Intel-
ligent Vehicles Symposium, Parma.
Aziz, Z. and Mertsching, B. (2008). Visual search in static and dynamic scenes using fine-
grain top-down visual attention. In Lecture Notes in Computer Science, volume 5008,
pages 3–12.
Backer, G. and Mertsching, B. (2000). Integrating depth and motion into the attentional
control of an active vision system. In G. Baratoff, H. Neumann, (Eds.), Dynamische
Perzeption, St. Augustin (Infix), pages 69–74.
Badino, H., Vaudrey, T., Franke, U., and Meyer, R. (2008). Stereo-based free space
computation in complex traffic scenarios. In IEEE Southwest Symposium on Image
Analysis and Interpretation, New Mexico.
Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algo-
rithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):105–139.
BenchmarkData (2008a). http://www.rtr.tu-darmstadt.de/~tmichalk/
ICVS2008_BenchmarkData/.
BenchmarkData (2008b). http://www.rtr.tu-darmstadt.de/~tmichalk/
ITSC_TempIntegration/.
BenchmarkData (2009a). http://www.rtr.tu-darmstadt.de/~tmichalk/
IV2009_RoadDetectionSystem/.
BenchmarkData (2009b). http://www.rtr.tu-darmstadt.de/~tmichalk/
IV2009_ADASControl/.
Borst, A. (1990). How do flies land? From behavior to neural circuits. In BioScience,
volume 40, pages 292–299.
143
Bibliography
Broggi, A. (1995). Robust real-time lane and road detection in critical shadow conditions.
In Proc. Int. Symp. on Computer Vision, Parma. IEEE.
Broggi, A., Bertozzi, M., Conte, G., and Fascioli, A. (2001). ARGO prototype vehicle.
In Vlacic, L., Parent, M., and Harashima, F., editors, Intelligent Vehicle Technologies.
Butterworth Heinemann, Oxford.
Broggi, A. and Grisleri, P. (2005). A software video stabilization system for automotive
oriented applications. In Procs. IEEE Vehicular Technology Conference, Stockholm,
Sweden.
Cavanagh, P. and Alvarez, G. (2005). Tracking multiple targets with multifocal attention.
Trends in Cognitive Sciences, 9:350–355.
Cech, M., Niem, W., Abraham, S., and Stiller, C. (2004). Dynamic ego-pose estimation
for driver assistance in urban environments. In IEEE Intelligent Vehicles Symposium,
pages 43–48.
Ceravola, A., Joublin, F., Dunn, M., Eggert, J., and Goerick, C. (2006). Integrated research
and development environment for real-time distributed embodied intelligent systems. In
Proc. Int. Conf. on Robots and Intelligent Systems, pages 1631–1637.
Chern, M. and Cheng, S. (2003). Finding road boundaries from the unstructured rural road
scene. In 16th IPPR Conference on Computer Vision, Graphics and Image Processing.
Corbetta, M. and Shulman, G. (2002). Control of goal-directed and stimulus-driven atten-
tion in the brain. Nature Reviews Neuroscience, 3:201–215.
Dahlkamp, H., Kaehler, A., Stavens, D., Thrun, S., and Bradski, G. (2006). Self-supervised
monocular road detection in desert terrain. In Proceedings of Robotics: Science and
Systems, Philadelphia, USA.
Dang, T., Kammel, S., Duchow, C., Hummel, B., and Stiller, C. (2006). Path planning
for autonomous driving based on stereoscopic and monoscopic vision cues. In IEEE
Proceedings of the 2006 American Control Conference, pages 191–196.
Dickmanns, E. (2004). Three-stage visual perception for vertebrate-type dynamic machine
vision. In Engineering of Intelligent Systems (EIS), Madeira.
Dickmanns, E. and Mysliwetz, B. (1992). Recursive 3-d road and relative ego-state recog-
nition. IEEE Trans. Pattern Anal. Mach. Intell., 14(2):199–213.
Egeth, H. and Yantis, S. (1997). Visual attention: control, representation, and time course.
Annual Review of Psychology, 48:269–297.
Farber, G. (2005). Biological aspects in technical sensor systems. In Proc. Advanced
Microsystems for Automotive Applications, pages 3–22, Berlin.
144
Bibliography
Findlay, J. and Gilchrist, I. (2003). Active Vision: The psychology of looking and seeing.
Oxford University Press.
Flores-Herr, N. (2001). Das hemmende Umfeld von Ganglienzellen in der Netzhaut des
Auges. PhD thesis, Frankfurt am Main, Johann Wolfgang Goethe-Universitat.
Forsyth, D. and Ponce, J. (2003). Computer Vision: A Modern Approach. Prentice Hall,
Berkeley.
Franke, U., Gavrila, D., Gern, A., Gorzig, S., Janssen, R., Paetzold, F., and Wohler, C.
(2001). From door to door - principles and applications of computer vision for driver
assistant systems. In Vlacic, L., Parent, M., and Harashima, F., editors, Intelligent
Vehicle Technologies. Butterworth Heinemann, Oxford.
Franke, U., Loose, H., and Knoeppel, C. (13-15 June 2007). Lane recognition on country
roads. Intelligent Vehicles Symposium, 2007 IEEE, pages 99–104.
Frintrop, S. (2006). VOCUS: A Visual Attention System for Object Detection and Goal-
Directed Search. PhD thesis, University of Bonn Germany.
Frintrop, S., Backer, G., and Rome, E. (2005). Goal-directed search with a top-down
modulated computational attention system. In DAGM-Symposium, pages 117–124.
Frintrop, S., Klodt, M., and Rome, E. (2007). A real-time visual attention system using
integral images. In Int. Conf. on Computer Vision Systems, Bielefeld.
Frintrop, S., Rome, E., and Christensen, H. (2009). Computational visual attention systems
and their cognitive foundation: a survey. ACM Transactions on Applied Percerption
(TAP).
Fritsch, J., Michalke, T., Gepperth, A., Bone, S., Waibel, F., Kleinehagenbrock, M., Gayko,
J., and Goerick, C. (2008). Towards a human-like vision system for driver assistance. In
IEEE Intelligent Vehicles Symposium, Eindhoven.
Gabor, D. (1946). Theory of communication. J. IEE, 93:429–457.
Gepperth, A., Mersch, B., Goerick, C., and Fritsch, J. (2007). Color object recognition in
real-world scenes. In de Sa, J., editor, J. Marques de Sa et al. (Eds.): Artificial Neural
Networks, 17th International Conference ICANN, Part II, Lecture Notes in computer
science, pages 583–592. Springer Verlag Berlin Heidelberg New York.
Goerick, C., Wersing, H., Mikhailova, I., and Dunn, M. (2005). Peripersonal space and
object recognition for humanoids. In Proc. Int. Conf. on Humanoid Robots.
Gray, R. and Regan, D. (1998). Accuracy of estimating time to collision using binocular
and monocular information. In Vision Research, volume 38, pages 499–512.
Hardy, R. (1983). Homeostasis. Arnold.
145
Bibliography
Harris, J. (2004). Binocular vision: moving closer to reality. In Philosophical Transactions
of the Royal Society, volume 42, pages 2721–2739.
Hawes, N. and Wyatt, J. (2006). Towards context-sensitive visual attention. In Proceedings
of the Second Int. Cognitive Vision Workshop, Graz, Austria.
Heikkila, J. and Silven, O. (1997). A four-step camera calibration procedure with implicit
image correction.
Heinke, D. and Humphreys, G. (2005). Computational models of visual selective attention:
a review. In Houghton, G., editor, Connectionist Models in Psychology, pages 273–312.
Psychology Press.
Hertel, G. (2007). Mercer-Studie Autoelektronik. Automobilelektronik, pages 26–27.
Hong, T., Chang, T., Rasmussen, C., and Shneier, M. (2002). Road detection and tracking
for autonomous mobile robots. In Proceedings of SPIE Aerosense Conference.
Hoyle, F. (1957). The black cloud. London: Penguin.
Hubel, D. and Wiesel, T. (1962). Receptive fields, binocular interaction and functional
architecture in the cat’s visual cortex. Journal of Physiology, 160:106–154.
Ikegaya, M., Asanuma, N., Ishida, S., and Kondo, S. (1998). Development of a lane
following assistance system. In Int. Symp. on Advanced Vehicle Control, Nagoya.
Intel (2006). Integrated Performance Primitives. http://www.intel.com/cd/software/
products/asmo-na/eng/perflib/ipp/302910.htm.
Itti, L., Koch, C., and Niebur, E. (1998). A model of saliency-based visual attention for
rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell., 20(11):1254–1259.
Itti, L., Rees, G., and Tsotsos, J., editors (2005). Neurobiology of Attention. Elsevier.
Jaehne, B. (2005). Digital Image Processing. Springer, Berlin.
Jones, J., Stepnoski, A., and Palmer, L. (1987). The two-dimensional spectral structure of
simple receptive fields in the cats striate cortex. Journal of Neurophysiology, 58(6):1233–
1258.
J.Y.Bouguet (2007). Camera Calibration Toolbox for Matlab.
http://www.vision.caltech.edu/bouguetj.
Kaiser, M. and Mowafy, L. (1993). Optical specification of time-to-passage: observers’
sensitivity to global tau. Journal of Experimental Psychology, 19:1028–1040.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Trans-
actions of the ASME–Journal of Basic Engineering, 82(Series D):35–45.
146
Bibliography
Kastner, R., Schneider, F., Michalke, T., Fritsch, J., and Goerick, C. (2009). Image-
based classification of driving scenes by Hierarchical Principal Component Classification
(HPCC). In IEEE Intelligent Vehicles Symposium, Xian.
Klein, R. (2000). Inhibition of return. Trends in Cognitive Science, 4(4):138–145.
Koch, C. and Ullman, S. (1985). Shifts in selective visual attention: towards the underlying
neural circuitry. Human Neurobiology, 4(4):219–227.
Kodaka, K. and Gayko, J. (2004). Intelligent systems for active and passive safety -
Collision Mitigation Brake System. In Proc. of the ATA EL conference 2004, Parma.
Kodaka, K., Otabe, M., Urai, Y., and Koike, H. (2003). Rear-end collision velocity reduc-
tion system. In Proc. 2003 SAE World Congress, Detroit.
Konolige, K. (1997). Small Vision System: Hardware and implementation. In Eighth
International Symposium on Robotics Research.
Landy, M., Maloney, L., Johnsten, E., and Young, M. (1995). Measurement and modeling
of depth cue combinations: in defense of weak fusion.
Li, X., Yao, X., Murphey, Y., Karlsen, R., and Gerhart, G. (2004). A real-time vehicle
detection and tracking system in outdoor traffic scenes. In Proceedings of the 17th
International Conference on Pattern Recognition.
Lilliefors, W. (1967). On the Kolmogorov-Smirnov test for normality with mean and
variance unknown. In Journal of the American Statistical Association, volume 62, pages
399–402.
Lin, X. and Chen, S. (1991). Color image segmentation using modified HSI system for
road following. In IEEE International Conference on Robotics and Automation.
Lombardi, P., Zanin, M., and Messelodi, S. (2005). Unified stereovision for ground, road
and obstacle detection. In IEEE Intelligent Vehicles Symposium.
Luo-Wai, T. (2008). Lane detection using directional random walks. In IEEE Intelligent
Vehicles Symposium, Eindhoven.
Mallot, H. (2002). Computational vision: Information processing in perception and visual
behavior. MIT Press Robotica.
Manz, K., Kooß, D., Klinger, K., and Schellinger, S. (2007). Entwicklung von Kri-
terien zur Bewertung der Fahrzeugbeleuchtung im Hinblick auf ein NCAP fur aktive
Fahrzeugsicherheit. Universitat Karlsruhe, Lichttechnisches Institut.
Marcelja, S. (1980). Mathematical description of the response of simple cortical cells. J.
Optical Society of America, 70(11):1297–1300.
147
Bibliography
Marita, T., Oniga, F., Nedevschi, S., Graf, T., and Schmidt, R. (2007). Camera calibration
method for far range stereovision sensors used in vehicles. In IEEE Intelligent Vehicles
Symposium, pages 356–363.
Mateus, D., Avina, G., and Devy, M. (2005). Robot visual navigation in semi-structured
outdoor environments. In IEEE International Conference on Robotics and Automation,
Barcelona.
Matzka, S., Petillot, Y., and Wallace, A. (2008). Proactive sensor-resource allocation using
optical sensors. In VDI-Berichte 2038, pages 159–167.
Michalke, T., Fritsch, J., Gepperth, A., and Goerick, C. (to appear end of 2009a). Robust
top-down attention for a human-like driver assistance system. Computer Vision and
Image Understanding, Special Issue On: Intelligent Vision Systems, Elsevier.
Michalke, T., Fritsch, J., and Goerick, C. (2008a). Enhancing robustness of a saliency-
based attention system for driver assistance. In The 6th International Conference on
Computer Vision Systems (ICVS), Santorini, Greece, 2008. Lecture Notes in Computer
Science, Springer, number 5008, pages 43–55.
Michalke, T., Gepperth, A., Schneider, M., Fritsch, J., and Goerick, C. (2007). Towards
a human-like vision system for resource-constrained intelligent cars. In Int. Conf. on
Computer Vision Systems, Bielefeld.
Michalke, T., Kastner, R., Adamy, J., Bone, S., Waibel, F., Kleinehagenbrock, M., Gayko,
J., Gepperth, A., Fritsch, J., and Goerick, C. (2008b). An attention-based system ap-
proach for scene analysis in driver assistance. at - Automatisierungstechnik, 56(11):575–
584.
Michalke, T., Kastner, R., Fritsch, J., and Goerick, C. (2008c). A generic temporal integra-
tion approach for enhancing feature-based road-detection systems. In IEEE Intelligent
Transportation Systems Conference, Beijing.
Michalke, T., Kastner, R., Fritsch, J., and Goerick, C. (2009b). Towards a proactive
biologically-inspired advanced driver assistance system. In IEEE Intelligent Vehicles
Symposium, Xian.
Michalke, T., Kastner, R., Herbert, M., Fritsch, J., and Goerick, C. (2009c). Adaptive
multi-cue fusion for robust detection of unmarked inner-city streets. In IEEE Intelligent
Vehicles Symposium, Xian.
Most, S. and Astur, R. (2007). Feature-based attentional set as a cause of traffic accidents.
Visual Cognition, 15(2):125–132.
Navalpakkam, V. and Itti, L. (2005). Modeling the influence of task on attention. Vision
Research, 45(2):205–231.
148
Bibliography
Navalpakkam, V. and Itti, L. (2006). An integrated model of top-down and bottom-up
attention for optimal object detection. In Proc. IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 2049–2056.
Neisser, U. (1967). Cognitive Psychology. Appleton-Century-Crofts, New York.
Nieto, M., Salgado, L., Jaureguizar, F., and Cabrera, J. (2007). Stabilization of inverse
perspective mapping images based on robust vanishing point estimation. In IEEE In-
telligent Vehicles Symposium.
Ouerhani, N. (2003). Visual Attention: From Bio-Inspired Modeling to Real-Time Imple-
mentation. PhD thesis, Universite de Neuchatel, Institute de Microtechnique.
Palmer, S. (1999). Vision Science: Photons to Phenomenology. MIT Press.
Proakis, J. and Manolakis, D. (2006). Digital Signal Processing.
Ramstroem, O. and Christensen, H. (2005). A method for following unmarked roads. In
IEEE Intelligent Vehicles Symposium, pages 650–655.
Rasmussen, C. (2002). Combining laser range, color and texture cues for autonomous road
following. In IEEE International Conference on Robotics and Automation, Washington
DC.
Regan, D. (2002). Binocular information about time to collision and time to passage. In
Vision Research, volume 42, pages 2479–2484.
Rotaru, C., Graf, T., and Zhang, J. (2004). Extracting road features from color images
using a cognitive approach. In IEEE Intelligent Vehicles Symposium.
Schmudderich, J., Willert, V., Eggert, J., Rebhan, S., Goerick, C., Sagerer, G., and Ko-
erner, E. (2008). Estimating object proper motion using optical flow, kinematics, and
depth information. In IEEE Transactions on Systems, Man, and Cybernetics, volume
PartB 38, pages 1139–1151.
Schorn, M., Stahlin, U., Khanafer, A., and Isermann, R. (2006). Nonlinear trajectory
following control for automatic steering of a collision avoiding vehicle. In IEEE Inter-
national Conference on Multisensor Fusion and Integration for Intelligent Systems.
Sha, Y., Zhang, G., and Yang, Y. (2007). A road detection algorithm by boosting using
feature combination. In IEEE Intelligent Vehicles Symposium, Istanbul.
Shinoda, H., Hayhoe, M., and Shrivastava, A. (2001). What controls attention in natural
environments. Vision Research, (41):3535 – 3546.
Simons, D. and Chabris, C. (1995). Gorillas in our midst: Sustained inattentional blindness
for dynamic events. British Journal of Developmental Psychology, 13:113–142.
149
Bibliography
Smuda, P., Schweiger, R., Neumann, H., and Ritter, W. (2006). Multiple cue data fusion
with particle filters for road course detection in vision systems. In IEEE Intelligent
Vehicles Symposium, Tokyo.
Soquet, N., Aubert, D., and Hautiere, N. (2007). Road segmentation supervised by an
extended vdisparity algorithm for autonomous navigation. In IEEE Intelligent Vehicles
Symposium.
Sotelo, M., Rodriguez, F., and Magdalena, L. (2004). VIRTUOUS: Vision-based road
transportation for unmanned operation on urban-like scenarios. In IEEE Transactions
on Intelligent Transportation Systems, volume 5.
Stiller, C., Farber, G., and Kammel, S. (2007). Cooperative Cognitive Automobiles. In
IEEE Intelligent Vehicles Symposium, pages 215–220.
Torralba, A. (2003). Contextual priming for object detection. In International Journal of
Computer Vision, volume 53.
Trapp, R. (1998). Stereoskopische Korrespondenzbestimmung mit impliziter Detektion von
Okklusionen. PhD thesis, University of Paderborn Germany.
Treisman, A. (1993). The perception of features and objects. In Baddeley, A. and
Weiskrantz, L., editors, Attention: Selection, awareness, and control, pages 5–35. Claren-
don Press, Oxford.
Treisman, A. and Gormican, S. (1988). Feature analysis in early vision: Evidence from
search asymmetries. Psychological Review, 95:15–48.
Treue, S. (2003). Visual attention: the where, what, how and why of saliency. In Current
Opinion in Neurobiology, volume 13.
Tsotsos, J., Culhane, S., Wai, W., Lai, Y., Davis, N., and Nuflo, F. (1995). Modeling
visual attention via selective tuning. Artificial Intelligence, 78(1-2):507–545.
Tsotsos, J., Liu, Y., Martinez-Trujillo, J., Pomplun, M., Simine, E., and Zhou, K. (2004).
Attending to visual motion. CVIU, 100(1-2):3–40.
Viola, P. and Jones, M. J. (2001). Robust real-time object detection.
von Seelen, W. (1970). Zur Informationsverarbeitung im visuellen System der Wirbeltiere.
In Kybernetik, volume 7, pages 43–60.
von Trzebiatowski, M., Gern, A., Franke, U., Kaeppeler, U.-P., and Levi, P. (2004). De-
tecting reflection posts - lane recognition on country roads. In IEEE Intelligent Vehicles
Symposium.
Wersing, H. and Korner, E. (2003). Learning optimized features for hierarchical models of
invariant object recognition. Neural Computation, 15(2):1559–1588.
150
Bibliography
Willert, V., Eggert, J., Adamy, J., and Koerner, E. (2006). Non-gaussian velocity distribu-
tions integrated over space, time and scales. IEEE Transactions on Systems, Man and
Cybernetics B, 36(3):482–493.
Willert, V., Toussaint, M., Eggert, J., and Korner, E. (2007). Uncertainty optimization
for robust dynamic optical flow estimation. In Proceedings of the 2007 International
Conference on Machine Learning and Applications (ICMLA). IEEE.
Winner, H. (2007). Fahrerassistenzsysteme. In Vorlesungsskript.
Wolfe, J. and Horowitz, T. (2004). What attributes guide the deployment of visual atten-
tion and how do they do it? Nat. Reviews Neuroscience, 5(6):495–501.
WWW (2006). European Project PReVENT. http://www.prevent-ip.org/.
WWW (2007a). DARPA Urban Challenge. http://www.darpa.mil/
grandchallenge/.
WWW (2007b). European Commission Information Society Intelligent Car Initiative.
http://ec.europa.eu/informationsociety/activities/
intelligentcar/.
WWW (2007c). European project SAFESPOT. http://www.safespot-eu.org.
151
Curriculum Vitae
Personal DetailsName: Thomas Paul Michalke
Date of birth: 06.05.1979
Place of birth: Weimar / Thuringen, Germany
Nationality: German
Family status: Married
EducationFeb. 2006 to Jan. 2009 Ph.D. student, control theory and robotics lab, Darmstadt
University of Technology, Germany, in close cooperation
with Honda Research Institute Europe in Offenbach, topic:
camera-based driver assistance using biologically signal pro-
cessing principles
Sept. 2002 to July 2003 Study visit at the Technical University of Lyon (Ecole Cen-
trale de Lyon) in France. Learning French, student thesis
and hearing technical courses
June 2001 Obtained bachelor degree
Sept. 1998 to Jan. 2006 Studying Master in industrial engineering at the Technical
University of Darmstadt, economical major in operations re-
search, technical major in telecommunications engineering
and data processing technology
July 1998 Obtained baccalaureate at the Friedrich Schiller grammar
school in Weimar, courses: English, mathematics, Russian
152
Curriculum Vitae
Work ExperienceSince July 2009 Daimler AG, research and development at Daimler
Evobus.
Feb. 2006 to April 2009 Honda Research Institute Europe in Offenbach, research
and development of a biologically motivated and generic
driver assistance system, real-time implementation on a
prototype car used for online demonstrations, fusion of
sensor data of innovative sensor concepts as PMD, laser,
camera-based stereo, publications on international confer-
ences and journals.
Aug. 2005 to Jan. 2006 Diploma thesis in a development department at BOSCH
in Ditzingen/Germany, content was the analysis of crash
signals, the correspondent application of crash algorithms
and the development of a signal analysis software.
April 2005 to June 2005 Practical training in a development department at BOSCH
in Farmington Hills, Michigan /USA, working on a time
critical passenger safety project, programming a Texas In-
struments microcontroller in C.
Dec. 2004 to April 2005 Practical training in a development department at BOSCH
in Ditzingen/Germany, working on a complex group soft-
ware project regarding signal processing and analysis in
Matlab.
Dec. 2002 to July 2003 Member of a research team during a project at LEOM (a
semi-private electronic research laboratory) in Lyon, work-
ing on new concepts for the transmission of high frequency
signals with light.
153
Publications
Conference papers� T. Michalke, R. Kastner, J. Fritsch, C. Goerick: Towards a Proactive Biologically-
inspired Advanced Driver Assistance System, IEEE Intelligent Vehicles Symposium,
Xian, 2009� T. Michalke, R. Kastner, M. Herbert, J. Fritsch, C. Goerick: Adaptive Multi-Cue Fu-
sion for Robust Detection of Unmarked Inner-City Streets, IEEE Intelligent Vehicles
Symposium, Xian, 2009� R. Kastner, F. Schneider, T. Michalke, J. Fritsch, C. Goerick: Image-based classifi-
cation of driving scenes by Hierarchical Principal Component Classification (HPCC),
IEEE Intelligent Vehicles Symposium, Xian, 2009� T. Michalke, R. Kastner, J. Fritsch, C. Goerick: A generic temporal integration ap-
proach for enhancing feature-based road-detection systems, IEEE Intelligent Trans-
portation Systems Conference, Beijing, 2008� J. Fritsch, T. Michalke, A. Gepperth, S. Bone, F. Waibel, M. Kleinehagenbrock,
J. Gayko, C. Goerick: Towards a Human-like Vision System for Driver Assistance,
IEEE Intelligent Vehicles Symposium, Eindhoven, 2008� T. Michalke, J. Fritsch, C. Goerick: Enhancing Robustness of a Saliency-based At-
tention System for Driver Assistance, Int. Conf. on Computer Vision Systems,
Santorini, 2008� T. Michalke, M. Schneider, A. Gepperth, J. Fritsch, C. Goerick: Towards a Human-
Like Vision System for Resource-Constrained Intelligent Cars, Int. Conf. on Com-
puter Vision Systems, Bielefeld, 2007� M. Briere, L. Carrel, T. Michalke, F. Mieyeville, I. O’Connor, F. Gaffiot: Design and
Behavioral Modeling Tools for Optical Network-on-Chip. IEEE Proceedings of the
Design, Automation and Test in Europe Conference and Exhibition (DATE 2004),
Paris, 2004
Journal papers� T. Michalke, R. Kastner, J. Adamy, A. Gepperth, S. Bone, F. Waibel, M. Kleine-
hagenbrock, J. Gayko, J. Fritsch, C. Goerick: An attention-based system ap-
proach for scene analysis in driver assistance, Automatisierungstechnik (AT), at-
Schwerpunktheft “Kognitive Automobile”, 2008
154
Publications� T. Michalke, J. Fritsch, A. Gepperth, C. Goerick: Robust Top-Down Attention for a
Human-like Driver Assistance System, Computer Vision and Image Understanding,
Special Issue On: Intelligent Vision Systems, Elsevier (to appear end of 2009)
Books� T. Michalke, J. Fritsch, C. Goerick: Enhancing Robustness of a Saliency-based At-
tention System for Driver Assistance, 6th International Conference on Computer
Vision Systems, ICVS 2008 Santorini, Greece, May 12-15, 2008, Proceedings, Series:
Lecture Notes in Computer Science, Vol. 5008, Sublibrary: Theoretical Computer
Science and General Issues, Gasteratos, Antonios; Vincze, Markus; Tsotsos, John
(Eds.), 2008
Patents� Christof Karner, Thomas Michalke: Vorrichtung zu Crashklassifizierung, 2006,
Patentnummer: DE 10 2006 038 348 A1 2008.02.21� Martin Heckmann, Jannik Fritsch, Thomas Michalke: Driving Path Identification
via Online Adaptation of the Driving Path Model, 2008, (pending)� Thomas Michalke, Robert Kastner, Jannik Fritsch: System and method for object
motion detection based on multiple 3D warping and vehicle equipped with such
system, 2008, (pending)
Supervised Student Theses� Thesis: Evaluation of different tracking algorithms and their implementation in the
context of an environmental representation for a driver assistance system, Shi Xuehui,
2006� Thesis: Road detection on unmarked roads, Michael Herbert, 2007� Diploma thesis: Autonomous learning in intelligent vehicles, Imran Bashir Bhatti,
2008� Thesis: Computational efficient lane detection on marked roads for a driver assistance
system, Wang Zheng, 2007� Thesis: Implementation of a fuzzy-based central control unit for a complex driver
assistance system, Yan Jiajie, 2007� Thesis: Biologically motivated filter adaptation for robust image interpretation, Pol
Blasco Moreno, 2007� Diploma thesis: Detection of object proper motion by fusion of stereo vision with
optical flow for a driver assistance system, Andreas Schlensag, 2007
155
Publications� Diploma thesis: Bio-inspired tracking of traffic-relevant objects, Marco-Antonio
Garcia-Ochoa, 2008� Bachelor thesis: Attention-based edge and contour detection with artificial neuronal
nets, Conrad Klytta, 2007� Diploma thesis: Pitch angle correction for a driver assistance system, Ming Zhao,
2007� Thesis: Robust depth cue integration in driver assistance, Sun Hailin, 2008� Thesis: Biologically motivated motion detection and classification, Jochen Schmell,
2008
Supervised Seminars� Seminar: Lane detection on unmarked roads, Aleksandar Aleksandrov, Christian
Schmell, Jochen Schmell� Seminar: Guided Search und Tracking mittels eines Top-Down Aufmerk-
samkeitsmodells, Jean-Pierre Hickey, Jingmin Zhang� Seminar: Untersuchung von Gewichtssets fur verschiedene Objekte, Tobias Pietsch,
Sebastian Waz� Seminar: Vision-based control of a mobil robot, Jonatan Antoni, Daniel Donigus� Seminar: Visuelle Positionsverfolgung eines mobilen Roboters Stefanie Apprich, Said
Azzam, Ulrich Schmieder
Supervised Internships� Student apprentice: Model- and appearance-based pitch estimation, Zhang Lyan� Student apprentice: Lane detection and appearance-based pitch estimation, Andre
Justus� Student apprentice: Temporal integration for improving lane detection systems,
Jochen Schmell
Supervised Tutorials� Tutorial: Fuzzy-Logik, Neuronale Netze und Evolutionre Algorithmen� Tutorial: Regelungstechnisches Praktikum II� Administrative supervision: Projektseminar Robotik und Computational Intelligence� Laboratory course: Regelung von Servoantrieben� Administrative supervision: Prozessleittechnik
156
Index
Attention 51
Basic feature 8
Bird’s eye view 35
Bottom-up attention 51
Canonical feature 22
Depth from object knowledge 33
Difference of Gaussian (DoG) filter 10
Differential images 42
Disparity 29
Double color opponents 26
Dynamic neuronal suppression 58
Early selection principle 51
Flat plane assumption 36
Gabor filter 16
Homeostasis 55
Inattentional blindness 52
Kolmogorov-Smirnov (KS) test of goodness of fit 137
KS-Lilliefors test of goodness of fit 137
Object motion detection 44
Parallel search 8
Plane fitting 37
Radar 42
Rectification 31
RGBY color space 22
Sigmoid function 58
Single track model 89
Sparseness weight 58
Stereoscopic depth 29
Short Term Memory 104
Structure tensor 57
Tau-function 38
Temporal integration 86
Time to contact 37
Top-down attention 51
Uniformity of color spaces 24
157
Index
Undistortion 30
Unmarked Road Detection 69
Voting 118
Weak fusion 107
Weak object feature conjunctions 55
158