chapter 8 perception for autonomous systems and... · chapter 8 perception for autonomous systems...

Chapter 8

Perception for autonomous systems

8.1. Introduction

Perception of the world in which an autonomous system operates is a key issue tocreate adaptive and intelligent behavior. It refers to the process of sensing, theextraction of information from the real world, and the interpretation of thatinformation. Without sensing an autonomous systems would be blind, deaf andwithout touch. It could only perform pre-planned actions without any feedbackabout how well it performs those actions. It could only operate in a staticenvironment because it could not check whether the model of the environment onwhich actions were planned has changed. So no unforeseen moving objects orhumans could be in the environment and all motion should be performed withouterrors. To make a real intelligent system it should be able to perceive itsenvironment and react to changes in it. For autonomous systems such as robotarms, mobile robots and intelligent cars the information about the environmentneeded is geometrical information. Examples are the position of objects to begrasped, the lane position of the road to be followed, or the position of obstaclesto be avoided. Other types of information may also be needed depending upon thetask the system has to perform. When a mobile cart has to operate in a humanenvironment such as an office or a hospital it should also be able to react tohuman gestures and spoken words like stop. For real life application systemsshould be very robust and a system should combine multiple sources ofinformation. It should not be necessary that somebody follow the system with ared knob to stop the system in case of emergency.Other applications needs besides sensing for navigation specific sensors toperform their task. Examples are autonomous systems for surveillance and safety,detection of air-pollution or forest-fires, systems for land-mine detection tomention a few.

Perception forms the basis for low level reactive behavior in autonomoussystems. The interpreted sensor information is used to follow lanes, a wall or toavoid collisions. At a higher level an autonomous system needs the perception ofthe real world to explore the environment and build a map of it, to be able tolocate goals and to find its way through an environment. In this case theinterpreted data is fused with previous data to acquire or improve a map of theenvironment and to locate the autonomous system in the map.

Many different processes are needed to transform sensed real world data touseful information for an autonomous system. In the next sections we willdescribe the main steps involved in the perception process from sensing to

8-2 Chapter 8 Perception for Autonomous Systems

interpretation.

8.2 From sensing to interpretation

When we want to observe the real world with an autonomous system, we needa measuring device connected to the system that obtains information from the realworld. This information has to be processed and interpreted so that it can be usedby the autonomous system to realize its mission.

A block diagram with the basic components of this process is given in figure8.1. The components of this process are:

• Physical sensors or transducers. These convert a physical quantity into anelectrical signal.

• AD Converter. It samples (time) and quantizes (value) the electrical signalinto a discrete signal that forms the input for the computer hardware.

• Signal processing. The hardware and software that extracts the information ofinterest from this signal and reduces the noise.

• Signal interpretation. The subsequent level of software extracts informationabout the real world from the signal and forms the interpretation domain.

Figure 8.1: Basic components of the sensing process.

Example.Let us examine an ultra-sonic sensor system to obtain the distance to an object

as an example. An ultra-sonic sensor measures the time between a transmittedsound pulse and the received reflected sound pulse. The measured time isconverted to distance by dividing the time by two times the sound velocity in theair. A basic ultra-sonic sensor consists of:

• oscillator and a loudspeaker that together can generate a short high frequencysound pulse, a microphone that converts the reflected sound pulse into anelectrical signal,

• an AD Converter which converts the electrical signal into a discrete signal,

WUDQVGXFHU $'&VLJQDOSURFHVVLQJ

VLJQDOLQWHUSUHWDWLRQ

SK\VLFDOTXDQWLW\

HOHFWULFDOVLJQDO

GLVFUHWHVLJQDO

GLVFUHWHVLJQDO

LQWHUSUHWDWLRQ

i [W [>Q@ \>Q@


• processing hardware that can detect the reflected pulse in this electrical signaland compute the delay between emission and reception, and converts thisdelay into a binary number which represents the time delay,

• Software which will try to interpret the time delay as a distance, incorporatingfacts like multiple reflections and temperature dependent sound velocity,which have an influence on the output of the sensor.

The output of the sensor, distance in a certain direction, is used as input by thevarious software modules in the sensing system. These modules combine thereadings from various ultra-sonic sensors to detect obstacles, recognize and rejecterroneous readings and use the remaining readings to construct a map of theenvironment.

SensorThe first component of a sensing process is the physical sensor, which transformsa real world quantity i into an electrical signal x. An example is a photodiode,which converts the brightness of the incident illumination into a voltage. So, anelectrical signal, the voltage is used to inform us about the actual illumination.The input i (illumination) into the sensor is transformed into an output x (voltage)which we measure.

The type of sensor needed depends upon the application. If we are interestedin sound we need a microphone to convert variations in air pressure into anelectrical signal. For images we may use a video camera to obtain a video signalwhich represents the brightness over a whole image.

SK\VLFDOVHQVRU

SK\VLFDOTXDQWLW\

HOHFWULFDOTXDQWLW\

i x

Ii

Ix

Figure 8.2: Physical sensor which transforms physical quantity i into electricalquantity x.

As long as there is a strict monotone functional relation x = f(i) between x andi and a suitable model exists that describes this relationship, we can use the


sensor. If the monotone relation is known, we can find an estimate i’ for eachmeasured level x, namely i’ = f--1(x). Thus, for each voltage (x) we know thecorresponding illumination (i’). See figure 8.2.

A simple sensor, such as the photodiode, realizes a conversion betweenphysical quantities. This conversion has to be a monotone relation between thequantities. Often the sensor system is more complex and consists of a number ofstages in which subsequent processes take place. An example was the ultra-sonicsensor.

AD Conversion The next block in sensing process represents the conversion of the electrical

signal into a discrete signal x[n]. This is realized by an Analog-to-DigitalConverter, which quantizes a continuous signal into digital numbers, which canbe sent to a computer. The essential parameters of an AD converter are thesampling frequency and the number of bits into which the signal is quantized.

Often you can buy a digital measurement instrument, which can directly beconnected to a computer bus or port and combines the sensor and the ADconverter.

Signal processingIt can be that the discrete signal is directly usable without any further

processing, as in case of the photodiode, which gives as measurement the digitalvalue representing the brightness. But usually the output of the discrete signalx[n] needs further processing, because it is contaminated with noise, which has tobe suppressed, or to filter the information out in which we are interested. Theprocessed signal is denoted by y[n] and is situated in the same domain as thediscrete signal x[n].

Signal interpretationThe most difficult part of the perception process is the signal interpretation,

which interprets the signal in terms of a model of the real world. For an ultra-sonic sensor, for instance, the time delay has to be interpreted as a distance to awall or an obstacle and matched to a map of the environment. For robot soccer weneed to find the position of the ball, the team members and the opponents from asequence of video images, to plan the action. This interpretation of images is adifficult task for the computer, despite the fact that a human is very good in that.In the section on interpretation we will focus on the interpretation of images andimage sequences. In particular to those aspects of computer vision that obtaingeometrical information. Geometrical information is needed for localization,navigation and exploration of autonomous systems.

The data analysis of specific distance sensors is often more straightforward. Inthat case directly distances and locations are measured, which can be combined ina fusion step, which will be discussed in the next chapter.


8.2.1. Observation of physical quantity over time and space

Until now we dealt with systems of which the signals are functions oftime, i.e. the physical quantity changes as a function of time. Such signals arecalled time signals. Changes in the observed value may also depend on otherquantities than time, for example location. A camera consists of an array of light-sensitive sites in which the brightness at each location in the array is transformedto a gray value. The output of the camera is an analog video signal x (X,Y, t)which gives the brightness as function of the position (X,Y) in the array. A videosignal has to be digitized by a video digitizer (frame grabber) before a computercan process it. The output of the frame grabber consists of digital values denotedby x[X,Y,n] which is an array of gray values. This array is the digitized imagestored in a computer. The camera together with the frame grabber form themeasuring instrument which give as measurement output a digitized imagesequence. This image usually has to be processed to filter out noise, etc. to obtainan image y[X,Y,n]. (See figure 8.3). This belongs to the field of image processing(see course on Image Processing). The last step in the sensing process is imageinterpretation or computer vision.

Figure 8.3: Basic components of the sensing process of camera.

Signals are mathematically represented as functions of one or more independentvariables. The independent variable in the real world is a continuous variable,which has to be quantized in discrete numbers, the sample moments in time or thepixel locations in space. In the computer we have discrete-time signals defined atdiscrete times and thus the independent variable has discrete values. The grayvalue of an image is defined as a function of two discrete independent spatialvariables. The pixel size is determined by the step size on the two axes. So,discrete signals are represented as sequences of numbers.

8.2.2. Uncertainty and measurement error

A sensor transforms the physical quantity i of interest into an electrical signal x.When we repeatedly sample the output of a sensor under the same conditions, theoutput will never be exactly the same. Small variations are present in the sensoroutput. These variations find their origin in the physical properties (everyelectrical circuit generates some noise) of the sensor or result from externaldisturbances (the air temperature influences the speed of sound). The variations

FDPHUDIUDPHJUDEEHU

LPDJHSURFHVVLQJ

LPDJHLQWHUSUHWDWLRQ

EULJKWQHVV

YLGHRVLJQDO

GLVFUHWHLPDJH

GLVFUHWHLPDJH

'LQWHUSUHWDWLRQ

i[;<W [>;<Q@ \>;<Q@


also can originate from the quantization process, i.e. the process that quantizes thecontinuous signal into the discrete signal. The existence of variations isresponsible for the fact that there is a basic uncertainty in the sensor value. Thisuncertainty is usually called the measurement error. The measurement error ∆xis the difference between the measured quantity xm and the real quantity x = f(i):

∆i = xm - x.

We can distinguish between two kinds of measurement errors: systematic andrandom errors. If an ultrasonic sensor is calibrated under circumstances whichdiffer from those where it is used (for example different temperatures), ameasurement error results. This systematic error is due to the fact that the soundvelocity in the air is dependent on the temperature. We use than for the sensor anincorrect model or incorrect parameter settings. The result is that themeasurements are systematically too low or too high.Another example is a wheel with a shaft encoder to measure traveled distance incentimeters. If we use the shaft encoder to measure the traveled distance incentimeters, we have to find a good estimate for the parameter, which convertsshaft encoder counts into centimeters under its operational conditions. Otherwise,we introduce a systematic error. Finding the conversion parameter is not simple,because it is dependent on the air pressure of the tire. These errors are calledsystematic errors and are usually caused by the wrong choice of parameters ormodels or by physical failures.

Other errors can be caused by the way the measurements are taken or by smallerrors in the sensor itself. If the shaft encoder skips a count an error is introduced,which is called a random error when it occurs occasionally. Random errors havethe property that they differ when repeated measurements are taken.

8.2.3. Complex measurement systems

In general we do not have to do with a single measurement sensor, but with morecomplex measurement systems, incorporating not only the sensor but also theprocessing of the sensor data and the control of sensor parameters. In modernequipment processing capabilities are often integrated with sensing. This makes ithard to define where in a complex system the sensing ends and the processingbegins.

In the next section we will discuss the different properties of measuringinstruments and the way they are put together. As sensing involves physics thissection will have a certain physical flavor. In the following section we will focuson vision sensors which have attracted enormous attention in the research ofautonomous systems. In the next chapter, we will focus on techniques for sensordata fusion, i.e. fusion of information from multiple sensors. An autonomoussystem must fuse information from multiple sensors to obtain more complete and


more accurate information about the world. The resulting observations obtainedfrom the sensors have to be integrated into a consistent consensus view of theenvironment, which can be used to plan and guide the execution of tasks. Thearchitecture of a generic sensor data system will be discussed and subsequentlythe various levels of the system together with the higher-level fusion methods.

8.3. Sensor principles

There are sensors on the market for measuring all kinds of physical entities. Thereare sensors to measure distance, small displacements, temperature, forces andtorques, velocity, acceleration, flow and all kinds of radiation. In the context ofautonomous systems we are especially interested in sensing needed to positionrobot arms and to navigate mobile robots. So we will focus our attention on thesensors to measure distances and to locate objects in an environments, enabling toposition autonomous systems to find goals and to avoid collisions. Which othersensors are needed depends strongly upon the application of the autonomoussystem. For instance when a system has to maneuver in a human environment weshould like to recognize speech and gestures, so sensors for sound and motion arerequired. For a system to monitor air pollution and smog formation, themeasurement of particles and certain gases in the air together with the weatherconditions should be measured.

Radiant signals light intensity, wavelength,polarization, phase, reflectance,transmittance

Mechanical signals position, distance, velocity,acceleration, force, torque, pressure

Thermal signals temperature, specific heat, heatflow

Electrical signals voltage, current, charge, resistance,inductance, capacitance, dielectricconstant, electric polarization,frequency, pulse duration

Magnetic signals field intensity, flux density,moment, magnetization,permeability

Chemical signals composition, concentration,reaction rate, toxicity, pH

Table 8.1: Physical properties in the signal domain after Middelhoek et al.

A sensor has to transform a signal from the outside world into an electrical signal.Following Lion [1] six different domains can be distinguished in these signals


from the outside world: radiant signals, mechanical signals, thermal signals,electrical signals, magnetic signals and chemical signals. Physical properties ofimportance in these different signal domains are listed in table 8.1[2]. Conversionfrom one signal domain to another is based on one of the many existing physicaland chemical effects and measurement principles that have been developed.Because of the immense number of measuring principles and devices, reviews ofthis field often have an encyclopedic character.Signals are carried by some form of energy. Sensors that transform this incomingenergy into the electrical energy of the sensor output are called self-generatingsensors. No additional source of energy is needed to obtain the measured sensorsignal. Examples are a solar cell, converting light energy into an electrical signalfor measuring illumination, or a piezo-electric microphone converting themechanical energy of acoustical waves into an electrical signal.

When an additional energy source is needed for the operation of the sensor, wecall the sensor a modulating sensor. The energy source is modulated by themeasured quantity. An example is an angular position decoder, which counts thenumber of holes in a rotating disc by interrupting a light beam.

A sensor has in general a spatial resolution: it measures a certain physical quantityat a certain location. When we measure the temperature, we do so at (or around)the place where the sensor is present. As to their spatial extension, point, line andarea sensors can be distinguished, which produce a single value, a profile or animage of the measured quantity.

It is important that sensors are robust, small and low-cost. Despite all the workdone in this field there is still a demand for better and low-priced sensors. Withthe development of low-cost microelectronics devices, new sensors can open newmarkets. Sensors for the detection of the quality of food, for the consumption ofgas, electricity; for the continuous inspection of correct operation of all kinds ofsystem, such as street illumination; for the identification of persons and goods;these are some examples, where low-cost sensing may open new markets.

In sensor technology, the material and the measurement principle used play animportant role. The development of solid-state sensors based upon silicon hasboosted sensor development [2]. The use of silicon not only makes it possible toapply the well-developed production methods of integrated circuits to sensorproduction, but also makes it feasible to combine the sensing and the processingof the sensor signal on a single chip. This gives the possibility to improve thecharacteristics of a sensor at a much lower price and with better performance thanwhich discrete components. Sensors combining sensing and processing are oftencalled 'smart sensors’.

We will now shortly review the different principles to create sensors for the five(non-electrical) signal domains, and discuss then in more depth the sensorsneeded in autonomous systems. These are position sensors to measure the


trajectory of an autonomous system, image sensors to interpret the environment ofthe system and distance sensors to obtain 3D information about the environment.

Mechanical signalsThere is an important difference between sensors that measure position with orwithout mechanical contact with the real world. The measurement of positions inimages using image-processing techniques has given the possibility to measurepositions remotely. Various physical principles are exploited for measuringposition or proximity including inductive, capacitive, resistive and opticaltechniques. To measure distances in robotics applications ultrasonic sensors, laserrange scanners and radar systems are used.

Force and pressure cannot be measured directly. First a force or pressure has to beconverted to a displacement, and the displacement can be measured with one ofthe techniques described above.

Radiant signalsElectromagnetic radiation includes besides the visible (infrared and ultra violet)light also radio waves, microwaves, X-rays and gamma rays. They differ inwavelength, ranging from 104 m for long radio waves to 10-14 m for gamma rays.The wavelength of visible light is between 400 nm (violet) and 700 nm (red). Thewavelength of Radar ranges from a few centimeters to about 1 m.

In this text we will concentrate on visible light. Solid-state sensors for (visible)light are mainly based on the photoelectric effect that converts light particles(photons) into electrical charge. Image sensors like CCD cameras are nowadaysvery cheap and form a rich source of information to access the environmentaround an autonomous system.

Thermal signalsThe resistance of a metal or a semi-conductor depends upon temperature. Thisrelation is well known and is exploited for temperature sensing. Also the base-emitter voltage of a bipolar transistor is temperature dependent, and is used inmany commercially available low-cost temperature sensors.

Self-generating temperature sensors can be obtained using the Seebeck- effect.When two wires made from different metals are welded together at one point, andthis junction point is heated or cooled with respect to the remaining parts of theso-called thermo-couple, a voltage is present between the open ends. For smalltemperature differences, the voltage is proportional to the temperature difference.

Magnetic signalsMost of the low-cost magnetic sensors are based on the Hall effect. When amagnetic field is applied to a conductor, in which a current flows, a voltage


difference over the conductor results in a direction perpendicular to the currentand the magnetic field. Because this effect is quite substantial in semi-conductors,semi-conductor Hall-plates are low-cost and used in many commercially devices.

Many materials change their resistivity upon application of a magnetic field. Thisso-called magneto-resistance effect can be exploited also for building magneticsensors.

Chemical signalsFor monitoring the environment, the measurement of specific components withingas mixtures is necessary. This motivated strongly research into miniature low-cost (and possibly disposable) chemical sensors. The chemical signal can bedirectly converted to an electrical signal or first converted into an optical,mechanical or thermal signal, which is then converted into an electrical signal.

As an example, a sensor can be built for measuring the CO concentration in air,by determining the attenuation of an infrared beam. As CO absorbs infrared light,the attenuation is a measure for the concentration.

Many chemical sensors are based on the measurement of the change of theconductivity or the dielectric constant of a chemical when it is exposed to a gas orelectrolyte. Such a material can be a metal oxide. For instance, the electricalconductivity of tin dioxide changes with the concentration of methane whenheated. In this way a sensor for the presence of gas can be built. Also manyorganic materials, when exposed to a gas, change their conductivity. However,since the conductivity of these materials is very low, they are hard to use.Chemical sensors exist for the measurement of many gases such as carbonmonoxide (CO), carbon dioxide (CO2), oxygen (O2) and ozone (O3). Also sensorsfor the humidity and acidity (pH) belong to this type.

A disadvantage of most chemical sensors is that they are not only sensitive to onechemical measurand but usually respond to many, which makes it necessary touse these sensors under well-defined conditions.

An important class of chemical sensors is the group of biosensors. One type ofbiosensor is the acoustic biosensor [4]. In such a sensor a vibrating quartz-crystalis coated with a biochemical, which is specific for a matter to be detected. Whenthis coating, such as an antibody, binds to the matter to be measured, such as anantigen, the mass of the coating increases. This leads to a change in theresonance-frequency of the quartz crystal, as the resonance frequency is directlyrelated to the mass. The frequency change can be measured accurately. In thisway a sensitive and specific sensor can be realized.


8.4. Internal position sensors

To measure the position of a robot arm or the distance traveled by a mobile robotwe need angular information. This angular information can be obtained from theshaft of a robot arm or from the wheel axis of a mobile system. This informationcan be obtained with internal position sensors. Most commonly used are resolversor absolute encoders for that.

ResolverThe LVDT (Linear Variable Differential Transformer) and the resolver are basedupon the inductive principle. With the LVDT linear displacements can bemeasured. A core is moved within a special transformer of which the outputvoltage varies linearly with the position of the core. With a resolver angularrotations can be measured of the rotary shaft on which it is mounted. A two-phaseclock drives stator and rotor windings of the resolver. The phase between thestator and rotor signal is measured and converted to an angular position.

Absolute encodersAbsolute encoders are high-precision rotary devices that are mounted on a shaftof a rotary drive like a resolver. They encode the angular position by a binarycode. This code is read from one or more discs with concentric rings ofphotographed or etched codes. In figure 8.4 this principle is illustrated for 16positions with 4 code rings. A large encoder may have 10 to 20 rings and is quiteexpensive. Cheaper solutions can be found with incremental encoders by countingthe number of steps. However, in this case no absolute position is obtained.

Figure 8.4: Absolute encoder for 16 positions in binary and Gray code


8.5. Image sensors

Image sensors are applied to obtain structural information. The requirements withrespect to absolute accuracy are in general less than those for single transducers.The information is however much more complex. Image sensors are built as anarray of elementary transducers, which are electronically scanned. The spatialextension of the elementary transducers should be small, so that these spatialextensions do not overlap.

It is also possible to obtain images by scanning the scene. There are three placesin an imaging system where scanning can take place: in the illumination, theobject/sensor positioning and in the sensor itself. A scene can be scanned with asingle light beam while the reflected light is measured. This technique is appliedin laser scanners, to obtain both a light intensity image and an image representingthe distance to objects. This may involve slow mechanical scanning and is ingeneral expensive. Scanning can also be obtained by moving the sensor over thescene. This method is applied in flatbed scanners, where a line image sensor iscombined with one-dimensional mechanical motion to access successive lines ofthe image. In an array image sensor the two-dimensional scanning process iscompletely electronic.

A good design of a vision system involves the total system. An optimal choice ofillumination, optical system and sensor in relation to the material properties of theobject or the scene to be measured is essential [5],[6],[7]. Very popular and cheapare video systems to obtain images.

8.5.1. Video systemsVideo systems originate from the entertainment industry. This means that thismarket has defined the standards for video systems: the American EIA norm andthe European CCIR norm. In the EIA norm a video image consists of 525 lines,with 30 image-frames/second. In the European CCIR norm a video image consistsof 625 lines, with 25 image-frames/second. To prevent flickering of the displayedimages at such a low number of images per second, video systems are interlaced.This means that an image is split into two fields: one consisting of the odd imagelines and the other consisting of the even image lines. So the lines of one field aredisplayed in between the line of the previous field and the resulting localrepetition frequency in the image is twice as high as when the complete frame wasdisplayed.

This is illustrated in figure 8.5a for the CCIR norm (with fields of 312,5 lines).The video signal represent the brightness of the image along the lines of the fieldsand contains also synchronization pulses, which indicate the beginning of a lineand a field. These synchronization pulses take also their time in the video signal(the so-called retrace time, which is also needed for the display device to position


the writing beam at the beginning of the next line). This results in a smallereffective scanned area, than would be expected from the given number of linesand times. This effective area where real image data is transmitted is shown infigure 8.5.b and is 74% of the total time (and so of the area).

622

624

3

10

2

45

623

625

lines of the f irs t(odd) f ield

lines of the second(even) f ield(a)

575lines

40ms625lines 768 pixels

64µs956 pixels

effective scannedarea

(b)

Figure 8.5: CCIR video system. The principle of interlacing (a) and the effectivescanned area (b).

8.5.2. Solid state video sensors

A video sensor has three important functions:

• light to charge conversion

• spatial accumulation of charge carriers

• signal reading.

A solid state video sensor consists of an array of photosensitive sites. Charges arecreated by the photoelectric effect, which frees electrons as a result of theillumination. The amount of charge accumulated at a photo site is a linearfunction of the local incident illumination and of the integration time. Thescanning and signal reading is based on the principle of Charge-Coupled Devices(CCD), basically analogue shift registers.

Finite amounts of electrical charge called ’packets’ are stored at specific locationsin the silicon semiconductor material. These locations, called storage elements,are created by the field of a pair of gate electrodes close to the surface. By placingthe storage elements close together so that there is an overlap between adjacentelements, a charge packet can pass from one storage element to another. Thistransfer of a packet is realized by alternately raising and lowering the voltage onadjacent gate electrodes.


Figure 8.6: Layout of a frame transfer solid state

In figure 8.6 the layout is sketched of a popular CCD solid state video sensorusing the frame transfer method. This sensor is divided into an image section anda storage section [10]. The accumulation of charges takes place in vertical CCDregisters and the charges are transferred to the storage section (dashed) during thevertical retrace time. Then the accumulation starts again at the photo sites whilethe storage section is read out line by line. The storage section is shifted into thehorizontal read-out register line by line (lower section) from which afteramplification and adding the synchronization pulses the video signal is obtained.

In a solid state sensor the spatial accumulation of charges is separated from thesignal read-out. This allows the possibility to use an accumulation time differentfrom the read-out time. In the high-speed shutter option the accumulation time isreduced. This makes the sensor less sensitive, but because of the shortaccumulation time the motion blur can be considerable reduced. For example thewater-drops of a fall become visible. We can also do the opposite: enlarge theaccumulation time. This makes the sensor more sensitive and useful in badillumination conditions. This enlargement is however limited by thermal noise.Therefore in low-light applications cooled solid state sensors are sometimesapplied.

The spectral response of a solid state sensor has a peak around 800 nm.Temperature is an important factor. The storage-related parameters degraderapidly at temperatures above 70° C (thermal relaxation).

In video cameras for the consumer market, single-chip color sensors are realizedby gluing a color filter on-chip. This reduces the resolution of the sensor by a


factor of 3. When color is not important, a black and white camera gives thehighest resolution for the same price! In professional video cameras three solidstate sensors are used for the three primary colors, and there is no reduction inresolution.

Solid-state cameras have no distortion of the picture geometry, no burn-in or lag.However, when a very bright spot is present in the image, the CCD registers ontowhich this spot is projected saturate and bright columns appear in the image. Solidstate sensors are small, light weight and mechanically rugged. The lowest lightconditions of consumer cameras require around 1 - 3 lux.

Resolution: video sensors developed for the consumer video market have sizesaround 600 x 576 pixels. The organization and set-up of the array sensors islargely determined of course by the video norms for this market. Also specialarray sensors for image processing applications are available. An example is theMegaplus camera [11] with square pixels and a resolution of 1340 x 1037 pixels(Megaplus is a trademark of the Videk Company).

Signal-to-noise ratio: This depends upon the illumination and ranges incommercial devices from 50 dB up to 64 dB. Besides these common propertiesthe following properties are also found in specifications of solid state sensors:• Photo Response Non-Uniformity (PRNU): the difference of the response

levels between the most and least sensitive elements under uniformillumination.

• element defects: the number of defective photo sites in the sensor. In aconsumer video solid state sensor 604 columns x 575 lines or in total 350,000photo-sites are present. At this moment array sensors with less than 10 defectsin an image are commercially available.

8.5.3. Video digitizers (frame grabbers)

A video signal has to be digitized by a video digitizer before a computer canprocess it. Several commercial video digitizers exist to input a video signal into acomputer system. Digitizing an image frame of a CCIR video signal takes 40 ms.A sample frequency of 14.8 MHz is necessary to obtain square pixels (pictureelements) in the CCIR system. Because of the retrace time the effective scanningarea (CCIR) is 768 pixels on a line and 576 lines in an image (for square pixels).This is illustrated in figure 8.5.b. The line time of the CCIR system is 64 µ s(15625 Hz) of which 51.7 µ s is the horizontal scan time and 12.3 µ s is theretrace time.

Three main functions are present in a video digitizer: AD conversion,synchronization and image storage.• Α video digitizer converts an analog video signal into digital values. The


number of bits required depends on the signal-to-noise ratio of the imagesensor. This ratio depends among other things upon the illumination and is inthe order of 50 dB. This corresponds to the 8 bits present in most commercialvideo digitizers.

• synchronization of the sampling instants of the video digitizer with thescanning of the video source is one of the most crucial parts of a videodigitizer. When the video source is free-running, the video digitizer has toadjust its sample clock to the external source, so that a fixed number ofsample points fall into each line (defined by the interval between two line-synchronization pulses in the video signal). When we want square pixels inthe digital image CCIR-norm video signal, there must be 956 pixels on a line.This means that the sample clock cannot be fixed but must be adjusted to thevideo signal. In particular, when the video source is a video recorder, line andframe frequency may vary considerably and such an adjustment is essential.

• When the (solid state) sensor device delivers not only a video signal but alsoits pixel (scan) clock, the AD conversion can take place completelysynchronously with the scanning of the photo sites, and each sample point inthe digital image corresponds in that case to one photo site in the solid statesensor.

• digitized image is stored in a video memory in the video digitizer. Often thisvideo memory can also be displayed. When the video digitizer logicallyresides on a processor bus, the video memory may be mapped into theworking space of the processor. Image processing may take place on thisstored image in the video memory.

It is good to make a clear distinction between square pixels (photo sites) of a solidstate sensor and square pixels in a digital image. Square pixels of a solid statesensor are a result of the geometry of the layout of the photo site. The imagevalues of the photo sites constitute the video signal at the rate of the pixel scanclock in the sensor. Square pixels in the digital image result from the fixednumber of sample moments between two successive line-pulses in the videosignal, as defined by the rate of the sample clock in the video digitizer. Only whenthe scan clock rate in the sensor is the same as the sample clock rate in the videodigitizer there exists a one-to-one relationship between a photo site in the sensorand a pixel in the digital image, and only then a sensor with square ’pixels’ willproduce square pixels in the digital image! For these sensors, besides the videosignal also the pixel-scan clock is needed for the video digitizer. When we haveno pixel-clock connection, the sampling clock of the video digitizer defines thelength of the pixels. So when this clock rate is 14.8 MHz we have square pixels inthe digital image, even when the photo sites have a rectangular size. In this casethere occurs a requantization of the photo sites.


CCD photosites

position

analog video signal

time

sampled video signal time

digital image

photo sites compared to digital image

Figure 8. 7: Requantization due to different pixel and sample clocks.

When for instance the photo sites are larger than the pixel length defined by thedigitizer clock, some photo sites are sampled twice and some are only sampledonce. As the video signal passes in general a low-pass filter within the solid statesensor, an interpolation takes place with requantization as a result.

8.5.4. Position sensitive devices (PSDs).

! "#$

!! "#$

! "%

&' "#%

( ) * + ,

- + .

- + /

- ) . - ) /

0 1 2 $ 1 2

Figure 8.8: Line and area PSDs


The position of a light beam can be calculated with a PSD. In the one-dimensionalconfiguration illustrated in 8.7, a PSD consists of a rectangular (e.g. 34 × 2.5mm2) diode. The backside of the diode is fully covered with metal and forms thereturn electrode. The front side is the light sensitive side with two contacts A andB.

When a light beam hits the device a current is generated by the photoelectriceffect. This current is split into two currents ia and ib to contact A and B. Now thePSD has been manufactured to realize an extremely constant surface resistance of

the layer (1%). So the resistors Ra and Rb are proportional to the lengths a and b

and so

Thus, (ib - ia)/(i b + ia) is proportional to a. In general the light beam has a certaindiameter. The output of the PSD represents in that case the center of gravity of thebeam. The spectral response of a PSD ranges from 400 nm (blue) to 1000 nm(infrared) with a peak at 900 nm. The sensitivity is around 0.6 A/watt.

The resolution obtainable with a PSD is determined by the noise in the signals ia

and ib. Accuracy’s attainable are in the range of 1:104. The influence of darkcurrent and environmental light can be largely reduced by the use of alternating orpulsed light.

8.6 Sensor systems for distance measurement

8.6.1 Time of flight principles

Ultrasonic distance sensorsUltrasonic sensors are based on the time-of-flight principle. An ultrasonic impulseis sent, and the time it takes before the reflected sound is received again by thetransducer (in the mean time switched over to receiving mode) is a measure forthe distance. This distance is computed by dividing the velocity of sound in air bytwo times the measured time interval. Because the sound velocity in air istemperature dependent, changes in temperature influence the measurement. Verycheap distance measurements can be realized in this way resulting in a highpopularity of these types of sensors.

i i

i i

a

Db a

b a

−+

= −2

1

i

i

R

R

b

a

D a

aa

b

b

a

= = =−


Ultrasonic sensor

Figure 8.9: Ultrasonic sensor

The Polaroid Company was the first to use this method for the distancemeasurement in its cameras. Although this sensor is very cheap the disadvantageof ultrasonic sensors is the relatively large opening angle (cone) in which thesound pulse is transmitted and received. You know that there is an object presentat a certain distance but you do not have a very precise location. Combination ofsensor readings at different locations (sensor data fusion) is needed to model theenvironment with ultrasonic sensors.

Laser range finder.Laser range finders also called LIDAR (LIght Detection And Ranging) work onthe same principle as ultrasonic sensors, but they emit a light pulse instead of asound pulse and measure the time-of-flight of the reflected light. As the speed oflight is about 3.108 m/s, times to be measured are in the order of 1-10 nanoseconds, which is measured by an ultra fast clock. Another measurement principleis to modulate the intensity of the laser beam (typically with 5 MHz) and tomeasure the phase shift of the reflected light. To obtain a 1D range scan of theenvironment the laser beam is deflected by a rotating mirror, scanning theenvironment in a (semi) circle.

An example of a laser range scanner is the PLS scanner of the SICK OpticCompany. To give an impression of the possibilities of laser range scanners, thetechnical specification of this PLS scanner are given below:- 180 degree scan- 360 range measurements per scan (0.5 degree angular resolution)- rotating hanging mirror at 12 Hz- maximum range 50 meters- distance accuracy 5 cm- eye-safe

Radar (RAdio Detection And Ranging)A radar system emits during a short pulse a beam of energy in the radio frequencydomain ranging in wavelength from a few centimeters to about 1 m, and usesagain the time-of-flight to measure the distance. Continuous-wave radarbroadcasts a modulated continuous wave. Signals reflected from objects that aremoving relative to the antenna will be of different frequencies because of the


Doppler effect. The difference in frequency bears the same ratio to the transmittedfrequency as the target velocity bears to the speed of light. In this way besides thedistance also the speed can be measured. Thus, a target moving towards the radarat 90 km/hr shifts the frequency of 10-cm (3,000-MHz) radar by 500 Hz.Because of the high speed of modern solid state hardware, radar has become anaffordable sensor, which can be used for instance, in intelligent vehicles tomeasure the distances and velocities to predecessors even in bad weatherconditions.

8.6.2. Triangulation principle

The methods using light to obtain range or distance images (also called 2½Dimages) are mainly based on triangulation. In an active approach a scanning lightsource is used in combination with an imaging system. The triangulation principleis also applied in stereo vision in a passive way, which we will discuss undercomputer vision.

When a small light beam illuminates a scene, the distance to the illuminatedscene-element can be calculated by triangulation. Only the position of theilluminated element is necessary to calculate the distance, so both image sensorsand PSDs may be used to calculate this distance.

laser

laserbeam

object

lens

positiondetector

Figure 8.10: Principle of triangulation. A difference in distance results in adisplacement in the sensor

A range image can be obtained by a complete scan of the scene with a light beam.One dimension of the scanning can be present in the movement of the object (orsystem). In that case only a one-dimensional range profile has to be calculated ina plane intersection across the scene, perpendicular to the direction of motion.This total image is obtained by combining the range profiles of the successiveobject positions. When no movement of the object is present an additional motionof the whole sensor system is necessary, which is mechanical. Commercial


systems based on these principles are available, but are in general slow.

Let us look into some more detail how a distance profile can be obtained. In 8.11a configuration is given to obtain such a profile across a plane intersection of ascene. The illumination and the detection beam are simultaneously moved overthe scene by a rotating mirror. Because of the simultaneous scanning a linearimaging array or a linear PSD insufficient to calculate the distance bytriangulation.

Figure 8.11: Distance profile obtained with a scanning laser beam

When instead of a linear PSD an image sensor is mounted and a light plane isused to illuminate the scene, no scanning by a rotating mirror is needed. Instead ofscanning with a beam, now the scanning is part of the image sensor. From theposition of the light plane on each row in the 2D image, the distance for theintersection of that row with the scene is calculated. Because no mechanicalscanning is necessary, this method is considerable faster than the previous.

Recently space encoding techniques have been proposed which are in principlevery fast (Inokuchi et al. [9]). Instead of scanning the scene with a light plane, thedifferent light planes are coded in binary masks. For instance n masks can code 2n

different scan lines.

8.7. Image Interpretation: Computer Vision

Visual sensors form a rich source of information for autonomous systems. Also inbiological autonomous systems, the eyes are predominately used for verycomplex tasks. The interpretation of what the eyes see in the language of the realworld seems to happen without any difficulty. This is quite different forautonomous systems. The amount of data is huge and the extraction of theessential information for the autonomous systems is complex and difficult. There


exists no general computer vision system, there are only solutions for specialapplications. A visual sensor produces large amounts of information. Take asexample a color video camera, which produces 25 color images per second. Eachimage contains about 768×576 pixels for each of the three colors. Since each pixelvalue is a measurement, this results in roughly 3.107 measurements per second.Compared to for instance a ultrasonic sensor, which operates at roughly 10measurements per second, we can see that a standard visual transducer produces 6orders of magnitude more information then a sonar transducer. If we consider thecost per measurement, the video camera turns out to be an extremely cheapdevice.

Another important item is the computational complexity of a vision system. Sincethe amount of information from a visual transducer is huge, this can limit theapplicability of visual sensors in a real time system. As a result of this limit is ageneral tendency in real time systems to reduce the information from the visualtransducer, as fast as possible, in order to lower the computational complexity.This reduction is achieved through a model of what is measured. The easier themodel parameters can be estimated from the measurement, the less thecomputational complexity of the sensor system is.

As an example of a vision system described so far, consider a fixed "securitycamera". The virtual sensor should produce a signal ("1") if there is an intruder inthe building, else ("0"). The task of the image interpretation part is thus totranslate the camera images into a binary number. Notice that the data reductionhere is enormous: from the camera image to a once per second intrudermeasurement is a 107 reduction of measurements! The model used in this sensoris twofold: the intruder is moving whereas the building does not, and, intrudersare bigger than mice are. We can translate this model into our visual sensor. Sincethe camera is fixed in place, motion is measured by subtracting two images atdifferent times. The resulting "time difference" image is zero if nothing is movingin the environment. If an area of this image is non-zero then something in thatarea has moved in the environment. Computing the size of that area is themeasurement for classifying the movement as an intruder or as a mouse. If thearea exceeds a predefined threshold the sensor produces a signal since there is anintruder in the building. Notice that if the building lights are turned off the virtualsensor will produce an intruder signal: the change of light is not taken intoaccount into the model and will produce an incorrect measurement.

We will state it again: the more general the model used is, the more generallyapplicable the measurements of the virtual sensor are. Since fewer assumptionsabout the environment can be made, the more complex the image interpretationbecomes. An image is always a 2D projection of the 3D real world. So we have tointerpret images in a 3D context. How to do that depends strongly on the amountof information we have about a scene. Much knowledge about the scene makesthe interpretation easier and more specific. Little information makes the approachmore general but also much harder.


When only little information about the environment is available from which thescene originate exploration of the environment is needed. In that case techniquesare used which create a distance map of the scene. We will discuss thosetechniques later (stereo vision, optic flow), from which we can navigate inunknowns environments like Mars or find walls and obstacles in an officeenvironment.When a lot of information is present, like how objects in the scene should looklike or which landmarks in a scene are present, we have a quite accurate model ofwhat to expect and we can directly use that model to match the scene. Examplesare navigation on known landmarks or on the line markings of a freeway.

Besides the question of location (where?) there is also the question ofidentification (what?). It makes a problem easier when those two aspects can beseparated. Depending upon the application one aspect can be much moreimportant than the other. In navigation the precise location of an obstacle isimportant and not so much how the obstacle looks like. On the other hand,knowing that an intruder is present, that a certain room is occupied, that a gestureis made to stop your robot, that a traffic light is red is more important than theprecise location. Different techniques for image processing and interpretation canbe used for this type of question.

In the next section we will discuss the camera projection which lays the basis forthe projection of the scene in the image plane. In the following sections we willcontinue with an overview of vision approaches with increasing levels ofcomplexity, and end up with very general visual systems which give the positionof viewed objects in the environment and visual systems which measure thevelocity of these objects.

8.7.1. The Camera Projection

Figure 8.12: A pinhole camera produces an image that is a perspectiveprojection of the environment. It is convenient to use a coordinate system in whichthe XY plane is parallel to the image plane and the origin is at the pinhole. The Z-axis then corresponds to the optical axis. (a) The image plane is located behindthe pinhole resulting in a mirrored image. (b) It is more convenient to think of theimage plane to be located in front of the pinhole so that there is no mirroring.

Z

X

Y

p

P

Z

X

Y

image plane

p

P

image plane

(a) (b)


We define a visual transducer as a device, which registers the radiation (light) thathas interacted with physical objects in the environment. Since this is a very broaddefinition we will restrict the transducers to be two dimensional, therebyeliminating point- and one-dimensional devices. Examples of devices, whichregister the reflected light on a two-dimensional surface, are the eye, a photocamera and a video camera.

The fundamental model of the transformation realized by these devices is thepoint projection. These devices act in approximation as a pinhole camera. Theimage results from projecting points in the 3D environment through a single pointof projection: the pinhole onto an image plane. As the image plane is behind thepoint of projection, and the image is reversed, as is shown in figure 8.12a.However, it is more intuitive to recompose the geometry so that the point ofprojection corresponds to a viewpoint behind the image plane, and the imageoccurs right side up as shown in figure 8.12b. With these figures we canmathematically define the projection of a point P=(Px,Py,Pz) on the surface of anobject on the image plane point p=(px,py). If we define the focal length f as thedistance of the camera center to the image plane, we can describe this projectiverelation by

p fP

Pxx

z= p f

P

Pyy

z= (8.1a)

What is measured at a point p is thus the reflected light from the point P on theobject in the direction of the pinhole, which is the origin of our coordinate system.

The pinhole model is only an approximation of the point projection occurringwith real lenses. A deviation from this pinhole model is often present and is calledpin-cushion or barrel distortion. This distortion can be described by a 5 parametermodel (parameters: K1, K2, K3, P1, P2) which is added to the linear projection(Wolf 1986 [8.18], Woltring ,1980 [8.19], Beyer,1991[8.20]).

p fP

Pxx

zx= + ∆ p f

P

Pyy

zy= + ∆ (8.1b)

These additive terms (∆x,∆y) are given by:

In most practical cases the two parameters K1 and K2 are dominant and give asufficient description of the distortion. Given these two parameters a correctedimage can be calculated by image warping.

∆

∆

x x x x x y x

y y y y y x y

K p r K p r K p r P p p P p r

K p r K p r K p r P p r P p p

= + + + + +

= + + + + +

12

24

36

1 22 2

12

24

36

12 2

2

2 2

2 2

( )

( )


8.7.2 3D vision with landmarks

When the environment in which a robot operates is known, the system is able tolocate itself and to navigate from the positions of known objects and landmarks inthe scene. We have to find those landmarks and objects in the (successive)images, which the system obtains from the environment. From the positions of thelandmarks in the scene, the 3D position of the robot can be calculated. We willnow briefly illustrate how techniques from image processing can be effectivelyused to find objects and landmarks in some typical example situations.

- Color histogram techniques enable us to find known colored landmarks basedon their specific color combinations. Multiple landmarks in an image can bematched to the known model of the environment from with the position of thesystem can be calculated.

- Matching a known object to its image enables us to determine its position andorientation. In particular matching multiple point features from the 2D projectionto the 3D object is a powerful technique, which can also be based on the shape ofobjects.

- Tracking. We do not analyze unrelated images. The images are snap shots ofthe successive moments during the motion of the mobile robot. This means that inthe image sequence we can track objects and predicts in the next image whereearlier detected objects should be found. This makes the analysis more robust andfaster as it decreases the image search space.

Landmarks from color histogramsColor histograms Swain and Ballard (1991) introduced a technique, called colorindexing, which was later improved by Enneser and Medioni (1993). This provedto be a powerful technique to find known colored objects in scene images. Givena discrete color space (e.g., red, green and blue), the color histogram is obtainedby discretizing the image colors and counting the number of times each discretecolor occurs in the image. Histograms are invariant to translation and rotationabout the viewing axis, and change only slowly under change of angle of view,change in scale, and occlusion (Swain Ballard).

Local color histograms L(k,l) are calculated at successive image positions (k,l)and compared to the model histogram M of the object we are looking for. Whenfor instance the red green and blue video signals are quantized into 8 levels, thecolor histogram consists of 8x8x8=512 buckets. A measure is defined how wellthe histograms match, and the best matching histograms L(k,l) give the possiblelocations (k,l) in the image of the object. The local histograms can be comparedbased on their weighted intersection. This weighted intersection is defined asfollows:


Let Mi be the counts in bucket i of the Model histogram and Li i the counts inbucket i of the local histogram at position (k,l) the weighted intersection is givenby :

( )I w L Mi i ii

= ∑ min ,

The weights wi could give more importance to colors that are specific in themodel and occur less frequent in the whole image. When H is the histogram of thewhole image and M is the model histogram, wi could be chosen as:

wi

M i

Hi

=

triangulation with 3 landmarksSuppose that we have a 3D model of the environment and so the position oflandmarks in the 3D environment is known. It can be proven that in that casethree landmarks in an image are sufficient to calculate the position of the camerain the environment.The position of the landmarks in the image gives us the angular direction underwhich we see the landmarks and so the angular separation between the landmarkscan be calculated. Assume that we see three landmarks: A, B and C in our cameraimage, of which the 2D floor plan is given in figure 8.13a. The angular separationbetween A and B is α and between B and C is β.

(a) (b)

Figure 8.13. Triangulation from 3 landmarks, based on circumscribed circles.

From geometry we know that a point making a fixed angle with two other pointsmust lie on a fixed circumscribed circle through all three points. This circle can becalculated from the position of the basis of the triangle (A and B) and the top

αβ

A

C

B

A

B


angle α. In the same way the camera must lay on the circumscribed circle of thetriangle through B and C with top angle β. One of the points of intersection is thecamera position (the other is the point B). This means that in an easy way theposition of the camera can be found. The noise present in the pixel positions ofthe landmarks results in an uncertainty region, the area between the two circles forthe maximum and minimum value of the top angle. This is illustrated in figure8.13b for an error in the image position of point A. The uncertainty in the topangle results in an intersection area, where the camera is located, instead of anintersection point.

When more than three landmarks are present we can calculate all possible circlesand an estimation of the camera position can be obtained from the intersections ofthe circles. This approach and its accuracy is described by Madsen and Andersen[8.21]

localization with n landmarksAnother way to find the camera position is based on a Least Squares approach.That camera position is found, which minimizes the error between the measuredposition of landmarks in the image and their calculated position from the 3Dmodel.In the previous section on camera projection we saw that a point in the 3D space Pis projected onto the image given by equation (8.1) In this equation the position ofthe point P is expressed in the coordinate system of the camera (see also fig.8.12). The landmarks however are given in the coordinate system of the model.The translation and rotation between these two coordinate systems is exactlywhere we are looking for: the position of the mobile robot in the model. We willassume that the mobile robot is driving over the flat surface of a floor and that thecamera is mounted parallel to that floor. So we are only interested in the positionof the robot with respect to the ground floor. This means that we will assume thatthe XZ plane of both coordinate systems (parallel to the ground floor) is the same.So the transformation between a point P given in the camera system as PC and inthe model system as PM is given by:

PCx = PM

x cos θ - PMz sin θ + tx

PCz = PM

x sin θ + PMz cos θ + tz

PCy = PM

y

In which θ is the rotation angle between the two coordinate systems and tx and tz

are the translations in the x and z direction between the two coordinate systems.This is illustrated in figure 8.14.


ZC

ZMXM

XC

PMz

PMx

PCx

Common Groundplane

tx

θtz

Imageplane

landmark

Figure 8.14. Camera en Model coordinate system

We have to find now those values of these three variables that minimizes thedifference between the position of a found landmark in the image pxi and theprojection of the model position of the landmark. So we have to minimize

E pP

Pp

P P t

P P txixiC

ziC

ixi

xM

zM

x

xM

zM

zi

= −

= −

− ++ +

∑ ∑

2 2cos sin

sin cos

θ θθ θ

with respect to tx, tz and θ. Standard techniques as were discussed in the previouschapter (7.3) can be used for that.

As the images constitute a sequence most effort goes in the minimization of thefirst image. For the following images the difference between the positionparameters with the previous image will be small, and so the next minimum canbe much easier obtained. Also Kalman filtering can be used to predict the nextexpected position and orientation of the robot.

Object localization with matching

To find the position and orientation of known objects the same technique can beapplied as we discussed before when we have colored landmarks on an object.When this is not the case, we have to base the location on the shape characteristicsof the object. In general with edge detection and regions growing techniques the


shape of objects can be extracted. Characteristic elements of man-made objectsare typical straight lines and corners, which can be found with local image filtersor extracted from object contours. These features have than to be matched to the3D object. Well-known problems are multiple matches: a corner for instance canbe present at different positions on a 3D object. Besides that, we also only see thefront side of an object so only a part of the object can be matched. This is evenmore complex when a part of the object is occluded. When different objects arepresent in the environment we have to match the image to different object models,which complicates the matching considerably.

Figure 8.15: Object localization by matching

Often the features found in the image and the connections between them arestored in a 2D graph together with their attributes. In the same way the model ofthe 3D object is represented in a 3D graph, and the problem becomes one or graphmatching. Efficient techniques for the matching of graphs with known modelsexists (Gavrila&Groen[8.17]). For each match the position error betweenprojected model features and those in the image can be minimized to obtain theposition and orientation of the object as we discussed before. When differentobjects are present in the scene the resulting error can also be used to recognizethe correct object: the error should be sufficiently small.

By combining information from additional sources the analysis can be made moreeasy and robust. An example is the combination of a laser range scanner andvision, to locate the position of a mobile robot in an office environment.(Weckesser&Dillmann [8.14]). We will discuss these techniques in more detail inthe next chapter.


8.8. 3D vision in unknown environments

On the question why we can see objects in depth, that vision is "threedimensional", the most common reply is that we use the difference between theobject position in images in our left and right eyes to judge depth. Binocularstereo is certainly not the only depth cue used in biological systems. Birds forexample have their eyes located at the sides of their head, which makes itimpossible to see the same object with two eyes at the same time. This means thatbirds cannot use binocular stereo at all! There are thus other cues for extractingdepth from visual information, and will be called depth cues.

The most common known depth cues used in artificial and biological systems are:1. Size: the size of a known object in the environment, as observed by the

viewer, seems to get smaller as distance increases, and visa versa.

2. Binocular stereo: points on the surfaces of objects in the environment areimaged in different positions, depending on their distance to the eyes.

3. Optic flow: the velocity of (points on) objects as projected on the viewerseye depends on the distances from the viewer; objects far away seem tomove slower and visa versa.

4. Partial occlusion: if a first object is partially occluded by a second object,the first object is located further away from the viewer since the secondobject is between the viewer and the first object.

5. Texture gradients: think of a grain field. The further you look the finerthe texture of the grain field gets.

The first three cues can be applied in autonomous systems. The last two, partialocclusion and texture gradients, are more of biological interest due toimplementation difficulties. In principle is additional information needed forshape from size and texture gradients. One need to known what size or texture toexpect in the 3D scene. Partial occlusion gives only limited information. Stereovision and Optic flow do not need additional information and so we will focus onthose two methods in the following sections.

8.8.1. Distance from size

One of the listed depth cues is the size cue. Suppose we have a model of theobject in the environment and part of this model is the size (Px, Py, 0) or thesquare area A=Px Py of the object itself. Let the object be located at (0,0,Pz) andthe camera have unit focal length f = 1. If we calculate the projection of the objectarea on the image plane using equation we find that the projected area a is givenby:


a p pA

Px yz

= = 2

An object thus seems to get smaller if the distance to the object increases.But from the model we know that it is not the object size that is changing but thedistance to the object. This result is so trivial that we almost forget that thedistance can only be estimated if we have a model (area) of the object. So whenwe know the area of the object, we can directly measure the distance as

PA

az =

If the size of the object is overestimated in the model the object willappear to be located at a larger distance. This seems to be one of the reasons why(small) children get hit by cars. In bad lightning conditions they are perceived asnormal humans and the driver resulting in a (too) late reaction overestimates thedistance to them.

Having a model of the size of objects in the environment is a veryrestricting assumption. In environments in which the objects are not known a-priori, the size cue is of no use since there is no model of the object seen. Thereare however other depth cues which do not need such a size model of the objects.We will discuss those in the next sections.

8.9. Stereo Vision

In this section we deal with stereo vision: the reconstruction of the 3-Dcoordinates of points in a scene given two images obtained by a camera of knownrelative positions and orientations. The basic problems to be solved in stereovision are shown in figure 8.16. Two pinhole cameras of the type described in theprevious section form images p1 and p2 of the physical point P. As shown in thefigure we have chosen three coordinate systems, one in each image plane (x1,y1)and (x2, y2) and one in 3-D space which is sometimes called the world referenceframe. The distance between the two optical centers C1 and C2 of the cameras iscalled the baseline.

What we observe are two images formed in the image plane L and R.. Given thesetwo images we want to solve two problems:

1. For a point p1 in plane L, decide which point p2 in plane R it correspondsto. Correspondence means that they are the images of the same physicalpoint P. This is the correspondence problem.

2. Given a corresponding p1 and p2, compute the 3-D coordinates of P in theworld reference frame. This is the reconstruction problem.


With the simple model of figure 8.16, the second question can be answered byintersecting the lines 1<C1,p1> and <C2,p2>. The result essentially depends onhow accurately we know the positions of the image centers and the image planesL and R.

p1 p2

C1

P

y1

x1

y2

x2

Z

X Y

C2

Figure 8.16: The 3-D vision problem

As we shall see, the main difficulty is to come up with good answers for the so-called correspondence problem: given a point in image 1 what is thecorresponding point in image 2? Since there are, in general, several possibilitiesfor the choice of the corresponding points in image 2, the stereo correspondenceproblem is said to be ambiguous. In this section an overview of stereo vision willbe given with answers to questions like: which point and which constraints can beused to reduce this ambiguity?

8.9.1 Reducing the search area: the Epipolar ConstraintIt should be clear that, unless some care is taken, the correspondence problem isambiguous and very computationally intensive. Given a point p1 in L, it may apriori be put in correspondence with any point in R. To solve this difficulty, wemust use constraints to reduce the number of possible matches for any given pointp1. We will show a very powerful constraint that arises from the geometry ofstereo vision which is called the epipolar constraint.

We will make a simple statement with the help of figure 8.17 . In this figure wesee that, given p1 in L, all possible physical points P that may have produced p1

are on the line <p1,C1>.

1 We use the notation <x y> for a line through the points x and y. Similarly we use<x y z> for a plane through the points x, y and z.


Figure 8.17: The epipolar constraint: The points C1,P and C2 specify the epipolarplane. Intersection of this plane with the image planes L and R gives the epipolarlines ep1 and ep2. Points on ep1 correspond to points on ep2!. (a) The generalsetup of two cameras viewing the environment. (b) If the image planes areparallel the epipolar lines correspond to the image rows! This makes the analysismore convenient.

As a direct consequence, all possible matches p2 of p1 in R are located on theprojection of this line <p1,C1> on the image R, resulting in the line ep2. The lineep2 is called the epipolar line of point p1 in the plane R.

Furthermore, the line ep2 goes through the point E2, which is the intersection ofthe line <C1,C2> with the plane R. E2 is called the epipole of the right camerawith respect to the left.

Note that the epipolar lines ep1 and ep2 are the intersections of the epipolar planeC1PC2 with the planes L and R. From this it is easily seen that all epipolar linespass through the corresponding epipoles.

The corresponding constraint is that, for a given point p1 in the plane L, itspossible matches in plane R all lie on a line. Therefore, we have reduced thedimension of our search space from two dimensions to one. Let us now calculatethe distance of a point P for the special case that the two cameras are parallel: theimage planes are the same, as is illustrated in figure 8.18. The projection of thepoint P on the X-axis in the left image px1 and right image px2 is given by:

p

fx1 =

P

Px

z

1

1

→ Px1 = p P

fx z1 1

D E

3


p

fx2 =

P

Px

z

2

2

→ Px2 =

p P

fx z2 2

in which f is the focal distance.

(px1 ,py1)

(px1 ,py1)

P

B

Pz

B

Figure 8.18: Calculation of the distance of a point P from stereo vision and theuncertainty region resulting from the pixel error.

When B is the length of the stereo baseline we get:

Px2 = Px1 + B

Pz1 = Pz2 = Pz

Pz = fB

p px x2 1−

And so we know the distance of the point P. It is also interesting to learn what isthe error in this depth obtained with stereo vision. This error results from the errorin the location of the corresponding points in the left and right image. When thiserror is ∆x, the resulting error in depth ∆Pz is:

∆ ∆ ∆PP

px

P

pxz

z

x

z

x

= +∂∆∂

∂∆∂1 2

and ∆Pz is given by:

∆Pz = 22 1

2∆xfB

p px x( )−

The error in depth increases for greater depth. This is illustrated in figure 8.18


8.9.2 Matching by dynamic programmingNow we are ready to tackle the key problem in stereo vision: How can we locatethe corresponding points in two images. As we have seen, if a point on the surfaceof an object is visible from either image, then the two image points must lie oncorresponding epipolar line.

Our first thought might be that we ought to analyze each image first andextract "features". These might be objects that have been identified or placeswhere there are distinctive gray-level patterns that we have some confidence inmatching. So how do we identify these conjugate points? A convenient feature forthis purpose is an edge. We might also look for gray-level corners in the images.Since feature methods rely on a discrete number of points in the image (better: onan epipolar line) we call them discrete methods. These methods try to find thebest match between feature points in the left image L and the right image R underthe assumption that a feature point in L can have at most one correspondence in Rand the resulting matches vary smooth over the image. Interested readers arereferred to the book of Faugeras. Only distance information is obtained at theposition where features are observed. Distance information of points in betweenhave to be deduced from a model of the scene and the segmentation of theimages..

Another approach is to use correlation methods, which can be thought of ascontinuous up to the pixel discretization. The idea is that the gray-level value ofone image point is not a good criterion to match between the two images sincethere will be many pixels with approximately the same gray level in the secondimage (epipolar line). So why not match the gray-level surrounding of an imagepoint and match this with a gray- level surrounding in the second image. In factwe are then correlating a small region in L with the search region in R. Formally,the correlation of two regions with size (Nx × Ny) located at p1=(x1,y1) in L andp2=(x2,y2) in R equals

( )( )Ck

I x i u j I x i u jL Li N

N

j N

N

R Rx

x

y

y

( ) ( , ) ( , )p ,p1 2 = + + − + + −=−=−∑∑1

1 1 2 2µ µ (8.2)

where k = σL(x1,y1) σR(x2,y2) Ω

The mean, variance and the area of the region are µ, σ and Ω are. At the pointwhere the regions match best, the correlation will be at its maximum. Thedifference in location in the left and right image is called the disparity. For colorimages the intersection of color histograms can be used as measure for thecorrespondence between the left and right images, as was discussed in 8.7.2

Given a point in the left image of a stereo pair, the corresponding point in theright image lies on the epipolar line. The search for this point can be restrictedalong this line. For each point on the epipolar line at position p1 in the left image


the corresponding point at position p2 in the right image has to be found. Thedisparity D is then p2 - p1. This is illustrated in figure 8.19 where the verticaldistance to the diagonal is the disparity. The diagonal itself indicate the zerodisparity, where we have taken the origins of the two epipoplar lines in such away that zero disparity is in the middle of our 3D region of interest. Given thefield of view of the left and right cameras there is only a limited number of pointp2 on the right epipolar line, which could possibly match a certain point p1 on theleft epipolar line. This limits our search region for p2 for matching with a point ofthe left epipolar line. This is illustrated in figure 8.19 by the darker area arounddisparity 0.

Figure 8.19 : The matching space of left and right epipolar lines. Ziegler [8.16]

Matching the points on the left and right epipolar lines can be seen asfinding an optimal path from left to right though the allowed disparities inbetween the left and right epipolar lines. We will call that the matching space. Forthe allowed region of the matching space we could calculate the correlationaccording to equation (8.2) and find a path which minimes the total differencebetween the left and right image along the epipolar lines. As the correlation foridentical image parts is 1 , we may use |1- C(p1 ,p2)| as a measure to sum alongthe path, so we want to minimize the sum S over all possible paths of thematching space:

∑ −=∀

K

i

ji

jppCS ),(1min 21


Left epipolar line

Rightepipolar

line

Background

Foreground

Occlusion

Figure 8.20 : Occlusions are jumps in the matching space of the epipolar lines

(a) (b)

(c) (d)

Figure 8.21: An example of reconstructing the environment: (a)(b) A stereo pairof images. (c) With the correlation technique the disparity is estimated. (d) Thereconstruction of the depth superimposed on the original left image.


As occlusion may occur, jumps in horizontal a vertical direction are allowedwithout costs. This minimization can be done with standard dynamicprogramming methods. Details how to realize this minimization for practicalsituations can be found in Ziegler [8.16]. To illustrate the theory we show a depthreconstruction of a stereo pair in figure 8.21

8.10 Shape from Motion

When objects move in front of a camera, or when a camera moves through a fixedenvironment, there are corresponding changes in the image over time. We willdescribe in this section how these changes can be used to recover the shape andmotion of objects in the environment. There are three fundamental questions to beanswered: 1: if a point on an object in the environment moves, how does thecorresponding point move in the image plane? 2: what can we derive from themotion of points in the image about the motion and depth of 3D points in theenvironment? 3: how do we estimate the motion of points in the image from timevarying images?

8.10.1. The Motion Field

We first define the motion field, which assigns a velocity vector to each point inthe image. At a particular instant of time, a point p=(px,py) corresponds to somepoint P=(Px,Py,Pz) on the surface of an object. The camera model as shown infigure 8.12 and equation 8.1 relates the two. Now let the object point P have avelocity V=(Vx,Vy,Vz) relative to the camera. This induces a motion v in thecorresponding image point p where

( )v = =

v v

dp

dt

dp

dtx y

y y, ,

By using the perspective projection relation with unit focal length (f=1) we canconclude that:

vV p V

Pxx x z

z

=−

vV p V

Py

y y z

z

=−

(8.3)

What is important here is that a vector can be assigned in this way to every imagepoint. These vectors v constitute the motion field. An example of such a motionfield is shown in figure 8.22


Figure 8.22: An example of a motion field (right) for approaching a hill togetherwith the three-dimensional map (left) it comes from. Notice that points far awayexhibit a smaller motion vector then points close by.

8.10.2. Structure and Motion from the Motion Field

We now know that there exists a theoretical two-dimensional motion field, whichequals the three-dimensional motion of objects as projected on the image plane.Let us take a closer look at equations . There are two important things to notice.

First we see that a motion variable Vi is always divided by the depth variable Pz.This means that neither the absolute motion nor the absolute depth to the objectcan be estimated from the motion field alone; it is only possible to compute therelative motion parameters and denote these as:

~V

V

Pxx

z

= ~V

V

Py

y

z

= ~V

V

Pzz

z

=

This means that an environmental map of depths Pz/ Vz is scaled by the velocityof objects in the environment relative to the camera. Likewise, as the camera has alarger velocity, objects in the environment appear to be closer. This can be madevery intuitive: a point moving close to the camera with small velocity can inducethe same motion field as a point far away with great velocity! Once again, thestructure of objects, the depth, can only be derived as a relative quantity.


D E

Figure 8.23: The motion field of a moving plane. We see that all motion vectorsoriginate from a single point, the focus of expansion. (a) The plane movestowards the camera with velocity (0,0,Vz): the focus of expansion is located at thecenter of the image. (b) The plane moves with velocity (Vx,0,0): the focus ofexpansion is located at minus infinity in the X direction.

Now lets look with a camera at a single rigid object moving with velocity V.Since the object is rigid, all points on the object move with the same velocity!This object produces a motion field as described in equation 8.3. In figure 8.23the motion field of a moving plane is shown, oriented parallel to the image plane(Pz is constant). In the left image we can clearly see that there is a location in theimage where all motion vectors originate. This location is called the focus ofexpansion and we denote it with F=(Fx,Fy). This location is easily computed sincethe motion vector at this location is zero! Solving for vx = 0 and vy = 0 in equationwe find that:

FV

V

V

Vxx

z

x

z

= =~

~ FV

V

V

Vy

y

z

y

z

= =~

~ (8.4)

Now when does a point in the environment have a zero motion vector in theimage plane? If it is not moving with respect to the camera or the camera ismoving in exactly the direction that the point lies! To put it an other way: thedirection of the point and the direction of translation coincide! The focus ofexpansion thus gives the projected location of the environment point, the camerais heading for. This opens perspectives for collision detection or goal directedbehavior. If the focus of expansion represents an obstacle this will lead to acollision. Further on we will show an example of this.


8.10.3. The Optic Flow

Going back to the motion field, our first thought at how the motion field for pointscan be measured from the time varying images is to "track" them over time. Anumber of points is identified in an image at time t and we find where these pointsare in the next image at time t+1. This seems simple at first sight but it isn’t sotrivial! How do you identify points uniquely so that there is no confusion betweenmutual points? Where to search for the points in the next image, remembering thatsearching is computationally expensive? What if a point is occluded and cannotbe found in the next image? This approach is feasible when we have to do with astatic camera and some moving objects in the scene. The dynamic situation with amoving camera needs a different approach: optic flow.

For the case of a relatively slowly time varying image, we can make assumptionswhich ease the estimation of the motion field. The question that thus arises is ofcourse: how do we extract the motion field from the time varying image withouttracking "special points" over time? Let us first look at when the motion fieldcannot be computed from these images.

Brightness patterns in the image move as the objects that give rise to them move.Optic flow is the apparent motion of the brightness pattern. Ideally the optic flowwill correspond to the motion field, but this is not always true. Consider forexample a perfectly uniform sphere rotating in front of the camera illuminated bya single source. Since the image does not change over time the optic flow equalszero in the whole image. However the motion field is certainly not zero over theimage since the sphere is rotating. In general we can state that the optic flow is theobservable part of the motion field. So we will rephrase the question to: how dowe extract the optic flow from the time varying image?

Let us consider the image intensity I(px,py,t) at a point p in the image plane. Wewill use a short notation for I(px,py,t) and write it as I. If we take the total timederivative &I of I with respect to time, we obtain:

&IdI

dt

I

p

dp

dt

I

p

dp

dt

I

t

dt

dtx

x

y

y= = + +∂∂

∂∂

∂∂

Now by definition dp

dtvx

x= , and dp

dtv

yy= therefore the previous equation can be

written as&I

I

pv

I

pv

I

txx

yy= + +∂

∂∂

∂∂∂

We are now in the position to make an assumption about the object in theenvironment, which has great consequences: the intensity of a point P on theobject in the environment does not change over time. This means that the intensityof the moving point p does not change too, such that

&I = 0


and result in the so-called optical flow constraint:

∂∂

∂∂

∂∂

I

pv

I

pv

I

txx

yy+ = − (8.5)

Since we can measure ∂∂

I

px

, ∂∂

I

p y

and ∂∂I

t we have one constraint on the optic flow

v at p in the image plane. Since v has two components, and we only have oneconstraint, only the magnitude (given a direction) can be computed or vice versa.This is called the aperture problem and explained in figure . We call vectorsconfining this one constraint on the magnitude normal flow vectors n, and theirdirection is taken to be parallel to the edge or gradient.

ti+1

v n v

ti+1ti ti

(a) (b) (c)

Figure 8.24: The aperture problem: (a) A moving edge is observed through arectangular window. The motion of the edge appears to be v. (b) the real motionof the edge is shown, however from the observation (a) alone there is no way intelling what the real motion of the edge was. (c) The best you can do is definingthe motion of the edge relative to a direction. Here we have shown the normalflow n.

As we mentioned, the optic flow has two components so an additional constraintis needed. We will now give the general model of the second constraint: theobject in the scene is continuous or smooth in space and time. This means that theoptic flow varies slowly over small location changes in the image plane. We canthus make the assumption that the optic flow is constant in a small neighborhoodin the image plane. This means that we have to measure the gradients at twonearby points and solve the optic flow vector from them. However, since imagegradients are measurements they will be measured with additional noise. A betterapproach is to use more measurements.

Let’s see how this works out. Define a small (2Nx+1)×(2Ny+1) region in theimage around p as Sp. Our assumption is that the optic flow v is constant in Sp.For every point in this region the optic flow is constrained by 8.5. We should thusfind a v=(vx,vy) which fits the constraints as good as possible. This results inminimizing the functional Ep:


EI

pv

I

pv

I

tpx

xy

yr S p

= + +

∈∑ ∂

∂∂

∂∂∂

2

We can minimize Ep toward the (unknown) optic flow vector in that region whichjust means solving for v in (see section 7.3):

∂∂E p

v= 0

This leads to a very simple linear system A v = b where the optic flow in regionSp can be solved using a simple matrix inversion with:

AI I I

I I Ix x y

x y y

=

∑ ∑∑ ∑

2

2 bI I

I I

II

pr

II

pr

II

tr

x t

y t

xx

yy

t

=−−

=

=

=

∑∑

∂∂∂

∂∂∂

( )

( )

( )

(8.6)

8.10.4. An Image Processing viewpoint

We have deliberately started the section on structure from motion with thegeometrical properties of two and three-dimensional motion. This is anautonomous system or robotics approach to the theory. However, it may beconvenient to look at the theory from an other viewpoint, that of imageprocessing. This viewpoint will most probably lead to the question: "okay, I havean image sequence what can I do to compute the structure and motion of theenvironment?"

What we can measure from the image sequence are, among others, the spatial andtemporal intensity gradients (Ix, Iy, It) (equation 8.6) of the image sequence. Theprevious section gives a technique for estimating the optic flow from thesegradients, using (equation 8.6). The optic flow is an approximation of the motionfield. This motion field is the basic ingredient for the analysis of the objectstructure and motion. We have seen that only the velocity scaled depth can berecovered from the motion field (equation 8.3). We first will show some simpleexamples of how to estimate these depths.

Example: Motion in the viewing direction

Suppose the camera is mounted in front of a train, and the viewing direction isparallel to the tracks. All objects in the environment will move with a unknownvelocity V=(0,0,-Vz) towards the camera. From the optic flow and equation wecan reconstruct a relative depth map (Pz / Vz) by using


P

V V

p

vz

z z

x

x

= =1~ P

V V

p

vz

z z

y

y

= =1~

Example: Motion perpendicular to the viewing direction

We now align the camera such that it is perpendicular to the train tracks and the Yaxis is perpendicular to the ground plane, like looking out of the train window. Allobjects in the environment will move with a unknown velocity V=(Vx,0,0) alongthe plane of the image. A relative depth map can be reconstructed in the same wayas the previous example, but now the depth is scaled by the velocity Vx:

P

V V vz

z x x

= =1 1~

Example: General Motion

In the case of general motion, the velocity of the camera is V=(Vx ,Vy,Vz). In thiscase the camera need not be aligned with any axis. As we have seen there is aspecial location in the image plane, the focus of expansion, which gives thedirection the camera is heading. We can estimate this location with two optic flowvectors v1 and v2 at image positions p1 and p2. Rewriting equation 8.3 as :

( )v F pV

PF p Vx x x

z

zx x z= −

= − ~ ( )v F p

V

PF p Vy y y

z

zy y z= −

= − ~

gives us for two optic flow vectors:

Fv

Vpx

x

zx= +

1

11

~ Fv

Vpy

y

zy= +

1

11

~ Fv

Vpx

x

zx= +

2

22

~ Fv

Vpy

y

zy= +

2

22

~

We now have 4 unknowns (Fx, Fy, ~Vz

1 ,~Vz

2 ) and four equations. The location ofthe focus of expansion and the relative depths of these two points can thus becomputed from these equations (solving the system is left as an exercise).

If more then two points N>2 are taken into the computation of F, a Least- Mean-Square approximation can be made which is of course more robust to noise! ForN points in the image we then want to minimize:

( )( )E v F p Vx xi

x xi

zi

i

N

= − +=∑ ~ 2

1

( )( )E v F p Vy yi

y yi

zi

i

N

= − +=∑ ~ 2

1


However, extra points do not constrain the velocity scaled depth ~Vz

i of point i,since with every extra point a new velocity scaled depth has to be estimated!

8.10.5. Application: The Breaking Task

We have seen that it is possible to estimate the observed motion field, the opticflow, from the spatial and time derivatives of a time varying image. We now alsoknow that their relative velocity scales the estimated depth to the objects in theenvironment to the camera. The latter may be seen as a serious disadvantage:unless the motion of objects is known absolutely, no absolute depth to an objectcan be obtained using the technique described in this section! But the previousstatement is exactly this property that makes "structure from motion" well suitedfor navigation. It is the combination of environment structure and motion of thesystem that entail needed information for navigation.

We will demonstrate this with an example of the breaking task; given anautonomous system that moves with an unknown speed V=(Vx,Vy,Vz) towards awall at unknown depth, the task is to stop exactly in front of the wall bycontrolling the deceleration of the system. This seems like a difficult task if thevelocity of the system is unknown, and thus, as a consequence, the depth to thewall cannot be absolutely measured from the optic flow. So how do we obtain auseful measurement to control the deceleration a2?

Now the requirement that the system is at rest in front of the wall means that thevelocity Vz(t) of the system and the distance Pz (t) to the wall equals zero at acertain time td resulting in:

Pz(td) = 0 and Vz(td) = 0 (8.7)

This translates into standard Newtonian laws of a system using the equations ofuniformly accelerated motion.

Pz(t) = a0 - a1 t + 12 a2 t2 Vz(t) = - a1 + a2 t (8.8)

Note that the sign of the velocity is reversed. This denotes that the direction of thevelocity is in the other direction! Inserting the time t= td into equation 8.8 results,in combination with equation 8.7, in

Pz (td) = 0 → a0 = a1 td - 12 a2 td

2

Vz(td) = 0 → td = a1a2

(8.8)


Combining these two equations by eliminating td reveals that:

a0 =

a12

2a2 (8.9)

So far for the Newtonian constraints. What can we measure from the optic flow?If the system is moving with constant velocity -Vz (t) toward the wall, then thetime left for the system before crashing into the wall equals the distance to thewall Pz(t) divided by the velocity. We call this the time to contact τ(t):

τ( )( )

( )t

P t

V tz

z

= − (8.10)

It is exactly this quantity that can be measured: the time to contact equals theinverse of the measurable quantity

~Vz in equation 8.3. We can express the time to

contact in the coefficients of the Newtonian laws. From equation 8.8 we see thatat the current time t=0:

τ( )ta

a= 0

1

a P tz0 0= =( ) a V tz1 0= − =( ) (8.11)

If we look at the time derivative of the time to contact at time (t=0):

∂τ∂( ) ( )

( ) ( )

t

t

P t

V t

a

V t

a a

az

z z t

= − + = − +=

1 12

0

0 2

12

Now that looks like an interesting quantity because we can measure it directlyfrom the optic flow and it entails the stopping constraint 8.9. Thus from equation8.9 we see that

∂τ∂( )t

t= − + = −1

1

2

1

2

The only requirement to stop in front of the wall has thus been translated into thevisual domain as .We can now control the deceleration a2 by monitoring theprevious requirement, and thus have a highly valuable measurement to control a2.

8.11. Evaluation of Vision sensors

This section has shown that visual transducers produce an enormous amount ofinformation from which different type of virtual sensors can be derived. We havefocussed on environment motion and environment depth sensors. Depending uponthe practical application there is of course a trade off with direct distance sensors


such as a LIDAR.State of the art vision methods can estimate the relative distance to a staticenvironment. Stereo vision suffers from the correspondence problem, a point inthe left image may correspond with numerous points in the right image and visaversa. Still it is used frequently in autonomous systems due to the fact that it is theonly virtual visual sensor, which produces an absolute estimation of the depth tothe environment. However, implementing stereovision on an autonomous systemrequires caibration of the camera centers and the orientation of the image planes.This is generally a very tough procedure requiring lots of time and effort.Optic flow is a virtual sensor, which estimates the projected motion ofenvironment objects relative to the camera. From this virtual sensor a velocityscaled depth map of the environment can be calculated. Since the optic flow islong assumed to be a very noisy measurable quantity and not easily computedfrom the time varying images, not many implementations on autonomous systemsare known. However, it is particularly suited for dynamic situations, when thecamera is mounted on a mobile system.

Bibliography

1. K. Lion ,Transducers: problems and prospects, IEEE Trans. Ind. Electron. &Control. Instrum., IECI-16 (1969) 2-5

2. S Middelhoek, S.A. Audet, Silicon Sensors, TUD, Department of ElectricalEngineering Et 05-31

3. L.N. Reijers, H.J.L.M. de Haas, Flexibele Produktie Automatisering, deel IIIIndustriele robots, Technische Uitgeverij De Vey Mestdagh BV., Middelburg.

4. R. Schasfoort. , Chemische sensorontwikkeling bij TNO,Sensornieuws, vol2,3, december 1993 pp8-10.

5. A.Novini: Before you buy a Vision System... Manufacturing Engineering,March 1985 Vol.94, no 3, pp 42-48.

6. H.E Schroeder.: Practical Illumination Concept and Technique for MachineVision Applications. Proc. Robots 8 June 4-7, 1984 Detroit, MI, pp 14-27/14-43.

7. R.A Jarvis.: A Perspective on Range Finding Techniques for ComputerVision. IEEE PAMI, Vol. PAMI-5, No. 4, March 1983, pp 122-139.

8. H.R Everett.: Survey of Collision Avoidance and Ranging Sensors for MobileRobots. Robotics and Autonomous Systems, 5, 1989.

9. S. Inokuchi, Sato K., Mutsuda F.: Range-Imaging system for 3-D ObjectRecognition Proc. Seventh Intl. Conf. on Pattern recognition, July 30-August


2, 1984, Montreal Canada pp 806-808.

10. Philips: The frame-transfer sensor an attractive alternative to the TV cameratube, Philips Technical publication 150, 1985.

11. Videk: Megaplus camera: CCD Camera for high resolution Applications.Videk, New York.

12. M.J. Swain, D.H.Ballard, Color indexing, Int. Journal of Computer Vision,Vol. 7, no 1, pp. 11-32,1991.

13. F.Enneser, G. Medioni, Finding Waldo, or Focus of Attention Using LocalColor Information,1993

14. P.Weckesser, R.Dillmann, Navigating a Mobile Service-Robot in a NaturalEnvironment Using Sensor-Fusion Techniques, Proc. IROS 02297 Grenobles,Sept, 1997, pp1423-1428

15. O.Faugeras, Three-Dimensional Computer Vision, The MIT Press, 1993

16. M.Ziegler, Region-based analysis and coding of stereoscopic video,Akademischer Verlag München 1997, ISBN 3-929115-96-4

17. D.Gavrila, F.C.A.Groen, 3D object recognition from 2D images usinggeometric hashing, Patter Recognition Letters 13, 1992, pp263-278,

18. P.R.Wolf, Elements of Photogrammetry, McGraw-Hill, 1986

19. H.J.Woltring, Planar control in multi-camera calibration for 3D-gait studies,J.Biomechanics, Vol13, 1980, pp39-48

20. H.A. Beyer, An introduction to Photogrammetric Camera Calibration, Invitedpaper, Seminaire Orasis, St. Malo, September 24-27,1991, pp37-42

21. C.B. Madsen , C.S. Andersen, Optimal Landmark Selection for Triangulationof Robot Position, J. Robotics and Autonomous Systems, Vol 13:4, 1998,pp277-292

chapter 8 perception for autonomous systems and... · chapter 8 perception for autonomous systems...

Documents