omnidirectional stereo vision using fisheye lenses · omnidirectional stereo vision using fisheye...

8
Omnidirectional stereo vision using fisheye lenses Marius Drulea, Istvan Szakats, Andrei Vatavu, Sergiu Nedevschi Technical University of Cluj-Napoca, Romania, Computer Science Department {marius.drulea, istvan.szakats, andrei.vatavu, sergiu.nedevschi}@cs.utcluj.ro AbstractThis work presents an omnidirectional stereo system with a direct application to autonomous logistics. The omnidirectional system uses fisheye lenses to inspect the surroundings of the automated forklift. The lenses are mounted on the top of the vehicle to allow a 360 o scene reconstruction with a single stereo pair. The system provides the list of obstacles detected around the vehicle. The reconstruction of the scene is possible via a division of the fisheye images into several rectified images. A stereo matching algorithm is applied to each pair of rectified images. The computed 3D points corresponding to the rectified pairs are unified into a single cloud. The obstacle detection module operates on the unified cloud of 3D points. Keywords omnidirectional stereovision, fisheye lens , multi- channel rectification, stereo reconstruction I. INTRODUCTION The PAN-Robots project [1] proposes a new generation of autonomous guided vehicles AGVs- for autonomous logistics factories. The vehicles are equipped with advanced on-board sensors to increase the flexibility and safety. The on-board sensors provides two functionalities: autonomous navigation and collision avoidance. In order to detect the possible collisions, the sensors detect the obstacles around the vehicle. In the state of the art, this is achieved using laser-scanners. They perform a single horizontal scan of the scene and report all the obstacles that hit the scan plane. The system fuse the information from multiple laser- scanners to provide a complete 360 o detection around the AGV. The scanning plane hits the majority of the obstacles, but not all. Hanging/protruding objects are exceptions and the AGVs might collide with such obstacles. Objects with irregular form might also create undesired situations. A worker carrying a long ladder is such an example. Humans are more difficult to detect when they are partially occluded or when they are very close to other objects. The PAN-Robots project proposes a novel sensor, a stereo system consisting of two high resolution cameras and two fisheye lenses. The stereo sensor is mounted on the top of the AGV and it is oriented to the ground. Figure 1 depicts this setup. In this manner, the stereo sensor is able to monitor the entire area surrounding the AGV. The fisheye lenses play an important role in this sensor. If we use conventional rectilinear lenses, we would only process a restricted area around the AGV. Several conventional stereo sensors have to be used to monitor the AGV’ surroundings. With the proposed stereo sensor, this is no longer necessary, a single sensor suffices for this task. The fisheye stereo sensor has to detect, track and classify all the objects around the AGV. Moreover, it should detect hanging, protruding objects and obstacles with irregular shapes. The AGV system fuses the list of detected objects by the stereo sensor with the list of objects detected by the laser- scanners. Up to the current project phase, we have implemented modules for synchronous image acquisition, calibration, rectification, stereo-reconstruction and obstacle detection. These modules are the subject of this paper. The tracking and classification modules will be implemented in the future phases of the project. The rest of the paper is organized as follows: section II presents the stereo sensor requirements, section III presents an overview of our system. A synchronous image acquisition module for Gigabit Ethernet cameras is described in Section IV. The sections V to VIII present the image processing related components of the system, from stereo calibration and rectification, to the object detection module. Section X presents the experimental results of the system. II. SENSOR REQUIREMETNS A. Range For the derivation of the range, the following worst case scenario is considered. The AGV operates at full speed and is approaching and object. At the same time the object moves towards the AGV at maximum speed. The AGV velocity is 2m/s. The maximum time for detection, tracking and classification is one second. Once the obstacle is detected, the AGV must have enough space and time to brake and stop to avoid the collisions. The AGV is able to brake at 1.0m/s 2 (hard braking) and at 0.5 m/s 2 (soft braking). Table 1 presents the required range for different obstacles. According to the table, a running person (2m/s) is safe if the detection range is minimum 10m and hard braking is involved. Therefore, 10m is the minimum acceptable range of the stereo sensor. For a complete derivation of these values, see [2]. B. Field of view The field of view depends on the required range and on the mounting position. In order to use a single stereo sensor, it has to be mounted on the top of the AGV, as depicted in Figure 1. From this position, the sensor is able to monitor the entire area around the AGV. The maximum height is between 4 m and 4.5 m and this is restricted by the ceil height. A pole is used to mount the sensor at the desired height. The minimum range is slightly higher than 10 m because the pole is not in front of the AGV. Finally, the minimum range from the pole is about 11 m. This work was supported by the research project PAN-Robots. PAN- Robots is funded by the European Commission, under the 7th Framework Programme Grant Agreement n. 314193. The partners of the consortium thank the European Commission for supporting the work of this project.

Upload: duongdiep

Post on 20-Apr-2018

223 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Omnidirectional stereo vision using fisheye lenses · Omnidirectional stereo vision using fisheye lenses Marius Drulea, Istvan Szakats, Andrei Vatavu, Sergiu Nedevschi Technical University

Omnidirectional stereo vision using fisheye lenses

Marius Drulea, Istvan Szakats, Andrei Vatavu, Sergiu Nedevschi

Technical University of Cluj-Napoca, Romania, Computer Science Department

{marius.drulea, istvan.szakats, andrei.vatavu, sergiu.nedevschi}@cs.utcluj.ro

Abstract—This work presents an omnidirectional stereo

system with a direct application to autonomous logistics. The

omnidirectional system uses fisheye lenses to inspect the

surroundings of the automated forklift. The lenses are mounted on

the top of the vehicle to allow a 360o scene reconstruction with a

single stereo pair. The system provides the list of obstacles detected

around the vehicle. The reconstruction of the scene is possible via

a division of the fisheye images into several rectified images. A

stereo matching algorithm is applied to each pair of rectified

images. The computed 3D points corresponding to the rectified

pairs are unified into a single cloud. The obstacle detection module

operates on the unified cloud of 3D points.

Keywords — omnidirectional stereovision, fisheye lens , multi-

channel rectification, stereo reconstruction

I. INTRODUCTION

The PAN-Robots project [1] proposes a new generation of autonomous guided vehicles –AGVs- for autonomous logistics factories. The vehicles are equipped with advanced on-board sensors to increase the flexibility and safety. The on-board sensors provides two functionalities: autonomous navigation and collision avoidance.

In order to detect the possible collisions, the sensors detect the obstacles around the vehicle. In the state of the art, this is achieved using laser-scanners. They perform a single horizontal scan of the scene and report all the obstacles that hit the scan plane. The system fuse the information from multiple laser-scanners to provide a complete 360o detection around the AGV. The scanning plane hits the majority of the obstacles, but not all. Hanging/protruding objects are exceptions and the AGVs might collide with such obstacles. Objects with irregular form might also create undesired situations. A worker carrying a long ladder is such an example. Humans are more difficult to detect when they are partially occluded or when they are very close to other objects.

The PAN-Robots project proposes a novel sensor, a stereo system consisting of two high resolution cameras and two fisheye lenses. The stereo sensor is mounted on the top of the AGV and it is oriented to the ground. Figure 1 depicts this setup. In this manner, the stereo sensor is able to monitor the entire area surrounding the AGV. The fisheye lenses play an important role in this sensor. If we use conventional rectilinear lenses, we would only process a restricted area around the AGV. Several conventional stereo sensors have to be used to monitor the AGV’ surroundings. With the proposed stereo sensor, this is no longer necessary, a single sensor suffices for

this task. The fisheye stereo sensor has to detect, track and classify all the objects around the AGV. Moreover, it should detect hanging, protruding objects and obstacles with irregular shapes. The AGV system fuses the list of detected objects by the stereo sensor with the list of objects detected by the laser-scanners.

Up to the current project phase, we have implemented modules for synchronous image acquisition, calibration, rectification, stereo-reconstruction and obstacle detection. These modules are the subject of this paper. The tracking and classification modules will be implemented in the future phases of the project.

The rest of the paper is organized as follows: section II presents the stereo sensor requirements, section III presents an overview of our system. A synchronous image acquisition module for Gigabit Ethernet cameras is described in Section IV. The sections V to VIII present the image processing related components of the system, from stereo calibration and rectification, to the object detection module. Section X presents the experimental results of the system.

II. SENSOR REQUIREMETNS

A. Range

For the derivation of the range, the following worst case scenario is considered. The AGV operates at full speed and is approaching and object. At the same time the object moves towards the AGV at maximum speed. The AGV velocity is 2m/s. The maximum time for detection, tracking and classification is one second. Once the obstacle is detected, the AGV must have enough space and time to brake and stop to avoid the collisions. The AGV is able to brake at 1.0m/s2 (hard braking) and at 0.5 m/s2 (soft braking). Table 1 presents the required range for different obstacles. According to the table, a running person (2m/s) is safe if the detection range is minimum 10m and hard braking is involved. Therefore, 10m is the minimum acceptable range of the stereo sensor. For a complete derivation of these values, see [2].

B. Field of view

The field of view depends on the required range and on the mounting position. In order to use a single stereo sensor, it has to be mounted on the top of the AGV, as depicted in Figure 1. From this position, the sensor is able to monitor the entire area around the AGV. The maximum height is between 4 m and 4.5 m and this is restricted by the ceil height. A pole is used to mount the sensor at the desired height. The minimum range is slightly higher than 10 m because the pole is not in front of the AGV. Finally, the minimum range from the pole is about 11 m.

This work was supported by the research project PAN-Robots. PAN-

Robots is funded by the European Commission, under the 7th Framework Programme Grant Agreement n. 314193. The partners of the consortium thank

the European Commission for supporting the work of this project.

Page 2: Omnidirectional stereo vision using fisheye lenses · Omnidirectional stereo vision using fisheye lenses Marius Drulea, Istvan Szakats, Andrei Vatavu, Sergiu Nedevschi Technical University

This range would cover the eventual modifications of the pole position. For a correct detection, objects of 1 m height should be entirely visible at 11 m from the sensor. With a sensor height of 4 m, there results a minimum field of view of 1500.

TABLE I RANGE REQUIREMENTS

Range requirement

AGV braking deceleration 𝒂AGV, brake

1.0 m/s²

(hard braking)

0.5 m/s²

(soft braking)

ob

ject

vel

oci

ty 𝒗

ob

j

0 m/s

(stationary

object)

4 m 6.0 m

0.8 m/s

(slow person) 6.4 m 10.0 m

2 m/s

(fast person) 10.0 m 16.0 m

5 m/s

(forklift) 19.0 m 31.0 m

C. Data processing requirements

The system has to detect, track and classify objects at 10

frames/second. The complete list of requirements is given

below (see [2]):

Object detection

The objects has to be detected with the following accuracies:

Position: estimated location of the object with respect

to the local sensor coordinate system, accuracy: 3 %

of object distance

Size: the estimated size of the object in terms of width,

length and height, accuracy: ±0.1 m at 1 σ

The rate of obstacle detection should exceed 95%.

Object tracking

Each detected object needs to be tracked with respect to the

following properties:

Velocity: the estimated velocity of the object (scalar

value), accuracy ±0.5 m/s at 1 σ

Heading: the estimated direction of motion with

respect to the local sensor coordinate system, accuracy

±15o at 1 σ

Age: number of successive scans an object has been

tracked

Object classification

The recognized objects need to be classified as one of the

following: pedestrian, AGV, manual forklift, electric tricycle,

other object. The class should be supplemented by a

classification quality indicating the confidence in the

classification decision. The rate of correct classifications should

exceed 90 %.

III. SYSTEM OVERVIEW

The architecture of the system is given in Figure 2. The

sensor consists of two Gigabit Ethernet Manta G-419 B NIR

cameras and two fisheye lenses. The system uses a special pair

of fisheye lenses [6], tailored to the requirements of the PAN-

Robots project. The system is compatible with any fisheye lens,

but the one described in [6] provides better specifications in the

context of the project. The baseline between the cameras is 20

cm. We use an Arduino Uno microcontroller to synchronously

trigger the cameras at a frequency of 10 Hz. Although the

trigger is simultaneous for the cameras, the computer receives

the frames asynchronously and this might lead to an incorrect

pairing of the frames. Section IV presents a method that

correctly pairs the received frames. Once we receive the left and

right fisheye images, we split them into three pairs of rectified

images. The rectification module uses the intrinsic and extrinsic

parameters of the cameras. The calibration module computes

Figure 1. The mounting position, range and field of view of the fisheye stereo sensor

Page 3: Omnidirectional stereo vision using fisheye lenses · Omnidirectional stereo vision using fisheye lenses Marius Drulea, Istvan Szakats, Andrei Vatavu, Sergiu Nedevschi Technical University

these parameters offline. Section V provides details about the

calibration model and details the rectification module. The

stereo reconstruction algorithm presented in section VI

computes the 3D locations of the image points. The NVIDIA

GeForce GTX 780 Ti carries the computationally intensive

operations of the 3D reconstruction algorithm. The computed

clouds of points and the pairs of images are the inputs of the

obstacle detection, tracking and classification module. In the

current project phase, only the obstacle detection module is

implemented. Sections VII and VIII present a description of this

module. This module exports the list of detected obstacles from

the data processing layer to the application layer. The

application layer uses the obstacles for collision avoidance and

local-path planning [3], [19], [20].

IV. SYNCHRONOUS IMAGE ACQUISITION

Industrial cameras for machine vision are available with

many different interfaces, CameraLink, FireWire, Gigabit

Ethernet (GigE), USB 3.0. Long, flexible cables are required

within the PAN-Robots project. FireWire/USB (with

maximum lengths of only 4.5/5.0 m) and CameraLink (with

longer but thicker, more rigid cables) did not suffice so GigE

interface was chosen. Allied Vision Manta 419-B NIR cameras

were used for the stereo system. While GigE cameras can be

synchronised using the Precision Time Protcol (PTP) protocol,

that requires an industrial ethernet switch in addition to the

gigabit ethernet boards. Instead of PTP, a common external

trigger was supplied to both cameras for starting the exposure.

A trigger signal was generated periodically by an Arduino

board. Using the output pins on the cameras the time difference

between the start of the exposure of the two images was

measured using an oscilloscope to be 1 microsecond. The

synchronous trigger of the cameras ensures that the acquisition

starts at the same time for both the cameras. However, the

transmission of the frames from the cameras to the computer

might lead to errors. Some frames (left or right or both) might

be dropped due to transmissions errors. Usually, the right image

arrive after the left image, or the left image arrive after the right

image. Sometimes it happens that two left images arrive and

then two right images. Without an acquisition board and

without PTP to synchronize the clocks of the two cameras, the

following method was implemented in software to match the

left/right images of the stereo pair. At acquisition start the

clocks of both cameras were reset and so the first frame pair had

to arrive with a timestamp difference of only a few

milliseconds. However due to clock skew between the cameras,

Figure 2. The architecture of the omnidirectional stereo system

Page 4: Omnidirectional stereo vision using fisheye lenses · Omnidirectional stereo vision using fisheye lenses Marius Drulea, Istvan Szakats, Andrei Vatavu, Sergiu Nedevschi Technical University

this difference grew over time to be large, so to match the stereo

pair the difference was instead assumed required not to change

by more than 1 millisecond between successive matched

frames. Another complication was that frames were sometimes

missing from one or the other camera or they might arrived out

of order. When the ethernet boards were properly configured

this did not happen too frequently, only when the system was

too busy to receive the ethernet packets for example, but to

make it robust to such events, the software had to search a few

previously received and unmatched packets frames to find the

stereo pairs. In some rare cases the frames would also stop

coming indefinitely so a watch dog thread was used to reset the

acquisition after a timeout. For maximum performance, a zero-

copy solution was implemented which passed the buffers in

which the images first arrived directly to the rest of the

application, did not reuse those buffers until the application no

longer needed them and supplied additional frame buffers to

the acquisition driver as necessary to maintain maximum frame

rate.

V. CAMERA CALIBRATION AND MULTI-CHANNEL RECTIFICATION

A. Fisheye lens description and calibration

The fisheye lens is an ultra-wide lens and can provide field

of views up to 185o, both horizontally and vertically. The

images obtained with such a lens are highly distorted. The most

popular are the equiangular fisheye lens: the angular resolution

provided by the lens is constant. They are characterized by the

equality

𝑟 = 𝑓 ⋅ 𝛼 (1)

where 𝑓 is the focal length of the lens, 𝛼 is the incoming ray

angle and 𝑟 is the projection of the ray on the imaging plane.

See figure 3. This relationship is a theoretical one. In reality,

there are deviations from this equality due to imperfections in

lens production and assembly.

For the calibration of the lens, we use the toolbox described

in [4]. Kannala [4] derived a projection function that can cope

with any type of ultra-wide lens. This model is quite complex

and it depends on 23 parameters. It is briefly described below.

It begins with a modification of the theoretical raw projection

function:

𝑟(𝛼) = 𝑘1𝛼 + 𝑘2𝛼3 + 𝑘3𝛼5 + 𝑘4𝛼7 + 𝑘5𝛼9 (2)

This polynomial representation of the raw projection function

theoretically fits to all ultra-wide lenses. The lenses also have

distortions and therefore, the theoretical raw projection function

does not suffice to characterize the lens. [4] also uses a general

distortion profile which is given below. The radial distortion is

characterized by the following formula:

Δ𝑟(𝛼,𝜑) = (𝑙1𝛼 + 𝑙2𝛼3 + 𝑙3𝛼5)(𝑖1 cos(𝜑)

+ 𝑖2 sin(𝜑) + 𝑖3cos(2𝜑)+ 𝑖4 sin(2φ))

(3)

Similarly, the tangential distortion is given by:

Δ𝑡(𝛼,𝜑) = (𝑚1𝛼 + 𝑚2𝛼3 + 𝑚3𝛼5)(𝑗1 cos(𝜑)

+ 𝑗2 sin(𝜑) + 𝑗3cos(2𝜑)+ 𝑗4 sin(2φ))

(4)

Including the distortions leads us to the following projection

function:

[𝑥𝑦] = 𝐹𝑚𝑚(𝑋, 𝑌, 𝑍) = 𝑟(𝛼) ⋅ [

cos(𝜑)

sin(𝜑)] +

Δ𝑟(𝛼, 𝜑) ⋅ [cos(𝜑)

sin(𝜑)] + Δ𝑡(𝛼, 𝜑) ⋅ [

−sin(𝜑)

cos(𝜑)]

(5)

In this equality, (𝛼, 𝜑) are the angles of the incoming ray from

the point 𝑃(𝑋, 𝑌, 𝑍) and can be easily computed from the

coordinates (𝑋, 𝑌, 𝑍). The coordinates of 𝑃 are given in mm.

𝑝(𝑥, 𝑦) is the projection of the point 𝑃 onto the imager plane.

Note that the coordinates (𝑥, 𝑦) are still in mm. There remains

to transform these coordinates into pixel coordinates. Let

(𝑢0, 𝑣0) be the principal point and let 𝑡𝑢 and 𝑡𝑣 be the number

of pixels per unit mm in horizontal and vertical direction

respectively. The final projection function is:

[𝑢𝑣

] = 𝐹(𝑋, 𝑌, 𝑍) = [𝑡𝑢 00 𝑡𝑣

] ⋅ [𝑥𝑦] + [

𝑢0

𝑣0] (6)

The projection function has 23 intrinsic parameters

(𝑘1−5, 𝑓𝑢, 𝑓𝑣, 𝑢0, 𝑣0, 𝑙1−3, 𝑖1−4, 𝑙1−3, 𝑖1−4). We use the calibration

procedure described in [4] and a checkerboard pattern to

calculate these parameters. The calibration involves a non-

linear optimization. Interested readers can found details about

the calibration in [4].

Figure 3. The projection model of the fisheye lens. The front,

central and back virtual imagers used for rectification

Beside the intrinsic parameters, the calibration procedure also

provides the 3D coordinates of the checkerboard points,

denoted as 𝐶𝐾𝐿 and 𝐶𝐾𝑅, in the left and right camera

coordinates respectively. This helps in computing the rotation

𝑅 and translation 𝑇 of the right camera w.r.t. to the left camera.

The left and right coordinates are related by the following

formula:

𝐶𝐾𝐿 = 𝑅 ⋅ 𝐶𝐾𝑅 + 𝑇 (7)

The rotation and the translation can be easily computed using

the least squares method. In order to reach a canonical system

and to generate rectified pairs of images we find 𝑅𝐿 , 𝑅𝑅 and 𝑡 =[‖𝑇‖, 0, 0]𝑇. This technique can be found in [5].

𝑅𝐿 ⋅ 𝐶𝐾𝐿 = 𝑅𝑅 ⋅ 𝐶𝐾𝑅 + 𝑡 (8)

Page 5: Omnidirectional stereo vision using fisheye lenses · Omnidirectional stereo vision using fisheye lenses Marius Drulea, Istvan Szakats, Andrei Vatavu, Sergiu Nedevschi Technical University

Figure 4. The division of the fisheye image into three channels. First

column: the original left fisheye image and its front, central and

back rectified images. Second column: the original right fisheye

image and the disparities for each left-right pair of rectified images

B. Multi-channel rectification

The rectification is a key operation in our system. Using this

operation, the original highly distorted fisheye image is divided

into three perspective images. We apply this division for both

left and right fisheye images. We then apply a stereo

reconstruction algorithm on the left-right pairs of rectified

images. We generate a single point cloud from all the 3D points

generated by stereo reconstruction. The obstacle detection

module operates on this cloud of 3D points.

In order to generate the rectified images, we create a front, a

central and a back virtual pinhole imagers, as depicted in Figure

3. The imagers are in fact rectangles in the 3D space relative to

the objective optical center. Let’s say we want to obtain a

rectified image 𝐼𝑓𝑟𝑜𝑛𝑡𝐿 of 𝑁 × 𝑀 pixels from the front imager of

the left camera. We denote the corners of the front imager with

𝑉1, 𝑉2, 𝑉3 and𝑉4. Then we split the front imager into a uniformly

sampled grid of size 𝑁 × 𝑀. The point 𝑃𝑖,𝑗(𝑋, 𝑌, 𝑍) at location

(𝑖, 𝑗) is a linear combination of the points 𝑉1, 𝑉2, 𝑉3𝑉4.

𝐴𝑖 = ((𝑁 − 𝑖) ⋅ 𝑉1 + 𝑖 ⋅ 𝑉2)/𝑁

𝐵𝑖 = ((𝑁 − 𝑖) ⋅ 𝑉3 + 𝑖 ⋅ 𝑉4)/𝑁

𝑃𝑖,𝑗 = ((𝑀 − 𝑗) ⋅ 𝐴𝑖 + 𝑗 ⋅ 𝐵𝑖)/𝑀

(9)

The intensity of the pixel (𝑖, 𝑗) in the rectified image is

recovered by means of the projection function 𝐹. We also take

into account the rotations 𝑅𝐿 and 𝑅𝑅 which transforms the

fisheye system into a canonical system. We have the following

relationships:

𝐼𝑓𝑟𝑜𝑛𝑡𝑋 (𝑖, 𝑗) = 𝐼𝑓𝑖𝑠ℎ𝑒𝑦𝑒 (𝐹(𝑅𝑋

𝑇 ⋅ 𝑃𝑖,𝑗)) , 𝑋 ∈ {𝐿, 𝑅} (10)

𝐼𝑓𝑖𝑠ℎ𝑒𝑦𝑒 represents the original fisheye image. In a similar

manner we calculate the rectified images 𝐼𝑐𝑒𝑛𝑡𝑟𝑎𝑙𝑋 and 𝐼𝑏𝑎𝑐𝑘

𝑋 , 𝑋 ∈{𝐿, 𝑅}. There results three rectified left images and three

rectified right images. An example of such images are depicted

in Figure 4. |

VI. STEREO RECONSTRUCTION ALGORITHM

We apply the stereo reconstruction algorithm described on

[18] on each left-right rectified pair. An example of disparity is

shown in Figure 4. Once the disparity of each channel (front,

central, back) is available, we calculate the 3D positions of the

pixels for all the channels. We take into account the rotation of

the imager w.r.t. to the physical imager. In our

rectification 𝑅𝑓𝑟𝑜𝑛𝑡 = 𝑟𝑜𝑡(90°), 𝑅𝑐𝑒𝑛𝑡𝑟𝑎𝑙 = 𝐼3, 𝑅𝑏𝑎𝑐𝑘 =

𝑟𝑜𝑡(−90°). All the computed 3D points reside into the AGV’s

coordinate system. The stereo matching algorithm [18] operates as follows. A

sparse census transform of the images is computed in a 9x9 window. The left and right pixels are compared using the hamming distance and the results are aggregated over a 5x5 window. Once the cost volume for all disparities is computed, semi-global matching (SGM) is applied to minimize an energy function which assigns an adaptive penalty for large disparity discontinuities, depending on the image gradient in each pixel. A winner-takes-all (WTA) strategy with confidence filtering is employed for each pixel in the left image to find the best matching pixel in the right image and vice-versa. Left-right consistency checking is performed to filter errors at occlusions. Finally, to estimate depth with sub-pixel accuracy, an interpolation function adapted to this particular stereo algorithm is used [23]. To achieve real-time performance, the algorithm was implemented in CUDA and runs on modern NVidia GPUs.

VII. INTERMEDIATE ENVIRONMENT REPRESENTATION WITH

DIGITAL ELEVATION MAPS

In this section we propose a spatial modeling solution by mapping the dense stereo data into a more compact representation. The dense stereo data consists of the all 3D points from all the rectified pairs of images. The proposed method can be decomposed into three main steps. In the first step, each 3D point is mapped into the digital elevation map (DEM) grid by scaling its lateral X and longitudinal Z coordinates. The number of points that fall in the same cell denotes a density value. Next, the raw density values are used to compute a density map, by incorporating stereo uncertainties. The second step consists in estimating the ground plane model by using a RANSAC approach and a least-squares refinement. In the last step, we use the estimated ground plane model and point densities to classify the measurements as ground cells or obstacle cells. Next we detail each of these steps.

Page 6: Omnidirectional stereo vision using fisheye lenses · Omnidirectional stereo vision using fisheye lenses Marius Drulea, Istvan Szakats, Andrei Vatavu, Sergiu Nedevschi Technical University

A. Mapping 3D points into DEM

The Digital Elevation Map space is represented by a 2D grid consisting of discrete 10 cm x 10 cm cells. Each cell mxz is described by its position (x,z), density value dxz, height hxz and type cxz. (object or ground):

),,,,( xzxzxzxz cdhzxm (11)

where GroundObjectcxz , For each 3D point P(X,Y,Z), the

corresponding grid cell (x,z) position is computed by scaling down its lateral X and longitudinal Z coordinates. As the result of the mapping step, each grid cell will store a number of associated 3D points Nxz, a minimum height value Ymin and a maximum height value Ymax. The minimum height values Ymin are used as the point hypotheses for ground-plane model computation, while the maximum height values Ymax describe the elevation of the associated grid cell. However, in reality, due to errors introduced by the sensorial system, especially in the matching step, the reconstructed 3D information may be inaccurate. Subsequently, the inaccurate 3D data may also produce erroneous grid mapping results. Therefore, we need, first, to define an error model, and then to estimate the DEM map density values by taking into account the introduced stereo uncertainties. An example of the obtained density map is shown in Figure 5.

Figure 5. Computing the DEM density values. a) The left camera image. b)

The raw density values obtaining by counting the number Nxz of associated 3D

points for each DEM cell mxz. c) The error standard deviation values

expressed as gray intensities. It can be observed that the error values increase

quadratically with the depth along the optic axes. d) The obtained densities by taking into account the probability distributions of all neighbor measurements.

B. Ground Plane Model Estimation

In order to detect the ground plane we use a RANSAC approach combined with a Least-Squares refinement step. Instead of including the whole 3D point cloud as the input to the algorithm, we select a smaller subset, by considering the 3D points with the minimum height values Ymin. These values are computed for each DEM cell in the mapping phase. Each RANSAC iteration is described by the following main steps:

Step 1: Sampling. This step consists in selecting a random subset of three non-collinear points. The three points represent the hypothetical inliers and are used to calculate the parameters of the corresponding plane model.

Step 2: Estimate the consensus set. The distance from all 3D observations to the candidate plane model is computed. The points with a distance less than a given threshold are selected as inliers. These points form the consensus set.

Step 3: Testing. In this step the obtained number of inliers is compared with the number of inliers from the previous iteration. The consensus set with a better score is saved.

Figure 6. a) Left grayscale image. b) The Elevation Map is projected on the

left image. The points are classified as Ground (blue) or Object (red). c) The

3D view representation including the 3D points, the Elevation Map, the detected ground surface represented as a white grid and the detected objects represented

as oriented 3D boxes.

Having the resulted consensus set ]..1[|),,( Nizyxp iiii ,

we estimate a refined plane model by applying a least-squares minimization. An example with the estimated ground surface is illustrated in Figure 6.

C. 3D Point Classification

Having the estimated ground plane, we classify all 3D

points ]..1[|),,( Nizyxp iiii as Objects or Ground. The

points with the distance Dist to the ground plane smaller than a given threshold are classified as Ground. Otherwise the points are classified as Objects. The same classification method is performed at the DEM cell level. Each DEM cell is classified as belonging either the ground surface or to an obstacle based on the point-to-plane distance.

Page 7: Omnidirectional stereo vision using fisheye lenses · Omnidirectional stereo vision using fisheye lenses Marius Drulea, Istvan Szakats, Andrei Vatavu, Sergiu Nedevschi Technical University

VIII. EXTRACTING OBSTACLES FROM DEM INFORMATION

Obstacle Detection by Grouping DEM Cells

This method consists in grouping the DEM cells into individual clusters (blobs). The grouping technique is performed on the 2D grid so that the connected entities are determined based on a proximity criterion. Each individual blob is described by an oriented bounding box (see Figure 6), which can be used as a primary information for other subsequent processing tasks such as collision detection, tracking or path planning.

IX. RESULTS AND EVALUATIONS

The preliminary functionality of the system is the obstacle

detection and localization and it was considered in this test.

A grid of 5x4 global positions were measured and marked

onto the ground. The positions were measured using the

existing AGV’s localization system, which provides an ac-

curacy of +-1cm [21], [22]. Pallets and boxes were installed in

the selected locations. The size of each obstacle was manually

measured. The height of the objects varies from 40 cm to 3m.

After all the objects were installed, the AGV started to navigate

among the objects. The AGV sends its location to the 3D Vision

ECU at a frequency of 10 Hz.

The software module installed on the 3D Vision ECU detects

the objects in the area of interest and their 3D position. The area

of interest around the AGV is a rectangular region of 19 m x 10

m: 11 m in front of the AGV, 8 in the back and 5 in the left/right

side. Using the current AGV global position, the system

compute the position of the ground truth objects in the AGV’s

local coordinate system. After this, the detected objects are

matched and compared with the ground-truth objects. Figure 7

depicts a sample from the evaluation of the system and TABLE

II shows the average measurements errors. The average

detection rate is 76%. The distance between the correct object

position and the estimated position is 22 cm on average. The

width, height and length are estimated with an error of 26, 11

and 32 cm. The cumulated number of ground-truth objects in

all the frames is 10684.

TABLE II SUMMARY OF 3D PERCEPTION SYSTEM EVALUATION

detection rate 76%

mean localization error 22 cm

mean width error 26 cm

mean height error 11 cm

mean length error 32 cm

ground truth objects 10684

missing objects 2551

number of frames 1418

Figure 7. Sample from the evaluation of the omnidirectional stereo

sensor. Green: detected ground-truth object, cyan: the ground-truth

object as it was detected by the sensor, red: undetected ground-truth

object

X. REFERENCES

[1] http://www.pan-robots.eu/

[2] PAN-Robots – Plug And Navigate Robots for Smart Factories: User Needs and Requirements, 2013.

[3] PAN-Robots – Plug And Navigate Robots for Smart Factories: Specification and Architecture, 2013.

[4] Juho Kannala and Sami S. Brandt. A generic camera model and calibration method for conventional, wide-angle and fish-eye lenses. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, August 2006.

[5] http://www.vision.caltech.edu/bouguetj/calib_doc/

[6] Aki Mäyrä, Mika Aikio, Matti Kumpulainen, “Fisheye optics for omnidirectional perception”, Intelligent Computer Communication and Processing (ICCP), 2014 IEEE International Conference on

[7] Rabe, C.; Franke, U.; Gehrig, S., "Fast detection of moving objects in complex scenarios," Intelligent Vehicles Symposium, 2007 IEEE , vol., no., pp.398,403, 13-15 June 2007

[8] Labayrade, R.; Aubert, D.; Tarel, J. -P, "Real time obstacle detection in stereovision on non flat road geometry through "v-disparity" representation," Intelligent Vehicle Symposium, 2002. IEEE , vol.2, no., pp.646,651 vol.2, 17-21 June 2002

[9] Pfeiffer, D.; Franke, U., "Efficient representation of traffic scenes by means of dynamic stixels," Intelligent Vehicles Symposium (IV), 2010 IEEE , vol., no., pp.217,224, 21-24 June 2010

[10] Thien-Nghia Nguyen; Michaelis, B.; Al-Hamadi, A.; Tornow, M.; Meinecke, M., "Stereo-Camera-Based Urban Environment Perception Using Occupancy Grid and Object Tracking," Intelligent Transportation Systems, IEEE Transactions on , vol.13, no.1, pp.154,165, March 2012

[11] Lategahn, H.; Derendarz, W.; Graf, T.; Kitt, B.; Effertz, J., "Occupancy grid computation from dense stereo and sparse structure and motion points for automotive applications," Intelligent Vehicles Symposium (IV), 2010 IEEE , vol., no., pp.819,824, 21-24 June 2010

Page 8: Omnidirectional stereo vision using fisheye lenses · Omnidirectional stereo vision using fisheye lenses Marius Drulea, Istvan Szakats, Andrei Vatavu, Sergiu Nedevschi Technical University

[12] F. Oniga, S. Nedevschi, "Processing Dense Stereo Data Using Elevation Maps: Road Surface, Traffic Isle and Obstacle Detection", IEEE Transactions on Vehicular Technologies, 2010.

[13] Alberto Elfes. Sonar-based real-world mapping and navigation. IEEE Journal of Robotics and Automation, RA-3(3):249–265, June 1987.

[14] Gallup, D.; Frahm, J.-M.; Mordohai, P.; Pollefeys, M., "Variable baseline/resolution stereo," Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on , vol., no., pp.1,8, 23-28 June 2008

[15] O. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint: Mit Press, 1993.

[16] M. Fischler, R. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography”, Commun. ACM 24(6), pp. 381-395,1981.

[17] A. Vatavu, Sergiu Nedevschi, and Florin Oniga, “Real Time Object Delimiters Extraction for Environment Representation in Driving Scenarios”, In Proc. of ICINCO-RA 2009, Milano, Italy, 2009, pp 86-93.

[18] Haller, I.; Nedevschi, S., "GPU optimization of the SGM stereo algorithm," Intelligent Computer Communication and Processing (ICCP), 2010 IEEE International Conference on , vol., no., pp.197,202, 26-28 Aug. 2010

[19] Sabattini, L.; Digani, V.; Secchi, C.; Cotena, G.; Ronzoni, D.; Foppoli, M.; Oleari, F., "Technological roadmap to boost the introduction of AGVs in industrial applications," Intelligent Computer Communication and Processing (ICCP), 2013 IEEE International Conference on , vol., no., pp.203,208, 5-7 Sept. 2013

[20] Digani, V.; Sabattini, L.; Secchi, C.; Fantuzzi, C., "Towards decentralized coordination of multi robot systems in industrial environments: A hierarchical traffic control strategy," Intelligent Computer Communication and Processing (ICCP), 2013 IEEE International Conference on , vol., no., pp.209,215, 5-7 Sept. 2013

[21] Reinke, C.; Beinschob, P., "Strategies for contour-based self-localization in large-scale modern warehouses," Intelligent Computer Communication and Processing (ICCP), 2013 IEEE International Conference on , vol., no., pp.223,227, 5-7 Sept. 2013

[22] Beinschob, P.; Reinke, C., "Strategies for 3D data acquisition and mapping in large-scale modern warehouses," Intelligent Computer Communication and Processing (ICCP), 2013 IEEE International Conference on , vol., no., pp.229,234, 5-7 Sept. 2013

[23] István Haller, Sergiu Nedevschi: Design of Interpolation Functions for Subpixel-Accuracy Stereo-Vision Systems. IEEE Transactions on Image Processing 21(2): 889-898 (2012)