on color-, infrared-, and multimodal-stereo approaches to...

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 8, NO. 4, DECEMBER 2007 619

On Color-, Infrared-, and Multimodal-StereoApproaches to Pedestrian Detection

Stephen J. Krotosky and Mohan Manubhai Trivedi

Abstract—This paper presents an analysis of color-, infrared-,and multimodal-stereo approaches to pedestrian detection. Wedesign a four-camera experimental testbed consisting of two colorand two infrared cameras for capturing and analyzing variousconfiguration permutations for pedestrian detection. We incor-porate this four-camera system in a test vehicle and conductcomparative experiments of stereo-based approaches to obstacledetection using unimodal color and infrared imageries. A detailedanalysis of the color and infrared features used to classify de-tected obstacles into pedestrian regions is used to motivate thedevelopment of a multimodal solution to pedestrian detection. Wepropose a multimodal trifocal framework consisting of a stereopair of color cameras coupled with an infrared camera. We use thisframework to combine multimodal-image features for pedestriandetection and to demonstrate that the detection performance is sig-nificantly higher when color, disparity, and infrared features areused together. This result motivates experiments and discussiontoward achieving multimodal-feature combination using a singlecolor and a single infrared camera arranged in a cross-spectralstereo pair. We demonstrate an approach to registering multipleobjects across modalities and provide an experimental analy-sis that highlights issues and challenges of pursuing the cross-spectral approach to multimodal and multiperspective pedestriananalysis.

Index Terms—Active safety, collision avoidance, intelligentvehicles, person detection, tracking.

I. INTRODUCTION

P EDESTRIAN safety is a problem of global significance.Of the 1.17 million yearly worldwide traffic fatalities,

65% are pedestrian-related [1]. In fully industrialized nations,pedestrian safety remains a high priority, with pedestrian fatal-ities accounting for 10.9% of all traffic deaths in the UnitedStates [2] and fatalities in Britain twice as likely for pedestriansthan vehicle occupants [3]. In rapidly industrializing countries,pedestrian fatalities are overwhelmingly more costly in bothproportion and sheer volume. Studies have found pedestrian fa-talities accounted for over half of all traffic deaths in both China[4] and India [5]. Naturally, an issue of this impact has receivedsignificant attention from all aspects of the research commu-nity. Ongoing computer-vision research is making strides to

Manuscript received January 15, 2007; revised April 23, 2007 and June 21,2007. This work was supported in part by the Technical Support Working Groupand in part by the U.C. Discovery Grant. The Associate Editor for this paperwas U. Nunes.

The authors are with the Computer Vision and Robotics Research Labora-tory, University of California, San Diego, CA 92093-0434 USA.

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TITS.2007.908722

detect and to track pedestrians from both moving vehicles andtransportation infrastructure. These approaches to pedestriandetection use visual or infrared imagery [6] in both monocularand stereo-camera configurations.

The choice of visual or infrared imagery is significant, aseach provides disparate, yet complementary information abouta scene. Visual cameras capture the reflective light propertiesof objects in a scene, whereas infrared cameras are sensitive tothe thermal emissivity properties of the same objects. Featuresextracted from each modality can be used to determine thepresence of pedestrians in a scene. Additionally, their combi-nation can provide a level of feature robustness beyond whatis readily obtained from a single camera type. Additionally,multiple camera systems have been incorporated into pedestriandetection in order to extract depth estimates which are crucial tothe task of collision mitigation and occlusion handling. In orderto be able to register unimodal stereo imagery, correspondence-matching techniques [7] are often sufficient. However, in amultimodal multiperspective system, the different appearanceof objects in the visual and infrared imagery makes finding arobust correspondence technique challenging [8].

This paper presents research toward the development of amultimodal multiperspective system that can extract the fea-tures that are necessary for robust pedestrian detection. Wedesign an experimental testbed consisting of two color andtwo infrared cameras for comparing multicamera approachesto pedestrian detection. We perform comparative experimentsof stereo-based-detection approaches using unimodal imageryand demonstrate the high obstacle-detection rate achievablewith both color and infrared imageries, and analyze the fea-tures and properties of the color and infrared imageries thatare useful in classifying the detected obstacles into pedestrianregions.

From this analysis, we propose a multimodal trifocal frame-work consisting of a stereo pair of color cameras coupled witha single infrared camera. Using a calibrated three-camera setupallows accurate and robust registration of color, disparity, andinfrared features using the properties of the trifocal tensor.We demonstrate that the combination of color, disparity, andinfrared information can yield significant gains in pedestriandetection compared with detectors trained on only unimodal orstereo features. This result motivates experiments and discus-sion of a cross-spectral stereo framework to pedestrian detec-tion. Using a single color and a single infrared camera arrangedin a stereo pair, we demonstrate an approach to registeringcolor and infrared features and discuss the issues and challengesof pursuing the cross-spectral framework to multimodal andmultiperspective pedestrian analysis.

1524-9050/$25.00 © 2007 IEEE

620 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 8, NO. 4, DECEMBER 2007

II. RELATED RESEARCH

Our focus on pedestrian detection is concerned with themethodologies and challenges of conventional camera systems.Specifically, we will review studies that utilize the color andinfrared imageries in single and multicamera configurations.For a more comprehensive review of computer-vision-basedapproaches to pedestrian detection, we refer the reader to arecent survey paper by Gandhi and Trivedi [9].

Typically, to find pedestrians in crowded and varied sceneswith a single camera, a trained set of features used to identifypedestrian regions is extracted. In color imagery, common fea-tures include Haar wavelet [10] or Gabor filter [11] responses,component-based gradient responses [12], image contours withmean field models [13], implicit shape models [14], and localreceptive fields [15].

Similarly, features are extracted from monocular infraredapproach. Typically, the features extracted from the infraredimagery are selected for their relation to the unique thermalsignature of humans that enables straightforward segmentation.Such features include thermal hotspots [16], body-model tem-plates [17], shape-independent multidimensional histograms,inertial and contrast base features [18], and histograms oforiented gradients [19].

The features extracted from monocular imagery are thentypically used in a classification scheme using many positiveand negative examples. The most common approach to classi-fication is to use a support vector machine (SVM) [10], [12],[13], [15], [16], [19], [20]. Additional approaches to classifica-tion include template matching [17], [21], convolutional neuralnetworks [22], and Chamfer distance matching [14].

While good pedestrian detection in monocular imagery canbe achieved, a single-camera approach is limited in one crit-ical area: accurate and reliable depth estimation. To achievethis, a multicamera system is necessary, typically arranged ina stereo vision configuration. Visual stereo-camera systems[23]–[25] utilized dense-stereo matching to identify candidatepedestrian regions and to determine their distance from thecamera. Infrared-stereo-camera systems have followed, whichcombine the benefits of infrared features with the powerfuldepth estimation inherent in stereo vision [21], [26]. Addi-tionally, a four-camera system that is separately combiningcolor-stereo and infrared-stereo systems has been investigated[27]. In typical stereo approaches to pedestrian detection, depthestimates yield an initial set of obstacle regions that can then beclassified as pedestrians using monocular-image features.

III. STEREO-BASED PEDESTRIAN DETECTION

A fundamental step to analyzing pedestrians in stereo im-agery is to detect obstacles and to localize their position in a3-D space. We adapt a classical approach to obstacle detectionin the stereo imagery proposed by Labayrade et al. [28] whichutilizes the concept of v-disparity to identify obstacles in ascene. The v-disparity is a histogram of the stereo disparityimage that accumulates the disparity values present in each rowin the image. This histogram has been shown to be useful inidentifying obstacles when the camera is relatively parallel to

Fig. 1. Flowchart of stereo disparity-based obstacle-detection algorithm.

the imaged scene so that objects appear at distinct planes in thedisparity domain [24], [25], [27].

A. Disparity-Based Obstacle Detection

Our goal is to provide a comparative analysis of color-stereo and infrared-stereo imageries for pedestrian detection.We use the v-disparity approach to obstacle detection so thatit can be implemented for both color-stereo and infrared-stereoimageries without modification. We examine each approach’sability to generate robust stereo disparities for determiningobstacle areas in a scene. This comparison of low-level detec-tion accuracy will lead to an evaluation of each camera-type’spotential for higher level pedestrian classification and analysis.Fig. 1 shows a flowchart of the obstacle-detection algorithm.1) Dense-Stereo Matching: We first perform the dense-

stereo matching to yield disparity estimates of the imagedscene. We select the correspondence-matching algorithm byKonolige [29] for its ease of use and reliable disparity genera-tion for both color-stereo and infrared-stereo imageries. Exam-ple disparity images from each approach are shown in Fig. 2.2) u- and v-Disparity Image Generation: The u- and

v-disparity images are histograms that bin the disparity values dfor each column or row in the image, respectively. The resultingv-disparity histogram image indicates the density of disparitiesfor each image row v, whereas the u-disparity image showsthe density of disparities for each image column u. Fig. 3shows an example of u-disparity images, and Fig. 4 showsthe corresponding v-disparity images generated from the color-stereo and infrared-stereo disparity maps in Fig. 2.

Notice that the u-disparity images in Fig. 3 show three dis-tinct horizontal regions corresponding to the three pedestriansin the scene. It is these regions that we wish to detect inorder to build candidate pedestrian areas. The region spanning

KROTOSKY AND TRIVEDI: STEREO APPROACHES TO PEDESTRIAN DETECTION 621

Fig. 2. Example disparity images from color- and infrared-stereo images.(a) Color. (b) Infrared.

Fig. 3. Example u-disparity images from color- and infrared-stereo images.(a) Color. (b) Infrared.

Fig. 4. Example v-disparity images from color- and infrared-stereo imagesalong with the detected ground plane. (a) Color with ground plane. (b) Infraredwith ground plane.

the entire length at the top of the u-disparity image indicatesthe background plane and can be filtered from processing.Similarly, the v-disparity images in Fig. 4 show vertical peaksof high density for both the background plane and the rangeof disparities containing pedestrians. These regions also needto be detected to build pedestrian candidates. Additionally, thedownward-sloping trend for each row in the v-disparity imageis exploited to estimate the ground plane in the scene [28].3) Ground-Plane Estimation: To estimate the ground plane,

we extract candidate points in the v-disparity image. For eachcolumn corresponding to a disparity d in the v-disparity image,we select the lowest pixel location, whose value is above athreshold, as a candidate ground-plane point. The ground plane

Fig. 5. ROI generation in u- and v-disparity images with color- and infrared-stereo images. (a) Color u-disparity. (b) Infrared u-disparity. (c) Colorv-disparity. (d) Infrared v-disparity.

Fig. 6. Bounding-box candidates with color- and infrared-stereo images.(a) Color. (b) Infrared.

is estimated by fitting these candidate points to a line with arobust linear regression scheme that uses weighted least squaresthat iteratively reweights using the bisquare weighting function.Fig. 5(b) and (d) shows the v-disparity images for color-stereoand infrared-stereo imageries with the candidate ground-planepoints in red and the fitted ground-plane estimate plotted incyan. Using a dense stereo with a robust point candidate gener-ation and an iterative line fitting, we obtain robust ground-planeestimates in both color- and infrared-stereo imageries.4) Candidate-Bounding-Box Generation: Bounding-box

candidates can be extracted from regions-of-interest (ROI) inthe u- and v-disparity images. The ROIs in the u-disparityimage are extracted by scanning the rows of the image forcontinuous spans where the histogram value exceeds the giventhreshold. Fig. 5(a) and (b) overlays the extracted regions ingreen on the u-disparity image. The ROIs are extracted fromthe v-disparity image by selecting columns where the sum ofthe histogram values above the ground plane is greater than thethreshold. The ROI spans from the ground plane to the highestpoint in the column that exceeds the given threshold. Fig. 5(c)and (d) shows the extracted regions in green on the v-disparityimage.

The candidate bounding boxes are selected from the ROIs inthe u- and v-disparity images based on their disparity values.For a given disparity d, the widths of the bounding boxes aredetermined by the ROIs found in the u-disparity image, andthe heights are derived from the ROIs in the v-disparity image.Large bounding boxes associated with background regions arefiltered, and the remaining candidates are shown in Fig. 6.5) Candidate Filtering and Merging: As shown in Fig. 6,

there are often multiple overlapping candidate bounding boxes


Fig. 7. Example of the final selection of pedestrian candidates after bounding-box merging with color- and infrared-stereo images. (a) Color. (b) Infrared.

generated. This occurs because the disparities associated witha single pedestrian span a range of values, particularly as thepedestrian moves closer to the camera. We merge significantlyoverlapping candidates if the disparities that are associated withthe bounding boxes are close. The final pedestrian candidatebounding boxes are shown in Fig. 7. Notice how the overlap-ping candidates have merged into the correct bounding boxescorresponding to the pedestrians in the scene.

B. Experimental Framework and Testbed

We establish a framework for experimenting and analyzingpedestrian detection approaches to facilitate a direct side-by-side comparison of the data coming from color-stereo andinfrared-stereo imageries. A custom rig was designed, consist-ing of a matched color-stereo pair and a matched infrared-stereo pair. The two pairs share identical baselines and arealigned in pitch, roll, and yaw to maximize the similarities inthe field of view. Calibration data were obtained by illuminatinga checkerboard pattern with high-intensity halogen bulbs sothat the checks would be visible in both color and infraredimageries, and standard calibration techniques could be appliedto obtain the intrinsic and extrinsic parameters of the cameras.

The calibrated rig was mounted on the grill of the Labo-ratory for Intelligent and Safe Automobiles (LISA)-P testbed[30], [31], a Volkswagen Passat equipped with the computing,power, and cabling requirements necessary to synchronouslycapture and save the four simultaneous camera streams. Fig. 8shows the four-camera rig properly arranged and mounted onthe LISA-P.

C. Experimental Analysis of Disparity-Based ObstacleDetection in Color- and Infrared-Stereo Imageries

Experiments were conducted so that multiple pedestrianswalk in front of the LISA-P testbed with varying degrees of

Fig. 8. Experimental testbed: Two color cameras and two infrared camerasarranged in stereo pairs and mounted to the front of the LISA-P testbed.

Fig. 9. Merge and miss errors from pedestrian-candidate generation. (a) Colormerged. (b) IR merged. (c) Color missed. (d) IR missed.

depth, complexity, and occlusion. To allow for direct compar-ison, color and infrared videos were captured synchronouslyand were analyzed using the disparity-based obstacle-detectionalgorithm in Section III-A. Successful detection was indicatedby a bounding box which is correctly overlaid by a correspond-ing pedestrian region. If our merging process combined twoseparate pedestrian regions, we consider the detection correct,yet note it as a “merge error” [Fig. 9(a) and (b)]. We reasonthat errors associated with lack of sophistication in our chosenmerging algorithm should not adversely affect the detectionrate, as the desire is to evaluate the effectiveness of identifyingpedestrian regions and not the robustness of the merging pro-cedure. This is also a fair assessment for collision mitigation,as finding all the critical areas in the scene is given priorityover discerning the merged bounding boxes. Therefore, falsenegatives were counted only if a pedestrian region was missed[Fig. 9(c) and (d)], and false positives were counted when abounding box selected a nonpedestrian region. Still, had weincorporated the merge errors, the total detection rate woulddecrease by only 1% for the color and 1.4% for the infrared.Table I shows the compiled results of the comparative experi-ments, and Fig. 10 shows additional examples of detection.

IV. STEREO-BASED PEDESTRIAN-DETECTION ANALYSIS

Our comparative experiments in Section III with stereo-based pedestrian detection for the color and infrared imageriesindicate a very high level of detection accuracy and a low


TABLE ICOMPARISON BETWEEN COLOR- AND INFRARED-STEREO IMAGERIES

FOR DISPARITY-BASED OBSTACLE DETECTION

false-positive rate in both modalities. However, we providea deeper analysis of the experiments to help understand andevaluate the success of these experiments.

We note that the difference in the pedestrian counts in Table Icomes from the position and view differences of the color-stereo and infrared-stereo cameras. As only pedestrians that arefully visible in the image are considered, there are frames wherea pedestrian is only visible in one modality. However, given thehigh number of examples, the detection rates can be directlycompared despite the different tallies.

The experiments yielded such a high rate of detection sincethe captured images did not include nonpedestrian obstacles,such as other vehicles or bicyclists, so detection of any obstacleregion is assumed to be a pedestrian. For our experiments, thisassumption is appropriate, as we are interested in evaluatinghow color and infrared dense-stereo correspondences can beused in low-level pedestrian detection. In that respect, ourexperiments demonstrate that both achieve high rates of low-level obstacle detection, which is an imperative first step towarda robust pedestrian detection and collision mitigation. How-ever, in real-world driving scenarios, this is not sufficient forpedestrian detection. Detected obstacles can include a variety ofobjects found in common driving scenes other than pedestrians,and additional processing is necessary to filter the detectedobstacles to identify pedestrians.

For example, bounds on pedestrian bounding-box features,such as size, disparity, and aspect ratio, can be learned orheuristically selected to filter out bounding boxes associatedwith other objects in the scene [27]. However, such size-based-filtering techniques will have difficulty with nonpedestrianbounding boxes that fall within the selected bounds of pedes-trian candidates. Additionally, the selection of appropriatelyrobust bounds is a challenging task, as bounding-box sizes canvary significantly with changes in pedestrian pose and disparityfidelity. To achieve a more reliable detection of pedestriancandidates, it is necessary to use discriminant image featuresin a learning framework, such as those discussed in Section II.

While justification can be made for selecting either color orinfrared features for pedestrian detection, a more interestingproposition would be to use both to obtain a much larger setof discriminant image features, as a system that incorporatesall features to improve detection. For example, the thermal“hotspots” of humans that often make pedestrians easily seg-mentable can be used combined with the color segmentation

features common to challenging tasks, such as detecting articu-lated poses for classifying human interactions [32].

Although stereo color and infrared analyses can be sepa-rately combined [27], a more economical and desirable solutionwould be to combine the color, disparity, and infrared featuresin an integrated detection framework. In Section V, we proposea multimodal trifocal framework consisting of a stereo pair ofcolor cameras coupled with a single infrared camera. Such asetup allows for accurate and robust registration of the colorand infrared imageries using the trifocal tensor. We use thisregistration framework to design a pedestrian detector that in-tegrates color, disparity, and infrared features and yields higherdetection rates than using separate features.

In Section VI, we investigate the feasibility of this inte-grated detection framework using a minimum camera cross-spectral stereo system with a single color and single infraredcamera. The challenge is to register image features in a cross-spectral stereo, where conventional and state-of-the-art stereo-correspondence algorithms fail due to the disparate nature ofthe color and infrared imageries. As a step toward a dense-stereo algorithm for cross-spectral stereo imagery, we proposea stereo-registration algorithm for multimodal imagery [8],evaluate its applicability to pedestrian detection, and highlightthe challenges of achieving robustness in this framework.

V. MULTIMODAL TRIFOCAL FRAMEWORK FOR

PEDESTRIAN DETECTION

The benefits of color-, disparity-, and infrared-image featurescan be incorporated using a three-camera approach consistingof a standard color-stereo rig paired with a single infrared cam-era. The trifocal framework, shown in Fig. 11, uses disparityestimates from the stereo imagery to register correspondingpixels in the infrared imagery. This can be done quickly andefficiently with the trifocal tensor—the set of matrices relatingthe correspondences between the three images.

The trifocal tensor can be estimated by minimizing thealgebraic error of point correspondences [33]. The pointcorrespondences can be obtained for trifocal imagery usingthe same calibration techniques used for stereo calibration,where the calibration board is visible in each trifocal. Whileonly seven point–point–point correspondences are required tocompute the trifocal tensor, in practice, we use many morecorrespondences to smooth errors in the point estimates. Theresulting trifocal tensor is written as T = [T1, T2, T3], where Ti

is a 3 × 3 matrix for the ith image in the set. From this tensornotation, standard two-view geometry parameters, such as thefundamental matrices F , the epipoles e, and the projectionmatrices P , can be determined.

Additionally, given a point correspondence x′ ↔ x′′, we canestimate the point transfer to the third image point x as

[x′] ×(∑

i

xiTi

)[x′′]× = 03×3. (1)

The dense-stereo matching gives the x′ ↔ x′′ correspon-dences, and the infrared point transfer is estimated and aligned


Fig. 10. Example of the final selection of pedestrian candidates with color- and infrared-stereo input images.

Fig. 11. Flowchart of the trifocal-tensor approach to pedestrian detection forcolor stereo and infrared framework.

Fig. 12. Registered color, disparity, and infrared imageries using trifocaltensor. (a) Color. (b) Disparity. (c) Aligned infrared.

to the color reference image. Fig. 12 shows an example set ofthe registered trifocal imagery.

A. Experimental Evaluation of Pedestrian Detection UsingColor-, Disparity-, and Infrared-Image Features

To determine the effect of using multimodal features forpedestrian detection, we use the trifocal framework to regis-ter the color, disparity, and infrared imageries into a singlefive-channel multispectral image, allowing for the comparisonof pedestrian detectors that make use of different combina-

tions of image features. To train the detectors, positive pedes-trian samples are manually annotated, and for each positivesample, ten negative samples are generated by moving thepositive bounding box to a random nonoverlapping positionin the image. All samples are resized to a common size(24 × 60 pixels), as shown in Fig. 13.

We elect to extract histogram-of-oriented-gradient featuressimilar to those proposed by Dalal and Triggs [34]. For eachof the color, disparity, and infrared images, we compute anX × Y × Θ element histogram, where X , Y , and Θ are thenumbers of histogram bins in width, height, and gradient ori-entation, respectively. For our experiments, we use a 4 × 4 ×8-element histogram, resulting in a 128-element feature vectorfor each image type. This descriptor was selected on the notionthat gradient information can discriminate a pedestrian fromother objects. While we make no claims of feature optimality,gradient-based features are common in pedestrian-detectionliterature, and we feel that its use is sufficient for the evaluationof the effect of multispectral-image features on the detectionaccuracy.

We train pedestrian detectors for all combinations of thecolor, disparity, and infrared features using an SVM with aradial basis function as the kernel type [35]. We train eachSVM using 865 annotated positive samples (and 8650 nega-tive samples) collected from the video obtained while drivingthe LISA-P testbed in store parking lots and local roads inLa Jolla, California. Similarly, we evaluate using a test set of641 positive samples and 6410 negative samples from a sepa-rate set of videos obtained while driving the LISA-P. Pedestri-ans in the training and testing sets range from approximately3 to 30 m from the vehicle. The resulting receiver-operating-characteristic (ROC) curves are plotted in Fig. 14, and detectionrates for a 5% false-positive rate are shown in Table II.

The pedestrian detector that combines the color, dispar-ity, and infrared features outperforms the other detectors bya significant margin. By integrating the features, we exploitthe complementary nature of multimodal imagery to yieldmore than a 5% increase in detection for a 5% false-positiverate. We also note that the combinations of color + infraredand color + disparity do not outperform the detector that istrained only on color. We suspect that this is because gradient-based features are not suitable for discriminating pedestriansin the low-contrast disparity and infrared images. This drop inperformance is evident in the detectors that are trained onlyon disparity or infrared. Given the relatively low number ofpositive samples, the addition of only disparity or infraredseems only to add noise. It is then all the more interesting


Fig. 13. Selection of positive and negative samples used for training pedestrian detectors. Each sample consists of color, disparity, and infrared images.(a) Positive samples. (b) Negative samples.

Fig. 14. ROC for pedestrian detection. The combination of color, disparity,and infrared features performs the best.

TABLE IIPEDESTRIAN-DETECTION RATE FOR 5% FALSE-POSITIVE RATE

that the color + disparity + infrared-trained detector performsso well. The discriminant gains from combining all the fea-tures greatly outweigh the noise added from nonideal gradientfeatures. We anticipate that greater gains in accuracy could beachieved by using more discriminant features in each imagespectrum.

VI. CROSS-SPECTRAL STEREO-CORRESPONDENCE

MATCHING FOR PEDESTRIAN DETECTION

The multimodal trifocal framework demonstrates the ben-efit of integrating the color, disparity, and infrared featuresfor pedestrian detection. While an attractive framework, its

requirement of two color cameras for the stereo-correspondencematching is redundant from a feature perspective. We investi-gate achieving the stereo-correspondence matching using cross-spectral stereo—a single color and single infrared camera.While a cross-spectral stereo system has the potential to in-tegrate the color, disparity, and infrared detail, the nontrivialproblem of accurate and robust stereo registration must first beresolved.

Toward achieving this, we have developed an algorithmfor matching regions in cross-spectral stereo images [8]. Thisapproach gives a robust disparity estimation with statisticalconfidence values for images that have an initial object segmen-tation. Fig. 15 shows the algorithmic framework of the region-based stereo algorithm.

The acquired and rectified image pairs are denoted as IL,for the left color image, and IR, for the right infrared image.Due to the high differences in the imaging characteristics,the matching is focused on the foreground pixels from aninitial-segmentation estimate. To obtain the segmentation ina moving vehicle, we use an optical-flow-based approach todetect moving pedestrians in the scene [36]. Our experimentshave shown that this approach is relatively robust at lowspeeds (< 10 m/h) and could be adapted for higher speedswith egomotion estimation. Low-speed analysis is useful in avariety of driving scenarios, including parking lots, residentialand shopping areas, and starting or stopping at a traffic signal.Additionally, while stationary pedestrians pose a segmentationissue for optical-flow techniques, we expect that static objectsabove the ground can be identified through long-term trackingof the scene.

Given the optical-flow estimates for motion in the horizontalmu and vertical mv directions, as well as occluded regionsmocc, we estimate the foreground regions F where there ismotion in either the horizontal or vertical direction and noocclusion. Morphological operations smooth the estimate.

F = ((|mu| > 0) ∪ (|mv| > 0)) ∩ (mocc = 0). (2)

We denote the color and infrared foreground images as FL

and FR, respectively, which are shown in Fig. 16. The color


Fig. 15. Flowchart of region-based correspondence matching in cross-spectralstereo for pedestrian detection.

Fig. 16. Outlined foreground extraction for color and infrared images.(a) Color segmentation. (b) Infrared segmentation.

image is also converted to grayscale for a mutual information-based matching.

The matching is performed by fixing a window in oneforeground image and by sliding a correspondence windowalong the second image. Given the height h and the widthw of the image, for each column i ∈ 0, . . . , w, let WL,i be areference window in the left image of height h∗ and width M .The width M is experimentally determined for a given sceneand is typically less than the width of the target object in thescene. In our case, the value of M was 31 pixels. The height h∗

is the largest spanning of the foreground within the referencewindow. The correspondence window WR,i,d in the right imagealso has height h∗ but is located at column i + d, where d is

a disparity offset. For a given column i, a reference windowis determined, and the correspondence values are found for alld ∈ dmin, . . . , dmax.

Given the two correspondence windows WL,i and WR,i,d,we first linearly quantize the image to N levels such that N ≈√

Mh∗/8 [37], as this equation has been shown to determinethe number of levels needed to give good results for maximizingthe mutual information between image regions. The similaritybetween the two image patches can be measured by the mutualinformation between them, which is defined as

I(L,R) =∑l,r

PL,R(l, r) logPL,R(l, r)

PL(l)PR(r)(3)

where PL,R(l, r) is the joint probability mass function (pmf),and PL(l) and PR(r) are the marginal pmfs of the left andright image patches, respectively. PL,R(l, r) is computed asthe normalized 2-D histogram of the image intensities, and themarginal probabilities are determined by summing along 1-Dof this histogram.

We define the mutual information between the two corre-spondence windows as Ii,d where i is the center of the referencewindow and i + d is the center of the moving window. For eachcolumn i, we compute Ii,d for d ∈ dmin, . . . , dmax. We choosethe best disparity d∗i as the one that maximizes the mutualinformation

d∗i = arg maxd

MIi,d. (4)

Fig. 17 shows example correspondence windows and a plotof the mutual information for the range of disparities. The redbox in the color image is the reference window, and the greenboxes in the infrared image are the candidate match windows.

We assign a vote for d∗i to all the foreground pixels inthe reference window. Define a disparity voting matrix DL ofsize (h,w, dmax − dmin + 1) as the range of disparities. Then,for each foreground pixel in a given reference window WL,i,(u, v) ∈ (WL,i ∩ FL), we accumulate the disparity voting ma-trix at DL(u, v, d∗i ). Since the correspondence windows are Mpixels wide, each column in the disparity voting matrix willhave M votes. For each pixel (u, v) in the image, DL canbe thought of as a distribution of matching disparities fromthe correspondence windows. Since it is assumed that a singleperson is at a single distance from the camera, a good matchshould have a large number of votes for a single disparity value,whereas a poor match would be distributed across the rangedisparity values. The best disparity value and its correspondingconfidence at each pixel are then found as

D∗L(u, v) = arg max

dDL(u, v, d) (5)

C∗L(u, v) = max

dDL(u, v, d). (6)

For a pixel (u, v), the value of C∗L(u, v) is the number of

votes for the best disparity value D∗L(u, v). A higher confidence

value indicates that the disparity maximized the mutual infor-mation for a large number of correspondence windows, and inturn, the disparity value is more likely to be accurate. Values for


Fig. 17. Mutual information for finding corresponding windows in a cross-spectral stereo imagery. (a) Color image. (b) Infrared image. (c) Mutual informationfor correspondence window.

Fig. 18. Resulting disparity image D∗ from combining the left and rightdisparity images D∗

L and D∗R, as defined in (7). (a) Disparity image.

(b) Unaligned. (c) Aligned.

D∗R and C∗

R are similarly determined by making the right imagethe reference. The values of D∗

R and C∗R are then shifted by

their disparities so that they align to the left image. The aligneddisparity images are then combined using an AND operation.This experimentally gives the most robust results. For all pixels(u, v) such that C∗

L(u, v) > 0 and C∗R(u, v) > 0

D∗(u, v) ={

D∗L(u, v), C∗

L(u, v) ≥ C∗R(u, v)

D∗R(u, v), C∗

L(u, v) < C∗R(u, v) . (7)

The resulting disparity image D∗(u, v) can be used toregister multiple objects in the scene, even at very differentdepths from the camera. Fig. 18 shows the registration resultfor the images carried throughout the algorithmic derivation.Fig. 18(a)–(c) shows the disparity image D∗, the initial align-ment of the color and infrared images, and the alignmentafter shifting the foreground pixels by the resulting disparityimage, respectively. The infrared foreground pixels are overlaid(in green) on the color foreground pixels (in purple). The cross-spectral stereo-correspondence matching successfully alignsthe foreground areas of the three people in the scene.

A. Experimental Analysis of Cross-SpectralStereo-Correspondence Matching for Pedestrian Detection

Using the same experiments performed in Section III-C, weanalyze the cross-spectral stereo-correspondence matching ofpedestrian regions in an outdoor environment. The goal wasto demonstrate a successful matching for the configurations ofpeople in different positions, for the distances from the camera,and for the levels of occlusion. We evaluate the registration byvisually inspecting the alignment of the corresponding colorand infrared pedestrian regions. Visually well-aligned are con-sidered correct, and misaligned, missing, or partially alignedregions are deemed incorrect. Table III summarizes our exper-

TABLE IIICROSS-SPECTRAL STEREO REGISTRATION OF PEDESTRIAN REGIONS

Fig. 19. Cross-spectral stereo-registration results for pedestrian detection.(a) Color. (b) Infrared. (c) Unaligned. (d) Aligned.

iments analysis, and Fig. 19 shows examples of correct corre-spondence matching. Additional experiments [38] demonstrateits robustness to different capture devices and environmentalconditions.


Fig. 20. Disparity discontinuity errors in cross-spectral stereo analysis dueto artifacting arising from windowed correspondence matching. (a) Color.(b) Infrared. (c) Disparity. (d) Aligned.

One challenge associated with this approach to cross-spectralstereo lies in the vertical artifacts from the multiple votingwindows that give the resulting registration hard vertical edgesat disparity discontinuities. This is most evident when theinherent disparity discontinuity of the occluding pedestrians isforced to a vertical edge, as shown in Fig. 20. Despite theseartifacts, we still identify the two distinct obstacle regions. Ad-ditionally, incorporating subpixel interpolation would improvethe registration, as the integer-based disparity matching of ourapproach can easily be off by a pixel in either direction of thecorrect match.

The initial segmentation, while necessary for the success ofthis algorithm, is limiting in several aspects. First, segmentationis challenging, and the result can often be noisy, easily over-or underestimating the true object boundaries. We motivatedthe initial segmentation as a way of providing appropriatelysized regions for matching the features in the color and infraredimageries. However, the very idea of an initial segmentationprecludes registration estimates for regions that are not withinthe segmentation boundaries. Clearly, a better approach wouldbe to register features from the entire image. Achieving this isan open research challenge that we are actively pursuing. Wefeel that a multifeature-matching approach that can integratestructural feature matching, such as edges, with pixel- or area-based matching, is promising.

VII. DISCUSSION AND CONCLUDING REMARKS

The depth estimates obtained from the vehicle-mountedstereo imagery give rise to a v-disparity-based approach forextracting the obstacle regions from the scene. We have outlinedsuch an algorithm and have provided comparative experimentsthat indicate that color- and infrared-based stereo dispari-ties are both capable of highly accurate pedestrian detection(> 98%) with low false positives ( 1%). Given these highdetection rates, the selection of an appropriate camera sys-tem for pedestrian detection turns to each modality’s abilityto classify the detected obstacles as pedestrians. Because ofthe disparate physical processes that yield color and ther-mal images, extractable features are largely unique to eachmodality. As previous approaches have demonstrated that bothcolor- and infrared-image features can be used for classifyingpedestrians, we propose a multimodal trifocal framework thatintegrates color, depth, and infrared features for pedestriandetection.

The multimodal trifocal solution pairs a color-stereo rig witha single infrared camera to accurately register pixels in each

image. We use this framework to demonstrate that integratingcolor, disparity, and infrared features for training a pedestriandetector yields an improved accuracy over detectors that utilizeonly unimodal or stereo features. From a cost-benefit perspec-tive, we suggest that the multimodal trifocal framework is likelythe best approach, as it can achieve the benefits of multimodal-ity seen in higher camera solutions, yet maintains the robustnessnot yet seen in the two camera cross-spectral solutions. Futureareas for investigation include a more extensive evaluationof the color, disparity, and infrared features. Additionally, anintegrated object candidate generation and pedestrian-detectionalgorithm using the multimodal trifocal framework would beuseful for evaluating the robustness to various lighting andenvironmental conditions.

In cross-spectral stereo analysis, the disparate nature of mul-timodal imagery that we hope to exploit in feature extractionmakes correspondence matching challenging. We have estab-lished an object-level registration scheme for establishing cor-respondences and have experimentally demonstrated successfulregistration of object regions across the color and infraredimageries. The 87% registration rate shows the feasibility ofcreating a multimodal-feature set in a cross-spectral stereoframework. Although the initial-segmentation requirementplaces limits on the generality and robustness of the approach,we feel that this is a good first step toward the develop-ment of a cross-spectral stereo-correspondence algorithm thatgenerates disparity images similar to that of the conventionalstereo algorithms for unimodal imagery. We believe that ad-vancement may be obtained by exploring multiple feature orhierarchical matching schemes that can integrate structuralfeature matching, such as edges, with a pixel- or area-basedmatching.

These multimodal and multiperspective approaches provideinsight into the overall active-safety paradigm. Pedestriansafety is one of the many aspects of the driving environmentthat needs to be monitored to ensure safety in the vehicle andthe surrounding areas [31]. The multimodal-feature set that isextractable from a multimodal trifocal or cross-spectral stereosolution could provide a robust and unified framework foranalyzing the vehicular environment [39], as well as higherlevel driver intent analysis, such as lane changing [40], turning[20], or braking [41].

REFERENCES

[1] [Online]. Available: http://www.worldbank.org/html/fpd/transport/roads/safety.htm

[2] Traffic safety facts 2004: A compilation of motor vehicle crash datafrom the fatality analysis reporting system and the general estimatessystem. Nat. Highway Traffic Safety Assoc., U.S. Dept. Transp.[Online]. Available: http://www-nrd.nhtsa.dot.gov/pdf/nrd-30/NCSA/TSFAnn/TSF2004.pdf

[3] J. R. Crandall, K. S. Bhalla, and N. J. Madeley, “Designing road vehiclesfor pedestrian protection,” Brit. Med. J., vol. 324, no. 7346, pp. 1145–1148, May 2002.

[4] D. Mohan, “Traffic safety and health in Indian cities,” J. Transp.Infrastruct., vol. 9, no. 1, pp. 79–94, 2002.

[5] S. K. Singh, “Review of urban transportation in India,” J. Public Transp.,vol. 8, no. 1, pp. 79–97, 2005.

[6] Y. Fang, K. Yamada, Y. Ninomiya, B. Horn, and I. Masaki, “Com-parison between infrared-image-based and visible-image-based ap-proaches for pedestrian detection,” in Proc. IEEE Intell. Veh. Symp., 2003,pp. 505–510.


[7] D. Scharstein and R. Szeliski, Middlebury College stereo vision researchpage, 2005. [Online]. Available: http://cat.middlebury.edu/stereo/

[8] S. J. Krotosky and M. M. Trivedi, “Mutual information based registrationof multimodal stereo videos for person tracking,” Comput. Vis. ImageUnderst., vol. 106, no. 2/3, pp. 270–287, May/Jun. 2007.

[9] T. Gandhi and M. M. Trivedi, “Pedestrian protection systems: Issues,survey, and challenges,” IEEE Trans. Intell. Transp. Syst., vol. 8, no. 3,pp. 413–430, Sep. 2007.

[10] L. Andreone, F. Bellotti, A. De Gloria, and R. Laulette, “SVM-basedpedestrian recognition on near-infrared images,” in Proc. 4th Int. Symp.Image Signal Process. Anal., 2005, pp. 274–278.

[11] H. Cheng, N. Zheng, and J. Qin, “Pedestrian detection using sparse Gaborfilter and support vector machine,” in Proc. IEEE Conf. Intell. Veh., 2005,pp. 583–587.

[12] A. Shashua, Y. Gdalyahu, and G. Hayun, “Pedestrian detection for drivingassistance systems: Single-frame classification and system level perfor-mance,” in Proc. IEEE Conf. Intell. Veh., 2004, pp. 1–6.

[13] Y. Wu, T. Yu, and G. Hua, “A statistical field model for pedestrian detec-tion,” in Proc. Comput. Vis. Pattern Recog., 2005, pp. 1023–1030.

[14] B. Leibe, E. Seemann, and B. Schiele, “Pedestrian detection in crowdedscenes,” in Proc. Comput. Vis. Pattern Recog., 2005, pp. 878–885.

[15] S. Munder and D. Gavrila, “An experimental study on pedestrianclassification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 11,pp. 1863–1868, Nov. 2006.

[16] F. Xu, X. Liu, and K. Fujimura, “Pedestrian detection and tracking withnight vision,” IEEE Trans. Intell. Transp. Syst., vol. 6, no. 1, pp. 63–71,Mar. 2005.

[17] A. Broggi, A. Fascioli, P. Grisleri, T. Graf, and M. Meinecke, “Model-based validation approaches and matching techniques for automotivevision based pedestrian detection,” in Proc. Comput. Vis. Pattern Recog.,2005, p. 1.

[18] Y. Fang, K. Yamada, Y. Ninomiya, B. K. P. Horn, and I. Masaki, “A shape-independent method for pedestrian detection with far-infrared images,”IEEE Trans. Veh. Technol., vol. 53, no. 6, pp. 1679–1697, Nov. 2004.

[19] F. Suard, A. Rakotomamonjy, A. Bensrhair, and A. Broggi, “Pedestriandetection using infrared images and histograms of oriented gradients,” inProc. IEEE Conf. Intell. Veh., 2006, pp. 206–212.

[20] S. Cheng and M. M. Trivedi, “Turn-intent analysis using body pose forintelligent driver assistance,” Pervasive Comput., vol. 5, no. 4, pp. 28–37,Oct.–Dec. 2006.

[21] M. Bertozzi, A. Broggi, C. Caraffi, M. D. Rose, M. Felisa, andG. Vezzoni, “Pedestrian detection by means of far-infrared stereovision,” Comput. Vis. Image Underst., vol. 106, no. 2/3, pp. 194–204,May/Jun. 2007.

[22] M. Szarvas, A. Yoshizawa, M. Yamamoto, and J. Ogata, “Pedestriandetection with convolutional neural networks,” in Proc. IEEE Intell. Veh.Symp., 2005, pp. 224–229.

[23] L. Zhao and C. Thorpe, “Stereo- and neural network-based pedestriandetection,” IEEE Trans. Intell. Transp. Syst., vol. 1, no. 3, pp. 148–154,Sep. 2000.

[24] G. Grubb, A. Zelinsky, L. Nilsson, and M. Rilbe, “3D vision sensingfor improved pedestrian safety,” in Proc. IEEE Conf. Intell. Veh., 2004,pp. 19–24.

[25] P. Alfonso, D. F. Llorca, M. A. Sotelo, L. M. Bergasa, P. Revenga deToro, J. Nuevo, M. Ocana, and M. A. G. Garrido, “Combination of featureextraction methods for SVM pedestrian detection,” IEEE Trans. Intell.Transp. Syst., vol. 8, no. 2, pp. 292–307, Jun. 2007.

[26] X. Lie and K. Fujimura, “Pedestrian detection using stereo night vision,”IEEE Trans. Veh. Technol., vol. 53, no. 6, pp. 1657–1665, Nov. 2004.

[27] M. Bertozzi, A. Broggi, M. Felias, G. Vezzoni, and M. Del Rose, “Low-level pedestrian detection by means of visible and far infra-red tetra-vision,” in Proc. IEEE Conf. Intell. Veh., 2006, pp. 231–236.

[28] R. Labayrade, D. Aubert, and J.-P. Tarel, “Real time obstacle detec-tion in stereovision on non flat road geometry through ‘v-disparity’representation,” in Proc. IEEE Conf. Intell. Veh., 2002, pp. 646–651.

[29] K. Konolige, “Small vision systems: Hardware and implementation,” inProc. 8th Int. Symp. Robot. Res., 1997, pp. 111–116.

[30] M. M. Trivedi, S. Y. Cheng, E. M. C. Childers, and S. J. Krotosky, “Occu-pant posture analysis with stereo and thermal infrared video: Algorithmsand experimental evaluation,” IEEE Trans. Veh. Technol., vol. 53, no. 6,pp. 1698–1712, Nov. 2004.

[31] M. M. Trivedi, T. Gandhi, and J. McCall, “Looking-in and looking-out ofa vehicle: Computer-vision-based enhanced vehicle safety,” IEEE Trans.Intell. Transp. Syst., vol. 8, no. 1, pp. 108–120, Mar. 2007.

[32] S. Park and M. M. Trivedi, “Multi-person interaction and activityanalysis: A synergistic track- and body-level analysis framework,” Mach.Vis. Appl.—Special Issue Novel Concepts Challenges Generation VisualSurveillance Systems, vol. 18, no. 3/4, pp. 151–166, Aug. 2007.

[33] R. Hartley and A. Zisserman, Multiple View Geometry in ComputerVision. Cambridge, U.K.: Cambridge Univ. Press, 2002.

[34] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Proc. Comput. Vis. Pattern Recog., 2005, pp. 886–893.

[35] C.-C. Chang and C.-J. Lin, LIBSVM: A library for supportvector machines, 2001. [Online]. Available: http://www.csie.ntu.edu.tw/cjlin/libsvm

[36] A. S. Ogale and Y. Aloimonos, “A roadmap to the integration of early vi-sual modules,” Int. J. Comput. Vis.—Special Issue Early Cognitive Vision,vol. 72, no. 1, pp. 9–25, Apr. 2007.

[37] P. Thevenaz and M. Unser, “Optimization of mutual information formultiresolution image registration,” IEEE Trans. Image Process., vol. 9,no. 12, pp. 2083–2099, Dec. 2000.

[38] S. J. Krotosky and M. M. Trivedi, “Multimodal stereo image registrationfor pedestrian detection,” in Proc. IEEE Conf. Intell. Transp. Syst., 2006,pp. 109–114.

[39] T. Gandhi and M. M. Trivedi, “Vehicle surround capture: Survey of tech-niques and a novel omni-video-based approach for dynamic panoramicsurround maps,” IEEE Trans. Intell. Transp. Syst., vol. 7, no. 3, pp. 293–308, Sep. 2006.

[40] J. McCall, D. Wipf, M. Trivedi, and B. Rao, “Lane change intent analysisusing robust operators and sparse Bayesian learning,” IEEE Trans. Intell.Transp. Syst., vol. 8, no. 3, pp. 431–440, Sep. 2007.

[41] J. McCall and M. Trivedi, “Driver behavior and situation aware brakeassistance for intelligent vehicles,” Proc. IEEE—Special Issue AdvancedAutomobile Technologies, vol. 95, no. 2, pp. 374–387, Feb. 2007.

Stephen J. Krotosky received the B.S. degreein computer engineering from the University ofDelaware, Newark, in 2001, and the M.S. and Ph.D.degrees in electrical and computer engineering fromthe University of California, San Diego, in 2004 and2007, respectively, specializing in signal and imageprocessing.

He is currently an Algorithm Development En-gineer with the Advanced Multimedia and SignalProcessing Division, Science Applications Interna-tional Corporation, San Diego, CA.

Mohan Manubhai Trivedi received the Ph.D.degree in electrical engineering from Utah State Uni-versity, Logan.

He is a Professor with the Department of Elec-trical and Computer Engineering and the Found-ing Director of the Computer Vision and RoboticsResearch Laboratory, University of California,San Diego. His research interests include computervision, intelligent vehicles and transportation sys-tems, and human–machine interfaces.

Dr. Trivedi is a member of the IEEE ComputerSociety, from which he received both the Pioneer Award and the Merito-rious Service Award, and a Fellow of the International Society for OpticalEngineering.

on color-, infrared-, and multimodal-stereo approaches to...

Documents