improving the performance of pedestrian detectors using ...carneiro/publications/pr-2016.pdf ·...

Improving the performance of pedestrian

detectors using convolutional learning

David Ribeiro a Jacinto C. Nascimento a Alexandre Bernardino a

Gustavo Carneiro b

aInstituto de Sistemas e RoboticaInstituto Superior Tecnico 1049-001 Lisboa

PORTUGALbAustralian Centre of Visual Technologies, The University of Adelaide

AUSTRALIA

Abstract

We present new achievements on the use of deep convolutional neural networks(CNN) in the problem of pedestrian detection (PD). In this paper, we aim to addressthe following questions: (i) Given non-deep state-of-the-art pedestrian detectors (e.g.ACF, LDCF), is it possible to improve their top performances ?; (ii) is it possible toapply a pre-trained deep model to these detectors to boost their performances in thePD context ? In this paper, we address the aforementioned questions by cascadingCNN models (pre-trained on Imagenet) with state-of-the-art non-deep pedestriandetectors. Furthermore, we also show that this strategy is extensible to differentsegmentation maps (e.g. RGB, gradient, LUV) computed from the same pedestrianbounding box (i.e. the proposal). We demonstrate that the proposed approach isable to boost the detection performance of state-of-the-art non-deep pedestriandetectors. We apply the proposed methodology to address the pedestrian detectionproblem on the publicly available datasets INRIA and Caltech.

1 Introduction

Outstanding progress has been made in pedestrian detection (PD) in the lastdecade and reaching state-of-the-art results is becoming harder to achieve.

This work was partially supported by FCT[UID/EEA/50009/2013], and by IST-ID through a research scholarship for the project INCENTIVO/EEI/LA009/2014,

Email addresses: [email protected] (David Ribeiro),[email protected] (Jacinto C. Nascimento), [email protected] (AlexandreBernardino), [email protected] (Gustavo Carneiro).

Preprint submitted to Pattern Recognition 6 April 2016

There exist many works in the PD field that testify the intensive study onthis subject [4–10]. One of the reasons is that pedestrians are among the mostimportant objects to be detected in images, and a large number of applica-tions could benefit from this. For instance, in the fields of mobile robotics toinform Human-Robot Interaction systems, or automotive providing input toAdvanced Driver Assistance Systems. The PD task is complex due to sev-eral difficulties that arise in this context. For instance, the high variabilitythat characterizes the pedestrians projections on the camera image plane; theappearance of a pedestrian on the image that is influenced by the person’spose; clothing; the atmospheric conditions that contribute to the illuminationchanges; and the background clutter, all play a role in making PD a difficultproblem to be solved. Another difficulty inherent to this problem is occlusion,which has received special attention in the community, giving rise to threedifferent types of metrics [11] that allow to quantify how occluded the pedes-trian is: no occlusion, partial occlusion and reasonable (combination of partialocclusion and no occlusion). This contributed to the standardization of theevaluation methodologies and to provide a fair comparison among differentdetectors. Also, the occlusion motivated the development of two main typesof detectors tailored to deal with this specific issue: piq the ones with priorknowledge of the occlusion types [12,13] and piiq the ones that divide thepedestrian into several parts and infer visibility [14,15].

The performance of conventional, handcrafted features has plateaued in recentyears, however, new developments in deep compositional architectures havegained significant attention and have greatly advanced the performance ofthe state-of-the-art concerning image classification, localization and detection.Indeed, deep learning have brought successful results in several domains ofapplication, being an active research topic in computer vision community. Oneof the advantages provided by deep learning models, and specifically by deepconvolutional neural networks (CNN) [16], is the high-level features producedby the top layers of the model that allow to largely improve the classificationresults, compared to performance produced by hand-crafted features (see [32]for discussion). However, the training process for CNNs requires large amountsof annotated samples to avoid overfitting. This issue has been tackled withtransfer learning, which retrains, via fine-tuning, publicly available models(e.g. models that have been trained with large annotated databases [18]) usingsmaller datasets [19]. This fine-tuning process has been adopted in severalworks, since it has been shown to improve the generalization of the model (i.e.regularization) compared to a model that is trained with randomly initializedweights using only the small datasets [13,19]. This strategy has been appliedwith success in other domains of application, such as in medical image analysis[20].

Despite the popularity of pedestrian detection, only few works have applieddeep neural networks to this problem [21]. One of the first published works

2

using convnets for PD is proposed in [7]. A similar research path is followed inthis paper. We use pre-trained models in order to determine if the fine-tuningprocess improves the generalization of the model, compared to a model trainedwith randomly initialized weights on small datasets. To accomplish this we usethe recently proposed VGG model (see [22]) 1 .

One of the popular approach for PD is based on a sliding window paradigm.However, such procedure can easily become intractable. With CNNs becomingmore of a commodity in the computer vision field, a number of attempts havebeen published to overcome the above difficulty. For instance, [23], [24] werethe best-performing methods at ILSVRC-2013, using a small receptive windowsize and small stride of the first convolutional layer. We follow a differentstrategy, pioneered by [25], where selective search identifies promising imageregions where to use more expensive features.

In this paper, our main objective is to answer the question: ”Can a deep CNNmodel improve the performance of non-deep PD detectors?”. More concisely, inthis paper we propose to cascade non-deep state-of-the-art detectors, namely,ACF [10], LDCF [26] and Spatial pooling + [27] (including also the classicViola and Jones (VJ) [3]) with a deep CNN model. We show that such a schemeimproves the performance of the base non-deep detectors and generalizes wellto different feature maps. Figure 1 illustrates the proposed methodology.

Fig. 1. Illustration of proposed methodology. The switch means that we only selectone non-deep pedestrian detector to generate the proposals.

1.1 Related Work

Although pedestrian detection is being addressed in a considerable amount ofworks, the application of deep neural networks to this problem has only beentackled very recently 2 . Indeed, the success of CNNs is witnessed in several

1 Details are also available in http://www.robots.ox.ac.uk/~vgg/research/

very_deep/.2 Being the majority of the recent works published in CVPR 2015 and ICCV 2015.

3

http://www.robots.ox.ac.uk/~vgg/research/very_deep/


works, but in other context of applications. For instance, feature matching isaddressed in [28], where it is shown that features learned via convolutionalneural networks outperforms SIFT on descriptor matching context. Stereomatching is another topic where CNNs show improved performance [29], com-pared to previous approaches. Another context of application is that of scenerecognition. Although, large datasets (e.g. ImageNet) are available for objectrecognition, this is not true in the scene recognition context. One of the firstworks in this field is proposed in [30], testifying that deep features are ef-ficiently learned in a scene-centric database (called Places) with over sevenmillion labeled pictures of scene. In [31] it is proposed a hybrid architectureto perform pose estimation. It is shown that in a monocular image settings,the use of the CNN (jointly used with MRF) is successfully applied to theproblem of articulated human pose estimation.

Object detection and classification (using Imagenet) have also received atten-tion in several works. In [18] the large-scale data collection process of ILSVRCis described; datasets such as ILSVRC-2010 and ILSVRC-2012 are used in [17];SPP-nets [33] and ”inception” [34] are proposed in the same context of appli-cation. A review is available in [35], where an in depth analysis of ten objectproposal methods is available along with four baselines regarding ground truthannotation recall (on Pascal VOC 2007 and ImageNet 2013).

Another related topic of research is detection proposals for object recognition.In [25] it is proposed a scheme for generating possible object locations. Objectbounding box proposals using edges [36] is shown to be effective in termsof computational efficiency. Another class of relevant works is presented in[37], in which several choices of features (i.e. HOG,HOF,CSS) and classifiers (i.e.linear SVM) are explored to optimize sliding window based approaches.

Methods based on decision trees (but not using convnets) have been appliedwith success in PD task. In [38], a collection of the main approaches for PDdetection are put together. From this study, decision forest based methodsemerge to provide among the best results. By using different combinations offeatures, we are able to obtain distinct methodologies in this context.

Square channel features are addressed in [8], where it is proposed the Roerei

detector using HOG+colour only, moving away from the classic HOG+SVM whichhas been one of the benchmark methods in the field. Informed Haar-like Fea-tures [9] obtain high performance, where the pedestrian shapes are modeledusing three rectangles geared towards different body parts. Based on this ob-servation, compact Haar-like features are proposed. However, these (manual)features are specially designed for the pedestrian detection task. In [27] spa-tial pooling is used, where three visual descriptors are investigated, such asHOG, LPB and CSS. Regionlets [39] is also another method known in this class ofapproaches, integrating various types of features from competing local regions.

4

Despite the unquestionable value of the above works, only a few, however,have applied deep learning to the task of pedestrian detection, which is thefocus of this paper.

1.2 Contributions

This paper presents the following contributions. First, we aim at using verydeep learning based approaches to face the problem of PD. Along this goal,we experimentally conduct results with pre-trained vs. randomly initializedmodels for the same architecture. Second, we propose to alleviate the compu-tational burden that comes from a sliding window exhaustive search, througha selective search approach that significantly improves the computational ef-ficiency. To accomplish this we cascade several state-of-the-art non-deep PDdetectors with a deep CNN model. We conduct experiments, to show that suchcascade strategy boosts the performance of the non-deep detectors. Finally,we generalize the methodology for several different feature maps. This allowsto ascertain the accuracy of an individual feature map, being able to figure outwhat are the most effective/suitable channel map features for the pedestriandetection task.

2 Methodology

Let us consider a pedestrian dataset D txpiq,mpiq,ypiquNi1 where x denotesthe original RGB CNN input bounding box, i.e. x : Ω Ñ Rp 3 and Ω denotingthe image lattice; m is the feature map computed from x. In this paper,we resort to use common features that most state-of-the-art methodologiesalso use. More specifically we use m tmRGB,mGm,mGx,mYUV,mLUVu. Theabove notation stands for different feature maps obtained from x as follows:RGB color space, Gradient magnitude (Gm), Gradient in x (Gx), YUV colorspace and LUV color space. Finally, y P Y t0, 1u denotes the class (label)indicating if x contains (or not) the pedestrian; i indexes the ith pedestrianbounding box used for training. Fig. 2 illustrates some of the feature mapsused. Also, we have available a dataset for pre-training the CNN, i.e. rD tprxqpnq, pryqpnqun, with rx : Ω Ñ Rp and ry P Y t0, 1uC (where C is thenumber of classes in the pre-trained model, in this case 1000).

Deep Convolutional Neural Networks: A CNN model consists of a net-work containing several processing stages. Each stage comprises two differentlayers: piq a convolutional layer with an activation non-linear function, and

3 p is three for RGB, Gx, YUV and LUV, and is one for Gm.

5

Fig. 2. Illustration of some of the feature maps used: (a) RGB, (b) Gx, (c) YUV.

piiq a non-linear subsampling layer. In piq the convolution filters are appliedto the image, whereas in piiq a reduction in the input image size is performed.These two stages are followed by several fully connected layers and a multi-nomial logistic regression layer [17]. Fig. 3 illustrates the architecture of ourapproach. The convolution layers compute the output at location j from theinput location i using the filter Wk and bias bk at the kth stage using:

xkpjq σ ¸iPΩpjq

xk1piq Wkpi, jq bkpjq

(1)

where σp.q is a non-linear function, e.g. the logistic or rectification linear unit;the operator represents the convolution, and Ωpjq are the input region ad-dresses. The non-linear subsampling layers are defined by xkpjq Ó pxk1pjqq,where Ó p.q is the pooling function that subsamples the values from the datainput region Ωpjq. The fully connected layers consist of the convolution in(1) using a separate filter for each output location, and the whole input fromthe previous layer. The regression layers compute the probability of the ithclass using the features xL from the Lth layer with the softmax functionypiq expxL

°j expxL

. The inference process is accomplished in a feedforward fash-

ion, and the training stage is carried out with stochastic gradient descent tominimize the cross entropy loss over the training set via back propagation (see[17] for details).

Let us denote the following set of CNN parameters as rΘ rrθcnv, rθfc, rθlrs. Each

element in rΘ is a set of parameters representing: the convolutional and non-linear subsampling layers (rθcnv), the fully connected layers (rθfc), and multino-

mial logistic regression layer (rθlr). Thus, the process of pre-training a CNN can

be formalized as ry fprx, rΘq. The above pre-training process ry, is defined

as training M1 stages using rθcnv, followed by M2 fully connected layers, usingrθfc, and by one multinomial logistic regression layer that minimizes the cross-entropy loss function rθlr, over the data set rD (see Fig. 3 for an illustrationof the proposal). This pre-trained model can be further used by taking thefirst M1 M2 layers to initialize the training of a new model [19], in a process

6

that is known as fine tuning. This fine tuning is crucial to achieve the bestclassification results in a transfer learning context. Using the above strategy,we take the set of parameters trθcnv, rθfcu and introduce a new multinomial lo-gistic regression layer θlr, and fine tune the CNN model by minimizing thecross-entropy loss function using the training set D. The process describedabove will produce the following models for each pedestrian bounding box:y

RGB fpx,ΘRGBq yGm fpx,ΘGmq, y

Gx fpx,ΘGxq, yYUV fpx,ΘYUVq

and yLUV fpx,ΘLUVq.

Fig. 3. The proposed methodology uses several CNN models, one for each featuremap.

3 Experimental setup

This section discusses results obtained with the proposed methodology in IN-RIA and Caltech datasets. Since, the detection quality depends on the effectof positive and negative training sets, a detailed description is provided onhow these two sets are built: in Sec. 3.1 we describe how the positives andnegative training sets are generated in the INRIA dataset, whereas in Sec. 3.2we describe how this procedure in conducted in the case of Caltech dataset.

3.1 CNN training for INRIA dataset

The experiments are performed on the full INRIA dataset [4], comprising 614positive and 1218 negative images for training, and 288 images for testing.The methodology used to obtain the training and validation sets is as follows.We use a positive training set, Pos 1237 samples 4 extracted from the614 positive images using the original ground truth bounding boxes. Then,we perform data augmentation comprising: piq an additional set of 1237 by

4 By samples we mean a bounding box or window of width height pixel size.

7

performing a horizontal flipping over the set Pos, obtaining a set of Pp1qos 2474

samples, and piiq performing random deformations (comprising translationsand scaling) in the interval R s0, 5r pixels over the set Pp1q

os , obtaining anew set of Pp2q

os 4948 samples. The deformations consist in performing cubicinterpolation between randomly chosen starting and end values for the widthand height of the image (i.e. obtained from a uniform distribution on theinterval R).

To obtain the negative sample set Neg, we run the non-fully-trained LDCFdetector. By non-fully-trained, we mean that, the LDCF detector was takenafter the first training stage as in [1], being only trained with a segment ofthe total number of images. This is applied on the 1218 negative full images,obtaining a set of Neg 12552 samples. An upper-bound of 18 negative can-didate samples per image for negative detection is used.

Such procedure (i.e. using the training proposals of the detector as in [21]),provides a richer set of negatives than the standard method of randomly ex-tracting windows. The latter approach, provides both informative (e.g. resem-bling pedestrians) and non-informative (e.g. sky or blank patches) negatives.However, the number of non-informative samples is quite large, and from theexperiments conducted, they have a negative impact in the performance. Fig.4, illustrates the negatives selection mechanism: the two leftmost negativesamples (in figures (a) and (b)) result from randomly extracting patches fromthe image and the two rightmost negative samples (in figures (a) and (b))result from running the non-fully-trained LDCF detector. The above proce-dure amounts to obtain a total of 17500 samples, from which 15751 are takenfor training (90%) and 1749 for validation (10%). Here the proposals have a100 41 size.

3.2 CNN training for Caltech dataset

This section describes the training data collection process in the Caltechdataset. To obtain the negatives we take every 30th frame in the Caltechdataset. As previously, we apply the non-fully-trained LDCF detector overthe selected negative images and collect negative samples based on the follow-ing rules: (i) a training proposal is considered to be a negative example if itdoes not exceed an Intersection-over-Union (IoU) of 0.1, that is, if IoU 0.1;(ii) for the consistency of the experiments we maintain the negative selectionmechanism used in the INRIA dataset, i.e., resorting to the non-fully-trainedLDCF detector; and (iii) we define an upper bound of five windows per im-age. Under these conditions, we obtain Neg 17540 samples. To obtain thepositives, we use a sampling in which we consider every 3rd frame in the se-quence of the Caltech dataset. We take the initial set of Pos 16376 samples,

8

extracted from the images using the original ground truth bounding boxes.From this set, we take 1000 samples forming an auxiliar set Pp1q

os 1000 sam-ples, and perform a horizontal flipping over Pp1q

os , obtaining an additional setPp2q

os 1000 samples. Finally, from Pp1qos and Pp2q

os , we perform a random de-formation in the range of R s0, 5r pixels (as in the INRIA dataset). Thisamounts to obtain, Pp3q

os 1000, and Pp4qos 1000, respectively. To summarize,

we have the final set of positives Pos 16376 3000 (i.e., additional samplesobtained from Ppiq

os , with i P t2, 3, 4u). Here the proposals have a 50 20 size.

The training data obtained as detailed in Sec. 3.1, Sec. 3.2 will be used for allthe detectors to provide a fair comparison of the results.

3.3 Testing phase

For testing purposes, we follow the evaluation protocol as in [11] that uses, asperformance metric, the log average miss rate over nine points in the range of102 to 100 False Positives Per Image (FPPI). The evaluation for the Caltechdataset is done according to the reasonable setting [11].

To show the benefits of the approach, with respect to the base detectors, weapply the CNN over their test proposals (see Fig. 1). More specifically, we runthe following detectors over the INRIA test set: ACF and LDCF detectors 5 .Then, we compute the feature maps (see Section 2) over the ACF and LDCFtest proposals (of size 100 41 pixels) and we run the CNN. The number ofground truth bounding box annotations is 589.

The same procedure is also applied to the Caltech dataset, but adding theSpatial pooling + detector. Also, the previously used ACF detector is re-placed by the ACF Caltech +, denoted by ACF+ [26]. The proposals have sizeof 50 20 pixels. We measure the performance on the reasonable setting, i.e.,no occlusion or partial occlusion in pedestrians with more than 50 pixels ofheight. The number of ground truth bounding box annotations is 1014.

Original network model: we use the VGG Very Deep 16 architecture (VGG-VD16) 6 , that corresponds to the configuration D in [22]. The architectureconsists of a CNN that takes a 2242243 input image and contains: 13 con-volutional layers, five non-linear sub-sampling operations (i.e., max-pooling),three fully connected layers and a final multinomial logistic regression layer.More specifically, all max-pooling layers sub-sample a 22 input window by a

5 We do not run the detector Spatial pooling + since no optical flow informationis available in the INRIA dataset6 See additional details in http://www.robots.ox.ac.uk/~vgg/research/very_

deep/

9



(a) (b)

Fig. 4. Leftmost position in (a) and (b): Two negative windows randomly extractedfrom full INRIA images. Rightmost position in (a) and (b): Two negative windowsobtained by running the non-fully-trained LDCF detector in full INRIA images.

factor of 2 and the convolutional and fully connected layers have the RectifiedLinear Unit (ReLU) as the activation function. Moreover, all convolutionallayers from 1 to 13 have a receptive field of size 3 3, but different numberof filters (or channels). In fact, layers 1 and 2 have 64 filters (each), followedby max-pooling; layers 3 and 4 contain 128 filters (each), followed by max-pooling; layers 5, 6 and 7 have 256 filters (each), followed by max-pooling;layers from 8 to 13 have 512 filters (each), followed by max-pooling after lay-ers 10 and 13; layers 14 and 15 have 4096 filters (each), and layer 16 has1000 filters corresponding to the ILSVRC [18] classes, followed by the soft-max. This model is pre-trained with Imagenet [18] (1K visual classes, 1.2Mtraining, 50K validation and 100K test images).

Network changes: In order to reduce the computational demands of theVGG-VD16 model, the original CNN input size of 224 224 3 was reducedto 64 64 3. However, this input size does not allow to perform inferenceafter the first fully connected layer, unless the size of the weights is changed. 7

Furthermore, for the PD task, we need to replace the layer 16 and the soft-max, with a new layer and soft-max adapted for only two classes (pedestrianor non-pedestrian). Consequently, the parameters of the three fully connectedlayers were randomly initialized from a Gaussian distribution with zero meanand variance equal to 0.01, allowing to perform the mentioned changes. Af-terwards, we fine-tune the modified network using all feature maps (see Fig.2) obtained from the pedestrian dataset (as depicted in Fig. 3). The use ofthe proposed pre-training can be seen as a regularization approach, whichcan be compared to other forms of regularization, such as data augmentation,obtained by artificially augmenting the training set with random geometrictransformations. Finally, the minibatch size is 100, the number of epochs is10, the learning rate is set to 0.001, and momentum is 0.9. No special attentionwas made to fine-tune the above hyperparameters.

Section 4 performs a comparison between the baseline performance of these

7 We conducted additional experiments in which an alternative was to adequatelychange the stride of some of the previous layers. This however, results in moreexpensive and time consuming computations.

10

detectors and with the application of the very deep VGG network.

4 Results

This section shows the results, where we first perform a comparison betweenthe randomly initialised and the pre-trained networks. In this comparisonstage, we also study different modes of initializing the CNN model, namelyrandom initialization and Xavier Improved initialization [40]. This is donefor one feature map (i.e. RGB). Second, we experimentally show that, deeparchitectures are useful since they improve the performance of all of the testedstate-of-the-art pedestrian detectors. Furthermore, this can be generalized forseveral and quite different feature maps. Finally, we perform a comparisonwith state-of-the-art applied in the PD task, and show that our techniqueexhibits very competitive results in the field.

4.1 Initialization of the very Deep CNN

To perform a comparison between random initialization and pre-trained mod-els, we start by performing a random initialization of the deep CNN. Morespecifically, we adopt the methodology as in [20,32], that is, random weightsdrawn from Gaussian distributions with fixed standard deviation. However, weexperienced that such initialization does not help the deep CNN to converge,drastically hampering its performance. Thus, we confirm what is already re-ported in [22,40], i.e., such initialization strategy can stall learning due to theinstability of gradient in deep nets (in our case 13 conv layers). However, al-ternatives are available in the literature. Glorot and Bengio [41] proposed theso called Xavier initialization in the attempt to overpass the above mentioneddifficulty. However, we follow the robust initialization method as proposed in[40], that particulary considers the rectifiers nonlinearities - Parametric Rec-tified Linear Unit (PReLU). Recall, that the Xavier initialization is based onthe linearity assumption, which is not valid in ReLU (as in our case) andPReLU. This methodology helps the convergence of very deep models, be-ing possible to train it from scratch. We denote this initialization as Xavierimproved initialization.

Table 1 shows a comparison of the two modes for the network initializationfor the RGB feature map.

11

Table 1Log average miss rate % using two modes for the initialization (see text). Results areshown for the detectors, using the RGB feature map, and for both data sequences.

Random Init. Xavier Improved VGG Pre-training

ACF LDCF SP+ ACF LDCF SP+

INRIA 21.14% 21.83% - 15.13% 12.45% -

Caltech 31.21% 30.15% 23.77% 23.34% 21.66% 16.66%

Table 2Log average miss rate % for the proposals generated by several detectors on theINRIA and Caltech datasets, and the CNN post-processing, using different featuremaps to perform the fine-tuning.

Dataset Detector Proposals RGB GM Gx YUV LUV

ACF 16.83 15.13 15.74 14.82 15.94 14.99

INRIA LDCF 13.89 12.45 13.33 12.61 12.78 12.66

SP+ - - - - - -

VJ 72.48 63.47 62.20 64.04 64.17 63.84

ACF+ 29.54 23.34 27.63 27.24 26.31 25.68

Caltech LDCF 25.19 21.66 24.59 24.67 24.31 23.24

SP+ 21.48 16.66 20.43 20.04 19.87 18.71

VJ 94.73 86.50 89.97 91.73 89.74 88.81

4.2 Improving the pedestrian detectors

We now consider the results with the pre-trained VGG model (VGG-VD16)as detailed in Sec. 4.1, that provided the best results as shown in Table 1.Table 2, summarizes the performance of all the detectors considered in thispaper, for several feature maps and for both of the datasets. The column pro-posals corresponds to the baseline performance of the detectors, the remainingcolumns correspond to the improvement of having the cascade of the detectorwith the very deep CNN model. 8

Figure 5 shows the results for the INRIA dataset, displaying a comparisonwith state-of-the art pedestrian (non-deep) detectors and our approach thatcascades the ACF and LDCF with a very deep CNN. Recall that for thisdataset the optical flow is not available, not being possible to use the Spatial

pooling +. The results of the cascade are shown in black bold. It is shown

8 The results were obtained with either 256 or 512 as the third dimension in thefirst fully connected layer.

12

the best performances for these detectors. A log average miss rate of 14.82% isachieved for the ACF detector using the Gx feature map, and a log average missrate of 12.45% is achieved for the LDCF detector using the RGB feature map.Notice that the CNN can always improve the performance of any non-deepdetector. Recall that the worst detector (Viola and Jones) cannot deterioratethe cascade (i.e. VJ + CNN) performance.

Figure 6 shows the results for the Caltech dataset. Again, it is shown theperformances cascading the three detectors with the very deep CNN. The topperformance is reached 16.66%, cascading the Spatial pooling + and VGG.

Figure 7 shows a comparison with the state-of-the art methodologies for theCaltech dataset. Here, the results of non-deep detectors are shown, as well asthe deep methodologies that were proposed very recently. Our proposal hasthe third best result.

We have computed the obtained running time figures for both datasets. Table3, shows the computational effort for both the datasets considering the cas-cade of the non-deep detectors with the CNN. These training and testing aremeasured using Matconvnet training [42] on a 2.50 GHz Intel Core i7-4710 HQwith 12 GB, a 64 bit architecture and graphics cards NVIDIA GeForce GTX850M (main memory) and Intel HD Graphics 4600 (secondary memory).

The training time, refers to the time spent training the CNN model (not thenon-deep full detector). The test time concerns the time spent to forward theproposals through the CNN model. Recall that, the test time depends on thenumber of proposals generated by the non-deep detector. Thus, for the INRIAdataset, the number of proposals for the four detectors are the following:Prop ACF 1835, Prop LDCF 940, and Prop VJ 12323. For the Caltechdata set the number of proposals are: Prop ACF 114K, Prop LDCF 54K,Prop SP+ 228K, and Prop VJ 190867. This justifies the larger test timespent for the ACF, VJ and SP+ detectors.

4.3 Analysis of the obtained results

In this section we analyse the obtained results and the improvement intro-duced by the application of the CNN over the proposals. More specifically, weexperimentally show that the CNN is able to:

Significantly reduce the number of False Positives (FP) Slightly reduces the number of True Positives (TP)

The ideal scenario would be to decrease the number of FP, while maintainingthe TP.

13

Table 3Running time figures for the two initialization modes, obtained for both datasetsand corresponding miss rate (MR), using the RGB feature map.

Dataset Rand. Init. Xavier Improved Pre-trained VGG-VD16

MR LDCF+VGG proposals = 21.83% MR LDCF+VGG proposals = 12.45%

MR ACF+VGG proposals = 21.14% MR ACF+VGG proposals = 15.13%

INRIA train time = 5.1 hrs. train time = 5.1 hrs.

LDCF test time = 40.1 sec. LDCF test time = 33.0 sec.

ACF test time = 63.9 sec. ACF test time = 67.3 sec.

MR LDCF+VGG proposals = 30.15% MR LDCF+VGG proposals = 21.66%

MR ACF+ +VGG proposals = 31.21% MR ACF+ +VGG proposals = 23.34%

MR SP+ +VGG proposals = 23.77% MR SP+ +VGG proposals = 16.66%

Caltech train time = 12.17 hrs. train time = 11.4 hrs.

LDCF test time = 30.86 min. LDCF test time = 34.6 min.

ACF+ test time = 1.37 hrs. ACF+ test time = 1.5 hrs

SP+ test time = 3.11 hrs. SP+ test time = 3 hrs.

We conducted experiments on both INRIA and Caltech datasets to obtain thenumber of FP and TP, before and after the application of the CNN. Tables 4,5, 6 and 7, present the mentioned metrics for each of the four methods: VJ,ACF, LDCF and SP+, respectively.

In general, the number of false positives is significantly reduced after the ap-plication of the CNN framework in the initial non-deep detectors proposals.This shows that our classification refinement strategy for the proposals is ef-fective. As a negative side effect, the true positives number tend to decrease.However, these changes are substantially less significant to the methods over-all performance when compared with the false positives reduction achieved bythe CNN framework.

For instance, in the SP++RGB case (our best result for the Caltech dataset),the false positives after the application of the CNN represent only approxi-mately 5% of its original value (i.e., more than a tenfold FP reduction), whilethe true positives still constitute approximately 95% of its original value. Theimpact experienced by the method’s performance (using the usual metrics [11])is positive, leading to an improvement of 4.82 percentage points, comparingto the base SP+ detector (as shown in the Table 2).

14

Table 4Number of True Positives (TP) and False Positives (FP) for the proposals generatedby the Viola and Jones (VJ) detector on the INRIA and Caltech datasets, and theCNN post-processing, using different feature maps to perform the fine-tuning.

Dataset Metrics VJ Proposals RGB GM Gx YUV LUV

INRIA TP 464 450 427 438 443 452

FP 11859 939 631 816 773 809

Caltech TP 212 202 194 178 183 191

FP 17108 902 2622 3778 1894 1517

Table 5Number of True Positives (TP) and False Positives (FP) for the proposals generatedby the ACF (ACF+) detector on the INRIA and Caltech datasets, and the CNNpost-processing, using different feature maps to perform the fine-tuning.

Dataset Metrics ACF Proposals RGB GM Gx YUV LUV

INRIA TP 551 546 533 542 540 546

FP 1284 249 197 237 242 284

Caltech TP 952 900 889 866 889 899

FP 111087 7742 28596 16187 17234 12840

Table 6Number of True Positives (TP) and False Positives (FP) for the proposals generatedby the LDCF detector on the INRIA and Caltech datasets, and the CNN post-processing, using different feature maps to perform the fine-tuning.

Dataset Metrics LDCF Proposals RGB GM Gx YUV LUV

INRIA TP 554 550 543 549 549 552

FP 386 127 104 119 122 140

Caltech TP 947 894 896 874 889 891

FP 51460 3712 14020 8847 8473 6528

Table 7Number of True Positives (TP) and False Positives (FP) for the proposals generatedby the Spatial Pooling+ (SP+) detector on the INRIA and Caltech datasets, andthe CNN post-processing, using different feature maps to perform the fine-tuning.

Dataset Metrics SP+ Proposals RGB GM Gx YUV LUV

INRIA TP - - - - - -

FP - - - - - -

Caltech TP 985 939 932 913 917 926

FP 223968 11576 58092 34775 34605 22257

15

Fig. 5. Results in INRIA dataset. Improvement over the state-of-the-art non-deepdetectors, cascading the ACF and LDCF with very deep CNN. The best result(shown in black bold) is 12.45% log average Miss Rate, achieved for the RGB featuremap, cascading the LDCF with VGG-VD16.

5 Discussion and Conclusions

A novel methodology for pedestrian detection based on very deep convolu-tional learning is presented. The main outcome is that, it is experimentallyobserved that an improvement over the top non-deep state-of-the-art detectorsfor the task of PD is achieved. This is accomplished by cascading state-of-the-art non-deep detectors with a deep compositional architecture. This architec-ture not only allows fast test times but also actually improves the performanceof the base detectors.

Detailing now the experiments conducted in this paper, one of the first issuesto be tackled is that of initializing the architecture. It was experimentallydemonstrated, that the initialization of the very deep architecture plays arelevant role in the performance. Although, we have used a robust initialization

16

Fig. 6. Results in Caltech dataset with reasonable setting. Improvement over thestate-of-the-art non-deep detectors cascading the ACF, LDCF and SP+ with verydeep CNN. The best result (shown in black bold) is 16.66% log average miss rate,achieved for the RGB feature map, cascading the SP+ with VGG-VD16. Here wehave included a cascaded deep detector (SCF + AlexNet - 23.32%) since it is similarto our approach.

that allows for a deep CNN to be trained from scratch , better results areachieved using the pre-trained models. Table 1 shows this issue for the RGBfeature map. Concerning the adopted cascade approach, we also concludedthat the improvement of the detectors is observed for both of the datasets, andalso, that these results can be generalized varying the feature map computedfrom the bounding box. Table 2 testifies precisely this fact for different featuremaps. Figures 5 and 6 show a comparison with other well known non-deepdetectors used for PD. We note that, the (ACF,VGG), (LDCF,VGG) and (Spatialpooling +, VGG) exhibit top performances for both of the datasets.

Although, not being the main goal of this paper, we also compared the per-formance with deep methodologies recently proposed. Figure 7 shows theseresults, where we plot our best result obtained cascading Spatial pooling

17

Fig. 7. Comparison with state-of-the-art approaches.

+ and VGG-VD16 architecture. It is seen that our methodology is in the top-three best methods in the literature for the Caltech reasonable dataset. Thebest approaches are [43,44] that were very recently proposed.

Although our approach is highly competitive, there is still room for improve-ment. Possible extensions of the methodology should be studied. For instance,and since, there is an improvement for various different feature maps, one ap-proach could be to fuse in a principled way the individual responses of thesefeature maps. Other alternative could be to select more heterogeneous inputchannels (such as pedestrian body parts, or different views of the pedestrians),including distinct detectors for proposals and integrating a multiscale schemein the network. Moreover, we could further demonstrate the generality of ourmethodology in improving any non-deep pedestrian detector. Therefore, wecould apply the proposed framework to other top-performing non-deep detec-tors in the literature, for example, Checkerboards and Checkerboards+ [2].Finally, our approach can benefit from recent results in deep learning, namely,novel ideas in deep-state-of-the-art pedestrian detectors could be adapted to

18

improve our architecture. For example, the shift handling procedure, usingfully convolutional neural networks, employed in the Deep Parts method [43]could be used to further enhance our approach. We could also change thenetwork model (e.g. to the GoogLeNet as in [43]), and extract and combinefeatures from the last CNN layers to perform the classification.

References

[1] Piotr Dollar, Piotr’s Computer Vision Matlab Toolbox (PMT),URL http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html

[2] S. Zhang, R. Benenson, B. Schiele, Filtered channel features for pedestriandetection, in: CVPR, 2015.

[3] P. Viola, M. Jones, Robust Real-Time Face Detection, in: IJCV, 2004.

[4] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in:CVPR, 2005.

[5] P. Dollar, Z. Tu, P. Perona, S. Belongie, Integral channel features, in: BMVC,2009.

[6] R. Benenson, M. Mathias, R. Timofte, L. V. Gool, Pedestrian detection at 100frames per second, in: CVPR, 2012.

[7] S. C. P. Sermanet, K. Kavukcuoglu, Y. LeCun, Pedestrian detection withunsupervised multi-stage feature learning, in: CVPR, 2013.

[8] R. Benenson, M. Mathias, T. Tuytelaars, L. V. Gool, Seeking the strongest rigiddetector, in: CVPR, 2013.

[9] S. Zhang, C. Bauckhage, A. Cremers, Informed haarlike features improvepedestrian detection, in: CVPR, 2014.

[10] P. Dollar, R. Appel, S. Belongie, P. Perona, Fast feature pyramids for objectdetection 36 (8), 1532–1545, 2014.

[11] P. Dollar, C. Wojec, B. Schiele, P. Perona, Pedestrian detection: An evaluationof the state of the art, IEEE Trans. Pattern Anal. Machine Intell., 2012.

[12] C. Wojek, S. Walk, S. Roth, B. Schiele, Monocular 3d scene understanding withexplicit occlusion reasoning, in: CVPR, 2011.

[13] M. Mathias, R. Benenson, R. Timofte, L. V. Gool, Handling occlusions withfranken-classifiers, in: ICCV, 2013.

[14] W. Ouyang, X. Wang, A discriminative deep model for pedestrian detectionwith occlusion handling, in: CVPR, 2012.

[15] W. Ouyang, X. Wang, Joint deep learning for pedestrian detection, in: ICCV,2013.

19

http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html

[16] Y. Bengio, Learning deep architectures for ai, Foundations and Trends inMachine Learning 2 (1), 1–127, 2009.

[17] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deepconvolutional neural networks, in: NIPS, 2012.

[18] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNetLarge Scale Visual Recognition Challenge, Int. Journal of Comp. Vision (IJCV)115 (3), 211–252, 2015. doi:10.1007/s11263-015-0816-y.

[19] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features indeep neural networks?, in: NIPS, pp. 3320–3328, 2014.

[20] G. Carneiro, J. C. Nascimento, A. P. Bradley, Unregistered multiviewmammogram analysis with pre-trained deep learning models, in: MICCAI, 2015.

[21] J. Hosang, M. Omran, R. Benenson, B. Schiele, Taking a deeper look atpedestrians, CoRR abs/1501.05790.URL http://arxiv.org/abs/1501.05790

[22] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scaleimage recognition, ICLR, 2015.

[23] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks,in: ECCV, 2014.

[24] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. LeCun, Overfeat:Integrated recognition, localization and detection using convolutional networks,in: ICLR, 2014.

[25] J. Uijlings, K. van de Sande, T. Gevers, A. Smeulders., Selective search forobject recognition., IJCV, 2013.

[26] W. Nam, P.Dollar, J. H. Han, Local decorrelation for improved pedestriandetection, in: NIPS, 2014.

[27] S. Paisitkriangkrai, C. Shen, A. van den Hengel, Strengthening the effectivenessof pedestrian detection with spatially pooled features, in: ECCV, 2014.

[28] P. Fischer, A. Dosovitskiy, T. Brox, Descriptor matching with convolutionalneural networks: a comparison to SIFT, CoRR abs/1405.5769, 2014.

[29] J. Zbontar, Y. LeCun, Computing the stereo matching cost with a convolutionalneural network, CoRR abs/1409.4326.

[30] X. Chen, P. Wei, W. Ke, Q. Ye, J. Jiao, Pedestrian detection with deepconvolutional neural network, in: ACCV workshop, 2014.

[31] J. Tompson, A. Jain, Y. LeCun, C. Bregler, Joint training of a convolutionalnetwork and a graphical model for human pose estimation, 2014.

[32] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deepconvolutional neural networks, in: NIPS, 2012.

20

http://dx.doi.org/10.1007/s11263-015-0816-y

http://arxiv.org/abs/1501.05790



[33] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutionalnetworks for visual recognition., in: ECCV, 2014.

[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, CoRRabs/1409.4842, 2014.

[35] J. Hosang, R. Benenson, B. Schiele., How good are detection proposals, really?,in: BMVC, 2014.

[36] C. L. Zitnick, P. Dollar, Edge boxes: Locating object proposals from edges, in:ECCV, 2014.

[37] S. Walk, N. Majer, K. Schindler and B. Schiele, New features and insights forpedestrian detection, in: CVPR, 2010.

[38] R. Benenson, M. Omran, J. Hosang, B. Schiele, Ten years of pedestriandetection, what have we learned?, in: ECCV workshop, 2014.

[39] X. Wang, M. Yang, S. Zhu, Y. Lin, Regionlets for generic object detection, in:ICCV, 2013.

[40] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, 2015.

[41] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforwardneural networks, in: Int. Conf. on Artificial Intelligence and Statistics, pp. 249–256, 2010.

[42] A. Vedaldi, K. Lenc, Matconvnet – convolutional neural networks for matlab,in: Proceeding of the ACM Int. Conf. on Multimedia, 2015.

[43] Y. Tian, P. Luo, X. Wang, X. Tang, Deep learning strong parts for pedestriandetection, in: ICCV, 2015.

[44] Z. Cai, M. Saberian, N. Vasconcelos, Learning complexity-aware cascades fordeep pedestrian detection, in: ICCV, 2015.

21

improving the performance of pedestrian detectors using ...carneiro/publications/pr-2016.pdf ·...

Documents