learning multi-scale features for foreground segmentation · 2018-08-07 · stancenormalization and...

Learning Multi-scale Features for Foreground

Segmentation

Long Ang Lim

Ankara University

Department of Computer Engineering

[email protected]

Hacer Yalim Keles

Ankara University

Department of Computer Engineering

[email protected]

Abstract

Foreground segmentation algorithms aim segment-ing moving objects from the background in a robustway under various challenging scenarios. Encoder-decoder type deep neural networks that are used in thisdomain recently perform impressive segmentation re-sults. In this work, we propose a novel robust encoder-decoder structure neural network that can be trainedend-to-end using only a few training examples. Theproposed method extends the Feature Pooling Module(FPM) of FgSegNet by introducing features fusions in-side this module, which is capable of extracting multi-scale features within images; resulting in a robust fea-ture pooling against camera motion, which can allevi-ate the need of multi-scale inputs to the network. Ourmethod outperforms all existing state-of-the-art meth-ods in CDnet2014 dataset by an average overall F-Measure of 0.9847. We also evaluate the effective-ness of our method on SBI2015 and UCSD BackgroundSubtraction datasets. The source code of the proposedmethod is made available at https://github.com/lim-anggun/FgSegNet v2.

Keywords — Foreground segmentation,convolutional neural networks, feature poolingmodule, background subtraction

1 Introduction

Extracting foreground objects from video se-quences is one of the challenging and major tasksin computer vision domain. The resultant fore-ground objects, also known as moving objects, canbe used in various computer vision tasks [1–6].Extracting objects of interests from stationarycamera-videos is challenging especially when thevideo sequences contain difficult scenarios suchas sudden or gradual illumination changes, shad-ows, dynamic background motion, camera motion,camouflage or subtle regions. Several foregroundsegmentation approaches have been proposed toaddress these problems [7–13] where most of the

Fig. 1: A comparison between our method and some cur-rent state-of-the-art methods on cameraJitter category.

proposed methods rely on building the stationarybackground model; this approach is not very ef-fective in adapting to the challenging scenarios.

Convolutional Neural Networks (CNNs) [14]based on gradient learning are very powerful in ex-tracting useful feature representations from data[15] and have been successfully used in many prac-tical applications [14, 16–22]. In particular, FullyConvolutional Networks (FCNs) that are based ontransfer learning [16,19] have shown significant im-provement over conventional approaches by largemargins. In this case, the knowledge that is gainedfrom image classification problem is adapted todense spatial class prediction domain where eachpixel in an image is marked with a class label; sucha prediction requires an understanding of bothhigher level and lower level contextual informa-tion in a scene. However, due to feature resolu-tion reduction caused by consecutive pooling andstrided convolution operations in the pre-trainedmodels, this is usually difficult since the contex-tual details are lost. One may remove downsam-pling operations to keep high resolution featuremaps, however it is computationally more expen-sive, and it is harder to expand the effective re-ceptive fields. To take advantages of low level fea-

1

arX

iv:1

808.

0147

7v1

[cs

.CV

] 4

Aug

201

8

https://github.com/lim-anggun/FgSegNet_v2

https://github.com/lim-anggun/FgSegNet_v2

tures in large resolution, [16] recently removed thelast block of VGG-16 [18] and fine-tuned the lastremaining block, and aggregated contextual infor-mation in multiple scales using a Feature PoolingModule (FPM).

Motivated by the recent success of deep neu-ral networks for foreground segmentation, weadapt the same encoder that is used in [16], whichwe found that it improves the performance morecompared to other pre-trained networks, and wepropose some modifications on the original FPMmodule to capture wide-range multi-scale infor-mation; resulting a more robust module againstcamera movements. In contrast to [21] that trans-fer max-pooling indices from the encoder to thedecoder and [22] that copy feature maps directlyfrom the encoder to the decoder to refine segmen-tation results, we use the global average pooling(GAP) from the encoder to guide the high levelfeatures in the decoder part.

In summary, our key contributions are:

• We propose a robust foreground segmentationnetwork that can be trained using only a fewtraining examples without incorporating tem-poral data; yet, provided highly accurate seg-mentation results.

• We improve the FPM module by fusing multi-scale features inside that module; resulting ina robust feature pooling against camera mo-tion, which can alleviate the need of multi-scale inputs to the network.

• We propose a novel decoder network, wherehigh level features in the decoder part areguided by low level feature coefficients in theencoder part.

• Our method exceeds the state-of-the-art per-formances in Change Detection 2014 Chal-lenge, SBI2015 and UCSD Background Sub-traction datasets.

• We provide ablation studies of design choicesin this work and the source code is made pub-licly available to facilitate future researches.

2 Related Works

Foreground segmentation, also known asbackground subtraction, is one of the major tasksin computer vision. Various methods have been

proposed in this domain. Most conventional ap-proach rely on building a background model for aspecific video sequence. To model the backgroundmodel, statistical (or parametric) methods usingGaussians are proposed [7, 8, 11] to model eachpixel as a background or foreground pixel. How-ever, parametric methods are computationally in-efficient; to alleviate this problem, various non-parametric methods [9, 10, 13] are proposed. Fur-thermore, [12] use Genetic programing to selectdifferent change detection methods and combinedtheir results.

Recently, deep learning based methods [16,23, 24] show impressive results and outperformall classical approaches by large margins. Thereare different training strategies in this domain;for example, [25,26] use patch-wise training strat-egy where the background patches and the imagepatches are combined, and then fed to CNNs topredict foreground probabilities of the center pix-els of the patches. However, this approach is com-putationally inefficient and may cause overfittingdue to redundant pixels, loss of higher context in-formation within patches, and requires large num-ber of patches in training. [16, 23, 24, 27, 28] ap-proach the problem by using whole resolution im-ages to the network to predict foreground masks.Some methods take advantages of temporal data[24,27], some methods train the networks by com-bining image frames with the generated back-ground models [25–28]. The number of trainingdata that is utilized to produce the model is alsodifferent in different approaches; [25], [26], [23],[16] and [28] use 50%, 5%, 200 frames, 200 framesand 70% from each video sequence, respectively;where [24] split video sequences into chunks of10 frames and use 70% for training. In this re-search, we are generating a model for each sceneand our purpose is to use only a few number offrames in training so that ground-truth require-ment for different scenes will be very low. We be-lieve as [16,23] also do, this strategy is essential fora system that works in different domains in prac-tice, since pixel-level ground-truth generation forlarge number of frames is a time-consuming anddifficult process. Hence, we adapt the same train-ing frames selection strategy as in [16, 23], whereonly 200 frames or less are used for training. Thisstrategy mitigates user interventions on labelingground-truths considerably.

Dilated Convolution: Dilated convolution,also known as Atrous convolution, has been re-

2

Fig. 2: The flow of FgSegNet v2 architecture.

cently applied in semantic segmentation domain[29–32]; the idea is to enlarge the field-of-viewsin the network without increasing the number oflearnt parameters. Motivated by the recent suc-cess of the previous works, we proposed in [16]an FPM module with parallel dilated convolutionlayers that is plugged on top of a single-input en-coder; it provides comparable foreground segmen-tation results compared to multi-inputs encoder.

3 The Method

In this section, we revisit our previous work,FgSegNet [16], in both encoder and FPM module.For more details, one may refer to the originalpaper.

3.1 The Encoder Network

Motivated by low level features of the pre-trained VGG-16 net [18], in FgSegNet, we utilizethe first four blocks of VGG-16 net by removingthe last, i.e. fifth, block and third max-poolinglayer; we obtain higher resolution feature maps.Dropout [33] layers are inserted after every con-volutional layer of the modified net, and then thisblock is fine-tuned. In this work, we also use thesame encoder architecture as in the FgSegNet im-plementation. We observe that this modified netimproves the performance compared to other pre-trained nets.

3.2 The Modified FPM

Given feature maps F , which are obtainedfrom the output of our encoder, the original FPMmodule [16] pools features at multiple scales byoperating several convolutional layers with differ-ent dilation rates and a max-pooling layer followed

Fig. 3: The modified FPM module, M-FPM. IN (Instan-ceNormalization), SD (SpatialDropout). All convolutionlayers have 64 features.

by a 1x1 convolution on the same feature F , andthen the pooled features are concatenated alongthe depth dimension. Finally, the concatenatedfeature is passed through BatchNormalization [34]and SpatialDropout [35] layers. In this work, weimprove the original FPM module by proposingsome modifications to it in two parts (Fig. 3): (1)the resultant features fa from a normal 3x3-convare concatenated with the feature F , and progres-sively pooled by a 3x3-conv with a dilation rateof 4, resulting in features fb. Then, F and fbare concatenated and fed to a 3x3-conv with di-lation rate of 8, resulting in features fc. Again,F and fc are concatenated and fed to a 3x3-convwith dilation rate of 16. Finally, all features areconcatenated to form 5x64 depth features, thatwe call it as F ′; F ′ contains multi-scale featureswith a wider receptive fields than the one in [16].(2) We replace BatchNormalization with Instan-ceNormalization [36] since we empirically observethat InstanceNormalization gives slightly betterperformance with a small batch size. Since mul-tiple pooling layers are operated on the same fea-tures F , the concatenated features F ′ are likelyto be correlated. To promote independent featuremaps, instead of the normal Dropout where indi-

3

vidual neurons in a feature plane is dropped, Spa-tialDropout is used to drop the entire 2D featuremaps by some rates, i.e. 0.25, if adjacent pixelswithin the feature maps are strongly correlated.We observe that SpatialDropout helps to improvethe performance and prevents overfitting in ournetwork. Note that we apply ReLU non-linearityonce right after InstanceNormalization in the M-FPM module. We will refer to this modified FPMshortly as M-FPM from this on.

3.3 The Decoder with GAP Module

Our decoder network with two GAPs is il-lustrated in Fig. 2. The decoder part containsthe stack of three 3x3-conv layers and a 1x1-convlayer, where the 3x3-conv layers followed by In-stanceNormalization and the 1x1-conv layer is theprojection from feature space to image space. All3x3-conv layers have 64 feature maps, except the1x1-conv layer that contains 1 feature slice. Notethat ReLU non-linearities are applied after Instan-ceNorm and sigmoid activation function is appliedafter 1x1-conv in the decoder part.

Global Average Pooling (GAP): There aretwo coefficients vectors that combine informationfrom the low-level features of the encoder andhigh-level features of the decoder: the first coef-ficients vector is pooled from the second convolu-tion layer right before max-pooling layer, the sec-ond coefficients vector is pooled from the fourthlayer. Since 3x3-conv layers have 64 features inthe decoder part, the fourth convolution layer ofthe encoder is first projected from 128 to 64 fea-tures. Both coefficient vectors (say αi) are multi-plied with the output features of the first and sec-ond convolutional layers (say f ij) in the decoderpart (see Fig. 2 for details). The scaled featuresare added with the original features to form fea-tures f ′ij , where f ′ij : αi ∗ f ij + f ij , i ∈ [0, 63] isthe index of each feature depth and j is the indexof an element in each feature slice. Finally, theconcatenated f ′ij are upscaled by 2x using bilin-ear interpolation and fed to the next layers. Weobserve that the network with GAP module addvery slight computational cost but improves theperformance in overall.

4 Training Protocol

We implement our models using Keras frame-work [37] with Tensorflow backend with a sin-

gle NVIDIA GTX 970 GPU. We follow the sametraining procedure as [16]; hence, keeping the pre-trained coefficients of the original VGG-16 net,only the last modified block is fine-tuned. Wetrain using RMSProp optimizer (setting rho to0.9 and epsilon to 1e-08) with a batch size of 1.We use an initial learning rate of 1e-4 and reduceit by a factor of 10 when validation loss stopsimproving in 5 epochs. We set max 100 epochsand stop the training early when the validationloss stops improving in 10 epochs. The trainingframes (e.g. 200 frames) are hardly shuffled beforetraining+validation split and further split 80% fortraining and 20% for validation. The binary crossentropy loss is used to make an agreement betweentrue labels and predicted labels. Due to highly im-balanced pixels between background/foregroundpixels in scenes, we alleviate the imbalanced dataclassification problem by giving more weights tothe rare class (foreground) but less weights to themajor class (background) during training. Fur-thermore, since the output from the sigmoid ac-tivation are in range [0,1], we use them as theprobability values; we apply thresholding to theactivations to obtain discrete binary class labelsas foreground and background.

We mainly evaluate the model performanceusing F-Measure and Percentage of Wrong Clas-sifications (PWC), where we want to maximizethe F-Measure, while minimizing the PWC. Giventrue positive (TP ), false positive (FP ), false nega-tive (FN), true negative (TN), F-Measure, is de-fined by:

F −Measure =2× precision× recallprecision+ recall

(1)

where precision = TPTP+FP , recall = TP

TP+FN . AndPWC is defined by:

PWC =100× (FP + FN)

TP + FP + TN + FN(2)

5 Results and Discussion

In this section, we evaluate the effectiveness ofour method using three different datasets, namelyCDnet2014, SBI2015 and UCSD Background Sub-traction. Each of these datasets contains challeng-ing scenarios and used widely in fg/bg segmenta-tion researches.

4

Table 1: The results with GAP and no GAP

Categoryno GAP GAP

F-Measure PWC F-Measure PWC25f 200f 25f 200f 25f 200f 25f 200f

cameraJit 0.9506 0.9936 0.3026 0.0436 0.9740 0.9936 0.2271 0.0438badWeather 0.9781 0.9858 0.0657 0.0271 0.9783 0.9848 0.0639 0.0295dynamicBg 0.9636 0.9878 0.0311 0.0052 0.9665 0.9881 0.0325 0.0054intermitt 0.9597 0.9929 0.2997 0.0794 0.9735 0.9935 0.3268 0.0707shadow 0.9840 0.9960 0.1265 0.0279 0.9853 0.9959 0.1159 0.0290

turbulence 0.9600 0.9779 0.0439 0.0230 0.9587 0.9762 0.0438 0.0232

5.1 Experiments on CDnet Dataset

CDnet2014 dataset [38] contains 11 cate-gories, where each category contains from 4 to6 video sequences. Totally, there are 53 differ-ent video sequences in this dataset. The video se-quence contains from 600 to 7999 frames with spa-tial resolutions varying from 320x240 to 720x576.Moreover, this dataset contains various challeng-ing scenarios such as illumination change, hardshadow, highly dynamic background motion andcamera motion etc.

Unlike most previous works that use largenumbers of training frames, to alleviate ground-truth labelling burden, [16, 23] use only a fewframes, i.e. 50 and 200 frames, for train-ing+validation. Similar to these works, we usethe same training set (i.e. 200 frames) providedby [16] and we attempt to use a small number oftraining frames by randomly selecting 25 framesfrom the set of 200 frames to perform another ex-periment.

The Global Average Pooling Experiments:In order to evaluate the effectiveness of GAP layersin the proposed network, we performed two setsof experiments by selecting the most challenging6 categories from the CDNet2014 dataset (cam-eraJitter, badWeather, dynamicBackground, in-termittentObjectMotion, shadow, turbulence); thissubset of data totally contains 30 video sequences.The selected video sequences contain a number offrames that varies from 1150 to 7999. We use only25 frames for training+validation (as mentionedabove) and the remaining frames for testing.

In the first setting, we remove the GAP layersentirely from the network and will refer to thismodified configuration as no GAP below, and inthe second setting we keep the GAP layers. Ascan be seen from Table 1, the network with GAPimproves over no GAP in most categories by somemargins. Especially, GAP improves over no GAPby 2.34% points in cameraJitter category.

Fig. 4: The improved M FPM module compared tothe original FPM. (a) input image, (b) ground-truth, (c)M FPM+proposed decoder result, (d) FPM+proposed de-coder result, (e) FgSegNet M result and (f) FgSegNet Sresult.

The Modified FPM (M-FPM) Experi-ments: In this study, we demonstrate the effec-tiveness of the M-FPM module compared to theoriginal FPM proposed by [16]. We again per-form two set of experiments; in the first setting,we utilize the proposed decoder with the M-FPMmodule, while in the second, we use the proposeddecoder with original FPM module. The experi-mental results are illustrated using a challengingscene,in Fig. 4, where previous networks producemany false positives. As can be seen, (1) the pro-posed M-FPM module (Fig. 4, (c)) produces lessfalse positive compared to the original FPM mod-ule (Fig. 4, (d)), (2) the proposed decoder (Fig.4, (d)) is effective compared to the FgSegNet fam-ily [16] decoder (Fig. 4, (f)), (3) since FgSeg-Net M [16] fuses and jointly learns multi-inputsnetwork features, it is robust to camera movement(Fig. 4, (e)). As a comparison, we empirically ob-serve that the M-FPM module can mitigate theneed of multi-inputs network, which is computa-tionally more expensive, by introducing the multi-scale feature fusion later instead; resulting in a ro-bust feature pooling module. As can be seen fromFig. 4, the proposed method (Fig. 4, (c)) pro-duces very less false positives compared to FgSeg-Net family (Fig. 4, (e) and (f)) and improves overFgSegNet M, FgSegNet S and Cascade CNN [23]by 0.43%, 0.56% and 5.92% points, respectively,

5

Table 2: The test results are obtained by manually and randomly selecting, 25 and 200 frames from CDnet2014 datasetacross 11 categories. Each row shows the average results of each category.

CategoryFPR FNR Recall Precision PWC F-Measure

25f 200f 25f 200f 25f 200f 25f 200f 25f 200f 25f 200fbaseline 0.0002 0.00004 0.0100 0.0038 0.9900 0.9962 0.9942 0.9985 0.0480 0.0117 0.9921 0.9974

cameraJit 0.0004 0.00012 0.0419 0.0093 0.9581 0.9907 0.9907 0.9965 0.2271 0.0438 0.9740 0.9936badWeather 0.0003 0.00009 0.0257 0.0215 0.9743 0.9785 0.9825 0.9911 0.0639 0.0295 0.9783 0.9848dynamicBg 0.0001 0.00002 0.0315 0.0075 0.9685 0.9925 0.9655 0.9840 0.0325 0.0054 0.9665 0.9881intermitt 0.0017 0.00015 0.0243 0.0104 0.9757 0.9896 0.9720 0.9976 0.3268 0.0707 0.9735 0.9935

lowFrameR. 0.0003 0.00008 0.2496 0.0956 0.7504 0.9044 0.7860 0.8782 0.1581 0.0299 0.7670 0.8897nightVid. 0.0008 0.00022 0.1197 0.0363 0.8803 0.9637 0.9540 0.9861 0.3048 0.0802 0.9148 0.9747

PTZ 0.0002 0.00004 0.0870 0.0215 0.9130 0.9785 0.9776 0.9834 0.0892 0.0128 0.9423 0.9809shadow 0.0003 0.0001 0.0203 0.0056 0.9797 0.9944 0.9911 0.9974 0.1159 0.0290 0.9853 0.9959thermal 0.0009 0.00024 0.0456 0.0089 0.9544 0.9911 0.9815 0.9947 0.2471 0.0575 0.9677 0.9929

turbulence 0.0002 0.00011 0.0369 0.0221 0.9631 0.9779 0.9546 0.9747 0.0438 0.0232 0.9587 0.9762Overall 0.0005 0.0001 0.0630 0.0220 0.9370 0.9780 0.9591 0.9802 0.1507 0.0358 0.9473 0.9789

Table 3: A comparison among 6 methods across 11 categories. Each row shows the results for each method. Each columnshows the average results in each category. Note that we consider all the frames in the ground-truths of CDnet2014 dataset.

MethodsF-Measure

Overallbaseline camJit badWeat dynaBg intermit lowFrame nightVid PTZ shadow thermal turbul.

FgSegNet v2 0.9980 0.9961 0.9900 0.9950 0.9939 0.9579 0.9816 0.9936 0.9966 0.9942 0.9815 0.9890FgSegNet S [16] 0.9980 0.9951 0.9902 0.9952 0.9942 0.9511 0.9837 0.9880 0.9967 0.9945 0.9796 0.9878FgSegNet M [16] 0.9975 0.9945 0.9838 0.9939 0.9933 0.9558 0.9779 0.9893 0.9954 0.9923 0.9776 0.9865

Cascade CNN [23] 0.9786 0.9758 0.9451 0.9658 0.8505 0.8804 0.8926 0.9344 0.9593 0.8958 0.9215 0.9272DeepBS [26] 0.9580 0.8990 0.8647 0.8761 0.6097 0.5900 0.6359 0.3306 0.9304 0.7583 0.8993 0.7593IUTIS-5 [12] 0.9567 0.8332 0.8289 0.8902 0.7296 0.7911 0.5132 0.4703 0.9084 0.8303 0.8507 0.7820

in PTZ category (see Table 3). We also show theeffectiveness of proposed M-FPM in Fig. 1.

After evaluating the effectiveness of the ex-tensions that we propose in this work with thechallenging subset of CDnet2014 dataset, we per-form further experiments by using the proposedarchitecture configuration (Fig. 2). As mentionedabove, labeling the dense ground-truths requiresmore human-efforts and to reduce labelling bur-den, [16,23] use only a few training examples. Weadapt the same idea for 200-frames experiment;however, in this work, we further reduce the train-ing examples by 8x to 25 frames. Specifically, wetake two sets of experiments by using 25 framesand 200 frames and illustrate the test results inTable 2. As can be seen, for 25-frames experi-ments, we obtain an overall F-Measure of 0.9473across 11 categories. By increasing the number offrames to 200, F-Measure increases by 3.16% com-pared to the 25-frame experiment results. Simi-larly, the PWC decreases by 0.115% when we in-crease the number of frames from 25 to 200. Notethat the training frames are not included in theseevaluations, only the test frames are utilized. Weuse threshold of 0.7 for 25-frames experiments andthreshold of 0.9 for 200-frames experiment sincethe network provides best performances with thesesettings.

We further compare the results between theproposed method and state-of-the-art methods in

Table 4: A comparison with the state-of-the-art methods.These average results are obtained from Change Detection2014 Challenge.

MethodsOverall

Precision Recall PWC F-MeasureFgSegNet v2 0.9823 0.9891 0.0402 0.9847FgSegNet S [16] 0.9751 0.9896 0.0461 0.9804FgSegNet M [16] 0.9758 0.9836 0.0559 0.9770

Cascade CNN [23] 0.8997 0.9506 0.4052 0.9209DeepBS [26] 0.8332 0.7545 1.9920 0.7458IUTIS-5 [12] 0.8087 0.7849 1.1986 0.7717

Table 3. Note that to make a comparison interms of the number of frames (follow changedetec-tion.net), we tested our model using all the pro-vided ground-truths in CDnet2014 dataset. Ascan be seen, our method (FgSegNet v2 ) improvesover current state-of-the-art methods by somemargins. In particular, it significantly improveson camera motion categories (as discussed above);i.e. PTZ and cameraJitter category, where cam-era is shaking, panning, tilting or zooming aroundthe scenes. This eliminates the tradeoff of usingmulti-inputs features fusion in FgSegNet M.

We further perform another experiment byusing the training frames provided by [23] andevaluate our model on Change Detection 2014Challenge (changedetection.net). The compari-son is provided in Table 4. As can be seen,our method outperformed existing state-of-the-art methods by some margins; specifically, forthe deep learning based methods, it improves

6

Fig. 5: Some comparisons among 5 methods. (a) input images, (b) ground-truths, (c) our segmentation results, (d)FgSegNet S results, (e) FgSegNet M results, (f) Cascade CNN results and (g) DeepBS results. # (frame number)

over FgSegNet S, FgSegNet M, Cascade CNN andDeepBS by 0.43%, 0.77%, 6.38% and 23.89%points, respectively. It also improves over all tra-ditional methods over 21.3% points. Our methodis ranked as number 1 at the time of submission.

In order to display the generalization capa-bility of all methods with some exemplary framesfrom the dataset, we provide segmentation re-sults in two ways; first, we choose some segmen-tation results randomly in the range of the pro-vided ground-truths (i.e. intermitObjMotion andthermal categories), second, we randomly chosesome example frames from the test set whereground-truths are not publicly shared (i.e. PTZ,nightVideo and badWeather categories). As can

be seen from Fig. 5, our method gives good seg-mentation results; especially, in intermitObjMo-tion category where a car stopped in the scene fora long time and starts moving immediately. Thisis the case where most methods fail dramaticallyin this scenario (e.g. method in (g)). As for test re-sults where ground-truths are not available in Fig.5 (PTZ, nightVideo, badWeather categories), ourmethod generalizes well to completely unseen datawhere challenging content of the scenes changeover time. Our method produces good segmen-tation masks compared to other methods.

Our method performs better in almost all cat-egories, except LowFrameRate category where itperforms poorly (F-Measure=0.9579) compared to

7

other categories (see Table 3). This low perfor-mance is primarily due to a challenging video se-quence (in lowFrameRate category), where thereare extremely small foreground objects in dynamicscenes with gradual illumination changes (Fig. 6).In this case, the network may pay more atten-tion to the major class (bg) but less attentionto the rare class (fg); resulting in misclassifyingvery small foreground objects. However, the pro-posed method still improves over the best methodby some margins in this category. Furthermore,our method fails to detect blended objects intothe scene (see Fig. 4); in this case, an object isblended to the background completely and it iseven hard for human to distinguish between fore-ground and background.

Fig. 6: The video sequence in lowFrameRate category thatour method performs poorly. First, second and third col-umn show input image, ground-truth and our segmentationresult, respectively.

5.2 Experiments on SBI2015 Dataset

We conduct additional experiments on SceneBackground Initialization 2015 (SBI2015) dataset[39], which contains 14 video sequences withground-truth labels provided by [23]. We followthe same training protocol as in [16,23], where 20%of the frames are used for training+validation andthe remaining 80% split for testing; if we denotethe number of training examples used in trainingas tn, in these experiments tn ∈ [2, 148].

The test results are illustrated in Table 5.As can be seen, our method outperformed pre-vious state-of-the-art methods by some margins.To be concrete, it improves over FgSegNet fam-ily by 0.22% and 0.59% points, while improvesover Cascade CNN [23] by 9.21% points. Simi-larly, the PWC of our method is significantly lesscompared to other methods. We obtain the low-est performance in Toscana sequence with an F-Measure of 0.9291; this is primarily due to usingvery low number of training frames; only 2 frames,for training+validation. We depict some exem-plary results in Fig. 7. As can be seen, our methodproduces good segmentation masks; especially, inhighwayI video sequence where hard shadows are

eliminated completely.

Table 5: The test results on SBI2015 dataset with thresh-old of 0.3 and comparisons with the state-of-the-art meth-ods.

Video Seq. FPR FNR F-Measure PWC

Board 0.0009 0.0019 0.9979 0.1213

Candela m1.10 0.0003 0.0037 0.9950 0.0399

CAVIAR1 0.0001 0.0007 0.9988 0.0086

CAVIAR2 0.0001 0.0092 0.9834 0.0131

CaVignal 0.0027 0.0076 0.9859 0.3310

Foliage 0.0771 0.0207 0.9732 3.7675

HallAndMonitor 0.0002 0.0051 0.9926 0.0357

HighwayI 0.0008 0.0068 0.9924 0.1358

HighwayII 0.0001 0.0051 0.9952 0.0289

HumanBody2 0.0009 0.0079 0.9920 0.1636

IBMtest2 0.0007 0.0220 0.9817 0.1680

PeopleAndFoliage 0.0066 0.0102 0.9919 0.8468

Snellen 0.0211 0.0147 0.9644 1.7573

Toscana 0.0046 0.1155 0.9291 2.5901

FgSegNet v2 0.0083 0.0165 0.9853 0.7148

FgSegNet S [16] 0.0090 0.0146 0.9831 0.8524

FgSegNet M [16] 0.0059 0.0310 0.9794 0.9431

Cascade CNN [23] – – 0.8932 5.5800

5.3 Experiments on UCSD Dataset

Similar to [16], we further evaluate ourmethod on UCSD Background Subtractiondataset [40], which contains 18 video sequenceswith ground-truth labels. This dataset containshighly dynamic backgrounds, which are extremelychallenging, and the number of frames are rela-tively small compared to CDnet2014 and SBI2015dataset. We use the same training/testing splitsprovided by [16], where, first, the 20% split is usedfor training, i.e. tn ∈ [3, 23], and 80% for testing;second, 50% is used for training, i.e. tn ∈ [7, 56],and remaining 50% for testing. Test results are de-picted in Table 6. Although there are very smallnumbers of training+validation frames, we obtainan average F-Measure of 0.8945 in case of 20%split and 0.9203 in case of 50% split. Our methodproduces comparable results to the previous meth-ods, while PWC decreases remarkably comparedto the previous methods. We also depict somesegmentation results in Fig. 8.

6 Conclusion

In this work, we propose a robust encoder-decoder network, which can be trained end-to-endin a supervised manner. Our network is simple,yet can learn accurate foreground segmentationby using a few training examples, which allevi-

8

Fig. 7: A comparison on SBI2015 dataset. Each column shows raw images, the ground-truths, our segmentation results,FgSegNet S and FgSegNet M results, respectively.

Table 6: The test results on UCSD dataset with threshold of 0.6 and some comparisons with state-of-the-art methods.

Video Seq.20% split 50% split

FPR FNR F-Measure PWC FPR FNR F-Measure PWCbirds 0.0025 0.1423 0.8649 0.5205 0.0021 0.1162 0.8884 0.4315boats 0.0009 0.0729 0.9213 0.1678 0.0008 0.0382 0.9437 0.1213bottle 0.0014 0.0406 0.9550 0.2462 0.0009 0.0476 0.9605 0.2134

chopper 0.0023 0.0760 0.9140 0.3991 0.0017 0.0810 0.9232 0.3544cyclists 0.0030 0.0738 0.9213 0.5382 0.0019 0.0488 0.9492 0.3457

flock 0.0048 0.0641 0.9383 0.9270 0.0036 0.0375 0.9591 0.6179freeway 0.0013 0.3012 0.7787 0.5480 0.0028 0.1349 0.8394 0.4635hockey 0.0197 0.0716 0.9165 2.8430 0.0138 0.0527 0.9400 2.0359jump 0.0066 0.0746 0.9358 1.4309 0.0041 0.0464 0.9603 0.8892

landing 0.0007 0.0653 0.9245 0.1278 0.0006 0.0559 0.9388 0.1031ocean 0.0012 0.1014 0.8931 0.2253 0.0010 0.0607 0.9243 0.1615peds 0.0048 0.0955 0.8776 0.7499 0.0038 0.0907 0.8942 0.6402rain 0.0029 0.1180 0.9174 1.0490 0.0025 0.0558 0.9534 0.5881

skiing 0.0022 0.0846 0.9171 0.4338 0.0018 0.0586 0.9385 0.3251surfers 0.0015 0.1147 0.8887 0.2990 0.0012 0.0764 0.9173 0.2232

surf 0.0008 0.2941 0.7307 0.1778 0.0005 0.2321 0.7968 0.1343traffic 0.0017 0.0899 0.9070 0.3186 0.0015 0.0566 0.9301 0.2427zodiac 0.0003 0.0735 0.8988 0.0429 0.0002 0.0815 0.9086 0.0383

FgSegNet v2 0.0033 0.1086 0.8945 0.6136 0.0025 0.0762 0.9203 0.4405FgSegNet S [16] 0.0058 0.0559 0.8822 0.7052 0.0039 0.0544 0.9139 0.5024FgSegNet M [16] 0.0037 0.0904 0.8948 0.6260 0.0027 0.0714 0.9203 0.4637

Fig. 8: A comparison on UCSD dataset. Each column shows raw images, the ground-truths, our segmentation results,FgSegNet S and FgSegNet M results, respectively.

9

ates the need of ground-truths labeling burden.We improve the original FPM module by fusingmultiple scale features inside FPM module; result-ing in a robust module against camera motion,which can alleviate the need for training the net-work with multi-scale inputs. We further proposea simple decoder, which can help improving theperformance. Our method neither requires anypost-processing to refine the segmentation resultsnor temporal data into consideration. The exper-imental results reveal that our network outper-forms existing state-of-the-art methods in severalbenchmarks. As a future work, we plan to incorpo-rate temporal data and redesign a method, whichcan learn from very small number of examples.

Acknowledgements

We would like to thank CDnet2014 bench-mark [38] for making the segmentation masks ofall methods publicly available, which allowed usto perform different types of comparisons. Andwe also thank the authors in [23] who made theirtraining frames publicly available for follow-up re-searchers.

References

[1] S. Brutzer, B. Hoferlin, and G. Heidemann, “Evalu-ation of background subtraction techniques for videosurveillance,” in Computer Vision and Pattern Recog-nition (CVPR), 2011 IEEE Conference on, pp. 1937–1944, IEEE, 2011.

[2] F. Porikli and O. Tuzel, “Human body tracking byadaptive background models and mean-shift analy-sis,” in IEEE International Workshop on PerformanceEvaluation of Tracking and Surveillance, pp. 1–9,2003.

[3] S. Zhu and L. Xia, “Human action recognition basedon fusion features extraction of adaptive backgroundsubtraction and optical flow model,” MathematicalProblems in Engineering, vol. 2015, 2015.

[4] R. Poppe, “A survey on vision-based human actionrecognition,” Image and vision computing, vol. 28,no. 6, pp. 976–990, 2010.

[5] S.-C. S. Cheung and C. Kamath, “Robust techniquesfor background subtraction in urban traffic video,” inProceedings of SPIE, vol. 5308, pp. 881–892, 2004.

[6] A. Basharat, A. Gritai, and M. Shah, “Learning objectmotion patterns for anomaly detection and improvedobject detection,” in Computer Vision and PatternRecognition, 2008. CVPR 2008. IEEE Conference on,pp. 1–8, IEEE, 2008.

[7] C. Stauffer and W. E. L. Grimson, “Adaptive back-ground mixture models for real-time tracking,” inComputer Vision and Pattern Recognition, 1999.

IEEE Computer Society Conference on., vol. 2,pp. 246–252, IEEE, 1999.

[8] Z. Zivkovic, “Improved adaptive gaussian mixturemodel for background subtraction,” in Pattern Recog-nition, 2004. ICPR 2004. Proceedings of the 17th In-ternational Conference on, vol. 2, pp. 28–31, IEEE,2004.

[9] O. Barnich and M. Van Droogenbroeck, “Vibe: Auniversal background subtraction algorithm for videosequences,” IEEE Transactions on Image processing,vol. 20, no. 6, pp. 1709–1724, 2011.

[10] M. Van Droogenbroeck and O. Paquot, “Backgroundsubtraction: Experiments and improvements for vibe,”in Computer Vision and Pattern Recognition Work-shops (CVPRW), 2012 IEEE Computer Society Con-ference on, pp. 32–37, IEEE, 2012.

[11] P. KaewTraKulPong and R. Bowden, “An improvedadaptive background mixture model for real-timetracking with shadow detection,” Video-based surveil-lance systems, vol. 1, pp. 135–144, 2002.

[12] S. Bianco, G. Ciocca, and R. Schettini, “How far canyou get by combining change detection algorithms?,”in International Conference on Image Analysis andProcessing, pp. 96–107, Springer, 2017.

[13] M. Hofmann, P. Tiefenbacher, and G. Rigoll, “Back-ground segmentation with feedback: The pixel-basedadaptive segmenter,” in Computer Vision and PatternRecognition Workshops (CVPRW), 2012 IEEE Com-puter Society Conference on, pp. 38–43, IEEE, 2012.

[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner,“Gradient-based learning applied to document recog-nition,” Proceedings of the IEEE, vol. 86, no. 11,pp. 2278–2324, 1998.

[15] M. D. Zeiler and R. Fergus, “Visualizing and under-standing convolutional networks,” in European confer-ence on computer vision, pp. 818–833, Springer, 2014.

[16] L. A. Lim and H. Y. Keles, “Foreground segmenta-tion using convolutional neural networks for multi-scale feature encoding,” Pattern Recognition Letters,DOI:10.1016/j.patrec.2018.08.002, 2018.

[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Ima-genet classification with deep convolutional neural net-works,” in Advances in neural information processingsystems, pp. 1097–1105, 2012.

[18] K. Simonyan and A. Zisserman, “Very deep convo-lutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014.

[19] J. Long, E. Shelhamer, and T. Darrell, “Fully convo-lutional networks for semantic segmentation,” in Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp. 3431–3440, 2015.

[20] A. Karpathy and L. Fei-Fei, “Deep visual-semanticalignments for generating image descriptions,” in Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp. 3128–3137, 2015.

[21] V. Badrinarayanan, A. Kendall, and R. Cipolla,“Segnet: A deep convolutional encoder-decoder ar-chitecture for image segmentation,” arXiv preprintarXiv:1511.00561, 2015.

[22] O. Ronneberger, P. Fischer, and T. Brox, “U-net:Convolutional networks for biomedical image seg-mentation,” in International Conference on Medical

10

image computing and computer-assisted intervention,pp. 234–241, Springer, 2015.

[23] Y. Wang, Z. Luo, and P.-M. Jodoin, “Interactive deeplearning method for segmenting moving objects,” Pat-tern Recognition Letters, vol. 96, no. Supplement C,pp. 66 – 75, 2017.

[24] D. Sakkos, H. Liu, J. Han, and L. Shao, “End-to-endvideo background subtraction with 3d convolutionalneural networks,” Multimedia Tools and Applications,pp. 1–19, 2017.

[25] M. Braham and M. Van Droogenbroeck, “Deep back-ground subtraction with scene-specific convolutionalneural networks,” in Systems, Signals and Image Pro-cessing (IWSSIP), 2016 International Conference on,pp. 1–4, IEEE, 2016.

[26] M. Babaee, D. T. Dinh, and G. Rigoll, “A deep convo-lutional neural network for background subtraction,”arXiv preprint arXiv:1702.01731, 2017.

[27] K. Lim, W.-D. Jang, and C.-S. Kim, “Backgroundsubtraction using encoder-decoder structured convolu-tional neural network,” in Advanced Video and SignalBased Surveillance (AVSS), 2017 14th IEEE Interna-tional Conference on, pp. 1–6, IEEE, 2017.

[28] L. P. Cinelli, L. A. Thomaz, A. F. da Silva, E. A.da Silva, and S. L. Netto, “Foreground segmentationfor anomaly detection in surveillance videos using deepresidual networks,” 2017.

[29] F. Yu and V. Koltun, “Multi-scale context ag-gregation by dilated convolutions,” arXiv preprintarXiv:1511.07122, 2015.

[30] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy,and A. L. Yuille, “Deeplab: Semantic image segmenta-tion with deep convolutional nets, atrous convolution,and fully connected crfs,” IEEE transactions on pat-tern analysis and machine intelligence, vol. 40, no. 4,pp. 834–848, 2018.

[31] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam,“Rethinking atrous convolution for semantic im-age segmentation,” arXiv preprint arXiv:1706.05587,2017.

[32] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, andH. Adam, “Encoder-decoder with atrous separableconvolution for semantic image segmentation,” arXivpreprint arXiv:1802.02611, 2018.

[33] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever,and R. Salakhutdinov, “Dropout: a simple way to pre-vent neural networks from overfitting,” The Journal ofMachine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[34] S. Ioffe and C. Szegedy, “Batch normalization: Accel-erating deep network training by reducing internal co-variate shift,” arXiv preprint arXiv:1502.03167, 2015.

[35] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, andC. Bregler, “Efficient object localization using convo-lutional networks,” in Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition,pp. 648–656, 2015.

[36] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “In-stance normalization: The missing ingredient for faststylization,” CoRR, vol. abs/1607.08022, 2016.

[37] F. Chollet et al., “Keras.” https://keras.io, 2015.

[38] Y. Wang, P.-M. Jodoin, F. Porikli, J. Konrad,Y. Benezeth, and P. Ishwar, “Cdnet 2014: an ex-panded change detection benchmark dataset,” in Pro-ceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition Workshops, pp. 387–394,2014.

[39] L. Maddalena and A. Petrosino, “Towards bench-marking scene background initialization,” in Interna-tional Conference on Image Analysis and Processing,pp. 469–476, Springer, 2015.

[40] V. Mahadevan and N. Vasconcelos, “Spatiotempo-ral saliency in dynamic scenes,” IEEE transactionson pattern analysis and machine intelligence, vol. 32,no. 1, pp. 171–177, 2010.

11

https://keras.io

learning multi-scale features for foreground segmentation · 2018-08-07 · stancenormalization and...

Documents