8.4. networks for semantic segmentation deep learning

Deep learning

8.4. Networks for semantic segmentation

Francois Fleuret

https://fleuret.org/dlc/

https://fleuret.org/dlc/

The historical approach to image segmentation was to define a measure ofsimilarity between pixels, and to cluster groups of similar pixels. Suchapproaches account poorly for semantic content.

The deep-learning approach re-casts semantic segmentation as pixelclassification, and re-uses networks trained for image classification by makingthem fully convolutional.

Francois Fleuret Deep learning / 8.4. Networks for semantic segmentation 1 / 8

Shelhamer et al. (2016) proposed the FCN (“Fully Convolutional Network”)that uses a pre-trained classification network (e.g. VGG 16 layers).

The fully connected layers are converted to 1 × 1 convolutional filters, and thefinal one retrained for 21 output channels (VOC 20 classes + “background”).

Since VGG16 has 5 max-pooling with 2 × 2 kernels, with proper padding, theoutput is 1/25 = 1/32 the size of the input.

This map is then up-scaled with a transposed convolution layer with kernel64 × 64 and stride 32 × 32 to get a final map of same size as the input image.

Training is achieved with full images and pixel-wise cross-entropy, starting witha pre-trained VGG16. All layers are fine-tuned, although fixing the up-scalingtransposed convolution to bilinear does as well.


Notes

The added “background” class is added forpixels that do not belong to any of the definedobject and avoid forcing the network to make ainconsistent choice.

Since segmentation aims at classifying theindividual pixels, the size of the final tensorshould be of the same size as the input image.Since the activation maps have been reduced bypooling operations, the size has to be increaseback.

As seen in lecture 7.1. “Transposed convolutions”up-sampling an activation map can be done withbilinear interpolation, or transposed convolutionlayers.

3d

12

, 64d

14

, 128d

18

, 256d

116

, 512d

132

, 512d

132

, 4096d

2× conv/relu

+ maxpool

2× conv/relu

+ maxpool

3× conv/relu

+ maxpool

3× conv/relu

+ maxpool

3× conv/relu

+ maxpool

2× fc-conv/relu

132

, 21d

21d

fc-conv

deconv

×32


Notes

The last fc-conv makes the prediction output ofthe 21 classes from the activation maps which is1/32 of the original image size. Then, theclassification maps are up-sampled by a 32factor up to the original image size by thetransposed convolution layer.

Although the FCN achieved almost state-of-the-art results when published, its mainweakness is the coarseness of the signal from which the final output is produced (1/32of the original resolution).

Shelhamer et al. proposed an additional element, that consists of using the sameprediction/up-scaling from intermediate layers of the VGG network.


3d

12

, 64d

14

, 128d

18

, 256d

116

, 512d

132

, 512d

132

, 4096d

2× conv/relu

+ maxpool

2× conv/relu

+ maxpool

3× conv/relu

+ maxpool

3× conv/relu

+ maxpool

3× conv/relu

+ maxpool

2× fc-conv/relu

132

, 21d

116

, 21d116

, 21d

116

, 21d

18

, 21d

18

, 21d

21d

18

, 21d

fc-conv

deconv

×2

fc-conv

fc-conv

+

deconv

×2

+

deconv

×8


Notes

The coarseness of the prediction is reduced byadding intermediate predictions which are lessrefined in term of features, but of greaterresolution.

9

FCN-8s SDS [14] Ground Truth Image

Fig. 6. Fully convolutional networks improve performance on PASCAL.The left column shows the output of our most accurate net, FCN-8s. Thesecond shows the output of the previous best method by Hariharan et al.[14]. Notice the fine structures recovered (first row), ability to separateclosely interacting objects (second row), and robustness to occluders(third row). The fifth and sixth rows show failure cases: the net seeslifejackets in a boat as people and confuses human hair with a dog.

6 ANALYSIS

We examine the learning and inference of fully convolu-tional networks. Masking experiments investigate the role ofcontext and shape by reducing the input to only foreground,only background, or shape alone. Defining a “null” back-ground model checks the necessity of learning a backgroundclassifier for semantic segmentation. We detail an approxi-mation between momentum and batch size to further tunewhole image learning. Finally, we measure bounds on taskaccuracy for given output resolutions to show there is stillmuch to improve.

6.1 Cues

Given the large receptive field size of an FCN, it is naturalto wonder about the relative importance of foreground andbackground pixels in the prediction. Is foreground appear-ance sufficient for inference, or does the context influencethe output? Conversely, can a network learn to recognize aclass by its shape and context alone?

Masking To explore these issues we experiment withmasked versions of the standard PASCAL VOC segmenta-tion challenge. We both mask input to networks trained onnormal PASCAL, and learn new networks on the maskedPASCAL. See Table 8 for masked results.

TABLE 8The role of foreground, background, and shape cues. All scores are the

mean intersection over union metric excluding background. Thearchitecture and optimization are fixed to those of FCN-32s (Reference)

and only input masking differs.

train test

FG BG FG BG mean IU

Reference keep keep keep keep 84.8Reference-FG keep keep keep mask 81.0Reference-BG keep keep mask keep 19.8FG-only keep mask keep mask 76.1BG-only mask keep mask keep 37.8Shape mask mask mask mask 29.1

Masking the foreground at inference time is catastrophic.However, masking the foreground during learning yieldsa network capable of recognizing object segments withoutobserving a single pixel of the labeled class. Masking thebackground has little effect overall but does lead to classconfusion in certain cases. When the background is maskedduring both learning and inference, the network unsurpris-ingly achieves nearly perfect background accuracy; howevercertain classes are more confused. All-in-all this suggeststhat FCNs do incorporate context even though decisions aredriven by foreground pixels.

To separate the contribution of shape, we learn a netrestricted to the simple input of foreground/backgroundmasks. The accuracy in this shape-only condition is lowerthan when only the foreground is masked, suggesting thatthe net is capable of learning context to boost recognition.Nonetheless, it is surprisingly accurate. See Figure 7.

Background modeling It is standard in detection andsemantic segmentation to have a background model. Thismodel usually takes the same form as the models for theclasses of interest, but is supervised by negative instances.In our experiments we have followed the same approach,learning parameters to score all classes including back-ground. Is this actually necessary, or do class models suffice?

To investigate, we define a net with a “null” backgroundmodel that gives a constant score of zero. Instead of train-ing with the softmax loss, which induces competition bynormalizing across classes, we train with the sigmoid cross-entropy loss, which independently normalizes each score.For inference each pixel is assigned the highest scoring class.In all other respects the experiment is identical to our FCN-32s on PASCAL VOC. The null background net scores 1point lower than the reference FCN-32s and a control FCN-32s trained on all classes including background with thesigmoid cross-entropy loss. To put this drop in perspective,note that discarding the background model in this wayreduces the total number of parameters by less than 0.1%.Nonetheless, this result suggests that learning a dedicatedbackground model for semantic segmentation is not vital.

6.2 Momentum and batch sizeIn comparing optimization schemes for FCNs, we find that“heavy” online learning with high momentum trains moreaccurate models in less wall clock time (see Section 4.2).Here we detail a relationship between momentum and batchsize that motivates heavy learning.

Left column is the best network from Shelhamer et al. (2016).


10

Image Ground Truth Output Input

Fig. 7. FCNs learn to recognize by shape when deprived of other inputdetail. From left to right: regular image (not seen by network), groundtruth, output, mask input.

By writing the updates computed by gradient accumu-lation as a non-recursive sum, we will see that momentumand batch size can be approximately traded off, whichsuggests alternative training parameters. Let gt be the steptaken by minibatch SGD with momentum at time t,

gt = −ηk−1∑

i=0

∇θ`(xkt+i; θt−1) + pgt−1,

where `(x; θ) is the loss for example x and parameters θ,p < 1 is the momentum, k is the batch size, and η is thelearning rate. Expanding this recurrence as an infinite sumwith geometric coefficients, we have

gt = −η∞∑

s=0

k−1∑

i=0

ps∇θ`(xk(t−s)+i; θt−s).

In other words, each example is included in the sum withcoefficient pbj/kc, where the index j orders the examplesfrom most recently considered to least recently considered.Approximating this expression by dropping the floor, we seethat learning with momentum p and batch size k appearsto be similar to learning with momentum p′ and batchsize k′ if p(1/k) = p′(1/k

′). Note that this is not an exactequivalence: a smaller batch size results in more frequentweight updates, and may make more learning progress forthe same number of gradient computations. For typical FCNvalues of momentum 0.9 and a batch size of 20 images, anapproximately equivalent training regime uses momentum0.9(1/20) ≈ 0.99 and a batch size of one, resulting in online

learning. In practice, we find that online learning works welland yields better FCN models in less wall clock time.

6.3 Upper bounds on IU

FCNs achieve good performance on the mean IU segmen-tation metric even with spatially coarse semantic predic-tion. To better understand this metric and the limits ofthis approach with respect to it, we compute approximateupper bounds on performance with prediction at variousresolutions. We do this by downsampling ground truthimages and then upsampling back to simulate the bestresults obtainable with a particular downsampling factor.The following table gives the mean IU on a subset5 ofPASCAL 2011 val for various downsampling factors.

factor mean IU

128 50.964 73.332 86.116 92.88 96.44 98.5

Pixel-perfect prediction is clearly not necessary toachieve mean IU well above state-of-the-art, and, con-versely, mean IU is a not a good measure of fine-scale accu-racy. The gaps between oracle and state-of-the-art accuracyat every stride suggest that recognition and not resolution isthe bottleneck for this metric.

7 CONCLUSION

Fully convolutional networks are a rich class of models thataddress many pixelwise tasks. FCNs for semantic segmen-tation dramatically improve accuracy by transferring pre-trained classifier weights, fusing different layer representa-tions, and learning end-to-end on whole images. End-to-end, pixel-to-pixel operation simultaneously simplifies andspeeds up learning and inference. All code for this paper isopen source in Caffe, and all models are freely available inthe Caffe Model Zoo. Further works have demonstrated thegenerality of fully convolutional networks for a variety ofimage-to-image tasks.

ACKNOWLEDGEMENTS

This work was supported in part by DARPA’s MSEE andSMISC programs, NSF awards IIS-1427425, IIS-1212798, IIS-1116411, and the NSF GRFP, Toyota, and the BerkeleyVision and Learning Center. We gratefully acknowledgeNVIDIA for GPU donation. We thank Bharath Hariharanand Saurabh Gupta for their advice and dataset tools. Wethank Sergio Guadarrama for reproducing GoogLeNet inCaffe. We thank Jitendra Malik for his helpful comments.Thanks to Wei Liu for pointing out an issue wth our SIFTFlow mean IU computation and an error in our frequencyweighted mean IU formula.

Results with a network trained from mask only (Shelhamer et al., 2016).


The most sophisticated object detection methods achieve instance segmentation andestimate a segmentation mask per detected object.

Mask R-CNN (He et al., 2017) adds a branch to the Faster R-CNN model to estimate amask for each detected region of interest.

Mask R-CNN

Kaiming He Georgia Gkioxari Piotr Dollar Ross Girshick

Facebook AI Research (FAIR)

Abstract

We present a conceptually simple, flexible, and generalframework for object instance segmentation. Our approachefficiently detects objects in an image while simultaneouslygenerating a high-quality segmentation mask for each in-stance. The method, called Mask R-CNN, extends FasterR-CNN by adding a branch for predicting an object mask inparallel with the existing branch for bounding box recogni-tion. Mask R-CNN is simple to train and adds only a smalloverhead to Faster R-CNN, running at 5 fps. Moreover,Mask R-CNN is easy to generalize to other tasks, e.g., al-lowing us to estimate human poses in the same framework.We show top results in all three tracks of the COCO suiteof challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. With-out bells and whistles, Mask R-CNN outperforms all ex-isting, single-model entries on every task, including theCOCO 2016 challenge winners. We hope our simple andeffective approach will serve as a solid baseline and helpease future research in instance-level recognition. Codehas been made available at: https://github.com/facebookresearch/Detectron.

1. IntroductionThe vision community has rapidly improved object de-

tection and semantic segmentation results over a short pe-riod of time. In large part, these advances have been drivenby powerful baseline systems, such as the Fast/Faster R-CNN [12, 36] and Fully Convolutional Network (FCN) [30]frameworks for object detection and semantic segmenta-tion, respectively. These methods are conceptually intuitiveand offer flexibility and robustness, together with fast train-ing and inference time. Our goal in this work is to develop acomparably enabling framework for instance segmentation.

Instance segmentation is challenging because it requiresthe correct detection of all objects in an image while alsoprecisely segmenting each instance. It therefore combineselements from the classical computer vision tasks of ob-ject detection, where the goal is to classify individual ob-jects and localize each using a bounding box, and semantic

RoIAlignRoIAlign

classbox

convconv convconv

Figure 1. The Mask R-CNN framework for instance segmentation.

segmentation, where the goal is to classify each pixel intoa fixed set of categories without differentiating object in-stances.1 Given this, one might expect a complex methodis required to achieve good results. However, we show thata surprisingly simple, flexible, and fast system can surpassprior state-of-the-art instance segmentation results.

Our method, called Mask R-CNN, extends Faster R-CNN[36] by adding a branch for predicting segmentation maskson each Region of Interest (RoI), in parallel with the ex-isting branch for classification and bounding box regres-sion (Figure 1). The mask branch is a small FCN appliedto each RoI, predicting a segmentation mask in a pixel-to-pixel manner. Mask R-CNN is simple to implement andtrain given the Faster R-CNN framework, which facilitatesa wide range of flexible architecture designs. Additionally,the mask branch only adds a small computational overhead,enabling a fast system and rapid experimentation.

In principle Mask R-CNN is an intuitive extension ofFaster R-CNN, yet constructing the mask branch properlyis critical for good results. Most importantly, Faster R-CNN was not designed for pixel-to-pixel alignment be-tween network inputs and outputs. This is most evident inhow RoIPool [18, 12], the de facto core operation for at-tending to instances, performs coarse spatial quantizationfor feature extraction. To fix the misalignment, we pro-pose a simple, quantization-free layer, called RoIAlign, thatfaithfully preserves exact spatial locations. Despite being

1Following common terminology, we use object detection to denotedetection via bounding boxes, not masks, and semantic segmentation todenote per-pixel classification without differentiating instances. Yet wenote that instance segmentation is both semantic and a form of detection.

1

arX

iv:1

703.

0687

0v3

[cs

.CV

] 2

4 Ja

n 20

18

(He et al., 2017)


Notes

Instance segmentation aims at not onlyclassifying the individual pixels in the image butalso the instance of the class when the sameobject is present multiple times, e.g. “car #1”,“car #2”.

horse1.00horse1.00 horse1.00

bus1.00

bus1.00

car.98

truck.88

car.93

car.78

car.98

car.91car.96

car.99

car.94

car.99

car.98truck.86

car.99

car.95

car1.00

car.93car.98

car.95

car.97

car.87

car.99

car.82

car.78

car.93

car.95car.97

person.99

traffic light.73

person1.00

person.99

person.95

person.93

person.93

person1.00

person.98

skateboard.82

suitcase1.00

suitcase.99

suitcase.96

suitcase1.00

suitcase.93

suitcase.98

suitcase.88

suitcase.72

stop sign.88

person1.00 person1.00

person1.00

person1.00person.99

person.99

bench.76 skateboard.91

skateboard.83

handbag.81

surfboard1.00

person1.00person1.00 surfboard1.00person1.00person.98

surfboard1.00

person1.00

surfboard.98 surfboard1.00person.91

person.74person1.00person1.00

person1.00person1.00

person1.00person1.00person.98

person.99

person1.00person.99 umbrella1.00

person.95

umbrella.99umbrella.97umbrella.97

umbrella.96umbrella1.00

backpack.96

umbrella.98

backpack.95

person.80backpack.98

bicycle.93

umbrella.89 person.89handbag.97

handbag.85

person1.00person1.00person1.00person1.00person1.00

person1.00

motorcycle.72

kite.89

person.99

kite.95

person.99

person1.00

person.81person.72

kite.93

person.89

kite1.00

person.98

person1.00

kite.84

kite.97

person.80

handbag.80

person.99

kite.82

person.98person.96

kite.98

person.99 person.82

kite.81

person.95 person.84

kite.98

kite.72

kite.99kite.84

kite.99

person.94person.72person.98

kite.95

person.98person.77

kite.73

person.78 person.71person.87

kite.88

kite.88

person.94

kite.86

kite.89

zebra.99

zebra1.00

zebra1.00

zebra.99 zebra1.00

zebra.96zebra.74

zebra.96

zebra.99zebra.90

zebra.88zebra.76

dining table.91

dining table.78

chair.97

person.99

person.86

chair.94

chair.98

person.95

chair.95

person.97

chair.92chair.99

person.97

person.99

person.94person.99person.87

person.99

chair.83

person.94

person.99person.98

chair.87

chair.95

person.97

person.96

chair.99

person.86 person.89

chair.89

wine glass.93

person.98 person.88

person.97

person.88

person.88

person.91 chair.96

person.95person.77

person.92

wine glass.94

cup.83

wine glass.94wine glass.83

cup.91

chair.85 dining table.96

wine glass.91

person.96

cup.98

person.83

dining table.75

cup.96

person.72

wine glass.80

chair.98

person.81 person.82

dining table.81

chair.85

chair.78cup.75

person.77

cup.71 wine glass.80

cup.79cup.93cup.71

person.99

person.99

person1.00

person1.00

frisbee1.00

person.80

person.82

elephant1.00elephant1.00elephant1.00

elephant.97

elephant.99


dining table.95

person1.00person.88

wine glass1.00

bottle.97

wine glass1.00

wine glass.99

tv.98 tv.84

person1.00

bench.97

person.98

person1.00 person1.00

handbag.73

person.86potted plant.92

bird.93

person.76 person.98person.78person.78backpack.88handbag.91

cell phone.77clock.73

person.99person1.00 person.98

person1.00person1.00 person1.00

person.99person.99 person.99person1.00person1.00 person.98 person.99

handbag.88

person1.00person.98person.92

handbag.99

person.97person.95

handbag.88

traffic light.99

person.95person.87

person.95

traffic light.87

traffic light.71

person.80 person.95person.95person.73person.74

tie.85

car.99

car.86

car.97

car1.00car.95 car.97

traffic light1.00traffic light.99

car.99person.99

car.95car.97car.98

car.98car.91

car1.00car.96

car.96

bicycle.86car.97

car.97

car.97car.94

car.95

car.94

car.81

person.87

parking meter.98

car.89

donut1.00

donut.90

donut.88

donut.81

donut.95

donut.96

donut1.00donut.98

donut.99

donut.94

donut.97

donut.99donut.98

donut1.00

donut.95

donut1.00

donut.98donut.98

donut.99

donut.96

donut.89

donut.96donut.95donut.98

donut.89

donut.93

donut.95

donut.90donut.89

donut.89

donut.89

donut.86

donut.86

person1.00


person1.00

person1.00

person1.00

person1.00

dog1.00

baseball bat.99

baseball bat.85baseball bat.98

truck.92

truck.99truck.96truck.99truck.97

bus.99truck.93bus.90


horse.77

horse.99

cow.93

person.96

person1.00

person.99

horse.97person.98person.97

person.98

person.96

person1.00

tennis racket1.00

chair.73

person.90

person.77

person.97person.81

person.87

person.71person.96 person.99 person.98person.94

chair.97

chair.80chair.71

chair.94chair.92chair.99

chair.93chair.99

chair.91chair.81chair.98chair.83

chair.81

chair.81

chair.93

sports ball.99

person1.00

couch.82person1.00

person.99

person1.00person1.00person1.00 person.99

skateboard.99

person.90

person.98person.99

person.91person.99person1.00

person.80

skateboard.98

Figure 5. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1).

backbone AP AP50 AP75 APS APM APL

MNC [10] ResNet-101-C4 24.6 44.3 24.8 4.7 25.9 43.6FCIS [26] +OHEM ResNet-101-C5-dilated 29.2 49.5 - 7.1 31.3 50.0FCIS+++ [26] +OHEM ResNet-101-C5-dilated 33.6 54.5 - - - -Mask R-CNN ResNet-101-C4 33.1 54.9 34.8 12.1 35.6 51.1Mask R-CNN ResNet-101-FPN 35.7 58.0 37.8 15.5 38.1 52.4Mask R-CNN ResNeXt-101-FPN 37.1 60.0 39.4 16.9 39.9 53.5

Table 1. Instance segmentation mask AP on COCO test-dev. MNC [10] and FCIS [26] are the winners of the COCO 2015 and 2016segmentation challenges, respectively. Without bells and whistles, Mask R-CNN outperforms the more complex FCIS+++, which includesmulti-scale train/test, horizontal flip test, and OHEM [38]. All entries are single-model results.

can predict K masks per RoI, but we only use the k-th mask,where k is the predicted class by the classification branch.The m×m floating-number mask output is then resized tothe RoI size, and binarized at a threshold of 0.5.

Note that since we only compute masks on the top 100detection boxes, Mask R-CNN adds a small overhead to itsFaster R-CNN counterpart (e.g., ∼20% on typical models).

4. Experiments: Instance SegmentationWe perform a thorough comparison of Mask R-CNN to

the state of the art along with comprehensive ablations onthe COCO dataset [28]. We report the standard COCO met-rics including AP (averaged over IoU thresholds), AP50,AP75, and APS , APM , APL (AP at different scales). Un-less noted, AP is evaluating using mask IoU. As in previouswork [5, 27], we train using the union of 80k train imagesand a 35k subset of val images (trainval35k), and re-port ablations on the remaining 5k val images (minival).We also report results on test-dev [28].

4.1. Main Results

We compare Mask R-CNN to the state-of-the-art meth-ods in instance segmentation in Table 1. All instantia-tions of our model outperform baseline variants of pre-vious state-of-the-art models. This includes MNC [10]and FCIS [26], the winners of the COCO 2015 and 2016segmentation challenges, respectively. Without bells andwhistles, Mask R-CNN with ResNet-101-FPN backboneoutperforms FCIS+++ [26], which includes multi-scaletrain/test, horizontal flip test, and online hard example min-ing (OHEM) [38]. While outside the scope of this work, weexpect many such improvements to be applicable to ours.

Mask R-CNN outputs are visualized in Figures 2 and 5.Mask R-CNN achieves good results even under challeng-ing conditions. In Figure 6 we compare our Mask R-CNNbaseline and FCIS+++ [26]. FCIS+++ exhibits systematicartifacts on overlapping instances, suggesting that it is chal-lenged by the fundamental difficulty of instance segmenta-tion. Mask R-CNN shows no such artifacts.

5

(He et al., 2017)


It is noteworthy that for detection and semantic segmentation, there is an heavy re-useof large networks trained for classification.

The models themselves, as much as the source code of the algorithm that producedthem, or the training data, are generic and re-usable assets.


References

K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-CNN. In International Conference onComputer Vision, pages 2980–2988, 2017.

E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation.CoRR, abs/1605.06211, 2016.

8.4. networks for semantic segmentation deep learning

Documents