SemanticSegmentation
Dr. Eyal Gruss
Director of AI, Flatspace
Talpiyot
PhD Physics
Machine Learning • Researcher• Consultant• Entrepreneur
Digital Artist
Eyal Gruss
For photorealistic VR experience
3D Model
Using deep neural networks
Architectural Interpretation
Bitmap Floorplan
An AI-powered service that creates a VR model from a simple floorplan.
Flatspace
Demo video: http://flatspace.xyz
28.19%
25.77%
16.42%
11.74%
6.66%
3.57%2.99%
2.25%
5.10%
0%
5%
10%
15%
20%
25%
30%
2010 2011 2012 2013 2014 2015 2016 2017 Humanlevel
Top
5 c
lass
ific
atio
n e
rro
r
Move to deep neural networks:AlexNet
Image Recognition (ImageNet ILSVRC)
GoogLeNet
MicrosoftResidualNet
1.2M train images, 100k test images, 1000 categories
Trimps-SoushenMinisteryof public security, China
Karpathy
Momenta/Oxford
Object Detection and Recognition (ImageNet)
googleresearch.blogspot.com/2014/09/building-deeper-understanding-of-images.html (Szegedy et al., GoogLeNet)
Live:• VGG• YOLO• YOLO v2• LeCun
Concurrence,Localization
Occlusion
Out of context
Counting
Tracking
Multi Instance Semantic Segmentation
Li et al.,arxiv.org/abs/1611.07709
Won the
COCO 2016Detection Challenge
(for segmentation)
Adversarial Perturbations AgainstSemantic SegmentationFischer et al.,arxiv.org/abs/1703.01101
Xie et al.,arxiv.org/abs/1703.08603
Metzen et al.,arxiv.org/abs/1704.05712
Cisse et al.,arxiv.org/abs/1707.05373
Other related tasks
• Edge detection
• Semantic edge detection
• Surface normals
• Matting / objectness (foreground/background)
• Saliency / memorability
• Pose estimation
• Depth estimation
• Optical flow interpolation and estimation
• Motion prediction
• E.g. Eigen and Fergus, UberNet, PixelNetcombine several of the above
This talk: Semantic Segmentationaka: scene labeling / scene parsing / dense prediction / dense labeling / pixel-level classification
(d) Input (e) semantic segmentation (f) naive instance segmentation (g) instance segmentation(e) semantic segmentation
Datasets and use cases
• General• Pascal VOC 2012 • MS COCO (evaluation only for instance segmentation)• ADE20K / SceneParse150K (all pixels annotated)• DAVIS 2017 (video; review)
• Urban (e.g. for autonomous vehicles)• Cityscapes (all pixels annotated)• CMP Facades (strong priors)• KITTI road/lane• CamVid (all pixels annotated, video)
• Aerial / Satellite • ISPRS Potsdam and Vaihingen• DSTL Kaggle (multi-modal)
• Human parsing (LIP, MHP)
• Medial imaging (can be 2.5D/multi-view)
• More: riemenschneider.hayko.at/vision/dataset
Pascal VOC 2012 11,530 6,929 20 + background
Train+Validation:
github.com/nightrome/really-awesome-semantic-segmentation
Evaluation metrics
• Pixel accuracy (dominated by background class)
• Mean accuracy over classes (individual class recall does not penalize false pos; must include background class)
• Jaccard index = Intersection over Union (IoU) = (GT ∩ Pred) / (GT U Pred) = TP / (TP + FN + FP)• <= Recall = TP / GT, Precision = TP / Pred• Usually: mean over classes, on the whole dataset• Can include or exclude the background class• Can be mean over images instead of whole dataset• Can be frequency weighted (unbalanced, similar to pixel accuracy) • Can be weighted by inverse instance size (cityscapes, important in traffic use cases)• Can be averaged with e.g. pixel accuracy (ADE20K)
• Dice index = F1 score = 2(GT ∩ Pred) / (GT + Pred) = 2TP / (2TP + FN + FP)• = Harmonic mean of Recall and Precision • = 2IoU / (1 + IoU), Monotonic with IoU
Evaluation metrics
• Trimap IoU around boundaries 4/8px (Krähenbühl and Koltun, Kohli et al.)
• Boundary F1 (BF) - Nearest boundary pixel distance (Csurka et al.)• For some distance error tolerance = e.g. 0.75% of the image diagonal
• Can be averaged with IoU (Davis)
• Average precision (AP) = Area under the precision-recall curve (MS COCO)• Here precision, recall are instance-level given some IoU threshold e.g. 0.5
• Can be additionally averaged over different thresholds (e.g. 0.5 - 0.95 in steps of 0.05)
• Multiple detections (instance fragmentation) are counted as false positives beyond the best
• Primary metric for instance segmentation (pixel-level metrics can be ambiguous)
Loss
• Cross entropy = - sumclasses sumpixels p*log(q)• p = targets; q = output probabilites• Can be weighted by inverse class size• Can be weighted to emphasize areas around edges (U-Net, Meyer)
• IoU approximated with probabilities = sumclasses [(sumpixels p*q) / sumpixels (p + q – p*q)]• Approximation is needed since IOU is discrete • Makes sense since this is our evaluation metric• Multiclass formulation is balanced over class sizes • Rediscovered in literature from time to time [1-16]• Visualead reported mixed results • Loss =
• - IoU [1 2 3 4 5 6]• - Dice [7 8 9 10]• - Tversky generalization [11]• 0.1 * CE + 0.9 * (1 - Dice) [12]• CE - log(IoU) [13]
• Other approximations [14 15 16 (TBD in TF)]• Total variation smoothing = sumclasses sumx,y |qx+1,y – qx,y|+|qx,y+1 – qx,y|• Adversarial (later)
Architectures
1. Patchwise CNN
2. FCN
3. DeepLab
4. DeconvNet
5. U-Net
6. SegNet
7. Dilated Convolutions (Yu and Koltun)
8. 100-Layer Tiramisu (DesneNets)
9. Wide ResNet
10. PSPNet
11. Adversarial
12. PolygonRNN
13. Mask R-CNN
14. Semi-supervised with unsupervised loss
Patchwise CNN
• Ning et al., http://yann.lecun.com/exdb/publis/pdf/ning-05.pdf
• Ciresan et al., people.idsia.ch/%7Ejuergen/nips2012.pdf
• A sliding window CNN classifies each pixel in turn
Fully Convolutional NN
• cs231n.github.io/convolutional-networks/#converting-fc-layers-to-conv-layers
• Start from a CNN classifier
• Convert fully connected to conv (with filter size = input volume, no padding):• CNN -> 7*7*512 -> fc(4096) -> 4096 -> fc(1000) -> 1000• CNN -> 7*7*512 -> conv(7*7*4096) -> 1*1*4096 -> conv (1*1*1000) -> 1*1*1000
• Can take arbitrarily larger input:• 224*224 -> 7*7*512 -> 1*1*100• 384*384 -> 12*12*512 -> 6*6*100
• Equivalent to sliding a patchwise CNN, butwith a single pass that is much moreefficient due to convolution sharing
Deconvolution/Upconvolution Layers
• FC convolution transposed
• cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
• Fractionally strided convolution
• github.com/vdumoulin/conv_arithmetic
Stride = 2 Stride = 1/2
input
(Resolution Increasing Convolutions)
Fully Convolutional Network (FCN; 2014-11)
• Long et al., arxiv.org/abs/1411.4038
• Shelhamer et al., arxiv.org/abs/1605.06211
• Start from classification CNN pre-trained onImageNet (AlexNet/VGG-16/GoogLeNet)and convert fully connected to conv (conv7)
• Replace final layer to 1*1*21 and add bilinearupsampling to get full spatial output (FCN-32s)
• Add x2 deconv (initialized as bilinear) on conv7and sum with conv prediction added to pool4
• Add bilinear upsampling to get full spatialoutput (FCN-16s). Fine tune from FCN-32s
• Do similarly for above fuse and pool3 (FCN-8s)
• Pascal VOC 2012 IoU=62.2%-67.2% (up from 51.6%)
• 100-175 ms(vs. 50 s)
• 134M params
DeepLab (2014-12)
• Chen et al. (Google), arxiv.org/abs/1412.7062
• VGG-16 pre-trained on ImageNet -> fully conv
• Cancel last two max-pool
• Change conv after above to x2/x4 dilated convolutions
• Train with x8 subsampled targets (IoU<90.7%). Infer with bilinear upsampling.
• Fully connected CRF (raw image dependent potential) post-processing in inference (+ 3%-5%)
• Add multi-scale layers fine tuned separately (similar to FCN-8s but with concats and convs)
• Increase dilation for first FC layer to x12 (large field of view) + change FC kernel, filters
• 20.5M params
• Pascal VOC 2012 IoU = 71.6%
• V2: arxiv.org/abs/1606.00915 with ResNet-101 + “atrous spatial pyramid pooling” • Pascal VOC 2012 IoU = 79.7% Cityscapes IoU = 70.4%
• V3: arxiv.org/abs/1706.05587• Pascal VOC 2012 IoU = 86.9% Cityscapes IoU = 81.3% (SOTA 2017)
Before softmaxAfter softmax
hole = atrous = dilated convolutions increase field of view without decreasing resolution, or adding parameters
DeconvNet (2015-05)
• Noh et al., arxiv.org/abs/1505.04366
• VGG-16 pre-trained on ImageNet
• Unpooling layers use saved max pooling indices
• Symmetric encoder-decoder: multiple deconvolutions + BatchNorm + ReLU (no dropout)
• Relies on region proposals. Training with two-stage curriculum learning:• 1. Instances cropped to GT bounding boxes * 1.2, all non-class pixels labeled as background• 2. Object proposals from edge-box * 1.2
• Inference:• Top 50 objectness score of 2000 edge-box object proposals, Max per pixel/class before softmax• Fully connected CRF post-processing (+ ~1%)
• 252M params
• Pascal VOC 2012 IoU = 70.5%
• Ensemble with FCN-8s = 72.5%
U-Net (2015-05)
• Ronneberger et al., arxiv.org/abs/1505.04597
• No VGG! Not pre-trained!
• Skip connections to keep res.!
• Separate deconv to:learned 2x2 upconv +(3x3 regular conv + ReLU) * 2
• Weighting to emphasize areasaround morphological edges
• Implementations I’ve seenuse half the filters and padding
dropout
SegNet (2015-11)
• Badrinarayanan et al., arxiv.org/abs/1505.07293 arxiv.org/abs/1511.00561• VGG-16 pre-trained on ImageNet (without fully connected layers)• Unpooling layers use saved max pooling indices like in DeconvNet• Deconvolutions + BatchNorm + ReLU (some dropout)• They compare various decoders, and dropouts (arxiv.org/abs/1511.02680)• Pascal VOC 2012 IoU = 59.9%
Dilated Convolutions (2015-11)
• Yu and Koltun, arxiv.org/abs/1511.07122
• Front-end network + Context aggregation network
• Front-end is a truncated VGG-16 like DeepLab + dilated convs,pre-trained on Pascal VOC 2012
• Context aggregation is a 7-layer uniform resolution dilated convs +ReLUs, with increasing dilations and initialized to unit filters
• Train with x8 subsampled targets. Front-end is trained first. Then context is added and trained with fixed front-end
• Possible post-processing with fully connected CRF / CRF-RNN
• Front-end alone: Pascal VOC 2012 IoU = 71.3%
• Front-end + Context + CRF-RNN: Pascal VOC 2012 IoU = 75.3%
• Dilation10: Cityscapes IoU = 67.1%
The One Hundred Layers Tiramisu (2016-11)
• Jegou et al., arxiv.org/abs/1611.09326
• DenseNets (few params, easy training)
• Encoder-Decoder with skipconnections
• 56 – 103 layers
• 1.5M – 9.4M params
• No pre-training
• No / negative results on largebenchmarks
Wide ResNet (2016-11)
• Wu et al., arxiv.org/abs/1611.10080
• Wider or Deeper Resnets? Wider!• See also Littwin and Wolf, arxiv.org/abs/1611.02525
• Wide 7-block ResNet pre-trained for classification, adapted to dilated a la DeepLab
• Pascal VOC 2012 IoU = 82.5%
• Cityscapes IoU = 78.4%
• ADE20K avg(pixel acc., IoU) = 56.74%
Pyramid Scene Parsing (PSPNet; 2016-12)• Zhao et al., arxiv.org/abs/1612.01105 (trained models: Caffee, Keras)
• Pre-trained dilated 101-269 ResNet + deep supervision auxiliary loss+ pyramid pooling module
• Pascal VOC 2012 IoU = 85.4% (1st place 2016)
• Cityscapes IoU = 80.2% (1st place 2016). Video
• ADE20K avg(pixel acc., IoU) = 57.21% (1st place 2016)
SOTA!(2016)
Mismatched Relationship
Confusion Categories
Inconspicuous Classes
Generative Adversarial Nets
Goodfellow et al., arxiv.org/abs/1406.2661
Generator
יוצרת
Discriminator(Curator)אוצרת
Fake or Real?
Fake
Real
Image to Image TranslationWith Conditional Adversarial Networks (PatchGAN)
Isola et al., phillipi.github.io/pix2pix Interactive: affinelayer.com/pixsrvGuide: ml4a.github.io/guides/Pix2Pix fotogenerator.npocloud.nl
Adversarial (2016-09)
• Idea is to regulate naturalness (strong and smooth classes, sharp boundaries, denoising, global structure)
• David Golan et al. (2016-09, first one AFAIK)
• Pix2pix, Isola et al., arxiv.org/abs/1611.07004• Generator is U-Net style (with skip connections)• 4x4 Conv with stride 2 – BatchNorm - ReLU (+ some dropout). No max-pooing.• L1 loss for generator• “PatchGAN” Discriminator takes both image and segmentation, averages over 70x70 patches• Adversarial loss hurts! Cityscapes IoU = 29% • L1 only Cityscapes IoU = 35%
• FAIR, Luc et al., arxiv.org/abs/1611.08408• Generator is Yu and Koltun’s Dilated8• Cross-entropy loss for generator• Discriminator issue: we feed it continuous probabilities (cannot do sgd with discrete labels), but GT are discrete
• Tested product with image and scaling GT, as alternative input to discriminator, but results were the same
• Pascal VOC 2012 IoU = 73.3% (compare to Yu and Koltun’s 71.3%). Adversarial ~ 2%• Several citations using this
PolygonRNN (2017-04)
• Castrejon et al. CSC2523_Project_Report, arxiv.org/abs/1704.05548
• Spare representation using polygons
• Cityscapes IoU = 61.4% per instance, assuming given bounding boxes
• Can speed-up manual annotation• CVPR 2017 Best Paper Honorable Mention Award (video)
(ConvLSTM)
Mask R-CNN (2017-03)
• He et al., arxiv.org/abs/1703.06870 (tutorial)
• Instance segmentation SOTA
Semi-Supervised Semantic Segmentationwith Unsupervised Total Variation Loss• Javanmard et al.,
arxiv.org/abs/1605.01368
Supervised Proposed10 pix/image 10 pix/image Full labels GT
Meta references
• Janai et al., arxiv.org/abs/1704.05519 (chapter 6)
• Garcia-Garcia et al., arxiv.org/abs/1704.06857
• meetshah1995.github.io/semantic-segmentation/deep-learning/pytorch/visdom/2017/06/01/semantic-segmentation-over-the-years
• blog.qure.ai/notes/semantic-segmentation-deep-learning-review
• handong1587.github.io/deep_learning/2015/10/09/segmentation
• github.com/kjw0612/awesome-deep-vision#semantic-segmentation
• github.com/mrgloom/Semantic-Segmentation-Evaluation
• github.com/fchollet/keras/issues/6538