Download - Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

SemanticSegmentation

Dr. Eyal Gruss

Director of AI, Flatspace

Talpiyot

PhD Physics

Machine Learning • Researcher• Consultant• Entrepreneur

Digital Artist

Eyal Gruss

For photorealistic VR experience

3D Model

Using deep neural networks

Architectural Interpretation

Bitmap Floorplan

An AI-powered service that creates a VR model from a simple floorplan.

Flatspace

Demo video: http://flatspace.xyz

http://flatspace.xyz/

28.19%

25.77%

16.42%

11.74%

6.66%

3.57%2.99%

2.25%

5.10%

0%

5%

10%

15%

20%

25%

30%

2010 2011 2012 2013 2014 2015 2016 2017 Humanlevel

Top

5 c

lass

ific

atio

n e

rro

r

Move to deep neural networks:AlexNet

Image Recognition (ImageNet ILSVRC)

GoogLeNet

MicrosoftResidualNet

1.2M train images, 100k test images, 1000 categories

Trimps-SoushenMinisteryof public security, China

Karpathy

Momenta/Oxford

http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf

http://www.image-net.org/challenges/LSVRC/

http://arxiv.org/abs/1409.4842




https://www.youtube.com/watch?v=NaoVOOhVC3w

http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/

https://arxiv.org/abs/1709.01507

Object Detection and Recognition (ImageNet)

googleresearch.blogspot.com/2014/09/building-deeper-understanding-of-images.html (Szegedy et al., GoogLeNet)

Live:• VGG• YOLO• YOLO v2• LeCun

Concurrence,Localization

Occlusion

Out of context

Counting

Tracking

http://googleresearch.blogspot.com/2014/09/building-deeper-understanding-of-images.html

https://www.youtube.com/watch?v=n5uP_LP9SmM

https://www.youtube.com/watch?v=r6ZzopHEO1U

https://www.youtube.com/watch?v=VOC3huqHrss

https://youtu.be/uTPtEW6WsjM?t=27m30s

Multi Instance Semantic Segmentation

Li et al.,arxiv.org/abs/1611.07709

Won the

COCO 2016Detection Challenge

(for segmentation)


http://mscoco.org/dataset/#detections-challenge2016Wo

Adversarial Perturbations AgainstSemantic SegmentationFischer et al.,arxiv.org/abs/1703.01101

Xie et al.,arxiv.org/abs/1703.08603

Metzen et al.,arxiv.org/abs/1704.05712

Cisse et al.,arxiv.org/abs/1707.05373





Other related tasks

• Edge detection

• Semantic edge detection

• Surface normals

• Matting / objectness (foreground/background)

• Saliency / memorability

• Pose estimation

• Depth estimation

• Optical flow interpolation and estimation

• Motion prediction

• E.g. Eigen and Fergus, UberNet, PixelNetcombine several of the above




This talk: Semantic Segmentationaka: scene labeling / scene parsing / dense prediction / dense labeling / pixel-level classification

(d) Input (e) semantic segmentation (f) naive instance segmentation (g) instance segmentation(e) semantic segmentation

Datasets and use cases

• General• Pascal VOC 2012 • MS COCO (evaluation only for instance segmentation)• ADE20K / SceneParse150K (all pixels annotated)• DAVIS 2017 (video; review)

• Urban (e.g. for autonomous vehicles)• Cityscapes (all pixels annotated)• CMP Facades (strong priors)• KITTI road/lane• CamVid (all pixels annotated, video)

• Aerial / Satellite • ISPRS Potsdam and Vaihingen• DSTL Kaggle (multi-modal)

• Human parsing (LIP, MHP)

• Medial imaging (can be 2.5D/multi-view)

• More: riemenschneider.hayko.at/vision/dataset

http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?challengeid=11&compid=6

http://mscoco.org/dataset/#detections-leaderboard

http://groups.csail.mit.edu/vision/datasets/ADE20K/

http://sceneparsing.csail.mit.edu/results2016.html

http://davischallenge.org/challenge2017/leaderboard.html

https://medium.com/@eddiesmo/a-meta-analysis-of-davis-2017-video-object-segmentation-challenge-c438790b3b56

https://www.cityscapes-dataset.com/benchmarks/#pixel-level-results

http://cmp.felk.cvut.cz/~tylecr1/facade/

http://www.cvlibs.net/datasets/kitti/eval_road.php

http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/

http://www2.isprs.org/potsdam-2d-semantic-labeling.html

http://www2.isprs.org/vaihingen-2d-semantic-labeling-contest.html

https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection/leaderboard



http://riemenschneider.hayko.at/vision/dataset/

Pascal VOC 2012 11,530 6,929 20 + background

Train+Validation:

github.com/nightrome/really-awesome-semantic-segmentation

https://github.com/nightrome/really-awesome-semantic-segmentation

Evaluation metrics

• Pixel accuracy (dominated by background class)

• Mean accuracy over classes (individual class recall does not penalize false pos; must include background class)

• Jaccard index = Intersection over Union (IoU) = (GT ∩ Pred) / (GT U Pred) = TP / (TP + FN + FP)• <= Recall = TP / GT, Precision = TP / Pred• Usually: mean over classes, on the whole dataset• Can include or exclude the background class• Can be mean over images instead of whole dataset• Can be frequency weighted (unbalanced, similar to pixel accuracy) • Can be weighted by inverse instance size (cityscapes, important in traffic use cases)• Can be averaged with e.g. pixel accuracy (ADE20K)

• Dice index = F1 score = 2(GT ∩ Pred) / (GT + Pred) = 2TP / (2TP + FN + FP)• = Harmonic mean of Recall and Precision • = 2IoU / (1 + IoU), Monotonic with IoU

https://www.cityscapes-dataset.com/benchmarks/

Evaluation metrics

• Trimap IoU around boundaries 4/8px (Krähenbühl and Koltun, Kohli et al.)

• Boundary F1 (BF) - Nearest boundary pixel distance (Csurka et al.)• For some distance error tolerance = e.g. 0.75% of the image diagonal

• Can be averaged with IoU (Davis)

• Average precision (AP) = Area under the precision-recall curve (MS COCO)• Here precision, recall are instance-level given some IoU threshold e.g. 0.5

• Can be additionally averaged over different thresholds (e.g. 0.5 - 0.95 in steps of 0.05)

• Multiple detections (instance fragmentation) are counted as false positives beyond the best

• Primary metric for instance segmentation (pixel-level metrics can be ambiguous)


https://www.inf.ethz.ch/personal/ladickyl/robust_ijcv09.pdf

http://www.bmva.org/bmvc/2013/Papers/paper0032/paper0032.pdf

https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Perazzi_A_Benchmark_Dataset_CVPR_2016_paper.pdf

http://mscoco.org/dataset/#detections-eval

Loss

• Cross entropy = - sumclasses sumpixels p*log(q)• p = targets; q = output probabilites• Can be weighted by inverse class size• Can be weighted to emphasize areas around edges (U-Net, Meyer)

• IoU approximated with probabilities = sumclasses [(sumpixels p*q) / sumpixels (p + q – p*q)]• Approximation is needed since IOU is discrete • Makes sense since this is our evaluation metric• Multiclass formulation is balanced over class sizes • Rediscovered in literature from time to time [1-16]• Visualead reported mixed results • Loss =

• - IoU [1 2 3 4 5 6]• - Dice [7 8 9 10]• - Tversky generalization [11]• 0.1 * CE + 0.9 * (1 - Dice) [12]• CE - log(IoU) [13]

• Other approximations [14 15 16 (TBD in TF)]• Total variation smoothing = sumclasses sumx,y |qx+1,y – qx,y|+|qx,y+1 – qx,y|• Adversarial (later)

https://github.com/arthurmeyer/Saliency_Detection_Convolutional_Autoencoder/issues/1

https://www.facebook.com/samsungnexttlv/videos/1278502672273222

http://www.philkr.net/papers/2013-06-01-icml/2013-06-01-icml.pdf


http://www.cs.umanitoba.ca/~ywang/papers/isvc16.pdf

https://github.com/jocicmarko/ultrasound-nerve-segmentation



https://github.com/jocicmarko/ultrasound-nerve-segmentation


http://angusg.com/writing/2016/12/28/optimizing-iou-semantic-segmentation.html



https://github.com/lopuhin/kaggle-dstl

blog.kaggle.com/2017/05/09/dstl-satellite-imagery-competition-3rd-place-winners-interview-vladimir-sergey


https://github.com/bermanmaxim/jaccardSegment

http://proceedings.mlr.press/v54/eban17a/eban17a.pdf

Architectures

1. Patchwise CNN

2. FCN

3. DeepLab

4. DeconvNet

5. U-Net

6. SegNet

7. Dilated Convolutions (Yu and Koltun)

8. 100-Layer Tiramisu (DesneNets)

9. Wide ResNet

10. PSPNet

11. Adversarial

12. PolygonRNN

13. Mask R-CNN

14. Semi-supervised with unsupervised loss

Patchwise CNN

• Ning et al., http://yann.lecun.com/exdb/publis/pdf/ning-05.pdf

• Ciresan et al., people.idsia.ch/%7Ejuergen/nips2012.pdf

• A sliding window CNN classifies each pixel in turn

http://yann.lecun.com/exdb/publis/pdf/ning-05.pdf

http://people.idsia.ch/~juergen/nips2012.pdf

Fully Convolutional NN

• cs231n.github.io/convolutional-networks/#converting-fc-layers-to-conv-layers

• Start from a CNN classifier

• Convert fully connected to conv (with filter size = input volume, no padding):• CNN -> 7*7*512 -> fc(4096) -> 4096 -> fc(1000) -> 1000• CNN -> 7*7*512 -> conv(7*7*4096) -> 1*1*4096 -> conv (1*1*1000) -> 1*1*1000

• Can take arbitrarily larger input:• 224*224 -> 7*7*512 -> 1*1*100• 384*384 -> 12*12*512 -> 6*6*100

• Equivalent to sliding a patchwise CNN, butwith a single pass that is much moreefficient due to convolution sharing

http://cs231n.github.io/convolutional-networks/#converting-fc-layers-to-conv-layers

Deconvolution/Upconvolution Layers

• FC convolution transposed

• cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf

• Fractionally strided convolution

• github.com/vdumoulin/conv_arithmetic

Stride = 2 Stride = 1/2

input

(Resolution Increasing Convolutions)

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf

https://github.com/vdumoulin/conv_arithmetic

Fully Convolutional Network (FCN; 2014-11)

• Long et al., arxiv.org/abs/1411.4038

• Shelhamer et al., arxiv.org/abs/1605.06211

• Start from classification CNN pre-trained onImageNet (AlexNet/VGG-16/GoogLeNet)and convert fully connected to conv (conv7)

• Replace final layer to 1*1*21 and add bilinearupsampling to get full spatial output (FCN-32s)

• Add x2 deconv (initialized as bilinear) on conv7and sum with conv prediction added to pool4

• Add bilinear upsampling to get full spatialoutput (FCN-16s). Fine tune from FCN-32s

• Do similarly for above fuse and pool3 (FCN-8s)

• Pascal VOC 2012 IoU=62.2%-67.2% (up from 51.6%)

• 100-175 ms(vs. 50 s)

• 134M params



DeepLab (2014-12)

• Chen et al. (Google), arxiv.org/abs/1412.7062

• VGG-16 pre-trained on ImageNet -> fully conv

• Cancel last two max-pool

• Change conv after above to x2/x4 dilated convolutions

• Train with x8 subsampled targets (IoU<90.7%). Infer with bilinear upsampling.

• Fully connected CRF (raw image dependent potential) post-processing in inference (+ 3%-5%)

• Add multi-scale layers fine tuned separately (similar to FCN-8s but with concats and convs)

• Increase dilation for first FC layer to x12 (large field of view) + change FC kernel, filters

• 20.5M params

• Pascal VOC 2012 IoU = 71.6%

• V2: arxiv.org/abs/1606.00915 with ResNet-101 + “atrous spatial pyramid pooling” • Pascal VOC 2012 IoU = 79.7% Cityscapes IoU = 70.4%

• V3: arxiv.org/abs/1706.05587• Pascal VOC 2012 IoU = 86.9% Cityscapes IoU = 81.3% (SOTA 2017)

Before softmaxAfter softmax

hole = atrous = dilated convolutions increase field of view without decreasing resolution, or adding parameters





DeconvNet (2015-05)

• Noh et al., arxiv.org/abs/1505.04366

• VGG-16 pre-trained on ImageNet

• Unpooling layers use saved max pooling indices

• Symmetric encoder-decoder: multiple deconvolutions + BatchNorm + ReLU (no dropout)

• Relies on region proposals. Training with two-stage curriculum learning:• 1. Instances cropped to GT bounding boxes * 1.2, all non-class pixels labeled as background• 2. Object proposals from edge-box * 1.2

• Inference:• Top 50 objectness score of 2000 edge-box object proposals, Max per pixel/class before softmax• Fully connected CRF post-processing (+ ~1%)

• 252M params


• Ensemble with FCN-8s = 72.5%


https://web.bii.a-star.edu.sg/~zhangxw/files/EdgeBoxes_ECCV2014.pdf

U-Net (2015-05)

• Ronneberger et al., arxiv.org/abs/1505.04597

• No VGG! Not pre-trained!

• Skip connections to keep res.!

• Separate deconv to:learned 2x2 upconv +(3x3 regular conv + ReLU) * 2

• Weighting to emphasize areasaround morphological edges

• Implementations I’ve seenuse half the filters and padding

dropout


SegNet (2015-11)

• Badrinarayanan et al., arxiv.org/abs/1505.07293 arxiv.org/abs/1511.00561• VGG-16 pre-trained on ImageNet (without fully connected layers)• Unpooling layers use saved max pooling indices like in DeconvNet• Deconvolutions + BatchNorm + ReLU (some dropout)• They compare various decoders, and dropouts (arxiv.org/abs/1511.02680)• Pascal VOC 2012 IoU = 59.9%




Dilated Convolutions (2015-11)

• Yu and Koltun, arxiv.org/abs/1511.07122

• Front-end network + Context aggregation network

• Front-end is a truncated VGG-16 like DeepLab + dilated convs,pre-trained on Pascal VOC 2012

• Context aggregation is a 7-layer uniform resolution dilated convs +ReLUs, with increasing dilations and initialized to unit filters

• Train with x8 subsampled targets. Front-end is trained first. Then context is added and trained with fixed front-end

• Possible post-processing with fully connected CRF / CRF-RNN

• Front-end alone: Pascal VOC 2012 IoU = 71.3%

• Front-end + Context + CRF-RNN: Pascal VOC 2012 IoU = 75.3%

• Dilation10: Cityscapes IoU = 67.1%



The One Hundred Layers Tiramisu (2016-11)

• Jegou et al., arxiv.org/abs/1611.09326

• DenseNets (few params, easy training)

• Encoder-Decoder with skipconnections

• 56 – 103 layers

• 1.5M – 9.4M params

• No pre-training

• No / negative results on largebenchmarks



https://github.com/SimJeg/FC-DenseNet/issues/10#issuecomment-286800017

Wide ResNet (2016-11)

• Wu et al., arxiv.org/abs/1611.10080

• Wider or Deeper Resnets? Wider!• See also Littwin and Wolf, arxiv.org/abs/1611.02525

• Wide 7-block ResNet pre-trained for classification, adapted to dilated a la DeepLab


• Cityscapes IoU = 78.4%

• ADE20K avg(pixel acc., IoU) = 56.74%



Pyramid Scene Parsing (PSPNet; 2016-12)• Zhao et al., arxiv.org/abs/1612.01105 (trained models: Caffee, Keras)

• Pre-trained dilated 101-269 ResNet + deep supervision auxiliary loss+ pyramid pooling module

• Pascal VOC 2012 IoU = 85.4% (1st place 2016)

• Cityscapes IoU = 80.2% (1st place 2016). Video

• ADE20K avg(pixel acc., IoU) = 57.21% (1st place 2016)

SOTA!(2016)


https://github.com/hszhao/PSPNet

https://github.com/Vladkryvoruchko/PSPNet-Keras-tensorflow

https://www.youtube.com/watch?v=rB1BmBOkKTw

Mismatched Relationship

Confusion Categories

Inconspicuous Classes

Generative Adversarial Nets

Goodfellow et al., arxiv.org/abs/1406.2661

Generator

יוצרת

Discriminator(Curator)אוצרת

Fake or Real?

Fake

Real


Image to Image TranslationWith Conditional Adversarial Networks (PatchGAN)

Isola et al., phillipi.github.io/pix2pix Interactive: affinelayer.com/pixsrvGuide: ml4a.github.io/guides/Pix2Pix fotogenerator.npocloud.nl

https://phillipi.github.io/pix2pix/

http://affinelayer.com/pixsrv/index.html

https://ml4a.github.io/guides/Pix2Pix/

http://fotogenerator.npocloud.nl/

Adversarial (2016-09)

• Idea is to regulate naturalness (strong and smooth classes, sharp boundaries, denoising, global structure)

• David Golan et al. (2016-09, first one AFAIK)

• Pix2pix, Isola et al., arxiv.org/abs/1611.07004• Generator is U-Net style (with skip connections)• 4x4 Conv with stride 2 – BatchNorm - ReLU (+ some dropout). No max-pooing.• L1 loss for generator• “PatchGAN” Discriminator takes both image and segmentation, averages over 70x70 patches• Adversarial loss hurts! Cityscapes IoU = 29% • L1 only Cityscapes IoU = 35%

• FAIR, Luc et al., arxiv.org/abs/1611.08408• Generator is Yu and Koltun’s Dilated8• Cross-entropy loss for generator• Discriminator issue: we feed it continuous probabilities (cannot do sgd with discrete labels), but GT are discrete

• Tested product with image and scaling GT, as alternative input to discriminator, but results were the same

• Pascal VOC 2012 IoU = 73.3% (compare to Yu and Koltun’s 71.3%). Adversarial ~ 2%• Several citations using this

https://www.dropbox.com/s/2hz3aqxbn15cxx8/ddh_paper.pdf



http://adsabs.harvard.edu/cgi-bin/nph-ref_query?bibcode=2016arXiv161108408L&refs=CITATIONS&db_key=PRE

PolygonRNN (2017-04)

• Castrejon et al. CSC2523_Project_Report, arxiv.org/abs/1704.05548

• Spare representation using polygons

• Cityscapes IoU = 61.4% per instance, assuming given bounding boxes

• Can speed-up manual annotation• CVPR 2017 Best Paper Honorable Mention Award (video)

(ConvLSTM)

http://www.cs.utoronto.ca/~kamyar/documents/CSC2523_Project_Report.pdf


https://www.youtube.com/watch?v=S1UUR4FlJ84

Mask R-CNN (2017-03)

• He et al., arxiv.org/abs/1703.06870 (tutorial)

• Instance segmentation SOTA


http://deeplearning.csail.mit.edu/instance_ross.pdf

Semi-Supervised Semantic Segmentationwith Unsupervised Total Variation Loss• Javanmard et al.,

arxiv.org/abs/1605.01368

Supervised Proposed10 pix/image 10 pix/image Full labels GT


Meta references

• Janai et al., arxiv.org/abs/1704.05519 (chapter 6)

• Garcia-Garcia et al., arxiv.org/abs/1704.06857

• meetshah1995.github.io/semantic-segmentation/deep-learning/pytorch/visdom/2017/06/01/semantic-segmentation-over-the-years

• blog.qure.ai/notes/semantic-segmentation-deep-learning-review

• handong1587.github.io/deep_learning/2015/10/09/segmentation

• github.com/kjw0612/awesome-deep-vision#semantic-segmentation

• github.com/mrgloom/Semantic-Segmentation-Evaluation

• github.com/fchollet/keras/issues/6538



https://meetshah1995.github.io/semantic-segmentation/deep-learning/pytorch/visdom/2017/06/01/semantic-segmentation-over-the-years.html

http://blog.qure.ai/notes/semantic-segmentation-deep-learning-review

https://handong1587.github.io/deep_learning/2015/10/09/segmentation.html

https://github.com/kjw0612/awesome-deep-vision#semantic-segmentation

https://github.com/mrgloom/Semantic-Segmentation-Evaluation

https://github.com/fchollet/keras/issues/6538

Thanks!

• Slides: bit.ly/semantic-segmentation

• Contact: [email protected]

http://bit.ly/semantic-segmentation

mailto:[email protected]

Download - Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Top Related