lecture 7: semantic segmentation - postechlecture 7: semantic segmentation bohyunghan computer...

Lecture 7: Semantic Segmentation

Bohyung HanComputer Vision [email protected]

CSED703R: Deep Learning for Visual Recognition (2016S)

Semantic Segmentation

• Segmenting images based on its semantic notion

2

3

Supervised Learning

Fully Convolutional Network

• Network architecture[Long15]• End‐to‐end CNN architecture for semantic segmentation• Interpret fully connected layers to convolutional layers

4

[Long15] J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Network for Semantic Segmentation. CVPR 2015

Deconvolution

16x16x21

500x500x3

Deconvolution Filter

• Bilinear interpolation filter Same filter for every class No filter learning!

• How does this deconvolution work? Deconvolution layer is fixed. Fining‐tuning convolutional layers of

the network with segmentation ground‐truth.

5

64x64 bilinear interpolation

seg ∘Fixed Pretrained on ImageNet

Fine‐tuned for segmentation

DeconvNet

• Learning a deep deconvolution network Conceptually more reasonable Better to identify fine structures of objects Designed to generate outputs from larger solution space Capable of predicting dense output scores Instance‐wise training and prediction Difficult to learn: memory intensive, large number of parameters

6[Noh15] H. Noh, S. Hong, B. Han: Learning Deconvolution Network for Semantic Segmentation, ICCV 2015

Operations in Deconvolution Network

• Unpooling Place activations to

pooled location Preserve structure of

activations

• Deconvolution Densify sparse activations Bases to reconstruct shape

• ReLU Same with convolution

network

7

How Deconvolution Network Works?

• Visualization of activations

8

Deconv: 14x14 Unpool: 28x28 Deconv: 28x28

Unpool: 56x56 Deconv: 56x56 Unpool: 112x112 Deconv: 112x112

Training and Inference

• Instance‐wise training Data augmentation: object proposals, random cropping, flipping Two‐stage training

• Binary segmentation with ground truth• Full segmentation with object proposals

Batch normalization

• Instance‐wise prediction

Each class corresponds to one of the channels in the output layer. Label of a pixel is given by max operation over all channels. Aggregation of 50 object proposals: max operations over all proposals

9

1. Input image 2. Object proposals

DeconvNet

3. Prediction and aggregation 4. Results

Results

10

11

Semi‐Supervised Learning

Motivation

• Challenges in existing supervised learning approaches Heavy labeling efforts in semantic segmentation Much more expensive to obtain pixel‐wise segmentation labels than

other kinds of labels Difficult to extend to other classes and handle more classes

12

Problem Definition

• Weakly supervised learning with hybrid annotations Many weak annotations: image‐level object class labels Few strong annotations: full segmentation labels

13

personbike

personhorse

boat persondining tablepotted planttv/monitor

DecoupledNet

• Architecture Classification network Segmentation network Bridging layers

• Characteristics Decouples classification and segmentation Train classification network first and then learn the rest of the network

14

[Hong15] S. Hong*, H. Noh*, B. Han: Decoupled Deep Neural Network for Semi‐Supervised Semantic Segmentation. NIPS 2015

Classification Network

• Specification Input: image Output: 20‐dimensional class label vector ; ∈

• Construction Fine‐tuning from VGG 16‐layer net Transferrable from any other existing classification networks

15

min ; ; , where ∈ 0,1 isGT.

Segmentation Network

• Specification Input: class‐specific activation map of input image Output: two‐channel class‐specific binary segmentation map ;

• Construction Adopting DeconvNet Customized for binary segmentation

16

min ; ; , where isbinaryGT.

Bridging Layers

• Specification Input: concatenation of and in the channel direction

Output: class‐specific activation map

• Construction Fully connected layers : pool5 feature maps

: backpropagating class‐specific information until pool5

17

Class‐Specific Information

• Class‐specific saliency map[Simonyan12]

Given an image, pixels related to specific class can be identified by computing gradient of class score w.r.t image by

18

[Simonyan12] K. Simonyan, A. Vedaldi, A. Zisserman. Deep inside Convolutional Networks: VisualisingImage Classification Models and Saliency Maps. ICLR Workshop, 2014.

⋅ ⋯

Segmentation Maps

19

Inference

• Need iterations Computing segmentation map for each identified label Using the same segmentation network with different class‐specific

information

20

;;

Inference

• Need iterations Computing segmentation map for each identified label Using the same segmentation network with different class‐specific

information

21

∗ ∗;

∗max ; , ; , ∗;

Qualitative Results

22

Quantitative Results

23

Comparison to other algorithms in PASCAL VOC 2012 validation set

Per‐class accuracy in PASCAL VOC 2012 test set

24

Weakly‐Supervised Learning

Problem Definition

• Semantic segmentation by weakly supervised learning Image‐level object class labels only Bounding boxes (and corresponding labels) only Scribbles (and corresponding labels) only

• Approaches Constrained optimization Iterative optimization Transfer learning

25

personbike

boatpersonhorse

• Training

Multiple Instance Learning

26

Input image Overfeat feature

P. O. Pinheiro, R. Collbert: From Image‐level to Pixel‐level Labeling with Convolutional Networks, CVPR 2015

Multiple Instance Learning

• Inference

27

P. O. Pinheiro, R. Collbert: From Image‐level to Pixel‐level Labeling with Convolutional Networks, CVPR 2015

Constrained Convolutional Neural Network

• With Image‐Level Class Labels only Define the objective function with constraints Estimate latent probability distribution for optimization

28

D. Pathak, P. Krähenbühl, T. Darrell: Constrained Convolutional Neural Networks for Weakly Supervised Segmentation, ICCV 2015

subjectto |find

subjecttomin, || |1

Constraints

• Suppression constraint Suppress all labels not in the image.

• Foreground constraint Make positive labels visible.

• Background constraint Define lower and upper

bound of background area.

• Size constraint Define upper bound on a class.

29

0∀ ∉∀ ∈ 0 bg

Optimization

• Iterative method with slack variable

Find

Find

30

max exp ; ;∈min ℓ log

subjecttomin, || |1

Results

31Original image Ground‐truth With labels With labels + tags

BoxSup

• With bounding box annotations only

32

J. Dai, K. He, J. Sun: BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation. ICCV 2015

1 1 IoU , , ,min,

Semantic Segmentation by Transfer Learning

• Data Source domain, 1,… ,

• Image‐level class labels • Pixel‐wise segmentation annotations

Target domain, 1,… ,• Image‐level class labels only

Source and target domains are composed of exclusive sets of categories.

• Goal Semantic segmentation of target domain images by transferring

segmentation knowledge from source domain data

• Impact Scalability to the datasets with a large number of classes with

minimal human supervision

33

Network Architecture

• Components Encoder, : pre‐trained VGG16 Decoder, : deconvolution network Classifier, : two fully‐connected layers Attention model, : multiplicative interactions

34

[Hong15] Seunghoon Hong, Junhyuk Oh, Bohyung Han, Honglak Lee: Learning Transferrable Knowledge for Semantic Segmentation with Deep Convolutional Neural Network, CVPR 2016

, ; exp∑ exp⨀

Network Architecture

• Training strategy Using segmentation annotations: train both decoder and attention model Using image‐level class labels: train both classifier and attention model Encoder is fixed: VGG‐16 layer net

• Loss function

35

min, , ;∈ ∗∈ ∪ , ;∈ ∗∈

[Hong15] Seunghoon Hong, Junhyuk Oh, Bohyung Han, Honglak Lee: Learning Transferrable Knowledge for Semantic Segmentation with Deep Convolutional Neural Network, CVPR 2016

Attention

• Attention vs. densified attention

36

Attention Densified attention

Results

37

Input image Ground‐truth Densified attention BaselineNet TransferNet TransferNet+CRF

Results

38

39

DeepLab with Various Supervisions

DeepLab

• Same with FCN approach but higher classification resolution• Hole algorithm

Make feature map denser Use existing CNN architecture

40

L.‐C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille: Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs, ICLR 2015

Less stride but larger filter forboth pooling and convolution!

Fully Connected Conditional Random Field

41

,

L.‐C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille: Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs, ICLR 2015

Results by DeepLab‐CRF

42

WSSL

• Baseline: semantic segmentation with pixel‐level annotations

• Goals: estimate pixel‐level labels by learning CNN based on Image‐level annotations Bounding box annotations Hybrid annotations: many image‐level annotations and few segmentation

annotations

43

G. Papandreou, L.‐C. Chen, K. P. Murphy, A. L. Yuille: Weakly‐ and Semi‐Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation. ICCV 2015

max log ; log ;where log ; ∝ exp | ;

Output of DCNN Model parameterLabel of pixel

EM using Weak Annotations

• E‐step: estimate the latent segmentation

• M‐step: maximize log‐likelihood by SGD

• EM‐Fixed:

• EM‐Adapt:

44

argmax ;argmax log ; logargmax | ; ′ log

′ argmax ;, 1 if 10 if 0

, const, const

, takes cardinality potential.

With image‐level class labels

45

With bounding box annotations

• Estimate pixel‐level annotations from bounding boxes Fully connected CRF

• Apply EM‐Fixed algorithm

46

With Hybrid Annotation

47

Few segmentation annotations

Many image‐level class annotations

48

49

, , ; ;

∗ log ∈∗,

lecture 7: semantic segmentation - postechlecture 7: semantic segmentation bohyunghan computer...

Documents