deep learning for computer...

86
Deep Learning for Computer Vision Fall 2019 http:// vllab.ee.ntu.edu.tw/dlcv.html (primary) https:// ceiba.ntu.edu.tw/1081DLCV (grade, etc.) Yu-Chiang Frank Wang 王鈺強 Dept. Electrical Engineering, National Taiwan University 2019/12/10

Upload: others

Post on 23-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Deep Learning for Computer VisionFall 2019

http://vllab.ee.ntu.edu.tw/dlcv.html (primary)

https://ceiba.ntu.edu.tw/1081DLCV (grade, etc.)

Yu-Chiang Frank Wang 王鈺強

Dept. Electrical Engineering, National Taiwan University

2019/12/10

Page 2: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

What’s to be Covered …

• Learning Beyond Images• 2D/3D Visual Data• Depth Images

2

Page 3: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

From 2D to 3D Visual Data

• Robotics• Augmented Reality

• Autonomous Driving

3

http://cseweb.ucsd.edu/~haosu/slides/3ddl.pdfhttps://www.androidauthority.com/shop-amazon-augmented-reality-right-now-841238/https://arxiv.org/pdf/1711.08488.pdf

Page 4: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

3D Deep Learning Tasks

• 3D geometry analysis• 3D synthesis

• 3D-assisted image analysis

4

Page 5: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

3D Deep Learning Tasks

• What We Will Focus Today…

5

Page 6: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Representations of 3D Data

• Multi-view RGB(D) Images• Volumetric

• Polygonal Mesh• Point Cloud• Primitive-based CAD Models

6

Page 7: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Representations of 3D Data

• Multi-view RGB(D) Images• Volumetric

• Polygonal Mesh• Point Cloud• Primitive-based CAD Models

7

Page 8: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Representations of 3D Data

• Multi-view RGB(D) Images• Volumetric

• Polygonal Mesh• Point Cloud• Primitive-based CAD Models

8

Page 9: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Representations of 3D Data

• Multi-view RGB(D) Images• Volumetric

• Polygonal Mesh• Point Cloud• Primitive-based CAD Models

9

Page 10: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Representations of 3D Data

• Multi-view RGB(D) Images• Volumetric

• Polygonal Mesh• Point Cloud• Primitive-based CAD Models

10

Page 11: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Representations of 3D Data

• Multi-view RGB(D) Images• Volumetric

• Polygonal Mesh• Point Cloud• Primitive-based CAD Models

11

Page 12: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

In this lecture, we mainly focus On…

• Multi-view RGB(D) Images• Volumetric

• Polygonal Mesh• Point Cloud• Primitive-based CAD Models

12

Page 13: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Can We Directly Apply CNN on 3D Data?

13

Page 14: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Can We Directly Apply CNN on 3D Data?

• What Kind of 3D Data?• (O) Multi-view RGB(D) Images, Volumetric• (X) Polygonal Mesh, Point Cloud, Primitive-based CAD Models

14

Page 15: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Can We Directly Apply CNN on 3D Data?

15

• Convolution for 2D images:• Convolution in 3D:

Page 16: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Multi-view Representation

● Able to leverage a huge amount of CNN literatures in image analysis● What viewpoints to select?● What if the input is noisy and incomplete?● Does not process invisible points● Aggregating view representations is challenging (not trivial)

16

Page 17: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Related Work on Multi-View Representation

• Classification• Multi-view CNN (MVCNN) for shape recognition [ICCV’15]

• Segmentation • 3D shape segmentation with projective conv nets [CVPR’17]

17

Page 18: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Classification: MVCNN

• State-of-the-art performances for 3D classification (>90%)• View pooling (all branches in the first stage of the network share the

same parameters in CNN1)

18

Page 19: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Classification: MVCNN

• Synthesize the info from all views into a single & compact 3D shape descriptor

• Element-wise max operation across views in the view-pooling layer

• Closely related to max-pooling and max-out layers, with the only difference in the dimensions which are performed

• Features are obtained after view pooling.

19

Page 20: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Related Work on Multi-View Representation

• Classification• Multi-view CNN (MVCNN) for shape recognition [ICCV’15]

• Segmentation • 3D shape segmentation with projective conv nets [CVPR’17]

20

Page 21: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Segmentation: ShapePFCN

• Combines image-based FCNs and surface-based CRFs (conditional random field)

• Preprocess 3D mesh data into shaded images and depth images

21

Page 22: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Segmentation: ShapePFCN

• FCN:• VGG-16 pretrained network• Two additional modifications1. Input is a 2-channel image (2-channel 3x3 filters instead of 3-channel RGB ones)2. Modify the output of the original FCN. The original FCN outputs L confidence maps

of size 64 x 64 pixels, followed by a conversion into L probability maps via softmax. Instead, ShapePFCN upsample the confidence maps to size 512 x 512 pixels through a transpose convolutional layer.

22

Page 23: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Segmentation: ShapePFCN

• Image2Surface Projection Layer• Aggregate the confidence maps across views, and project back to the 3D surface.• L confidence maps extracted from FCN are stacked into a M x 512 x 512 x L image.

The projection layer takes input this 4D image (M input images).• Also take the surface reference image, and stacked into a 3D M x 512 x 512 image.• The layer outputs a FS x L array, where FS is the number of polygons of the shape S.

23

Page 24: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Segmentation: ShapePFCN

• Surface CRF (Conditional Random Fields)• Convert label confidences (i.e., soft labels) into hard labels• In addition, due to upsampling in FCN, processing of discontinuities across complex

surface (via CRF) might be expected.

24

Page 25: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Segmentation: ShapePFCN

• Example results

25

Page 26: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Properties of Volumetric Data

• Easy to operate• Info Ioss in voxelization

• But low resolution…

26

Page 27: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Related Work on Volumetric/Voxel Data

• Classification• 3D ShapeNets: A Deep Representation for Volumetric Shapes [CVPR’15]

• 3D Reconstruction• 3D-R2N2: A Unified Approach for Single and Multi-View 3D Object Reconstruction

[ECCV’16]• Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction

without 3D Supervision [NIPS’16]• Weakly Supervised 3D Reconstruction with Adversarial Constraint [3DV’17]

27

Page 28: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Classification: 3DShapeNets

• Directly utilize 3D shape info

• Accuracy: ~77%

• MVCNN outperforms volumetric methodsin terms of classification.

28

Page 29: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Classification: 3DShapeNets

• Why MVCNN works better?• Leverage the capabilities of 2D-based DL/CNN models • With the help of a large amount of 2D image data (e.g., ImageNet) to pretrain the

CNN architectures.

29

Page 30: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Related Work on Volumetric/Voxel Data

• Classification• 3D ShapeNets: A Deep Representation for Volumetric Shapes [CVPR’15]

• 3D Reconstruction• 3D-R2N2: A Unified Approach for Single and Multi-View 3D Object Reconstruction

[ECCV’16]• Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction

without 3D Supervision [NIPS’16]• Weakly Supervised 3D Reconstruction with Adversarial Constraint [3DV’17]

30

Page 31: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

3D Reconstruction: 3D-R2N2

• Supervised Learning

• Recurrent 3D CNN

• 3D LSTM units

• Single or multi-view 3D reconstruction

• Voxel-wise cross-entropy loss

31

Page 32: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

3D Reconstruction: 3D-R2N2

• Supervised learning

• Ground truth 3D volume/voxels available

• Voxel-wise cross-entropy loss

32

Page 33: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

3D Reconstruction: 3D-R2N2

• 3D LSTM units

• Input can be either single or a series images.

• Resolve multiple viewpoints seamlessly

33

Page 34: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

3D Reconstruction: 3D-R2N2

• Example construction results (left: single image input, right: multiple image inputs)

34

Page 35: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Related Work on Volumetric/Voxel Data

• Classification• 3D ShapeNets: A Deep Representation for Volumetric Shapes [CVPR’15]

• 3D Reconstruction• 3D-R2N2: A Unified Approach for Single and Multi-View 3D Object Reconstruction

[ECCV’16]• Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction

without 3D Supervision [NIPS’16]• Weakly Supervised 3D Reconstruction with Adversarial Constraint [3DV’17]

35

Page 36: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

3D Reconstruction: PTN

• Projecting 3D volume into 2D masks

36

Page 37: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

3D Reconstruction: PTN

• Loss function:• Reconstruction loss

• Projection loss

37

Page 38: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Related Work on Volumetric/Voxel Data

• Classification• 3D ShapeNets: A Deep Representation for Volumetric Shapes [CVPR’15]

• 3D Reconstruction• 3D-R2N2: A Unified Approach for Single and Multi-View 3D Object Reconstruction

[ECCV’16]• Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction

without 3D Supervision [NIPS’16]• Weakly Supervised 3D Reconstruction with Adversarial Constraint [3DV’17]

38

Page 39: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Weakly Supervised 3D Reconstruction

• Image and approximated viewpoints as inputs

• 2D masks as supervision

• Raytrace pooling layer enables perspective projection and backpropagation

• Constrain 3D reconstruction to the manifold of unlabeled realistic 3D shapes

39

Page 40: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Weakly Supervised 3D Reconstruction

• Image and approximated viewpoints as inputs

• 2D masks as supervision

• Raytrace pooling layer enables perspective projection and backpropagation

• Constrain 3D reconstruction to the manifold of unlabeled realistic 3D shapes

40

Page 41: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Point Cloud

• Unordered point set

• The same object can be represented in different orders (see below).

• Need to be invariant to feature transformation (e.g., rotation, translation, etc.)

41

Page 42: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Related Work on Point Cloud

• Classification/segmentation• PointNet: Deep Learning on Point Sets for 3D Classification/Segmentation [CVPR’17]

• 3D Reconstruction• A Point Set Generation Net for 3D Object Reconstruction from a Single Image [CVPR’17]

• Unsupervised Learning (i.e., Autoencoder for Data Recovery)• FoldingNet: Interpretable Unsupervised Learning on 3D Point Clouds [CVPR’18]

42

Page 43: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Classification/Segmentation

• Classification/segmentation via Point Clouds

43

Page 44: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Classification/Segmentation via PointNet

• Unordered -> per-point perceptron + max pooling

• Interaction among points: concatenate local and global features

• Invariance under transformation

44

Page 45: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Related Work on Point Cloud

• Classification/segmentation• PointNet: Deep Learning on Point Sets for 3D Classification/Segmentation [CVPR’17]

• 3D Reconstruction• A Point Set Generation Net for 3D Object Reconstruction from a Single Image [CVPR’17]

• Unsupervised Learning (i.e., Autoencoder for Data Recovery)• FoldingNet: Interpretable Unsupervised Learning on 3D Point Clouds [CVPR’18]

45

Page 46: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

3D Reconstruction via Point Cloud

• A Point Set Generation Net for 3D Object Reconstruction from a Single Image

46

Page 47: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

3D Reconstruction via Point Cloud

• Settings:• Supervised learning with ground truth point clouds• 3D shapes in an unordered point set• Two-branch prediction: fully connected for intrinsic structure +

deconvolution for smooth surfaces• Loss function: Chamfer distance

47

Page 48: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

3D Reconstruction via Point Cloud

• Example results

48

Page 49: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Related Work on Point Cloud

• Classification/segmentation• PointNet: Deep Learning on Point Sets for 3D Classification/Segmentation [CVPR’17]

• 3D Reconstruction• A Point Set Generation Net for 3D Object Reconstruction from a Single Image

[CVPR’17]

• Unsupervised Learning (i.e., Autoencoder for Data Recovery)• FoldingNet: Interpretable Unsupervised Learning on 3D Point Clouds [CVPR’18]

49

Page 50: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Unsupervised Learning for Point Clouds

• Autoencoder for 3D Point Clouds

50

Page 51: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

FoldingNet for Recovering Point Clouds

• Graph representation for point clouds

• Analogous to convolution in images. Each pixel’s spatial ordering and neighborhood remain unchanged even when feature channels of the input image expands in top conv layers.

51

Page 52: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Encoder in FoldingNet

• n: number of points in input point cloud

• Use kNN graph, compute local covariance of k = 16 points along xyz

• Perceptron: per-point function

52

Page 53: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Encoder in FoldingNet

• n: number of points in input point cloud

• Use kNN graph, compute local covariance of k = 16 points along xyz

• Perceptron: per-point function

• Graph layer: perceptron + graph max pooling

53

Page 54: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Decoder in FoldingNet

54

Page 55: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Remarks for FoldingNet

• Transfer classification accuracy

• Efficiency of representation learning and feature extraction

• Use point clouds from ShapeNet to train autoencoder

• Classification:Train a SVM on another dataset using codewords obtained from the encoder

55

Page 56: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

What’s to be Covered …

• Learning Beyond Images• 2D/3D Visual Data• Depth Images

56

Page 57: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Depth Estimation from a Single Image

• Depth estimation from a single image in a semi-supervised setting• Use supervised and unsupervised cues simultaneously

• Supervised cue: Sparse depth image• Unsupervised cue: Stereo pair consistency

57Semi-Supervised Deep Learning for Monocular Depth Map Prediction, CVPR 2017

Page 58: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

• Overall loss function:

58Semi-Supervised Deep Learning for Monocular Depth Map Prediction, CVPR 2017

Page 59: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

• The supervised loss measures the deviation of the predicted depth map from the ground truth depth values.

59Semi-Supervised Deep Learning for Monocular Depth Map Prediction, CVPR 2017

Page 60: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

• The unsupervised loss quantifies the direct image alignment error in both directions:

60Semi-Supervised Deep Learning for Monocular Depth Map Prediction, CVPR 2017

Page 61: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

• Network architecture: ResNet encoder + decoder

61Semi-Supervised Deep Learning for Monocular Depth Map Prediction, CVPR 2017

Page 62: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

• Example results

62Semi-Supervised Deep Learning for Monocular Depth Map Prediction, CVPR 2017

Page 63: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Unsupervised Depth Estimation:Depth Image Estimation from a Single Image

• Estimate the depth image from a single RGB input image without supervision

• Render the disparity map from a single image• Depth info can be estimated from the disparity map• Disparity map is able to warp the left image to the right image (and vice versa)• Training data: stereo image pairs only

63Unsupervised Monocular Depth Estimation with Left-Right Consistency, CVPR 2017

Page 64: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

• Simultaneously render the disparity maps from both views

• Enforce the consistency between both recovered disparity mapswhich lead to accurate results w/o supervision during training.

64Unsupervised Monocular Depth Estimation with Left-Right Consistency, CVPR 2017

Unsupervised Depth Estimation:Depth Image Estimation from a Single Image (cont’d)

Page 65: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

65Unsupervised Monocular Depth Estimation with Left-Right Consistency, CVPR 2017

● Reconstruction loss:

● Smoothness loss:

● consistency loss:

Unsupervised Depth Estimation:Depth Image Estimation from a Single Image (cont’d)

Page 66: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

• Estimate the depth image by inferring the disparity maps (left & right) from the single input image.

66Unsupervised Monocular Depth Estimation with Left-Right Consistency, CVPR 2017

Unsupervised Depth Estimation:Depth Image Estimation from a Single Image (cont’d)

Page 67: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

• Example results

67Unsupervised Monocular Depth Estimation with Left-Right Consistency, CVPR 2017

Unsupervised Depth Estimation:Depth Image Estimation from a Single Image (cont’d)

Page 68: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

• Goal: depth estimation from a single image/frame+ camera pose estimation from multiple consecutive frames

• Challenge• No supervision (i.e., ground truth info available)• Only input video sequence available

• Method• Use single-view depth and pose networks for video frame recovery

• Experiments• KITTI, CityScape, and Make3D datasets

68Unsupervised Learning of Depth and Ego-motion from Video, CVPR 2017 (oral)

Unsupervised Depth Estimation:Depth Image Estimation from a Video Sequence

Page 69: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Introduction

69

Goal (Inference)

• Estimate the depth image and camera-pose without supervision of ground-truth depth, stereo pair, and camera-pose information

Method• The key supervision signal for the proposed method come from

• View SynthesisSynthesize a new image of scene seen from a different camera pose

Page 70: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

MethodOverview

Target Frame

DepthInformation

PoseInformation

Predicted Target Image

NearbyFrame

Training Sequence

PoseNN

Dep.NN

Page 71: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Method

71

Cam-Pose NN

Depth NN

Target View

Transformation matrix

Time

𝐼𝐼𝑡𝑡+1 = 𝐾𝐾�𝑇𝑇𝑡𝑡→𝑡𝑡+1�𝐷𝐷𝑡𝑡 𝐼𝐼𝑡𝑡 𝐾𝐾−1𝐼𝐼𝑡𝑡

PredictedCam-Pose

PredictedDepth

RGB imagein time t

RGB imagein time t+1

Cam. parameter

ℒ = �𝑡𝑡|𝐼𝐼𝑡𝑡+1 − 𝐼𝐼𝑡𝑡+1|

Concept Render the target

Page 72: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

ExperimentDepth Estimation

72

• Dataset• KITTI and Cityscape

Make3D

Page 73: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

• 3D reconstruction from a single image (with GT 3D but no pose info)

• A single-image & pose-aware 3D reconstruction DL framework

• Extract camera pose info from 2D-3D self-consistency without supervision

73Liao, Yang, Lin, Chen, Kuo, Chiu, & Wang, Learning Pose-Aware 3D Reconstruction via 2D-3D Self-Consistency, IEEE ICASSP 2019.

Learning Pose-Aware 3D Reconstructionvia 2D-3D Self-Consistency

Page 74: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

• Experiments

74Liao, Yang, Lin, Chen, Kuo, Chiu, & Wang, Learning Pose-Aware 3D Reconstruction via 2D-3D Self-Consistency, IEEE ICASSP 2019.

Learning Pose-Aware 3D Reconstructionvia 2D-3D Self-Consistency

(a) Input image (b) GT mask(c) Predicted mask (d) Projection of predicted shapes

(e) GT voxel (f) Predicted voxel(g) GT pose-aware mesh (h) Predicted pose-aware mesh

Page 75: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

• Experiments

75Liao, Yang, Lin, Chen, Kuo, Chiu, & Wang, Learning Pose-Aware 3D Reconstruction via 2D-3D Self-Consistency, IEEE ICASSP 2019.

Learning Pose-Aware 3D Reconstructionvia 2D-3D Self-Consistency

Comparison with other fully supervised method in terms of IoU.

Quantitative result of pose-aware 3D shape reconstruction in term of IoU. pred means that shapes or poses are predicted.

Quantitative result of 3D-2D projection in term of IoU. We evaluate IoU between GT masks and different projections.

Quantitative result of pose estimation and mask segmentation.

Page 76: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

• Unsupervised monocular depth estimation

• Joint exploitation of scene semantics via semantic segmentation

• No GT for the depth image

• Improved scene representation for joint depth estimation &semantic segmentation

76P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019

Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation

1National Taiwan University

Yen-Cheng Liu2 Yu-Chiang Frank Wang1Alexander H. Liu* 1Po-Yi Chen* 1

2 Georgia Institute of Technology

Page 77: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Introduction – Unsupervised Monocular Depth Estimation

Monocular Depth Estimation

Model

Unsupervised training with stereo view

ModelLeft view

Right view

ReconstructLeft view

2D Image Disparity Map(Depth)

wrapping

P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019 77

Page 78: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Introduction – Our Goal

Learning semantic-aware scene representation for depth estimation.

SemanticSegmentatio

n

SemanticUnderstanding

Content ConsistencySceneRepresentation

2D Scene

DepthEstimation

GeometricUnderstanding

P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019 78

Page 79: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Methodology – Overview

- Scene representation from a single network- Multi-task learning on depth estimation and semantic segmentation- Refine depth estimation with semantic information

P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019 79

Page 80: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Methodology – Shared Scene Representation

Unified network architecture Controllable cross-modal prediction Multi-task learning from disjoint dataset

EncDec Softmax

Sigmoid

Pixel-wisedAvg. Pooling

+

Semantic labels

Disparity map

Task Identity

SceneRepresentation

P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019 80

Page 81: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Methodology – Self-supervised learning

1. Left-Right Semantic Consistency- Unsupervised learning relied on left-right consistency over color value (RGB image)- Such consistency may be affect by optical change (e.g., reflection on glass)- We proposed left-right consistency at the semantic level

Model

Reconstruct

Left view

Consistency

Left view

Right view

Left disparity

Model

ReconstructLeft semantic

Left view

Right view

Left disparity

Model

Right semantic

Left semantic

Consistency

P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019 81

Page 82: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Methodology – Self-supervised learning

2. Semantic-Guided Disparity Smoothness- Disparity should change smoothly on a single object- Pseudo object boundary can be obtained from semantic prediction- We proposed to regularize the disparity smoothness within the boundary

Semantic Prediction Object Boundary Regularize Smoothness

P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019 82

Page 83: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Experiments – Setup

- Dataset- Stereo pairs from KITTI dataset (for unsupervised depth

estimation)- Single-view from Cityscapes dataset (for supervised semantic

segmentation)- Model

- Encoder : 14-layered dilated residual network- Decoder : 8-layered transposed convolution network- Instead of using separated decoder for each view, we

introduce horizontally flipping technique

Page 84: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Experiments – Result on depth estimation

- State-of-the-art result on unsupervised depth estimation- Leveraging semantic segmentation, the performance can be further improved

P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019

Page 85: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

Experiments – Study on multi-task learning

- Semantic segmentation & depth estimation benefits each other- Shared decoder & task identity improves the robustness of multi-task learning- A better scene representation by our framework

P. Chen, A. Liu, Y. Liu, & Y.-C. F. Wang, Towards Scene Understanding:Unsupervised Monocular Depth Estimation with Semantic-aware Representation. CVPR 2019

Page 86: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w13_part_ii.pdf · • Depth info can be estimated from the disparity map • Disparity map is

References

● 3D Deep Learning Tutorial○ http://3ddl.stanford.edu/CVPR17_Tutorial_Overview.pdf○ http://3ddl.stanford.edu/CVPR17_Tutorial_MVCNN_3DCNN_v3.pdf○ http://cseweb.ucsd.edu/~haosu/slides/3ddl.pdf○ https://cse291-i.github.io/

● List of 3D deep learning related projects

○ https://github.com/timzhang642/3D-Machine-Learning

86