indoor scene segmentation using a structured light sensor

48
Indoor Scene Segmentation using a Structured Light Sensor Nathan Silberman and Rob Fergus ICCV 2011 Workshop on 3D Representation and Recognition Courant Institute

Upload: cady

Post on 24-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Indoor Scene Segmentation using a Structured Light Sensor. ICCV 2011 Workshop on 3D Representation and Recognition. Nathan Silberman and Rob Fergus. Courant Institute. Overview. Indoor Scene Recognition using the Kinect Introduce new Indoor Scene Depth Dataset Describe CRF-based model - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Indoor Scene Segmentation using a Structured Light Sensor

Indoor Scene Segmentation using a Structured Light Sensor

Nathan Silberman and Rob Fergus

ICCV 2011 Workshop on 3D Representation and Recognition

Courant Institute

Page 2: Indoor Scene Segmentation using a Structured Light Sensor

Overview

Indoor Scene Recognition using the Kinect• Introduce new Indoor Scene Depth Dataset• Describe CRF-based model– Explore the use of rgb/depth cues

Page 3: Indoor Scene Segmentation using a Structured Light Sensor

Motivation• Indoor Scene recognition is hard– Far less texture than outdoor scenes– More geometric structure

Page 4: Indoor Scene Segmentation using a Structured Light Sensor

Motivation• Indoor Scene recognition is hard– Far less texture than outdoor scenes– More geometric structure

• Kinect gives us depth map (and RGB)– Direct access to shape and geometry information

Page 5: Indoor Scene Segmentation using a Structured Light Sensor

Overview

Indoor Scene Recognition using the Kinect• Introduce new Indoor Scene Depth Dataset• Describe CRF-based model– Explore the use of rgb/depth cues

Page 6: Indoor Scene Segmentation using a Structured Light Sensor

Capturing our Dataset

Page 7: Indoor Scene Segmentation using a Structured Light Sensor

Statistics of the DatasetScene Type Number of

Scenes Frames Labeled Frames *

Bathroom 6 5,588 76

Bedroom 17 22,764 480

Bookstore 3 27,173 784

Cafe 1 1,933 48

Kitchen 10 12,643 285

Living Room 13 19,262 355

Office 14 19,254 319

Total 64 108,617 2,347

* Labels obtained via LabelMe

Page 8: Indoor Scene Segmentation using a Structured Light Sensor

Dataset Examples

Living Room

RGB Raw Depth Labels

Page 9: Indoor Scene Segmentation using a Structured Light Sensor

Dataset Examples

Living Room

RGB Depth* Labels

* Bilateral Filtering used to clean up raw depth image

Page 10: Indoor Scene Segmentation using a Structured Light Sensor

Dataset Examples

Bathroom

RGB Depth Labels

Page 11: Indoor Scene Segmentation using a Structured Light Sensor

Dataset Examples

Bedroom

RGB Depth Labels

Page 12: Indoor Scene Segmentation using a Structured Light Sensor

Existing Depth Datasets

[1] K. Lai, L. Bo, X. Ren, and D. Fox. A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. ICRA 2011 [2] B. Liu, S. Gould and D. Koller. Single Image Depth Estimation from Predicted Semantic Labels. CVPR 2010

RGB-D Dataset [1]

Stanford Make3d [2]

Page 13: Indoor Scene Segmentation using a Structured Light Sensor

Existing Depth Datasets

[1] Abhishek Anand, Hema Swetha Koppula, Thorsten Joachims, Ashutosh Saxena. Semantic Labeling of 3D Point Clouds for Indoor Scenes. NIPS, 2011[2] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, T. Darrell. A Category-Level 3-D Object Dataset: Putting the Kinect to Work. ICCV Workshop on Consumer Depth Cameras for Computer Vision. 2011

Point Cloud Data [1] B3DO [2]

Page 14: Indoor Scene Segmentation using a Structured Light Sensor

Dataset Freely Availablehttp://cs.nyu.edu/~silberman/nyu_indoor_scenes.html

Page 15: Indoor Scene Segmentation using a Structured Light Sensor

Overview

Indoor Scene Recognition using the Kinect• Introduce new Indoor Scene Depth Dataset• Describe CRF-based model– Explore the use of rgb/depth cues

Page 16: Indoor Scene Segmentation using a Structured Light Sensor

Segmentation using CRF ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)€

• Standard CRF formulation• Optimized via graph cuts• Discrete label set (~12 classes)

i∈ pixels

i, j ∈ pairs of pixels

Page 17: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

= Appearance(label i | descriptor i) Location(i)€

•€

∑€

i∈ pixels

i, j∈ pairs of pixels

Page 18: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

= Appearance(label i | descriptor i) Location(i)€

•€

∑€

i∈ pixels

i, j∈ pairs of pixels

Page 19: Indoor Scene Segmentation using a Structured Light Sensor

Appearance Term

Appearance(label i | descriptor i)

Several Descriptor Types to choose from:o RGB-SIFTo Depth-SIFTo Depth-SPINo RGBD-SIFTo RGB-SIFT/D-SPIN

Page 20: Indoor Scene Segmentation using a Structured Light Sensor

Descriptor Type: RGB-SIFT

Extracted Over Discrete Grid

RGB image from the Kinect

128 D

Page 21: Indoor Scene Segmentation using a Structured Light Sensor

Descriptor Type: Depth-SIFTDepth image from kinect with linear scaling

128 D

Extracted Over Discrete Grid

Page 22: Indoor Scene Segmentation using a Structured Light Sensor

Descriptor Type: Depth-SPINDepth image from kinect with linear scaling

50 D

Radius

Depth

A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE PAMI, 21(5):433–449, 1999

Extracted Over Discrete Grid

Page 23: Indoor Scene Segmentation using a Structured Light Sensor

Descriptor Type: RGBD-SIFT

Concatenate

256 D

RGB image from the Kinect

Depth image from kinectwith linear scaling

Page 24: Indoor Scene Segmentation using a Structured Light Sensor

Descriptor Type: RGD-SIFT, D-SPIN

Concatenate

RGB image from the Kinect

Depth image from kinectwith linear scaling

178 D

Page 25: Indoor Scene Segmentation using a Structured Light Sensor

Appearance Model

Descriptor at each location

Appearance(label i | descriptor i)- Modeled by a Neural Network with a

single hidden layer

Page 26: Indoor Scene Segmentation using a Structured Light Sensor

Appearance Model

Descriptor at each location

Appearance(label i | descriptor i)

13 Classes

1000-D Hidden Layer

128/178/256-D Input

Softmax output layer

Page 27: Indoor Scene Segmentation using a Structured Light Sensor

Appearance Model

13 Classes

1000-D Hidden Layer

128/178/256-D Input

Descriptor at each location

Probability Distribution over classes

Appearance(label i | descriptor i)

Interpreted as p(label | descriptor)

Page 28: Indoor Scene Segmentation using a Structured Light Sensor

Appearance Model

13 Classes

1000-D Hidden Layer

128/178/256-D Input

Descriptor at each location

Probability Distribution over classes

Appearance(label i | descriptor i)

Trained with backpropagation

Page 29: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

= Appearance(label i | descriptor i) Location(i)€

∑€

i∈ pixels

i, j∈ pairs of pixels

Page 30: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

= Appearance(label i | descriptor i) Location(i)€

•€

∑€

i∈ pixels

i, j∈ pairs of pixels

Page 31: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

= Appearance(label i | descriptor i) Location(i)€

∑€

i∈ pixels

i, j∈ pairs of pixels

3D Priors2D Priors

Page 32: Indoor Scene Segmentation using a Structured Light Sensor

Location Priors: 2D

• 2D Priors are histograms of P(class, location)• Smoothed to avoid image-specific artifacts

Page 33: Indoor Scene Segmentation using a Structured Light Sensor

Motivation: 3D Location Priors

• 2D Priors don’t capture 3d geomety• 3D Priors can be built from depth data

• Rooms are of different shapes and sizes, how do we align them?

Page 34: Indoor Scene Segmentation using a Structured Light Sensor

Motivation: 3D Location Priors

• To align rooms, we’ll use a normalized cylindrical coordinate system:

Band of maximum depths along each vertical scanline

Page 35: Indoor Scene Segmentation using a Structured Light Sensor

Relative Depth DistributionsTable Television

Bed Wall

Relative Depth

Density

0 01 1

Page 36: Indoor Scene Segmentation using a Structured Light Sensor

Location Priors: 3D

Page 37: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

= Appearance(label i | descriptor i) Location(i)€

∑€

i∈ pixels

i, j∈ pairs of pixels

3D Priors2D Priors

Page 38: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

∑€

i∈ pixels

i, j∈ pairs of pixels

Penalty for adjacent labels disagreeing(Standard Potts Model)

Page 39: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

∑€

i∈ pixels

i, j∈ pairs of pixels

Spatial Modulation of Smoothness• None• RGB Edge • Depth Edges• RGB + Depth Edges

• Superpixel Edges• Superpixel + RGB Edges• Superpixel + Depth Edges

Page 40: Indoor Scene Segmentation using a Structured Light Sensor

Experimental Setup

• 60% Train (~1408 images)• 40% Test (~939 images)• 10 fold cross validation• Images of the same scene cannot appear apart• Performance criteria is pixel-level classification

(mean diagonal of confusion matrix)• 12 most common classes, 1 background class

(from the rest)

Page 41: Indoor Scene Segmentation using a Structured Light Sensor

Evaluating Descriptors

2D Descriptors 3D Descriptors

Perc

ent

RGB-SIFT Depth-SIFT Depth-SPIN RGBD-SIFT RGB-SIFT/D-SPIN30

32

34

36

38

40

42

44

46

48

50

UnaryCRF

Page 42: Indoor Scene Segmentation using a Structured Light Sensor

Evaluating Location Priors

RGB-SIFT

RGB-SIFT

+2D Prio

rs

RGBD-SIFT

RGBD-SIFT

+2D Prio

rs

RGBD-SIFT

+3D Prio

rs

RGBD-SIFT

+3D Prio

rs (ab

s)30

35

40

45

50

55

UnaryCRF

Perc

ent

2D Descriptors 3D Descriptors

Page 43: Indoor Scene Segmentation using a Structured Light Sensor
Page 44: Indoor Scene Segmentation using a Structured Light Sensor
Page 45: Indoor Scene Segmentation using a Structured Light Sensor

Conclusion

• Kinect Depth signal helps scene parsing• Still a long way from great performance• Shown standard approaches on RGB-D data.• Lots of potential for more sophisticated

methods.• No complicated geometric reasoning• http://cs.nyu.edu/~silberman/nyu_indoor_scenes.html

Page 46: Indoor Scene Segmentation using a Structured Light Sensor

Preprocessing the Data

[1] N. Burrus. Kinect RGB Demo v0.4.0. http://nicolas.burrus.name/index.php/Research/KinectRgbDemoV4?from=Research.KinectRgbDemoV2, Feb. 2011

We use open source calibration software [1] to infer:• Parameters of RGB & Depth cameras• Homography between cameras.

Page 47: Indoor Scene Segmentation using a Structured Light Sensor

Preprocessing the data

• Bilateral filter used to diffuse depth across regions of similar RGB intensity

• Naïve GPU implementation runs in ~100 ms

Page 48: Indoor Scene Segmentation using a Structured Light Sensor

Motivation

Results from Spatial Pyramid-based classification [1] using 5 indoor scene types. Contrast this with the 81% received by [1] on a 13-class (mostly outdoor) scene dataset. They note similar confusion within indoor scenes.[1] Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene CategoriesS. Lazebnik, C. Schmid, and J. Ponce, CVPR 2006