object localization, segmentation, classification, and pose …€¦ · •extend cnn model to...

Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

Allan Zelener

Dissertation Proposal

December 12th 2016

Overview

1. Introduction to 3D Object Identification

2. Completed Work

• Part-based Object Classification of Vehicle Point Clouds.

• CNN-based Object Segmentation in LIDAR with Missing Points.

3. Proposed Work

• Joint localization, segmentation, classification, and 3D pose estimation.

• Depth-sensitive localization.

• Depth-sensitive subpixel methods for segmentation.

• Spatial transformers for pose estimation.

• Domain adaptation and shape completion from synthetic data.

• Timeline for completion.

Identifying 3D Objects

• Real world objects have a 3D shape and a position in a 3D scene.

• Objects may be oriented with respect to some reference pose.

• These object properties are associated with their semantic class.

Identifying 3D Objects

Identifying Objects in 2D Images

Fei-Fei, Karpathy, Johnson (http://cs231n.stanford.edu/slides/winter1516_lecture13.pdf)

http://cs231n.stanford.edu/slides/winter1516_lecture13.pdf

Identifying 3D Objects in 2D Images

• 3D oriented CAD models mapped to 2D image regions.

• Approximate 3D shape based on selected models.

• Relative 3D position and scale may still be ambiguous.

• Visual perspective cues required to estimate object properties.

Yu et al., ObjectNet3D: A Large Scale Database for 3D Object Recognition

Identifying 3D Objects in 3D Images

Song et al., SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite

• 3D sensors provide accurate pointwise depth measurements.• Object position and scale can be determined from a single 3D image.

Challenges in 3D Images

• Missing measurements due to sensor properties.

• Partial 3D data based on limited viewpoints.

• Difficult large-scale annotation compared to 2D images.

• Feature representations for 3D properties.

Manual Labeling of 3D Point Cloud

Completed Work

• Classification of Vehicle Parts in Unstructured 3D Point Clouds

• RANSAC point clustering for planar parts.

• Part-based structured model for classifying parts and overall object class.

Classification of Vehicle Parts in Unstructured 3D Point Clouds,Zelener, Mordohai, and Stamos, 3DV, 2014.

Local Feature Extraction

• Density weighted spin images.

• Dense sampling of keypoints on a uniformly spaced voxel grid.

• Normals oriented away from center of object centroid.

• K-means clustering to generatebag-of-words codebook.

• Baseline object descriptor is normalized count vector of codebook features.

K-Means Spin Image Codebook𝑘 = 50

Automatic Part Segmentation

• Iterative RANSAC plane fitting.

• Candidate planes from faces of convex hull.

• Robust re-estimation of planes using PCA.

• For vehicles, five planar parts cover most of the surface

Colored by Segmentation Order

Convex Hull Examples

Part-Level Features

• Spin image bag-of-words.

• Average height ഥ𝒉.

• Horizontal/vertical indicator 𝐼 𝑛 = ൝0, if 𝒏𝑇𝒛 > cos

𝜋

4

1, otherwise

• Mean, median, and max of plane fit errors.

• Eigenvalues from plane fitting 𝜆1, 𝜆2, 𝜆3 (in descending order).

• Linearity (𝜆1−𝜆2) and Planarity 𝜆2 − 𝜆3 .

Pairwise Part Features

• Dot product of normals, 𝒏𝟏𝑇 ⋅ 𝒏𝟐

• Absolute difference in average heights, 𝒉𝟏 − 𝒉𝟐

• Distance between centroids, 𝒄𝟏 − 𝒄𝟐

• Closest distance between points, min𝑖∈𝑃1,𝑗∈𝑃2

𝒑𝟏,𝒊 − 𝒑𝟐,𝒋

• Coplanarity as mean, median, and max cross-plane fit errors.

Structured Part Modeling

• Generalized HMM as sequence of parts and final class variable.

• Trained discriminatively by structured averaged perceptron.

• Parts reordered in sequence based on 𝐼(𝑛) and average height.

a1 a2 an⋯

x1 x2 xn

c

x1 x2 … xn

Experimental Results for Part Classification

• Evaluation on Ottawa dataset with 155 sedans and 67 SUVs.

• Structured part modeling provides increased performance for part classification.

• Manual segmentation provides increase for classification of all parts per object.

Part Classification Comparison

Experimental Results for Object Classification

• SP gives significant gains over baseline perceptron model.

• Manual segmentation with SP exceeds unstructured baselines.

Sedan vs SUV Object ClassificationNo Part Segmentation Part Segmentation

Comparison Between Automatic andManual Segmentation• Under-segmentation from

unbounded plane fitting.

• Merged semantic part classes like roof-hood and roof-trunk.

• Inconsistent labeling behavior at boundaries and noisy points.

Automatic

Manual

Conclusions for Part-based Classification

PROS

• RANSAC segmentation is robust to many complexities of 3D data.

• Structured part-based method shows improvement over bag-of-words with local features.

• Pairwise features based on geometric properties improve classification performance.

CONS

• RANSAC segmentation is not equivalent to semantic segmentation.

• Labeling ground truth parts for every possible object class may be infeasible.

• RANSAC segmentation, features, and structure model are determined before training the classifier.

CNN-Based Object Segmentation• Segmentation on LIDAR

scanning grid with missing points.

• CNN training procedure for LIDAR data.

• CNN-based features extracted from small set of initial feature maps for 3D images.

CNN-Based Object Segmentation in Urban LIDAR with Missing Points,Zelener and Stamos, 3DV, 2016.

Missing Points in LIDAR

• Contiguous LIDAR scanlines form 2.5D grid of scanner measurements.

• Laser reflection causes missing points on objects in the grid.

• We can label and infer over these positions.

Missing Points in Gray on Scanning Grid

Missing Points on Vehicles are Labeled

Preprocessing Pipeline

• Sample positive and negative locations in large LIDAR scene piece.

• Extract 𝑀 × 𝑀 patch as input to CNN.

• Predict labels for central 𝐾 × 𝐾 region, 𝐾 ≤ 𝑀. (𝑀 = 64, 𝐾 = 8)

Initial Feature Maps• Compute normalized feature maps from 3D points in 𝑀 × 𝑀 patch.

• Assume ~𝒩(0,1) truncated to [−6, 6] within each patch.

• Missing data given max value (6) in clip range.

Relative Depth Relative Height-6

6

0

Initial Feature Maps• Angle and missing mask describe sensor properties.

• Angle normalized as before and missing mask in {0,1}.

Angle Missing Mask0

1

-6

6

0

Initial Feature Maps

• Signed Angle from Hadjiliadis and Stamos. 3DPVT 2010.

Signed Angle

-6

6

0

ො𝒛

𝑝1

𝑝2

𝑝3

𝑣1

𝑣2

Scanning Direction

𝑆𝑖𝑔𝑛𝑒𝑑𝐴𝑛𝑔𝑙𝑒 𝑝2 = acos( Ƹ𝑧 ⋅ ො𝑣2) ⋅ sgn 𝑣1 ⋅ 𝑣2

• Horizontal surfaces at 90 degrees.• Vertical surfaces at 0 degrees.• Sharp changes yield negative sign.

Model Overview

• Baseline CNN architecture.

• ReLU nonlinear activation functions.

• L2-regularization on affine layers.

• Dropout regularization on final layer.

• Predict binary label for each point in the 𝐾 × 𝐾 target.

• Total model loss is

Input Patch

Conv 5 × 5

Conv 5 × 5

Affine

Affine

64 (= 𝐾2)

512

(16, 16, 64)

(64, 64, 5)

(32, 32, 32)

Output Labels

ℒ 𝒙, 𝒚 = −

𝑘=1

𝐾2

[𝑦𝑘 log 𝑝𝑘 + (1 − 𝑦𝑘)log (1 − 𝑝𝑘)] +𝜆

2

𝑙=1

𝐿

𝑊𝑙 22

Binary Cross Entropy L2-Regularization

Results from Vehicle Point Detection using CNN [patch size = 64 x 64, target size = 8 x 8]

nyc_0 (in-sample) test piece

nyc_1 test piece

True Positive – YellowTrue Negative – Dark Blue

False Positive – CyanFalse Negative – Orange



Nyc_0 (In-sample)Test Recall .85, Precision .73



Nyc_1Test Recall .85, Precision .73

Experimental ResultsInput Feature Map Comparison

D – Depth, H – Height, A – Angle, S – Signed Angle, M - Missing Mask

Impact of Using Missing Point Labels

• Training with missing point labels improves precision.• Missing point labels allow for complete segmentation.

DHASM with Missing Point Labels

DHASM with No Missing Point Labels

Experimental Results

Use of Missing Point Labels

NML – No Missing Labels

Conclusions for CNN-Based Segmentation

• CNN for LIDAR learned using a sampling based training pipeline.

• We can predict class labels over missing points in LIDAR.

• Incorporating missing points improves precision.

• Input feature maps that describe 3D shape and sensor properties have a significant effect on performance.

Proposed Work

• Extend CNN model to multiclass object localization, segmentation, classification, and pose estimation in 3D images.

• Examine design and structure of CNN components for 3D images:• Depth-sensitive localization.

• Depth-subpixel methods for segmentation.

• Spatial transformer for pose estimation.

• Utilize domain adaptation from synthetic data for auxiliary training data and missing point reconstruction.

Novelty of Proposed Work

• Multi-task model for all tasks.• Previous models only address up to three of the proposed tasks.

• Addition of 3D object pose estimation.

• Improve performance on all tasks by integrating algorithms of current state-of-the-art techniques for the domain of 3D objects.• Balance between 2.5D image and 3D voxel representation.

• Incorporation of additional datasets.• Comparison across urban LIDAR and indoor RGB-D domains.

• Missing point estimation from synthetic data or multi-view reconstruction.

• Domain adaptation from synthetic datasets.

2D Object Localization in LIDAR (In Progress)

• Preliminary results at 0.8 confidence threshold.

• Based on YOLO single-shot architecture.

• Can be used for region proposal or extended to 3D bounding boxes.

Automatic fit of bounding boxesPCA to fit non-axis aligned boxesManual tool to

(a) select front face (different color) for orientation(default is selected automatically)

(b) change size/position/orientation of boxes in case of incomplete objects

Google Street View DatasetGround Truth Pose Labeling

Multi-task Model for Object Identification• Shared representation can be applicable for multiple tasks.

• Tasks: Object localization, segmentation, classification, and pose estimation.

• Error signal for each task trains weights for shared representation.

Source: Dai et al., Instance-aware Semantic Segmentation via Multi-task Network Cascades

Multi-task Model for Object Identification

• Straightforward extension to orientation estimation.

• Assume objects are upright, estimate rotation about gravity axis.

Source: Dai et al., Instance-aware Semantic Segmentation via Multi-task Network Cascades

oriented

Localization for 3D Objects in Voxel Space

• 3D voxel input representation (TSDF).

• Voxel gives relative position, anchor box gives shape prior.

• Network estimates adjustments for box position and dimensions.

Source: Song and Xiao, Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images

Depth-Sensitive Localization

• We aim to maintain a non-volumetric 2.5D input representation.

• Partition viewing volume and consider localization in depth slices.

𝑧1

𝑧2

𝑧3

𝑧4

2.5D Input

2D Conv𝑎𝑧

𝑎𝑥

𝑎𝑦

(𝑋, 𝑌, 𝑍 × 𝐴 × 6)

Conv 3D Box𝐻

𝑊

(𝑊, 𝐻, 𝐹𝑖𝑛)

𝑏 ො𝑥 = 𝑥𝑖 + 𝑑𝑥𝑏 ො𝑦 = 𝑦𝑖 + 𝑑𝑦

𝑏 Ƹ𝑧 = 𝑧𝑖 + 𝑑𝑧

𝑏width = 𝑎𝑥 ∗ 𝑠𝑥𝑏height = 𝑎𝑦 ∗ 𝑠𝑦

𝑏depth = 𝑎𝑧 ∗ 𝑠𝑧

Subpixel Convolutions

• Pooled CNN features can still encode higher resolution information.

• Upscale back through “deconvolution” or subpixel convolution.

• Used in state-of-the-art segmentation networks.

Source: Shi et al., Is the deconvolution layer the same as a convolutional layer?

Padded Image

Zero-padded Sub-Pixel Image

Subpixel Filter

Filter Activations

Subpixel Convolutions

• Independent subpixel filter weights can be separated.

• All convolutions are in low resolution then interleaved to upsample at the end of the network.

Source: Shi et al., Is the deconvolution layer the same as a convolutional layer?

Padded ImageSeparate Filters

Filter Activations

Combined Filter Activations

Position-sensitive Score Maps

• Subpixel-like features can be specialized for a given task.

Source: Dai et al., R-FCN: Object Detection via Region-based Fully Convolutional Networks

Depth-sensitive Score Maps

• We can extend this approach to be depth-sensitive.

conv

feature maps

𝑘3(𝐶 + 1) conv

Top-left-back,Top-left-center,…Bottom-right-center,Bottom-right-front.

pool

𝑘𝑘

𝑘vote

𝐶 + 1

= 𝐶 + 1

Spatial Transformers for Pose Estimation

• General method for parameterized transforms between feature maps.

• Interpolation of transformed sampling grid.

• Estimated transformation is related to 3D object pose.

Complete Model Sketch

conv

sharedfeature maps

down convs

multi-scaledepth-sensitive

localization

ROI pooling and spatial transformer

depth-sensitivesegmentation,classification,pose estimation

𝑂

2.5D imagefeature maps

Timeline for Completion

• December 2016• Select and prepare new datasets for experiments.

• Annotate Street View dataset with object bounding boxes.

• Extend current localization and segmentation implementations for baselines.

• Begin implementation of classification and pose estimation baselines.

• January 2017• Complete implementation of baseline models and begin training models for

evaluation on a chosen dataset.

• Implement baseline multi-task model.


• February 2017• Begin some experiments with architectures using:

• Depth-sensitive localization.

• Depth-sensitive subpixel convolution for segmentation.

• 3D object pose estimation with spatial transformers.

• March 2017• Prepare paper for ICCV 2017 submission including experiments on:

• Multi-task learning for 3D object identification.

• One of the proposed depth-sensitive experimental architectures.

• Consider additional experiments on domain adaptation and missing point reconstruction.


• April 2017• Dissertation writing.

• Continuation of experiments.

• May 2017• Dissertation defense.

• Prepare paper submission to 3DV 2017 containing additional experiments.

Additional Slides

Google Street View Dataset

• Google R5 Street View Dataset

• All but two pieces of NYC 0 used for training.

• Remaining runs used for evaluation.

KITTI Dataset

• 3D bounding boxes for vehicles, cyclists, and pedestrians in LIDAR.

• Precise segmentation labels not included in benchmark.

Synthia Dataset

• Synthetic urban scenes for simulated RGB-D scans.

• Exact labels for semantic segmentation but 3D poses are not given.

• Domain adaptation required for effective use on real-world data.

• Missing point reconstruction task can be simulated.

Indoor RGB-D Datasets

• SUN RGB-D and SceneNN.

• Class, segmentation, and oriented 3D bounding boxes included.

• Reconstructed shape can be used for missing points.

Assumptions for Proposed Work

• Single 3D image from LIDAR sensor sweep or RGB-D camera.• Excludes video, multiview registration, and volumetric sensors.

• Possible shape completion only for missing (non-occluded) scan points.• Excluding complete volumetric shape reconstruction and database matching.

Hua et al., SceneNN: A Scene Meshes Dataset with aNNotationsWu et al., 3D ShapeNets: A Deep Representation for Volumetric Shapes

object localization, segmentation, classification, and pose …€¦ · •extend cnn model to...

Documents