deep watershed transform for instance...

Post on 30-Jul-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deep Watershed Transform for Instance Segmentation

Min Bai & Raquel Urtasun

To appear at IEEE CVPR 2017 in HawaiiPresented at NVIDIA GTC 2017

Semantic Segmentation● Input: RGB Image● Output at each pixel:

○ Semantic label

Instance Segmentation● Input: RGB Image● Output at each pixel:

○ Semantic label ○ Instance label

■ Same for each px in object■ Different among objects

○ Difficulty: How to phrase the problem?

Applications● Object tracking

Image credit: Davi Frossard

Applications● Interacting with the environment

Image credit: http://www.rethinkrobotics.com/build-a-bot/

Applications● Useful information for other algorithms such as optical flow, etc

Image credit: Shenlong Wang

Semantic Segmentation● Semantic segmentation is a well studied problem

○ Our instance segmentation method leverages an existing technique○ H. Zhao et al, Pyramid Scene Parsing Network, https://arxiv.org/abs/1612.01105

Image credit: H. Zhao et al.

Watershed Transform● Classical image segmentation technique

Image (left) credit: Adrian Fisher

Scalar Field and Gradient

Image source: Wikipedia: byVivekj78 - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15346899

● Scalar field: single number at each pixel

● Gradient: vector at each pixel, pointing toward direction of greatest ascent

Overview of Approach

Gradient of Energy Landscape Energy Landscape Predicted Instances

Input Image

Semantic Segmentation

Overview of Approach

Gradient of Energy Landscape Energy Landscape Predicted Instances

Input Image

Semantic Segmentation

Why Predict Direction First?

Input Image Energy LandscapeDirection of Gradient

Much sharper difference in the direction label at the boundary!

Overall Network

Direction Prediction Network

Ground Truth Directions

Predicted Directions

Input Image

Semantic Segmentation

Energy Prediction Network

Ground Truth Energy

Predicted Energy

Ground Truth Instances

Predicted Instances

Training and Inference● Pre-train both networks● End-to-end fine-tuning● Network trained on NVIDIA DGX-1

○ Approximately 25 hours total for training on one GP100 core○ ~0.1s per image for forward pass○ Thank you NVIDIA for the generous gift!

Image source: www.nvidia.com

Cityscapes Dataset● 2975 training / 500 validation / 1525 testing images● Instances: car, truck, bus, train, person, rider, motorcycle, bicycle

Cityscapes Dataset● 2975 training / 500 validation / 1525 testing images● Instances: car, truck, bus, train, person, rider, motorcycle, bicycle

Cityscapes Instance Segmentation Leaderboard

* Average Precision (AP): higher is better

AP* AP* @ 50% AP* @ 50m AP* @ 100m

van den Brand et al. 2.3% 3.7% 3.9% 4.9%

Cordts et al. 4.6% 12.9% 7.7% 10.3%

Uhrig et al. 8.9% 21.1% 15.3% 16.7%

Ours 19.4% 35.3% 31.4% 36.8%

Recently, new approaches have achieved even higher performance.

Sample Output

Input RGB

Semantic Segmentation

Direction Prediction Energy Prediction

Predicted Instances Ground Truth Instances

Sample Output

Input RGB

Semantic Segmentation

Direction Prediction Energy Prediction

Predicted Instances Ground Truth Instances

Sample Output

Input RGB

Semantic Segmentation

Direction Prediction Energy Prediction

Predicted Instances Ground Truth Instances

Sample Output

Input RGB

Semantic Segmentation

Direction Prediction Energy Prediction

Predicted Instances Ground Truth Instances

Sample Output

Input RGB

Semantic Segmentation

Direction Prediction Energy Prediction

Predicted Instances Ground Truth Instances

Sample Output

Input RGB

Semantic Segmentation

Direction Prediction Energy Prediction

Predicted Instances Ground Truth Instances

Preliminary TorontoCity Aerial Instance Segmentation

Input RGB Semantic Segmentation (ResNet) Predicted Building Instances

Preliminary TorontoCity Aerial Instance Segmentation

Weighted Coverage*

AP* Recall* @ 50%

Precision* @ 50%

FCN-8 41.92% 11.37% 21.50% 36.00%

ResNet-56 40.65% 12.13% 18.90% 45.36%

Ours 56.22% 21.22% 67.16% 63.67%

* higher is better

In Summary...

● Simple technique for instance segmentation● Encodes object instances as energy map● Predicts gradient direction as intermediate task for better

supervision

top related