analyzing semantic segmentation using hybrid human-machine crfs

1
Analyzing Semantic Segmentation Using Hybrid Human- Machine CRFs Roozbeh Mottaghi 1 , Sanja Fidler 2 , Jian Yao 2 , Raquel Urtasun 2 , Devi Parikh 3 1 UCLA 2 TTI Chicago 3 Virginia Tech Our goal is to identify bottlenecks in holistic scene understanding models for semantic segmentation. • We compute CRF potentials based on: - Responses of human subjects - Ground Truth - Machine - Removing components Findings from our human studies inspire an approach that results in state-of-the- art accuracy on MSRC. Summary sky water bir d boat Scene recognition Object shape Object detection CRF Segmentat ion Huma n Machin e Remove Context Ground Truth Holistic CRF Model Which scene is depicted? Which classes are present? True/False detection? cow, sheep, aeroplane, face, car, bicycle, flower, sign, bird, book, chair, cat, dog, body, boat Super-segment labeling Segment labeling • Plug-and-Play architecture • No restrictions on the form of the potentials Human & Machine Potentials • We used Amazon Mechanical Turk to produce human potentials. • 10 subjects participated in our study for each task (total of 500 subjects). • In total, we had ~300K tasks. • We used MSRC dataset for our experiments. Segment & Super-segment Potentials: • Machine: TextonBoost classifier • Humans: Buildi ng Grass Tree Cow Sheep Sky Aeroplan e Water Human Face Car Bicycl e Flower Sign Bird Book Chair Road Cat Dog Body Boat Recognize this Class Unary: Likelihood of presence of a certain class • Machine: Frequency in training data • Humans: Given a pair of categories, we asked “Which category is more likely?” Class-class Co-occurrence: • Machine: Co-occurrence matrix from training data • Humans: We asked “Which scenario is more likely to occur in an image? Observing (cow and grass) or (cow and airplane)?” Human Machine Object Detection: • Machine: DPM (Felzenszwalb et al.) • Humans: Ground truth which is provided by humans. Shape Prior: • Machine: Per component average mask of examples. (78.2%) • Humans: We asked them to draw object contour along the boundary of superpixels (80.2%) Bird Cow Chair Car, cmp. 1 Car, cmp. 3 interface used by human subjects Scene Potential: • Machine: Spatial Pyramid Match over SIFT, RGB, GIST, … (81.8%) • Humans: We asked the human subjects to classify the images into one of 21 scene classes. 90.4% 86.8% Experiments with Human-Machine CRFs 89.8% 85.3% The CRF model performs better with the less accurate human segment potential. Goal of subsequent analysis is to explain the boost. H S, H SS M S, M SS H S, M SS M S, H SS M S + H S, M SS M S + H S, H SS 70 74 78 82 average per class recall H S H SS M S M SS H S M SS M S H SS (M+H) S M SS (M+H) S H SS w/o H: Human M: Machine S: Segment SS: Supersegment 76 78 80 Window size Average per- class recall We tried several hypotheses including incorporating scale, handling overfitting, predicting human potentials from machine potentials, etc., but none of them explained the boost. Humans and machines make complementary mistakes: Human and machine errors become similar when we reduce the window size for TextonBoost: Machine: 200x200 windows Human Machine: 30x30 windows A simple change of using multiple window sizes in the segment classifier provides a significant improvement. The type of mistakes are more important than the number of mistakes. Average Recall Breaking the connection between layers causes a significant decrease in accuracy when we have human at one level and machine at another. inconsiste nt Machine Accuracy: 77.4 > Human Accuracy: 72.2 car face sheep J. Yao, S. Fidler and R. Urtasun, Describing the Scene as a Whole: Joint Object Detection, Scene Classification and Semantic Segmentation, CVPR 2012. 2.4% gras s face book sky tree cat cow shee p flow er cha ir buildi ng bicycl e car aeropla ne wate r bird dog road body sign boat gras s face book sky tree cat cow shee p flow er cha ir buildi ng bicycl e car aeropla ne wate r bird dog road body sign boat

Upload: carissa-cleveland

Post on 30-Dec-2015

28 views

Category:

Documents


0 download

DESCRIPTION

Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs. water. bird. boat. sky. Roozbeh Mottaghi 1 , Sanja Fidler 2 , Jian Yao 2 , Raquel Urtasun 2 , Devi Parikh 3. 1 UCLA 2 TTI Chicago 3 Virginia Tech. Scene Potential: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Analyzing Semantic Segmentation  Using Hybrid Human-Machine CRFs

Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFsRoozbeh Mottaghi1, Sanja Fidler2, Jian Yao2, Raquel Urtasun2, Devi Parikh3

1 UCLA 2 TTI Chicago 3 Virginia Tech

• Our goal is to identify bottlenecks in holistic scene understanding models for semantic segmentation.

• We compute CRF potentials based on: - Responses of human subjects- Ground Truth- Machine- Removing components

• Findings from our human studies inspire an approach that results in state-of-the-art accuracy on MSRC.

Summary

sky

water bird

boat

Scene recognition

Object shape Object detection

CRF

Segmentation

Human Machine Remove

Context

Ground Truth

Holistic CRF Model

Which scene is depicted?

Which classes are present?

True/False detection?cow, sheep, aeroplane, face, car, bicycle, flower, sign, bird, book, chair, cat, dog, body, boat

Super-segment labeling

Segment labeling

• Plug-and-Play architecture • No restrictions on the form of the potentials

Human & Machine Potentials

• We used Amazon Mechanical Turk to produce human potentials.

• 10 subjects participated in our study for each task (total of 500 subjects).

• In total, we had ~300K tasks.

• We used MSRC dataset for our experiments.

Segment & Super-segment Potentials:• Machine: TextonBoost classifier• Humans:

Building

Grass

TreeCow

Sheep

Sky

Aeroplane

Water

Human Face

Car

Bicycle

FlowerSign

Bird

Book

ChairRoad

Cat

DogBody

Boat

Recognize this

Class Unary: Likelihood of presence of a certain class• Machine: Frequency in training data• Humans: Given a pair of categories, we asked “Which category is more likely?”

Class-class Co-occurrence:• Machine: Co-occurrence matrix from training data• Humans: We asked “Which scenario is more likely to occur in an image? Observing (cow and grass) or (cow and airplane)?”

Human Machine

Object Detection:• Machine: DPM (Felzenszwalb et al.)• Humans: Ground truth which is provided by humans.

Shape Prior:• Machine: Per component average mask of examples. (78.2%)• Humans: We asked them to draw object contour along the boundary of superpixels (80.2%)

Bird Cow Chair

Car, cmp. 1 Car, cmp. 3

interface used by human subjects

Scene Potential:• Machine: Spatial Pyramid Match over SIFT, RGB, GIST, … (81.8%)• Humans: We asked the human subjects to classify the images into one of 21 scene classes.

90.4% 86.8%

Experiments with Human-Machine CRFs

89.8% 85.3%

The CRF model performs better with the less accurate human segment potential. Goal of subsequent analysis is to explain the boost.

H S, H SS M S, M SS H S, M SS M S, H SS M S + H S, M SS

M S + H S, H SS

70

72

74

76

78

80

82

84

aver

age

per c

lass

reca

ll

H S H SS

M S M SS

H SM SS

M SH SS

(M+H) SM SS

(M+H) SH SS

w/o

H: HumanM: Machine

S: SegmentSS: Supersegment

200x200

10x10 20x20 30x30 40x4076

77

78

79

80

Window size

Aver

age

per-

clas

s re

call

We tried several hypotheses including incorporating scale, handling overfitting, predicting human potentials from machine potentials, etc., but none of them explained the boost.

Humans and machines make complementary mistakes:

Human and machine errors become similar when we reduce the window size for TextonBoost:

Machine: 200x200 windows Human Machine: 30x30 windows

A simple change of using multiple window sizes in the segment classifier provides a significant improvement.

The type of mistakes are more important than the number of mistakes.

Ave

rage

Rec

all

Breaking the connection between layers causes a significant decrease in accuracy when we have human at one level and machine at another.

inconsistent

Machine Accuracy: 77.4 > Human Accuracy: 72.2

car face sheep

J. Yao, S. Fidler and R. Urtasun, Describing the Scene as a Whole: Joint Object Detection, Scene Classification and Semantic Segmentation, CVPR 2012.

2.4%

grass face book

sky tree cat cow sheep flower chair building bicycle car

aeroplane water bird dog road body sign

boat

grass

face

book sky

tree

cat

cow sheep flower chairbuilding bicyclecar

aeroplane water

bird

dog

roadbody sign

boat