object detection, deep learning, and r- cnns ross girshick microsoft research guest lecture for uw...

62
Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Upload: jemimah-harper

Post on 13-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Object detection, deep learning, and

R-CNNsRoss Girshick

Microsoft Research

Guest lecture for UW CSE 455Nov. 24, 2014

Page 2: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Outline

• Object detection• the task, evaluation, datasets

• Convolutional Neural Networks (CNNs)• overview and history

• Region-based Convolutional Networks (R-CNNs)

Page 3: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Image classification

• classes• Task: assign correct class label to the whole image

Digit classification (MNIST) Object recognition (Caltech-101)

Page 4: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Classification vs. Detection

Dog

DogDog

Page 5: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Problem formulation

person

motorbike

Input Desired output

{ airplane, bird, motorbike, person, sofa }

Page 6: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Evaluating a detector

Test image (previously unseen)

Page 7: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

First detection ...

‘person’ detector predictions

0.9

Page 8: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Second detection ...

0.9

0.6

‘person’ detector predictions

Page 9: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Third detection ...

0.9

0.6

0.2

‘person’ detector predictions

Page 10: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Compare to ground truth

ground truth ‘person’ boxes

0.9

0.6

0.2

‘person’ detector predictions

Page 11: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Sort by confidence

... ... ... ... ...

✓ ✓ ✓

0.9 0.8 0.6 0.5 0.2 0.1

truepositive(high overlap)

falsepositive(no overlap,low overlap, or duplicate)

X X X

Page 12: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Evaluation metric

... ... ... ... ...

0.9 0.8 0.6 0.5 0.2 0.1

✓ ✓ ✓X X X

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@ 𝑡=¿ 𝑡𝑟𝑢𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@ 𝑡

¿ 𝑡𝑟𝑢𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@ 𝑡+¿ 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡

𝑟𝑒𝑐𝑎𝑙𝑙@𝑡=¿𝑡𝑟𝑢𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@ 𝑡

¿𝑔𝑟𝑜𝑢𝑛𝑑 h𝑡𝑟𝑢𝑡 𝑜𝑏𝑗𝑒𝑐𝑡𝑠

𝑡

✓✓ +X

Page 13: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Evaluation metric

Average Precision (AP) 0% is worst 100% is best

mean AP over classes(mAP)

... ... ... ... ...

0.9 0.8 0.6 0.5 0.2 0.1

✓ ✓ ✓X X X

Page 14: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

PedestriansHistograms of Oriented Gradients for Human Detection, Dalal and Triggs, CVPR 2005

AP ~77%More sophisticated methods: AP ~90%

(a) average gradient image over training examples(b) each “pixel” shows max positive SVM weight in the block centered on that pixel(c) same as (b) for negative SVM weights(d) test image(e) its R-HOG descriptor(f) R-HOG descriptor weighted by positive SVM weights(g) R-HOG descriptor weighted by negative SVM weights

Page 15: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Why did it work?

Average gradient image

Page 16: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Generic categories

Can we detect people, chairs, horses, cars, dogs, buses, bottles, sheep …?PASCAL Visual Object Categories (VOC) dataset

Page 17: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Generic categoriesWhy doesn’t this work (as well)?

Can we detect people, chairs, horses, cars, dogs, buses, bottles, sheep …?PASCAL Visual Object Categories (VOC) dataset

Page 18: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Quiz time

Page 19: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Warm up

This is an average image of which object class?

Page 20: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Warm up

pedestrian

Page 21: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

A little harder

?

Page 22: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

A little harder

?Hint: airplane, bicycle, bus, car, cat, chair, cow, dog, dining table

Page 23: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

A little harder

bicycle (PASCAL)

Page 24: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

A little harder, yet

?

Page 25: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

A little harder, yet

?Hint: white blob on a green background

Page 26: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

A little harder, yet

sheep (PASCAL)

Page 27: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Impossible?

?

Page 28: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Impossible?

dog (PASCAL)

Page 29: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Impossible?

dog (PASCAL)Why does the mean look like this?

There’s no alignment between the examples!How do we combat this?

Page 30: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

PASCAL VOC detection history

2006 2007 2008 2009 2010 2011 2012 2013 2014 20150%

10%

20%

30%

40%

50%

60%

70%

year

mea

n Av

erag

e Pr

ecis

ion

(mAP

)

DPM

DPM,HOG+BOW

DPM,MKL

DPM++DPM++,MKL,SelectiveSearch

Selective Search,DPM++,MKL

41%41%37%

28%23%

17%

Page 31: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Part-based models & multiple features (MKL)

2006 2007 2008 2009 2010 2011 2012 2013 2014 20150%

10%

20%

30%

40%

50%

60%

70%

year

mea

n Av

erag

e Pr

ecis

ion

(mAP

)

DPM

DPM,HOG+BOW

DPM,MKL

DPM++DPM++,MKL,SelectiveSearch

Selective Search,DPM++,MKL

41%41%37%

28%23%

17%

rapid performance improvements

Page 32: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Kitchen-sink approaches

2006 2007 2008 2009 2010 2011 2012 2013 2014 20150%

10%

20%

30%

40%

50%

60%

70%

year

mea

n Av

erag

e Pr

ecis

ion

(mAP

)

DPM

DPM,HOG+BOW

DPM,MKL

DPM++DPM++,MKL,SelectiveSearch

Selective Search,DPM++,MKL

41%41%37%

28%23%

17%

increasing complexity & plateau

Page 33: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Region-based Convolutional Networks (R-CNNs)

2006 2007 2008 2009 2010 2011 2012 2013 2014 20150%

10%

20%

30%

40%

50%

60%

70%

year

mea

n Av

erag

e Pr

ecis

ion

(mAP

)

DPM

DPM,HOG+BOW

DPM,MKL

DPM++DPM++,MKL,SelectiveSearch

Selective Search,DPM++,MKL

41%41%37%

28%23%

17%

53%

62%

R-CNN v1

R-CNN v2

[R-CNN. Girshick et al. CVPR 2014]

Page 34: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

2006 2007 2008 2009 2010 2011 2012 2013 2014 20150%

10%

20%

30%

40%

50%

60%

70%

year

mea

n Av

erag

e Pr

ecis

ion

(mAP

)

~1 year

~5 years

Region-based Convolutional Networks (R-CNNs)

[R-CNN. Girshick et al. CVPR 2014]

Page 35: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Convolutional Neural Networks• Overview

Page 36: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Standard Neural Networks

𝒙=(𝑥1 ,…,𝑥784 )𝑇 𝑧 𝑗=𝑔(𝒘 ¿¿ 𝑗𝑇 𝒙)¿ 𝑔 (𝑡 )= 1

1+𝑒−𝑡

“Fully connected”

Page 37: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

From NNs to Convolutional NNs• Local connectivity• Shared (“tied”) weights• Multiple feature maps• Pooling

Page 38: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Convolutional NNs

• Local connectivity

• Each green unit is only connected to (3)neighboring blue units

compare

Page 39: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Convolutional NNs

• Shared (“tied”) weights

• All green units share the same parameters

• Each green unit computes the same function, but with a different input window

𝑤1

𝑤2

𝑤3

𝑤1

𝑤2

𝑤3

Page 40: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Convolutional NNs

• Convolution with 1-D filter:

• All green units share the same parameters

• Each green unit computes the same function, but with a different input window

𝑤1

𝑤2

𝑤3

Page 41: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Convolutional NNs

• Convolution with 1-D filter:

• All green units share the same parameters

• Each green unit computes the same function, but with a different input window

𝑤1

𝑤2

𝑤3

Page 42: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Convolutional NNs

• Convolution with 1-D filter:

• All green units share the same parameters

• Each green unit computes the same function, but with a different input window

𝑤1

𝑤2

𝑤3

Page 43: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Convolutional NNs

• Convolution with 1-D filter:

• All green units share the same parameters

• Each green unit computes the same function, but with a different input window

𝑤1

𝑤2

𝑤3

Page 44: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Convolutional NNs

• Convolution with 1-D filter:

• All green units share the same parameters

• Each green unit computes the same function, but with a different input window𝑤1

𝑤2

𝑤3

Page 45: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Convolutional NNs

• Multiple feature maps

• All orange units compute the same functionbut with a different input windows

• Orange and green units compute different functions

𝑤1

𝑤2

𝑤3

𝑤 ′ 1𝑤 ′ 2𝑤 ′ 3

Feature map 1(array of green units)

Feature map 2(array of orange units)

Page 46: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Convolutional NNs

• Pooling (max, average)

1

4

0

3

4

3

• Pooling area: 2 units

• Pooling stride: 2 units

• Subsamples feature maps

Page 47: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Image

Pooling

Convolution

2D input

Page 48: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

1989

Backpropagation applied to handwritten zip code recognition, Lecun et al., 1989

Page 49: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Historical perspective – 1980

Page 50: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Historical perspective – 1980

Hubel and Wiesel1962

Included basic ingredients of ConvNets, but no supervised learning algorithm

Page 51: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Supervised learning – 1986

Early demonstration that error backpropagation can be usedfor supervised training of neural nets (including ConvNets)

Gradient descent training with error backpropagation

Page 52: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Supervised learning – 1986

“T” vs. “C” problem Simple ConvNet

Page 53: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Practical ConvNets

Gradient-Based Learning Applied to Document Recognition, Lecun et al., 1998

Page 54: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Demo

• http://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html• ConvNetJS by Andrej Karpathy (Ph.D. student at

Stanford)

Software libraries• Caffe (C++, python, matlab)• Torch7 (C++, lua)• Theano (python)

Page 55: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

The fall of ConvNets

• The rise of Support Vector Machines (SVMs)• Mathematical advantages (theory, convex

optimization)• Competitive performance on tasks such as digit

classification• Neural nets became unpopular in the mid 1990s

Page 56: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

The key to SVMs

• It’s all about the features

Histograms of Oriented Gradients for Human Detection, Dalal and Triggs, CVPR 2005

SVM weights(+) (-)

HOG features

Page 57: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Core idea of “deep learning”• Input: the “raw” signal (image, waveform, …)

• Features: hierarchy of features is learned from the raw input

Page 58: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

• If SVMs killed neural nets, how did they come back (in computer vision)?

Page 59: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

What’s new since the 1980s?• More layers• LeNet-3 and LeNet-5 had 3 and 5 learnable layers• Current models have 8 – 20+

• “ReLU” non-linearities (Rectified Linear Unit)

• Gradient doesn’t vanish

• “Dropout” regularization• Fast GPU implementations• More data

𝑥

𝑔 (𝑥)

Page 60: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Ross’s Own System: Region CNNs

Page 61: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Competitive Results

Page 62: Object detection, deep learning, and R- CNNs Ross Girshick Microsoft Research Guest lecture for UW CSE 455 Nov. 24, 2014

Top Regions for Six Object Classes