deep fisher networks and class saliency maps for and...

33
Deep Fisher Networks and Class Saliency Maps for Object Classification and Localisation Karén Simonyan, Andrea Vedaldi, Andrew Zisserman Visual Geometry Group, University of Oxford

Upload: others

Post on 07-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Deep Fisher Networks and Class Saliency Maps for 

Object Classification and Localisation

Karén Simonyan, Andrea Vedaldi, Andrew ZissermanVisual Geometry Group, University of Oxford

Page 2: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Outline

• Classification challenge• can Fisher Vector encodings be improved by a deep architecture?• deep Fisher Network (FN)• combination of two deep models: Convolutional Network (CN) and deep Fisher Network

• Localisation challenge• visualization of class saliency maps  and  per‐image foreground  pixels from a single classification CN

• bounding boxes computed from foreground pixels• weak supervision: only image class labels used for training

Page 3: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

• Bag of Visual Words (BOW) pipeline

... ... ... ...

VQ

Linear SVM

dogs

Shallow Image Encoding & Classification• Dense SIFT features

[Luong & Malik, 1999][Varma & Zisserman, 2003]

[Csurka et al, 2004] [Vogel & Schiele, 2004]

[Jurie & Triggs, 2005][Lazebnik et al, 2006]

[Bosch et al, 2006]

Page 4: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

soft-assignment to GMM

1st order stats (k-th Gaussian):

2nd order stats (k-th Gaussian):

80-D 80-D 80-D

FV dimensionality: 80×2×512=81,920(for a mixture of 512 Gaussians)

stacking e.g. if SIFT x reduced to 80 dimensions by PCA

Dense set of local SIFT features → Fisher vector (high dim)

Fisher Vector (FV) – Encoding

Perronnin et al CVPR 07 & 10, ECCV 10

Page 5: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

• Learn projection onto a low-dim space where classes are well-separated

• Joint learning of projection and projected-space classifiers (WSABIE):

• Or project onto the space of classifier scores:

• are linear SVM classifiers in the high-dimensional FV space

• fast-to-learn

Projection LearningFisher vector (high dim) → low dimensional representation

Page 6: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Deep Fisher Network

Dense feature extractionSIFT, colour

One vs. rest linear SVMs

low-dim FV encoder

Spatial stacking

L2 norm. & PCA

FV encoder

SSR & L2 norm.

SSR & L2 norm.

input image

0-th layer

1-st Fisher layer (local & global pooling)

2-nd Fisher layer(global pooling)

classifier layer

Dense feature extraction

SIFT, raw patches, …

One vs rest linear SVMs

FV encoder

SSR & L2 norm.

Shallow Fisher Vector

Page 7: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Fisher Layer

Compressedlocal Fisher encoding

Spatial stacking(2×2)

L2 norm‐n & PCA 

decorrelation

featurew

h

80

w/2

h/2

82000

w/2

h/2

1000

w/2

h/2

4000

w/2

h/2

256

Page 8: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Deep Fisher Network

Dense feature extractionSIFT, colour

One vs. rest linear SVMs

low-dim FV encoder

Spatial stacking

L2 norm. & PCA

FV encoder

SSR & L2 norm.

SSR & L2 norm.

input image

0-th layer

1-st Fisher layer (local & global pooling)

2-nd Fisher layer(global pooling)

classifier layer

Dense feature extraction

SIFT, raw patches, …

One vs rest linear SVMs

FV encoder

SSR & L2 norm.

Shallow Fisher Vector

Page 9: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Classification Results for Fisher Network

ImageNet 2010 challenge dataset: • 1.2M images, 1K classes• SIFT & colour features• Learning: 2‐3 days on 200 CPU cores  (MATLAB + MEX implementation)

Improved classification accuracy by adding layer

Page 10: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Deep ConvNet Implementation

• Based on cuda‐convnet [Krizhevsky et al., 2012]• 8 weight layers (rather narrow):

conv64‐conv256‐conv256‐conv256‐conv256‐full4096‐full4096‐full1000

• Jittering:• cropping, flipping, PCA‐aligned noise• random occlusion:

• Single ConvNet instance

Page 11: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Classification Results

ImageNet 2012 challenge dataset: • 1.2M images, 1K classes• top‐5 classification accuracy

Method top‐5 accuracyFV encoding (our 2012 entry) 72.7%Deep FishNet 76.9%Deep ConvNet [Krizhevsky et al., 2012] 81.8%

83.6% (5 ConvNets)Deep ConvNet (our implementation) 82.3%Deep ConvNet + Deep FishNet 84.8%

ConvNet and FisherNet are complementary

Page 12: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Outline

• Classification challenge• can Fisher Vector encodings be improved by a deep architecture?• deep Fisher Network (FN)• combination of two deep models: Convolutional Network (CN) and deep Fisher Network

• Localisation challenge• visualization of class saliency maps  and  per‐image foreground  pixels from a single classification CN

• bounding boxes computed from foreground pixels• weak supervision: only image class labels used for training

Page 13: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Deep inside ConvNets: what Has Been Learnt?

ConvNet class model visualisation• find a (regularised) image with a high class score                :

with a fixed learnt model

• compute                         using back‐prop

Cf ConvNet training• max log‐likelihood of the correct class 

• using back‐prop

Visualizing higher‐layer features of a deep network. Erhan, D., Bengio, Y., Courville, A., Vincent, P. Technical report, University of Montreal, 2009.

fully connectedclassifier layer

soft-max layer

Page 14: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

fox

Page 15: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

pepper

Page 16: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

dumbbell

Page 17: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Deep inside ConvNets: what Has Been Learnt?

ConvNet class model visualisation• find a (regularised) image with a high class score                :

with a fixed learnt model

• compute                         using back‐prop

NB                             

Visualizing higher‐layer features of a deep network. Erhan, D., Bengio, Y., Courville, A., Vincent, P. Technical report, University of Montreal, 2009.

fully connectedclassifier layer

soft-max layer

gives less prominent visualisation, as it concentrates on reducing  scores of other classes

Page 18: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Deep inside ConvNets:What Makes an Image Belong to a Class?

• ConvNets are highly non‐linear → local linear approxima on

• 1st order expansion of a class score around a given image    :

• has the same dimensions as image• magnitude of     defines a saliency map for image     and class 

– computed using back‐prop

– score of  ‐th class

How to Explain Individual Classification Decisions. Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., Müller, K.‐R. JMLR, 2010.

Page 19: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Saliency Maps For Top‐1 Class

Page 20: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Saliency Maps For Top‐1 Class

Page 21: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Saliency Maps For Top‐1 Class

Page 22: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

• Weakly supervised• computed using class‐n ConvNet, trained on image class labels• no additional annotation required (e.g. boxes or masks)

• Highlights discriminative object parts• Instant computation – no sliding window• Fires on several object instances

• Related to deconvnet [Zeiler and Fergus, 2013]• very similar for convolution, max‐pooling, and RELU layers• but we also back‐prop through fully‐connected layers

Image Saliency Map

Page 23: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Saliency Maps for Object Localisation• Image → top‐k class → class saliency map → object box

Page 24: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

• Given an image and a saliency map:

BBox Localisation for ILSVRC Submission

Page 25: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

• Given an image and a saliency map:1. Foreground/background mask

using thresholds on saliency

BBox Localisation for ILSVRC Submission

blue – foregroundcyan – backgroundred – undefined

Page 26: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

• Given an image and a saliency map:1. Foreground/background mask

using thresholds on saliency2. GraphCut colour segmentation

[Boykov and Jolly, 2001]

BBox Localisation for ILSVRC Submission

Page 27: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

• Given an image and a saliency map:1. Foreground/background mask

using thresholds on saliency2. GraphCut colour segmentation

[Boykov and Jolly, 2001]3. Bounding box of the largest 

connected component

• Colour information propagates segmentation from the most discriminative areas

BBox Localisation for ILSVRC Submission

Page 28: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Segmentation‐Localisation Examples

Page 29: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Segmentation‐Localisation Examples

Page 30: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Segmentation‐Localisation Failure Cases

• Several object instances

Page 31: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

• Segmentation isn’t propagated from the salient parts

Segmentation‐Localisation Failure Cases

Page 32: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

• Limitations of GraphCut segmentation

Segmentation‐Localisation Failure Cases

Page 33: Deep Fisher Networks and Class Saliency Maps for and ...image-net.org/challenges/LSVRC/2013/slides/ILSVRC_az.pdf · Deep inside ConvNets: What Makes an Image Belong to a Class? •

Summary

• Fisher encoding benefits from stacking

• Deep FishNet is complementary to Deep ConvNet

• Class saliency maps are useful for localisation• location of discriminative object parts• weakly supervised: bounding boxes not used for training• fast to compute