learning message-passing inference machines for structured ... · contextual classification with...

1
Overview: Structured Prediction for Scene Labeling Predict a class label at every site (pixel/superpixel/segment) in a scene: Learning Message-Passing Inference Machines for Structured Prediction Stephane Ross, Daniel Munoz, Martial Hebert & J. Andrew Bagnell Proposed Approach: Inference Machines General Learning Procedures for Inference Machines Learning Synchronous Message-Passing Inference Learning Asynchronous Message-Passing Inference Theoretical Guarantees 3D Point Cloud Classification [2] Surface Layout Estimation [6] Predictor Features Inference Naive Approach: Independent Predictions Typical Approach: Graphical Models Simple, fast and tractable No context (usually poor accuracy) Exploits context (models spatial relationships between neighbors) Intractable without approximations (e.g. Tree model/Loopy BP/Graph-cut with limited class of potentials) Optimization with approximations can lead to poor performance [5] Features Model ... ... f \ N ' f v v ' f v v v vf v ) y ( m ) x , y ( ) y ( m v \ N ' v ' v f ' v y ' y | ' y f f v fv f v v f ) ' y ( m ) x , ' y ( ) y ( m v N ' f v v ' f v v v v ) y ( m ) x , y ( ) y Y ( P Experimental Results [1] S. Ross, G. Gordon & J. A. Bagnell. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS 2011. [4] D. Munoz, J. A. Bagnell & M. Hebert. Stacked Hierarchical Labeling. ECCV 2010. [2] D. Munoz, J. A. Bagnell, N. Vandapel & M. Hebert. Contextual Classification with Functional Max-Margin Markov Networks. CVPR 2009. [5] A. Kulesza & F. Pereira. Structured learning with approximate inference. NIPS 2008. [3] Z. Tu & X. Bai. Auto-context and Its application to High-level Vision Tasks and 3D Brain Image Segmentation. PAMI 2010. [6] D. Hoeim, A. A. Efros & M. Hebert. Recovering Surface Layout from an Image. IJCV 2007 Pairwise CRF Model Models the conditional joint distribution over labelings Y of the scene given input features X: Learn node φ and edge potentials ψ from training data. E ) j , i ( ij j i V i i i ) X , Y , Y ( ) X , Y ( ) X | Y ( P Loopy Belief Propagation (BP) Approximate inference procedure to find most likely labeling. Iteratively visits all nodes & edges in the graph to update neighbors’ messages until convergence: Iterate many times over graph: Predictor Features Neighbors’ Predictions Exploits context Exploits discriminative power of classifiers Strong theoretical guarantees Accuracy/speed trade-off via more/less complex predictors Given any message-passing algorithm (e.g. Mean-Field, BP): Keep the same algorithm structure. Message updates are replaced by a learned predictor h’s outputs; e.g. logistic regressor. h outputs messages (a distribution over labels) to send to neighbors, given input features and neighbors’ messages. h minimizes the loss of the inference’s output prediction. B C A 1 2 Message of C to B Inference procedure treated as a black box function to optimize, instead of learning a graphical model ([3,4] are special cases of our approach): 3 B C A 1 2 Message of C to A 3 h B C A 1 2 Prediction for C 3 AB AC BC CB CA BA AB AC A BC B C Interdependencies between outputs for 3 async. passes from graph on the right. Issue: Predictor’s outputs change its future inputs (non-iid) Forward Training: Learn a separate predictor for each inference pass: 1 2 3 4 5 6 7 8 9 Training scenesmessages at 1 st pass Dataset 1 st pass ... Supervised Learning 1 2 3 4 5 6 7 8 9 Training scenesmessages at 2 nd pass after 1 st predictor Dataset 2 nd pass Supervised Learning ... ... x 1 x 12 x 14 - - - - Features Messages Target x 2 x 23 x 25 x 21 - - x 3 - x 32 x 36 - - - x 4 x 45 x 47 - x 41 - x 5 x 56 x 58 x 54 x 52 x 1 x 12 x 14 - - - - x 2 x 23 x 25 x 21 - - x 3 - x 32 x 36 - - - x 4 x 45 x 47 - x 41 - x 5 x 56 x 58 x 54 x 52 Learn one predictor for all inference passes: Aggregate Dataset Inference on Training Scenes Initial predictor Supervised Learning New predictor Dataset Aggregation (DAgger) [1] Forward Training: If the predictors have ε avg. loss on the supervised learning task, then the sequence will have ε avg. loss during inference. Similar to [3,4] DAgger: In N iterations, if we achieve ε avg. loss on the aggregate dataset, we are guaranteed to find a predictor h that has avg. loss of ε + O(1/N) during inference. For both: Performance at inference is guaranteed to be at least as good as naive independent predictions. [6] Point Cloud Classification [2]: Label each 3D point into 5 classes 5 NN pairwise structure + segments Logistic Regressor as base predictor Surface Layout Estimation [6]: Label each superpixel into 7 classes Adjacent pairwise structure + segments Logistic Regressor as base predictor Ground Truth Inf. Machine (BP version) Ground Truth Inf. Machine (BP version) Hoeim [6] M 3 N-F [2] Ground Truth Inf. Machine (Mean-Field version) M 3 N-F [2] The Robotics Institute, Carnegie Mellon University New data

Upload: others

Post on 25-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Message-Passing Inference Machines for Structured ... · Contextual Classification with Functional Max-Margin Markov Networks. CVPR 2009. [5] A. Kulesza & F. Pereira. Structured

Overview: Structured Prediction for Scene Labeling

Predict a class label at every site (pixel/superpixel/segment) in a scene:

Learning Message-Passing Inference Machines for Structured Prediction

Stephane Ross, Daniel Munoz, Martial Hebert & J. Andrew Bagnell

Proposed Approach: Inference Machines

General Learning Procedures for Inference Machines

Learning Synchronous Message-Passing Inference Learning Asynchronous Message-Passing Inference Theoretical Guarantees

3D Point Cloud Classification [2] Surface Layout Estimation [6]

PredictorFeatures

Inference

Naive Approach: Independent Predictions

Typical Approach: Graphical Models

Simple, fast and tractable

No context (usually poor accuracy)

Exploits context (models spatial relationships between neighbors)

Intractable without approximations(e.g. Tree model/Loopy BP/Graph-cut with limited class of potentials)

Optimization with approximations can lead to poor performance [5]

FeaturesModel

... ...

f\N'f

vv'fvvvvf

v

)y(m)x,y()y(m

v\N'v

'vf'v

y'y|'y

ffvfv

fvvf

)'y(m)x,'y()y(m

vN'f

vv'fvvvv )y(m)x,y()yY(P

Experimental Results

[1] S. Ross, G. Gordon & J. A. Bagnell. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS 2011. [4] D. Munoz, J. A. Bagnell & M. Hebert. Stacked Hierarchical Labeling. ECCV 2010.[2] D. Munoz, J. A. Bagnell, N. Vandapel & M. Hebert. Contextual Classification with Functional Max-Margin Markov Networks. CVPR 2009. [5] A. Kulesza & F. Pereira. Structured learning with approximate inference. NIPS 2008.[3] Z. Tu & X. Bai. Auto-context and Its application to High-level Vision Tasks and 3D Brain Image Segmentation. PAMI 2010. [6] D. Hoeim, A. A. Efros & M. Hebert. Recovering Surface Layout from an Image. IJCV 2007

Pairwise CRF Model• Models the conditional joint distribution over labelingsY of the scene given input features X:

• Learn node φ and edge potentials ψ from training data.

E)j,i(

ijji

Vi

ii )X,Y,Y()X,Y()X|Y(P

Loopy Belief Propagation (BP)• Approximate inference procedure to find most likely labeling. • Iteratively visits all nodes & edges in the graph to update neighbors’messages until convergence:

Iterate many times over graph:

Predictor

Features

Neighbors’Predictions

Exploits context

Exploits discriminative power of classifiers

Strong theoretical guarantees

Accuracy/speed trade-off via more/less complex predictors

Given any message-passing algorithm (e.g. Mean-Field, BP):• Keep the same algorithm structure.• Message updates are replaced by a learned predictor h’s outputs; e.g. logistic regressor.• h outputs messages (a distribution over labels) to send to neighbors, given input features and neighbors’ messages.• h minimizes the loss of the inference’s output prediction.

B

C

A1

2

Message of C to B

Inference procedure treated as a black box function to optimize, instead of learning a graphical model ([3,4] are special cases of our approach): 3

B

C

A1

2

Message of C to A

3h

B

C

A1

2

Prediction for C

3

AB

AC

BC

CB

CA

BA

AB

AC

A

BC

BC

Interdependencies between outputs for 3 async. passes from graph on the right.

Issue: Predictor’s outputs change its future inputs (non-iid)

Forward Training: Learn a separate predictor for each inference pass:

1 2 3

4 5 6

7 8 9

Training scenes’ messages at 1st pass

Dataset 1st pass

...

Supervised Learning

1 2 3

4 5 6

7 8 9

Training scenes’ messages at 2nd pass

after 1st predictor

Dataset 2nd pass

Supervised Learning

...

...

x1 x12 x14 - - - -

Features Messages Target

x2 x23 x25 x21 - -

x3 - x32 x36 - - -

x4 x45 x47 - x41 -

x5 x56 x58 x54 x52

x1 x12 x14 - - - -

x2 x23 x25 x21 - -

x3 - x32 x36 - - -

x4 x45 x47 - x41 -

x5 x56 x58 x54 x52

Learn one predictor for all inference passes:

Aggregate Dataset

Inference on Training Scenes

Initial predictor

Supervised Learning

New predictor

Dataset Aggregation (DAgger) [1]

Forward Training:If the predictors have ε avg. loss on the supervised learning task, then the sequence will have ε avg. loss during inference.

Similar to [3,4]

DAgger:In N iterations, if we achieve ε avg.loss on the aggregate dataset, we are guaranteed to find a predictor h that has avg. loss of ε + O(1/N) during inference.

For both: Performance at inference is guaranteed to be at least as good as naive independent predictions.

[6]

Point Cloud Classification [2]:• Label each 3D point into 5 classes• 5 NN pairwise structure + segments• Logistic Regressor as base predictor

Surface Layout Estimation [6]:• Label each superpixel into 7 classes• Adjacent pairwise structure + segments• Logistic Regressor as base predictor

Ground Truth Inf. Machine (BP version)

Ground Truth Inf. Machine (BP version)

Hoeim [6]

M3N-F [2]

Ground Truth

Inf. Machine (Mean-Field version)

M3N-F [2]

The Robotics Institute, Carnegie Mellon University

New data