crash course on machine learning part v several slides from derek hoiem, ben taskar, and andreas...

54
Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Upload: charity-robertson

Post on 21-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Crash Course on Machine LearningPart V

Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Page 2: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Structured Prediction

• Use local information • Exploit correlations

b r ea c

Page 3: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Min-max Formulation

LP duality

Page 4: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Before

QP duality

Exponentially many constraints/variables

Page 5: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

After

By QP duality

Dual inherits structure from problem-specific inference LPVariables correspond to a decomposition of variables of the flat case

Page 6: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

The Connection

b c a r e b r o r e b r o c eb r a c e

rc

ao

cr

.2

.15

.25

.4

.2 .35

.65.8.4

.61b 1e

2 2 10

Page 7: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Duals and Kernels

Kernel trick works: Factored dual Local functions (log-potentials) can use kernels

Page 8: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

3D Mapping

Laser Range Finder

GPS

IMU

Data provided by: Michael Montemerlo & Sebastian Thrun

Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points

Page 9: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause
Page 10: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause
Page 11: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause
Page 12: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause
Page 13: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

• Simple iterative method

• Unstable for structured output: fewer instances, big updates– May not converge if non-separable– Noisy

• Voted / averaged perceptron [Freund & Schapire 99, Collins 02]– Regularize / reduce variance by aggregating over iterations

Alternatives: Perceptron

Page 14: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

• Add most violated constraint

• Handles several more general loss functions• Need to re-solve QP many times• Theorem: Only polynomial # of constraints needed to achieve -

error [Tsochantaridis et al, 04]

• Worst case # of constraints larger than factored

Alternatives: Constraint Generation

[Collins 02; Altun et al, 03]

Page 15: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Integration

• Feature Passing

• Margin Based– Max margin Structure Learning

• Probabilistic– Graphical Models

Page 16: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Graphical Models

• Joint distribution– Factoring using independent variables

• Representation

• Inference

• Learning

Page 17: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Big Picture

• Two problems with using full joint distribution tables as our probabilistic models: – Unless there are only a few variables, the joint is WAY too

big to represent explicitly – Hard to learn (estimate) anything empirically about more

than a few variables at a time

• Describe complex joint distributions (models) using simple, local distributions – We describe how variables locally interact – Local interactions chain together to give global, indirect

interactions

Page 18: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Joint Distribution• For n variables with domain sizes d

– joint distribution table with dn -1 free parameters • Size of representation if we use the chain rule

Concretely, counting the number of free parameters accounting for that we know probabilities sum to one: (d-1) + d(d-1) + d2(d-1) + ... + dn-1 (d-1) = (dn-1)/(d-1) (d-1)= dn - 1

Page 19: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Conditional Independence• Two variables are conditionally independent:

• What about this domain? – Traffic– Umbrella– Raining

Page 20: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

RepresentationExplicitly model uncertainty and dependency structure

a

b

c

a

b

c

Directed Undirected Factor graph

d d

a

b

c d

Key concept: Markov blanket

Page 21: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Bayes Net: Notation

• Nodes: variables– Can be assigned (observed)

or – unassigned (unobserved)

• Arcs: interactions – Indicate “direct influence”

between variables – Formally: encode conditional

independence

Cavity

Toothache Catch

Weather

Page 22: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Example: Flip Coins

• N independent flip coins

• No interactions between variables– Absolute independence

X1 X2 Xn

Page 23: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Example: Traffic

• Variables: – Traffic– Rain

• Model 1: absolute independence• Model 2: rain causes traffic• Which makes more sense?

Rain

Traffic

Page 24: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Semantics

• A set of nodes, one per variable X • A directed, acyclic graph • A conditional distribution for each

node – A collection of distributions over X,

one for each combination of parents’ values

– Conditional Probability Table (CPT)

A1 An

X

A2

Parents

A Bayes net = Topology (graph) + Local Conditional Probabilities

Page 25: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Example: Alarm

• Variables:– Alarm– Burglary– Earthquake– Radio– Calls John

Earthquake

Radio

Burglary

Alarm

Call

Page 26: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Example: AlarmEarthquake

Radio

Burglary

Alarm

Call

P(C|A)

P(R|E) P(A|E,B)

P(E) P(B)

P(E,B,R,A,C)=P(E)P(B)P(R|E)P(A|B,E)P(C|A)

Page 27: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Bayes Net Size

• How big is a joint distribution over N Boolean variables?

• How big is the size of CPT with k parents?

• How big is the size of BN with n node if nodes have up to k parents?

• BNs: – Compact representation– Use local properties to define CPTS– Answer queries more easily

2n

2k+1

n.2k+1

Page 28: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Independence in BN

• BNs present a compact representation for joint distributions– Take advantage of conditional independence

• Given a BN let’s answer independence questions:– Are two nodes independent given certain

evidence?

– What can we say about X, Z? (Example: Low pressure, Rain, Traffic}

X Y Z

Page 29: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Causal Chains

• Question: Is Z independent of X given Y?

X Y Z• X: low pressure• Y: Rain• Z: Traffic

Page 30: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Common Cause

• Are X, Z independent?

X

y

Z

• Are X, Z independent given Y?

• Observing Y blocks the influence between X,Z

• Y: low pressure• X: Rain• Z: Cold

Page 31: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Common Effect

• Are X, Z independent?

X

Y

Z

• X: Rain• Y: Traffic• Z: Ball Game

• Are X, Z independent given Y?

• Observing Y activates influence between X, Z

Page 32: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Independence in BNs

• Any complex BN structure can be analyzed using these three cases

Earthquake

Radio

Burglary

Alarm

Call

Page 33: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Directed acyclical graph (Bayes net)

a

b

c d

P(a,b,c,d) = P(c|b)P(d|b)P(b|a)P(a)

• Can model causality• Parameter learning

– Decomposes: learn each term separately (ML)

• Inference– Simple exact inference if tree-

shaped (belief propagation)

Page 34: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Directed acyclical graph (Bayes net)

a

b

c d

• Can model causality• Parameter learning

– Decomposes: learn each term separately (ML)

• Inference– Simple exact inference if tree-

shaped (belief propagation)– Loops require approximation

• Loopy BP• Tree-reweighted BP• Sampling

P(a,b,c,d) = P(c|b)P(d|a,b)P(b|a)P(a)

Page 35: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

• Example: Places and scenes

Directed graph

Place: office, kitchen, street, etc.

Car Person Toaster MicrowaveFire

Hydrant

Objects Present

P(place, car, person, toaster, micro, hydrant) = P(place) P(car | place) P(person | place) … P(hydrant | place)

Page 36: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Undirected graph (Markov Networks)

• Does not model causality• Often pairwise• Parameter learning difficult• Inference usually approximate

x1

x2

x3 x4

edgesji

jii

iZ dataxxdataxdataP,

24..1

11 ),;,(),;(),;( x

Page 37: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Markov Networks• Example: “label smoothing” grid

Binary nodes

0 10 0 K1 K 0

Pairwise Potential

Page 38: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Image De-Noising

Original Image Noisy Image

Page 39: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Image De-Noising

Page 40: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Image De-Noising

Noisy Image Restored Image (ICM)

Page 41: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Factor graphs• A general representation

a

b

c

Bayes Net

Factor Graph

d

a

b

c d

Page 42: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Factor graphs• A general representation

a

b

c

Markov Net

d

Factor Graph

a

b

c d

Page 43: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Factor graphs

),()(),,(),,,( 321 dafdfcbafdcbaP

Write as a factor graph

Page 44: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Inference in Graphical Models

• Joint• Marginal• Max

• Exact inference is HARD

Page 45: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Approximate Inference

Page 46: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Approximation

Page 47: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Sampling a Multinomial Distribution

Page 48: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Sampling from a BN

- Compute Marginals- Compute Conditionals

Page 49: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Belief Propagation

• Very general • Approximate, except for tree-shaped graphs

– Generalizing variants BP can have better convergence for graphs with many loops or high potentials

• Standard packages available (BNT toolbox)

• To learn more:– Yedidia, J.S.; Freeman, W.T.; Weiss, Y., "Understanding Belief Propagation and Its

Generalizations”, Technical Report, 2001: http://www.merl.com/publications/TR2001-022/

Page 50: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Belief Propagation

i

)(

)()(iNa

iiaii xmxb

“beliefs” “messages”

a

)( \)(

)()()(aNi aiNb

iibaaaa xmXfXb

The “belief” is the BP approximation of the marginal probability.

Page 51: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

BP Message-update Rules

Using ,)()(\

ia xX

aaii Xbxb we get

ai

ia xX iaNj ajNb

jjbaaiia xmXfxm\ \)( \)(

)()()(

i a=

Page 52: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Inference: Graph Cuts

• Associative: edge potentials penalize different labels• Associative binary networks can be solved optimally

(and quickly) using graph cuts

• Multilabel associative networks can be handled by alpha-expansion or alpha-beta swaps

• To learn more:– http://www.cs.cornell.edu/~rdz/graphcuts.html– Classic paper: What Energy Functions can be Minimized via Graph Cuts? (Kolmogorov

and Zabih, ECCV '02/PAMI '04)

Page 53: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Graph Cuts: Binary MRF

Unary terms (compatability of data with label y)

Pairwise terms (compatability of neighboring labels)

Graph cuts used to optimise this cost function:

Summary of approach

• Associate each possible solution with a minimum cut on a graph• Set capacities on graph, so cost of cut matches the cost function• Use augmenting paths to find minimum cut• This minimizes the cost function and finds the MAP solution

Page 54: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause

Denoising Results

Original Pairwise costs increasing

Pairwise costs increasing