learning tree conditional random fields

49
Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin

Upload: silas-church

Post on 31-Dec-2015

37 views

Category:

Documents


1 download

DESCRIPTION

Learning Tree Conditional Random Fields. Joseph K. Bradley Carlos Guestrin. Reading people’s minds. predict. Correlated!. E.g., Person? & Live in water? Colorful? & Yellow?. (Application from Palatucci et al., 2009). X : fMRI voxels. Y : semantic features. Metal? Manmade? - PowerPoint PPT Presentation

TRANSCRIPT

Learning Tree Conditional Random

Fields

Joseph K. BradleyCarlos Guestrin

We want to model

conditional correlations

Reading people’s minds

X: fMRI voxels

Y: semantic features• Metal?

• Manmade?• Found in house?• ...

predict

Predict independently? Yi ~ X, for all i

Correlated!

E.g.,• Person? & Live in water?• Colorful? & Yellow?

Image fromhttp://en.wikipedia.org/wiki/File:FMRI.jpg

(Application from Palatucci et al., 2009)

Conditional Random Fields (CRFs)

(Lafferty et al., 2001)

Q(Y | X)1

Z(X) j (YCj ,XCj )

j

Pro: Avoid modeling P(X)

In fMRI, X ≈ 500 to 10,000 voxels

Conditional Random Fields (CRFs)

Q(Y | X)1

Z(X) j (YCj ,XCj )

j

Pro: Avoid modeling P(X)

),,( 2,1212,1 XYYY1

Y2

Y3Y4

4,2

2,3 encode conditional

independence structure

j

Conditional Random Fields (CRFs)

j

CjCjj xYxZ

xXYQ ),()(

1)|(

Y1

Y2

Y3Y4

4,2

2,3

),,( 2,12,1212,1 xXYY

encode conditional

independence structure

j

Pro: Avoid modeling P(X)

Conditional Random Fields (CRFs)

Normalizationdepends on X=x

Con: Compute Z(x) for each inference

Q(Y | X)1

Z(X) j (YCj ,XCj )

j

),,( 2,1212,1 XYYY1

Y2

Y3Y4

4,2

2,3

Pro: Avoid modeling P(X)

Conditional Random Fields (CRFs)

Q(Y | X)1

Z(X) j (YCj ,XCj )

j

Exact inference intractable in general.

Approximate inference expensive.Con: Compute Z(x) for each inferencePro: Avoid modeling P(X)

Use tree CRFs!

),,( 2,1212,1 XYYY1

Y2

Y3Y4

4,2

2,3

Conditional Random Fields (CRFs)

Tji

ijjiijT XYYXZ

XYQ),(

),,()(

1)|(

Con: Compute Z(x) for each inferencePro: Avoid modeling P(X)

Use tree CRFs!

Pro: Fast, exact inference

),,( 2,1212,1 XYYY1

Y2

Y3Y4

4,2

2,3

CRF Structure Learning

Tree CRFs Fast, exact inference

Avoid modeling P(X)

Featureselection

Tji

ijjiijT XYYXZ

XYQ),(

),,()(

1)|(

2,1Y1

Y2

Y3Y4

4,2

2,3

Structurelearning

CRF Structure Learning

Tree CRFs Fast, exact inference

Avoid modeling P(X)

Tji

ijjiijT XYYXZ

XYQ),(

),,()(

1)|(

Local inputs

),,( ijjiij XYY (scalable)

),,( XYY jiijinstead of

Global inputs

(not scalable)

This work

Goals:

Structured conditional models P(Y|X)Scalable methods

Tree structuresLocal inputs Xij

Max spanning treesOutlineGold standardMax spanning trees

Generalized edge weightsHeuristic weights

Experiments: synthetic & fMRI

Related workMethod Feature

selection?

Tractable models?

Torralba et al. (2004)

Boosted Random Fields

Yes No

Schmidt et al. (2008)

Block-L1 regularized pseudolikelihood

No No

Shahaf et al. (2009)

Edge weight +low-treewidth model

No YesVs. our work

Choice of edge weights Local inputs

Chow-LiuFor generative models:

Tji

jiijT YYZ

YQ),(

),(1

)(

E[logQT (Y ) logQdisc (Y )] I(Yi;Y j )(i, j )T

i

iidisc YZ

YQ )(1

)(

Chow-Liu for CRFs?For CRFs with global inputs:

Tji

jiijT XYYXZ

XYQ),(

),,()(

1)|(

Tji

jidiscT XYYIXYQXYQE),(

)|;()]|(log)|([log

i

iidisc XYXZ

XYQ ),()(

1)|(

Global CMI (Conditional Mutual Information):Pro: “Gold standard”Con: I(Yi;Yj | X) intractable for big X

Where now?

Tji

jidiscT XYYIXYQXYQE),(

)|;()]|(log)|([log

Global CMI (Conditional Mutual Information):Pros: “Gold standard”Cons: I(Yi;Yj | X) intractable for big XAlgorithmic frameworkGiven: data {(y(i),x(i))}.Given: input mapping Yi Xi

Weight potential edge (Yi,Yj) with Score(i,j)

Choose max spanning tree

Local inputs!

Generalized edge scores

Key step: Weight edge (Yi,Yj) with Score(i,j).

Local Linear Entropy Scores: Score(i,j) = linear combination of

entropies over Yi,Yj,Xi,Xj

E.g., Local Conditional Mutual Information )|,()|()|()|;( ijjiijjijiijji XYYHXYHXYHXYYI

TheoremAssume true P(Y|X) is tree CRF

(w/ non-trivial parameters).

No Local Linear Entropy Score can recover all such tree CRFs

(even with exact entropies).

Generalized edge scores

Key step: Weight edge (Yi,Yj) with Score(i,j).

Local Linear Entropy Scores: Score(i,j) = linear combination of

entropies over Yi,Yj,Xi,Xj

OutlineGold standardMax spanning trees

Generalized edge weightsHeuristic weights

Experiments: synthetic & fMRI

Heuristics

Piecewise likelihood Local CMI DCI

Piecewise likelihood (PWL)

logQT (Y | X) logP(Yij | X ij )(i, j )T

Sutton and McCallum (2005,2007): PWL for parameter learning

Main idea: Bound Z(X)

For tree CRFs, optimal parameters give:

Score(i, j)E[logP(Yij | X ij )]

Edge score w/ local inputs Xij

Bounds log likelihoodFails on simple counterexampleDoes badly in practiceHelps explain other edge scores

Piecewise likelihood (PWL)

Score(i, j)E[logP(Yij | X ij )] H(Yij | X ij )Y1 Y2 Y3 Yn

X1 X2 X3 Xn

.

.

.

True P(Y,X)

H(Y2 j | X2 j ) H(Y j | X2 j ),jStrong potential

H(Y jk | X jk )

Choose (2,j)

Over (j,k)

Y1 Y2

Y3

Yn

Score(i, j) H(Yij | X ij )H(Yi | X ij )H(Y j | X ij )

Local Conditional Mutual Info

Score(i, j) H(Yij | X ij )

I(Yi;Y j | X ij )Decomposable score w/ local inputs Xij

Does pretty well in practiceCan fail with strong potentials

Theorem: Local CMI bounds log likelihood gain

E[logQT (Y | X) logQdisc(Y | X)]

I(Yi;Y j | X ij )(i, j )T

Local Conditional Mutual Info

Score(i, j)I(Yi;Y j | X ij )

Y1 Y2 Y3 Yn

X1 X2 X3 Xn

.

.

.

True P(Y,X)

I(Y2;Y j | X2 j ) 0,jStrong potential

Y3

Y2

Y1

),,( 3,1313,1 XYY

Score(i, j) H(Yij | X ij )H(Yi | X i)H(Y j | X j )

Score(i, j) H(Yij | X ij )H(Yi | X ij )H(Y j | X ij )

Decomposable Conditional Influence (DCI)

Exact measure of gain for some edgesEdge score w/ local inputs Xij

Succeeds on counterexampleDoes best in practice

PWLFrom )|( XYQdisc

Y2Y1Y3

Experiments

Given: Data {(yi,xi)}; input mapping Yi Xi

Compute edge scores:

DCI(i, j) H(Yij | X ij )H(Yi | X i)H(Y j | X j )

Regress P(Yij|Xij) (10-fold CV to choose regularization)Choose max spanning tree

Parameter learning:Conjugate gradient on L2-regularized log likelihood10-fold CV to choose regularization

Algorithmic details

Synthetic experiments

Experiments:Binary Y,X; tabular edge factorsUse natural input mapping: Yi Xi

P(Y|X) P(X)Y1 Y2 Y3 Yn

X1 X2 X3 Xn

.

.

.

X1 X2 X3 Xn...

intractable P(Y,X)

Synthetic experiments

P(Y|X), P(X): chains & trees

Y1

Y3

Y4Y2

Y5

P(Y|X) P(X)

P(Y,X): tractable & intractable

X1

X3

X4X2

X5

Φ(Yij,Xij):

tractable P(Y,X) X1

X3

X4

X2

X5

Synthetic experiments

P(Y|X): chains & trees

P(Y|X)

P(Y,X): tractable & intractable

With & without cross-factors

Φ(Yij,Xij):

Y1 Y2 Y3 Yn

X1 X2 X3 Xn

.

.

.

crossfactors

Associative (all positive & alternating +/-) & random factors

Synthetic: vary # train exs.

Synthetic: vary # train exs.

TreeIntractable P(Y,X)Associative Φ (alternating +/-)|Y|=401000 test examples

Synthetic: vary # train exs.

TreeIntractable P(Y,X)Associative Φ (alternating +/-)|Y|=401000 test examples

Synthetic: vary # train exs.

TreeIntractable P(Y,X)Associative Φ (alternating +/-)|Y|=401000 test examples

Synthetic: vary # train exs.

TreeIntractable P(Y,X)Associative Φ (alternating +/-)|Y|=401000 test examples

Synthetic: vary # train exs.

TreeIntractable P(Y,X)Associative Φ (alternating +/-)|Y|=401000 test examples

Synthetic: vary # train exs.

Synthetic: vary model size

Fixed 50 train exs., 1000 test exs.

fMRI experiments

X(500 fMRI voxels)

Y(218 semantic features)• Metal?

• Manmade?• Found in house?• ...

predict

Data, setup from Palatucci et al. (2009)

Decode(hand-built map)

Object (60 total)• Bear

• Screwdriver• ...

Zero-shot learning: Can predict objects not in training data (given decoding).

Image fromhttp://en.wikipedia.org/wiki/File:FMRI.jpg

fMRI experimentsX

(500 fMRI voxels)

Y(218 semantic features)

predict

Y,X real-valued Gaussian factors:

(y,x)exp 1

2Ay (Cx b) 2

Input mapping: Regressed Yi ~ Y-i,X Chose top K inputs

Added fixed

Regularized A & C,b separately CV for parameter learning very expensive Do CV on subject 0 only2 methods: CRF1: K=10 & CRF2: K=20 &

(Yij ,X ij )

(Yij )

(Yi,X)P(Yi | X)

Accuracy: (for zero-shot learning)

Hold out objects i,j.

Predict Y(i)’, Y(j)’

If ||Y(i) - Y(i)’||2 < ||Y(j) - Y(i)’||2 then we got i right.

fMRI experiments

fMRI experiments

Accuracy: CRFs a bit worse

fMRI experiments

better

Accuracy: CRFs a bit worse

Log likelihood: CRFs better

fMRI experiments

better

Accuracy: CRFs a bit worse

Log likelihood: CRFs better

Squared error: CRFs better

fMRI experiments

better

Accuracy: CRFs a bit worse

Log likelihood: CRFs better

Squared error: CRFs better

ConclusionScalable learning of CRF structureAnalyzed edge scores for spanning tree methods

Local Linear Entropy Scores imperfectHeuristics

Pleasing theoretical propertiesEmpirical success—we recommend DCI

Future work

Templated CRFs

Learning edge score

Assumptions on model/factors which give learnability

Thank you!

Thank you!References

M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery. Learning to Extract Symbolic Knowledge from the World Wide Web. AAAI 1998.Lafferty, J.D., McCallum, A., Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001.M. Palatucci, D. Pomerleau, G. Hinton, T. Mitchell. Zero-Shot Learning with Semantic Output Codes. NIPS 2009.M. Schmidt, K. Murphy, G. Fung, R. Rosales. Structure learning in random fields for heart motion abnormality detection. CVPR 2008.D. Shahaf, A. Chechetka, C. Guestrin. Learning Thin Junction Trees via Graph Cuts. AI-STATS 2009.C. Sutton, A. McCallum. Piecewise training of undirected models. UAI 2005.C. Sutton, A. McCallum. Piecewise pseudolikelihood for efficient training of conditional random fields. ICML, 2007.A. Torralba, K. Murphy, W. Freeman. Contextual models for object detection using boosted random fields. NIPS 2004.

(extra slides)

B: Score Decay Assumption

B: Example complexity

Future work: Templated CRFs

Learn template, e.g.Score(i,j) = DCI(i,j)Parametrization

(Yij ,X ij )P(Yij | X ij )WebKB (Craven et al., 1998)Given webpages {(Yi=page type, Xi=content)}

Use template to: Choose tree over pages

Instantiate parameters

P(Y|X=x) = P(pages’ types | pages’ content)

Requires local inputsPotentially very fast

Future work: Learn score

Given training queries:DataGround-truth model (E.g., from expensive structure learning method)

Learn function Score(Yi,Yj) for MST algorithm.