learning crfs with hierarchical features: an application to go

1

Learning CRFs with Hierarchical Features: An Application to Go

Scott SannerThore Graepel

Ralf HerbrichTom Minka

– University of Toronto

Microsoft Research

(with thanks to David Stern and Mykel Kochenderfer)

2

The Game of Go• Started about 4000 years ago in ancient China• About 60 million players worldwide• 2 Players: Black and White• Board: 19×19 grid• Rules:

– Turn: One stone placed on vertex.– Capture.

1

• A stone is captured by surrounding it

White territory

Blackterritory

• Aim: Gather territory by surrounding it

3

Territory Prediction• Goal: predict territory distribution given board...

• How to predict territory?

– Could use simulated play:• Monte Carlo averaging is an excellent estimator

• Avg. 180 moves/turn, >100 moves/game costly

– We learn to directly predict territory:• Learn from expert data

~c P (~s j~c)

P (~s j~c)

4

• Hierarchical pattern features

• Independent pattern-based classifiers– Best way to combine features?

• CRF models– Coupling factor model (w/ patterns)– Best training / inference approximation

to circumvent intractability?

• Evaluation and Conclusions

Talk Outline

5

• Centered on a single position

• Exact config.of stones

• Fixed match region (template)

• 8 nested templates

• 3.8 million patterns mined

Hierarchical Patterns

6

Models

Vertex Variables: si ;ci

(a) Independent pattern- based classifiers

(b) CRF (c) Pattern CRF

Ãi (si = +1;~c) = exp

0

@X

~¼2 ~¦

¸~¼¢I ~¼(~c;i)

1

AUnary pattern-based factors:

Ã(i ;j )(si ;sj ;ci ;cj ) = exp

Ã36X

k=1

¸k ¢Ik(si ;sj ;ci ;cj ;k)

!Couplingfactors:

7

Independent Pattern-based

Classifiers

8

Inference and Training

• Up to 8 pattern sizes may match at any vertex

• Which pattern to use?– Smallest pattern – Largest pattern

• Or, combine all patterns:– Logistic regression– Bayesian model averaging…

Ãi (si = +1;~c)

= exp

0

@X

~¼2 ~¦

¸~¼¢I ~¼(~c; i)

1

A

Ãi (si = +1;~c) = P (si j~¼min(~c; i))

Ãi (si = +1;~c) = P (si j~¼max(~c; i))

9

Bayesian Model Averaging

• Bayesian approach to combining models:

• Now examine the model “weight”:

• Model must apply to all data!

P (¿j~c;D) =P (Dj¿;~c)P (¿j~c)

P¿2¨ P (Dj¿;~c)P (¿j~c)

P (sj j~c;D) =X

¿2¨

P (sj j¿;~c;D)P (¿j~c;D)

10

Hierarchical Tree Models• Arrange patterns into decision trees i:

• Model i provides predictions on all data

11

CRF

& Pattern CRF

12

Inference and Training• Inference

– Exact is slow for 19x19 grids– Loopy BP is faster

• but biased

– Sampling is unbiased• but slower than Loopy BP

• Training– Max likelihood requires inference!

– Other approximate methods…

@l@̧ j

=X

d2D

Ã

I j (~s(d);~c(d)) ¡X

~s

I j (~s;~c(d))P (~sj~c(d))

!

13

Pseudolikelihood• Standard log-likelihood:

• Edge-based pseudo log-likelihood:

• Then inference during training is purely local• Long range effects captured in data• Note: only valid for training

– in presence of fully labeled data

pl(~̧) =X

d2D

X

f 2F

logP (~s(d)f j~c(d)

f ;MBF (f )(d))

l(~̧) =X

d2D

logP (~s(d) j~c(d))

14

Local Training• Piecewise:

• Shared Unary Piecewise:

15

Evaluation

16

Models & Algorithms• Model & algorithm specification:

– Model / Training (/ Inference, if not obvious)

• Models & algorithms evaluated:– Indep / {Smallest, Largest} Pattern– Indep / BMA-Tree {Uniform, Exp}– Indep / Log Regr– CRF / ML Loopy BP (/ Swendsen-Wang)– Pattern CRF / Pseudolikelihood (Edge)– Pattern CRF / (S. U.) Piecewise– Monte Carlo

17

Training Time• Approximate time for various models and

algorithms to reach convergence:

Algorithm Training Time

Indep / Largest Pattern < 45 min

Indep / BMA-Tree < 45 min

Pattern CRF / Piecewise ~ 2 hrs

Indep / Log Regr ~ 5 hrs

Pattern CRF / Pseudolikelihood ~ 12 hrs

CRF / ML Loopy BP > 2 days

18

Inference Time• Average time to evaluate for various

models and algorithms on a 19x19 board:

Algorithm Inference Time

Indep / Sm. & Largest Pattern 1.7 ms

Indep / BMA-Tree & Log Regr 6.0 ms

CRF / Loopy BP 101.0 ms

Pattern CRF / Loopy BP 214.6 ms

Monte Carlo 2,967.5 ms

CRF / Swend.-Wang Sampling 10,568.7 ms

P (~s j~c)

19

Performance Metrics

• Vertex Error: (classification error)

• Net Error: (score error)

• Log Likelihood: (model fit)

1jGj

P jGji=1 I(sgn(EP (~sj~c(d) )[si ]) 6= sgn(s(d)

i ))

12jGj j

P jGji=1 EP (~sj~c( d) )[si ]¡

P jGji=1 s(d)

i j

logP (~s(d) j~c(d))

20

Performance Tradeoffs I

0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.340

0.02

0.04

0.06

0.08

0.1

0.12

Vertex Error

Net

Err

or

Net Error vs. Vertex Error Tradeoff

Indep / Smallest Pattern Indep / Largest Pattern Indep / BMA-Tree Uniform Indep / BMA-Tree Exp Indep / Log Regr CRF / ML Loopy BP CRF / ML Loopy BP / Swendsen-Wang Pattern CRF / Pseudolikelihood EdgePattern CRF / Pseudolikelihood Pattern CRF / Piecewise Pattern CRF / S.U. Piecewise Monte Carlo

21

Why is Vertex Error better for CRFs?

• Coupling factors help realize stable configurations

• Compare previous unary-only independent model to unary and coupling model:– Independent models make inconsistent predictions– Loopy BP smoothes these predictions (but too much?)

BMA-Tree Model Coupling Model with Loopy BP

22

Why is Net Error worse for CRFs?• Use sampling to examine bias of Loopy BP

– Unbiased inference in limit– Can run over all test data but still too costly for training

• Smoothing gets rid of local inconsistencies• But errors reinforce each other!

Loopy Belief Propagation Swendsen-Wang Sampling

23

Bias of Local Training• Problems with Piecewise training:

– Very biased when used in conjunction with Loopy BP

– Predictions good (low Vertex Error), just saturated– Accounts for poor Log Likelihood & Net Error…

ML Trained Piecewise Trained

24

Performance Tradeoffs II

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.30

0.02

0.04

0.06

0.08

0.1

0.12

-Log Likelihood

Net

Err

or

Net Error vs. -Log Likelihood Tradeoff

Indep / Smallest Pattern Indep / Largest Pattern Indep / BMA-Tree Uniform Indep / BMA-Tree Exp Indep / Log Regr CRF / ML Loopy BP CRF / ML Loopy BP / Swendsen-Wang Pattern CRF / Pseudolikelihood EdgePattern CRF / Pseudolikelihood Pattern CRF / Piecewise Pattern CRF / S.U. Piecewise Monte Carlo

25

ConclusionsTwo general messages:

(1) CRFs vs. Independent Models:• Pattern CRFs should theoretically be better• However, time cost is high• Can save time with approximate training /

inference• But then CRFs may perform worse than

independent classifiers – depends on metric

(2) For Independent Models:• Problem of choosing appropriate neighborhood

can be finessed by Bayesian model averaging

26

Thank you!

Questions?

learning crfs with hierarchical features: an application to go

Documents

pattern sizes

vertexwhich pattern

model training inference

model i

model weight

datacrf pattern crfinference

loopy bptrainingmax

mspattern crf loopy