probabilistic reasoning and learning with permutations thesis defense, 7/29/2011 jonathan huang...

Probabilistic Reasoning and Learning with Permutations

Thesis Defense, 7/29/2011

Jonathan Huang

Collaborators:

Carlos GuestrinCMU

Leonidas GuibasStanford

Xiaoye JiangStanford

Ashish KapoorMicrosoft

2

Political Elections in Ireland

12

3

4

“But Ireland's complicated [election] system of proportional representation, … could upset the front-runner and help… the Fianna Fail candidate running second in the polls, to snatch victory.”

“Recent polling … indicates Doherty [Sinn Fein Party] is

leading the race.”

3

Proportional Representation

ProsEncourages coalition governments

Discourages negative campaigning

No wasted votes – empowers voters

Irish Parliament, Maltese Parliament, Australian Senate, Iceland Constitutional Assembly, Academy Awards, University of Cambridge, Scotland local Governments, Cambridge (Mass) local, …..

ConFar more complex than

plurality voting…

4

2002 Irish Election Data

Ireland

64,081 votes, 14 candidates

Major PartiesFianna Fail (FF)

Fine Gael (FG)

Minor PartiesIndependents (I)

Green Party (GP)

Christian Solidarity (CS)

Labour (L)

Sinn Fein (SF)

[Gormley, Murphy, 2006]

Predict winners Identify “voting-blocs” Formulate campaign strategies Engender an informed, effective democracy

Statistical analysis of voting data can:

5

Distributions over Permutations

A B C D Probability

1 2 3 4 02 1 3 4 01 3 2 4 1/103 1 2 4 02 3 1 4 1/203 2 1 4 1/51 2 4 3 0

Ran

kin

gs

CandidatesA B C D Probability

1 2 3 4 0

2 1 3 4 0

1 3 2 4 1/10

3 1 2 4 0

2 3 1 4 1/20

3 2 1 4 1/5

1 2 4 3 0

“With probability 1/10: Candidate A ranked first, Candidate B ranked third, Candidate C ranked second, Candidate D ranked last”

6

Permutations are Ubiquitous!

7

12

3

412

4

53

67

31

2

4

79

5

89

Politics Preferences

> >

> > Multiobject Tracking

7

Problem #1: Representation

n n! Storage requirements9 362,880 3 megabytes

12 4.8x108 9.5 terabyes

15 1.31x1012 1729 petabytes (!!)

How can we tractably represent distributions over n! permutations in storage?

8

First-order summary [Shin et al, ‘03]

14 Candidates

14 R

anks

1

3

5

7

9

11

130.05

0.1

0.15

0.2

0.25

FF FF FFFG FG FGI I I I GP CS SF L

25% voters rank Sinn Fein last

10% voters rank Sinn Fein first

ConReally coarse representation – can’t compute P(Sinn Fein candidate is first and Fianna Fail candidate is second)

Pron2 versus n! storage

For each (j,i) pair, store P(candidate j is in rank i)

9

Decomposable DistributionsA

dd

itiv

e D

eco

mp

osi

tio

nM

ult

iplic

ativ

e D

eco

mp

osi

tio

n

Decompose functions on permutations into sums of

simpler functions

Decompose functions on permutations into products of

simpler functions

10

Additive (Fourier) Decompositions

+.2 x +.1 x +.01 x.6 xf(x)=

Fourier coefficients Fourier basis functions

low frequency high frequency

Approximate distributions over permutations with low frequency basis functions

([Kondor2007,Huang2007,Huang2009])

Storing low frequency coefficients to approximate f

f .6 .2 .1 .1 .05 .01 .01 0 0 0

11

Fourier coefficients for permutations

low frequency high frequency

f

Fourier coefficients for distributions on permutations are matrix-valued

Can exactly reconstruct all n! original probabilities

Can exactly reconstruct all first-order probabilities with first two matrices

Can exactly reconstruct all second-order probabilities with first three matrices

[Diaconis, ‘88]

12

Second order summary (submatrix)

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07R

anks

pai

rs

Candidate pairs

1,2

1,3

1,4

2,3

2,4

3,4

FF,FG FF,FF FF,SF FG,FF FG,SF FF,SF

7% voters placed two Fianna Fail candidates consecutively

in ranks 1 and 2

Capture higher order dependencies with O(n4) storage

13

Accuracy/Storage Trade-off

Probability Fourier interpretation0th order Lowest frequency Fourier coefficient

1st order Reconstructible from O(n2) lowest frequency coefficients

2nd order Reconstructible from O(n4) lowest frequency coefficients

3rd order Reconstructible from O(n6) lowest frequency coefficients

… …

nth order Requires all n! Fourier coefficients

Problem #1: RepresentationStoring a low frequency Fourier approximation

is equivalent tostoring low-order probabilities

(and can be done in polynomial space)

Low-frequency Fourier approximations generalize the first-order summary!

14

ContributionsRepresentation

Polynomial storage for approximate distributions

Low frequency = maintaining probabilities over small sets

[NIPS07, JMLR09]

Inference

Ad

dit

ive

(F

ou

rier

)D

ec

om

po

siti

on

Mu

ltip

licat

ive

D

ec

om

po

siti

on

15

Problem #2: Probabilistic Inference in Ranking

What are the odds that someone will rank Sinn Fein first if he ranks Fianna Fail

second?

If a voter ranks Labour first, is he more likely to prefer Fine Gael over

Fianna Fail?

If I prefer Titanic to Star Wars, am I likely to also

prefer The English Patient to Jurassic Park?

16

Problem #2: Inference

Posterior Likelihood PriorABCD

BACD

ACBD

CABD

…

xlikelihood

prior

posterior=

How can we efficiently compute a posterior based on a new

observation?

candidate ranking σ z = “Fianna Fail ranked second”P( )|

Complexity: O(n!)

Rev. Bayes

Bayes Rule:

17

P(ranking) P(“Sinn Fein is first” | ranking)

Inference with Fourier coefficients

prior likelihood

Given:

posterior

P(ranking | “Sinn Fein is first”)

Compute:

From Signal Processing: Pointwise products correspond to convolutions of Fourier coefficients

18

P(ranking) P(“Sinn Fein is first” | ranking)

Inference with Fourier coefficients

[Huang et al, NIPS 2007]

prior likelihood

posterior

Our algorithm applies to arbitrary distributionsdefined over arbitrary finite groups

Pointwise products correspond to (generalized) convolution in the Fourier domain

P(ranking | “Sinn Fein is first”)

19

• Discard “high-frequency” coefficients after conditioning– Equivalently, maintain low-order

probabilities

Bandlimiting

Theorem. Given rth order terms of the prior and an sth order likelihood, then the (r-s)th order terms of the posterior can be exactly computed.

(Fourier methods work best on low-order observations)

[Huang et al, NIPS 2009]

Dealing with the Impossible

Infeasible approximations (e.g. negative probabilities) can arise due to bandlimiting

20

Feasible Coefficients

Infeasible Coefficients

Infeasible approximation

Nearest feasible Fourier coefficients

(Efficient projection (to a relaxed polytope) possible using a quadratic program)

Solution [Huang, 2007]: Project to space of coefficients corresponding to feasible probabilities

21

Permutations in Tracking

Track 1 Track 2 Track 3 Track 4

Applications to:- Monitoring for Assisted Living

- Video analysis for sports

- Video surveillance for crowds

22

Probabilistic Inference in Tracking

Mixing @Tracks 1,2

Mixing @Tracks 1,3

Mixing @Tracks 1,4

Track 1

Track 2

Track 3

Track 4

Inference problem: Where is Alice?

?

23

Simulated tracking dataProjection to the Marginal polytope

versus no projection (n=6)

Approximation by a uniform distribution

1st order 3rd order2nd order

Bet

ter

0

0.02

0.04

0.06

0.08

0.1

0.12

Err

or

w/o Projection

w/Projection 1st order

3rd order

2nd order

24

Tracking with a camera network• Camera Network data:

– 8 cameras, multi-view, occlusion effects

– 11 individuals in lab– Identity observations

obtained from color histograms

– Mixing events declared when people walk close to each other B

ette

r

0

10

20

30

40

50

60

% T

rack

s co

rrec

tly Id

en

tifie

d

Omniscient tracker

time-independent classification

w/o Projection

2nd order w/Projection

Problem #2: Inference

can be formulated in Fourier domain as (generalized) convolution, and approximated via

bandlimiting/projections; low-order observations = polytime/accurate inference

25

ContributionsRepresentation Inference



Polytime Fourier domain conditioning algorithm for

finite groups

Approximation guarantee for low order observations

[NIPS07, JMLR09] [NIPS07, JMLR09]Ad

dit

ive

(F

ou

rier

)D

ec

om

po

siti

on

Mu

ltip

licat

ive

D

ec

om

po

siti

on

26

Even polynomial is too slow…

Representation Depth # Fourier coefficients1st order O(n2)

2nd order O(n4)

3rd order O(n6)

4th order O(n8)

3rd order2nd order1st order

Exact inference

Bet

ter

Ru

nni

ng

time

in s

eco

nds

4 5 6 7 80

1

2

3

4

n

Can we achieve more compact representations?

27

Idea: Assume a ranking is created by “shuffling” smaller, independent rankings

Artichoke Broccoli>

Rank Veggies

Cherry Dates>Rank Fruits

Veggie

Fruit

Veggie

Fruit

>

>

>Veggie

Fruit

Veggie

Fruit

>

>

>Veggie

Fruit

Veggie

Fruit

>

>

>

Veggie

Fruit

Veggie

Fruit

>

>

>

Artichoke

Broccoli

Cherry

Dates

>

>

>

Interleave (riffle shuffle) veggie/fruit rankings to form a

complete ranking[Huang, Guestrin, 2009]

Riffle independent distributions can be represented with a reduced set of parameters!

Riffled Independence

28

10 20 30 40 50 60 70 80 90 100 110 12000.010.020.030.040.050.060.070.080.090.1

permutations

pro

bab

ility Blue line: candidate {2} riffle

independent of candidates {1,3,4,5}

American Psych. Assoc. (APA) Election (1980)

Empirically, we can find approximate riffled independence in real datasets

William Bevan

Ira Iscoe

Charles Kiesler

Max Siegle

Logan Wright

5738 full ballots, 5 candidates

dataset from [Diaconis, ‘89]

5!=120

29

#(rankings)5! = 120

Can we do better?

Parameter Counting

{1,2,3,4,5}

{1,3,4,5} {2}

Item set decomposition

Relative ranking of candidates {1,3,4,5}

Interleaving candidate {2} with remaining candidates

4! = 24

5

30 Total # of model parameters <

Relative ranking of candidates {2} 1! = 1

{4,3} {1,5}

Relative ranking of candidates {4,3}

Interleaving candidates {4,3} with candidates {1,5}

2! = 2

6

Relative ranking of candidates {1,5} 2! = 2

16Total # of model parameters <

Problem #1: RepresentationDistributions which decompose into riffle

independent factors can be represented using exponentially fewer parameters

30

Hierarchical Decompositions: Drawing a Ranking

Interleave Healthy foods with Junk food

Interleave fruits/vegetables

Rank Junk food

Rank fruits Rank vegetables

better

Food Preferences

Problem: For APA data, don’t know the hierarchy!

31

Reverse Engineering the HierarchyMachine learning approach: Use the structure that best explains the data

ABCD

BACD

BADCBDCA

CBDA

CABD

CBAD

CDBA DCBA

DCAB

ADBC

{A,B,C,D}

{C,D,A} {B}

{A} {C,D}

Structure Learning Algorithm

Data HierarchyCore Problem: Given ranked data, determine whether subsets are riffle independent

32

Measuring riffled independence

If i, (j,k) lie on opposite sides,Mutual information=0

Idea: measure independence between singleton rankings and pairwise rankings

preference over Fruit i

relative preference over Vegetable j, k

[Huang, Guestrin, 2010]

Riffled independence: absolute rankings of Fruits not informative about relative rankings in Vegetables

33

Tripletwise objective function

A Ball items in set A –plays nono role inobjective

Measuring departure from riffled independence:

Exponential number of possible splits, but there is an efficient minimization

algorithm that works with high probability

Minimize:

34

Hierarchy respects political coalition structure of the APA!

# model parameters: 11

Learning Structure from APA Data{12345}

{1345} {2}

{13} {45}

3. Charles Kiesler

1. William Bevan 4. Max Siegle

2. Ira Iscoe 5. Logan Wright

Candidates

Research psychologists

Clinical psychologists

Community psychologists

candidates

rank

s

1 2 3 4 5

1

2

3

4

5

candidates

rank

s

1 2 3 4 5

1

2

3

4

50

0.05

0.1

0.15

0.2

0.25

“True” first order Hierarchical first order

35

true structure known

learned structure

random 1-chain ([Doignon et al, 2004])

Structure learning with synthetic data

1 2 3 4 5-6300

-6200

-6100

-6000

-5900

-5800

-5700

-5600

-5500

-5400lo

g-lik

elih

ood

log10(# samples)

16 items, 4 items in each leaf

Bet

ter

Theorem: Our algorithm recovers riffle independent split with probability given

samples (under mild assumptions on connectivity).

[Huang, Guestrin, 2011]

36

Irish Election (No Structure Learning)

0

0.05

0.1

0.15

0.2

0.25“True” first order Probability Riffle Independent approximation

Sinn Fein, Christian Solidarity columns not well captured by a single split!

CandidatesFF FF FFFG FG FGI I I I GP CS SF L

CandidatesFF FF FFFG FG FGI I I I GP CS SF L

Major parties riffle independent of minor parties?

ran

ks

2

4

6

8

10

12

14

37

Structure Learning the Irish Election{1,2,3,4,5,6,7,8,9,10,11,12,13,14}

{1,2,3,4,5,6,7,8,9,10,11,13,14}

{12}

{11} {1,2,3,4,5,6,7,8,9,10,13,14}

{2,3,5,6,7,8,9,10,14} {1,4,13}

{2,5,6} {3,7,8,9,10,14}

Sinn Fein

Christian Solidarity

Fianna Fail

Fine Gael Independents, Labour, GreenFull model ~ 87 billion

Hierarchical model ~1000

# Parameters

Brute force optimization: 70.2sOur method: 2.3s

Running time

“True” first order Learned first order

Candidates

Ran

ks

2 4 6 8 10 12 14

2468101214

Candidates

Ran

ks

2 4 6 8 10 12 14

2468101214 0

0.05

0.1

0.15

0.2

0.25

38

Preference Analysis (for Sushi)

1. Ebi (shrimp)2. Anago (sea eel)3. Maguro (tuna)4. Ika (squid)5. Uni (sea urchin)6. Sake (salmon roe)7. Tamago (egg) 8. Toro (fatty tuna)9. Tekka-make (tuna roll)10. Kappa-maki (cucumber roll)

Contenders

5000 preference rankings of 10 types of sushi

Sushi

Ran

ks

1 2 3 4 5 6 7 8 9 10

First

Last

Fatty tuna (Toro)is a favorite!

No one likes cucumber roll !

39

{1,2,3,4,5,6,7,8,9,10}

{2} {1,3,4,5,6,7,8,9,10}

{1,3,5,6,7,8,9,10}{4}

{1,3,7,8,9,10} {5,6}

{3,7,8,9,10} {1}

{3,8,9} {7,10}

(sea eel)

(squid)

(sea urchin, salmon roe)

(shrimp)

(tuna, fatty tuna, tuna roll)

(egg, cucumber roll)

Sushi Hierarchy

40





finite groups


[NIPS07, JMLR09]

[NIPS09, ICML10, EJS11]

[NIPS07, JMLR09]

Introduction of Hierarchical Riffled Independence models

Structure learning algorithm with polynomial time/samples guarantee

Ad

dit

ive

(F

ou

rier

)D

ec

om

po

siti

on

Mu

ltip

licat

ive

(R

iffl

e In

dep

en

de

nt)

D

ec

om

po

siti

on

41

Top-k Inference Problem

12

3

4

--

-

2 4 6 8 10 12 14

# candidates specified

k

num

ber

of v

otes

0

20,000

10,000

Most voters rank just the top-3 or top-4

candidates

Inference problem: Given an observation of a voter’s top-k rankings,

infer his preferences over remaining candidates

42

Inference in Riffled Independent Models

Decomposition Can efficiently perform inference with

Fourier (Additive) Low order likelihoods (observations depend on few items)

Riffle Independent (Multiplicative) ????

Posterior Likelihood Prior

Bayes Rule: O(n!) operation

Answer: Efficient inference possible if and only if observations take the

form of partial rankings!(including top-k observations)

43

The Top-1 Inference ProblemSometimes we can decompose the

observation into smaller observations

Bayes rule complexity: factorial in number of items?

“Candidate 3 (FF party) ranked in first place” decomposes as:

{all candidates}

{1,2,3} {4,5,6,7,8}Fianna Fail Other Candidates

Bayes rule complexity: linear in # of parameters

Interleaving Observation: Fianna Fail candidate ranked in

first place overall

Fianna Fail Observation: Candidate 3 ranked first among

FF candidates

Observation:

Top-1 inference always decomposes into inference for each node in the

hierarchical model

44

Efficient inference for partial rankings

Fine Gael

Fianna Fail Sinn Fein

Independent

Green

Labour

Socialistxxx

In general, there are many forms of partial rankings allowing items to be tied

(Approval voting)

• First place observations:

• Top-k observations:

• Approval voting observations:

G|ABCDEFH “G in first place”

G|F|A|BCDEH “G in first place, F in second, A in third”

ACFG|BDEH “Approve of candidates in {ACFG} ”

45

Main Theorem

[Huang, Kapoor, 2011]

Theorem: Any partial ranking observation is decomposable with respect to any hierarchy.

i.e., Inference for partial rankings is efficientwith running time linear in #(parameters)

Hierarchy H1 Hierarchy H2

Hierarchy H3

Partial rankings

(But what’s out here?)

Converse to Main Theorem:Every observation that decomposes with respect to all hierarchies takes the form of some partial

ranking.

46

Learning with Top-k Votes (Irish Data)

21800

22200

22600

Low

er is

Bet

ter

Full rankings only

Full rankings + Partial rankings

Neg

ativ

e Lo

g-Li

kelih

ood

Riffle independent model

Nonparametric Mallows [Lebanon, 2008]

Using inference, we can efficiently build accurate, interpretable models

of partial rankings

47


Ad

dit

ive

Dec

om

po

siti

on

Mu

ltip

licat

ive

Dec

om

po

siti

on




finite groups


Introduction of Hierarchical Riffled Independence models

Structure learning algorithm with polynomial time/samples guarantee

Decomposability theorem for partial rankings

Learning distributions with partial rankings

[NIPS07, JMLR09]

[NIPS09, ICML10, EJS11] [NIPS-CSS10]

[NIPS07, JMLR09]Algorithms for exploiting both

decompositions for scalable inference

[AISTATS08, NIPS09, EJS11]

48

Main Technical Contributions

• Fourier theoretic conditioning algorithm with projection to the marginal polytope [NIPS07, JMLR09]

• Fourier theoretic characterization of probabilistic independence [AISTATS07]

• Definition of riffled independence [NIPS09]• Polynomial sample/time complexity structure learning

algorithms [ICML10]• Theoretical connection between efficient inference in

riffle independent models and partial ranking [UAI11]• Tractable model estimation algorithm with partial

rankings [UAI11]

49

Thank You Carlos Guestrin

Leo Guibas, John Lafferty, Drew Bagnell, Alex Smola

Ashish Kapoor, Eric Horvitz, Ali Rahimi

Risi Kondor, Marina Meila, Guy Lebanon, Tiberio Caetano, Xiaoye Jiang

SELECT Lab, Michelle Martin

Friends

Lucia Castellanos

Billy, Farn-lin, and Jonah Huang

probabilistic reasoning and learning with permutations thesis defense, 7/29/2011 jonathan huang...

Documents

low frequency coefficients

loworder probabilities

candidate c

candidate b

permutations thesis

order summary submatrix

candidate d

ranks pairs candidate