probabilistic reasoning and learning with permutations thesis defense, 7/29/2011 jonathan huang...
Post on 19-Dec-2015
216 views
TRANSCRIPT
Probabilistic Reasoning and Learning with Permutations
Thesis Defense, 7/29/2011
Jonathan Huang
Collaborators:
Carlos GuestrinCMU
Leonidas GuibasStanford
Xiaoye JiangStanford
Ashish KapoorMicrosoft
2
Political Elections in Ireland
12
3
4
“But Ireland's complicated [election] system of proportional representation, … could upset the front-runner and help… the Fianna Fail candidate running second in the polls, to snatch victory.”
“Recent polling … indicates Doherty [Sinn Fein Party] is
leading the race.”
3
Proportional Representation
ProsEncourages coalition governments
Discourages negative campaigning
No wasted votes – empowers voters
Irish Parliament, Maltese Parliament, Australian Senate, Iceland Constitutional Assembly, Academy Awards, University of Cambridge, Scotland local Governments, Cambridge (Mass) local, …..
ConFar more complex than
plurality voting…
4
2002 Irish Election Data
Ireland
64,081 votes, 14 candidates
Major PartiesFianna Fail (FF)
Fine Gael (FG)
Minor PartiesIndependents (I)
Green Party (GP)
Christian Solidarity (CS)
Labour (L)
Sinn Fein (SF)
[Gormley, Murphy, 2006]
Predict winners Identify “voting-blocs” Formulate campaign strategies Engender an informed, effective democracy
Statistical analysis of voting data can:
5
Distributions over Permutations
A B C D Probability
1 2 3 4 02 1 3 4 01 3 2 4 1/103 1 2 4 02 3 1 4 1/203 2 1 4 1/51 2 4 3 0
Ran
kin
gs
CandidatesA B C D Probability
1 2 3 4 0
2 1 3 4 0
1 3 2 4 1/10
3 1 2 4 0
2 3 1 4 1/20
3 2 1 4 1/5
1 2 4 3 0
“With probability 1/10: Candidate A ranked first, Candidate B ranked third, Candidate C ranked second, Candidate D ranked last”
6
Permutations are Ubiquitous!
7
12
3
412
4
53
67
31
2
4
79
5
89
Politics Preferences
> >
> > Multiobject Tracking
7
Problem #1: Representation
n n! Storage requirements9 362,880 3 megabytes
12 4.8x108 9.5 terabyes
15 1.31x1012 1729 petabytes (!!)
How can we tractably represent distributions over n! permutations in storage?
8
First-order summary [Shin et al, ‘03]
14 Candidates
14 R
anks
1
3
5
7
9
11
130.05
0.1
0.15
0.2
0.25
FF FF FFFG FG FGI I I I GP CS SF L
25% voters rank Sinn Fein last
10% voters rank Sinn Fein first
ConReally coarse representation – can’t compute P(Sinn Fein candidate is first and Fianna Fail candidate is second)
Pron2 versus n! storage
For each (j,i) pair, store P(candidate j is in rank i)
9
Decomposable DistributionsA
dd
itiv
e D
eco
mp
osi
tio
nM
ult
iplic
ativ
e D
eco
mp
osi
tio
n
Decompose functions on permutations into sums of
simpler functions
Decompose functions on permutations into products of
simpler functions
10
Additive (Fourier) Decompositions
+.2 x +.1 x +.01 x.6 xf(x)=
Fourier coefficients Fourier basis functions
low frequency high frequency
Approximate distributions over permutations with low frequency basis functions
([Kondor2007,Huang2007,Huang2009])
Storing low frequency coefficients to approximate f
f .6 .2 .1 .1 .05 .01 .01 0 0 0
11
Fourier coefficients for permutations
low frequency high frequency
f
Fourier coefficients for distributions on permutations are matrix-valued
Can exactly reconstruct all n! original probabilities
Can exactly reconstruct all first-order probabilities with first two matrices
Can exactly reconstruct all second-order probabilities with first three matrices
[Diaconis, ‘88]
12
Second order summary (submatrix)
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07R
anks
pai
rs
Candidate pairs
1,2
1,3
1,4
2,3
2,4
3,4
FF,FG FF,FF FF,SF FG,FF FG,SF FF,SF
7% voters placed two Fianna Fail candidates consecutively
in ranks 1 and 2
Capture higher order dependencies with O(n4) storage
13
Accuracy/Storage Trade-off
Probability Fourier interpretation0th order Lowest frequency Fourier coefficient
1st order Reconstructible from O(n2) lowest frequency coefficients
2nd order Reconstructible from O(n4) lowest frequency coefficients
3rd order Reconstructible from O(n6) lowest frequency coefficients
… …
nth order Requires all n! Fourier coefficients
Problem #1: RepresentationStoring a low frequency Fourier approximation
is equivalent tostoring low-order probabilities
(and can be done in polynomial space)
Low-frequency Fourier approximations generalize the first-order summary!
14
ContributionsRepresentation
Polynomial storage for approximate distributions
Low frequency = maintaining probabilities over small sets
[NIPS07, JMLR09]
Inference
Ad
dit
ive
(F
ou
rier
)D
ec
om
po
siti
on
Mu
ltip
licat
ive
D
ec
om
po
siti
on
15
Problem #2: Probabilistic Inference in Ranking
What are the odds that someone will rank Sinn Fein first if he ranks Fianna Fail
second?
If a voter ranks Labour first, is he more likely to prefer Fine Gael over
Fianna Fail?
If I prefer Titanic to Star Wars, am I likely to also
prefer The English Patient to Jurassic Park?
16
Problem #2: Inference
Posterior Likelihood PriorABCD
BACD
ACBD
CABD
…
xlikelihood
prior
posterior=
How can we efficiently compute a posterior based on a new
observation?
candidate ranking σ z = “Fianna Fail ranked second”P( )|
Complexity: O(n!)
Rev. Bayes
Bayes Rule:
17
P(ranking) P(“Sinn Fein is first” | ranking)
Inference with Fourier coefficients
prior likelihood
Given:
posterior
P(ranking | “Sinn Fein is first”)
Compute:
From Signal Processing: Pointwise products correspond to convolutions of Fourier coefficients
18
P(ranking) P(“Sinn Fein is first” | ranking)
Inference with Fourier coefficients
[Huang et al, NIPS 2007]
prior likelihood
posterior
Our algorithm applies to arbitrary distributionsdefined over arbitrary finite groups
Pointwise products correspond to (generalized) convolution in the Fourier domain
P(ranking | “Sinn Fein is first”)
19
• Discard “high-frequency” coefficients after conditioning– Equivalently, maintain low-order
probabilities
Bandlimiting
Theorem. Given rth order terms of the prior and an sth order likelihood, then the (r-s)th order terms of the posterior can be exactly computed.
(Fourier methods work best on low-order observations)
[Huang et al, NIPS 2009]
Dealing with the Impossible
Infeasible approximations (e.g. negative probabilities) can arise due to bandlimiting
20
Feasible Coefficients
Infeasible Coefficients
Infeasible approximation
Nearest feasible Fourier coefficients
(Efficient projection (to a relaxed polytope) possible using a quadratic program)
Solution [Huang, 2007]: Project to space of coefficients corresponding to feasible probabilities
21
Permutations in Tracking
Track 1 Track 2 Track 3 Track 4
Applications to:- Monitoring for Assisted Living
- Video analysis for sports
- Video surveillance for crowds
22
Probabilistic Inference in Tracking
Mixing @Tracks 1,2
Mixing @Tracks 1,3
Mixing @Tracks 1,4
Track 1
Track 2
Track 3
Track 4
Inference problem: Where is Alice?
?
23
Simulated tracking dataProjection to the Marginal polytope
versus no projection (n=6)
Approximation by a uniform distribution
1st order 3rd order2nd order
Bet
ter
0
0.02
0.04
0.06
0.08
0.1
0.12
Err
or
w/o Projection
w/Projection 1st order
3rd order
2nd order
24
Tracking with a camera network• Camera Network data:
– 8 cameras, multi-view, occlusion effects
– 11 individuals in lab– Identity observations
obtained from color histograms
– Mixing events declared when people walk close to each other B
ette
r
0
10
20
30
40
50
60
% T
rack
s co
rrec
tly Id
en
tifie
d
Omniscient tracker
time-independent classification
w/o Projection
2nd order w/Projection
Problem #2: Inference
can be formulated in Fourier domain as (generalized) convolution, and approximated via
bandlimiting/projections; low-order observations = polytime/accurate inference
25
ContributionsRepresentation Inference
Polynomial storage for approximate distributions
Low frequency = maintaining probabilities over small sets
Polytime Fourier domain conditioning algorithm for
finite groups
Approximation guarantee for low order observations
[NIPS07, JMLR09] [NIPS07, JMLR09]Ad
dit
ive
(F
ou
rier
)D
ec
om
po
siti
on
Mu
ltip
licat
ive
D
ec
om
po
siti
on
26
Even polynomial is too slow…
Representation Depth # Fourier coefficients1st order O(n2)
2nd order O(n4)
3rd order O(n6)
4th order O(n8)
3rd order2nd order1st order
Exact inference
Bet
ter
Ru
nni
ng
time
in s
eco
nds
4 5 6 7 80
1
2
3
4
n
Can we achieve more compact representations?
27
Idea: Assume a ranking is created by “shuffling” smaller, independent rankings
Artichoke Broccoli>
Rank Veggies
Cherry Dates>Rank Fruits
Veggie
Fruit
Veggie
Fruit
>
>
>Veggie
Fruit
Veggie
Fruit
>
>
>Veggie
Fruit
Veggie
Fruit
>
>
>
Veggie
Fruit
Veggie
Fruit
>
>
>
Artichoke
Broccoli
Cherry
Dates
>
>
>
Interleave (riffle shuffle) veggie/fruit rankings to form a
complete ranking[Huang, Guestrin, 2009]
Riffle independent distributions can be represented with a reduced set of parameters!
Riffled Independence
28
10 20 30 40 50 60 70 80 90 100 110 12000.010.020.030.040.050.060.070.080.090.1
permutations
pro
bab
ility Blue line: candidate {2} riffle
independent of candidates {1,3,4,5}
American Psych. Assoc. (APA) Election (1980)
Empirically, we can find approximate riffled independence in real datasets
William Bevan
Ira Iscoe
Charles Kiesler
Max Siegle
Logan Wright
5738 full ballots, 5 candidates
dataset from [Diaconis, ‘89]
5!=120
29
#(rankings)5! = 120
Can we do better?
Parameter Counting
{1,2,3,4,5}
{1,3,4,5} {2}
Item set decomposition
Relative ranking of candidates {1,3,4,5}
Interleaving candidate {2} with remaining candidates
4! = 24
5
30 Total # of model parameters <
Relative ranking of candidates {2} 1! = 1
{4,3} {1,5}
Relative ranking of candidates {4,3}
Interleaving candidates {4,3} with candidates {1,5}
2! = 2
6
Relative ranking of candidates {1,5} 2! = 2
16Total # of model parameters <
Problem #1: RepresentationDistributions which decompose into riffle
independent factors can be represented using exponentially fewer parameters
30
Hierarchical Decompositions: Drawing a Ranking
Interleave Healthy foods with Junk food
Interleave fruits/vegetables
Rank Junk food
Rank fruits Rank vegetables
better
Food Preferences
Problem: For APA data, don’t know the hierarchy!
31
Reverse Engineering the HierarchyMachine learning approach: Use the structure that best explains the data
ABCD
BACD
BADCBDCA
CBDA
CABD
CBAD
CDBA DCBA
DCAB
ADBC
{A,B,C,D}
{C,D,A} {B}
{A} {C,D}
Structure Learning Algorithm
Data HierarchyCore Problem: Given ranked data, determine whether subsets are riffle independent
32
Measuring riffled independence
If i, (j,k) lie on opposite sides,Mutual information=0
Idea: measure independence between singleton rankings and pairwise rankings
preference over Fruit i
relative preference over Vegetable j, k
[Huang, Guestrin, 2010]
Riffled independence: absolute rankings of Fruits not informative about relative rankings in Vegetables
33
Tripletwise objective function
A Ball items in set A –plays nono role inobjective
Measuring departure from riffled independence:
Exponential number of possible splits, but there is an efficient minimization
algorithm that works with high probability
Minimize:
34
Hierarchy respects political coalition structure of the APA!
# model parameters: 11
Learning Structure from APA Data{12345}
{1345} {2}
{13} {45}
3. Charles Kiesler
1. William Bevan 4. Max Siegle
2. Ira Iscoe 5. Logan Wright
Candidates
Research psychologists
Clinical psychologists
Community psychologists
candidates
rank
s
1 2 3 4 5
1
2
3
4
5
candidates
rank
s
1 2 3 4 5
1
2
3
4
50
0.05
0.1
0.15
0.2
0.25
“True” first order Hierarchical first order
35
true structure known
learned structure
random 1-chain ([Doignon et al, 2004])
Structure learning with synthetic data
1 2 3 4 5-6300
-6200
-6100
-6000
-5900
-5800
-5700
-5600
-5500
-5400lo
g-lik
elih
ood
log10(# samples)
16 items, 4 items in each leaf
Bet
ter
Theorem: Our algorithm recovers riffle independent split with probability given
samples (under mild assumptions on connectivity).
[Huang, Guestrin, 2011]
36
Irish Election (No Structure Learning)
0
0.05
0.1
0.15
0.2
0.25“True” first order Probability Riffle Independent approximation
Sinn Fein, Christian Solidarity columns not well captured by a single split!
CandidatesFF FF FFFG FG FGI I I I GP CS SF L
CandidatesFF FF FFFG FG FGI I I I GP CS SF L
Major parties riffle independent of minor parties?
ran
ks
2
4
6
8
10
12
14
37
Structure Learning the Irish Election{1,2,3,4,5,6,7,8,9,10,11,12,13,14}
{1,2,3,4,5,6,7,8,9,10,11,13,14}
{12}
{11} {1,2,3,4,5,6,7,8,9,10,13,14}
{2,3,5,6,7,8,9,10,14} {1,4,13}
{2,5,6} {3,7,8,9,10,14}
Sinn Fein
Christian Solidarity
Fianna Fail
Fine Gael Independents, Labour, GreenFull model ~ 87 billion
Hierarchical model ~1000
# Parameters
Brute force optimization: 70.2sOur method: 2.3s
Running time
“True” first order Learned first order
Candidates
Ran
ks
2 4 6 8 10 12 14
2468101214
Candidates
Ran
ks
2 4 6 8 10 12 14
2468101214 0
0.05
0.1
0.15
0.2
0.25
38
Preference Analysis (for Sushi)
1. Ebi (shrimp)2. Anago (sea eel)3. Maguro (tuna)4. Ika (squid)5. Uni (sea urchin)6. Sake (salmon roe)7. Tamago (egg) 8. Toro (fatty tuna)9. Tekka-make (tuna roll)10. Kappa-maki (cucumber roll)
Contenders
5000 preference rankings of 10 types of sushi
Sushi
Ran
ks
1 2 3 4 5 6 7 8 9 10
First
Last
Fatty tuna (Toro)is a favorite!
No one likes cucumber roll !
39
{1,2,3,4,5,6,7,8,9,10}
{2} {1,3,4,5,6,7,8,9,10}
{1,3,5,6,7,8,9,10}{4}
{1,3,7,8,9,10} {5,6}
{3,7,8,9,10} {1}
{3,8,9} {7,10}
(sea eel)
(squid)
(sea urchin, salmon roe)
(shrimp)
(tuna, fatty tuna, tuna roll)
(egg, cucumber roll)
Sushi Hierarchy
40
ContributionsRepresentation Inference
Polynomial storage for approximate distributions
Low frequency = maintaining probabilities over small sets
Polytime Fourier domain conditioning algorithm for
finite groups
Approximation guarantee for low order observations
[NIPS07, JMLR09]
[NIPS09, ICML10, EJS11]
[NIPS07, JMLR09]
Introduction of Hierarchical Riffled Independence models
Structure learning algorithm with polynomial time/samples guarantee
Ad
dit
ive
(F
ou
rier
)D
ec
om
po
siti
on
Mu
ltip
licat
ive
(R
iffl
e In
dep
en
de
nt)
D
ec
om
po
siti
on
41
Top-k Inference Problem
12
3
4
--
-
2 4 6 8 10 12 14
# candidates specified
k
num
ber
of v
otes
0
20,000
10,000
Most voters rank just the top-3 or top-4
candidates
Inference problem: Given an observation of a voter’s top-k rankings,
infer his preferences over remaining candidates
42
Inference in Riffled Independent Models
Decomposition Can efficiently perform inference with
Fourier (Additive) Low order likelihoods (observations depend on few items)
Riffle Independent (Multiplicative) ????
Posterior Likelihood Prior
Bayes Rule: O(n!) operation
Answer: Efficient inference possible if and only if observations take the
form of partial rankings!(including top-k observations)
43
The Top-1 Inference ProblemSometimes we can decompose the
observation into smaller observations
Bayes rule complexity: factorial in number of items?
“Candidate 3 (FF party) ranked in first place” decomposes as:
{all candidates}
{1,2,3} {4,5,6,7,8}Fianna Fail Other Candidates
Bayes rule complexity: linear in # of parameters
Interleaving Observation: Fianna Fail candidate ranked in
first place overall
Fianna Fail Observation: Candidate 3 ranked first among
FF candidates
Observation:
Top-1 inference always decomposes into inference for each node in the
hierarchical model
44
Efficient inference for partial rankings
Fine Gael
Fianna Fail Sinn Fein
Independent
Green
Labour
Socialistxxx
In general, there are many forms of partial rankings allowing items to be tied
(Approval voting)
• First place observations:
• Top-k observations:
• Approval voting observations:
G|ABCDEFH “G in first place”
G|F|A|BCDEH “G in first place, F in second, A in third”
ACFG|BDEH “Approve of candidates in {ACFG} ”
45
Main Theorem
[Huang, Kapoor, 2011]
Theorem: Any partial ranking observation is decomposable with respect to any hierarchy.
i.e., Inference for partial rankings is efficientwith running time linear in #(parameters)
Hierarchy H1 Hierarchy H2
Hierarchy H3
Partial rankings
(But what’s out here?)
Converse to Main Theorem:Every observation that decomposes with respect to all hierarchies takes the form of some partial
ranking.
46
Learning with Top-k Votes (Irish Data)
21800
22200
22600
Low
er is
Bet
ter
Full rankings only
Full rankings + Partial rankings
Neg
ativ
e Lo
g-Li
kelih
ood
Riffle independent model
Nonparametric Mallows [Lebanon, 2008]
Using inference, we can efficiently build accurate, interpretable models
of partial rankings
47
ContributionsRepresentation Inference
Ad
dit
ive
Dec
om
po
siti
on
Mu
ltip
licat
ive
Dec
om
po
siti
on
Polynomial storage for approximate distributions
Low frequency = maintaining probabilities over small sets
Polytime Fourier domain conditioning algorithm for
finite groups
Approximation guarantee for low order observations
Introduction of Hierarchical Riffled Independence models
Structure learning algorithm with polynomial time/samples guarantee
Decomposability theorem for partial rankings
Learning distributions with partial rankings
[NIPS07, JMLR09]
[NIPS09, ICML10, EJS11] [NIPS-CSS10]
[NIPS07, JMLR09]Algorithms for exploiting both
decompositions for scalable inference
[AISTATS08, NIPS09, EJS11]
48
Main Technical Contributions
• Fourier theoretic conditioning algorithm with projection to the marginal polytope [NIPS07, JMLR09]
• Fourier theoretic characterization of probabilistic independence [AISTATS07]
• Definition of riffled independence [NIPS09]• Polynomial sample/time complexity structure learning
algorithms [ICML10]• Theoretical connection between efficient inference in
riffle independent models and partial ranking [UAI11]• Tractable model estimation algorithm with partial
rankings [UAI11]
49
Thank You Carlos Guestrin
Leo Guibas, John Lafferty, Drew Bagnell, Alex Smola
Ashish Kapoor, Eric Horvitz, Ali Rahimi
Risi Kondor, Marina Meila, Guy Lebanon, Tiberio Caetano, Xiaoye Jiang
SELECT Lab, Michelle Martin
Friends
Lucia Castellanos
Billy, Farn-lin, and Jonah Huang