population stratification with limited data by kamalika chaudhuri, eran halperin, satish rao and...

Population Population Stratification with Stratification with

Limited DataLimited DataByBy

Kamalika ChaudhuriKamalika Chaudhuri, Eran , Eran Halperin, Satish Rao and Halperin, Satish Rao and

Shuheng ZhouShuheng Zhou

The ProblemThe Problem

Given:Given: Samples from two hidden distributions PSamples from two hidden distributions P11

and Pand P22

Unknown labelsUnknown labels Each sample/individual:Each sample/individual:

k features: 0/1 valuesk features: 0/1 values Population PPopulation P11 : : feature f is 1 w.p. pfeature f is 1 w.p. p11

ff

Population PPopulation P22 : : feature f is 1 w.p. pfeature f is 1 w.p. p22ff

Unknown feature probabilitiesUnknown feature probabilities

The ProblemThe Problem

Given:Given: 2n samples from two hidden distributions P2n samples from two hidden distributions P11 and P and P22

Unknown labelsUnknown labels

Goal: Classify each individual correctly for most Goal: Classify each individual correctly for most inputsinputs

ApplicationsApplications

Preprocessing step in statistical Preprocessing step in statistical analysis:analysis: Analyze the factors that cause a complex Analyze the factors that cause a complex

disease, such as cancerdisease, such as cancer Cluster the samples into populations, then Cluster the samples into populations, then

apply statistical analysisapply statistical analysis

Collaborative FilteringCollaborative Filtering Feature can be “likes Star Wars or not” Feature can be “likes Star Wars or not” Cluster users into types using the featuresCluster users into types using the features

Our ResultsOur Results

Need some separation between the Need some separation between the distributions!distributions!

Measure of Separation : distance between Measure of Separation : distance between meansmeans = L= L11 distance between means / k distance between means / k

= L= L2222 distance between means / k distance between means / k

Our Results: Our Results: Optimization function and poly-time Optimization function and poly-time

algorithm : algorithm : k = k = (√k log n)(√k log n) Optimization function : Optimization function : k = k = ( log n)( log n)

Our ResultsOur Results

This talk:This talk: Optimization function and poly-time Optimization function and poly-time

algorithm : algorithm : k = k = (√k log n)(√k log n) Example:Example:

PP11 : For each feature f, p : For each feature f, p11ff = ½ = ½

PP22 : For each feature f, p : For each feature f, p22ff = ½ + √log n/√k = ½ + √log n/√k

Information-theoretically optimal:Information-theoretically optimal: There exists two distributions with this There exists two distributions with this

separation and constant overlap in separation and constant overlap in probability mass probability mass

Optimization FunctionOptimization Function

What measure to optimize to get the What measure to optimize to get the correct clustering?correct clustering?

Need a robust measure which works Need a robust measure which works for small separationsfor small separations

A Robust MeasureA Robust Measure

Find the best balanced partition (S,S’) such Find the best balanced partition (S,S’) such that:that: ff |N |Nff(S) – N(S) – Nff(S’)|(S’)|

is maximumis maximum

NNff(S), N(S), Nff(S’) : # of individuals with feature f in S, S’(S’) : # of individuals with feature f in S, S’

A Robust MeasureA Robust Measure

Find the best balanced partition (S,S’) such Find the best balanced partition (S,S’) such that:that: ff |N |Nff(S) – N(S) – Nff(S’)|(S’)|

is maximumis maximum

NNff(S), N(S), Nff(S’) : # of individuals with feature f in S, S’(S’) : # of individuals with feature f in S, S’

Theorem : Optimizing this measure provides the correct partition w.h.p. if

k = k = (√k log n)(√k log n)

Proof Sketch:Proof Sketch:

How does the optimal partition behave?How does the optimal partition behave?

E[ f(P)] = k n + k √n

Pr[ | f(P) – E[f] | >n√k ] · 2-n

E[ f(Any partition)] = k √n

Pr[ | f(P) – E[f] | > n√k] · 2-nThe partition with the optimal value of f

in (I) dominates all the partitions in (II) w.h.p for the separation conditions

An AlgorithmAn Algorithm How can we find the partition which optimizes this How can we find the partition which optimizes this

measure?measure?

Theorem:Theorem: There exists an There exists an algorithm which finds the correct algorithm which finds the correct partition whenpartition when

k = k = (√k log(√k log22n)n)

Running Time : O(nk log2 n)

An AlgorithmAn Algorithm

Algorithm:Algorithm:

1.1. Divide individuals into two sets: A Divide individuals into two sets: A and Band B

2.2. Start with a random partition of AStart with a random partition of A

3.3. Iterate log n times:Iterate log n times:1.1. Classify B using current partition of A Classify B using current partition of A

and a proximity scoreand a proximity score

2.2. And the same for AAnd the same for A

An AlgorithmAn Algorithm

Iterate:Iterate: Classify B using Classify B using

current partition of A current partition of A and a scoreand a score

And vice versa.And vice versa.

Random Partition:Random Partition: ( 1/2 + 1/√n) ( 1/2 + 1/√n)

imbalanceimbalance Each iteration Each iteration

produces a partition produces a partition with more imbalancewith more imbalance

Classification ScoreClassification Score

Our Score: For each feature f,Our Score: For each feature f, If NIf Nff(S) > N(S) > Nff(S’)(S’)

add 1 to the score if f is present, else add 1 to the score if f is present, else subtract 1subtract 1

If NIf Nff(S) < N(S) < Nff(S’) (S’)

add 1 to the score if f is absent, else add 1 to the score if f is absent, else subtract 1subtract 1

Classify:Classify: Individuals above the median score : SIndividuals above the median score : S Individuals below the median score : S’Individuals below the median score : S’

ClassificationClassification LemmaLemma: If the current partition has (1/2 + : If the current partition has (1/2 + )-)-

imbalance, the next iteration produces a partition imbalance, the next iteration produces a partition with (1/2 + 2with (1/2 + 2)-imbalance [for )-imbalance [for < c] < c]

LemmaLemma: If the current partition has (1/2 + c)-: If the current partition has (1/2 + c)-imbalance, the next iteration produces the correct imbalance, the next iteration produces the correct partition with our separation conditions.partition with our separation conditions.

(log n) rounds needed to get the correct partition(log n) rounds needed to get the correct partition

Use a fresh set of features in each round to get Use a fresh set of features in each round to get independenceindependence

Proof Sketch:Proof Sketch: LemmaLemma: If the current partition has (1/2 + : If the current partition has (1/2 + )-)-

imbalance, the next iteration produces a imbalance, the next iteration produces a partition with (1/2 + 2partition with (1/2 + 2)-imbalance [for )-imbalance [for < c] < c]

Initially:

G ≈ (log n)

X, Y ≈ Bin(k, ½)

Population 1 Population 2

G

G = ( 2 k√n)

Proof Sketch:Proof Sketch: LemmaLemma: If the current partition has (1/2 + : If the current partition has (1/2 + )-)-

imbalance, the next iteration produces a imbalance, the next iteration produces a partition with (1/2 + 2partition with (1/2 + 2)-imbalance [for )-imbalance [for < c] < c]


G

G = ( 2 k√n)

Pr[ Correct Classification ]

= ½ + Ga/√k /(½ + ½)

> ½ + 2

[From separation conditions]

Proof Sketch:Proof Sketch: LemmaLemma: If the current partition has (1/2 + c)-: If the current partition has (1/2 + c)-

imbalance, the next iteration produces the imbalance, the next iteration produces the correct partition with our separation conditions.correct partition with our separation conditions.


G = ( 2 k√n)

All but a 1/poly(n) fraction is correctly classified

Related WorkRelated Work

Learning Mixtures of Gaussians Learning Mixtures of Gaussians [D99]:[D99]: Best performance by Spectral Best performance by Spectral

Algorithms [VW02, AM05,KSV05]Algorithms [VW02, AM05,KSV05]

Our algorithm :Our algorithm : Matches the bounds in [VW02] for two Matches the bounds in [VW02] for two

clustersclusters Not a spectral algorithm !Not a spectral algorithm !

Open QuestionsOpen Questions

How to extend our algorithm to work How to extend our algorithm to work for multiple clusters ?for multiple clusters ?

What is the relationship between our What is the relationship between our algorithm and spectral algorithms?algorithm and spectral algorithms? Matches spectral algorithms of [M01] Matches spectral algorithms of [M01]

for two-way graph partitioning for two-way graph partitioning Can our algorithm do better?Can our algorithm do better?

Thank You!Thank You!

population stratification with limited data by kamalika chaudhuri, eran halperin, satish rao and...

Documents

f population p

feature f

f n f s n f s f n f

hidden distributions

log nk p

maximum n f s

values population p

n samples