ramya korlakai vinayak - 2020 conference11-11-00)-11... · 2019-06-08 · ramya korlakai vinayak...

Maximum Likelihood Estimation for Learning Populations of Parameters

Ramya Korlakai Vinayak

joint work with Weihao Kong, Gregory Valiant, Sham Kakade

Paul G. Allen School of CSE

1

Postdoctoral Researcher

Poster #[email protected]

mailto:[email protected]

Poster #189

Motivation: Large yet Sparse Data

Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years

2

Poster #189



2

pii

Probability of catching flu (unknown)

Poster #189



2

bias of coin i

pii


Poster #189



2

bias of coin i

pii


1 0 0 1 0{ }xi = 2

bpi =xi

t

= 0.4± 0.45

Poster #189



2

bias of coin i

pii


1 0 0 1 0{ }xi = 2

bpi =xi

t

= 0.4± 0.45

Goal: Can we learn the distribution of the biases over the population?

Poster #189


• Population size is large, often hundreds of thousands or millions• Number of observations per individual is limited (sparse) prohibiting accurate

estimation of parameters of interest

• Application domains: Epidemiology, Social Sciences, Psychology, Medicine, Biology


2

bias of coin i

pii


1 0 0 1 0{ }xi = 2

bpi =xi

t

= 0.4± 0.45


Poster #189


• Population size is large, often hundreds of thousands or millions• Number of observations per individual is limited (sparse) prohibiting accurate

estimation of parameters of interest

• Application domains: Epidemiology, Social Sciences, Psychology, Medicine, Biology


2

bias of coin i

pii


1 0 0 1 0{ }xi = 2

bpi =xi

t

= 0.4± 0.45


Why? Testing and estimating properties of the distributionUseful for downstream analysis:

Model: Non-parametric Mixture of Binomials

0 10.5

True Distribution P ?• N independent coins Each coin has its own bias drawn from

i = 1, 2, ..., N pi ⇠ P ? (unknown)

(unknown)

Lord 1965, 1969

3Poster #189

P ?


0 10.5


i = 1, 2, ..., N pi ⇠ P ? (unknown)

(unknown)

• We get to observe t tosses for every coin

Observations: Xi ⇠ Bin(t, pi) 2 {0, 1, ..., t}

xi = 21 0 0 1 0{ }t = 5 tosses

Lord 1965, 1969

3Poster #189

P ?


0 10.5


i = 1, 2, ..., N pi ⇠ P ? (unknown)

(unknown)



xi = 21 0 0 1 0{ }t = 5 tosses

Lord 1965, 1969

3Poster #189

{Xi}Ni=1• Given P̂ P ?, return estimate of

P ?


0 10.5


i = 1, 2, ..., N pi ⇠ P ? (unknown)

(unknown)



xi = 21 0 0 1 0{ }t = 5 tosses

Lord 1965, 1969

3Poster #189

{Xi}Ni=1• Given P̂ P ?, return estimate of

• Wasserstein-1 distance (Earth Mover’s Distance) W1

⇣P ?, P̂

⌘

P ?

Learning with Sparse Observations is Non-trivial

4Poster #189

• Empirical plug-in estimator is bad ˆPplug-in = histogram

⇢X1

t, ...

Xi

t, ...,

XN

t

�

⇥

✓1pt

◆When incurs error oft ⌧ N N = Number of

coins t =Number of tosses per

coin


4Poster #189

The setting in this work is different

• Many recent works on estimating symmetric properties of a discrete distribution with sparse observationsPaninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 ….


⇢X1

t, ...

Xi

t, ...,

XN

t

�

⇥

✓1pt

◆When incurs error oft ⌧ N N = Number of

coins t =Number of tosses per

coin


4Poster #189




⇢X1

t, ...

Xi

t, ...,

XN

t

�

⇥

✓1pt

◆When incurs error oft ⌧ N

• Tian et. al 2017 proposed a moment matching based estimator which achieveswhenoptimal error of O

✓1

t

◆

Weakness of moment matching estimator is that it fails to obtain optimal error when due to higher variance in larger momentst > c logN

t < c logN

N = Number of coins t =

Number of tosses per

coin


4Poster #189




⇢X1

t, ...

Xi

t, ...,

XN

t

�

⇥

✓1pt

◆When incurs error oft ⌧ N

• Tian et. al 2017 proposed a moment matching based estimator which achieveswhenoptimal error of O

✓1

t

◆

Weakness of moment matching estimator is that it fails to obtain optimal error when due to higher variance in larger momentst > c logN

t < c logN

What about Maximum Likelihood Estimator?

N = Number of coins t =

Number of tosses per

coin

Maximum Likelihood Estimator

5

Sufficient statistic: Fingerprint

Poster #189

fingerprint vectorh = [h0, h1, ..hs, .., ht] 0 1 2 3 4 5

hs

s

hs =# coins that show s heads

Ns = 0, 1, ..., t


5


Poster #189

P̂mle 2 arg minQ2dist[0,1]

KL Observed h , Expected h under the distribution Q


hs

s


Ns = 0, 1, ..., t

• NOT the empirical estimator


• Convex optimization: Efficient (polynomial time)

5


Poster #189




hs

s


Ns = 0, 1, ..., t




5

• Proposed in late 1960’s by Frederic Lord in the context of psychological testing. Several works study the geometry and identifiability and uniqueness of the solution of the MLELord 1965,1969, Turnbull 1976, Laird 1978, Lindsay 1983, Wood 1999


Poster #189




hs

s


Ns = 0, 1, ..., t




5

• Proposed in late 1960’s by Frederic Lord in the context of psychological testing. Several works study the geometry and identifiability and uniqueness of the solution of the MLELord 1965,1969, Turnbull 1976, Laird 1978, Lindsay 1983, Wood 1999


Poster #189




hs

s

How well does the MLE recover the distribution?


Ns = 0, 1, ..., t

Main Results: MLE is Minimax Optimal in Sparse Regime

6Poster #189

N = Number of coins

t =Number of tosses per

coin

t ⌧ NSparse Regime

Non-asymptotic guarantees

Theorem 1The MLE achieves following error bounds:

• Small Sample Regime:

w. p. � 1� �

W1

⇣P ?, P̂mle

⌘= O�

✓1

t

◆t < c logN

• Medium Sample Regime:

W1

⇣P ?, ˆPmle

⌘= O�

✓1p

t logN

◆c logN t N2/9�✏

when

when

Main Results: MLE is Minimax Optimal in Sparse Regime

6Poster #189

• Matching Minimax Lower Boundsinf

fsup

PE [W1(P, f(X))] > ⌦

✓1

t

◆_ ⌦

✓1p

t logN

◆

Theorem 2

N = Number of coins

t =Number of tosses per

coin

t ⌧ NSparse Regime

Non-asymptotic guarantees

Theorem 1The MLE achieves following error bounds:

• Small Sample Regime:

w. p. � 1� �

W1

⇣P ?, P̂mle

⌘= O�

✓1

t

◆t < c logN

• Medium Sample Regime:

W1

⇣P ?, ˆPmle

⌘= O�

✓1p

t logN

◆c logN t N2/9�✏

when

when

Novel Proof: Polynomial Approximations

7Poster #189

bft(x) =

tX

j=0

bj

✓t

j

◆x

j(1� x)t�j

New bounds on coefficients of Bernstein polynomials approximating Lipschitz-1 functions on [0, 1]

Bernstein polynomials

Novel Proof: Polynomial Approximations

7Poster #189

bft(x) =

tX

j=0

bj

✓t

j

◆x

j(1� x)t�j

New bounds on coefficients of Bernstein polynomials approximating Lipschitz-1 functions on [0, 1]

Performance on Real Data

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

CD

F

Political Leanings

MLETVK17Empirical

0 0.2 0.4 0.6 0.8 1

p

0

0.2

0.4

0.6

0.8

1

CD

F

Flight Delays

MLETVK17Empirical

Bernstein polynomials

SummaryLearning distribution of parameters over a population with sparse observations per individual

MLE is Minimax Optimal even with sparse

observations!

[email protected] Poster #189

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

CD

F

Political Leanings

MLETVK17Empirical

0 0.2 0.4 0.6 0.8 1

p

0

0.2

0.4

0.6

0.8

1

CD

F

Flight Delays

MLETVK17Empirical

Novel proof: new bounds on coefficients of Bernstein polynomials approximating Lipschitz-1 functions

Performance on Real Data

mailto:[email protected]

ramya korlakai vinayak - 2020 conference11-11-00)-11... · 2019-06-08 · ramya korlakai vinayak...

Documents