ramya korlakai vinayak - 2020 conference11-11-00)-11... · 2019-06-08 · ramya korlakai vinayak...

26
Maximum Likelihood Estimation for Learning Populations of Parameters Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of CSE 1 Postdoctoral Researcher Poster #189 [email protected]

Upload: others

Post on 03-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Maximum Likelihood Estimation for Learning Populations of Parameters

Ramya Korlakai Vinayak

joint work with Weihao Kong, Gregory Valiant, Sham Kakade

Paul G. Allen School of CSE

1

Postdoctoral Researcher

Poster #[email protected]

Page 2: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Poster #189

Motivation: Large yet Sparse Data

Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years

2

Page 3: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Poster #189

Motivation: Large yet Sparse Data

Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years

2

pii

Probability of catching flu (unknown)

Page 4: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Poster #189

Motivation: Large yet Sparse Data

Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years

2

bias of coin i

pii

Probability of catching flu (unknown)

Page 5: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Poster #189

Motivation: Large yet Sparse Data

Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years

2

bias of coin i

pii

Probability of catching flu (unknown)

1 0 0 1 0{ }xi = 2

bpi =xi

t

= 0.4± 0.45

Page 6: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Poster #189

Motivation: Large yet Sparse Data

Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years

2

bias of coin i

pii

Probability of catching flu (unknown)

1 0 0 1 0{ }xi = 2

bpi =xi

t

= 0.4± 0.45

Goal: Can we learn the distribution of the biases over the population?

Page 7: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Poster #189

Motivation: Large yet Sparse Data

• Population size is large, often hundreds of thousands or millions• Number of observations per individual is limited (sparse) prohibiting accurate

estimation of parameters of interest

• Application domains: Epidemiology, Social Sciences, Psychology, Medicine, Biology

Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years

2

bias of coin i

pii

Probability of catching flu (unknown)

1 0 0 1 0{ }xi = 2

bpi =xi

t

= 0.4± 0.45

Goal: Can we learn the distribution of the biases over the population?

Page 8: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Poster #189

Motivation: Large yet Sparse Data

• Population size is large, often hundreds of thousands or millions• Number of observations per individual is limited (sparse) prohibiting accurate

estimation of parameters of interest

• Application domains: Epidemiology, Social Sciences, Psychology, Medicine, Biology

Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years

2

bias of coin i

pii

Probability of catching flu (unknown)

1 0 0 1 0{ }xi = 2

bpi =xi

t

= 0.4± 0.45

Goal: Can we learn the distribution of the biases over the population?

Why? Testing and estimating properties of the distributionUseful for downstream analysis:

Page 9: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Model: Non-parametric Mixture of Binomials

0 10.5

True Distribution P ?• N independent coins Each coin has its own bias drawn from

i = 1, 2, ..., N pi ⇠ P ? (unknown)

(unknown)

Lord 1965, 1969

3Poster #189

P ?

Page 10: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Model: Non-parametric Mixture of Binomials

0 10.5

True Distribution P ?• N independent coins Each coin has its own bias drawn from

i = 1, 2, ..., N pi ⇠ P ? (unknown)

(unknown)

• We get to observe t tosses for every coin

Observations: Xi ⇠ Bin(t, pi) 2 {0, 1, ..., t}

xi = 21 0 0 1 0{ }t = 5 tosses

Lord 1965, 1969

3Poster #189

P ?

Page 11: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Model: Non-parametric Mixture of Binomials

0 10.5

True Distribution P ?• N independent coins Each coin has its own bias drawn from

i = 1, 2, ..., N pi ⇠ P ? (unknown)

(unknown)

• We get to observe t tosses for every coin

Observations: Xi ⇠ Bin(t, pi) 2 {0, 1, ..., t}

xi = 21 0 0 1 0{ }t = 5 tosses

Lord 1965, 1969

3Poster #189

{Xi}Ni=1• Given P̂ P ?, return estimate of

P ?

Page 12: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Model: Non-parametric Mixture of Binomials

0 10.5

True Distribution P ?• N independent coins Each coin has its own bias drawn from

i = 1, 2, ..., N pi ⇠ P ? (unknown)

(unknown)

• We get to observe t tosses for every coin

Observations: Xi ⇠ Bin(t, pi) 2 {0, 1, ..., t}

xi = 21 0 0 1 0{ }t = 5 tosses

Lord 1965, 1969

3Poster #189

{Xi}Ni=1• Given P̂ P ?, return estimate of

• Wasserstein-1 distance (Earth Mover’s Distance) W1

⇣P ?, P̂

P ?

Page 13: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Learning with Sparse Observations is Non-trivial

4Poster #189

• Empirical plug-in estimator is bad ˆPplug-in = histogram

⇢X1

t, ...

Xi

t, ...,

XN

t

✓1pt

◆When incurs error oft ⌧ N N = Number of

coins t =Number of tosses per

coin

Page 14: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Learning with Sparse Observations is Non-trivial

4Poster #189

The setting in this work is different

• Many recent works on estimating symmetric properties of a discrete distribution with sparse observationsPaninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 ….

• Empirical plug-in estimator is bad ˆPplug-in = histogram

⇢X1

t, ...

Xi

t, ...,

XN

t

✓1pt

◆When incurs error oft ⌧ N N = Number of

coins t =Number of tosses per

coin

Page 15: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Learning with Sparse Observations is Non-trivial

4Poster #189

The setting in this work is different

• Many recent works on estimating symmetric properties of a discrete distribution with sparse observationsPaninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 ….

• Empirical plug-in estimator is bad ˆPplug-in = histogram

⇢X1

t, ...

Xi

t, ...,

XN

t

✓1pt

◆When incurs error oft ⌧ N

• Tian et. al 2017 proposed a moment matching based estimator which achieveswhenoptimal error of O

✓1

t

Weakness of moment matching estimator is that it fails to obtain optimal error when due to higher variance in larger momentst > c logN

t < c logN

N = Number of coins t =

Number of tosses per

coin

Page 16: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Learning with Sparse Observations is Non-trivial

4Poster #189

The setting in this work is different

• Many recent works on estimating symmetric properties of a discrete distribution with sparse observationsPaninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 ….

• Empirical plug-in estimator is bad ˆPplug-in = histogram

⇢X1

t, ...

Xi

t, ...,

XN

t

✓1pt

◆When incurs error oft ⌧ N

• Tian et. al 2017 proposed a moment matching based estimator which achieveswhenoptimal error of O

✓1

t

Weakness of moment matching estimator is that it fails to obtain optimal error when due to higher variance in larger momentst > c logN

t < c logN

What about Maximum Likelihood Estimator?

N = Number of coins t =

Number of tosses per

coin

Page 17: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Maximum Likelihood Estimator

5

Sufficient statistic: Fingerprint

Poster #189

fingerprint vectorh = [h0, h1, ..hs, .., ht] 0 1 2 3 4 5

hs

s

hs =# coins that show s heads

Ns = 0, 1, ..., t

Page 18: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Maximum Likelihood Estimator

5

Sufficient statistic: Fingerprint

Poster #189

P̂mle 2 arg minQ2dist[0,1]

KL Observed h , Expected h under the distribution Q

fingerprint vectorh = [h0, h1, ..hs, .., ht] 0 1 2 3 4 5

hs

s

hs =# coins that show s heads

Ns = 0, 1, ..., t

Page 19: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

• NOT the empirical estimator

Maximum Likelihood Estimator

• Convex optimization: Efficient (polynomial time)

5

Sufficient statistic: Fingerprint

Poster #189

P̂mle 2 arg minQ2dist[0,1]

KL Observed h , Expected h under the distribution Q

fingerprint vectorh = [h0, h1, ..hs, .., ht] 0 1 2 3 4 5

hs

s

hs =# coins that show s heads

Ns = 0, 1, ..., t

Page 20: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

• NOT the empirical estimator

Maximum Likelihood Estimator

• Convex optimization: Efficient (polynomial time)

5

• Proposed in late 1960’s by Frederic Lord in the context of psychological testing. Several works study the geometry and identifiability and uniqueness of the solution of the MLELord 1965,1969, Turnbull 1976, Laird 1978, Lindsay 1983, Wood 1999

Sufficient statistic: Fingerprint

Poster #189

P̂mle 2 arg minQ2dist[0,1]

KL Observed h , Expected h under the distribution Q

fingerprint vectorh = [h0, h1, ..hs, .., ht] 0 1 2 3 4 5

hs

s

hs =# coins that show s heads

Ns = 0, 1, ..., t

Page 21: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

• NOT the empirical estimator

Maximum Likelihood Estimator

• Convex optimization: Efficient (polynomial time)

5

• Proposed in late 1960’s by Frederic Lord in the context of psychological testing. Several works study the geometry and identifiability and uniqueness of the solution of the MLELord 1965,1969, Turnbull 1976, Laird 1978, Lindsay 1983, Wood 1999

Sufficient statistic: Fingerprint

Poster #189

P̂mle 2 arg minQ2dist[0,1]

KL Observed h , Expected h under the distribution Q

fingerprint vectorh = [h0, h1, ..hs, .., ht] 0 1 2 3 4 5

hs

s

How well does the MLE recover the distribution?

hs =# coins that show s heads

Ns = 0, 1, ..., t

Page 22: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Main Results: MLE is Minimax Optimal in Sparse Regime

6Poster #189

N = Number of coins

t =Number of tosses per

coin

t ⌧ NSparse Regime

Non-asymptotic guarantees

Theorem 1The MLE achieves following error bounds:

• Small Sample Regime:

w. p. � 1� �

W1

⇣P ?, P̂mle

⌘= O�

✓1

t

◆t < c logN

• Medium Sample Regime:

W1

⇣P ?, ˆPmle

⌘= O�

✓1p

t logN

◆c logN t N2/9�✏

when

when

Page 23: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Main Results: MLE is Minimax Optimal in Sparse Regime

6Poster #189

• Matching Minimax Lower Boundsinf

fsup

PE [W1(P, f(X))] > ⌦

✓1

t

◆_ ⌦

✓1p

t logN

Theorem 2

N = Number of coins

t =Number of tosses per

coin

t ⌧ NSparse Regime

Non-asymptotic guarantees

Theorem 1The MLE achieves following error bounds:

• Small Sample Regime:

w. p. � 1� �

W1

⇣P ?, P̂mle

⌘= O�

✓1

t

◆t < c logN

• Medium Sample Regime:

W1

⇣P ?, ˆPmle

⌘= O�

✓1p

t logN

◆c logN t N2/9�✏

when

when

Page 24: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Novel Proof: Polynomial Approximations

7Poster #189

bft(x) =

tX

j=0

bj

✓t

j

◆x

j(1� x)t�j

New bounds on coefficients of Bernstein polynomials approximating Lipschitz-1 functions on [0, 1]

Bernstein polynomials

Page 25: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

Novel Proof: Polynomial Approximations

7Poster #189

bft(x) =

tX

j=0

bj

✓t

j

◆x

j(1� x)t�j

New bounds on coefficients of Bernstein polynomials approximating Lipschitz-1 functions on [0, 1]

Performance on Real Data

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

CD

F

Political Leanings

MLETVK17Empirical

0 0.2 0.4 0.6 0.8 1

p

0

0.2

0.4

0.6

0.8

1

CD

F

Flight Delays

MLETVK17Empirical

Bernstein polynomials

Page 26: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of

SummaryLearning distribution of parameters over a population with sparse observations per individual

MLE is Minimax Optimal even with sparse

observations!

[email protected] Poster #189

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

CD

F

Political Leanings

MLETVK17Empirical

0 0.2 0.4 0.6 0.8 1

p

0

0.2

0.4

0.6

0.8

1

CD

F

Flight Delays

MLETVK17Empirical

Novel proof: new bounds on coefficients of Bernstein polynomials approximating Lipschitz-1 functions

Performance on Real Data