ramya korlakai vinayak - 2020 conference11-11-00)-11... · 2019-06-08 · ramya korlakai vinayak...
TRANSCRIPT
![Page 1: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/1.jpg)
Maximum Likelihood Estimation for Learning Populations of Parameters
Ramya Korlakai Vinayak
joint work with Weihao Kong, Gregory Valiant, Sham Kakade
Paul G. Allen School of CSE
1
Postdoctoral Researcher
Poster #[email protected]
![Page 2: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/2.jpg)
Poster #189
Motivation: Large yet Sparse Data
Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years
2
![Page 3: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/3.jpg)
Poster #189
Motivation: Large yet Sparse Data
Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years
2
pii
Probability of catching flu (unknown)
![Page 4: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/4.jpg)
Poster #189
Motivation: Large yet Sparse Data
Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years
2
bias of coin i
pii
Probability of catching flu (unknown)
![Page 5: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/5.jpg)
Poster #189
Motivation: Large yet Sparse Data
Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years
2
bias of coin i
pii
Probability of catching flu (unknown)
1 0 0 1 0{ }xi = 2
bpi =xi
t
= 0.4± 0.45
![Page 6: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/6.jpg)
Poster #189
Motivation: Large yet Sparse Data
Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years
2
bias of coin i
pii
Probability of catching flu (unknown)
1 0 0 1 0{ }xi = 2
bpi =xi
t
= 0.4± 0.45
Goal: Can we learn the distribution of the biases over the population?
![Page 7: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/7.jpg)
Poster #189
Motivation: Large yet Sparse Data
• Population size is large, often hundreds of thousands or millions• Number of observations per individual is limited (sparse) prohibiting accurate
estimation of parameters of interest
• Application domains: Epidemiology, Social Sciences, Psychology, Medicine, Biology
Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years
2
bias of coin i
pii
Probability of catching flu (unknown)
1 0 0 1 0{ }xi = 2
bpi =xi
t
= 0.4± 0.45
Goal: Can we learn the distribution of the biases over the population?
![Page 8: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/8.jpg)
Poster #189
Motivation: Large yet Sparse Data
• Population size is large, often hundreds of thousands or millions• Number of observations per individual is limited (sparse) prohibiting accurate
estimation of parameters of interest
• Application domains: Epidemiology, Social Sciences, Psychology, Medicine, Biology
Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years
2
bias of coin i
pii
Probability of catching flu (unknown)
1 0 0 1 0{ }xi = 2
bpi =xi
t
= 0.4± 0.45
Goal: Can we learn the distribution of the biases over the population?
Why? Testing and estimating properties of the distributionUseful for downstream analysis:
![Page 9: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/9.jpg)
Model: Non-parametric Mixture of Binomials
0 10.5
True Distribution P ?• N independent coins Each coin has its own bias drawn from
i = 1, 2, ..., N pi ⇠ P ? (unknown)
(unknown)
Lord 1965, 1969
3Poster #189
P ?
![Page 10: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/10.jpg)
Model: Non-parametric Mixture of Binomials
0 10.5
True Distribution P ?• N independent coins Each coin has its own bias drawn from
i = 1, 2, ..., N pi ⇠ P ? (unknown)
(unknown)
• We get to observe t tosses for every coin
Observations: Xi ⇠ Bin(t, pi) 2 {0, 1, ..., t}
xi = 21 0 0 1 0{ }t = 5 tosses
Lord 1965, 1969
3Poster #189
P ?
![Page 11: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/11.jpg)
Model: Non-parametric Mixture of Binomials
0 10.5
True Distribution P ?• N independent coins Each coin has its own bias drawn from
i = 1, 2, ..., N pi ⇠ P ? (unknown)
(unknown)
• We get to observe t tosses for every coin
Observations: Xi ⇠ Bin(t, pi) 2 {0, 1, ..., t}
xi = 21 0 0 1 0{ }t = 5 tosses
Lord 1965, 1969
3Poster #189
{Xi}Ni=1• Given P̂ P ?, return estimate of
P ?
![Page 12: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/12.jpg)
Model: Non-parametric Mixture of Binomials
0 10.5
True Distribution P ?• N independent coins Each coin has its own bias drawn from
i = 1, 2, ..., N pi ⇠ P ? (unknown)
(unknown)
• We get to observe t tosses for every coin
Observations: Xi ⇠ Bin(t, pi) 2 {0, 1, ..., t}
xi = 21 0 0 1 0{ }t = 5 tosses
Lord 1965, 1969
3Poster #189
{Xi}Ni=1• Given P̂ P ?, return estimate of
• Wasserstein-1 distance (Earth Mover’s Distance) W1
⇣P ?, P̂
⌘
P ?
![Page 13: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/13.jpg)
Learning with Sparse Observations is Non-trivial
4Poster #189
• Empirical plug-in estimator is bad ˆPplug-in = histogram
⇢X1
t, ...
Xi
t, ...,
XN
t
�
⇥
✓1pt
◆When incurs error oft ⌧ N N = Number of
coins t =Number of tosses per
coin
![Page 14: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/14.jpg)
Learning with Sparse Observations is Non-trivial
4Poster #189
The setting in this work is different
• Many recent works on estimating symmetric properties of a discrete distribution with sparse observationsPaninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 ….
• Empirical plug-in estimator is bad ˆPplug-in = histogram
⇢X1
t, ...
Xi
t, ...,
XN
t
�
⇥
✓1pt
◆When incurs error oft ⌧ N N = Number of
coins t =Number of tosses per
coin
![Page 15: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/15.jpg)
Learning with Sparse Observations is Non-trivial
4Poster #189
The setting in this work is different
• Many recent works on estimating symmetric properties of a discrete distribution with sparse observationsPaninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 ….
• Empirical plug-in estimator is bad ˆPplug-in = histogram
⇢X1
t, ...
Xi
t, ...,
XN
t
�
⇥
✓1pt
◆When incurs error oft ⌧ N
• Tian et. al 2017 proposed a moment matching based estimator which achieveswhenoptimal error of O
✓1
t
◆
Weakness of moment matching estimator is that it fails to obtain optimal error when due to higher variance in larger momentst > c logN
t < c logN
N = Number of coins t =
Number of tosses per
coin
![Page 16: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/16.jpg)
Learning with Sparse Observations is Non-trivial
4Poster #189
The setting in this work is different
• Many recent works on estimating symmetric properties of a discrete distribution with sparse observationsPaninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 ….
• Empirical plug-in estimator is bad ˆPplug-in = histogram
⇢X1
t, ...
Xi
t, ...,
XN
t
�
⇥
✓1pt
◆When incurs error oft ⌧ N
• Tian et. al 2017 proposed a moment matching based estimator which achieveswhenoptimal error of O
✓1
t
◆
Weakness of moment matching estimator is that it fails to obtain optimal error when due to higher variance in larger momentst > c logN
t < c logN
What about Maximum Likelihood Estimator?
N = Number of coins t =
Number of tosses per
coin
![Page 17: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/17.jpg)
Maximum Likelihood Estimator
5
Sufficient statistic: Fingerprint
Poster #189
fingerprint vectorh = [h0, h1, ..hs, .., ht] 0 1 2 3 4 5
hs
s
hs =# coins that show s heads
Ns = 0, 1, ..., t
![Page 18: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/18.jpg)
Maximum Likelihood Estimator
5
Sufficient statistic: Fingerprint
Poster #189
P̂mle 2 arg minQ2dist[0,1]
KL Observed h , Expected h under the distribution Q
fingerprint vectorh = [h0, h1, ..hs, .., ht] 0 1 2 3 4 5
hs
s
hs =# coins that show s heads
Ns = 0, 1, ..., t
![Page 19: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/19.jpg)
• NOT the empirical estimator
Maximum Likelihood Estimator
• Convex optimization: Efficient (polynomial time)
5
Sufficient statistic: Fingerprint
Poster #189
P̂mle 2 arg minQ2dist[0,1]
KL Observed h , Expected h under the distribution Q
fingerprint vectorh = [h0, h1, ..hs, .., ht] 0 1 2 3 4 5
hs
s
hs =# coins that show s heads
Ns = 0, 1, ..., t
![Page 20: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/20.jpg)
• NOT the empirical estimator
Maximum Likelihood Estimator
• Convex optimization: Efficient (polynomial time)
5
• Proposed in late 1960’s by Frederic Lord in the context of psychological testing. Several works study the geometry and identifiability and uniqueness of the solution of the MLELord 1965,1969, Turnbull 1976, Laird 1978, Lindsay 1983, Wood 1999
Sufficient statistic: Fingerprint
Poster #189
P̂mle 2 arg minQ2dist[0,1]
KL Observed h , Expected h under the distribution Q
fingerprint vectorh = [h0, h1, ..hs, .., ht] 0 1 2 3 4 5
hs
s
hs =# coins that show s heads
Ns = 0, 1, ..., t
![Page 21: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/21.jpg)
• NOT the empirical estimator
Maximum Likelihood Estimator
• Convex optimization: Efficient (polynomial time)
5
• Proposed in late 1960’s by Frederic Lord in the context of psychological testing. Several works study the geometry and identifiability and uniqueness of the solution of the MLELord 1965,1969, Turnbull 1976, Laird 1978, Lindsay 1983, Wood 1999
Sufficient statistic: Fingerprint
Poster #189
P̂mle 2 arg minQ2dist[0,1]
KL Observed h , Expected h under the distribution Q
fingerprint vectorh = [h0, h1, ..hs, .., ht] 0 1 2 3 4 5
hs
s
How well does the MLE recover the distribution?
hs =# coins that show s heads
Ns = 0, 1, ..., t
![Page 22: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/22.jpg)
Main Results: MLE is Minimax Optimal in Sparse Regime
6Poster #189
N = Number of coins
t =Number of tosses per
coin
t ⌧ NSparse Regime
Non-asymptotic guarantees
Theorem 1The MLE achieves following error bounds:
• Small Sample Regime:
w. p. � 1� �
W1
⇣P ?, P̂mle
⌘= O�
✓1
t
◆t < c logN
• Medium Sample Regime:
W1
⇣P ?, ˆPmle
⌘= O�
✓1p
t logN
◆c logN t N2/9�✏
when
when
![Page 23: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/23.jpg)
Main Results: MLE is Minimax Optimal in Sparse Regime
6Poster #189
• Matching Minimax Lower Boundsinf
fsup
PE [W1(P, f(X))] > ⌦
✓1
t
◆_ ⌦
✓1p
t logN
◆
Theorem 2
N = Number of coins
t =Number of tosses per
coin
t ⌧ NSparse Regime
Non-asymptotic guarantees
Theorem 1The MLE achieves following error bounds:
• Small Sample Regime:
w. p. � 1� �
W1
⇣P ?, P̂mle
⌘= O�
✓1
t
◆t < c logN
• Medium Sample Regime:
W1
⇣P ?, ˆPmle
⌘= O�
✓1p
t logN
◆c logN t N2/9�✏
when
when
![Page 24: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/24.jpg)
Novel Proof: Polynomial Approximations
7Poster #189
bft(x) =
tX
j=0
bj
✓t
j
◆x
j(1� x)t�j
New bounds on coefficients of Bernstein polynomials approximating Lipschitz-1 functions on [0, 1]
Bernstein polynomials
![Page 25: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/25.jpg)
Novel Proof: Polynomial Approximations
7Poster #189
bft(x) =
tX
j=0
bj
✓t
j
◆x
j(1� x)t�j
New bounds on coefficients of Bernstein polynomials approximating Lipschitz-1 functions on [0, 1]
Performance on Real Data
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
CD
F
Political Leanings
MLETVK17Empirical
0 0.2 0.4 0.6 0.8 1
p
0
0.2
0.4
0.6
0.8
1
CD
F
Flight Delays
MLETVK17Empirical
Bernstein polynomials
![Page 26: Ramya Korlakai Vinayak - 2020 Conference11-11-00)-11... · 2019-06-08 · Ramya Korlakai Vinayak joint work with Weihao Kong, Gregory Valiant, Sham Kakade Paul G. Allen School of](https://reader033.vdocuments.net/reader033/viewer/2022060420/5f1775d1aea0686bc8779dcc/html5/thumbnails/26.jpg)
SummaryLearning distribution of parameters over a population with sparse observations per individual
MLE is Minimax Optimal even with sparse
observations!
[email protected] Poster #189
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
CD
F
Political Leanings
MLETVK17Empirical
0 0.2 0.4 0.6 0.8 1
p
0
0.2
0.4
0.6
0.8
1
CD
F
Flight Delays
MLETVK17Empirical
Novel proof: new bounds on coefficients of Bernstein polynomials approximating Lipschitz-1 functions
Performance on Real Data