statistical paradises and paradoxes in big data (i): law ... · menu 1 xiao-li meng department of...
TRANSCRIPT
![Page 1: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/1.jpg)
Menu 1
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Statistical Paradises and Paradoxes in Big Data (I):
Law of Large Populations, Big DataParadox, and the 2016 US Election
Xiao-Li MengDepartment of Statistics, Harvard University
Meng (2018). Annals of Applied Statistics, No. 2, 685-726.https://statistics.fas.harvard.edu/people/xiao-li-meng
Many thanks to Stephen Ansolabehere and ShiroKuriwaki for the CCES (Cooperative CongressionalElection Study) data and analysis on 2016 US election.
![Page 2: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/2.jpg)
Menu 1
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Statistical Paradises and Paradoxes in Big Data (I):
Law of Large Populations, Big DataParadox, and the 2016 US Election
Xiao-Li MengDepartment of Statistics, Harvard University
Meng (2018). Annals of Applied Statistics, No. 2, 685-726.https://statistics.fas.harvard.edu/people/xiao-li-meng
Many thanks to Stephen Ansolabehere and ShiroKuriwaki for the CCES (Cooperative CongressionalElection Study) data and analysis on 2016 US election.
![Page 3: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/3.jpg)
Menu 1
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Statistical Paradises and Paradoxes in Big Data (I):
Law of Large Populations, Big DataParadox, and the 2016 US Election
Xiao-Li MengDepartment of Statistics, Harvard University
Meng (2018). Annals of Applied Statistics, No. 2, 685-726.https://statistics.fas.harvard.edu/people/xiao-li-meng
Many thanks to Stephen Ansolabehere and ShiroKuriwaki for the CCES (Cooperative CongressionalElection Study) data and analysis on 2016 US election.
![Page 4: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/4.jpg)
Menu 2
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Motivating questions
We know that a 5% random sample is better than a 5%non-random sample in measurable ways (e.g., bias,predictive power).
But is an 80% non-random sample “better” than a5% random sample in measurable terms? 90%?95%? 99%? (Jeremy Wu of US Census Bureau, 2012,Seminar at Harvard Statistics)
“Which one should we trust more: a 1% survey with60% response rate or a non-probabilistic datasetcovering 80% of the population?” (Keiding and Louis,2015, Joint Statistical Meetings; and JRSSB, 2016)
![Page 5: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/5.jpg)
Menu 2
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Motivating questions
We know that a 5% random sample is better than a 5%non-random sample in measurable ways (e.g., bias,predictive power).
But is an 80% non-random sample “better” than a5% random sample in measurable terms? 90%?95%? 99%? (Jeremy Wu of US Census Bureau, 2012,Seminar at Harvard Statistics)
“Which one should we trust more: a 1% survey with60% response rate or a non-probabilistic datasetcovering 80% of the population?” (Keiding and Louis,2015, Joint Statistical Meetings; and JRSSB, 2016)
![Page 6: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/6.jpg)
Menu 2
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Motivating questions
We know that a 5% random sample is better than a 5%non-random sample in measurable ways (e.g., bias,predictive power).
But is an 80% non-random sample “better” than a5% random sample in measurable terms? 90%?95%? 99%? (Jeremy Wu of US Census Bureau, 2012,Seminar at Harvard Statistics)
“Which one should we trust more: a 1% survey with60% response rate or a non-probabilistic datasetcovering 80% of the population?” (Keiding and Louis,2015, Joint Statistical Meetings; and JRSSB, 2016)
![Page 7: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/7.jpg)
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
![Page 8: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/8.jpg)
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
![Page 9: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/9.jpg)
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)
The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
![Page 10: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/10.jpg)
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
![Page 11: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/11.jpg)
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
![Page 12: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/12.jpg)
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
![Page 13: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/13.jpg)
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
![Page 14: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/14.jpg)
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
![Page 15: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/15.jpg)
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
![Page 16: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/16.jpg)
Menu 4
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
When/why can we ignore the population size?
Think about tastingsoup ...
Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!
⇐⇒
![Page 17: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/17.jpg)
Menu 4
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
When/why can we ignore the population size?
Think about tastingsoup ...
Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!
⇐⇒
![Page 18: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/18.jpg)
Menu 4
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
When/why can we ignore the population size?
Think about tastingsoup ...
Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!
⇐⇒
![Page 19: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/19.jpg)
Menu 4
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
When/why can we ignore the population size?
Think about tastingsoup ...
Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!
⇐⇒
![Page 20: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/20.jpg)
Menu 4
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
When/why can we ignore the population size?
Think about tastingsoup ...
Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!
⇐⇒
![Page 21: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/21.jpg)
Menu 5
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
2016 US Presidential Election
n: number of respondents to an election survey
N: number of (actual) voters in US
Xj = 1: plan to vote for Trump; Xj = 0 otherwise
Rj = 1: report (honestly) voting plan; Rj = 0 otherwise
Estimatinng Trump’s share: µN
= Ave(Xj) by sample average:
µ̂n =R1X1 + . . .+ RNXN
n=
Ave(RjXj)
Ave(Rj)
Actual estimation error
µ̂n − µN=
Ave(RjXj)
Ave(Rj)− Ave(Xj)
=
[Ave(RjXj)− Ave(Rj)Ave(Xj)
σRσ
X
]× σ
R
Ave(Rj)× σ
X
![Page 22: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/22.jpg)
Menu 5
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
2016 US Presidential Election
n: number of respondents to an election survey
N: number of (actual) voters in US
Xj = 1: plan to vote for Trump; Xj = 0 otherwise
Rj = 1: report (honestly) voting plan; Rj = 0 otherwise
Estimatinng Trump’s share: µN
= Ave(Xj) by sample average:
µ̂n =R1X1 + . . .+ RNXN
n=
Ave(RjXj)
Ave(Rj)
Actual estimation error
µ̂n − µN=
Ave(RjXj)
Ave(Rj)− Ave(Xj)
=
[Ave(RjXj)− Ave(Rj)Ave(Xj)
σRσ
X
]× σ
R
Ave(Rj)× σ
X
![Page 23: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/23.jpg)
Menu 5
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
2016 US Presidential Election
n: number of respondents to an election survey
N: number of (actual) voters in US
Xj = 1: plan to vote for Trump; Xj = 0 otherwise
Rj = 1: report (honestly) voting plan; Rj = 0 otherwise
Estimatinng Trump’s share: µN
= Ave(Xj) by sample average:
µ̂n =R1X1 + . . .+ RNXN
n=
Ave(RjXj)
Ave(Rj)
Actual estimation error
µ̂n − µN=
Ave(RjXj)
Ave(Rj)− Ave(Xj)
=
[Ave(RjXj)− Ave(Rj)Ave(Xj)
σRσ
X
]× σ
R
Ave(Rj)× σ
X
![Page 24: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/24.jpg)
Menu 5
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
2016 US Presidential Election
n: number of respondents to an election survey
N: number of (actual) voters in US
Xj = 1: plan to vote for Trump; Xj = 0 otherwise
Rj = 1: report (honestly) voting plan; Rj = 0 otherwise
Estimatinng Trump’s share: µN
= Ave(Xj) by sample average:
µ̂n =R1X1 + . . .+ RNXN
n=
Ave(RjXj)
Ave(Rj)
Actual estimation error
µ̂n − µN=
Ave(RjXj)
Ave(Rj)− Ave(Xj)
=
[Ave(RjXj)− Ave(Rj)Ave(Xj)
σRσ
X
]× σ
R
Ave(Rj)× σ
X
![Page 25: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/25.jpg)
Menu 5
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
2016 US Presidential Election
n: number of respondents to an election survey
N: number of (actual) voters in US
Xj = 1: plan to vote for Trump; Xj = 0 otherwise
Rj = 1: report (honestly) voting plan; Rj = 0 otherwise
Estimatinng Trump’s share: µN
= Ave(Xj) by sample average:
µ̂n =R1X1 + . . .+ RNXN
n=
Ave(RjXj)
Ave(Rj)
Actual estimation error
µ̂n − µN=
Ave(RjXj)
Ave(Rj)− Ave(Xj)
=
[Ave(RjXj)− Ave(Rj)Ave(Xj)
σRσ
X
]× σ
R
Ave(Rj)× σ
X
![Page 26: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/26.jpg)
Menu 5
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
2016 US Presidential Election
n: number of respondents to an election survey
N: number of (actual) voters in US
Xj = 1: plan to vote for Trump; Xj = 0 otherwise
Rj = 1: report (honestly) voting plan; Rj = 0 otherwise
Estimatinng Trump’s share: µN
= Ave(Xj) by sample average:
µ̂n =R1X1 + . . .+ RNXN
n=
Ave(RjXj)
Ave(Rj)
Actual estimation error
µ̂n − µN=
Ave(RjXj)
Ave(Rj)− Ave(Xj)
=
[Ave(RjXj)− Ave(Rj)Ave(Xj)
σRσ
X
]× σ
R
Ave(Rj)× σ
X
![Page 27: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/27.jpg)
Menu 5
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
2016 US Presidential Election
n: number of respondents to an election survey
N: number of (actual) voters in US
Xj = 1: plan to vote for Trump; Xj = 0 otherwise
Rj = 1: report (honestly) voting plan; Rj = 0 otherwise
Estimatinng Trump’s share: µN
= Ave(Xj) by sample average:
µ̂n =R1X1 + . . .+ RNXN
n=
Ave(RjXj)
Ave(Rj)
Actual estimation error
µ̂n − µN=
Ave(RjXj)
Ave(Rj)− Ave(Xj)
=
[Ave(RjXj)− Ave(Rj)Ave(Xj)
σRσ
X
]× σ
R
Ave(Rj)× σ
X
![Page 28: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/28.jpg)
Menu 6
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data quality, quantity, and uncertainty
Because σ2R = f (1− f ), f = Ave{Rj} = n
N , we have
Error = ρ̂R,X︸︷︷︸
Data Quality
×
√N − n
n︸ ︷︷ ︸Data Quantity
× σX︸︷︷︸
Problem Difficulty
![Page 29: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/29.jpg)
Menu 6
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data quality, quantity, and uncertainty
Because σ2R = f (1− f ), f = Ave{Rj} = n
N , we have
Error = ρ̂R,X︸︷︷︸
Data Quality
×√
N − n
n︸ ︷︷ ︸Data Quantity
×
σX︸︷︷︸
Problem Difficulty
![Page 30: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/30.jpg)
Menu 6
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data quality, quantity, and uncertainty
Because σ2R = f (1− f ), f = Ave{Rj} = n
N , we have
Error = ρ̂R,X︸︷︷︸
Data Quality
×√
N − n
n︸ ︷︷ ︸Data Quantity
× σX︸︷︷︸
Problem Difficulty
![Page 31: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/31.jpg)
Menu 7
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data Defect Index (d.d.i.)
Mean Squared Error (MSE)
MSE(µ̂n) = ER(ρ̂2)× N − n
n× σ2
X
Data Defect Index (d.d.i): DI = ER(ρ̂2)
For Simple Random Sample (SRS): DI = (N − 1)−1
For probabilistic samples in general: DI ∝ N−1
Deep trouble when DI does not vanish with N−1;
or equivalently when ρ̂ does not vanish with N−1/2 ...
![Page 32: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/32.jpg)
Menu 7
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data Defect Index (d.d.i.)
Mean Squared Error (MSE)
MSE(µ̂n) = ER(ρ̂2)× N − n
n× σ2
X
Data Defect Index (d.d.i): DI = ER(ρ̂2)
For Simple Random Sample (SRS): DI = (N − 1)−1
For probabilistic samples in general: DI ∝ N−1
Deep trouble when DI does not vanish with N−1;
or equivalently when ρ̂ does not vanish with N−1/2 ...
![Page 33: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/33.jpg)
Menu 7
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data Defect Index (d.d.i.)
Mean Squared Error (MSE)
MSE(µ̂n) = ER(ρ̂2)× N − n
n× σ2
X
Data Defect Index (d.d.i): DI = ER(ρ̂2)
For Simple Random Sample (SRS): DI = (N − 1)−1
For probabilistic samples in general: DI ∝ N−1
Deep trouble when DI does not vanish with N−1;
or equivalently when ρ̂ does not vanish with N−1/2 ...
![Page 34: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/34.jpg)
Menu 7
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data Defect Index (d.d.i.)
Mean Squared Error (MSE)
MSE(µ̂n) = ER(ρ̂2)× N − n
n× σ2
X
Data Defect Index (d.d.i): DI = ER(ρ̂2)
For Simple Random Sample (SRS): DI = (N − 1)−1
For probabilistic samples in general: DI ∝ N−1
Deep trouble when DI does not vanish with N−1;
or equivalently when ρ̂ does not vanish with N−1/2 ...
![Page 35: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/35.jpg)
Menu 7
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data Defect Index (d.d.i.)
Mean Squared Error (MSE)
MSE(µ̂n) = ER(ρ̂2)× N − n
n× σ2
X
Data Defect Index (d.d.i): DI = ER(ρ̂2)
For Simple Random Sample (SRS): DI = (N − 1)−1
For probabilistic samples in general: DI ∝ N−1
Deep trouble when DI does not vanish with N−1;
or equivalently when ρ̂ does not vanish with N−1/2 ...
![Page 36: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/36.jpg)
Menu 8
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Law of Large Populations (LLP)
If ρ = ER(ρ̂) 6= 0, then on average, the relative error ↑√N:
Actual Error
Benchmark SRS Standard Error=√N − 1ρ̂
The (lack-of) design effect (Deff)
Deff =MSE
Benchmark SRS MSE= (N − 1)DI
Paradigm shift for “Big Data”:
Fromσ√n︸︷︷︸
random error
to ρ̂√N︸ ︷︷ ︸
relative systemtic bias
![Page 37: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/37.jpg)
Menu 8
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Law of Large Populations (LLP)
If ρ = ER(ρ̂) 6= 0, then on average, the relative error ↑√N:
Actual Error
Benchmark SRS Standard Error=√N − 1ρ̂
The (lack-of) design effect (Deff)
Deff =MSE
Benchmark SRS MSE= (N − 1)DI
Paradigm shift for “Big Data”:
Fromσ√n︸︷︷︸
random error
to ρ̂√N︸ ︷︷ ︸
relative systemtic bias
![Page 38: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/38.jpg)
Menu 8
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Law of Large Populations (LLP)
If ρ = ER(ρ̂) 6= 0, then on average, the relative error ↑√N:
Actual Error
Benchmark SRS Standard Error=√N − 1ρ̂
The (lack-of) design effect (Deff)
Deff =MSE
Benchmark SRS MSE= (N − 1)DI
Paradigm shift for “Big Data”:
Fromσ√n︸︷︷︸
random error
to ρ̂√N︸ ︷︷ ︸
relative systemtic bias
![Page 39: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/39.jpg)
Menu 9
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Effective Sample Size
The Effective Sample Size neff of a “Big Data” set
Equate its MSE to that from a SRS with size neff :
DI
[N − n
n
]σ2 =
1
N − 1
[N − neff
neff
]σ2
What matters is the relative size f = n/N
neff =n
1 + (1− f )[(N − 1)DI − 1]≈ f
1− f
1
ρ̂2.
![Page 40: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/40.jpg)
Menu 9
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Effective Sample Size
The Effective Sample Size neff of a “Big Data” set
Equate its MSE to that from a SRS with size neff :
DI
[N − n
n
]σ2 =
1
N − 1
[N − neff
neff
]σ2
What matters is the relative size f = n/N
neff =n
1 + (1− f )[(N − 1)DI − 1]≈ f
1− f
1
ρ̂2.
![Page 41: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/41.jpg)
Menu 9
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Effective Sample Size
The Effective Sample Size neff of a “Big Data” set
Equate its MSE to that from a SRS with size neff :
DI
[N − n
n
]σ2 =
1
N − 1
[N − neff
neff
]σ2
What matters is the relative size f = n/N
neff =n
1 + (1− f )[(N − 1)DI − 1]≈ f
1− f
1
ρ̂2.
![Page 42: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/42.jpg)
Menu 10
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Gaining 2020 Vision: Assessing the behavioral ρ̂using validated voter counts (≈ 35, 000)
CCES: Cooperative Congressional Election Study(Conducted by Stephen Ansolabehere, Brian Schaffner, Sam Luks, Douglas Rivers
on Oct 4 - Nov 6, 2016 (YouGov); Analysis assisted by Shiro Kuriwaki)
Poll underestimated
Clinton support
Poll overestimated
Clinton support
0%
50%
100%
0% 50% 100%
Final Clinton Popular Vote Share
Val
idat
ed V
oter
Pol
l Est
imat
e,C
linto
n S
uppo
rt
Root Mean Squared Error: 0.07
Reasonable predictions forClinton’s Vote Share
Poll underestimated
Trump support
Poll overestimated
Trump support
0%
50%
100%
0% 50% 100%
Final Trump Popular Vote Share
Val
idat
ed V
oter
Pol
l Est
imat
e,Tr
ump
Sup
port
Root Mean Squared Error: 0.15
Serious underestimation ofTrump’s Vote Share
![Page 43: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/43.jpg)
Menu 10
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Gaining 2020 Vision: Assessing the behavioral ρ̂using validated voter counts (≈ 35, 000)
CCES: Cooperative Congressional Election Study(Conducted by Stephen Ansolabehere, Brian Schaffner, Sam Luks, Douglas Rivers
on Oct 4 - Nov 6, 2016 (YouGov); Analysis assisted by Shiro Kuriwaki)
Poll underestimated
Clinton support
Poll overestimated
Clinton support
0%
50%
100%
0% 50% 100%
Final Clinton Popular Vote Share
Val
idat
ed V
oter
Pol
l Est
imat
e,C
linto
n S
uppo
rt
Root Mean Squared Error: 0.07
Reasonable predictions forClinton’s Vote Share
Poll underestimated
Trump support
Poll overestimated
Trump support
0%
50%
100%
0% 50% 100%
Final Trump Popular Vote Share
Val
idat
ed V
oter
Pol
l Est
imat
e,Tr
ump
Sup
port
Root Mean Squared Error: 0.15
Serious underestimation ofTrump’s Vote Share
![Page 44: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/44.jpg)
Menu 10
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Gaining 2020 Vision: Assessing the behavioral ρ̂using validated voter counts (≈ 35, 000)
CCES: Cooperative Congressional Election Study(Conducted by Stephen Ansolabehere, Brian Schaffner, Sam Luks, Douglas Rivers
on Oct 4 - Nov 6, 2016 (YouGov); Analysis assisted by Shiro Kuriwaki)
Poll underestimated
Clinton support
Poll overestimated
Clinton support
0%
50%
100%
0% 50% 100%
Final Clinton Popular Vote Share
Val
idat
ed V
oter
Pol
l Est
imat
e,C
linto
n S
uppo
rt
Root Mean Squared Error: 0.07
Reasonable predictions forClinton’s Vote Share
Poll underestimated
Trump support
Poll overestimated
Trump support
0%
50%
100%
0% 50% 100%
Final Trump Popular Vote Share
Val
idat
ed V
oter
Pol
l Est
imat
e,Tr
ump
Sup
port
Root Mean Squared Error: 0.15
Serious underestimation ofTrump’s Vote Share
![Page 45: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/45.jpg)
Menu 11
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Assessing ρ̂ using state-level data
Let µN
be the true share, and µ̂n the estimated share. Then
ρ̂ =µ̂n − µN√
N−nn σ2
, & σ2 = µN
(1− µN
)
−0.00021 ± 0.00061
0
4
8
12
−0.010 −0.005 0.000 0.005 0.010
Clinton ρN
Cou
nt
Clinton: ρ̂ ≈ −0.0002± 0.0006
−0.0045 ± 0.00056
0
4
8
12
−0.010 −0.005 0.000 0.005 0.010
Trump ρN
Cou
nt
Trump: ρ̂ ≈ −0.0045± 0.0006
![Page 46: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/46.jpg)
Menu 11
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Assessing ρ̂ using state-level data
Let µN
be the true share, and µ̂n the estimated share. Then
ρ̂ =µ̂n − µN√
N−nn σ2
, & σ2 = µN
(1− µN
)
−0.00021 ± 0.00061
0
4
8
12
−0.010 −0.005 0.000 0.005 0.010
Clinton ρN
Cou
nt
Clinton: ρ̂ ≈ −0.0002± 0.0006
−0.0045 ± 0.00056
0
4
8
12
−0.010 −0.005 0.000 0.005 0.010
Trump ρN
Cou
nt
Trump: ρ̂ ≈ −0.0045± 0.0006
![Page 47: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/47.jpg)
Menu 11
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Assessing ρ̂ using state-level data
Let µN
be the true share, and µ̂n the estimated share. Then
ρ̂ =µ̂n − µN√
N−nn σ2
, & σ2 = µN
(1− µN
)
−0.00021 ± 0.00061
0
4
8
12
−0.010 −0.005 0.000 0.005 0.010
Clinton ρN
Cou
nt
Clinton: ρ̂ ≈ −0.0002± 0.0006
−0.0045 ± 0.00056
0
4
8
12
−0.010 −0.005 0.000 0.005 0.010
Trump ρNC
ount
Trump: ρ̂ ≈ −0.0045± 0.0006
![Page 48: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/48.jpg)
Menu 12
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
What’s the implication of ρ̂ = −0.005?
Many (major) survey results published before Nov 8, 2016;
Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;
Equivalent to 2,300 surveys of 1,000 respondents each.
When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence
neff =f
1− f
1
DI=
1
99× 40000 ≈ 404!
A 99.98% reduction in n, caused by ρ̂ = −0.005.
Butterfly Effect due to Law of Large Populations (LLP)
Relative Error =√
N− 1ρ̂
![Page 49: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/49.jpg)
Menu 12
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
What’s the implication of ρ̂ = −0.005?
Many (major) survey results published before Nov 8, 2016;
Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;
Equivalent to 2,300 surveys of 1,000 respondents each.
When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence
neff =f
1− f
1
DI=
1
99× 40000 ≈ 404!
A 99.98% reduction in n, caused by ρ̂ = −0.005.
Butterfly Effect due to Law of Large Populations (LLP)
Relative Error =√
N− 1ρ̂
![Page 50: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/50.jpg)
Menu 12
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
What’s the implication of ρ̂ = −0.005?
Many (major) survey results published before Nov 8, 2016;
Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;
Equivalent to 2,300 surveys of 1,000 respondents each.
When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence
neff =f
1− f
1
DI=
1
99× 40000 ≈ 404!
A 99.98% reduction in n, caused by ρ̂ = −0.005.
Butterfly Effect due to Law of Large Populations (LLP)
Relative Error =√
N− 1ρ̂
![Page 51: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/51.jpg)
Menu 12
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
What’s the implication of ρ̂ = −0.005?
Many (major) survey results published before Nov 8, 2016;
Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;
Equivalent to 2,300 surveys of 1,000 respondents each.
When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence
neff =f
1− f
1
DI=
1
99× 40000 ≈ 404!
A 99.98% reduction in n, caused by ρ̂ = −0.005.
Butterfly Effect due to Law of Large Populations (LLP)
Relative Error =√
N− 1ρ̂
![Page 52: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/52.jpg)
Menu 12
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
What’s the implication of ρ̂ = −0.005?
Many (major) survey results published before Nov 8, 2016;
Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;
Equivalent to 2,300 surveys of 1,000 respondents each.
When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence
neff =f
1− f
1
DI=
1
99× 40000 ≈ 404!
A 99.98% reduction in n, caused by ρ̂ = −0.005.
Butterfly Effect due to Law of Large Populations (LLP)
Relative Error =√
N− 1ρ̂
![Page 53: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/53.jpg)
Menu 13
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Visualizing LLP: Actual Coverage for Clinton
AL
AK
AZ
AR
CA
CO
CTDEDC
FL
GA
HI
ID
IL
IN
IA
KS
KY
LA
ME
MDMA
MI
MN
MSMO
MT
NE
NV
NH
NJNM
NY
NCND
OH
OK
OR
PA
RISC
SDTN
TX
UT
VT VA
WA
WVWI
WY
−10
−5
−2
0
2
5
5.5 6.0 6.5 7.0
log10 (Total Voters)
Clin
ton
Zn
![Page 54: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/54.jpg)
Menu 14
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Visualizing LLP: Actual Coverage for Trump
AL
AKAZ
AR
CACOCT
DE
DC
FL
GA
HI
ID
IL
IN
IA
KS KY
LAME MDMA
MI
MN
MS
MO
MT
NE
NV
NH
NJ
NM
NY
NC
ND
OH
OK
OR
PA
RI
SC
SD
TN
TX
UT
VT
VA
WAWV
WI
WY
−10
−5
−2
0
2
5
5.5 6.0 6.5 7.0
log10 (Total Voters)
Tru
mp
Zn
![Page 55: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/55.jpg)
Menu 15
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
The Big Data Paradox:
If we do not pay attention to data quality, then
The bigger the data,
the surer we fool ourselves.
![Page 56: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/56.jpg)
Menu 16
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Lessons Learned ...
Lesson 1: What matters most is the quality, not thequantity.
Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.
Lesson 3: Watch the relative size, not the absolutesize.
Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.
![Page 57: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/57.jpg)
Menu 16
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Lessons Learned ...
Lesson 1: What matters most is the quality, not thequantity.
Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.
Lesson 3: Watch the relative size, not the absolutesize.
Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.
![Page 58: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/58.jpg)
Menu 16
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Lessons Learned ...
Lesson 1: What matters most is the quality, not thequantity.
Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.
Lesson 3: Watch the relative size, not the absolutesize.
Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.
![Page 59: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/59.jpg)
Menu 16
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Lessons Learned ...
Lesson 1: What matters most is the quality, not thequantity.
Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.
Lesson 3: Watch the relative size, not the absolutesize.
Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.
![Page 60: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/60.jpg)
Menu 17
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
In case you are kind enough to invite me again ...
The sequel: Meng (2018/9)
Statistical Paradises and Paradoxes in Big Data (II):Multi-resolution Inference, Simpson’s
Paradox, and Individualized Treatments
![Page 61: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s](https://reader035.vdocuments.net/reader035/viewer/2022080506/5f791bebb21df4286d09c750/html5/thumbnails/61.jpg)
Menu 17
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
In case you are kind enough to invite me again ...
The sequel: Meng (2018/9)
Statistical Paradises and Paradoxes in Big Data (II):Multi-resolution Inference, Simpson’s
Paradox, and Individualized Treatments