filling in missing data: elections, 1.5cm0.3mm,...
TRANSCRIPT
Filling in Missing Data:Elections, , Healthcare
Madeleine Udell
Assistant ProfessorOperations Research and Information Engineering
Cornell University
Cornell WAM, March 16 2019
1 / 38
Definitions
what is operations research?
my answer: using data to make decisions.
I using data to make predictions
I using predictions to make decisions
2 / 38
Definitions
what is operations research?
my answer: using data to make decisions.
I using data to make predictions
I using predictions to make decisions
2 / 38
Outline
Elections
Data Tables
Generalized Low Rank Models
Applications
Bias
3 / 38
Analytics for political campaigns
goal: allocate limited resources to optimize electoral vote
4 / 38
2012
5 / 38
2012 electoral map
6 / 38
Optimization on the Obama campaign
−30 −20 −10 0 10 20 30Margin
0
10
20
30
40
50
60
70Ele
ctora
l vote
s
7 / 38
Optimization on the Obama campaign
−30 −20 −10 0 10 20 30Margin
0
10
20
30
40
50
60
70Ele
ctora
l vote
s
won by just enough...
wasted effort
8 / 38
2016 electoral map
9 / 38
Not much (effective) optimization on the Clinton
campaign
0
10
20
30
40
50
60
70
80
-48--46
-46--44
-44--42
-42--40
-40--38
-38--36
-36--34
-34--32
-32--30
-30--28
-28--26
-26--24
-24--22
-22--20
-20--18
-18--16
-16--14
-14--12
-12--10
-10--8
-8--6
-6--4
-4--2
-2-0 0-2
2-4
4-6
6-8
8-10
10-12
12-14
14-16
16-18
18-20
20-22
22-24
24-26
26-28
28-30
30-32
32-34
>34
ElectoralV
otes(sum
perbin)
StateMarginHRC-DJTin%
2016ElectoralVotesHistogram(source:RCPTh.11/1010p)
10 / 38
Predictions and decisions in electoral campaigns
predictions
I who will vote?
I who will they vote for?
I how effective are interventions?
decisions
I volunteers: who should they talk to?
I money: what ads to display on what platforms?
I candidate’s time: where to travel?
I policy positions: what to emphasize?
to maximize probability of electoral win
11 / 38
Outline
Elections
Data Tables
Generalized Low Rank Models
Applications
Bias
12 / 38
Data table: politics
age gender state income education voted? support · · ·29 F CT $53,000 college yes Clinton · · ·57 ? NY $19,000 high school yes ? · · ·? M CA $102,000 masters no Trump · · ·
41 F NV $23,000 ? yes Trump · · ·...
......
......
......
...
goals:
I detect demographic groups?
I find typical responses?
I identify related features?
I impute missing entries?
13 / 38
Data table: politics
age gender state income education voted? support · · ·29 F CT $53,000 college yes Clinton · · ·57 ? NY $19,000 high school yes ? · · ·? M CA $102,000 masters no Trump · · ·
41 F NV $23,000 ? yes Trump · · ·...
......
......
......
...
goals:
I detect demographic groups?
I find typical responses?
I identify related features?
I impute missing entries?13 / 38
Data table
m examples (patients, respondents, assets)n features (tests, questions, performance indicators) A
=
A11 · · · A1n...
. . ....
Am1 · · · Amn
I ith row of A is feature vector for ith example
I jth column of A gives values for jth feature across allexamples
14 / 38
Outline
Elections
Data Tables
Generalized Low Rank Models
Applications
Bias
15 / 38
Low rank model
given: m × n data table A, k m, nfind: X ∈ Rm×k , Y ∈ Rk×n for whichX
[ Y]≈
A
i.e., xiyj ≈ Aij , whereX
=
—x1—...
—xm—
[Y
]=
| |y1 · · · yn| |
interpretation:
I X and Y are (compressed) representation of AI xTi ∈ Rk is a point associated with example iI yj ∈ Rk is a point associated with feature jI inner product xiyj approximates Aij
16 / 38
Low rank model
given: m × n data table A, k m, nfind: X ∈ Rm×k , Y ∈ Rk×n for whichX
[ Y]≈
A
i.e., xiyj ≈ Aij , whereX
=
—x1—...
—xm—
[Y
]=
| |y1 · · · yn| |
interpretation:
I X and Y are (compressed) representation of A
I xTi ∈ Rk is a point associated with example iI yj ∈ Rk is a point associated with feature jI inner product xiyj approximates Aij
16 / 38
Low rank model
given: m × n data table A, k m, nfind: X ∈ Rm×k , Y ∈ Rk×n for whichX
[ Y]≈
A
i.e., xiyj ≈ Aij , whereX
=
—x1—...
—xm—
[Y
]=
| |y1 · · · yn| |
interpretation:
I X and Y are (compressed) representation of AI xTi ∈ Rk is a point associated with example iI yj ∈ Rk is a point associated with feature j
I inner product xiyj approximates Aij
16 / 38
Low rank model
given: m × n data table A, k m, nfind: X ∈ Rm×k , Y ∈ Rk×n for whichX
[ Y]≈
A
i.e., xiyj ≈ Aij , whereX
=
—x1—...
—xm—
[Y
]=
| |y1 · · · yn| |
interpretation:
I X and Y are (compressed) representation of AI xTi ∈ Rk is a point associated with example iI yj ∈ Rk is a point associated with feature jI inner product xiyj approximates Aij
16 / 38
Why?
I reduce storage; speed transmission
I understand (visualize, cluster)
I remove noise
I infer missing data
I simplify data processing
17 / 38
Principal components analysis
PCA: for A ∈ Rm×n,
minimize ‖A− XY ‖2F =
∑mi=1
∑nj=1(Aij − xiyj)
2
with variables X ∈ Rm×k , Y ∈ Rk×n
I old roots (Pearson 1901, Hotelling 1933)
I least squares low rank fitting
I (analytical) solution via SVD of A = UΣV T
I (numerical) solution via alternating minimization
18 / 38
Generalized low rank model
minimize∑
(i ,j)∈Ω Lj(xiyj ,Aij)
I loss functions Lj for each columnI e.g., different losses for reals, booleans, categoricals,
ordinals, . . .
I observe only (i , j) ∈ Ω (other entries are missing)
Note:
I can be (NP-)hard to optimize exactly
I alternating minimiziation still works well
19 / 38
Matrix completion
observe Aij only for (i , j) ∈ Ω ⊂ 1, . . . ,m × 1, . . . , n
minimize∑
(i ,j)∈Ω(Aij − xiyj)2 + γ‖X‖2
F + γ‖Y ‖2F
two regimes:
I some entries missing: don’t waste data; “borrowstrength” from entries that are not missing
I most entries missing: matrix completion still works!
20 / 38
Losses
minimize∑
(i ,j)∈Ω Lj(xiyj ,Aij) +∑m
i=1 ri (xi ) +∑n
j=1 rj(yj)
choose loss L(u, a) adapted to data type:
data type loss L(u, a)
real quadratic (u − a)2
real absolute value |u − a|real huber huber(u − a)
boolean hinge (1− ua)+
boolean logistic log(1 + exp(−au))
integer poisson exp(u)− au + a log a− a
ordinal ordinal hinge∑a−1
a′=1(1− u + a′)++∑da′=a+1(1 + u − a′)+
categorical one-vs-all (1− ua)+ +∑
a′ 6=a(1 + ua′)+
categorical multinomial logit exp(ua)∑da′=1 exp(ua′ ) 21 / 38
Implementations
find code to fit GLRMs in
I Python (serial)
I Julia (shared memory parallel)
I Spark (parallel distributed)
I H20 (parallel distributed)
example: (Julia) fit rank 5 GLRM in 2 lines of code:
glrm, labels = GLRM(A, 5)
X,Y = fit!(glrm)
22 / 38
Implementations
find code to fit GLRMs in
I Python (serial)
I Julia (shared memory parallel)
I Spark (parallel distributed)
I H20 (parallel distributed)
example: (Julia) fit rank 5 GLRM in 2 lines of code:
glrm, labels = GLRM(A, 5)
X,Y = fit!(glrm)
22 / 38
Outline
Elections
Data Tables
Generalized Low Rank Models
Applications
Bias
23 / 38
Impute heterogeneous data
mixed data types
−12−9−6−3036912
remove entries
−12−9−6−3036912
qpca rank 10 recovery
−12−9−6−3036912
error
−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0
glrm rank 10 recovery
−12−9−6−3036912
error
−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0
24 / 38
Impute heterogeneous data
mixed data types
−12−9−6−3036912
remove entries
−12−9−6−3036912
qpca rank 10 recovery
−12−9−6−3036912
error
−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0
glrm rank 10 recovery
−12−9−6−3036912
error
−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0
24 / 38
Impute heterogeneous data
mixed data types
−12−9−6−3036912
remove entries
−12−9−6−3036912
qpca rank 10 recovery
−12−9−6−3036912
error
−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0
glrm rank 10 recovery
−12−9−6−3036912
error
−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0
24 / 38
Hospitalizations are low rank
hospitalization data set
GLRM outperforms PCA
(Schuler Liu, Wan, Callahan, U, Stark, Shah 2016)
25 / 38
American community survey
2013 ACS:
I 3M respondents, 87 economic/demographic surveyquestionsI incomeI cost of utilities (water, gas, electric)I weeks worked per yearI hours worked per weekI home ownershipI looking for workI use foodstampsI education levelI state of residenceI . . .
I 1/3 of responses missing
26 / 38
American community survey
most similar features (in demography space):
I Alaska: Montana, North Dakota
I California: Illinois, cost of water
I Colorado: Oregon, Idaho
I Ohio: Indiana, Michigan
I Pennsylvania: Massachusetts, New Jersey
I Virginia: Maryland, Connecticut
I Hours worked: weeks worked, education
27 / 38
Low rank models for dimensionality reduction1
U.S. Wage & Hour Division (WHD) compliance actions:
company zip violations · · ·Holiday Inn 14850 109 · · ·
Moosewood Restaurant 14850 0 · · ·Cornell Orchards 14850 0 · · ·
Lakeside Nursing Home 14850 53 · · ·...
......
I 208,806 rows (cases) × 252 columns (violation info)
I 32,989 zip codes. . .
1labor law violation demo: https://github.com/h2oai/h2o-3/blob/master/h2o-r/demos/rdemo.census.labor.violations.large.R
28 / 38
Low rank models for dimensionality reduction
ACS demographic data (US census bureau):
zip unemployment mean income · · ·94305 12% $47,000 · · ·06511 19% $32,000 · · ·60647 23% $23,000 · · ·94121 4% $178,000 · · ·
......
...
I 32,989 rows (zip codes) × 150 columns (demographic info)
I GLRM embeds zip codes into (low dimensional)demography space
29 / 38
Low rank models for dimensionality reduction
Zip code features:
30 / 38
Low rank models for dimensionality reduction
build 3 sets of features to predict violations:
I categorical: expand zip code to categorical variable
I concatenate: join tables on zip
I GLRM: replace zip code by low dimensional zip codefeatures
fit a supervised (deep learning) model:
method train error test error runtime
categorical 0.2091690 0.2173612 23.7600000concatenate 0.2258872 0.2515906 4.4700000
GLRM 0.1790884 0.1933637 4.3600000
31 / 38
Outline
Elections
Data Tables
Generalized Low Rank Models
Applications
Bias
32 / 38
The trouble with polls
Q: are people who respond to polls like people who don’t?
A: no:
There is a 19-year-old black man in Illinois who hasno idea of the role he is playing in this election.
He is sure he is going to vote for Donald J. Trump.In some polls, he’s weighted as much as 30 times
more than the average respondent, and as much as 300times more than the least-weighted respondent.
http://www.nytimes.com/2016/10/13/upshot/
how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.
html
33 / 38
The trouble with polls
Q: are people who respond to polls like people who don’t?A: no:
There is a 19-year-old black man in Illinois who hasno idea of the role he is playing in this election.
He is sure he is going to vote for Donald J. Trump.In some polls, he’s weighted as much as 30 times
more than the average respondent, and as much as 300times more than the least-weighted respondent.
http://www.nytimes.com/2016/10/13/upshot/
how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.
html
33 / 38
Correct biased sample
two types of people
I type A always fill out all questionsI type B leave question 3 blank half the time
question 1 question 2 question 3 question 4 · · ·2.7 yes 4 yes · · ·9.2 no ? no · · ·2.7 yes 4 yes · · ·9.2 no 1 no · · ·9.2 no 1 no · · ·9.2 no ? no · · ·
......
.... . .
estimate population mean of question 3
I excluding missing entries: 2.5I imputing missing entries: 2
34 / 38
Correct biased sample
two types of people
I type A always fill out all questionsI type B leave question 3 blank half the time
question 1 question 2 question 3 question 4 · · ·2.7 yes 4 yes · · ·9.2 no ? no · · ·2.7 yes 4 yes · · ·9.2 no 1 no · · ·9.2 no 1 no · · ·9.2 no ? no · · ·
......
.... . .
estimate population mean of question 3 iftype B people have two subtypes:
I one that answers “1” to question 3I another that doesn’t answer, but whose true answer is “27”
35 / 38
How does this apply to election models?
simple model: suppose that in each demographic group,
I there are some Trump and some Clinton supporters
I the Trump supporters respond to pollsters at lower rates (orlie about their support)
there is no way to detect this from polling data!
n.b. confidence intervals (as usually computed)
I account for statistical error
I do not account for systematic error
36 / 38
How does this apply to election models?
simple model: suppose that in each demographic group,
I there are some Trump and some Clinton supporters
I the Trump supporters respond to pollsters at lower rates (orlie about their support)
there is no way to detect this from polling data!
n.b. confidence intervals (as usually computed)
I account for statistical error
I do not account for systematic error
36 / 38
Dealing with systematic bias
for correct estimation, need outcome to be independent ofmissingness conditional on covariates
support for Trump ⊥ non-response | demographics
problem with systematic bias:
I even if you know it exists, you don’t know how much!
I modeling systemic bias?
37 / 38
Dealing with systematic bias
for correct estimation, need outcome to be independent ofmissingness conditional on covariates
support for Trump ⊥ non-response | demographics
problem with systematic bias:
I even if you know it exists, you don’t know how much!
I modeling systemic bias?
37 / 38
Dealing with systematic bias
for correct estimation, need outcome to be independent ofmissingness conditional on covariates
support for Trump ⊥ non-response | demographics
problem with systematic bias:
I even if you know it exists, you don’t know how much!
I modeling systemic bias?
37 / 38
Summary
generalized low rank models
I fill in missing data
I handle huge, heterogeneous data coherently
I transform big messy data into small clean data
paper:http://arxiv.org/abs/1410.0342
code:https://github.com/madeleineudell/LowRankModels.jl
38 / 38