filling in missing data: elections, 1.5cm0.3mm,...

Filling in Missing Data:Elections, , Healthcare

Madeleine Udell

Assistant ProfessorOperations Research and Information Engineering

Cornell University

Cornell WAM, March 16 2019

1 / 38

Definitions

what is operations research?

my answer: using data to make decisions.

I using data to make predictions

I using predictions to make decisions

2 / 38

Outline

Elections

Data Tables

Generalized Low Rank Models

Applications

Bias

3 / 38

Analytics for political campaigns

goal: allocate limited resources to optimize electoral vote

4 / 38

2012

5 / 38

2012 electoral map

6 / 38

Optimization on the Obama campaign

−30 −20 −10 0 10 20 30Margin

0

10

20

30

40

50

60

70Ele

ctora

l vote

s

7 / 38

Optimization on the Obama campaign

−30 −20 −10 0 10 20 30Margin

0

10

20

30

40

50

60

70Ele

ctora

l vote

s

won by just enough...

wasted effort

8 / 38

2016 electoral map

9 / 38

Not much (effective) optimization on the Clinton

campaign

0

10

20

30

40

50

60

70

80

-48--46

-46--44

-44--42

-42--40

-40--38

-38--36

-36--34

-34--32

-32--30

-30--28

-28--26

-26--24

-24--22

-22--20

-20--18

-18--16

-16--14

-14--12

-12--10

-10--8

-8--6

-6--4

-4--2

-2-0 0-2

2-4

4-6

6-8

8-10

10-12

12-14

14-16

16-18

18-20

20-22

22-24

24-26

26-28

28-30

30-32

32-34

>34

ElectoralV

otes(sum

perbin)

StateMarginHRC-DJTin%

2016ElectoralVotesHistogram(source:RCPTh.11/1010p)

10 / 38

Predictions and decisions in electoral campaigns

predictions

I who will vote?

I who will they vote for?

I how effective are interventions?

decisions

I volunteers: who should they talk to?

I money: what ads to display on what platforms?

I candidate’s time: where to travel?

I policy positions: what to emphasize?

to maximize probability of electoral win

11 / 38

Outline

Elections

Data Tables


Applications

Bias

12 / 38

Data table: politics

age gender state income education voted? support · · ·29 F CT $53,000 college yes Clinton · · ·57 ? NY $19,000 high school yes ? · · ·? M CA $102,000 masters no Trump · · ·

41 F NV $23,000 ? yes Trump · · ·...

......

......

......

...

goals:

I detect demographic groups?

I find typical responses?

I identify related features?

I impute missing entries?

13 / 38

Data table: politics

age gender state income education voted? support · · ·29 F CT $53,000 college yes Clinton · · ·57 ? NY $19,000 high school yes ? · · ·? M CA $102,000 masters no Trump · · ·

41 F NV $23,000 ? yes Trump · · ·...

......

......

......

...

goals:

I detect demographic groups?

I find typical responses?

I identify related features?

I impute missing entries?13 / 38

Data table

m examples (patients, respondents, assets)n features (tests, questions, performance indicators) A

=

A11 · · · A1n...

. . ....

Am1 · · · Amn

I ith row of A is feature vector for ith example

I jth column of A gives values for jth feature across allexamples

14 / 38

Outline

Elections

Data Tables


Applications

Bias

15 / 38

Low rank model

given: m × n data table A, k m, nfind: X ∈ Rm×k , Y ∈ Rk×n for whichX

[ Y]≈

A

i.e., xiyj ≈ Aij , whereX

=

—x1—...

—xm—

[Y

]=

| |y1 · · · yn| |

interpretation:

I X and Y are (compressed) representation of AI xTi ∈ Rk is a point associated with example iI yj ∈ Rk is a point associated with feature jI inner product xiyj approximates Aij

16 / 38

Low rank model


[ Y]≈

A


=

—x1—...

—xm—

[Y

]=

| |y1 · · · yn| |

interpretation:

I X and Y are (compressed) representation of A

I xTi ∈ Rk is a point associated with example iI yj ∈ Rk is a point associated with feature jI inner product xiyj approximates Aij

16 / 38

Low rank model


[ Y]≈

A


=

—x1—...

—xm—

[Y

]=

| |y1 · · · yn| |

interpretation:

I X and Y are (compressed) representation of AI xTi ∈ Rk is a point associated with example iI yj ∈ Rk is a point associated with feature j

I inner product xiyj approximates Aij

16 / 38

Low rank model


[ Y]≈

A


=

—x1—...

—xm—

[Y

]=

| |y1 · · · yn| |

interpretation:

I X and Y are (compressed) representation of AI xTi ∈ Rk is a point associated with example iI yj ∈ Rk is a point associated with feature jI inner product xiyj approximates Aij

16 / 38

Why?

I reduce storage; speed transmission

I understand (visualize, cluster)

I remove noise

I infer missing data

I simplify data processing

17 / 38

Principal components analysis

PCA: for A ∈ Rm×n,

minimize ‖A− XY ‖2F =

∑mi=1

∑nj=1(Aij − xiyj)

2

with variables X ∈ Rm×k , Y ∈ Rk×n

I old roots (Pearson 1901, Hotelling 1933)

I least squares low rank fitting

I (analytical) solution via SVD of A = UΣV T

I (numerical) solution via alternating minimization

18 / 38

Generalized low rank model

minimize∑

(i ,j)∈Ω Lj(xiyj ,Aij)

I loss functions Lj for each columnI e.g., different losses for reals, booleans, categoricals,

ordinals, . . .

I observe only (i , j) ∈ Ω (other entries are missing)

Note:

I can be (NP-)hard to optimize exactly

I alternating minimiziation still works well

19 / 38

Matrix completion

observe Aij only for (i , j) ∈ Ω ⊂ 1, . . . ,m × 1, . . . , n

minimize∑

(i ,j)∈Ω(Aij − xiyj)2 + γ‖X‖2

F + γ‖Y ‖2F

two regimes:

I some entries missing: don’t waste data; “borrowstrength” from entries that are not missing

I most entries missing: matrix completion still works!

20 / 38

Losses

minimize∑

(i ,j)∈Ω Lj(xiyj ,Aij) +∑m

i=1 ri (xi ) +∑n

j=1 rj(yj)

choose loss L(u, a) adapted to data type:

data type loss L(u, a)

real quadratic (u − a)2

real absolute value |u − a|real huber huber(u − a)

boolean hinge (1− ua)+

boolean logistic log(1 + exp(−au))

integer poisson exp(u)− au + a log a− a

ordinal ordinal hinge∑a−1

a′=1(1− u + a′)++∑da′=a+1(1 + u − a′)+

categorical one-vs-all (1− ua)+ +∑

a′ 6=a(1 + ua′)+

categorical multinomial logit exp(ua)∑da′=1 exp(ua′ ) 21 / 38

Implementations

find code to fit GLRMs in

I Python (serial)

I Julia (shared memory parallel)

I Spark (parallel distributed)

I H20 (parallel distributed)

example: (Julia) fit rank 5 GLRM in 2 lines of code:

glrm, labels = GLRM(A, 5)

X,Y = fit!(glrm)

22 / 38

Outline

Elections

Data Tables


Applications

Bias

23 / 38

Impute heterogeneous data

mixed data types

−12−9−6−3036912

remove entries

−12−9−6−3036912

qpca rank 10 recovery

−12−9−6−3036912

error

−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0

glrm rank 10 recovery

−12−9−6−3036912

error

−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0

24 / 38

Hospitalizations are low rank

hospitalization data set

GLRM outperforms PCA

(Schuler Liu, Wan, Callahan, U, Stark, Shah 2016)

25 / 38

American community survey

2013 ACS:

I 3M respondents, 87 economic/demographic surveyquestionsI incomeI cost of utilities (water, gas, electric)I weeks worked per yearI hours worked per weekI home ownershipI looking for workI use foodstampsI education levelI state of residenceI . . .

I 1/3 of responses missing

26 / 38

American community survey

most similar features (in demography space):

I Alaska: Montana, North Dakota

I California: Illinois, cost of water

I Colorado: Oregon, Idaho

I Ohio: Indiana, Michigan

I Pennsylvania: Massachusetts, New Jersey

I Virginia: Maryland, Connecticut

I Hours worked: weeks worked, education

27 / 38

Low rank models for dimensionality reduction1

U.S. Wage & Hour Division (WHD) compliance actions:

company zip violations · · ·Holiday Inn 14850 109 · · ·

Moosewood Restaurant 14850 0 · · ·Cornell Orchards 14850 0 · · ·

Lakeside Nursing Home 14850 53 · · ·...

......

I 208,806 rows (cases) × 252 columns (violation info)

I 32,989 zip codes. . .

1labor law violation demo: https://github.com/h2oai/h2o-3/blob/master/h2o-r/demos/rdemo.census.labor.violations.large.R

28 / 38

https://github.com/h2oai/h2o-3/blob/master/h2o-r/demos/rdemo.census.labor.violations.large.R

https://github.com/h2oai/h2o-3/blob/master/h2o-r/demos/rdemo.census.labor.violations.large.R

Low rank models for dimensionality reduction

ACS demographic data (US census bureau):

zip unemployment mean income · · ·94305 12% $47,000 · · ·06511 19% $32,000 · · ·60647 23% $23,000 · · ·94121 4% $178,000 · · ·

......

...

I 32,989 rows (zip codes) × 150 columns (demographic info)

I GLRM embeds zip codes into (low dimensional)demography space

29 / 38


Zip code features:

30 / 38


build 3 sets of features to predict violations:

I categorical: expand zip code to categorical variable

I concatenate: join tables on zip

I GLRM: replace zip code by low dimensional zip codefeatures

fit a supervised (deep learning) model:

method train error test error runtime

categorical 0.2091690 0.2173612 23.7600000concatenate 0.2258872 0.2515906 4.4700000

GLRM 0.1790884 0.1933637 4.3600000

31 / 38

Outline

Elections

Data Tables


Applications

Bias

32 / 38

The trouble with polls

Q: are people who respond to polls like people who don’t?

A: no:

There is a 19-year-old black man in Illinois who hasno idea of the role he is playing in this election.

He is sure he is going to vote for Donald J. Trump.In some polls, he’s weighted as much as 30 times

more than the average respondent, and as much as 300times more than the least-weighted respondent.

http://www.nytimes.com/2016/10/13/upshot/

how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.

html

33 / 38

http://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html



The trouble with polls

Q: are people who respond to polls like people who don’t?A: no:

There is a 19-year-old black man in Illinois who hasno idea of the role he is playing in this election.

He is sure he is going to vote for Donald J. Trump.In some polls, he’s weighted as much as 30 times

more than the average respondent, and as much as 300times more than the least-weighted respondent.

http://www.nytimes.com/2016/10/13/upshot/

how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.

html

33 / 38




Correct biased sample

two types of people

I type A always fill out all questionsI type B leave question 3 blank half the time

question 1 question 2 question 3 question 4 · · ·2.7 yes 4 yes · · ·9.2 no ? no · · ·2.7 yes 4 yes · · ·9.2 no 1 no · · ·9.2 no 1 no · · ·9.2 no ? no · · ·

......

.... . .

estimate population mean of question 3

I excluding missing entries: 2.5I imputing missing entries: 2

34 / 38

Correct biased sample

two types of people

I type A always fill out all questionsI type B leave question 3 blank half the time

question 1 question 2 question 3 question 4 · · ·2.7 yes 4 yes · · ·9.2 no ? no · · ·2.7 yes 4 yes · · ·9.2 no 1 no · · ·9.2 no 1 no · · ·9.2 no ? no · · ·

......

.... . .

estimate population mean of question 3 iftype B people have two subtypes:

I one that answers “1” to question 3I another that doesn’t answer, but whose true answer is “27”

35 / 38

How does this apply to election models?

simple model: suppose that in each demographic group,

I there are some Trump and some Clinton supporters

I the Trump supporters respond to pollsters at lower rates (orlie about their support)

there is no way to detect this from polling data!

n.b. confidence intervals (as usually computed)

I account for statistical error

I do not account for systematic error

36 / 38

Dealing with systematic bias

for correct estimation, need outcome to be independent ofmissingness conditional on covariates

support for Trump ⊥ non-response | demographics

problem with systematic bias:

I even if you know it exists, you don’t know how much!

I modeling systemic bias?

37 / 38

Summary

generalized low rank models

I fill in missing data

I handle huge, heterogeneous data coherently

I transform big messy data into small clean data

paper:http://arxiv.org/abs/1410.0342

code:https://github.com/madeleineudell/LowRankModels.jl

38 / 38

http://arxiv.org/abs/1410.0342

https://github.com/madeleineudell/LowRankModels.jl

filling in missing data: elections, 1.5cm0.3mm,...

Documents