filling in missing data: elections, 1.5cm0.3mm,...

50
Filling in Missing Data: Elections, , Healthcare Madeleine Udell Assistant Professor Operations Research and Information Engineering Cornell University Cornell WAM, March 16 2019 1 / 38

Upload: others

Post on 12-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Filling in Missing Data:Elections, , Healthcare

Madeleine Udell

Assistant ProfessorOperations Research and Information Engineering

Cornell University

Cornell WAM, March 16 2019

1 / 38

Page 2: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Definitions

what is operations research?

my answer: using data to make decisions.

I using data to make predictions

I using predictions to make decisions

2 / 38

Page 3: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Definitions

what is operations research?

my answer: using data to make decisions.

I using data to make predictions

I using predictions to make decisions

2 / 38

Page 4: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Outline

Elections

Data Tables

Generalized Low Rank Models

Applications

Bias

3 / 38

Page 5: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Analytics for political campaigns

goal: allocate limited resources to optimize electoral vote

4 / 38

Page 6: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

2012

5 / 38

Page 7: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

2012 electoral map

6 / 38

Page 8: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Optimization on the Obama campaign

−30 −20 −10 0 10 20 30Margin

0

10

20

30

40

50

60

70Ele

ctora

l vote

s

7 / 38

Page 9: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Optimization on the Obama campaign

−30 −20 −10 0 10 20 30Margin

0

10

20

30

40

50

60

70Ele

ctora

l vote

s

won by just enough...

wasted effort

8 / 38

Page 10: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

2016 electoral map

9 / 38

Page 11: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Not much (effective) optimization on the Clinton

campaign

0

10

20

30

40

50

60

70

80

-48--46

-46--44

-44--42

-42--40

-40--38

-38--36

-36--34

-34--32

-32--30

-30--28

-28--26

-26--24

-24--22

-22--20

-20--18

-18--16

-16--14

-14--12

-12--10

-10--8

-8--6

-6--4

-4--2

-2-0 0-2

2-4

4-6

6-8

8-10

10-12

12-14

14-16

16-18

18-20

20-22

22-24

24-26

26-28

28-30

30-32

32-34

>34

ElectoralV

otes(sum

perbin)

StateMarginHRC-DJTin%

2016ElectoralVotesHistogram(source:RCPTh.11/1010p)

10 / 38

Page 12: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Predictions and decisions in electoral campaigns

predictions

I who will vote?

I who will they vote for?

I how effective are interventions?

decisions

I volunteers: who should they talk to?

I money: what ads to display on what platforms?

I candidate’s time: where to travel?

I policy positions: what to emphasize?

to maximize probability of electoral win

11 / 38

Page 13: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Outline

Elections

Data Tables

Generalized Low Rank Models

Applications

Bias

12 / 38

Page 14: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Data table: politics

age gender state income education voted? support · · ·29 F CT $53,000 college yes Clinton · · ·57 ? NY $19,000 high school yes ? · · ·? M CA $102,000 masters no Trump · · ·

41 F NV $23,000 ? yes Trump · · ·...

......

......

......

...

goals:

I detect demographic groups?

I find typical responses?

I identify related features?

I impute missing entries?

13 / 38

Page 15: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Data table: politics

age gender state income education voted? support · · ·29 F CT $53,000 college yes Clinton · · ·57 ? NY $19,000 high school yes ? · · ·? M CA $102,000 masters no Trump · · ·

41 F NV $23,000 ? yes Trump · · ·...

......

......

......

...

goals:

I detect demographic groups?

I find typical responses?

I identify related features?

I impute missing entries?13 / 38

Page 16: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Data table

m examples (patients, respondents, assets)n features (tests, questions, performance indicators) A

=

A11 · · · A1n...

. . ....

Am1 · · · Amn

I ith row of A is feature vector for ith example

I jth column of A gives values for jth feature across allexamples

14 / 38

Page 17: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Outline

Elections

Data Tables

Generalized Low Rank Models

Applications

Bias

15 / 38

Page 18: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Low rank model

given: m × n data table A, k m, nfind: X ∈ Rm×k , Y ∈ Rk×n for whichX

[ Y]≈

A

i.e., xiyj ≈ Aij , whereX

=

—x1—...

—xm—

[Y

]=

| |y1 · · · yn| |

interpretation:

I X and Y are (compressed) representation of AI xTi ∈ Rk is a point associated with example iI yj ∈ Rk is a point associated with feature jI inner product xiyj approximates Aij

16 / 38

Page 19: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Low rank model

given: m × n data table A, k m, nfind: X ∈ Rm×k , Y ∈ Rk×n for whichX

[ Y]≈

A

i.e., xiyj ≈ Aij , whereX

=

—x1—...

—xm—

[Y

]=

| |y1 · · · yn| |

interpretation:

I X and Y are (compressed) representation of A

I xTi ∈ Rk is a point associated with example iI yj ∈ Rk is a point associated with feature jI inner product xiyj approximates Aij

16 / 38

Page 20: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Low rank model

given: m × n data table A, k m, nfind: X ∈ Rm×k , Y ∈ Rk×n for whichX

[ Y]≈

A

i.e., xiyj ≈ Aij , whereX

=

—x1—...

—xm—

[Y

]=

| |y1 · · · yn| |

interpretation:

I X and Y are (compressed) representation of AI xTi ∈ Rk is a point associated with example iI yj ∈ Rk is a point associated with feature j

I inner product xiyj approximates Aij

16 / 38

Page 21: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Low rank model

given: m × n data table A, k m, nfind: X ∈ Rm×k , Y ∈ Rk×n for whichX

[ Y]≈

A

i.e., xiyj ≈ Aij , whereX

=

—x1—...

—xm—

[Y

]=

| |y1 · · · yn| |

interpretation:

I X and Y are (compressed) representation of AI xTi ∈ Rk is a point associated with example iI yj ∈ Rk is a point associated with feature jI inner product xiyj approximates Aij

16 / 38

Page 22: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Why?

I reduce storage; speed transmission

I understand (visualize, cluster)

I remove noise

I infer missing data

I simplify data processing

17 / 38

Page 23: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Principal components analysis

PCA: for A ∈ Rm×n,

minimize ‖A− XY ‖2F =

∑mi=1

∑nj=1(Aij − xiyj)

2

with variables X ∈ Rm×k , Y ∈ Rk×n

I old roots (Pearson 1901, Hotelling 1933)

I least squares low rank fitting

I (analytical) solution via SVD of A = UΣV T

I (numerical) solution via alternating minimization

18 / 38

Page 24: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Generalized low rank model

minimize∑

(i ,j)∈Ω Lj(xiyj ,Aij)

I loss functions Lj for each columnI e.g., different losses for reals, booleans, categoricals,

ordinals, . . .

I observe only (i , j) ∈ Ω (other entries are missing)

Note:

I can be (NP-)hard to optimize exactly

I alternating minimiziation still works well

19 / 38

Page 25: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Matrix completion

observe Aij only for (i , j) ∈ Ω ⊂ 1, . . . ,m × 1, . . . , n

minimize∑

(i ,j)∈Ω(Aij − xiyj)2 + γ‖X‖2

F + γ‖Y ‖2F

two regimes:

I some entries missing: don’t waste data; “borrowstrength” from entries that are not missing

I most entries missing: matrix completion still works!

20 / 38

Page 26: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Losses

minimize∑

(i ,j)∈Ω Lj(xiyj ,Aij) +∑m

i=1 ri (xi ) +∑n

j=1 rj(yj)

choose loss L(u, a) adapted to data type:

data type loss L(u, a)

real quadratic (u − a)2

real absolute value |u − a|real huber huber(u − a)

boolean hinge (1− ua)+

boolean logistic log(1 + exp(−au))

integer poisson exp(u)− au + a log a− a

ordinal ordinal hinge∑a−1

a′=1(1− u + a′)++∑da′=a+1(1 + u − a′)+

categorical one-vs-all (1− ua)+ +∑

a′ 6=a(1 + ua′)+

categorical multinomial logit exp(ua)∑da′=1 exp(ua′ ) 21 / 38

Page 27: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Implementations

find code to fit GLRMs in

I Python (serial)

I Julia (shared memory parallel)

I Spark (parallel distributed)

I H20 (parallel distributed)

example: (Julia) fit rank 5 GLRM in 2 lines of code:

glrm, labels = GLRM(A, 5)

X,Y = fit!(glrm)

22 / 38

Page 28: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Implementations

find code to fit GLRMs in

I Python (serial)

I Julia (shared memory parallel)

I Spark (parallel distributed)

I H20 (parallel distributed)

example: (Julia) fit rank 5 GLRM in 2 lines of code:

glrm, labels = GLRM(A, 5)

X,Y = fit!(glrm)

22 / 38

Page 29: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Outline

Elections

Data Tables

Generalized Low Rank Models

Applications

Bias

23 / 38

Page 30: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Impute heterogeneous data

mixed data types

−12−9−6−3036912

remove entries

−12−9−6−3036912

qpca rank 10 recovery

−12−9−6−3036912

error

−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0

glrm rank 10 recovery

−12−9−6−3036912

error

−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0

24 / 38

Page 31: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Impute heterogeneous data

mixed data types

−12−9−6−3036912

remove entries

−12−9−6−3036912

qpca rank 10 recovery

−12−9−6−3036912

error

−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0

glrm rank 10 recovery

−12−9−6−3036912

error

−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0

24 / 38

Page 32: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Impute heterogeneous data

mixed data types

−12−9−6−3036912

remove entries

−12−9−6−3036912

qpca rank 10 recovery

−12−9−6−3036912

error

−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0

glrm rank 10 recovery

−12−9−6−3036912

error

−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0

24 / 38

Page 33: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Hospitalizations are low rank

hospitalization data set

GLRM outperforms PCA

(Schuler Liu, Wan, Callahan, U, Stark, Shah 2016)

25 / 38

Page 34: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

American community survey

2013 ACS:

I 3M respondents, 87 economic/demographic surveyquestionsI incomeI cost of utilities (water, gas, electric)I weeks worked per yearI hours worked per weekI home ownershipI looking for workI use foodstampsI education levelI state of residenceI . . .

I 1/3 of responses missing

26 / 38

Page 35: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

American community survey

most similar features (in demography space):

I Alaska: Montana, North Dakota

I California: Illinois, cost of water

I Colorado: Oregon, Idaho

I Ohio: Indiana, Michigan

I Pennsylvania: Massachusetts, New Jersey

I Virginia: Maryland, Connecticut

I Hours worked: weeks worked, education

27 / 38

Page 36: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Low rank models for dimensionality reduction1

U.S. Wage & Hour Division (WHD) compliance actions:

company zip violations · · ·Holiday Inn 14850 109 · · ·

Moosewood Restaurant 14850 0 · · ·Cornell Orchards 14850 0 · · ·

Lakeside Nursing Home 14850 53 · · ·...

......

I 208,806 rows (cases) × 252 columns (violation info)

I 32,989 zip codes. . .

1labor law violation demo: https://github.com/h2oai/h2o-3/blob/master/h2o-r/demos/rdemo.census.labor.violations.large.R

28 / 38

Page 37: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Low rank models for dimensionality reduction

ACS demographic data (US census bureau):

zip unemployment mean income · · ·94305 12% $47,000 · · ·06511 19% $32,000 · · ·60647 23% $23,000 · · ·94121 4% $178,000 · · ·

......

...

I 32,989 rows (zip codes) × 150 columns (demographic info)

I GLRM embeds zip codes into (low dimensional)demography space

29 / 38

Page 38: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Low rank models for dimensionality reduction

Zip code features:

30 / 38

Page 39: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Low rank models for dimensionality reduction

build 3 sets of features to predict violations:

I categorical: expand zip code to categorical variable

I concatenate: join tables on zip

I GLRM: replace zip code by low dimensional zip codefeatures

fit a supervised (deep learning) model:

method train error test error runtime

categorical 0.2091690 0.2173612 23.7600000concatenate 0.2258872 0.2515906 4.4700000

GLRM 0.1790884 0.1933637 4.3600000

31 / 38

Page 40: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Outline

Elections

Data Tables

Generalized Low Rank Models

Applications

Bias

32 / 38

Page 41: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

The trouble with polls

Q: are people who respond to polls like people who don’t?

A: no:

There is a 19-year-old black man in Illinois who hasno idea of the role he is playing in this election.

He is sure he is going to vote for Donald J. Trump.In some polls, he’s weighted as much as 30 times

more than the average respondent, and as much as 300times more than the least-weighted respondent.

http://www.nytimes.com/2016/10/13/upshot/

how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.

html

33 / 38

Page 42: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

The trouble with polls

Q: are people who respond to polls like people who don’t?A: no:

There is a 19-year-old black man in Illinois who hasno idea of the role he is playing in this election.

He is sure he is going to vote for Donald J. Trump.In some polls, he’s weighted as much as 30 times

more than the average respondent, and as much as 300times more than the least-weighted respondent.

http://www.nytimes.com/2016/10/13/upshot/

how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.

html

33 / 38

Page 43: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Correct biased sample

two types of people

I type A always fill out all questionsI type B leave question 3 blank half the time

question 1 question 2 question 3 question 4 · · ·2.7 yes 4 yes · · ·9.2 no ? no · · ·2.7 yes 4 yes · · ·9.2 no 1 no · · ·9.2 no 1 no · · ·9.2 no ? no · · ·

......

.... . .

estimate population mean of question 3

I excluding missing entries: 2.5I imputing missing entries: 2

34 / 38

Page 44: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Correct biased sample

two types of people

I type A always fill out all questionsI type B leave question 3 blank half the time

question 1 question 2 question 3 question 4 · · ·2.7 yes 4 yes · · ·9.2 no ? no · · ·2.7 yes 4 yes · · ·9.2 no 1 no · · ·9.2 no 1 no · · ·9.2 no ? no · · ·

......

.... . .

estimate population mean of question 3 iftype B people have two subtypes:

I one that answers “1” to question 3I another that doesn’t answer, but whose true answer is “27”

35 / 38

Page 45: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

How does this apply to election models?

simple model: suppose that in each demographic group,

I there are some Trump and some Clinton supporters

I the Trump supporters respond to pollsters at lower rates (orlie about their support)

there is no way to detect this from polling data!

n.b. confidence intervals (as usually computed)

I account for statistical error

I do not account for systematic error

36 / 38

Page 46: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

How does this apply to election models?

simple model: suppose that in each demographic group,

I there are some Trump and some Clinton supporters

I the Trump supporters respond to pollsters at lower rates (orlie about their support)

there is no way to detect this from polling data!

n.b. confidence intervals (as usually computed)

I account for statistical error

I do not account for systematic error

36 / 38

Page 47: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Dealing with systematic bias

for correct estimation, need outcome to be independent ofmissingness conditional on covariates

support for Trump ⊥ non-response | demographics

problem with systematic bias:

I even if you know it exists, you don’t know how much!

I modeling systemic bias?

37 / 38

Page 48: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Dealing with systematic bias

for correct estimation, need outcome to be independent ofmissingness conditional on covariates

support for Trump ⊥ non-response | demographics

problem with systematic bias:

I even if you know it exists, you don’t know how much!

I modeling systemic bias?

37 / 38

Page 49: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Dealing with systematic bias

for correct estimation, need outcome to be independent ofmissingness conditional on covariates

support for Trump ⊥ non-response | demographics

problem with systematic bias:

I even if you know it exists, you don’t know how much!

I modeling systemic bias?

37 / 38

Page 50: Filling in Missing Data: Elections, 1.5cm0.3mm, Healthcarepi.math.cornell.edu/files/AWM/wamglrm.pdf · I inner product x iy j approximates A ij ... I least squares low rank tting

Summary

generalized low rank models

I fill in missing data

I handle huge, heterogeneous data coherently

I transform big messy data into small clean data

paper:http://arxiv.org/abs/1410.0342

code:https://github.com/madeleineudell/LowRankModels.jl

38 / 38