big data research methods – contemporary analysis

42
Big Data & Research Methods PRESENTED BY Grant Stanley, CEO Tadd Wood, Chief Data Scientist Contemporary Analysis 1209 Harney Street, Suite 200 Omaha, NE 68102

Upload: gstanley87

Post on 28-Nov-2014

195 views

Category:

Data & Analytics


1 download

DESCRIPTION

We are turning more and more “work” over to computers. However, this comes with a lot of responsibility. As we automate work, the impact of bad policies and decisions grows exponentially. We need to be vigilant to make sure that our work produces accurate results using sound research methods. We need to remember that the process of research is as important as the results. It is easy to forsake methodology, as Big Data distances researchers from the research process, and puts the focus on data collection, storage, and processing. However, practicing solid methods is the best way to produce accurate results. During this presentation we will explore important research topics. For example we will explore the exponential increase in noise — spurious relationships — as the number of variables increase and time horizons narrow. We will also cover ways to detect and prevent spurious relationships in a Big Data context.

TRANSCRIPT

Page 1: Big Data Research Methods – Contemporary Analysis

Big Data & Research MethodsPRESENTED BY

Grant Stanley, CEOTadd Wood, Chief Data Scientist

Contemporary Analysis1209 Harney Street, Suite 200 Omaha, NE 68102

Page 2: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

INTRO

The process of research is as important as the results.• Correct research methods improve results,

• And allow others to collaborate and improve your work.

Contemporary Analysis canworksmart.com

Page 3: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

INTRO

We’ll explore the dangers of:• Spurious Correlation

• Sampling Errors

• Model Selection

• Heteroscedasticity

• Overfitting

• Lack of Background

• Solutions instead of Theories

• Lack of the Scientific Method

• Correlation vs. Causation

Grant Stanley
Text
Grant Stanley
Text
Page 4: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

INTRO

Big Data can’t just be about collecting, processing & storing more data.

It has to be put to use. We need to conduct research, build models, and develop reports.

Page 5: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

THE DANGER OF FALSE POSITIVES

The car has little impact without the highway or interstate.

If we take Big Data beyond engineering, we are building the equivalent of the highway or interstate for the computer & Internet.Contemporary Analysis canworksmart.com

Page 6: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

SPURIOUS RELATIONSHIPS

Spurious relationships are when two or more events or variables have no direct causal connection, yet it may be wrongly inferred that they do, due to either coincidence or the presence of a certain third, unseen factor.

Contemporary Analysis canworksmart.com

Page 7: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

SPURIOUS RELATIONSHIPS

Big Data Errors: Spurious Correlations

VARIABLES

SP

UR

IOU

S C

OR

RE

LA

TIO

NS

140,000

20,000

80,000

500 1000 1500 2000

Page 8: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Divorce rate in MaineDivorces per 1000 people (US Census)

5 4.7 4.6 4.4 4.3 4.1 4.2 4.2 4.2 4.1

Consumption of margarine (US) Per capita in pounds (USDA)

8.2 7 6.5 5.3 5.2 4 4.6 4.5 4.2 3.7

Correlation 0.992558

SPURIOUS RELATIONSHIPS

Maine’s divorce rate with US margarine consumption

DIV

OR

CE

S P

ER

10

00

PE

OP

LEM

AR

GA

RIN

E C

ON

SUM

PT

ION

(PO

UN

DS)

5

4.8

4.6

4.4

4.2

4

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

9

7

6

5

4

3

8DIVORCE RATE IN MAINE

PER CAPITA CONSUMPTION OF MARGARINE (US)

Page 9: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

SAMPLING

There are two reasons for sampling a population:• The cost of collecting and processing data

is too high or impossible.

• To ensure that the results are representative of the population.

Contemporary Analysis canworksmart.com

Page 10: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

SAMPLING

Sampling still matters in Big Data.

Data is not information. It is simply a representation of information. You have to think about what the data you are using represents.

Page 11: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

SAMPLING

Is smartphone data representative of the population?

Gender by Platform Age by Platform

iPhone Android0%

100%

iPhone Android0%

100%

57%MALE

73%MALE

43%FEMALE

27%FEMALE

7%17 OR YOUNGER

13%17 OR YOUNGER

12%18 - 24

17%18 - 24

21%25 - 34

30%25 - 34

21%35 - 44

21%35 - 44

32%45+

25%45+

Page 12: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

MODEL SELECTION

OLS is not a catch all.

You have to know your data. Is it continuous, discrete, binary, ordinal, or categorical? Is your data symmetric or asymmetric? Are there outliers?

Contemporary Analysis canworksmart.com

Page 13: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

MODEL SELECTION

Page 14: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

HETEROSCEDASTICITY

Heteroscedasticity refers to the circumstance in which the variability of a variable is unequal across the range of values of a second variable that predicts it.

Contemporary Analysis canworksmart.com

Page 15: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

MA

RK

ET

PR

ICE

HOURS ON MACHINE

T1

T2

T3

Y^

= a + bx

HETEROSCEDASTICITY

Predicting equipment pricing based on machine hours

Page 16: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

Unbiased & Homoscedastic Biased & Homoscedastic Biased & Homoscedastic

Unbiased & Heteroscedastic Biased & Heteroscedastic Biased & Heteroscedastic

Page 17: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

OVERFITTING

Overfitting occurs when a statistical model captures more than just the underlying relationships.

The model is fitted to as much data as possible including random errors, outliers, and noise. Contemporary Analysis canworksmart.com

Page 18: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

OVERFITTING

An overfitted model nearly perfectly matches the training set, but does not perform well with new data. While an overfitted model looks great, it will have poor predictive performance.

Page 19: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

OVERFITTING

The mark of a good model isn’t how well it performs on the data used to build the model, but on fresh data outside of the training data set.

Page 20: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

OVERFITTING

Overfitting Example: Training Classification Table

General Election (Predicted)

General Election (Observed) Did not vote Voted Percentage Correct

Did not vote 132423 3 99.99773%

Voted 0 411099 100%

Overall Correct Percentage 100%

Page 21: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

OVERFITTING

Overfitting Example: Prediction Classification Table

General Election (Predicted)

General Election (Observed) Did not vote Voted Percentage Correct

Did not vote 35726 4068 90%

Voted 45924 77199 63%

Overall Correct Percentage 69%

Page 22: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

OVERFITTING

Overfitting Example: Variables 95% C.I. for EXP(B)

Variable B (Coefficients) Standard Error Wald Significance Lower Upper

NumberOfPastRaces 63.840 106.208 .361 .548 .000 1.35E+118Primary_03072000_Voter -66.218 106.264 .388 .533 .000 4.95E+61General_1107200_Voter -61.971 106.219 .340 .560 .000 3.16E+63Special_05082001_Voter -58.129 111.165 .273 .601 .000 2.39E+69General_11062001_Voter -60.658 106.181 .326 .568 .000 1.09E+64Primary_05072002_Voter -57.806 99.816 .335 .563 .000 7.23E+59General_11052002_Voter -63.208 106.206 .354 .552 .000 8.94E+62Special_05062003_Voter -66.393 106.249 .390 .532 .000 4.03E+61General_11042003_Voter -64.056 106.209 .364 .546 .000 3.85E+62Primary_03022004_Voter -63.836 106.204 .361 .548 .000 4.76E+62Special_02052005_Voter -58.510 111.784 .274 .601 .000 5.50E+69General_11082005_Voter -65.617 106.238 .381 .537 .000 8.56E+61Special_02072006_Voter -56.952 305.188 .035 .852 .000 1.10E+235Primary_05022006_Voter -64.696 106.220 .371 .542 .000 2.08E+62General_11072006_Voter -64.074 106.210 .364 .546 .000 3.79E+62Primary_05082007_Voter -65.976 106.233 .386 .535 .000 5.93E+61Primary_09112007_Voter -57.949 15652.399 .000 .997 .000 —General_11062007_Voter -67.465 106.231 .403 .525 .000 1.33E+61General_12112007_Voter -75.855 106.274 .509 .475 .000 3.29E+57Primary_03042008_Voter -62.602 106.214 .347 .556 .000 1.67E+63General_11042008_Voter -64.100 106.220 .364 .546 .000 3.77E+62Primary_05052009_Voter -57.094 98.053 .339 .560 .000 4.56E+58Primary_09152009_Voter -54.792 7118.311 .000 .994 .000 —General_11032009_Voter -55.176 98.071 .317 .574 .000 3.28E+59Primary_05042010_Voter -65.564 106.234 .381 .537 .000 8.97E+61Primary_07132010_Voter -56.331 45432.804 .000 .999 .000 —Primary_09072010_Voter -57.607 3684.807 .000 .998 .000 —General_11022010_Voter -63.431 106.214 .357 .550 .000 7.28E+62Primary_05032011_Voter -57.848 136.939 .178 .673 .000 2.75E+91General_11082011_Voter -54.865 98.255 .312 .577 .000 6.42E+59Primary_03062012_Voter -55.419 95.847 .334 .563 .000 3.29E+57Primary_05072013_Voter -58.652 110.873 .280 .597 .000 8.00E+68General_11052013_Voter -62.617 106.196 .348 .555 .000 1.58E+63Constant -115.093 212.413 .294 .588

Page 23: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

OVERFITTING

Simple Model Example: Variables 95% C.I. for EXP(B)

Variable B (Coefficients) Standard Error Wald Significance Lower Upper

Age_life_bin_1 .344 .019 312.341 .000 1.358 1.466

Age_life_bin_2 .282 .017 266.954 .000 1.282 1.372

Age_life_bin_3 .180 .017 109.330 .000 1.158 1.239

Age_life_bin_4 .133 .018 53.146 .000 1.102 1.184

Age_life_bin_5 .055 .019 8.719 .003 1.019 1.096

Age_life_bin_7 -.342 .029 139.262 .000 .671 .752

Age_life_bin_8 -1.949 .029 4636.533 .000 .135 .151

Party_affiliation_D .523 .037 202.630 .000 1.570 1.814

Party_affiliation_R .692 .027 656.239 .000 1.895 2.106

NumberOfPastRaces .480 .002 63659.304 .000 1.611 1.623

Constant -1.332 .017 6041.871 .000

Page 24: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

OVERFITTING

Simple Model Example: Training Classification Table

General Election (Predicted)

General Election (Observed) Did not vote Voted Percentage Correct

Did not vote 95397 37029 72%

Voted 43439 367660 89%

Overall Correct Percentage 85%

Page 25: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

OVERFITTING

Simple Model Example: Prediction Classification Table

General Election (Predicted)

General Election (Observed) Did not vote Voted Percentage Correct

Did not vote 72167 9483 88%

Voted 15131 66136 81%

Overall Correct Percentage 85%

Page 26: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

OVERFITTING

Big Data Errors: Spurious Correlations

VARIABLES

SP

UR

IOU

S C

OR

RE

LA

TIO

NS

140,000

20,000

80,000

500 1000 1500 2000

Grant Stanley
Page 27: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

OVERFITTING

Overstuffing Example: Variables 95% C.I. for EXP(B)

Variable B (Coefficients) Standard Error Wald Significance Lower UpperAge_life_bin_1 .331 .020 286.120 .000 1.339 1.446Age_life_bin_2 .281 .017 263.325 .000 1.281 1.371Age_life_bin_3 .184 .017 113.157 .000 1.162 1.243Age_life_bin_4 .134 .018 53.857 .000 1.103 1.185Age_life_bin_5 .058 .019 9.629 .002 1.022 1.099Age_life_bin_7 -.348 .029 143.259 .000 .667 .748Age_life_bin_8 -1.959 .029 4687.305 .000 .133 .149Party_affiliation_D .513 .037 194.040 .000 1.554 1.796Party_affiliation_R .684 .027 637.417 .000 1.879 2.089NumberOfPastRaces .478 .002 62834.614 .000 1.608 1.620Residential_Zip_3 -.364 .127 8.181 .004 .541 .892Residential_Zip_7 .360 .063 32.902 .000 1.268 1.622Residential_Zip_8 .428 .218 3.834 .050 1.000 2.354Residential_Zip_16 -.125 .023 28.277 .000 .843 .924Residential_Zip_17 .127 .058 4.797 .029 1.013 1.272Residential_Zip_18 -.356 .044 64.141 .000 .642 .764Residential_Zip_19 -.283 .026 117.878 .000 .716 .793Residential_Zip_21 .115 .037 9.801 .002 1.044 1.206Residential_Zip_22 .113 .026 19.024 .000 1.064 1.178Residential_Zip_25 -.182 .024 59.045 .000 .796 .873Residential_Zip_26 .074 .032 5.248 .022 1.011 1.148Residential_Zip_27 -.132 .033 16.081 .000 .821 .935Residential_Zip_28 -.077 .023 11.484 .001 .885 .968Residential_Zip_29 -.160 .038 17.765 .000 .791 .918Residential_Zip_30 -.191 .044 18.638 .000 .758 .901Residential_Zip_33 -.059 .030 3.945 .047 .889 .999Residential_Zip_35 .104 .026 15.662 .000 1.054 1.168Residential_Zip_41 .140 .018 57.675 .000 1.109 1.193Residential_Zip_42 .156 .039 16.010 .000 1.083 1.262Residential_Zip_45 .138 .024 32.782 .000 1.095 1.204Residential_Zip_46 -.065 .018 12.838 .000 .904 .971Residential_Zip_48 .261 .022 136.998 .000 1.243 1.357Residential_Zip_50 .164 .025 41.633 .000 1.121 1.239Residential_Zip_51 .157 .031 26.169 .000 1.102 1.243Residential_Zip_53 .114 .033 11.628 .001 1.050 1.197Residential_Zip_54 .104 .029 13.215 .000 1.049 1.174Residential_Zip_56 .116 .032 13.238 .000 1.055 1.196Residential_Zip_59 .094 .032 8.647 .003 1.032 1.170Local_School_District_6 -.375 .055 47.296 .000 .618 .765Local_School_District_7 .078 .016 23.389 .000 1.047 1.115Local_School_District_9 -.501 .057 77.534 .000 .542 .677Local_School_District_10 -.255 .033 61.473 .000 .727 .826Constant -1.332 .018 5513.792 .000

Page 28: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

OVERFITTING

Overstuffing Example: Training Classification Table

General Election (Predicted)

General Election (Observed) Did not vote Voted Percentage Correct

Did not vote 93029 39397 70%

Voted 36228 374871 91%

Overall Correct Percentage 86%

Grant Stanley
Text
Page 29: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

LACK OF BACKGROUND

The farther we are from the work, the more likely we are to be tricked by the data.

We owe it to the end user to get out of the library, and try to understand what we are modeling.

Contemporary Analysis canworksmart.com

Page 30: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

SOLUTIONS INSTEAD OF THEORIES

There is an element of data science that should be frustrating, confusing, & despair inducing.

It should make us stand back in awe of the complexity of the world, and not the simplicity to which we can reduce it to.Contemporary Analysis canworksmart.com

Page 31: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

SOLUTIONS INSTEAD OF THEORIES

“The great thing about economics, is that we admit that we know nothing about anything”

- Thomas Piketty author of “Capital in the Twenty-First Century”

Page 32: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

SOLUTIONS INSTEAD OF THEORIES

As we learn more, we realize there’s more to learn.

The hallmark of genius is the sharp awareness of what is and what is not possible. We become aware of complexity, ambiguity and nuance.

Page 33: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

CORRELATION & CAUSATION

The anthem of the Big Data age is “correlation does not imply causation.”

Contemporary Analysis canworksmart.com

Page 34: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

CORRELATION & CAUSATION

The problem is that this statement is tautological. It is always correct, and can never be wrong.

Page 35: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

CORRELATION & CAUSATION

Don’t let people use it as a kill switch to discussion.• True causation is pretty rare. There are few

things where, if I do this, this will happen.

• Research should create discussions not shut them down. Models can’t explain everything. There is always an “X” variable that captures the unknown.

Page 36: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

SOLUTIONS INSTEAD OF THEORIES

Contemporary Analysis canworksmart.com

Page 37: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

FAILING TO AUDIT

Primary reasons that we fail to have our work peer-reviewed:• Lack of funding to “repeat” work.

• We hide behind the complexity of our work.

Contemporary Analysis canworksmart.com

Page 38: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

FAILING TO AUDIT

Page 39: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

FAILING TO AUDIT

Other tools:• rMarkdown: for creating webpages and

documents in R

• iPython notebooks: for creating websites and documents interactively in Python

• Galaxy Project: for creating reproducible workflows. (Favorable for people with less scripting experience.)

Page 40: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

TRAINING

We offer training on:• Data Visualization

• Managerial Statistics

• Predictive Modeling

You will be introduced to:• R

• SPSS

• Tableau

• MySQL

• Git

Page 41: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

TRAINING

Trainings sessions last 3 days.

We will work through projects, practice different approaches, and which approach is the best for different scenarios.

Page 42: Big Data Research Methods – Contemporary Analysis

Big Data & Research Methods

Contemporary Analysis canworksmart.com

Questions & Learn more.

QUESTIONS? Grant Stanley, CEOContemporary Analysis1209 Harney Street, Suite 200Omaha, NE [email protected] (402) 679-8398

Contemporary Analysis canworksmart.com

Grant Stanley