big data research methods – contemporary analysis
DESCRIPTION
We are turning more and more “work” over to computers. However, this comes with a lot of responsibility. As we automate work, the impact of bad policies and decisions grows exponentially. We need to be vigilant to make sure that our work produces accurate results using sound research methods. We need to remember that the process of research is as important as the results. It is easy to forsake methodology, as Big Data distances researchers from the research process, and puts the focus on data collection, storage, and processing. However, practicing solid methods is the best way to produce accurate results. During this presentation we will explore important research topics. For example we will explore the exponential increase in noise — spurious relationships — as the number of variables increase and time horizons narrow. We will also cover ways to detect and prevent spurious relationships in a Big Data context.TRANSCRIPT
Big Data & Research MethodsPRESENTED BY
Grant Stanley, CEOTadd Wood, Chief Data Scientist
Contemporary Analysis1209 Harney Street, Suite 200 Omaha, NE 68102
Big Data & Research Methods
Contemporary Analysis canworksmart.com
INTRO
The process of research is as important as the results.• Correct research methods improve results,
• And allow others to collaborate and improve your work.
Contemporary Analysis canworksmart.com
Big Data & Research Methods
Contemporary Analysis canworksmart.com
INTRO
We’ll explore the dangers of:• Spurious Correlation
• Sampling Errors
• Model Selection
• Heteroscedasticity
• Overfitting
• Lack of Background
• Solutions instead of Theories
• Lack of the Scientific Method
• Correlation vs. Causation
Big Data & Research Methods
Contemporary Analysis canworksmart.com
INTRO
Big Data can’t just be about collecting, processing & storing more data.
It has to be put to use. We need to conduct research, build models, and develop reports.
Big Data & Research Methods
Contemporary Analysis canworksmart.com
THE DANGER OF FALSE POSITIVES
The car has little impact without the highway or interstate.
If we take Big Data beyond engineering, we are building the equivalent of the highway or interstate for the computer & Internet.Contemporary Analysis canworksmart.com
Big Data & Research Methods
Contemporary Analysis canworksmart.com
SPURIOUS RELATIONSHIPS
Spurious relationships are when two or more events or variables have no direct causal connection, yet it may be wrongly inferred that they do, due to either coincidence or the presence of a certain third, unseen factor.
Contemporary Analysis canworksmart.com
Big Data & Research Methods
Contemporary Analysis canworksmart.com
SPURIOUS RELATIONSHIPS
Big Data Errors: Spurious Correlations
VARIABLES
SP
UR
IOU
S C
OR
RE
LA
TIO
NS
140,000
20,000
80,000
500 1000 1500 2000
Big Data & Research Methods
Contemporary Analysis canworksmart.com
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Divorce rate in MaineDivorces per 1000 people (US Census)
5 4.7 4.6 4.4 4.3 4.1 4.2 4.2 4.2 4.1
Consumption of margarine (US) Per capita in pounds (USDA)
8.2 7 6.5 5.3 5.2 4 4.6 4.5 4.2 3.7
Correlation 0.992558
SPURIOUS RELATIONSHIPS
Maine’s divorce rate with US margarine consumption
DIV
OR
CE
S P
ER
10
00
PE
OP
LEM
AR
GA
RIN
E C
ON
SUM
PT
ION
(PO
UN
DS)
5
4.8
4.6
4.4
4.2
4
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
9
7
6
5
4
3
8DIVORCE RATE IN MAINE
PER CAPITA CONSUMPTION OF MARGARINE (US)
Big Data & Research Methods
Contemporary Analysis canworksmart.com
SAMPLING
There are two reasons for sampling a population:• The cost of collecting and processing data
is too high or impossible.
• To ensure that the results are representative of the population.
Contemporary Analysis canworksmart.com
Big Data & Research Methods
Contemporary Analysis canworksmart.com
SAMPLING
Sampling still matters in Big Data.
Data is not information. It is simply a representation of information. You have to think about what the data you are using represents.
Big Data & Research Methods
Contemporary Analysis canworksmart.com
SAMPLING
Is smartphone data representative of the population?
Gender by Platform Age by Platform
iPhone Android0%
100%
iPhone Android0%
100%
57%MALE
73%MALE
43%FEMALE
27%FEMALE
7%17 OR YOUNGER
13%17 OR YOUNGER
12%18 - 24
17%18 - 24
21%25 - 34
30%25 - 34
21%35 - 44
21%35 - 44
32%45+
25%45+
Big Data & Research Methods
Contemporary Analysis canworksmart.com
MODEL SELECTION
OLS is not a catch all.
You have to know your data. Is it continuous, discrete, binary, ordinal, or categorical? Is your data symmetric or asymmetric? Are there outliers?
Contemporary Analysis canworksmart.com
Big Data & Research Methods
Contemporary Analysis canworksmart.com
MODEL SELECTION
Big Data & Research Methods
Contemporary Analysis canworksmart.com
HETEROSCEDASTICITY
Heteroscedasticity refers to the circumstance in which the variability of a variable is unequal across the range of values of a second variable that predicts it.
Contemporary Analysis canworksmart.com
Big Data & Research Methods
Contemporary Analysis canworksmart.com
MA
RK
ET
PR
ICE
HOURS ON MACHINE
T1
T2
T3
Y^
= a + bx
HETEROSCEDASTICITY
Predicting equipment pricing based on machine hours
Big Data & Research Methods
Contemporary Analysis canworksmart.com
Unbiased & Homoscedastic Biased & Homoscedastic Biased & Homoscedastic
Unbiased & Heteroscedastic Biased & Heteroscedastic Biased & Heteroscedastic
Big Data & Research Methods
Contemporary Analysis canworksmart.com
OVERFITTING
Overfitting occurs when a statistical model captures more than just the underlying relationships.
The model is fitted to as much data as possible including random errors, outliers, and noise. Contemporary Analysis canworksmart.com
Big Data & Research Methods
Contemporary Analysis canworksmart.com
OVERFITTING
An overfitted model nearly perfectly matches the training set, but does not perform well with new data. While an overfitted model looks great, it will have poor predictive performance.
Big Data & Research Methods
Contemporary Analysis canworksmart.com
OVERFITTING
The mark of a good model isn’t how well it performs on the data used to build the model, but on fresh data outside of the training data set.
Big Data & Research Methods
Contemporary Analysis canworksmart.com
OVERFITTING
Overfitting Example: Training Classification Table
General Election (Predicted)
General Election (Observed) Did not vote Voted Percentage Correct
Did not vote 132423 3 99.99773%
Voted 0 411099 100%
Overall Correct Percentage 100%
Big Data & Research Methods
Contemporary Analysis canworksmart.com
OVERFITTING
Overfitting Example: Prediction Classification Table
General Election (Predicted)
General Election (Observed) Did not vote Voted Percentage Correct
Did not vote 35726 4068 90%
Voted 45924 77199 63%
Overall Correct Percentage 69%
Big Data & Research Methods
Contemporary Analysis canworksmart.com
OVERFITTING
Overfitting Example: Variables 95% C.I. for EXP(B)
Variable B (Coefficients) Standard Error Wald Significance Lower Upper
NumberOfPastRaces 63.840 106.208 .361 .548 .000 1.35E+118Primary_03072000_Voter -66.218 106.264 .388 .533 .000 4.95E+61General_1107200_Voter -61.971 106.219 .340 .560 .000 3.16E+63Special_05082001_Voter -58.129 111.165 .273 .601 .000 2.39E+69General_11062001_Voter -60.658 106.181 .326 .568 .000 1.09E+64Primary_05072002_Voter -57.806 99.816 .335 .563 .000 7.23E+59General_11052002_Voter -63.208 106.206 .354 .552 .000 8.94E+62Special_05062003_Voter -66.393 106.249 .390 .532 .000 4.03E+61General_11042003_Voter -64.056 106.209 .364 .546 .000 3.85E+62Primary_03022004_Voter -63.836 106.204 .361 .548 .000 4.76E+62Special_02052005_Voter -58.510 111.784 .274 .601 .000 5.50E+69General_11082005_Voter -65.617 106.238 .381 .537 .000 8.56E+61Special_02072006_Voter -56.952 305.188 .035 .852 .000 1.10E+235Primary_05022006_Voter -64.696 106.220 .371 .542 .000 2.08E+62General_11072006_Voter -64.074 106.210 .364 .546 .000 3.79E+62Primary_05082007_Voter -65.976 106.233 .386 .535 .000 5.93E+61Primary_09112007_Voter -57.949 15652.399 .000 .997 .000 —General_11062007_Voter -67.465 106.231 .403 .525 .000 1.33E+61General_12112007_Voter -75.855 106.274 .509 .475 .000 3.29E+57Primary_03042008_Voter -62.602 106.214 .347 .556 .000 1.67E+63General_11042008_Voter -64.100 106.220 .364 .546 .000 3.77E+62Primary_05052009_Voter -57.094 98.053 .339 .560 .000 4.56E+58Primary_09152009_Voter -54.792 7118.311 .000 .994 .000 —General_11032009_Voter -55.176 98.071 .317 .574 .000 3.28E+59Primary_05042010_Voter -65.564 106.234 .381 .537 .000 8.97E+61Primary_07132010_Voter -56.331 45432.804 .000 .999 .000 —Primary_09072010_Voter -57.607 3684.807 .000 .998 .000 —General_11022010_Voter -63.431 106.214 .357 .550 .000 7.28E+62Primary_05032011_Voter -57.848 136.939 .178 .673 .000 2.75E+91General_11082011_Voter -54.865 98.255 .312 .577 .000 6.42E+59Primary_03062012_Voter -55.419 95.847 .334 .563 .000 3.29E+57Primary_05072013_Voter -58.652 110.873 .280 .597 .000 8.00E+68General_11052013_Voter -62.617 106.196 .348 .555 .000 1.58E+63Constant -115.093 212.413 .294 .588
Big Data & Research Methods
Contemporary Analysis canworksmart.com
OVERFITTING
Simple Model Example: Variables 95% C.I. for EXP(B)
Variable B (Coefficients) Standard Error Wald Significance Lower Upper
Age_life_bin_1 .344 .019 312.341 .000 1.358 1.466
Age_life_bin_2 .282 .017 266.954 .000 1.282 1.372
Age_life_bin_3 .180 .017 109.330 .000 1.158 1.239
Age_life_bin_4 .133 .018 53.146 .000 1.102 1.184
Age_life_bin_5 .055 .019 8.719 .003 1.019 1.096
Age_life_bin_7 -.342 .029 139.262 .000 .671 .752
Age_life_bin_8 -1.949 .029 4636.533 .000 .135 .151
Party_affiliation_D .523 .037 202.630 .000 1.570 1.814
Party_affiliation_R .692 .027 656.239 .000 1.895 2.106
NumberOfPastRaces .480 .002 63659.304 .000 1.611 1.623
Constant -1.332 .017 6041.871 .000
Big Data & Research Methods
Contemporary Analysis canworksmart.com
OVERFITTING
Simple Model Example: Training Classification Table
General Election (Predicted)
General Election (Observed) Did not vote Voted Percentage Correct
Did not vote 95397 37029 72%
Voted 43439 367660 89%
Overall Correct Percentage 85%
Big Data & Research Methods
Contemporary Analysis canworksmart.com
OVERFITTING
Simple Model Example: Prediction Classification Table
General Election (Predicted)
General Election (Observed) Did not vote Voted Percentage Correct
Did not vote 72167 9483 88%
Voted 15131 66136 81%
Overall Correct Percentage 85%
Big Data & Research Methods
Contemporary Analysis canworksmart.com
OVERFITTING
Big Data Errors: Spurious Correlations
VARIABLES
SP
UR
IOU
S C
OR
RE
LA
TIO
NS
140,000
20,000
80,000
500 1000 1500 2000
Big Data & Research Methods
Contemporary Analysis canworksmart.com
OVERFITTING
Overstuffing Example: Variables 95% C.I. for EXP(B)
Variable B (Coefficients) Standard Error Wald Significance Lower UpperAge_life_bin_1 .331 .020 286.120 .000 1.339 1.446Age_life_bin_2 .281 .017 263.325 .000 1.281 1.371Age_life_bin_3 .184 .017 113.157 .000 1.162 1.243Age_life_bin_4 .134 .018 53.857 .000 1.103 1.185Age_life_bin_5 .058 .019 9.629 .002 1.022 1.099Age_life_bin_7 -.348 .029 143.259 .000 .667 .748Age_life_bin_8 -1.959 .029 4687.305 .000 .133 .149Party_affiliation_D .513 .037 194.040 .000 1.554 1.796Party_affiliation_R .684 .027 637.417 .000 1.879 2.089NumberOfPastRaces .478 .002 62834.614 .000 1.608 1.620Residential_Zip_3 -.364 .127 8.181 .004 .541 .892Residential_Zip_7 .360 .063 32.902 .000 1.268 1.622Residential_Zip_8 .428 .218 3.834 .050 1.000 2.354Residential_Zip_16 -.125 .023 28.277 .000 .843 .924Residential_Zip_17 .127 .058 4.797 .029 1.013 1.272Residential_Zip_18 -.356 .044 64.141 .000 .642 .764Residential_Zip_19 -.283 .026 117.878 .000 .716 .793Residential_Zip_21 .115 .037 9.801 .002 1.044 1.206Residential_Zip_22 .113 .026 19.024 .000 1.064 1.178Residential_Zip_25 -.182 .024 59.045 .000 .796 .873Residential_Zip_26 .074 .032 5.248 .022 1.011 1.148Residential_Zip_27 -.132 .033 16.081 .000 .821 .935Residential_Zip_28 -.077 .023 11.484 .001 .885 .968Residential_Zip_29 -.160 .038 17.765 .000 .791 .918Residential_Zip_30 -.191 .044 18.638 .000 .758 .901Residential_Zip_33 -.059 .030 3.945 .047 .889 .999Residential_Zip_35 .104 .026 15.662 .000 1.054 1.168Residential_Zip_41 .140 .018 57.675 .000 1.109 1.193Residential_Zip_42 .156 .039 16.010 .000 1.083 1.262Residential_Zip_45 .138 .024 32.782 .000 1.095 1.204Residential_Zip_46 -.065 .018 12.838 .000 .904 .971Residential_Zip_48 .261 .022 136.998 .000 1.243 1.357Residential_Zip_50 .164 .025 41.633 .000 1.121 1.239Residential_Zip_51 .157 .031 26.169 .000 1.102 1.243Residential_Zip_53 .114 .033 11.628 .001 1.050 1.197Residential_Zip_54 .104 .029 13.215 .000 1.049 1.174Residential_Zip_56 .116 .032 13.238 .000 1.055 1.196Residential_Zip_59 .094 .032 8.647 .003 1.032 1.170Local_School_District_6 -.375 .055 47.296 .000 .618 .765Local_School_District_7 .078 .016 23.389 .000 1.047 1.115Local_School_District_9 -.501 .057 77.534 .000 .542 .677Local_School_District_10 -.255 .033 61.473 .000 .727 .826Constant -1.332 .018 5513.792 .000
Big Data & Research Methods
Contemporary Analysis canworksmart.com
OVERFITTING
Overstuffing Example: Training Classification Table
General Election (Predicted)
General Election (Observed) Did not vote Voted Percentage Correct
Did not vote 93029 39397 70%
Voted 36228 374871 91%
Overall Correct Percentage 86%
Big Data & Research Methods
Contemporary Analysis canworksmart.com
LACK OF BACKGROUND
The farther we are from the work, the more likely we are to be tricked by the data.
We owe it to the end user to get out of the library, and try to understand what we are modeling.
Contemporary Analysis canworksmart.com
Big Data & Research Methods
Contemporary Analysis canworksmart.com
SOLUTIONS INSTEAD OF THEORIES
There is an element of data science that should be frustrating, confusing, & despair inducing.
It should make us stand back in awe of the complexity of the world, and not the simplicity to which we can reduce it to.Contemporary Analysis canworksmart.com
Big Data & Research Methods
Contemporary Analysis canworksmart.com
SOLUTIONS INSTEAD OF THEORIES
“The great thing about economics, is that we admit that we know nothing about anything”
- Thomas Piketty author of “Capital in the Twenty-First Century”
Big Data & Research Methods
Contemporary Analysis canworksmart.com
SOLUTIONS INSTEAD OF THEORIES
As we learn more, we realize there’s more to learn.
The hallmark of genius is the sharp awareness of what is and what is not possible. We become aware of complexity, ambiguity and nuance.
Big Data & Research Methods
Contemporary Analysis canworksmart.com
CORRELATION & CAUSATION
The anthem of the Big Data age is “correlation does not imply causation.”
Contemporary Analysis canworksmart.com
Big Data & Research Methods
Contemporary Analysis canworksmart.com
CORRELATION & CAUSATION
The problem is that this statement is tautological. It is always correct, and can never be wrong.
Big Data & Research Methods
Contemporary Analysis canworksmart.com
CORRELATION & CAUSATION
Don’t let people use it as a kill switch to discussion.• True causation is pretty rare. There are few
things where, if I do this, this will happen.
• Research should create discussions not shut them down. Models can’t explain everything. There is always an “X” variable that captures the unknown.
Big Data & Research Methods
Contemporary Analysis canworksmart.com
SOLUTIONS INSTEAD OF THEORIES
Contemporary Analysis canworksmart.com
Big Data & Research Methods
Contemporary Analysis canworksmart.com
FAILING TO AUDIT
Primary reasons that we fail to have our work peer-reviewed:• Lack of funding to “repeat” work.
• We hide behind the complexity of our work.
Contemporary Analysis canworksmart.com
Big Data & Research Methods
Contemporary Analysis canworksmart.com
FAILING TO AUDIT
Big Data & Research Methods
Contemporary Analysis canworksmart.com
FAILING TO AUDIT
Other tools:• rMarkdown: for creating webpages and
documents in R
• iPython notebooks: for creating websites and documents interactively in Python
• Galaxy Project: for creating reproducible workflows. (Favorable for people with less scripting experience.)
Big Data & Research Methods
Contemporary Analysis canworksmart.com
TRAINING
We offer training on:• Data Visualization
• Managerial Statistics
• Predictive Modeling
You will be introduced to:• R
• SPSS
• Tableau
• MySQL
• Git
Big Data & Research Methods
Contemporary Analysis canworksmart.com
TRAINING
Trainings sessions last 3 days.
We will work through projects, practice different approaches, and which approach is the best for different scenarios.
Big Data & Research Methods
Contemporary Analysis canworksmart.com
Questions & Learn more.
QUESTIONS? Grant Stanley, CEOContemporary Analysis1209 Harney Street, Suite 200Omaha, NE [email protected] (402) 679-8398
Contemporary Analysis canworksmart.com