data science: the main course @ kcdc 2016
TRANSCRIPT
DATA SCIENCE: THE MAIN COURSE
I Can Science Data, and So Can You!
Arthur Doler @arthurdoler [email protected]
TITANIUM SPONSORS
Platinum Sponsors
Gold Sponsors
HOW MANY APPETIZERS HAVE YOU EATEN?
Sources: Mediawiki, Publicdomainpictures.net
SO WE’RE SKIPPING RIGHT TO THE MAIN COURSEYOU HAVE THE DATA
YOU HAVE THE POWER
Sources: Mattel, he-manreviewed.net
WHAT’S FOR DINNER
Picking your problem
Using Knitr/R Markdown
Building a linear predictor
Making a predictive, repeatable document
WHAT’S NOT FOR DINNER
Learning R
Exhaustive discussion of statistics
Exhaustive discussion of regression modeling
Ways to run R in production
STEP 0: KNOW YOUR RECIPE FOR REPEATABILITY
Learn to Knit you some R
knitr ≈ Sweave + cacheSweave + pgfSweave + weaver + animation::saveLatex +
R2HTML::RweaveHTML + highlight::HighlightWeaveLatex + 0.2 * brew + 0.1 *
SweaveListingUtils + more
Source: Reddit
R Code
Markup
R Code
Markup
Markup
WHAT?! WHY IS THIS A GOOD IDEA?
Do you love me?
YN
LET’S GO FIND THAT RECIPE!
Source: Reddit
STEP 1: SHOP FOR YOUR INGREDIENTS
Finding the question to ask
WHAT ARE YOU TRYING TO DO?
Finding or proving a correlation
Looking for outliers
Building a predictive model
LET’S BUILD A LINEAR PREDICTIVE MODEL
Source: Wikipedia
WHAT ARE YOUR VARIABLES?
• Material Category• Material ID• Time-to-Incapacitation• 1000 / Time-To-Incapacitation• Carbon Monoxide• Hydrogen Cyanide• Hydrogen Sulfide• Hydrochloric Acid• Hydrobromic Acid• Nitrogen Dioxide• Sulfur Dioxide
WHAT DO YOU CARE ABOUT?
FORMULATE YOUR QUESTION
LET’S HEAD TO THE STORE!
Source: Reddit
STEP 2: GET YOUR MISE EN PLACE
Dividing your data
WHERE IS THE VALUE IN A PREDICTIVE MODEL?
WE BUILD OUR MODEL WITH A TRAINING SET
PARTITIONING YOUR DATA PREVENTS OVERTRAINING
²⁄³ Training¹⁄³ Test
½ Training¼ Test¼ Validation
LET’S MEASURE EVERYTHING OUT!
Source: Reddit
STEP 3: COOK UP YOUR PREDICTOR
Training your model
ONE WARNING FIRST
DO YOU NEED TO UNDERSTAND YOUR PREDICTOR?
LET’S GO COOK UP THE MODEL!
Source: Reddit
WHY DID 1000/TIME_TO_INCAPACITATION WORK BETTER?
STEP 3A: TRIM THE FATEliminating Outliers
LET’S GO CUT!
Source: Reddit
STEP 4: GARNISH WITH GRAPHICS
Adding visualizations to your report
plot ggplot2
Source: Wikimedia
LET’S FINISH UP THAT REPORT!
Source: Reddit
1. Know Your Recipe for Repeatability2. Shop for Your Column Ingredients3. Get your Data Divided4. Cook Up Your Predictor
1. Trim the Outlier Fat5. Garnish with Graphics
QUESTIONS?Source: Reddit