data science, what even?!
DESCRIPTION
Presented an abridged version of my "What is data science" talk at #websummit 2013. This talk goes over the required skillset as defined by Drew Conway and his famous venn diagram, and also outlines the Data Scientific Method brought by Dr. Patil. The talk is mainly two parts and the second part goes over some of the packages and technologies we use — minus the storage part.TRANSCRIPT
Data Science?!what even...
David Coallier@davidcoallier
Data ScientistEngine Yard
And I cook..A lot.
(n-1) items
Adapting.
Feedback.
Indifference.
Young mathematically inclined minds
Young mathematically inclined minds
We knew everything.
First Bad Assumption.
So we asked “experts”.
Wrong Ingredients
Bad Data
Tasted like sh*t
From Our ResultsWe had questions.
Found ExpertiseNot Online.
Data Scientific Method
Find a QuestionYour Hypothesis
Current DataWhat do you have?
Features & TestsTry it.
Analyse ResultsWon’t be pretty.
ConversationFramed. By. Data.
But....
Good DiscussionsImply good data scientists
Hacking Skills
Hacking Skills
Maths & Stats
Hacking Skills
Maths & Stats
Expertise
Hacking Skills
Maths & Stats
Expertise
MachineLearning
Research
DangerZone!!!
Hacking Skills
Maths & Stats
Expertise
DataScience
Hacking Skills
Maths & StatsExpertise
MachineLearning
Research
DangerZone!!!
DataScience
BusinessDon’t need an MBA
In other words.
1. Hacking2. Maths & Stats3. Expertise
Apply MethodData Scientific
1. Question2. Current Data3. Features/Tests4. Analyse5. Converse
Find a QuestionLet’s imagine Github
Upgrade ReposAffect users as little as possible
import csvcontent = csv.read('repo1.csv')
f (k;λ) = λ ke−k
k!for k >= 0
ConversePresent Findings
IterateCommits aren’t key.
KPIs are keyIndicators from experience
QuestionsSuper Important.
Just test it..
We are Human.Emotional Connection
What next?Second Hypothesis.
Focus on DataRelevant to your KPIs.
Data gives you the what
Humans give you the why
Turn Information
Into
Actionable Insight
Create DiscussionsIntrospection Engines
Seeing, Feeling itThe brain sees.
Not regressions
Not p-values
Not slopes
Not F-statistics
Not coefficients
Question DataNot Visualisations.
ToolboxWhat do we use?
RModeling, Testing, Prototyping
RStudioThe IDE
lubridateand zoo
Dealing with Dates...
yy/mm/dd mm/dd/yyYYYY-mm-dd HH:MM:ss TZyy-mm-dd 1363784094.513425yy/mm different timezone
reshape2Reshape your Data
ggplot2Visualise your Data
RCurl, RJSONIOFind more Data
HMiscMiscellaneous useful functions
forecastCan you guess?
garchGeneralized Autoregressive Conditional Heteroskedasticity
quantmodStatistical Financial Trading
getSymbols('AAPL')barChart(AAPL)addMACD()
xtsExtensible Time Series
igraphStudy Networks
maptoolsRead & View Maps
map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)
PythonScientific Computing
scipy.stats
scipy.statsDescriptive Statistics
from scipy.stats import describe
s = [1,2,1,3,4,5]
print describe(s)
scipy.statsProbability Distributions
ExamplePoisson Distribution
f (k;λ) = λ ke−k
k!for k >= 0
import scipy.stats.poissonp = poisson.pmf([1,2,3,4,1,2,3], 2)
print p.mean()print p.sum()...
NumPyLinear Algebra
1 00 1
⎛⎝⎜
⎞⎠⎟
import numpy as npx = np.array([ [1, 0], [0, 1] ])vec, val = np.linalg.eig(x)np.linalg.eigvals(x)
>>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.], [ 0., 1.] ]) )
MatplotlibPython Plotting
statsmodelsAdvanced Statistics Modeling
NLTKNatural Language Tool Kit
scikit-learnMachine Learning
from sklearn import treeX = [[0, 0], [1, 1]]Y = [0, 1]clf = tree.DecisionTreeClassifier()clf = clf.fit(X, Y)
clf.predict([[2., 2.]])>>> array([1])
PyBrain... Machine Learning
PyMCBayesian Inference
PatternWeb Mining for Python
NetworkXStudy Networks
MILK: Machine Learning
Pandaseasy-to-use data structures
from pandas import *x = DataFrame([ {"age": 26}, {"age": 19}, {"age": 21}, {"age": 18}])
print x[x['age'] > 20].count()print x[x['age'] > 20].mean()
Python vs R?Different Purposes
DogfoodingData Scientific Method
Original QuestionWhat is Data Science?
Back to youFor questioning