data science: hype and reality · • distributed storage platform hadoop distributed file system...
TRANSCRIPT
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
DATA SCIENCE: HYPE AND REALITYPATRICK HALL
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
About me
SAS Enterprise Miner, 2012
Cloudera Data Scientist, 2014
3
No, I mix my martinis with gin.
Do you use Kolmogorov–Smirnov often?
Statistician
Data Scientist
4
So, you have no SQL experience?
That’s right, I have NoSQL experience.
Statistician
Data Scientist
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Audience poll
Is data science a new field?• Sources of data• Technologies
Is data science a true mathematical science?
NETWORKS TEXT
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Intro to data science
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Historical roots
J. W. Tukey, The Future of Data Analysis, 1962
International Federation of Classification Societies, 1996
William Cleveland, Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics, 2001
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Data science Venn diagram 1.0
Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Drew Conway, 2010
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Data science Venn diagram 2.0
http://joelgrus.com/wp-content/uploads/2013/06/VennDiagram2.png
10Source: "Of the unicorn" by Special Collections, University of Houston Libraries - http://digital.lib.uh.edu/u?/p15195coll18,33. Licensed under CC0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Oftheunicorn.jpg#/media/File:Oftheunicorn.jpg
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Intro to machine learning
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Data science Venn diagram 1.0
Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
SEMI-SUPERVISED LEARNING
Prediction and classification*Clustering*EM TSVMManifoldregularization Autoencoders
Multilayer perceptronRestricted Boltzmannmachines
SUPERVISED LEARNING
RegressionLASSO regressionLogistic regressionRidge regression
Decision treeGradient boostingRandom forests
Neural networks SVMNaïve BayesNeighborsGaussianprocesses
UNSUPERVISEDLEARNING
A priori rulesClustering
k-means clusteringMean shift clustering Spectral clustering
Kernel densityestimationNonnegative matrixfactorizationPCA
Kernel PCASparse PCA
Singular valuedecompositionSOM
Don’t know YKnow Y Sometimes
know Y
A closer look at machine learning
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Sacrificing interpretability for accuracy
Hill and plateau sample data
Traditional regression Decision tree
Neural network
The shocking truth revealed!
http://www.kdnuggets.com/2015/10/deep-learning-vapnik-einstein-devil-yandex-conference.html
Most time is spent cleaning and preprocessing the data!
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Small data tools
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
• Multicore CPU• GPU• Solid state drive (SSD)• 64+ GB of RAM• Scalable algorithms
Workstation
Data
MPI Based
Software client
Data
Software server
Data scientist
Data scientist
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
How do we turn our insights into a production system?
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Identify/Formulate Problem
Data Preparation/Exploration
Model Building
Deploy Model
Evaluate/Monitor Model
ESTIMATION VS. PREDICTION DIFFERENT MINDSETS
RegressionDiscriminant Analysis
Assumptions Parsimony
Interpretation
What happened? Why?
Production Deployment
Predictive Accuracy
What will happen?
Machine Learning
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
!!!???!!!
The ‘IT’ folks
The ‘Analytics’ folks
I just built 850 new models.
When can you put
them into production?
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
big data tools
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
MPI Based
• Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases
• Distributed analytics platform Disk-enabled: Hadoop MapReduce In-memory: H20.ai
SAS® High-Performance Analytics SAS® LASR Analytic ServerSpark ML/MLlib
Data scientist
Distributed data and software on multiple servers
Software client
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Data growth
0,00
5,00
10,00
15,00
20,00
25,00
30,00
35,00
40,00
45,00
50,00
1991 1996 2001 2006 2011 2016
Wor
ld’s
Dat
a in
Zet
taby
tes
SOURCE: Oracle 2012
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Data growth
(1 zettabyte = 1 billion terabytes)
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Typical server hard drive was 500GB with a transfer
rate of 98 MB/sec
In 2008
An entire Disk could be transferred in 85 minutes
Typical Server Hard Drive was 4TB with a transfer rate of 150
MB/sec
In 2013
An entire disk could be transferred in 440
minutes
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
$0,00
$0,20
$0,40
$0,60
$0,80
$1,00
$1,20
2000 2005 2010
Average Price 1MB RAM
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
0
500
1000
1500
2000
2500
3000
3500
4000
1978 1982 1985 1989 1995 1997 1999 2000 2005 2008
CPU Speed in MHz
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
• Disk capacities are getting bigger, but disks are not spinning faster
• Processors are not running much faster, but they have more cores
• RAM is becoming affordable
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
So …
• To handle all of this new data we distribute it on clusters of computers
• Most modern analytical architectures take advantage of in-memory, distributed processing
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Hadoop and Spark
• Bulk ETL
• Batch processing
• Deployment
• Online transactions
• Advanced AnalyticsMapReduce is a difficult framework for iterative, sophisticated algorithms
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
https://github.com/szilard/benchm-ml
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
• “Hadoop Corporate Adoption Remains Low”
• Death of RDBMS exaggerated
• Big data adoption will require time
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Parting shot
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Use the scientific method.http://www.sas.com/en_us/insights/articles/analytics/keeping-the-science-in-data-science.html
Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.
Keep the Science in Data Sciencehttp://www.sas.com/en_us/insights/articles/analytics/keeping-the-science-in-data-science.html
An Introduction to Machine Learninghttp://blogs.sas.com/content/sascom/2015/08/11/an-introduction-to-machine-learning/
SAS Data Mining Communityhttps://communities.sas.com/
Quora Github Twitterwww.quora.com github.com/jphall663 @jpatrickhall
github.com/sassoftware
Where you can find me …