the rise of data science in the age of big data analytics: why data distillation and machine...
DESCRIPTION
The reason why Big Data is important is because we want to use it to make sense of our world. It’s tempting to think there’s some “magic bullet” for analyzing big data, but simple “data distillation” often isn’t enough, and unsupervised machine-learning systems can be dangerous. (Like, bringing-down-the-entire-financial-system dangerous.) Data Science is the key to unlocking insight from Big Data: by combining computer science skills with statistical analysis and a deep understanding of the data and problem we can not only make better predictions, but also fill in gaps in our knowledge, and even find answers to questions we hadn’t even thought of yet.TRANSCRIPT
Revolution Confidential
T he R is e of Data S c ienc e in the age of B ig Data A nalytic sWhy Data Dis tillation and Mac hine L earning A ren’t E nough
David M S mithV P Marketing and C ommunityR evolution Analytic s
Revolution ConfidentialToday, we’ll dis c us s :
What is Data Science? Why machine learning isn’t enough Why Data Science works The Data Scientists Toolkit The Future of Big Data Analytics Closing thoughts and resources
2
Revolution Confidential
3© Dov Harrington, CC By-2.0http://www.flickr.com/photos/idovermani/4110546683/
Revolution ConfidentialWhere is it s afe to fis h near S an F ranc is co?
4San Francisco Estuary Institutehttp://www.sfei.org/tools/wqt
Revolution ConfidentialHurric ane S andy
Bob Rudishttp://rud.is/b/2012/10/28/watch-sandy-in-r-including-forecast-cone/
5
Revolution ConfidentialHurric ane S andy
Ed Chenhttp://blog.echen.me/hurricane-sandy-outages/
6
Revolution Confidential
When did Michael J acks on have his bigges t hits ?
New York Times, June 25 2009 (3 hours after Michael Jackson’s death)http://www.nytimes.com/interactive/2009/06/25/arts/0625-jackson-graphic.html 7
Revolution ConfidentialT hree E s s ential S kills of Data S c ientis ts
8Drew Conwayhttp://www.dataists.com/2010/09/the-data-science-venn-diagram/
Data IntegrationMashups
Applications
ModelsVisualizationPredictionsUncertainty
ProblemsData Sources
Credibility
EffectiveData
Applications
Revolution Confidential
9Image © Abode of Chaos, CC BY 2.0http://www.flickr.com/photos/home_of_chaos/6418989233/
Revolution ConfidentialMac hine learning (ML ) for predic tions
10
Res
pons
e
Feat
ures
Res
pons
es
MLscoring rules
Building the Model
Valid
atio
n se
t
Pre
dict
ions
scoring rules
Validating the Model
New
Dat
a
Pre
dict
ions
(sco
res)
scoring rules
Scoring new data
“Accuracy”
Revolution ConfidentialP roblem: A lac k of pers pec tive
11Image © 2010 David M Smith. Some rights reserved CC BY-2.0
Revolution ConfidentialP roblem: L ac k of c redibility
12
Revolution ConfidentialP roblem: C omplexity
13
Revolution ConfidentialData Science to the Rescue!
14
Revolution ConfidentialA ns wer Unas ked Ques tions
15Revolutions blog: “The Uncanny Valley of Big Data”http://blog.revolutionanalytics.com/2012/02/the-uncanny-valley-of-big-data.html
Revolution Confidential
16
“More data beats better algorithms, every time” – Google
“Companies that have massive amounts of data without massive amounts
of clue are going to be displaced by startups that have less data but more
clue.” -- Tim O’Reilly
Google Research, “The Unreasonable Effectiveness of Data”: http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html
Tim O’Reilly on Google+: https://plus.google.com/107033731246200681024/posts/4Xa76AtxYwdTechnoCalifornia: http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html
F ill in knowledge gaps
Revolution ConfidentialAvoid ineffec tive reac tions
17Stupid Data Miner Trickshttp://nerdsonwallstreet.typepad.com/my_weblog/files/dataminejune_2000.pdf
S&P
500
Revolution Confidential
18© Henricks Photos CC-BY-ND 2.0http://www.flickr.com/photos/hendricksphotos/3240667626/
Revolution Confidential0. Data (B ig & Mes s y)
19
Revolution Confidential1. A language for programming with data
20
Download the White Paper
R is Hotbit.ly/r-is-hot
Revolution Confidential
21
Grant awards to homeless veterans FY09Data: Data.govAnalysis: Drew Conway
User-defined functions
Internet API interfaceXML parsing
Custom graphics
Data import and pre-processing
Iterative data processing
Revolution Confidential2. S peed. L ots and lots of s peed.
22
Variable Transformation
Model Estimation
Model Refinement
Model Comparison / Benkmarking
Feature SelectionSampling
AggregationData Predictions
Revolution Confidential
Core 0(Thread 0)
Core n(Thread n)
Core 2(Thread 2)
Core 1(Thread 1)
Multicore Processor (4, 8, 16+ cores)
DataData Data
Disk
Shared Memory
Us e all available c omputing c yc les
23
Revolution Confidential
Compute Node
Compute Node
Master Node
DataPartition
DataPartition
Compute Node
Compute Node
DataPartition
DataPartition
3. A lgorithms that don’t choke on B ig Data
PEMAs: Parallel External-Memory Algorithms24
BIGDATA
Revolution ConfidentialDrink les s c offee!
25
Single ThreadedNon-optimized
algorithms
OptimizedParallelizedAlgorithms
Revolution Confidential4. Move c ode to data (not vic e vers a)
26
Map-Reduce
RHadoop: http://bit.ly/RHadoop
Revolution ConfidentialB ig Data A pplianc es
27
More info: http://bit.ly/R-Netezza
Revolution ConfidentialP lay Nic e with Others
• Business Intelligence Tools• Web-based data apps• Reporting / Spreadsheets
Presentation Layer
• R
Analytics Layer
• Relational datastores• Unstructured datastores
Data Layer
28
Revolution ConfidentialWhat every data s c ientis t needs
Open-Source RRevolution R
EnterpriseInterface with multiple data sources
Exploratory data analysis
Wide range of statistical methods
High-speed computation
Big Data support
Data/code locality (Hadoop, etc.)
Print-quality data visualization
Scheduled batch production
Works in a multi-tool ecosystem
Integration into Data Apps
29
Revolution ConfidentialR evolution R E nterpris e: B ig-Data R
Open-Source RRevolution R
EnterpriseInterface with multiple data sources
Exploratory data analysis
Wide range of statistical methods
High-speed computation
Big Data support
Data/code locality (Hadoop, etc.)
Print-quality data visualization
Scheduled batch production
Works in a multi-tool ecosystem
Integration into Data Apps
30www.revolutionanalytics.com/products
Revolution Confidential
31Image © www.tinyplanetphotography.com
Revolution ConfidentialA nd … the future?
Even more data
Cloud computing
Demand for Data Scientists
Diverging paradigms for data analytics
32http://www.indeed.com/jobtrends
Revolution ConfidentialDiverging data paradigms
33
HadoopNoSQL
FilesClusters
Data Appliances
More data, better fault tolerance
Easier programming, better performanceExplorationModeling
StoragePreprocessing
Production
Revolution ConfidentialData S c ienc e in P roduc tion
Real-time Big Data Analytics: From Deployment to Production
Thursday, November 29, 201210:00AM - 11:00AM Pacific Time
www.revolutionanalytics.com/news-events/free-webinars/
34
Revolution ConfidentialB uilding Data S c ienc e Teams
DJ Patil in O’Reilly Radar: http://oreil.ly/I3H5fI
Statistics and Data Science graduates
Kaggle and Chorus
Revolution Analytics R Training: http://www.revolutionanalytics.com/services/training/
35
Revolution ConfidentialC los ing T houghts
Data Science process leads to more powerful, and more useful models
Data Scientists need a technology platform to think about, explore, and model data
Revolution R Enterprise is R for Big Data
36
Revolution ConfidentialR es ourc es
Revolution R Enterprise : R for Big Data www.revolutionanalytics.com/products
Rhadoop : Connecting R and Hadoop bit.ly/r-hadoop
Contact David Smith [email protected] @revodavid blog.revolutionanalytics.com
37
Revolution ConfidentialT hank you.
38
www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR
The leading commercial provider of software and support for the popular open source R statistics language.