data driven action : a primer on data science
TRANSCRIPT
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
SPRINGONE2GXWASHINGTON, DC
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Data Driven Action: A Primer on Data Science
Sarah Aerni (@iTweetSarah)Srivatsan Ramanujam (@being_bayesian)
Jarrod Vawdrey (@jjvawdrey)
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/ 2
Agenda• Approaches and Open Source Tools for Wrangling and Modeling
Massive Datasets • Sarah Aerni
• Text Analytics at Scale on MPP• Srivatsan Ramanujam
• A Scalable Framework For Real Time Monitoring & Prediction Of Sensor Data
• Jarrod Vawdrey
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Our everyday devices are smart and talk to us
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Our everyday devices are smart and talk to us
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Connected devices take action to make daily life easier.
But what else?
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How can IoT help prevent accidents like the Macondo
Disaster ?
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Gene Sequencing
Smart GridsCOST TO SEQUENCE ONE GENOMEHAS FALLEN FROM
$100M IN 2001
TO $10K IN 2011TO $1K IN 2014
READING SMART METERSEVERY 15 MINUTES IS3000X MOREDATA INTENSIVE
Stock Market
Social Media
FACEBOOK UPLOADS250 MILLIONPHOTOS EACH DAY
In all industries billions of data points represent opportunities for the Internet of Things
Oil Exploration
Video Surveillance
OIL RIGS GENERATE25000DATA POINTS PER SECOND
Medical Imaging
Mobile Sensors
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Smart Systems = Sensors + Digital Brain + Actuators
Problem Formulation
Modeling Step
Data StepApplication Step
Data Science forBuilding Models
Sensors & Actuators
Data Lake
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How can data drive true, automated action?
How does this…
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How can data drive true, automated action?
…become this?How does this…
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How can data drive true, automated action?
• How is data collected?• Where is it stored and processed?• Is there real signal or just noise?• How can we build a predictive model?• When is the right time to take action?
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Critical considerations for successful modelingHow to build a predictive
model at scale
Data-driven paradigms, data cleansing and feature engineering
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Critical considerations for successful modelingHow to build a predictive
model at scale
Data-driven paradigms, data cleansing and feature engineering
Tradeoffs between model
accuracy and timeliness
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Critical considerations for successful modelingHow to build a predictive
model at scale
Data-driven paradigms, data cleansing and feature engineering
Derive insight from models
to change processes
Tradeoffs between model
accuracy and timeliness
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Critical considerations for successful modelingHow to build a predictive
model at scale
Data-driven paradigms, data cleansing and feature engineeringUse Cases
Oil Drilling Vaccine Manufacturing
Derive insight from models
to change processes
Tradeoffs between model
accuracy and timeliness
Treating Patients
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Drilling into the San Andreas Fault at
Parkfield California.Credit: Stephen H.
Hickman, USGS
Data: The New Oil• Oil & gas generates large amounts of data from sensors
enabling data-driven approaches to improve operationsPredictive maintenance
• Motivation: Failure costs estimated at $150,000/incident (billions annually)*
• Goals – Early warning system– Insights into prominent features impacting operation and
failure– Reduction of non-productive drill time– Reduced incidents
*http://blog.pivotal.io/pivotal/case-studies-2/data-as-the-new-oil-producing-value-for-the-oil-gas-industry
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Integrating &
CleansingFeature Building Modeling
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Integrating &
CleansingFeature Building Modeling
Integrated Data
Operator Data( thousands of records )
• Failure details• Component details• Drill Bit details
Drill Rig Sensor Data
( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)
Primary data sources
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Integrating &
Cleansing
Operator Data( thousands of records )
• Failure details• Component details• Drill Bit details
Drill Rig Sensor Data
( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)
Primary data sources
ROP
Time
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Integrating &
Cleansing
Operator Data( thousands of records )
• Failure details• Component details• Drill Bit details
Drill Rig Sensor Data
( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)
Primary data sources
ROP
Time
Drill bit changes
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Integrating &
Cleansing
Operator Data( thousands of records )
• Failure details• Component details• Drill Bit details
Drill Rig Sensor Data
( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)
Primary data sources
ROP
Time
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Operator Data( thousands of records )
• Failure details• Component details• Drill Bit details
Drill Rig Sensor Data
( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)
Primary data sources
WO
BTime
Integrating &
Cleansing
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Operator Data( thousands of records )
• Failure details• Component details• Drill Bit details
Drill Rig Sensor Data
( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)
Primary data sources
WO
BTime
Integrating &
Cleansing
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Operator Data( thousands of records )
• Failure details• Component details• Drill Bit details
Drill Rig Sensor Data
( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)
Primary data sources
WO
BTime
Integrating &
Cleansing
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Operator Data( thousands of records )
• Failure details• Component details• Drill Bit details
Drill Rig Sensor Data
( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)
Primary data sources
WO
BTime
Integrating &
Cleansing
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Operator Data( thousands of records )
• Failure details• Component details• Drill Bit details
Drill Rig Sensor Data
( billions of records )• Rate of Penetration (ROP)• RPM• Weight on Bit (WOB)
Primary data sources
WO
BTime
Integrating &
Cleansing
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Integrating &
CleansingFeature Building Modeling
• A failure occurred at the end of this run
Bit
posi
tion
RPM
ROP
WO
B
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Integrating &
CleansingFeature Building Modeling
• A failure occurred at the end of this run
• Taking a window of time prior to failure, what features should we extract (e.g. variance of RPM, max bit position velocity)?
Bit
posi
tion
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Integrating &
CleansingFeature Building Modeling
• A failure occurred at the end of this run
• Taking a window of time prior to failure, what features should we extract (e.g. variance of RPM, max bit position velocity)?
RPM
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Integrating &
CleansingFeature Building Modeling
Predict occurrence of equipment failure in a chosen future time window
Predict remaining life of equipment
Predict Rate-of-Penetration
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Integrating &
CleansingFeature Building Modeling
Predict occurrence of equipment failure in a chosen future time window
• Logistic Regression• Elastic Net Regularized Regression
(Binomial)• Support Vector Machines
Predict remaining life of equipment
Predict Rate-of-Penetration
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Integrating &
CleansingFeature Building Modeling
Predict occurrence of equipment failure in a chosen future time window
• Logistic Regression• Elastic Net Regularized Regression
(Binomial)• Support Vector Machines
Predict remaining life of equipment • Cox Proportional Hazards Regression
Predict Rate-of-Penetration
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
How are models built using sensor data?
Integrating &
CleansingFeature Building Modeling
Predict occurrence of equipment failure in a chosen future time window
• Logistic Regression• Elastic Net Regularized Regression
(Binomial)• Support Vector Machines
Predict remaining life of equipment • Cox Proportional Hazards Regression
Predict Rate-of-Penetration• Linear Regression• Elastic Net Regularized Regression
(Gaussian)• Support Vector Machines
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Finding linear dependencies between variables
ROP = c0+ WOB * cWOB
-10 150
20
40
60
80
100
Rate
of
Pen
etra
tion
(ROP
)
Weight on Bit (WOB)
Linear Regression: Streaming Algorithm
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Finding linear dependencies between variables
-10 150
20
40
60
80
100
Rate
of
Pen
etra
tion
(ROP
)
Weight on Bit (WOB)
Linear Regression: Streaming Algorithm
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Finding linear dependencies between variables
How to compute with a single scan?
-10 150
20
40
60
80
100
Rate
of
Pen
etra
tion
(ROP
)
Weight on Bit (WOB)
Linear Regression: Streaming Algorithm
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Linear Regression: Parallel Computation
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Linear Regression: Parallel Computation
Segment 1 Segment 2
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Linear Regression: Parallel Computation
Segment 1 Segment 2
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Linear regression on 10 million rows in seconds
0 50 100 150 200 250 300 3500
50
100
150
2006 Segments12 Segments18 Segments24 Segments
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.
# independent variables
Exec
utio
n tim
e (s
)
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Big Data Machine Learning in SQLhttp://madlib.net/
Predictive Modeling Library
Linear Systems• Sparse and Dense Solvers
Matrix Factorization• Single Value Decomposition (SVD)• Low-Rank
Generalized Linear Models• Linear Regression• Logistic Regression• Multinomial Logistic Regression• Cox Proportional Hazards• Regression• Elastic Net Regularization• Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms• Principal Component Analysis (PCA)• Association Rules (Affinity Analysis,
Market Basket)• Topic Modeling (Parallel LDA)• Decision Trees• Ensemble Learners (Random Forests)• Support Vector Machines• Conditional Random Field (CRF)• Clustering (K-means) • Cross Validation
Descriptive Statistics
Sketch-based Estimators• CountMin (Cormode-
Muthukrishnan)• FM (Flajolet-Martin)• MFV (Most Frequent
Values)CorrelationSummary
Support Modules
Array OperationsSparse VectorsRandom SamplingProbability FunctionsPMML Export
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
P L A T F O R M
Data Science Toolkit
KEY TOOLS KEY LANGUAGES
SQL
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Critical considerations for successful modelingHow to build a predictive
model at scale
Data-driven paradigms, data cleansing and feature engineeringUse Cases
Oil Drilling Vaccine Manufacturing
Derive insight from models
to change processes
Tradeoffs between model
accuracy and timeliness
Treating Patients
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Opportunities for Data-Driven Decisions in Pharma
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing
Input materials Mix Incubate Filter Centrifuge Final Product
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing
Input materials Mix Incubate Filter Centrifuge Final Product
SensorsTe
mp
Time
Abs
orba
nce
Elution volume
Velo
city
Time
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing
Input materials Mix Incubate Filter Centrifuge Final Product
Tem
p
Time
Abs
orba
nce
Elution volume
Velo
city
Time
• What opportunities exist for intervention, correction?• Which attributes should be used as features in a model?• When is the appropriate time to take action?
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing
Input materials Mix Incubate Filter Centrifuge Final Product
Tem
p
Time
Abs
orba
nce
Elution volume
Velo
city
Time
• What opportunities exist for intervention, correction?• Which attributes should be used as features in a model?• When is the appropriate time to take action?
>6 months
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Predicting vaccine potency using manufacturing dataModel generation and evaluation
Input materials Mix Incubate Filter Centrifuge Final Product
True Potency
Pre
dict
ed P
oten
cy
>6 months
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Predicting vaccine potency using manufacturing dataModel generation and evaluation
Input materials Mix Incubate Filter Centrifuge Final Product
True Potency
Pre
dict
ed P
oten
cy
Data Integratio
nFeature Building Modeling
>6 months
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Predicting vaccine potency using manufacturing dataModel generation and evaluation
• Tracing product through pipeline
• Integrating manual and automated data collection
• Missing data and outliers
Data Integratio
nFeature Building Modeling
Input materials Mix Incubate Filter Centrifuge Final Product
True Potency
Pre
dict
ed P
oten
cy
>6 months
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Predicting vaccine potency using manufacturing dataModel generation and evaluation
• Extract multiple features from particular steps (duration, mean, median, etc.)
• Considerations• Tunable vs. measures• Step in pipeline
Data Integratio
nFeature Building Modeling
Input materials Mix Incubate Filter Centrifuge Final Product
True Potency
Pre
dict
ed P
oten
cy
>6 months
Tem
p
Time
Abs
orba
nce
Elution volume
Velo
city
Time
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Predicting vaccine potency using manufacturing dataModel generation and evaluation
Input materials Mix Incubate Filter Centrifuge Final Product
True Potency
Pre
dict
ed P
oten
cy
Data Integratio
nFeature Building Modeling
>6 months
• Partial least squares• Random forest• Regularized regression
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Interpreting the utility of a measure obtained during manufacturing based on model outcomesBuilding insights from models
Some features may reveal tunable parameters to alter potency, others may simply be markers
Opportunities to provide real-time feedback on data entry errors and predicted potency outcomes
Assayed value Duration of a step
Pot
ency
Pot
ency
Correlation=0.45 Correlation=0.38
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
55© Copyright 2013 Pivotal. All rights reserved.
Internet of Things in HealthcareImproving Patient Outcomes and Increasing Efficiency
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Critical considerations for successful modelingHow to build a predictive
model at scale
Data-driven paradigms, data cleansing and feature engineeringUse Cases
Oil Drilling Vaccine Manufacturing
Derive insight from models
to change processes
Tradeoffs between model
accuracy and timeliness
Treating Patients
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Beyond monitor alerts for crashing patients–Prediction means preventionPowering the Connected Hospital
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
1 53 42 6 7
8 1210 119 13 14
15 1917 1816 20 21
22 2624 2523 27 28
29 30
SUNDAY THURSDAYTUESDAY WEDNESDAYMONDAY FRIDAY SATURDAY
SEPTEMBER 2013CDC – 2011- Number of Health Care Visits Per Year - Age Adjusted
31 2 4 5
6 108 97 11 12
13 1715 1614 18 19
20 2422 2321 25 26
27 3129 3028
SUN THUTUE WEDMON FRI SAT
OCTOBER 2013
1 53 42 6 7
8 1210 119 13 14
15 1917 1816 20 21
22 2624 2523 27 28
29 30
SUN THUTUE WEDMON FRI SATSEPTEMBER 2013
A Snapshot
1 2
3 75 64 8 9
10 1412 1311 15 16
17 2119 2018 22 23
24 2826 2725 29 30
SUN THUTUE WEDMON FRI SAT
NOVEMBER 2013
1 53 42 6 7
8 1210 119 13 14
15 1917 1816 20 21
22 2624 2523 27 28
29 3130
SUN THUTUE WEDMON FRI SATDECEMBER 2013
AnotherSnapshot
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
What happens between doctor visits?
Blood GlucoseTarget Zone
31 2 4 5
6 108 97 11 12
13 1715 1614 18 19
20 2422 2321 25 26
27 3129 3028
SUN THUTUE WEDMON FRI SAT
OCTOBER 2013
1 53 42 6 7
8 1210 119 13 14
15 1917 1816 20 21
22 2624 2523 27 28
29 30
SUN THUTUE WEDMON FRI SATSEPTEMBER 2013
A Snapshot
1 2
3 75 64 8 9
10 1412 1311 15 16
17 2119 2018 22 23
24 2826 2725 29 30
SUN THUTUE WEDMON FRI SAT
NOVEMBER 2013
1 53 42 6 7
8 1210 119 13 14
15 1917 1816 20 21
22 2624 2523 27 28
29 3130
SUN THUTUE WEDMON FRI SATDECEMBER 2013
AnotherSnapshot
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
The Promise of Internet of Humans Smart contact lenses and sensors to identify and alert
patients before catastrophic events (e.g. blood sugar drop for diabetics)
Wearables to track patient disease progression using objective measures
Track patient adherence Detect disease outbreaks using sequencing in sewer
system samples ECG monitoring on mobile phones for early alerting of
stroke
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
SPRINGONE2GXWASHINGTON, DC
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Scaling Java Applications for NLP on MPP through PL/Java
Srivatsan Ramanujam@being_bayesian
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/ 62
Text Analysis at Scale: Business Use Cases
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Sentiment Analysis for Churn PredictionCustomerA major telecom company
Business ProblemReducing churn through more accurate models
Challenges• Existing models only used structured
features
• Call center memos had poor structure and had lots of typos
Solution• Built sentiment analysis models to predict
churn and topic models to understand topics of conversation in call center memos
• Achieved 16% improve in ROC curve for Churn Prediction
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Predicting Commodity Futures using Twitter
Customer
A major a agri-business cooperative
Business Problem
Predict price of commodity futures through Twitter
Challenges
• Language on Twitter does not adhere to rules of grammar and has poor structure
• No domain specific label corpus of tweet sentiment – problem is semi-supervised
Solution
• Built Sentiment Analysis and Text Regression algorithms to predict commodity futures from Tweets
• Established the foundation for blending the structured data (market fundamentals) with unstructured data (tweets)
http://www.slideshare.net/SrivatsanRamanujam/sramanujam-taw-2014
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/ 65
Platform and Tools
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Data Lake
Business Levers
Apps
Pipeline of a Data Science Driven App
MLlib
PL/X
Model Building
Model Tuning
Continuous Model Improvement
Data Feeds
Ingest Filter Enrich
SinkSpringXD
Greenplum
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Pivotal Greenplum MPP DBThink of it as multiple PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by a particular field
(or randomly)
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
• For embarrassingly parallel tasks, we can use procedural languages to easily parallelize any stand-alone library in Java, Python, R or C/C++
• The interpreter/VM of the language ‘X’ is installed on each node of the MPP environment
StandbyMaster
…
MasterHost
SQL
Interconnect
Segment Host
SegmentSegment
Segment Host
SegmentSegment
Segment Host
SegmentSegment
Segment Host
SegmentSegment
Data Parallelism through PL/X
CREATE FUNCTION pymax ( a integer, b integer)RETURNS integerAS $$ if a > b: return a return b$$ LANGUAGE plpythonu;
SQL wrapper
Source language
codeSource
language declaration
User Defined Functions
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
PL/X : Libraries We are able to tap into the vast collection of libraries from the open
source ecosystem in languages like Python, R and Java and apply those for data parallel problems
PL/X
CoreNLP
http://www.slideshare.net/SrivatsanRamanujam/pivotal-data-labs-technology-and-tools-in-our-data-scientists-arsenal
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Model Parallelism Data Parallel computation via PL/X libraries only allow us to run ‘n’
models in parallel. This works great when we are building one model for each value of
the group by column, but we need parallelized algorithms to be able to build a single model on all the available data
For this, we use MADlib – an open source library of parallel in-database machine learning algorithms.
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
MADlib: Scalable, in-database ML
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
GPText
Standby
SegmentSegment Segment Segment…
Master
SQL GPTextScalable - one text processor instance per segment: “MPP
Text” - can linearly scale High Availability – ReplicatedDatabase management features
• Backup/Restore• Online Expansion• Data Recovery• Performance Monitoring
Full Text Search - flexible indexing and • search (stemming, phonetic search, • multi-lingual search, etc.)
Join structured & text in one queryAdvanced Text Analytics Platform
• Can be run ad-hoc• Supports multiple machine learning algorithms • Extensible
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/ 73
Sentiment Analysis of Tweets
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Sentiment Analysis – Challenges Language on Twitter
doesn’t adhere to rules of grammar, syntax or spelling
We don’t have labeled data for our problem. The tweets aren’t tagged with sentiment
Semi-Supervised Sentiment Prediction can be achieved by dictionary look-ups of tokens in a Tweet, but without Context, Sentiment Prediction is futile!
“Cool”
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Extracting Context – Part-of-speech tagging
Part-of-speech tagging (POS tagging) helps us extract contexts associated with sentiment words (an adjectives dictionary you may have access to).
This simple approach was first described in a classic paper by Peter D. Turney in 2002 and it is as follows:1. Apply POS-tagging on sentences to tag words and their part-of-speech.2. Extract 2-token phrases that provide context (ex: ADJECTIVE followed by
NOUN, NOUN followed by ADJECTIVE , ADVERB followed by a VERB)3. Use a reference corpus to identify count of co-occurrence of your
extracted phrases with a strongly positive word like “excellent” compared to a strongly negative word like “poor” and use that to compute a “polarity score”.
4. Sentiment associated with your sentence can be the average polarity score of all phrases in your sentence.
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Sentiment Analysis – Approach
1: Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/)
Phrase Extraction
Semi-Supervised Sentiment Classification
Phrasal Polarity Scoring
Sentiment Scored Tweets
Use learned phrasal polarities to score sentiment of new
tweets
Part-of-speech tagger1
Break-up Tweets into tokens and tag their
parts-of-speech
Parallelized ArkTweetNLP to achieve fast parts-of-speech tagging on Tweets Custom algorithm to extract contextual cues & score sentiment of tweets
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/ 77
DS Pipeline: Topic and Sentiment Analysis of Tweets
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Topic and Sentiment Analysis of Tweets
Stored on Data Lake
Tweet Stream
(PXF/gpfdist)Loaded as
external tables
Parallel Parsing of JSON and extraction of fields using PXF
Topic Analysis through MADlib
pLDA
Sentiment Analysis through custom
PL/Python functions
Pivotal Cloud
Foundry
55 million tweets/day
http://www.slideshare.net/SrivatsanRamanujam/a-pipeline-for-distributed-topic-and-sentiment-analysis-of-tweets-on-pivotal-greenplum-database
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Spring XD Components
[user@smdw ~]$ xd-singlenode --hadoopDistro phd20
[user@smdw ~]$ xd-shell --hadoopDistro phd20
Spring XD SNE
Spring XD Shell
xd:> stream create --name gnipdecahose --definition "http --port=9009 | hdfs --directory=/user/decahose/ --partitionPath=dateFormat('yyyy/MM/dd')" --deploy
Create Stream: HTTP Source, HDFS Sink
xd:> stream destroy --name gnipdecahoseDestroy stream
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Parallel Parsing of JSON using PXFRaw JSON
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Parallel Parsing of JSON using PXF
http://blog.pivotal.io/pivotal/products/analyzing-raw-twitter-data-using-hawq-and-pxf
Natively parse JSON from external table
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
PL/Java: In-database parallel POS-tagging UDT and UDF Usage
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
PL/Java: In-database parallel POS-tagging
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/ 84
Demo: Topic and Sentiment Analysis of Tweets
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Topic and Sentiment Analysis Engine (Demo)
http://www.slideshare.net/SrivatsanRamanujam/python-powered-data-science-at-pivotal-pydata-2013
SPRINGONE2GXWASHINGTON, DC
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
A Scalable Framework For Real Time Monitoring & Prediction Of Sensor Data
Jarrod Vawdrey@jjvawdrey
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Connected cars can produce more than 25GB of data per
hour
18x F1 cars (100+ sensors each)
generated 243 terabytes of data from
their vehicles at the 2014 U.S. GRAND PRIX
GE jet engine produces ~1
terabyte of data on a single cross country flight
The Explosion Of Sensor DataOrganizations have started to apply sensors to all kinds of operational equipment in order to gain added visibility into their day to day activities
87
http://www.bloomberg.com/bw/articles/2012-12-06/ge-tries-to-make-its-machines-cool-and-connectedhttps://www.hds.com/assets/pdf/hitachi-point-of-view-internet-on-wheels-and-hitachi-ltd.pdfhttp://www.forbes.com/sites/frankbi/2014/11/13/how-formula-one-teams-are-using-big-data-to-get-the-inside-edge/
Examples
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
The Value Of Collecting Sensor Data
88
When used effectively, the data streaming off of sensors can be a huge source of value, providing insights that generate additional revenue and reduce operating costs.
UPS• Through the use of
telematics UPS has been able to optimize delivery schedules and reduce gas usage by 25 gallons per driver per year. In the US this will reduce fuel consumption by 1.4 million gallons annually.
US Government• Using data collected from
sensors across data centers, the General Services Administration was able to reduce total data center power usage by 17% (~$30k) in it’s USDA facility.
Dundee Precious Metals• Outfitting miners and
machinery with internet enabled sensors helped DPM lower production costs from $60 a ton to $40.
http://www.automotive-fleet.com/article/story/2010/07/green-fleet-telematics-sensor-equipped-trucks-help-ups-control-costs/page/2.aspxhttp://energy.gov/eere/femp/wireless-sensor-networks-data-centershttp://www.wsj.com/articles/mining-sensor-data-to-run-a-better-gold-mine-1424226463
Examples
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Server Room Test Use CaseProblem: A single Supermicro 1U Server (2x Xeon 2.66ghz Processors, 64gb RAM, 4x 2TB HDD) heats up a 6ftx4ft server room (closet) to above safe operating temperature (95oF) in under two hours … the Hadoop cluster in the test server room had 4 servers + 2 switches!
89
Test Hadoop cluster
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Server Room Test Use CaseSolution: Add an air conditioning unit to the server room and keep servers within the 41oF to 95oF safe operating temperature.
New Problem: 12amp AC requires separate power circuit and breaker. If AC trips breaker no guarantee that servers will also shut off.
Still concerned with overheating & now concerned with power consumption!
90
Tripp Lite Directed AC
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Server Room Test Use CaseSolution: Add sensors to server room in order to …• Monitor temperature
remotely• Predict & alert potential A/C
and server failures which could cause overheating
• Optimize AC temperature setting
91
USB Temperature Sensor
http://www.amazon.com/gp/product/B002VA813U?psc=1&redirect=true&ref_=
oh_aui_detailpage_o04_s00
Arduino LM35 Temperature Sensor
http://www.lightinthebox.com/digital-temperature-sensor-module-ds18b20-for-arduino-55-125_p903326.
html
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Challenges Found With Adding Sensors
92
Challenges identified in the server room test environment are commonly found in business critical applications
Data Collection• Sensor failure• Sensor becomes
detached from collection system
• Sensors may not be collocated
• Integrating external sources
Data Volumes• Handling large
data volumes• Trade off
between granularity of data and volume
• Data storage which allows rapid access for analysis and modeling
Measurement• Handling missing
values• Building
aggregate metrics after system failure
Machine Learning• Feature
engineering for real time scoring
• Model performance testing and retraining indicators
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Planning For Real Time Prediction: Two Primary Approaches
93
Batch Modeling: Model developed on a full training
dataset and published to scoring mechanism
Real Time Scoring: Model applied to new data as it
becomes available to generate prediction
Each data point in a sequence is used to update the model as
it becomes available and produce a prediction
Batch Modeling & Real time Scoring(Offline Learning) Online Learning
Hybrid approaches also exist
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Tools Used: Spring XD
94
“Spring XD is a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export.” http://projects.spring.io/spring-xd/
Key Concepts• Streams: Defines a data pipe from source to sink, that may pass through
multiple processors• Source: The data provider (e.g. HTTP, JDBC, RabbitMQ)• Processor: Processing tasks operate on data being passed through a
stream• Sink: Termination point for data in a stream (e.g. HDFS, JDBC, File)
• Jobs: Batch processors launched from Spring-XD• Taps: ‘Listen’ to data being passed through a stream
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Tools Used: Other
95
Python RabbitMQ HDFS HAWQ R MADlib RedisProgramming language
Message broker
Hadoop distributed file system
SQL query engine for Hadoop
Statistical programming language/application
Scalable SQL machine learning library
In-memory data store
Scripts to interfaces with sensors and send readings to RabbitMQ
Real time model scoring
Queue sensor readings
Short term readings cache - if connection to Spring-XD drops
Store sensor readings, other data and models
Access data stored in Hadoop using SQL
Provide access to R for modeling (via pl/r)
Batch modeling
Batch modeling
Short term application data storage (e.g. counters, aggregates)
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
HDFS/HAWQ(Sink)
Framework For Monitoring & Modeling (Offline)
96
Spring-XD
S1
Rabb
itMQ
– Mes
sage
Bro
ker
(Sou
rce)
Real-time Monitoring(Tap)
S2
SN
…BatchModel
Training(Job)
Real-time ModelScoring
(Processor)
Sensors PythonListeners
PN
P2
P1
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
HDFS/HAWQ(Sink)
Framework For Monitoring & Modeling (Online)
97
Spring-XD
S1
Rabb
itMQ
– Mes
sage
Bro
ker
(Sou
rce)
Real-time Monitoring
(Tap)
S2
SN
…
Sensors PythonListeners
PN
P2
P1Online
Learning(Tap)
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Server Room Test Use Case – Model Development
98
Server Data• 4x Rack Sensors
(USB)• 1x Server Room
Sensor (USB)• 1x Outside Room
Sensor (Arduino)• 1x Outdoor Sensor
(Arduino)• A/C settings
temperature• Ganglia RRD data
(Server logs)
External Data• Weather
Underground 10 Day Forecast
Data Cleanup & Integration
FeatureGeneratio
n
Time series models(ARIMA, VARs)
Event prediction models(Log Reg, SVM)
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Server Room Test Use Case - Application• Node.js application
used to serve results remotely
• D3.js visualization of observed and predicted readings
• Python package ‘smtplib’ used to send email alerts when failure or out of range temperature event predicted
99
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
Key Takeaways
100
• Organization who have embraced data collection from sensors and are using this data to generate real time actionable insights are already generating huge amounts of value
• Many challenges exist when working with streaming data which can be solved for using a framework built around Spring-XD
• Flexibility to plug and play the best tool for the job is crucial in implementing an scalable real time systems
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
http://blog.pivotal.io/data-science-pivotalCheck out the Pivotal Data Science Blog!
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/
FOR FURTHER INFO, CHECKOUT…
• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data
• Pivotal Blog @ http://blog.pivotal.io
• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal
• Pivotal Academy @ https://pivotal.biglms.com
• Or reach out to your local Pivotal Account Executive…
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under aCreative Commons Attribution-NonCommercial license: http://creativecommons.org/l icenses/by-nc/3.0/ 103
Safe Harbor StatementThe following is intended to outline the general direction of Pivotal's offerings. It is intended for information purposes only and may not be incorporated into any contract. Any information regarding pre-release of Pivotal offerings, future updates or other planned modifications is subject to ongoing evaluation by Pivotal and is subject to change. This information is provided without warranty or any kind, express or implied, and is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions regarding Pivotal's offerings. These purchasing decisions should only be based on features currently available. The development, release, and timing of any features or functionality described for Pivotal's offerings in this presentation remain at the sole discretion of Pivotal. Pivotal has no obligation to update forward looking information in this presentation.