an analytics platform for connected vehicles

42
1 1 Predictive Maintenance in Connected Vehicles Frank McQuillan June 21, 2016 Big Data and Analytics Platform

Upload: data-engineers-guild-meetup-group

Post on 08-Feb-2017

191 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: An Analytics Platform for Connected Vehicles

1 1

Predictive Maintenance in Connected Vehicles

Frank McQuillan June 21, 2016

Big Data and Analytics Platform

Page 2: An Analytics Platform for Connected Vehicles

2

"The primary goal at the moment is predictive maintenance, being able to detect defects at the earliest stage. We have to

find the right correlation patterns … and incoming data to predict upcoming malfunctions and their consequences.”

– Dirk Ruger, Head of after-sale analytics and digital processes at BMW http://www.v3.co.uk/v3-uk/news/2407083/big-data-analytics-driving-predictive-car-maintenance-at-bmw

Page 3: An Analytics Platform for Connected Vehicles

3

This is a system design problem

1

Page 4: An Analytics Platform for Connected Vehicles

4

This is a hard system design problem

1

Page 5: An Analytics Platform for Connected Vehicles

5

Open source is the starting point

2

Page 6: An Analytics Platform for Connected Vehicles

6

Open source is just the starting point

2

Page 7: An Analytics Platform for Connected Vehicles

7

There is no single best design

3

Page 8: An Analytics Platform for Connected Vehicles

8

There is an acceptable design

3

Page 9: An Analytics Platform for Connected Vehicles

9

(Back End) Platform Characteristics

•  Connectivity to multiple data sources •  Data ingestion •  Real-time streaming analytics •  Persist data to big data store •  Tools for data exploration, build/score data science models •  Build applications that consume model outputs •  Deploy and manage those applications in the cloud

(operationalization)

Page 10: An Analytics Platform for Connected Vehicles

%%publish model info.

/

Microservices (Spring Boot)

/load_model /score_model

Spring Cloud Data Flow

vehicle data (streaming)

connector

exploratory data analysis & model

training

Rabbit/Kafka source

training (offline) scoring (online)

/

web or mobile app dashboard

Reference Architecture

Page 11: An Analytics Platform for Connected Vehicles

%%publish model info.

/

Microservices (Spring Boot)

/load_model /score_model

Spring Cloud Data Flow

vehicle data (streaming)

connector

exploratory data analysis & model

training

Rabbit/Kafka source

training (offline) scoring (online)

/

web or mobile app dashboard

Reference Architecture

Page 12: An Analytics Platform for Connected Vehicles

12 12

Apache HAWQ (incubating)

Pivotal HDB

Page 13: An Analytics Platform for Connected Vehicles

13

What is Apache HAWQ / Pivotal HDB?

http://hawq.incubator.apache.org/

Page 14: An Analytics Platform for Connected Vehicles

14

1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015

Journey to Open Source

Michael Stonebraker develops Postgres at UCB

Postgres adds support for SQL

Open Source PostgreSQL

PostgreSQL 7.0 released

PostgreSQL 8.0 released

Greenplum forks PostgreSQL

Hadoop 1.0 Released

HAWQ & MADlib go Apache

HAWQ launched

Hadoop 2.0 Released

MADlib launched

Greenplum open sourced

Page 15: An Analytics Platform for Connected Vehicles

15

AdvancedAnaly1csPerformance

Excep&onalMPPperformance,lowlatency,ACIDreliability,datafedera&on

MostCompleteLanguageCompliance

HigherdegreeofSQLcompa&bility,SQL-92,99,2003,OLAP(leverageexis&ngSQLskills)

AdvancedQueryOp1mizerMaximizeperformanceand

doadvancedquerieswithconfidence

Elas1cArchitectureforScalability

Scale-up/downorscale-in/out,expand/shrinkclustersonthefly

Integratedw/MADlibMachineLearning

AdvancedMPPanaly&cs,datascienceatscale,directlyonHadoopdata

HAWQ / Pivotal HDB Advantages

MAD

Page 16: An Analytics Platform for Connected Vehicles

16

HAWQ Extension Framework (PXF)

•  Enables connectivity between Pivotal HDB and other stores (Hive, HBase, HDFS files).

•  Provides an extensible framework to add support for custom services

•  Operates as a separate service in Hadoop •  Low latency on large data sets •  Considers cost model of federated sources

HAWQ

HDFS (Hadoop Distributed File System)

Hive

HBase P X F

Services

Page 17: An Analytics Platform for Connected Vehicles

17 17

Greenplum Database

Greenplum

Page 18: An Analytics Platform for Connected Vehicles

18

Greenplum Database

•  SQL Based: –  Load And Query Like Any SQL Database –  MPP Shared-Nothing Parallelization –  Automatic data distribution without tuning

•  Linear Scalability: –  Linear scaling of capacity, loading, users and concurrency

•  Analytics Optimized: –  Analytics-oriented query optimization, write locking, storage

management, data compression, etc. •  Extensible for Analytics:

–  MADlib machine learning library

Greenplum Database

Greenplum DB

http://greenplum.org/

Page 19: An Analytics Platform for Connected Vehicles

19

MPP Shared Nothing Architecture

Standby Master

Segment Host with one or more Segment Instances Segment Instances process queries in parallel

Flexible framework for processing large datasets

High speed interconnect for continuous pipelining of data processing …

Master Host

SQL Master Host and Standby Master Host Master coordinates work with Segment Hosts

Interconnect

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

Segment Hosts have their own CPU, disk and memory (shared nothing)

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

node1

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

node2

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

node3

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

nodeN

Greenplum DB

Page 20: An Analytics Platform for Connected Vehicles

20 20

Apache MADlib (Incubating)

Distributed In-Database Machine Learning

Page 21: An Analytics Platform for Connected Vehicles

21

Scalable, In-Database Machine Learning

•  Open source https://github.com/apache/incubator-madlib •  Downloads and docs http://madlib.incubator.apache.org/ •  Wiki https://cwiki.apache.org/confluence/display/MADLIB/

Page 22: An Analytics Platform for Connected Vehicles

22

History MADlib project was initiated in 2011 by EMC/Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley.

UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun.

1- dude, you got skills. 2- dude, you got mad skills.

Page 23: An Analytics Platform for Connected Vehicles

23

Functions

Linear Systems •  Sparse and Dense Solvers •  Linear Algebra

Matrix Factorization •  Singular Value Decomposition (SVD) •  Low Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Ordinal Regression •  Cox Proportional Hazards Regression •  Elastic Net Regularization •  Robust Variance (Huber-White),

Clustered Variance, Marginal Effects

Other Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Apriori) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Random Forest •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation •  Naïve Bayes •  Support Vector Machines (SVM)

Descriptive Statistics Sketch-Based Estimators •  CountMin (Cormode-Muth.) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation and Covariance Summary

Utility Modules Array and Matrix Operations Sparse Vectors Random Sampling Probability Functions Data Preparation PMML Export Conjugate Gradient Stemming

Inferential Statistics Hypothesis Tests

Time Series •  ARIMA

April 2016

Path Functions •  Operations on Pattern Matches

Page 24: An Analytics Platform for Connected Vehicles

24

MADlib Features

�  Better parallelism –  Algorithms designed to leverage MPP and

Hadoop architecture

�  Better scalability –  Algorithms scale as your data set scales

�  Better predictive accuracy –  Can use all data, not a sample

�  ASF open source (incubating) –  Active and growing community

Page 25: An Analytics Platform for Connected Vehicles

25

Supported Platforms

Greenplum Database

PostgreSQL Apache HAWQ (incubating)

Scale-out machine learning on open source, MPP execution engines.

Page 26: An Analytics Platform for Connected Vehicles

Reference Architecture

%%publish model info.

/

Microservices (Spring Boot)

/load_model /score_model

Spring Cloud Data Flow

vehicle data (streaming)

connector

exploratory data analysis & model

training

Rabbit/Kafka source

training (offline) scoring (online)

/

web or mobile app dashboard

Page 27: An Analytics Platform for Connected Vehicles

27 27

Spring Cloud Data Flow

Page 28: An Analytics Platform for Connected Vehicles

28

https://cloud.spring.io/spring-cloud-dataflow/

Page 29: An Analytics Platform for Connected Vehicles

29

Page 30: An Analytics Platform for Connected Vehicles

%%publish model info.

/

Microservices (Spring Boot)

/load_model /score_model

Spring Cloud Data Flow

vehicle data (streaming)

connector

exploratory data analysis & model

training

Rabbit/Kafka source

training (offline) scoring (online)

/

web or mobile app dashboard

Reference Architecture

Page 31: An Analytics Platform for Connected Vehicles

31 31

Apache Geode (incubating)

Pivotal Gemfire

Page 32: An Analytics Platform for Connected Vehicles

32

An in-memory, distributed database with strong consistency built to support low latency transactional applications at

extreme scale.

Apache Geode / Pivotal Gemfire

http://geode.incubator.apache.org/

Page 33: An Analytics Platform for Connected Vehicles

33

Cloud-ready, infra-structure agnostic

33

Horizontal Scalability Automatic fail-overing Reliable eventing model

Multi-site High Availability Seamless integration to

analytical databases

App 1 App 3 App 2

Apache Geode / Pivotal Gemfire

Page 34: An Analytics Platform for Connected Vehicles

34

Pivotal Big Data Suite Complete platform

Hadoop Native SQL

Deployment options

Based on open source

Flexible licensing

Advanced data services

PIVOTAL GREENPLUM DATABASE

Data warehouse database based on open source Greenplum Database

PIVOTAL HDB Open source analytical database for Apache

Hadoop based on Apache HAWQ

PIVOTAL GEMFIRE Open source application and transaction data grid based on Apache Geode

Pivotal Big Data Suite Open source data management portfolio

Page 35: An Analytics Platform for Connected Vehicles

35 35

Other Architectures

Page 36: An Analytics Platform for Connected Vehicles

36

https://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/media/documents/ 2016-06-06_berlin_buzzwords_nat_poc2indus_nonne.pdf

Page 37: An Analytics Platform for Connected Vehicles

37 http://enterprise.microsoft.com/en-us/industries/discrete-manufacturing/learning-leaders-manufacturers-using-iot-reimagine-connected-services-customer-experiences/

Page 38: An Analytics Platform for Connected Vehicles

38

https://www.ge.com/digital/sites/default/files/predix-platform-brief-ge-digital.pdf

Page 39: An Analytics Platform for Connected Vehicles

39 https://www.mapr.com/developercentral/lambda-architecture

Page 40: An Analytics Platform for Connected Vehicles

40

Platform Challenges

•  Managing complexity •  Integration •  Open source – how to chose and keep up? •  Data security and lineage •  IT and car development cycles are not in sync •  Multiple vendors involved (e.g., carmakers need mobile

partners)

Page 41: An Analytics Platform for Connected Vehicles

41

Platform Challenges (2)

•  Car dealers need to connect to the platform •  Who will pay for connected car services? •  Fleet management

Page 42: An Analytics Platform for Connected Vehicles

42

This is a system design problem that YOU can help solve