how spark enables the internet of things: efficient integration of multiple spark components for...

24
© 2015 IBM Corporation How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases Paula Ta-Shma IBM Research [email protected] Joint work with: Adnan Akbar, University of Surrey Michael Factor, IBM Research Guy Hadash, IBM Research Juan Sancho, ATOS

Upload: sparktc

Post on 09-Apr-2017

1.338 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation

How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

Paula Ta-ShmaIBM [email protected]

Joint work with:Adnan Akbar, University of SurreyMichael Factor, IBM ResearchGuy Hadash, IBM ResearchJuan Sancho, ATOS

Page 2: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation2

The Evolution of Data Collection

Internet of Things

Page 3: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation3

2005 2012 2017

The IoT market will grow to $1.7 trillion in 2020 (IDC)

Sens

ors

(Inte

rnet

of T

hings

)

VoIPSocial Media

(video, audio and text)

Enterprise Data

By 2020 the number of networked devices will be 30 billion (IDC), more than 4 times the entire global population

IoT : The Biggest Big Data

Glo

bal D

ata

Volu

me

in E

xaby

tes

2005 2012 2017

Page 4: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation4

EMT Madrid Bus Company Needs to Make Decisions According to Current and Predicted Future Traffic State The Problem

– EMT needs to staff control rooms where employees manually analyze Madrid traffic sensor output. This can be slow and costly.

Objective– Improve customer satisfaction and reduce costs by responding more efficiently and quickly to real-

time traffic problems

Approach– Monitor data from up to 3000 sensors. React by rerouting buses, modifying traffic lights, etc., based

upon knowledge derived from historical data

Today Tomorrow

Page 5: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation5

1. Collect historical time series data– Collect data from devices– Aggregate into objects– Index and/or partition

Generic IoT Architecture – Data Flow

Secor

IoT

Swift

Page 6: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation6

2. Learn patterns in data– May be time/location dependent– Generate thresholds, classifiers etc.

Generic IoT Architecture – Data Flow

SecorSwift

Page 7: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation7

IoT

3. Apply what was learned on real time data stream– Take action

Generic IoT Architecture – Data Flow

Secor

CEP

Swift

Page 8: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation8

How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

IoT

Generic IoT Architecture – Data Flow

CEP

SecorSwift

Green Flows: Real time

Purple Flows: Batch

Page 9: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation9

Aim: Collect historical timeseries data for analysis– Continuously collect data from up to 3000 Madrid council traffic sensors via web service

- Data includes traffic speeds and intensities, updated every 5 mins– Push the messages to Kafka– Use Secor to aggregate multiple messages into a single Swift object

- According to policy, e.g., every 60 mins- Possibly partition the data, e.g. according to date- Convert to Parquet format- Annotate with metadata, e.g., min/max speed, start/end time

– Index Swift objects according to their metadata using ElasticSearch

Secor

Swift

IoT Architecture – Madrid Traffic – Ingestion Flow

IoT

Page 10: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation10

IoT Architecture – Madrid Traffic – Data Access

Aim: Access data efficiently and cost effectively

– Store IoT data in OpenStack Swift object storage

- Open source, low cost deployment, and highly scalable

– Parquet data is accessible via Spark SQL– Optimized predicate pushdown

- Custom Spark SQL external data source driver

- Uses object metadata indexes- Searches for Swift objects whose min/max

values overlap requested ranges

Get all data for morning traffic:SELECT codigo, intensidad, velocidad FROM madridtraffic WHERE tf >= '08:00:00' AND tf <= '12:00:00'

Brute force method13245 Swift requestsOptimized predicate pushdown616 Swift requests21.5 times improvement

Swift

Page 11: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation11

IoT Architecture – Madrid Traffic – Machine Learning

Aim: Learn to differentiate between ‘good’ and ‘bad’ traffic

– Depends on context - Time (morning/evening), Day (weekday/weekend)- Location

– Use Spark MLlib k-means clustering– Produce threshold values for real-time decision making– Re-run algorithm when quality of clusters decreases

- Can use silhouette index to measure quality

Swift

Page 12: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation12

IoT Architecture – Madrid Traffic – Machine Learning

Event Detection:

• Use Spark MLlib k-means clustering to separate data into 2 clusters

• Find the midpoint between the 2 cluster centres

• Use this midpoint to generate the thresholds

• Repeat for each context e.g. time period (morning, afternoon, evening, night)

Anomaly Detection:

• Use a single cluster and define an anomaly to be further than a certain distance from the cluster centre

Morning Traffic on Weekdays

Page 13: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation13

IoT Architecture – Madrid Traffic – Real Time Decision MakingAim: Respond in real time to traffic conditions

– Use Complex Event Processing (CEP) approach- Rule based- Process events record by record- CEP rules are typically defined manually but in many

cases it is difficult to get them right- We automate this process and make it smart

- uCEP has a small footprint, can be run at the edge

CEP

IoT

Work in ProgressProactive approach:

• Use Spark streaming linear regression to predict traffic behavior (e.g. speed, intensity) for near future

• Apply CEP on predicted data

• Respond pro-actively to predicted events such as traffic congestion

– e.g. EMT can proactively re-route buses

Page 14: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation14

Demo

Page 15: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation15

Our Architecture Applies to Many IoT Use Cases Energy/utilities

– Anomaly detection- Pipe leakage- Appliance malfunction

– Occupancy detection

Healthcare– Healthcare patient

monitoring/alert/response

Insurance– Driver behavior and location

monitoring

Transportation– Connected vehicles, engine

diagnostics, automated service scheduling

Logistics– Goods tracking, sensitive

goods management

Page 16: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation

Data Sources

Apache

Spark

Node-RED

Secor

Message Bus

Data Storage

Data Analytics

Data Visualization

Freeboard Dashboard

Object Storage

16

MQTT

The Madrid Traffic Use Case on IBM Bluemix

Madrid Traffic Sensors

Joint work with Naeem Altaf and team

Page 17: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation17

Thank You !

Page 18: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation18

Backup

Page 19: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation19

COSMOS Funding: EU FP7 at level of 2PY x 3 years Started: Sept 2013 Coordinator: ATOS Technical partners: IBM, NTUA, Univ Surrey, Siemens, ATOS Use Case Partners: Hildebrand/Camden, EMT Madrid Bus Transport/Madrid

Council, III Taiwan – Smart Cities use cases Project Vision: Enable ‘things’ to interact with each other based on shared

experience, trust, reputation etc.

Page 20: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation20

IBM Bluemix Data Analytics for IoT Architecture

Page 21: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation21

What is it?– Apache Kafka is a high throughput distributed publish/subscribe messaging system. – Secor is an open source tool developed by Pinterest, which aggregates Kafka messages

and saves as an S3 object. What extensions were needed?

– Support for OpenStack Swift as a Secor target. We also added support for Parquet format and annotating objects with metadata search to support indexing.

What is the value of integration with Swift?– Enables bringing new data and applications to Swift which is an open source solution.

Parquet and metadata search enable improved performance for batch analytics. Status

– We contributed OpenStack Swift support to the Secor community and it is now part of Secor.

Secor

Kafka + Secor

Page 22: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation22

Parquet What is it?

– A column based semi-structured, schema-based storage format supported by Hadoop and Spark. Enables column-wise compression and projection pushdown.

What integration is needed?– Since Swift is now part of the Hadoop ecosystem, no additional integration is needed.

Data in Swift can be stored in Apache Parquet format, inheriting associated advantages. Status

– Spark SQL supports storing tabular data in Parquet format in Hadoop compatible storage systems such as Swift.

Page 23: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

© 2015 IBM Corporation23

elasticsearch What is it?

– A distributed, scalable, real-time search and analytics engine, built on Apache Lucene. What integration is needed?

– Index object metadata allowing search for objects by attributes. What is the value of integration with Swift

– Use search to select objects for further processing, e.g., relevant objects for analytics. - Note that S3 does not yet have native search according to metadata.

Status– The IBM SoftLayer object service includes a basic implementation of metadata search;

At IBM Research, we added extensions such as data type support and range searches.

Page 24: How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases

Power of data. Simplicity of design. Speed of innovation.

IBM Spark

For up-to-date information and newsabout the Spark and the Spark Technology Center,

Sign up for our newsletter at www.spark.tc