spark and mapr streams: a motivating example

66
© 2017 MapR Technologies 1 Spark and MapR Streams: A Motivating Example

Upload: ian-downard

Post on 05-Apr-2017

154 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 1

Spark and MapR Streams: A Motivating Example

Page 2: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 2

Abstract• Businesses are discovering the untapped potential of large datasets and data

streams through the use of technologies for big data processing and storage. By leveraging these assets they’re creating a new generation of applications that derive value from data they used to throw away.

• In this presentation we’ll discuss how to build operational environments for these types of applications with the MapR Converged Data Platform and we’ll walk through an example of a next-generation application that uses Java APIs for MapR Streams, Apache Spark, Apache Hive, and MapR-DB.

• We’ll see how these technologies can be used to join and transform unbounded datasets to find signals and derive new data streams for a financial scenario involving real-time algorithmic trading and historical analysis using SQL.

• We’ll also discuss how MapR enables you to run real-time data applications with the speed, reliability, and security you need for a production environment.

• Keywords: MapR, Spark, Kafka, NoSQL, JSON, Zeppelin, Hive, streaming

Page 3: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 3

Contact Info

Ian DownardTechnical Evangelist at MapR [email protected]

Personal Blog: http://bigendiandata.com

Twitter: @iandownard

Page 4: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 4

Learning Goals

1. Appreciate the opportunity of the time we’re in.

2. Become familiar with MapR

3. Become familiar with Spark

4. Feel empowered.

Page 5: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 5

Why Now?• But Moore’s law has applied for a long time

• Why is data exploding now?

• Why not 10 years ago?

• Why not 20?

Page 6: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 6

Because data wasn’t available?• If it were just availability of data then existing big companies

would adopt big data technology first

Page 7: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 7

Because data wasn’t available?• If it were just availability of data then existing big companies

would adopt big data technology first

They didn’t

Page 8: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 8

Because processing it was too expensive?• If it were just a net positive value then finance companies

should adopt first because they have higher opportunity value / byte

Page 9: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 9

Because processing it was too expensive?• If it were just a net positive value then finance companies

should adopt first because they have higher opportunity value / byte

They didn’t

Page 10: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 10

Backwards adoption• Under almost any argument, startups would not have adopted

big data technology first

Page 11: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 11

Backwards adoption• Under almost any argument, startups would not have adopted

big data technology first

They did

Page 12: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 12

Everywhere at Once?• Something very strange is happening

– Big data is being applied at many different scales– By large companies and small

Page 13: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 13

Everywhere at Once?• Something very strange is happening

– Big data is being applied at many different scales– By large companies and small

Why?

Page 14: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 14

Data Analytics Scaling Laws• Analytics scaling is all about:

– Big gains for little initial effort– Rapidly diminishing returns

• The key to net value is how costs scale– Old school – exponential scaling– Big data – linear scaling, low constant

• Cost/performance has radically changed– Cluster computing, commodity hardware, data science frameworks…

Page 15: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 15

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Most data isn’t worth much in isolation

First data is valuable

Later data is dregs

Page 16: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 16

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Suddenly worth processing

First data is valuable

Later data is dregs

But has high aggregate value

Page 17: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 17

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

If we can handle the scale

It’s really big

Page 18: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 18

So what makes

that possible?

Page 19: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 19

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Page 20: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 20

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value Net value optimum has

a sharp peak well before maximum effort

Page 21: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 21

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

But scaling laws are changing both slope and shape

Page 22: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 22

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

More than just a little

Page 23: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 23

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

They are changing a LOT!

Page 24: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 24

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Page 25: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 25

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Page 26: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 26

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Page 27: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 27

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Page 28: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 28

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Initially, linear cost scaling actually makes things worse

Then a tipping point is reached and things change radically …

Page 29: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 30

MapR Overview

Page 30: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 31

How do you persist data?

Page 31: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 32

All major persistence abstractions are one of these:

Files

tokyo

Streams

User profiles

Tables

Page 32: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 33

HDFS

SOURCEDATA

STREAMPROCESSING & STORAGE

FINALOUTPUT

STORAGE

KafkaKafkaKafka

KafkaKafkaSpark

Cassandra / MongoCassandra /

MongoCassandra / Mongo

“Classic” streaming involves single-purpose clusters.

Page 33: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 34

MapR-FS

SOURCEDATA

STREAMPROCESSING & STORAGE

FINALOUTPUT

STORAGE

MapR Streams

Spark

MapR-DB

MapR converges the data layer into a single cluster.

Page 34: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 35

What is MapR?

A Data PlatformConverged^

Page 35: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 36

Open Source Engines & Tools Commercial Engines & Applications

Enterprise-Grade Platform Services

Dat

aPr

oces

sing

Web-Scale StorageMapR-FS MapR-DB

Search and Others

Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability

MapR Streams

Cloud and Managed Services

Search and Others

Unified M

anagement and M

onitoring

Search and Others

Event StreamingDatabase

Custom Apps

HDFS API POSIX, NFS HBase API JSON API Kafka API

MapR Converged Data Platform

Page 36: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 37

“Convergence” means…• One cluster that does it all: Files + Tables + Streams• Standard APIs for everything• A distributed file system that looks “normal” (POSIX)• Unified Management• Global Namespace• Mirroring, Replication, and Snapshots

– Synchronize files, tables, and streams across datacenters– True failover for your applications

Page 37: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 38

How do I use MapR?• Installs on Linux (e.g. Ubuntu, Redhat) typically to a block

device, and typically to a cluster of 3 or more nodes.

• Packaged as a scriptable / web-based installer, cloud marketplace offers, Docker containers

• Sandbox VMs for your laptops.

Page 38: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 39

MapR In Action

Page 39: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 40

Apply MapR as a data layer for containers.

Producer Servlet Engine

HTTP Log

Browser

Page 40: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 41

Procedure1. Download Sandbox

– Configure for Host-only Adapter

2. Download github repo3. Compile code4. Build Docker images5. Create the MapR Stream topics6. Run the Docker containers

Page 41: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 42

Procedure• Download Sandbox

– Configure for Host-only Adapter

Page 42: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 43

Docker / MapR demo commands1. git clone https://github.com/mapr-demos/mapr-pacc-sample

2. maprcli stream create -path /apps/sensors -produceperm p -consumeperm

p -topicperm p

3. maprcli stream topic create -path /apps/sensors -topic computer

4. /opt/mapr/kafka/kafka-0.9.0/bin/kafka-console-consumer.sh --new-

consumer --bootstrap-server this.will.be.ignored:9092 --topic

/apps/sensors:computer

5. docker run -it -e MAPR_CLDB_HOSTS=192.168.99.3 -e

MAPR_CLUSTER=demo.cluster.com -e MAPR_CONTAINER_USER=mapr --name

producer -i -t mapr-sensor-producer

6. docker run -it --privileged --cap-add SYS_ADMIN --cap-add SYS_RESOURCE

--device /dev/fuse -e MAPR_CLDB_HOSTS=192.168.99.3 -e

MAPR_CLUSTER=demo.cluster.com -e MAPR_CONTAINER_USER=mapr -e

MAPR_MOUNT_PATH=/mapr -p 8080:8080 --device /dev/fuse --name web -i -t

mapr-web-consumer

7. Open http://localhost:8080

8. Open http://192.168.99.3:8443

Page 43: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 44

References• MapR Sandbox

http://maprdocs.mapr.com/home/SandboxHadoop/t_install_sandbox_vbox.html

• MapR sample applicationhttps://mapr.com/blog/getting-started-mapr-client-container/

• MapR Tutorialshttps://mapr.com/developercentral/code/

Page 44: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 45

Apache Spark

Page 45: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 46

https://databricks.com/spark/about

Page 46: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 47

Resilient Distributed Datasets (RDDs)• RDDs – lets programmers perform in-memory computations on

large distributed datasets in a fault-tolerant manner• RDD is a representation of data that may or may not be on

your local machine. It’s partitioned across the cluster. (like a distributed java Collection).

• RDD is immutable– JavaRDD<String> lines = sc.textFile(“/path/to/data.log”)

• When you read data, nothing gets loaded. You’re not even opening it. We first declare the operations that we’re going to perform, then in the end the data is loaded and operated upon when we perform an action that materializes the data.

Page 47: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 48

Resilient Distributed Datasets (RDDs)1. Start by reading from files, DB, etc. to create a top level RDD2. Lazy Transformations

.filter(), .map(), shuffle(), sample()

3. Actions (retrieval of the data) trigger stuff to finally run. Pulls all the data into the JVM. .savetoCassandra() .count(), .collect()

4. Once you have an RDD you like to work on, you can call .cache() on it to keep it around, so you don’t have to derive it again. By default cache will save to disk.

Page 48: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 49

Resilient Distributed Datasets (RDDs)• RDD is building block of Spark.

– Dataframe, Dataset, DStream, etc are all abstractions for RDD

• immutable • Operated on by lambda functions.• Lazily evaluated• Kick off parallel execution with actions like collect(), count(),

etc.

Page 49: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 50

What is Spark Streaming?• enables scalable, high-throughput, fault-tolerant stream

processing of live data

• Run continuous SQL queries on data pushed into Kafka

Data Sources Data Sinks

Page 50: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 51

tail -f

MapR Streams storeand expose stream data

for processing

Outputaction

Page 51: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 52

Spark Streaming Architecture

• processedresultsarepushedoutinbatches

Spark

batches of processed results

Spark Streaming

input data stream

data from time 0 to 1

data from time 1 to 2

RDD @ time 2

data from time 2 to 3

RDD @ time 3RDD @ time 1

Batchinterval

Page 52: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 53

Spark In Action

Page 53: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 54

Spark In Action• Spark Shell• Spark SQL in Zeppelin• Spark SQL Databricks Notebook• Spark Streaming Java API• Debugging Spark with IntelliJ

Page 54: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 55

Databricks Cloud• Spark notebook in

the cloud– https://community.cl

oud.databricks.com/

• Sample notebooks:– https://databricks.co

m/resources/type/example-notebooks

Page 55: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 56

Spark Shell (aka REPL)

• If you install spark locally, you get this.

• Evals commands immediately when you type it in, and shows you the output.

• Fantastic way to experiment, with tab completion.

Page 56: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 57

Apache Zeppelin

Page 57: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 58

Debugging Spark with IntelliJ

export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport= dt_socket,server=y,suspend=y,address=4000

Page 58: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 59

Monitoring

http://[hostname]:4040/jobs/

Page 59: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 60

Spark Streaming + ML on MapR• Predict the location and time of Taxi requests.

– https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-1– https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-2

streaming topic:location and time oftaxi requests

Predicted and actual pickup locations and times

Classification Models (Spark ML)

Ridership analytics(Zeppelin)

Kmeans Clustering (Spark ML on Uber dataset)

Page 60: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 61

Streaming + ML demo procedure1. Create topics:

maprcli stream create -path /user/mapr/stream -produceperm p -consumeperm p -topicperm p

maprcli stream topic create -path /user/mapr/stream -topic ubers -partitions 3

maprcli stream topic create -path /user/mapr/stream -topic uberp -partitions 3

2. Create and save the kmeans model to /mapr/my.cluster.com/user/mapr/data/savemodel:/opt/mapr/spark/spark-2.0.1/bin/spark-submit --class com.sparkml.uber.ClusterUber --master

local[2] /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-1.0.jar

3. Send test dataset to a stream (just to illustrate using a stream):java -cp /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-

1.0.jar:`mapr classpath` com.streamskafka.uber.MsgProducer /user/mapr/stream:ubers

/mapr/my.cluster.com/user/mapr/data/uber.csv

4. Monitor the test dataset (optional, on nodeb):java -cp /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-

1.0.jar:`mapr classpath` com.streamskafka.uber.MsgConsumer /user/mapr/stream:ubers

5. Use the model to predict cluster for incoming taxi telemetry, output predictions to a topic:/opt/mapr/spark/spark-2.0.1/bin/spark-submit --class

com.sparkkafka.uber.SparkKafkaConsumerProducer --master local[2] /home/mapr/mapr-sparkml-

streaming-uber/target/mapr-sparkml-streaming-uber-1.0-jar-with-dependencies.jar

/user/mapr/data/savemodel /user/mapr/stream:ubers /user/mapr/stream:uberp

6. Read the predictions topic and put it into a format that we can adhoc analyze in SQL/opt/mapr/spark/spark-2.0.1/bin/spark-submit --class com.sparkkafka.uber.SparkKafkaConsumer --

master local[2] /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-1.0-

jar-with-dependencies.jar /user/mapr/stream:uberp

7. Open http://nodea:4040

Page 61: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 62

Real-Time Stock Market Analysis

https://mapr.com/appblueprinthttps://github.com/mapr-demos/finserv-application-blueprint

Page 62: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 63

Advanced Concept:Look Back for n Seconds on a Topic

Time

Data Topic

Offset Topic

t₀ t₁ t₂ t₃ t₄ t₅

3253 3347 3467 3608 3798 3913

Offset Topic: Key = Time t, Value = Offset of Data Topic at t

Page 63: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 64

MapR Streams vs Kafka

Page 64: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 65

Call To Action

Page 65: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 66

Call To Action• You can foster innovation just by making data available.

• Seeking career advancement?– Coursera classes on data science, ML, Spark, etc.– Be a polyglot.– Enable data science from development to production.

• You can apply those skills in ANY industry. • Don’t be afraid by not knowing much.• 87% of career builders attribute career benefit to completing

online courses (Harvard Business Review, Coursera)– Be better equipped for current job, find a new job, change career.

Page 66: Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 67

All Industries Web 2.0 Healthcare Telecom

• ETL / DW optimization• Mainframe optimization• Real-time application & network monitoring

• Security information & event management

• Recommendation engines & targeting

• Customer 360• Click-stream analysis• Social media analysis• Ad optimization

• Patient system of record• Smart hospitals• Biometrics• Patient vital monitoring• Fraud detection

• Crowd-based antenna optimization

• Charging & billing• Equipment monitoring & preventative maintenance

• Smart meter analysis

Have an interesting use case? Let’s talk!

Oil & Gas Financial Services Retail Ad Tech

• Pump monitoring & alerting• Seismic trace identification• Equipment maintenance• Safety & environment• Security

• Real-time fraud/risk monitoring

• Mobile notifications of transactions

• Real-time supply chain optimization

• Customer location optimization

• Real-time coupons

• Ad targeting & optimization• Global campaign dashboards