spark and mapr streams: a motivating example

© 2017 MapR Technologies 1

Spark and MapR Streams: A Motivating Example


Abstract• Businesses are discovering the untapped potential of large datasets and data

streams through the use of technologies for big data processing and storage. By leveraging these assets they’re creating a new generation of applications that derive value from data they used to throw away.

• In this presentation we’ll discuss how to build operational environments for these types of applications with the MapR Converged Data Platform and we’ll walk through an example of a next-generation application that uses Java APIs for MapR Streams, Apache Spark, Apache Hive, and MapR-DB.

• We’ll see how these technologies can be used to join and transform unbounded datasets to find signals and derive new data streams for a financial scenario involving real-time algorithmic trading and historical analysis using SQL.

• We’ll also discuss how MapR enables you to run real-time data applications with the speed, reliability, and security you need for a production environment.

• Keywords: MapR, Spark, Kafka, NoSQL, JSON, Zeppelin, Hive, streaming


Contact Info

Ian DownardTechnical Evangelist at MapR [email protected]

Personal Blog: http://bigendiandata.com

Twitter: @iandownard


Learning Goals

1. Appreciate the opportunity of the time we’re in.

2. Become familiar with MapR

3. Become familiar with Spark

4. Feel empowered.


Why Now?• But Moore’s law has applied for a long time

• Why is data exploding now?

• Why not 10 years ago?

• Why not 20?


Because data wasn’t available?• If it were just availability of data then existing big companies

would adopt big data technology first


Because data wasn’t available?• If it were just availability of data then existing big companies

would adopt big data technology first

They didn’t


Because processing it was too expensive?• If it were just a net positive value then finance companies

should adopt first because they have higher opportunity value / byte


Because processing it was too expensive?• If it were just a net positive value then finance companies

should adopt first because they have higher opportunity value / byte

They didn’t


Backwards adoption• Under almost any argument, startups would not have adopted

big data technology first


Backwards adoption• Under almost any argument, startups would not have adopted

big data technology first

They did


Everywhere at Once?• Something very strange is happening

– Big data is being applied at many different scales– By large companies and small


Everywhere at Once?• Something very strange is happening

– Big data is being applied at many different scales– By large companies and small

Why?


Data Analytics Scaling Laws• Analytics scaling is all about:

– Big gains for little initial effort– Rapidly diminishing returns

• The key to net value is how costs scale– Old school – exponential scaling– Big data – linear scaling, low constant

• Cost/performance has radically changed– Cluster computing, commodity hardware, data science frameworks…


2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Most data isn’t worth much in isolation

First data is valuable

Later data is dregs


2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Suddenly worth processing

First data is valuable

Later data is dregs

But has high aggregate value


2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

If we can handle the scale

It’s really big


So what makes

that possible?


2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value


2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value Net value optimum has

a sharp peak well before maximum effort


2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

But scaling laws are changing both slope and shape


2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

More than just a little


2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

They are changing a LOT!


2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value


2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Value

Initially, linear cost scaling actually makes things worse

Then a tipping point is reached and things change radically …


MapR Overview


How do you persist data?


All major persistence abstractions are one of these:

Files

tokyo

Streams

User profiles

Tables


HDFS

SOURCEDATA

STREAMPROCESSING & STORAGE

FINALOUTPUT

STORAGE

KafkaKafkaKafka

KafkaKafkaSpark

Cassandra / MongoCassandra /

MongoCassandra / Mongo

“Classic” streaming involves single-purpose clusters.


MapR-FS

SOURCEDATA

STREAMPROCESSING & STORAGE

FINALOUTPUT

STORAGE

MapR Streams

Spark

MapR-DB

MapR converges the data layer into a single cluster.


What is MapR?

A Data PlatformConverged^


Open Source Engines & Tools Commercial Engines & Applications

Enterprise-Grade Platform Services

Dat

aPr

oces

sing

Web-Scale StorageMapR-FS MapR-DB

Search and Others

Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability

MapR Streams

Cloud and Managed Services

Search and Others

Unified M

anagement and M

onitoring

Search and Others

Event StreamingDatabase

Custom Apps

HDFS API POSIX, NFS HBase API JSON API Kafka API

MapR Converged Data Platform


“Convergence” means…• One cluster that does it all: Files + Tables + Streams• Standard APIs for everything• A distributed file system that looks “normal” (POSIX)• Unified Management• Global Namespace• Mirroring, Replication, and Snapshots

– Synchronize files, tables, and streams across datacenters– True failover for your applications


How do I use MapR?• Installs on Linux (e.g. Ubuntu, Redhat) typically to a block

device, and typically to a cluster of 3 or more nodes.

• Packaged as a scriptable / web-based installer, cloud marketplace offers, Docker containers

• Sandbox VMs for your laptops.


MapR In Action


Apply MapR as a data layer for containers.

Producer Servlet Engine

HTTP Log

Browser


Procedure1. Download Sandbox

– Configure for Host-only Adapter

2. Download github repo3. Compile code4. Build Docker images5. Create the MapR Stream topics6. Run the Docker containers


Procedure• Download Sandbox

– Configure for Host-only Adapter


Docker / MapR demo commands1. git clone https://github.com/mapr-demos/mapr-pacc-sample

2. maprcli stream create -path /apps/sensors -produceperm p -consumeperm

p -topicperm p

3. maprcli stream topic create -path /apps/sensors -topic computer

4. /opt/mapr/kafka/kafka-0.9.0/bin/kafka-console-consumer.sh --new-

consumer --bootstrap-server this.will.be.ignored:9092 --topic

/apps/sensors:computer

5. docker run -it -e MAPR_CLDB_HOSTS=192.168.99.3 -e

MAPR_CLUSTER=demo.cluster.com -e MAPR_CONTAINER_USER=mapr --name

producer -i -t mapr-sensor-producer

6. docker run -it --privileged --cap-add SYS_ADMIN --cap-add SYS_RESOURCE

--device /dev/fuse -e MAPR_CLDB_HOSTS=192.168.99.3 -e

MAPR_CLUSTER=demo.cluster.com -e MAPR_CONTAINER_USER=mapr -e

MAPR_MOUNT_PATH=/mapr -p 8080:8080 --device /dev/fuse --name web -i -t

mapr-web-consumer

7. Open http://localhost:8080

8. Open http://192.168.99.3:8443


References• MapR Sandbox

http://maprdocs.mapr.com/home/SandboxHadoop/t_install_sandbox_vbox.html

• MapR sample applicationhttps://mapr.com/blog/getting-started-mapr-client-container/

• MapR Tutorialshttps://mapr.com/developercentral/code/


Apache Spark


https://databricks.com/spark/about


Resilient Distributed Datasets (RDDs)• RDDs – lets programmers perform in-memory computations on

large distributed datasets in a fault-tolerant manner• RDD is a representation of data that may or may not be on

your local machine. It’s partitioned across the cluster. (like a distributed java Collection).

• RDD is immutable– JavaRDD<String> lines = sc.textFile(“/path/to/data.log”)

• When you read data, nothing gets loaded. You’re not even opening it. We first declare the operations that we’re going to perform, then in the end the data is loaded and operated upon when we perform an action that materializes the data.


Resilient Distributed Datasets (RDDs)1. Start by reading from files, DB, etc. to create a top level RDD2. Lazy Transformations

.filter(), .map(), shuffle(), sample()

3. Actions (retrieval of the data) trigger stuff to finally run. Pulls all the data into the JVM. .savetoCassandra() .count(), .collect()

4. Once you have an RDD you like to work on, you can call .cache() on it to keep it around, so you don’t have to derive it again. By default cache will save to disk.


Resilient Distributed Datasets (RDDs)• RDD is building block of Spark.

– Dataframe, Dataset, DStream, etc are all abstractions for RDD

• immutable • Operated on by lambda functions.• Lazily evaluated• Kick off parallel execution with actions like collect(), count(),

etc.


What is Spark Streaming?• enables scalable, high-throughput, fault-tolerant stream

processing of live data

• Run continuous SQL queries on data pushed into Kafka

Data Sources Data Sinks


tail -f

MapR Streams storeand expose stream data

for processing

Outputaction


Spark Streaming Architecture

• processedresultsarepushedoutinbatches

Spark

batches of processed results

Spark Streaming

input data stream

data from time 0 to 1


RDD @ time 2


RDD @ time 3RDD @ time 1

Batchinterval


Spark In Action


Spark In Action• Spark Shell• Spark SQL in Zeppelin• Spark SQL Databricks Notebook• Spark Streaming Java API• Debugging Spark with IntelliJ


Databricks Cloud• Spark notebook in

the cloud– https://community.cl

oud.databricks.com/

• Sample notebooks:– https://databricks.co

m/resources/type/example-notebooks


Spark Shell (aka REPL)

• If you install spark locally, you get this.

• Evals commands immediately when you type it in, and shows you the output.

• Fantastic way to experiment, with tab completion.


Apache Zeppelin


Debugging Spark with IntelliJ

export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport= dt_socket,server=y,suspend=y,address=4000


Monitoring

http://[hostname]:4040/jobs/


Spark Streaming + ML on MapR• Predict the location and time of Taxi requests.

– https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-1– https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-2

streaming topic:location and time oftaxi requests

Predicted and actual pickup locations and times

Classification Models (Spark ML)

Ridership analytics(Zeppelin)

Kmeans Clustering (Spark ML on Uber dataset)


Streaming + ML demo procedure1. Create topics:

maprcli stream create -path /user/mapr/stream -produceperm p -consumeperm p -topicperm p

maprcli stream topic create -path /user/mapr/stream -topic ubers -partitions 3

maprcli stream topic create -path /user/mapr/stream -topic uberp -partitions 3

2. Create and save the kmeans model to /mapr/my.cluster.com/user/mapr/data/savemodel:/opt/mapr/spark/spark-2.0.1/bin/spark-submit --class com.sparkml.uber.ClusterUber --master

local[2] /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-1.0.jar

3. Send test dataset to a stream (just to illustrate using a stream):java -cp /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-

1.0.jar:`mapr classpath` com.streamskafka.uber.MsgProducer /user/mapr/stream:ubers

/mapr/my.cluster.com/user/mapr/data/uber.csv

4. Monitor the test dataset (optional, on nodeb):java -cp /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-

1.0.jar:`mapr classpath` com.streamskafka.uber.MsgConsumer /user/mapr/stream:ubers

5. Use the model to predict cluster for incoming taxi telemetry, output predictions to a topic:/opt/mapr/spark/spark-2.0.1/bin/spark-submit --class

com.sparkkafka.uber.SparkKafkaConsumerProducer --master local[2] /home/mapr/mapr-sparkml-

streaming-uber/target/mapr-sparkml-streaming-uber-1.0-jar-with-dependencies.jar

/user/mapr/data/savemodel /user/mapr/stream:ubers /user/mapr/stream:uberp

6. Read the predictions topic and put it into a format that we can adhoc analyze in SQL/opt/mapr/spark/spark-2.0.1/bin/spark-submit --class com.sparkkafka.uber.SparkKafkaConsumer --

master local[2] /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-1.0-

jar-with-dependencies.jar /user/mapr/stream:uberp

7. Open http://nodea:4040


Real-Time Stock Market Analysis

https://mapr.com/appblueprinthttps://github.com/mapr-demos/finserv-application-blueprint


Advanced Concept:Look Back for n Seconds on a Topic

Time

Data Topic

Offset Topic

t₀ t₁ t₂ t₃ t₄ t₅

3253 3347 3467 3608 3798 3913

Offset Topic: Key = Time t, Value = Offset of Data Topic at t


MapR Streams vs Kafka


Call To Action


Call To Action• You can foster innovation just by making data available.

• Seeking career advancement?– Coursera classes on data science, ML, Spark, etc.– Be a polyglot.– Enable data science from development to production.

• You can apply those skills in ANY industry. • Don’t be afraid by not knowing much.• 87% of career builders attribute career benefit to completing

online courses (Harvard Business Review, Coursera)– Be better equipped for current job, find a new job, change career.


All Industries Web 2.0 Healthcare Telecom

• ETL / DW optimization• Mainframe optimization• Real-time application & network monitoring

• Security information & event management

• Recommendation engines & targeting

• Customer 360• Click-stream analysis• Social media analysis• Ad optimization

• Patient system of record• Smart hospitals• Biometrics• Patient vital monitoring• Fraud detection

• Crowd-based antenna optimization

• Charging & billing• Equipment monitoring & preventative maintenance

• Smart meter analysis

Have an interesting use case? Let’s talk!

Oil & Gas Financial Services Retail Ad Tech

• Pump monitoring & alerting• Seismic trace identification• Equipment maintenance• Safety & environment• Security

• Real-time fraud/risk monitoring

• Mobile notifications of transactions

• Real-time supply chain optimization

• Customer location optimization

• Real-time coupons

• Ad targeting & optimization• Global campaign dashboards

spark and mapr streams: a motivating example

Technology