apache spark streaming and hbase

®© 2015 MapR Technologies 1

®

© 2014 MapR Technologies

Overview of Apache Spark Streaming

Carol McDonald


Agenda •  Why Apache Spark Streaming ? •  What is Apache Spark Streaming?

–  Key Concepts and Architecture

•  How it works by Example


Why Spark Streaming?

•  Process Time Series data : –  Results in near-real-time

•  Use Cases –  Social network trends –  Website statistics, monitoring –  Fraud detection –  Advertising click monetization

put put

put put

Time stamped data

data

•  Sensor, System Metrics, Events, log files •  Stock Ticker, User Activity •  Hi Volume, Velocity

Data for real-time monitoring


What is time series data? •  Stuff with timestamps

–  Sensor data –  log files –  Phones..

Credit Card Transactions Web user behaviour

Social media Log files

Geodata

Sensors


Why Spark Streaming ?

What If? •  You want to analyze data as it arrives?

For Example Time Series Data: Sensors, Clicks, Logs, Stats


Batch Processing

It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees

It was hot at 6:05 yesterday!

Batch processing may be too late for some events


Event Processing

It's 6:05 and 90 degrees

Someone should open a window!

Streaming

Its becoming important to process events as they arrive


What is Spark Streaming?

• extension of the core Spark AP

• enables scalable, high-throughput, fault-tolerant stream processing of live data

Data Sources Data Sinks


Stream Processing Architecture

Streaming

Sources/Apps

MapR-FS

Data Ingest

Topics

MapR-DB

Data Storage

MapR-FS

Apps

Stream Processing


Key Concepts

•  Data Sources: –  File Based: HDFS –  Network Based: TCP sockets,

Twitter, Kafka, Flume, ZeroMQ, Akka Actor

•  Transformations •  Output Operations

MapR-FS

Topics


Spark Streaming Architecture

• Divide data stream into batches of X seconds – Called DStream = sequence of RDDs

Spark Streaming

input data stream

DStream RDD batches

Batch interval

data from time 0 to 1


RDD @ time 2


RDD @ time 3 RDD @ time 1


Resilient Distributed Datasets (RDD)

Spark revolves around RDDs •  read only collection of

elements


Resilient Distributed Datasets (RDD)

Spark revolves around RDDs •  read only collection of

elements •  operated on in parallel •  Cached in memory

–  Or on disk •  Fault tolerant


Working With RDDs

RDDRDDRDDRDD

Transformations

Action Value

linesWithErrorRDD.count()!6 !!linesWithErrorRDD.first()!# Error line!

textFile = sc.textFile(”SomeFile.txt”) !

linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line) !


Process DStream

transform

Transform map

reduceByValue count

DStream RDDs

Dstream RDDs

transform transform

•  Process using transformaBons – creates new RDDs



RDD @ time 2



RDD @ time 1 RDD @ time 2 RDD @ time 3


Key Concepts

•  Data Sources •  Transformations: create new DStream

–  Standard RDD operations: map, filter, union, reduce, join, … –  Stateful operations: UpdateStateByKey(function),

countByValueAndWindow, …

•  Output Operations


Spark Streaming Architecture

•  processed results are pushed out in batches

Spark

batches of processed results

Spark Streaming

input data stream

DStream RDD batches



RDD @ time 2




Key Concepts

•  Data Sources •  Transformations •  Output Operations: trigger Computation

–  saveAsHadoopFiles – save to HDFS –  saveAsHadoopDataset – save to Hbase–  saveAsTextFiles –  foreach – do anything with each batch of RDDs

MapR-DB

MapR-FS


Learning Goals

•  How it works by example


Use Case: Time Series Data

Data for real-time monitoring

read

Spark Processing

Spark

Streaming

Oil Pump Sensor data


Convert Line of CSV data to Sensor Object

case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }


Schema

•  All events stored, data CF could be set to expire data •  Filtered alerts put in alerts CF •  Daily summaries put in Stats CF

Row key CF data CF alerts CF stats

hz … psi psi … hz_avg … psi_min

COHUTTA_3/10/14_1:01 10.37 84 0

COHUTTA_3/10/14 10 0


Basic Steps for Spark Streaming code

These are the basic steps for Spark Streaming code: 1.  create a Dstream

1.  Apply transformations 2.  Apply output operations

2.  Start receiving data and processing it –  using streamingContext.start().

3.  Wait for the processing to be stopped –  using streamingContext.awaitTermination().


Create a DStream

val ssc = new StreamingContext(sparkConf, Seconds(2))val linesDStream = ssc.textFileStream(“/mapr/stream")

batch

'me 0-‐1

linesDStream

batch 'me 1-‐2

batch 'me 1-‐2

DStream: a sequence of RDDs represenBng a stream of data

stored in memory as an RDD


Process DStream

val linesDStream = ssc.textFileStream(”directory path")val sensorDStream = linesDStream.map(parseSensor)

map new RDDs created for every batch

batch 'me 0-‐1

linesDStream RDDs

sensorDstream RDDs

batch 'me 1-‐2

map map

batch 'me 1-‐2


Process DStream

// for Each RDD sensorDStream.foreachRDD { rdd => // filter sensor data for low psi val alertRDD = rdd.filter(sensor => sensor.psi < 5.0) . . .}


DataFrame and SQL Operations

// for Each RDD parse into a sensor object filtersensorDStream.foreachRDD { rdd => . . . alertRdd.toDF().registerTempTable(”alert”) // join alert data with pump maintenance info val res = sqlContext.sql( "select s.resid,s.psi, p.pumpType from alert s join pump p on s.resid = p.resid join maint m on p.resid=m.resid") . . .}


Save to HBase

// for Each RDD parse into a sensor object filtersensorDStream.foreachRDD { rdd => . . . // convert alert to put object write to HBase alerts alertRDD.map(Sensor.convertToPutAlert) .saveAsHadoopDataset(jobConfig)}


Save to HBase

rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)

map

Put objects wriFen To HBase

batch 'me 0-‐1

linesRDD DStream

sensorRDD Dstream

batch 'me 1-‐2

map map

batch 'me 1-‐2

HBase

save save save

output opera'on: persist data to external storage


Start Receiving Data

sensorDStream.foreachRDD { rdd => . . .

}// Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()


Using HBase as a Source and Sink

read

write

Spark application HBase database

EXAMPLE: calculate and store summaries, Pre-Computed, Materialized View


HBase

HBase Read and Write

val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])

keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)

newAPIHadoopRDD

Row key Result

saveAsHadoopDataset

Key Put

HBase

Scan Result


Read HBase

// Load an RDD of (rowkey, Result) tuples from HBase table val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) // get Result val resultRDD = hBaseRDD.map(tuple => tuple._2) // transform into an RDD of (RowKey, ColumnValue)s val keyValueRDD = resultRDD.map(

result => (Bytes.toString(result.getRow()).split(" ")(0), Bytes.toDouble(result.value))) // group by rowkey , get statistics for column value val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list => StatCounter(list))


Write HBase

// save to HBase table CF data val jobConfig: JobConf = new JobConf(conf, this.getClass) jobConfig.setOutputFormat(classOf[TableOutputFormat]) jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName) // convert psi stats to put and write to hbase table stats column family keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)


MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data

•  https://www.mapr.com/blog/spark-streaming-hbase


Free HBase On Demand Training (includes Hive and MapReduce with HBase)

•  https://www.mapr.com/services/mapr-academy/big-data-hadoop-online-training


Soon to Come

•  Spark On Demand Training –  https://www.mapr.com/services/mapr-academy/


References •  Spark web site: http://spark.apache.org/ •  https://databricks.com/ •  Spark on MapR:

–  http://www.mapr.com/products/apache-spark

•  Spark SQL and DataFrame Guide •  Apache Spark vs. MapReduce – Whiteboard Walkthrough •  Learning Spark - O'Reilly Book •  Apache Spark


Q & A

@mapr maprtech

Engage with us!

MapR

maprtech

mapr-technologies

apache spark streaming and hbase

Technology