apache spark streaming and hbase
TRANSCRIPT
®© 2015 MapR Technologies 1
®
© 2014 MapR Technologies
Overview of Apache Spark Streaming
Carol McDonald
®© 2015 MapR Technologies 2
Agenda • Why Apache Spark Streaming ? • What is Apache Spark Streaming?
– Key Concepts and Architecture
• How it works by Example
®© 2015 MapR Technologies 3
Why Spark Streaming?
• Process Time Series data : – Results in near-real-time
• Use Cases – Social network trends – Website statistics, monitoring – Fraud detection – Advertising click monetization
put put
put put
Time stamped data
data
• Sensor, System Metrics, Events, log files • Stock Ticker, User Activity • Hi Volume, Velocity
Data for real-time monitoring
®© 2015 MapR Technologies 4
What is time series data? • Stuff with timestamps
– Sensor data – log files – Phones..
Credit Card Transactions Web user behaviour
Social media Log files
Geodata
Sensors
®© 2015 MapR Technologies 5
Why Spark Streaming ?
What If? • You want to analyze data as it arrives?
For Example Time Series Data: Sensors, Clicks, Logs, Stats
®© 2015 MapR Technologies 6
Batch Processing
It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees
It was hot at 6:05 yesterday!
Batch processing may be too late for some events
®© 2015 MapR Technologies 7
Event Processing
It's 6:05 and 90 degrees
Someone should open a window!
Streaming
Its becoming important to process events as they arrive
®© 2015 MapR Technologies 8
What is Spark Streaming?
• extension of the core Spark AP
• enables scalable, high-throughput, fault-tolerant stream processing of live data
Data Sources Data Sinks
®© 2015 MapR Technologies 9
Stream Processing Architecture
Streaming
Sources/Apps
MapR-FS
Data Ingest
Topics
MapR-DB
Data Storage
MapR-FS
Apps
Stream Processing
®© 2015 MapR Technologies 10
Key Concepts
• Data Sources: – File Based: HDFS – Network Based: TCP sockets,
Twitter, Kafka, Flume, ZeroMQ, Akka Actor
• Transformations • Output Operations
MapR-FS
Topics
®© 2015 MapR Technologies 11
Spark Streaming Architecture
• Divide data stream into batches of X seconds – Called DStream = sequence of RDDs
Spark Streaming
input data stream
DStream RDD batches
Batch interval
data from time 0 to 1
data from time 1 to 2
RDD @ time 2
data from time 2 to 3
RDD @ time 3 RDD @ time 1
®© 2015 MapR Technologies 12
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs • read only collection of
elements
®© 2015 MapR Technologies 13
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs • read only collection of
elements • operated on in parallel • Cached in memory
– Or on disk • Fault tolerant
®© 2015 MapR Technologies 14
Working With RDDs
RDDRDDRDDRDD
Transformations
Action Value
linesWithErrorRDD.count()!6 !!linesWithErrorRDD.first()!# Error line!
textFile = sc.textFile(”SomeFile.txt”) !
linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line) !
®© 2015 MapR Technologies 15
Process DStream
transform
Transform map
reduceByValue count
DStream RDDs
Dstream RDDs
transform transform
• Process using transformaBons – creates new RDDs
data from time 0 to 1
data from time 1 to 2
RDD @ time 2
data from time 2 to 3
RDD @ time 3 RDD @ time 1
RDD @ time 1 RDD @ time 2 RDD @ time 3
®© 2015 MapR Technologies 16
Key Concepts
• Data Sources • Transformations: create new DStream
– Standard RDD operations: map, filter, union, reduce, join, … – Stateful operations: UpdateStateByKey(function),
countByValueAndWindow, …
• Output Operations
®© 2015 MapR Technologies 17
Spark Streaming Architecture
• processed results are pushed out in batches
Spark
batches of processed results
Spark Streaming
input data stream
DStream RDD batches
data from time 0 to 1
data from time 1 to 2
RDD @ time 2
data from time 2 to 3
RDD @ time 3 RDD @ time 1
®© 2015 MapR Technologies 18
Key Concepts
• Data Sources • Transformations • Output Operations: trigger Computation
– saveAsHadoopFiles – save to HDFS – saveAsHadoopDataset – save to Hbase– saveAsTextFiles – foreach – do anything with each batch of RDDs
MapR-DB
MapR-FS
®© 2015 MapR Technologies 19
Learning Goals
• How it works by example
®© 2015 MapR Technologies 20
Use Case: Time Series Data
Data for real-time monitoring
read
Spark Processing
Spark
Streaming
Oil Pump Sensor data
®© 2015 MapR Technologies 21
Convert Line of CSV data to Sensor Object
case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }
®© 2015 MapR Technologies 22
Schema
• All events stored, data CF could be set to expire data • Filtered alerts put in alerts CF • Daily summaries put in Stats CF
Row key CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
®© 2015 MapR Technologies 23
Basic Steps for Spark Streaming code
These are the basic steps for Spark Streaming code: 1. create a Dstream
1. Apply transformations 2. Apply output operations
2. Start receiving data and processing it – using streamingContext.start().
3. Wait for the processing to be stopped – using streamingContext.awaitTermination().
®© 2015 MapR Technologies 24
Create a DStream
val ssc = new StreamingContext(sparkConf, Seconds(2))val linesDStream = ssc.textFileStream(“/mapr/stream")
batch
'me 0-‐1
linesDStream
batch 'me 1-‐2
batch 'me 1-‐2
DStream: a sequence of RDDs represenBng a stream of data
stored in memory as an RDD
®© 2015 MapR Technologies 25
Process DStream
val linesDStream = ssc.textFileStream(”directory path")val sensorDStream = linesDStream.map(parseSensor)
map new RDDs created for every batch
batch 'me 0-‐1
linesDStream RDDs
sensorDstream RDDs
batch 'me 1-‐2
map map
batch 'me 1-‐2
®© 2015 MapR Technologies 26
Process DStream
// for Each RDD sensorDStream.foreachRDD { rdd => // filter sensor data for low psi val alertRDD = rdd.filter(sensor => sensor.psi < 5.0) . . .}
®© 2015 MapR Technologies 27
DataFrame and SQL Operations
// for Each RDD parse into a sensor object filtersensorDStream.foreachRDD { rdd => . . . alertRdd.toDF().registerTempTable(”alert”) // join alert data with pump maintenance info val res = sqlContext.sql( "select s.resid,s.psi, p.pumpType from alert s join pump p on s.resid = p.resid join maint m on p.resid=m.resid") . . .}
®© 2015 MapR Technologies 28
Save to HBase
// for Each RDD parse into a sensor object filtersensorDStream.foreachRDD { rdd => . . . // convert alert to put object write to HBase alerts alertRDD.map(Sensor.convertToPutAlert) .saveAsHadoopDataset(jobConfig)}
®© 2015 MapR Technologies 29
Save to HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
map
Put objects wriFen To HBase
batch 'me 0-‐1
linesRDD DStream
sensorRDD Dstream
batch 'me 1-‐2
map map
batch 'me 1-‐2
HBase
save save save
output opera'on: persist data to external storage
®© 2015 MapR Technologies 30
Start Receiving Data
sensorDStream.foreachRDD { rdd => . . .
}// Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()
®© 2015 MapR Technologies 31
Using HBase as a Source and Sink
read
write
Spark application HBase database
EXAMPLE: calculate and store summaries, Pre-Computed, Materialized View
®© 2015 MapR Technologies 32
HBase
HBase Read and Write
val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
newAPIHadoopRDD
Row key Result
saveAsHadoopDataset
Key Put
HBase
Scan Result
®© 2015 MapR Technologies 33
Read HBase
// Load an RDD of (rowkey, Result) tuples from HBase table val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) // get Result val resultRDD = hBaseRDD.map(tuple => tuple._2) // transform into an RDD of (RowKey, ColumnValue)s val keyValueRDD = resultRDD.map(
result => (Bytes.toString(result.getRow()).split(" ")(0), Bytes.toDouble(result.value))) // group by rowkey , get statistics for column value val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list => StatCounter(list))
®© 2015 MapR Technologies 34
Write HBase
// save to HBase table CF data val jobConfig: JobConf = new JobConf(conf, this.getClass) jobConfig.setOutputFormat(classOf[TableOutputFormat]) jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName) // convert psi stats to put and write to hbase table stats column family keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
®© 2015 MapR Technologies 35
MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data
• https://www.mapr.com/blog/spark-streaming-hbase
®© 2015 MapR Technologies 36
Free HBase On Demand Training (includes Hive and MapReduce with HBase)
• https://www.mapr.com/services/mapr-academy/big-data-hadoop-online-training
®© 2015 MapR Technologies 37
Soon to Come
• Spark On Demand Training – https://www.mapr.com/services/mapr-academy/
®© 2015 MapR Technologies 38
References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR:
– http://www.mapr.com/products/apache-spark
• Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
®© 2015 MapR Technologies 39
Q & A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies