spark+flume seattle
TRANSCRIPT
1© Cloudera, Inc. All rights reserved.
Hari Shreedharan, Software Engineer @ Cloudera
Committer/PMC Member, Apache Flume
Committer, Apache Sqoop
Contributor, Apache Spark
Author, Using Flume (O’Reilly)
Flume + Spark Streaming = Ingest + Processing
2© Cloudera, Inc. All rights reserved.
What is Flume
• Collection, Aggregation of streaming Event Data
• Typically used for log data
• Significant advantages over ad-hoc solutions
• Reliable, Scalable, Manageable, Customizable and High Performance
• Declarative, Dynamic Configuration
• Contextual Routing
• Feature rich
• Fully extensible
3© Cloudera, Inc. All rights reserved.
Core Concepts: Event
An Event is the fundamental unit of data transported by Flume from its point of origination to its final destination. Event is a byte array payload accompanied by optional headers.
• Payload is opaque to Flume
• Headers are specified as an unordered collection of string key-value pairs, with keys being unique across the collection
• Headers can be used for contextual routing
4© Cloudera, Inc. All rights reserved.
Core Concepts: Client
An entity that generates events and sends them to one or more Agents.
• Example
• Flume log4j Appender
• Custom Client using Client SDK (org.apache.flume.api)
• Embedded Agent – An agent embedded within your application
• Decouples Flume from the system where event data is consumed from
• Not needed in all cases
5© Cloudera, Inc. All rights reserved.
Core Concepts: Agent
A container for hosting Sources, Channels, Sinks and other componentsthat enable the transportation of events from one place to another.
• Fundamental part of a Flume flow
• Provides Configuration, Life-Cycle Management, and Monitoring Support for hosted components
6© Cloudera, Inc. All rights reserved.
Inside a Flume agent
7© Cloudera, Inc. All rights reserved.
Typical Aggregation Flow
[Client]+ Agent [ Agent]* Destination
8© Cloudera, Inc. All rights reserved.
Core Concepts: Source
An active component that receives events from a specialized location or mechanism and places it on one or Channels.
• Different Source types:
• Specialized sources for integrating with well-known systems. Example: Syslog, Netcat
• Auto-Generating Sources: Exec, SEQ
• IPC sources for Agent-to-Agent communication: Avro
• Require at least one channel to function
9© Cloudera, Inc. All rights reserved.
Sources
• Different Source types:
• Specialized sources for integrating with well-known systems. Example: Spooling Files, Syslog, Netcat, JMS
• Auto-Generating Sources: Exec, SEQ
• IPC sources for Agent-to-Agent communication: Avro, Thrift
• Require at least one channel to function
10© Cloudera, Inc. All rights reserved.
Core Concepts: Channel
A passive component that buffers the incoming events until they are drained by Sinks.
• Different Channels offer different levels of persistence:• Memory Channel: volatile
• Data lost if JVM or machine restarts• File Channel: backed by WAL implementation
• Data not lost unless the disk dies.• Eventually, when the agent comes back data can be accessed.
• Channels are fully transactional• Provide weak ordering guarantees• Can work with any number of Sources and Sinks.
11© Cloudera, Inc. All rights reserved.
Core Concepts: Sink
An active component that removes events from a Channel and transmits them to their next hop destination.
• Different types of Sinks:
• Terminal sinks that deposit events to their final destination. For example: HDFS, HBase, Morphline-Solr, Elastic Search
• Sinks support serialization to user’s preferred formats.
• HDFS sink supports time-based and arbitrary bucketing of data while writing to HDFS.
• IPC sink for Agent-to-Agent communication: Avro, Thrift
• Require exactly one channel to function
12© Cloudera, Inc. All rights reserved.
Flow Reliability
Normal Flow
Communication Failure between Agents
Communication Restored, Flow back to Normal
13© Cloudera, Inc. All rights reserved.
Flow Handling
Channels decouple impedance of upstream and downstream
• Upstream burstiness is damped by channels
• Downstream failures are transparently absorbed by channels
Sizing of channel capacity is key in realizing these benefits
14© Cloudera, Inc. All rights reserved.
Interceptors
Interceptor
An Interceptor is a component applied to a source in pre-specified order to enable decorating and filtering of events where necessary.
• Built-in Interceptors allow adding headers such as timestamps, hostname, static markers etc.
• Custom interceptors can introspect event payload to create specific headers where necessary
15© Cloudera, Inc. All rights reserved.
Contextual Routing
Channel Selector
A Channel Selector allows a Source to select one or more Channels from all the Channels that the Source is configured with based on preset criteria.
• Built-in Channel Selectors:
• Replicating: for duplicating the events
• Multiplexing: for routing based on headers
16© Cloudera, Inc. All rights reserved.
Contextual Routing
• Terminal Sinks can directly use Headers to make destination selections
• HDFS Sink can use headers values to create dynamic path for files that the event will be added to.
• Some headers such as timestamps can be used in a more sophisticated manner
• Custom Channel Selector can be used for doing specialized routing where necessary
17© Cloudera, Inc. All rights reserved.
Client API
• Simple API that can be used to send data to Flume agents.
• Simplest form – send a batch of events to one agent.
• Can be used to send data to multiple agents in a round-robin, random or failover fashion (send data to one till it fails).
• Java only.
• flume.thrift can be used to generate code for other languages.
• Use with Thrift source.
18© Cloudera, Inc. All rights reserved.
Motivation for Real-Time Stream Processing
Data is being created at unprecedented rates
• Exponential data growth from mobile, web, social• Connected devices: 9B in 2012 to 50B by 2020• Over 1 trillion sensors by 2020• Datacenter IP traffic growing at CAGR of 25%
How can we harness it data in real-time?• Value can quickly degrade → capture value immediately• From reactive analysis to direct operational impact• Unlocks new competitive advantages• Requires a completely new approach...
19© Cloudera, Inc. All rights reserved.
Spark: Easy and Fast Big Data
•Easy to Develop•Rich APIs in Java, Scala, Python
• Interactive shell
•Fast to Run•General execution graphs
• In-memory storage
2-5× less codeUp to 10× faster on disk,
100× in memory
20© Cloudera, Inc. All rights reserved.
Spark Architecture
Driver
Worker
Worker
Worker
Data
RAM
Data
RAM
Data
RAM
21© Cloudera, Inc. All rights reserved.
RDDs
RDD = Resilient Distributed Datasets• Immutable representation of data• Operations on one RDD creates a new one• Memory caching layer that stores data in a distributed, fault-tolerant cache• Created by parallel transformations on data in stable storage• Lazy materialization
Two observations:a. Can fall back to disk when data-set does not fit in memoryb. Provides fault-tolerance through concept of lineage
22© Cloudera, Inc. All rights reserved.
Spark Streaming
Extension of Apache Spark’s Core API, for Stream Processing.
The Framework Provides
Fault Tolerance
Scalability
High-Throughput
23© Cloudera, Inc. All rights reserved.
Canonical Stream Processing Architecture
Kafka
Data IngestApp 1
App 2
.
.
.
Kafka Flume
HDFS HBase
Data Sources
24© Cloudera, Inc. All rights reserved.
Spark Streaming
• Incoming data represented as Discretized Streams (DStreams)
• Stream is broken down into micro-batches
• Each micro-batch is an RDD – can share code between batch and streaming
25© Cloudera, Inc. All rights reserved.
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2tweets DStream
hashTags DStream
Stream composed of small (1-10s) batch
computations
“Micro-batch” Architecture
26© Cloudera, Inc. All rights reserved.
Use DStreams for Windowing Functions
27© Cloudera, Inc. All rights reserved.
Spark Streaming Use-Cases
• Real-time dashboards
• Show approximate results in real-time
• Reconcile periodically with source-of-truth using Spark
• Joins of multiple streams
• Time-based or count-based “windows”
• Combine multiple sources of input to produce composite data
• Re-use RDDs created by Streaming in other Spark jobs.
28© Cloudera, Inc. All rights reserved.
Connectors• Flume
• Kafka
• Amazon Kinesis
• MQTT
• Twitter (?!!)
29© Cloudera, Inc. All rights reserved.
Flume Polling Input DStream• Reliable Receiver to pull data from Flume agents.• Configure Flume agents to run the SparkSink
(org.apache.spark.streaming.flume.sink.SparkSink)• Use FlumeUtils.createPollingStream to create a
FlumePollingInputDStream configured to connect to the Flume agent(s)
• Enable WAL in Spark configuration to ensure no data loss• Depends on Flume’s transactions for reliability
30© Cloudera, Inc. All rights reserved.
Flume Connector
• Flume Sink Flume receiver on Spark use Avro and uses the standard event format Flume uses.
• No additional costs for using Spark Streaming vs just sending data from one Flume agent to another!
• With WAL enabled, no data loss!
31© Cloudera, Inc. All rights reserved.
Flume Transactions
• Sink starts a transaction with the channel
Executor
Receiver
Block
Manager
Flume
Agent
Spark Sink
Flume
Channel
Star
t Tx
n
32© Cloudera, Inc. All rights reserved.
Flume Transactions
• Sink starts a transaction with the channel
• Sink then pulls events out and sends it to the receiver
Executor
Receiver
Block
Manager
Flume
Agent
Spark Sink
Flume
Channel
Star
t Tx
n
Event Stream
33© Cloudera, Inc. All rights reserved.
Flume Transactions
• Sink starts a transaction with the channel
• Sink then pulls events out and sends it to the receiver
• The receiver stores all events in a single store call
Executor
Receiver
Block
Manager
Blo
ck
Flume
Agent
Spark Sink
Flume
Channel
Star
t Tx
n
Event Stream
34© Cloudera, Inc. All rights reserved.
Flume Transactions
• Sink starts a transaction with the channel
• Sink then pulls events out and sends it to the receiver
• The receiver stores all events in a single store call
• Once the store call returns, the receiver sends an ACK to the sink
Executor
Receiver
Block
Manager
Blo
ck
Flume
Agent
Spark Sink
Flume
Channel
Star
t Tx
n
ACK
Event Stream
35© Cloudera, Inc. All rights reserved.
Flume Transactions
Executor
Receiver
Block
Manager
Blo
ck
Flume
Agent
Spark Sink
Flume
Channel
Star
t Tx
n
ACK
Event Stream
• Sink starts a transaction with the channel
• Sink then pulls events out and sends it to the receiver
• The receiver stores all events in a single store call
• Once the store call returns, the receiver sends an ACK to the sink
• The sink then commits the transaction
• Timeouts/NACKs trigger a rollback and data is re-sent.
Co
mm
it T
xn
36© Cloudera, Inc. All rights reserved.
Summary
• Clients send Events to Agents
• Agents hosts number Flume components – Source, Interceptors, Channel Selectors, Channels, Sink Processors, and Sinks.
• Sources and Sinks are active components, where as Channels are passive
• Source accepts Events, passes them through Interceptor(s), and if not filtered, puts them on channel(s) selected by the configured Channel Selector
• Sink Processor identifies a sink to invoke, that can take Events from a Channel and send it to its next hop destination
• Channel operations are transactional to guarantee one-hop delivery semantics
• Channel persistence allows for ensuring end-to-end reliability
36