spark+flume seattle

1© Cloudera, Inc. All rights reserved.

Hari Shreedharan, Software Engineer @ Cloudera

Committer/PMC Member, Apache Flume

Committer, Apache Sqoop

Contributor, Apache Spark

Author, Using Flume (O’Reilly)

Flume + Spark Streaming = Ingest + Processing


What is Flume

• Collection, Aggregation of streaming Event Data

• Typically used for log data

• Significant advantages over ad-hoc solutions

• Reliable, Scalable, Manageable, Customizable and High Performance

• Declarative, Dynamic Configuration

• Contextual Routing

• Feature rich

• Fully extensible


Core Concepts: Event

An Event is the fundamental unit of data transported by Flume from its point of origination to its final destination. Event is a byte array payload accompanied by optional headers.

• Payload is opaque to Flume

• Headers are specified as an unordered collection of string key-value pairs, with keys being unique across the collection

• Headers can be used for contextual routing


Core Concepts: Client

An entity that generates events and sends them to one or more Agents.

• Example

• Flume log4j Appender

• Custom Client using Client SDK (org.apache.flume.api)

• Embedded Agent – An agent embedded within your application

• Decouples Flume from the system where event data is consumed from

• Not needed in all cases


Core Concepts: Agent

A container for hosting Sources, Channels, Sinks and other componentsthat enable the transportation of events from one place to another.

• Fundamental part of a Flume flow

• Provides Configuration, Life-Cycle Management, and Monitoring Support for hosted components


Inside a Flume agent


Typical Aggregation Flow

[Client]+ Agent [ Agent]* Destination


Core Concepts: Source

An active component that receives events from a specialized location or mechanism and places it on one or Channels.

• Different Source types:

• Specialized sources for integrating with well-known systems. Example: Syslog, Netcat

• Auto-Generating Sources: Exec, SEQ

• IPC sources for Agent-to-Agent communication: Avro

• Require at least one channel to function


Sources

• Different Source types:

• Specialized sources for integrating with well-known systems. Example: Spooling Files, Syslog, Netcat, JMS

• Auto-Generating Sources: Exec, SEQ

• IPC sources for Agent-to-Agent communication: Avro, Thrift

• Require at least one channel to function


Core Concepts: Channel

A passive component that buffers the incoming events until they are drained by Sinks.

• Different Channels offer different levels of persistence:• Memory Channel: volatile

• Data lost if JVM or machine restarts• File Channel: backed by WAL implementation

• Data not lost unless the disk dies.• Eventually, when the agent comes back data can be accessed.

• Channels are fully transactional• Provide weak ordering guarantees• Can work with any number of Sources and Sinks.


Core Concepts: Sink

An active component that removes events from a Channel and transmits them to their next hop destination.

• Different types of Sinks:

• Terminal sinks that deposit events to their final destination. For example: HDFS, HBase, Morphline-Solr, Elastic Search

• Sinks support serialization to user’s preferred formats.

• HDFS sink supports time-based and arbitrary bucketing of data while writing to HDFS.

• IPC sink for Agent-to-Agent communication: Avro, Thrift

• Require exactly one channel to function


Flow Reliability

Normal Flow

Communication Failure between Agents

Communication Restored, Flow back to Normal


Flow Handling

Channels decouple impedance of upstream and downstream

• Upstream burstiness is damped by channels

• Downstream failures are transparently absorbed by channels

Sizing of channel capacity is key in realizing these benefits


Interceptors

Interceptor

An Interceptor is a component applied to a source in pre-specified order to enable decorating and filtering of events where necessary.

• Built-in Interceptors allow adding headers such as timestamps, hostname, static markers etc.

• Custom interceptors can introspect event payload to create specific headers where necessary


Contextual Routing

Channel Selector

A Channel Selector allows a Source to select one or more Channels from all the Channels that the Source is configured with based on preset criteria.

• Built-in Channel Selectors:

• Replicating: for duplicating the events

• Multiplexing: for routing based on headers


Contextual Routing

• Terminal Sinks can directly use Headers to make destination selections

• HDFS Sink can use headers values to create dynamic path for files that the event will be added to.

• Some headers such as timestamps can be used in a more sophisticated manner

• Custom Channel Selector can be used for doing specialized routing where necessary


Client API

• Simple API that can be used to send data to Flume agents.

• Simplest form – send a batch of events to one agent.

• Can be used to send data to multiple agents in a round-robin, random or failover fashion (send data to one till it fails).

• Java only.

• flume.thrift can be used to generate code for other languages.

• Use with Thrift source.


Motivation for Real-Time Stream Processing

Data is being created at unprecedented rates

• Exponential data growth from mobile, web, social• Connected devices: 9B in 2012 to 50B by 2020• Over 1 trillion sensors by 2020• Datacenter IP traffic growing at CAGR of 25%

How can we harness it data in real-time?• Value can quickly degrade → capture value immediately• From reactive analysis to direct operational impact• Unlocks new competitive advantages• Requires a completely new approach...


Spark: Easy and Fast Big Data

•Easy to Develop•Rich APIs in Java, Scala, Python

• Interactive shell

•Fast to Run•General execution graphs

• In-memory storage

2-5× less codeUp to 10× faster on disk,

100× in memory


Spark Architecture

Driver

Worker

Worker

Worker

Data

RAM

Data

RAM

Data

RAM


RDDs

RDD = Resilient Distributed Datasets• Immutable representation of data• Operations on one RDD creates a new one• Memory caching layer that stores data in a distributed, fault-tolerant cache• Created by parallel transformations on data in stable storage• Lazy materialization

Two observations:a. Can fall back to disk when data-set does not fit in memoryb. Provides fault-tolerance through concept of lineage


Spark Streaming

Extension of Apache Spark’s Core API, for Stream Processing.

The Framework Provides

Fault Tolerance

Scalability

High-Throughput


Canonical Stream Processing Architecture

Kafka

Data IngestApp 1

App 2

.

.

.

Kafka Flume

HDFS HBase

Data Sources


Spark Streaming

• Incoming data represented as Discretized Streams (DStreams)

• Stream is broken down into micro-batches

• Each micro-batch is an RDD – can share code between batch and streaming


val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

flatMap flatMap flatMap

save save save

batch @ t+1batch @ t batch @ t+2tweets DStream

hashTags DStream

Stream composed of small (1-10s) batch

computations

“Micro-batch” Architecture


Use DStreams for Windowing Functions


Spark Streaming Use-Cases

• Real-time dashboards

• Show approximate results in real-time

• Reconcile periodically with source-of-truth using Spark

• Joins of multiple streams

• Time-based or count-based “windows”

• Combine multiple sources of input to produce composite data

• Re-use RDDs created by Streaming in other Spark jobs.


Connectors• Flume

• Kafka

• Amazon Kinesis

• MQTT

• Twitter (?!!)


Flume Polling Input DStream• Reliable Receiver to pull data from Flume agents.• Configure Flume agents to run the SparkSink

(org.apache.spark.streaming.flume.sink.SparkSink)• Use FlumeUtils.createPollingStream to create a

FlumePollingInputDStream configured to connect to the Flume agent(s)

• Enable WAL in Spark configuration to ensure no data loss• Depends on Flume’s transactions for reliability


Flume Connector

• Flume Sink Flume receiver on Spark use Avro and uses the standard event format Flume uses.

• No additional costs for using Spark Streaming vs just sending data from one Flume agent to another!

• With WAL enabled, no data loss!


Flume Transactions

• Sink starts a transaction with the channel

Executor

Receiver

Block

Manager

Flume

Agent

Spark Sink

Flume

Channel

Star

t Tx

n


Flume Transactions


• Sink then pulls events out and sends it to the receiver

Executor

Receiver

Block

Manager

Flume

Agent

Spark Sink

Flume

Channel

Star

t Tx

n

Event Stream


Flume Transactions



• The receiver stores all events in a single store call

Executor

Receiver

Block

Manager

Blo

ck

Flume

Agent

Spark Sink

Flume

Channel

Star

t Tx

n

Event Stream


Flume Transactions




• Once the store call returns, the receiver sends an ACK to the sink

Executor

Receiver

Block

Manager

Blo

ck

Flume

Agent

Spark Sink

Flume

Channel

Star

t Tx

n

ACK

Event Stream


Flume Transactions

Executor

Receiver

Block

Manager

Blo

ck

Flume

Agent

Spark Sink

Flume

Channel

Star

t Tx

n

ACK

Event Stream




• Once the store call returns, the receiver sends an ACK to the sink

• The sink then commits the transaction

• Timeouts/NACKs trigger a rollback and data is re-sent.

Co

mm

it T

xn


Summary

• Clients send Events to Agents

• Agents hosts number Flume components – Source, Interceptors, Channel Selectors, Channels, Sink Processors, and Sinks.

• Sources and Sinks are active components, where as Channels are passive

• Source accepts Events, passes them through Interceptor(s), and if not filtered, puts them on channel(s) selected by the configured Channel Selector

• Sink Processor identifies a sink to invoke, that can take Events from a Channel and send it to its next hop destination

• Channel operations are transactional to guarantee one-hop delivery semantics

• Channel persistence allows for ensuring end-to-end reliability

36


Thank you

[email protected]@harisr1234

spark+flume seattle

Technology

flume agent

core concepts

agent communication

specialized sources

apiembedded agent

seqipc sources

hosting sources

number of sources