spark+flume seattle

37
1 © Cloudera, Inc. All rights reserved. Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O’Reilly) Flume + Spark Streaming = Ingest + Processing

Upload: harisr1234

Post on 14-Jul-2015

327 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Spark+flume seattle

1© Cloudera, Inc. All rights reserved.

Hari Shreedharan, Software Engineer @ Cloudera

Committer/PMC Member, Apache Flume

Committer, Apache Sqoop

Contributor, Apache Spark

Author, Using Flume (O’Reilly)

Flume + Spark Streaming = Ingest + Processing

Page 2: Spark+flume seattle

2© Cloudera, Inc. All rights reserved.

What is Flume

• Collection, Aggregation of streaming Event Data

• Typically used for log data

• Significant advantages over ad-hoc solutions

• Reliable, Scalable, Manageable, Customizable and High Performance

• Declarative, Dynamic Configuration

• Contextual Routing

• Feature rich

• Fully extensible

Page 3: Spark+flume seattle

3© Cloudera, Inc. All rights reserved.

Core Concepts: Event

An Event is the fundamental unit of data transported by Flume from its point of origination to its final destination. Event is a byte array payload accompanied by optional headers.

• Payload is opaque to Flume

• Headers are specified as an unordered collection of string key-value pairs, with keys being unique across the collection

• Headers can be used for contextual routing

Page 4: Spark+flume seattle

4© Cloudera, Inc. All rights reserved.

Core Concepts: Client

An entity that generates events and sends them to one or more Agents.

• Example

• Flume log4j Appender

• Custom Client using Client SDK (org.apache.flume.api)

• Embedded Agent – An agent embedded within your application

• Decouples Flume from the system where event data is consumed from

• Not needed in all cases

Page 5: Spark+flume seattle

5© Cloudera, Inc. All rights reserved.

Core Concepts: Agent

A container for hosting Sources, Channels, Sinks and other componentsthat enable the transportation of events from one place to another.

• Fundamental part of a Flume flow

• Provides Configuration, Life-Cycle Management, and Monitoring Support for hosted components

Page 6: Spark+flume seattle

6© Cloudera, Inc. All rights reserved.

Inside a Flume agent

Page 7: Spark+flume seattle

7© Cloudera, Inc. All rights reserved.

Typical Aggregation Flow

[Client]+ Agent [ Agent]* Destination

Page 8: Spark+flume seattle

8© Cloudera, Inc. All rights reserved.

Core Concepts: Source

An active component that receives events from a specialized location or mechanism and places it on one or Channels.

• Different Source types:

• Specialized sources for integrating with well-known systems. Example: Syslog, Netcat

• Auto-Generating Sources: Exec, SEQ

• IPC sources for Agent-to-Agent communication: Avro

• Require at least one channel to function

Page 9: Spark+flume seattle

9© Cloudera, Inc. All rights reserved.

Sources

• Different Source types:

• Specialized sources for integrating with well-known systems. Example: Spooling Files, Syslog, Netcat, JMS

• Auto-Generating Sources: Exec, SEQ

• IPC sources for Agent-to-Agent communication: Avro, Thrift

• Require at least one channel to function

Page 10: Spark+flume seattle

10© Cloudera, Inc. All rights reserved.

Core Concepts: Channel

A passive component that buffers the incoming events until they are drained by Sinks.

• Different Channels offer different levels of persistence:• Memory Channel: volatile

• Data lost if JVM or machine restarts• File Channel: backed by WAL implementation

• Data not lost unless the disk dies.• Eventually, when the agent comes back data can be accessed.

• Channels are fully transactional• Provide weak ordering guarantees• Can work with any number of Sources and Sinks.

Page 11: Spark+flume seattle

11© Cloudera, Inc. All rights reserved.

Core Concepts: Sink

An active component that removes events from a Channel and transmits them to their next hop destination.

• Different types of Sinks:

• Terminal sinks that deposit events to their final destination. For example: HDFS, HBase, Morphline-Solr, Elastic Search

• Sinks support serialization to user’s preferred formats.

• HDFS sink supports time-based and arbitrary bucketing of data while writing to HDFS.

• IPC sink for Agent-to-Agent communication: Avro, Thrift

• Require exactly one channel to function

Page 12: Spark+flume seattle

12© Cloudera, Inc. All rights reserved.

Flow Reliability

Normal Flow

Communication Failure between Agents

Communication Restored, Flow back to Normal

Page 13: Spark+flume seattle

13© Cloudera, Inc. All rights reserved.

Flow Handling

Channels decouple impedance of upstream and downstream

• Upstream burstiness is damped by channels

• Downstream failures are transparently absorbed by channels

Sizing of channel capacity is key in realizing these benefits

Page 14: Spark+flume seattle

14© Cloudera, Inc. All rights reserved.

Interceptors

Interceptor

An Interceptor is a component applied to a source in pre-specified order to enable decorating and filtering of events where necessary.

• Built-in Interceptors allow adding headers such as timestamps, hostname, static markers etc.

• Custom interceptors can introspect event payload to create specific headers where necessary

Page 15: Spark+flume seattle

15© Cloudera, Inc. All rights reserved.

Contextual Routing

Channel Selector

A Channel Selector allows a Source to select one or more Channels from all the Channels that the Source is configured with based on preset criteria.

• Built-in Channel Selectors:

• Replicating: for duplicating the events

• Multiplexing: for routing based on headers

Page 16: Spark+flume seattle

16© Cloudera, Inc. All rights reserved.

Contextual Routing

• Terminal Sinks can directly use Headers to make destination selections

• HDFS Sink can use headers values to create dynamic path for files that the event will be added to.

• Some headers such as timestamps can be used in a more sophisticated manner

• Custom Channel Selector can be used for doing specialized routing where necessary

Page 17: Spark+flume seattle

17© Cloudera, Inc. All rights reserved.

Client API

• Simple API that can be used to send data to Flume agents.

• Simplest form – send a batch of events to one agent.

• Can be used to send data to multiple agents in a round-robin, random or failover fashion (send data to one till it fails).

• Java only.

• flume.thrift can be used to generate code for other languages.

• Use with Thrift source.

Page 18: Spark+flume seattle

18© Cloudera, Inc. All rights reserved.

Motivation for Real-Time Stream Processing

Data is being created at unprecedented rates

• Exponential data growth from mobile, web, social• Connected devices: 9B in 2012 to 50B by 2020• Over 1 trillion sensors by 2020• Datacenter IP traffic growing at CAGR of 25%

How can we harness it data in real-time?• Value can quickly degrade → capture value immediately• From reactive analysis to direct operational impact• Unlocks new competitive advantages• Requires a completely new approach...

Page 19: Spark+flume seattle

19© Cloudera, Inc. All rights reserved.

Spark: Easy and Fast Big Data

•Easy to Develop•Rich APIs in Java, Scala, Python

• Interactive shell

•Fast to Run•General execution graphs

• In-memory storage

2-5× less codeUp to 10× faster on disk,

100× in memory

Page 20: Spark+flume seattle

20© Cloudera, Inc. All rights reserved.

Spark Architecture

Driver

Worker

Worker

Worker

Data

RAM

Data

RAM

Data

RAM

Page 21: Spark+flume seattle

21© Cloudera, Inc. All rights reserved.

RDDs

RDD = Resilient Distributed Datasets• Immutable representation of data• Operations on one RDD creates a new one• Memory caching layer that stores data in a distributed, fault-tolerant cache• Created by parallel transformations on data in stable storage• Lazy materialization

Two observations:a. Can fall back to disk when data-set does not fit in memoryb. Provides fault-tolerance through concept of lineage

Page 22: Spark+flume seattle

22© Cloudera, Inc. All rights reserved.

Spark Streaming

Extension of Apache Spark’s Core API, for Stream Processing.

The Framework Provides

Fault Tolerance

Scalability

High-Throughput

Page 23: Spark+flume seattle

23© Cloudera, Inc. All rights reserved.

Canonical Stream Processing Architecture

Kafka

Data IngestApp 1

App 2

.

.

.

Kafka Flume

HDFS HBase

Data Sources

Page 24: Spark+flume seattle

24© Cloudera, Inc. All rights reserved.

Spark Streaming

• Incoming data represented as Discretized Streams (DStreams)

• Stream is broken down into micro-batches

• Each micro-batch is an RDD – can share code between batch and streaming

Page 25: Spark+flume seattle

25© Cloudera, Inc. All rights reserved.

val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

flatMap flatMap flatMap

save save save

batch @ t+1batch @ t batch @ t+2tweets DStream

hashTags DStream

Stream composed of small (1-10s) batch

computations

“Micro-batch” Architecture

Page 26: Spark+flume seattle

26© Cloudera, Inc. All rights reserved.

Use DStreams for Windowing Functions

Page 27: Spark+flume seattle

27© Cloudera, Inc. All rights reserved.

Spark Streaming Use-Cases

• Real-time dashboards

• Show approximate results in real-time

• Reconcile periodically with source-of-truth using Spark

• Joins of multiple streams

• Time-based or count-based “windows”

• Combine multiple sources of input to produce composite data

• Re-use RDDs created by Streaming in other Spark jobs.

Page 28: Spark+flume seattle

28© Cloudera, Inc. All rights reserved.

Connectors• Flume

• Kafka

• Amazon Kinesis

• MQTT

• Twitter (?!!)

Page 29: Spark+flume seattle

29© Cloudera, Inc. All rights reserved.

Flume Polling Input DStream• Reliable Receiver to pull data from Flume agents.• Configure Flume agents to run the SparkSink

(org.apache.spark.streaming.flume.sink.SparkSink)• Use FlumeUtils.createPollingStream to create a

FlumePollingInputDStream configured to connect to the Flume agent(s)

• Enable WAL in Spark configuration to ensure no data loss• Depends on Flume’s transactions for reliability

Page 30: Spark+flume seattle

30© Cloudera, Inc. All rights reserved.

Flume Connector

• Flume Sink Flume receiver on Spark use Avro and uses the standard event format Flume uses.

• No additional costs for using Spark Streaming vs just sending data from one Flume agent to another!

• With WAL enabled, no data loss!

Page 31: Spark+flume seattle

31© Cloudera, Inc. All rights reserved.

Flume Transactions

• Sink starts a transaction with the channel

Executor

Receiver

Block

Manager

Flume

Agent

Spark Sink

Flume

Channel

Star

t Tx

n

Page 32: Spark+flume seattle

32© Cloudera, Inc. All rights reserved.

Flume Transactions

• Sink starts a transaction with the channel

• Sink then pulls events out and sends it to the receiver

Executor

Receiver

Block

Manager

Flume

Agent

Spark Sink

Flume

Channel

Star

t Tx

n

Event Stream

Page 33: Spark+flume seattle

33© Cloudera, Inc. All rights reserved.

Flume Transactions

• Sink starts a transaction with the channel

• Sink then pulls events out and sends it to the receiver

• The receiver stores all events in a single store call

Executor

Receiver

Block

Manager

Blo

ck

Flume

Agent

Spark Sink

Flume

Channel

Star

t Tx

n

Event Stream

Page 34: Spark+flume seattle

34© Cloudera, Inc. All rights reserved.

Flume Transactions

• Sink starts a transaction with the channel

• Sink then pulls events out and sends it to the receiver

• The receiver stores all events in a single store call

• Once the store call returns, the receiver sends an ACK to the sink

Executor

Receiver

Block

Manager

Blo

ck

Flume

Agent

Spark Sink

Flume

Channel

Star

t Tx

n

ACK

Event Stream

Page 35: Spark+flume seattle

35© Cloudera, Inc. All rights reserved.

Flume Transactions

Executor

Receiver

Block

Manager

Blo

ck

Flume

Agent

Spark Sink

Flume

Channel

Star

t Tx

n

ACK

Event Stream

• Sink starts a transaction with the channel

• Sink then pulls events out and sends it to the receiver

• The receiver stores all events in a single store call

• Once the store call returns, the receiver sends an ACK to the sink

• The sink then commits the transaction

• Timeouts/NACKs trigger a rollback and data is re-sent.

Co

mm

it T

xn

Page 36: Spark+flume seattle

36© Cloudera, Inc. All rights reserved.

Summary

• Clients send Events to Agents

• Agents hosts number Flume components – Source, Interceptors, Channel Selectors, Channels, Sink Processors, and Sinks.

• Sources and Sinks are active components, where as Channels are passive

• Source accepts Events, passes them through Interceptor(s), and if not filtered, puts them on channel(s) selected by the configured Channel Selector

• Sink Processor identifies a sink to invoke, that can take Events from a Channel and send it to its next hop destination

• Channel operations are transactional to guarantee one-hop delivery semantics

• Channel persistence allows for ensuring end-to-end reliability

36

Page 37: Spark+flume seattle

37© Cloudera, Inc. All rights reserved.

Thank you

[email protected]@harisr1234