streaming architecture patterns

77
Best practices for streaming applications O’Reilly Webcast June 21 st /22 nd , 2016 Mark Grover | @mark_grover | Software Engineer Ted Malaska | @TedMalaska | Principal Solutions Architect

Upload: hadooparchbook

Post on 06-Jan-2017

622 views

Category:

Engineering


5 download

TRANSCRIPT

Page 1: Streaming architecture patterns

Best practices for streaming applicationsO’Reilly WebcastJune 21st/22nd, 2016Mark Grover | @mark_grover | Software Engineer

Ted Malaska | @TedMalaska | Principal Solutions Architect

Page 2: Streaming architecture patterns

2

About the presenters

• Principal Solutions Architect at Cloudera

• Done Hadoop for 6 years– Worked with > 70 companies in 8

countries • Previously, lead architect at FINRA • Contributor to Apache Hadoop,

HBase, Flume, Avro, Pig and Spark• Contributor to Apache Hadoop,

HBase, Flume, Avro, Pig and Spark• Marvel fan boy, runner

• Software Engineer at Cloudera, working on Spark

• Committer on Apache Bigtop, PMC member on Apache Sentry (incubating)

• Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume

Ted Malaska Mark Grover

Page 3: Streaming architecture patterns

3

About the book

• @hadooparchbook• hadooparchitecturebook.com• github.com/hadooparchitecturebook• slideshare.com/hadooparchbook

Page 4: Streaming architecture patterns

4

Goal

Page 5: Streaming architecture patterns

5

Understand common use-cases for streaming and

their architectures

Page 6: Streaming architecture patterns

6

What is streaming?

Page 7: Streaming architecture patterns

7

When to stream, and when not to

Constant low milliseconds & under

Low milliseconds to seconds, delay in case

of failures

10s of seconds or more, re-run in case of

failures

Real-time Near real-time Batch

Page 8: Streaming architecture patterns

8

When to stream, and when not to

Constant low milliseconds & under

Low milliseconds to seconds, delay in case

of failures

10s of seconds or more, re-run in case of

failures

Real-time Near real-time Batch

Page 9: Streaming architecture patterns

9

No free lunch

Constant low milliseconds & under

Low milliseconds to seconds, delay in case

of failures

10s of seconds or more, re-run in case of

failures

Real-time Near real-time Batch

“Difficult” architectures, lower latency “Easier” architectures, higher latency

Page 10: Streaming architecture patterns

10

Use-cases for streaming

Page 11: Streaming architecture patterns

11

Use-case categories

• Ingestion• Simple transformations

– Decision (e.g. Anomaly detection)

• Simple counts– Lambda, etc.

• Advanced usage– Machine Learning– Windowing

Page 12: Streaming architecture patterns

12

Ingestion & Transformations

Page 13: Streaming architecture patterns

13

What is ingestion?

Source Systems Destination systemStreaming engine

Page 14: Streaming architecture patterns

14

But there multiple sources

Ingest

Source System 1

Destination systemSource System 2

Source System 3

Ingest

Ingest

Streaming engine Ingest

Page 15: Streaming architecture patterns

15

But..

• Sources, sinks, ingestion channels may go down• Sources, sinks producing/consuming at different rates (buffering)• Regular maintenance windows may need to be scheduled• You need a resilient message broker (pub/sub)

Page 16: Streaming architecture patterns

16

Need for a message broker

Source System 1

Destination systemSource System 2

Source System 3

Ingest

Ingest

Ingest Extract Streaming engine

Push

Message broker

Page 17: Streaming architecture patterns

17

Kafka

Source System 1

Destination systemSource System 2

Source System 3

Ingest

Ingest

Ingest Extract Streaming engine

Push

Message broker

Page 18: Streaming architecture patterns

18

Destination systems

Source System 1

Destination systemSource System 2

Source System 3

Ingest

Ingest

Ingest Extract Streaming engine

Push

Message broker

Most common “destination” is a storage system

Page 19: Streaming architecture patterns

19

Architecture diagram with a broker

Source System 1

Storage systemSource System 2

Source System 3

Ingest

Ingest

Ingest Extract Streaming engine

Push

Message broker

Page 20: Streaming architecture patterns

20

Streaming engines

Source System 1

Storage systemSource System 2

Source System 3

Ingest

Ingest

Ingest Extract Streaming engine

Push

Kafka Connect

ApacheFlume

Message broker

Apache Beam (incubating)

Page 21: Streaming architecture patterns

21

Storage options

Source System 1

Storage systemSource System 2

Source System 3

Ingest

Ingest

Ingest Extract Streaming engine

Push

Kafka Connect

ApacheFlume

Message broker

Apache Beam (incubating)

Page 22: Streaming architecture patterns

22

SemanticsAt most once, Exactly once, At least once

Page 23: Streaming architecture patterns

23

Semantic types

• At most once– Not good for many cases– Only where performance/SLA is more important than accuracy

• Exactly once– Expensive to achieve but desirable

• At least once– Easiest to achieve

Page 24: Streaming architecture patterns

24

Review

Source System 1

Destination systemSource System 2

Source System 3

Ingest

Ingest

Ingest Extract Streaming engine

Push

Message broker

Page 25: Streaming architecture patterns

25

Semantics of our architecture

Source System 1

Destination systemSource System 2

Source System 3

Ingest

Ingest

Ingest Extract Streaming engine

Push

Message broker

At least once

At least onceOrderedPartitioned

It depends It depends

Page 26: Streaming architecture patterns

26

Transforming data in flight

Page 27: Streaming architecture patterns

27

Streaming architecture for ingestion

Source System 1

Storage systemSource System 2

Source System 3

Ingest

Ingest

Ingest ExtractStreaming ingestion process

Push

Kafka connect

ApacheFlume

Message broker

Can be used to do simple

transformations

Page 28: Streaming architecture patterns

28

Ingestion and/or Transformation

1. Zero Transformation– No transformation, plain ingest, no schema validation– Keep the original format - SequenceFiles, Text, etc.– Allows to store data that may have errors in the schema

2. Format Transformation– Simply change the format of field, for example– Structured Format e.g. Avro– Which does schema validation

3. Enrichment Transformation– Atomic– Contextual

Page 29: Streaming architecture patterns

29

#3 - Enrichment transformations

Atomic• Need to work with one event at a

time• Mask a credit card number• Add processing time or offset to the

record

Contextual• Need to refer to external context• Example - convert zip code to state,

by looking up a cache

Page 30: Streaming architecture patterns

30

Atomic transformations

• Require no context• All streaming engines support it

Page 31: Streaming architecture patterns

31

Contextual transformations

• Well supported by many streaming engines• Need to store the context somewhere.

Page 32: Streaming architecture patterns

32

Where to store the context

1. Locally Broadcast Cached Dim Data– Local to Process (On Heap, Off Heap)– Local to Node (Off Process)

2. Partitioned Cache– Shuffle to move new data to partitioned cache

3. External Fetch Data (e.g. HBase, Memcached)

Page 33: Streaming architecture patterns

33

#1a - Locally broadcast cached data

Could be On heap or Off heap

Page 34: Streaming architecture patterns

34

#1b - Off process cached dataData is cached on the node, outside of process. Potentially in an external system like Rocks DB

Page 35: Streaming architecture patterns

35

#2 - Partitioned cache data

Data is partitioned based on field(s) and then cached

Page 36: Streaming architecture patterns

36

#3 - External fetch

Data fetched from external system

Page 37: Streaming architecture patterns

37

A combination (partitioned cache + external)

Page 38: Streaming architecture patterns

38

Anomaly detection using contextual transformations

Page 39: Streaming architecture patterns

39

Storage systemsWhen to use which one?

Page 40: Streaming architecture patterns

40

Storage Considerations

• Throughput• Access Patterns

– Scanning– Indexed– Reversed Indexed

• Transaction Level– Record/Document– File

Page 41: Streaming architecture patterns

41

File Level

• HDFS• S3

Page 42: Streaming architecture patterns

42

NoSql

• HBase• Cassandra• MongoDB

Page 43: Streaming architecture patterns

43

Search

• SolR• Elastic Search

Page 44: Streaming architecture patterns

44

NoSql-Sql

• Kudu

Page 45: Streaming architecture patterns

45

Streaming enginesComparison

Page 46: Streaming architecture patterns

46© Cloudera, Inc. All rights reserved.

Tricks With Producers

•Send Source ID (requires Partitioning In Kafka)

•Seq

•UUID

•UUID plus time

•Partition on SourceID

•Watch out for repartitions and partition fail overs

Page 47: Streaming architecture patterns

47© Cloudera, Inc. All rights reserved.

Streaming Engines

•Consumer

•Flume, KafkaConnect, Streaming Engine

•Storm

•Spark Streaming

•Flink

•Kafka Streams

Page 48: Streaming architecture patterns

48© Cloudera, Inc. All rights reserved.

Consumer: Flume, KafkaConnect

•Simple and Works

•Low latency

•High throughput

•Interceptors

•Transformations

•Alerting

•Ingestions

Page 49: Streaming architecture patterns

49© Cloudera, Inc. All rights reserved.

Consumer: Streaming Engines

•Not so great at HDFS Ingestion

•But great for record storage systems

•HBase

•Cassandra

•Kudu

•SolR

•Elastic Search

Page 50: Streaming architecture patterns

50© Cloudera, Inc. All rights reserved.

Storm

•Old Gen

•Low latency

•Low throughput

•At least once

•Around for ever

•Topology Based

Page 51: Streaming architecture patterns

51© Cloudera, Inc. All rights reserved.

Spark Streaming

•The Juggernaut

•Higher Latency

•High Through Put

• Exactly Once

•SQL

•MlLib

•Highly used

•Easy to Debug/Unit Test

•Easy to transition from Batch

•Flow Language

•600 commits in a month and about 100 meetups

Page 52: Streaming architecture patterns

52© Cloudera, Inc. All rights reserved.

Spark Streaming

DStream

DStream

DStream

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count Print

First Batch

Second Batch

Page 53: Streaming architecture patterns

53© Cloudera, Inc. All rights reserved.

DStream

DStream

DStream

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

Print

Source ReceiverRDD

partitions

RDDParition

RDD

Single Pass

Filter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

Print

Stateful RDD 2

Stateful RDD 1

Spark Streaming

Page 54: Streaming architecture patterns

54© Cloudera, Inc. All rights reserved.

Flink

•I’m Better Than Spark Why Doesn’t Anyone use me

•Very much like Spark but not as feature rich

•Lower Latency•Micro Batch -> ABS

•Asynchronous Barrier Snapshotting

•Flow Language

•~1/6th the comments and meetups

•But Slim loves it ☺

Page 55: Streaming architecture patterns

55© Cloudera, Inc. All rights reserved.

Flink - ABS

Operator

Buffer

Page 56: Streaming architecture patterns

56© Cloudera, Inc. All rights reserved.

Operator

Buffer

Operator

Buffer

Flink - ABS

Barrier 1A Hit

Barrier 1B Still Behind

Page 57: Streaming architecture patterns

57© Cloudera, Inc. All rights reserved.

Operator

Buffer

Flink - ABS

Both Barriers Hit

Operator

Buffer

Barrier 1A Hit

Barrier 1B Still Behind

Check Point

Page 58: Streaming architecture patterns

58© Cloudera, Inc. All rights reserved.

Operator

Buffer

Flink - ABSBoth

Barriers Hit

Check Point

Operator

BufferBarrier is combined and can move on

Buffer can be flushed

out

Page 59: Streaming architecture patterns

59© Cloudera, Inc. All rights reserved.

Kafka Streams• The new Kid on the Block• When you only have Kafka• Low Latency• High Throughput• Not exactly once• Very Young• Flow Language• Very different hardware profile then others• Not widely supported• Not widely used• Worries about separation of concern

Page 60: Streaming architecture patterns

60© Cloudera, Inc. All rights reserved.

Summary about Engines• Ingestion

• Flume and KafkaConnect• Super Real Time and Special

• Consumer• Counting, MlLib, SQL

• Spark• Maybe future and cool

• Flink and KafkaStreams• Odd man out

• Storm

Page 61: Streaming architecture patterns

61© Cloudera, Inc. All rights reserved.

Abstractions

Code Abstractions

BeamSQL Abstraction

SQLUI Abstraction

StreamSets

Streaming Engines

Page 62: Streaming architecture patterns

62

Counting

Page 63: Streaming architecture patterns

63

Streaming and Counting

• Counting is easy right?• Back to Only once

Page 64: Streaming architecture patterns

64

We started with Lambda

Pipe

Speed Layer

Batch Layer

Persist Results

Speed Results

Batch Results

Serving Layer

Page 65: Streaming architecture patterns

65

Why did Streaming Suck

• Increments with Cassandra • Double increment• No strong consistency

• Storm without Kafka• Not only once• Not at least once

• Batch would have to re-process EVERY record to remove dups

Page 66: Streaming architecture patterns

66

We have come a long way

• We don’t have to use Increments any more and we can have consistency• HBase

• We can have state in our streaming platform• Spark Streaming

• We don’t lose data• Spark Streaming• Kafka• Other options

• Full universe of Deduping• Again HBase with versions

Page 67: Streaming architecture patterns

67

Increments

Page 68: Streaming architecture patterns

68

Puts with State

Page 69: Streaming architecture patterns

69

Advanced streamingWhen to use which one?

Page 70: Streaming architecture patterns

70

Advanced Streaming

• Ad-hoc will produce Identify Value• Ad-hoc will become batch• The value will demand less latency on batch• Batch will become Streaming

Page 71: Streaming architecture patterns

71

Advanced Streaming

• Requirements for Ideal Batch to Streaming frameworks• Something that can snap both paradigms• Something that can use the tools of Ad-hoc

• SQL• MlLib• R• Scala• Java

• Development through a common IDE• Debugging• Unit Testing• Common deployment model

Page 72: Streaming architecture patterns

72

Advanced Streaming

• In Spark Streaming• A DStream is a collection of RDD with respect to micro batch

intervals• If we can access RDDs in Spark Streaming

• We can convert to Vectors• KMeans• Principal component analysis

• We can convert to LabeledPoint• NaiveBayes• Random Forest• Linear Support Vector Machines

• We can convert to a DataFrames• SQL• R

Page 73: Streaming architecture patterns

73

Wrap-up

Page 74: Streaming architecture patterns

74

Understand common use-cases for streaming and

their architecturesOur original goal

Page 75: Streaming architecture patterns

75

Common streaming use-cases• Ingestion

– Transformation

• Counting– Lambda, etc.

• Advanced streaming

Page 76: Streaming architecture patterns

76

Thank you!Mark Grover | @mark_grover

Ted Malaska | @TedMalaska

@hadooparchbook

hadooparchitecturebook.com

Page 77: Streaming architecture patterns

77

Transformations with context