designing agile data pipelines - bi...

53
Designing Agile Data Pipelines Ashish Singh | Software Engineer, Cloudera

Upload: others

Post on 03-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

Designing Agile Data Pipelines

Ashish Singh | Software Engineer, Cloudera

Page 2: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

2 ©2014 Cloudera, Inc. All rights reserved.

•  Software Engineer @ Cloudera •  Contributed to Kafka, Hive, Parquet and Sentry •  Used to work in HPC •  @singhasdev

About Me

Page 3: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

“Big Data” is stuck at The Lab.

Page 4: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

4

We want to move to The Factory

Page 5: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

5 Click to enter confidentiality information

Page 6: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

6

What does it mean to “Systemize”? •  Ability to easily add new data sources •  Easily improve and expend analytics •  Ease data access by standardizing metadata and storage •  Ability to discover mistakes and to recover from them •  Ability to safely experiment with new approaches

Click to enter confidentiality information

Page 7: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

7

We will discuss: •  Actual decision making •  Data Science •  Machine learning •  Algorithms

Click to enter confidentiality information

We will not discuss: •  Architectures •  Patterns •  Ingest •  Storage •  Schemas •  Metadata •  Streaming •  Experimenting •  Recovery

Page 8: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

8

So how do we build real data architectures?

Click to enter confidentiality information

Page 9: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

9

The Data Bus

Page 10: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

10

Client Backend

Data Pipelines Start like this.

Page 11: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

11

Client Backend

Client

Client

Client

Then we reuse them

Page 12: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

12

Client Backend

Client

Client

Client

Then we add multiple backends

Another Backend

Page 13: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

13

Client Backend

Client

Client

Client

Then it starts to look like this

Another Backend

Another Backend

Another Backend

Page 14: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

14

Client Backend

Client

Client

Client

With maybe some of this

Another Backend

Another Backend

Another Backend

Page 15: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

15

Adding applications should be easier We need: •  Shared infrastructure for sending records •  Infrastructure must scale •  Set of agreed-upon record schemas

Page 16: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

16

Kafka Based Ingest Architecture

Source System Source System Source System Source System

Hadoop Security Systems

Real-time monitoring

Data Warehouse

Kafka

Producers

Brokers

Consumers

Kafka decouples Data Pipelines

Page 17: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

17

Retain All Data

Click to enter confidentiality information

Page 18: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

18

Data Pipeline – Traditional View Raw data

Raw data Clean data

Aggregated data Clean data Enriched data

Input Output Waste of diskspace

Page 19: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

19 ©2014 Cloudera, Inc. All rights reserved.

It is all valuable data Raw data

Raw data Clean data

Aggregated data Clean data Enriched data

Filtered data Dash board Report

Data scientist Alerts

OMG

Page 20: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

20

Hadoop Based ETL – The FileSystem is the DB

/user/… /user/gshapira/testdata/orders /data/<database>/<table>/<partition> /data/<biz unit>/<app>/<dataset>/partition /data/pharmacy/fraud/orders/date=20131101 /etl/<biz unit>/<app>/<dataset>/<stage> /etl/pharmacy/fraud/orders/validated

Page 21: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

21

Store intermediate data /etl/<biz unit>/<app>/<dataset>/<stage>/<dataset_id> /etl/pharmacy/fraud/orders/raw/date=20131101 /etl/pharmacy/fraud/orders/deduped/date=20131101 /etl/pharmacy/fraud/orders/validated/date=20131101 /etl/pharmacy/fraud/orders_labs/merged/date=20131101 /etl/pharmacy/fraud/orders_labs/aggregated/date=20131101 /etl/pharmacy/fraud/orders_labs/ranked/date=20131101

Click to enter confidentiality information

Page 22: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

22

Batch ETL is old news

Click to enter confidentiality information

Page 23: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

23

Small Problem! •  HDFS is optimized for large chunks of data •  Don’t write individual events of micro-batches •  Think 100M-2G batches •  What do we do with small events?

Click to enter confidentiality information

Page 24: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

24

Well, we have this data bus…

Click to enter confidentiality information

0 1 2 3 4 5 6 7 8 9 10

11

12

13

0 1 2 3 4 5 6 7 8 9 10

11

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Partition 1

Partition 2

Partition 3

Writes

Old New

Page 25: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

25

Kafka has topics How about? <biz unit>.<app>.<dataset>.<stage> pharmacy.fraud.orders.raw pharmacy.fraud.orders.deduped pharmacy.fraud.orders.validated pharmacy.fraud.orders_labs.merged

Click to enter confidentiality information

Page 26: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

26 ©2014 Cloudera, Inc. All rights reserved.

It’s (almost) all topics Raw data

Raw data Clean data

Aggregated data Clean data

Filtered data Dash board Report

Data scientist Alerts

OMG

Enriched Data

Page 27: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

27

Benefits •  Recover from accidents •  Debug suspicious results •  Fix algorithm errors •  Experiment with new algorithms

Click to enter confidentiality information

Page 28: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

28

Kinda Lambda

Page 29: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

29

Lambda Architecture •  Immutable events •  Store intermediate stages •  Combine Batches and Streams •  Reprocessing

Click to enter confidentiality information

Page 30: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

30

What we don’t like

Maintaining two applications Often in two languages That do the same thing

Click to enter confidentiality information

Page 31: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

31

Pain Avoidance #1 – Use Spark + SparkStreaming

•  Spark is awesome for batch, so why not? –  The New Kid that isn’t that New Anymore –  Easily 10x less code –  Extremely Easy and Powerful API –  Very good for machine learning –  Scala, Java, and Python –  RDDs –  DAG Engine

Click to enter confidentiality information

Page 32: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

32

Spark Streaming •  Calling Spark in a Loop •  Extends RDDs with DStream •  Very Little Code Changes from ETL to Streaming

Confidentiality Information Goes Here

Page 33: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

33

Spark Streaming

Confidentiality Information Goes Here

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count Print

Pre-first Batch

First Batch

Second Batch

Page 34: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

34

Small Example val sparkConf = new SparkConf() .setMaster(args(0)).setAppName(this.getClass.getCanonicalName)

val ssc = new StreamingContext(sparkConf, Seconds(10))

// Create the DStream from data sent over the network

val dStream = ssc.socketTextStream(args(1), args(2).toInt, StorageLevel.MEMORY_AND_DISK_SER)

// Counting the errors in each RDD in the stream

val errCountStream = dStream.transform(rdd => ErrorCount.countErrors(rdd))

val stateStream = errCountStream.updateStateByKey[Int](updateFunc)

errCountStream.foreachRDD(rdd => {

System.out.println("Errors this minute:%d".format(rdd.first()._2))

})

Click to enter confidentiality information

Page 35: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

35

Pain Avoidance #2 – Split the Stream Why do we even need stream + batch? •  Batch efficiencies •  Re-process to fix errors •  Re-process after delayed arrival

What if we could re-play data?

Click to enter confidentiality information

Page 36: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

36

Kafka + Stream Processing

Click to enter confidentiality information

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Streaming App v1 Result set 1

App

Page 37: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

37

Lets Re-Process with new algorithm

Click to enter confidentiality information

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Streaming App v1

Streaming App v2

Result set 1

Result set 2

App

Page 38: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

38

Lets Re-Process with new algorithm

Click to enter confidentiality information

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Streaming App v1

Streaming App v2

Result set 1

Result set 2

App

Page 39: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

39

Oh no, we just got a bunch of data for yesterday!

Click to enter confidentiality information

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Streaming App

Streaming App

Today

Yesterday

Page 40: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

40

Note:

No need to choose between the approaches. There are good reasons to do both.

Click to enter confidentiality information

Page 41: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

41

Prediction:

Batch vs. Streaming distinction is going away.

Click to enter confidentiality information

Page 42: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

42

Yes, you really need a Schema

Click to enter confidentiality information

Page 43: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

43

Schema is a MUST HAVE for data integration

Click to enter confidentiality information

Page 44: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

44

Client Backend

Client

Client

Client

Another Backend

Another Backend

Another Backend

Page 45: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

45

Remember that we want this?

Source System Source System Source System Source System

Hadoop Security Systems

Real-time monitoring

Data Warehouse

Kafka

Producers

Brokers

Consumers

Page 46: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

46

This means we need this:

Click to enter confidentiality information

Source System Source System Source System Source System

Hadoop Security Systems

Real-time monitoring

Data Warehouse

Kafka Schema Repository

Page 47: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

47

We can do it in few ways •  People go around asking each other:

“So, what does the 5th field of the messages in topic Blah contain?” •  There’s utility code for reading/writing messages that everyone

reuses •  Schema embedded in the message •  A centralized repository for schemas

–  Each message has Schema ID –  Each topic has Schema ID

Click to enter confidentiality information

Page 48: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

48

I Avro •  Define Schema •  Generate code for objects •  Serialize / Deserialize into Bytes or JSON •  Embed schema in files / records… or not •  Support for our favorite languages… Except Go. •  Schema Evolution

–  Add and remove fields without breaking anything

Click to enter confidentiality information

Page 49: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

49

Schemas are Agile •  Schemas allow adding readers and writers easily •  Schemas allow modifying readers and writers independently •  Schemas can evolve as the system grows

Click to enter confidentiality information

Page 50: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

50 Click to enter confidentiality information

Page 51: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

51

Woah, that was lots of stuff!

Click to enter confidentiality information

Page 52: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

52

Recap – if you remember nothing else… •  After the POC, its time for production •  Goal: Evolve fast without breaking things For this you need: •  Keep all data •  Design pipeline for error recovery – batch or stream •  Integrate with a data bus •  And Schemas

Page 53: Designing Agile Data Pipelines - BI Consultingbiconsulting.hu/letoltes/...singhashish_designing.pdf · So how do we build real data architectures? Click to enter confidentiality information

Thank you