linkedin-teradata summit feb 25, 2015

41
Stream Processing with Samza Navina Ramesh DDS, Data Infrastructure February 25, 2015

Upload: navina-ramesh

Post on 17-Jul-2015

62 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: LinkedIn-Teradata Summit feb 25, 2015

Stream Processing with Samza

Navina Ramesh

DDS, Data Infrastructure

February 25, 2015

Page 2: LinkedIn-Teradata Summit feb 25, 2015

Outline

• Introduction

• Use Cases at LinkedIn

• Architecture & Concepts

Page 3: LinkedIn-Teradata Summit feb 25, 2015

Response latency

Milliseconds to minutes

Synchronous Later. Possibly much later.

0 ms

Stream Processing

Page 4: LinkedIn-Teradata Summit feb 25, 2015

Use cases @ LinkedIn

• Data standardization platform (Project

“Waterloo”)

• Call graph assembly

• Metrics & Monitoring

Page 5: LinkedIn-Teradata Summit feb 25, 2015

Call graph assembly

Map-reduce/Hadoop Samza

Filter/redirect records Mapper Repartition job

Process the grouped records Reduce Aggregation job

Page 6: LinkedIn-Teradata Summit feb 25, 2015

Samza Concepts & Architecture

• Streams

• Tasks

• Jobs

• Stateful Stream Processing

Page 7: LinkedIn-Teradata Summit feb 25, 2015

Streams

Partition 0 Partition 1 Partition 2

next append

123456

12345

1234567

Page 8: LinkedIn-Teradata Summit feb 25, 2015

TasksPartition 0

Task 1

Page 9: LinkedIn-Teradata Summit feb 25, 2015

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask {

public void process(IncomingMessageEnvelope envelope,

MessageCollector collector,

TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg());

String pageKey = record.get("page-key").toString();

int newCount = pageKeyViews.get(pageKey).incrementAndGet();

collector.send(countStream, pageKey, newCount);

}

}

Page 10: LinkedIn-Teradata Summit feb 25, 2015

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask {

public void process(IncomingMessageEnvelope envelope,

MessageCollector collector,

TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg());

String pageKey = record.get("page-key").toString();

int newCount = pageKeyViews.get(pageKey).incrementAndGet();

collector.send(countStream, pageKey, newCount);

}

}

Page 11: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

Page Views - Partition 0

1234

Output Count Stream

Page 12: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

Page Views - Partition 0

1234

Output Count Stream

Page 13: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

Page Views - Partition 0

1234

Output Count Stream

Page 14: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

Page Views - Partition 0

1234

Output Count Stream

Page 15: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

Page Views - Partition 0

1234

Output Count Stream

Page 16: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

Page Views - Partition 0

1234

Output Count Stream

Page 17: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

Page Views - Partition 0

1234

Output Count Stream

Page 18: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

Page Views - Partition 0

1234

Output Count Stream

Page 19: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Page 20: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Page 21: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Page 22: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Page 23: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Page 24: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Page 25: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Page 26: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Page 27: LinkedIn-Teradata Summit feb 25, 2015

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Page 28: LinkedIn-Teradata Summit feb 25, 2015

JobsAdViews AdClicks

Task 1 Task 2 Task 3

AdClickThroughRate

Page 29: LinkedIn-Teradata Summit feb 25, 2015

JobsAdViews AdClicks

Task 1 Task 2 Task 3

AdClickThroughRate

Page 30: LinkedIn-Teradata Summit feb 25, 2015

Stream Processing is Hard

• Partitioning

• Re-processing

• Failure semantics

• State

• Joins to services or database

• Non-determinism

Page 31: LinkedIn-Teradata Summit feb 25, 2015

Stream Processing is Hard

• Partitioning

• Re-processing

• Failure semantics

• State

• Joins to services or database

• Non-determinism

Page 32: LinkedIn-Teradata Summit feb 25, 2015

Jobs

AdViews AdClicks

Task 1 Task 2 Task 3

AdClickThroughRate

SELECTAdViews.id,COUNT(AdViews) views,COUNT(AdClicks) clicks,clicks/views ctr

FROMAdViews

LEFT JOINAdClicks

WHEREAdViews.id = AdClicks.id

GROUP BY id

Page 33: LinkedIn-Teradata Summit feb 25, 2015

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B

Page 34: LinkedIn-Teradata Summit feb 25, 2015

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 35: LinkedIn-Teradata Summit feb 25, 2015

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 36: LinkedIn-Teradata Summit feb 25, 2015

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 37: LinkedIn-Teradata Summit feb 25, 2015

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 38: LinkedIn-Teradata Summit feb 25, 2015

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 39: LinkedIn-Teradata Summit feb 25, 2015

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 40: LinkedIn-Teradata Summit feb 25, 2015

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Page 41: LinkedIn-Teradata Summit feb 25, 2015

Resources

• What’s next ?

– Support for SQL operators over streams

– Samza without YARN

• Get involved:

– Apache – http://samza.apache.org

– Dev Mailing List – [email protected]

– JIRA -

https://issues.apache.org/jira/browse/SAMZA