linkedin-teradata summit feb 25, 2015

Post on 17-Jul-2015

62 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Stream Processing with Samza

Navina Ramesh

DDS, Data Infrastructure

February 25, 2015

Outline

• Introduction

• Use Cases at LinkedIn

• Architecture & Concepts

Response latency

Milliseconds to minutes

Synchronous Later. Possibly much later.

0 ms

Stream Processing

Use cases @ LinkedIn

• Data standardization platform (Project

“Waterloo”)

• Call graph assembly

• Metrics & Monitoring

Call graph assembly

Map-reduce/Hadoop Samza

Filter/redirect records Mapper Repartition job

Process the grouped records Reduce Aggregation job

Samza Concepts & Architecture

• Streams

• Tasks

• Jobs

• Stateful Stream Processing

Streams

Partition 0 Partition 1 Partition 2

next append

123456

12345

1234567

TasksPartition 0

Task 1

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask {

public void process(IncomingMessageEnvelope envelope,

MessageCollector collector,

TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg());

String pageKey = record.get("page-key").toString();

int newCount = pageKeyViews.get(pageKey).incrementAndGet();

collector.send(countStream, pageKey, newCount);

}

}

TasksPartition 0

class PageKeyViewsCounterTask implements StreamTask {

public void process(IncomingMessageEnvelope envelope,

MessageCollector collector,

TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg());

String pageKey = record.get("page-key").toString();

int newCount = pageKeyViews.get(pageKey).incrementAndGet();

collector.send(countStream, pageKey, newCount);

}

}

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

Page Views - Partition 0

1234

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

Page Views - Partition 0

1234

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

Page Views - Partition 0

1234

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

Page Views - Partition 0

1234

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

Page Views - Partition 0

1234

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

Page Views - Partition 0

1234

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

Page Views - Partition 0

1234

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

Page Views - Partition 0

1234

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0 Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

Tasks

PageKeyViewsCounterTask

Partition 0

Partition 1

1234

2

Partition 1Checkpoint

Stream

Page Views - Partition 0

Output Count Stream

JobsAdViews AdClicks

Task 1 Task 2 Task 3

AdClickThroughRate

JobsAdViews AdClicks

Task 1 Task 2 Task 3

AdClickThroughRate

Stream Processing is Hard

• Partitioning

• Re-processing

• Failure semantics

• State

• Joins to services or database

• Non-determinism

Stream Processing is Hard

• Partitioning

• Re-processing

• Failure semantics

• State

• Joins to services or database

• Non-determinism

Jobs

AdViews AdClicks

Task 1 Task 2 Task 3

AdClickThroughRate

SELECTAdViews.id,COUNT(AdViews) views,COUNT(AdClicks) clicks,clicks/views ctr

FROMAdViews

LEFT JOINAdClicks

WHEREAdViews.id = AdClicks.id

GROUP BY id

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Stateful TasksStream A

Task 1 Task 2 Task 3

Stream B Changelog Stream

Resources

• What’s next ?

– Support for SQL operators over streams

– Samza without YARN

• Get involved:

– Apache – http://samza.apache.org

– Dev Mailing List – dev@samza.apache.org

– JIRA -

https://issues.apache.org/jira/browse/SAMZA

top related