sanjar akhmedov - joining infinity – windowless stream processing with flink

of 79 /79
Joining Infinity — Windowless Stream Processing with Flink Sanjar Akhmedov, Software Engineer, ResearchGate

Upload: flink-forward

Post on 08-Jan-2017

186 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Joining Infinity — Windowless Stream Processing with Flink

Sanjar Akhmedov, Software Engineer, ResearchGate

Page 2: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

It started when two researchers discovered first-

hand that collaborating with a friend or colleague on

the other side of the world was no easy task. There are many variations ofpassages of Lorem Ipsum

ResearchGate is a socialnetwork for scientists.

Page 3: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Connect the world of science.Make research open to all.

Page 4: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Structured system

There are many variations ofpassages of Lorem Ipsum

We have, and arecontinuing to changehow scientificknowledge is shared anddiscovered.

Page 5: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

10,000,000Members

102,000,000Publications

30,000,000Visitors

Page 6: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Feature: Research Timeline

Page 7: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Feature: Research Timeline

Page 8: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Diverse data sources

Proxy

Frontend

Services

memcache MongoDB Solr PostgreSQL

Infinispan HBaseMongoDB Solr

Page 9: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Big data pipelineChange

datacapture

Import

Hadoop cluster

Export

Page 10: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Data Model

Account Publication

Claim

1 *

Author

Authorship

1*

Page 11: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Hypothetical SQL

PublicationAuthorship

1*

CREATE TABLE publications (id SERIAL PRIMARY KEY,author_ids INTEGER[]

);

AccountClaim

1 *

Author

CREATE TABLE accounts (id SERIAL PRIMARY KEY,claimed_author_ids INTEGER[]

);

CREATE MATERIALIZED VIEW account_publicationsREFRESH FAST ON COMMITASSELECTaccounts.id AS account_id,publications.id AS publication_id

FROM accountsJOIN publicationsON ANY (accounts.claimed_author_ids) = ANY (publications.author_ids);

Page 12: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

• Data sources are distributed across different DBs

• Dataset doesn’t fit in memory on a single machine

• Join process must be fault tolerant

• Deploy changes fast

• Up-to-date join result in near real-time

• Join result must be accurate

Challenges

Page 13: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Change data capture (CDC)

User Microservice DBRequest Write

Cache

Sync

Solr/ES

Sync

HBase/HDFS

Sync

Page 14: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Change data capture (CDC)

User Microservice DBRequest Write

Log

K2

1

K1

4

Extract

Page 15: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Change data capture (CDC)

User Microservice DBRequest Write

Log

K2

1

K1

4

K1

Ø

Extract

Page 16: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Change data capture (CDC)

User Microservice DBRequest Write

Log

K2

1

K1

4

K1

Ø

KN

42…

Extract

Page 17: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Change data capture (CDC)

User Microservice DB

Cache

Request Write

Log

K2

1

K1

4

K1

Ø

KN

42…

Extract

Sync

Page 18: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Change data capture (CDC)

User Microservice DB

Cache

Request Write

Log

K2

1

K1

4

K1

Ø

KN

42…

Extract

Sync

HBase/HDFSSolr/ES

Page 19: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Join two CDC streams into one

NoSQL1

SQL Kafka

Kafka

Flink Streaming Join Kafka NoSQL2

Page 20: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Flink job topology

Accounts Stream

Join(CoFlatMap)

AccountPublications

PublicationsStream

Author 2

Author 1

Author N

Page 21: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

DataStream<Account> accounts = kafkaTopic("accounts");DataStream<Publication> publications = kafkaTopic("publications");DataStream<AccountPublication> result = accounts.connect(publications)

.keyBy("claimedAuthorId", "publicationAuthorId")

.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {

transient ValueState<String> authorAccount;transient ValueState<String> authorPublication;

public void open(Configuration parameters) throws Exception {authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));

}

public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {authorAccount.update(account.id);if (authorPublication.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}

public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {authorPublication.update(publication.id);if (authorAccount.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}});

Prototype implementation

Page 22: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

DataStream<Account> accounts = kafkaTopic("accounts");DataStream<Publication> publications = kafkaTopic("publications");DataStream<AccountPublication> result = accounts.connect(publications)

.keyBy("claimedAuthorId", "publicationAuthorId")

.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {

transient ValueState<String> authorAccount;transient ValueState<String> authorPublication;

public void open(Configuration parameters) throws Exception {authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));

}

public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {authorAccount.update(account.id);if (authorPublication.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}

public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {authorPublication.update(publication.id);if (authorAccount.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}});

Prototype implementation

Page 23: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

DataStream<Account> accounts = kafkaTopic("accounts");DataStream<Publication> publications = kafkaTopic("publications");DataStream<AccountPublication> result = accounts.connect(publications)

.keyBy("claimedAuthorId", "publicationAuthorId")

.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {

transient ValueState<String> authorAccount;transient ValueState<String> authorPublication;

public void open(Configuration parameters) throws Exception {authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));

}

public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {authorAccount.update(account.id);if (authorPublication.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}

public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {authorPublication.update(publication.id);if (authorAccount.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}});

Prototype implementation

Page 24: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

DataStream<Account> accounts = kafkaTopic("accounts");DataStream<Publication> publications = kafkaTopic("publications");DataStream<AccountPublication> result = accounts.connect(publications)

.keyBy("claimedAuthorId", "publicationAuthorId")

.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {

transient ValueState<String> authorAccount;transient ValueState<String> authorPublication;

public void open(Configuration parameters) throws Exception {authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));

}

public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {authorAccount.update(account.id);if (authorPublication.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}

public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {authorPublication.update(publication.id);if (authorAccount.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}});

Prototype implementation

Page 25: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

DataStream<Account> accounts = kafkaTopic("accounts");DataStream<Publication> publications = kafkaTopic("publications");DataStream<AccountPublication> result = accounts.connect(publications)

.keyBy("claimedAuthorId", "publicationAuthorId")

.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {

transient ValueState<String> authorAccount;transient ValueState<String> authorPublication;

public void open(Configuration parameters) throws Exception {authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));

}

public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {authorAccount.update(account.id);if (authorPublication.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}

public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {authorPublication.update(publication.id);if (authorAccount.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}});

Prototype implementation

Page 26: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

DataStream<Account> accounts = kafkaTopic("accounts");DataStream<Publication> publications = kafkaTopic("publications");DataStream<AccountPublication> result = accounts.connect(publications)

.keyBy("claimedAuthorId", "publicationAuthorId")

.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {

transient ValueState<String> authorAccount;transient ValueState<String> authorPublication;

public void open(Configuration parameters) throws Exception {authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));

}

public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {authorAccount.update(account.id);if (authorPublication.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}

public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {authorPublication.update(publication.id);if (authorAccount.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}});

Prototype implementation

Page 27: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

DataStream<Account> accounts = kafkaTopic("accounts");DataStream<Publication> publications = kafkaTopic("publications");DataStream<AccountPublication> result = accounts.connect(publications)

.keyBy("claimedAuthorId", "publicationAuthorId")

.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {

transient ValueState<String> authorAccount;transient ValueState<String> authorPublication;

public void open(Configuration parameters) throws Exception {authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));

}

public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {authorAccount.update(account.id);if (authorPublication.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}

public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {authorPublication.update(publication.id);if (authorAccount.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}});

Prototype implementation

Page 28: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Publications

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Author 2

Author N

Page 29: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Publications

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Author 2

Author N

Page 30: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Publications

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Author 2

Alice

Author N

Page 31: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Bob 1

Publications

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Author 2

Alice

Author N

Page 32: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Bob 1

Publications

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Author 2

Alice

Author N

(Bob, 1)

Page 33: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Bob 1

Publications

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob

Author 2

Alice

Author N

(Bob, 1)

Page 34: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob

Author 2

Alice

Author N

Page 35: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob

Author 2

Alice

Author N

Page 36: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 37: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 38: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 39: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

• ✔ Data sources are distributed across different DBs

• ✔ Dataset doesn’t fit in memory on a single machine

• ✔ Join process must be fault tolerant

• ✔ Deploy changes fast

• ✔ Up-to-date join result in near real-time

• ? Join result must be accurate

Challenges

Page 40: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets deletedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 Ø

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 41: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets deletedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 Ø

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 42: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets deletedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 Ø

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

?

Page 43: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets deletedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 Ø

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Need previousvalue

Page 44: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets deletedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 Ø

Accounts Stream

Join

AccountPublications

Diff withPrevious

State

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 45: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets deletedAccount Publications

K1 (Bob, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 Ø

Accounts Stream

Join

AccountPublications

Diff withPrevious

State

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 46: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets deletedAccount Publications

K1 (Bob, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 Ø

Accounts Stream

Join

AccountPublications

Diff withPrevious

State

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Need K1 here,e.g. K1 = 𝒇(Bob, Paper1)

Page 47: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets updatedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 2

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 48: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets updatedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 2

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 49: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets updatedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 2

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice Paper1

Author N

Page 50: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets updatedAccount Publications

K1 (Bob, Paper1)

(Alice, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 2

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice Paper1

Author N

(Alice, Paper1)

Page 51: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets updatedAccount Publications

K1 (Bob, Paper1)

?? (Alice, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 2

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice Paper1

Author N

Page 52: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice Paper1

Author N

Page 53: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

Accounts

Alice 2

Bob 1

Bob Ø

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice Paper1

Author N

Page 54: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

Accounts

Alice 2

Bob 1

Bob Ø

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice Paper1

Author N

Page 55: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Ø Paper1

Author 2

Alice Paper1

Author N

Page 56: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Ø Paper1

Author 2

Alice Paper1

Author N

Page 57: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Alice 1

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Ø Paper1

Author 2

Alice Paper1

Author N

Page 58: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Alice 1

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Ø Paper1

Author 2

Alice Paper1

Author N

2. (Alice, 1)

Page 59: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Alice 1

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Alice Paper1

Author 2

Ø Paper1

Author N

2. (Alice, 1)

Page 60: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Alice 1

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Alice Paper1

Author 2

Ø Paper1

Author N

2. (Alice, 1)

(Alice, Paper1)

Page 61: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

K2 (Alice, Paper1)

K2 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Alice 1

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Alice Paper1

Author 2

Ø Paper1

Author N

2. (Alice, 1)

(Alice, Paper1)

Page 62: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

K3 (Alice, Paper1)

K2 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Alice 1

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Alice Paper1

Author 2

Ø Paper1

Author N

2. (Alice, 1)

(Alice, Paper1)

Pick correct natural IDse.g. K3 = 𝒇(Alice, Author1, Paper1)

Page 63: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

• Keep previous element state to updateprevious join result

• Stream elements are not domain entitiesbut commands such as delete or upsert

• Joined stream must have natural IDsto propagate deletes and updates

How to solve deletes and updates

Page 64: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Generic join graph

AccountPublications

Accounts Stream

PublicationsStream

Diff

Alice

Bob

Diff

Paper1

PaperN

Join

Author1

AuthorM

Page 65: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Generic join graph

Operate on commands

AccountPublications

Accounts Stream

PublicationsStream

Diff

Alice

Bob

Diff

Paper1

PaperN

Join

Author1

AuthorM

Page 66: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Memory requirements

AccountPublications

Accounts Stream

PublicationsStream

Diff

Alice

Bob

Diff

Paper1

PaperN

Join

Author1

AuthorM

Full copy ofAccounts stream

Full copy ofPublications

stream

Full copy ofAccounts stream

on left side

Full copy ofPublications stream

on right side

Page 67: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Network load

AccountPublications

Accounts Stream

PublicationsStream

Diff

Alice

Bob

Diff

Paper1

PaperN

Join

Author1

AuthorM

Reshuffle Reshuffle

Network

Network

Page 68: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

• In addition to handling Kafka traffic we need to reshuffle all data twice over the network

• We need to keep two full copies of each joined stream in memory

Resource considerations

Page 69: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Questions

We are hiring - www.researchgate.net/careers

Page 70: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Page 71: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Page 72: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Lorem ipsum dolor sit amet, consecteturadipiscing elit. Mauris pharetra interdum felis, sitamet aliquet mauris. Proin non fermentum sem. Vivamus a ligula vel arcu convallis porttitor.

Page 73: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Page 74: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Page 75: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Page 76: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Page 77: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Page 78: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Page 79: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink