Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

70
Joining Infinity — Windowless Stream Processing with Flink Sanjar Akhmedov, Software Engineer, ResearchGate

Transcript of Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Page 1: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Joining Infinity — Windowless Stream Processing with Flink

Sanjar Akhmedov, Software Engineer, ResearchGate

Page 2: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

It started when two researchers discovered first-

hand that collaborating with a friend or colleague on

the other side of the world was no easy task. There are many variations ofpassages of Lorem Ipsum

ResearchGate is a socialnetwork for scientists.

Page 3: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Connect the world of science.Make research open to all.

Page 4: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Structured system

There are many variations ofpassages of Lorem Ipsum

We have, and arecontinuing to changehow scientificknowledge is shared anddiscovered.

Page 5: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

11,000,000+Members

110,000,000+Publications

1,300,000,000+Citations

Page 6: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Feature: Research Timeline

Page 7: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Feature: Research Timeline

Page 8: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Diverse data sources

Proxy

Frontend

Services

memcache MongoDB Solr PostgreSQL

Infinispan HBaseMongoDB Solr

Page 9: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Big data pipelineChange

datacapture

Import

Hadoop cluster

Export

Page 10: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Data Model

Account Publication

Claim

1 *

Author

Authorship

1*

Page 11: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Hypothetical SQL

PublicationAuthorship

1*

CREATE TABLE publications (id SERIAL PRIMARY KEY,author_ids INTEGER[]

);

AccountClaim

1 *

Author

CREATE TABLE accounts (id SERIAL PRIMARY KEY,claimed_author_ids INTEGER[]

);

CREATE MATERIALIZED VIEW account_publicationsREFRESH FAST ON COMMITASSELECTaccounts.id AS account_id,publications.id AS publication_id

FROM accountsJOIN publicationsON ANY (accounts.claimed_author_ids) = ANY (publications.author_ids);

Page 12: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

• Data sources are distributed across different DBs

• Dataset doesn’t fit in memory on a single machine

• Join process must be fault tolerant

• Deploy changes fast

• Up-to-date join result in near real-time

• Join result must be accurate

Challenges

Page 13: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Change data capture (CDC)

User Microservice DBRequest Write

Cache

Sync

Solr/ES

Sync

HBase/HDFS

Sync

Page 14: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Change data capture (CDC)

User Microservice DBRequest Write

Log

K2

1

K1

4

Extract

Page 15: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Change data capture (CDC)

User Microservice DBRequest Write

Log

K2

1

K1

4

K1

Ø

Extract

Page 16: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Change data capture (CDC)

User Microservice DBRequest Write

Log

K2

1

K1

4

K1

Ø

KN

42…

Extract

Page 17: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Change data capture (CDC)

User Microservice DB

Cache

Request Write

Log

K2

1

K1

4

K1

Ø

KN

42…

Extract

Sync

Page 18: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Change data capture (CDC)

User Microservice DB

Cache

Request Write

Log

K2

1

K1

4

K1

Ø

KN

42…

Extract

Sync

HBase/HDFSSolr/ES

Page 19: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Join two CDC streams into one

NoSQL1

SQL Kafka

Kafka

Flink Streaming Join Kafka NoSQL2

Page 20: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Flink job topology

Accounts Stream

Join(CoFlatMap)

AccountPublications

PublicationsStream

Author 2

Author 1

Author N

Page 21: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

DataStream<Account> accounts = kafkaTopic("accounts");DataStream<Publication> publications = kafkaTopic("publications");DataStream<AccountPublication> result = accounts.connect(publications)

.keyBy("claimedAuthorId", "publicationAuthorId")

.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {

transient ValueState<String> authorAccount;transient ValueState<String> authorPublication;

public void open(Configuration parameters) throws Exception {authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));

}

public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {authorAccount.update(account.id);if (authorPublication.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}

public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {authorPublication.update(publication.id);if (authorAccount.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}});

Prototype implementation

Page 22: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

DataStream<Account> accounts = kafkaTopic("accounts");DataStream<Publication> publications = kafkaTopic("publications");DataStream<AccountPublication> result = accounts.connect(publications)

.keyBy("claimedAuthorId", "publicationAuthorId")

.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {

transient ValueState<String> authorAccount;transient ValueState<String> authorPublication;

public void open(Configuration parameters) throws Exception {authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));

}

public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {authorAccount.update(account.id);if (authorPublication.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}

public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {authorPublication.update(publication.id);if (authorAccount.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}});

Prototype implementation

Page 23: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

DataStream<Account> accounts = kafkaTopic("accounts");DataStream<Publication> publications = kafkaTopic("publications");DataStream<AccountPublication> result = accounts.connect(publications)

.keyBy("claimedAuthorId", "publicationAuthorId")

.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {

transient ValueState<String> authorAccount;transient ValueState<String> authorPublication;

public void open(Configuration parameters) throws Exception {authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));

}

public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {authorAccount.update(account.id);if (authorPublication.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}

public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {authorPublication.update(publication.id);if (authorAccount.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}});

Prototype implementation

Page 24: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

DataStream<Account> accounts = kafkaTopic("accounts");DataStream<Publication> publications = kafkaTopic("publications");DataStream<AccountPublication> result = accounts.connect(publications)

.keyBy("claimedAuthorId", "publicationAuthorId")

.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {

transient ValueState<String> authorAccount;transient ValueState<String> authorPublication;

public void open(Configuration parameters) throws Exception {authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));

}

public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {authorAccount.update(account.id);if (authorPublication.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}

public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {authorPublication.update(publication.id);if (authorAccount.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}});

Prototype implementation

Page 25: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

DataStream<Account> accounts = kafkaTopic("accounts");DataStream<Publication> publications = kafkaTopic("publications");DataStream<AccountPublication> result = accounts.connect(publications)

.keyBy("claimedAuthorId", "publicationAuthorId")

.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {

transient ValueState<String> authorAccount;transient ValueState<String> authorPublication;

public void open(Configuration parameters) throws Exception {authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));

}

public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {authorAccount.update(account.id);if (authorPublication.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}

public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {authorPublication.update(publication.id);if (authorAccount.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}});

Prototype implementation

Page 26: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

DataStream<Account> accounts = kafkaTopic("accounts");DataStream<Publication> publications = kafkaTopic("publications");DataStream<AccountPublication> result = accounts.connect(publications)

.keyBy("claimedAuthorId", "publicationAuthorId")

.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {

transient ValueState<String> authorAccount;transient ValueState<String> authorPublication;

public void open(Configuration parameters) throws Exception {authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));

}

public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {authorAccount.update(account.id);if (authorPublication.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}

public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {authorPublication.update(publication.id);if (authorAccount.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}});

Prototype implementation

Page 27: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

DataStream<Account> accounts = kafkaTopic("accounts");DataStream<Publication> publications = kafkaTopic("publications");DataStream<AccountPublication> result = accounts.connect(publications)

.keyBy("claimedAuthorId", "publicationAuthorId")

.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {

transient ValueState<String> authorAccount;transient ValueState<String> authorPublication;

public void open(Configuration parameters) throws Exception {authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));

}

public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {authorAccount.update(account.id);if (authorPublication.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}

public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {authorPublication.update(publication.id);if (authorAccount.value() != null) {

out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));}

}});

Prototype implementation

Page 28: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Publications

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Author 2

Author N

Page 29: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Publications

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Author 2

Author N

Page 30: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Publications

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Author 2

Alice

Author N

Page 31: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Bob 1

Publications

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Author 2

Alice

Author N

Page 32: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Bob 1

Publications

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Author 2

Alice

Author N

(Bob, 1)

Page 33: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Bob 1

Publications

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob

Author 2

Alice

Author N

(Bob, 1)

Page 34: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob

Author 2

Alice

Author N

Page 35: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob

Author 2

Alice

Author N

Page 36: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 37: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 38: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Example dataflowAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 39: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

• ✔ Data sources are distributed across different DBs

• ✔ Dataset doesn’t fit in memory on a single machine

• ✔ Join process must be fault tolerant

• ✔ Deploy changes fast

• ✔ Up-to-date join result in near real-time

• ? Join result must be accurate

Challenges

Page 40: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets deletedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 Ø

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 41: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets deletedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 Ø

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 42: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets deletedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 Ø

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

?

Page 43: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets deletedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 Ø

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Need previousvalue

Page 44: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets deletedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 Ø

Accounts Stream

Join

AccountPublications

Diff withPrevious

State

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 45: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets deletedAccount Publications

K1 (Bob, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 Ø

Accounts Stream

Join

AccountPublications

Diff withPrevious

State

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 46: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets deletedAccount Publications

K1 (Bob, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 Ø

Accounts Stream

Join

AccountPublications

Diff withPrevious

State

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Need K1 here,e.g. K1 = 𝒇(Bob, Paper1)

Page 47: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets updatedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 2

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 48: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets updatedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 2

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice

Author N

Page 49: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets updatedAccount Publications

K1 (Bob, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 2

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice Paper1

Author N

Page 50: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets updatedAccount Publications

K1 (Bob, Paper1)

(Alice, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 2

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice Paper1

Author N

(Alice, Paper1)

Page 51: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Paper1 gets updatedAccount Publications

K1 (Bob, Paper1)

?? (Alice, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 2

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice Paper1

Author N

Page 52: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

Accounts

Alice 2

Bob 1

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice Paper1

Author N

Page 53: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

Accounts

Alice 2

Bob 1

Bob Ø

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice Paper1

Author N

Page 54: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

Accounts

Alice 2

Bob 1

Bob Ø

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Bob Paper1

Author 2

Alice Paper1

Author N

Page 55: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Ø Paper1

Author 2

Alice Paper1

Author N

Page 56: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Ø Paper1

Author 2

Alice Paper1

Author N

Page 57: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Alice 1

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Ø Paper1

Author 2

Alice Paper1

Author N

Page 58: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Alice 1

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Ø Paper1

Author 2

Alice Paper1

Author N

2. (Alice, 1)

Page 59: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Alice 1

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Alice Paper1

Author 2

Ø Paper1

Author N

2. (Alice, 1)

Page 60: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Alice 1

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Alice Paper1

Author 2

Ø Paper1

Author N

2. (Alice, 1)

(Alice, Paper1)

Page 61: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

K2 (Alice, Paper1)

K2 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Alice 1

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Alice Paper1

Author 2

Ø Paper1

Author N

2. (Alice, 1)

(Alice, Paper1)

Page 62: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Alice claims Paper1 via different authorAccount Publications

K1 (Bob, Paper1)

K2 (Alice, Paper1)

K1 Ø

K3 (Alice, Paper1)

K2 Ø

Accounts

Alice 2

Bob 1

Bob Ø

Alice 1

Publications

Paper1 1

Paper1 (1, 2)

Accounts Stream

Join

AccountPublications

PublicationsStream

Author 1

Alice Paper1

Author 2

Ø Paper1

Author N

2. (Alice, 1)

(Alice, Paper1)

Pick correct natural IDse.g. K3 = 𝒇(Alice, Author1, Paper1)

Page 63: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

• Keep previous element state to updateprevious join result

• Stream elements are not domain entitiesbut commands such as delete or upsert

• Joined stream must have natural IDsto propagate deletes and updates

How to solve deletes and updates

Page 64: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Generic join graph

AccountPublications

Accounts Stream

PublicationsStream

Diff

Alice

Bob

Diff

Paper1

PaperN

Join

Author1

AuthorM

Page 65: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Generic join graph

Operate on commands

AccountPublications

Accounts Stream

PublicationsStream

Diff

Alice

Bob

Diff

Paper1

PaperN

Join

Author1

AuthorM

Page 66: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Memory requirements

AccountPublications

Accounts Stream

PublicationsStream

Diff

Alice

Bob

Diff

Paper1

PaperN

Join

Author1

AuthorM

Full copy ofAccounts stream

Full copy ofPublications

stream

Full copy ofAccounts stream

on left side

Full copy ofPublications stream

on right side

Page 67: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Network load

AccountPublications

Accounts Stream

PublicationsStream

Diff

Alice

Bob

Diff

Paper1

PaperN

Join

Author1

AuthorM

Reshuffle Reshuffle

Network

Network

Page 68: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

• In addition to handling Kafka traffic we need to reshuffle all data twice over the network

• We need to keep two full copies of each joined stream in memory

Resource considerations

Page 69: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

Questions

We are hiring - www.researchgate.net/careers

Page 70: Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink