Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and Spark

Post on 13-Apr-2017

28 views 0 download

Transcript of Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and Spark

Building a Large-Scale, Adaptive Recommendation Engine with

Apache Flink and SparkZoltán Zvara

zoltan.zvara@ilab.sztaki.huGábor Hermann

ghermann@ilab.sztaki.hu

This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 688191.

About us• Institute for Computer Science and Control, Hungarian Academy of

Sciences (MTA SZTAKI)• Informatics Laboratory• „Big Data – Momemtum” research group• „Data Mining and Search” research group

• Research group with strong industry ties• Ericsson, Rovio, Portugal Telekom, etc.

Agenda1. Recommendation systems and matrix factorization2. Batch vs. online3. Matrix factorization

1. Online2. Batch + online

4. Solution in Spark & Flink5. Conclusions

Recommendation systems

Recommendation systems

ghermann
coll.filt. kiemlése

𝑅Recommendation with matrix factorization

5

1

3

5

2

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

Zoltán rated Rogue One with 5 stars

𝑅Recommendation with matrix factorization

𝑈𝐼

𝑈 ∙ 𝐼 ≈𝑅

item vector325

532

5 -6 -1

5 4 -4

5

1

3

uservector

5

2

Level of actionLevel of dramaX factor

0

0

0

0

Latent factors

Zoltán

Gábor

Rogue One Interstellar

Zoltán rated Rogue One with 5 stars

𝑅Recommendation with matrix factorization

𝑈𝐼

𝑈 ∙ 𝐼 ≈𝑅

item vector325

532

5 -6 -1

5 4 -4

5

1

3

uservector

5

2

Level of actionLevel of dramaX factor

0

0

0

0

Latent factors

Zoltán

Gábor

Rogue One Interstellar

min𝑢∗ ,𝑖∗

∑(𝑝 ,𝑞 )∈ 𝜅𝑅

(𝑟𝑝𝑞−𝜇−𝑏𝑝−𝑏𝑞−𝑢𝑝 𝑖𝑞)2+¿+𝜆 ∑𝑝∈𝜅𝑈

(‖𝑢𝑝‖2¿+𝑏𝑝

2 )+𝜆 ∑𝑞∈𝜅𝐼

(¿‖𝑖𝑞‖2+𝑏𝑞

2 )

¿¿

Zoltán rated Rogue One with 5 stars

𝑅Recommendation with matrix factorization

𝑈𝐼

𝑈 ∙ 𝐼 ≈𝑅

item vector325

532

5 -6 -1

5 4 -4

5

1

3

uservector

5

2

Level of actionLevel of dramaX factor

?

0

0

0

0

Latent factors

Zoltán

Gábor

Rogue One Interstellar

Zoltán rated Rogue One with 5 stars

Would Gábor like Interstellar?

𝑅Recommendation with matrix factorization

𝑈𝐼

𝑈 ∙ 𝐼 ≈𝑅

item vector325

532

5 -6 -1

5 4 -4

5

1

3

uservector

5

2

Level of actionLevel of dramaX factor

?

0

0

0

0

Latent factors

Zoltán

Gábor

Rogue One Interstellar

Zoltán rated Rogue One with 5 stars

Would Gábor like Interstellar?

𝑅Recommendation with matrix factorization

𝑈𝐼

𝑈 ∙ 𝐼 ≈𝑅

item vector325

532

5 -6 -1

5 4 -4

5

1

3

uservector

5

2

Level of actionLevel of dramaX factor

?

0

0

0

0

Latent factors

Zoltán

Gábor

Rogue One Interstellar

Zoltán rated Rogue One with 5 stars

Would Gábor like Interstellar?

5 4 -4

325

𝑅Recommendation with matrix factorization

𝑈𝐼

𝑈 ∙ 𝐼 ≈𝑅

item vector325

532

5 -6 -1

5 4 -4

5

1

3

uservector

5

2

Level of actionLevel of dramaX factor

?

0

0

0

0

Latent factors

Zoltán

Gábor

Rogue One Interstellar

Zoltán rated Rogue One with 5 stars

Would Gábor like Interstellar?

5 4 -4

325

3

𝑅Recommendation with matrix factorization

𝑈𝐼

𝑈 ∙ 𝐼 ≈𝑅

item vector325

532

5 -6 -1

5 4 -4

5

1

3

uservector

5

2

Level of actionLevel of dramaX factor

3

0

0

0

0

Latent factors

Zoltán

Gábor

Rogue One Interstellar

Zoltán rated Rogue One with 5 stars

Would Gábor like Interstellar?

5 4 -4

325

3

[user; item; time; rating]

𝑅Batch training

𝑈𝐼item vector

5

1

3

uservector

5

2

3

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

PERSISTENT STORAGE

[user; item; time; rating]

𝑅Batch training

𝑈𝐼item vector

5

1

3

uservector

5

2

3

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

PERSISTENT STORAGE

[user; item; time; rating]

𝑅Batch training

𝑈𝐼item vector

325

532

5 -6 -1

5 4 -4

5

1

3

uservector

5

2

3

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

PERSISTENT STORAGE

𝑅Online training

𝑈𝐼item vector

325

532

5 -6 -1

5 4 -4

5

1

3

uservector

5 3

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

[user; item; time; rating]

2 5 4 2 4

𝑅Online training

𝑈𝐼item vector

326

532

5 -6 -2

5 4 -4

5

1

3

uservector

5

2

3

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

[user; item; time; rating]

5 4 2 4

𝑅Online training

𝑈𝐼item vector

135

532

4 -5 -1

5 4 -4

5

1

3

uservector

5

2

3

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

[user; item; time; rating]

5 4 2 4

Batch + online combination

But how to scale?• Spotify streamed 20 billion hours of music in 2015• YouTube over a billion users, billions of video views every day• Use distributed data-analytics frameworks• How can we combine batch + online?

Apache Spark vs. Apache Flink

𝑅Distributed online matrix factorization

𝑈𝐼item vector

326

532

5 -6 -2

5 4 -4

1

3

uservector

3

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

[user; item; time; rating]

2 5 4 2 4

𝑅Distributed online matrix factorization

𝑈𝐼item vector

326

532

5 -6 -2

5 4 -4

1

3

uservector

2

3

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

[user; item; time; rating]

5 4 2 4

𝑅Distributed online matrix factorization

𝑈𝐼item vector

326

532

5 -6 -2

5 4 -4

1

3

uservector

2

3

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

[user; item; time; rating]

5 4 2 4

326

25 -6 -2

need to co-locate

𝑅Distributed online matrix factorization

𝑈𝐼item vector

326

532

5 -6 -2

5 4 -4

1

3

uservector

2

3

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

[user; item; time; rating]

5 4 2 4

135

24 -3 -1

need to co-locatethen update

𝑅Distributed online matrix factorization

𝑈𝐼item vector

135

532

4 -5 -1

5 4 -4

1

3

uservector

2

3

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

[user; item; time; rating]

5 4 2 4

135

24 -3 -1

need to co-locatethen updatesend updates

𝑅Distributed online matrix factorization

𝑈𝐼item vector

135

532

4 -5 -1

5 4 -4

5

1

3

uservector

5

2

3

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

5 4 2 4

process two ratings in parallel

𝑅Distributed online matrix factorization

𝑈𝐼item vector

135

532

4 -5 -1

5 4 -4

5

1

3

uservector

5

2

3

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

5 4 2 4

process two ratings in parallel

𝑅Distributed online matrix factorization

𝑈𝐼item vector

135

532

4 -5 -1

5 4 -4

5

1

3

uservector

5

2

3

0

0

0

0

Zoltán

Gábor

Rogue One Interstellar

5 4 2 4

process two ratings in parallel

• Concurrent modification• Similar problem with batch SGD• Distributed SGD

(Gemulla et al. 2011)

Online MF in Spark

val ratings: DStream[Rating] = ...

we have our input

Online MF in Spark

val ratings: DStream[Rating] = ...

val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =

we have our input

would like to have output like this

Online MF in Spark

val ratings: DStream[Rating] = ...

val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =

we have our input

would like to have output like this

updateStateByKey?

Online MF in Spark

val ratings: DStream[Rating] = ...

val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =

we have our input

would like to have output like this

updateStateByKey?Use batch DSGD for online updates!(discussion issue SPARK-6407)

Online MF in Spark

val ratings: DStream[Rating] = ...

var users: RDD[(UserId, Vector)] = ...var items: RDD[(ItemId, Vector)] = ...

val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =

we have our input

would like to have output like this

need to represent factor matrices

Online MF in Spark

val ratings: DStream[Rating] = ...

var users: RDD[(UserId, Vector)] = ...var items: RDD[(ItemId, Vector)] = ...

val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] = ratings.transform { (rs: RDD[Rating]) =>

we have our input

would like to have output like this

use transform to allow RDD operations

need to represent factor matrices

Online MF in Spark

val ratings: DStream[Rating] = ...

var users: RDD[(UserId, Vector)] = ...var items: RDD[(ItemId, Vector)] = ...

val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] = ratings.transform { (rs: RDD[Rating]) => val updates = batchDSGD(rs, users, items)

we have our input

would like to have output like this

use transform to allow RDD operations

need to represent factor matrices

compute updates

Online MF in Spark

val ratings: DStream[Rating] = ...

var users: RDD[(UserId, Vector)] = ...var items: RDD[(ItemId, Vector)] = ...

val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] = ratings.transform { (rs: RDD[Rating]) => val updates = batchDSGD(rs, users, items) users = applyUserUpdates(users, updates) items = applyItemUpdates(items, updates) updates }

we have our input

would like to have output like this

use transform to allow RDD operations

need to represent factor matrices

compute updates

apply updates to get updated matrices

Online MF in Spark• Performance decreases by time

Online MF in Spark• Performance decreases by time

• Problem: tracking lineage graph• Solution: use checkpointing

Online MF in Spark• Performance decreases by time

• Problem: tracking lineage graph• Solution: use checkpointing

Online MF in Flink

uservectors

itemvectors

long-running operators with state

Online MF in Flink

uservectors

itemvectors

long-running operators with state

backward edge in dataflow (stream loop)

Online MF in Flink

1. rating event

2

uservectors

itemvectors

Online MF in Flink

1. rating event 2. rating event & user vector

25 -6 -22

uservectors

itemvectors

Online MF in Flink

1. rating event 2. rating event & user vector 25 -6 -2

326

25 -6 -22

uservectors

itemvectors

Online MF in Flink

1. rating event 2. rating event & user vector

3. apply update

225 -6 -22

uservectors

itemvectors

4 -3 -1

135

Online MF in Flink

1. rating event 2. rating event & user vector

4. user vector update

3. apply update

225 -6 -22

uservectors

itemvectors

4 -3 -1

135

4 -3 -1

Online MF in FlinkWARNING!Loops API (iterative streams) not mature enough yet,but there is ongoing effort

1. rating event 2. rating event & user vector

4. user vector update

3. apply update

225 -6 -22

uservectors

itemvectors

4 -3 -1

135

4 -3 -1

Online MF: Spark vs. Flink

Combining batch + online in Spark• Easy: can run batch training periodically on whole dataset

Combining batch + online in Flink• Combining Flink Batch API with Streaming API• Could only do it with an external system

Combining batch + online in Flink• Combining Flink Batch API with Streaming API• Could only do it with an external system

• Batch with Streaming API• Feasible!• Asynchronous training

(Schelter et al. 2014)

Combining batch + online in Flink• Combining Flink Batch API with Streaming API• Could only do it with an external system

• Batch with Streaming API• Feasible!• Asynchronous training

(Schelter et al. 2014)

• Batch + online• Both with Streaming API• Share matrices in common state• Parameter Server approach

Lessons learned

Lessons learnedFlink Spark

Implementation More complex solution,harder to implement

Easier to use:could use batch for streaming

Lessons learnedFlink Spark

Implementation More complex solution,harder to implement

Easier to use:could use batch for streaming

Generality Can express finer grained updates Updates limited by mini-batch

Lessons learnedFlink Spark

Implementation More complex solution,harder to implement

Easier to use:could use batch for streaming

Generality Can express finer grained updates Updates limited by mini-batch

Code stability Some parts are not mature enough (e.g. Loops API)

More mature

Lessons learnedFlink Spark

Implementation More complex solution,harder to implement

Easier to use:could use batch for streaming

Generality Can express finer grained updates Updates limited by mini-batch

Code stability Some parts are not mature enough (e.g. Loops API)

More mature

Performance Optimal for online learning,can perform well on batch

Not always optimal for online learning (e.g. online MF)

Lessons learnedFlink Spark

Implementation More complex solution,harder to implement

Easier to use:could use batch for streaming

Generality Can express finer grained updates Updates limited by mini-batch

Code stability Some parts are not mature enough (e.g. Loops API)

More mature

Performance Optimal for online learning,can perform well on batch

Not always optimal for online learning (e.g. online MF)

Handlingdata skew

Currently hard to relocatelong-running operators

Periodic scheduling enables easier modification of partitioning

Lessons learnedFlink Spark

Implementation More complex solution,harder to implement

Easier to use:could use batch for streaming

Generality Can express finer grained updates Updates limited by mini-batch

Code stability Some parts are not mature enough (e.g. Loops API)

More mature

Performance Optimal for online learning,can perform well on batch

Not always optimal for online learning (e.g. online MF)

Handlingdata skew

Currently hard to relocatelong-running operators

Periodic scheduling enables easier modification of partitioning

Machine learning Non-complete ML libraryand other efforts for ML in Flink

Spark MLlib is matureand used in production

Thank you for your attention

Zoltán Zvarazoltan.zvara@ilab.sztaki.hu

Gábor Hermannghermann@ilab.sztaki.hu

Source code:https://github.com/gaborhermann/large-scale-recommendation

Measurements

Batch + online combination• 30M music listening Last.fm dataset• Weekly batch training• Evaluation weekly average• on every incoming listening

• Around 45.000 users

Online MF: Spark vs. Flink• 30M music listening Last.fm dataset read from 12 Kafka partitions• Spark batch duration: 5 sec• Time of processing X ratings• DSGD algorithm

• Using 6 nodes, 4 cores each• Spark 2.1.0, Flink 1.2.0

Batch on Flink Streaming• Movielens 1M movie rating dataset• Using 6 nodes, 4 cores each