Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

32
Till Rohrmann Flink committer / data Artisans [email protected] @stsffap Computing Recommendations at Extreme Scale with Apache Flink

Transcript of Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

Till Rohrmann Flink committer / data Artisans

[email protected] @stsffap

Computing Recommendations at Extreme Scale with Apache Flink

Recommendations: Collaborative Filtering

1

Recommendations §  Omnipresent nowadays §  Important for user

experience and sales

2

Recommending Websites

3

Collaborative Filtering §  Recommend items based on users with

similar preferences §  Latent factor models capture underlying

characteristics of items and preferences of user

§  Predicted preference:

4

r̂u,i = xuT yi

Rating Matrix §  Explicit or implicit ratings §  Prediction goal: Rating for unseen items

5

Items

Users10 5 ?? 2 107 ? 5

!

"

###

$

%

&&& Princess  

Facebook  

Matrix Factorization §  Calculate low rank approximation to obtain latent factors

6

minX,Y ru,i − xuT yi( )

2+λ nu xu

2+ ni yi

2

i∑

u∑#

$%

&

'(

ru,i≠0∑

Items

Users10 5 ?? 2 107 ? 5

!

"

###

$

%

&&&≈ X

!

"

###

$

%

&&&• Y( )Princess  

Facebook  

§  = rating of user u for item i §  = latent factors of user u §  = latent factors of item i

§  = regularization constant §  = number of rated items for user u §  = number of ratings for item i

xuyi

λnuni

ru,i

Alternating Least Squares §  Hard to optimize since we have two

variables §  Fixing one variables gives quadratic

problem

7

X,Y

xu = YSuY T +λnuΙ( )−1Yru

T

Siiu =

1 if ru,i ≠ 00 else

"#$

%$

We  only  need  the  item  vectors  rated  by  user  u  

ALS Algorithm §  Update user matrix

8

Items

Users10 5 ?? 2 107 ? 5

!

"

###

$

%

&&&≈ X

!

"

###

$

%

&&&• Y( )

Keep  fixed  

Calculate  update  

ALS Algorithm contd. §  Update item matrix

9

Items

Users10 5 ?? 2 107 ? 5

!

"

###

$

%

&&&≈ X

!

"

###

$

%

&&&• Y( )

Keep  fixed  

Calculate  update  

ALS Algorithm contd. §  Repeat update step until convergence

10

Items

Users10 5 ?? 2 107 ? 5

!

"

###

$

%

&&&≈ X

!

"

###

$

%

&&&• Y( )

Keep  fixed  

Calculate  update  

Apache Flink

11

What is Apache Flink? Apache  Flink  deep-­‐dive  by  Stephan  Ewen  Tomorrow,  12:20  –  13:00  on  Stage  3  

12

Gel

ly

Tabl

e

Flin

kML

SAM

OA

DataSet (Java/Scala/Python) DataStream (Java/Scala)

Had

oop

M/R

Local Remote Yarn Tez Embedded

Dat

aflow

Dat

aflow

(WiP

)

MRQ

L

Tabl

e

Casc

adin

g (W

iP)

Streaming dataflow runtime

Why Using Flink for ALS? §  Expressive API §  Pipelined stream processor §  Closed loop iterations §  Operations on managed memory

13

Expressive APIs §  DataSet: Abstraction for distributed data §  Computation specified as sequence of lazily

evaluated transformations

14

case  class  Word(word:  String,  frequency:  Int)    val  lines:  DataSet[String]  =  env.readTextFile(…)    lines.flatMap(line  =>  line.split(“  “).map(word  =>  Word(word,  1))  

 .groupBy(“word”).sum(“frequency”)    .print()  

Program Execution

15

case  class  Path  (from:  Long,  to:  Long)  val  tc  =  edges.iterate(10)  {        paths:  DataSet[Path]  =>          val  next  =  paths              .join(edges)              .where("to")              .equalTo("from")  {                  (path,  edge)  =>                        Path(path.from,  edge.to)              }              .union(paths)              .distinct()          next      }  

Optimizer

Type extraction stack

Task scheduling

Dataflow metadata

Pre-flight (Client)

Master Workers

Data Source orders.tbl

Filter

Map DataSource

lineitem.tbl

Join Hybrid Hash

buildHT

probe

hash-part [0] hash-part [0]

GroupRed sort

forward

Program

Dataflow Graph

deploy operators

track intermediate

results

Pipelined Stream Processor

16

Avoiding  materializaBon  of  intermediate  results  

Iterate by looping

§  for/while loop in client submits one job per iteration step

§  Data reuse by caching in memory and/or disk

Step Step Step Step Step

Client

17

Iterate in the Dataflow

18

Memory Management

19

Memory Management

20

ALS implementations with Apache Flink

21

Naïve Implementation 1.  Join item vectors

with ratings 2.  Group on user ID 3.  Compute new

user vectors

22

xu = YSuY T +λnuΙ( )−1Yru

T

Pros and Cons of Naïve ALS §  Pros •  Easy to implement

§  Cons •  Item vectors are sent redundantly to network

nodes •  Two shuffle steps make execution expensive

23

Blocked ALS Implementation 1.  Create user and item

rating blocks 2.  Cache them on

worker nodes 3.  Send all item vectors

needed by user rating block bundled

4.  Compute block of item vectors

24 Based  on  Spark’s  MLlib  implementaBon  

Pros and Cons of Blocked ALS §  Pros •  Reduces network load by avoiding data

duplication •  Caching ratings: Only one shuffle step needed

§  Cons •  Duplicates the rating matrix (user block/item

block partitioning)

25

Performance Comparison

26

•  40  node  GCE  cluster,  highmem-­‐8  

•  10  ALS  iteraBons  with  50  latent  factors  

•  RaBng  matrix  has  28  billion  non  zero  entries:  Scale  of  NeAlix  or  SpoCfy  

Machine Learning with FlinkML §  FlinkML contains

blocked ALS §  Support for many

other tasks •  Clustering •  Regression •  Classification

§  scikit-learn like pipeline support

27

val  als  =  ALS()    val  ratingDS  =      env.readCsvFile[(Int,  Int,  Double)](ratingData)    val  parameters  =  ParameterMap()      .add(ALS.Iterations,  10)      .add(ALS.NumFactors,  50)      .add(ALS.Lambda,  1.5)    als.fit(ratingDS,  parameters)    val  testingDS  =  env.readCsvFile[(Int,  Int)](testingData)    val  predictions  =  als.predict(testingDS)  

Closing

28

What Have You Seen? §  How to use collaborative filtering to make

recommendations §  Apache Flink, a powerful parallel stream

processing engine §  How to use Apache Flink and alternating least

squares to factorize really large matrices

29

30

flink.apache.org @ApacheFlink