computing recommendations at extreme scale with apache flink

Till RohrmannFlink committer / data Artisans

[email protected]@stsffap

Computing Recommendations at Extreme Scale with Apache Flink

Recommendations: Collaborative Filtering

2

Recommendations

Omnipresent nowadays Important for user

experience and sales

3

Recommending Websites

4

Collaborative Filtering

Recommend items based on users with similar preferences

Latent factor models capture underlying characteristics of items and preferences of user

Predicted preference:

5

r̂u,i xuTyi

Rating Matrix

Explicit or implicit ratings Prediction goal: Rating for unseen items

6

Princess

Facebook

Items

Users10 5 ?? 2 107 ? 5

Matrix Factorization Calculate low rank approximation to obtain latent factors

7

Princess

Facebook

= rating of user u for item i = latent factors of user u = latent factors of item i

= regularization constant = number of rated items for user u = number of ratings for item i

minX,Y ru,i xuTyi 2

nu xu2 ni yi

2

i

u

ru,i0

Items

Users10 5 ?? 2 107 ? 5

X

Y

xuyi

nuni

ru,i

Alternating Least Squares

Hard to optimize since we have two variables

Fixing one variables gives quadratic problem

8

We only need the item vectors rated by user u

X,Y

xu YSuYT nu 1Yru

T

Siiu

1 if ru,i 0

0 else

ALS Algorithm

Update user matrix

9

Keep fixed

Calculate updateItems

Users10 5 ?? 2 107 ? 5

X

Y

ALS Algorithm contd.

Update item matrix

10

Keep fixed

Calculate update

Items

Users10 5 ?? 2 107 ? 5

X

Y

ALS Algorithm contd.

Repeat update step until convergence

11

Keep fixed

Calculate updateItems

Users10 5 ?? 2 107 ? 5

X

Y

Apache Flink

12

What is Apache Flink?Apache Flink deep-dive by Stephan EwenTomorrow, 12:20 – 13:00 on Stage 3

13

Gelly

Gelly

Table

Table

Flin

kM

LFl

inkM

L

SA

MO

AS

AM

OA

DataSet (Java/Scala/Python)DataSet (Java/Scala/Python) DataStream (Java/Scala)DataStream (Java/Scala)

Hadoop M

/RH

adoop M

/R

LocalLocal RemoteRemote YarnYarn TezTez EmbeddedEmbedded

Data

fow

Data

fow

Data

fow

(W

iP)

Data

fow

(W

iP)

MR

QL

MR

QL

Table

Table

Casc

adin

g

(WiP

)C

asc

adin

g

(WiP

)

Streaming datafow runtime

Why Using Flink for ALS?

Expressive API Pipelined stream processor Closed loop iterations Operations on managed memory

14

Expressive APIs DataSet: Abstraction for distributed data Computation specified as sequence of lazily evaluated

transformations

15

case class W ord(w ord: String, frequency: Int)

val lines: DataSet[String] = env.readTextFile(… )

lines.flatM ap(line = > line.split(“ “).m ap(w ord = > W ord(w ord, 1)).groupBy(“w ord”).sum (“frequency”).print()

Pipelined Stream Processor

17

Avoiding materialization ofintermediate results

Iterate in the Datafow

19

Memory Management

21

ALS implementations with Apache Flink

22

Naïve Implementation

1. Join item vectors with ratings

2. Group on user ID3. Compute new

user vectors

23

xu YSuYT nu 1Yru

T

Pros and Cons of Naïve ALS

Pros• Easy to implement

Cons• Item vectors are sent redundantly to

network nodes• Two shuffle steps make execution

expensive24

Blocked ALS Implementation1. Create user and item

rating blocks2. Cache them on

worker nodes3. Send all item vectors

needed by user rating block bundled

4. Compute block of item vectors

25

Based on Spark’s MLlib implementation

Pros and Cons of Blocked ALS

Pros• Reduces network load by avoiding data

duplication• Caching ratings: Only one shuffle step needed

Cons• Duplicates the rating matrix (user block/item

block partitioning)

26

Performance Comparison

27

• 40 node GCE cluster, highmem-8

• 10 ALS iterations with 50 latent factors

• Rating matrix has 28 billion non zero entries: Scale of Netflix or Spotify

Machine Learning with FlinkML FlinkML contains

blocked ALS Support for many

other tasks• Clustering• Regression• Classification

scikit-learn like pipeline support

28

val als = ALS()

val ratingDS = env.readCsvFile[(Int, Int, Double)](ratingData)

val param eters = Param eterM ap() .add(ALS.Iterations, 10) .add(ALS.Num Factors, 50) .add(ALS.Lam bda, 1.5)

als.fit(ratingDS, param eters)

val testingDS = env.readCsvFile[(Int, Int)](testingData)

val predictions = als.predict(testingDS)

Closing

29

What Have You Seen?

How to use collaborative filtering to make recommendations

Apache Flink, a powerful parallel stream processing engine

How to use Apache Flink and alternating least squares to factorize really large matrices

30

fink.apache.org@ApacheFlink

computing recommendations at extreme scale with apache flink

Documents