computing recommendations at extreme scale with apache flink

29
Till Rohrmann Flink committer / data Artisans [email protected] @stsffap Computing Recommendations at Extreme Scale with Apache Flink

Upload: hoangtu

Post on 14-Feb-2017

224 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Computing Recommendations at Extreme Scale with Apache Flink

Till RohrmannFlink committer / data Artisans

[email protected]@stsffap

Computing Recommendations at Extreme Scale with Apache Flink

Page 2: Computing Recommendations at Extreme Scale with Apache Flink

Recommendations: Collaborative Filtering

2

Page 3: Computing Recommendations at Extreme Scale with Apache Flink

Recommendations

Omnipresent nowadays Important for user

experience and sales

3

Page 4: Computing Recommendations at Extreme Scale with Apache Flink

Recommending Websites

4

Page 5: Computing Recommendations at Extreme Scale with Apache Flink

Collaborative Filtering

Recommend items based on users with similar preferences

Latent factor models capture underlying characteristics of items and preferences of user

Predicted preference:

5

r̂u,i xuTyi

Page 6: Computing Recommendations at Extreme Scale with Apache Flink

Rating Matrix

Explicit or implicit ratings Prediction goal: Rating for unseen items

6

Princess

Facebook

Items

Users10 5 ?? 2 107 ? 5

Page 7: Computing Recommendations at Extreme Scale with Apache Flink

Matrix Factorization Calculate low rank approximation to obtain latent factors

7

Princess

Facebook

= rating of user u for item i = latent factors of user u = latent factors of item i

= regularization constant = number of rated items for user u = number of ratings for item i

minX,Y ru,i xuTyi 2

nu xu2 ni yi

2

i

u

ru,i0

Items

Users10 5 ?? 2 107 ? 5

X

Y

xuyi

nuni

ru,i

Page 8: Computing Recommendations at Extreme Scale with Apache Flink

Alternating Least Squares

Hard to optimize since we have two variables

Fixing one variables gives quadratic problem

8

We only need the item vectors rated by user u

X,Y

xu YSuYT nu 1Yru

T

Siiu

1 if ru,i 0

0 else

Page 9: Computing Recommendations at Extreme Scale with Apache Flink

ALS Algorithm

Update user matrix

9

Keep fixed

Calculate updateItems

Users10 5 ?? 2 107 ? 5

X

Y

Page 10: Computing Recommendations at Extreme Scale with Apache Flink

ALS Algorithm contd.

Update item matrix

10

Keep fixed

Calculate update

Items

Users10 5 ?? 2 107 ? 5

X

Y

Page 11: Computing Recommendations at Extreme Scale with Apache Flink

ALS Algorithm contd.

Repeat update step until convergence

11

Keep fixed

Calculate updateItems

Users10 5 ?? 2 107 ? 5

X

Y

Page 12: Computing Recommendations at Extreme Scale with Apache Flink

Apache Flink

12

Page 13: Computing Recommendations at Extreme Scale with Apache Flink

What is Apache Flink?Apache Flink deep-dive by Stephan EwenTomorrow, 12:20 – 13:00 on Stage 3

13

Gelly

Gelly

Table

Table

Flin

kM

LFl

inkM

L

SA

MO

AS

AM

OA

DataSet (Java/Scala/Python)DataSet (Java/Scala/Python) DataStream (Java/Scala)DataStream (Java/Scala)

Hadoop M

/RH

adoop M

/R

LocalLocal RemoteRemote YarnYarn TezTez EmbeddedEmbedded

Data

fow

Data

fow

Data

fow

(W

iP)

Data

fow

(W

iP)

MR

QL

MR

QL

Table

Table

Casc

adin

g

(WiP

)C

asc

adin

g

(WiP

)

Streaming datafow runtime

Page 14: Computing Recommendations at Extreme Scale with Apache Flink

Why Using Flink for ALS?

Expressive API Pipelined stream processor Closed loop iterations Operations on managed memory

14

Page 15: Computing Recommendations at Extreme Scale with Apache Flink

Expressive APIs DataSet: Abstraction for distributed data Computation specified as sequence of lazily evaluated

transformations

15

case class W ord(w ord: String, frequency: Int)

val lines: DataSet[String] = env.readTextFile(… )

lines.flatM ap(line = > line.split(“ “).m ap(w ord = > W ord(w ord, 1)).groupBy(“w ord”).sum (“frequency”).print()

Page 16: Computing Recommendations at Extreme Scale with Apache Flink

Pipelined Stream Processor

17

Avoiding materialization ofintermediate results

Page 17: Computing Recommendations at Extreme Scale with Apache Flink

Iterate in the Datafow

19

Page 18: Computing Recommendations at Extreme Scale with Apache Flink

Memory Management

21

Page 19: Computing Recommendations at Extreme Scale with Apache Flink

ALS implementations with Apache Flink

22

Page 20: Computing Recommendations at Extreme Scale with Apache Flink

Naïve Implementation

1. Join item vectors with ratings

2. Group on user ID3. Compute new

user vectors

23

xu YSuYT nu 1Yru

T

Page 21: Computing Recommendations at Extreme Scale with Apache Flink

Pros and Cons of Naïve ALS

Pros• Easy to implement

Cons• Item vectors are sent redundantly to

network nodes• Two shuffle steps make execution

expensive24

Page 22: Computing Recommendations at Extreme Scale with Apache Flink

Blocked ALS Implementation1. Create user and item

rating blocks2. Cache them on

worker nodes3. Send all item vectors

needed by user rating block bundled

4. Compute block of item vectors

25

Based on Spark’s MLlib implementation

Page 23: Computing Recommendations at Extreme Scale with Apache Flink

Pros and Cons of Blocked ALS

Pros• Reduces network load by avoiding data

duplication• Caching ratings: Only one shuffle step needed

Cons• Duplicates the rating matrix (user block/item

block partitioning)

26

Page 24: Computing Recommendations at Extreme Scale with Apache Flink

Performance Comparison

27

• 40 node GCE cluster, highmem-8

• 10 ALS iterations with 50 latent factors

• Rating matrix has 28 billion non zero entries: Scale of Netflix or Spotify

Page 25: Computing Recommendations at Extreme Scale with Apache Flink

Machine Learning with FlinkML FlinkML contains

blocked ALS Support for many

other tasks• Clustering• Regression• Classification

scikit-learn like pipeline support

28

val als = ALS()

val ratingDS = env.readCsvFile[(Int, Int, Double)](ratingData)

val param eters = Param eterM ap() .add(ALS.Iterations, 10) .add(ALS.Num Factors, 50) .add(ALS.Lam bda, 1.5)

als.fit(ratingDS, param eters)

val testingDS = env.readCsvFile[(Int, Int)](testingData)

val predictions = als.predict(testingDS)

Page 26: Computing Recommendations at Extreme Scale with Apache Flink

Closing

29

Page 27: Computing Recommendations at Extreme Scale with Apache Flink

What Have You Seen?

How to use collaborative filtering to make recommendations

Apache Flink, a powerful parallel stream processing engine

How to use Apache Flink and alternating least squares to factorize really large matrices

30

Page 28: Computing Recommendations at Extreme Scale with Apache Flink

31

Page 29: Computing Recommendations at Extreme Scale with Apache Flink

fink.apache.org@ApacheFlink