large&scale&machine&learning&for&...

Large Scale Machine Learning for Informa2on Retrieval

Bo Long, Liang Zhang LinkedIn Applied Relevance Science

CIKM 2013, San Francisco

Structure of This Tutorial

•  Part I: Introduc2on –  Challenges from Big Data –  Introduc2on of Distributed Compu2ng –  Introduc2on of Map-‐Reduce – Web Recommender Systems -‐-‐ Our Mo2va2on Problem

•  Part II: Large Scale Logis2c Regression •  Part III: Parallel Matrix Factoriza2on •  Part IV: Bag of LiRle Bootstraps •  Part V: Future of Distributed Compu2ng Infrastructure

Big Data becoming Ubiquitous

•  Bioinforma2cs •  Astronomy •  Internet •  Telecommunica2ons •  Climatology •  …

Big Data: Some size es2mates

•  1000 human genomes: > 100TB of data (1000 genomes project)

•  Sloan Digital Sky Survey: 200GB data per night (>140TB aggregated)

•  Facebook: A billion monthly ac2ve users •  LinkedIn: 238M members worldwide •  TwiRer: 500 million tweets a day •  Over 6 billion mobile phones in the world genera2ng data everyday

Challenges: Machine Learning for Big Data

•  Before: –  A small/medium sized data set –  Train models on a single machine

•  Now: –  Almost infinite supply of data –  Does the problem become easier? (Asympto2c theory) –  Not quite

•  More data comes with more heterogeneity •  Need to change our thinking in terms of

–  Models –  Model fidng approaches –  Data analysis –  …

Some Challenges

•  Exploratory Analysis (EDA), Visualiza2on – Retrospec2ve (on Terabytes) – More Real Time (streaming computa2ons every few minutes/hours)

•  Machine Learning Models – Scale (computa2onal challenge) – Curse of dimensionality

•  Millions of predictors, heterogeneity – Temporal and Spa2al correla2ons

Some Challenges con2nued

•  Experiments – To test new methods, test hypothesis from randomized experiments

– Adap2ve experiments

•  Forecas2ng – Planning, adver2sing

•  Many more we are not fully well versed in

Introduc2on of Distributed Compu2ng

Distributed Compu2ng for Big Data

•  Distributed compu2ng invaluable tool to scale computa2ons for big data

•  Some distributed compu2ng models – Mul2-‐threading – Graphics Processing Units (GPU) – Message Passing Interface (MPI) – Map-‐Reduce

Evalua2ng a method for a problem

•  Scalability –  Process X GB in Y hours

•  Ease of use •  Reliability (fault tolerance) •  Cost

– Hardware and cost of maintenance •  Good for the computa2ons required?

–  E.g., Itera2ve versus one pass •  Resource sharing

Mul2threading

•  Mul2ple threads take advantage of mul2ple CPUs

•  Shared memory •  Threads can execute independently and concurrently

•  Can only handle Gigabytes of data •  Reliable

Graphics Processing Units (GPU) •  Number of cores:

–  CPU: Order of 10 –  GPU: smaller cores

•  Order of 1000

•  Can be >100x faster than CPU –  Parallel computa2onally intensive tasks off-‐loaded to GPU

•  Good for certain computa2onally-‐intensive tasks

•  Can only handle Gigabytes of data

•  Not trivial to use, requires good understanding of low-‐level architecture for efficient use

Message Passing Interface (MPI)

•  Language independent communica2on protocol among processes (e.g. computers)

•  Most suitable for master/slave model •  Can handle Terabytes of data •  Good for itera2ve processing •  Fault tolerance is low

Map-‐Reduce (Dean & Ghemawat, 2004)

Mappers

Reducers

Data

Output

•  Computa2on split to Map (scaRer) and Reduce (gather) stages

•  Easy to Use: –  User needs to implement two func2ons: Mapper and Reducer

•  Easily handles Terabytes of data

•  Very good fault tolerance (failed tasks automa2cally get restarted)

Comparison of Distributed Compu2ng Methods

Mul$threading GPU MPI Map-‐Reduce

Scalability (data size)

Gigabytes Gigabytes Terabytes Terabytes

Fault Tolerance High High Low High

Maintenance Cost Low Medium Medium Medium-‐High

Itera2ve Process Complexity

Cheap Cheap Cheap Usually expensive

Resource Sharing Hard Hard Easy Easy

Easy to Implement? Easy Needs understanding of low-‐level GPU architecture

Easy Easy

Introduc2on of Map-‐Reduce

Example Problem

•  Tabula2ng word counts in corpus of documents

•  Output: <word, count>

Word Count Through Map-‐Reduce Hello World Bye World

Hello Hadoop Goodbye Hadoop

Mapper 1

<Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop,1>

<Hello, 1> <World, 1> <Bye, 1> <World,1>

Mapper 2

Reducer 1 Words from A-‐G

Reducer 2 Words from H-‐Z

<Bye, 1> <Goodbye, 1>

<Hello, 2> <World, 2> <Hadoop, 2>

Key Ideas about Map-‐Reduce Big Data

Par22on 1 Par22on 2 … Par22on N

Mapper 1 Mapper 2 … Mapper N

<Key, Value> <Key, Value> <Key, Value> <Key, Value>

Reducer 1 Reducer 2 Reducer M …

Output 1 Output 1 Output 1 Output 1

Key Ideas about Map-‐Reduce •  Data are split into par22ons and stored in many different machines on disk (distributed storage)

•  Mappers process data chunks independently and emit <Key, Value> pairs

•  Data with the same key are sent to the same reducer. One reducer can receive mul2ple keys

•  Every reducer sorts its data by key •  For each key, the reducer processes the values corresponding to the key according to the customized reducer func2on and output

More on Map-‐Reduce •  Depends on distributed file systems •  Typically mappers are the data storage nodes •  Map/Reduce tasks automa2cally get restarted when they fail (good fault tolerance)

•  Map and Reduce I/O are all on disk – Data transmission from mappers to reducers are through disk copy

•  Itera2ve process through Map-‐Reduce –  Each itera2on becomes a map-‐reduce job –  Can be expensive since map-‐reduce overhead is high

Web Recommender Systems – Our Mo2va2on Problem

Web Recommender Systems at LinkedIn

Similar problem: Content recommenda2on on Yahoo!

front page

Recommend content links (out of 30-‐40, editorially programmed) 4 slots exposed, F1 has maximum exposure Routes traffic to other Y! proper2es

F1 F2 F3 F4

Today module

item j from a set of candidates with item features

(e.g. keywords, categories…)

User i with user features (e.g., industry, behavioral features, Demographic features,……)

(i, j) : response yij visits

Algorithm selects

(click or not)

Which item should we recommend to the user? � The item with highest expected gain of u2lity •  Sample u2li2es:

•  CTR •  Revenue (e.g. CTR * bid) •  Time spent …

Web Recommender Systems

Challenges of Web Recommender Systems

•  Large scale data •  Low latency real-‐2me ranking •  Low posi2ve rates (sparsity problem) •  Proper2es of the item set to rank per user

– Dynamic: new items may get added any 2me –  Personalized:

•  Ads: adver2sers specify targeted user to show their ads •  Update stream in social network: only from connec2ons

– May have temporal paRerns, e.g. novelty effect •  New users come to the site every second

A Naïve Model (Most-‐Popular)

•  Use empirical CTR for each item to rank •  Item j: CTRj = #Clicksj / # Viewsj •  Advantage:

– Quick update of CTR: e.g. every 5-‐minutes

•  Problem: – Different users might like different items – Cold start & sparsity: many items may have very few clicks

Segmented Most Popular

•  However: Sparsity drama2cally increases as the number of features grows

•  Curse of dimensionality

Male Female

Item 1

Item 2

CTR for Male and Item 1

CTR for Female and Item 1

CTR for Male and Item 2

CTR for Female and Item 2

Another Challenge: Data Collec2on

•  Assume most-‐popular model for all traffic •  Item A: true CTR 5%, 500 clicks, 10000 views •  Item B: true CTR 10%, 0 click, 10 views •  We ends up always serving item A! •  Mul2-‐armed bandit problem (Explore-‐Exploit):

– A naïve approach: ε-‐greedy -‐-‐ Use x% bucket to serve random traffic

– BeRer approaches later in the tutorial

Challenges from Data Analysis

•  Example: How to generate confidence intervals of model performance like AUC?

•  Single machine: – Bootstrap the test data 100 2mes – Compute the std(AUC) from AUC in the 100 bootstrapped data

•  How do we do it for large scale data? – Bootstrap non-‐trivial!

Large Scale Logis2c Regression for Response Predic2on in Web Recommender Systems

Web Recommender Systems at LinkedIn

item j from a set of candidates with item features

(e.g. keywords, categories…)

User i with user features (e.g., industry, behavioral features, Demographic features,……)

(i, j) : response yij visits

Algorithm selects

(click or not)

Which item should we recommend to the user? � The item with highest expected gain of u2lity •  Sample u2li2es:

•  CTR •  Revenue (e.g. CTR * bid) •  Time spent …

Recap: Response Predic2on in Web Recommender Systems

Possible Approaches

§  Logistic Regression §  Naïve Bayes §  Support Vector Machines (SVM) §  Decision Trees §  Neural Networks §  …

Challenges of Response Predic2on •  Abundance of features

–  Context: 2me, item placement, IP geo, other content on page –  Member: gender, age, loca2on, industry, 2tle, func2on –  Item: topic, keywords, standardized features

•  Need models that can handle sparse data and many features –  Logis2c regression, Decision trees/GBDTs, etc.

•  Has to perform at scale •  Logis2c regression strikes a good balance

–  Well understood, industry proven –  Can be scaled for both training and inference –  Straighworward to reason about model

Response Predic2on for Ads

•  Items to rank: Ad campaigns created by adver2sers

•  Ranking func2on: – predicted CTR X campaign bid

•  System shows the top-‐ranked items to user •  Charge adver2sers if clicked •  Second-‐price auc2on

CTR Predic2on Model for Ads •  Feature vectors

–  Member feature vector: xi –  Campaign feature vector: cj –  Context feature vector: zt

•  Model:


–  Member feature vector: xi –  Campaign feature vector: cj –  Context feature vector: zk

•  Model:

Cold-start component

Warm-start per-campaign component


–  Member feature vector: xi –  Campaign feature vector: cj –  Context feature vector: zk

•  Model:

Cold-start component

Warm-start per-campaign component

Cold-‐start: Warm-‐start: Both can have L2 penal2es.

Model Fidng •  Single machine (well understood)

–  conjugate gradient –  L-‐BFGS –  Trusted region –  …

•  Model Training with Large scale data –  Cold-‐start component Θc is more stable

•  Weekly/bi-‐weekly training good enough •  However: difficulty from need for large-‐scale logis2c regression

–  Warm-‐start per-‐campaign model Θw is more dynamic •  New items can get generated any 2me •  Big loss if opportuni2es missed •  Need to update the warm-‐start component as frequently as possible

Model Fidng •  Single machine (well understood)

–  conjugate gradient –  L-‐BFGS –  Trusted region –  …

•  Model Training with Large scale data –  Cold-‐start component Θc is more stable

•  Weekly/bi-‐weekly training good enough •  However: difficulty from need for large-‐scale logis2c regression

–  Warm-‐start per-‐campaign model Θw is more dynamic •  New items can get generated any 2me •  Big loss if opportuni2es missed •  Need to update the warm-‐start component as frequently as possible

Large Scale Logistic Regression

Per-item logistic regression given Θc

Challenges of Large Scale Logis2c Regression

•  Logis2c regression for small data sets has been studied and widely applied since 1970s

•  Unknown weights can be learned through Newton method, conjugate gradient ascent, coordinate descent, etc.

•  However, for web applica2ons we usually see – Hundreds of millions of samples – Hundreds of thousands of features

•  Q: How to fit large scale Logis2c regression in a scalable way?

Large Scale Logis2c Regression

•  Naïve: –  Par22on the data and run logis2c regression for each par22on

–  Take the mean of the learned coefficients –  Problem: Not guaranteed to converge to the model from single machine!

•  Alterna2ng Direc2on Method of Mul2pliers (ADMM) –  Boyd et al. 2011 –  Set up a constraint that each par22on’s coefficient = global consensus

–  Solve the op2miza2on problem using Lagrange Mul2pliers

Model Fidng for Large-‐Scale Ads Data

A Deep-‐dive to ADMM

Dual Ascent Method

•  Consider a convex op2miza2on problem

•  Lagrangian for the problem:

•  Dual Ascent:

2Precursors

In this section, we briefly review two optimization algorithms that areprecursors to the alternating direction method of multipliers. Whilewe will not use this material in the sequel, it provides some usefulbackground and motivation.

2.1 Dual Ascent

Consider the equality-constrained convex optimization problem

minimize f(x)subject to Ax = b,

(2.1)

with variable x ! Rn, where A ! Rm!n and f : Rn " R is convex.The Lagrangian for problem (2.1) is

L(x,y) = f(x) + yT (Ax # b)

and the dual function is

g(y) = infx

L(x,y) = #f"(#AT y) # bT y,

where y is the dual variable or Lagrange multiplier, and f" is the convexconjugate of f ; see [20, §3.3] or [140, §12] for background. The dual

7

2Precursors

In this section, we briefly review two optimization algorithms that areprecursors to the alternating direction method of multipliers. Whilewe will not use this material in the sequel, it provides some usefulbackground and motivation.

2.1 Dual Ascent

Consider the equality-constrained convex optimization problem

minimize f(x)subject to Ax = b,

(2.1)

with variable x ! Rn, where A ! Rm!n and f : Rn " R is convex.The Lagrangian for problem (2.1) is

L(x,y) = f(x) + yT (Ax # b)

and the dual function is

g(y) = infx

L(x,y) = #f"(#AT y) # bT y,

where y is the dual variable or Lagrange multiplier, and f" is the convexconjugate of f ; see [20, §3.3] or [140, §12] for background. The dual

7

8 Precursors

problem is

maximize g(y),

with variable y ! Rm. Assuming that strong duality holds, the optimalvalues of the primal and dual problems are the same. We can recovera primal optimal point x! from a dual optimal point y! as

x! = argminx

L(x,y!),

provided there is only one minimizer of L(x,y!). (This is the caseif, e.g., f is strictly convex.) In the sequel, we will use the notationargminx F (x) to denote any minimizer of F , even when F does nothave a unique minimizer.

In the dual ascent method, we solve the dual problem using gradientascent. Assuming that g is di!erentiable, the gradient "g(y) can beevaluated as follows. We first find x+ = argminx L(x,y); then we have"g(y) = Ax+ # b, which is the residual for the equality constraint. Thedual ascent method consists of iterating the updates

xk+1 := argminx

L(x,yk) (2.2)

yk+1 := yk + !k(Axk+1 # b), (2.3)

where !k > 0 is a step size, and the superscript is the iteration counter.The first step (2.2) is an x-minimization step, and the second step (2.3)is a dual variable update. The dual variable y can be interpreted asa vector of prices, and the y-update is then called a price update orprice adjustment step. This algorithm is called dual ascent since, withappropriate choice of !k, the dual function increases in each step, i.e.,g(yk+1) > g(yk).

The dual ascent method can be used even in some cases when g isnot di!erentiable. In this case, the residual Axk+1 # b is not the gradi-ent of g, but the negative of a subgradient of #g. This case requires adi!erent choice of the !k than when g is di!erentiable, and convergenceis not monotone; it is often the case that g(yk+1) $> g(yk). In this case,the algorithm is usually called the dual subgradient method [152].

If !k is chosen appropriately and several other assumptions hold,then xk converges to an optimal point and yk converges to an optimal

Augmented Lagrangians •  Bring robustness to the dual ascent method

•  Yield convergence without assump2ons like strict convexity or finiteness of f

• 

•  The value of ρ influences the convergence rate

10 Precursors

collected (gathered) in order to compute the residual Axk+1 ! b. Oncethe (global) dual variable yk+1 is computed, it must be distributed(broadcast) to the processors that carry out the N individual xi mini-mization steps (2.4).

Dual decomposition is an old idea in optimization, and traces backat least to the early 1960s. Related ideas appear in well known workby Dantzig and Wolfe [44] and Benders [13] on large-scale linear pro-gramming, as well as in Dantzig’s seminal book [43]. The general ideaof dual decomposition appears to be originally due to Everett [69],and is explored in many early references [107, 84, 117, 14]. The useof nondi!erentiable optimization, such as the subgradient method, tosolve the dual problem is discussed by Shor [152]. Good references ondual methods and decomposition include the book by Bertsekas [16,chapter 6] and the survey by Nedic and Ozdaglar [131] on distributedoptimization, which discusses dual decomposition methods and con-sensus problems. A number of papers also discuss variants on standarddual decomposition, such as [129].

More generally, decentralized optimization has been an active topicof research since the 1980s. For instance, Tsitsiklis and his co-authorsworked on a number of decentralized detection and consensus problemsinvolving the minimization of a smooth function f known to multi-ple agents [160, 161, 17]. Some good reference books on parallel opti-mization include those by Bertsekas and Tsitsiklis [17] and Censor andZenios [31]. There has also been some recent work on problems whereeach agent has its own convex, potentially nondi!erentiable, objectivefunction [130]. See [54] for a recent discussion of distributed methodsfor graph-structured optimization problems.

2.3 Augmented Lagrangians and the Method of Multipliers

Augmented Lagrangian methods were developed in part to bringrobustness to the dual ascent method, and in particular, to yield con-vergence without assumptions like strict convexity or finiteness of f .The augmented Lagrangian for (2.1) is

L!(x,y) = f(x) + yT (Ax ! b) + (!/2)"Ax ! b"22, (2.6)2.3 Augmented Lagrangians and the Method of Multipliers 11

where ! > 0 is called the penalty parameter. (Note that L0 is thestandard Lagrangian for the problem.) The augmented Lagrangiancan be viewed as the (unaugmented) Lagrangian associated with theproblem

minimize f(x) + (!/2)!Ax " b!22

subject to Ax = b.

This problem is clearly equivalent to the original problem (2.1), sincefor any feasible x the term added to the objective is zero. The associateddual function is g!(y) = infx L!(x,y).

The benefit of including the penalty term is that g! can be shown tobe di!erentiable under rather mild conditions on the original problem.The gradient of the augmented dual function is found the same way aswith the ordinary Lagrangian, i.e., by minimizing over x, and then eval-uating the resulting equality constraint residual. Applying dual ascentto the modified problem yields the algorithm

xk+1 := argminx

L!(x,yk) (2.7)

yk+1 := yk + !(Axk+1 " b), (2.8)

which is known as the method of multipliers for solving (2.1). This isthe same as standard dual ascent, except that the x-minimization stepuses the augmented Lagrangian, and the penalty parameter ! is usedas the step size "k. The method of multipliers converges under far moregeneral conditions than dual ascent, including cases when f takes onthe value +# or is not strictly convex.

It is easy to motivate the choice of the particular step size ! inthe dual update (2.8). For simplicity, we assume here that f is di!er-entiable, though this is not required for the algorithm to work. Theoptimality conditions for (2.1) are primal and dual feasibility, i.e.,

Ax" " b = 0, $f(x") + AT y" = 0,

respectively. By definition, xk+1 minimizes L!(x,yk), so

0 = $xL!(xk+1,yk)

= $xf(xk+1) + AT!yk + !(Axk+1 " b)

"

= $xf(xk+1) + AT yk+1.

Alterna2ng Direc2on Method of Mul2pliers (ADMM)

•  Problem:

•  Augmented Lagrangians

•  ADMM:

3Alternating Direction Method of Multipliers

3.1 Algorithm

ADMM is an algorithm that is intended to blend the decomposabilityof dual ascent with the superior convergence properties of the methodof multipliers. The algorithm solves problems in the form

minimize f(x) + g(z)subject to Ax + Bz = c

(3.1)

with variables x ! Rn and z ! Rm, where A ! Rp!n, B ! Rp!m, andc ! Rp. We will assume that f and g are convex; more specific assump-tions will be discussed in §3.2. The only di!erence from the generallinear equality-constrained problem (2.1) is that the variable, called xthere, has been split into two parts, called x and z here, with the objec-tive function separable across this splitting. The optimal value of theproblem (3.1) will be denoted by

p! = inf{f(x) + g(z) | Ax + Bz = c}.

As in the method of multipliers, we form the augmented Lagrangian

L"(x,z,y) = f(x) + g(z) + yT (Ax + Bz " c) + (!/2)#Ax + Bz " c#22.

13

3Alternating Direction Method of Multipliers

3.1 Algorithm

ADMM is an algorithm that is intended to blend the decomposabilityof dual ascent with the superior convergence properties of the methodof multipliers. The algorithm solves problems in the form

minimize f(x) + g(z)subject to Ax + Bz = c

(3.1)

with variables x ! Rn and z ! Rm, where A ! Rp!n, B ! Rp!m, andc ! Rp. We will assume that f and g are convex; more specific assump-tions will be discussed in §3.2. The only di!erence from the generallinear equality-constrained problem (2.1) is that the variable, called xthere, has been split into two parts, called x and z here, with the objec-tive function separable across this splitting. The optimal value of theproblem (3.1) will be denoted by

p! = inf{f(x) + g(z) | Ax + Bz = c}.

As in the method of multipliers, we form the augmented Lagrangian

L"(x,z,y) = f(x) + g(z) + yT (Ax + Bz " c) + (!/2)#Ax + Bz " c#22.

13

14 Alternating Direction Method of Multipliers

ADMM consists of the iterations

xk+1 := argminx

L!(x,zk,yk) (3.2)

zk+1 := argminz

L!(xk+1,z,yk) (3.3)

yk+1 := yk + !(Axk+1 + Bzk+1 ! c), (3.4)

where ! > 0. The algorithm is very similar to dual ascent and themethod of multipliers: it consists of an x-minimization step (3.2), az-minimization step (3.3), and a dual variable update (3.4). As in themethod of multipliers, the dual variable update uses a step size equalto the augmented Lagrangian parameter !.

The method of multipliers for (3.1) has the form

(xk+1,zk+1) := argminx,z

L!(x,z,yk)

yk+1 := yk + !(Axk+1 + Bzk+1 ! c).

Here the augmented Lagrangian is minimized jointly with respect tothe two primal variables. In ADMM, on the other hand, x and z areupdated in an alternating or sequential fashion, which accounts for theterm alternating direction. ADMM can be viewed as a version of themethod of multipliers where a single Gauss-Seidel pass [90, §10.1] overx and z is used instead of the usual joint minimization. Separating theminimization over x and z into two steps is precisely what allows fordecomposition when f or g are separable.

The algorithm state in ADMM consists of zk and yk. In other words,(zk+1,yk+1) is a function of (zk,yk). The variable xk is not part of thestate; it is an intermediate result computed from the previous state(zk!1,yk!1).

If we switch (re-label) x and z, f and g, and A and B in the prob-lem (3.1), we obtain a variation on ADMM with the order of the x-update step (3.2) and z-update step (3.3) reversed. The roles of x andz are almost symmetric, but not quite, since the dual update is doneafter the z-update but before the x-update.

Large Scale Logis2c Regression via ADMM

•  Nota2on: –  (A, b): data –  xi: Coefficients for par22on i –  z: Consensus coefficient vector –  r(z): penalty component such as ||z||2

•  Problem

•  ADMM

64 Distributed Model Fitting

Some common loss functions are hinge loss (1 ! µi)+, exponentialloss exp(!µi), and logistic loss log(1 + exp(!µi)); the most com-mon regularizers are !1 and !2 (squared). The support vector machine(SVM) [151] corresponds to hinge loss with a quadratic penalty, whileexponential loss yields boosting [78] and logistic loss yields logisticregression.

8.2 Splitting across Examples

Here we discuss how to solve the model fitting problem (8.1) witha modest number of features but a very large number of trainingexamples. Most classical statistical estimation problems belong to thisregime, with large volumes of relatively low-dimensional data. The goalis to solve the problem in a distributed way, with each processor han-dling a subset of the training data. This is useful either when there areso many training examples that it is inconvenient or impossible to pro-cess them on a single machine or when the data is naturally collectedor stored in a distributed fashion. This includes, for example, onlinesocial network data, webserver access logs, wireless sensor networks,and many cloud computing applications more generally.

We partition A and b by rows,

A =

!

"#A1...

AN

$

%& , b =

!

"#b1...

bN

$

%& ,

with Ai " Rmi!n and bi " Rmi , where'N

i=1 mi = m. Thus, Ai and bi

represent the ith block of data and will be handled by the ith processor.We first put the model fitting problem in the consensus form

minimize'N

i=1 li(Aixi ! bi) + r(z)subject to xi ! z = 0, i = 1, . . . ,N,

(8.3)

with variables xi " Rn and z " Rn. Here, li refers (with some abuse ofnotation) to the loss function for the ith block of data. The problem cannow be solved by applying the generic global variable consensus ADMM

8.2 Splitting across Examples 65

algorithm described in §7.1, given here with scaled dual variable:

xk+1i := argmin

xi

!li(Aixi ! bi) + (!/2)"xi ! zk + uk

i "22

"

zk+1 := argminz

!r(z) + (N!/2)"z ! xk+1 ! uk"2

2

"

uk+1i := uk

i + xk+1i ! zk+1.

The first step, which consists of an "2-regularized model fitting prob-lem, can be carried out in parallel for each data block. The secondstep requires gathering variables to form the average. The minimiza-tion in the second step can be carried out componentwise (and usuallyanalytically) when r is assumed to be fully separable.

The algorithm described above only requires that the loss functionl be separable across the blocks of data; the regularizer r does not needto be separable at all. (However, when r is not separable, the z-updatemay require the solution of a nontrivial optimization problem.)

8.2.1 Lasso

For the lasso, this yields the distributed algorithm

xk+1i := argmin

xi

!(1/2)"Aixi ! bi"2

2 + (!/2)"xi ! zk + uki "2

2

"

zk+1 := S!/"N (xk+1 + uk)uk+1

i := uki + xk+1

i ! zk+1.

Each xi-update takes the form of a Tikhonov-regularized least squares(i.e., ridge regression) problem, with analytical solution

xk+1i := (AT

i Ai + !I)!1(ATi bi + !(zk ! uk

i )).

The techniques from §4.2 apply: If a direct method is used, then the fac-torization of AT

i Ai + !I can be cached to speed up subsequent updates,and if mi < n, then the matrix inversion lemma can be applied to letus factor the smaller matrix AiAT

i + !I instead.Comparing this distributed-data lasso algorithm with the serial ver-

sion in §6.4, we see that the only di!erence is the collection and aver-aging steps, which couple the computations for the data blocks.

An ADMM-based distributed lasso algorithm is described in [121],with applications in signal processing and wireless communications.

Per-partition Regression

Consensus Computation


BIG DATA

Par22on 1 Par22on 2 Par22on 3 Par22on K

Logis2c Regression

Logis2c Regression

Logis2c Regression

Logis2c Regression

Consensus Computa2on

Iteration 1


BIG DATA


Logis2c Regression


Logis2c Regression

Logis2c Regression

Logis2c Regression

Iteration 1


BIG DATA


Logis2c Regression

Logis2c Regression

Logis2c Regression

Logis2c Regression


Iteration 2

An Example Implementa2on

•  ADMM for Logis2c regression model fidng with L2 penalty

•  Each itera2on of ADMM is a Map-‐Reduce job – Mapper: par22on the data into K par22ons –  Reducer: For each par22on, use liblinear/glmnet to fit a L2 logis2c regression

– Gateway: consensus computa2on by results from all reducers, and sends back the consensus to each reducer node

BeRer Convergence Can Be Achieved By

1.  BeRer Ini2aliza2on: Use results from Naïve method to ini2alize the parameters

2.  Adap2vely change step size (ρ) for each itera2on based on the convergence status of the consensus

•  ADMM-‐M: ADMM with tuning 1 •  ADMM-‐MA: ADMM with tuning 1+2

Evalua2on of ADMM Convergence

KDD CUP 2010 Data

•  Bridge to Algebra 2008-‐2009 data in hRps://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp •  Binary response, 20M covariates •  Only keep covariates with >= 10 occurrences => 2.2M covariates

•  Training data: 8,407,752 samples •  Test data : 510,302 samples

Avg Training Loglikelihood vs Number of Itera2ons

Test AUC vs Number of Itera2ons

0 5 10 15 20

−0.2

4−0

.22

−0.2

0

number of iterations

log−

likel

ihoo

d ●

●

●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

ADMMADMM−MADMM−MA

Offline Ads Data: Effect of ADMM Convergence Tunings

Explore-‐Exploit

Recap: Challenges from Data Collec2on

•  Assume most-‐popular model for all traffic •  Item A: true CTR 5%, 500 clicks, 10000 views •  Item B: true CTR 10%, 0 click, 10 views •  We ends up always serving item A! •  Mul2-‐armed bandit problem (Explore-‐Exploit):

– A naïve approach: ε-‐greedy -‐-‐ Use x% bucket to serve random traffic

– Q: Can we do beRer? •  Number of items are very large and ads clicks are very sparse

•  ε-‐greedy will also have revenue loss

Exis2ng Approaches for Explore-‐Exploit

•  ε-‐greedy

•  UCB (Upper Confidence Bound)

•  Gidns Index

•  Thompson sampling

Thompson Sampling (Thompson 1933)

•  Basic Idea: –  For every user request & item to rank, draw a sample from posterior

of Θc and Θw –  Compute CTR using the sampled coefficients instead of posterior

mode –  More explora2on for items that have high posterior variances due to

small sample sizes •  Has been proven to be op2mal for large scale data (Chapelle and Li

2011) •  Our approach

–  Θc is learned from a lot of data, posterior variance can be approximated as 0

–  Θw is campaign-‐specific, need to draw sample from its posterior that is learned from per-‐item logis2c regression

–  Laplace approxima2on for compu2ng posterior variance

Experiments

Models Considered

•  CONTROL: per-‐campaign CTR coun2ng model

•  COLD-‐ONLY: only cold-‐start component

•  LASER: our model (cold-‐start + warm-‐start)

•  LASER-‐EE: our model with Explore-‐Exploit using Thompson sampling

Metrics

•  Model metrics – Test Log-‐likelihood – AUC/ROC – Observed/Expected ra2o

•  Business metrics (Online A/B Test) – CTR – CPM (Revenue per impression)

Observed / Expected Ra2o

•  Observed: #Clicks in the data •  Expected: Sum of predicted CTR for all impressions

•  Not a “standard” classifier metric, but in many ways more useful for this applica2on

•  What we usually see: Observed / Expected < 1

Observed / Expected Ra2o •  Over-‐predic2on is a serious problem with ads

–  Instance of the “winner’s curse” – When choosing from among thousands of candidates, an item with mistakenly over-‐es2mated CTR may end up winning the auc2on

•  Par2cularly helpful for examining per-‐segment data –  E.g. by bid, number of impressions in training (warmness), geo, etc.

–  Allows us to see where the model might be giving too much weight to the wrong campaigns

•  High correla2on between O/E ra2o and model performance online

Offline Experiment

•  45 Days of adver2sing events log •  First 30 days for training data •  Last 15 days for tes2ng data •  A�er down sampling, both train and test data have hundreds of millions of events

•  Thousands of features

Offline: ROC Curves

False Positive Rate

True

Pos

itive

Rat

e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

●●●●●●●●●

●●

●●

●

●

●

●

●

●

●

●

●

●●●●●●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

CONTROL [ 0.672 ]COLD−ONLY [ 0.757 ]LASER [ 0.778 ]

Online A/B Test •  Three models

–  CONTROL (10%) –  LASER (85%) –  LASER-‐EE (5%)

•  Segmented Analysis –  8 segments by campaign warmness

•  Degree of warmness: the number of training samples available in the training data for the campaign

•  Segment #1: Campaigns with almost no data in training •  Segment #8: Campaigns that are served most heavily in the previous batches so that their CTR es2mate can be quite accurate

Daily CTR Li� Over Control Pe

rcen

tage

of C

TR L

ift

+%

+%

+%

+%

+%D

ay 1

Day

2

Day

3

Day

4

Day

5

Day

6

Day

7

●

● ●

●

●

●

●●

LASERLASER−EE

Daily CPM Li� Over Control Pe

rcen

tage

of e

CPM

Lift

+%

+%

+%

+%

+%

+%

Day

1

Day

2

Day

3

Day

4

Day

5

Day

6

Day

7

●

●

●

● ●

●

●

●

LASERLASER−EE

CTR Li� By Campaign Warmness Segments

Campaign Warmness Segment

Lift

Perc

enta

ge o

f CTR

−%

−%

−%

0%

+%

+%

1 2 3 4 5 6 7 8

LASERLASER−EE

CPM Li� By Campaign Warmness Segments


Lift

Perc

enta

ge o

f CPM

−%

−%

−%

0%

+%

+%

1 2 3 4 5 6 7 8

LASERLASER−EE

O/E Ra2o By Campaign Warmness Segments


Obs

erve

d C

lick/

Expe

cted

Clic

ks

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

CONTROLLASERLASER−EE

Number of Campaigns Served Improvement from E/E

Insights •  Overall performance:

–  LASER and LASER-‐EE are both much beRer than control –  LASER and LASER-‐EE performance are very similar

•  Segmented analysis by campaign warmness –  Segment #1 (very cold)

•  LASER-‐EE much worse than LASER due to its explora2on property •  LASER much beRer than CONTROL due to cold-‐start features

–  Segments #3 -‐ #5 •  LASER-‐EE significantly beRer than LASER •  Winner’s curse hit LASER

–  Segment #6 -‐ #8 (very warm) •  LASER-‐EE and LASER are equivalent

•  Number of campaigns served –  LASER-‐EE serves significantly more campaigns than LASER –  Provides healthier market place

Parallel Matrix Factoriza2on

Recap: Proper2es of Web Recommender Systems

•  One or mul2ple metrics to op2mize –  Click Through Rate (CTR) (focus of this talk) –  Revenue per impression –  Time spent on the landing page –  Ad conversion rate –  …

•  Imbalanced data (clicks are very sparse) •  Large scale data •  Cold-‐start

–  User features: Age, gender, posi2on, industry, … –  Item features: Category, key words, creator features, …

Matrix Factoriza2on Models

•  Response matrix Y – yij = 1 if user i clicked item j – yij = 0 if user i viewed item j but did not click

•  Y is a highly incomplete matrix •  Y ≈ U’V

– ui’ is the i-‐th row of U, latent profile of user i – vj’ is the j-‐th row of V, latent profile of item j

•  Widely used approach such as Newlix compe22on

Probabilis2c Matrix Factoriza2on

• 

Global Features

Item effect

User factors

Item factors

User effect

Salakhutdinov and Mnih (2007)

Flexible Regression Priors • 

•  g(·∙), h(·∙), G(·∙), H(·∙) can be any regression func2ons, e.g. simple linear regression

•  Agarwal and Chen (KDD 2009); Zhang et al. (RecSys 2011)

User covariates

Item covariates

•  Monte Carlo EM (Booth and Hobert 1999) •  Let •  Let •  E Step:

– Obtain N samples of condi2onal posterior

•  M Step:

Model Fidng Using MCEM (Single Machine)

Handling Binary Responses

•  Gaussian responses: have closed form •  Binary responses + Logis2c: no longer closed form

•  Varia2onal approxima2on (VAR)

•  Adap2ve rejec2on sampling (ARS)

Varia2onal Approxima2on (VAR)

•  Ini2ally let ξij = 1 •  Before each E-‐step, create pseudo Gaussian response for each binary observa2on

•  Run E-‐Step and M-‐Step using the Gaussian pseudo response

•  A�er M-‐step, let Agarwal and Chen (KDD 2009)

Adap2ve Rejec2on Sampling (ARS)

•  For each E-‐step, obtain precise condi2onal posterior samples of

•  Adap2vely update the upper and lower bound of the density func2on to do rejec2on sampling

Adap2ve Rejec2on Sampling (ARS) •  Goal: Obtain a sample x* from a log-‐concave target density

func2on p(x) •  Try to avoid evalua2ng p(x) as much as possible •  Ini2alize point set S with 3 ini2al points •  Construct lower bound lower(x) and upper bound upper(x) using

the current S •  Let e(x) = exp(upper(x)), s(x) = exp(lower(x)) •  Let e1(x) be the density func2on from e(x) •  Repeat the following steps un2l a sample is obtained:

–  Draw x* ~ e1(x) and z ~ Unif(0,1) –  If z<=s(x*)/e(x*), accept x* and break –  If z<=p(x*)/e(x*), accept x* and break; otherwise reject x* –  If x* rejected, update e(x) and s(x) by construc2ng new lower and

upper bounds

Single machine model fidng is good, but…

•  We are working with very large scale data sets

•  Parallel matrix factoriza2on methods using Map-‐Reduce have to be developed

•  Monte Carlo EM (Booth and Hobert 1999) •  Let •  Let •  E Step:

– Obtain N samples of condi2onal posterior

•  M Step:

Model Fidng Using MCEM

Parallel Matrix Factoriza2on •  Par22on data into m par22ons •  For each par22on run MCEM algorithm and get .

• 

•  Ensemble runs: for k = 1, … , n –  Repar22on data into m par22ons with a new seed –  Run E-‐step only job for each par22on given

•  Average over user/item factors for all par22ons and k’s to obtain the final es2mate

One Map-‐Reduce job

Parallel Matrix Factoriza2on •  Par22on data into m par22ons •  For each par22on run MCEM algorithm and get .

• 

•  Ensemble runs: for k = 1, … , n –  Repar22on data into m par22ons with a new seed –  Run E-‐step only job for each par22on given

•  Average over user/item factors for all par22ons and k’s to obtain the final es2mate

Each ensemble run is a Map-‐Reduce

job

Key Points

•  Par22oning is tricky! – By events? By items? By users?

•  Empirically, “divide and conquer” + average over to obtain work well!

•  Ensemble runs: A�er obtained , we run n E-‐step-‐only jobs and take average, for each job using a different user-‐item mix.

Iden2fiability Issues (MCEM-‐ARSID)

•  Same log-‐likelihood can be achieved by – g ( ) = g ( ) + r, h ( ) = h ( ) – r

•  Center α, β, u to zero-‐mean every E-‐step

– u = -‐u, v = -‐v •  Constrain v to be posi2ve

–  Switching u.1, v.1 with u.2, v.2 •  ui ~ N(G(xi) , I), vj ~ N(H(xj), λI) •  Constraint: Diagonal entries λ1 >= λ2 >= …

MovieLens 1M Data

•  75% training and 25% test split by 2me •  Imbalanced data

– User ra2ng = 1: Posi2ve – User ra2ng = 2, 3, 4, 5: Nega2ve – 5% posi2ve rate

•  Balanced data – User ra2ng = 1, 2, 3: Posi2ve – User ra2ng = 4, 5: Nega2ve – 44% posi2ve rate

•  MCEM-‐VAR and MCEM-‐ARS slightly beRer than SGD for 1 par22on •  A lot of effort spent on tuning L2 penalty parameters and learning rate

for SGD

Big difference between VAR and ARS for imbalanced data!

Yahoo! FrontPage Today Module Recommenda2on

How to Handle New Ar2cles

•  New ar2cles can get generated any 2me during the day

•  Problem: How to quickly build models to reflect user preference for the fresh ar2cles

•  Offline large-‐scale matrix factoriza2on can not be real-‐2me!

Online Logis2c Regression (OLR) §  User i with feature xi, article j §  Binary response y (click/non-click) §  § 

§  Prior §  Using Laplace approximation or variational Bayesian

methods to obtain posterior

§  Prior for next batch §  Can approximate and as diagonal for high dim xi

User Features for OLR •  Age, gender, job posi2on etc for login users

•  General behavior targe2ng (BT) features –  Music? Finance? Poli2cs?

•  User profiles from historical view/click behavior on previous items in the data, e.g. –  Item-‐profile: use previously clicked item ids as the user profile –  Category-‐profile: use item category affinity score as profile. The score can be simply user’s historical CTR on each category.

–  Are there beRer ways to generate user profiles? –  Yes! By matrix factoriza2on!

Matrix Factoriza2on For User Profile in OLR

•  Offline user profile building period, obtain the user factor for user i

•  Online modeling using OLR –  If a user has a profile (warm-‐start), use as the user covariates

–  If not (cold-‐start), use as the user covariates

Large Yahoo! Front Page Data •  Data for building user profile: 8M users with at least 10 clicks (heavy users) in June 2011, 1B events

•  Data for training and tes2ng OLR model: Random served data with 2.4M clicks in July 2011

•  Heavy users contributed around 30% of clicks •  User covariates / features for OLR:

–  Intercept-‐only (MOST POPULAR) –  124 Behavior targe2ng features (BT-‐ONLY) –  BT + top 1000 clicked ar2cle ids (ITEM-‐PROFILE) –  BT + user profile with CTR on 43 binary content categories (CATEGORY-‐PROFILE)

–  BT + user latent factor from matrix factoriza2on

Offline Evalua2on Metric Related to Clicks

•  For model M and J live items (ar2cles) at any 2me

•  If M = random (constant) model E[S(M)] = #clicks

•  Unbiased es2mate of expected total clicks (Langford et al. 2008)

Click Li� Performance For Different User Profiles

Warm Start: Users with at least one sample in training data Cold Start: Users with no data in training data

Conclusion

•  ARS to probabilis2c matrix factoriza2on framework for imbalanced binary data

•  Model fidng approach for large scale data using Map-‐Reduce

•  Iden2fiability the key for parallel matrix factoriza2on (cold-‐start)

•  Our proposed approach significantly outperforms baseline methods such as varia2onal approxima2on

Bag of LiRle Bootstraps

Kleiner et al. 2012

Mo2va2on: Challenges from Data Analysis

•  Example: How to generate confidence intervals of model performance like AUC?

•  Single machine: – Bootstrap the test data 100 2mes – Compute the std(AUC) from AUC in the 100 test data

•  How do we do it for large scale data? – Bootstrap non-‐trivial!

Bootstrap (Efron, 1979) •  A re-‐sampling based method to obtain sta2s2cal distribu2on of sample es2mators

•  Why are we interested ? –  Re-‐sampling is embarrassingly parallelizable

•  For example: –  Standard devia2on of the mean of N samples (μ) –  For i = 1 to r do

•  Randomly sample with replacement N 2mes from the original sample -‐> bootstrap data i

•  Compute mean of the i-‐th bootstrap data -‐> μi

–  Es2mate of Sd(μ) = Sd([μ1,…μr]) –  r is usually a large number, e.g. 200

Bootstrap for Big Data

•  Can have r nodes running in parallel, each sampling one bootstrap data

•  However… – N can be very large – Data may not fit into memory – Collec2ng N samples with replacement on each node can be computa2onally expensive

M out of N Bootstrap (Bikel et al. 1997)

•  Obtain SdM(μ) by sampling M samples with replacement for each bootstrap, where M<N

•  Apply analy2cal correc2on to SdM(μ) to obtain Sd(μ) using prior knowledge of convergence rate of sample es2mates

•  However… –  Prior knowledge is required –  Choice of M is cri2cal to performance –  Finding op2mal value of M needs more computa2on

Bag of LiRle Bootstraps (BLB) •  Example: Standard devia2on of the mean •  Generate S sampled data sets, each obtained by random

sampling without replacement a subset of size b (or par22on the original data into S par22ons, each with size b)

•  For each data p = 1 to S do –  For i = 1 to r do

•  N samples with replacement on data of size b •  Compute mean of the resampled data μpi

–  Compute Sdp(μ) = Sd([μp1,…μpr]) •  Es2mate of Sd(μ) = Avg([Sd1(μ),…, SdS(μ)])

Bag of LiRle Bootstraps (BLB) •  Interest: ξ(θ), where θ is an es2mate obtained from size N data –  ξ is some func2on of θ, such as standard devia2on, …

•  Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or par22on the original data into S par22ons, each with size b)


•  Sample N samples with replacement on data of size b •  Compute the es2mate of the resampled data θpi

–  Compute ξp(θ) = ξ([θp1,…θpr]) •  Es2mate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])

Bag of LiRle Bootstraps (BLB) •  Interest: ξ(θ), where θ is an es2mate obtained from size N data –  ξ is some func2on of θ, such as standard devia2on, …

•  Generate S sampled data sets, each obtained from random sampling without replacement a subset of size b (or par22on the original data into S par22ons, each with size b)


•  Sample N samples with replacement on the data of size b •  Compute the es2mate of the resampled data θpi

–  Compute ξp(θ) = ξ([θp1,…θpr]) •  Es2mate of ξ(μ) = Avg([ξ1(θ),…, ξS(θ)])

Mapper Reducer

Gateway

Why is BLB Efficient

•  Before: – N samples with replacement from size N data is expensive when N is large

•  Now: – N samples with replacement from size b data – b can be several magnitude smaller than N (e.g. b = Nγ, γ in [0.5, 1))

– Equivalent to: A mul2nomial sampler with dim = b – Storage = O(b), Computa2onal complexity = O(b)

Simula2on Experiment

•  95% CI of Logis2c Regression Coefficients •  N = 20000, 10 explanatory variables •  Rela2ve Error = |Es2mated CI width – True CI width | / True CI width

•  BLB-‐γ: BLB with b = Nγ •  BOFN-‐γ: b out of N sampling with b = Nγ

•  BOOT: Naïve bootstrap

Simula2on Experiment

Real Data

•  95% CI of Logis2c Regression Coefficients •  N = 6M, 3000 explanatory variables •  Data size = 150GB, r = 50, S = 5, γ = 0.7

Data stored on disk Data stored in memory using Spark

Hyper-‐parameter Selec2on

•  From empirical experiments in the paper –  S>=3 and r>=50 is sufficient for low rela2ve errors

•  Adap2vely selec2ng r and S –  Increase r and s un2l es2mated value converges –  For r: Con2nue to process resamples and update ξp(θ) un2l it has ceased to change significantly.

–  For S: Con2nue to process more subsamples (i.e., increasing S) un2l BLB’s output value has stabilized.

Hyper-‐parameter Selec2on

BLB-‐0.7

•  Adap2vely selec2ng r and S –  Increase r and S un2l es2mated value converges

Adap2ve BLB

Summary of BLB

•  A new algorithm for bootstrapping on big data

•  Advantages – Fast and efficient – Easy to parallelize – Easy to understand and implement – Friendly to Hadoop, makes it rou2ne to perform sta2s2cal calcula2ons on Big data

Part V: Future of Distributed Compu2ng Infrastructure

Problem of ADMM in Hadoop

•  ADMM in hadoop can take hours to converge

•  Is there a beRer way to handle itera2ve learning process?

Spark overview

126

§  What is Spark –  A fast cluster computing system compatible with Hadoop

§  Works with HDFS –  Improves efficiency through in-memory computing primitives –  Written in Scala, provides APIs in Scala, Java and Python

§  Often 2-10× less code with Scala

§  What can it do –  Developed for iterative algorithms

§  Load data into memory and query it repeatedly more quickly than with disk-based systems like Hadoop

–  Retain the attractive properties of MapReduce §  Fault tolerance, data locality, scalability

Itera2ve Process in Hadoop

Disk Mappers Disk Reducers



Itera2ve Process in Spark

Data In Disk

Mappers

Memory

Reducers

Mappers Reducers

Mappers Reducers

Memory

Spark cluster speed-‐up preliminary results

0

20

40

60

80

100

120

140

300 600 900 1200

120 Cpus

816 Cpus

129

Sec

onds

per

iter

atio

n

Number of partitions

Ini2al Spark Result Summary

ü Spark is very efficient for large-‐scale machine learning purposes – ADMM on Ads data, 20 mins (Spark) VS 4-‐5 hours (Hadoop)

– Provides an offline experimental system to vet various modeling ideas with high agility

– Models get trained faster thus can react to temporal shi�s of online system more quickly

130

GraphLab

•  An open-‐source graph-‐based, high performance, distributed computa2on framework in C++

•  hRp://graphlab.org/ •  HDFS integra2on •  Major design

– Sparse data with local dependencies –  Itera2ve algorithms – Poten2ally asynchronous execu2on among nodes

GraphLab

•  Graph-‐parallel •  Map-‐Reduce: computa2on applied to independent records

•  GraphLab: dependent records stored as ver2ces in a large distributed data-‐graph

•  Computa2on in parallel on each vertex and can interact with neighboring ver2ces

GraphX

•  Combines the advantages of both data-‐parallel and graph-‐parallel systems

•  Distributed graph computa2on on Spark

Final Remarks

Summary •  Large scale machine learning is s2ll a new area, a lot of open problems

•  The advancing distributed compu2ng infrastructure plays a big role in the algorithms

•  In today’s tutorial, we covered – Distributed compu2ng and Map-‐Reduce – Web recommender systems: our mo2va2on problem –  Large scale logis2c regression –  Parallel matrix factoriza2on –  Bag of liRle bootstraps: One analysis tool for big data –  Future of distributed compu2ng infrastructure

Bibliography Agarwal, D. and Chen, B. (2009). Regression-‐based latent factor models. In Proceedings of the 15th ACM SIGKDD interna2onal conference on Knowledge discovery and data mining, 19–28. ACM. Agarwal, D., Chen, B., and Elango, P. (2010). Fast online learning through offline ini2aliza2on for 2me-‐sensi2ve recommenda2on. In Proceedings of the 16th ACM SIGKDD interna2onal conference on Knowledge discovery and data mining, 703–712. ACM. Bell, R., Koren, Y., and Volinsky, C. (2007). Modeling rela2onships at mul2ple scales to improve accuracy of large recommender systems. In Proceedings of the 13th ACM SIGKDD interna2onal conference on Knowledge discovery and data mining, 95–104. ACM. Booth, J. G., & Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Sta2s2cal Society: Series B (Sta2s2cal Methodology), 61(1), 265-‐285. Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed op2miza2on and sta2s2cal learning via the alterna2ng direc2on method of mul2pliers. Founda2ons and Trends® in Machine Learning, 3(1), 1-‐122. Bickel, P. J., Götze, F., & van Zwet, W. R. (2012). Resampling fewer than n observa2ons: gains, losses, and remedies for losses (pp. 267-‐297). Springer New York. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communica2ons of the ACM, 51(1), 107-‐113.

Bibliography Efron, B. (1979). Bootstrap methods: another look at the jackknife. The annals of Sta2s2cs, 1-‐26. Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. (2012). The big data bootstrap. arXiv preprint arXiv:1206.6415. Khanna, R., Zhang, L., Agarwal, D. and Chen, B. (2012). Parallel Matrix Factoriza2on for Binary Response. IEEE BigData 2013. Mnih, Andriy, and Ruslan Salakhutdinov. (2007). "Probabilis2c matrix factoriza2on." Advances in neural informa8on processing systems. Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294. Zhang, L., Agarwal, D., and Chen, B. (2011). Generalizing matrix factoriza2on through flexible regression priors. In Proceedings of the fi�h ACM conference on Recommender systems, 13–20. ACM.

large&scale&machine&learning&for&...

Documents