misha bilenko, principal researcher, microsoft at mlconf sea - 5/01/15

27
Many Shades of Scale: Big Learning Beyond Big Data Misha Bilenko Principal Researcher Microsoft Azure Machine Learning

Upload: sessionsevents

Post on 15-Jul-2015

436 views

Category:

Technology


2 download

TRANSCRIPT

Many Shades of Scale: Big Learning

Beyond Big Data

Misha Bilenko

Principal Researcher

Microsoft Azure Machine Learning

ML β™₯ More Data

What we see in production[Banko and Brill, 2001]

What we [used to] learn in school[Mooney, 1996]

ML β™₯ More Data

What we see in production[Banko and Brill, 2001]

Is training on more examples

all there is to it?

Big Learning β‰  Learning(BigData)

β€’ Big data: size β†’ distributing storage and processing

β€’ Big learning: scale bottlenecks in training and prediction

β€’ Classic bottlenecks: bytes and cyclesLarge datasets β†’ distribute training on larger hardware (FPGAs, GPUs, cores, clusters)

β€’ Other scaling dimensions

Features Components/People

5

Learning from Countswith

DRACuLaDistributed Robust Algorithm for Count-based Learning

joint work with Chris Meek (MSR)Wenhan Wang, Pete Luferenko (Azure ML)

Scaling to many Features

Learning with relational data

𝑝(π‘π‘™π‘–π‘π‘˜|π‘Žπ‘‘,π‘π‘œπ‘›π‘‘π‘’π‘₯𝑑,π‘’π‘ π‘’π‘Ÿ) adid = 1010054353adText = K2 ski sale!adURL= www.k2.com/sale

Userid = 0xb49129827048dd9bIP = 131.107.65.14

Query = powder skisQCategories = {skiing, outdoor gear}

6

#π‘’π‘ π‘’π‘Ÿπ‘ ~109 #π‘žπ‘’π‘’π‘Ÿπ‘–π‘’π‘ ~109+ #π‘Žπ‘‘π‘ ~107 # π‘Žπ‘‘ Γ— π‘žπ‘’π‘’π‘Ÿπ‘¦ ~1010+

β€’ Information retrieval

β€’ Advertising, recommending, search: item, page/query, user

β€’ Transaction classification

β€’ Payment fraud: transaction, product, user

β€’ Email spam: message, sender, recipient

β€’ Intrusion detection: session, system, user

β€’ IoT: device, location

Learning with relational data

𝑝(π‘π‘™π‘–π‘π‘˜|π‘’π‘ π‘’π‘Ÿ,π‘π‘œπ‘›π‘‘π‘’π‘₯𝑑,π‘Žπ‘‘)

adid: 1010054353adText: Fall ski sale!adURL: www.k2.com/sale

userid 0xb49129827048dd9bIP 131.107.65.14

query powder skisqCategories {skiing, outdoor gear}

7

β€’ Problem: representing high-cardinality attributes as featuresβ€’ Scalable: to billions of attribute values

β€’ Efficient: ~105+ predictions/sec/node

β€’ Flexible: for a variety of downstream learners

β€’ Adaptive: to distribution change

β€’ Standard approaches: binary features, hashing

β€’ What everyone should use in industry: learning with countsβ€’ Formalization and generalization

Standard approach 1: binary (one-hot, indicator)

Attributes are mapped to indices based on lookup tables- Not scalable cannot support high-cardinality attributes- Not efficient large value-index dictionary must be retained- Not flexible only linear learners are practical- Not adaptive doesn’t support drift in attribute values

0010000..00 0..01000000 00000..001 0..00001000

#userIPs #ads #queries #queries x #ads

𝑖𝑑π‘₯𝑒 131.107.65.14 𝑖𝑑π‘₯π‘ž π‘π‘œπ‘€π‘‘π‘’π‘Ÿ π‘ π‘˜π‘–π‘ π‘–π‘‘π‘₯π‘Ž π‘˜2. π‘π‘œπ‘š 𝑖𝑑π‘₯ π‘π‘œπ‘€π‘‘π‘’π‘Ÿ π‘ π‘˜π‘–π‘ , π‘˜2. π‘π‘œπ‘š

8

Standard approach 1+: feature hashing

Attributes are mapped to indices via hashing: β„Ž π‘₯𝑖 = β„Žπ‘Žπ‘ β„Ž π‘₯𝑖 modπ‘šβ€’ Collisions are rare; dot products unbiased

+ Scalable no mapping tables+ Efficient low cost, preserves sparsity- Not flexible only linear learners are practicalΒ± Adaptive new values ok, no temporal effects

0000010..0000010000..0000010...000001000

β„Ž powder skis + k2. comβ„Ž powder skis

β„Ž k2. comβ„Ž 131.107.65.14

π‘š ∼ 107

[Moody β€˜89, Tarjan-Skadron β€˜05, Weinberger+ ’08]

9

πœ™(π‘₯)

Learning with counts

β€’ Features are per-label counts [+odds] [+backoff]

𝝓 = [N+ N- log(N+)-log(N-) IsRest]

β€’ log(N+)-log(N-) = log𝒑(+)

𝒑(βˆ’): log-odds/NaΓ―ve Bayes estimate

β€’ N+, N-: indicators of confidence of the naΓ―ve estimate

β€’ IsFromRest: indicator of back-off vs. β€œreal count”

131.107.65.14

πΆπ‘œπ‘’π‘›π‘‘π‘ (131.107.65.14) πΆπ‘œπ‘’π‘›π‘‘π‘ (k2.com)

k2.com

πΆπ‘œπ‘’π‘›π‘‘π‘ (powder skis)

powder skis

πΆπ‘œπ‘’π‘›π‘‘π‘ (powder skis, k2.com)

powder skis, k2.com

IP 𝑡+ π‘΅βˆ’

173.194.33.9 46964 993424

87.250.251.11 31 843

131.107.65.14 12 430

… … …

REST 745623 13964931

𝝓(π‘ͺ𝒐𝒖𝒏𝒕𝒔 (𝑰𝑷)) 𝝓(π‘ͺ𝒐𝒖𝒏𝒕𝒔 (𝒂𝒅)) 𝝓(π‘ͺ𝒐𝒖𝒏𝒕𝒔 (π’’π’–π’†π’“π’š)) 𝝓(π‘ͺ𝒐𝒖𝒏𝒕𝒔 (π’’π’–π’†π’“π’š, 𝒂𝒅))

Learning with counts

β€’ Features are per-label counts [+odds] [+backoff]

𝝓 = [N+ N- log(N+)-log(N-) IsRest]

+ Scalable β€œhead” in memory + tail in backoff; or: count-min sketch+ Efficient low cost, low dimensionality+ Flexible low dimensionality works well with non-linear learners+ Adaptive new values easily added, back-off for infrequent values, temporal counts

𝝓(π‘ͺ𝒐𝒖𝒏𝒕𝒔(𝒖𝒔𝒆𝒓)) 𝝓(π‘ͺ𝒐𝒖𝒏𝒕𝒔(𝒂𝒅)) 𝝓(π‘ͺ𝒐𝒖𝒏𝒕𝒔(π’’π’–π’†π’“π’š) 𝝓(π‘ͺ(π’’π’–π’†π’“π’š Γ— 𝒂𝒅))

131.107.65.14

πΆπ‘œπ‘’π‘›π‘‘π‘ (131.107.65.14) πΆπ‘œπ‘’π‘›π‘‘π‘ (k2.com)

k2.com

πΆπ‘œπ‘’π‘›π‘‘π‘ (powder skis)

powder skis

πΆπ‘œπ‘’π‘›π‘‘π‘ (powder skis, k2.com)

powder skis, k2.com

𝝓(π‘ͺ𝒐𝒖𝒏𝒕𝒔 (𝑰𝑷)) 𝝓(π‘ͺ𝒐𝒖𝒏𝒕𝒔 (𝒂𝒅)) 𝝓(π‘ͺ𝒐𝒖𝒏𝒕𝒔 (π’’π’–π’†π’“π’š)) 𝝓(π‘ͺ𝒐𝒖𝒏𝒕𝒔 (π’’π’–π’†π’“π’š, 𝒂𝒅))

IP 𝑡+ π‘΅βˆ’

173.194.33.9 46964 993424

87.250.251.11 31 843

131.107.65.14 12 430

… … …

REST 745623 13964931

Backoff is a pain. Count-Min Sketches to the Rescue![Cormode-Muthukrishnan β€˜04]

Intuition: correct for collisions by using multiple hashes

Featurize: π‘šπ‘–π‘›π‘— (𝑀[𝑗][β„Žπ‘—(𝑖)]) Estimation Time : O(d)

= M (d x w)

Count: for each hash function M[j][hj(i)] ++ Update Time: O(d)

Learning from counts: aggregationAggregate πΆπ‘œπ‘’π‘›π‘‘(𝑦, 𝑏𝑖𝑛 π‘₯ ) for different 𝑏𝑖𝑛 π‘₯

β€’ Standard MapReduce

β€’ Bin function: any projection

β€’ Backoff options: β€œtail bin”, hashing, hierarchical (shrinkage)

IP 𝑡+ π‘΅βˆ’

173.194.33.9 46964 993424

87.250.251.11 31 843

131.253.13.32 12 430

… … …

REST 745623 13964931

query 𝑡+ π‘΅βˆ’

facebook 281912 7957321

dozen roses 32791 640964

… … …

REST 6321789 43477252

Query Γ— AdId 𝑡+ π‘΅βˆ’

facebook, ad1 54546 978964

facebook, ad2 232343 8431467

dozen roses, ad3 12973 430982

… … …

REST 4419312 52754683

timeTnow

Counting

IP[2] 𝑡+ π‘΅βˆ’

173.194.*.* 46964 993424

87.250.*.* 6341 91356

131.253.*.* 75126 430826

… … …

13

Learning from counts: combiner trainingIP 𝑡+ π‘΅βˆ’

173.194.33.9 46964 993424

87.250.251.11 31 843

131.253.13.32 12 430

… … …

REST 745623 13964931

query 𝑡+ π‘΅βˆ’

facebook 281912 7957321

dozen roses 32791 640964

… … …

REST 6321789 43477252

timeTnow

Train predictor

….

IsBackoff

ln𝑁+ βˆ’ lnπ‘βˆ’

Aggregatedfeatures

Original numeric features

π‘βˆ’π‘+

Counting

Train non-linear model on count-based features

β€’ Counts, transforms, lookup properties

β€’ Additional features can be injected

Query Γ— AdId 𝑡+ π‘΅βˆ’

facebook, ad1 54546 978964

facebook, ad2 232343 8431467

dozen roses, ad3 12973 430982

… … …

REST 4419312 52754683

14

Prediction with countsIP 𝑡+ π‘΅βˆ’

173.194.33.9 46964 993424

87.250.251.11 31 843

131.253.13.32 12 430

… … …

REST 745623 13964931

query 𝑡+ π‘΅βˆ’

facebook 281912 7957321

dozen roses 32791 640964

… … …

REST 6321789 43477252

URL Γ— Country 𝑡+ π‘΅βˆ’

url1, US 54546 978964

url2, CA 232343 8431467

url3, FR 12973 430982

… … …

REST 4419312 52754683

timeTnow

….

IsBackoff

ln𝑁+ βˆ’ lnπ‘βˆ’

Aggregatedfeatures

π‘βˆ’π‘+

Counting β†’

β€’ Counts are updated continuously

β€’ Combiner re-training infrequent

Ttrain

Original numeric features

Where did it come from?

Li et al. 2010

Pavlov et al. 2009

Lee et al. 1998

Yeh and Patt, 1991

16

Hillard et al. 2011

β€’ De-facto standard in online advertising industry

β€’ Rediscovered by everyone who really cares about accuracy

Do we need to separate counting and training?

β€’ Can we use use same data for both counting and featurization

β€’ Bad idea: leakage = count features contain labels β†’ overfittingβ€’ Combiner dedicates capacity to decoding example’s label from features

β€’ Can we hold out each example’s label during train-set featurization?

β€’ Bad idea: leakage and biasβ€’ Illustration: two examples, same feature values, different labels (click and non-click)

β€’ Different representations are inconsistent and allow decoding the label

Train predictorCounting

Example ID Label N+[a] N-[a]

1 + π‘π‘Ž+ βˆ’ 1 π‘π‘Ž

βˆ’

2 - π‘π‘Ž+ π‘π‘Ž

βˆ’-1

Solution via Differential privacy

β€’ What is leakage? Revealing information about any individual label

β€’ Formally: count table cT is Ξ΅-leakage-proof if same features for βˆ€π‘₯, 𝑇, 𝑇′ = 𝑇\(π‘₯𝑖 , 𝑦𝑖)

β€’ Theorem: adding noise sampled from Laplace(k/πœ–) makes counts πœ–-leakage-proof

β€’ Typically 1 ≀ π‘˜ ≀ 100

β€’ Concretely: N+ = N+ + LaplaceRand(0,10k) N- = N- + LaplaceRand(0,10k)

β€’ In practice: LaplaceRand(0,1) sufficient

Learning from counts: why it works

β€’ State-of-the-art accuracy

β€’ Easy to implement on standard clusters

β€’ Monitorable and debuggableβ€’ Temporal changes easy to monitor

β€’ Easy emergency recovery (bot attacks, etc.)

β€’ Error debugging (which feature to blame)

β€’ Modular (vs. monolithic)β€’ Components: learners and count features

β€’ People: multiple feature/learner authors

19

Big Learning: Pipelines and Teams

Ravi: text features in R

Jim: matrix projections

Vera: sweeping boosted trees

Steph: count featureson Hadoop

How to scale up Machine Learning toParallel and Distributed Data Scientists?

AzureML

β€’ Cloud-hosted, graphical environmentfor creating, training, evaluating, sharing, and deployingmachine learning models

β€’ Supports versioning and collaboration

β€’ Dozens of ML algorithms, extensible via R and Python

APIML STUDIO

Learning with Counts in Azure ML

Criteo 1TB dataset

Counting:an hour on HDInsight Hadoop cluster

Training: minutes in AzureML Studio

Deploymentone click to RRS service

Maximizing Utilization: Keeping it Asynchronous

β€’ Macro-level: concurrently executing pipelines

β€’ Micro-level: asynchronous optimization (with overwriting updates)β€’ Hogwild SGD [Recht-Re], Downpour SGD [Google Brain]

β€’ Parameter Server [Smola et al.]

β€’ GraphLab [Guestrin et al.]

β€’ SA-SDCA [Tran, Hosseini, Xiao, Finley, B.]

Semi-Asynchronous SDCA: state-of-the-art linear learning

β€’ SDCA: Stochastic Dual Coordinate Ascent [Shalev-Schwartz & Zhang]β€’ Plot: SGD marries SVM and they have a beautiful baby

β€’ Algorithm: for each example: update example’s 𝛼𝑖, then re-estimate weights

β€’ Let’s make it asynchronous, Hogwild-style!

β€’ Problem: primal and dual diverge

β€’ Solution: separate thread for primal-dual synchronization

β€’ Taking it out-of-memory: block pseudo-random data loading

SGD update𝑀𝑑+1 ← π‘€π‘‘βˆ’π›Ύπ‘‘ πœ†π‘€π‘‘ βˆ’ π‘¦π‘–πœ™π‘–

β€²(𝑀𝑑 β‹… π‘₯𝑖) π‘₯𝑖

SDCA update𝛼𝑖𝑑 ← 𝛼𝑖

π‘‘βˆ’1 + Δ𝛼𝑖

𝑀𝑑 ← π‘€π‘‘βˆ’1 +Ξ”π›Όπ‘–πœ†π‘›

π‘₯𝑖

Keeping it asynchronous: it pays off

In closing: Big Learning = Streetfighting

β€’ Big features are resource-hungry: learning with counts, projections… β€’ Make them distributed and easy to compute/monitor

β€’ Big learners are resource-hungryβ€’ Parallelize them (preferably asynchronously)

β€’ Big pipelines are resource-hungry: authored by many humansβ€’ Run them a collaborative cloud environment