girish nathan misha bilenko microsoft azure machine learning how to work with large datasets to...

18
Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

Upload: jody-blankenship

Post on 18-Dec-2015

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

Girish NathanMisha Bilenko

Microsoft Azure Machine Learning

How to Work with Large Datasets to Build Predictive Models

Page 2: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

Agenda

1. How to Work with Large Datasets• Sample Dataset: NYC Taxi • HDInsight (Hadoop on Azure) • iPython notebook and HDInsight

2. Building Predictive Models• Azure ML Studio• Learning with Counts

3. Putting it all together: Learning with Counts and HDInsight

Page 3: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

Sample Data: NYC Taxi

• One year log of NYC taxi rides• 60GB, publicly available at http://www.andresmh.com/nyctaxitrips/• Trip (driver id, times, locations) and fare (fare, tip, tolls)

• Rest of tutorial: data wrangling and tip prediction• Tools: AzCopy, HDInsight, iPython, Azure ML Studio

Page 4: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

• 100% Apache Hadoop as an Azure service• Can deploy on Windows or Linux• Provides Map-Reduce capability over big data in Azure

blobs• Head node: job and cluster monitoring• Hive: SQL-like queries as an alternative to writing codeSELECT Col1, COUNT(*) AS Count_Col1 FROM Your_TableGROUP BY Col1 ORDER BY Count_Col1 DESC LIMIT 10;

HD Insight : Hadoop on Azure

Page 5: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

• Web-based Python REPL environment• Combines authoring, execution, visualization• Can author and execute HDInsight Hive queries• Sample query (python code snippet)

def submit_hive_query(self): response=urllib2.urlopen(self.url, self.hiveParams)

data = json.load(response)self.hiveJobID = data[‘id’]

def query(self, queryString):self.submit_hive_query()

Example query string: SELECT * FROM sample_table LIMIT 10;

Ipython Notebook

Page 6: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

• Fully managed cloud service

• Browser based authoring of dataflow

• Best in class machine learning algorithms

• Support for R/Python/SQL

• Collaborative data science

• Quickly deploy models as web services/REST API’s

• Publish to a gallery for collaboration with community

What is Azure ML Studio

Page 7: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

(Distributed Robust Algorithm for CoUnt-based LeArning)

Misha Bilenko

Microsoft Azure Machine LearningMicrosoft Research

Learning with Counts a.k.a Dracula

Page 8: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

adid = 1010054353adText = K2 ski sale!adURL= www.k2.com/sale

Userid = 0xb49129827048dd9bIP = 131.107.65.14

Query = powder skisQCategories = {skiing, outdoor gear}

8

¿𝑢𝑠𝑒𝑟𝑠 109 ¿𝑞𝑢𝑒𝑟𝑖𝑒𝑠 109+¿¿𝑎𝑑𝑠 107 ¿ (𝑎𝑑×𝑞𝑢𝑒𝑟𝑦 ) 1010+¿ ¿

• Information retrieval• Advertising, recommending, search: item, page/query, user

• Transaction classification• Payment fraud: transaction, product, user• Email spam: message, sender, recipient• Intrusion detection: session, system, user• IoT: device, location

Large Scale learning in multi entity domains

Page 9: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

adid: 1010054353adText: Fall ski sale!adURL: www.k2.com/sale

userid 0xb49129827048dd9bIP 131.107.65.14

query powder skisqCategories {skiing, outdoor gear}

9

• Problem: representing high-cardinality attributes as features• Scalable: to billions of attribute values• Efficient: predictions/sec• Flexible: for a variety of downstream learners• Adaptive: to distribution change

• Standard approaches: binary features, hashing, projections• What everyone uses in industry: learning with counts• This talk: formalization and generalization

Large Scale learning in multi entity domains

Page 10: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

• Features are transforms of conditional statistics (per-label counts)

= [N+ N- log(N+)-log(N-) IsBackoff]• log(N+)-log(N-) = log log-odds/Naïve Bayes estimate

• N+, N- indicators of confidence of the naïve estimate

• IsFromRest: indicator of back-off vs. “real count”

) )

131.107.65.14

) )

k2.com

)

powder  skis

)

powder  skis ,  k2. com

IP

173.194.33.9 46964 993424

87.250.251.11 31 843

131.107.65.14 12 430

… … …

REST 745623 13964931

Learning with Counts

Page 11: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

• Features are transforms of conditional counts = [N+ N- log(N+)-log(N-) IsBackoff]

Scalable “head” in memory + tail in backoff; or: count-min sketch Efficient low cost, low dimensionality Flexible low dimensionality works well with non-linear learners new values easily added, back-off for infrequent values, temporal counts

) )

131.107.65.14

) )

k2.com

)

powder  skis

)

powder  skis ,  k2. com

IP

173.194.33.9 46964 993424

87.250.251.11 31 843

131.107.65.14 12 430

… … …

REST 745623 13964931

Learning with Counts

Page 12: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

Aggregate for different • Standard MapReduce• Bin function: any projection• Backoff options: “tail bin”, hashing,

hierarchical (shrinkage)

IP

173.194.33.9 46964 993424

87.250.251.11 31 843

131.253.13.32 12 430… … …

REST 745623 13964931

query

facebook 281912 7957321

dozen roses 32791 640964… … …

REST 6321789 43477252

Query × AdId

facebook, ad1 54546 978964

facebook, ad2 232343 8431467

dozen roses, ad3 12973 430982… … …

REST 4419312

52754683

timeTnow

Counting

IP[2]

173.194.*.* 46964 993424

87.250.*.* 6341 91356

131.253.*.* 75126 430826

… … …

12

Learning with Counts : aggregation

Page 13: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

IP

173.194.33.9 46964 993424

87.250.251.11 31 843

131.253.13.32 12 430… … …

REST 745623 13964931

query

facebook 281912 7957321

dozen roses 32791 640964… … …

REST 6321789 43477252

timeTnow

Train predictor

….

IsBackoff

ln𝑁 +¿− ln𝑁−¿Aggregatedfeatures

Original numeric features𝑁−𝑁+¿¿

Counting

Train non-linear model on count-based features

• Counts, transforms, lookup properties

• Additional features can be injected

Query × AdId

facebook, ad1 54546 978964

facebook, ad2 232343 8431467

dozen roses, ad3 12973 430982… … …

REST 4419312

52754683

13

Learning with Counts : combiner training

Page 14: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

IP

173.194.33.9 46964 993424

87.250.251.11 31 843

131.253.13.32 12 430… … …

REST 745623 13964931

query

facebook 281912 7957321

dozen roses 32791 640964… … …

REST 6321789 43477252

URL × Country

url1, US 54546 978964

url2, CA 232343 8431467

url3, FR 12973 430982… … …

REST 4419312

52754683

timeTnow

….

IsBackoff

ln𝑁 +¿− ln𝑁−¿Aggregatedfeatures

𝑁−𝑁+¿¿

Counting

• Counts are updated continuously

• Combiner re-training infrequent

Ttrain

Original numeric features

Prediction with counts

Page 15: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

• State-of-the-art accuracy• Good fit for map-reduce• Modular (vs. monolithic)• Learner can be tuned/monitored/replaced in isolation

• Monitorable, debuggable (this is HUGE in practice!)• Temporal changes easy to monitor• Easy emergency recovery (remove bot attacks, etc.)• Decomposable predictions• Error debugging (which feature can we blame…)

15

What is great about learning with Counts ?

Page 16: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

Learning with Counts : in Azure ML

Page 17: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

• HDInsight: large data storage and map-reduce processing

• Azure ML: cloud ML and analytics accessible anywhere

• Learning with Counts: intuitive, flexible large-scale ML solution

Putting it all together

Page 18: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

Thanks for your time

Useful Links:

http://azure.microsoft.com/ml- Sign up for your free Azure ML Trial

http://bit.ly/datasc_ebook - Free tutorial on how to use Azure ML

Need Azure ML for teaching in classroom ? - Contact the speakers

Other Questions ? - Contact the speakers

Speakers :-Misha Bilenko : [email protected] Nathan – [email protected]