modeldb: a system to manage machine learning models: spark summit east talk by manasi vartak

20
ModelDB: A system to manage machine learning models Manasi Vartak PhD Student, MIT DB Group

Upload: spark-summit

Post on 21-Feb-2017

77 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

ModelDB: A system to manage machine learning models

Manasi VartakPhD Student, MIT DB Group

Page 2: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

People

Manasi Vartak PhD student, MIT

Srinidhi Viswanathan MEng, MIT

Samuel Madden Faculty, MIT

Matei Zaharia Faculty, Stanford

Harihar Subramanyam MEng, MIT

Wei-En Lee MEng student, MIT

Page 3: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

Building a default prediction algorithm

Profession Credit History Risk of Default

Politician Reasonable 0.3

Struggling artist Poor 0.7

InvestorHas more

money than our company

0.0

… … … …

Barack Obama

Lindsay Lohan

Warren Buffet

Page 4: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

Accuracy: 62%

Model 1

Page 5: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

Model 3

RandomForestClassifier

val udf1: (Int => Int) = (delayed..) df.withColumn(“timesDelayed”, udf1)

Page 6: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

RandomForestClassifier

df.withColumn(“timesDelayed”, udf1) .withColumn(“percentPaid”, udf2)

val lrGrid = new ParamGridBuilder() .addGrid(rf.maxDepth, Array(5, 10, 15)) .addGrid(rf.numTrees, Array(50, 100))

Model 5

credit-default-clean.csv

Page 7: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

df.withColumn(“timesDelayed”, udf1) .withColumn(“percentPaid”, udf2).withColumn(“creditUsed”, udf3)

val lrGrid = new ParamGridBuilder() .addGrid(lr.elasticNetParam, Array(0.01, 0.1, 0.5, 0.7))

val scaler = new StandardScaler() .setInputCol(“features”)

val labelIndexer1 = new LabelIndexer() val labelIndexer2 = new LabelIndexer() …

Model 50

val udf1: (Int => Int) = (delayed..) val udf2: (String, Int) = …

credit-default-clean.csv

Page 8: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

No one in here tracks (all of) their models…and this is not unusual

I’m willing to bet…

Page 9: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

Why is this a problem?• No record of experiments

• Insights lost along the way

• Difficult to reproduce results

• Cannot search for or query models

• Difficult to collaborate

Did my colleague do that already? How did normalization affect my ROC?

How does someone review your model?

Where’s the LR model I tried last week with featureX?

What params did I use?

Page 10: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

Model Management

track, store and index modeling artifacts so that they may subsequently be reproduced, shared, queried, and analyzed

Page 11: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

ModelDB: a system to manage machine learning models

http://modeldb.csail.mit.edu

Page 12: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

ModelDB: an end-to-end model management system

Model artifact Storage & Versioning

Query

Ingest models, metadata

Collaboration, Reproducibilitytrack

store & index query, reproduce++

Page 13: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

Demo

Page 14: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

ModelDB w/scikit-learn

Page 15: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

ModelDB Architecture & Design Decisions

1. Support for diverse languages and environments

2. Minimal changes to existing workflows

3. Rich visual interface

4. Support for complex queries

spark.ml

scikit-learn

ModelDB Backend

Storage

thrift

Scala

Python

ModelDB Frontend:

vis + query

Native ClientEvents

Page 16: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

ModelDB Features• Experiment tracking

• Versioning

• Reproducibility

• Comparisons, queries, search

• Collaboration

Log models, params, pipelines etc. via ModelDB API

Model search, query, comparison via frontend

Central repository of models Review models, annotate

All pipeline details, params logged

Every modeling run = version

Page 17: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

Ongoing Work• Unified querying of modeling artifacts

• Mining data in ModelDB

• Model monitoring and retraining

Page 18: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

ModelDB available now!

http://modeldb.csail.mit.edu *MIT License

Page 19: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

ModelDB available now!• Download, try it out!

• Tell us what you think; what can we do better?

• Contribute! (see Issues on repo for some ideas)

Page 20: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

ModelDB: a system to manage machine learning models

[email protected] | @DataCereal

http://modeldb.csail.mit.edu