Download - ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak
![Page 1: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/1.jpg)
ModelDB: A system to manage machine learning models
Manasi VartakPhD Student, MIT DB Group
![Page 2: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/2.jpg)
People
Manasi Vartak PhD student, MIT
Srinidhi Viswanathan MEng, MIT
Samuel Madden Faculty, MIT
Matei Zaharia Faculty, Stanford
Harihar Subramanyam MEng, MIT
Wei-En Lee MEng student, MIT
![Page 3: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/3.jpg)
Building a default prediction algorithm
Profession Credit History Risk of Default
Politician Reasonable 0.3
Struggling artist Poor 0.7
InvestorHas more
money than our company
0.0
… … … …
Barack Obama
Lindsay Lohan
Warren Buffet
![Page 4: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/4.jpg)
Accuracy: 62%
Model 1
![Page 5: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/5.jpg)
Model 3
RandomForestClassifier
val udf1: (Int => Int) = (delayed..) df.withColumn(“timesDelayed”, udf1)
![Page 6: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/6.jpg)
RandomForestClassifier
df.withColumn(“timesDelayed”, udf1) .withColumn(“percentPaid”, udf2)
val lrGrid = new ParamGridBuilder() .addGrid(rf.maxDepth, Array(5, 10, 15)) .addGrid(rf.numTrees, Array(50, 100))
Model 5
credit-default-clean.csv
![Page 7: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/7.jpg)
df.withColumn(“timesDelayed”, udf1) .withColumn(“percentPaid”, udf2).withColumn(“creditUsed”, udf3)
…
val lrGrid = new ParamGridBuilder() .addGrid(lr.elasticNetParam, Array(0.01, 0.1, 0.5, 0.7))
val scaler = new StandardScaler() .setInputCol(“features”)
…
val labelIndexer1 = new LabelIndexer() val labelIndexer2 = new LabelIndexer() …
Model 50
val udf1: (Int => Int) = (delayed..) val udf2: (String, Int) = …
credit-default-clean.csv
![Page 8: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/8.jpg)
No one in here tracks (all of) their models…and this is not unusual
I’m willing to bet…
![Page 9: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/9.jpg)
Why is this a problem?• No record of experiments
• Insights lost along the way
• Difficult to reproduce results
• Cannot search for or query models
• Difficult to collaborate
Did my colleague do that already? How did normalization affect my ROC?
How does someone review your model?
Where’s the LR model I tried last week with featureX?
What params did I use?
![Page 10: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/10.jpg)
Model Management
track, store and index modeling artifacts so that they may subsequently be reproduced, shared, queried, and analyzed
![Page 11: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/11.jpg)
ModelDB: a system to manage machine learning models
http://modeldb.csail.mit.edu
![Page 12: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/12.jpg)
ModelDB: an end-to-end model management system
Model artifact Storage & Versioning
Query
Ingest models, metadata
Collaboration, Reproducibilitytrack
store & index query, reproduce++
![Page 13: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/13.jpg)
Demo
![Page 14: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/14.jpg)
ModelDB w/scikit-learn
![Page 15: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/15.jpg)
ModelDB Architecture & Design Decisions
1. Support for diverse languages and environments
2. Minimal changes to existing workflows
3. Rich visual interface
4. Support for complex queries
spark.ml
scikit-learn
ModelDB Backend
Storage
thrift
Scala
Python
…
ModelDB Frontend:
vis + query
Native ClientEvents
![Page 16: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/16.jpg)
ModelDB Features• Experiment tracking
• Versioning
• Reproducibility
• Comparisons, queries, search
• Collaboration
Log models, params, pipelines etc. via ModelDB API
Model search, query, comparison via frontend
Central repository of models Review models, annotate
All pipeline details, params logged
Every modeling run = version
![Page 17: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/17.jpg)
Ongoing Work• Unified querying of modeling artifacts
• Mining data in ModelDB
• Model monitoring and retraining
![Page 19: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/19.jpg)
ModelDB available now!• Download, try it out!
• Tell us what you think; what can we do better?
• Contribute! (see Issues on repo for some ideas)
![Page 20: ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak](https://reader034.vdocuments.net/reader034/viewer/2022042600/58abb4431a28ab04618b4d37/html5/thumbnails/20.jpg)
ModelDB: a system to manage machine learning models
[email protected] | @DataCereal
http://modeldb.csail.mit.edu