data versioning and reproducible ml with dvc and mlflow
TRANSCRIPT
![Page 1: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/1.jpg)
Data Versioning and Reproducible ML with DVC and MLflowPetra Kaferle DevisschereData Scientist, Adaltas
![Page 2: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/2.jpg)
IntroductionAbout me:
▪ Data Scientist
▪ 13 years of experience in Data Analytics in Life Sciences
▪ Developed solutions for automating frequently used protocols
About Adaltas:▪ Involved with Big Data since 2010
▪ 16 experts in France, 2 in Morocco
▪ Partner with Cloudera, Databricks, D2IQ
▪ Actor in Open Source community
▪ Teaching Big Data at 6 universities
![Page 3: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/3.jpg)
Agenda
▪ Versioning in Data Science
▪ Data Versioning
▪ Intro to DVC and MLflow
▪ Demo with DVC and MLflow
![Page 4: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/4.jpg)
Problem definition
▪ Dataset▪ Tabular▪ Images, video, sound, text
▪ Code▪ Functionality, Hyper-parameters▪ Data pre-processing, Modeling
▪ Results▪ Numeric (metrics)▪ Plots
▪ Environment▪ Dependencies
What do we need to track in Data Science project to be able to reproduce the results?
![Page 5: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/5.jpg)
Data versioning use cases
▪ Data engineering during data cleaning and dataset preparation
▪ Test data in software engineering to assure the functionality of the software
▪ Development and testing of database applications
▪ Ontology versioning for Semantic Web
▪ ...
Data beyond Data Science
![Page 6: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/6.jpg)
Data is in the heart of any ML project
▪ Data quality management▪ Reproducible training and re-training▪ Automation of testing or deployment▪ Use in applications and web services▪ Audit
Why is the exact version of a dataset important?
![Page 7: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/7.jpg)
Classical DV solutions and their downsides
▪ Git ▪ Not suitable for big files
▪ Difficulties with binary files
▪ Git-LFS▪ File size limitation (ex.: 2 GB for GitHub free account and 5 GB for GitHub Enterprise Cloud)▪ Data needs to be stored on servers of Git provider
▪ External storage▪ Checksum calculation for any change in the dataset and placing them under version control
Classical Software Engineering tools are not enough to solve ML reproducibility crisis
![Page 8: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/8.jpg)
Our proposed solutionCombining Data Version Control (DVC) and MLflow
&- solves Git’s limitations- easy to learn
- extensive and easy tracking- user-friendly UI
![Page 9: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/9.jpg)
What is DVC?https://dvc.org/
▪ Tracks datasets and ML projects▪ Works with many types of storages (cloud,
local, HDFS, HTTP…)▪ Runs on top of a Git repository▪ Supports building and running pipelines
We will focus on dataset tracking.
![Page 10: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/10.jpg)
What is MLflow?https://mlflow.org/
We will focus on experiment tracking.
![Page 11: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/11.jpg)
Let’s put them together!Best of both worlds
![Page 12: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/12.jpg)
Demo
![Page 13: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/13.jpg)
RecapLet’s look again at what we just did
![Page 14: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/14.jpg)
Thank you for your attention!
![Page 15: Data Versioning and Reproducible ML with DVC and MLflow](https://reader036.vdocuments.net/reader036/viewer/2022071303/62ccc85edf449f3ec338b02e/html5/thumbnails/15.jpg)
Feedback
Your feedback is important to us.
Don’t forget to rateand review the sessions.