ebay ai platform - machine learning made easy introducing...

34
Introducing Krylov eBay AI Platform - Machine Learning Made Easy GPU Technology Conference, 2018 Henry Saputra Technical Lead for Krylov - eBay Unified AI Platform

Upload: others

Post on 13-Jan-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Introducing KryloveBay AI Platform - Machine Learning Made Easy

GPU Technology Conference, 2018

Henry SaputraTechnical Lead for Krylov - eBay Unified AI Platform

Page 2: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

1. Data Science and Machine Learning at eBay2. Introducing Krylov3. Compute Cluster and Accelerator Support with Nvidia GPU4. Quickstart Example5. Future Roadmap6. Q & A

Agenda

Page 3: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Data Science and Machine Learning at eBay

Page 4: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

eBay Patterns - Tools and FrameworksTools

• Languages: R, Python, Scala, C++

• IDE-like: RStudio, Notebooks (Juptyer), Python IDE

• Frameworks: NumPy, SciPy, matplotlib, Scikit-learn, Spark MLLib, H2O

Weka, XGBoost, Moses

• Pipelines: Cron, Luigi, Apache Airflow, Apache Oozie

Patterns for ML Training

• Single node

• Distributed training

• Deep learning (GPUs)

Deep LearningDistributed Training Key takeaway = CHOICE

1. Flexibility of software 2. Flexibility of hardware

configuration

Page 5: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

1. 50%-70% is plumbing worka. Accessing and moving secured datab. Environment and tools setupc. Sub-optimal compute instances - NVIDIA GPUs and High memory/ CPUs instancesd. Long wait time from platform and infrastructure

2. Lost of productivity and opportunitiesa. ML lifecycle management of models and featuresb. Building robust training model pipelines: prepare data, algorithm, hyperparameters tuning, cross

validation

3. Collaborations almost impossible4. Research vs Applied ML

Problems and Challenges

Page 6: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Introducing Krylov: Unified eBay AI Platform

Page 7: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

● Krylov is the core project of the eBay unified AI Platform initiative to enable easy to use and powerful cloud-based data science and machine learning platform.

● The objective of the project is to enable machine learning jobs with easy access to secured-data and eBay cloud computing resources.

● The main goals for the Krylov initiative are:○ Easy and secure access to training datasets○ Access to compute in high performance machines, such as GPUs, or cluster of

machines.○ Familiar tools and flexible software to run machine learning model training jobs○ Interactive data analysis and visualization, with multi-tenancy support to allow quick

prototyping of algorithms and data access○ Sharing and collaboration of ML work between teams in eBay

Overview

Page 8: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

ML Lifecycle Management

Lifecycle

MODEL INFERENCINGDeployable, Scalable

MODEL BUILDING

Interactive, iterative

MODEL RE-FITTINGInteractive, iterative

MODEL RE-TRAININGInteractive, iterative

Data + Lifecycle Management

MODEL TRAININGAutomatable, repeatable, scalable

Page 9: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Krylov Staircase Design for AI Platform

Page 10: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

eBay AI Platform Components

Infrastructure - Krylov

AI Engine - Krylov

LearningPipelines

Model Experimentation

Data ScientistWorkspaces

Model LifecycleManagement

GPU Tall instances

Fast Storage

Data

Preparation

Movement

Discovery

AccessAI Hub(Shared

Repository)

AI Modules

Speech Recognition Machine TranslationComputer Vision Information RetrievalNatural Language Understanding …

Inferencing

Page 11: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Krylov High Level Architecture

Page 12: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

1. Client Command Line Interface (CLI) via krylovctl program

2. ML Application and Run Specification

3. ML Pipelines: Workflow and Workspace

4. Namespaces - For quota and data isolation

5. Jobs and Runs - Managed by Krylov Tools and Minions

6. Secure Data Access - HDFS, NFS, OpenStack Swift, Custom

Krylov Main Features and Concepts

Page 13: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Krylov CLI - krylovctl

Page 14: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

● Krylov ML Application is a versioned unit of deployment that contains declaration of the

developers’ programs

● Implemented as client project used as source to build deployment artifact

● Three main parts:

○ mlapplication.json and artifact.sjon configuration files

○ Source code of the programs

○ Dependencies management via Dockerfile

● Supported types of programs: JVM languages (Java, Scala), Python, Shell script

● Using the ML Application as source, developers can build deployment artifact that can be

used by the Run Specification file to deploy it into one of the nodes in the cluster

Krylov ML Application

Page 15: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

{

"tasks": {

"prepare_data": {

"program": "com.ebay.oss.krylov.workflow.JvmMainProgram",

"parameters": {

"className": "com.ebay.krylov.helloai.HelloWorld"

}

},

"train_model": {

"program": "com.ebay.oss.krylov.workflow.PythonProgram",

"parameters": {

"file": "helloai-python/helloai/helloworld.py",

"args": []

}

},

...

Krylov ML Application Example

Page 16: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

● The Krylov Run Specification is a runtime configuration to add override configuration and

parameter passing for each Task in the ML Application job submissions

● It tells Krylov master API server of which the artifact created by ML Application will be used in

the compute cluster

● Defined as runspec.json file or can be passed as argument to krylovctl client program.

● The runspec.json file also has definition for the compute resources, such as which NVIDIA

GPUs to use, CPU, memory, and which Docker image for dependencies used in ML

Application programs

Krylov Run Specification

Page 17: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

{

"jobName": "job-sample",

"artifact": "myartifact",

"artifactTag": "latest",

"mlApplication": "com.ebay.oss.krylov.workflow.app.GenericMLApplication",

"applicationParameters": {

},

"tasks": {

"prepare_data": {

"taskParameters": {

"prepare_data_parameter_key": "prepare_data_parameter_value"

}

}

}

Krylov Run Specification Example

Page 18: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

● Krylov ML batch lifecycle pipeline is defined as Krylov Workflow definition

○ Declarative

○ Default Generic Workflow

● Important concepts for Krylov Workflow:

○ Workflow - A single pipeline defined within Krylov and the unit of deployment for an ML Application

■ Each Workflow contains one or more Tasks

■ The Tasks are connected to each other as Directed Acyclic Graph (DAG) structure

○ Task - smallest unit of execution that run developers’ Program and executed in a single machine

○ Flows - Contains one or more key-value pairs of name and declaration of Tasks DAGs

○ Flow - The chosen key that will be run from possible selection in the Flows definition

Krylov ML Pipelines: Workflow

Page 19: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

{

"tasks": {

...

},

"flows": {

"sample_flow": {

"prepare_data":

["train_model"],

"train_model":

["output"]

}

},

"flow": "sample_flow"

}

Workflow Example in mlapplication.json

Page 20: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Workflow Runs Flow

Page 21: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

● A Workspace is an interactive web application to allow developers to use web

browser to do ML model prototyping, data preparation and exploration

● The Workspace is run as Jupyter Notebook servers and launched on high CPU/

memory or NVIDIA GPU instances

● Enhance the JupyterHub project to allow distributed launching of multi-tenants

Jupyter Notebook servers in Krylov compute cluster using Kubernetes

● Krylov Workspace uses configuration file on creation time to override and

customize default parameters

Krylov ML Pipelines: Workspace

Page 22: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Workspace Deployment Flow

Page 23: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Krylov Compute Cluster

Page 24: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Krylov Cluster Infrastructure

Page 25: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Krylov Compute Cluster Deployment

Page 26: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

● Metrics - Grafana, InfluxDb, and Telegraf for GPU monitoring

Krylov Cluster Monitoring

Page 27: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Krylov Metrics Management Flow

Page 28: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Krylov Compute Resources Management

Page 29: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Quickstart Example

Page 30: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

1. Download krylovctl program from Krylov release repository

2. Run `krylovctl project create` to create new project in the local machine

3. Update or add code to the Krylov project for the machine learning programs

4. Register them as Program within a Task in the mlapplication.json

5. Add new Flow for the defined Tasks to construct the Workflow as a Directed Acyclic Graph (DAG)

6. Run `krylovctl project build` to build the project.

7. Run `krylovctl artifact create` to copy the runnables of the program into an artifact file

8. Run `krylovctl artifact upload` to upload the artifact file for remote execution

9. Run `krylovctl job run` for local execution, or `krylovctl job submit` for running it in the computing

cluster

Steps to Submit Krylov Workflow Job with CLI

Page 31: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

● Here we go ...

Demo Time

Page 32: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Future Roadmap

Page 33: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

1. Inferencing Platform

2. Exploration and documentation of RESTful APIs for job management

3. Data Source and Dataset abstraction via Krylov SDKs

4. Managed ML Pipelines - Computer Vision, NLP, Machine Translation

5. Distributed Deep Learning

6. AutoML - Hyper Parameters Tuning

7. AI Hub to share ML Applications and Datasets

Future Roadmap

Page 34: eBay AI Platform - Machine Learning Made Easy Introducing ...on-demand.gputechconf.com/gtc/2018/...platform...data-science-and-engineering-teams-v2.pdfpowerful cloud-based data science

Question?