easyml: ease the process of machine learning with data flojunxu/publications/sosp2017-aisys... ·...
TRANSCRIPT
EasyML:EasetheProcessofMachineLearningwithDataFlow
JunXu
InstituteofComputingTechnology,ChineseAcademyofSciencesOct.28,2017
1
SOSPAI SystemWorkshopShanghai
Oct.28,2017
ApplyingMachineLearningisnotEasy
2
UI
Volume
Simple UI is helpful
Collaboratingandsharing
Volume
Share the data, algorithms, and
experience
The barrier comes not only from the advanced algorithms, but also from the complex process of using the algorithms!
Mobility
Volume
Can access the service everywhere
LargeScaleMachineLearningSystem@ICT
3
Ourfocus
Data Storage and ManagementLarge scale data management
HDFSStructured data management
MySQL
Distributed Computing
Map-Reduce
LIB: scalable machine learning algorithms
Conventional ML algorithms
Data pre-processing, ETL, model evaluation …
Easy ML: interactive graphical UIDesigner: ML task creation, editing,
submitting and managementMonitor: task monitoring, result visualization, and task reusing
Spark TensorFlow
DL/RL algorithms for ranking & matching
Focus of EasyML
DesignofEasyMachineLearning
4
Data Storage and ManagementLarge scale data management
HDFSStructured data management
MySQL
Distributed Computing
Map-Reduce Spark TensorFlow
Scheduling: Oozie
Execute command lines Program status
Interactive GUI (GWT)
Dataflow DAG
Workflow DAG
Job status, program status
designer monitor
Submit Oozie job
Node: program / dataEdge: dataflow
Node: program / start / end / fork / joinEdge: dependency
KeyFeatures— Resource Management
• Managingthealgorithms,data,andtasks• Uploadingalgorithmsanddata
5
UploadingnewalgorithmsManagingprograms,data,andtasks
KeyFeatures— TaskDesign
• CreatingthetaskDAG(usuallybycloningandeditinganexistingtask)withdrag-and-dropmanner
• Settingtheparametersforeachnode• Submittingforexecution
6
KeyFeatures— TaskMonitoring
• Monitoringstatusoftasksandnodes• Checking/downloadingtheoutputs• Visualizingthedata/models
7
Data/resultsvisualizationTask/nodestatusmonitoring
KeyFeatures— TaskReusing
• Editing(appendingnodes,deletingnodes,andchangingparameters)andre-submiting
8
DeployasWebServicehttp://159.226.40.104:18080/dev
• Advantages– Sharing:sharedata/programs/tasksamongusers– Collaborating:workingtogetherforonetask– Mobility:accessingwithwebbrowsersanywhere– Open:ETLfordataimport/export;canrunthird-partyalgorithms
92017/10/30
Brower EasyML service Hadoop/Spark/TensorFlow cluster
SourceSharedatGithubhttps://github.com/ICT-BDA/EasyML
• Top1JavaprojectatGithub trendingforoneweek
• 1400+starsand~300forks• CIKM2016bestdemocandidate
[Guo etal.,CIKM’16]
10
RelatedSystems
• StanfordCodaLab– Acollaborativeplatformforreproducible
research• Rapid Miner Studio
– InteractiveGUI andintegratedmachinelearningalgorithms
• MicrosoftAzuremachinelearning– GUI-basedIDEforconstructingand
operationalizingmachinelearningworkflowonAzure
• Alibaba御膳房/DT PAI• The Fourth Paradigm Prophet
11
Summary
• Ease theprocessofusingmachinelearning– MachinelearningprocessasdataflowDAG– InteractiveGUIfordesigningandmanagingMLtasks– Deployedasawebservice
• Resourcemanagement• Taskdesign• Taskmonitoring• Resultreusing
– Sourcecode@Githubhttps://github.com/ICT-BDA/EasyML
12
EasyML Team
• Xinjie Chen,ICTCAS• Zhaohui Li,ICTCAS• Tianyou Guo,Sogou Inc.• Jianpeng Hou,GoogleChina• PingLi,Tencent Wechat• Jiashuo Cao,CUIT• DongHuang,UCAS• Xiaohui Yan• Xueqi Cheng,ICTCAS
13