industrial level deep learning training infrastructure

Industrial Level Deep Learning Training Infrastructure—the Practice and Experience from SenseTime

Shengen Yan

SenseTime Group Limited.

The Success of Deep Learning

2006-01 2007-01 2008-01 2009-01 2010-01 2011-01 2012-01 2013-01 2014-01 2015-01 2016-01

Google Search

AlexNet won ImageNet

What Lead to the Success?

Model CapacityThe Key to High Performance

5 8 22

LeNet AlexNet (2012) GoogLeNet (2014) ResNet (2016) Ours

# Layers

Computation power

Years months weeks days

Accelerate the training time from several years to several days!

Deep Learning PackageA deep learning framework that is efficient, scalable, and flexible.

DeepLinkA large-scale cluster platform designed for deep learning.

ApplicationsDelivers many application models

Deep Learning is Complicated

Deep Learning community developedframeworks to make the life easier.

GoogleNet (2014)

Deep learning Training Frameworks

‣SenseTime Deep Learning training Package

• Memory efficient

• Computation efficient

• Both model parallel & data parallel

• Support huge model

• Scalability

Memory Footprint Optimization

high level compiler backend optimization algorithms on intermediate representation.

Optimizations: liveness analysis, computation graph

Seeing

Perceiving

Generated Graph with mirror(re-compute) node

Chen T, Xu B, Zhang C, et al. Training deep nets with sublinear memory cost[J]. arXiv preprint arXiv:1604.06174, 2016.

Memory Footprint Optimization

Model Capacity

Memory usage efficiency, higher is better

VGG ResNet50 ResNet152 Inception V4 ResNet269 Inception ResNet

Ours MxNet TensorFlow Chainer Caffe Torch

Single-GPU Performance

Batch-32 Batch-64 Batch-128Caffe 497.5 1045 1965Chainer 200 290 543TensorFlow 178.6 315.7 587.2Parrots 122.7 225.6 471

milliseconds / iteration

Caffe Chainer TensorFlow Parrots

Communication Optimization

Support Multi-GPUs and Multi-Nodes

Three procedures: Copy, Allreduce, Copy

Optimizations:

• Master-slave threads to overlap the communication and computation overhead

• GPU direct communication

• Ring allreduce message passing

GPU0 GPU1 GPU3GPU2

CPU Memory

Other NodesAllreduce

CopyCopy

Scalability

1 2 3 4 8 16 24 32

# GPUs

millisec/iter scale efficiency

single node multiple nodes

The role of supercomputer

It just like highway in the city

— It is a key infrastructure of AI

Supercomputing Centers for AIThe key infrastructures for AI research.

COMPPUT-

ATIONMODEL

DeepLink

Challenges

‣ Interconnects at multiple levels

• GPUs, Nodes, Sub-networks

‣Distributed data

• Random access becomes particularly difficult

‣Scale vs. Stability

• Failures of individual nodes/links

‣Human resources

• Engineers who understand both Deep Learning & HPC are difficult to come by

DeepLink ClustersDesigned for Deep Learning

Software

Hardware

Co-design

performance

Hardware

Customized

Middlewares

Maximize respective strengths while ensuring optimal cooperation.

• High speed interconnects

• High performance GPU computing

• Efficient distributed storage

• Distributed storage & cache system (optimized for small files)

• Distributed deep learning framework

• Task scheduling & monitoring

Platform overview

Heterogeneous deep learning super computer

High speed storage system

Operation/Maintenance/Monitoring System

Lightweight virtualization

Task scheduling system

Distributed training software

Deep Learning Training Visualization System

Customized communication library for deep learning

Computation library

Distributed cache system

arePlatfo

Training Visualization

DeepLink in SenseTime

>3000 GPUs

THANK YOU

industrial level deep learning training infrastructure

Documents

german testing board - - industrial quality consulting ·...

industrial communication and infrastructure

mitsubishi corporation industrial infrastructure group

engineering, industrial, and infrastructure services

kerala industrial infrastructure development …

deep dive - infrastructure as code

mega industrial-infrastructure projects and their impact ......

dell emc isilon: deep learning infrastructure for ... ·...

industrial infrastructure development corporation,...

epm infrastructure deep dive - · pdf fileepm...

deep sustainability - infrastructure, institutions,...

social infrastructure & industrial systems

fairwealth deep infrastructure pvt. ltd

industrial ethernet infrastructure design seminar - … ·...

yokogawa’s contribution to industrial infrastructure for...

from deep it-infrastructure to deep waters (norwegian)

business forum: industrial infrastructure - wisnoski

andhra pradesh industrial infrastructure corporation · pdf...

industrial infrastructure cost sharing program · the...

vibrant gujarat - industrial infrastructure profile