panel the impact of ai workloads on datacenter compute and … · 2019-03-29 · panel –the...

11
Panel – The Impact of AI Workloads on Datacenter Compute and Memory Samsung @ The Heart of Your Data Unparalleled Product Breadth & Technology Leadership Marc Tremblay Microsoft Sumit Gupta IBM Cliff Young Google Maxim Naumov Facebook Greg Diamos Baidu Anand Iyer Samsung Distringuished Engineer, Azure H/W Responsible for silicon / systems roadmap for AI Full stack expert from application requirement to silicon Previously CTO of Microelectronics @ Sun Vice President AI, ML and HPC business Responsible for strategy and HW/SW products for Watson ML accelerator and Spectrum compute Previously GM AI & GPU data center business, Nvidia Software Engineer Google Brain team HW/SW Codesign and Machine Learning One of the TPU designers Previously at D.E. Shaw and Bell Labs Research Scientist Facebook AI Research Deep Learning, Parallel Algorithms and Systems Co-developed many of Nvidia GPU-accelerated libraries Previously at Nvidia Research and Intel AI Research Lead Silicon Valley AI Labs Co-developed Deep Speech and Deep Voice Contributor to Volta SIMT scheduler, Compiler and Microarchitecture Previously at Nvidia Chief Platform Architect Datacenter products GPU deployments for AI and DL @ Hyperscalers CPU architecture, storage systems, SSD, N/W, wireless and pwr mgmt. Previously at Sandisk, LSI, Apple etc. Dir, Planning/Technology, Semiconductor products Data center and AI HBM/Accelerators/CPU CPU, Network infra, near memory compute Previously at Broadcom, Cavium and Digital Rob Ober Nvidia

Upload: others

Post on 20-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Panel The Impact of AI Workloads on Datacenter Compute and … · 2019-03-29 · Panel –The Impact of AI Workloads on Datacenter Compute and Memory Samsung @ The Heart of Your Data

Panel – The Impact of AI Workloads on Datacenter Compute and Memory

Samsung @ The

Heart of Your DataUnparalleled Product Breadth & Technology Leadership

Marc TremblayMicrosoft

Sumit GuptaIBM

Cliff YoungGoogle

Maxim NaumovFacebook

Greg DiamosBaidu

Anand IyerSamsung

Distringuished Engineer, Azure H/W

Responsible for silicon / systems roadmap for AI

Full stack expert from application requirement to silicon

Previously CTO of Microelectronics @ Sun

Vice PresidentAI, ML and HPC business

Responsible for strategy and HW/SW products for Watson ML accelerator and Spectrum compute

Previously GM AI & GPU data center business, Nvidia

Software EngineerGoogle Brain team

HW/SW Codesign and Machine Learning

One of the TPU designers

Previously at D.E. Shaw and Bell Labs

Research ScientistFacebook AI Research

Deep Learning, Parallel Algorithms and Systems

Co-developed many of Nvidia GPU-accelerated libraries

Previously at NvidiaResearch and Intel

AI Research LeadSilicon Valley AI Labs

Co-developed Deep Speech and Deep Voice

Contributor to Volta SIMT scheduler, Compiler and Microarchitecture

Previously at Nvidia

Chief Platform ArchitectDatacenter products

GPU deployments for AI and DL @ Hyperscalers

CPU architecture, storage systems, SSD, N/W, wireless and pwr mgmt.

Previously at Sandisk, LSI, Apple etc.

Dir, Planning/Technology, Semiconductor products

Data center and AI HBM/Accelerators/CPU

CPU, Network infra, near memory compute

Previously at Broadcom, Cavium and Digital

Rob OberNvidia

Page 2: Panel The Impact of AI Workloads on Datacenter Compute and … · 2019-03-29 · Panel –The Impact of AI Workloads on Datacenter Compute and Memory Samsung @ The Heart of Your Data

Virtuous cycle of big data vs compute growth chasm

Data

Compute

Data is being created faster than our ability to make sense of it.2 Trillion searches / year

350M photos

4PB / day

Flu Outbreak: Google knows before CDC

Page 3: Panel The Impact of AI Workloads on Datacenter Compute and … · 2019-03-29 · Panel –The Impact of AI Workloads on Datacenter Compute and Memory Samsung @ The Heart of Your Data

ML/DL Systems – Market Demand and Key Application Domains

https://code.fb.com/ai-research/scaling-neural-machine-translation-to-bigger-data-sets-with-faster-training-and-inferencehttps://code.fb.com/ml-applications/expanding-automatic-machine-translation-to-more-languages

cat

image/video

convolutions

Vision

dense features sparse features

NN

sEm

bed

din

gLo

oku

pIn

tera

ctio

ns

NN

s

probability of a click

Recommenders

I like cats \end

\start p1 p2

p1 (мне)

p2 (нравятся)

p3 (коты)

attention

hidden state

NMT

[1] “Deep Learning Inference in Data Centers: Characterization, Performance Optimizations and Hardware Implications”, ArXiv, 2018

DL inference in data centers [1]

Page 4: Panel The Impact of AI Workloads on Datacenter Compute and … · 2019-03-29 · Panel –The Impact of AI Workloads on Datacenter Compute and Memory Samsung @ The Heart of Your Data

Resource Requirements

Category

Model Types Model Size (W)

Max. Activations

Op. Intensity (W) Op. Intensity (X and W)

RecSysFCs 1-10M > 10K 20-200 20-200

Embeddings >10 Billion > 10K 1-2 1-2

CV

ResNeXt101-32x4-48 43-829M 2-29MAvg. 380/Min. 100

Avg. 188/Min. 28

Faster-RCNN-ShuffleNet

6M 13MAvg.3.5K/Min. 2.5K

Avg. 145/Min. 4

ResNeXt3D-101 21M 58M Avg. 22K/Min. 2K Avg. 172/ Min. 6

NLP seq2seq 100M-1B >100K 2-20 2-20

Page 5: Panel The Impact of AI Workloads on Datacenter Compute and … · 2019-03-29 · Panel –The Impact of AI Workloads on Datacenter Compute and Memory Samsung @ The Heart of Your Data

RooflineAI Application PerformanceFor the foreseeable future, off-chip memory

bandwidth will often be the constraining

resource in system performance.

System balanceMemory <-> Compute <-> Communication

Memory Access for:

Network / program config and control flow

Training data mini-batch compute flow

Compute consumes:Mini-batch data

Communication for:All reduce

Embedding table insertion

Research + Development

Time to train and accuracy

Multiple runs for exploration, sometimes overnight

ProductionOptimal work/$

Optimal work/watt

Time to train

Page 6: Panel The Impact of AI Workloads on Datacenter Compute and … · 2019-03-29 · Panel –The Impact of AI Workloads on Datacenter Compute and Memory Samsung @ The Heart of Your Data

Hardware Trends

Figure 3: Runtime roofline analysis of different ML models

varying on-chip memory capacity of a hypothetical accelerator

with 100 int8 Top/s compute and 100 GB/s DRAM bandwidth.

The importance of on-chip 1 TB/s (solid) and 10 TB/s (dashed)

bandwidth is showcased under a variety of workloads.

800⇥600 input images and ResNeXt-3D for videos). The

FC layers in recommendation and NMT models use small

batch sizesso performance isbound by off-chip memory band-

width unless parameters can fit on-chip. The batch size can

be increased whilemaintaining latency with higher compute

throughput of accelerators [34], but only up to apoint due to

other application requirements. The number of operations per

weight in CV models are generally high, but the number of

operations per activation is not as high (some layers in the

ShuffleNet and ResNeXt-3D models are as low as 4 or 6).

This iswhy theperformance of ShuffleNet and ResNeXt-3D

varies considerably depending on on-chip memory bandwidth

as shown in Figure 3. Had weonly considered their minimum

2K operations per weight, we would expect that 1 TB/s of

on-chip memory is sufficient to saturate the peak 100 Top/s

compute throughput of the hypothetical accelerator. As the

application would becomputebound with 1 TB/s of on-chip

memory bandwidth, we would expect there to be no perfor-

mance difference between 1 TB/s and 10 TB/s.

Third, common primitive operations are not just canoni-

cal multiplications of squarematrices, but often involve tall-

and-skinny matrices or vectors. These problem shapes arise

from group/depth-wise convolutions that have recently be-

comepopular in CV, and from small batch sizes in Recommen-

dation/NMT models due to their latency constraints. There-

fore, it is desired to have a combination of 1) matrix-matrix

engines to execute the bulk of FLOPs from compute-intensive

models in an energy-efficient manner and 2) powerful enough

vector engines to handle the remaining of operations. More

details are described in the next section.

Figure 4: CPU time breakdown across data centers.

(a) Activation

(b) Weight

Figure 5: Common activation and weight matrix shapes. Here

4 : FCs. ⇥: group convolutions with few channels per group

(depth-wise convolution is an extreme case with 1 channel per

group). : all other operations.

2.3. Computation Kernels

Figure 4 shows the breakdown of operations across all data

centers. Wecount CPU operations because for inference we

often work with a small batch size in order to meet latency

constraints and therefore GPUs are not widely used (Table 1).

Notice that FCs are the most time consuming operation, fol-

lowed by tensor manipulations and embedding lookups.

Figure 5 showscommon matrix shapes encountered in our

DL inference workloads. For activation matrices in convo-

lution layers, we put dimensions of lowered (i.e. i m2col ’d)

matricesbut it doesn’ t necessarily mean lowering isused. That

is the reduction dimension is multiplied by the filter size (e.g.,

9 for 3⇥3 filters). In convolution layers, thenumber of rows

of activation matrices is batch_size⇥H_out⇥W_out, where

H_out⇥W_out is the size of each output channel. We call

this number of rows effective batch size or batch/spatial di-

5

Time spent in caffe2 operators in data centers [1]

Common activation and weight matrix shapes (XMxKWTKxN) [1]

Figure 3: Runtime roofline analysis of different ML models

varying on-chip memory capacity of a hypothetical accelerator

with 100 int8 Top/s compute and 100 GB/s DRAM bandwidth.

The importance of on-chip 1 TB/s (solid) and 10 TB/s (dashed)

bandwidth is showcased under a variety of workloads.

800⇥600 input images and ResNeXt-3D for videos). The

FC layers in recommendation and NMT models use small

batch sizesso performance isbound by off-chip memory band-

width unless parameters can fit on-chip. The batch size can

be increased while maintaining latency with higher compute

throughput of accelerators [34], but only up to apoint due to

other application requirements. The number of operations per

weight in CV models are generally high, but the number of

operations per activation is not as high (some layers in the

ShuffleNet and ResNeXt-3D models are as low as 4 or 6).

This is why the performance of ShuffleNet and ResNeXt-3D

varies considerably depending on on-chip memory bandwidth

as shown in Figure 3. Had we only considered their minimum

2K operations per weight, we would expect that 1 TB/s of

on-chip memory is sufficient to saturate the peak 100 Top/s

compute throughput of the hypothetical accelerator. As the

application would becompute bound with 1 TB/s of on-chip

memory bandwidth, we would expect there to be no perfor-

mance difference between 1 TB/s and 10 TB/s.

Third, common primitive operations are not just canoni-

cal multiplications of square matrices, but often involve tall-

and-skinny matrices or vectors. These problem shapes arise

from group/depth-wise convolutions that have recently be-

comepopular in CV, and from small batch sizes in Recommen-

dation/NMT models due to their latency constraints. There-

fore, it is desired to have a combination of 1) matrix-matrix

engines to execute the bulk of FLOPs from compute-intensive

models in an energy-efficient manner and 2) powerful enough

vector engines to handle the remaining of operations. More

details are described in the next section.

Figure 4: CPU time breakdown across data centers.

(a) Activation

(b) Weight

Figure 5: Common activation and weight matrix shapes. Here

4 : FCs. ⇥: group convolutions with few channels per group

(depth-wise convolution is an extreme case with 1 channel per

group). : all other operations.

2.3. Computation Kernels

Figure 4 shows the breakdown of operations across all data

centers. Wecount CPU operations because for inference we

often work with a small batch size in order to meet latency

constraints and therefore GPUs are not widely used (Table 1).

Notice that FCs are the most time consuming operation, fol-

lowed by tensor manipulations and embedding lookups.

Figure 5 shows common matrix shapes encountered in our

DL inference workloads. For activation matrices in convo-

lution layers, we put dimensions of lowered (i.e. i m2col ’d)

matricesbut it doesn’ t necessarily mean lowering isused. That

is the reduction dimension is multiplied by the filter size (e.g.,

9 for 3⇥3 filters). In convolution layers, the number of rows

of activation matrices is batch_size⇥H_out⇥W_out, where

H_out⇥W_out is the size of each output channel. We call

this number of rows effective batch size or batch/spatial di-

5

Figure 3: Runtime roofline analysis of different ML models

varying on-chip memory capacity of a hypothetical accelerator

with 100 int8 Top/s compute and 100 GB/s DRAM bandwidth.

The importance of on-chip 1 TB/s (solid) and 10 TB/s (dashed)

bandwidth is showcased under a variety of workloads.

800⇥600 input images and ResNeXt-3D for videos). The

FC layers in recommendation and NMT models use small

batch sizesso performance isbound by off-chip memory band-

width unless parameters can fit on-chip. The batch size can

be increased while maintaining latency with higher compute

throughput of accelerators [34], but only up to apoint due to

other application requirements. The number of operations per

weight in CV models are generally high, but the number of

operations per activation is not as high (some layers in the

ShuffleNet and ResNeXt-3D models are as low as 4 or 6).

This is why the performance of ShuffleNet and ResNeXt-3D

varies considerably depending on on-chip memory bandwidth

as shown in Figure 3. Had we only considered their minimum

2K operations per weight, we would expect that 1 TB/s of

on-chip memory is sufficient to saturate the peak 100 Top/s

compute throughput of the hypothetical accelerator. As the

application would becompute bound with 1 TB/s of on-chip

memory bandwidth, we would expect there to be no perfor-

mance difference between 1 TB/s and 10 TB/s.

Third, common primitive operations are not just canoni-

cal multiplications of squarematrices, but often involve tall-

and-skinny matrices or vectors. These problem shapes arise

from group/depth-wise convolutions that have recently be-

comepopular in CV, and from small batch sizes in Recommen-

dation/NMT models due to their latency constraints. There-

fore, it is desired to have a combination of 1) matrix-matrix

engines to execute the bulk of FLOPs from compute-intensive

models in an energy-efficient manner and 2) powerful enough

vector engines to handle the remaining of operations. More

details are described in the next section.

Figure 4: CPU time breakdown across data centers.

(a) Activation

(b) Weight

Figure 5: Common activation and weight matrix shapes. Here

4 : FCs. ⇥: group convolutions with few channels per group

(depth-wise convolution is an extreme case with 1 channel per

group). : all other operations.

2.3. Computation Kernels

Figure 4 shows the breakdown of operations across all data

centers. Wecount CPU operations because for inference we

often work with a small batch size in order to meet latency

constraints and therefore GPUs are not widely used (Table 1).

Notice that FCs are the most time consuming operation, fol-

lowed by tensor manipulations and embedding lookups.

Figure 5 shows common matrix shapes encountered in our

DL inference workloads. For activation matrices in convo-

lution layers, we put dimensions of lowered (i.e. i m2col ’d)

matricesbut it doesn’ t necessarily mean lowering isused. That

is the reduction dimension is multiplied by the filter size (e.g.,

9 for 3⇥3 filters). In convolution layers, the number of rows

of activation matrices is batch_size⇥H_out⇥W_out, where

H_out⇥W_out is the size of each output channel. We call

this number of rows effective batch size or batch/spatial di-

5

• High memory bandwidth and capacity for embeddings

• Support for powerful matrix and vector engines

• Large on-chip memory for inference with small batches

• Support for half-precision floating-point computation

[1] “Deep Learning Inference in Data Centers: Characterization, Performance Optimizations and Hardware Implications”, ArXiv, 2018

Page 7: Panel The Impact of AI Workloads on Datacenter Compute and … · 2019-03-29 · Panel –The Impact of AI Workloads on Datacenter Compute and Memory Samsung @ The Heart of Your Data

Sample Workload Characterization

Embedding table hit rates and access histograms [2]

[2] “Bandana: Using Non-Volatile Memory for Storing Deep Learning Models”, SysML, 2019

Page 8: Panel The Impact of AI Workloads on Datacenter Compute and … · 2019-03-29 · Panel –The Impact of AI Workloads on Datacenter Compute and Memory Samsung @ The Heart of Your Data

MLPerf

Page 9: Panel The Impact of AI Workloads on Datacenter Compute and … · 2019-03-29 · Panel –The Impact of AI Workloads on Datacenter Compute and Memory Samsung @ The Heart of Your Data

The Move to The Edge

By 2022, 7 out of every 10 bytes of data created will never see a data center.

CPU

Memory

Storage

Let Data Speak for Itself!

• Compute closer to Data• Smarter Data Movement• Faster Time to Insight

Considerations

Page 10: Panel The Impact of AI Workloads on Datacenter Compute and … · 2019-03-29 · Panel –The Impact of AI Workloads on Datacenter Compute and Memory Samsung @ The Heart of Your Data

Panel – The Impact of AI Workloads on Datacenter Compute and Memory

Samsung @ The

Heart of Your DataUnparalleled Product Breadth & Technology Leadership

Marc TremblayMicrosoft

Sumit GuptaIBM

Cliff YoungGoogle

Maxim NaumovFacebook

Greg DiamosBaidu

Rob OberNvidia

Anand IyerSamsung

Distringuished Engineer, Azure H/W

Responsible for silicon / systems roadmap for AI

Full stack expert from application requirement to silicon

Previously CTO of Microelectronics @ Sun

Vice PresidentAI, ML and HPC business

Responsible for strategy and HW/SW products for Watson ML accelerator and Spectrum compute

Previously GM AI & GPU data center business, Nvidia

Software EngineerGoogle Brain team

HW/SW Codesign and Machine Learning

One of the TPU designers

Previously at D.E. Shaw and Bell Labs

Research ScientistFacebook AI Research

Deep Learning, Parallel Algorithms and Systems

Co-developed many of Nvidia GPU-accelerated libraries

Previously at NvidiaResearch and Intel

AI Research LeadBaidu SVAIL

Co-developed Deep Speech and Deep Voice

Contributor to Volta SIMT scheduler, Compiler and Microarchitecture

Previously at Nvidia

Chief Platform ArchitectDatacenter products

GPU deployments for AI and DL @ Hyperscalers

CPU architecture, storage systems, SSD, N/W, wireless and pwr mgmt.

Previously at Sandisk, LSI, Apple etc.

Dir, Planning/Technology, Semiconductor products

Data center and AI HBM/Accelerators/CPU

CPU, Network infra, near memory compute

Previously at Broadcom, Cavium and Digital

Page 11: Panel The Impact of AI Workloads on Datacenter Compute and … · 2019-03-29 · Panel –The Impact of AI Workloads on Datacenter Compute and Memory Samsung @ The Heart of Your Data

Cloud AI

HBM

6x faster training time

8x training cost effectiveness

Edge computing AI

2x faster data access

2x hot data feeding

GDDR6

On-device AI

LPDDR5

1.5x bandwidthprivacy & fast response

Samsung @ The Heart of Your DataVisit us at Booth #726