panel the impact of ai workloads on datacenter compute and … · 2019-03-29 · panel –the...

Panel – The Impact of AI Workloads on Datacenter Compute and Memory

Samsung @ The

Heart of Your DataUnparalleled Product Breadth & Technology Leadership

Marc TremblayMicrosoft

Sumit GuptaIBM

Cliff YoungGoogle

Maxim NaumovFacebook

Greg DiamosBaidu

Anand IyerSamsung

Distringuished Engineer, Azure H/W

Responsible for silicon / systems roadmap for AI

Full stack expert from application requirement to silicon

Previously CTO of Microelectronics @ Sun

Vice PresidentAI, ML and HPC business

Responsible for strategy and HW/SW products for Watson ML accelerator and Spectrum compute

Previously GM AI & GPU data center business, Nvidia

Software EngineerGoogle Brain team

HW/SW Codesign and Machine Learning

One of the TPU designers

Previously at D.E. Shaw and Bell Labs

Research ScientistFacebook AI Research

Deep Learning, Parallel Algorithms and Systems

Co-developed many of Nvidia GPU-accelerated libraries

Previously at NvidiaResearch and Intel

AI Research LeadSilicon Valley AI Labs

Co-developed Deep Speech and Deep Voice

Contributor to Volta SIMT scheduler, Compiler and Microarchitecture

Previously at Nvidia

Chief Platform ArchitectDatacenter products

GPU deployments for AI and DL @ Hyperscalers

CPU architecture, storage systems, SSD, N/W, wireless and pwr mgmt.

Previously at Sandisk, LSI, Apple etc.

Dir, Planning/Technology, Semiconductor products

Data center and AI HBM/Accelerators/CPU

CPU, Network infra, near memory compute

Previously at Broadcom, Cavium and Digital

Rob OberNvidia

Virtuous cycle of big data vs compute growth chasm

Data

Compute

Data is being created faster than our ability to make sense of it.2 Trillion searches / year

350M photos

4PB / day

Flu Outbreak: Google knows before CDC

ML/DL Systems – Market Demand and Key Application Domains

https://code.fb.com/ai-research/scaling-neural-machine-translation-to-bigger-data-sets-with-faster-training-and-inferencehttps://code.fb.com/ml-applications/expanding-automatic-machine-translation-to-more-languages

cat

image/video

convolutions

Vision

dense features sparse features

NN

sEm

bed

din

gLo

oku

pIn

tera

ctio

ns

…

NN

s

probability of a click

Recommenders

I like cats \end

\start p1 p2

p1 (мне)

p2 (нравятся)

p3 (коты)

attention

hidden state

NMT

[1] “Deep Learning Inference in Data Centers: Characterization, Performance Optimizations and Hardware Implications”, ArXiv, 2018

DL inference in data centers [1]

Resource Requirements

Category

Model Types Model Size (W)

Max. Activations

Op. Intensity (W) Op. Intensity (X and W)

RecSysFCs 1-10M > 10K 20-200 20-200

Embeddings >10 Billion > 10K 1-2 1-2

CV

ResNeXt101-32x4-48 43-829M 2-29MAvg. 380/Min. 100

Avg. 188/Min. 28

Faster-RCNN-ShuffleNet

6M 13MAvg.3.5K/Min. 2.5K

Avg. 145/Min. 4

ResNeXt3D-101 21M 58M Avg. 22K/Min. 2K Avg. 172/ Min. 6

NLP seq2seq 100M-1B >100K 2-20 2-20

RooflineAI Application PerformanceFor the foreseeable future, off-chip memory

bandwidth will often be the constraining

resource in system performance.

System balanceMemory <-> Compute <-> Communication

Memory Access for:

Network / program config and control flow

Training data mini-batch compute flow

Compute consumes:Mini-batch data

Communication for:All reduce

Embedding table insertion

Research + Development

Time to train and accuracy

Multiple runs for exploration, sometimes overnight

ProductionOptimal work/$

Optimal work/watt

Time to train

Hardware Trends

Figure 3: Runtime roofline analysis of different ML models

varying on-chip memory capacity of a hypothetical accelerator

with 100 int8 Top/s compute and 100 GB/s DRAM bandwidth.

The importance of on-chip 1 TB/s (solid) and 10 TB/s (dashed)

bandwidth is showcased under a variety of workloads.

800⇥600 input images and ResNeXt-3D for videos). The

FC layers in recommendation and NMT models use small

batch sizesso performance isbound by off-chip memory band-

width unless parameters can fit on-chip. The batch size can

be increased whilemaintaining latency with higher compute

throughput of accelerators [34], but only up to apoint due to

other application requirements. The number of operations per

weight in CV models are generally high, but the number of

operations per activation is not as high (some layers in the

ShuffleNet and ResNeXt-3D models are as low as 4 or 6).

This iswhy theperformance of ShuffleNet and ResNeXt-3D

varies considerably depending on on-chip memory bandwidth

as shown in Figure 3. Had weonly considered their minimum

2K operations per weight, we would expect that 1 TB/s of

on-chip memory is sufficient to saturate the peak 100 Top/s

compute throughput of the hypothetical accelerator. As the

application would becomputebound with 1 TB/s of on-chip

memory bandwidth, we would expect there to be no perfor-

mance difference between 1 TB/s and 10 TB/s.

Third, common primitive operations are not just canoni-

cal multiplications of squarematrices, but often involve tall-

and-skinny matrices or vectors. These problem shapes arise

from group/depth-wise convolutions that have recently be-

comepopular in CV, and from small batch sizes in Recommen-

dation/NMT models due to their latency constraints. There-

fore, it is desired to have a combination of 1) matrix-matrix

engines to execute the bulk of FLOPs from compute-intensive

models in an energy-efficient manner and 2) powerful enough

vector engines to handle the remaining of operations. More

details are described in the next section.

Figure 4: CPU time breakdown across data centers.

(a) Activation

(b) Weight

Figure 5: Common activation and weight matrix shapes. Here

4 : FCs. ⇥: group convolutions with few channels per group

(depth-wise convolution is an extreme case with 1 channel per

group). : all other operations.

2.3. Computation Kernels

Figure 4 shows the breakdown of operations across all data

centers. Wecount CPU operations because for inference we

often work with a small batch size in order to meet latency

constraints and therefore GPUs are not widely used (Table 1).

Notice that FCs are the most time consuming operation, fol-

lowed by tensor manipulations and embedding lookups.

Figure 5 showscommon matrix shapes encountered in our

DL inference workloads. For activation matrices in convo-

lution layers, we put dimensions of lowered (i.e. i m2col ’d)

matricesbut it doesn’ t necessarily mean lowering isused. That

is the reduction dimension is multiplied by the filter size (e.g.,

9 for 3⇥3 filters). In convolution layers, thenumber of rows

of activation matrices is batch_size⇥H_out⇥W_out, where

H_out⇥W_out is the size of each output channel. We call

this number of rows effective batch size or batch/spatial di-

5

Time spent in caffe2 operators in data centers [1]

Common activation and weight matrix shapes (XMxKWTKxN) [1]










be increased while maintaining latency with higher compute






This is why the performance of ShuffleNet and ResNeXt-3D


as shown in Figure 3. Had we only considered their minimum




application would becompute bound with 1 TB/s of on-chip




cal multiplications of square matrices, but often involve tall-











(a) Activation

(b) Weight












Figure 5 shows common matrix shapes encountered in our





9 for 3⇥3 filters). In convolution layers, the number of rows




5










be increased while maintaining latency with higher compute






This is why the performance of ShuffleNet and ResNeXt-3D


as shown in Figure 3. Had we only considered their minimum




application would becompute bound with 1 TB/s of on-chip




cal multiplications of squarematrices, but often involve tall-











(a) Activation

(b) Weight












Figure 5 shows common matrix shapes encountered in our





9 for 3⇥3 filters). In convolution layers, the number of rows




5

• High memory bandwidth and capacity for embeddings

• Support for powerful matrix and vector engines

• Large on-chip memory for inference with small batches

• Support for half-precision floating-point computation

[1] “Deep Learning Inference in Data Centers: Characterization, Performance Optimizations and Hardware Implications”, ArXiv, 2018

Sample Workload Characterization

Embedding table hit rates and access histograms [2]

[2] “Bandana: Using Non-Volatile Memory for Storing Deep Learning Models”, SysML, 2019

MLPerf

The Move to The Edge

By 2022, 7 out of every 10 bytes of data created will never see a data center.

CPU

Memory

Storage

Let Data Speak for Itself!

• Compute closer to Data• Smarter Data Movement• Faster Time to Insight

Considerations

Panel – The Impact of AI Workloads on Datacenter Compute and Memory

Samsung @ The

Heart of Your DataUnparalleled Product Breadth & Technology Leadership

Marc TremblayMicrosoft

Sumit GuptaIBM

Cliff YoungGoogle

Maxim NaumovFacebook

Greg DiamosBaidu

Rob OberNvidia

Anand IyerSamsung

Distringuished Engineer, Azure H/W

Responsible for silicon / systems roadmap for AI

Full stack expert from application requirement to silicon

Previously CTO of Microelectronics @ Sun

Vice PresidentAI, ML and HPC business

Responsible for strategy and HW/SW products for Watson ML accelerator and Spectrum compute

Previously GM AI & GPU data center business, Nvidia

Software EngineerGoogle Brain team

HW/SW Codesign and Machine Learning

One of the TPU designers

Previously at D.E. Shaw and Bell Labs

Research ScientistFacebook AI Research

Deep Learning, Parallel Algorithms and Systems

Co-developed many of Nvidia GPU-accelerated libraries

Previously at NvidiaResearch and Intel

AI Research LeadBaidu SVAIL

Co-developed Deep Speech and Deep Voice

Contributor to Volta SIMT scheduler, Compiler and Microarchitecture

Previously at Nvidia

Chief Platform ArchitectDatacenter products

GPU deployments for AI and DL @ Hyperscalers

CPU architecture, storage systems, SSD, N/W, wireless and pwr mgmt.

Previously at Sandisk, LSI, Apple etc.

Dir, Planning/Technology, Semiconductor products

Data center and AI HBM/Accelerators/CPU

CPU, Network infra, near memory compute

Previously at Broadcom, Cavium and Digital

Cloud AI

HBM

6x faster training time

8x training cost effectiveness

Edge computing AI

2x faster data access

2x hot data feeding

GDDR6

On-device AI

LPDDR5

1.5x bandwidthprivacy & fast response

Samsung @ The Heart of Your DataVisit us at Booth #726

panel the impact of ai workloads on datacenter compute and … · 2019-03-29 · panel –the...

Documents