thinking outside the die

Sean LieCo-founder & Chief HW Architect, Cerebras

Thinking Outside the Die:Architecting the ML Accelerator of the Future

© 2021 Cerebras Systems Inc. All Rights Reserved

Cerebras Systems

Building and deploying a new class of computer system

Designed for the purpose of accelerating AI and changing the future of AI work

Founded in 2016

350+ Engineers

Offices

Silicon Valley | San Diego | Toronto | Tokyo

Customers

North America | Asia | Europe


1800x more compute

In just 2 years

Tomorrow, multi-trillion parameter models

Exponential Growth of Neural Networks

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000

To

tal

train

ing

co

mp

ute

, P

FL

OP

-da

ys

Model memory requirement, GB

Memory and compute requirements

BERT Base (110M)

BERT Large (340M)

2018

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000

To

tal

train

ing

co

mp

ute

, P

FL

OP

-da

ys



BERT Base (110M)

BERT Large (340M)

2018

GPT-2 (1.5B)

Megatron-LM (8B)

T5 (11B)

2019

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000

To

tal

train

ing

co

mp

ute

, P

FL

OP

-da

ys



BERT Base (110M)

BERT Large (340M)

2018

GPT-2 (1.5B)

Megatron-LM (8B)

T5 (11B)

2019

T-NLG (17B)

GPT-3 (175B)

MSFT-1T (1T)2020+

MT-NLG (530B)

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000

To

tal

train

ing

co

mp

ute

, P

FL

OP

-da

ys



BERT Base (110M)

BERT Large (340M)

2018

GPT-2 (1.5B)

Megatron-LM (8B)

T5 (11B)

2019

T-NLG (17B)

GPT-3 (175B)

MSFT-1T (1T)2020+

MT-NLG (530B)


A lot of innovation in last few years

For example…

• Process: 16nm, 12nm, 7nm

• Memory: HBM, HBM2

• Precision: float16, bfloat16

• Datapath: Systolic array, Tensor cores

ML Acceleration Improvements in Industry

Low precision numerics

Dense GEMM datapath

8b 23b

5b 10b

8b 7b

FP32

FP16

BFLOAT16


For example…




• Datapath: Systolic array, Tensor Core

Process Technology Gains

1.0E+06

1.0E+07

1.0E+08

2006 2008 2010 2012 2014 2016 2018 2020 2022

Transistors/mm2

Moore’s Law is not dead!

A100 7nm

V100 12nm

P100 16nm

GTX480 40nm

G80 90nm


For example…




• Datapath: Systolic array, Tensor Core

Architecture and Microarchitecture Gains

1.0E+02

1.0E+03

1.0E+04

2006 2008 2010 2012 2014 2016 2018 2020 2022

Peak FLOPS/transistor

Architecture Matters!

A100

V100

P100

GTX480

G80


In the last 2 years…

What our industry delivered:

• 2x: Process technology

• 3x: Architecture and µarch

• 300x: Cluster scale-out

What the ML needs:

• BERT → GPT-3

• 1800x more compute

Meeting the Need

Cluster Scale-out Dominated Performance Gains

Google TPUs in one of its data centers. Image credit: Google (cloud.google.com/tpu)

https://cloud.google.com/tpu


Last 2 years

2x: Process technology

3x: Architecture and µarch

300x: Cluster scale-out

The Answer is Just Scale-out, Right?

Scale-out is needed…

But not practical

And not very scalable

Next 2 years?




100k chip clusters?


Distribution complexity scales dramatically with cluster size

Massive models need massive memory, massive compute, and massive communication.

On giant clusters of small devices, all three become intertwined, distributed problems.

Need to do inefficient, fine-grained partitioning and coordination of memory, compute, and communication across thousands of devices.

Limits of Existing Scale-out


Last 2 years




Next 2 years



1-10x: Cluster-scale out

Traditional Scale-out is Not Enough

1k-10k chip clusters

Path to 60x gains

But we need 1800x


We must find ways to:

• Improve process gains

• Improve architecture gains

• Improve scale-out

Imagine more balanced scaling:

• 2x → 20-100x: Process technology

• 3x → 30-150x: Architecture and µarch

• 300x: Scalable cluster scale-out

The ML Demand Challenge

Requires thinking outside the box with co-design

But is it possible?


Process:Amplifying Moore’s Law


0

100

200

300

400

500

600

700

800

900

1960 1970 1980 1990 2000 2010 2020 2030

Die Size (mm2)

Transistor density improvement continues!

Historically, amplified with bigger chips

But die sizes are not getting bigger

Because of challenges in…

1. Yield

2. Lithography

3. Power and cooling

Building Bigger Chips

A100V100

P100

GTX480

G80

Pentium

80486

8028680804004


Extending Beyond a Single Die in Industry

Largest traditional chip

2-5x

10-20x

Single reticle chip

Chiplets and MCM

Advanced interposer


Cerebras WSE-2 56x Larger than Traditional Chip

The Largest Chip Ever Built

46,225 mm2 silicon

2.6 Trillion transistors

850,000 AI optimized cores

40 Gigabytes on-chip memory

20 Petabyte/s memory bandwidth

220 Petabit/s fabric bandwidth

7nm Process technology at TSMC


• Impossible to yield full wafer with zero defects

• Silicon and process defects are inevitable even in mature process

1. Solving Yield Challenges

Defects

Die


Co-designed with

chip architecture

Redundancy

• Uniform small core architecture enables redundancy to address yield at very low cost

• Design includes redundant cores and redundant fabric links

• Redundant cores replace defective cores

• Extra links reconnect fabric to restore logical 2D mesh


• Standard fabrication process requires die to be independent

• Scribe line separates each die

• Scribe line used as mechanical barrier for die cutting and for test structures

2. Solving Lithography Limitations


• Add wires across scribe line in partnership with TSMC

• Extend 2D mesh across die

• Same connectivity between cores and across scribe lines create a homogenous array

Cross-Die Wires

Co-designed with

fab and chip architecture


Advantages of On-chip Fabric

• Same bandwidth on die as across die

• Short wires enable high bandwidth at low cost

• Spanning <1mm on silicon is highly efficient

• On-chip bandwidth is “easy”

• Shifts complexity to system and package

Reticle Bandwidth Power

mm2 TB/s GB/s/mm2 pJ/bit WNvidia

A100826 0.6 0.7 10 60

Cerebra

s WSE-2525 3.2 6.1 0.15 4.8

Ratio 5.3x 8.4x 66.6x 12.5x

Nvidia estimates based on PCIe 5.0 SERDES design in 7nm

Inter-reticle IO

Co-designed with

package


Concentrated high density exceeds traditional power & cooling capabilities

• Power delivery• Current density too high for power plane

distribution in PCB

• Heat removal• Heat density too high for direct air cooling

3. Solving Power and Cooling


• Power delivery

• Current flow distributed in 3rd dimension perpendicular to wafer

• Heat removal

• Water carries heat from wafer through cold plate

Using the 3rd Dimension

Co-designed with

system


The Co-designed System

Engine Block


The Co-designed System

Cerebras CS-2

Power Supplies

Heat Exchanger

Water Pumps


Built from the ground up to run the WSE-2

Single chip equivalent to 56x traditional chips

Combined with process shrinks…


✅ 2x → 20-100x: Process technology

3x → 30-150x: Architecture and µarch

300x: Scalable cluster scale-out

Cerebras CS-2 System

Cluster scale performance on a single system


Architecture:Designing From Ground Up for Neural Networks


……

Neural networks are expressed as GEMMs

• Leading to high density GEMM datapaths

Not a path to significant gains because…

• Physical and power limits number of FMACs

• Coarse granularity fit overheads

We need to change the rules

What if… we could run fewer FLOPS?

Driving More Compute Traditionally

Dense GEMM datapath


Neural Networks are Naturally Sparse

Natural sparsity arises from common ML techniques in neural networks

• e.g. nonlinears that zero activations

• ReLU, Dropout, …

Natural sparsity arises in training backward pass even when forward pass is dense

• Backward pass accounts for 2/3 of compute

• Nonlinear derivatives that zero deltas

• Hard Sigmoid, Max Pool, …

Dense Network

Sparse Network


Neural Networks Can be Made Sparse

Technique Sparsity FLOP ↓ Reference

Fixed Sparse Training 90% 8x Lottery Ticket [MIT CSAIL]

Dynamic Sparse Training 80% 2x Rig the Lottery [Google Brain, DeepMind]

Scaling-up Sparse Training 90%+ 10x+ Pruning scaling laws [MIT CSAIL]

Monte Carlo DropConnect 50% 2x DropConnect in Bayesian Nets [Nature]

ML Community has invented various sparsity techniques

Sparsity Research Shows 10x+ Opportunity

https://arxiv.org/abs/1803.03635



https://www.nature.com/articles/s41598-021-84854-x


Existing GEMM architectures are dense-only

Fundamental design limitations:

1. Memory bandwidth limitation

2. Structured computation

What if we could solve these?

Architecture and ML Co-design for Sparsity


Traditional memory bandwidth is low

• Central shared memory is slow & far away

• Requires high data reuse and caching

Distributed memory has full bandwidth

• All memory is fully distributed with cores

• Cores get full perf without caching

1. Full Memory Bandwidth

Co-designed with

system and ML


Matrix-Matrix Matrix-Vector Vector-Vector Vector-Scalar

GEMM GEMV DOT AXPY

𝐶 ← 𝛼𝐴𝐵 + 𝛽𝐶 y ← 𝛼𝐴𝑥 + 𝛽𝑦 𝛼 ← (𝑥, 𝑦) 𝑦 ← 𝛼𝑥 + 𝑦

Full Performance on All BLAS Levels

Ay x+=AC B+= 𝛼y x+=

0.005 Byte/FLOP 1 Byte/FLOP 2 Byte/FLOP 3 Byte/FLOP

GPU Cerebras

𝛼 y x+=

Bandwidth jump!

Sparse GEMM is one AXPY per non-zero weight


2. Architecture for Unstructured Sparsity

Dataflow scheduling in hardware

• Data and control received from fabric

• Triggers instruction lookup to set up tensor control state machine

• State machine schedules datapath cycles to consume fabric input

• Datapath output is written back to memory or fabric output

Fabric Output

Instructions

fmac[z]=[z],[w],a

Data

Tensor

Control

FMAC

Datapath [z][w]

Registers

Data a Ctrl

Dataflow Trigger


Dataflow sparsity harvesting

• Sender filters out sparse zero data

• Receiver skips unnecessary processing

Enabled by fine-grained execution datapaths

• Small cores with independent instructions

• Optimized for dynamic non-uniform work

Co-designed with

ML and compiler/kernels

Core Designed for Sparsity

Fabric Output

Instructions

fmac[z]=[z],[w],a

Data

Tensor

Control

FMAC

Datapath [z][w]

Registers

Data a Ctrl

Dataflow Trigger


The Wafer is the Matrix-Multiply array

• High-capacity local memory stores all activations across compute fabric

• Large compute core array receives sparse weight stream and triggers multiplies with activations

• Massive memory bandwidth enables full performance of operands to the datapath

• High BW interconnect enables partial sum accumulation across wafer at full performance

• No matrix blocking or partitioning required

Sparse MatMul Kernel

Core Array

Sequence

Feature

Wafer Scale Engine

Partial

Sum

Ring

Sparse Weights

Same flow supports dense and sparse


Sparsity reduces time-to-accuracy

• WSE runs AXPY at full performance

• Limited only by low fixed overheads

• Minimized by high bandwidth interconnect

• Reduced as networks grow larger

• Accelerates all unstructured sparsity

• Fully dynamic and fine-grained

• Even fully random patterns

Near-linear sparsity acceleration

1

2

3

4

5

6

7

8

9

10

0% 50% 67% 75% 80% 83% 86% 88% 89% 90%

Unstructured Sparsity Factor (% of Zeros)

Measured Speedup vs. Sparsity Factor

On GPT-3 Layer (12k*12k MatMul)Linear

Measured

Demonstrated Unstructured Sparsity Speedup


Unstructured sparsity acceleration enables new level of ML co-design

Sparsity opportunity likely increases with model size

Combined with other architecture improvements…



✅ 3x → 30-150x: Architecture and µarch

300x: Scalable cluster scale-out

Beyond FLOPS

Enabling existing and new novel sparse ML techniques


Scale-out:Inherently Scalable Clustering


Several existing scale-out techniques

Challenges to Existing Scale-out

Device 1 Device 2

Data Parallel

Sample 1

Multiple samples at a time

Parameter memory limits

Sample N

Multiple layers at a time

Communication overhead

N2 activation memory

Device 1 Device 2

Pipelined Model Parallel

Sample 1

Multiple splits at a time

Communication overhead

Complex partitioning

Device 1 Device 2

Tensor Model Parallel

Sample 1

All share same limitation: memory tied to compute


Thinking outside the device

Cluster Level Co-design


Scale model size and training speed independently

The Cluster is the ML AcceleratorDisaggregation of memory and compute

CS-2 Compute

External Model

Memory

Weights

Gradients

Training Database

Samples

Labels

Activations kept local Streamed 1

layer at a time


CS-2

850,000 compute cores in single chip


MemoryX Technology

CS-2

MemoryX Technology

Up to 120 trillion parameters on a single CS-2


CS-2

CS-2

Near-linear performance scaling up to 192 CS-2s

SwarmX Interconnect Technology CS-2

MemoryX Technology

SwarmX

Interconnect Technology

CS-2


Built for extreme-scale neural networks:

• Weights stored externally off-wafer

• Weights streamed onto wafer to compute layer

• Activations only are resident on wafer

• Execute one layer at a time

Decoupling weight optimizer compute

• Gradients streamed out of wafer

• Weight update occurs in MemoryX

Weight Streaming Execution Model

Weights

Gradients

MemoryX

Optimizer

Compute

Weight

MemoryCS-2 Dataset

Server


• Efficient performance scaling by removing latency-sensitive communication

• Coarse-grained pipelining

• Forward/delta/gradient are fully pipelined

• All streams within an iteration have no inter-layer dependencies

• Fine-grained pipelining

• Overlapping of weight update and forward pass covers inter-iteration dependency

Solving the Latency Problem

Time

Iteration N Back Pass Iteration N+1 Forward Pass

L2 Gradient L2 Forward

L2 Weight Update

L2 Weight Stream Out

CS-2:

MemoryX:


L1 Weight Update

L1 Gradient L1 Forward



Purpose-built to support large neural network execution:

• 4TB – 2.4PB capacity

• 200 billion – 120 trillion weights with optimizer state

• DRAM and flash hybrid storage

• Internal compute for weight update/optimizer

• Handles intelligent pipelining to mask latency

Scalable to extreme model sizes

Capacity scaling independent from compute

MemoryX Technology


• Data parallel training across CS-2s

• Weights are broadcast to all CS-2s

• Gradients are reduced on way back

• Multi-system scaling with the same execution model as single system

• Same system architecture

• Same network execution flow

• Same software user interface

SwarmX Fabric Connects Multiple CS-2s

Scalable to extreme model sizes

Compute scaling independent from capacity

MemoryX

Optimizer

Compute

Weight

Memory

CS-2

Weights

Gradients

Weights

Gradients

CS-2

CS-2

CS-2

SwarmX


Weight sparsity induced in MemoryX

• Sparse weights streamed to all CS-2s

• Sparse gradients reduced on the way back

• Sparse weight updates on sparse matrix

No change to the weight streaming model

Same flow supports dense and sparse

Streaming Sparse Weights

MemoryX

Optimizer + Sparsity

Compute

Weight

Memory

CS-2

Sparse

Weights

Sparse

Gradients

Sparse

Weights

Sparse

Gradients

CS-2

CS-2

CS-2

SwarmX

Sparsify

Sparse

Compute


Near-Linear Performance Scaling

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128 256

Number of CS-2s

Projected Speedup vs. Number of CS-2s

10B

100B

1T

10T

100T

NLP Model Size

(parameters)

Model Size

Megatron-LM 8B

T5 11B

T-NLG 17B

GPT-3 175B

MT-NLG 530B

MSFT-1T 1T

Projections based on Scaling Laws for Neural Language Models [OpenAI]



Cluster level purpose-built memory, compute, interconnect

Disaggregation avoids traditional scale-out complexity



✅ 3x → 30-150x: Architecture and µarch

✅ 300x: Scalable cluster scale-out

The Cluster is the ML Accelerator


Is it possible?

The Grand ML Demand Challenge

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000

To

tal

train

ing

co

mp

ute

, P

FL

OP

-da

ys



BERT Base (110M)

BERT Large (340M)

2018

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000

To

tal

train

ing

co

mp

ute

, P

FL

OP

-da

ys



BERT Base (110M)

BERT Large (340M)

2018

GPT-2 (1.5B)

Megatron-LM (8B)

T5 (11B)

2019

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000

To

tal

train

ing

co

mp

ute

, P

FL

OP

-da

ys



BERT Base (110M)

BERT Large (340M)

2018

GPT-2 (1.5B)

Megatron-LM (8B)

T5 (11B)

2019

T-NLG (17B)

GPT-3 (175B)

MSFT-1T (1T)2020+

MT-NLG (530B)

???


Architecting the ML Accelerator of the Future

Thinking outside

the die

Pushing beyond

FLOPS

Cluster is the ML

Accelerator

Changing the rules through co-design


[email protected]

Thank You

thinking outside the die

Documents