thinking outside the die
TRANSCRIPT
Sean LieCo-founder & Chief HW Architect, Cerebras
Thinking Outside the Die:Architecting the ML Accelerator of the Future
© 2021 Cerebras Systems Inc. All Rights Reserved
Cerebras Systems
Building and deploying a new class of computer system
Designed for the purpose of accelerating AI and changing the future of AI work
Founded in 2016
350+ Engineers
Offices
Silicon Valley | San Diego | Toronto | Tokyo
Customers
North America | Asia | Europe
© 2021 Cerebras Systems Inc. All Rights Reserved
1800x more compute
In just 2 years
Tomorrow, multi-trillion parameter models
Exponential Growth of Neural Networks
1
10
100
1,000
10,000
100,000
1 10 100 1,000 10,000 100,000
To
tal
train
ing
co
mp
ute
, P
FL
OP
-da
ys
Model memory requirement, GB
Memory and compute requirements
BERT Base (110M)
BERT Large (340M)
2018
1
10
100
1,000
10,000
100,000
1 10 100 1,000 10,000 100,000
To
tal
train
ing
co
mp
ute
, P
FL
OP
-da
ys
Model memory requirement, GB
Memory and compute requirements
BERT Base (110M)
BERT Large (340M)
2018
GPT-2 (1.5B)
Megatron-LM (8B)
T5 (11B)
2019
1
10
100
1,000
10,000
100,000
1 10 100 1,000 10,000 100,000
To
tal
train
ing
co
mp
ute
, P
FL
OP
-da
ys
Model memory requirement, GB
Memory and compute requirements
BERT Base (110M)
BERT Large (340M)
2018
GPT-2 (1.5B)
Megatron-LM (8B)
T5 (11B)
2019
T-NLG (17B)
GPT-3 (175B)
MSFT-1T (1T)2020+
MT-NLG (530B)
1
10
100
1,000
10,000
100,000
1 10 100 1,000 10,000 100,000
To
tal
train
ing
co
mp
ute
, P
FL
OP
-da
ys
Model memory requirement, GB
Memory and compute requirements
BERT Base (110M)
BERT Large (340M)
2018
GPT-2 (1.5B)
Megatron-LM (8B)
T5 (11B)
2019
T-NLG (17B)
GPT-3 (175B)
MSFT-1T (1T)2020+
MT-NLG (530B)
© 2021 Cerebras Systems Inc. All Rights Reserved
A lot of innovation in last few years
For example…
• Process: 16nm, 12nm, 7nm
• Memory: HBM, HBM2
• Precision: float16, bfloat16
• Datapath: Systolic array, Tensor cores
ML Acceleration Improvements in Industry
Low precision numerics
Dense GEMM datapath
8b 23b
5b 10b
8b 7b
FP32
FP16
BFLOAT16
© 2021 Cerebras Systems Inc. All Rights Reserved
For example…
• Process: 16nm, 12nm, 7nm
• Memory: HBM, HBM2
• Precision: float16, bfloat16
• Datapath: Systolic array, Tensor Core
Process Technology Gains
1.0E+06
1.0E+07
1.0E+08
2006 2008 2010 2012 2014 2016 2018 2020 2022
Transistors/mm2
Moore’s Law is not dead!
A100 7nm
V100 12nm
P100 16nm
GTX480 40nm
G80 90nm
© 2021 Cerebras Systems Inc. All Rights Reserved
For example…
• Process: 16nm, 12nm, 7nm
• Memory: HBM, HBM2
• Precision: float16, bfloat16
• Datapath: Systolic array, Tensor Core
Architecture and Microarchitecture Gains
1.0E+02
1.0E+03
1.0E+04
2006 2008 2010 2012 2014 2016 2018 2020 2022
Peak FLOPS/transistor
Architecture Matters!
A100
V100
P100
GTX480
G80
© 2021 Cerebras Systems Inc. All Rights Reserved
In the last 2 years…
What our industry delivered:
• 2x: Process technology
• 3x: Architecture and µarch
• 300x: Cluster scale-out
What the ML needs:
• BERT → GPT-3
• 1800x more compute
Meeting the Need
Cluster Scale-out Dominated Performance Gains
Google TPUs in one of its data centers. Image credit: Google (cloud.google.com/tpu)
© 2021 Cerebras Systems Inc. All Rights Reserved
Last 2 years
2x: Process technology
3x: Architecture and µarch
300x: Cluster scale-out
The Answer is Just Scale-out, Right?
Scale-out is needed…
But not practical
And not very scalable
Next 2 years?
2x: Process technology
3x: Architecture and µarch
300x: Cluster scale-out
100k chip clusters?
© 2021 Cerebras Systems Inc. All Rights Reserved
Distribution complexity scales dramatically with cluster size
Massive models need massive memory, massive compute, and massive communication.
On giant clusters of small devices, all three become intertwined, distributed problems.
Need to do inefficient, fine-grained partitioning and coordination of memory, compute, and communication across thousands of devices.
Limits of Existing Scale-out
© 2021 Cerebras Systems Inc. All Rights Reserved
Last 2 years
2x: Process technology
3x: Architecture and µarch
300x: Cluster scale-out
Next 2 years
2x: Process technology
3x: Architecture and µarch
1-10x: Cluster-scale out
Traditional Scale-out is Not Enough
1k-10k chip clusters
Path to 60x gains
But we need 1800x
© 2021 Cerebras Systems Inc. All Rights Reserved
We must find ways to:
• Improve process gains
• Improve architecture gains
• Improve scale-out
Imagine more balanced scaling:
• 2x → 20-100x: Process technology
• 3x → 30-150x: Architecture and µarch
• 300x: Scalable cluster scale-out
The ML Demand Challenge
Requires thinking outside the box with co-design
But is it possible?
© 2021 Cerebras Systems Inc. All Rights Reserved
Process:Amplifying Moore’s Law
© 2021 Cerebras Systems Inc. All Rights Reserved
0
100
200
300
400
500
600
700
800
900
1960 1970 1980 1990 2000 2010 2020 2030
Die Size (mm2)
Transistor density improvement continues!
Historically, amplified with bigger chips
But die sizes are not getting bigger
Because of challenges in…
1. Yield
2. Lithography
3. Power and cooling
Building Bigger Chips
A100V100
P100
GTX480
G80
Pentium
80486
8028680804004
© 2021 Cerebras Systems Inc. All Rights Reserved
Extending Beyond a Single Die in Industry
Largest traditional chip
2-5x
10-20x
Single reticle chip
Chiplets and MCM
Advanced interposer
© 2021 Cerebras Systems Inc. All Rights Reserved
Cerebras WSE-2 56x Larger than Traditional Chip
The Largest Chip Ever Built
46,225 mm2 silicon
2.6 Trillion transistors
850,000 AI optimized cores
40 Gigabytes on-chip memory
20 Petabyte/s memory bandwidth
220 Petabit/s fabric bandwidth
7nm Process technology at TSMC
© 2021 Cerebras Systems Inc. All Rights Reserved
• Impossible to yield full wafer with zero defects
• Silicon and process defects are inevitable even in mature process
1. Solving Yield Challenges
Defects
Die
© 2021 Cerebras Systems Inc. All Rights Reserved
Co-designed with
chip architecture
Redundancy
• Uniform small core architecture enables redundancy to address yield at very low cost
• Design includes redundant cores and redundant fabric links
• Redundant cores replace defective cores
• Extra links reconnect fabric to restore logical 2D mesh
© 2021 Cerebras Systems Inc. All Rights Reserved
• Standard fabrication process requires die to be independent
• Scribe line separates each die
• Scribe line used as mechanical barrier for die cutting and for test structures
2. Solving Lithography Limitations
© 2021 Cerebras Systems Inc. All Rights Reserved
• Add wires across scribe line in partnership with TSMC
• Extend 2D mesh across die
• Same connectivity between cores and across scribe lines create a homogenous array
Cross-Die Wires
Co-designed with
fab and chip architecture
© 2021 Cerebras Systems Inc. All Rights Reserved
Advantages of On-chip Fabric
• Same bandwidth on die as across die
• Short wires enable high bandwidth at low cost
• Spanning <1mm on silicon is highly efficient
• On-chip bandwidth is “easy”
• Shifts complexity to system and package
Reticle Bandwidth Power
mm2 TB/s GB/s/mm2 pJ/bit WNvidia
A100826 0.6 0.7 10 60
Cerebra
s WSE-2525 3.2 6.1 0.15 4.8
Ratio 5.3x 8.4x 66.6x 12.5x
Nvidia estimates based on PCIe 5.0 SERDES design in 7nm
Inter-reticle IO
Co-designed with
package
© 2021 Cerebras Systems Inc. All Rights Reserved
Concentrated high density exceeds traditional power & cooling capabilities
• Power delivery• Current density too high for power plane
distribution in PCB
• Heat removal• Heat density too high for direct air cooling
3. Solving Power and Cooling
© 2021 Cerebras Systems Inc. All Rights Reserved
• Power delivery
• Current flow distributed in 3rd dimension perpendicular to wafer
• Heat removal
• Water carries heat from wafer through cold plate
Using the 3rd Dimension
Co-designed with
system
© 2021 Cerebras Systems Inc. All Rights Reserved
The Co-designed System
Engine Block
© 2021 Cerebras Systems Inc. All Rights Reserved
The Co-designed System
Cerebras CS-2
Power Supplies
Heat Exchanger
Water Pumps
© 2021 Cerebras Systems Inc. All Rights Reserved
© 2021 Cerebras Systems Inc. All Rights Reserved
Built from the ground up to run the WSE-2
Single chip equivalent to 56x traditional chips
Combined with process shrinks…
Imagine more balanced scaling:
✅ 2x → 20-100x: Process technology
3x → 30-150x: Architecture and µarch
300x: Scalable cluster scale-out
Cerebras CS-2 System
Cluster scale performance on a single system
© 2021 Cerebras Systems Inc. All Rights Reserved
Architecture:Designing From Ground Up for Neural Networks
© 2021 Cerebras Systems Inc. All Rights Reserved
……
Neural networks are expressed as GEMMs
• Leading to high density GEMM datapaths
Not a path to significant gains because…
• Physical and power limits number of FMACs
• Coarse granularity fit overheads
We need to change the rules
What if… we could run fewer FLOPS?
Driving More Compute Traditionally
Dense GEMM datapath
© 2021 Cerebras Systems Inc. All Rights Reserved
Neural Networks are Naturally Sparse
Natural sparsity arises from common ML techniques in neural networks
• e.g. nonlinears that zero activations
• ReLU, Dropout, …
Natural sparsity arises in training backward pass even when forward pass is dense
• Backward pass accounts for 2/3 of compute
• Nonlinear derivatives that zero deltas
• Hard Sigmoid, Max Pool, …
Dense Network
Sparse Network
© 2021 Cerebras Systems Inc. All Rights Reserved
Neural Networks Can be Made Sparse
Technique Sparsity FLOP ↓ Reference
Fixed Sparse Training 90% 8x Lottery Ticket [MIT CSAIL]
Dynamic Sparse Training 80% 2x Rig the Lottery [Google Brain, DeepMind]
Scaling-up Sparse Training 90%+ 10x+ Pruning scaling laws [MIT CSAIL]
Monte Carlo DropConnect 50% 2x DropConnect in Bayesian Nets [Nature]
ML Community has invented various sparsity techniques
Sparsity Research Shows 10x+ Opportunity
© 2021 Cerebras Systems Inc. All Rights Reserved
Existing GEMM architectures are dense-only
Fundamental design limitations:
1. Memory bandwidth limitation
2. Structured computation
What if we could solve these?
Architecture and ML Co-design for Sparsity
© 2021 Cerebras Systems Inc. All Rights Reserved
Traditional memory bandwidth is low
• Central shared memory is slow & far away
• Requires high data reuse and caching
Distributed memory has full bandwidth
• All memory is fully distributed with cores
• Cores get full perf without caching
1. Full Memory Bandwidth
Co-designed with
system and ML
© 2021 Cerebras Systems Inc. All Rights Reserved
Matrix-Matrix Matrix-Vector Vector-Vector Vector-Scalar
GEMM GEMV DOT AXPY
𝐶 ← 𝛼𝐴𝐵 + 𝛽𝐶 y ← 𝛼𝐴𝑥 + 𝛽𝑦 𝛼 ← (𝑥, 𝑦) 𝑦 ← 𝛼𝑥 + 𝑦
Full Performance on All BLAS Levels
Ay x+=AC B+= 𝛼y x+=
0.005 Byte/FLOP 1 Byte/FLOP 2 Byte/FLOP 3 Byte/FLOP
GPU Cerebras
𝛼 y x+=
Bandwidth jump!
Sparse GEMM is one AXPY per non-zero weight
© 2021 Cerebras Systems Inc. All Rights Reserved
2. Architecture for Unstructured Sparsity
Dataflow scheduling in hardware
• Data and control received from fabric
• Triggers instruction lookup to set up tensor control state machine
• State machine schedules datapath cycles to consume fabric input
• Datapath output is written back to memory or fabric output
Fabric Output
Instructions
fmac[z]=[z],[w],a
Data
Tensor
Control
FMAC
Datapath [z][w]
Registers
Data a Ctrl
Dataflow Trigger
© 2021 Cerebras Systems Inc. All Rights Reserved
Dataflow sparsity harvesting
• Sender filters out sparse zero data
• Receiver skips unnecessary processing
Enabled by fine-grained execution datapaths
• Small cores with independent instructions
• Optimized for dynamic non-uniform work
Co-designed with
ML and compiler/kernels
Core Designed for Sparsity
Fabric Output
Instructions
fmac[z]=[z],[w],a
Data
Tensor
Control
FMAC
Datapath [z][w]
Registers
Data a Ctrl
Dataflow Trigger
© 2021 Cerebras Systems Inc. All Rights Reserved
The Wafer is the Matrix-Multiply array
• High-capacity local memory stores all activations across compute fabric
• Large compute core array receives sparse weight stream and triggers multiplies with activations
• Massive memory bandwidth enables full performance of operands to the datapath
• High BW interconnect enables partial sum accumulation across wafer at full performance
• No matrix blocking or partitioning required
Sparse MatMul Kernel
Core Array
Sequence
Feature
Wafer Scale Engine
Partial
Sum
Ring
Sparse Weights
Same flow supports dense and sparse
© 2021 Cerebras Systems Inc. All Rights Reserved
Sparsity reduces time-to-accuracy
• WSE runs AXPY at full performance
• Limited only by low fixed overheads
• Minimized by high bandwidth interconnect
• Reduced as networks grow larger
• Accelerates all unstructured sparsity
• Fully dynamic and fine-grained
• Even fully random patterns
Near-linear sparsity acceleration
1
2
3
4
5
6
7
8
9
10
0% 50% 67% 75% 80% 83% 86% 88% 89% 90%
Unstructured Sparsity Factor (% of Zeros)
Measured Speedup vs. Sparsity Factor
On GPT-3 Layer (12k*12k MatMul)Linear
Measured
Demonstrated Unstructured Sparsity Speedup
© 2021 Cerebras Systems Inc. All Rights Reserved
Unstructured sparsity acceleration enables new level of ML co-design
Sparsity opportunity likely increases with model size
Combined with other architecture improvements…
Imagine more balanced scaling:
✅ 2x → 20-100x: Process technology
✅ 3x → 30-150x: Architecture and µarch
300x: Scalable cluster scale-out
Beyond FLOPS
Enabling existing and new novel sparse ML techniques
© 2021 Cerebras Systems Inc. All Rights Reserved
Scale-out:Inherently Scalable Clustering
© 2021 Cerebras Systems Inc. All Rights Reserved
Several existing scale-out techniques
Challenges to Existing Scale-out
Device 1 Device 2
Data Parallel
Sample 1
Multiple samples at a time
Parameter memory limits
Sample N
Multiple layers at a time
Communication overhead
N2 activation memory
Device 1 Device 2
Pipelined Model Parallel
Sample 1
Multiple splits at a time
Communication overhead
Complex partitioning
Device 1 Device 2
Tensor Model Parallel
Sample 1
All share same limitation: memory tied to compute
© 2021 Cerebras Systems Inc. All Rights Reserved
Thinking outside the device
Cluster Level Co-design
© 2021 Cerebras Systems Inc. All Rights Reserved
Scale model size and training speed independently
The Cluster is the ML AcceleratorDisaggregation of memory and compute
CS-2 Compute
External Model
Memory
Weights
Gradients
Training Database
Samples
Labels
Activations kept local Streamed 1
layer at a time
© 2021 Cerebras Systems Inc. All Rights Reserved
CS-2
850,000 compute cores in single chip
© 2021 Cerebras Systems Inc. All Rights Reserved
MemoryX Technology
CS-2
MemoryX Technology
Up to 120 trillion parameters on a single CS-2
© 2021 Cerebras Systems Inc. All Rights Reserved
CS-2
CS-2
Near-linear performance scaling up to 192 CS-2s
SwarmX Interconnect Technology CS-2
MemoryX Technology
SwarmX
Interconnect Technology
CS-2
© 2021 Cerebras Systems Inc. All Rights Reserved
Built for extreme-scale neural networks:
• Weights stored externally off-wafer
• Weights streamed onto wafer to compute layer
• Activations only are resident on wafer
• Execute one layer at a time
Decoupling weight optimizer compute
• Gradients streamed out of wafer
• Weight update occurs in MemoryX
Weight Streaming Execution Model
Weights
Gradients
MemoryX
Optimizer
Compute
Weight
MemoryCS-2 Dataset
Server
© 2021 Cerebras Systems Inc. All Rights Reserved
• Efficient performance scaling by removing latency-sensitive communication
• Coarse-grained pipelining
• Forward/delta/gradient are fully pipelined
• All streams within an iteration have no inter-layer dependencies
• Fine-grained pipelining
• Overlapping of weight update and forward pass covers inter-iteration dependency
Solving the Latency Problem
Time
Iteration N Back Pass Iteration N+1 Forward Pass
L2 Gradient L2 Forward
L2 Weight Update
L2 Weight Stream Out
CS-2:
MemoryX:
L3 Weight Stream Out
L1 Weight Update
L1 Gradient L1 Forward
L1 Weight Stream Out
© 2021 Cerebras Systems Inc. All Rights Reserved
Purpose-built to support large neural network execution:
• 4TB – 2.4PB capacity
• 200 billion – 120 trillion weights with optimizer state
• DRAM and flash hybrid storage
• Internal compute for weight update/optimizer
• Handles intelligent pipelining to mask latency
Scalable to extreme model sizes
Capacity scaling independent from compute
MemoryX Technology
© 2021 Cerebras Systems Inc. All Rights Reserved
• Data parallel training across CS-2s
• Weights are broadcast to all CS-2s
• Gradients are reduced on way back
• Multi-system scaling with the same execution model as single system
• Same system architecture
• Same network execution flow
• Same software user interface
SwarmX Fabric Connects Multiple CS-2s
Scalable to extreme model sizes
Compute scaling independent from capacity
MemoryX
Optimizer
Compute
Weight
Memory
CS-2
Weights
Gradients
Weights
Gradients
CS-2
CS-2
CS-2
SwarmX
© 2021 Cerebras Systems Inc. All Rights Reserved
Weight sparsity induced in MemoryX
• Sparse weights streamed to all CS-2s
• Sparse gradients reduced on the way back
• Sparse weight updates on sparse matrix
No change to the weight streaming model
Same flow supports dense and sparse
Streaming Sparse Weights
MemoryX
Optimizer + Sparsity
Compute
Weight
Memory
CS-2
Sparse
Weights
Sparse
Gradients
Sparse
Weights
Sparse
Gradients
CS-2
CS-2
CS-2
SwarmX
Sparsify
Sparse
Compute
© 2021 Cerebras Systems Inc. All Rights Reserved
Near-Linear Performance Scaling
1
2
4
8
16
32
64
128
256
1 2 4 8 16 32 64 128 256
Number of CS-2s
Projected Speedup vs. Number of CS-2s
10B
100B
1T
10T
100T
NLP Model Size
(parameters)
Model Size
Megatron-LM 8B
T5 11B
T-NLG 17B
GPT-3 175B
MT-NLG 530B
MSFT-1T 1T
Projections based on Scaling Laws for Neural Language Models [OpenAI]
© 2021 Cerebras Systems Inc. All Rights Reserved
Cluster level purpose-built memory, compute, interconnect
Disaggregation avoids traditional scale-out complexity
Imagine more balanced scaling:
✅ 2x → 20-100x: Process technology
✅ 3x → 30-150x: Architecture and µarch
✅ 300x: Scalable cluster scale-out
The Cluster is the ML Accelerator
© 2021 Cerebras Systems Inc. All Rights Reserved
Is it possible?
The Grand ML Demand Challenge
1
10
100
1,000
10,000
100,000
1 10 100 1,000 10,000 100,000
To
tal
train
ing
co
mp
ute
, P
FL
OP
-da
ys
Model memory requirement, GB
Memory and compute requirements
BERT Base (110M)
BERT Large (340M)
2018
1
10
100
1,000
10,000
100,000
1 10 100 1,000 10,000 100,000
To
tal
train
ing
co
mp
ute
, P
FL
OP
-da
ys
Model memory requirement, GB
Memory and compute requirements
BERT Base (110M)
BERT Large (340M)
2018
GPT-2 (1.5B)
Megatron-LM (8B)
T5 (11B)
2019
1
10
100
1,000
10,000
100,000
1 10 100 1,000 10,000 100,000
To
tal
train
ing
co
mp
ute
, P
FL
OP
-da
ys
Model memory requirement, GB
Memory and compute requirements
BERT Base (110M)
BERT Large (340M)
2018
GPT-2 (1.5B)
Megatron-LM (8B)
T5 (11B)
2019
T-NLG (17B)
GPT-3 (175B)
MSFT-1T (1T)2020+
MT-NLG (530B)
???
© 2021 Cerebras Systems Inc. All Rights Reserved
Architecting the ML Accelerator of the Future
Thinking outside
the die
Pushing beyond
FLOPS
Cluster is the ML
Accelerator
Changing the rules through co-design