waferscale architectures - .net framework

Waferscale Architectures

Saptadeep Pal, Subramanian S. Iyer and Puneet Gupta University of California, Los Angeles

In collaboration with Rakesh Kumar

University of Illinois at Urbana-Champaign

Table of Contents

• Why and How Waferscale?

• Design Challenges - A Waferscale GPU Case Study

• Prototype - Waferscale Graph Processor

10/2/2019 UCLA 2

Massively Parallel Applications

Social Networks

Molecular Dynamics Simulation

Artificial Intelligence

Weather Modelling

Big Data

Multi-Physics Simulation10/2/2019 UCLA 3

Communication is Expensive!!

10/2/2019 UCLA 4

One Large Chip

10/2/2019 UCLA 5

A Brief History of Waferscale Computing

Gene Amdahl’s Trilogy Systems Tandem Computers, Fujitsu

Other efforts: ITT Corporation, Texas Instruments. Recent efforts: Spinnaker

10/2/2019 UCLA 6

What Happened to Waferscale Integration?

Their Approach to Waferscale: Monolithic

Didn’t work out (e.g., Trilogy Systems was one of the biggest financial disasters in Silicon Valley before 2001)

Area of chip

Probability of defects

Variability

Voting Logic

Some mitigation possible through TMR, etc. - but prohibitively expensive

Deemed commercially unviable10/2/2019 UCLA 7

Time to Give Waferscale Another Go?

However, to achieve waferscale integration, we need to solve the yield problem

10/2/2019 UCLA 8

Re-imagining WSI Technology

UCLA Silicon Interconnect Fabric (Si-IF)*

2 µm10 µm

100 µm

*UCLA CHIPS Programme: https://www.chips.ucla.edu/research/project/4

Allows heterogeneous waferscale integration with high yieldMeasured Bond Yield >99%

Q: What do we need from waferscale integration?A: High density interconnection

10/2/2019 UCLA 9

A Case for Waferscale GPU

GPU applications scale well with compute and memory resources

10/2/2019 UCLA 10

Waferscale GPU Overview

• GPU die = 500 mm2

• 3D DRAM die = 100 mm2

• Total Area = 700 mm2

• GPU Die power = 200 W• DRAM Die power = 35 W• Total Power = 270 W

300 mm wafer has enough area for about 72 GPU modules (GPM).

A GPU Module: GPM

10/2/2019 UCLA 11

Architecting a Waferscale GPUQ: Can we build a 72-GPM waferscale GPU ?

Three major physical constraints:

1. Thermal

➢ Waferscale GPU would dissipate kWs of power

2. Power Delivery

➢ How to supply kWs of power to the GPU modules? ➢ Voltage Regulator Module (VRM) overhead?

3. Network of GPMs ➢ Si-IF has up to 4 metal layers, what network topology to build?

10/2/2019 UCLA 12

Thermal Design Power

➢ Forced air-cooling with two heat sinks

➢ ~12 kW TDP i.e., 34 GPMs can be supported

10/2/2019 UCLA 13

Input Voltage (V)

Number of Metal Layers

VRM+DeCap Area per GPM

(mm2)

Number of GPMs

1 68 300 50

3.3 8 1020 29

12 2 1380 24

48 2 2460 15

Input Voltage Number of Layers

Input Voltage Area Overhead

Power Delivery

Input Voltage (V)

Number of Metal Layers

VRM+DeCap Area per GPM

(mm2)

Number of GPMs

1 68 300 50

3.3 8 1020 29

12 2 1380 24

48 2 2460 15

10/2/2019 UCLA 14

Stacked Power Delivery

Input Voltage (V)

No Stack 2-GPM stack 4-GPM stack

12 24 33 40

48 15 24 34

• One VRM shared by multiple GPMs in stack

• Decreases step down ratio of VRM

• GPMs need to be activity balanced for equal voltage drop

• Intermediate regulators for voltage stability

10/2/2019 UCLA 15

Mesh 1D-Torus 2D-Torus

Num. Layers Topology Inter-GPM BW (TBps) Si-IF Yield

1 Mesh 0.75 95.9%

2 Mesh 1.5 91.9%

1-D Torus 1.5 84.3%

3 2D Torus 1.9 74%

Waferscale Inter-GPM Network

10/2/2019 UCLA 16

Final WS-GPU Architectures

24 GPM Floorplan without voltage stacking 40 GPM Floorplan with voltage stacking

Inter-GPM Network: Mesh

10/2/2019 UCLA 17

Results – WS-GPU Performance Improvement

• WSI with 24 GPMs performs 2.97x better than multi-MCM configuration (EDP: 9.3x)

• WSI with 40 GPMs performs 5.2x better than multi-MCM configuration (EDP: 22.5x)

Waferscale integration can provide large performance and energy benefits

MCM Package

PCB

10/2/2019 UCLA 18

Waferscale Graph Accelerator Prototype

10/2/2019 UCLA 19

Graph Acceleration using Waferscale Architectures

Bandwidth beyond 50 TBps @ 10 ns/iteration

00.10.20.30.40.50.60.70.80.9

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Uti

lizat

ion

Parallel Workers (log2)

Utilization w/ Perfect Scheduling

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1.00E+09

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Mem

ory

Th

rou

ghp

ut

Req

uir

ed (

Byt

es/I

tera

tio

n)

Parallel Workers (log2)

Memory Requirements

Massive number of processing cores needed

10/2/2019 UCLA 20

Graph workloads exhibit irregular memory access pattern

ARM Cortex-M3 based Prototype

(1) Scalable up to 64x64 tile array (4096 tiles). Total 57,344 cores

(2) Unified memory architecture and >100 TB/s total memory bandwidth

(3) Custom implementation of waferscale read, write and synchronization primitives.

(4) Fault-tolerant dual network scheme.

(5) GALS clocking using the waferscale network.

(6) Simple software libraries implemented for easy and intuitive programming-- Programmers can treat the entire wafer as a single multi-processor

Planned Tape-out: Late-2019 Bring-up: Q2 2020.

Compute die

Memory die

10/2/2019 UCLA 21

• Communication between packaged processors is a major bottleneck

• Si-IF technology enables waferscale integration at high yield

• Waferscale GPU versus multi-MCM system:• 5.2x performance improvement

• 22.5x EDP improvement

• Graph applications are good fit for waferscale processing• ARM Cortex-M3 based waferscale prototype under active development

Summary and Conclusion

10/2/2019 UCLA 22

• This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) through ONR grant N00014-16-1-263, UCOP through grant MRP-17-454999, and the UCLA CHIPS Consortium.

• Collaborators: UCLA: SivaChandra Jangam, Adeel A. Bajwa, Shi Bu, Prof. Sudhakar Pamarti

UIUC: Matthew Tomei, Daniel Petrisko, Nick Cebry, Jinyang Liu

Acknowledgement

10/2/2019 UCLA 23

Thank You

10/2/2019 UCLA 24

Backup

10/2/2019 UCLA 25

Waferscale Graph Processor Architecture

• 256 3D stacked nodes with simple cores in the logic layer (2 TB memory)

• Large intra node bandwidth in 3D memories (up to 1 TBps)

• Mesh based inter-node network. 1.5TBps link bandwidth using Si-IF

• Unified memory architecture. Any core can directly access any memory

10/2/2019 UCLA 26

Re-imagining Waferscale Integration

Bond the dies on to the interconnect wafer

Small known good diesA wafer with

interconnect wiring only

10/2/2019 UCLA 27

10/2/2019 28

Why Interposers or Packages Do not Scale

Package

BGA Balls (Solder) - 0.5-1 mmPackage Substrate

PCB (Organic)

Silicon Die

Package

BGA Balls (Solder) - 0.5-1 mmPackage Substrate

PCB (Organic)

Silicon DieInterposer TSV

Experimental Methodology

Simulator: In-house Trace-based GPU Simulator (Validated against Gem5-GPU)

Baselines: MCM-GPU, iso-GPM multi-MCM GPU integrated on PCB

MCM Package PCB

10/2/2019 UCLA 29

Graph Processing Requires a New Architecture

Assumes Locality

Mo

vin

g A

way

fro

m C

ore

s

CPUs GPUs

Graph Applications Exhibit Poor Memory Locality 10/2/2019 UCLA 30

Graph Processing Requires a New Architecture

Requires Locality Requires Predictable Layout

Requires Nothing

Mo

vin

g A

way

fro

m C

ore

s

CPUs GPUs Accelerators

Near MemoryComputing/HPC Clusters

Ideal Architecture

Requires Expensive Pre-Processing

Provides High Random Access Bandwidth

10/2/2019 UCLA 31

waferscale architectures - .net framework

Documents