parallel vlsi cad algorithms for energy efficient...

1
Parallel VLSI CAD Algorithms for Energy Efficient Heterogeneous Computing Platforms Xueqian Zhao, Zhuo Feng Department of Electrical and Computer Engineering, Michigan Technological University, Houghton, MI 1. INTRODUCTION d) Concurrent Kernel Executions 2. CAPACITANCE EXTRACTION Compute and combine all coefficient matrices Reorder and cluster panel cubes Initialize GPU memory allocation Do workload balance and SM assignment GMRES solver End b) FMM Turns the O(n 2 ) into O(n) via Approximations ( ,) ( ,) ( ) ( ) , , i k i k ki ki d k i i i k x y x y p q q x y ∈Ω ∉Ω = Φ + Φ Directly Solve Approximately Solve Multipole expansion: Approximate panel charges within a small sphere Local expansion: Approximate panel potentials within a small sphere Key Steps: Q2P Q2M, M2M M2L, L2L M2P, L2P Direct Pass Upward Pass Downward Pass Evaluation Pass + - 1V a b c d panel b b b a ba V dA R ε −> = b b b q A ε = ... panel 1 i a i i i i i b i ia q V dA q A R = = = Φ Summation: a) Problem Formulation ( ,) ( ,) 2 2 ki kj j k Q P i M P p q m × + × ( ,) ( ,) 2 2 i ki kj k Q P M P j q p m = Φ Φ × Q2P coefficient matrix: ( ,) 2 ki nm Q P × Φ ( ,) 2 kj ns M P × Φ M2P coefficient matrix: Panel potentials in cube k: Direct charges of source cube i Multipole expansions of source cube j 1 m i q × 1 n k p × 1 s j m × 2 2 2 2 2 2 ( ,) ( ,) ( ,) ( ,) (,) (,) Q P M P Q P M P Q P Q P mix mix ki kj ki kj k lk rd lk rd l I p I I p I I p × × Φ Φ Φ = Φ Φ 1 1 1 0 T T T T T T T global y z x Vec q q m m l l = Global vector: M2P index matrix: Q2P index matrix: ( ,) 2 kj ns M P I × ( ,) 2 ki nm Q P I × c) FMMGpu Data Structure and Algorithm Flow Linear Speedups Realistic Speedups Runtime Number of SMs 0 1 2 3 n T 0 T 0 /2 T 0 /3 T 0 /n Super linear Speedup Range 4 T 0 /4 Cube indices after ordering Number of Coefficients 0 Coef. Mat. Cluster 8 SMs 3 SMs 2 SMs 2 SMs Small Tests 1 i i sm p i i S N N S = = × Small Tests Small Tests Find optimal SM assignment Normalization based on #SM on GPU d) GPU Performance Modeling and Workload Balancing Runtime of single FMM iteration on CPU & GPU (ms) 18X 18X 22X 26X 26X 30X 0 20 40 60 80 100 120 140 160 180 Precond Direct Upward Downward Evaluation CPU GPU Runtime of each kernel on CPU & GPU (ms) 45X 45X 3.5X 33X 24X e) Experimental Results [4] 3. PERFORMANCE MODELING & OPTIMIZATION FOR ELECTRICAL & THERMAL SIMULATIONS 3D Parasitic Extraction: Capacitance extraction for delay, noise and power estimation computes charge distributions for discretized panels Full Chip Thermal Simulation: Thermal simulation for 3-D dense mesh structures requires solving many millions of unknowns 3D Interconnections Power Delivery Network a) GPU Has Evolved into A Very Flexible and Powerful Processor c) Latest NVIDIA Fermi GPU Architecture Streaming Multiprocessor (SM16) Shared Memory & L1 Cache SFU SP SFU SFU SFU Streaming Multiprocessor (SM2) Shared Memory & L1 Cache SFU SP SP SP SP SFU SFU SFU Streaming Multiprocessor (SM1) Instruction L1 Shared Memory & L1 Cache SFU SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SFU SFU SFU Streaming Multiprocessor GPU Global Memory All processors do the same job on different data sets (SIMD) More transistors are devoted to data processing Less transistors are used for data caching and control flow Serial Kernel Execution Kernel 1 Kernel 2 Kernel 3 Kernel 4 Kernel 5 SM Occupancy Running Time Kernel 6 Concurrent Kernel Execution Kernel 1 Kernel 2 Kernel 4 Kernel 5 Kernel 3 Kernel 3 (cont.) SM Occupancy Kernel 6 Running Time GPU concurrent kernel execution can improve the SM occupancy. Fermi GPUs allow at most 16 different kernels executed at the same time Substrate Layer 2 Substrate Layer 1 Substrate Layer 3 3D-ICs w/ Microchannels Large Power Grid Simulation: Power grids DC and transient simulations involve more than hundreds of millions of nodes b) Heterogeneous CPU-GPU Computing Achieves Much Better Energy Efficiency Intel Larrabee AMD APU NVIDIA Tegra 2 b) Analytical GPU Runtime Performance Model (ISCA’09) MWP(Memory Warp Parallelism): the max. num. of warps per SM that can access the memory simultaneously CWP(Computation Warp Parallelism): the num. of warps the SM processor can execute during one memory warp waiting period plus one 8 Computation + 1 Memory 1 3 1 3 2 4 2 Additional delay 1 3 1 3 2 4 2 Synchronization 1 3 4 1 3 2 4 2 8 Computation + 1 Memory MWP is greater than CWP 1 3 4 1 3 2 4 2 4 4 c) Adaptive Performance Modeling (PPoPP’10) Work Flow Graph (WFG): an extension of the control flow graph of a GPU kernel C 2 M C 1 entry exit C n M n computation instructions memory instructions memory dependency L c L c L m L c L m Instruction issue latency memory loading latency C 7 Mg C 2 C 1 C 3 C 1 C 2 C 3 Mts S Mt C 1 Ms C 3 C 2 C 3 Ms Ms C 5 C 4 C 1 Ms Ms S D C 2 Mg Ms S CPU GPU Mts Mts Mt 2 Mt 2 Mts Inner Loop WFG of Multigrid Smoother [1] Z. Feng, X. Zhao, and Z. Zeng. Robust Parallel Preconditioned Power Grid Simulation on GPU with Adaptive Runtime Performance Modeling and Optimization. In IEEETCAD, 2011. [2] Z. Feng and Z. Zeng. Parallel multigrid preconditioning on graphics processing units (GPUs) for robust power grid analysis. In Proc. IEEE/ACM DAC, 2010. [3] Z. Feng and P. Li. Fast thermal analysis on GPU for 3D-ICs with integrated microchannel cooling. In Proc. IEEE/ACM ICCAD, 2010. [4] X. Zhao and Z. Feng. Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms. In Proc. IEEE/ACM DAC, 2011. 4. REFERENCE Precondition pass Direct pass (Q2P) Upward pass (Q2M, M2M) Downward Pass (M2L, L2L) Evaluation pass (M2P, L2P) Compute residual Converged? Initialize panel charges GPU CPU No Yes Worst Case d) Experimental Results 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 -9 1.765 1.77 1.775 1.78 1.785 1.79 1.795 1.8 Time (seconds) Voltage (V) Cholmod GPU 2.3 2.305 2.31 2.315 2.32 2.325 2.33 2.335 x 10 -9 1.7708 1.7708 1.7709 1.771 1.771 1.771 1.7711 Time (seconds) Voltage (V) Cholmod GPU 22X Speedups 0 50 100 150 200 0 100 200 50 60 70 80 90 X Axle Y Axle Temperature (Celsius) 60 65 70 75 80 Chip-level Electrical [1,2] and Thermal Modeling & Simulation [3] IC Temperature Prediction Voltage Waveforms Prediction Performance Modeling & Optimization [1] Worst Case 25X Speedups Automatically Identify the Optimal Simulation Configurations a) Multigrid Preconditioned Krylov Subspace Iterative Solver VDD VDD VDD VDD VDD VDD 3D Multi-layer Irregular Electrical/Thermal Grid Host (CPU) Memory Gx b = () () () dx t Gx t C bt dt + = DC : TR : Original Grid GPU Global Memory Jacobi Smoother using Sparse Matrix 1,1 1,4 1,5 2,2 2,3 2,6 3,2 3,3 3,8 4,1 4,4 5,1 5,5 5,7 6,2 6,6 6,8 7,5 7,7 8,3 8,6 8,8 a a a a a a a a a a a a a a a a a a a a a a Geometric Multigrid Solver (GMD) Matrix + Geometric Representations + GPU-Friendly Multi-Level Iterative Algorithm MGPCG Algorithm on GPU Set Initial Solution Get Initial Residual and Search Direction Update Solution and Residual Multigrid Preconditioning Check Convergence Update Search Direction Return Final Solution Not Converged Converged 180 GB/s (GPU) vs. 30 GB/s (CPU) 1350 GFLOPS (GPU) vs. 200 GFLOPS (CPU) Achieve highly energy efficient computing, and CPU-GPU communications Need reinventing CAD algorithms: circuit simulations, interconnect modeling, performance optimizations,…

Upload: others

Post on 14-May-2020

56 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel VLSI CAD Algorithms for Energy Efficient ...developer.download.nvidia.com/.../Posters/P0177_Parallel_VLSI_Zhu… · Parallel VLSI CAD Algorithms for Energy Efficient Heterogeneous

Parallel VLSI CAD Algorithms for Energy Efficient Heterogeneous Computing PlatformsXueqian Zhao, Zhuo Feng

Department of Electrical and Computer Engineering, Michigan Technological University, Houghton, MI

1. INTRODUCTION

d) Concurrent Kernel Executions

2. CAPACITANCE EXTRACTION

Compute and combine all coefficient matrices

Reorder and cluster panel cubes

Initialize GPU memory allocation

Do workload balance and SM assignment

GMRES solver

End

b) FMM Turns the O(n2) into O(n) via Approximations

( , ) ( , )

( ) ( ), ,

i k i k

k i k i dk i i i k

x y x yp q q x y

∈Ω ∉Ω

= Φ + Φ ∈∑ ∑

Directly Solve

Approximately Solve

Multipole expansion:Approximate panel charges

within a small sphere

Local expansion:Approximate panel potentials

within a small sphere

Key Steps:

Q2P

Q2M, M2M

M2L, L2L

M2P, L2P

Direct Pass

Upward Pass

Downward Pass

Evaluation Pass

+-

1V

a

b c

d

panel bb

b aba

V dARε

−> = ∫

bb

b

qA

ε =

...

panel

1ia i i ii

i b i ia

qV dA qA R=

= = Φ∑ ∑∫Summation:

a) Problem Formulation

( , ) ( , )2 2

k i k jjk Q P i M Pp q m= Φ × + + Φ ×

( , ) ( , )2 2

ik i k j

k Q P M P

j

qp

m

= Φ Φ ×

Q2P coefficient matrix: ( , )

2k i n m

Q P×Φ ∈

( , )2

k j n sM P

×Φ ∈

M2P coefficient matrix:

Panel potentials in cube k:

Direct charges of source cube i

Multipole expansions of source cube j

1miq ×∈

1nkp ×∈

1sjm ×∈

2 2 2 2

2 2

( , ) ( , ) ( , ) ( , )

( , ) ( , )

Q P M P Q P M P

Q P Q P

mix mix

k i k j k i k j

k

l k r d l k r d

l

Ip

I Ip

I Ip× ×

Φ

Φ Φ

= ⊗ Φ Φ

1 110TT T T T T T

global y zxVec q q m m l l =

Global vector:

M2P index matrix: Q2P index matrix: ( , )2

k j n sM PI ×∈( , )

2k i n m

Q PI ×∈

c) FMMGpu Data Structure and Algorithm Flow

Linear Speedups

Realistic Speedups

Runtime

Number of SMs0 1 2 3 … n

T0

T0/2T0/3

T0/n

Super linearSpeedup Range

4

T0/4

Cube indices after ordering

Number of Coefficients0

Coef. Mat.

Cluster

8 SMs

3 SMs

2 SMs

2 SMs

Small Tests

1

ii sm p

ii

SN NS

=

= ×

Small Tests

Small Tests

Find optimal SM assignment

Normalization based on #SM on GPU

d) GPU Performance Modeling and Workload Balancing

Runtime of single FMM iteration on CPU & GPU (ms)

18X18X

22X26X

26X30X

020406080

100120140160180

Precond Direct Upward Downward Evaluation

CPUGPU

Runtime of each kernel on CPU & GPU (ms)

45X

45X

3.5X

33X 24X

e) Experimental Results [4]

3. PERFORMANCE MODELING & OPTIMIZATION FOR ELECTRICAL & THERMAL SIMULATIONS

3D Parasitic Extraction:Capacitance extraction for delay, noise and power estimation computes charge

distributions for discretized panels

Full Chip Thermal Simulation:Thermal simulation for 3-D dense mesh structures requires solving

many millions of unknowns

3D Interconnections Power Delivery Network

a) GPU Has Evolved into A Very Flexible and Powerful Processor

c) Latest NVIDIA Fermi GPU Architecture

Streaming Multiprocessor (SM16)Instruction L1

Shared Memory & L1 Cache

SFUSPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SFU

SFU

SFU

Streaming Multiprocessor (SM2)Instruction L1

Shared Memory & L1 Cache

SFUSPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SFU

SFU

SFU

Streaming Multiprocessor (SM1)Instruction L1

Shared Memory & L1 Cache

SFUSPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SFU

SFU

SFU

Streaming Multiprocessor

GPU

Glo

bal M

emor

y

– All processors do the same job on different data sets (SIMD)– More transistors are devoted to data processing– Less transistors are used for data caching and control flow

Serial Kernel Execution

Kernel 1

Kernel 2

Kernel 3

Kernel 4

Kernel 5

SM Occupancy

Running Tim

e

Kernel 6

Concurrent Kernel Execution

Kernel 1 Kernel 2

Kernel 4

Kernel 5

Kernel 3

Kernel 3 (cont.)

SM Occupancy

Kernel 6

Running Tim

e

– GPU concurrent kernel execution can improve theSM occupancy.

– Fermi GPUs allow at most 16 different kernelsexecuted at the same time

Substrate Layer 2

Substrate Layer 1

Substrate Layer 3

3D-ICs w/ Microchannels

Large Power Grid Simulation:Power grids DC and transient simulations involve more than hundreds of millions of nodes

b) Heterogeneous CPU-GPU Computing Achieves Much Better Energy Efficiency

Intel LarrabeeAMD APUNVIDIA Tegra 2

b) Analytical GPU Runtime Performance Model (ISCA’09)MWP(Memory Warp Parallelism): the max. num. of warps per SM that can access the memory simultaneouslyCWP(Computation Warp Parallelism): the num. of warps the SM processor can execute during one memory warp waiting period plus one

8 Computation + 1 Memory

1

3

1

32

4

2

Additional delay1

3

1

32

4

2

Synchronization

1

34

1

32

4

2

8 Computation + 1 Memory

MWP is greater than CWP1

34

1

32

4

2

4 4

c) Adaptive Performance Modeling (PPoPP’10)Work Flow Graph (WFG): an extension of the control flow graph of a GPU kernel

C2 M C1entry exitCn

M

n computation instructionsmemory instructions

memory dependency

Lc Lc

Lm

Lc

Lm

Instruction issue latencymemory loading latency

C7 Mg C2

C1

C3

C1

C2

C3

Mts

SMt

C1 Ms C3 C2 C3Ms MsC5

C4 C1Ms MsSD

C2 MgMsS CPU

GPU

Mts Mts

Mt2 Mt2 Mts

Inner Loop

WFG of Multigrid Smoother

[1] Z. Feng, X. Zhao, and Z. Zeng. Robust Parallel Preconditioned Power Grid Simulation on GPU with Adaptive Runtime Performance Modeling and Optimization. In IEEETCAD, 2011.[2] Z. Feng and Z. Zeng. Parallel multigrid preconditioning on graphics processing units (GPUs) for robust power grid analysis. In Proc. IEEE/ACM DAC, 2010.[3] Z. Feng and P. Li. Fast thermal analysis on GPU for 3D-ICs with integrated microchannel cooling. In Proc. IEEE/ACM ICCAD, 2010.[4] X. Zhao and Z. Feng. Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms. In Proc. IEEE/ACM DAC, 2011.

4. REFERENCE

Precondition pass

Direct pass (Q2P)

Upward pass (Q2M, M2M)

Downward Pass (M2L, L2L)

Evaluation pass (M2P, L2P)

Compute residual

Converged?

Initialize panel charges

GPUCPU

No

Yes

Worst Case

d) Experimental Results

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10-9

1.765

1.77

1.775

1.78

1.785

1.79

1.795

1.8

Time (seconds)

Volta

ge (V

)

CholmodGPU

2.3 2.305 2.31 2.315 2.32 2.325 2.33 2.335

x 10-9

1.7708

1.7708

1.7709

1.771

1.771

1.771

1.7711

Time (seconds)

Volta

ge (V

)

CholmodGPU

22X Speedups

050

100150

200

0

100

20050

60

70

80

90

X AxleY Axle

Tem

pera

ture

(Cel

sius

)

60

65

70

75

80

Chip-level Electrical [1,2] and Thermal Modeling & Simulation [3]

IC Temperature Prediction Voltage Waveforms PredictionPerformance Modeling & Optimization [1]

Worst Case

25X Speedups

Automatically Identify the Optimal Simulation Configurations

a) Multigrid Preconditioned Krylov Subspace Iterative Solver

VDD VDD VDD

VDD VDD VDD

3D Multi-layer Irregular Electrical/Thermal Grid

Host (CPU) Memory

Gx b= ( )( ) ( )dx tGx t C b tdt

+ =DC :TR :

Original Grid

GPU Global Memory

Jacobi Smoother using Sparse Matrix

1,1 1,4 1,5

2,2 2,3 2,6

3,2 3,3 3,8

4,1 4,4

5,1 5,5 5,7

6,2 6,6 6,8

7,5 7,7

8,3 8,6 8,8

a a aa a aa a a

a aa a a

a a aa a

a a a

Geometric Multigrid Solver (GMD)

Matrix + Geometric Representations

+

GPU-Friendly Multi-LevelIterative Algorithm

MGPCG Algorithm on GPU

Set Initial Solution

Get Initial Residual and

Search Direction

Update Solution and Residual

Multigrid Preconditioning

Check Convergence

Update Search Direction

Return Final Solution

Not Converged

Converged

180 GB/s (GPU) vs. 30 GB/s (CPU) 1350 GFLOPS (GPU) vs. 200 GFLOPS (CPU)

Achieve highly energy efficient computing, and CPU-GPU communicationsNeed reinventing CAD algorithms: circuit simulations, interconnect modeling, performance optimizations,…