parallel vlsi cad algorithms for energy efficient...

Parallel VLSI CAD Algorithms for Energy Efficient Heterogeneous Computing PlatformsXueqian Zhao, Zhuo Feng

Department of Electrical and Computer Engineering, Michigan Technological University, Houghton, MI

1. INTRODUCTION

d) Concurrent Kernel Executions

2. CAPACITANCE EXTRACTION

Compute and combine all coefficient matrices

Reorder and cluster panel cubes

Initialize GPU memory allocation

Do workload balance and SM assignment

GMRES solver

End

b) FMM Turns the O(n2) into O(n) via Approximations

( , ) ( , )

( ) ( ), ,

i k i k

k i k i dk i i i k

x y x yp q q x y

∈Ω ∉Ω

= Φ + Φ ∈∑ ∑

Directly Solve

Approximately Solve

Multipole expansion:Approximate panel charges

within a small sphere

Local expansion:Approximate panel potentials

within a small sphere

Key Steps:

Q2P

Q2M, M2M

M2L, L2L

M2P, L2P

Direct Pass

Upward Pass

Downward Pass

Evaluation Pass

+-

1V

a

b c

d

panel bb

b aba

V dARε

−> = ∫

bb

b

qA

ε =

...

panel

1ia i i ii

i b i ia

qV dA qA R=

= = Φ∑ ∑∫Summation:

a) Problem Formulation

( , ) ( , )2 2

k i k jjk Q P i M Pp q m= Φ × + + Φ ×

( , ) ( , )2 2

ik i k j

k Q P M P

j

qp

m

= Φ Φ ×

Q2P coefficient matrix: ( , )

2k i n m

Q P×Φ ∈

( , )2

k j n sM P

×Φ ∈

M2P coefficient matrix:

Panel potentials in cube k:

Direct charges of source cube i

Multipole expansions of source cube j

1miq ×∈

1nkp ×∈

1sjm ×∈

2 2 2 2

2 2

( , ) ( , ) ( , ) ( , )

( , ) ( , )

Q P M P Q P M P

Q P Q P

mix mix

k i k j k i k j

k

l k r d l k r d

l

Ip

I Ip

I Ip× ×

Φ

Φ Φ

= ⊗ Φ Φ

1 110TT T T T T T

global y zxVec q q m m l l =

Global vector:

M2P index matrix: Q2P index matrix: ( , )2

k j n sM PI ×∈( , )

2k i n m

Q PI ×∈

c) FMMGpu Data Structure and Algorithm Flow

Linear Speedups

Realistic Speedups

Runtime

Number of SMs0 1 2 3 … n

T0

T0/2T0/3

T0/n

Super linearSpeedup Range

4

T0/4

Cube indices after ordering

Number of Coefficients0

Coef. Mat.

Cluster

8 SMs

3 SMs

2 SMs

2 SMs

Small Tests

1

ii sm p

ii

SN NS

=

= ×

∑

Small Tests

Small Tests

Find optimal SM assignment

Normalization based on #SM on GPU

d) GPU Performance Modeling and Workload Balancing

Runtime of single FMM iteration on CPU & GPU (ms)

18X18X

22X26X

26X30X

020406080

100120140160180

Precond Direct Upward Downward Evaluation

CPUGPU

Runtime of each kernel on CPU & GPU (ms)

45X

45X

3.5X

33X 24X

e) Experimental Results [4]

3. PERFORMANCE MODELING & OPTIMIZATION FOR ELECTRICAL & THERMAL SIMULATIONS

3D Parasitic Extraction:Capacitance extraction for delay, noise and power estimation computes charge

distributions for discretized panels

Full Chip Thermal Simulation:Thermal simulation for 3-D dense mesh structures requires solving

many millions of unknowns

3D Interconnections Power Delivery Network

a) GPU Has Evolved into A Very Flexible and Powerful Processor

c) Latest NVIDIA Fermi GPU Architecture

Streaming Multiprocessor (SM16)Instruction L1

Shared Memory & L1 Cache

SFUSPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SFU

SFU

SFU



SFUSPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SFU

SFU

SFU



SFUSPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SPSP

SPSP

SP SPSP SP

SFU

SFU

SFU

Streaming Multiprocessor

GPU

Glo

bal M

emor

y

– All processors do the same job on different data sets (SIMD)– More transistors are devoted to data processing– Less transistors are used for data caching and control flow

Serial Kernel Execution

Kernel 1

Kernel 2

Kernel 3

Kernel 4

Kernel 5

SM Occupancy

Running Tim

e

Kernel 6

Concurrent Kernel Execution

Kernel 1 Kernel 2

Kernel 4

Kernel 5

Kernel 3

Kernel 3 (cont.)

SM Occupancy

Kernel 6

Running Tim

e

– GPU concurrent kernel execution can improve theSM occupancy.

– Fermi GPUs allow at most 16 different kernelsexecuted at the same time

Substrate Layer 2

Substrate Layer 1

Substrate Layer 3

3D-ICs w/ Microchannels

Large Power Grid Simulation:Power grids DC and transient simulations involve more than hundreds of millions of nodes

b) Heterogeneous CPU-GPU Computing Achieves Much Better Energy Efficiency

Intel LarrabeeAMD APUNVIDIA Tegra 2

b) Analytical GPU Runtime Performance Model (ISCA’09)MWP(Memory Warp Parallelism): the max. num. of warps per SM that can access the memory simultaneouslyCWP(Computation Warp Parallelism): the num. of warps the SM processor can execute during one memory warp waiting period plus one

8 Computation + 1 Memory

1

3

1

32

4

2

Additional delay1

3

1

32

4

2

Synchronization

1

34

1

32

4

2

8 Computation + 1 Memory

MWP is greater than CWP1

34

1

32

4

2

4 4

c) Adaptive Performance Modeling (PPoPP’10)Work Flow Graph (WFG): an extension of the control flow graph of a GPU kernel

C2 M C1entry exitCn

M

n computation instructionsmemory instructions

memory dependency

Lc Lc

Lm

Lc

Lm

Instruction issue latencymemory loading latency

C7 Mg C2

C1

C3

C1

C2

C3

Mts

SMt

C1 Ms C3 C2 C3Ms MsC5

C4 C1Ms MsSD

C2 MgMsS CPU

GPU

Mts Mts

Mt2 Mt2 Mts

Inner Loop

WFG of Multigrid Smoother

[1] Z. Feng, X. Zhao, and Z. Zeng. Robust Parallel Preconditioned Power Grid Simulation on GPU with Adaptive Runtime Performance Modeling and Optimization. In IEEETCAD, 2011.[2] Z. Feng and Z. Zeng. Parallel multigrid preconditioning on graphics processing units (GPUs) for robust power grid analysis. In Proc. IEEE/ACM DAC, 2010.[3] Z. Feng and P. Li. Fast thermal analysis on GPU for 3D-ICs with integrated microchannel cooling. In Proc. IEEE/ACM ICCAD, 2010.[4] X. Zhao and Z. Feng. Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms. In Proc. IEEE/ACM DAC, 2011.

4. REFERENCE

Precondition pass

Direct pass (Q2P)

Upward pass (Q2M, M2M)

Downward Pass (M2L, L2L)

Evaluation pass (M2P, L2P)

Compute residual

Converged?

Initialize panel charges

GPUCPU

No

Yes

Worst Case

d) Experimental Results

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10-9

1.765

1.77

1.775

1.78

1.785

1.79

1.795

1.8

Time (seconds)

Volta

ge (V

)

CholmodGPU

2.3 2.305 2.31 2.315 2.32 2.325 2.33 2.335

x 10-9

1.7708

1.7708

1.7709

1.771

1.771

1.771

1.7711

Time (seconds)

Volta

ge (V

)

CholmodGPU

22X Speedups

050

100150

200

0

100

20050

60

70

80

90

X AxleY Axle

Tem

pera

ture

(Cel

sius

)

60

65

70

75

80

Chip-level Electrical [1,2] and Thermal Modeling & Simulation [3]

IC Temperature Prediction Voltage Waveforms PredictionPerformance Modeling & Optimization [1]

Worst Case

25X Speedups

Automatically Identify the Optimal Simulation Configurations

a) Multigrid Preconditioned Krylov Subspace Iterative Solver

…

VDD VDD VDD

VDD VDD VDD

3D Multi-layer Irregular Electrical/Thermal Grid

Host (CPU) Memory

Gx b= ( )( ) ( )dx tGx t C b tdt

+ =DC :TR :

Original Grid

GPU Global Memory

Jacobi Smoother using Sparse Matrix

1,1 1,4 1,5

2,2 2,3 2,6

3,2 3,3 3,8

4,1 4,4

5,1 5,5 5,7

6,2 6,6 6,8

7,5 7,7

8,3 8,6 8,8

a a aa a aa a a

a aa a a

a a aa a

a a a

Geometric Multigrid Solver (GMD)

Matrix + Geometric Representations

+

GPU-Friendly Multi-LevelIterative Algorithm

MGPCG Algorithm on GPU

Set Initial Solution

Get Initial Residual and

Search Direction

Update Solution and Residual

Multigrid Preconditioning

Check Convergence

Update Search Direction

Return Final Solution

Not Converged

Converged

180 GB/s (GPU) vs. 30 GB/s (CPU) 1350 GFLOPS (GPU) vs. 200 GFLOPS (CPU)

Achieve highly energy efficient computing, and CPU-GPU communicationsNeed reinventing CAD algorithms: circuit simulations, interconnect modeling, performance optimizations,…

parallel vlsi cad algorithms for energy efficient...

Documents