large-scale nonlinear device-level powersite.ieee.org/pes-hpcgrid/files/2019/08/4_pesgm2019.pdf ·...

Introduction GPU Specs Massive-Thread Parallel Modules for Power Electronic Circuit Simulation Case Studies and Data Analysis Conclusion

Large-Scale Nonlinear Device-Level PowerElectronic Circuit Simulation on GPUs1

Venkata Dinavahi

University of AlbertaEdmonton, Alberta, Canada.

Aug. 2019

1S. Yan, Z. Zhou, V. Dinavahi, “Large-Scale Nonlinear Device-Level PowerElectronic Circuit Simulation on Massively Parallel Graphics ProcessingArchitectures,” IEEE Trans. on Power Electronics, vol. 33, no. 6, pp.4660-4678, June 2018.

AMPS Panel Session IEEE PES GM, Atlanta, 2019.


Outline

1 Introduction

2 GPU Specs

3 Massive-Thread Parallel Modules for Power Electronic CircuitSimulation

4 Case Studies and Data Analysis

5 Conclusion



Introduction

Power Electronic Circuit Simulation

Computational speed is an overriding concern in large-scalepower electronic circuit simulation using device-level circuitsimulators.

Modeling complex systems composed of power electronicsubsystems, such as HVDC grids, transportation systems,renewable energy systems, and microgrids, can be verychallenging.

Compromise between system size, modeling complexity,and simulation duration to obtain a reasonable executiontime for the application.



IntroductionApplications

Computational bottlenecks arise due to repeated nonlinear transientsimulations required for:

robust design of power electronic systems involving optimization tofine-tune parameters at the circuit and component levels that needseveral design iterations and hundreds of simulation runs;

simulation-based statistical and sensitivity analysis to reducedesign costs and time;

comprehensive fault and reliability analysis of systems using amatrix of faults representing possible device/component failures.

visualization and analysis of large sets of post-simulation results toextract performance indices;

detailed modeling of complex mixed-signal and multi-domainsubsystems with widely different time constants, for e.g., electronic,electrical, magnetic, thermal, and hydraulic systems.



IntroductionSimulation Tools

Variety of device-level simulators exist such as SaberRD, Orcad,PSIM, PLECS, LTSpice, PECS, PETS, DesignLab, etc.

Plethora of models for semiconductor devices such as diodes,BJTs, JFETs, MOSFETs, IGBTs, and basic circuit components;

Assortment of studies such as nonlinear dc, transient, linear ac(small signal), and Monte Carlo analysis. Native mixed-signalcapabilities or provide co-simulation interfaces.

Common attribute: single-thread sequential program execution onCPU. Some distributed processing possible on multiple CPU cores.

Model order reduction for improving execution speed includecircuit size reduction, and using averaged or linearized models forthe switching devices to observe only the system-level behavior.

Evaluating comprehensive system behavior entails maintainingmodels of varying size and complexity on different simulation tools.



Stream Multiprocessors of GPUsYAN et al.: LARGE-SCALE NONLINEAR DEVICE-LEVEL POWER ELECTRONIC CIRCUIT SIMULATION ON MASSIVELY PARALLEL GRAPHICS 4661

attained task parallelism is coarse grained at best. The result-ing computational efficiency of device-level circuit simulatorswas primarily derived from an increase in CPU clock speed,which until the mid-2000s could be relied upon to provide thenecessary acceleration. However, computer chip manufacturesno longer rely on clock speed for higher processing power;they have implemented multicore CPU and many-core GPU ar-chitectures to increase chip performance. While multiple threadconcepts on CPUs such as hyperthreading were introduced earlyon, they were hardly taken advantage of by circuit simulatorsmainly due to the cumbersome task of rewriting the programcode to enable multiple threads of execution.

Model order reduction is the usual course for improving com-putational speed in device-level circuit simulators. Model sim-plifications include circuit size reduction and using averaged orlinearized models for the switching devices to observe only thesystem-level behavior. However, evaluating the system’s com-prehensive behavior entails maintaining many different modelsof varying size and complexity on different simulation tools. Itwould be far more effective if the same simulation tool couldefficiently show both the system-level and device-level resultsover long time frames. Taking IGBT as an example, there areseveral models ranging from detailed physics-based model toideal switch model. Hefner brought up the first complete ana-lytical physics-based model available for a device-level circuitsimulator, which is implemented in SaberRD [9], [10]. As themost common voltage source converter for HVdc applications,modular multilevel converter (MMC) is normally simulated us-ing simplified IGBT and diode models because of its complexstructure consisting of a series of submodules (SM); device-leveldetails of IGBT and diode cannot be observed using simplifiedsystem-level models of the MMC.

Since GPU offers a massively parallel architecture composedof thousands of cores grouped into streaming multiprocessorsthough its clock frequency is relevantly lower than CPU, ithas a higher compute power and throughput in floating pointcalculations than traditional CPUs, as compared in Table III.The attained data parallelism is fine-grained, which conformsthe single instruction, multiple data (SIMD) format; therefore,to fully exploit GPU acceleration, the device models and thenumerical algorithms have to be rewritten into the SIMD for-mat [11]. Mature application programming interfaces are avail-able for SIMD abstraction such as CUDA, DirectCompute, andOpenCL. The user develops C/C++ code interlaced with spe-cial constructs and functions to access the parallel cores anddistributed memories on the GPU. Furthermore, optimized nu-merical libraries such as CUBLAS [12] and CUFFT [13] areavailable for linear solvers and data processing. GPU-basedmassively parallel processing has been used worldwide formyriad applications, such as computational fluid dynamics,life sciences, medical imaging, game physics, seismic simula-tions, etc., and impressive acceleration has been reported [14]–[17]. For power system computation, GPUs have been used forvarious studies, such as transient stability simulation [18], elec-tromagnetic transient simulation [19], dynamic state estima-tion [20], [21], ionized field computation [22], and power flowcalculation [23].

TABLE IGPU SPECIFICATIONS

Chip GK110 GP104(Kepler) (Pascal)

Fabrication 28 nm 16 nmNumber of single precision cores 2880 2560Memory bandwidth 336 GB/s 320 GB/sMemory size 6 GB 8 GBMemory type GDDR5 GDDR5XCore clock 837 MHz 1607 MHz

In this paper, a massively parallel simulation of large-scalepower electronic circuits using nonlinear physics-based device-level models on the GPU architecture is proposed. The massive-thread implementation of physics-based IGBT and power diodeutilize the Hefner’s IGBT model and Lauritzen’s diode model[24]. The numerical solution modules including the Newton–Raphson iterative method, matrix equation solution using par-tial LU decomposition are implemented in the massive-threadframework. Based on the MMC circuit characteristic, a mathe-matical method is proposed to decompose the semiblock Jaco-bian matrix. In addition, a variable time-stepping scheme usingthe predictor–corrector method is adopted to increase simula-tion efficiency. The GPU-based massive-thread simulation iscompared with SaberRD simulator as well as a complete CPUimplementation, in terms of accuracy and computational effi-ciency.

This paper is organized as follows. Section II briefly de-scribes the GPU hardware architecture and CUDA abstractionfor parallel computation. Section III describes the details of thedeveloped massive-thread parallel modules for the IGBT, powerdiode, and the numerical algorithms. Section IV shows the casestudies of MMC with simulation results and discussion. Finally,Section V gives the conclusion of this paper.

II. BACKGROUND ON GPU ARCHITECTURE AND

PROGRAMMING INTERFACE

Two GPU architectures, NVIDIA Kepler GK110 (2012) andPascal GP104 (2016), whose specifications are given in Table I,are used to implement the massively parallel power electroniccircuit simulation codes. In hybrid computational systems, CPUand GPU cooperate as host and device, respectively. The hostsends all essential instructions and data to the device throughPCIe 3.0 × 16 interface up to 15.754 GB/s. Instructions are dis-tributed through GigaThread interface to each streaming mul-tiprocessor and data are transferred to global memory on theGPU board. Every 32 threads in the streaming multiprocessorare distributed into one warp as an execution unit, which oper-ate simultaneously, while other threads are parallelized by thepipeline automatically. Finally, results saved in global memoryare transferred back to host through PCIe 3.0 bus again.

Growing from NVIDIA’s Fermi architecture, the Kepler ar-chitecture has 15 streaming multiprocessors with registers,caches, and shared memory, shown in Fig. 1(a) [25]. Eachmultiprocessor contains 192 CUDA cores, which is 3× that

(a) Kepler (b) Pascal



Nonlinear Physics-Based Power Diode Model

Physics-based Diode Model

P+ N+

vE vM

i-region

+ -

i(t)

CJ

iR

RS

v(t)

+ -v(t) +

i(t)

gR

2vE

iReq

gJ

-

RS

v(t)

iJeq

Reverse recovery

Junction capacitance

Contact resistor

(a)

(b) (c)

Anode Cathode

Anode Cathode

Anode Cathode

CJ

2vM

RM

vA(t) vin(t) vK(t)




Nonlinear Diode Model

Reverse recovery phenomena:

iR(t) =qE (t)− qM(t)

TM, (1)

0 =dqM(t)

dt+

qM(t)

τ− qE (t)− qM(t)

TM, (2)

qE (t) = ISτ(evE (t)

VT − 1), (3)

Voltage drop across i-region:

vM(t) =VTTM i(t)

qM(t). (4)

Voltage drop across diode:

v(t) = 2vM(t) + 2vE (t) + RS

[iE (t) +

dqJ(t)

dt

], (5)




Nonlinear Diode Model

Charge on junction capacitance:

qJ(t) =

∫CJ(t)d(2vE ). (6)

The junction capacitance CJ(t) is given as

CJ(t)=

CJ0

(1− 2vE (t)

φB)m

vE <φB

4

[2m+2mvE (t)

φB− (m−1)2m

]CJ0 vE ≥

φB

4

, (7)

qM =∆t · qE (t)

2TM(1 + k1∆t2 )

+qhist(t −∆t)

1 + k1∆t2

, (8)




where the history term is given as

qhist(t −∆t) =∆t

2TMqE (t −∆t)− k1∆t

2qM(t −∆t). (9)

The equivalent reverse recovery current iReq is given as

iReq = k2ISτ(evE (t)

vT −1)− qhist(t −∆t)

TM(1 + k1∆t2 )−2vE (t)gR , (10)

where gR is the dynamic conductance defined as

gR =1

2vTk2ISτe

vE (t)

VT . (11)

The equivalent junction capacitance current iJeq is obtained as

iJeq = iJ(t)− 2vE (t)gJ , (12)




The equivalent junction conductance gJ is given as

gJ =2

∆tCJ(t). (13)

The discretized and linearized diode equations can be obtained asfollows:

GDiode · VDiode = IDiodeeq , (14)

where

GDiode =

gR +gJ −gR−gJ 0

−gR−gJ gR +gJ + 1RM+RS

− 1RM+RS

0 − 1RM+RS

1RM+RS

, (15)

VDiode = [vA, vin, vK ]T and (16)

IDiodeeq = [−iReq−iJeq, iReq+iJeq , 0]T . (17)




Massive-Thread Parallel Implementation of Diode Model

P+ N+

vE vM

i-region

+ -

i(t)

CJ

iR

RS

v(t)

+ -v(t) +

i(t)

gR

2vE

iReq

gJ

-

RS

v(t)

iJeq

Reverse recovery


Contact resistor

(a)

(b) (c)

Anode Cathode

Anode Cathode

Anode Cathode

CJ

2vM

RM

vA(t) vin(t) vK(t)

Global Memory

vE(t)

converge?

No

Yes

RJ1

RJ2

RJn

...

Reverse recovery &


Kernel0

qhist

vE(t)

qhist(t-Δt)gR(t)

iReq(t)

qhist(t)

gJ(t)

iJeq(t)

NS1

NS2

NSn

...

Nonlinear

solution

Kernel1

v(t)

Node voltage v(t)



Nonlinear Physics-Based IGBT Model

Physics-based IGBT Model

CathodeGate

Anode

p+

DepletionRegion

n-

p+

BJTb

Cdsj Ccer

n+CmCoxsMOSFETCoxd

Cgdj

d rb

Cebj+Cebd

s

e

c

Gate Cathode

CgsCgd

imosCdsj

imult

Cmult

icssCceribssCeb

rbAnode

(a) (b)

g

g

d s c

e

b

Gate

a




IGBT nodal equation:

GIGBT · VIGBT = IIGBTeq , (18)

where

VIGBT = [vc vg va vd ve ]T , (19)

Using N-R method to solve the nonlinear matrix equation (18),∆VIGBT is obtained to update VIGBT iteratively:

GIGBT(n)∆VIGBT(n+1) = −IIGBT(n), (20)

where

IIGBT(n) = GIGBT(n)VIGBT(n) − IIGBT(n)eq . (21)




4666 IEEE TRANSACTIONS ON POWER ELECTRONICS, VOL. 33, NO. 6, JUNE 2018

Fig. 6. Massive thread parallel implementation of the physics-based IGBT model.

iteratively. Therefore, the iterative equation is given as

GIGBT(n)ΔV IGBT(n+1) = −I IGBT(n) (42)

where

I IGBT(n) = GIGBT(n)V IGBT(n) − I IGBT(n)eq . (43)

3) Parallel Massive-Thread Mapping: For a system con-taining n IGBTs, the massive-thread parallel implementation is

shown in Fig. 6 and is described in Algorithm 2. Among the 12kernels involved in the module, Kernel0 and Kernel1 check p-i-njunction and MOSFET junction voltage limitation within succes-sive Newton–Raphson iterations; intermediate parameters p0 ,Q, and M are processed in Kernel2 , Kernel3 and Kernel4 ; sixnonlinear capacitors Cgd , Cgs , Cdsj , Cmult , Ceb , and Ccer areupdated in Kernel5 and Kernel6 ; with those capacitances, theequivalent conductance GQ and parallel current source IQeq

I IGBTeq =

⎡⎢⎢⎢⎢⎢⎣

ic eq

ig eq

ia eq

id eq

ie eq

⎤⎥⎥⎥⎥⎥⎦

=

⎡⎢⎢⎢⎢⎢⎢⎣

imult eq + iC mult eq + icss eq + iC cer eq + iC gs eq + iC dsj eq + imos eq

−iC gs eq + iC dg eq

−iT eq

iC ceb eq + ibss eq − imos eq − iC dg eq − iC dsj eq − imult eq − iC mult eq

icss eq − ibss eq − iC cer eq − iC ceb eq + iT eq

⎤⎥⎥⎥⎥⎥⎥⎦

(40)

GIGBT =⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

gmult ds + gmult gs +

gmult bc + gcss bc +

gC cer bc + gC gs+gC dsj + gmos ds+

gmos gs

−gmult gs − gC gs−gmos gs

−gmult ae − gcss ae

−gmult ds − gC mult bc−gcss bc + gcss eb−gC cer bc − gdsj

−gmos ds + gmult eb

−gmult ae − gcss ae +

gcss eb − gmult eb

−gC gs gC gs + gC dg 0 −gC dg 0

−gT bc 0 gT ae gT bc − gT eb −gT ae + gT eb

gbss bc − gmos gs −−gmos ds − gC dsj −gmult ds − gmult gs

gC mult bc

gmos gs − gC dg+

gmult gs

gmult ae

gC eb − gbss bc + gbss eb +

gmos ds + gC dg + gC dsj +

gmult ds − gmult eb − gC mult bc

−gC eb + gbss eb −gmult ae + gmult eb

−gcss bc − gC cer bc −gbss bc + gT bc

0 gcss ae − gT ae

gcss bc − gcss eb + gC cer bc −gbss eb + gbss bc − gC eb −

gT bc + gTeb

−gcss ae + gcss eb+

gbss eb + gC eb+

gT ae − gTeb

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(41)




Discretized and Linearized IGBT Model

CathodeGate

Anode

p+

DepletionRegion

n-

p+

BJTb

Cdsj Ccer

n+

CmCoxsMOSFETCoxd

Cgdj

d rb

Cebj+Cebd

s

e

c

GateCathode

CgsCgd

imosCdsj

imult

Cmult

icssCceribssCeb

rb

Anode

(a) (b)

g

g

d s c

e

b

Cathode

Gate

gC

gs

i Cgse

q

gC

gd

i Cgdeq

gmosds

imoseq

gmosgsvgs

gC

dsj

i Cdsj

eq

gm

ult

ds

gm

ult

gsv

gs

gm

ult

aev

ae

gm

ult

ebv e

b

i mult

eq

gC

mult

bc

i Cm

ult

eq

gC

eb

i Ceb

eq

gbss

bcv

bs

i bss

eq

gbss

eb

i Cce

req

gC

cerb

cvbc

gcs

sebv e

b

i css

eq

gcs

saev

ae

gcs

sbcv

bc

Anode

gT

bcv

bc

gT

ae

gT

ebv e

b

i Teq

Base

a

Cathode

Gate

gC

gs

i Cgs_

eq

gC

gd

i Cgd_

eq

gmos_ds

imos_eq

gmos_gsvgs

gC

dsj

i Cdsj

_eq

gm

ult

_ds

gm

ult

_gsv

gs

gm

ult

_aev

ae

gm

ult

_eb

v eb

i mult

_eq

gC

mult

_bc

i Cm

ult

_eq

gC

eb

i Ceb

_eq

gbss

_bcv

bs

i bss

_eq

i Cce

r_eq

gC

cer_

bcv

bc

gcs

s_eb

v eb

i css

_eq

gcs

s_aev

ae

gcs

s_bcv

bc

Anode

gT

bcv

bc

gT

ae

gT

ebv e

b

i Teq

Base

gbss

_eb




Massive-Thread Parallel Implementation of IGBT Model

Global Memory

Vn(t)

Ccer(t-∆t)

Q(t-∆t)

IQ(t-∆t)

∆t

vds(t-∆t)

Kernel2

p0

Kernel3

Qvn

eb(t)

Cgs1

Cgd Cgs CdsjCeb

Kernel4

M

vnds(t)

Kernel7 Cgs1

iTibssicss imosimult

GnI(t)

InIeq(t)

Kernel6

CcerCmult

Qngs(t)

Qngd(t)

Cngd(t)

Qnds(t)

Cndsj(t)

vngs(t)

Qneb(t)Cn

eb(t)Qn

ce(t-∆t)

Cncer(t-∆t)

InQce(t-∆t)

Mn(t)

∆tnvn

ds(t-∆t)

Qnce(t)

Cncer(t)

Qnmult(t)

Cnmult(t)

GnQ(t)

InQeq(t)

InQ(t)

Kernel5

Kernel11

Kernel9

Kernel0

vneb(t)

vn-1eb(t)

vn-1bc(t)

vnbc(t) BJT

Kernel1

...

Qn(t)

vneb(t)

vnbc(t)

vngs(t)

vn-1gs(t)

vngs(t)

Kernel10

∆Vn(t)

∆V(t) converge?

No

Yes

Vn(t)

Vn(t)

Charge in capacitors Q

Current through capacitors IQ

Time step size ∆t

Collector-emitter redistribution capacitance Ccer

Node voltages V

FET

rb

IGBT-AE1

Jacobian

IGBT-AE2

IGBT-AEn

Kernel8

...

IGBT-AE1

IGBT-AE2

IGBT-AEn

Vn(t)




Modular Multi-Level Converter (MMC) Circuit

SM

SMn

RL

...

va

vc

La

La

+Vdc

-Vdc

Cm

LL vsa iaib

iSM S1D1

D2

ia1

ia2

S2

SM2

SM1

SMn...

SM2

SM1

SMn

...

vb

Lb

Lb

SM2

SM1

SMn

...SM2

SM1

vsbvsc

SMn

...

Lc

Lc

SM2

SM1

SMn

...

SM2

SM1

ic

ib1

ib2

ic1

ic2Phase U

nit




MMC Control System

(a)

(b)AMPS Panel Session IEEE PES GM, Atlanta, 2019.


Behavioral IGBT Model

Behavioral IGBT model

Global Memory

iarm(t-∆t)

g

Node voltage update

for each SM

vcap_h(t-∆t)

icap(t-∆t)

vcap_h(t)

icap(t)

Gate signal g

History capacitor voltage Vcap_h

Arm current iarm

Kernel0

vSM

rSM

iarm(t) NV1

NV2

NVn

...

Capacitor current icap

TC1

TC2...

TCn

vcap_h

rcap

r1

r2

vSM

rSM

Ideal

IGBT

Ideal

Diode

vce0

vd0

rS_IGBT

rS_dv1

v2

(a) (b) (c) (d)

Thèvenin circuit (TC)

Node voltage (NV)

(e) (f)



Behavioral IGBT Model

Massive-Thread Implementation of Behavioral IGBT Model

Global Memory

iarm(t-∆t)

g

Node voltage update

for each SM

vcap_h(t-∆t)

icap(t-∆t)

vcap_h(t)

icap(t)

Gate signal g

History capacitor voltage Vcap_h

Arm current iarm

Kernel0

vSM

rSM

iarm(t) NV1

NV2

NVn

...

Capacitor current icap

TC1

TC2...TCn

vcap_h

rcap

r1

r2

vSM

rSM

Ideal

IGBT

Ideal

Diode

vce0

vd0

rS_IGBT

rS_dv1

v2

(a) (b) (c) (d)

Thèvenin circuit (TC)

Node voltage (NV)

(e) (f)



Numerical Solvers

Numerical Solvers

Modules

Massive-Thread Parallel Newton-Raphson Algorithm.

Novel Relaxation Based Jacobian Matrix Update.

Parallel Partial LU Decomposition.

Predictor-Corrector Variable Time-Stepping Scheme.



Results and Accuracy Comparison

Results

Single-phase five-level physics-based MMC simulation resultscomparison between GPU (top) and SaberRD (bottom).

40 45 50 55 60 65 70 75 80

-900

-600

-300

0

300

600

900

VGPU

out

(V)

40 45 50 55 60 65 70 75 80

-900

-600

-300

0

300

600

900

Simulation time (ms)

VSaberRD

out

(V)

(a) 5-level MMC Vout comparison

57 58 59 60 61 62 63 64 65-900

-600

-300

0

300

600

900

VGPU

out

(V)

57 58 59 60 61 62 63 64 65-900

-600

-300

0

300

600

900


VSaberRD

out

(V)

(b) 5-level MMC zoomed Vout comparison

0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75

100

200

300

400

500

60Hz, 502.7V

2.5KHz, 39.4V

Mag

nitu

deGPU

(V)

0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75

100

200

300

400

500

60Hz, 506.6V

2.5KHz, 45.3V

Frequency (KHz)

Mag

nitu

deSaberRD

(V)

(c) 5-level MMC Vout FFT comparison

40 45 50 55 60 65 70 75 80

-200

-100

0

100

200

IGPU

load

(A)

40 45 50 55 60 65 70 75 80

-200

-100

0

100

200


ISaberRD

load

(A)

(d) 5-level MMC Iload comparison

57 58 59 60 61 62 63 64 65-200

-100

0

100

200

IGPU

load

(A)

57 58 59 60 61 62 63 64 65-200

-100

0

100

200


ISaberRD

load

(A)

(e) 5-level MMC zoomed Iload comparison

0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75

40

80

120

160

60Hz, 155.5A

2.5KHz, 3.55AMag

nitu

deGPU

(A)

0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75

40

80

120

160

60Hz, 156.0A

2.5KHz, 3.81A

Frequency (KHz)

Mag

nitu

deSaberRD

(A)

(f) 5-level MMC Iload FFT comparison

62 64 66 68 70 72 74

460

480

500

VGPU

cap

(V)

V cap1

V cap2

V cap3

V cap4

62 64 66 68 70 72 74

460

480

500

Simulation time (s)

VSaberRD

cap

(V)

V cap1

V cap2

V cap3

V cap4

(g) 5-level MMC arm Vcap comparison

13.474 13.476 13.478 13.480 13.482

0

200

400

600

800

VGPU

ce(V

)

Vce

0

50

100

tr = 0.19μs

90%Ic

10%Ic

tond = 1.09μs

Gate on

IGPU

c(A

)

Vce

Ic

13.474 13.476 13.478 13.480 13.482

0

200

400

600

800

VSaberRD

ce(V

)

Vce

0

50

100

tr = 0.20μs

90%Ic

10%Ic

tond = 1.12μs

Gate on


ISaberRD

c(A

)

Vce

Ic

(h) IGBT gate on waveform comparison

13.214 13.216 13.218 13.220

0

200

400

600

VGPU

ce(V

)

Vce

0

20

40

60

80

tf = 0.36μs

90%Ic

10%Ic

toffd = 3.1μs

Gate off

IGPU

c(A

)

Vce

Ic

13.214 13.216 13.218 13.220

0

200

400

600

VSaberRD

ce(V

)

Vce

0

20

40

60

80

tf = 0.35μs

90%Ic

10%Ic

toffd = 3.09μs

Gate off

ISaberRD

c(A

)

Vce

Ic

(i) IGBT gate off waveform comparison


Fig. 17. Single-phase five-level physics-based MMC simulation results comparison between GPU (top) and SaberRD (bottom). (a) Five-level MMC Voutcomparison. (b) Five-level MMC zoomed Vout comparison. (c) Five-level MMC Vout FFT comparison. (d) Five-level MMC Iload comparison. (e) Five-levelMMC zoomed Iload comparison. (f) Five-level MMC Iload FFT comparison. (g) Five-level MMC arm Vcap comparison. (h) IGBT gate ON waveform comparison.(i) IGBT gate OFF waveform comparison.

TABLE IIDEVICE-LEVEL SWITCHING TIME AND POWER DISSIPATION FOR

IGBT-DIODE UNIT

Switching time (μs) Power dissipation (W)

SaberRD GPU SaberRD GPU

tIGBTd (on)

1.12 1.09 P IGBTon 112.31 113.02

tIGBTr 0.20 0.19 P IGBT

off 74.01 74.84tIGBTd (off) 3.09 3.10 P IGBT

con d 287.52 289.06

tIGBTf 0.35 0.36 P Diode

con d 7.48 7.45tDioder r 0.66 0.64 P Diode

r r 9.95 10.07

switching times and power dissipation of the lower IGBT-Diodeunit in the SM, referred to as S2 and D2 , respectively, in Fig. 7,during switching and conducting states, which are calculated

using

P IGBT =

∫vce(t)ic(t)dt

Ts(68)

PDiode =

∫vf (t)if (t)dt

Ts(69)

for IGBT and diode, respectively, where Ts is the switchingperiod.

The simulation results of 3-phase 11-level MMC circuit withnonlinear physics-based models, including output voltages andload currents, are shown in Fig. 18(a), which are comparedwith the simulation results of the same MMC circuit withbehavior-based models shown in Fig. 18(b). The waveformsof physics-based model have much more detail than those frombehavior-based model because of the nonlinear capacitance anddynamic current sources in the former model. When the MMC




Results

Three-phase 11-level physics-based and behavior-based MMCsimulation comparison.

60 62 64 66 68 70 72 74 76 78 80

-2

-1

0

1

2

3-ph

ase

VGPU

out

(kV

)

Va

Vb

Vc

60 62 64 66 68 70 72 74 76 78 80

-100

-50

0

50

100

3-ph

ase

IGPU

load

(A)

Ia

Ib

Ic

60 62 64 66 68 70 72 74 76 78 80

380

390

400

410

420 Lower arm Upper arm

Simulation time (s)

VGPU

cap

(V)

(a) 11-level physics-based MMC simulation

60 62 64 66 68 70 72 74 76 78 80

-2

-1

0

1

2

3-ph

ase

VGPU

out

(kV

)

Va

Vb

Vc

60 62 64 66 68 70 72 74 76 78 80

-100

-50

0

50

1003-

phas

eIG

PU

load

(A)

Ia

Ib

Ic

60 62 64 66 68 70 72 74 76 78 80

380

390

400

410

420 Lower arm Upper arm

Simulation time (s)

VGPU

cap

(V)

(b) 11-level behavior-based MMC simulation




Results

201-level behavior-based MMC output voltages, load currents, andarm capacitor voltages.

50 52 54 56 58 60 62 64 66

-40

-20

0

20

40Zoomed area


VGPU

out

(kV

)

Va

Vb

Vc

(a) 201-level behavior-based MMC Vout

50 52 54 56 58 60 62 64 66-1000

-500

0

500

1000

Zoomed area


IGPU

load

(A)

Ia

Ib

Ic

(b) 201-level behavior-based MMC Iload

50 52 54 56 58 60 62 64 66

240

280

320

360

400

440

480Lower arm Upper arm


VGPU

cap

(V)

Zoomed area

(c) 201-level behavior-based MMC arm Vcap

53.5 53.6 53.7 53.8 53.9 54.0 54.1 54.2 54.3

29.5

29.6

29.7

29.8

29.9

30.0


VGPU

out

(kV

)

Va

(d) 201-level behavior-based MMC zoomedVout

57.44 57.48 57.52 57.56 57.60 57.64

492.40

492.45

492.50

492.55

492.60


IGPU

load

(A)

Ia

(e) 201-level behavior-based MMC zoomedIload

55.96 55.98 56.00 56.02 56.04

266.8

266.9

267.0


VGPU

cap

(V)

(f) 201-level behavior-based MMC armzoomed Vcap.



Execution Time and Speedup Comparison

Results

Single-phase nonlinear physics-based MMC circuit execution timeand speedup comparison.


TABLE IVSABERRD, CPU, AND GPU EXECUTION TIMES FOR THE ONE-PHASE

NONLINEAR PHYSICS-BASED MMC

NL NS M Execution time (s) Speedup

SaberRD CPU GPUK GPUP GPUK GPUP

2 2 53 50.8 143.1 70.9 0.35 0.723 4 100 99.9 162.4 81.6 0.62 1.224 6 149 143.8 175.5 87.4 0.82 1.655 8 202 192.5 184.7 90.2 1.04 2.136 10 - 240.9 195.2 94.1 1.23 2.567 12 - 297.3 212.7 101.3 1.40 2.938 14 - 351.8 227.5 107.5 1.55 3.279 16 - 411.5 237.9 111.9 1.73 3.6810 18 - 481.5 250.2 115.6 1.92 4.1711 20 - 557.8 268.5 121.2 2.08 4.60

Fig. 20. One-phase nonlinear physics-based MMC circuit execution time andspeedup comparison.

since the run time of the CPU program was close to that ofSaberRD according to existing numbers, the CPU results wereused to calculate speedup replacing the incomplete results ofSaberRD for circuits with more than eight SMs. As shown inFig. 20, although the speedup rates of GPUs are less than 1 whenthe levels of MMC circuits are lower, up to 4 for GPUK and 2for GPUP , they increase steadily along with the circuit scale,which approach 2.08 for GPUK and 4.60 for GPUP . Benefittingfrom the higher clock and memory data rate listed in Table III,GPUP doubles the speedup of GPUK .

Although some speedups are shown in the simulations ofsingle-phase MMC circuits, they are not representative of thefull computational capability of GPUs for MMC simulation. Inorder to feed more data to the GPUs to realize their computepotential, 3-phase physics-based MMC circuits from 2-levelto 11-level are simulated for 100 ms with the variable time-stepping method, whose execution time is listed in Table V. Forthree-phase systems, SaberRD is difficult to converge beyondtwo-level MMC circuit, and therefore, it is absent from the com-parison. In addition, two Pascal GPU (GTX 1080) are involvedin the competition since there is sufficient compute load to beprocessed. When the 3-phase MMC circuit reaches 11-level with60 SMs, the acceleration of GPUs is obvious, which is 5.61 forGPUK and 9.58 for GPUP . Moreover, the 2-GPUP platformobtains the speedup of up to 14.67, which almost triples theperformance of Kepler GPU, as shown in Fig. 21. The specificstructure of MMC circuit consisting of upper and lower arms

TABLE VCPU AND GPU EXECUTION TIMES OF THREE-PHASE NONLINEAR

PHYSICS-BASED MMC CIRCUIT


CPU GPUK GPUP 2GPUP GPUK GPUP 2GPUP

2 6 150.8 172.2 82.4 79.9 0.88 1.83 1.893 12 285.9 185.9 93.4 83.2 1.54 3.06 3.444 18 423.5 206.7 103.2 88.7 2.05 4.10 4.775 24 600.3 227.2 115.9 95.8 2.64 5.18 6.276 30 774.1 251.5 129.2 101.5 3.08 5.99 7.637 36 991.1 286.2 148.4 109.2 3.46 6.68 9.088 42 1249.2 302.3 164.2 117.8 4.13 7.61 10.609 48 1495.4 322.9 182.3 127.2 4.63 8.20 11.7610 54 1782.4 357.2 202.5 134.4 4.99 8.80 13.2611 60 2137.3 381.1 223.2 145.7 5.61 9.58 14.67

Fig. 21. Three-phase nonlinear physics-based MMC circuit execution timesand speedup comparison.

TABLE VICPU AND GPU EXECUTION TIMES OF THREE-PHASE BEHAVIOR-BASED

MMC CIRCUIT

NL NS M NI G B T Execution time (s) Speedup


11 60 120 37.2 35.5 14.1 14.3 1.05 2.64 2.6021 120 240 67.4 35.9 14.4 14.5 1.88 4.68 4.6541 240 480 129.2 36.2 14.5 14.6 3.57 8.91 8.8551 300 600 156.0 36.6 14.6 14.7 4.26 10.68 10.6181 480 960 245.8 38.2 14.9 14.9 6.43 16.50 16.50101 600 1200 302.8 40.9 16.4 15.1 7.40 18.46 20.05126 750 1500 376.2 41.9 16.8 15.4 8.98 22.39 24.43151 900 1800 453.2 43.1 17.2 15.8 10.52 26.35 28.68176 1050 2100 526.2 46.2 18.8 16.8 11.39 27.99 31.32201 1200 2400 595.6 47.3 19.0 17.1 12.59 31.35 34.83251 1500 3000 755.2 49.7 20.1 17.6 15.20 37.57 42.91301 1800 3600 894.5 53.2 21.4 18.5 16.81 41.80 48.35401 2400 4800 1188.2 56.1 23.2 19.9 21.18 51.22 59.71501 3000 6000 1495.6 62.5 25.5 21.6 23.93 58.65 69.24

simplifies the task management and the computational load bal-ancing of 2-GPUP platform, and only the data of one commonnode need to be exchanged between the two MMC arms.

Since the Jacobian matrix is prone to singularity as the volt-age levels increase, the nonlinear physics-based MMC circuitsare hard to converge; therefore, the MMC circuits larger than11-level are simulated with behavior-based model from 11-levelto 501-level for 1 s with a 10 μs fixed time-step method, whose

1 2 3 4 5 6 7 8 9 10 11 120

100

200

300

400

500

600

Voltage level number of MMC circuits

Exe

cutio

ntim

e(s

)

SaberRD timeCPU time

GPUK timeGPUP time

0

1

2

3

4

5

Spe

edup

SaberRD timeCPU time

GPUK timeGPUP time

GPUK speedupGPUP speedup




Results

Three-phase nonlinear physics-based MMC circuit execution timesand speedup comparison.






2 2 53 50.8 143.1 70.9 0.35 0.723 4 100 99.9 162.4 81.6 0.62 1.224 6 149 143.8 175.5 87.4 0.82 1.655 8 202 192.5 184.7 90.2 1.04 2.136 10 - 240.9 195.2 94.1 1.23 2.567 12 - 297.3 212.7 101.3 1.40 2.938 14 - 351.8 227.5 107.5 1.55 3.279 16 - 411.5 237.9 111.9 1.73 3.6810 18 - 481.5 250.2 115.6 1.92 4.1711 20 - 557.8 268.5 121.2 2.08 4.60








2 6 150.8 172.2 82.4 79.9 0.88 1.83 1.893 12 285.9 185.9 93.4 83.2 1.54 3.06 3.444 18 423.5 206.7 103.2 88.7 2.05 4.10 4.775 24 600.3 227.2 115.9 95.8 2.64 5.18 6.276 30 774.1 251.5 129.2 101.5 3.08 5.99 7.637 36 991.1 286.2 148.4 109.2 3.46 6.68 9.088 42 1249.2 302.3 164.2 117.8 4.13 7.61 10.609 48 1495.4 322.9 182.3 127.2 4.63 8.20 11.7610 54 1782.4 357.2 202.5 134.4 4.99 8.80 13.2611 60 2137.3 381.1 223.2 145.7 5.61 9.58 14.67



MMC CIRCUIT



11 60 120 37.2 35.5 14.1 14.3 1.05 2.64 2.6021 120 240 67.4 35.9 14.4 14.5 1.88 4.68 4.6541 240 480 129.2 36.2 14.5 14.6 3.57 8.91 8.8551 300 600 156.0 36.6 14.6 14.7 4.26 10.68 10.6181 480 960 245.8 38.2 14.9 14.9 6.43 16.50 16.50101 600 1200 302.8 40.9 16.4 15.1 7.40 18.46 20.05126 750 1500 376.2 41.9 16.8 15.4 8.98 22.39 24.43151 900 1800 453.2 43.1 17.2 15.8 10.52 26.35 28.68176 1050 2100 526.2 46.2 18.8 16.8 11.39 27.99 31.32201 1200 2400 595.6 47.3 19.0 17.1 12.59 31.35 34.83251 1500 3000 755.2 49.7 20.1 17.6 15.20 37.57 42.91301 1800 3600 894.5 53.2 21.4 18.5 16.81 41.80 48.35401 2400 4800 1188.2 56.1 23.2 19.9 21.18 51.22 59.71501 3000 6000 1495.6 62.5 25.5 21.6 23.93 58.65 69.24



1 2 3 4 5 6 7 8 9 10 11 120

250

500

750

1000

1250

1500

1750

2000

2250


Exe

cutio

ntim

e(s

)

CPU timeGPUK timeGPUP time

2GPUP time

0

2

4

6

8

10

12

14

16

Spe

edup


2GPUP timeGPUK speedupGPUP speedup

2GPUP speedup




Results

Three-phase behavior-based MMC circuit execution times andspeedup comparison.






2 2 53 50.8 143.1 70.9 0.35 0.723 4 100 99.9 162.4 81.6 0.62 1.224 6 149 143.8 175.5 87.4 0.82 1.655 8 202 192.5 184.7 90.2 1.04 2.136 10 - 240.9 195.2 94.1 1.23 2.567 12 - 297.3 212.7 101.3 1.40 2.938 14 - 351.8 227.5 107.5 1.55 3.279 16 - 411.5 237.9 111.9 1.73 3.6810 18 - 481.5 250.2 115.6 1.92 4.1711 20 - 557.8 268.5 121.2 2.08 4.60








2 6 150.8 172.2 82.4 79.9 0.88 1.83 1.893 12 285.9 185.9 93.4 83.2 1.54 3.06 3.444 18 423.5 206.7 103.2 88.7 2.05 4.10 4.775 24 600.3 227.2 115.9 95.8 2.64 5.18 6.276 30 774.1 251.5 129.2 101.5 3.08 5.99 7.637 36 991.1 286.2 148.4 109.2 3.46 6.68 9.088 42 1249.2 302.3 164.2 117.8 4.13 7.61 10.609 48 1495.4 322.9 182.3 127.2 4.63 8.20 11.7610 54 1782.4 357.2 202.5 134.4 4.99 8.80 13.2611 60 2137.3 381.1 223.2 145.7 5.61 9.58 14.67



MMC CIRCUIT



11 60 120 37.2 35.5 14.1 14.3 1.05 2.64 2.6021 120 240 67.4 35.9 14.4 14.5 1.88 4.68 4.6541 240 480 129.2 36.2 14.5 14.6 3.57 8.91 8.8551 300 600 156.0 36.6 14.6 14.7 4.26 10.68 10.6181 480 960 245.8 38.2 14.9 14.9 6.43 16.50 16.50101 600 1200 302.8 40.9 16.4 15.1 7.40 18.46 20.05126 750 1500 376.2 41.9 16.8 15.4 8.98 22.39 24.43151 900 1800 453.2 43.1 17.2 15.8 10.52 26.35 28.68176 1050 2100 526.2 46.2 18.8 16.8 11.39 27.99 31.32201 1200 2400 595.6 47.3 19.0 17.1 12.59 31.35 34.83251 1500 3000 755.2 49.7 20.1 17.6 15.20 37.57 42.91301 1800 3600 894.5 53.2 21.4 18.5 16.81 41.80 48.35401 2400 4800 1188.2 56.1 23.2 19.9 21.18 51.22 59.71501 3000 6000 1495.6 62.5 25.5 21.6 23.93 58.65 69.24



0 50 100 150 200 250 300 350 400 450 5000

250

500

750

1000

1250

1500


Exe

cutio

ntim

e(s

)


2GPUP time

0

10

20

30

40

50

60

70

Spe

edup


2GPUP timeGPUK speedupGPUP speedup

2GPUP speedup



ConclusionPower Electronic Circuit Simulation

System-level and device-level simulations provide differentfocus in power electronic circuit simulation.

Adopting nonlinear device-level models to constructlarge-scale converter circuits gives an opportunity to studydetailed device interactions while simultaneously addressingthe challenge of large-scale system solution.

The cost is high computational burden due to complex systemmodel. Sequential implementation results in higher executiontimes, with the growth of converter levels, which makes thesimulation impractical.

Massively parallel algorithms and GPU implementation ofdevice-level models and corresponding numerical solversprovides significant acceleration while improving convergenceand accuracy.



Thank you!



IGBT Model Equations

Currents:Steady-state collector current:

icss =iT

1 + b+

4bDpQ

(1 + b)W 2, (22)

The anode current:

iT =vaerb. (23)

Base resistance:

rb =

{W

qµnANBveb ≤ 0

WqµeffAneff

veb > 0, (24)



BJT current:

ibss =Q

τHL+

4Q2N2scl isne

Q2Bn

2i

, (25)

MOSFET current:

imos =

0 vgs < vT

Kp(vgs − vT )vds − Kpv2ds

2 vds ≤ vgs − vTKp(vgs−vT )2

2 vds > vgs − vT

, (26)

Avalanche multiplication current:

imult = (M − 1)(imos + icss + ic cer ) + Migen, (27)

Charges and Capacitances:Gate-source charge:

Qgs = Cgsvgs . (28)



Gate-drain capacitance:

Cgd =

{Coxd vds ≤ vgs − vTdCgdjCoxd

Cgdj+Coxdvds > vgs − vTd

, (29)

Gate-drain overlap depletion capacitance:

Cgdj =AgdεsiWgdj

, (30)

Gate-Drain charge:

Qgd =

Coxdvdg vds ≤ vgs − vTd

qNBεsiA2gd

Coxd[CoxdWgdj

εsiAgd−

ln(1 +CoxdWgdj

εsiAgd)]− vds > vgs − vTd

CoxdvTd

. (31)



Drain-source depletion capacitance:

Cdsj =(A− Agd)εsi

Wdsj, (32)

Drain-source depletion charge:

Qds = Ads

√2εsi (vds + 0.6)qNscl . (33)

Emitter-base capacitance Ceb is solved from ∂Qeb∂Veb

as

Ceb = −qNBεsiA2

Q − Qbi. (34)



Collector-emitter redistribution capacitance:

Ccer =QCbcj

3QB, (35)

The carrier multiplication charge and capacitance:

Qmult = (M − 1)Qce and Cmult = (M − 1)Ccer . (36)

in+1mos =inmos eq + gn

mos gsvn+1gs + gn

mos dsvn+1ds , (37)

in+1T =inTeq + gn

Taevn+1ae + gn

Tbcvn+1bc + gn

Tebvn+1eb , (38)

in+1css =incss eq + gn

css bcvn+1bc + gn

css aevn+1ae

+ gncss ebv

n+1eb , (39)

in+1bss =inbss eq + inbss ebv

n+1eb + gn

bss bcvn+1bc , (40)

in+1mult =inmult eq + gn

mult dsvn+1ds + gn

mult dsvn+1ds

+ gnmult aev

n+1ae + gn

mult ebvn+1eb . (41)



Parallel N-R

To find the solution of a nonlinear system F(X ) consisting of kunknown variables, utilizing the Jacobian matrix JF (X ) for linearapproximation gives the following equation,

0 = F(Xn) + JF (Xn)(Xn+1 − Xn). (42)

The Jacobian matrix JF (X) is a k × k matrix of first-order partialderivatives of F given as:

JF =dF

dX=

∂F1

∂X1· · · ∂F1

∂Xk...

. . ....

∂Fk∂X1

. . .∂Fk∂Xk

. (43)

Solving for the root of F(X ) is numerically replaced by solving (42)for (Xn+1 − Xn) and updating Xn+1 from Xn.



Solution process is repeated until ||Xn+1 − Xn|| is converged.According to KCL,

ΣI(V) = 0. (44)

Applying (42) to (44) results in

JVn(Vn+1 − Vn) = −ΣIn. (45)

The voltage vector for the solution of KCL at each node is given as:

−Vn = [v c (n) vg (n) va(n) vd(n)

v e (n)]T , (46)

Applying the N-R iteration results in,

GIGBT(n)(Vn+1 − Vn) = −In, (47)

where the conductance matrix GIGBT(n)is the Jacobian matrix

JVn, and the right hand side vector is given as

−In = [ic (n) ig (n) ia(n) id(n)

ie (n)]T . (48)



Massive-Thread Parallel N-R

Global Memory

CU1

CU2

CUn

...

ΔVn≤ ε?

NoYes

Kernel0

JM1

JM2

JMn

...

Kernel1

Vn

Vn

In

ΔVn VU1

VU2...

Kernel3ΔVn

Solve matrix equation V lt d tGn

Vn

In

Gn

In

Gn

VnVUn

VnGaussian1

Gaussian2...

Kernel2

Shared memory

Gaussiann

Voltage difference ΔVnJacobian matrxi Gn

Sum of current at each node In

Voltage vector Vn



Relaxation Based Jacobian Update

Although the sparse matrix G origSM is irregular, it can be reshaped by

eliminating two elements, G6,11 and G11,6 brought about by thecapacitance, using the relaxation algorithm. After thetransformation, the linearized equation of a SM is given as

Gorig(n)SM ·∆Vn+1 = −Iorig(n), (51)

is updated toGn

SM ·∆Vn+1 = −In, (52)

and In comes from Iorig(n) whose 6th and 11th elements areadjusted by G6∆vn6 and G11∆vn11 with known values from previousiteration. Although there is an outer Gauss-Jacobi loop over theN-R iterations to guarantee the convergence of ∆vn, the updatedGSM results in a better bordered-block pattern, which benefitsparallel implementation on GPU.



MMC Sub-Module Conductance


Fig. 9. Massive-thread parallel implementation of the Newton–Raphsonmethod.

the dq voltages, the abc phase reference voltages are calculatedfor modulation. In the capacitor voltage averaging and balancedcontrol, as shown in Fig. 8(b), the averaging control signal VA

ensures that the average voltage value of all the capacitors ineach phase track the command value; and the balancing controlsignal VB employs the phase-shifted carrier signals to force thedc voltage of each capacitor to follow the reference value.

D. Massive-Thread Parallel Implementation of theNewton–Raphson Method

1) Algorithm: To find the solution of a nonlinear systemF (X) consisting of k unknown variables, utilizing the Jacobianmatrix JF (X) for linear approximation gives the followingequation:

0 = F (Xn ) + JF (Xn )(Xn+1 − Xn ). (44)

The Jacobian matrix JF (X) is a k × k matrix of first-orderpartial derivatives of F given as

JF =dF

dX=

⎡⎢⎢⎢⎢⎣

∂F1

∂X1· · · ∂F1

∂Xk...

. . ....

∂Fk

∂X1. . .

∂Fk

∂Xk

⎤⎥⎥⎥⎥⎦

. (45)

Solving for the root of F (X) is numerically replaced by solv-ing (44) for (Xn+1 − Xn ) and updating Xn+1 from Xn Thesolution process is repeated until the difference ||Xn+1 − Xn ||is converged.

According to KCL,

ΣI(V ) = 0 (46)

where I(V ) refers to the sum of currents leaving each node.Applying (44) to (46) results in

JVn (V n+1 − V n ) = −ΣIn . (47)

In the nonlinear physics-based IGBT model, the voltage vec-tor for the solution of KCL at each node of collector, gate, anode,drain, and emitter is given as

−V n = [vc (n) vg (n) va (n) vd (n)ve (n) ]T .

Applying the Newton–Raphson iterative equation results in

GIGBT (n)(V n+1 − V n ) = −In (48)

where the conductance matrix GIGBT (n)is the Jacobian matrix

JVn , and the right-hand side vector is given as

−In = [ic (n) ig (n) ia (n) id(n)

ie (n) ]T .

See (49) and (50) shown at the bottom of this page.

GorigSM =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

G1,1 G1,2 G1,3 G1,4 G1,5 G1,6 G1,7 G1,8 G1,9 G1,10 G1,11

G2,1 G2,2 G2,3 G2,4 G2,5 G2,6

G3,1 G3,2 G3,3 G3,4 G3,5 G3,6

G4,1 G4,2 G4,3 G4,4 G4,5 G4,6

G5,1 G5,2 G5,3 G5,4 G5,5 G5,6

G6,1 G6,2 G6,3 G6,4 G6,5 G6,6 G6,11

G7,1 G7,7 G7,8 G7,9 G7,10 G7,11

G8,1 G8,7 G8,8 G8,9 G8,10 G8,11

G9,1 G9,7 G9,8 G9,9 G9,10 G9,11

G10,1 G10,7 G10,8 G10,9 G10,10 G10,11

G11,1 G11,6 G11,7 G11,8 G11,9 G11,10 G11,11

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(49)

GSM =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

G1,1 G1,2 G1,3 G1,4 G1,5 G1,6 G1,7 G1,8 G1,9 G1,10 G1,11

G2,1 G2,2 G2,3 G2,4 G2,5 G2,6

G3,1 G3,2 G3,3 G3,4 G3,5 G3,6

G4,1 G4,2 G4,3 G4,4 G4,5 G4,6

G5,1 G5,2 G5,3 G5,4 G5,5 G5,6

G6,1 G6,2 G6,3 G6,4 G6,5 G6,6

G7,1 G7,7 G7,8 G7,9 G7,10 G7,11

G8,1 G8,7 G8,8 G8,9 G8,10 G8,11

G9,1 G9,7 G9,8 G9,9 G9,10 G9,11

G10,1 G10,7 G10,8 G10,9 G10,10 G10,11

G11,1 G11,7 G11,8 G11,9 G11,10 G11,11

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(50)



Partial LU Decomposition for MMC

For a k-SM MMC circuit, each cascade node connects to twononlinear IGBT-Diode units. Therefore, the complete structure ofthe Jacobian matrix GMMC is shown in Fig. 41(a).

YAN et al.: LARGE-SCALE NONLINEAR DEVICE-LEVEL POWER ELECTRONIC CIRCUIT SIMULATION ON MASSIVELY PARALLEL GRAPHICS 4669

Fig. 10. Jacobian matrices for the MMC circuit: (a) Original GMMC; (b) updated G∗MMC using relaxation.

2) Massive-Thread Parallel Implementation: In themassive-thread parallel module for the Newton–Raphsonmethod, four kernels are involved, as shown in Fig. 9 andAlgorithm 3. First, Kernel 0 and Kernel1 update In , whichequals the sum of currents leaving the nodes, and the Jacobianmatrix JF (Xn ), which is the conductance matrix with currentunits (CUs) and Jacobian matrix units (JMs), respectively.All parameters are copied to shared memory for Kernel2 tosolve (44) by Gaussian elimination method. Then, V n+1 isupdated and verified if it satisfies the convergence criteriain Kernel3 with voltage units (VUs). The iterations continueuntil V is converged or the maximum number of iterations isreached.

E. Block Jacobian Matrix Decomposition for MMC

Applying the physics-based IGBT and power diode modulementioned above, each SM has five nodes for the IGBT andone more inside node for the power diode, which results in theJacobian matrix (the original G matrix for the SM) (49).

1) Matrix Update Using a Relaxation Algorithm: Althoughthe sparse matrix Gorig

SM is irregular, it can be reshaped byeliminating two elements, G6,11 and G11,6 brought aboutby the capacitance, using the relaxation algorithm. After the

transformation, the linearized equation of an SM is given as

Gorig(n)SM · ΔV n+1 = −Iorig(n) (51)

is updated to

GnSM · ΔV n+1 = −In (52)

where GSM is given in (50) and In comes from Iorig(n) whose6th and 11th elements are adjusted by G6Δvn

6 and G11Δvn11

with known values from previous iteration. Although there is anouter Gauss–Jacobi loop over the Newton–Raphson iterationsto guarantee the convergence of Δvn , the updated GSM resultsin a better bordered-block pattern, which benefits to the parallelimplementation on the GPU.

2) Partial LU Decomposition for MMC: For a k-SM MMCcircuit, each cascade node connects to two nonlinear IGBT-diode units. Therefore, the complete structure of the Jacobianmatrix GMMC is shown in Fig. 10(a). The connections on Node1 and Node 11 of every GSM are relaxed to decouple the largeJacobian matrix. Thus, the updated Jacobian matrix G∗

MMC isshown in Fig. 10(b), and the Newton–Raphson linearized equa-tion, given as

GnMMC · ΔV n+1 = −In (53)

The connection on Node 1 and Node 11 of every GSM are relaxedto decouple the large Jacobian matrix. Thus, the updatedJacobian matrix G∗

MMC is shown in Fig. 41(b), and theNewton-Raphson linearized equation, given as

GnMMC ·∆Vn+1 = −In, (53)

is also updated as

G∗(n)MMC ·∆Vn+1 = −I∗(n), (54)

where −I∗(n) is −In adjusted by previous iteration values.Processing the LU decompositions for all A matrices in G∗

MMC , weobtain

lj · uj = Aj (j = 1, 2, · · · , 2k). (55)

Then, the border vectors, f j and gj in Fig. 41, can be foundaccording to the following relations:

{f j · uj = cj

lj · gj = dj. (56)


Fig. 11. Partial LU decomposition for G∗M M C .

Fig. 12. Blocked linear MMC system solver: (a) Blocked forward substitution; (b) blocked backward substitution.

is also updated as

G∗(n)MMC · ΔV n+1 = −I∗(n) (54)


MMC ,we obtain

lj · uj = Aj (j = 1, 2, · · · , 2k). (55)


{f j · uj = cj

lj · gj = dj. (56)

Since lj and uj are lower and upper triangular matri-ces, f j and gj can be solved with forward and backwardsubstitutions. Finally, the elements at the connecting nodes,hi (i = 1, 2, . . . , k) in Fig. 11, are calculated with ei in Fig. 10as

hi = ei − f 2i−1 · g2i−1 − f 2i · g2i . (57)

The proposed partial LU decomposition can be computed inparallel due to the decoupled structure of updated Jacobian ma-trix G∗

MMC , including the LU decomposition of Aj , calcula-tion of border vectors f j , gj , and updating of connecting nodeelements hi .

3) Blocked Forward and Backward Substitutions: After ob-taining the semilower and upper triangular matrices utilizingpartial LU decomposition, blocked forward and backward sub-stitutions are utilized to obtain the solution of the linear system.For the equation LU · x = b, defining

U · x = y (58)

we obtain

U · y = b. (59)

The blocked forward substitution shown in Fig. 12(a) is usedto solve for y in (59). All elements of y2i−1 and most elementsof y2i except for the last one can be solved directly; then, the lastelement of previous block y2i−2 can be obtained with solved el-ements in y2i−1 and y2i ; and so forth, the missing element of y2i

can be obtained from the solution of the next block. Fortunately,y2k in the last block can be fully solved since it has no over-lap border vector f . Because of the decoupled structure of L,the blocked forward substitution can be accomplished by GPU-based parallelism. Similarly, the solution of (58) can be obtainedby the blocked backward substitution shown in Fig. 12(b) withyj . First, all connecting elements hi can be solved in parallel.Second, the border vector gj can be eliminated by updating theright-hand side vector. Finally, backward substitution is imple-mented on all blocked U j in parallel.

Since lj and uj are lower and upper triangular matrices, f j and gjcan be solved with forward and backward substitutions. Lastly, theelements at the connecting nodea, hi (i = 1, 2, · · · , k) in Fig. 41,are calculated with ei in Fig. 41, as

hi = ei − f2i−1 · g2i−1 − f2i · g2i . (57)

The proposed partial LU decomposition can be computed inparallel due to the decoupled structure of updated Jacobian matrixG∗

MMC , including the LU decomposition of Aj , calculation ofborder vectors f j , gj , and updating of connecting node elements hi .



Blocked Forward and Backward Substitutions

For the equation LU · x = b, defining

U · x = y, (58)

we obtainU · y = b. (59)


Fig. 11. Partial LU decomposition for G∗M M C .

Fig. 12. Blocked linear MMC system solver: (a) Blocked forward substitution; (b) blocked backward substitution.

is also updated as

G∗(n)MMC · ΔV n+1 = −I∗(n) (54)


MMC ,we obtain

lj · uj = Aj (j = 1, 2, · · · , 2k). (55)


{f j · uj = cj

lj · gj = dj. (56)

Since lj and uj are lower and upper triangular matri-ces, f j and gj can be solved with forward and backwardsubstitutions. Finally, the elements at the connecting nodes,hi (i = 1, 2, . . . , k) in Fig. 11, are calculated with ei in Fig. 10as

hi = ei − f 2i−1 · g2i−1 − f 2i · g2i . (57)

The proposed partial LU decomposition can be computed inparallel due to the decoupled structure of updated Jacobian ma-trix G∗

MMC , including the LU decomposition of Aj , calcula-tion of border vectors f j , gj , and updating of connecting nodeelements hi .

3) Blocked Forward and Backward Substitutions: After ob-taining the semilower and upper triangular matrices utilizingpartial LU decomposition, blocked forward and backward sub-stitutions are utilized to obtain the solution of the linear system.For the equation LU · x = b, defining

U · x = y (58)

we obtain

U · y = b. (59)

The blocked forward substitution shown in Fig. 12(a) is usedto solve for y in (59). All elements of y2i−1 and most elementsof y2i except for the last one can be solved directly; then, the lastelement of previous block y2i−2 can be obtained with solved el-ements in y2i−1 and y2i ; and so forth, the missing element of y2i

can be obtained from the solution of the next block. Fortunately,y2k in the last block can be fully solved since it has no over-lap border vector f . Because of the decoupled structure of L,the blocked forward substitution can be accomplished by GPU-based parallelism. Similarly, the solution of (58) can be obtainedby the blocked backward substitution shown in Fig. 12(b) withyj . First, all connecting elements hi can be solved in parallel.Second, the border vector gj can be eliminated by updating theright-hand side vector. Finally, backward substitution is imple-mented on all blocked U j in parallel.



Predictor-Corrector Variable Time-Stepping Scheme

Reset Δt

Start: initialize t, Δt, LTE

Update t

t_last discontinuous

point?

Forward Euler

approximation

for V_pred Corr

ecto

r N

-R S

chem

eUpdate error between

latest two iteration

error<tolerance?No

Yes

No

t discontinuous point?

No Yes

Corr

ecto

r N

-R S

chem

e

Update error between

latest two iteration

error<tolerance?No

Update LTE

Yes Yes

LTE<tolerance?

Reduce Δt

t = t_end?

Update t_last

Increase Δt

End

Yes

No

No

Yes

Backward Euler

approximation for V_corr

Pre

dic

tor

Sch

eme

Trapezoidal method

approximation for V_corr


large-scale nonlinear device-level powersite.ieee.org/pes-hpcgrid/files/2019/08/4_pesgm2019.pdf ·...

Documents