optimization and tuning techniques of lattice qcd for blue gene

36
© 2006 IBM Corporation Optimization and tuning techniques of lattice QCD for Blue Gene Jun Doi Tokyo Research Labora tory IBM Japan

Upload: josephine-kirkland

Post on 31-Dec-2015

34 views

Category:

Documents


4 download

DESCRIPTION

Optimization and tuning techniques of lattice QCD for Blue Gene. Jun Doi Tokyo Research Laboratory IBM Japan. Agenda. Part I: Optimization of lattice QCD program using double FPU instructions Part II: Parallelization of lattice QCD and optimization of communication. Part I:. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Optimization and tuning techniques of lattice QCD for Blue Gene

© 2006 IBM Corporation

Optimization and tuningtechniques of lattice QCDfor Blue Gene

Jun DoiTokyo Research LaboratoryIBM Japan

Page 2: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Agenda

Part I:

– Optimization of lattice QCD program using double FPU instructions

Part II:

– Parallelization of lattice QCD and optimization of communication

Page 3: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Part I:

Optimization of lattice QCD program using double FPU instructions

Page 4: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Optimization of lattice QCD for Blue Gene

Our lattice QCD program

– Wilson’s method

– Original program is written in C++

Optimization using double FPU instructions

– We used inline assembly to optimize complex arithmetic

• We have to schedule instructions by ourselves instead of compiler

Page 5: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Wilson-Dirac operator

Exchanging colors by multiplying color and 3x3 gauge matrix for 4 spin for 8 directions x+,x-,y+,y-,z+,z-,t+ and t-

: 4x3 spinor• U : 3x3 gauge matrix• Ut : Hermitian matrix of U• (1+γ),(1-γ) : 4x4 projector matrix

),,,(),,,(

),,,(),,,(

),,,(),,,(

),,,(),,,(

,,,

1,,,1,,,11,,,,,,1

,1,,,1,,1,1,,,,,1

,,1,,,1,1,,1,,,,1

,,,1,,,11,,,1,,,1

,,,,,,

tzyxutmtzyxutp

tzyxuzmtzyxuzp

tzyxuymtzyxuyp

tzyxuxmtzyxuxp

tzyx

tzyxtzyxUtzyxtzyxU

tzyxtzyxUtzyxtzyxU

tzyxtzyxUtzyxtzyxU

tzyxtzyxUtzyxtzyxU

tzyxtzyxD

ttttt

tzzzz

tyyyy

txxxx

(x,y,z,t)

(x,y-1,z,t)

(x-1,y,z,t)

(x,y+1,z,t)

(x+1,y,z,t)

Ux(x,y,z,t)

Uy(x,y,z,t)

Uy(x,y-1,z,t)Ux(x-1,y,z,t)

Page 6: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

uxp : Part of Wilson-Dirac operator for X plus direction

11),( xxUxUxuxp xxx

100

010

010

001

1

i

i

i

i

xProjector :

11

11

32

41

xixxUB

xixxUA

x

x

Multiplying symmetric projector , we can merge 4 spinor into 2 spinor to calculate uxp

Ai

Bi

B

A

xxUxUxuxp xxx 11),(

11

11

32

41

xixD

xixC

DxUB

CxUA

x

x

Ai

Bi

B

A

xx

I. Merging 4 spinors into 2 II. Multiplying 2 spinors and gauge III. Adding to 4 spinors

Page 7: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Floating point register usage for u?p, u?m 4x3 spinor to add result= 12 registers

4x3 spinor for neighboring lattice point (input) = 12 registers

3x3 gauge matrix = 9 registers

– = 33 registers are needed

– Additional registers are needed for constant values

12 registers

to add result

(Always loaded for other directions)

2 reg. for others

6 registers

for gauge matrix

12 registers

for input

and to save result

FP 0toFP 11

FP 12toFP 17

FP 18toFP 29

FP 30FP 31

6 registers

for 2 spinors6 registers

to save result

Last 3 elements are loaded after first 3 are multiplied

Merging 4 spinors into 2 spinors

Adding to 4x3 spinor

Page 8: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Step I : Merging 4 spinors into 2 spinors

spinor 1 Rspinor 1 Gspinor 1 B

FR18FR19FR20

spinor 2 Rspinor 2 Gspinor 2 B

FR21FR22FR23

spinor 3 Rspinor 3 Gspinor 3 B

FR24FR25FR26

spinor 4 Rspinor 4 Gspinor 4 B

FR27FR28FR29

14 x

11 x

12 x

13 x

11

11

32

41

xixD

xixC

C RC GC B

FR18FR19FR20

D RD GD B

FR21FR22FR23

C

D

v = x + i * yRe(v) = Re(x) – Im(y)Im(v) = Im(x) + Re(y)

double unit[2] = {1,1};

LFPDX 31,unit

LFPDX 18,spinor1_R

LFPDX 27,spinor4_R

FXCXNPMA 18,31,27,18

Page 9: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

FXCXNPMA instruction

Primary FPUregister

FR0~FR31

MUL

ADD

CA B S

Secondary FPUregister

FR32~FR63

MUL

ADD

CA B S

FR63 FR59FR18 x

FR63 FR27FR50

A C

x=FR18

-

FR50

B

+

= -( )

double unit[2] = {1,1};

LFPDX 31,unit

LFPDX 18,spinor1_R

LFPDX 27,spinor4_R

FXCXNPMA 18,31,27,18

Page 10: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Step II : Multiplying spinor and 3x3 gauge matrix (for +directions)

y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2];

y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2];

y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2];

u[0][0] * x[0] : Multiplying 2 complex numbers

FXPMUL (y[0],u[0][0],x[0])

FXCXNPMA (y[0],u[0][0],x[0],y[0])

+ u[0][1] * x[1] + u[0][2] * x[2]; Using FMA instructions

FXCPMADD (y[0],u[0][1],x[1],y[0])

FXCXNPMA (y[0],u[0][1],x[1],y[0])

FXCPMADD (y[0],u[0][2],x[2],y[0])

FXCXNPMA (y[0],u[0][2],x[2],y[0])

re(y[0])=re(u[0][0])*re(x[0])

im(y[0])=re(u[0][0])*im(x[0])

re(y[0])+=-im(u[0][0])*im(x[0])

im(y[0])+=im(u[0][0])*re(x[0])

x: input spinor

y: output spinor

u : 3x3 gauge matrix

Page 11: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

FXPMUL and FXCPMADD instruction

Primary FPUregister

FR0~FR31

MUL

ADD

CA B S

Secondary FPUregister

FR32~FR63

MUL

ADD

CA B S

FR3 FR4

FXPMUL 5,3,4

FR5 x

FR3 FR36

=

FR37

A C

x=

FR0 FR1

FXCPMADD 10,0,1,5

FR10 x

FR0 FR33

=

FR42

A C

x=

FR5+

FR37

B

+

Page 12: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Multiplying Hermitian gauge matrix (for -directions)

y[0] = ~u[0][0] * x[0] + ~u[1][0] * x[1] + ~u[2][0] * x[2];

y[1] = ~u[0][1] * x[0] + ~u[1][1] * x[1] + ~u[2][1] * x[2];

y[2] = ~u[0][2] * x[0] + ~u[1][2] * x[1] + ~u[2][2] * x[2];

~u[0][0] * x[0] : Multiplying conjugate complex

FXPMUL (y[0],u[0][0],x[0])

FXCXNSMA (y[0],u[0][0],x[0],y[0])

+ ~u[1][0] * x[1] + ~u[2][0] * x[2]; Using FMA instruction

FXCPMADD (y[0],u[1][0],x[1],y[0])

FXCXNSMA (y[0],u[1][0],x[1],y[0])

FXCPMADD (y[0],u[2][0],x[2],y[0])

FXCXNSMA (y[0],u[2][0],x[2],y[0])

re(y[0])=re(u[0][0])*re(x[0])

im(y[0])=re(u[0][0])*im(x[0])

re(y[0])+=im(u[0][0])*im(x[0])

im(y[0])+=-im(u[0][0])*re(x[0])

Multiplying Hermitian matrix is as follows x: input spinor

y: output spinor

~u : conjugate complex

Page 13: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Optimization of instruction pipeline

y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2];

y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2];

y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2];

FXPMUL (y[0],u[0][0],x[0])

FXPMUL (y[1],u[1][0],x[0])

FXPMUL (y[2],u[2][0],x[0])

FXCXNPMA (y[0],u[0][0],x[0],y[0])

FXCXNPMA (y[1],u[1][0],x[0],y[1])

FXCXNPMA (y[2],u[2][0],x[0],y[2])

FXCPMADD (y[0],u[0][1],x[1],y[0])

...

3 cycles to use the result in next calculation : pipeline stalls

FXPMUL (yc[0],u[0][0],xa[0])

FXPMUL (yd[0],u[0][0],xb[0])

FXPMUL (yc[1],u[1][0],xa[0])

FXPMUL (yd[1],u[1][0],xb[0])

FXPMUL (yc[2],u[2][0],xa[0])

FXPMUL (yd[2],u[2][0],xb[0])

FXCXNPMA (yc[0],u[0][0],xa[0],yc[0])

FXCXNPMA (yd[0],u[0][0],xb[0],yd[0])

FXCXNPMA (yc[1],u[1][0],xa[0],yc[1])

FXCXNPMA (yd[1],u[1][0],xb[0],yd[1])

FXCXNPMA (yc[2],u[2][0],xa[0],yc[2])

FXCXNPMA (yd[2],u[2][0],xb[0],yd[2])

FXCPMADD (yc[0],u[0][1],xa[1],yc[0])

...

6 cycles for next calculation

: pipeline does not stall

Multiplying

2 spinors

Yc = u*Xa

Yd = u*Xb

together

1 multiplication

Calculation order

2 multiplications

Page 14: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Loading gauge matrix to registerfor minus direction : uxm, uym, uzm, utm operators

lfpdux u,12 //u[0][0]lfpdux u,13 //u[0][1]lfpdux u,14 //u[0][2]

y=u[0]*x[0]

y+=u[1]*x[1]

y+=u[2]*x[2]

for plus direction : uxp, uyp, uzp, utp operators

matrix can be loaded sequentially

lfpdux u,14 //u[0][0]lfpdux u,15 //u[0][1]lfpdux u,12 //u[0][2]lfpdux u,17 //u[1][0]lfpdux u,16 //u[1][1]lfpdux u,13 //u[1][2]lfpdux u,30 //u[2][0]

lfpdux u,14 //u[2][2]

y=u*x[0]

y+=u*x[1]

y+=u*x[2]

lfpdux u,31 //u[2][1]

lfpdux u,15 //u[1][0]lfpdux u,16 //u[1][1]lfpdux u,17 //u[1][2]

lfpdux u,12 //u[2][0]lfpdux u,13 //u[2][1]lfpdux u,14 //u[2][2]

y[0] = ~u[0][0] * x[0] + ~u[1][0] * x[1] + ~u[2][0] * x[2];y[1] = ~u[0][1] * x[0] + ~u[1][1] * x[1] + ~u[2][1] * x[2];y[2] = ~u[0][2] * x[0] + ~u[1][2] * x[1] + ~u[2][2] * x[2];

order of matrix in array

y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2];y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2];y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2];

order of matrix in array

to load matrix sequentially additional 2 temporary registers are used

Calculation order Calculation order

Page 15: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Step III: Adding result to 4x3 spinor

A RA GA B

FR24FR25FR26

B RB GB B

FR27FR28FR29

B

A

Ai

Bi

B

A

xx

spinor 1 Rspinor 1 Gspinor 1 B

FR0FR1FR2

spinor 2 Rspinor 2 Gspinor 2 B

FR3FR4FR5

spinor 3 Rspinor 3 Gspinor 3 B

FR6FR7FR8

spinor 4 Rspinor 4 Gspinor 4 B

FR9FR10FR11

x4

x1

x2

x3

v = v - i * wRe(v) = Re(v) + Im(w)Im(v) = Im(v) - Re(w)

double unit[2] = {1,1};

LFPDX 31,unit

LFPDX 9,spinor4_R

LFPDX 24,A_R

FXCXNSMA 9,31,24,9

FXCXNSMA to subtract

For spinor 3 and 4

For spinor 1 and 2v = v + w

Re(v) = Re(v) + Re(w)Im(v) = Im(v) + Im(w)

LFPDX 0,spinor1_R

LFPDX 24,A_R

FPADD 0,0,24

Page 16: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Multiplying to 4x3 spinor

1,,,1,,,11,,,,,,1

,1,,,1,,1,1,,,,,1

,,1,,,1,1,,1,,,,1

,,,1,,,11,,,1,,,1

,,,,,,

tzyxtzyxUtzyxtzyxU

tzyxtzyxUtzyxtzyxU

tzyxtzyxUtzyxtzyxU

tzyxtzyxUtzyxtzyxU

tzyxtzyxD

ttttt

tzzzz

tyyyy

txxxx

xutmxutp

xuzmxuzp

xuymxuyp

xuxmxuxp

tzyxtzyxD

,,,,,,

Multiplying to every u?p, u?m operators

This change increases calculation,

but does not increase double FPU instructions

This allows out of order calculation of each operators

Page 17: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Adding result to 4x3 spinor with multiplying

Ai

Bi

B

A

xx

A RA GA B

FR24FR25FR26

B RB GB B

FR27FR28FR29

B

Aspinor 1 Rspinor 1 Gspinor 1 B

FR0FR1FR2

spinor 2 Rspinor 2 Gspinor 2 B

FR3FR4FR5

spinor 3 Rspinor 3 Gspinor 3 B

FR6FR7FR8

spinor 4 Rspinor 4 Gspinor 4 B

FR9FR10FR11

x4

x1

x2

x3

v = v - * i * w Re(v) = Re(v) + *Im(w)Im(v) = Im(v) - *Re(w)

double kappa[2] = {, };

LFPDX 31,kappa

LFPDX 9,spinor4_R

LFPDX 24,A_R

FXCXNSMA 9,31,24,9

For spinor 3 and 4

For spinor 1 and 2

v = v + * wRe(v) = Re(v) + *Re(w)Im(v) = Im(v) + *Im(w)

LFPDX 0,spinor1_R

LFPDX 24,A_R

FXCPMADD 0,kappa,0,24

v = v - i * wdouble unit[2] = {1,1};

LFPDX 31,unit

LFPDX 9,spinor4_R

LFPDX 24,A_R

FXCXNSMA 9,31,24,9 v = v + wLFPDX 0,spinor1_R

LFPDX 24,A_R

FPADD 0,0,24

Page 18: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Part II:

Parallelization of lattice QCD and optimization of communication

Page 19: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Parallelization of lattice QCD and optimization of communication

Decreasing communication time as much as we can

– Limit data exchange only between neighboring node on torus• Shortest path and never conflict data exchange

– Mapping 4D lattice to torus network

MPI is rich enough to do such a limited communication

– Overhead to call MPI function is too big for limited conditions

We used torus packet HW directly

– Very small latency to send and receive data

– We can send/recv directly from/to register of double FPU• We do not need buffer in memory

– We can overlap sending 6 directions and local computations• We can hide communication time

Page 20: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Parallelization of lattice QCDParallelization of lattice QCD

X

Y

X

Y

Solving on 1 CPU

Mapping physical topology of network

to avoid conflict of data exchange

Page 21: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Mapping lattice to torus network of Blue Gene How to divide 4D lattice into 3D torus network

– Using virtual node mode and communication between 2 CPUs in same compute node will be 4th dimensional torus

We mapped xyzt of 4D lattice into TXYZ of virtual 4D torus

– The fastest communication is between 2 CPUs in compute node

– x of lattice is inner most loop of spinor and gauge array

CPU0

CPU1

Lattice of QCDTorus network of Blue Gene

y

ztX

YZ

Tx

Page 22: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Data exchange by torus packet network

Packet header destination/size/etc…

16bytes data

16bytes data

16bytes data

Send FIFO X+

Send FIFO X-

Send FIFO Y+

Send FIFO Z-

...

double FPU

register

16bytes parallel load 16bytes parallel storeSending data

Receiving data

Store data to memory mapped FIFO to send data to neighboring node

6 FIFOs are independent and can transfer data at the same time

Packet size is multiple of 32 bytes up to 256 bytes

(including 16 bytes header)

...

double FPU

register

Recv FIFO X+

Recv FIFO X-

Recv FIFO Y+

Recv FIFO Z-

...

16bytes parallel load

CPU0

CPU1

Page 23: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Exchanging 2 spinors between neighboring nodesSending 2 spinor to + direction for uym

C RC GC B

FR18FR19FR20

D RD GD B

FR21FR22FR23

C

D

Merging 4 spinors into 2 spinors

FIFO buffer

(1KB)

Send FIFO X+

Recv FIFO X-

C RC GC B

FR18FR19FR20

D RD GD B

FR21FR22FR23

C

D

Multiplying gauge matrix

Add to 4x3 spinor

Store data to FIFO as if it is part of memory

Load data from FIFO

Wait until all data is received in FIFO

Send directly from register

Size of 2 spinors is 2x3x16 = 96 bytes

1 packet is big enough to send

Page 24: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Exchanging 2 spinors between neighboring nodesSending 2 spinor to - direction for uyp Merging 4 spinors into 2 spinors

FIFO buffer

(1KB)

Send FIFO X-

Recv FIFO X+

C RC GC B

FR18FR19FR20

D RD GD B

FR21FR22FR23

C

D

Multiplying gauge matrix

Add to 4x3 spinor

Store data to FIFO

Load data from FIFO

A RA GA B

FR24FR25FR26

B RB GB B

FR27FR28FR29

B

AA RA GA B

FR24FR25FR26

B RB GB B

FR27FR28FR29

B

A

Wait until all data is received in FIFO

Page 25: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Exchanging 2 spinors between 2 CPUsSending 2 spinor to + direction for uxm

C RC GC B

FR18FR19FR20

D RD GD B

FR21FR22FR23

C

D

Merging 4 spinors into 2 spinors

Shared memory

C RC GC B

FR18FR19FR20

D RD GD B

FR21FR22FR23

C

D

Multiplying gauge matrix

Add to 4x3 spinor

Store data to shared memory

Load data from shared memory

lockbox

barrierPass barrier after all data is stored

Wait until all data is stored

CPU0

CPU1

Shared memory is not FIFO

so 2 CPUs are synchronized to make sure to write all data in shared memory

(For safety, it is better to synchronize also before send)

Page 26: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Overlapping data exchange and computations

Torus packet HW can send 6 direction independently

– After storing to FIFO, data communication is non-blocking

• We can overlap computation or sending to other direction

But 6 send FIFOs are shared with 2 CPUs in compute node

– 2 sets of 3 FIFOs (X+,Y+,Z+) and (X-,Y-,Z-) are assigned for each CPU

SEND_X+

SEND_Y+

SEND_Z+

RECV_X-

RECV_Y-

RECV_Z-

Local computations

Data exchange between 2CPUs

loop for CPU0

SEND_X-

SEND_Y-

SEND_Z-

RECV_X+

RECV_Y+

RECV_Z+

loop for CPU1

Local computations

Data exchange between 2CPUs

Actual time to transfer data between compute node can be hidden if there is much CPU time between send and recv

Page 27: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Special communication API for lattice QCD

Limitation

– Only for exchanging data between neighboring nodes and 2 CPUs in compute node

What we can do with API

– API function to prepare packet header

– API macros to send/recv data from/to FPU register• Communication between node (XYZ) and between CPU (T) can be ha

ndled in same way• These macros are used with inline assembly to optimize instruction pi

peline with computations

– API function for internal barrier between 2 CPUs

– API functions to send / recv through user buffer• These functions are used if we do not want to use inline assembly

Page 28: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Comparison of API for QCD and MPI

between 2 neighboring compute node between 2 CPUs in compute node

20

40

60

80

100

120

140

160

100 1000 10000

Data size[B]

Ban

dwid

th [

MB

/sec

]

MPI

API for QCD

0

200

400

600

800

1000

1200

1400

1600

100 1000 10000

Data size[B]

Ban

dwid

th [

MB

/sec

]

MPI

API for QCD

1 packet = 256 bytes about 40 MB/sec

Comparison of bandwidth for ping-pong communication :

10 times as faster than MPI2-3 times as faster than MPI

Page 29: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Sending data using API macros

#define BGLNET_WORK_REG 30#define BGLNET_HEADER_REG 30

BGLNet_Send_WaitReady(BGLNET_X_PLUS,fifo,6); loop for several times: (calculate something to send)

BGLNet_Send_Enqueue_Header(fifo); BGLNet_Send_Enqueue(fifo,24);

FPADD(21, 3, 6);BGLNet_Send_Enqueue(fifo,25);

FPSUB(18, 0, 9);BGLNet_Send_Enqueue(fifo,26);

FPSUB(19, 1,10);BGLNet_Send_Enqueue(fifo,27);

FPSUB(20, 2,11);BGLNet_Send_Enqueue(fifo,28);

FPADD(22, 4, 7);BGLNet_Send_Enqueue(fifo,29);

FPADD(23, 5, 8);BGLNet_Send_Packet(fifo);

end of loop:

to tell API which register is used to load packet header

Waits until send FIFO is not empty

Sets address of FIFO

Loads packet header

Sends packet header

Sends data from register

Sends additional 16 bytes

Computation can be inserted as same as to optimize instruction pipeline with load/store and computation

Page 30: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Receiving data using API macros#define BGLNET_WORK_REG 30#define BGLNET_HEADER_REG 30

BGLNet_Recv_WaitReady(BGLNET_X_MINUS,fifo,Nx*8);loop for Nx times:

BGLNet_Recv_Dequeue_Header(fifo);FXCSMADD( 0,31,24, 0);

BGLNet_Recv_Dequeue(fifo,12);FXCSMADD( 1,31,25, 1);

BGLNet_Recv_Dequeue(fifo,13);FXCSMADD( 2,31,26, 2);

BGLNet_Recv_Dequeue(fifo,14);FXCSMADD( 3,31,27, 3);

BGLNet_Recv_Dequeue(fifo,15);FXCSMADD( 4,31,28, 4);

BGLNet_Recv_Dequeue(fifo,16);FXCSMADD( 5,31,29, 5);

BGLNet_Recv_Dequeue(fifo,17);FXCPMADD( 6,31,27, 6);

BGLNet_Recv_Packet(fifo);end of loop:

to tell API which register is used to recv packet header

Waits until all data is received in FIFO buffer

Sets address of FIFO

Receives packet header

Receives data to register

Receives additional 16 bytes

Page 31: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Sending data between 2 CPUs

BGLNet_Send_WaitReady(BGLNET_T_PLUS,fifo,6); (calculate something to send)BGLNet_Send_Enqueue(fifo,0);

FXNMSUB(21,24,31,21);BGLNet_Send_Enqueue(fifo + 1,1);

FXNMSUB(18,27,31,18);BGLNet_Send_Enqueue(fifo + 2,2);

FXNMSUB(22,25,31,22);BGLNet_Send_Enqueue(fifo + 3,3);

FXNMSUB(23,26,31,23);BGLNet_Send_Enqueue(fifo + 4,4);

FXNMSUB(19,28,31,19);BGLNet_Send_Enqueue(fifo + 5,5);

FXNMSUB(20,29,31,20);BGLNet_InternalBarrier();

Does not wait

Does not need packet header

Sets address of shared memory to pointer “fifo”

Sends data from register

Shared memory is not fifo , so we should update address to store next data

Barrier function to make sure all data is stored in shared memory

Receiver also calls this function before receiving data

Page 32: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Simpler way to access torus packet HW

2 API functions for send and recv can be used without inline assembly

Send data from user buffer

void BGLNet_Send(void* pData,int dir,int size);User buffer:

pData

size bytes

send FIFO

shared memory

For dir X,Y or Z

For dir = T

This function returns as soon as all data are copied to FIFO

(non-blocking send)

This function can send up to 32 spinors

Receive data to user buffer

send FIFO

shared memory

For dir X,Y or Z

For dir = T

void BGLNet_Recv(void* pData,int dir,int size);

User buffer:

pData

size bytesThis function waits until all data can be received

Page 33: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Optimization result of our lattice QCD program

global lattice size 16x16x16x32 24x24x24x48

1/2 rack (2.8 TFLOPS)

8x8x8x2 = 1024 CPUs

24.33%(0.68 TFLOPS)

29.23%(0.82 TFLOPS)

1 rack (5.6 TFLOPS)

8x8x16x2 = 2048 CPUs

22.78%(1.28 TFLOPS)

28.57%(1.60 TFLOPS) st

ron

g s

calin

g

global lattice size 8x8x8x16 12x12x12x24

1 node card

4x4x2x2 = 64 CPUs

25.45% 29.84%

wea

k sc

alin

g

Sustained performance per peak performance:

inline assembly with MPI (using buffer to send at once)

(1/2 rack 24x24x24x48)

17.88%

For comparison:

Page 34: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Easier way to optimize to get performance

XLC 8.0 with MPI original source code

(1/2 rack 24x24x24x48)

6.66%

XLC 8.0 with API for QCD non-overlapped communication

(1/2 rack 24x24x24x48)

7.95%

XLC 8.0 with MPI adding alignx() to tell compiler 16 bytes alignment (1/2 rack 24x24x24x48)

7.42%

XLC 8.0 with API for QCD overlapping 3 send for XYZ

(1/2 rack 24x24x24x48)

9.19%

Built-in functions to optimize double FPU instruction (much easier than inline assembly)

Overlapping communication and computations using API for QCD

get more performance

Page 35: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation

Summary

Optimization using double FPU instructions

– We used inline assembly to optimize complex arithmetic

Parallelization of lattice QCD

– Mapping 4D lattice into 4D virtual torus network to limit communication only to neighboring compute node

Optimization of communication

– We used torus packet HW directly

• We developed API for QCD to use easier

Page 36: Optimization and tuning techniques of lattice QCD for Blue Gene

Deep Computing | Tokyo Research Laboratory

© 2006 IBM Corporation