bg/l architecture and high performance qcd
DESCRIPTION
BG/L architecture and high performance QCD. P. Vranas IBM Watson Research Lab. :. BlueGene/L. A Three Dimensional Torus. System. 64 Racks, 64x32x32. BlueGene/L. Rack. Cabled 8x8x16. 32 Node Cards. 180/360 TF/s 32 TB. Node Card. (32 chips 4x4x2) 16 compute, 0-2 IO cards. - PowerPoint PPT PresentationTRANSCRIPT
P. Vranas, IBM Watson Research Lab 1
BG/L architecture and high performance QCD
P. Vranas
IBM Watson Research Lab
P. Vranas, IBM Watson Research Lab 2
BlueGene/L:
P. Vranas, IBM Watson Research Lab 3
P. Vranas, IBM Watson Research Lab 4
A Three Dimensional TorusA Three Dimensional Torus
P. Vranas, IBM Watson Research Lab 5
BlueGene/L
2.8/5.6 GF/s4 MB
2 processors
2 chips, 1x2x1
5.6/11.2 GF/s1.0 GB
(32 chips 4x4x2)16 compute, 0-2 IO cards
90/180 GF/s16 GB
32 Node Cards
2.8/5.6 TF/s512 GB
64 Racks, 64x32x32
180/360 TF/s32 TB
Cabled 8x8x16Rack
System
Node Card
Compute Card
Chip
P. Vranas, IBM Watson Research Lab 6
P. Vranas, IBM Watson Research Lab 7
PLB (4:1)
“Double FPU”
Ethernet Gbit
JTAGAccess
144 bit wide DDR256MB
JTAG
Gbit Ethernet
440 CPU
440 CPUI/O proc
L2
L2
MultiportedSharedSRAM Buffer
Torus
DDR Control with ECC
SharedL3 directoryfor EDRAM
Includes ECC
4MB EDRAM
L3 CacheorMemory
l
6 out and6 in, each at 1.4 Gbit/s link
256
256
1024+144 ECC256
128
128
32k/32k L1
32k/32k L1
2.7GB/s
22GB/s
11GB/s
“Double FPU”
5.5GB/s
5.5 GB/s
256
snoop
Tree
3 out and3 in, each at 2.8 Gbit/s link
GlobalInterrupt
4 global barriers orinterrupts
128
BlueGene/L Compute ASIC
P. Vranas, IBM Watson Research Lab 8
Dual Node Compute CardDual Node Compute Card
9 x 512 Mb DRAM; 16B interface; no external termination
Heatsinks designed for 15W
206 mm (8.125”) wide, 54mm high (2.125”), 14 layers, single sided, ground referenced
Metral 4000 high speed differential connector (180 pins)
P. Vranas, IBM Watson Research Lab 9
32- way (4x4x2) 32- way (4x4x2) node cardnode card
Custom dual voltage, dc-dc converters; I2C control
IO Gb Ethernet connectors through tailstock Latching and retention
Midplane (450 pins) torus, tree, barrier, clock, Ethernet service port
16 compute cards
2 optional IO cards
Ethernet-JTAG FPGA
P. Vranas, IBM Watson Research Lab 10
P. Vranas, IBM Watson Research Lab 11
360TF Peak; 360TF Peak; Footprint 8.5m x 17m; Footprint 8.5m x 17m;
P. Vranas, IBM Watson Research Lab 12
64 racks at LLNL 64 racks at LLNL
P. Vranas, IBM Watson Research Lab 13
~25KW Max Power @ 700MHz, 1.6VNode Cards
AC-DC ConversionLoss
DC-DC ConversionLoss
Fans
Link Cards
Service Card
BlueGene/L Compute Rack PowerBlueGene/L Compute Rack Power
MF/W (Peak) 250
MF/W (Sustained-Linpack) 172
ASIC 14.4W
DRAM 5W
per node
89%
87%
P. Vranas, IBM Watson Research Lab 14
BG/L is the fastest computer ever built .
P. Vranas, IBM Watson Research Lab 15
P. Vranas, IBM Watson Research Lab 16
BlueGene/L Link “Eye” Measurements at 1.6 Gb/s
Signal path includes module, card wire (86 cm), and card edge connectors
Signal path includes module, card wire (2 x 10 cm), cable connectors, and 8 m cable
P. Vranas, IBM Watson Research Lab 17
Torus top level
Processor Injection
Net SenderNet Receiver
Processor Reception
CPU
CPU
Net
wires
Net
wires
P. Vranas, IBM Watson Research Lab 18
Torus network hardware packets
The hardware packets come in sizes of S = 32, 64,… 256 Bytes
Hardware header (routing etc…),
8 bytes
Payload
S-8 Bytes
Packet tail (CRC etc..) 4 bytes
P. Vranas, IBM Watson Research Lab 19
Torus interface fifos
The cpus access the torus via the memory mapped torus fifos.
Each fifo has 1Kbyte of SRAM memory.
There are 6 normal-priority injection fifos.
There are 2 high priority injection fifos.
Injection fifos are not associated with network directions. For example a packet going out the z+ direction can be injected into any fifo.
There are 2 groups of normal-priority reception fifos. Each group has 6 reception fifos, one for each direction (x+, x-, y+, y-, z+, z-).
The packet header has a bit that specifies into which group the packet should be received. A packet received from the z- direction with header group bit 0 will go to the z- fifo of group 0.
There are 2 groups of high-priority fifos. Each group has 1 fifo. All packets with the header high priority bit set will go to the corresponding fifo.
All fifos have status bits that can be read from specific hardware addresses. The status indicates how full a fifo is.
P. Vranas, IBM Watson Research Lab 20
Torus communications code
Prepare a complete packet in memory that has 8 bytes hardware header and the remaining bytes contain the desired payload.
Must be aligned at a 16 Byte boundary of memory (Quad aligned).
Must have size 32, 64, up to 256 bytes.
Pick a torus fifo to inject your packet.
Read the status bits of that fifo from the corresponding fifo-status hardware address. These include the available space in the fifo.
Keep polling until the fifo has enough space for your packet.
Use the double FPU (DFPU) QuadLoad to load the first Quad (16 Bytes) into a DFPU register.
Use the DFPU QuadStore to store the 16 Bytes into the desired torus fifo. Each fifo has a specific hardware address.
Repeat until all bytes are stored in fifo.
Done. The torus hardware will take care and deliver your packet to the destination node specified in the hardware header.
Injection
P. Vranas, IBM Watson Research Lab 21
Torus communications code
Read the status bits of the reception fifos. These indicate the number of bytes in each reception fifo. The status is updated only after a full packet is completely in the reception fifo.
Keep polling until a reception fifo has data to be read.
Use the double FPU (DFPU) QuadLoad to load the first Quad (16 Bytes) from the corresponding fifo hardware address into a DFPU register.
This is the packet header and has the size of the packet.
Use the DFPU QuadStore to store the 16 Bytes into the desired memory location.
Repeat until all bytes of that packet are read from the fifo and stored into memory. (you know how many times to read since the header had the packet size).
Remember that QuadStores store data in quad aligned memory addresses.
Done. The torus hardware has advanced the fifo to the next packet received (if any).
Reception
P. Vranas, IBM Watson Research Lab 22
Routing
Virtual
Cut-through
Dynamic
with bubble escapeand priority channel
VC
VC
VC
VC
VC
VC
VCB
VCP
P. Vranas, IBM Watson Research Lab 23
Routing examples
A hardware implementation of multicasting along a line
Deterministic and adaptive routing
P. Vranas, IBM Watson Research Lab 24
All to all performance
Torus All-to-All Bandwidth
0%
20%
40%
60%
80%
100%
1 100 10,000 1,000,000
Message Size (Bytes)
Perc
enta
ge o
f Tor
us P
eak
32 way (4x4x2)
512 way (8x8x8)
1,000
P. Vranas, IBM Watson Research Lab 25
The double FPU
The BG/L chip has two 440 cores. Each core has a double FPU.
The DFPU has two register files (primary and secondary). Each has 32, 64-bit floating point registers.
There are floating-point instructions that allow load/store and manipulation of all registers.
These instructions are an extension to the PowerPC Book E instruction set.
The DFPU is ideal for complex arithmetic.
The primary and secondary registers can be loaded independently or simultaneously. For example R4-primary and R4-secondary can be loaded with a single Quad-Load instruction. In this case the data must be coming from a Quad-aligned address.
Similarly with stores.
P0
FPR primary
P31
S0
FPR secondary
S31
P. Vranas, IBM Watson Research Lab 26
BlueGene/L and QCD at night:
P. Vranas, IBM Watson Research Lab 27
Physics is what physicists do at night. R. Feynman
P. Vranas, IBM Watson Research Lab 28
1 sustained-Teraflops for 8.5 hours on 1024 nodes (1 rack)
June 2004
Two flavor dynamical Wilson HMC Phi
b = 5.2, j =0.18, V=323x64
0.310.3150.32
0.3250.33
0.3350.34
0.3450.35
0.3550.36
0.365
0 10 20 30 40 50
Configuration number
Ch
iral
Co
nd
ensa
te
The 1 sustained-Teraflops landmark
P. Vranas, IBM Watson Research Lab 29
QCD on BlueGene/L machines (1/25/06)
More than 20 racks = 112 Teraflops worldwide mostly for QCD.
LLNL and Watson-IBM will possibly run some QCD.
…
P. Vranas, IBM Watson Research Lab 30
One chip hardware
MADD MADD
CPU0
MADD MADD
CPU1
L132KB
L132KB
L2Pre-fetch
L3 4 MB
Fifos
SenderReceiver
3D-Torus
Combine/Bcast5 s roundtrip
Virtual cut-through
External DDR1GB
For 2 nodes
Tree
P. Vranas, IBM Watson Research Lab 31
QCD on the hardware
1) Virtual node mode:
CPU0, CPU1 act as independent “virtual nodes”
Each one does both computations and communications
The 4th direction is along the two CPUs (it can also be “spread” across the machine via “hand-coded” cut-through routing or MPI)
The two CPU’s communicate via common memory buffers
Computations and communications can not overlap.
Peak compute performance is then 5.6 GFlops
CPU0
CPU1
P. Vranas, IBM Watson Research Lab 32
QCD on the hardware
2) Co-processor mode:
CPU0 does all the computations
CPU1 does most of the communications (MPI etc…)
The 4-th direction is internal to CPU0 or can be “spread” across the machine using “hand-coded” cut-through routing or MPI
Communications can overlap with computations
Peak compute performance is then 5.6/2 = 2.8 GFlops
CPU0
CPU1
P. Vranas, IBM Watson Research Lab 33
Optimized Wilson D with even/odd preconditioning in virtual node mode
Inner most kernel code is in C/C++ inline assembly.
Algorithm is similar to the one used in CM2 and QCDSP:
Spin project in the 4 “backward” directions Spin project in the 4 “forward” directions and multiply with gauge fieldCommunicate “backward” and “forward” spinors to nnMultiply the “backward” spinors with gauge field and spin reconstructSpin reconstruct “forward” spinors
P. Vranas, IBM Watson Research Lab 34
All computations use the double Hummer multiply/add instructions.
All floating computations are carefully arranged to avoid pipeline conflicts.
Memory storage ordering is chosen for minimal pointer arithmetic.
Quad Load/store are carefully arranged to take advantage of the cache hierarchy and the CPUs ability to issue up to 3 outstanding loads.
Computations almost fully overlap with load/stores. Local performance is bounded by memory access to L3.
A very thin and effective nearest-neighbor communication layer interacts directly with the torus network hardware to do the data transfers.
Global sums are done via a fast torus or tree routines.
Communications do not overlap with computations or memory access.
Small local size : Fast L1 memory access but more communications Large local size: Slower L3 memory access less communications.
P. Vranas, IBM Watson Research Lab 35
Cycle breakdown
24 (pcycles/site) 16x43 (pcycles/site)
cmat_two_spproj 457 489
comm 1537 432
mat_reconstruct 388 479
reconstruct 154 193
Dslash 2564 1596
Theoretical Best 324 324
Performance 12.6% 20.3%
For the Wilson Dslash operator with even/odd preconditioning.
Processor cycle measurements (pcycles) in virtual node mode.
The lattices are the local lattices on each core.
P. Vranas, IBM Watson Research Lab 36
Wilson kernel node performance
Spin-projection and even/odd preconditioning (“squashed” along x dir)
Numbers are for single chip with self-wrapped links
Full inverter (with torus global sum)
%of peak 24 4x 23 44 8 x 43 82 x 42 16 x 43
D no comms 31.5 28.2 25.9 27.1 27.1 27.8
D 12.6 15.4 15.6 19.5 19.7 20.3
Inverter 13.1 15.3 15.4 18.7 18.8 19.0
P. Vranas, IBM Watson Research Lab 37
QCD CG Inverter - Wilson fermions with even/odd preconditioning
1 core in torus loopback
10
15
20
25
30
35
40
0 100 200 300 400 500 600 700 800 900 1000 1100Local volume (number of local lattice points)
Su
sta
ine
d
pe
rfo
rma
nc
e%
Dslash no comms Dslash CG inverter
P. Vranas, IBM Watson Research Lab 38
Weak Scaling (fixed local size)
Spin-projection and even/odd preconditioning.
Full inverter (with torus global sum)
16x4x4x4 local lattice. CG iterations = 21.
Machine ½ chip midplane 1 rack 2 racks
Cores 1 1024 2048 4096
Global lattice 4x4x4x16 32x32x32x32 32x32x64x32 32x64x64x32
% of peak 19.0 18.9 18.8 18.7
P. Vranas, IBM Watson Research Lab 39
QCD CG Inverter - Wilson fermions21 CG iterations, 16x4x4x4 local lattice
10
12
14
16
18
20
22
24
26
28
30
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Number of CPUs
Sus
tain
ed p
erfo
rman
ce %
th
eore
tical
max
is ~
75%
P. Vranas, IBM Watson Research Lab 40
Special OS tricks (not necessarily dirty)
It was found that L1 evictions cause delays due to increased L3 traffic. In order to avoid some of this the “temporary” spin-projected two-component spinors are stored into memory with L1 attribute of write-through-swoa.
An OS function is called that returns a pointer to memory and a fixed size. That image of memory has the above attributes. This increased performance from 16% to 19%.
The on-chip, core-to-core communications are done with a local copy in common memory. It was found that the copy was faster if it was done via the common SRAM.
An OS function is called that returns a pointer to memory and a fixed size. That image of memory is in SRAM and has size about 1KB. This increased performance from 19% to 20%.
Under construction: An OS function that splits the L1 cache into two pieces (standard and transient). Loads in the transient L1 will not get evicted or cause evictions. Since the gauge fields are not modified during inversion this is an ideal place to store them.
These functions exist in the IBM Watson software group experimental kernel called controlX. They have not migrated to the BG/L standard software release.
P. Vranas, IBM Watson Research Lab 41
Full QCD physics system
The physics code (besides the Wilson Dslash) is the Columbia C++ physics system (cps).
The full system ported very easily and worked immediately.
The BG/L additions/modifications to the system have
been kept isolated.
Acknowledgement
We would like to thank the QCDOC collaboration for useful discussions and for providing us with the Columbia physics system software.
P. Vranas, IBM Watson Research Lab 42
BlueGene next generations:
P. Vranas, IBM Watson Research Lab 43
P
P. Vranas, IBM Watson Research Lab 44
Q
P. Vranas, IBM Watson Research Lab 45
What would you do ?
P. Vranas, IBM Watson Research Lab 46
… if they come to you with 1 Petaflop for a month?
P. Vranas, IBM Watson Research Lab 47
QCD, the movie :
P. Vranas, IBM Watson Research Lab 48
QCD thermal phase transition a clip from a BG/L lattice simulation.
This clip is from a state of the art simulation of QCD on a ½ a rack of a BG/L machine (2.8 Teraflops). It took about about 2 days.
It shows 2-flavor dynamical QCD on a 16x16x16x4 lattice with the DWF 5th dimension set to 24 sites.
The pion mass is about 400 MeV. The color of each lattice point is the value of the Polyakov loop which can
fluctuate between -3 and 3. Think of it as a spin system. The graph shows the volume average of the Polyakov line. This value is
directly related to the single quark free energy. In the confined phase there are no free quarks and the value is low (not zero because of screening), in the quark-gluon plasma phase quarks can exist alone and the value is large.
G. Bhanot, D. Chen, A. Gara, P. Heidelberger, J. Sexton, P. Vranas, B. Walkup