tema 7: sistemes multiprocessadors (memòria...
TRANSCRIPT
1
Tema 7: Sistemes Multiprocessadors (Memòria Compartida)
Eduard Ayguadé i Josep Llosa
These slides have been prepared using some material which is part of the teaching material of other professors at the ComputerArchitecture Departament (Jesús Labarta, Miguel Valero, …). Other material available through the internet has also been used to prepare this chapter’s slides.
Throughput vs. parallel programmingThroughput vs. parallel programming
ThroughputMultiple, unrelated, instruction streams (programs) that executeconcurrently on multiple processorsMultiprogramming n tasks on p processors: each task receives p/nprocessors
Parallel ProgrammingMultiple, related, interacting instruction streams (single program) that execute concurrently to increase the speed of a single program1 task on m processors, each processor receives 1/m of the task:reduce response time
2
Example: heat distribution problemExample: heat distribution problem
Time=0
Time=1
…
Time=2
Time=3
Time=99
Time=100
Temp=20 Temp=100Temp=0
xi-1 xi+1
xi
xixi(t) = (xi-1(t-1) + xi+1(t-1)) / 2
Heat distribution: sequential programHeat distribution: sequential program
int t; /* time step */int i; /* array index */float x[n+2], y[n+2]; /* temperatures */
x[0] = y[0] = 20; x[n+1] = y[n+1] = 100;for(i=1; i<n+1; i++)
x[i] = 0;
for(t=1; t<=100; t++){
for(i=1; i<n+1; i++)y[i] = 0.5 * (x[i-1] + x[i+1]);
swap(x,y);}
x
…
…
y
3
Heat distribution: shared memoryHeat distribution: shared memory
int t; /* time step */int i; /* my processor id */float temp;shared float x[n+2]; /* temperatures */i=GetID()+1; /* GetID returns 0 .. P-1 */x[i]=0;if(i==1) {x[0]=20; x[n+1]=100;}
for (t=1; t<=100; t++){temp = 0.5 * (x[i-1] + x[i+1]);x[i] = temp;}
Points
Processors
Heat distribution: shared memoryHeat distribution: shared memory
int t; /* time step */int i; /* my processor id */float temp;shared float x[n+2]; /* temperatures */
i=GetID()+1; /* GetID returns 0 .. P-1 */x[i]=0;if(i==1) {x[0]=20; x[n+1]=100;}
for (t=1; t<=100; t++){barrier(n);temp = 0.5 * (x[i-1] + x[i+1]);barrier(n);x[i] = temp;}
barrier(n);
…
Time
Barrierwaiting
computing
P1 P2 P3 Pn
4
Heat distribution: shared memoryHeat distribution: shared memory
int t /* time step */, i;int id; /* my processor id */shared float x[1002]; /* temperatures */int leftmost,rightmost; /* processor boundary points */float x_leftmost,x_rightmost; /* their temperatures */
id=GetID();leftmost=id*100+1; rightmost=(id+1)*100;if(id==0){x[0]=20; x[1001]=100;}for(i=leftmost;i<=rightmost;i++) x[i]=0;
for(t=1;t<=100;t++){barrier(n);x_leftmost=0.5*(x[leftmost-1]+x[leftmost+1]);x_rightmost=0.5*(x[rightmost-1]+x[rightmost+1]);barrier(n);for(i=leftmost+1;i<rightmost;i++)
x[i]=0.5*(x[i-1]+x[i+1]);x[leftmost]=x_leftmost;x[rightmost]=x_rightmost;
}barrier(n);
1002 points
10 processors
… …
Don’t forget the Amdahls’ lawDon’t forget the Amdahls’ law
Example:p processorsprogram: 5% sequential, 95% suitable for speedup
T(1) = Ts + Tp = 0.05 T + 0.95 T
T(p) = Ts + (Tp / p) + overhead(p)
S(100) = T(1) / T(100) < 16.8
)()()1()(
poverheadPTpTs
TpTspT
TpS++
+==
5
Shared-memory parallel programmingShared-memory parallel programming
Any processor can directly reference any memory location Communication occurs implicitly as result of loads and stores
St o r e
P1
P2
Pn
P0
L o a d
P0 p r i v a t e
P1 p r i v a t e
P2 p r i v a t e
Pn p r i v a t e
Virtual address spaces for acollection of processes communicatingvia shared addresses
Machine physical address space
Shared portionof address space
Private portionof address space
Common physicaladdresses
Models of shared-memory multiprocessorsModels of shared-memory multiprocessors
The Uniform Memory Access (UMA) Model: The physical memory is shared by all processorsAll processors have equal access to all memory addressesAlso called SMP (Symmetric Multi-Processor)
Interconnection network
P1 P2 Pn
M
No cache
Interconnection network
P1 P2 Pn
M
cache
Shared cache
P2
cache
Pn
cache
P1
cache
Interconnection network
M
Private caches
6
On a board: Quad Xeon MP serverOn a board: Quad Xeon MP server
All coherence and multiprocessing glue in processor moduleUp to 4 processors
On a chip: dual core PowerPC 970MPOn a chip: dual core PowerPC 970MP
7
On a chip: POWER5On a chip: POWER5
Dual-core SMT processor8-way superscalar SMT cores 276M transistors, 389 mm2 dieOperating in lab at 1.8GHz & 1.3V1.9MB L2 cache – point of
coherencyOn-chip L3 directory, memory
controller
Technology130nm lithography, SOICu wiring, 8 layers of metal
High-speed elastic bus interface
I/Os: 2313 signal, 3057 power
On a chip: Intel dual core chipsOn a chip: Intel dual core chips
Intel Core Duo processor with twocores, a unified 2 MB L2 cache, and152 million transistors
Intel Pentium D (dual core) and eXtreme Edition (dualCore, multithreaded), 2 x 1MB L2 cache, up to 3.2 GHz
8
Chip multiprocessor vs. multithreadingChip multiprocessor vs. multithreading
SRAM secondary cache
DRAMDRAMDRAMDRAM
W-issue logic
PC
IF
RF
EU
IC DC
CPU1
W-issue logic
PC
IF
RF
EU
IC DC
CPUn. . .
k-way issue logic
I cache D cache
SRAM secondary cache
DRAMDRAMDRAMDRAM
. . .PC1
PCm . . .
Reg
File
1
Reg
File
m
Instruction fetch unit Execution units and queues
SMT Issue SlotSMT Issue Slot
SMT:ILP (instruction-level parallelism) like a superscalar and TLP (thread-level parallelism) like a multiprocessorIt hides long latencies like multithreaded
Thread1Thread2Thread3Thread4
Superscalar
4 issue slots
Multithreaded
4 issue slots
SMT
4 issue slots
Tim
e (p
roce
ssor
cyc
les)
MP2
2x(2 issue slots)
9
SMT ImplementationSMT Implementation
Straightforward extension to a conventional superscalar design:
The fetch unit is shared among the different threads. Fetch policies designed to improve fetch effectivenessSingle pool of physical registers for renaming
Hyperthreading:Intel’s form of SMT, available on Xeon server30% improvement on 2 threads
Dec
ode
Ren
ame
Inst
ruct
ion
Win
dow
Wak
eup+
sele
ct
Reg
iste
rfil
e
Byp
ass
Dat
a C
achePCPCPCPCPCPCPCPC
Fetch
Models of shared-memory multiprocessorsModels of shared-memory multiprocessors
Distributed shared-memory or Non-Uniform Memory Access (NUMA) Model:
Shared memory is physically distributed among processorsReferences to memory on other nodes must be sent across the interconnection network (transparently to programmer)
Access to local memory much faster than access to remote memory
Interconnection network
P1
cache M1
P2
cache M2
Pn
cache Mn
10
POWER5 Multi-chip ModulePOWER5 Multi-chip Module
95mm x 95mm
Four POWER5 chips2 processors per chip2-way simultaneous multithreaded
Four L3 cache chips
4,491 signal I/Os
89 layers of metalMemoryI/OJTAG
On-BookOff-Book
POW
ER
5
L3 POWER5
L3
POWER5
L3
POW
ER
5
L3
Memory
POWER5L3
POWER5 POWER5 POWER5
Memory Memory
MCM
I/O I/O I/O I/O
L3 L3 L3
Memory
16-way Building Block16-way Building Block
MCM
I/O
POWER5
Memory
L3
BookI/OMemory
POWER5L3
I/O
POWER5
Memory
L3
I/O
POWER5
Memory
L3
11
64-way SMP Interconnection64-way SMP Interconnection
SynchronizationSynchronization
Why do we neeed synchronization? Need to know when itis safe for different processors to access a shared-memorylocation or to signal a certain event
Issues for synchronization:Uninterruptable instruction to fetch and update memory (atomicoperation)User level synchronization operation using these primitives (e.g. locks, barriers, …)For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization
12
SynchronizationSynchronization
ComponentsAcquire method
Acquire the right to the synchronizationWaiting algorithm:
Busy wait (resource consumption during wait)Blocking (need mechanism to awake)Two phase: wait for a while, then block
Release methodAllow others to proceed
SynchronizationSynchronization
What’s wrong with the following synchronization code? (assume that initially flag=0)
lock: ld r1, flagst flag, #1bnz r1, lock
Unlock: st flag, #0
Pi
lock: ld r1, flagst flag, #1bnz r1, lock
Unlock: st flag, #0
Pj
13
Uninterruptable Fetch and UpdateUninterruptable Fetch and Update
Test-and-set instructionRead value and set location to 1The read and write sequence is indivisible
Atomic exchangeInterchange a value in a register with a value in memoryThe swap operation is indivisible
lock: t&s r1, flag
Uninterruptable Fetch and UpdateUninterruptable Fetch and Update
Fetch and op (addr, register)Read addr value to registerReplace addr value with op(addr value)Common variants
fetch and incrementfetch and decrementfetch and add - requires an additional addend register argument
Compare and swap (addr, reg1, reg2)compare addr value with contents of reg1if equal then swap addr value with reg2
14
User-level synchronizationUser-level synchronization
Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock
Assume:0 → synchronization variable is free 1 → synchronization variable is locked and unavailable
Set register to 1 & swapNew value in register determines success in getting lock
0 if you succeeded in setting the lock1 if other processor had already claimed access
mov R2, #1lockit: exch R2, 0(R1) ; atomic exchange
bnez R2, lockit ; already locked?
User-level synchronizationUser-level synchronization
Barriers: processors block until all have reached itCan be implemented with a counter initially set to zeroEvery time a processor reaches the barrier, it atomicallyincrements the counterAnd compares with the total number of processors that need toreach the barrier
If not equal → go and wait on a flag variable (set to 0 by first processor reaching the barrier)If equal → set flag variable to 1
After the barrier, all processors continue
15
User-level synchronizationUser-level synchronization
Centralized barrierBarrier(barr) {
lock (barr.lock);if (barr.counter == 0)
barr.flag = 0; // reset flag if first mycount = barr.counter++;unlock (barr.lock);if (mycount == P) { // last to arrive?
barr.counter =0; // reset for next barrierbarr.flag = 1; // release waiting processors
} elsewhile (barr.flag == 0); // busy wait for release
}
Uninterruptable Fetch and UpdateUninterruptable Fetch and Update
Sometimes it is hard to have read and write in 1 instructionuse 2 instructions instead
Load linked (or load locked) + store conditionalLoad linked returns the initial valueStore conditional returns 1 if it succeeds (same processor that didthe preceeding load) and 0 otherwise
The memory is in charge of keeping track the lastprocessor that loaded a memory location
16
Uninterruptable Fetch and UpdateUninterruptable Fetch and Update
Example: doing atomic swap R4, 0(R1)try: mov R3, R4 ; mov exchange value
ll R2, 0(R1) ; load linkedsc R3, 0(R1) ; store conditionalbeqz R3, try ; branch store fails (R3 = 0)mov R4, R2 ; put load value in R4
Example: doing fetch and increment f&i R2, 0(R1)try: ll R2, 0(R1) ; load linked
addi R3, R2, #1 ; increment (OK if reg–reg)sc R3, 0(R1) ; store conditional beqz R3,try ; branch store fails (R2 = 0)
Caches are critical for performanceCaches are critical for performance
Shared cacheHit and miss latency increased due tointervening switch and cache sizeInterference:
Positive: prefetching across processors, sharing of working setsNegative: conflicts when replacing data in cache
High bandwidth needsNo coherence problem
UsedIn first SMPs in Mid 80s to connect couple of processors on a boardToday: for multiprocessor on a chip (for small-scale systems ornodes)
IN
P1 P2 Pn
M
cache
17
Caches are critical for performanceCaches are critical for performance
Private cachesReduce average data access time (latency)Reduce bandwidth demands placed on shared interconnect
Many processors can share data efficientlyAutomatic replication closer to processor
But private processor caches create a problem:Copies of a variable can be present in multiple caches …… coherence problem when one processor writes
P2
$
Pn
$
P1
$
IN
M
Cache coherence problemCache coherence problem
CPU
X=0
CPU
cache
CPU
cache
Interconnection network
X=0 E / S
...
...
CPU1 reads x
CPU
X=0
CPU
X=0
CPU
cache
Interconnection network
X=0 E / S
...
...
CPU2 reads x
CPU
X=0
CPU
X=1
CPU
cache
Interconnection network
X=0 E / S
...
...
CPU2 writes x
CPU
X=0
CPU
X=1
CPU
X=0
Interconnection network
X=0 E / S
...
...
CPU1 or CPU3 read an incorrect x
18
Cache coherence using a bus Cache coherence using a bus
Snooping-based coherenceBus is a broadcast medium:
Transactions on bus are visible to all processors:Processors or their representatives can snoop (monitor) bus andtake action on relevant events (e.g. change state)
Interconnection network
Sharedmemory
I/O
...
...CPU
cacheSCC
CPU
cacheSCC
CPU
cacheSCC
Cache coherence using a bus Cache coherence using a bus
Cache controller now receives inputs from both sides:Requests from processor (processor-side)Bus requests/responses from snooper (bus-side)Dual tags (not data) or dual-ported tag RAM
In either case, takes zero or more actions:Updates state, responds with data, generates new bus transactions
Protocol is distributed algorithm: cooperating statemachines
Set of states, state transition diagram, actions
Granularity of coherence is typically cache blockLike that of allocation in cache and transfer to/from cache
19
Caches with write buffersCaches with write buffers
Need to snoop the write buffer
Two protocols to handle write operationsTwo protocols to handle write operations
Write-update: writing processor broadcasts the new value and forces all others to update their copies
similar to write-through cache policynew data appears sooner in caches and main memorybut, higher bus traffic
High bandwidth requirements: every write from everyprocessor goes to shared bus and memory
Processor: 2 GHz, 2 instructions/cycle, and 10% of instructionsare 8-byte stores Each processor writes 3.2 GB data persecondMotherboard: FSB @ 400/800 MHz, 128-bit data bus peakbandwidth = 6.4/12.8 GB/sUp to 2 or 4 processors per board
20
Bus bandwidthBus bandwidth
Example: number of read and write references, shared and non-shared (all units in millions)
Instructions FLOPS References Total Reads Total Writes Shared Reads Shared WritesBarnes-Hut 2002.74 239.24 720.13 406.84 313.29 225.04 93.23LU 489.52 92.2 151.07 103.09 47.99 92.79 44.74Raytrace 833.35 - 290.35 210.03 80.31 161.1 22.35Ocean 376.51 101.54 99.7 81.16 18.54 76.95 16.97Radix 14.02 - 5.27 2.9 2.37 1.34 0.81
Two protocols to handle write operationsTwo protocols to handle write operations
Write-invalidate: writing processor forces all others to invalidate their copies
Similar to write-back cache policyDirty in cache state now indicates exclusive ownership:
Exclusive: only cache with a valid copy. Subsequent writes to same block do not need to broadcast invalidate → less bus trafficOwner: responsible for supplying block upon a request for it. The request is generated when accessing to an invalidated cache line(which causes miss)
most popular policy
Bus arbitration resolves races:Two processors attempt to write same block
21
Example: write-invalidate protocolExample: write-invalidate protocol
States:Invalid (I) or not exists (miss)Shared (S): one or moreDirty or Modified (M): one only
CPU events:PrRd (read)PrWr (write)
Bus transactions:BusRd: asks for copy with no intent to modifyBusRdX: asks for copy with intent to modify (invalidation)BusWB (or Flush): updates memory
Actions taken:Update state, perform bus transaction, flush value onto bus
Bus
CPU
SCC
cache
block i state
Example: write-invalidate protocol (MSI)Example: write-invalidate protocol (MSI)
CPU event / Bus transaction
BusUpgr instead of BusRdX:Upgrade from S to M in order to reduce traffic
22
Example: write-invalidate protocol (MESI)Example: write-invalidate protocol (MESI)
CPU event / Bus transaction
States:– invalid– exclusive (only this cache has copy,
but not modified)– shared (two or more caches may
have copies)– modified (dirty)
Transactions:– BusRd(S) means shared line
asserted on BusRd transaction– Flush’: if cache-to-cache sharing,
only one cache flushes data
Example: write-invalidate protocol (MESI)Example: write-invalidate protocol (MESI)
In MESI protocol, need to know if block is shared; i.e. transition to E or S state on read miss?
MESI also requires priority scheme for cache-to-cachetransfers
Which cache should supply data when in shared state?Commercial implementations allow memory to provide data
23
Bandwidth requirements on shared busBandwidth requirements on shared bus
5.27-14.02Radix
290.35-833.35Raytrace
99.7101.54376.51Ocean
References (M)FLOPS (M)Instructions (M)
Cache: 1MByteSet associative LRU: 4Line: 64BytesTransitions /1000 refs
839.5070.312500.03050.0219M
0.2768162.56900.72640.0092S
0.0060.00010.02410.00030E
0.03240.5766000.0262I
0.03540.25810.0006800NP
Raytrace
906.9451.498204.28860.0173M
0.305184.667100.41560.0109S
00.00010.02840.00080.0006E
1.7050.4119000.0485I
5.40511.31530.00300NP
Radix
843.5652.299600.00152.6259M
2.2392134.71602.49940.41715S
0.99550.02414.00400.204E
0.00151.8676000.6362I
1.67870.95651.248400NP
Ocean
MSEINP
Bandwidth requirements on shared busBandwidth requirements on shared bus
NP I E S MNP - - BusRd BusRd BusRdXI - - BusRd BusRd BusRdXE - - - - - S - - Not possible - BusUpgrM BusWB BusWB Not possible BusWB -
07007070M60000S00000E
70707000I70707000NPMSEINP
Transfers:• Command and address on the address bus lines: 6 bytes• Data on the data bus lines: 64 bytes
24
Bandwidth requirements on shared busBandwidth requirements on shared bus
Total number of bytes per 1000 references
Total: 90.4264
Total: 1026.939
Total: 761.0142
021.87502.1351.533M
1.66080000S
00000E
2.26840.362000I
2.47818.0670.047600NP
Raytrace
0104.8740300.2021.211M
1.83060000S
00000E
119.3528.833000I
378.35792.0710.2100NP
Radix
0160.97200.105183.813M
13.43520000S
00000E
0.105130.732000I
117.50966.95587.38800NP
Ocean
MSEINP
Bandwidth requirements on shared busBandwidth requirements on shared bus
Total used bandwidth, assuming processor running at 2 GHz, 2 instructions per cycle
So for example, a FSB @ 400/800 MHz, 128-bit data bus, 6.4/12.8 GB/s peak bandwidth provides enough capacity for8/16 processors running Ocean (only 4/8 running Radix)
126.02290.4264290.35833.35Raytrace
1544.0701026.93865.2714.02Radix
806.067761.014299.7376.51Ocean
BW (MB/s)Bytes per 1000 refsReferences (M)Instructions (M)
25
Bandwidth requirements on shared busBandwidth requirements on shared bus
How much bandwidth saves the MESI vs MSI with no BusUpgr?
… and after all
NP I E S MNP - - BusRd BusRd BusRdXI - - BusRd BusRd BusRdXE - - - - BusRdXS - - Not possible - BusRdXM BusWB BusWB Not possible BusWB -
151.29108.5616290.35833.35Raytrace
1573.421046.4655.2714.02Radix
1031.67974.00899.7376.51Ocean
BW (MB/s)Bytes per refReferences (M)Instructions (M)
User-level synchronization (revisited)User-level synchronization (revisited)
What about synchronization with cache coherency?Want to spin on cache copy to avoid full memory latencyLikely to get cache hits for such variables
Problem: Exchange includes a write, which invalidates all othercopies; this generates considerable bus traffic
mov R2, #1lockit:exch R2, 0(R1) ; atomic exchange
bnez R2, lockit ; already locked?
26
User-level synchronization (revisited)User-level synchronization (revisited)
Use a standard load (not a test and set)reads it’s own local copy until the value changeshence the spin doesn’t generate traffic until there is a reason to -e. g. the lock got releasedno guarantee that when you try it won’t be too latehence there is a starvation potential unless algorithm has somefairness model built inbut the win is clear
try: mov R2, #1lockit: ld R3, 0(R1) ; load var
bnez R3, lockit ; not free → spinexch R2, 0(R1) ; atomic exchangebnez R2, try ; already locked?
Directory-based cache coherencyDirectory-based cache coherency
Scaling to shared-memory systems with a large number ofprocessors:
Interconnection network
CPU cache
memorydirect.
CPU cache
memorydirect.
CPU cache
memorydirect.
CPU cache
memorydirect.
CPU cache
memorydirect.
CPU cache
memorydirect.
...
...
27
Directory-based cache coherencyDirectory-based cache coherency
Directory:It tracks the state of the data stored in its memoryIf a block is shared, it must track which processors have itIf a block is valid only in one node, it should know which one
Nodememory
directory
Nodes currently having the line:- Bit string- 1 bit per node, position indicates node
state:– I: invalid– E: exclusive (only this node has copy,
but not modified)– S: shared (two or more nodes may
have copies)– M: modified (dirty)
Directory-based cache coherencyDirectory-based cache coherency
Typically 3 processors involved in each transaction:Local node where a request originatesHome node where the memory location of an @ residesRemote node has a copy of a cache block, whether exclusive orshared
Example: get block for read access
Local
Home
1: RdReq
2: RdResp
28
Directory-based cache coherencyDirectory-based cache coherency
Access to clean copy, with twosharers, for write
Access to dirty block for read
Local
Home
1: RdXReq2: ReadersList
Reader3: Invalidate
3: Invalidate
4: done
4: done Reader
Local
Home
1: RdReq 2: Owner
Owner3: Intervention
4: WB_D
4: Data
Distributed-memory multiprocessorsDistributed-memory multiprocessors
Distributed memory:Each processor can only reference its own local memoryReferences to memory on other nodes must be done sending messages across the interconnection network (not transparent to programmer)
Access to local memory much faster than messaging
Interconnection network
P1
cache M1
P2
cache M2
Pn
cache Mn
29
Message passing programming modelMessage passing programming model
Send specifies buffer to be transmitted and receiving processReceive specifies sending process and application storage to receive intoOptional tag on send and matching rule on receiveImplicit synchronization (e.g. non-blocking send and blocking receive)Many overheads: copying, buffer management, protection
Process P Process Q
AddressY
AddressX
Send Q, X, tag
Receive P, Y, tag
Match
Local pr ocessaddress spaceLocal pr ocess
address space
Example: heat distribution problemExample: heat distribution problem
Time=0
Time=1
…
Time=2
Time=3
Time=99
Time=100
Temp=20 Temp=100Temp=0
xi-1 xi+1
xi
xixi(t) = (xi-1(t-1) + xi+1(t-1)) / 2
30
Heat distribution: message passingHeat distribution: message passing
int t /* time step */, i;int id; /* my processor id */int num_points; /* number of points per processor */float x[-1:(1000/P)]; /* temperatures, including shadow points */num_points = 1000/P;id=GetID();if (id == 0) x[-1]=20;if (id == (P-1)) x[num_points]=100;for(i=0; i<num_points; i++) x[i]=0;
for(t=1;t<=100;t++){if(id < P) receive(id+1,&x[num_points]);if(id > 0) receive(id-1,&x[-1]);for(i=0; i<num_points; i++)
x[i]=0.5*(x[i-1]+x[i+1]);if(id > 0) send(id-1,x[0]);if(id < P) send(id+1,x[num_points-1]);
}
shadow points
… …
…
id-1 id+1
id
time=t-1
time=t-1
1002 points
10 processors
… …
Networks for distributed memoryNetworks for distributed memory
Uniform access time networks
31
Networks for distributed memoryNetworks for distributed memory
Non-uniform access time networksThe time to reach another node in the network depends on its position
line ring
hipercube 3D torus2D torus
Networks for distributed memoryNetworks for distributed memory
Routing algorithm:Deterministic routing: there exists a single way to reach a node. The path is determined by the identifiers of the source and target nodes.
E.g. Manhattan routing in a 2D
(x1, y1) → (x2, y2)first ∆x = x2 - x1,then ∆y = y2 - y1
00 00
01 00
10 00
11 00
00 01
01 01
10 01
11 01
00 10
01 10
10 10
11 10
00 11
01 11
10 11
11 11
001
000
101
100
010 110
111011
R = X xor YTraverse dimensions of differing address in order
32
www.top500.org (november 2006)www.top500.org (november 2006)GFlops
20 Terabytes memory280+90 Terabytes disk storageArea: 120 m2
MareNostrum: 10240 cpu, 94.2 TFlopsMareNostrum: 10240 cpu, 94.2 TFlops
33
JS21 Processor Blade2 chips PPC970MP, each one with two cores, 2.3 GHzsymmetric multiprocessor 4 GB memory, shared memory40 GBytes local IDE diskMyrinet network adapter and 2 Gigabit ports
MareNostrum: 10240 cpu, 94.2 TFlopsMareNostrum: 10240 cpu, 94.2 TFlops
PowerPC 970MP
MareNostrum: 10240 cpu, 94.2 TFlopsMareNostrum: 10240 cpu, 94.2 TFlops
34
Blade Center• 14 blades per chassis (7U)
• 56 processors• 56GB memory
• Gigabit ethernet switch
6 chassis in a rack (42U)• 336 processors(3.1 TFlops)
• 336GB memory
MareNostrum: 10240 cpu, 94.2 TFlopsMareNostrum: 10240 cpu, 94.2 TFlops
Myrinet Clos256+256 switch
MareNostrum: 10240 cpu, 94.2 TFlopsMareNostrum: 10240 cpu, 94.2 TFlops
35
Myrinet Spine 1280 switchMyrinet Spine 1280 switch
Spine switches are used to connect 3, 4 and 5 Clos Clos256+256 units
Cabling spines to buildlarger systems
Clos 256x256Clos 256x256
Clos 256x256Clos 256x256
Clos 256x256Clos 256x256
Clos 256x256Clos 256x256
Clos 256x256Clos 256x256
Spine 1280 Spine 1280
256 links (1 to each node)250MB/s each direction
128 Links
MareNostrum: 10240 cpu, 94.2 TFlopsMareNostrum: 10240 cpu, 94.2 TFlops
36
IBM Blue Gene LightIBM Blue Gene Light
BG/L compute card
IBM Blue Gene LightIBM Blue Gene Light
Nodes interconnected as 64x32x32 3D torusEasy to build large systems, as each node connects only to six nearest neighbors – full routing in hardware
A global reduction tree supportsfast global operations such as global max/sum in a few microseconds over 65,536 nodes
Auxiliary networks for I/O andglobal operations
37
NASA Columbia SupercomputerNASA Columbia Supercomputer
Global shared memory across 4 cpus, 8 Gigabyte
4 Itanium2 per C-Brick
NASA Columbia SupercomputerNASA Columbia Supercomputer
Global shared memory across 64 cpus, 128 Gigabyte
38
NASA Columbia SupercomputerNASA Columbia Supercomputer
Global shared memory across 512 cpus, 1 Terabyte
NASA Columbia SupercomputerNASA Columbia Supercomputer
20 SGI® Altix™ 3700 superclusters, each with 512 Itanium2 processors (1.5 GHz, 6 MB cache)
Infiniband network to connect superclusters