hamal parallel computer design and evaluation of thedecember 3, 2002 the hamal parallel computer 45...
TRANSCRIPT
![Page 1: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/1.jpg)
�
Design and Evaluation of the Hamal Parallel Computer
J.P. Grossman
Project AriesSupervisor: Tom Knight
Dec. 3, 2002
![Page 2: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/2.jpg)
December 3, 2002 The Hamal Parallel Computer 2
�Motivation
� Million node general-purpose machine
� Scalable memory system
� Support for massive multithreading
� Discarding Network
� Billion transistor era
� Embedded DRAM
![Page 3: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/3.jpg)
December 3, 2002 The Hamal Parallel Computer 3
�Talk Outline
� Overview of Hamal
� The Hamal memory system
� Thread management and synchronization
� Fault-tolerant messaging protocol
![Page 4: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/4.jpg)
December 3, 2002 The Hamal Parallel Computer 4
�Hamal - Overview
� Distributed shared memory machine
� Multiple processor-memory nodes per die
� Fat tree interconnect
� Split-phase memory operations
� Memory consistency implemented in software
� No data caches
� Complete system cycle-accurate simulator
![Page 5: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/5.jpg)
December 3, 2002 The Hamal Parallel Computer 5
�Hamal - Overview
Data Data Data Data
Controller/Arbiter
CodeNet Processor
![Page 6: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/6.jpg)
December 3, 2002 The Hamal Parallel Computer 6
�The Hamal Processor
� 128-bit VLIW multithreaded (8 contexts)
� No register renaming, branch prediction, speculative execution, etc.
� Event-driven microkernel runs in context 0
Handle event
Pollqueue
Event Queue
Memory Events
Processor Events
Network Events
Context 0
![Page 7: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/7.jpg)
December 3, 2002 The Hamal Parallel Computer 7
�Talk Outline
�Overview of Hamal
� The Hamal memory system
� Threads and synchronization
� Fault-tolerant messaging protocol
![Page 8: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/8.jpg)
December 3, 2002 The Hamal Parallel Computer 8
�Capabilities
� 128 bit pointers with embedded hardware-enforced permissions and bounds� 64 address bits, 64 capability bits
� Single virtual address space� Reduces state associated with a process
� Easy sharing of data
� Intra-process protection
� Object-based protection
� Simple lazy page allocation
![Page 9: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/9.jpg)
December 3, 2002 The Hamal Parallel Computer 9
�A Haiku
Capabilities!
It is no longer a sin
to program in C
![Page 10: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/10.jpg)
December 3, 2002 The Hamal Parallel Computer 10
�Virtual Memory
virtual address physical address
node page offset
virtual address
physical page virtual page physical page
node page offset
virtual address
TLB Hardware Page Tables
![Page 11: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/11.jpg)
December 3, 2002 The Hamal Parallel Computer 11
�Distributed Objects
� Hamal implements Sparsely Faceted Arrays[Brown02]
� Conceptually allocate same segment on all nodes, but actual facets are lazily allocated
� Network interface translates between global segment IDs and local facets
![Page 12: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/12.jpg)
December 3, 2002 The Hamal Parallel Computer 12
�Talk Outline
�Overview of Hamal
� The Hamal memory system
� Threads and synchronization
� Fault-tolerant messaging protocol
![Page 13: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/13.jpg)
December 3, 2002 The Hamal Parallel Computer 13
�Motivation
� Run time for a problem of size m on N nodes:
� Optimal run time for fixed m:
)log(21
NCN
mCt +=
+=⇒=
2
1
212log1
C
mCCt
N
mCC
![Page 14: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/14.jpg)
December 3, 2002 The Hamal Parallel Computer 14
�Thread Creation
� fork instruction specifies:
� Starting address
� Destination node
� Set of registers to copy to child
� Each node contains a hardware fork queue
� Queue is serviced by microkernel
fork
![Page 15: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/15.jpg)
December 3, 2002 The Hamal Parallel Computer 15
�Register Dribbling
� Each thread has a swap page in memory
� Threads are loaded/unloaded in the background on unused memory cycles (Register dribbling - [Soundararajan92])
� Reduces overhead of thread swapping
� Least recently issued (LRI) context constantly dribbles
� Reduces latency of thread swapping
![Page 16: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/16.jpg)
December 3, 2002 The Hamal Parallel Computer 16
�Thread Suspension
� When should a blocked thread be suspended?
� Two part strategy:
1. Wait until
a) No context can issue
b) The LRI context is clean
2. Generate a stall event and allow the microkernel to decide if the thread should be suspended
![Page 17: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/17.jpg)
December 3, 2002 The Hamal Parallel Computer 17
�Register-Based Synchronization
� join capabilities allow one thread to write directly to another thread’s registers
� Three instructions: jcap, busy, and join
Parent Thread
r0 = jcap r1r1 = busyfork _child, {r0}r1 = and r1, r1
Child Thread
_child:
<computation>join r0, 0
![Page 18: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/18.jpg)
December 3, 2002 The Hamal Parallel Computer 18
�Example - Barrier
doacross
![Page 19: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/19.jpg)
December 3, 2002 The Hamal Parallel Computer 19
�Barrier Times
barrier time
12
58
105
161
216
273
333
393
456
523
0
100
200
300
400
500
600
1 2 4 8 16 32 64 128 256 512
# processors
tim
e (
cycle
s)
![Page 20: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/20.jpg)
December 3, 2002 The Hamal Parallel Computer 20
�Benchmark - ppadd
ppadd
11
12
13
14
15
16
17
18
19
20
1 2 4 8 16 32 64 128 256 512
# processors
log(tim
e)
1024
2048
4096
8192
16384
32768
65536
![Page 21: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/21.jpg)
December 3, 2002 The Hamal Parallel Computer 21
�UV Trap Bits
� Each memory word is tagged with two user trap bits (U, V)
� Each memory operation may optionally:
� Trap on U high, U low, V high, V low
� Modify U, V if successful
� Traps generate events which are handled by the microkernel on the node containing the memory word
![Page 22: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/22.jpg)
December 3, 2002 The Hamal Parallel Computer 22
�Example – Word Locking
� Aqcuire:
� load, trap on U high or V high, set U
� Release:
� store, trap on V high, clear U
busy11
trap10
unavailable01
available00
MeaningVU
![Page 23: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/23.jpg)
December 3, 2002 The Hamal Parallel Computer 23
�Example – Word Locking
B
B
C
B
CD
A
A
A
A
B
CD
B
CD
B
CD
= word = lock = join capability= thread
![Page 24: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/24.jpg)
December 3, 2002 The Hamal Parallel Computer 24
�Benchmark – wordcount
� Frequency count of words in [Brown02]
� Distributed hash table used to store counts
� remote version: access hash table remotely
� local version: create a thread on target node 0
0.5
1
1.5
2
2.5
3
3.5
ex
ecu
tio
n t
ime
remote local
spin
UV
![Page 25: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/25.jpg)
December 3, 2002 The Hamal Parallel Computer 25
�Talk Outline
�Overview of Hamal
� The Hamal memory system
� Threads and synchronization
� Fault-tolerant messaging protocol
![Page 26: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/26.jpg)
December 3, 2002 The Hamal Parallel Computer 26
�Motivation
Discarding vs. Non-Discarding
Internet Examples Most || Computers
Performance �
� Simplicity �
� Reliability �
![Page 27: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/27.jpg)
December 3, 2002 The Hamal Parallel Computer 27
�Fault Tolerant Messaging
� Idempotence?
� Sequence numbers (e.g. TCP)
� 220 nodes, 32 bits ⇒ 8MB/node
MSG MSG
MSG ACK
![Page 28: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/28.jpg)
December 3, 2002 The Hamal Parallel Computer 28
�Idempotent Messaging Protocol
MSG MSG
� CONF: No more messages will be sent
� [Brown01]
MSG ACK MSG
CONF MSG
![Page 29: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/29.jpg)
December 3, 2002 The Hamal Parallel Computer 29
�Out of Order Packets
MSG 7
sender receiver
MSG 7MSG 7
ACK 7
MSG 7
MSG 7
CONF 7
MSG 7MSG 7 CONF 7
![Page 30: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/30.jpg)
December 3, 2002 The Hamal Parallel Computer 30
�Message Identification
� Sender generates message ID
� All packets contain source node and msg. ID
ACKSender ReceiverMSG
CONF
� ID identifies MSG
� source node gives destination for CONF
� (source node, ID) identifies MSG
![Page 31: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/31.jpg)
December 3, 2002 The Hamal Parallel Computer 31
�Can We Reduce Overhead?
� ACK/CONF packets:
� Two ideas to reduce size:
1. Use short (4-8 bit) MSG ID
2. Receiver assigns short secondary ID
type dest message IDsource
ID2type dest message IDsource
type dest ID2
ACK:
CONF:
![Page 32: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/32.jpg)
December 3, 2002 The Hamal Parallel Computer 32
�Failure of Short IDs
MSG 7
sender receiver
MSG 7
CONF 7
ACK 7
ACK 7 MSG 7
MSG 7
MSG 7
ACK 7
MSG 7
CONF 7 MSG 7
![Page 33: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/33.jpg)
December 3, 2002 The Hamal Parallel Computer 33
�Secondary IDs
MSG 5
sender receiver
MSG 5
MSG 9
MSG 9
CONF 2
MSG 5,2MSG 5 ACK 5,2
CONF 2
MSG 5,2
ACK 5,2
MSG 9 MSG 9,2
ACK 9,2
MSG 9 MSG 9
![Page 34: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/34.jpg)
December 3, 2002 The Hamal Parallel Computer 34
�How Many Bits Is Enough?
� Can’t reuse an ID if it’s still in the system
� But receivers can remember an ID for arbitrarily long periods of time
� Solution:
� use 48 bit IDs
� flush the network every 4-12 months when a node runs out of IDs
![Page 35: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/35.jpg)
December 3, 2002 The Hamal Parallel Computer 35
�Experimental Results
� Trace driven simulation of 4 microbenchmarkson 4 topologies
� Linear backoff for packet retransmission
� Small send tables (8 entries)
� Larger receive tables (64 entries)
� Buffer ~4 flits at each network node
![Page 36: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/36.jpg)
December 3, 2002 The Hamal Parallel Computer 36
�Summary
� Scalable memory system
� Capabilities → single shared virtual address space
� Hardware page tables, sparsely faceted arrays
� Low overhead for parallel programs
� Efficient thread management
� Register-based synchronization
� UV Trap bits
� Discarding network
� Lightweight fault-tolerant messaging protocol
![Page 37: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/37.jpg)
December 3, 2002 The Hamal Parallel Computer 37
�Conclusion
Yesterday – ENIAC Today – P4
Soon – 1M nodes Tomorrow – ???
![Page 38: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/38.jpg)
December 3, 2002 The Hamal Parallel Computer 38
�Comparison with Non-Discarding
0
200
400
600
800
1000
2D Grid 3D Grid Fat Tree Butterfly0
2000
4000
6000
8000
10000
12000
14000
2D Grid 3D Grid Fat Tree Butterfly
0
50000
100000
150000
200000
250000
2D Grid 3D Grid Fat Tree Butterfly
0
20000
40000
60000
80000
100000
2D Grid 3D Grid Fat Tree Butterfly
add reverse
nbodyquicksort
Non-Discarding Discarding + fault-tolerant messaging
![Page 39: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/39.jpg)
December 3, 2002 The Hamal Parallel Computer 39
�Hamal Benchmarks
� ppadd – Parallel-prefix addition
� quicksort – Parallel quicksort
� nbody – exact N-body simulation, 256 bodies
� Processors conceptually organized in square array
� Communication is in rows and columns only
� wordcount – frequency count of words in [Brown02]
� Distributed hash table used to maintain counts
![Page 40: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/40.jpg)
December 3, 2002 The Hamal Parallel Computer 40
�ppadd – Model vs. Simulations
ppadd
10
11
12
13
14
15
16
17
18
19
20
1 2 4 8 16 32 64 128 256 512
# processors
log(tim
e)
![Page 41: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/41.jpg)
December 3, 2002 The Hamal Parallel Computer 41
�quicksort – Execution Time
quicksort
17
18
19
20
21
22
23
24
25
26
27
1 2 4 8 16 32 64 128 256 512
# processors
log(tim
e)
4096
8192
16384
32768
65536
131072
262144
![Page 42: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/42.jpg)
December 3, 2002 The Hamal Parallel Computer 42
�quicksort - Speedup
quicksort
0
1
2
3
4
5
6
7
1 2 4 8 16 32 64 128 256 512
# processors
log(speedup)
4096
8192
16384
32768
65536
131072
262144
![Page 43: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/43.jpg)
December 3, 2002 The Hamal Parallel Computer 43
�nbody – Speedup
nbody
0
2
4
6
8
1 4 16 64 256
# processors
log(speedup)
![Page 44: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/44.jpg)
December 3, 2002 The Hamal Parallel Computer 44
�Register Dribbling
ppadd8
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
4 5 6 7 8 9 10 11 12 13 14 15 16
# contexts
tim
e (
cycle
s)
dribble on suspend
extended dribbling
quicksort
2800000
2850000
2900000
2950000
3000000
3050000
3100000
3150000
4 5 6 7 8 9 10 11 12 13 14 15 16
# contexts
tim
e (
cycle
s)
dribble on suspend
extended dribbling
nbody
2200000
2300000
2400000
2500000
2600000
2700000
2800000
2900000
4 5 6 7 8 9 10 11 12 13 14 15 16
# contexts
tim
e (
cycle
s)
dribble on suspend
extended dribbling
wordcount
0
200000
400000
600000
800000
1000000
1200000
1400000
4 5 6 7 8 9 10 11 12 13 14 15 16
# contexts
tim
e (
cycle
s)
dribble on suspend
extended dribbling
![Page 45: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/45.jpg)
December 3, 2002 The Hamal Parallel Computer 45
�Network Benchmarks
� add – parallel prefix addition on 4096 nodes
� reverse – reverse the data of a 16K entry vector distributed across 1024 nodes
� quicksort – parallel quicksort of a 32K entry vector on 1024 nodes
� nbody – full N-body simulation on 256 nodes with one body per node
![Page 46: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/46.jpg)
December 3, 2002 The Hamal Parallel Computer 46
�Network Topologies
� 2D Grid
� Dimension-ordered routing prefered
� 3D Grid
� Dimension-ordered routing prefered
� Fat tree
� radix-4 (down) dilation-2 (up), randomized
� Multibutterfly
� radix-2 dilation-2, randomized
![Page 47: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/47.jpg)
December 3, 2002 The Hamal Parallel Computer 47
�Network Retransmission
L301.028L301.093overall
L321.015L321.041multibutterfly
L301.036L311.085fat tree
L301.018L321.0393D grid
L281.026L321.0332D grid
L311.045L311.085nbody
Q121.015Q151.020quicksort
L321.016L301.028reverse
L51.003L51.009add
average case worst case
slowdown over optimal
![Page 48: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/48.jpg)
December 3, 2002 The Hamal Parallel Computer 48
�Network Send Table Size
add
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 4 8 16 32 64 128 256
reverse
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
1 2 4 8 16 32 64 128 256
quicksort
0
50000
100000
150000
200000
250000
300000
1 2 4 8 16 32 64 128 256
nbody
0
20000
40000
60000
80000
100000
120000
140000
1 2 4 8 16 32 64 128 256
![Page 49: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/49.jpg)
December 3, 2002 The Hamal Parallel Computer 49
�Network Buffering
add
0
200
400
600
800
1000
1200
1400
1600
1 2 3 4 5 6 7 8
reverse
0
5000
10000
15000
20000
25000
30000
35000
1 2 3 4 5 6 7 8
quicksort
0
50000
100000
150000
200000
250000
300000
350000
1 2 3 4 5 6 7 8
nbody
0
20000
40000
60000
80000
100000
120000
140000
160000
1 2 3 4 5 6 7 8
![Page 50: Hamal Parallel Computer Design and Evaluation of theDecember 3, 2002 The Hamal Parallel Computer 45 Network Benchmarks add – parallel prefix addition on 4096 nodes reverse – reverse](https://reader035.vdocuments.net/reader035/viewer/2022070820/5f1e11cf3dd3000cef708516/html5/thumbnails/50.jpg)
December 3, 2002 The Hamal Parallel Computer 50
�Network Receive Table Size
add
0
500
1000
1500
2000
2500
3000
3500
16 32 64 128
reverse
0
2000
4000
6000
8000
10000
12000
14000
16000
16 32 64 128
quicksort
0
50000
100000
150000
200000
250000
300000
16 32 64 128
nbody
0
20000
40000
60000
80000
100000
120000
140000
16 32 64 128