july 2004computer architecture, advanced architecturesslide 1 part vii advanced architectures
Post on 21-Dec-2015
224 views
TRANSCRIPT
July 2004 Computer Architecture, Advanced Architectures Slide 1
Part VIIAdvanced Architectures
July 2004 Computer Architecture, Advanced Architectures Slide 2
About This Presentation
This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. ©
Behrooz Parhami
Edition Released Revised Revised Revised Revised
First July 2003 July 2004 July 2005
July 2004 Computer Architecture, Advanced Architectures Slide 3
VII Advanced Architectures
Topics in This Part
Chapter 25 Road to Higher Performance
Chapter 26 Vector and Array Processing
Chapter 27 Shared-Memory Multiprocessing
Chapter 28 Distributed Multicomputing
Performance enhancement beyond what we have seen:• What else can we do at the instruction execution level?• Data parallelism: vector and array processing• Control parallelism: parallel and distributed processing
July 2004 Computer Architecture, Advanced Architectures Slide 4
25 Road to Higher Performance Review past, current, and future architectural trends:
• General-purpose and special-purpose acceleration• Introduction to data and control parallelism
Topics in This Chapter
25.1 Past and Current Performance Trends
25.2 Performance-Driven ISA Extensions
25.3 Instruction-Level Parallelism
25.4 Speculation and Value Prediction
25.5 Special-Purpose Hardware Accelerators
25.6 Vector, Array, and Parallel Processing
July 2004 Computer Architecture, Advanced Architectures Slide 5
25.1 Past and Current Performance Trends
Architectural method Improvement factor
1. Pipelining (and superpipelining) 3-8 √ 2. Cache memory, 2-3 levels 2-5 √ 3. RISC and related ideas 2-3 √ 4. Multiple instruction issue (superscalar) 2-3 √ 5. ISA extensions (e.g., for multimedia) 1-3 √ 6. Multithreading (super-, hyper-) 2-5 ? 7. Speculation and value prediction 2-3 ? 8. Hardware acceleration 2-10 ? 9. Vector and array processing 2-10 ?10. Parallel/distributed computing 2-1000s ?
Est
ab
lishe
dm
eth
ods
Ne
we
rm
eth
ods
Pre
vio
usly
dis
cuss
ed
Co
vere
d in
Pa
rt V
II
Available computing power ca. 2000: GFLOPS on desktop TFLOPS in supercomputer center PFLOPS on drawing board
Computer performance grew by a factorof about 10000 between 1980 and 2000 100 due to faster technology 100 due to better architecture
July 2004 Computer Architecture, Advanced Architectures Slide 6
Peak Performance of SupercomputersPFLOPS
TFLOPS
GFLOPS1980 20001990 2010
Earth Simulator
ASCI White Pacific
ASCI Red
Cray T3DTMC CM-5
TMC CM-2Cray X-MP
Cray 2
10 / 5 years
Dongarra, J., “Trends in High Performance Computing,”Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04]
July 2004 Computer Architecture, Advanced Architectures Slide 7
Energy Consumption is Getting out of Hand
Figure 25.1 Trend in energy consumption for each MIPS of computational power in general-purpose processors and DSPs.
1990 1980 2000 2010 kIPS
MIPS
GIPS
TIPS
Pe
rfo
rma
nce
Calendar year
Absolute processor
performance
GP processor performance
per watt
DSP performance per watt
July 2004 Computer Architecture, Advanced Architectures Slide 8
Energy Reduction Methods
• Clock gating– Turn off clock signals that go to unused
parts of a chip
• Asynchronous design– Doing away with clocking and the
attendant energy waste
• Voltage reduction– Slower but uses less energy
July 2004 Computer Architecture, Advanced Architectures Slide 9
Examples
• Intel XScale Processor– Runs at 100 MHz to 1 gHz– 30 mW of power
• System on a chip– Wearable computers– Augmented reality
July 2004 Computer Architecture, Advanced Architectures Slide 10
Examples
• Power = k * voltage squared– Use two low-voltage processors– Chip multiprocessors
• Cell phones– Two processors
• One low power but computationally weak• Digital signal processor
July 2004 Computer Architecture, Advanced Architectures Slide 11
25.2 Performance-Driven ISA Extensions
Adding instructions that do more work per cycle
Shift-add: replace two instructions with one (e.g., multiply by 5)
Multiply-add: replace two instructions with one (x := c + a b) Multiply-accumulate: reduce round-off error (s := s + a b) Conditional copy: to avoid some branches (e.g., in if-then-else)
Subword parallelism (for multimedia applications)
Intel MMX: multimedia extension 64-bit registers can hold multiple integer operands
Intel SSE: Streaming SIMD extension 128-bit registers can hold several floating-point operands
July 2004 Computer Architecture, Advanced Architectures Slide 12
Multimedia Extensions
• Workload shifts from data crunching to multimedia
• New instructions
• Compatibility issues and costs lead to compromises “that prevented the full benefits of such extensions from materializing”
July 2004 Computer Architecture, Advanced Architectures Slide 13
New Instructions
• Subword operands– 8-bit pixel, 16-bit audio samples– MMX added 57 instructions to allow
parallel subword arithmetic operations
July 2004 Computer Architecture, Advanced Architectures Slide 14
MMX
• 8 floating-point registers that “double” as the integer register file (cost)
• A 64-bit register can be viewed as holding a single 64-bit integer or a vector of 2, 4, or 8 narrower integer operands
July 2004 Computer Architecture, Advanced Architectures Slide 15
Parallel Pack
• Unpacking– 8-vectors, low
• Xxxxabcd, xxxxefgh -> aebfcgdh
• Packing does the reverse
• These operations increae precision
July 2004 Computer Architecture, Advanced Architectures Slide 16
Saturating Arithmetic
• With ordinary arithmetic overflow results in a sum that is smaller than either operand. Ie, 0b1100 + 0b1011 = 0b0111
• This could result in a bright spot in a dark region
• Special rules to deal with “infinity”
July 2004 Computer Architecture, Advanced Architectures Slide 17
Intel MMXISA
Exten-sion
Table25.1
Class Instruction Vector Op type Function or results
Copy
Register copy 32 bits Integer register MMX registerParallel pack 4, 2 Saturate Convert to narrower elementsParallel unpack low 8, 4, 2 Merge lower halves of 2 vectorsParallel unpack high 8, 4, 2 Merge upper halves of 2 vectors
Arith-metic
Parallel add 8, 4, 2 Wrap/Saturate# Add; inhibit carry at boundariesParallel subtract 8, 4, 2 Wrap/Saturate# Subtract with carry inhibitionParallel multiply low 4 Multiply, keep the 4 low halvesParallel multiply high 4 Multiply, keep the 4 high halvesParallel multiply-add 4 Multiply, add adjacent products*
Parallel compare equal 8, 4, 2 All 1s where equal, else all 0s
Parallel compare greater 8, 4, 2 All 1s where greater, else all 0s
ShiftParallel left shift logical 4, 2, 1 Shift left, respect boundariesParallel right shift logical 4, 2, 1 Shift right, respect boundariesParallel right shift arith 4, 2 Arith shift within each (half)word
Logic
Parallel AND 1 Bitwise dest (src1) (src2)Parallel ANDNOT 1 Bitwise dest (src1) (src2)Parallel OR 1 Bitwise dest (src1) (src2)Parallel XOR 1 Bitwise dest (src1) (src2)
Memoryaccess
Parallel load MMX reg 32 or 64 bits Address given in integer registerParallel store MMX reg 32 or 64 bit Address given in integer register
Control Empty FP tag bits Required for compatibility$
July 2004 Computer Architecture, Advanced Architectures Slide 18
MMX Multiplication and Multiply-Add
Figure 25.2 Parallel multiplication and multiply-add in MMX.
a
(a) Parallel multiply low (b) Parallel multiply-add
b d e
e f g h
s t u v
e h
d g
b f
a e
z v
y u
x t
w s
a b d e
e f g h
s + t u + v
e h
d g
b f
a e
v
u
t
s
add add
July 2004 Computer Architecture, Advanced Architectures Slide 19
MMX Parallel Comparisons
Figure 25.3 Parallel comparisons in MMX.
14
(a) Parallel compare equal (b) Parallel compare greater
3 58 66
79 1 58 65
0 0 0
5 12 3 32
12 3 22
5 12 6 9
12 5 90 17 8 65 535 (all 1s)
0 0 0 0 0
255 (all 1s)
July 2004 Computer Architecture, Advanced Architectures Slide 20
25.3 ILP Under Ideal Conditions!
Figure 25.4 Available instruction-level parallelism and the speedup due to multiple instruction issue in superscalar processors [John91].
1
Fra
ctio
n o
f cyc
les
Issuable instructions per cycle
20%
30%
10%
0% 2 3 4 5 6 7 8 0
Sp
ee
dup
atta
ine
d
Instruction issue width
3
2
1 2 4 6 8 0
(a) (b)
July 2004 Computer Architecture, Advanced Architectures Slide 21
ILP Inefficiencies
• Diminishing returns due to– Slowing down instruction issue due to
extreme circuit complexity– Wasting resources on instructions that will
not be executed due to branching
July 2004 Computer Architecture, Advanced Architectures Slide 22
Instruction-Level Parallelism
Figure 25.5 A computation with inherent instruction-level parallelism.
July 2004 Computer Architecture, Advanced Architectures Slide 23
VLIW and EPIC Architectures
Figure 25.6 Hardware organization for IA-64. General and floating-point registers are 64-bit wide. Predicates are single-bit registers.
VLIW Very long instruction word architectureEPIC Explicitly parallel instruction computing
Memory
General registers (128)
Floating-point registers (128)
Predi- cates (64)
Execution unit
Execution unit
Execution unit
Execution unit
Execution unit
Execution unit
. . .
. . .
July 2004 Computer Architecture, Advanced Architectures Slide 24
A Better Approach?
• Allow the programmer or compiler to specify which operations are capable of being executed at the same time
• More reasonable than hiding the concurrency in sequential code and then requiring the hardware to rediscover it
July 2004 Computer Architecture, Advanced Architectures Slide 25
Itanium
• Long instructions “bundle” three conventional RISC-type instructions, “syllables”, of 41 bits, plus a 5 bit “template” holding scheduling information
• The template specifies the instruction types and indicates concurrency
July 2004 Computer Architecture, Advanced Architectures Slide 26
41-bit Instruction
• 4 bit major opcode
• 6 bit predicate specification
• 31 bits for operands and modifiers– 3 registers can be specified (21 bits)
• The results are committed only if the specified predicate is 1– Predicated instruction execution
July 2004 Computer Architecture, Advanced Architectures Slide 27
Additional Features
• Control and data speculation
• Software pipelining for loops
• Threading– Interleaved multithreading– Blocked multithreading– Simultaneous multithreading
July 2004 Computer Architecture, Advanced Architectures Slide 28
Interleaved Multithreading
• Instruction logic picks instructions from the different threads in a round-robin fashion– Spaces out the instructions from the same
thread avoiding bubbles due to data or control dependencies
July 2004 Computer Architecture, Advanced Architectures Slide 29
Blocked multithreading
• Deals with only one thread until it hits a snag, eg, cache miss
• Fills the long stalls of one thread with useful instructions from another thread
July 2004 Computer Architecture, Advanced Architectures Slide 30
Simultaneous Multithreading
• Most flexible strategy
• Allows mixing and matching of instructions from different threads to make the best use of available resources
July 2004 Computer Architecture, Advanced Architectures Slide 31
Speculation and Value Prediction
• Itanium– Control speculation
• One load becomes two– A speculative load (earlier)– A checking instruction
• We may execute a load instruction that will not be encountered due to a branch or a false predicate
• The speculative load keeps a record of an exception
July 2004 Computer Architecture, Advanced Architectures Slide 32
Data Speculation
• Allows a load instruction to be moved ahead of a store
• When the speculative load is performed, the address is recorded and checked against subsequent addresses for write operations
July 2004 Computer Architecture, Advanced Architectures Slide 33
Other Forms of Speculation
• Issuing instructions from both paths a branch instruction
• Add microarchitecture registers for temporary values
• Value prediction– Store the operands and results of previous
divide instructions in a memo table
July 2004 Computer Architecture, Advanced Architectures Slide 34
Continued
• Instruction reuse
• Superspeculation– Predict the outcome of a sequence of
instructions– Take into account patterns
• 50 – 98%
• Predict the next address in a load (r2)
July 2004 Computer Architecture, Advanced Architectures Slide 35
25.4 Speculation and Value Prediction
Figure 25.7 Examples of software speculation in IA-64.
---- ---- ---- ---- load ---- ----
spec load ---- ---- ---- ---- check load ---- ----
(a) Control speculation
---- ---- store ---- load ---- ----
spec load ---- ---- store ---- check load ---- ----
(b) Data speculation
July 2004 Computer Architecture, Advanced Architectures Slide 36
Value Prediction
Figure 25.8 Value prediction for multiplication or division via a memo table.
Mult/ Div
Memo table
Control
Mux
Inputs
Inputs ready
Output
Output ready
0
1
Miss
Done
July 2004 Computer Architecture, Advanced Architectures Slide 37
25.5 Special-Purpose Hardware Accelerators
Figure 25.9 General structure of a processor with configurable hardware accelerators.
CPU Configuration
memory
Accel. 1
Accel. 2
Accel. 3
Data and program memory
FPGA-like unit on which accelerators can be formed via loading of configuration registers
Unused resources
July 2004 Computer Architecture, Advanced Architectures Slide 38
Graphic Processors, Network Processors, etc.
Figure 25.10 Simplified block diagram of Toaster2, Cisco Systems’ network processor.
Input buffer
PE 0
PE 1
PE 2
PE 3
PE 4
PE 5 PE
6 PE 7
PE 8
PE 9
PE 10
PE 11
PE 12
PE 13
PE 14
PE 15
Output buffer
Column memory
Column memory
Column memory
Column memory
Feedback path
PE5
July 2004 Computer Architecture, Advanced Architectures Slide 39
Network Processor
• Arriving traffic overwhelms a processor
• Network processors have specialized instruction sets and other capabilities– Instructions for bit-mapping, tree search,
and cyclic redundancy check calculations– Instruction-level parallelism, micro engines
July 2004 Computer Architecture, Advanced Architectures Slide 40
25.6 Vector, Array, and Parallel Processing
Figure 25.11 The Flynn-Johnson classification of computer systems.
SISD
SIMD
MISD
MIMD
GMSV
GMMP
DMSV
DMMP
Single data stream
Mult iple data streams
Sin
gle
inst
r st
ream
M
ultip
le in
str
stre
ams
Flynn’s categories
Joh
nso
n’s
ex
pan
sio
n
Shared variables
Message passing
Glo
bal
me
mor
y D
istr
ibut
ed
me
mor
y
Uniprocessors
Rarely used
Array or vector processors
Mult iproc’s or mult icomputers
Shared-memory mult iprocessors
Rarely used
Distributed shared memory
Distrib-memory mult icomputers
July 2004 Computer Architecture, Advanced Architectures Slide 41
SIMD Architectures
Data parallelism: executing one operation on multiple data streams
Concurrency in time – vector processing Concurrency in space – array processing
Example to provide context
Multiplying a coefficient vector by a data vector (e.g., in filtering) y[i] := c[i] x[i], 0 i < n
Sources of performance improvement in vector processing (details in the first half of Chapter 26)
One instruction is fetched and decoded for the entire operation The multiplications are known to be independent (no checking) Pipelining/concurrency in memory access as well as in arithmetic
Array processing is similar (details in the second half of Chapter 26)
July 2004 Computer Architecture, Advanced Architectures Slide 42
MISD Architecture Example
Figure 25.12 Multiple instruction streams operating on a single data stream (MISD).
I n s t r u c t i o n s t r e a m s 1-5
Data in
Data out
July 2004 Computer Architecture, Advanced Architectures Slide 43
MIMD Architectures
Control parallelism: executing several instruction streams in parallel
GMSV: Shared global memory – symmetric multiprocessors DMSV: Shared distributed memory – asymmetric multiprocessors DMMP: Message passing – multicomputers
Figure 27.1 Centralized shared memory. Figure 28.1 Distributed memory.
0 0
1 1
m 1
Processor-to-
memory network
Processor-to-
processor network
Processors Memory modules
Parallel I/O
. . .
.
.
.
.
.
.
p 1
0
1
Inter- connection
network
Memories and processors
Par
alle
l inp
ut/o
utp
ut
p 1
.
.
.
Routers
A computing node
. . .
July 2004 Computer Architecture, Advanced Architectures Slide 44
Amdahl’s Law Revisited
0
10
20
30
40
50
0 10 20 30 40 50Enhancement factor (p )
Spe
edup
(s
)
f = 0
f = 0.1
f = 0.05
f = 0.02
f = 0.01
Figure 4.4 Amdahl’s law: speedup achieved if a fraction f of a task is unaffected and the remaining 1 – f part runs p times as fast.
s =
min(p, 1/f)
1f + (1 – f)/p
f = sequential fraction
p = speedup of the rest with p processors
July 2004 Computer Architecture, Advanced Architectures Slide 45
26 Vector and Array Processing Single instruction stream operating on multiple data streams
• Data parallelism in time = vector processing• Data parallelism in space = array processing
Topics in This Chapter
26.1 Operations on Vectors
26.2 Vector Processor Implementation
26.3 Vector Processor Performance
26.4 Shared-Control Systems
26.5 Array Processor Implementation
26.6 Array Processor Performance
July 2004 Computer Architecture, Advanced Architectures Slide 46
26.1 Operations on Vectors
Sequential processor:
for i = 0 to 63 do P[i] := W[i]
D[i]endfor
Vector processor:
load Wload DP := W Dstore P
for i = 0 to 63 do X[i+1] := X[i] + Z[i] Y[i+1] := X[i+1] + Y[i] endfor
Unparallelizable
July 2004 Computer Architecture, Advanced Architectures Slide 47
First Example
• Overlap load and store
• Pipeline chaining– http://www.pcc.qub.ac.uk/tec/courses/
cray/stu-notes/CRAY-completeMIF_3.html
July 2004 Computer Architecture, Advanced Architectures Slide 48
Second Example
• For (i=0;i<64;i++)– {
• A[i]=A[i] + B[i];• B[i+1]= C[i] + D[i];
– }
• Data dependency
• Exercise
July 2004 Computer Architecture, Advanced Architectures Slide 49
Useful Vector Operations
• A = c * B
• A = B * B
• s = summation(A)
July 2004 Computer Architecture, Advanced Architectures Slide 50
Three Key Features
• Deeply pipelined units for high throughput
• Plenty of long vector registers• High-bandwidth data transfer paths
between memory and vector registers• Helpful: pipeline chaining
– Forwarding result values from one unit to another
July 2004 Computer Architecture, Advanced Architectures Slide 51
26.2 Vector Processor Implementation
Figure 26.1 Simplified generic structure of a vector processor.
Function unit 1 pipeline
To
and
from
mem
ory
uni
t
From scalar registers
Vector register
file
Function unit 2 pipeline
Function unit 3 pipeline
Forwarding muxes
Load unit A
Load unit B
Store unit
July 2004 Computer Architecture, Advanced Architectures Slide 52
The Memory Bandwidth Problem
• Page 495– Most challenging aspect– No cache– Memory is highly interleaved (17.4)
• 64-256 memory banks are not unusual• For peak performance a memory bank must not
be accessed until all other memory banks have been accessed
July 2004 Computer Architecture, Advanced Architectures Slide 53
Continued
• No problem for sequential, stride-1, access
• Any stride that is relatively prime with the interleaving width is no problem
• Unfortunately, …
• Eg, 64 x 64 matrix stored in row-major order accessed by columns
July 2004 Computer Architecture, Advanced Architectures Slide 54
Continued
• Skewed storage scheme– Elements of row i are rotated to the right by
i positions– Stride becomes 65– Adds small overhead due to a more
complex translation
July 2004 Computer Architecture, Advanced Architectures Slide 55
Conflict-Free Memory Access
Figure 26.2 Skewed storage of the elements of a 64 64 matrix for conflict-free memory access in a 64-way interleaved memory. Elements of column 0 are highlighted in both diagrams .
0,0
1,0
2,0 . . .
62,0
63,0
0,1
1,1
2,1 . . .
62,1
63,1
0,2
1,2
2,2 . . .
62,2
63,2
0,62
1,62
2,62 . . .
62,62
63,62
0,63
1,63
2,63 . . .
62,63
63,63
...
...
... . . . ...
...
0,0
1,63
2,62 . . .
62,2
63,1
0,1
1,0
2,63 . . .
62,3
63,2
0,2
1,0
2,0 . . .
62,4
63,3
0,62
1,61
2,60 . . .
62,0
63,63
0,63
1,62
2,61 . . .
62,1
63,0
... ... ... . . . ... ...
(a) Conventional row-major order (b) Skewed row-major order
Bank number
0 1 62 63 2 . . . 0 1 62 63 2 . . .
July 2004 Computer Architecture, Advanced Architectures Slide 56
Vector machine instructions
• Similar in spirit to MMX instructions– MMX vectors are short and fixed in length– Vector machine vectors are long and
variable in length (parameter)
• Typical instructions (non-arithmetic)– Vector load and store with implicit or
explicitly specified stride
July 2004 Computer Architecture, Advanced Architectures Slide 57
Continued
• Vector load and store with memory addresses specified in an index vector
• Vector comparisons on an element by element basis
• Data copying among registers including vector-length and mask registers– Mask registers allow selectivity
July 2004 Computer Architecture, Advanced Architectures Slide 58
Overlapped Memory Access and Computation
Figure 26.3 Vector processing via segmented load/store of vectors in registers in a double-buffering scheme. Solid (dashed) lines show data flow in the current (next) segment.
Vector reg 0
Vector reg 1
Vector reg 5
Vector reg 2
Vector reg 3
Vector reg 4
Load X
Load Y
Store Z
To
an
d fr
om
mem
ory
un
it
Pipelined adder
July 2004 Computer Architecture, Advanced Architectures Slide 59
Vector Processor Performance
• Ideal performance is only theoretical
• Start-up time– Becomes negligible only when vectors are
very long– A few to many tens of cycles– Half-performance vector length
• Performance is half that of infinite vectors• Tens to a hundred
July 2004 Computer Architecture, Advanced Architectures Slide 60
Continued
• Resource idle time and contention
• S = V * W + X * Y– With one multiply unit the adder will go
unused
• Memory bank contention– Severe when vector elements are
accessed in irregular positions (index vectors)
July 2004 Computer Architecture, Advanced Architectures Slide 61
• Vector segmentation– Entire vector does not fit in a vector
register– A vector of length 65!
• Conditional computations– Mask determines selection
July 2004 Computer Architecture, Advanced Architectures Slide 62
Conditional Computation
• Example: if (A(i)) … else …– Say the mask is 110101, ie , true for 0, 1,
3, 5– Two stages
• First stage 0, 1, 3, 5 execute the if part• Second stage 2 and 4 execute the then part
– Computation time roughly doubles
• Switch statement?
July 2004 Computer Architecture, Advanced Architectures Slide 63
Multilane Vector Processors
• 4 lanes– Lane 0 has 0, 4, 8, …– Four pipelined adders operate on the
registers in parallel
• Corresponds to multiple lanes on a highway
July 2004 Computer Architecture, Advanced Architectures Slide 64
26.3 Vector Processor Performance
Figure 26.4 Total latency of the vector computation S := X Y + Z, without and with pipeline chaining.
Multiplication start-up
Addition start-up
+
+
Without chaining
With pipeline chaining
Time
July 2004 Computer Architecture, Advanced Architectures Slide 65
Performance as a Function of Vector Length
Figure 26.5 The per-element execution time in a vector processor as a function of the vector length.
Vector length 100 200 300 400 0
Clo
ck c
ycle
s pe
r ve
ctor
ele
me
nt
5
4
3
2
1
0
July 2004 Computer Architecture, Advanced Architectures Slide 66
26.4 Shared-Control Systems
Figure 26.6 From completely shared control to totally separate controls.
(a) Shared-control array processor, SIMD
(b) Multiple shared controls, MSIMD
(c) Separate controls, MIMD
Processing Control
. . .
Processing Control
. . .
Processing Control
. . .
. . .
July 2004 Computer Architecture, Advanced Architectures Slide 67
Simple Shared-Control
• SIMD
• Array processors– Centralized control– Memory and processor are distributed in
space
July 2004 Computer Architecture, Advanced Architectures Slide 68
Example Array Processor
Figure 26.7 Array processor with 2D torus interprocessor communication network.
Control broadcast Parallel
I/O
Processor array Control
Switches
July 2004 Computer Architecture, Advanced Architectures Slide 69
Memory Bandwidth
• Ceases to be a major bottleneck
• A vector of length m can be distributed to the local memories of p processors with each one receiving m/p elements
• 64-256 very fast processors
• 64K processors
July 2004 Computer Architecture, Advanced Architectures Slide 70
Downside
• Vector operations that require interprocessor communication– Summation– X[i] * Y[m-1-i]
• Processor i needs information from processor m-1-i
July 2004 Computer Architecture, Advanced Architectures Slide 71
Pattern of Connectivity
• Linear array (ring)– Each processor is connected to two
neighbors
• Highly sophisticated networks
• Square or rectangular 2D mesh– Torus involves connecting the endpoints of
rows and columns
July 2004 Computer Architecture, Advanced Architectures Slide 72
Array Processor Implementation
• Assume 2D torus interconnection
• Switches allow data to flow in the row loops clockwise or counterclockwise directions
• Control unit resembles a scalar processor– Fetch, decode, broadcast control signals
July 2004 Computer Architecture, Advanced Architectures Slide 73
Continued
• Register file
• ALU
• Data memory
• New: branching based on the state of the processor array– Single bit of information from each
processor
July 2004 Computer Architecture, Advanced Architectures Slide 74
Continued
• Array state register– Population count instruction
• Processors within the array– Bare-bone
• No fetch, no decode
– Processing elements (PEs)
• Register file, ALU, memory
July 2004 Computer Architecture, Advanced Architectures Slide 75
Continued
• Some local autonomy– Say we load using r2 as a pointer– Different processors may have different
values of r2
July 2004 Computer Architecture, Advanced Architectures Slide 76
26.5 Array Processor Implementation
Figure 26.8 Handling of interprocessor communication via a mechanism similar to data forwarding.
ALU Reg file
CommunDir
CommunEn
PE state FF
Data memory
To array state reg
To reg file and data memory
Commun buffer
N E
W S To NEWS
neighbors
0
1
July 2004 Computer Architecture, Advanced Architectures Slide 77
Configuration Switches
Figure 26.9 I/O switch states in the array processor of Figure 26.7.
Control broadcast Parallel
I/O
Processor array Control
Switches
Figure 26.7
(a) Torus operation
In
(b) Clockwise I/O (c) Counterclockwise I/O
Out
In
Out
July 2004 Computer Architecture, Advanced Architectures Slide 78
Array Processor Performance
• Signal travel time
• Embarrassingly Parallel– Signal and image processing– Database management and data mining– Data fusion in command and control
• Data distribution– Which processor has what memory?
July 2004 Computer Architecture, Advanced Architectures Slide 79
26.6 Array Processor Performance
Array processors perform well for the same class of problems thatare suitable for vector processors
For embarrassingly (pleasantly) parallel problems, array processorscan be faster and more energy-efficient than vector processors
A criticism of array processing:For conditional computations, a significant part of the array remainsidle while the “then” part is performed; subsequently, idle and busyprocessors reverse roles during the “else” part
However:Considering array processors inefficient due to idle processorsis like criticizing mass transportation because many seats are unoccupied most of the time
It’s the total cost of computation that counts, not hardware utilization!
July 2004 Computer Architecture, Advanced Architectures Slide 80
27 Shared-Memory Multiprocessing Multiple processors sharing a memory unit seems naïve
• Didn’t we conclude that memory is the bottleneck?• How then does it make sense to share the memory?
Topics in This Chapter
27.1 Centralized Shared Memory
27.2 Multiple Caches and Cache Coherence
27.3 Implementing Symmetric Multiprocessors
27.4 Distributed Shared Memory
27.5 Directories to Guide Data Access
27.6 Implementing Asymmetric Multiprocessors
July 2004 Computer Architecture, Advanced Architectures Slide 81
Robert Tomasulo
• Looking forward, 1998
• Technology advances in both hardware and software have rendered most Instruction Set Architecture conflicts moot. … Computer design is focusing more on the memory bandwidth and multiprocessing interconnection problems.
July 2004 Computer Architecture, Advanced Architectures Slide 82
Parallel Processing as a Topic of Study
Graduate course ECE 254B:Adv. Computer Architecture – Parallel Processing
An important area of studythat allows us to overcomefundamental speed limits
Our treatment of the topic isquite brief (Chapters 26-27)
July 2004 Computer Architecture, Advanced Architectures Slide 83
Memory
• It is the speed of memory, not the processor, that limits performance.
• Why does it make sense to expect a single memory unit to keep up with several processors?
• Provide a private cache for each processor. Cache consistency!
July 2004 Computer Architecture, Advanced Architectures Slide 84
27.1 Centralized Shared Memory
Figure 27.1 Structure of a multiprocessor with centralized shared-memory.
0 0
1 1
m 1
Processor-to-
memory network
Processor-to-
processor network
Processors Memory modules
Parallel I/O
. . .
.
.
.
.
.
.
p 1
July 2004 Computer Architecture, Advanced Architectures Slide 85
Two Extremes
• A single shared bus
• A permutation network that allows any processor to read or write any memory module.
• Most networks fall between these extremes.
July 2004 Computer Architecture, Advanced Architectures Slide 86
Processor-to-memory connection
• Most critical component• Must support conflict-free and high-
speed access to memory by several processors at once
• Bandwidth is more important than latency due to multithreading and other methods that allow processors to do useful work while waiting for data. LT
July 2004 Computer Architecture, Advanced Architectures Slide 87
Multistage Interconnection Network
• Multiple paths
• Large switches arranged in layers
• Cost/performance trade-offs
• One of the oldest and most versatile is the butterfly network
• Self-routing network
July 2004 Computer Architecture, Advanced Architectures Slide 88
Butterfly Network
• 2^q rows and q+1 columns
• Each switch in row r has a connection to the switch in the same row and a cross connection to row r +/- 2^c, where c is the column number
• Unique path can be determined from i and j (in binary)
July 2004 Computer Architecture, Advanced Architectures Slide 89
Example
• Say we want to go from row 2 to row 4, 010 to 100.
• XOR the two numbers getting 110.
• Read that routing tag backwards.
• This is not a permutation network.
July 2004 Computer Architecture, Advanced Architectures Slide 90
Processor-to-Memory Interconnection Network
Figure 27.2 Butterfly and the related Beneš network as examples of processor-to-memory interconnection network in a multiprocessor.
(a) Butterfly network (b) Beneš network
0
2
4
6
8
10
12
14
Processors Memories
P r o c e s s o r s
M e m o r i e s
1
3
5
7
9
11
13
15
0
2
4
6
8
10
12
14
1
3
5
7
9
11
13
15
0
2
4
6
0
2
4
6
1
3
5
7
1
3
5
7
Row 0
Row 1
Row 2
Row 3
Row 4
Row 5
Row 6
Row 7
July 2004 Computer Architecture, Advanced Architectures Slide 91
Processor-to-Memory Interconnection Network
Figure 27.3 Interconnection of eight processors to 256 memory banks in Cray Y-MP, a supercomputer with multiple vector processors.
0
1
2
3
4
5
6
7
8 8
8 8
8 8
8 8
4 4
4 4
4 4
4 4
4 4
4 4
4 4
4 4
Sections Subsections Memory banks
0, 4, 8, 12, 16, 20, 24, 28 32, 36, 40, 44, 48, 52, 56, 60
1, 5, 9, 13, 17, 21, 25, 29
2, 6, 10, 14, 18, 22, 26, 30
3, 7, 11, 15, 19, 23, 27, 31
Processors
1 8 switches 224, 228, 232, 236, . . . , 252
225, 229, 233, 237, . . . , 253
226, 230, 234, 238, . . . , 254
227, 231, 235, 239, . . . , 255
8 /
8 /
8 /
8 /
July 2004 Computer Architecture, Advanced Architectures Slide 92
Shared-Memory Programming: BroadcastingCopy B[0] into all B[i] so that multiple
processorscan read its value without memory access conflicts
for k = 0 to log2 p – 1 processor j, 0 j < p, do
B[j + 2k] := B[j]endfor
0 1 2 3 4 5 6 7 8 9 10 11
B
Recursivedoubling
July 2004 Computer Architecture, Advanced Architectures Slide 93
Shared-Memory Programming: SummationSum reduction of vector X
processor j, 0 j < p, do Z[j] := X[j]
s := 1while s < p processor j, 0 j < p – s,
do Z[j + s] := X[j] + X[j + s] s := 2 sendfor
0 1 2 3 4 5 6 7 8 9
S 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 8:8 9:9
0:0 0:1 1:2 2:3 3:4 4:5 5:6 6:7 7:8 8:9
0:0 0:1 0:2 0:3 1:4 2:5 3:6 4:7 5:8 6:9
0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 1:8 2:9
0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9
Recursivedoubling
July 2004 Computer Architecture, Advanced Architectures Slide 94
Drawbacks
• Hot-spot contention
• Shared data that are continually being accessed by many processors
• Interconnection becomes bottleneck
July 2004 Computer Architecture, Advanced Architectures Slide 95
Solution?
• Reduce the frequency of memory accesses to main memory by equipping each processor with a private cache
• A hit rate of 95% reduces traffic through the processor-to-memory network to 5%
July 2004 Computer Architecture, Advanced Architectures Slide 96
27.2 Multiple Caches and Cache Coherence
Private processor caches reduce memory access traffic through the interconnection network but lead to challenging consistency problems.
0 0
1 1
m 1
Processor-to-
memory network
p 1
Processor-to-
processor network
Processors Caches Memory modules
Parallel I/O
. . .
.
.
.
.
.
.
July 2004 Computer Architecture, Advanced Architectures Slide 97
Status of Data Copies
Figure 27.4 Various types of cached data blocks in a parallel processor with centralized main memory and private processor caches.
0
1
Processor-to-
memory network
p–1
Processor-to-
processor network
Processors Caches Memory modules
Parallel I/O
. . .
.
.
.
.
.
.
w x
y
z
w z
w y
x z
Multiple consistent
Single consistent
Single inconsistent
Invalid
m–1
0
1
July 2004 Computer Architecture, Advanced Architectures Slide 98
Cache Coherence Algotithms
• Designate each cache line as exclusive or shared
• A processor may only modify an exclusive cache line
• To modify a shared line, change its status to exclusive which invalidates all copies in other caches.
July 2004 Computer Architecture, Advanced Architectures Slide 99
Continued
• This involves broadcasting the intention to modify the line
• This is called a write-invalidate algorithm
July 2004 Computer Architecture, Advanced Architectures Slide 100
Snoopy Protocol
• Particular implementation of write-invalidate
• Requires a shared bus
• Snoop control lines monitor the bus
• Each cache line is shared, exclusive, or invalid
July 2004 Computer Architecture, Advanced Architectures Slide 101
Continued
• Transitions– CPU read hits are no problem– CPU write hits
• Exclusive? No change• Shared state? Change state to exclusive and a
write miss is signaled to other caches, causing them to invalidate their copies
July 2004 Computer Architecture, Advanced Architectures Slide 102
Continued
• CPU write miss– Exclusive state? Write the line to memory
and change state to invalid
• A request for access to an exclusive cache line forces the owner to make sure that stale data from main memory is not used.
July 2004 Computer Architecture, Advanced Architectures Slide 103
A Snoopy Cache Coherence
Protocol
Figure 27.5 Finite-state control mechanism for a bus-based snoopy cache coherence protocol with write-back caches.
CPU read or write hit
Invalid
Shared (read-only)
Exclusive (writable)
CPU read hit
CPU read miss: signal read miss
on bus
CPU w rite miss: signal write miss
on bus
CPU w rite hit: signal write miss on bus
Bus write miss: write back cache line
Bus write miss
Bus read miss: write back cache line
P
C
P
C
P
C
P
C
BusMemory
July 2004 Computer Architecture, Advanced Architectures Slide 104
27.3 Implementing Symmetric Multiprocessors
Figure 27.6 Structure of a generic bus-based symmetric multiprocessor.
Computing nodes (typically, 1-4 CPUs
and caches per node)
Interleaved memory
Bus adapter
I/O modules
Standard interfaces
Bus adapter
Very wide, high-bandwidth bus
July 2004 Computer Architecture, Advanced Architectures Slide 105
Bus Bandwidth Limits PerformanceExample 27.1
Consider a shared-memory multiprocessor built around a single bus with a data bandwidth of x GB/s. Instructions and data words are 4 B wide, each instruction requires access to an average of 1.4 memory words (including the instruction itself). The combined hit rate for caches is 98%. Compute an upper bound on the multiprocessor performance in GIPS. Address lines are separate and do not affect the bus data bandwidth.
Solution
Executing an instruction implies a bus transfer of 1.4 0.02 4 = 0.112
B. Thus, an absolute upper bound on performance is x/0.112 = 8.93x GIPS. Assuming a bus width of 32 B, no bus cycle or data going to waste, and a bus clock rate of y GHz, the performance bound becomes 286y GIPS. This bound is highly optimistic. Buses operate in the range 0.1 to 1 GHz. Thus, a performance level approaching 1 TIPS (perhaps even ¼ TIPS) is beyond reach with this type of architecture.
July 2004 Computer Architecture, Advanced Architectures Slide 106
Split-transaction Protocol
• The bus is released between the transmission of memory address and the return of data.
• There may be several outstanding memory requests in progress at the same time.
July 2004 Computer Architecture, Advanced Architectures Slide 107
Snoop-based Coherence
• Assume one level of cache for each processor
• Atomic bus– Total order
July 2004 Computer Architecture, Advanced Architectures Slide 108
Implementing Snoopy Caches
Figure 27.7 Main structure for a snoop-based cache coherence algorithm.
Tags
Cache data array
Duplicate tags and state store for snoop side
CPU
Main tags and state store for processor side
=?
=?
Processor side cache control
Snoop side cache control
Addr Addr Cmd Cmd Buffer Buffer Snoop state
System bus
Tag
Addr Cmd
State
July 2004 Computer Architecture, Advanced Architectures Slide 109
Example
• Two processors issue requests for writing into a shared cache line– Change its state from shared to exclusive
• The bus arbiter will allow one to go first
• The other processor must then invalidate the line and issue a different request (write miss)
July 2004 Computer Architecture, Advanced Architectures Slide 110
Continued
• Include intermediate states in the state diagram
• Shared/exclusive intermediate state
• Outcome of the request takes you either to exclusive or invalid
July 2004 Computer Architecture, Advanced Architectures Slide 111
Two Useful Mechanisms
• Locking of shared resources– Atomic instructions– Two processors both read 0 and then write
1
• Barrier synchronizations– Counter (stop until it gets to p)
• Obvious contention
– Logical and
July 2004 Computer Architecture, Advanced Architectures Slide 112
Distributed Shared Memory
• Scalability problem!• Does sharing of a global address space
require that the memory be centralized?• No. NUMA (nonuniform memory
access)• Reference to a non-local address
initiates a remote access.• DMA controllers.
July 2004 Computer Architecture, Advanced Architectures Slide 113
27.4 Distributed Shared Memory
Figure 27.8 Structure of a distributed shared-memory multiprocessor.
0
1 z : 0
x : 0 y : 1
Inter- connection
network
Processors with memory
Pa
ralle
l inp
ut/o
utp
ut
.
.
.
p 1
y := -1 z := 1
while z=0 do x := x + y endwhile
Routers
July 2004 Computer Architecture, Advanced Architectures Slide 114
Sequential Consistency
• The example in the previous slide is nondeterministic.
• Everything depends on the order of execution.
• The only expectation is that y = -1 will complete before z = 1
• Any processor that sees the new value for z will see the new value for y
July 2004 Computer Architecture, Advanced Architectures Slide 115
Continued
• Ensuring sequential consistency is not easy and comes with a performance penalty.
• It is insufficient to ensure that y changes before z changes.– A processor with different latencies to y
and z may see these events in the wrong order.
– Astronomer example.
July 2004 Computer Architecture, Advanced Architectures Slide 116
Directories
• Problem: dynamic data distribution
• Management overhead– Keeping track of the whereabouts of
various data elements to be able to find them when needed
July 2004 Computer Architecture, Advanced Architectures Slide 117
Assumptions
• Each data line has a statically assigned home location within the memory modules but may have multiple cached copies at various locations
• Associated with each data element is a directory entry that specifies which caches currently hold a copy of the data (the sharing set)
July 2004 Computer Architecture, Advanced Architectures Slide 118
Continued
• If the number of processors is small, a bit vector is quite efficient.– 32 processors <-> 32 bit word
• Provide a listing of the caches– Efficient when sharing sets are small
• Neither method is scalable
July 2004 Computer Architecture, Advanced Architectures Slide 119
A Scalable Method
• Scalable Coherent Interface (SCI) standard
• The directory entry specifies one of the caches holding a copy, with each copy containing a pointer to the next cache copy.
• Distribute directory scheme
July 2004 Computer Architecture, Advanced Architectures Slide 120
27.5 Directories to Guide Data Access
Figure 27.9 Distributed shared-memory multiprocessor with a cache, directory, and memory module associated with each processor.
0
1
Inter- connection
network
Processors & caches
Pa
ralle
l inp
ut/o
utp
ut
.
.
.
p 1
Memories
Directories Communication & memory interfaces
July 2004 Computer Architecture, Advanced Architectures Slide 121
Triggering Events
• Messages sent by cache memories to the directory holding the requested line
• Cache c indicates a read miss– Shared: data sent to c, c added to SS– Exclusive: fetch to the owner, data to c, c
added to SS– Uncached: data to c, sharing set is {c}
• Line is shared
July 2004 Computer Architecture, Advanced Architectures Slide 122
Continued
• Cache message indicates a write miss– Shared: invalidation message sent to SS,
sharing set is {c}, data to c– Exclusive: fetch/invalidate to owner, data to
c, SS is {c}– Uncached: data to c, SS is {c}
• Line is exclusive
July 2004 Computer Architecture, Advanced Architectures Slide 123
Continued
• If a writeback message is sent from a cache (needs to replace a dirty line), the state becomes uncached and SS is empty
July 2004 Computer Architecture, Advanced Architectures Slide 124
Cache-only Memory Architecture
• COMA
• Data lines have no fixed homes
• The entire memory associated with each PE is organized as a cache with each line having one or more temporary copies
July 2004 Computer Architecture, Advanced Architectures Slide 125
Shared Virtual Memory
• When a page fault occurs in a node, the page fault handler obtains the page
• CPU and OS get involved in the process
• Requires little hardware support
July 2004 Computer Architecture, Advanced Architectures Slide 126
Directory-Based Cache Coherence
Figure 27.10 States and transitions for a directory entry in a directory-based cache coherence protocol (c is the requesting cache).
Write miss: return value, set sharing set to {c}
Uncached
Shared (read-only)
Exclusive (writable)
Read miss: return value, include c in sharing set
Read miss: return value, set sharing set to {c}
Write miss: invalidate all cached copies, set sharing set to {c}, return value
Data w rite-back: set sharing set to { }
Read miss: fetch data from owner, return value, include c in sharing set
Write miss: fetch data from owner, request invalidation,
return value, set sharing set to {c}
July 2004 Computer Architecture, Advanced Architectures Slide 127
27.6 Implementing Asymmetric Multiprocessors
Figure 27.11 Structure of a ring-based distributed-memory multiprocessor.
Computing nodes (typically, 1-4 CPUs and associated memory)
Link
To I/O controllers
Memory
Ring network
Link Link Link
Node 0 Node 1 Node 2 Node 3
July 2004 Computer Architecture, Advanced Architectures Slide 128
Link
• Key component
• Allows communication with all other nodes
• Enforces the coherence protocol
• Contains cache, local directory, interface controllers for both sides
July 2004 Computer Architecture, Advanced Architectures Slide 129
Continued
• Cache miss– Request is sent via the ring– Coherence
• Within the node, snooping is used to ensure coherence among processor caches, the link’s cache, and main memory
– Internode coherence• Directory based protocol (TBD)
July 2004 Computer Architecture, Advanced Architectures Slide 130
Scalable Coherent Interface
(SCI)
Figure 27.11 Structure of a ring-based distributed-memory multiprocessor.
0
1
Processors and caches
To
inte
rco
nnec
tion
net
wo
rk
3
Memories
2
July 2004 Computer Architecture, Advanced Architectures Slide 131
28 Distributed MulticomputingComputer architects’ dream: connect computers like toy blocks
• Building multicomputers from loosely connected nodes• Internode communication is done via message passing
Topics in This Chapter
28.1 Communication by Message Passing
28.2 Interconnection Networks
28.3 Message Composition and Routing
28.4 Building and Using Multicomputers
28.5 Network-Based Distributed Computing
28.6 Grid Computing and Beyond
July 2004 Computer Architecture, Advanced Architectures Slide 132
28.1 Communication by Message Passing
Figure 28.1 Structure of a distributed multicomputer.
0
1
Inter- connection
network
Memories and processors
Pa
ralle
l inp
ut/o
utp
ut
p 1
.
.
.
Routers
A computing node
July 2004 Computer Architecture, Advanced Architectures Slide 133
How is it different?
• From a shared-memory multiprocessor– Data transmission does not occur as a
result of memory access requests– Message-passing commands activated by
the processor– Usually more than one connection from
each node to the interconnection network
July 2004 Computer Architecture, Advanced Architectures Slide 134
Continued
• Shared-medium is controlled by bus acquisition
• Direct, or point-to-point, involves forwarding
July 2004 Computer Architecture, Advanced Architectures Slide 135
Router Design
Figure 28.2 The structure of a generic router.
Switch
Inp
ut c
hann
els
Routing and arbitration
Input queues
Q
Q
Q
Q
Q
Q
Q
Q
LC
LC
LC
LC
LC
LC
LC
LC Out
put
chan
nels
Output queues
Q Q
LC LC Link controller
Message queue
Injection channel Ejection channel
July 2004 Computer Architecture, Advanced Architectures Slide 136
Routing Decisions
• Routing tables within the router• Routing tags carried with messages• Packet routing
– Store and forward
• Wormhole routing– Small portion (flit – flow control digit) is
forwarded with the expectation that others will follow
July 2004 Computer Architecture, Advanced Architectures Slide 137
Switches
• May contain queues and routing logic that make them similar to routers
• Circuit-switched interconnection network– A path is established and data is sent over
that path
July 2004 Computer Architecture, Advanced Architectures Slide 138
Crossbar Switches
• Have p^2 crossover points connecting p inputs to p outputs
July 2004 Computer Architecture, Advanced Architectures Slide 139
Building Networks from Switches
Straight through Crossed connection Lower broadcast Upper broadcast
Figure 28.3 Example 2 2 switch with point-to-point and broadcast connection capabilities.
(a) Butterfly network (b) Beneš network
0
2
4
6
8
10
12
14
Processors Memories
P r o c e s s o r s
M e m o r i e s
1
3
5
7
9
11
13
15
0
2
4
6
8
10
12
14
1
3
5
7
9
11
13
15
0
2
4
6
0
2
4
6
1
3
5
7
1
3
5
7
Row 0
Row 1
Row 2
Row 3
Row 4
Row 5
Row 6
Row 7
Figure 27.2Butterfly and Beneš networks
July 2004 Computer Architecture, Advanced Architectures Slide 140
Interprocess Communication via Messages
Figure 28.4 Use of send and receive message-passing primitives to synchronize two processes.
Process A Process B
...
...
...
...
...
... send x ... ... ... ... ... ... ...
...
... receive x ... ... ... Time
Communication latency
Process B is suspended
Process B is awakened
July 2004 Computer Architecture, Advanced Architectures Slide 141
Message-Passing Interface
• MPI library– Broadcasting– Scattering– Gathering– Complete exchange
• Provides– Persistent communication– Barrier synchronization
July 2004 Computer Architecture, Advanced Architectures Slide 142
Interconnection Networks
• Analogy of a power grid– Some relays are simply merely relays or
voltage converters
• Indirect network– Some nodes are noncomputing nodes
• Switches and routers
July 2004 Computer Architecture, Advanced Architectures Slide 143
28.2 Interconnection Networks
Figure 28.5 Examples of direct and indirect interconnection networks.
(a) Direct network (b) Indirect network
Routers Nodes Nodes
July 2004 Computer Architecture, Advanced Architectures Slide 144
Direct Interconnection
Networks
Figure 28.6 A sampling of common direct interconnection networks. Only routers are shown; a computing node is implicit for each router.
(a) 2D torus (b) 4D hypercube
(c) Chordal ring (d) Ring of rings
July 2004 Computer Architecture, Advanced Architectures Slide 145
Indirect Interconnection Networks
Figure 28.7 Two commonly used indirect interconnection networks.
(a) Hierarchical buses (b) Omega network
Level-1 bus
Level-2 bus
Level-3 bus
July 2004 Computer Architecture, Advanced Architectures Slide 146
Two Key Parameters
• Network diameter– Defined in terms of number of hops, ie
router to router transfers
• Torus? 4
• Larger diameter higher worst-case latency
July 2004 Computer Architecture, Advanced Architectures Slide 147
Continued
• Bisection width– Minimum number of channels that must be
cut to divide the p-node network into two parts containing [p/2] and [p+1/2] nodes
– Torus? 8– The greater this width, the higher the
available communications bandwidth– Important when communication is random
July 2004 Computer Architecture, Advanced Architectures Slide 148
28.3 Message Composition and Routing
Figure 28.8 Messages and their parts for message passing.
Message Padding
Packet data
Last packet Header Trailer
A transmitted packet
Flow control digits (flits)
Data or payload First packet
July 2004 Computer Architecture, Advanced Architectures Slide 149
Wormhole Switching
Figure 28.9 Concepts of wormhole switching.
Worm 1: moving
(a) Two worms en route to their respective destinations
Source 2
Source 1
Destination 1
Destination 2
Worm 2: blocked
(b) Deadlock due to circular waiting of four blocked worms
Each worm is blocked at the point of attempted right turn
July 2004 Computer Architecture, Advanced Architectures Slide 150
Routing algorithms
• Nondeterministic– Adaptive– Probabilistic or random
• Fairness
– Oblivious• Unique path can be determined at the source
July 2004 Computer Architecture, Advanced Architectures Slide 151
Continued
• Row-first routing on 2D mesh or torus networks– Message goes along the row and then
down the appropriate column
July 2004 Computer Architecture, Advanced Architectures Slide 152
28.4 Building and Using Multicomputers
Figure 28.10 A task system and schedules on 1, 2, and 3 computers.
(a) Static task graph (b) Schedules on 1-3 computers
Inputs
Outputs
t = 1
t = 1
t = 2
t = 2 t = 2
t = 3
B
A C
D
E
F
G
H
t = 1
t = 2
B A C D E F G H
B A C
D E
H F G
B A C
D
E F G H
0 5 10 15
Time
July 2004 Computer Architecture, Advanced Architectures Slide 153
Optimal Schedule?
• Extremely difficult– Task running times are data-dependent– All tasks are not known a priori– Communication time must be considered– Message passing does not occur at task
boundaries– Can every node execute every task?
July 2004 Computer Architecture, Advanced Architectures Slide 154
Two-tiered Approach
• Assign tasks and monitor progress and communication behavior
• Reassign tasks for overloaded nodes and excessive communication
• Load balancing
July 2004 Computer Architecture, Advanced Architectures Slide 155
Building Multicomputers from Commodity Nodes
Figure 28.11 Growing clusters using modular nodes.
(a) Current racks of modules (b) Futuristic toy-block construction
Expansion slots
One module: CPU,
memory, disks
One module: CPU(s), memory,
disks
Wireless connection surfaces
July 2004 Computer Architecture, Advanced Architectures Slide 156
Network-based Distributed Computing
• What if we drop the assumptions of uniformity and regularity?– Identical or similar computing nodes– Regular interconnections
• Mix general-purpose and special-purpose resources
• Use idle time on PCs
July 2004 Computer Architecture, Advanced Architectures Slide 157
Clusters
• LAN
• Limited span
• Also referred to as networks of workstations
July 2004 Computer Architecture, Advanced Architectures Slide 158
Metacomputer
• WAN– The internet as an extreme example
• Nodes are complete computers– End systems or platforms
• Requires a coordination strategy– Client-server– Peer-to-peer
July 2004 Computer Architecture, Advanced Architectures Slide 159
Peer-to-peer
• Middleware– Helps coordinate the actions of the nodes– Eg, a group communication facility may be
used to notify all participating processes of changes in the environment
• Stirling has argued that for each MFLOPS a supercomputer costs 10 times as much as a PC
July 2004 Computer Architecture, Advanced Architectures Slide 160
28.5 Network-Based Distributed Computing
Figure 28.12 Network of workstations.
System or I/O bus PC
Fast network interface with large memory
NIC
Network built of high-speed
wormhole switches
July 2004 Computer Architecture, Advanced Architectures Slide 161
Extensions to the OS
• Maintain status information for efficient use and recovery
• Task partition and scheduling
• Initial assignment to nodes and, possibly, load balancing
• Distributed file system management
July 2004 Computer Architecture, Advanced Architectures Slide 162
Continued
• Compilation of system usage data for planning and accounting
• Enforcement of intersystem privacy and security policies
• Some or all of these functions will be incorporated into standard operating systems
July 2004 Computer Architecture, Advanced Architectures Slide 163
28.6 Grid Computing and Beyond
Computational grid is analogous to the power grid
Decouples the “production” and “consumption” of computational power
Homes don’t have an electricity generator; why should they have a computer?
Advantages of computational grid:
Near continuous availability of computational and related resources
Resource requirements based on sum of averages, rather than sum of peaks
Paying for services based on actual usage rather than peak demand
Distributed data storage for higher reliability, availability, and security
Universal access to specialized and one-of-a-kind computing resources
Still to be worked out: How to charge for computation usage
July 2004 Computer Architecture, Advanced Architectures Slide 164
What will it take?
• Interaction requirements are quite different from the way in which systems interact today– More emphasis on communication
capabilities and interfaces– Hardware and OS support for cluster
coordination and communication with the outside world
July 2004 Computer Architecture, Advanced Architectures Slide 165
Continued
• Improve communication performance by using lighter-weight interaction, much as in clusters, is needed
• Internets generally have no centralized control whatsoever. Wide geographic distribution leads to nonnegligible signal propogation time.
July 2004 Computer Architecture, Advanced Architectures Slide 166
Continued
• There are security, interorganizational, and international issues in data exchange