cse 160 – lecture 2

CSE 160 – Lecture 2

Today’s Topics

• Flynn’s Taxonomy• Bit-Serial, Vector, Pipelined Processors• Interconnection Networks

– Topologies– Routing – Embedding

• Network Bisection

Taxonomy

• Flynn (1966) Classified machines by data and control streams

Single InstructionSingle Data(SISD)

Single Instruction Multiple DataSIMD

Multiple Instruction Single Data(MISD)

Multiple Instruction Multiple Data(MIMD)

SIMD

• SIMD– All processors execute the same program in lockstep– Data that each processor sees is different– Single control processor– Individual processors can be turned on/off at each cycle– Illiac IV, CM-2, MasPar are some examples– Silicon Graphics Reality Graphics engine

MIMD

• All processors execute their own set of instructions

• Processors operate on separate datastreams• No centralized clock implied• SP-2, T3E, Clusters, Cray’s, etc.

SPMD/MPMD

• Single/Multiple Program Multiple Data• SPMD processors run the same program but

processors are necessarily run in lock step. • Very popular and scalable programming

style• MPMD is similar except that different

processors run different programs– PVM distribution has some simple examples

Processor Types

• Four types– Bit serial– Vector– Cache-based, pipelined – Custom (eg. Tera MTA or KSR-1)

Bit Serial

• Only seen in SIMD machines like CM-2 or MasPar

• Each clock cycle, one bit of the data is loaded/written– Simplifies memory system and memory trace

count• Popular for very dense (64K) processor

arrays

Cache-based, Pipelined• Garden Variety Microprocessor

– Sparc, Intel x86, MC68xxx, MIPs, …– Register-based ALUs and FPUs– Registers are of scalar type

• Pipelined execution to improve performance of individual chips– Splits up components of basic operation like addition into stages– The more stages, the faster the speedup, but more problems with

branching and data/control hazards• Per-processor caches make it challenging to build SMPs

(coherency issues)• Now dominates the high-end market

Vector Processors• Very specialized (eg. $$$$$) machines• Registers are true vectors with power of 2 lengths• Designed to efficiently perform matrix-style operations

– Ax = b ( b(I) = A(I,J)*x(J))– Vector registers v1, v2, v3

• V1 = A(I,*), V2 = b(*)• MULV V3(I), V1, V2

• “Chaining” to efficiently handle larger vectors than size of vector registers

• Cray, Hitachi, SGI (now Cray SV-1) are examples

Some Custom Processors

• Denelcor HEP/Tera MTA– Multiple register sets

• Stack Pointer, Instruction Pointer, Frame Pointer, etc.• Facilitates hardware threads• Switch each clock cycle to different register set

– Why? Stalls to memory subsystem in one thread can be hidden by concurrency

• KSR-1– Cache-only memory processor– Basically 2 generations behind standard micros

Going Parallel

• Late 70’s, even vector “monsters” started to to go parallel

• For //-processing to work, individual processors must synchronize– SIMD – Synchronize every clock cycle– MIMD – Explicit sychronization

• Message passing• Semaphores, monitors, fetch-and-increment

– Focus on interconnection networks for rest of lecture

Characterizing Networks• Bandwidth• Device/switch latency• Switching types

– Circuit switched (eg. Telephone)– Packet switched (eg. Internet)

• Store and forward• Virtual Cut Through• Wormhole routed

• Topology– Number of connections– Diameter (how many hops through switches)

Latency• Latency is the amount of time taken for a command to

start before any effect is seen– Push on gas pedal before car goes forward– Time you enter a line, before cashier starts on your job– First bit leaves computer A, first bit arrives at computer BOR

– (Message latency) First bit leaves computer A, last bit arrives at computer B

• Startup latency is the amount of time to send a zero length message

Bandwidth

• Bits/second that can travel through a connection• A really simple model for calculating the time to

send a message of N bytes– Time = latency + N/bandwidth

• Bisection is the minimum number of wires that must be cut to divide a network of machines into two equal halves.

• Bisection bandwidth is the total bandwidth through the bisection

• Completely connected– Every node has a direct wire connection to

every other node

(N x (N-1))/2 Wires, Clearly impractical

Interconnection Topologies

Line/Ring

21 3 4 5 6 7

• Simple interconnection

• First topology where routing is an issue

• Needed when no direct connection exists between nodes

• Want go to node 4 from node 2 have to pass through node 3

• What happens if 2 want to communicate with 3 at the same time 1 want to communicate with 4?

• What is the bisection of a line/ring

• If the links are of bandwidth B, what is the bisection bandwidth

• What is the aggregate bandwidth of the network?

• Generalization of line/ring to multiple dimensions• More routes between nodes• What is the bisection of this network?

Mesh/Torus

21 3 4 5 6 7

21 3 4 5 6 7

21 3 4 5 6 7

Hop Count

• Networks are measured by diameter– This is the minimum number of hops that

message must traverse for the two nodes that furthest apart

– Line: Diameter = N-1– 2D (NxM) Mesh: Diameter = N+M-2

Tree-based Networks

• Nodes organized in a tree fashion (important for some global algorithms)

Diameter of this network?

Bisection, Bisection Bandwidth?

Hypercubes

1D 2D

3D

4D

Hypercubes 2

• Dimension N Hypercube is constructed by connecting the “corners” of two N-1 hypercubes

• Relatively low wire count to build large networks• Multiple routes from any destination to any node.• Exercise to the reader, what is the dimenision of a

K-dimensional Hypercube

Labeling/Routing in a Hypercube

• Nodes a labeled in Gray Code– Connected neighbors have their binary node

number representation differ by one bit.• 3D cube

010

001

101100

000

110

011

111

The e-cube routing algorithm

• Source address S = S0 S1 S2 … Sn

• Destination address D = D0 D1 D2 … Dn

• Let R = R0 R1 R2 … Rn = S R• Number of one bits in R indicate distance between

S and D• Starting at S, go to neighbor where first Rj = 1 (if Sj

= 0 then goto neighbor where Sj=1)• Continue routing from this intermediate node where

the next Rk (k > j) is one, goto that neighbor.

E-cube routing example

• 8 Dimensional Hypercube (256 Nodes)• S = 134= 0x86 = 10000110• D = 215 = 0xD7 = 11010111• S D = 0x51 = 01010001

– Distance = 3• S 11000110 (198)

11010110 (214)11010111 (215)

Embedding• A network is embeddable if nodes and links can be

mapped to a target network• A mesh is embeddable in a hypercube

– There is mapping of hypercube nodes and networks to a mesh• The dilation of an embedding is how many links are

needed in the embedding network to represent the embedded network– Perfect embeddings have dilation 1

• Embedding a tree into a mesh has a dilation of 2 (See example in book)

Modern Parallel Machines are Packet Switched

• Break message into smaller blocks and send these pieces through the network

• Network intermediate points (routers) can be store-and-forward or virtual cut through– Store and forward requires buffering at each switch if

an incoming packet has packets ahead of it on an outgoing port (congestion)

– Virtual cut-through eliminates the always buffering for store and forward by “cutting through” the switch when the output port is free

Wormhole Routing

• Wormhole routing is a variation of virtual cut through– Small headers (flow control digits == Flits) pass

through the network.– When a flit is allowed to cut through a switch, the

original sender is guaranteed a clear path through that switch.

– A tail flit closes the “connection”• Wormhole was defined by Seitz and is used in

Myrinet, a very popular cluster interconnect.

Latency of Circuit Switched and Virtual Cut Through

• Circuit Switch Latency– (Lc/B) l + (L/B)

• Lc = length of control packet• B = bandwidth• l = number of links• L = Length of Packet

• Virtual Cut-through latency– (Lh/B) l + (L/B)

• Lh = length of header packet

Store-Forward and Wormhole routing Latency

– Wormhole Routing Latency• (Lf/B) l + (L/B)

– Lf = Length of flit

– Store-Forward Latency• (L/B) l

– Store and forward latency can be much worse for many hops.

– Virtual Cut Through, Wormhole, and Circuit Switch reach (L/B) as message length increases

Deadlock/Livelock

• Livelock/Deadlock is a potential problem in any network design.

• Livelock occurs in adaptive routing algorithms when a packet never finds destination

• Deadlock occurs when packets cannot be forwarded because waiting for other packets to move out of the way. Blocking packet is waiting for blocked packet to move

Next Time …

• All about clusters• Introduction to PVM (and MPI)

cse 160 – lecture 2

Documents