1.3 convergence of parallel architectures -...

1.3 Convergence of Parallel Architectures

Programming Model

• Conceptualization of the machine that programmer uses in coding applications– How parts cooperate and coordinate their activities

– Specifies communication and synchronization operations

• Multiprogramming

2

• Multiprogramming– no communication or synch. at program level

• Shared address space– like bulletin board

• Message passing– like letters or phone calls, explicit point to point

• Data parallel: – more regimented, global actions on data

– Implemented with shared address space or message passing

Shared Memory => Shared Addr. Space

• Bottom-up engineering factors

• Programming concepts

• Why its attactive.

3

Adding Processing Capacity

I/O ctrlMem Mem Mem

Interconnect

Mem I/O ctrl

Interconnect

I/Odevices

4

• Memory capacity increased by adding modules

• I/O by controllers and devices

• Add processors for processing! – For higher-throughput multiprogramming, or parallel programs

Processor Processor

Historical Development

P

P

C

C

I/O

I/O

M MM M

• “Mainframe” approach

– Motivated by multiprogramming

– Extends crossbar used for Mem and I/O

– Processor cost-limited => crossbar

– Bandwidth scales with p

– High incremental cost

» use multistage instead

5

PP

C

I/O

M MC

I/O

$ $

• “Minicomputer” approach

– Almost all microprocessor systems have bus

– Motivated by multiprogramming, TP

– Used heavily for parallel computing

– Called symmetric multiprocessor (SMP)

– Latency larger than for uniprocessor

– Bus is bandwidth bottleneck

» caching is key: coherence problem

– Low incremental cost

Shared Physical Memory

• Any processor can directly reference any memory location

• Any I/O controller - any memory

• Operating system can run on any processor, or all.

6

all.– OS uses shared memory to coordinate

• Communication occurs implicitly as result of loads and stores

• What about application processes?

Shared Virtual Address Space

• Process = address space plus thread of control

• Virtual-to-physical mapping can be established so that processes shared portions of address space.– User-kernel or multiple processes

• Multiple threads of control on one address

7

• Multiple threads of control on one address space.– Popular approach to structuring OS’s

– Now standard application capability (ex: POSIX threads)

• Writes to shared address visible to other threads– Natural extension of uniprocessors model

– conventional memory operations for communication

– special atomic operations for synchronization

» also load/stores

Structured Shared Address Space

St or e

P1

P2

Pn

P0

Load

P2 pr i vat e

Pn pr i vat e

Virtual address spaces for acollection of processes communicatingvia shared addresses

Machine physical address space

Common physicaladdresses

8

• Add hoc parallelism used in system code

• Most parallel applications have structured SAS

• Same program on each processor– shared variable X means the same thing to each thread

P0 pr i vat e

P1 pr i vat e

P2 pr i vat eShared portionof address space

Private portionof address space

Engineering: Intel Pentium Pro Quad

P-Pro bus (64-bit data, 36-bit address, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayPCI bus

PCI bus

PCII/Ocards

9

– All coherence and multiprocessing glue in processor module

– Highly integrated, targeted at high volume

– Low latency and bandwidth

1-, 2-, or 4-wayinterleaved DRAM

Engineering: SUN Enterprise

Gigaplane bus (256 data, 41 address, 83 MHz)

CPU/memcardsP

$2

$

P

$2

$

Mem ctrl

Bus interface/switch

10

• Proc + mem card - I/O card– 16 cards of either type

– All memory accessed over bus, so symmetric

– Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 address, 83 MHz)

SBUS

SBUS

SBUS

2 FiberChannel

100bT, SCSI

Bus interfaceI/O cards

Scaling Up

M M M°°°

°°° M °°°M M

NetworkNetwork

P

$

P

$

P

$

P

$

P

$

P

$

11

– Problem is interconnect: cost (crossbar) or bandwidth (bus)

– Dance-hall: bandwidth still scalable, but lower cost than crossbar

» latencies to memory uniform, but uniformly large

– Distributed memory or non-uniform memory access (NUMA)

» Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response)

– Caching shared (particularly nonlocal) data?

“Dance hall” Distributed memory

Engineering: Cray T3E

P

$

External I/O

Memctrl

and NI

Mem

12

– Scale up to 1024 processors, 480MB/s links

– Memory controller generates request message for non-local references

– No hardware mechanism for coherence

» SGI Origin etc. provide this

SwitchXY

Z

M °°°M M

Network

P

$

P

$

P

$

13

SIMD

Message Passing

Shared MemoryDataflow

SystolicArrays Generic

Architecture

Message Passing Architectures

• Complete computer as building block, including I/O– Communication via explicit I/O operations

• Programming model– direct access only to private address space (local memory),

– communication via explicit messages (send/receive)

• High-level block diagram

14

• High-level block diagram – Communication integration?

» Mem, I/O, LAN, Cluster

– Easier to build and scale than SAS

• Programming model more removed from basic hardware operations– Library or OS intervention

M °°°M M

Network

P

$

P

$

P

$

Message-Passing Abstraction

AddressY

Address X

Send X, Q, t

Receive Y, P, tMatch

Local processaddress space

Local processaddress space

15

– Send specifies buffer to be transmitted and receiving process

– Recv specifies sending process and application storage to receive into

– Memory to memory copy, but need to name processes

– Optional tag on send and matching rule on receive

– User process names local data and entities in process/tag space too

– In simplest form, the send/recv match achieves pairwise synch event

» Other variants too

– Many overheads: copying, buffer management, protection

ProcessP Process Q

000001

100

110

101

111

Evolution of Message-Passing Machines

• Early machines: FIFO on each link– HW close to prog. Model;

– synchronous ops

– topology central (hypercube algorithms)

16

010011

CalTech Cosmic Cube (Seitz, CACM Jan 95)

Diminishing Role of Topology

• Shift to general links– DMA, enabling non-blocking ops

» Buffered by system at destination until recv

– Store&forward routing

• Diminishing role of topology

17

• Diminishing role of topology– Any-to-any pipelined routing

– node-network interface dominates communication time

– Simplifies programming

– Allows richer design space

» grids vs hypercubes

H x (T0 + n/B)

vs

T0 + H∆∆∆∆ + n/B

Intel iPSC/1 -> iPSC/2 -> iPSC/860

Example Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

DMA

i860

L1 $

Memctrl

IntelParagonnode

18

NIDriver

ctrl

4-wayinterleavedDRAM

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia?s Intel Paragon XP/S-based Supercomputer

IBM SP-2 node

L2 $

Power 2CPU

Building on the mainstream: IBM SP-2

• Made out of essentially complete RS6000 workstations

• Network

19

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DRAM

Memorycontroller

4-wayinterleavedDRAM

General interconnectionnetwork formed from8-port switches

NIC

• Network interface integrated in I/O bus (bw limited by I/O bus)

Berkeley NOW

• 100 Sun Ultra2 workstations

• Inteligent network interface– proc + mem

20

– proc + mem

• Myrinet Network– 160 MB/s per link

– 300 ns per hop

Toward Architectural Convergence

• Evolution and role of software have blurred boundary– Send/recv supported on SAS machines via buffers

– Can construct global address space on MP (GA -> P | LA)

– Page-based (or finer-grained) shared virtual memory

• Hardware organization converging too– Tighter NI integration even for MP (low-latency, high-bandwidth)

21

– Tighter NI integration even for MP (low-latency, high-bandwidth)

– Hardware SAS passes messages

• Even clusters of workstations/SMPs are parallel systems– Emergence of fast system area networks (SAN)

• Programming models distinct, but organizations converging– Nodes connected by general network and communication assists

– Implementations also converging, at least in high-end machines

Data Parallel Systems

• Programming model – Operations performed in parallel on each element of data structure

– Logically single thread of control, performs sequential or parallel steps

– Conceptually, a processor associated with each data element

• Architectural model

22

PE PE PE° ° °

PE PE PE° ° °

PE PE PE° ° °

° ° ° ° ° ° ° ° °

Controlprocessor

– Array of many simple, cheap processors with little memory each

» Processors don’t sequence through instructions

– Attached to a control processor that issues instructions

– Specialized and general communication, cheap global synchronization

• Original motivations– Matches simple differential equation solvers

– Centralize high cost of instruction fetch/sequencing

Application of Data Parallelism

– Each PE contains an employee record with his/her salary

If salary > 100K then

salary = salary *1.05

else

salary = salary *1.10

– Logically, the whole operation is a single step

23

– Some processors enabled for arithmetic operation, others disabled

• Other examples:– Finite differences, linear algebra, ...

– Document searching, graphics, image processing, ...

• Some recent machines:

– Thinking Machines CM-1, CM-2 (and CM-5)

– Maspar MP-1 and MP-2,

Connection Machine

24

(Tucker, IEEE Computer, Aug. 1988)

Flynn’s Taxonomy

• # instruction x # Data– Single Instruction Single Data (SISD)

– Single Instruction Multiple Data (SIMD)

– Multiple Instruction Single Data

– Multiple Instruction Multiple Data (MIMD)

• Everything is MIMD!

25

• Everything is MIMD!

Evolution and Convergence

• SIMD Popular when cost savings of centralized sequencer high– 60s when CPU was a cabinet

– Replaced by vectors in mid-70s

» More flexible w.r.t. memory layout and easier to manage

– Revived in mid-80s when 32-bit datapath slices just fit on chip

26

• Simple, regular applications have good locality

• Programming model converges with SPMD (single program multiple data)– need fast global synchronization

– Structured global address space, implemented with either SAS or MP

CM-5

• Repackaged SparcStation– 4 per board

• Fat-Tree network

• Control network

27

• Control network for global synchronization

SIMD

Message Passing

SystolicArrays Generic

Architecture

28

Message Passing

Shared MemoryDataflow

1 b

a

+ − ×

×

×

c e

d

f

Dataflow graph

f = a × d

Network

a = (b +1) × (b − c) d = c × e

Dataflow Architectures

• Represent computation as a graph of essential dependences– Logical processor at each node, activated by availability of operands

– Message (tokens) carrying

29

f

T okenstore

W aitingMatching

Instructionfetch

Execute

Token queue

Formtoken

Network

Network

Pr ogramstore

– Message (tokens) carrying tag of next instruction sent to next processor

– Tag compared with others in matching store; match fires execution

Evolution and Convergence

• Key characteristics– Ability to name operations, synchronization, dynamic scheduling

• Problems– Operations have locality across them, useful to group together

– Handling complex data structures like arrays

– Complexity of matching store and memory units

– Expose too much parallelism (?)

30

– Expose too much parallelism (?)

• Converged to use conventional processors and memory– Support for large, dynamic set of threads to map to processors

– Typically shared address space as well

– But separation of progr. model from hardware (like data-parallel)

• Lasting contributions:– Integration of communication with thread (handler) generation

– Tightly integrated communication and fine-grained synchronization

– Remained useful concept for software (compilers etc.)

Systolic Architectures

• VLSI enables inexpensive special-purpose chips– Represent algorithms directly by chips connected in regular pattern

– Replace single processor with array of regular processing elements

– Orchestrate data flow for high throughput with less memory access

M M

31

• Different from pipelining– Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory

• SIMD? : each PE may do something different

PEPE PE PE

Systolic Arrays (contd.)

y(i) = w1 ´ x(i) + w2 ´ x(i + 1) + w3 ´ x(i + 2) + w4 ´ x(i + 3)

x8

y3 y2 y1

x7x6

x5x4

x3

w4

x2

x

x1

w3 w2 w1

xin xout xout = x

Example: Systolic array for 1-D convolution

32

– Practical realizations (e.g. iWARP) use quite general processors

» Enable variety of algorithms on same hardware

– But dedicated interconnect channels

» Data transfer directly from register to register across channel

– Specialized, and same problems as SIMD

» General purpose systems work well for same algorithms (locality etc.)

x

w

xin

yin

xout

yout

xout = x

yout = yin + w ´ xinx = xin

Architecture

• Two facets of Computer Architecture:– Defines Critical Abstractions

» especially at HW/SW boundary

» set of operations and data types these operate on

– Organizational structure that realizes these abstraction

• Parallel Computer Arch. =

33

• Parallel Computer Arch. = Comp. Arch + Communication Arch.

• Comm. Architecture has same two facets– communication abstraction

– primitives at user/system and hw/sw boundary

Layered Perspective of PCA

CAD

Multiprogramming Sharedaddress

Messagepassing

Dataparallel

Database Scientific modeling Parallel applications

Programming models

Communication abstractionCompilationor library

34

Communication abstractionUser/system boundary

or library

Operating systems support

Communication hardware

Physical communication medium

Hardware/software boundary

Communication Architecture

User/System Interface + Organization

• User/System Interface:–Comm. primitives exposed to user-level by hw and system-level sw

• Implementation:–Organizational structures that implement the primitives: HW or OS

35

–Organizational structures that implement the primitives: HW or OS

–How optimized are they? How integrated into processing node?

– Structure of network

•Goals:– Performance

–Broad applicability

– Programmability

– Scalability

– Low Cost

Communication Abstraction

• User level communication primitives provided– Realizes the programming model

– Mapping exists between language primitives of programming model and these primitives

• Supported directly by hw, or via OS, or via user sw

• Lot of debate about what to support in sw and

36

• Lot of debate about what to support in sw and gap between layers

• Today:– Hw/sw interface tends to be flat, i.e. complexity roughly uniform

– Compilers and software play important roles as bridges today

– Technology trends exert strong influence

• Result is convergence in organizational structure– Relatively simple, general purpose communication primitives

Understanding Parallel Architecture

• Traditional taxonomies not very useful

• Programming models not enough, nor hardware structures– Same one can be supported by radically different architectures

=> Architectural distinctions that affect software– Compilers, libraries, programs

37

– Compilers, libraries, programs

• Design of user/system and hardware/software interface– Constrained from above by progr. models and below by technology

• Guiding principles provided by layers– What primitives are provided at communication abstraction

– How programming models map to these

– How they are mapped to hardware

1.3 convergence of parallel architectures -...

Documents