1.3 convergence of parallel architectures -...
TRANSCRIPT
Programming Model
• Conceptualization of the machine that programmer uses in coding applications– How parts cooperate and coordinate their activities
– Specifies communication and synchronization operations
• Multiprogramming
2
• Multiprogramming– no communication or synch. at program level
• Shared address space– like bulletin board
• Message passing– like letters or phone calls, explicit point to point
• Data parallel: – more regimented, global actions on data
– Implemented with shared address space or message passing
Shared Memory => Shared Addr. Space
• Bottom-up engineering factors
• Programming concepts
• Why its attactive.
3
Adding Processing Capacity
I/O ctrlMem Mem Mem
Interconnect
Mem I/O ctrl
Interconnect
I/Odevices
4
• Memory capacity increased by adding modules
• I/O by controllers and devices
• Add processors for processing! – For higher-throughput multiprogramming, or parallel programs
Processor Processor
Historical Development
P
P
C
C
I/O
I/O
M MM M
• “Mainframe” approach
– Motivated by multiprogramming
– Extends crossbar used for Mem and I/O
– Processor cost-limited => crossbar
– Bandwidth scales with p
– High incremental cost
» use multistage instead
5
PP
C
I/O
M MC
I/O
$ $
• “Minicomputer” approach
– Almost all microprocessor systems have bus
– Motivated by multiprogramming, TP
– Used heavily for parallel computing
– Called symmetric multiprocessor (SMP)
– Latency larger than for uniprocessor
– Bus is bandwidth bottleneck
» caching is key: coherence problem
– Low incremental cost
Shared Physical Memory
• Any processor can directly reference any memory location
• Any I/O controller - any memory
• Operating system can run on any processor, or all.
6
all.– OS uses shared memory to coordinate
• Communication occurs implicitly as result of loads and stores
• What about application processes?
Shared Virtual Address Space
• Process = address space plus thread of control
• Virtual-to-physical mapping can be established so that processes shared portions of address space.– User-kernel or multiple processes
• Multiple threads of control on one address
7
• Multiple threads of control on one address space.– Popular approach to structuring OS’s
– Now standard application capability (ex: POSIX threads)
• Writes to shared address visible to other threads– Natural extension of uniprocessors model
– conventional memory operations for communication
– special atomic operations for synchronization
» also load/stores
Structured Shared Address Space
St or e
P1
P2
Pn
P0
Load
P2 pr i vat e
Pn pr i vat e
Virtual address spaces for acollection of processes communicatingvia shared addresses
Machine physical address space
Common physicaladdresses
8
• Add hoc parallelism used in system code
• Most parallel applications have structured SAS
• Same program on each processor– shared variable X means the same thing to each thread
P0 pr i vat e
P1 pr i vat e
P2 pr i vat eShared portionof address space
Private portionof address space
Engineering: Intel Pentium Pro Quad
P-Pro bus (64-bit data, 36-bit address, 66 MHz)
CPU
Bus interface
MIU
P-Promodule
P-Promodule
P-Promodule256-KB
L2 $Interruptcontroller
PCIbridge
PCIbridge
Memorycontroller
1-, 2-, or 4-wayPCI bus
PCI bus
PCII/Ocards
9
– All coherence and multiprocessing glue in processor module
– Highly integrated, targeted at high volume
– Low latency and bandwidth
1-, 2-, or 4-wayinterleaved DRAM
Engineering: SUN Enterprise
Gigaplane bus (256 data, 41 address, 83 MHz)
CPU/memcardsP
$2
$
P
$2
$
Mem ctrl
Bus interface/switch
10
• Proc + mem card - I/O card– 16 cards of either type
– All memory accessed over bus, so symmetric
– Higher bandwidth, higher latency bus
Gigaplane bus (256 data, 41 address, 83 MHz)
SBUS
SBUS
SBUS
2 FiberChannel
100bT, SCSI
Bus interfaceI/O cards
Scaling Up
M M M°°°
°°° M °°°M M
NetworkNetwork
P
$
P
$
P
$
P
$
P
$
P
$
11
– Problem is interconnect: cost (crossbar) or bandwidth (bus)
– Dance-hall: bandwidth still scalable, but lower cost than crossbar
» latencies to memory uniform, but uniformly large
– Distributed memory or non-uniform memory access (NUMA)
» Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response)
– Caching shared (particularly nonlocal) data?
“Dance hall” Distributed memory
Engineering: Cray T3E
P
$
External I/O
Memctrl
and NI
Mem
12
– Scale up to 1024 processors, 480MB/s links
– Memory controller generates request message for non-local references
– No hardware mechanism for coherence
» SGI Origin etc. provide this
SwitchXY
Z
M °°°M M
Network
P
$
P
$
P
$
13
SIMD
Message Passing
Shared MemoryDataflow
SystolicArrays Generic
Architecture
Message Passing Architectures
• Complete computer as building block, including I/O– Communication via explicit I/O operations
• Programming model– direct access only to private address space (local memory),
– communication via explicit messages (send/receive)
• High-level block diagram
14
• High-level block diagram – Communication integration?
» Mem, I/O, LAN, Cluster
– Easier to build and scale than SAS
• Programming model more removed from basic hardware operations– Library or OS intervention
M °°°M M
Network
P
$
P
$
P
$
Message-Passing Abstraction
AddressY
Address X
Send X, Q, t
Receive Y, P, tMatch
Local processaddress space
Local processaddress space
15
– Send specifies buffer to be transmitted and receiving process
– Recv specifies sending process and application storage to receive into
– Memory to memory copy, but need to name processes
– Optional tag on send and matching rule on receive
– User process names local data and entities in process/tag space too
– In simplest form, the send/recv match achieves pairwise synch event
» Other variants too
– Many overheads: copying, buffer management, protection
ProcessP Process Q
000001
100
110
101
111
Evolution of Message-Passing Machines
• Early machines: FIFO on each link– HW close to prog. Model;
– synchronous ops
– topology central (hypercube algorithms)
16
010011
CalTech Cosmic Cube (Seitz, CACM Jan 95)
Diminishing Role of Topology
• Shift to general links– DMA, enabling non-blocking ops
» Buffered by system at destination until recv
– Store&forward routing
• Diminishing role of topology
17
• Diminishing role of topology– Any-to-any pipelined routing
– node-network interface dominates communication time
– Simplifies programming
– Allows richer design space
» grids vs hypercubes
H x (T0 + n/B)
vs
T0 + H∆∆∆∆ + n/B
Intel iPSC/1 -> iPSC/2 -> iPSC/860
Example Intel Paragon
Memory bus (64-bit, 50 MHz)
i860
L1 $
DMA
i860
L1 $
Memctrl
IntelParagonnode
18
NIDriver
ctrl
4-wayinterleavedDRAM
8 bits,175 MHz,bidirectional2D grid network
with processing nodeattached to every switch
Sandia?s Intel Paragon XP/S-based Supercomputer
IBM SP-2 node
L2 $
Power 2CPU
Building on the mainstream: IBM SP-2
• Made out of essentially complete RS6000 workstations
• Network
19
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DRAM
Memorycontroller
4-wayinterleavedDRAM
General interconnectionnetwork formed from8-port switches
NIC
• Network interface integrated in I/O bus (bw limited by I/O bus)
Berkeley NOW
• 100 Sun Ultra2 workstations
• Inteligent network interface– proc + mem
20
– proc + mem
• Myrinet Network– 160 MB/s per link
– 300 ns per hop
Toward Architectural Convergence
• Evolution and role of software have blurred boundary– Send/recv supported on SAS machines via buffers
– Can construct global address space on MP (GA -> P | LA)
– Page-based (or finer-grained) shared virtual memory
• Hardware organization converging too– Tighter NI integration even for MP (low-latency, high-bandwidth)
21
– Tighter NI integration even for MP (low-latency, high-bandwidth)
– Hardware SAS passes messages
• Even clusters of workstations/SMPs are parallel systems– Emergence of fast system area networks (SAN)
• Programming models distinct, but organizations converging– Nodes connected by general network and communication assists
– Implementations also converging, at least in high-end machines
Data Parallel Systems
• Programming model – Operations performed in parallel on each element of data structure
– Logically single thread of control, performs sequential or parallel steps
– Conceptually, a processor associated with each data element
• Architectural model
22
PE PE PE° ° °
PE PE PE° ° °
PE PE PE° ° °
° ° ° ° ° ° ° ° °
Controlprocessor
– Array of many simple, cheap processors with little memory each
» Processors don’t sequence through instructions
– Attached to a control processor that issues instructions
– Specialized and general communication, cheap global synchronization
• Original motivations– Matches simple differential equation solvers
– Centralize high cost of instruction fetch/sequencing
Application of Data Parallelism
– Each PE contains an employee record with his/her salary
If salary > 100K then
salary = salary *1.05
else
salary = salary *1.10
– Logically, the whole operation is a single step
23
– Some processors enabled for arithmetic operation, others disabled
• Other examples:– Finite differences, linear algebra, ...
– Document searching, graphics, image processing, ...
• Some recent machines:
– Thinking Machines CM-1, CM-2 (and CM-5)
– Maspar MP-1 and MP-2,
Flynn’s Taxonomy
• # instruction x # Data– Single Instruction Single Data (SISD)
– Single Instruction Multiple Data (SIMD)
– Multiple Instruction Single Data
– Multiple Instruction Multiple Data (MIMD)
• Everything is MIMD!
25
• Everything is MIMD!
Evolution and Convergence
• SIMD Popular when cost savings of centralized sequencer high– 60s when CPU was a cabinet
– Replaced by vectors in mid-70s
» More flexible w.r.t. memory layout and easier to manage
– Revived in mid-80s when 32-bit datapath slices just fit on chip
26
• Simple, regular applications have good locality
• Programming model converges with SPMD (single program multiple data)– need fast global synchronization
– Structured global address space, implemented with either SAS or MP
CM-5
• Repackaged SparcStation– 4 per board
• Fat-Tree network
• Control network
27
• Control network for global synchronization
1 b
a
+ − ×
×
×
c e
d
f
Dataflow graph
f = a × d
Network
a = (b +1) × (b − c) d = c × e
Dataflow Architectures
• Represent computation as a graph of essential dependences– Logical processor at each node, activated by availability of operands
– Message (tokens) carrying
29
f
T okenstore
W aitingMatching
Instructionfetch
Execute
Token queue
Formtoken
Network
Network
Pr ogramstore
– Message (tokens) carrying tag of next instruction sent to next processor
– Tag compared with others in matching store; match fires execution
Evolution and Convergence
• Key characteristics– Ability to name operations, synchronization, dynamic scheduling
• Problems– Operations have locality across them, useful to group together
– Handling complex data structures like arrays
– Complexity of matching store and memory units
– Expose too much parallelism (?)
30
– Expose too much parallelism (?)
• Converged to use conventional processors and memory– Support for large, dynamic set of threads to map to processors
– Typically shared address space as well
– But separation of progr. model from hardware (like data-parallel)
• Lasting contributions:– Integration of communication with thread (handler) generation
– Tightly integrated communication and fine-grained synchronization
– Remained useful concept for software (compilers etc.)
Systolic Architectures
• VLSI enables inexpensive special-purpose chips– Represent algorithms directly by chips connected in regular pattern
– Replace single processor with array of regular processing elements
– Orchestrate data flow for high throughput with less memory access
M M
31
• Different from pipelining– Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory
• SIMD? : each PE may do something different
PEPE PE PE
Systolic Arrays (contd.)
y(i) = w1 ´ x(i) + w2 ´ x(i + 1) + w3 ´ x(i + 2) + w4 ´ x(i + 3)
x8
y3 y2 y1
x7x6
x5x4
x3
w4
x2
x
x1
w3 w2 w1
xin xout xout = x
Example: Systolic array for 1-D convolution
32
– Practical realizations (e.g. iWARP) use quite general processors
» Enable variety of algorithms on same hardware
– But dedicated interconnect channels
» Data transfer directly from register to register across channel
– Specialized, and same problems as SIMD
» General purpose systems work well for same algorithms (locality etc.)
x
w
xin
yin
xout
yout
xout = x
yout = yin + w ´ xinx = xin
Architecture
• Two facets of Computer Architecture:– Defines Critical Abstractions
» especially at HW/SW boundary
» set of operations and data types these operate on
– Organizational structure that realizes these abstraction
• Parallel Computer Arch. =
33
• Parallel Computer Arch. = Comp. Arch + Communication Arch.
• Comm. Architecture has same two facets– communication abstraction
– primitives at user/system and hw/sw boundary
Layered Perspective of PCA
CAD
Multiprogramming Sharedaddress
Messagepassing
Dataparallel
Database Scientific modeling Parallel applications
Programming models
Communication abstractionCompilationor library
34
Communication abstractionUser/system boundary
or library
Operating systems support
Communication hardware
Physical communication medium
Hardware/software boundary
Communication Architecture
User/System Interface + Organization
• User/System Interface:–Comm. primitives exposed to user-level by hw and system-level sw
• Implementation:–Organizational structures that implement the primitives: HW or OS
35
–Organizational structures that implement the primitives: HW or OS
–How optimized are they? How integrated into processing node?
– Structure of network
•Goals:– Performance
–Broad applicability
– Programmability
– Scalability
– Low Cost
Communication Abstraction
• User level communication primitives provided– Realizes the programming model
– Mapping exists between language primitives of programming model and these primitives
• Supported directly by hw, or via OS, or via user sw
• Lot of debate about what to support in sw and
36
• Lot of debate about what to support in sw and gap between layers
• Today:– Hw/sw interface tends to be flat, i.e. complexity roughly uniform
– Compilers and software play important roles as bridges today
– Technology trends exert strong influence
• Result is convergence in organizational structure– Relatively simple, general purpose communication primitives
Understanding Parallel Architecture
• Traditional taxonomies not very useful
• Programming models not enough, nor hardware structures– Same one can be supported by radically different architectures
=> Architectural distinctions that affect software– Compilers, libraries, programs
37
– Compilers, libraries, programs
• Design of user/system and hardware/software interface– Constrained from above by progr. models and below by technology
• Guiding principles provided by layers– What primitives are provided at communication abstraction
– How programming models map to these
– How they are mapped to hardware