chapter 18 parallel processing - yonsei...

YonseiYonsei UniversityUniversity

Chapter 18Chapter 18

Parallel ProcessingParallel Processing

YonseiYonsei UniversityUniversity18-2

ContentsContents• Multiple Processor Organizations• Symmetric Multiprocessors• Cache Coherence and the MESI Protocol• Clusters• Nonuniform Memory Access• Vector Computation


Types of Parallel ProcessorTypes of Parallel Processor• Types of Parallel Processor Systems

– Single instruction, single data stream - SISD– Single instruction, multiple data stream - SIMD– Multiple instruction, single data stream - MISD– Multiple instruction, multiple data stream-

MIMD

Multiple Processor Organizations


Taxonomy Of Parallel ProcessorTaxonomy Of Parallel Processor Multiple Processor Organizations


SISDSISDSingle processorSingle instruction streamData stored in single memoryUni-processor



SIMDSIMD• Single machine instruction • Controls simultaneous execution• Number of processing elements• Lockstep basis• Each processing element has associated

data memory• Each instruction executed on different set

of data by different processors• Vector and array processors



SIMDSIMD• With distributed memory



MISDMISD• Sequence of data• Transmitted to set of processors• Each processor executes different

instruction sequence• Never been implemented



MIMDMIMD• Set of processors• Simultaneously execute different

instruction sequences• Different sets of data• SMPs, clusters and NUMA systems



MIMDMIMD• With shared memory



MIMDMIMD• With distributed memory



MIMD MIMD -- OverviewOverview• General purpose processors• Each can process all instructions

necessary• Further classified by method of processor

communication



Tightly Coupled Tightly Coupled -- SMPSMP• Processors share memory• Communicate via that shared memory• Symmetric Multiprocessor (SMP)

– Share single memory or pool– Shared bus to access memory– Memory access time to given area of memory is

approximately the same for each processor



Tightly Coupled Tightly Coupled -- NUMANUMA• Nonuniform memory access• Access times to different regions of

memroy may differ



Loosely Coupled Loosely Coupled -- ClustersClusters• Collection of independent uniprocessors

or SMPs• Interconnected to form a cluster• Communication via fixed path or network

connections



Design IssuesDesign Issues• Physical Organization• Interconnection Structures• Interprocessor Communication• Operating System Design• Application Software Techniques



SMP CharacteristicsSMP Characteristics• Two or more similar processors • Share the same main memory and I/O facilities• All processors share access to I/O devices• All processors can perform the same functions• The system is controlled by an integrated OS

– Providing interaction between processors – Interaction at job, task, file and data element levels

Symmetric Multiprocessors


SMP AdvantagesSMP Advantages• Performance

– If some work can be done in parallel

• Availability– Since all processors can perform the same functions, failure

of a single processor does not halt the system

• Incremental growth– User can enhance performance by adding additional

processors

• Scaling– Vendors can offer range of products based on number of

processors



Multiprogramming & MultiprocessingMultiprogramming & Multiprocessing• Interleaving(multiprogramming)



Multiprogramming & MultiprocessingMultiprogramming & Multiprocessing• Interleaving and

overlapping(multiprocessing)



Tightly CoupledTightly Coupled• Time-shared or common bus• Multiport memory• Central control unit



TimeTime--Shared BusShared Bus• To facilitate DMA transfers from I/O

processors, the following features are provided– Addressing

• Distinguish modules on the bus to determine the source and destination of data

– Arbitration• I/O module can temporarily function as “master”

– Time sharing• When module is controlling the bus, other modules

are locked out and must, if necessary, suspend operation until bus access is achieved



Time Shared BusTime Shared Bus Symmetric Multiprocessors


Time Shared Bus : AdvantagesTime Shared Bus : Advantages• Simplicity

– The simplest approach to multiprocessor organization

• Flexibility– Easy to expand the system by attaching more

processors to the bus• Reliability

– The bus is essentially a passive medium



Time Shared Bus : DrawbacksTime Shared Bus : Drawbacks• Performance

– The speed of the system is limited by the bus cycle time

– To equip each processor with a cache memory for improving performance

– Each processor should have local cache• Reduce number of bus accesses

– Leads to problems with cache coherence• Solved in hardware - see later



MultiportMultiport MemoryMemory• The multiport memory allows the direct,

independent access of main memory modules by each processor and I/O module

• Direct independent access of memory modules by each processor

• Logic required to resolve conflicts• Little or no modification to processors or

modules required



MultiportMultiport MemoryMemory• Advantages

– Little or no modification is needed for either processor or I/O modules to accommodate multiport memory

– Better performance than bus approach– Security

• To configure portions of memory as private to one or more processors and/or I/O module

• Disadvantage– Logic associated with memory is required for resolving

conflicts– More complex than the bus approach



Central Control UnitCentral Control Unit• To separate data streams back and forth between

independent modules: processor, memory, I/O module

• The controller can buffer requests and perform arbitration and timing functions

• To pass status and control messages between processors

• To perform cache update alerting• Advantages

– flexibility and simplicity of interfacing of the bus approach

• Disadvantages– the control unit is quite complex– A potential performance bottleneck



Multiprocessor OS Multiprocessor OS DeisgnDeisgn• An SMP OS manages processor and other

computer resources so that the user perceives a single OS controlling system resources

• The key design issues– Simultaneous concurrent processes– Scheduling– Synchronization– Memory management– Reliability and fault tolerance



S/390 SMPS/390 SMP Symmetric Multiprocessors


S/390 SMP ComponentsS/390 SMP Components• PU

– CISC microprocessor– 64-KB L1 cache

• L2 cache– 384 KB– To be arranged in cluster of two– Each cluster supports three PUs

• Bus-switching network adapter(BSN)– to interconnect the L2 caches and the main

memory– To includes a level 3 (L3) cache (2 MB)

• Memory card– 8G per card



Switched InterconnectionSwitched Interconnection• The single bus becomes a bottleneck

affecting the scalability• The S/390 copes with this problem in two

ways– Main memory is split into four separate cards,

each with its own storage controller that can handle memory accesses at high speeds

– The connection from processors to a single memory card is not in the form of a shared bus but rather point to point links, where each link connects a group of three processors via an L2 cache to a BSN



Shared L2 CachesShared L2 Caches• Needs of Shared L2 Caches

– In moving from G3(generation 3) to G4, IBM doubled the speed of the microprocessors. If the G3 organization was retained, a significant increase in bus traffic would occur. At the same time, it was desired to reuse as many G3 components as possible. Without a significant bus upgrade, the BSNs would become a bottleneck

– Analysis of typical S/390 workloads revealed a large degree of sharing of instructions and data among processors

• The use of one or more L2 caches, each of which was shared by multiple processors



L3 CacheL3 Cache• Each L3 cache provides a buffer between L2

caches and one memory card• To provides the data much more quickly than a

main memory access if the requested cache line is already shared by other processors but was not recently used by the requesting processor

38 GB32Memory32 MB14L3 cache5256 KB5L2 cache

8932 KB1L1 cache

Hit Rate(%)Cache SizeAccess Penalty

(PU cycles)Memory

subsystem



Cache Coherence ProblemCache Coherence Problem• Problem - multiple copies of same data in

different caches• Can result in an inconsistent view of memory• Write back

– Write operations are usually made only to the cache – Main memory is only updated when the

corresponding cache line is flushed from the cache• Write through

– All write operations are made to main memory as well as to the cache, ensuring that main memory is always valid

Cache Coherence and the MESI Protocol


Software SolutionsSoftware Solutions• Compiler-based coherence mechanisms :• Compiler and operating system deal with

problem• Overhead transferred to compile time• Design complexity transferred from

hardware to software• Simplest approach

– Prevent any shared data variables from being cached

• Efficient approach – Analyze the code to determine safe periods for

shared variables



Hardware SolutionsHardware Solutions• Cache coherence protocols• Dynamic recognition at run time of

potential inconsistency conditions• More efficient use of cache• Hardware schemes can be divided into

two categories– Directory protocols– Snoopy protocols



Directory ProtocolsDirectory Protocols• Collect and maintain information about

copies of data in cache• Directory stored in main memory• Requests are checked against directory• Appropriate transfers are performed• Creates central bottleneck• Effective in large scale systems with

complex interconnection schemes



Snoopy ProtocolsSnoopy Protocols• Distribute cache coherence responsibility

among cache controllers• Cache recognizes that a line is shared• Updates announced to other caches• Suited to bus based multiprocessor• Increases bus traffic



Snoopy ProtocolsSnoopy Protocols• Two approach to the snoopy protocol

– Write-invalidate– Write-update (Write-broadcast)

• Performance depends on the number of local caches and pattern of memory reads and writes

• To adaptive protocols that employ both



Write InvalidateWrite Invalidate• Multiple readers, one writer• When a write is required, all other caches

of the line are invalidated• Writing processor then has exclusive

(cheap) access until line required by another processor

• Used in Pentium II and PowerPC systems• State of every line is marked as modified,

exclusive, shared or invalid• MESI



Write UpdateWrite Update• Multiple readers and writers• When a processor wishes to update a

shared line, the word to be updated is distributed to all others and caches containing that line can update it



MESI ProtocolMESI Protocol• Modified : The line in the cache has been

modified and is available only in this cache

• Exclusive : The line in the cache is the same as that in main memory and is not present in any other cache

• Shared : The line in the cache is the same as that in main memory and may be present in another cache

• Invalid : The line in the cache does not contain valid data



MESI State Transition DiagramMESI State Transition Diagram Cache Coherence and the MESI Protocol


MESI State Transition DiagramMESI State Transition Diagram(a) Line in cache at initiating processor



MESI State Transition DiagramMESI State Transition Diagram(b) Line in snooping cache



Read MissRead Miss• When a read miss occurs in the local

cache, the processor initiates a memory read to read the line of main memory containing the missing address

• The processor inserts a signal on the bus that alerts all other processor/cache units to snoop the transaction



Read HitRead Hit• When a read hit occurs on a line currently

in the local cache, the processor simply reads the required item

• There is no state change• The state remains modified, shared, or

exclusive



Write MissWrite Miss• When a write miss occurs in the local

cache, the processor initiates a memory read to read the line of main memory containing the missing address

• RWITM(read-with-intent-to-modify)• When the line is loaded, it is immediately

marked modified



Write HitWrite Hit• When a write hit occurs on a line currently

in the local cache, the effect depends on the current state of that line in the local cache:

• Shared• Exclusive• Modified



L1L1--L2 Cache ConsistencyL2 Cache Consistency• Needs to maintain data integrity across

both levels of cache and across all caches in the SMP configuration

• To extend the MESI protocol to the L1 caches– Each line in the L1 cache includes bits to indicate

the state• Why?

– To adopt the write-through policy in the L1 cache• If L1 has a write-back, more complex

– The write through is to the L2 cache and not to the memory



ClusteringClustering• Clustering is an alternative to symmetric

multiprocessing(SMP)• Cluster

– A group of interconnected, whole computers working together as a unified computing resource that can create the illusion of being one machine

• Whole Computer– A system that can run on its own, apart from the cluster

• Benefits– Absolute scalability– Incremental scalability– High availability– Superior price/performance

Clusters


Cluster ConfigurationsCluster Configurations(a) Standby Server with No Shared Disk

Clusters


Cluster ConfigurationsCluster Configurations(b) Shared Disk

Clusters


Clustering MethodsClustering Methods

Requires lock manager software. Usually used with disk mirroring or RAID technology

Low network and server overhead. Reduced risk of downtime caused by disk failure

Multiple servers simultaneously share access to disks

Servers share disks

Usually requires disk mirroring or RAID technology to compensate for risk of disk failure.

Reduced network and server overhead due to elimination of copying operations

Servers are cabled to the same disks, but each server owns its disks. If one server fails, its disks are taken over by the other server

Servers connected to disks

High network and server overhead due to copying operations

High availability

Separate servers have their own disks. Data is continuously copied from primary to secondary server

Separate servers

Increased complexityReduced cost because secondary servers can be used for processing

The secondary server is also used for processing tasks

Active secondary

High cost because the secondary server is unavailable for other processing tasks.

Easy to implementA secondary server takes over in case of primary server failure

Passive standby

LimitationsBenefitsDescriptionClustering Method

Clusters


Operating System Design IssuesOperating System Design Issues• Failure Management

– Failover– Failback

• Load Balancing• Clusters versus SMP

Clusters


Failure ManagementFailure Management• A highly available cluster

– A high probability that all resources will be in service– If a failure does occur, the queries in progress are lost– Any lost query, if retried, will be serviced by a different

computer in the cluster

• A fault-tolerant cluster– All resources are always available– Achieved by the use of redundant shared disks and

mechanisms for backing out uncommitted transactions– Failover : Switching resources over from a failed to

alternative system– Failback : restoring resources to the original system

Clusters


Load BalancingLoad Balancing• Incremental scalability• Automatically include new computers in

scheduling• Middleware needs to recognise that

processes may switch between machines

Clusters


ParallelizingParallelizing• Single application executing in parallel on a

number of machines in cluster– Complier

• Determines at compile time which parts can be executed in parallel

• Split off for different computers– Application

• Application written from scratch to be parallel• Message passing to move data between nodes• Hard to program• Best end result

– Parametric computing• If a problem is repeated execution of algorithm on different s

ets of data• e.g. simulation using different scenarios• Needs effective tools to organize and run

Clusters


Cluster Computer ArchitectureCluster Computer Architecture Clusters


Cluster MiddlewareCluster Middleware• Unified image to user

– Single system image• Single point of entry• Single file hierarchy• Single control point• Single virtual networking• Single memory space• Single job management system• Single user interface• Single I/O space• Single process space• Checkpointing• Process migration

Clusters


Clusters Clusters vsvs SMPSMP• Both provide multiprocessor support to high dem

and applications• Both available commercially

– SMP for longer

• The advantages of the SMP approach– SMP is easier to manage and configure than a cluster– SMP usually takes up less physical space and draws less

power than cluster

• The advantages of the cluster approach– To result in clusters dominating the high-performance server

market– Superior in terms of incremental and absolute scalability– Superior in terms of availability

• All components of the system can readily be made highly redundant

Clusters


NonuniformNonuniform Memory Access (NUMA)Memory Access (NUMA)• Alternative to SMP & clustering• Uniform memory access

– All processors have access to all parts of memory using load & store– Access time to all regions of memory is the same– Access time to memory for different processors same– As used by SMP

• Nonuniform memory access– All processors have access to all parts of memory using load & store– Access time of processor differs depending on region of memory– Different processors access different regions of memory at different

speeds• Cache coherent NUMA

– Cache coherence is maintained among the caches of the various processors

– Significantly different from SMP and clusters

NonuniformMemory Access


MotivationMotivation• SMP has practical limit to number of processors

– Bus traffic limits to between 16 and 64 processors

• In clusters each node has own memory– Apps do not see large global memory– Coherence maintained by software not hardware

• NUMA retains SMP flavour while giving large scale multiprocessing– e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000

processors

• The objective with NUMA is to maintain a transparent system-wide memory while permitting multiple multiprocessor nodes, each with its own bus or other internal interconnect system



CCCC--NUMA OrganizationNUMA Organization NonuniformMemory Access


CCCC--NUMA OperationNUMA Operation• Each processor has own L1 and L2 cache• Each node has own main memory• Nodes connected by some networking facility• Each processor sees single addressable

memory space• Memory request order:

– L1 cache (local to processor)– L2 cache (local to processor)– Main memory (local to node)– Remote memory

• Delivered to requesting (local to processor) cache• Automatic and transparent



OrganizationOrganization• Example

– Suppose the P2-3 requests a memory location 798, the memory of node 1

1. P2-3 issues read request on the snoopy bus of node 22. The directory on node 2 sees the request and recognizes

that the location is in lode13. Node 2’s directory sends a request to node 14. Node 1’s directory requests the contents of 7985. Node 1’s main memory responds by putting the requested

data on bus6. Node 1’s directory picks up the data from the bus7. The value is transferred back to node 2’s directory8. Node 2’s directory places the data back no node 2’bus9. The value is picked up and placed in P2-3’s cache and

delivered to P2-3



Cache CoherenceCache Coherence• Node 1 directory keeps note that node 2 h

as copy of data• If data modified in cache, this is broadcast

to other nodes• Local directories monitor and purge local

cache if necessary• Local directory monitors changes to local

data in remote caches and marks memory invalid until writeback

• Local directory forces writeback if memory location requested by another processor



NUMA Pros and ConsNUMA Pros and Cons• The advantage of a CC-NUMA system

– Can deliver effective performance at higher levels of parallelism than SMP, without requiring major software changes

• The disadvantages of a CC-NUMA system– Performance can breakdown if too much access to remote

memory• Can be avoided by:

– L1 & L2 cache design reducing all memory access» Need good temporal locality of software

– Good spatial locality of software– Virtual memory management moving pages to nodes

that are using them most– Not transparently look like an SMP

• Software changes will be required to move an operating system and applications from an SMP to a CC-NUMA system

– Availability



Vector ComputationVector Computation• Maths problems involving physical

processes present different difficulties for computation– Aerodynamics, seismology, meteorology– Continuous field simulation

• High precision• Repeated floating point calculations on large

arrays of numbers

VectorComputation


Vector ComputationVector Computation• Supercomputers handle these types of

problem– Hundreds of millions of flops– $10-15 million– Optimised for calculation rather than multitasking and

I/O– Limited market

• Research, government agencies, meteorology• Array processor

– Alternative to supercomputer– Configured as peripherals to mainframe & mini– Just run vector portion of problems

VectorComputation


Example of Vector AdditionExample of Vector Addition VectorComputation


Matrix Multiplication (C = A x B)Matrix Multiplication (C = A x B) VectorComputation


ApproachesApproaches• General purpose computers rely on iteration to do

vector calculations• In example this needs six calculations• Vector processing

– Assume possible to operate on one-dimensional vector of data– All elements in a particular row can be calculated in parallel

• Parallel processing– Independent processors functioning in parallel– Use FORK N to start individual process at location N– JOIN N causes N independent processes to join and merge

following JOIN• O/S Co-ordinates JOINs• Execution is blocked until all N processes have reached JOIN

VectorComputation


Processor DesignsProcessor Designs• Pipelined ALU

– Within operations– Across operations

• Parallel ALUs• Parallel processors

VectorComputation


Approaches to Vector ComputationApproaches to Vector Computation(a) Pipelined ALU

VectorComputation


Approaches to Vector ComputationApproaches to Vector Computation(b) Parallel ALUs

VectorComputation


Pipelined ProcessingPipelined Processing VectorComputation


Ex. Computing C=(s x A) + BEx. Computing C=(s x A) + B1. Vector load A->Vector

Register(VR1)2. Vector load B->VR23. Vector multiply s x VR1->VR34. Vector add VR3 + VR2 -> VR45. Vector store VR4->C

VectorComputation


ChainingChaining• Cray Supercomputers• Vector operation may start as soon as first

element of operand vector available and functional unit is free

• Result from one functional unit is fed immediately into another

• If vector registers used, intermediate results do not have to be stored in memory

VectorComputation


Taxonomy of Computer OrganizationTaxonomy of Computer Organization VectorComputation


Taxonomy of Computer OrganizationTaxonomy of Computer Organization• Parallel processors

– The multiple processors can function cooperatively on a given task

• Vector processor– Often equated with pipelined ALU organization– Also designed by a parallel ALU organization and a parallel

processor• Array processing

– Usually refers to a Parallel ALU– But any of the three organizations is optimized for the processing

of array– Usually refers to an auxiliary processor attached to a general-

purpose processor and used to perform vector computation• The pipelined ALU organization dominates the

marketplace– Less complex

VectorComputation


IBM 3090 Vector FacilityIBM 3090 Vector Facility• IBM facility makes use of a number of

vector registers• Each register is actually a bank of scalar

registers• To compute vector sum C=A+B, the vectors

A and B are loaded into two vector registers• The data from these registers are passed

through the ALU as fast as possible• The computation overlap, and the loading of

the input data into the resisters in a block, results in a significant speeding up over an ordinary ALU operation

VectorComputation


OrganizationOrganization• The fixed and predetermined structure of

vector data permits housekeeping instructions inside the loop to be replaced by faster internal machine operations

• Data-access and arithmetic operations on several successive vector elements can proceed concurrently by overlapping such operations in a pipelined design or by performing multiple-element operations in parallel

• The use of vector registers for intermediate results avoids additional storage reference

VectorComputation


IBM 3090 with Vector FacilityIBM 3090 with Vector Facility VectorComputation


RegistersRegisters• The IBM organization is referred to as

register-to-register, because the vector operands, both input and output, can be stored in vector registers

• The main disadvantage of the use of vectro registers is that the programmer or compiler must take them into account

VectorComputation


Alternative ProgramsAlternative Programs VectorComputation


Alternative ProgramsAlternative Programs(a) Storage to Storage

VectorComputation


Alternative ProgramsAlternative Programs(b) Register to Register

VectorComputation


Alternative ProgramsAlternative Programs(c) Storage to Register

VectorComputation


Alternative ProgramsAlternative Programs(d) Compound Instructions

VectorComputation


RegistersRegisters• The vector registers can also be coupled

to form 8 64-bit vector registers• The architecture specifies that each

register contains from 8 to 512 scalar elements

• Three additional registers are needed by the vector facility

VectorComputation


Registers of the IBM 3090Registers of the IBM 3090 VectorComputation


Compound InstructionCompound Instruction• Instruction execution can be overlapped

using chining to improve performance • The designers of the IBM vector facility

chose not to include this capability for several reasons

• Compound instruction do not require the use of additional registers for temporary storage of intermediate results and they require one less register

VectorComputation


The Instruction SetThe Instruction Set• There are memory-to-register load and

register-to-memory store instructions • Many of instruction use a three operand

format

VectorComputation


IBM 3090 Vector FacilityIBM 3090 Vector Facility VectorComputation

chapter 18 parallel processing - yonsei...

Documents