chapter 18 parallel processing - yonsei...

96
Yonsei Yonsei University University Chapter 18 Chapter 18 Parallel Processing Parallel Processing

Upload: others

Post on 18-Oct-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity

Chapter 18Chapter 18

Parallel ProcessingParallel Processing

Page 2: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-2

ContentsContents• Multiple Processor Organizations• Symmetric Multiprocessors• Cache Coherence and the MESI Protocol• Clusters• Nonuniform Memory Access• Vector Computation

Page 3: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-3

Types of Parallel ProcessorTypes of Parallel Processor• Types of Parallel Processor Systems

– Single instruction, single data stream - SISD– Single instruction, multiple data stream - SIMD– Multiple instruction, single data stream - MISD– Multiple instruction, multiple data stream-

MIMD

Multiple Processor Organizations

Page 4: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-4

Taxonomy Of Parallel ProcessorTaxonomy Of Parallel Processor Multiple Processor Organizations

Page 5: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-5

SISDSISDSingle processorSingle instruction streamData stored in single memoryUni-processor

Multiple Processor Organizations

Page 6: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-6

SIMDSIMD• Single machine instruction • Controls simultaneous execution• Number of processing elements• Lockstep basis• Each processing element has associated

data memory• Each instruction executed on different set

of data by different processors• Vector and array processors

Multiple Processor Organizations

Page 7: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-7

SIMDSIMD• With distributed memory

Multiple Processor Organizations

Page 8: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-8

MISDMISD• Sequence of data• Transmitted to set of processors• Each processor executes different

instruction sequence• Never been implemented

Multiple Processor Organizations

Page 9: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-9

MIMDMIMD• Set of processors• Simultaneously execute different

instruction sequences• Different sets of data• SMPs, clusters and NUMA systems

Multiple Processor Organizations

Page 10: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-10

MIMDMIMD• With shared memory

Multiple Processor Organizations

Page 11: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-11

MIMDMIMD• With distributed memory

Multiple Processor Organizations

Page 12: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-12

MIMD MIMD -- OverviewOverview• General purpose processors• Each can process all instructions

necessary• Further classified by method of processor

communication

Multiple Processor Organizations

Page 13: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-13

Tightly Coupled Tightly Coupled -- SMPSMP• Processors share memory• Communicate via that shared memory• Symmetric Multiprocessor (SMP)

– Share single memory or pool– Shared bus to access memory– Memory access time to given area of memory is

approximately the same for each processor

Multiple Processor Organizations

Page 14: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-14

Tightly Coupled Tightly Coupled -- NUMANUMA• Nonuniform memory access• Access times to different regions of

memroy may differ

Multiple Processor Organizations

Page 15: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-15

Loosely Coupled Loosely Coupled -- ClustersClusters• Collection of independent uniprocessors

or SMPs• Interconnected to form a cluster• Communication via fixed path or network

connections

Multiple Processor Organizations

Page 16: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-16

Design IssuesDesign Issues• Physical Organization• Interconnection Structures• Interprocessor Communication• Operating System Design• Application Software Techniques

Multiple Processor Organizations

Page 17: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-17

SMP CharacteristicsSMP Characteristics• Two or more similar processors • Share the same main memory and I/O facilities• All processors share access to I/O devices• All processors can perform the same functions• The system is controlled by an integrated OS

– Providing interaction between processors – Interaction at job, task, file and data element levels

Symmetric Multiprocessors

Page 18: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-18

SMP AdvantagesSMP Advantages• Performance

– If some work can be done in parallel

• Availability– Since all processors can perform the same functions, failure

of a single processor does not halt the system

• Incremental growth– User can enhance performance by adding additional

processors

• Scaling– Vendors can offer range of products based on number of

processors

Symmetric Multiprocessors

Page 19: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-19

Multiprogramming & MultiprocessingMultiprogramming & Multiprocessing• Interleaving(multiprogramming)

Symmetric Multiprocessors

Page 20: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-20

Multiprogramming & MultiprocessingMultiprogramming & Multiprocessing• Interleaving and

overlapping(multiprocessing)

Symmetric Multiprocessors

Page 21: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-21

Tightly CoupledTightly Coupled• Time-shared or common bus• Multiport memory• Central control unit

Symmetric Multiprocessors

Page 22: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-22

TimeTime--Shared BusShared Bus• To facilitate DMA transfers from I/O

processors, the following features are provided– Addressing

• Distinguish modules on the bus to determine the source and destination of data

– Arbitration• I/O module can temporarily function as “master”

– Time sharing• When module is controlling the bus, other modules

are locked out and must, if necessary, suspend operation until bus access is achieved

Symmetric Multiprocessors

Page 23: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-23

Time Shared BusTime Shared Bus Symmetric Multiprocessors

Page 24: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-24

Time Shared Bus : AdvantagesTime Shared Bus : Advantages• Simplicity

– The simplest approach to multiprocessor organization

• Flexibility– Easy to expand the system by attaching more

processors to the bus• Reliability

– The bus is essentially a passive medium

Symmetric Multiprocessors

Page 25: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-25

Time Shared Bus : DrawbacksTime Shared Bus : Drawbacks• Performance

– The speed of the system is limited by the bus cycle time

– To equip each processor with a cache memory for improving performance

– Each processor should have local cache• Reduce number of bus accesses

– Leads to problems with cache coherence• Solved in hardware - see later

Symmetric Multiprocessors

Page 26: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-26

MultiportMultiport MemoryMemory• The multiport memory allows the direct,

independent access of main memory modules by each processor and I/O module

• Direct independent access of memory modules by each processor

• Logic required to resolve conflicts• Little or no modification to processors or

modules required

Symmetric Multiprocessors

Page 27: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-27

MultiportMultiport MemoryMemory• Advantages

– Little or no modification is needed for either processor or I/O modules to accommodate multiport memory

– Better performance than bus approach– Security

• To configure portions of memory as private to one or more processors and/or I/O module

• Disadvantage– Logic associated with memory is required for resolving

conflicts– More complex than the bus approach

Symmetric Multiprocessors

Page 28: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-28

Central Control UnitCentral Control Unit• To separate data streams back and forth between

independent modules: processor, memory, I/O module

• The controller can buffer requests and perform arbitration and timing functions

• To pass status and control messages between processors

• To perform cache update alerting• Advantages

– flexibility and simplicity of interfacing of the bus approach

• Disadvantages– the control unit is quite complex– A potential performance bottleneck

Symmetric Multiprocessors

Page 29: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-29

Multiprocessor OS Multiprocessor OS DeisgnDeisgn• An SMP OS manages processor and other

computer resources so that the user perceives a single OS controlling system resources

• The key design issues– Simultaneous concurrent processes– Scheduling– Synchronization– Memory management– Reliability and fault tolerance

Symmetric Multiprocessors

Page 30: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-30

S/390 SMPS/390 SMP Symmetric Multiprocessors

Page 31: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-31

S/390 SMP ComponentsS/390 SMP Components• PU

– CISC microprocessor– 64-KB L1 cache

• L2 cache– 384 KB– To be arranged in cluster of two– Each cluster supports three PUs

• Bus-switching network adapter(BSN)– to interconnect the L2 caches and the main

memory– To includes a level 3 (L3) cache (2 MB)

• Memory card– 8G per card

Symmetric Multiprocessors

Page 32: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-32

Switched InterconnectionSwitched Interconnection• The single bus becomes a bottleneck

affecting the scalability• The S/390 copes with this problem in two

ways– Main memory is split into four separate cards,

each with its own storage controller that can handle memory accesses at high speeds

– The connection from processors to a single memory card is not in the form of a shared bus but rather point to point links, where each link connects a group of three processors via an L2 cache to a BSN

Symmetric Multiprocessors

Page 33: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-33

Shared L2 CachesShared L2 Caches• Needs of Shared L2 Caches

– In moving from G3(generation 3) to G4, IBM doubled the speed of the microprocessors. If the G3 organization was retained, a significant increase in bus traffic would occur. At the same time, it was desired to reuse as many G3 components as possible. Without a significant bus upgrade, the BSNs would become a bottleneck

– Analysis of typical S/390 workloads revealed a large degree of sharing of instructions and data among processors

• The use of one or more L2 caches, each of which was shared by multiple processors

Symmetric Multiprocessors

Page 34: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-34

L3 CacheL3 Cache• Each L3 cache provides a buffer between L2

caches and one memory card• To provides the data much more quickly than a

main memory access if the requested cache line is already shared by other processors but was not recently used by the requesting processor

38 GB32Memory32 MB14L3 cache5256 KB5L2 cache

8932 KB1L1 cache

Hit Rate(%)Cache SizeAccess Penalty

(PU cycles)Memory

subsystem

Symmetric Multiprocessors

Page 35: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-35

Cache Coherence ProblemCache Coherence Problem• Problem - multiple copies of same data in

different caches• Can result in an inconsistent view of memory• Write back

– Write operations are usually made only to the cache – Main memory is only updated when the

corresponding cache line is flushed from the cache• Write through

– All write operations are made to main memory as well as to the cache, ensuring that main memory is always valid

Cache Coherence and the MESI Protocol

Page 36: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-36

Software SolutionsSoftware Solutions• Compiler-based coherence mechanisms :• Compiler and operating system deal with

problem• Overhead transferred to compile time• Design complexity transferred from

hardware to software• Simplest approach

– Prevent any shared data variables from being cached

• Efficient approach – Analyze the code to determine safe periods for

shared variables

Cache Coherence and the MESI Protocol

Page 37: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-37

Hardware SolutionsHardware Solutions• Cache coherence protocols• Dynamic recognition at run time of

potential inconsistency conditions• More efficient use of cache• Hardware schemes can be divided into

two categories– Directory protocols– Snoopy protocols

Cache Coherence and the MESI Protocol

Page 38: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-38

Directory ProtocolsDirectory Protocols• Collect and maintain information about

copies of data in cache• Directory stored in main memory• Requests are checked against directory• Appropriate transfers are performed• Creates central bottleneck• Effective in large scale systems with

complex interconnection schemes

Cache Coherence and the MESI Protocol

Page 39: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-39

Snoopy ProtocolsSnoopy Protocols• Distribute cache coherence responsibility

among cache controllers• Cache recognizes that a line is shared• Updates announced to other caches• Suited to bus based multiprocessor• Increases bus traffic

Cache Coherence and the MESI Protocol

Page 40: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-40

Snoopy ProtocolsSnoopy Protocols• Two approach to the snoopy protocol

– Write-invalidate– Write-update (Write-broadcast)

• Performance depends on the number of local caches and pattern of memory reads and writes

• To adaptive protocols that employ both

Cache Coherence and the MESI Protocol

Page 41: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-41

Write InvalidateWrite Invalidate• Multiple readers, one writer• When a write is required, all other caches

of the line are invalidated• Writing processor then has exclusive

(cheap) access until line required by another processor

• Used in Pentium II and PowerPC systems• State of every line is marked as modified,

exclusive, shared or invalid• MESI

Cache Coherence and the MESI Protocol

Page 42: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-42

Write UpdateWrite Update• Multiple readers and writers• When a processor wishes to update a

shared line, the word to be updated is distributed to all others and caches containing that line can update it

Cache Coherence and the MESI Protocol

Page 43: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-43

MESI ProtocolMESI Protocol• Modified : The line in the cache has been

modified and is available only in this cache

• Exclusive : The line in the cache is the same as that in main memory and is not present in any other cache

• Shared : The line in the cache is the same as that in main memory and may be present in another cache

• Invalid : The line in the cache does not contain valid data

Cache Coherence and the MESI Protocol

Page 44: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-44

MESI State Transition DiagramMESI State Transition Diagram Cache Coherence and the MESI Protocol

Page 45: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-45

MESI State Transition DiagramMESI State Transition Diagram(a) Line in cache at initiating processor

Cache Coherence and the MESI Protocol

Page 46: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-46

MESI State Transition DiagramMESI State Transition Diagram(b) Line in snooping cache

Cache Coherence and the MESI Protocol

Page 47: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-47

Read MissRead Miss• When a read miss occurs in the local

cache, the processor initiates a memory read to read the line of main memory containing the missing address

• The processor inserts a signal on the bus that alerts all other processor/cache units to snoop the transaction

Cache Coherence and the MESI Protocol

Page 48: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-48

Read HitRead Hit• When a read hit occurs on a line currently

in the local cache, the processor simply reads the required item

• There is no state change• The state remains modified, shared, or

exclusive

Cache Coherence and the MESI Protocol

Page 49: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-49

Write MissWrite Miss• When a write miss occurs in the local

cache, the processor initiates a memory read to read the line of main memory containing the missing address

• RWITM(read-with-intent-to-modify)• When the line is loaded, it is immediately

marked modified

Cache Coherence and the MESI Protocol

Page 50: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-50

Write HitWrite Hit• When a write hit occurs on a line currently

in the local cache, the effect depends on the current state of that line in the local cache:

• Shared• Exclusive• Modified

Cache Coherence and the MESI Protocol

Page 51: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-51

L1L1--L2 Cache ConsistencyL2 Cache Consistency• Needs to maintain data integrity across

both levels of cache and across all caches in the SMP configuration

• To extend the MESI protocol to the L1 caches– Each line in the L1 cache includes bits to indicate

the state• Why?

– To adopt the write-through policy in the L1 cache• If L1 has a write-back, more complex

– The write through is to the L2 cache and not to the memory

Cache Coherence and the MESI Protocol

Page 52: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-52

ClusteringClustering• Clustering is an alternative to symmetric

multiprocessing(SMP)• Cluster

– A group of interconnected, whole computers working together as a unified computing resource that can create the illusion of being one machine

• Whole Computer– A system that can run on its own, apart from the cluster

• Benefits– Absolute scalability– Incremental scalability– High availability– Superior price/performance

Clusters

Page 53: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-53

Cluster ConfigurationsCluster Configurations(a) Standby Server with No Shared Disk

Clusters

Page 54: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-54

Cluster ConfigurationsCluster Configurations(b) Shared Disk

Clusters

Page 55: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-55

Clustering MethodsClustering Methods

Requires lock manager software. Usually used with disk mirroring or RAID technology

Low network and server overhead. Reduced risk of downtime caused by disk failure

Multiple servers simultaneously share access to disks

Servers share disks

Usually requires disk mirroring or RAID technology to compensate for risk of disk failure.

Reduced network and server overhead due to elimination of copying operations

Servers are cabled to the same disks, but each server owns its disks. If one server fails, its disks are taken over by the other server

Servers connected to disks

High network and server overhead due to copying operations

High availability

Separate servers have their own disks. Data is continuously copied from primary to secondary server

Separate servers

Increased complexityReduced cost because secondary servers can be used for processing

The secondary server is also used for processing tasks

Active secondary

High cost because the secondary server is unavailable for other processing tasks.

Easy to implementA secondary server takes over in case of primary server failure

Passive standby

LimitationsBenefitsDescriptionClustering Method

Clusters

Page 56: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-56

Operating System Design IssuesOperating System Design Issues• Failure Management

– Failover– Failback

• Load Balancing• Clusters versus SMP

Clusters

Page 57: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-57

Failure ManagementFailure Management• A highly available cluster

– A high probability that all resources will be in service– If a failure does occur, the queries in progress are lost– Any lost query, if retried, will be serviced by a different

computer in the cluster

• A fault-tolerant cluster– All resources are always available– Achieved by the use of redundant shared disks and

mechanisms for backing out uncommitted transactions– Failover : Switching resources over from a failed to

alternative system– Failback : restoring resources to the original system

Clusters

Page 58: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-58

Load BalancingLoad Balancing• Incremental scalability• Automatically include new computers in

scheduling• Middleware needs to recognise that

processes may switch between machines

Clusters

Page 59: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-59

ParallelizingParallelizing• Single application executing in parallel on a

number of machines in cluster– Complier

• Determines at compile time which parts can be executed in parallel

• Split off for different computers– Application

• Application written from scratch to be parallel• Message passing to move data between nodes• Hard to program• Best end result

– Parametric computing• If a problem is repeated execution of algorithm on different s

ets of data• e.g. simulation using different scenarios• Needs effective tools to organize and run

Clusters

Page 60: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-60

Cluster Computer ArchitectureCluster Computer Architecture Clusters

Page 61: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-61

Cluster MiddlewareCluster Middleware• Unified image to user

– Single system image• Single point of entry• Single file hierarchy• Single control point• Single virtual networking• Single memory space• Single job management system• Single user interface• Single I/O space• Single process space• Checkpointing• Process migration

Clusters

Page 62: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-62

Clusters Clusters vsvs SMPSMP• Both provide multiprocessor support to high dem

and applications• Both available commercially

– SMP for longer

• The advantages of the SMP approach– SMP is easier to manage and configure than a cluster– SMP usually takes up less physical space and draws less

power than cluster

• The advantages of the cluster approach– To result in clusters dominating the high-performance server

market– Superior in terms of incremental and absolute scalability– Superior in terms of availability

• All components of the system can readily be made highly redundant

Clusters

Page 63: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-63

NonuniformNonuniform Memory Access (NUMA)Memory Access (NUMA)• Alternative to SMP & clustering• Uniform memory access

– All processors have access to all parts of memory using load & store– Access time to all regions of memory is the same– Access time to memory for different processors same– As used by SMP

• Nonuniform memory access– All processors have access to all parts of memory using load & store– Access time of processor differs depending on region of memory– Different processors access different regions of memory at different

speeds• Cache coherent NUMA

– Cache coherence is maintained among the caches of the various processors

– Significantly different from SMP and clusters

NonuniformMemory Access

Page 64: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-64

MotivationMotivation• SMP has practical limit to number of processors

– Bus traffic limits to between 16 and 64 processors

• In clusters each node has own memory– Apps do not see large global memory– Coherence maintained by software not hardware

• NUMA retains SMP flavour while giving large scale multiprocessing– e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000

processors

• The objective with NUMA is to maintain a transparent system-wide memory while permitting multiple multiprocessor nodes, each with its own bus or other internal interconnect system

NonuniformMemory Access

Page 65: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-65

CCCC--NUMA OrganizationNUMA Organization NonuniformMemory Access

Page 66: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-66

CCCC--NUMA OperationNUMA Operation• Each processor has own L1 and L2 cache• Each node has own main memory• Nodes connected by some networking facility• Each processor sees single addressable

memory space• Memory request order:

– L1 cache (local to processor)– L2 cache (local to processor)– Main memory (local to node)– Remote memory

• Delivered to requesting (local to processor) cache• Automatic and transparent

NonuniformMemory Access

Page 67: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-67

OrganizationOrganization• Example

– Suppose the P2-3 requests a memory location 798, the memory of node 1

1. P2-3 issues read request on the snoopy bus of node 22. The directory on node 2 sees the request and recognizes

that the location is in lode13. Node 2’s directory sends a request to node 14. Node 1’s directory requests the contents of 7985. Node 1’s main memory responds by putting the requested

data on bus6. Node 1’s directory picks up the data from the bus7. The value is transferred back to node 2’s directory8. Node 2’s directory places the data back no node 2’bus9. The value is picked up and placed in P2-3’s cache and

delivered to P2-3

NonuniformMemory Access

Page 68: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-68

Cache CoherenceCache Coherence• Node 1 directory keeps note that node 2 h

as copy of data• If data modified in cache, this is broadcast

to other nodes• Local directories monitor and purge local

cache if necessary• Local directory monitors changes to local

data in remote caches and marks memory invalid until writeback

• Local directory forces writeback if memory location requested by another processor

NonuniformMemory Access

Page 69: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-69

NUMA Pros and ConsNUMA Pros and Cons• The advantage of a CC-NUMA system

– Can deliver effective performance at higher levels of parallelism than SMP, without requiring major software changes

• The disadvantages of a CC-NUMA system– Performance can breakdown if too much access to remote

memory• Can be avoided by:

– L1 & L2 cache design reducing all memory access» Need good temporal locality of software

– Good spatial locality of software– Virtual memory management moving pages to nodes

that are using them most– Not transparently look like an SMP

• Software changes will be required to move an operating system and applications from an SMP to a CC-NUMA system

– Availability

NonuniformMemory Access

Page 70: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-70

Vector ComputationVector Computation• Maths problems involving physical

processes present different difficulties for computation– Aerodynamics, seismology, meteorology– Continuous field simulation

• High precision• Repeated floating point calculations on large

arrays of numbers

VectorComputation

Page 71: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-71

Vector ComputationVector Computation• Supercomputers handle these types of

problem– Hundreds of millions of flops– $10-15 million– Optimised for calculation rather than multitasking and

I/O– Limited market

• Research, government agencies, meteorology• Array processor

– Alternative to supercomputer– Configured as peripherals to mainframe & mini– Just run vector portion of problems

VectorComputation

Page 72: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-72

Example of Vector AdditionExample of Vector Addition VectorComputation

Page 73: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-73

Matrix Multiplication (C = A x B)Matrix Multiplication (C = A x B) VectorComputation

Page 74: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-74

ApproachesApproaches• General purpose computers rely on iteration to do

vector calculations• In example this needs six calculations• Vector processing

– Assume possible to operate on one-dimensional vector of data– All elements in a particular row can be calculated in parallel

• Parallel processing– Independent processors functioning in parallel– Use FORK N to start individual process at location N– JOIN N causes N independent processes to join and merge

following JOIN• O/S Co-ordinates JOINs• Execution is blocked until all N processes have reached JOIN

VectorComputation

Page 75: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-75

Processor DesignsProcessor Designs• Pipelined ALU

– Within operations– Across operations

• Parallel ALUs• Parallel processors

VectorComputation

Page 76: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-76

Approaches to Vector ComputationApproaches to Vector Computation(a) Pipelined ALU

VectorComputation

Page 77: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-77

Approaches to Vector ComputationApproaches to Vector Computation(b) Parallel ALUs

VectorComputation

Page 78: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-78

Pipelined ProcessingPipelined Processing VectorComputation

Page 79: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-79

Ex. Computing C=(s x A) + BEx. Computing C=(s x A) + B1. Vector load A->Vector

Register(VR1)2. Vector load B->VR23. Vector multiply s x VR1->VR34. Vector add VR3 + VR2 -> VR45. Vector store VR4->C

VectorComputation

Page 80: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-80

ChainingChaining• Cray Supercomputers• Vector operation may start as soon as first

element of operand vector available and functional unit is free

• Result from one functional unit is fed immediately into another

• If vector registers used, intermediate results do not have to be stored in memory

VectorComputation

Page 81: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-81

Taxonomy of Computer OrganizationTaxonomy of Computer Organization VectorComputation

Page 82: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-82

Taxonomy of Computer OrganizationTaxonomy of Computer Organization• Parallel processors

– The multiple processors can function cooperatively on a given task

• Vector processor– Often equated with pipelined ALU organization– Also designed by a parallel ALU organization and a parallel

processor• Array processing

– Usually refers to a Parallel ALU– But any of the three organizations is optimized for the processing

of array– Usually refers to an auxiliary processor attached to a general-

purpose processor and used to perform vector computation• The pipelined ALU organization dominates the

marketplace– Less complex

VectorComputation

Page 83: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-83

IBM 3090 Vector FacilityIBM 3090 Vector Facility• IBM facility makes use of a number of

vector registers• Each register is actually a bank of scalar

registers• To compute vector sum C=A+B, the vectors

A and B are loaded into two vector registers• The data from these registers are passed

through the ALU as fast as possible• The computation overlap, and the loading of

the input data into the resisters in a block, results in a significant speeding up over an ordinary ALU operation

VectorComputation

Page 84: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-84

OrganizationOrganization• The fixed and predetermined structure of

vector data permits housekeeping instructions inside the loop to be replaced by faster internal machine operations

• Data-access and arithmetic operations on several successive vector elements can proceed concurrently by overlapping such operations in a pipelined design or by performing multiple-element operations in parallel

• The use of vector registers for intermediate results avoids additional storage reference

VectorComputation

Page 85: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-85

IBM 3090 with Vector FacilityIBM 3090 with Vector Facility VectorComputation

Page 86: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-86

RegistersRegisters• The IBM organization is referred to as

register-to-register, because the vector operands, both input and output, can be stored in vector registers

• The main disadvantage of the use of vectro registers is that the programmer or compiler must take them into account

VectorComputation

Page 87: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-87

Alternative ProgramsAlternative Programs VectorComputation

Page 88: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-88

Alternative ProgramsAlternative Programs(a) Storage to Storage

VectorComputation

Page 89: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-89

Alternative ProgramsAlternative Programs(b) Register to Register

VectorComputation

Page 90: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-90

Alternative ProgramsAlternative Programs(c) Storage to Register

VectorComputation

Page 91: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-91

Alternative ProgramsAlternative Programs(d) Compound Instructions

VectorComputation

Page 92: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-92

RegistersRegisters• The vector registers can also be coupled

to form 8 64-bit vector registers• The architecture specifies that each

register contains from 8 to 512 scalar elements

• Three additional registers are needed by the vector facility

VectorComputation

Page 93: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-93

Registers of the IBM 3090Registers of the IBM 3090 VectorComputation

Page 94: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-94

Compound InstructionCompound Instruction• Instruction execution can be overlapped

using chining to improve performance • The designers of the IBM vector facility

chose not to include this capability for several reasons

• Compound instruction do not require the use of additional registers for temporary storage of intermediate results and they require one less register

VectorComputation

Page 95: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-95

The Instruction SetThe Instruction Set• There are memory-to-register load and

register-to-memory store instructions • Many of instruction use a three operand

format

VectorComputation

Page 96: Chapter 18 Parallel Processing - Yonsei Universitysoc.yonsei.ac.kr/class/material/computersystems/chapter18.pdf · Chapter 18 Parallel Processing. 18-2 Yonsei University Contents

YonseiYonsei UniversityUniversity18-96

IBM 3090 Vector FacilityIBM 3090 Vector Facility VectorComputation