1 parallel computing final exam review. 2 what is parallel computing? parallel computing involves...

1

Parallel Computing Final Exam Review

2

What is Parallel computing? Parallel computing involves performing

parallel tasks using more than one computer.

Example in real life with related principles -- book shelving in a library Single worker P workers with each worker stacking n/p

books, but with arbitration problem(many workers try to stack the next book in the same shelf.)

P workers with each worker stacking n/p books, but without arbitration problem (each worker work on a different set of shelves)

3

Important Issues in parallel computing Task/Program Partitioning.

How to split a single task among the processors so that each processor performs the same amount of work, and all processors work collectively to complete the task.

Data Partitioning. How to split the data evenly among the

processors in such a way that processor interaction is minimized.

Communication/Arbitration. How we allow communication among

different processors and how we arbitrate communication related conflicts.

4

Challenges

1. Design of parallel computers so that we resolve the above issues.

2. Design, analysis and evaluation of parallel algorithms run on these machines.

3. Portability and scalability issues related to parallel programs and algorithms

4. Tools and libraries used in such systems.

5

Units of Measure in HPC

• High Performance Computing (HPC) units are:—Flop: floating point operation—Flops/s: floating point operations per second—Bytes: size of data (a double precision floating point number

is 8)• Typical sizes are millions, billions, trillions…Mega Mflop/s = 106 flop/sec Mbyte = 220 = 1048576 ~ 106

bytesGiga Gflop/s = 109 flop/sec Gbyte = 230 ~ 109 bytesTera Tflop/s = 1012 flop/sec Tbyte = 240 ~ 1012 bytesPeta Pflop/s = 1015 flop/sec Pbyte = 250 ~ 1015 bytesExa Eflop/s = 1018 flop/sec Ebyte = 260 ~ 1018 bytesZetta Zflop/s = 1021 flop/sec Zbyte = 270 ~ 1021 bytesYotta Yflop/s = 1024 flop/sec Ybyte = 280 ~ 1024 bytes• See www.top500.org for current list of fastest machines

6

What is a parallel computer? A parallel computer is a collection of

processors that cooperatively solve computationally intensive problems faster than other computers.

Parallel algorithms allow the efficient programming of parallel computers.

This way the waste of computational resources can be avoided.

Parallel computer v.s. Supercomputer supercomputer refers to a general-purpose computer

that can solve computational intensive problems faster than traditional computers.

A supercomputer may or may not be a parallel computer.

7

Flynn’s taxonomy of computer architectures (control mechanism) Depending on the execution and data streams computer

architectures can be distinguished into the following groups. (1) SISD (Single Instruction Single Data) : This is a sequential computer. (2) SIMD (Single Instruction Multiple Data) : This is a parallel machine

like the TM CM-200. SIMD machines are suited for data-parallel programs where the same set of instructions are executed on a large data set.

Some of the earliest parallel computers such as the Illiac IV, MPP, DAP, CM-2, and MasPar MP-1 belonged to this class of machines

(3) MISD (Multiple Instructions Single Data) : Some consider a systolic array a member of this group.

(4) MIMD (Multiple Instructions Multiple Data) : All other parallel machines. A MIMD architecture can be an MPMD or an SPMD. In a Multiple Program Multiple Data organization, each processor executes its own program as opposed to a single program that is executed by all processors on a Single Program Multiple Data architecture.

Examples of such platforms include current generation Sun Ultra Servers, SGI Origin Servers, multiprocessor PCs, workstation clusters, and the IBM SP

Note: Some consider CM-5 as a combination of a MIMD and SIMD as it contains control hardware that allows it to operatein a SIMD mode.

8

SIMD and MIMD Processors

A typical SIMD architecture (a) and a typical MIMD architecture (b).

9

Taxonomy based on Address-Space Organization (memory distribution)

Message-Passing Architecture In a distributed memory machine each processor has its own memory. Each

processor can access its own memory faster than it can access the memory of a remote processor (NUMA for Non-Uniform Memory Access). This architecture is also known as message-passing architecture and such machines are commonly referred to as multicomputers.

Examples: Cray T3D/T3E, IBM SP1/SP2, workstation clusters. Shared-Address-Space Architecture

Provides hardware support for read/write to a shared address space. Machines built this way are often called multiprocessors.

(1) A shared memory machine has a single address space shared by all processors (UMA, for Uniform Memory Access).

The time taken by a processor to access any memory word in the system is identical.

Examples: SGI Power Challenge, SMP machines. (2) A distributed shared memory system is a hybrid between the two

previous ones. A global address space is shared among the processors but is distributed among them. Example: SGI Origin 2000

Note: The existence of a cache in shared-memory parallel machines cause cache coherence problems when a cached variable is modified by a processor and the shared-variable is requested by another processor. cc-NUMA for cachecoherent NUMA architectures (Origin 2000).

10

NUMA and UMA Shared-Address-Space Platforms

Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space computer; (b) Uniform-memory-

access shared-address-space computer with caches and memories; (c) Non-uniform-memory-access shared-address-

space computer with local memory only.

11

Message Passing vs. Shared Address Space Platforms

Message passing requires little hardware support, other than a network.

Shared address space platforms can easily emulate message passing. The reverse is more difficult to do (in an efficient manner).

12

Taxonomy based on processor granularity The granularity sometimes refers to the power of

individual processors. Sometimes is also used to denote the degree of parallelism.

(1) A coarse-grained architecture consists of (usually few) powerful processors (eg old Cray machines).

(2) a fine-grained architecture consists of (usually many inexpensive) processors (eg TM CM-200, CM-2).

(3) a medium-grained architecture is between the two (eg CM-5).

Process Granularity refers to the amount of computation assigned to a particular processor of a parallel machine for a given parallel program. It also refers, within a single program, to the amount of computation performed before communication is issued. If the amount of computation is small (low degree of concurrency) a process is fine-grained. Otherwise granularity is coarse.

13

Taxonomy based on processor synchronization

(1) In a fully synchronous system a global clock is used to synchronize all operations performed by the processors.

(2) An asynchronous system lacks any synchronization facilities. Processor synchronization needs to be explicit in a user’s program.

(3) A bulk-synchronous system comes in between a fully synchronous and an asynchronous system. Synchronization of processors is required only at certain parts of the execution of a parallel program.

14

Physical Organization of Parallel Platforms – ideal architecture(PRAM)

The Parallel Random Access Machine (PRAM) is one of the simplest ways to model a parallel computer.

A PRAM consists of a collection of (sequential) processors that can synchronously access a global shared memory in unit time. Each processor can thus access its shared memory as fast (and efficiently) as it can access its own local memory.

The main advantages of the PRAM is its simplicity in capturing parallelism and abstracting away communication and synchronization issues related to parallel computing.

Processors are considered to be in abundance and unlimited in number. The resulting PRAM algorithms thus exhibit unlimited parallelism (number of processors used is a function of problem size).

The abstraction thus offered by the PRAM is a fully synchronous collection of processors and a shared memory which makes it popular for parallel algorithm design.

It is, however, this abstraction that also makes the PRAM unrealistic from a practical point of view.

Full synchronization offered by the PRAM is too expensive and time demanding in parallel machines currently in use.

Remote memory (i.e. shared memory) access is considerably more expensive in real machines than local memory access

UMA machines with unlimited parallelism are difficult to build.

15

Four Subclasses of PRAM Depending on how concurrent access to a single memory cell (of the

shared memory) is resolved, there are various PRAM variants. ER (Exclusive Read) or EW (Exclusive Write) PRAMs do not allow concurrent

access of the shared memory. It is allowed, however, for CR (Concurrent Read) or CW (Concurrent Write)

PRAMs. Combining the rules for read and write access there are four PRAM

variants: EREW:

access to a memory location is exclusive. No concurrent read or write operations are allowed.

Weakest PRAM model CREW

Multiple read accesses to a memory location are allowed. Multiple write accesses to a memory location are serialized.

ERCW Multiple write accesses to a memory location are allowed. Multiple read accesses to a

memory location are serialized. Can simulate an EREW PRAM

CRCW Allows multiple read and write accesses to a common memory location. Most powerful PRAM model Can simulate both EREW PRAM and CREW PRAM

16

Resolve concurrent write access (1) in the arbitrary PRAM, if multiple processors write into a

single shared memory cell, then an arbitrary processor succeeds in writing into this cell.

(2) in the common PRAM, processors must write the same value into the shared memory cell.

(3) in the priority PRAM the processor with the highest priority (smallest or largest indexed processor) succeeds in writing.

(4) in the combining PRAM if more than one processors write into the same memory cell, the result written into it depends on the combining operator. If it is the sum operator, the sum of the values is written, if it is the maximum operator the maximum is written.

Note: An algorithm designed for the common PRAM can be executed on a priority or arbitrary PRAM and exhibit similar complexity. The same holds for an arbitrary PRAM algorithm when run on a priority PRAM.

17

Innerconnection Networks for Parallel Computers Interconnection networks carry data between

processors and to memory. Interconnects are made of switches and links

(wires, fiber). Interconnects are classified as static or

dynamic. Static networks

Consists of point-to-point communication links among processing nodes

Also referred to as direct networks Dynamic networks

Built using switches (switching element) and links Communication links are connected to one another

dynamically by the switches to establish paths among processing nodes and memory banks.

18

Static and DynamicInterconnection Networks

Classification of interconnection networks: (a) a static network; and (b) a dynamic network.

19

Network Topologies Bus-Based Networks

The simplest network that consists a shared medium(bus) that is common to all the nodes.

The distance between any two nodes in the network is constant (O(1)).

Ideal for broadcasting information among nodes. Scalable in terms of cost, but not scalable in terms of

performance. The bounded bandwidth of a bus places limitations on the

overall performance as the number of nodes increases. Typical bus-based machines are limited to dozens of nodes.

Sun Enterprise servers and Intel Pentium based shared-bus multiprocessors are examples of such architectures

20

Network Topologies Crossbar Networks

Employs a grid of switches or switching nodes to connect p processors to b memory banks.

Nonblocking network: the connection of a processing node to a memory bank

doesnot block the connection of any other processing nodes to other memory banks.

The total number of switching nodes required is Θ(pb). (It is reasonable to assume b>=p)

Scalable in terms of performance Not scalable in terms of cost. Examples of machines that employ crossbars include

the Sun Ultra HPC 10000 and the Fujitsu VPP500

21

Network Topologies: Multistage Network

Multistage Networks Intermediate class of networks between bus-based network and crossbar

network Blocking networks: access to a memory bank by a processor may

disallow access to another memory bank by another processor. More scalable than the bus-based network in terms of performance,

more scalable than crossbar network in terms of cost.

The schematic of a typical multistage interconnection network.

22

Network Topologies: Multistage Omega Network Omega network

Consists of log p stages, p is the number of inputs(processing nodes) and also the number of outputs(memory banks)

Each stage consists of an interconnection pattern that connects p inputs and p outputs:

Perfect shuffle(left rotation):

Each switch has two conncetion modes: Pass-thought conncetion: the inputs are sent straight

through to the outputs Cross-over connection: the inputs to the switching node

are crossed over and then sent out. Has p/2*log p switching nodes

2 , 0 / 2 1

2 1 , / 2 1

i i pj

i p p i p

23

Network Topologies: Multistage Omega Network

A complete omega network connecting eight inputs and eight outputs.

An omega network has p/2 × log p switching nodes, and the cost of such a network grows as (p log p).

A complete Omega network with the perfect shuffle interconnects and switches can now be illustrated:

24

Network Topologies: Multistage Omega Network – Routing

Let s be the binary representation of the source and d be that of the destination processor.

The data traverses the link to the first switching node. If the most significant bits of s and d are the same, then the data is routed in pass-through mode by the switch else, it switches to crossover.

This process is repeated for each of the log p switching stages.

Note that this is not a non-blocking switch.

25

Network Topologies: Multistage Omega Network – Routing

An example of blocking in omega network: one of the messages (010 to 111 or 110 to 100) is blocked at link AB.

26

Network Topologies - Fixed Connection Networks (static) Completely-connection Network Star-Connected Network Linear array 2d-array or 2d-mesh or mesh 3d-mesh Complete Binary Tree (CBT) 2d-Mesh of Trees Hypercube

27

Evaluating Static Interconnection Network

One can view an interconnection network as a graph whose nodes correspond to processors and its edges to links connecting neighboring processors. The properties of these interconnection networks can be described in terms of a number of criteria.

(1) Set of processor nodes V . The cardinality of V is the number of processors p (also denoted by n).

(2) Set of edges E linking the processors. An edge e = (u, v) is represented by a pair (u, v) of nodes. If the graph G = (V,E) is directed, this means that there is a unidirectional link from processor u to v. If the graph is undirected, the link is bidirectional. In almost all networks that will be considered in this course communication links will be bidirectional. The exceptions will be clearly distinguished.

(3) The degree du of node u is the number of links containing u as an endpoint. If graph G is directed we distinguish between the out-degree of u (number of pairs (u, v) ∈ E, for any v ∈ V ) and similarly, the in-degree of u. T he degree d of graph G is the maximum of the degrees of its nodes i.e. d = maxudu.

28

Evaluating Static Interconnection Network (cont.) (4) The diameter D of graph G is the maximum of the lengths of

the shortest paths linking any two nodes of G. A shortest path between u and v is the path of minimal length linking u and v. We denote the length of this shortest path by duv. Then, D = maxu,vduv. The diameter of a graph G denotes the maximum delay (in terms of number of links traversed) that will be incurred when a packet is transmitted from one node to the other of the pair that contributes to D (i.e. from u to v or the other way around, if D = duv). Of course such a delay would hold if messages follow shortest paths (the case for most routing algorithms).

(5) latency is the total time to send a message including software overhead. Message latency is the time to send a zero-length message.

(6) bandwidth is the number of bits transmitted in unit time. (7) bisection width is the number of links that need to be

removed from G to split the nodes into two sets of about the same size (±1).

29

Network Topologies - Fixed Connection Networks (static) Completely-connection Network

Each node has a direct communication link to every other node in the network.

Ideal in the sense that a node can send a message to another node in a single step.

Static counterpart of crossbar switching networks Nonblocking

Star-Connected Network One processor acts as the central processor. Every

other processor has a communication link connecting it to this central processor.

Similar to bus-based network. The central processor is the bottleneck.

30

Network Topologies: Completely Connected and Star Connected Networks

Example of an 8-node completely connected network.

(a) A completely-connected network of eight nodes; (b) a star connected network of nine nodes.

31

Network Topologies - Fixed Connection Networks (static) Linear array

In a linear array, each node(except the two nodes at the ends) has two neighbors, one each to its left and right.

Extension: ring or 1-D torus(linear array with wraparound). 2d-array or 2d-mesh or mesh

The processors are ordered to form a 2-dimensional structure (square) so that each processor is connected to its four neighbor (north, south, east, west) except perhaps for the processors of the boundary.

Extension of linear array to two-dimensions: Each dimension has p nodes with a node identified by a two-tuple (i,j).

3d-mesh A generalization of a 2d-mesh in three dimensions. Exercise: Find the

characteristics of this network and its generalization in k dimensions (k > 2).

Complete Binary Tree (CBT) There is only one path between any pair of two nodes Static tree network: have a processing element at each node of the tree. Dynamic tree network: nodes at intermediate levels are switching nodes

and the leaf nodes are processing elements. Communication bottleneck at higher levels of the tree.

Solution: increasing the number of communication links and switching nodes closer to the root.

32

Network Topologies: Linear Arrays

Linear arrays: (a) with no wraparound links; (b) with wraparound link.

33

Network Topologies: Two- and Three Dimensional Meshes

Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and

(c) a 3-D mesh with no wraparound.

34

Network Topologies: Tree-Based Networks

Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.

35

Network Topologies: Fat Trees

A fat tree network of 16 processing nodes.

36

Evaluating Network Topologies Linear array

|V | = N, |E| = N −1, d = 2, D = N −1, bw = 1 (bisection width).

2d-array or 2d-mesh or mesh For |V | = N, we have a √N ×√N mesh structure, with |E| ≤

2N = O(N), d = 4, D = 2√N − 2, bw = √N. 3d-mesh

Exercise: Find the characteristics of this network and its generalization in k dimensions (k > 2).

Complete Binary Tree (CBT) on N = 2n leaves For a complete binary tree on N leaves, we define the level

of a node to be its distance from the root. The root is of level 0 and the number of nodes of level i is 2i. Then, |V | = 2N − 1, |E| ≤ 2N − 2 = O(N), d = 3, D = 2lgN, bw = 1(c)

37

Network Topologies - Fixed Connection Networks (static) 2d-Mesh of Trees

An N2-leaf 2d-MOT consists of N2 nodes ordered as in a 2d-array N ×N (but without the links). The N rows and N columns of the 2d-MOT form N row CBT and N column CBTs respectively.

For such a network, |V | = N2+2N(N −1), |E| = O(N2), d = 3, D = 4lgN, bw = N. The 2d-MOT possesses an interesting decomposition property. If the 2N roots of the CBT’s are removed we get 4 N/2 × N/2 CBT s.

The 2d-Mesh of Trees (2d-MOT) combines the advantages of 2d-meshes and binary trees. A 2d-mesh has large bisection width but large diameter (√N). On the other hand a binary tree on N leaves has small bisection width but small diameter. The 2d-MOT has small diameter and large bisection width.

A 3d-MOTcan be defined similarly.

38

Network Topologies - Fixed Connection Networks (static)

Hypercube The hypercube is the major representative of a class of networks that are

called hypercubic networks. Other such networks is the butterfly, the shuffle-exchange graph, de-Bruijn graph, Cube-connected cycles etc.

Each vertex of an n-dimensional hypercube is represented by a binary string of length n. Therefore there are |V | = 2n = N vertices in such a hypercube. Two vertices are connected by an edge if their strings differ in exactly one bit position.

Let u = u1u2 . . .ui . . . un. An edge is a dimension i edge if it links two nodes that differ in the i-th bit position.

This way vertex u is connected to vertex ui= u1u2 . . . ūi . . .un with a dimension i edge. Therefore |E| = N lg N/2 and d = lgN = n.

The hypercube is the first network examined so far that has degree that is not a constant but a very slowly growing function of N. The diameter of the hypercube is D = lgN. A path from node u to node v can be determined by correcting the bits of u to agree with those of v starting from dimension 1 in a “left-to-right” fashion. The bisection width of the hypercube is bw = N. This is a result of the following property of the hypercube. If all edges of dimension i are removed from an n dimensional hypercube, we get two hypercubes each one of dimension n − 1.

39

Network Topologies: Hypercubes and their Construction

Construction of hypercubes from hypercubes of lower dimension.

40

Evaluating Static Interconnection Networks

Network Diameter BisectionWidth

Arc Connectivity

Cost (No. of links)

Completely-connected

Star

Complete binary tree

Linear array

2-D mesh, no wraparound

2-D wraparound mesh

Hypercube

Wraparound k-ary d-cube

41

Performance Metrics for Parallel Systems

Number of processing elements p Execution Time

Parallel runtime: the time that elapses from the moment a parallel computation starts to the moment the last processing element finishes execution.

Ts: serial runtime Tp: parallel runtime

Total Parallel Overhead T0 Total time collectively spent by all the processing

elements – running time required by the fastest known sequential algorithm for solving the same problem on a single processing element.

T0=pTp-Ts

42

Performance Metrics for Parallel Systems Speedup S:

The ratio of the serial runtime of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on p processing elements.

S=Ts(best)/Tp Example: adding n numbers: Tp=Θ(logn), Ts= Θ(n), S=

Θ(n/logn) Theoretically, speedup can never exceed the number of

processing elements p(S<=p). Proof: Assume a speedup is greater than p, then each processing

element can spend less than time Ts/p solving the problem. In this case, a single processing element could emulate the p processing elements and solve the problem in fewer than Ts units of time. This is a contradiction because speedup, by definition, is computed with respect to the best sequential algorithm.

Superlinear speedup: In practice, a speedup greater than p is sometimes observed, this usually happens when the work performed by a serial algorithm is greater than its parallel formulation or due to hardware features that put the serial implementation at a disadvantage.

43

Example for Superlinear speedup Superlinear speedup:

Example1: Superlinear effects from caches: With the problem instance size of A and 64KB cache, the cache hit rate is 80%. Assume latency to cache of 2ns and latency of DRAM of 100ns, then memory access time is 2*0.8+100*0.2=21.6ns. If the computation is memory bound and performs one FLOP/memory access, this corresponds to a processing rate of 46.3 MFLOPS. With the problem instance size of A/2 and 64KB cache, the cache hit rate is higher, i.e., 90%, 8% the remaining data comes from local DRAM and the other 2% comes from the remote DRAM with latency of 400ns, then memory access time is 2*0.9+100*0.08+400*0.02=17.8. The corresponding execution rate at each processor is 56.18MFLOPS, and for two processors the total processing rate is 112.36MFLOPS. Then the speedup will be 112.36/46.3=2.43!

44

Example for Superlinear speedup Superlinear speedup:

Example2: Superlinear effects due to exploratory decomposition: explore leaf nodes of an unstructured tree. Each leaf has a label associated with it and the objective is to find a node with a specified label, say ‘S’. The solution node is the rightmost leaf in the tree. A serial formulation of this problem based on depth-first tree traversal explores the entire tree, i.e. all 14 nodes, time is 14 units time. Now a parallel formulation in which the left subtree is explored by processing element 0 and the right subtree is explored by processing element 1. The total work done by the parallel algorithm is only 9 nodes and corresponding parallel time is 5 units time. Then the speedup is 14/5=2.8.

45

Performance Metrics for Parallel Systems(cont.) Efficiency E

Ratio of speedup to the number of processing element. E=S/p A measure of the fraction of time for which a processing element is

usefully employed. Examples: adding n numbers on n processing elements: Tp=Θ(logn),

Ts= Θ(n), S= Θ(n/logn), E= Θ(1/logn) Cost(also called Work or processor-time product) W

Product of parallel runtime and the number of processing elements used.

W=Tp*p Examples: adding n numbers on n processing elements: W= Θ(nlogn). Cost-optimal: if the cost of solving a problem on a parallel computer has

the same asymptotic growth(in Θ terms) as a function of the input size as the fastest-known sequential algorithm on a single processing element.

Problem Size W2 The number of basic computation steps in the best sequential algorithm

to solve the problem on a single processing element. W2=Ts of the fastest known algorithm to solve the problem on a

sequential computer.

46

Parallel vs Sequential Computing: Amdahl’s

Theorem 0.1 (Amdahl’s Law) Let f, 0 ≤ f ≤ 1, be the fraction of a computation that is inherently sequential. Then the maximum obtainable speedup S on p processors is S ≤1/(f + (1 − f)/p)

Proof. Let T be the sequential running time for the named computation. fT is the time spent on the inherently sequential part of the program. On p processors the remaining computation, if fully parallelizable, would achieve a running time of at most (1−f)T/p. This way the running time of the parallel program on p processors is the sum of the execution time of the sequential and parallel components that is, fT + (1 − f)T/p. The maximum allowable speedup is therefore S ≤ T/(fT + (1 − f)T/p) and the result is proven.

47

Amdahl’s Law Amdahl used this observation to advocate the building of

even more powerful sequential machines as one cannot gain much by using parallel machines. For example if f = 10%, then S ≤ 10 as p → ∞. The underlying assumption in Amdahl’s Law is that the sequential component of a program is a constant fraction of the whole program. In many instances as problem size increases the fraction of computation that is inherently sequential decreases with time. In many cases even a speedup of 10 is quite significant by itself.

In addition Amdahl’s law is based on the concept that parallel computing always tries to minimize parallel time. In some cases a parallel computer is used to increase the problem size that can be solved in a fixed amount of time. For example in weather prediction this would increase the accuracy of say a three-day forecast or would allow a more accurate five-day forecast.

48

Parallel vs Sequential Computing: Gustaffson’s Law Theorem 0.2 (Gustafson’s Law) Let the execution

time of a parallel algorithm consist of a sequential segment fT and a parallel segment (1 − f)T and the sequential segment is constant. The scaled speedup of the algorithm is then. S =(fT + (1 − f)Tp)/(fT + (1 − f)T) = f + (1 − f)p

For f = 0.05, we get S = 19.05, whereas Amdahl’s law gives an S ≤ 10.26.

1 proc p procfT fT(1-f)Tp (1-f)TT(f+(1-f)p) T

Amdahl’s Law assumes that problem size is fixed when it deals with scalability. Gustafson’s Law assumes that running time is fixed.

49

Brent’s Scheduling Principle(Emulations) Suppose we have an unlimited parallelism efficient parallel

algorithm, i.e. an algorithm that runs on zillions of processors. In practice zillions of processors may not available. Suppose we have only p processors. A question that arises is what can we do to “run” the efficient zillion processor algorithm on our limited machine.

One answer is emulation: simulate the zillion processor algorithm on the p processor machine.

Theorem 0.3 (Brent’s Principle) Let the execution time of a parallel algorithm requires m operations and runs in parallel time t. Then running this algorithm on a limited processor machine with only p processors would require time m/p + t.

Proof: Let mi be the number of computational operations at the i-th step, i.e. .If we assign the p processors on the i-th step to work on these mi operations they can conclude in time . Thus the total running time on p processors would be

im m/ / 1i im p m p

1

/ / 1 / /t

i i ii i i

m p m p t m p t m p

50

The Message Passing Interface (MPI): Introduction

The Message-Passing Interface (MPI)is an attempt to create a standard to allow tasks executing on multiple processors to communicate through some standardized communication primitives.

It defines a standard library for message passing that one can use to develop message-passing program using C or Fortran.

The MPI standard define both the syntax and the semantics of these functional interface to message passing.

MPI comes intro a variety of flavors, freely available such as LAM-MPI and MPIch, and also commercial versions such as Critical Software’s WMPI.

It supporst message-passing on a variety of platforms from Linux-based or Windows-based PC to supercomputer and multiprocessor systems.

After the introduction of MPI whose functionality includes a set of 125 functions, a revision of the standard took place that added C++ support, external memory accessibility and also Remote Memory Access (similar to BSP’s put and get capability)to the standard. The resulting standard is known as MPI-2 and has grown to almost 241 functions.

51

The Message Passing InterfaceA minimum set

A minimum set of MPI functions is described below. MPI functions use the prefix MPI and after the prefix the remaining keyword start with a capital letter.

A brief explanation of the primitives can be found on the textbook (beginning page 242). A more elaborate presentation is available in the optional book.

Function Class Standard

Function Operation

Initialization and Termination

MPIMPI

MPI_InitMPI_Finalize

Start of SPMD codeEnd of SPMD code

Abnormal Stop MPI MPI_Abort One process halts all

Process Control MPIMPIMPI

MPI_Comm_sizeMPI_Comm_rankMPI_Wtime

Number of processesIdentifier of Calling ProcessLocal (wall-clock) time

Message Passing MPIMPI

MPI_SendMPI_Recv

Send message to remote proc.Receive mess. from remote proc.

52

General MPI program

#include <mpi.h>

…

Main(int argc, char* argv[])

{

…

MPI_Init(&argc, argv); /*no MPI functions called before this*/

…

MPI_Finalize(); /*no MPI functions called after this*/

…

}

53

MPI and MPI-2Initialization and Termination

#include <mpi.h>int MPI_Init (int *argc, char **argv); int MPI_Finalize(void);

Multiple processes from the same source are created by issuing the function MPI_Init and these processes are safely terminated by issuing a MPI_Finalize.

The arguments of MPI_Init are the command line arguments minus the ones that were used/processed by the MPI implementation(main function’s parameters – argc and argv). Thus command line processing should only be performed in the program after the execution of this function call. Successful return returns a MPI_SUCCESS; otherwise an error-code that is implementation dependent is returned.

Definitions are available in <mpi.h>.

54

MPI and MPI-2Abort

int MPI_Abort(MPI_Comm comm, int errcode);

Note that MPI_Abort aborts an MPI program cleanly. The first one is a communicator and second argument is an integer error code.

A communicator is a collection of processes that can send messages to each other. A default communicator is MPI_COMM_WORLD, which consists of all the processes running when program execution begins.

55

MPI and MPI-2Communicators; Process Control

Under MPI, a communication domain is a set of processors that are allowed to communicate with each other. Information about such a domain is stored in a communicator that uniquely identify the processors that participate in a communication operation.

A default communication domain is all the processors of a parallel execution; it is called MPI COMM WORLD. By using multiple communicators between possibly overlapping groups of processors we make sure that messages are not interfering with each other.

#include <mpi.h>int MPI_Comm_size ( MPI_Comm comm, int *size);int MPI_Comm_rank ( MPI_Comm comm, int *rank);ThusMPI_Comm_size ( MPI_COMM_WORLD, &nprocs);MPI_Comm_rank ( MPI_COMM_WORLD, &pid );return the number of processors nprocs and the processor id pid of the

calling processor.

56

MPI and MPI-2Communicators; Process Control: Example

A hello world! Program in MPI is the following one.

#include <mpi.h>int main(int argc, char **argv) {int nprocs, mypid;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&nprocs);MPI_Comm_rank(MPI_COMM_WORLD,&mypid );printf("Hello world from process %d of total %d\n", mypid,

nprocs);MPI_Finalize();}

57

Submit job to Grid Engine

(1)- created mpi program, e.g. example.c. (2)- compiled my program, using mpicc -O2 example.c -o example (3)- type command of vi submit (4)- i put this into vi submit#!/bin/bash#$ -S /bin/sh#$ -N example#$ -q default#$ -pe lammpi 4#$ -cwdmpirun C ./example (5)- then ran chmod a+x submit(6)- then ran qsub submit(7)- ran qstat to check the status of your program

58

MPIMessage-Passing primitives

#include <mpi.h>/* Blocking send and receive */int MPI_Send(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm);int MPI_Recv(void *buf, int count, MPI_Datatype dtype, int src, int tag, MPI_Comm comm, MPI_Status *stat);

/* Non-Blocking send and receive */int MPI_Isend(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm, MPI_Request

*req);int MPI_Irecv(void *buf, int count, MPI_Datatype dtype, int src, int tag, MPI_Comm comm, MPI_Request

*req);int MPI_Wait(MPI_Request *preq, MPI_Status *stat);

buf - initial address of send/receive buffercount - number of elements in send buffer (nonnegative integer) or maximum number of elements in

receive buffer.dtyp - datatype of each send/receive buffer element (handle)dest,src - rank of destination/source (integer)

Wild-card: MPI_ANY_SOURCE for recv only. No wildcard for dest.tag - message tag (integer). Range 0...32767.

Wild-card: MPI_ANY_TAG for recv only; send must specify tag.comm - communicator (handle)stat - status object (Status), which can be the MPI constant; it returns the source and tag of the message

that was acctually received.MPI_STATUS_IGNORE if the return status is not desired

Struct of status: status->MPI_SOURCE, status->MPI_TAG, status->MPI_ERROR

59

data type correspondence between MPI and C

MPI_CHAR --> signed char , MPI_SHORT --> signed short int , MPI_INT --> signed intMPI_LONG --> signed long int , MPI_UNSIGNED_CHAR --> unsigned char , MPI_UNSIGNED_SHORT --> unsigned short int , MPI_UNSIGNED --> unsigned intMPI_UNSIGNED_LONG --> unsigned long int ,MPI_FLOAT --> float , MPI_DOUBLE --> double,MPI_LONG_DOUBLE --> long double,MPI_BYTEMPI_PACKED

60

MPIMessage-Passing primitives The MPI Send and MPI Recv functions are blocking, that

is they do not return unless it is safe to modify or use the contents of the send/receive buffer.

MPI also provides for non-blocking send and receive primitives. These are MPI Isend and MPI Irecv, where the I stands for Immediate.

These functions allow a process to post that it wants to send to or receive from another process, and then allow the process to call a function (eg. MPI Wait to complete the send-receive pair.

Non-blocking send-receives allow for the overlapping of computation/communication. Thus MPI Wait plays the role of synchronizer: the send/receive are only advisories and communication is only effected at the MPI Wait.

61

MPIMessage-Passing primitives: tag

A tag is simply an integer argument that is passed to a communication function and that can be used to uniquely identify a message. For example, in MPI if process A sends a message to process B, then in order for B to receive the message, the tag used in A's call to MPI_Send must be the same as the tag used in B's called to MPI_Recv. Thus, if the characteristics of two messages sent by A to B are identical (i.e., same count and datatype), then A and B can distinguish between the two by using different tags.

For example, suppose A is sending two floats, x and y, to B. Then the processes can be sure that the values are received correctly, regardless of the order in which A sends and B receives, provided different tags are used:

/* Assume system provides some buffering */ if (my_rank == A) {

tag = 0; MPI_Send(&x, 1, MPI_FLOAT, B, tag, MPI_COMM_WORLD); . . . tag = 1; MPI_Send(&y, 1, MPI_FLOAT, B, tag, MPI_COMM_WORLD);

} else if (my_rank == B) {tag = 1; MPI_Recv(&y, 1, MPI_FLOAT, A, tag, MPI_COMM_WORLD, &status); . . . tag = 0; MPI_Recv(&x, 1, MPI_FLOAT, A, tag, MPI_COMM_WORLD, &status); }

62

MPI Message-Passing primitives:Communicators Now if one message from process A to process B is being sent by

the library, and another, with identical characteristics, is being sent by the user's code, unless the library developer insists that user programs refrain from using certain tag values, this approach cannot be made to work. Clearly, partitioning the set of possible tags is at best an inconvenience: if one wishes to modify an existing user code so that it can use a library that partitions tags, each message passing function in the entire user code must be checked.

The solution that was ultimately decided on was the communicator.

Formally, a communicator is a pair of objects: the first is a group or ordered collection of processes, and the second is a context, which can be viewed as a unique, system-defined tag. Every communication function in MPI takes a communicator argument, and a communication can succeed only if all the processes participating in the communication use the same communicator argument. Thus, a library can either require that its functions be passed a unique library-specific communicator, or its functions can create their own unique communicator. In either case, it is straightforward for the library designer and the user to make certain that their messages are not confused.

63

For example, suppose now that the user's code is sending a float, x, from process A to process B, while the library is sending a float, y, from A to B:

/* Assume system provides some buffering */ void User_function(int my_rank, float* x) {

MPI_Status status; if (my_rank == A) {

/* MPI_COMM_WORLD is pre-defined in MPI */ MPI_Send(x, 1, MPI_FLOAT, B, 0, MPI_COMM_WORLD);

} else if (my_rank == B) { MPI_Recv(x, 1, MPI_FLOAT, A, 0, MPI_COMM_WORLD, &status);

} . . .

} void Library_function(float* y) {

MPI_Comm library_comm;MPI_Status status; int my_rank; /* Create a communicator with the same group */ /* as MPI_COMM_WORLD, but a different context */ MPI_Comm_dup(MPI_COMM_WORLD, &library_comm); /* Get process rank in new communicator */ MPI_Comm_rank(library_comm, &my_rank);if (my_rank == A)

MPI_Send(y, 1, MPI_FLOAT, B, 0, library_comm); else if (my_rank == B)

{ MPI_Recv(y, 1, MPI_FLOAT, A, 0, library_comm, &status); } . . .

} int main(int argc, char* argv[]) {

. . . if (my_rank == A) {

User_function(A, &x); . . . Library_function(&y);

} else if (my_rank == B) { Library_function(&y); . . . User_function(B, &x); } . . .

}

64

MPI Message-Passing primitives:User-defined datatypes The second main innovation in MPI, user-defined datatypes, allows programmers

to exploit this power, and as a consequence, to create messages consisting of logically unified sets of data rather than only physically contiguous blocks of data.

Loosely, an MPI datatype is a sequence of displacements in memory together with a collection of basic datatypes (e.g., int, float, double, and char). Thus, an MPI-datatype specifies the layout in memory of data to be collected into a single message or data to be distributed from a single message.

For example, suppose we specify a sparse matrix entry with the following definition.

typedef struct { double entry; int row, col;

} mat_entry_t; MPI provides functions for creating a variable that stores the layout in memory

of a variable of type mat_entry_t. One does this by first defining an MPI datatype MPI_Datatype mat_entry_mpi_t;

to be used in communication functions, and then calling various MPI functions to initialize mat_entry_mpi_t so that it contains the required layout. Then, if we define

mat_entry_t x; we can send x by simply calling

MPI_Send(&x, 1, mat_entry_mpi_t, dest, tag, comm); and we can receive x with a similar call to MPI_Recv.

65

MPI: An example with the blocking operations#include <stdio.h>#include <mpi.h>#define N 10000000 // Choose N to be multiple of nprocs to avoid problems.// Parallel sum of 1 , 2 , 3, ... , Nint main(int argc,char **argv){

int pid,nprocs,i,j;int sum, start, end, total;MPI_Status status;MPI_Init(argc,argv);MPI_Comm_size(MPI_COMM_WORLD,&nprocs);MPI_Comm_rank(MPI_COMM_WORLD,&pid );sum = 0; total = 0;start = (N/nprocs)*pid +1 ; // Each processorend =(N/nprocs)*(pid+1);for(i=start;i<=end;i++) sum += i;if (pid != 0 ) {

MPI_Send(&sum,1,MPI_INT,0,1,MPI_COMM_WORLD);}else {

for (j=1;j<nprocs;j++) {

MPI_Recv(&total,1,MPI_INT,j,1,MPI_COMM_WORLD,&status);sum = sum + total;

}}if (pid == 0 ) {

printf(" The sum from 1 to %d is %d \n",N,sum);}MPI_Finalize();

}// Note: Program neither compiled nor run!+-

66

Non-Blocking send and Receive

/* Non-Blocking send and receive */int MPI_Isend(void *buf, int count, MPI_Datatype dtype, int dest, int

tag, MPI_Comm comm, MPI_Request *req);int MPI_Irecv(void *buf, int count, MPI_Datatype dtype, int src, int

tag, MPI_Comm comm, MPI_Request *req);int MPI_Wait(MPI_Request *preq, MPI_Status *stat);

#include "mpi.h" int MPI_Wait ( MPI_Request *request, MPI_Status *status) Waits for an MPI send or receive to complete Input Parameter

request (handle) Output Parameter

status object (Status) . May be MPI_STATUS_IGNORE.

67

MPI: An example with the non-blocking operations

#include <stdio.h>#include <mpi.h>#define N 10000000 // Choose N to be multiple of nprocs to avoid problems.// Parallel sum of 1 , 2 , 3, ... , Nint main(int argc,char **argv){

int pid,nprocs,i,j;int sum, start, end, total;MPI_Status status;MPI_Request request;MPI_Init(argc,argv);MPI_Comm_size(MPI_COMM_WORLD,&nprocs);MPI_Comm_rank(MPI_COMM_WORLD,&pid );sum = 0; total = 0;start = (N/nprocs)*pid +1 ; // Each processorend = (N/nprocs)*(pid+1);for(i=start;i<=end;i++) sum += i;if (pid != 0 ) {// MPI_Send(&sum,1,MPI_INT,0,1,MPI_COMM_WORLD);MPI_Isend(&sum,1,MPI_INT,0,1,MPI_COMM_WORLD,&request);MPI_Wait(&request,&status);}else {

for (j=1;j<nprocs;j++) {

MPI_Recv(&total,1,MPI_INT,j,1,MPI_COMM_WORLD,&status);sum = sum + total;

}}if (pid == 0 ) {

printf(" The sum from 1 to %d is %d \n",N,sum);}MPI_Finalize();

} // Note: Program neither compiled nor run!

68

MPI Basic Collective Operations

One simple collective operations:

int MPI_Bcast(void * message, int count, MPI_Datatype datatype, int root, MPI_Comm comm)

The routine MPI_Bcast sends data from one process to all others

69

MPI_Bcast

Process 0

Process 1

Process 2

Process 3

Write data to all processes

Data written, unblockData Present

Data Empty

Data Empty

Data Empty

Data Present

Data Present

Data Present

70

Simple Program that Demonstrates MPI_Bcast:#include <mpi.h>#include <stdio.h>int main (int argc, char *argv[]){ int k,id,p,size; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &id); MPI_Comm_size(MPI_COMM_WORLD, &size); if(id == 0) k = 20; else k = 10; for(p=0; p<size; p++){ if(id == p) printf("Process %d: k= %d before\n",id,k); } //note MPI_Bcast must be put where all other processes //can see it. MPI_Bcast(&k,1,MPI_INT,0,MPI_COMM_WORLD); for(p=0; p<size; p++){ if(id == p) printf("Process %d: k= %d after\n",id,k); } MPI_Finalize();return 0

71

The Output would look like:Process 0: k= 20 beforeProcess 0: k= 20 afterProcess 3: k= 10 beforeProcess 3: k= 20 afterProcess 2: k= 10 beforeProcess 2: k= 20 afterProcess 1: k= 10 beforeProcess 1: k= 20 after

Simple Program that Demonstrates MPI_Bcast:

72

Parallel Algorithm Assumptions Convention: In this subject we name processors

arbitrarily either 0, 1, . . . , p − 1 or 1, 2, . . . , p. The input to a particular problem would reside in the

cells of the shared memory. We assume, in order to simplify the exposition of our algorithms, that a cell is wide enough (in bits or bytes) to accommodate a single instance of the input (eg. a key or a floating point number). If the input is of size n, the first n cells numbered 0, . . . , n − 1 store the input.

We assume that the number of processors of the PRAM is n or a polynomial function of the size n of the input. Processor indices are 0, 1, . . . , n − 1.

73

PRAM Algorithm: Matrix MultiplicationMatrix Multiplication A simple algorithm for multiplying two n × n matrices on a CREW

PRAM with time complexity T = O(lg n) and P = n3 follows. For convenience, processors are indexed as triples (i, j, k), where i, j, k = 1, . . . , n. In the first step processor (i, j, k) concurrently reads aij and bjk and performs the multiplication aijbjk. In the following steps, for all i, k the results (i, ∗, k) are combined, using the parallel sum algorithm to form cik = j aijbjk. After lgn steps, the result cik is thus computed.

The same algorithm also works on the EREW PRAM with the same time and processor complexity. The first step of the CREW algorithm need to be changed only. We avoid concurrency by broadcasting element aij to processors (i, j, ∗) using the broadcasting algorithm of the EREW PRAM in O(lg n) steps. Similarly, bjk is broadcast to processors (∗, j, k).

The above algorithm also shows how an n-processor EREW PRAM can simulate an n-processor CREW PRAM with an O(lg n) slowdown.

74

Matrix MultiplicationCREW EREW

1. aij to all (i,j,*) procs O(1) O(lgn) bjk to all (*,j,k) procs O(1) O(lgn)2. aij*bjk at (i,j,k) proc O(1) O(1)3. parallel sumj aij *bjk (i,*,k) procs O(lgn) O(lgn) n procs

participate4. cik = sumj aij*bjk O(1) O(1)

T=O(lgn),P=O(n3 ) W=O( n3 lgn) W2 = O(n3 )

75

PRAM Algorithm: Logical AND operationProblem. Let X1 . . .,Xn be binary/boolean values. Find X = X1 ∧ X2 ∧

. . . ∧ Xn. The sequential problem accepts a P = 1, T = O(n),W = O(n)

direct solution. An EREW PRAM algorithm solution for this problem works the

same way as the PARALLEL SUM algorithm and its performance is P = O(n), T = O(lg n),W = O(n lg n) along with the improvements in P and W mentioned for the PARALLEL SUM algorithm.

In the remainder we will investigate a CRCW PRAM algorithm. Let binary value Xi reside in the shared memory location i. We can find X = X1 ∧ X2 ∧ . . . ∧ Xn in constant time on a CRCW PRAM. Processor 1 first writes an 1 in shared memory cell 0. If Xi = 0, processor i writes a 0 in memory cell 0. The result X is then stored in this memory cell.

The result stored in cell 0 is 1 (TRUE) unless a processor writes a 0 in cell 0; then one of the Xi is 0 (FALSE) and the result X should be FALSE, as it is.

76

Logical AND operationbegin Logical AND (X1 . . .Xn)1. Proc 1 writ1es in cell 0.2. if Xi = 0 processor i writes 0 into cell 0.

end Logical AND

Exercise Give an O(1) CRCW algorithm for LOGICAL OR.

77

Parallel Operations with Multiple Outputs – Parallel Prefix

Problem definition: Given a set of n values x0, x1, . . . , xn−1 and an associative operator, say +, the parallel prefix problem is to compute the following n results/“sums”.

0: x0,1: x0 + x1,2: x0 + x1 + x2,. . .n − 1: x0 + x1 + . . . + xn−1.

Parallel prefix is also called prefix sums or scan. It has many uses in parallel computing such as in load-balancing the work assigned to processors and compacting data structures such as arrays.

We shall prove that computing ALL THE SUMS is no more difficult that computing the single sum x0 + . . .xn−1.

78

Parallel Prefix Algorithm1: divide-and-conquer

x0 x1 x2 x3 x4 x5 x6 x7 <<Paralel Prefix "Box" for 8 inputs| | | | | | | |------------------- --------------------| 1 | | 2 | <<< 2 PP Boxes for 4 inputs each------------------- --------------------| | | | | | | || | | | | | | | Take rightmost output of Box 1 and| | | | | | | | combine it with the outputs of Box2| | | | | | | || | | | | | | || | | | | | | || | | | | | | | x0+...+x3 x0+..+x7 x0+...+x2 x0+...+x6 x0+x1 x0+...+x5x0 x0+...+x4

79

Parallel Prefix Algorithm 2: An algorithm for parallel prefix on an EREW

PRAM would require lg n phases. In phase i, processor j reads the contents of cells j and j − 2i (if it exists) combines them and stores the result in cell j.

The EREW PRAM algorithm that solves the parallel prefix problem has performance P = O(n), T = O(lg n), and W = O(n lg n), W2 = O(n).

80

Parallel Prefix Algorithm 2: Example

For visualization purposes, the second step is written in two different lines. When we write x1 + . . . + x5 we mean

x1 + x2 + x3 + x4 + x5.

x1 x2 x3 x4 x5 x6 x7 x81. x1+x2 x2+x3 x3+x4 x4+x5 x5+x6 x6+x7

x7+x82. x1+(x2+x3) (x2+x3)+(x4+x5) (x4+x5)+(x6+x7)2. (x1+x2)+(x3+x4) (x3+x4)+(x5+x6)

(x5+x6+x7+x8)3. x1+...+x5 x1+...+x73. x1+...+x6 x1+...

+x8FinallyF. x1 x1+x2 x1+...+x3 x1+...+x4 x1+...+x5 x1+...+x6 x1+...+x7

x1+...+x8

81

Parallel Prefix Algorithm 2: Example 2For visualization purposes, the second step is written in two different lines. When we write [1 : 5] we mean x1 +x2 + x3 + x4 + x5.We write below [1:2] to denote x1+x2 [i:j] to denote xi + ... + x5 [i:i] is xi NOT xi+xi! [1:2][3:4]=[1:2]+[3:4]= (x1+x2) + (x3+x4) = x1+x2+x3+x4A * indicates value above remains the same in subsequent steps

0 x1 x2 x3 x4 x5 x6 x7 x80 [1:1] [2:2] [3:3] [4:4] [5:5] [6:6] [7:7] [8:8]1 * [1:1][2:2] [2:2][3:3] [3:3][4:4] [4:4][5:5] [5:5][6:6] [6:6][7:7] [7:7][8:8]1. * [1:2] [2:3] [3:4] [4:5] [5:6] [6:7] [7:8]2. * * [1:1][2:3] [1:2][3:4] [2:3][4:5] [3:4][5:6] [4:5][6:7] [5:6][7:8]2. * * [1:3] [1:4] [2:5] [3:6] [4:7] [5:8]3. * * * * [1:1][2:5] [1:2][3:6] [1:3][4:7] [1:4][5:8]3. * * * * [1:5] [1:6] [1:7] [1:8] [1:1] [1:2] [1:3] [1:4] [1:5] [1:6] [1:7] [1:8] x1 x1+x2 x1+x2+x3 x1+...+x4 x1+...+x5 x1+...+x6 x1+...+x7

x1+...+x8

82

Parallel Prefix Algorithm 2:// We write below[1:2] to denote X[1]+X[2]// [i:j] to denote X[i]+X[i+1]+...+X[j]// [i:i] is X[i] NOT X[i]+X[i]// [1:2][3:4]=[1:2]+[3:4]= (X[1]+X[2])+(X[3]+X[4])=X[1]+X[2]+X[3]+X[4]// Input : M[j]= X[j]=[j:j] for j=1,...,n.// Output: M[j]= X[1]+...+X[j] = [1:j] for j=1,...,n.

ParallelPrefix(n)1. i=1; // At this step M[j]= [j:j]=[j+1-2**(i-1):j]2. while (i < n ) {3. j=pid();4. if (j-2**(i-1) >0 ) {5. a=M[j]; // Before this stepM[j] = [j+1-2**(i-1):j]6. b=M[j-2**(i-1)]; // Before this stepM[j-2**(i-1)]= [j-2**(i-1)+1-2**(i-1):j-2**(i-

1)]7. M[j]=a+b; // After this step M[j]= M[j]+M[j-2**(i-1)]=[j-2**(i-1)+1-2**(i-1):j-

2**(i-1)] // [j+1-2**(i-1):j] = [j-2**(i-1)+1-2**(i-1):j]=[j+1-2**i:j]

8. }9. i=i*2; }

At step 5, memory location j − 2i−1 is read provided that j − 2i−1 ≥ 1. This is true for all times i ≤ tj = lg(j − 1) + 1. For i > tj the test of line 4 fails and lines 5-8 are not executed.

83

Parallel Prefix Algorithm based on Complete Binary Tree Consider the following variation of parallel

prefix on n inputs that works on a complete binary tree with n leaves (assume n is a power of two).

Action by nodes1. Non-leaf : If it receives l and r from left and right

children, computes l + r and sends it up and send down to its right child the l.

2. Root : Step [1] except nothing is sent up.3. Non-leaf : If it gets p from parent it transmits it to its

left/right children.4. Leaf : If it holds l and receives p from its parent it

sets l = p + l (this order) [note p is the left argument, l is the right one, order matters]

84

Parallel Prefix Algorithm based on Complete Binary Tree: Example

x1

X1+x2+x3+x4+X5+x6+x7+x8

X1+x2+x3+x4 X5+x6+x7+x8

X1+x2 x3+x4 X5+x6 x7+x8

x2 x3 x4 X5 x6 x7

x8

\x1+x2+x3+x4

\x1+x2 \x5+x6

\x1+x2+x3+x4\x1+x2+x3+x4

\x5+x6\x1+x2+x3+x4

\x1\x3

\x1+x2\x1+x2

\x7\x5+x6

\x1+x2+x3+x4\x1+x2+x3+x4

\x5\x1+x2+x3+x4

after recving: x1+x2 x3+x4 x5+x6 x7+x8 x1+.+x3 x1+..+x4 x1+.+x5 x1+.+x6 x5+.+x7 x5+.+x8 x1+.+x3 x1+..+x4 x1+.+x5 x1+.+x6 x1+.+x7 x1+.+x8

85

PRAM Algorithms:Maximum finding

Problem. Let X1 . . .,XN be n keys. Find X = max{X1,X2, . . .,XN}.

The sequential problem accepts a P = 1, T = O(N),W = O(N) direct solution.

An EREW PRAM algorithm solution for this problem works the same way as the PARALLEL SUM algorithm and its performance is P = O(N), T = O(lgN),W = O(N lgN),W2 = O(N) along with the improvements in P and W mentioned for the PARALLEL SUM algorithm.

In the remainder we will investigate a CRCW PRAM algorithm. Let binary value Xi reside in the local memory of processor i.

The CRCW PRAM algorithm MAX1 to be presented has performance T = O(1), P = O(N2), and work W2 = W = O(N2).

The second algorithm to be presented in the following pages utilizes what is called a doubly-logarithmic depth tree and achieves T = O(lglgN), P = O(N) and W = W2 = O(N lglgN).

The third algorithm is a combination of the EREW PRAM algorithm and the CRCW doubly-logarithmic depth tree-based algorithm and requires T = O(lglgN), P = O(N) and W2 = O(N).

86

PRAM Algorithm - Maximum Finding

begin Max1 (X1 . . .XN)

1. in proc (i, j) if Xi ≥ Xj then xij = 1;

2. else xij = 0;

3. Yi = xi1 ∧ . . . ∧ xin ;

4. Processor i reads Yi ;

5. if Yi = 1 processor i writes i into cell 0.end Max1

In the algorithm, we rename processors so that pair (i, j) could refer to processor j × n + i. Variable Yi is equal to 1 if and only if Xi is the maximum.

The CRCW PRAM algorithm MAX1 has performance T = O(1), P = O(N2), and work W2 = W = O(N2).

87

Traditional PRAM Algorithm vs. Architecture Independent Parallel Algorithm Design

Under the PRAM model, synchronization is ignored and thus is seen as for free, as PRAM processors work synchronously. It also ignores communication, as in the PRAM the cost of accessing the shared memory is as small as the cost of accessing local registers of the PRAM.

But actually, the exchange of data can significantly impact the efficiency of parallel programs by introducing interaction delays during their execution.

It takes roughly ts+mtw time for a simple exchange of an m-word message between two processes running on different nodes of an interconnection network with cut-through routing.

ts: latency or the startup time for the data transfer tw: per-word transfer time, which is inversely proportional to

the available bandwidth between the nodes.

88

Basic Communication Operations – One-to-all broadcast and all-to-one reduction

Assume that p processes participate in the operation and the data to be broadcast or reduced contains m words.

Since one-to-all broadcast or all-to-one reduction procedure involves log p point-to-point simple message transfers, each at a time cost of ts+mtw. Therefore, the total time taken by the procedure is T=(ts+mtw) log p

This is true for all interconnection network.

89

All-to-all Broadcast and Reduction

Linear Array and Ring: P different messages circulate in the p-node ensemble. If communication is performed circularly in a single direction, then each node

received all (p-1) pieces of information from all other nodes in (p-1) steps. So the total time is: T=(ts+mtw)(p-1)

2-D Mesh: Based on linear array algorithm, treating each rows and columns of the mesh as

linear arrays. Two phases:

Phase one: each row of the mesh performs an all-to-all broadcast using the procedure for the linear array. In this phase, all nodes collect p corresponding to the p nodes of their respective rows. Each node consolidates this information into a single message of size mp. The time for this phase is:

T1= =(ts+mtw)(p-1) Phase two: columnwise all-to-all broadcase of the consolidated messages. By the end of

this phase, each node obtains all p pieces of m-word data originally resided on different nodes. The time for this phase is

T2= =(ts+mptw)(p-1) The time for entire all-to-all broadcast on a p-node two-dimensional square mesh is the sum

of the times spent in the individual phases:

T=2ts(p-1)+mtw(p-1) Hypercube: )1(log)2(

log

1

1

pmtptmttT ws

p

iw

is

90

Traditional PRAM Algorithm vs. Architecture Independent Parallel Algorithm Design

As an example of how traditional PRAM algorithm design differs from architecture independent parallel algorithm design, example algorithm for broadcasting in a parallel machine is introduced.

Problem: In a parallel machine with p processors numbered 0, . . . , p − 1, one of them, say processor 0, holds a one-word message The problem of broadcasting involves the dissemination of this message to the local memory of the remaining p − 1 processors.

The performance of a well-known exclusive PRAM algorithm for broadcasting is analyzed below in two ways under the assumption that no concurrent operations are allowed. One follows the traditional (PRAM) analysis that minimizes parallel running time. The other takes into consideration the issues of communication and synchronization. This leads to a modification of the PRAM-based algorithm to derive an architecture independent algorithm for broadcasting whose performance is consistent with observations of broadcasting operations on real parallel machines.

91

Broadcasting: MPI Algorithm 1

Algorithm. Without loss of generality let us assume that p is a power of two. The message is broadcast in lg p rounds of communication by binary replication. In round i = 1, . . . , lg p, each processor j with index j < 2i−1 sends the message it currently holds to processor j + 2i−1 (on a shared memory system, this may mean copying information into a cell read by this processor). The number of processors with the message at the end of round i is thus 2i.

Analysis of Algorithm. Under the PRAM model the algorithm requires lg p communication rounds and so many parallel steps to complete. This cost, however, ignores synchronization which is for free, as PRAM processors work synchronously. It also ignores communication, as in the PRAM the cost of accessing the shared memory is as small as the cost of accessing local registers of the PRAM.

92

Broadcasting: MPI Algorithm 1 Under the MPI cost model each communication round is assigned

a cost of max {ts, tw · 1} as each processor in each round sends or receives at most one message containing the one-word message. The cost of the algorithm is lg p · max {tw, tw · 1}, as there are lgp rounds of communication.

As the communicated information by any processors is small in size, it is likely that latency issues prevail in the transmission time (ie bandwidth based cost tw · 1 is insignificant compared to the latency/synchronization reflecting term ts).

In high latency machines the dominant term would be ts lg p rather than tw lg p. Even though each communication round would last for at least ts time units, only a small fraction tw of it is used for actual communication. The remainder is wasted.

It makes then sense to increase communication round utilization so that each processor sends the one-word message to as many processors as it can accommodate within a round.

The total time is: lg p *(ts+tw)

93


Input: p processors numbered 0 . . .p − 1. Processor 0 holds a message of length equal to one word.

Output: The problem of broadcasting involves the dissemination of this message to the remaining p − 1 processors.

Algorithm 2. In one superstep, processor 0 sends the message to be broadcast to processors 1, . . . , p − 1 in turn (a “sequential”-looking algorithm).

Analysis of Algorithm 2. The communication time of Algorithm 2 is 1 ·

max{ts, (p − 1) · tw} (in a single superstep, the message is replicated p − 1 times by processor 0).

The total time is ts+(p-1)tw

94


Algorithm 3 Both Algorithm 1 and Algorithm 2 can be viewed as extreme cases of an Algorithm

3. The main observation is that up to L/g words can be sent in a superstep at a cost of

ts. Then, It makes sense for each processor to send L/g messages to other processors. Let k − 1 be the number of messages a processor sends to other processors in a broadcasting step. The number of processors with the message at the end of a broadcasting superstep would be k times larger than that in the start. We call k the degree of replication of the broadcast operation.

Architecture independent Algorithm 3 In each round, every processor sends the message to k−1 other processors. In

round i = 0, 1, . . ., each processor j with index j < ki sends the message to k − 1 distinct processors numbered j + kiּl, where l = 1, . . . , k−1. At the end of round i (the (i+1)-st overall round), the message is broadcast to ki ·(k−1)+ki = ki+1 processors. The number of rounds required is the minimum integer r such that kr ≥ p, The number of rounds necessary for full dissemination is thus decreased to lgkp, and the total cost becomes lgkp max {ts, (k − 1)tw}.

At the end of each superstep the number of processors possessing the message is k times more than that of the previous superstep. During each superstep each processor sends the message to exactly k−1 other processors.

Algorithm 3 consists of a number of rounds between 1 (and it becomes Algorithm 2) and lg p (and it becomes Algorithm 1).

The total time is: lgkp (ts+(k-1)tw)

95


Broadcast (0, p, k)1. my_pid = pid(); mask_pid = 1;2. while (mask_pid < p) {

1. if (my_pid < mask_pid) for (i = 1, j = mask_pid;i < k; i++, j+ =

mask_pid) { target_pid = my_pid + j; if (target_pid < p) mpi_put(target_pid,&M,&M, 0, sizeof(M));

(or mpi_send…)1. } 2. else if ((my_pid >= mask_pid) and (my_pid < k* mask_pid))

1. mpi_get() or mpi_Recv…

mask_pid = mask_pid ∗ k; }

96

Broadcasting n > p words: Algorithm 4 Now suppose that the message to be broadcast consists of

not a single word but is of size n > p. Algorithm 4 may be a better choice than the previous algorithms as one of the processors sends or receives substantially more than n words of information. (ntw>>ts)

There is a broadcasting algorithm, call it Algorithm 4, that requires only two communication rounds and is optimal (for the communication model abstracted by ts and tw) in terms of the amount of information (up to a constant) each processor sends or receives.

Algorithm 4. Two-phase broadcasting The idea is to split the message into p pieces, have processor

0 send piece i to processor i in the first round and in the second round processor i replicates the i-th piece p − 1 times by sending each copy to each of the remaining p − 1 processors (see attached figure).

The total time is: p times one-to-one + one all-to-all broadcast(ts+n/p*tw)(p-1)+(ts+n/p*tw)(p-1)=2(ts+n/p*tw)(p-1)

98

Matrix Computations SPMD program design stipulates that processors executes a single program on different pieces of

data. For matrix related computations it makes sense to distribute a matrix evenly among the p processors of a parallel computer. Such a distribution should also take into consideration the storage of the matrix by say the compiler so that locality issues are also taken into consideration (filling cache lines efficiently to speedup computation). There are various ways to divide a matrix. Some of the most common one are described below.

One way to distribute a matrix is by using block distributions. Split an array into blocks of size n/p1 × n/p2 so that p = p1 × p2 and assign the i-th block to processor i. This distribution is suitable for matrices as long as the amount of work for different elements of the matrix is the same.

The most common block distributions are. • column-wise (block) distribution. Split matrix into p column stripes so that n/p

consecutive columns form the i-th stripe that will be stored in processor i. This is p1 = 1 and p2 = p.

• row-wise (block) distribution. Split matrix into p row stripes so that n/p consecutive rows form the i-th stripe that will be stored in processor i. This is p1 = p and p2 = 1.

• block or square distribution. This is the case p1 = p2 = √p, i.e. the blocks are of size n/√p× n/√p and store block i to processor i.

There are certain cases (eg. LU decomposition, Cholesky factorization), where the amount of work differs for different elements of a matrix. For these cases block distributions are not suitable.

99

Matrix block distributions

100

Matrix-Vector Multiplication Sequential Alg: the running time is O(n2).

n^2 multiplications and additions

MAT_VECT(A,x,y){

for i=0 to n-1 do{

y[i]=0;for j=0 to n-1 do

y[i]=y[i]+A[i][j]*x[j];}

}

101

Matrix-Vector Multiplication: Rowwise 1-D Partitioning Assume p=n (p – no. of processors). Steps:

Step 1: Initial partition of matrix and vector: Matrix distribution: Each process get one complete row of the matrix. Vector distribution: The n*1 vector is distributed such that each process

owns one of its elements. Step 2: All-to-all broadcast

Every process has one element of the vector, but every process needs the entire vector.

Step 3: computation Process Pi computes

Running time: All-to-all broadcast: θ(n) at any architecture Multiplication of a single row of A and with vector x is θ(n) Total running time is θ(n). Total work is θ(n^2) – cost-optimal

1

0

])[],[(][n

j

jxjiAiy

102

Matrix-Vector Multiplication: Rowwise 1-D Partitioning

103

Matrix-Vector Multiplication: Rowwise 1-D Partitioning Assume p<n (p – no. of processors). Three Steps:

Initial partition of matrix and vector: Each process initially stores n/p complete rows of the matrix and a portion of the vector

of size n/p All-to-all broadcast:

Among p processes and involved messages of size n/p Computation:

Each process multiplies n/p rows of the matrix with the vector x to produce n/p elements of the result vector.

Running Time: All-to-all broadcast:

T=(ts+ n/p tw)(p-1) on any architecture T=ts logp + n/p tw(p-1) on hypercube

Computation: T=n* n/p =θ(n2/p) Total running time T= θ(n2/p+ts logp + n tw) Total work: W=θ(n2+ts p logp + n p tw) – cost-optimal

104

Matrix-Vector Multiplication: Columnwise 1-D Partitioning

Similar to rowwise 1-D Partitioning

105

Matrix-Vector Multiplication: 2-D Partitioning Assume p=n2

Steps: Step 1: Initial partitioning

Each process get one element of matrix The vector is distributed only processes in the diagonal, each of which owns one element.

Step 2: broadcast The ith element of vector should be available to the ith element of each row of matrix. So

this step consists of n simultaneous one-to-all broadcast operations, one in each column of processes.

Step 3: computation Each process multiplies its matrix element with the corresponding element of x.

Step 4: All-to-one reduction of partial results. The products computed for each row must be added, leaving the sums in the last column

of processes. Running time:

One-to-all broadcast: θ(log n) Computation in each process: θ(1) All-to-one reduction: θ(log n) Total running time: θ(log n) Total work: θ(n2 log n) – not cost-optimal

106

Matrix-Vector Multiplication: 2-D Partitioning

107

Matrix-Vector Multiplication: 2-D Partitioning Assume p<n2

Steps: Step 1: Initial partitioning

Each process get (n/p)*(n/p) of matrix The vector is distributed only processes in the diagonal, each of which owns n/p element.

Step 2: columwise one-to-all broadcast The ith group of elements of vector should be available to the ith group of each row of matrix. So this

step consists of n simultaneous one-to-all broadcast operations, one in each column of processes. Step 3: computation

Each process multiplies its n/p matrix element with the corresponding element of x. Step 4: All-to-one reduction of partial results.

The products computed for each row must be added, leaving the sums in the last column of processes. Running time:

Columnwise one-to-all broadcast: T= (ts+ n/p tw)(log p) on any architecture Computation in each process: T=n/p* n/p All-to-one reduction: T= (ts+ n/p tw)(log p) on any architecture Total running time: T= n2/p + 2(ts+ n/p tw)(log p) on any architecture

108

Matrix-Vector Multiplication: 1-D Partitioning vs. 2-D Partitioning

Matrix-vector multiplication is faster with block 2-D partitioning of the matrix than with block 1-D partitioning for the same number of processes.

If the number of processes is greater than n, then the 1-D partitioning cannot be used.

If the number of processes is less than or equal to n, 2-D partitioning is preferable.

109

Matrix Distributions : Block cyclic

In block cyclic distributions the rows (similarly for columns) are split into q groups of n/q consecutive rows per group, where potentially q > p, and the i-th group is assigned to a processor in a cyclic fashion.

• column-cyclic distribution. This is an one-dimensional cyclic distribution. Split matrix into q column stripes so that n/q consecutive columns form the i-th stripe that will be stored in processor i %p. The symbol % is the mod (remainder of the division) operator. Usually q > p. Sometimes the term wrapped-around column distribution is used for the case where n/q = 1, i.e. q = n.

• row-cyclic distribution. This is an one-dimensional cyclic distribution. Split matrix into q row stripes so that n/q consecutive rows form the i-th stripe that will be stored in processor i %p. The symbol % is the mod (remainder of the division) operator. Usually q > p. Sometimes the term wrapped-around row distribution is used for the case where n/q = 1, i.e. q = n.

• scattered distribution. Let p = qi · qj processors be divided into qj groups each group Pj consisting of qi processors. Particularly, Pj = {jqi + l | 0 ≤ l ≤ qi − 1}. Processor jqi + l is called the l-th processor of group Pj . This way matrix element (i, j), 0 ≤ i, j < n, is assigned to the (i mod qi)-th processor of group P(j

mod qj). A scattered distribution refers to the special case qi = qj = √p.

110

Block cyclic distributions

111

Scattered Distribution

112

Matrix Multiplication – Serial algorithm

113

Matrix Multiplication The algorithm for matrix multiplication presented below was presented

in the seminal work of Valiant. It works for p ≤ n2. Three steps:

Initial partitioning: Matrices A and B are partitioned into p blocks Ai,j, and Bi,j (1 <=i,j < √p) of size n/√p × n/√p each. These blocks are mapped onto a √p × √p logical mesh of processes. The process are labeled from P0,0 to P √p-1,√p -1.

All-to-all broadcasting: Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix. Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤k<√p. To aquire all the required blocks, an all-to-all broadcast of matrix A’s block is performed in each row of processes, and an all-to-all broadcast of matrix B’s blocks is performed in each column.

Computation: After Pi,j acquire Ai,0, Ai,1, …, Ai, √p -1 and B0,j, B1,j, …, B √p -1,j, it performs the submatrix multiplication and addition step of line 7 and line 8 in Alg 8.3.

Running time: All-to-all broadcast:

T=(ts+ n^2/p tw)(p-1) on any architecture T=ts log p + n^2/p tw( p-1) on hypercube

Computation: T= p*(n/p)^3=n^3/p.

114

Matrix Multiplication The input matrices A and B are divided into p block-

submatrices, each one of dimension m× m, where m = n/√p. We call this distribution of the input among the processors block distribution. This way, element A(i, j), 0 ≤ i < n, 0 ≤ j < n, belongs to the (j/m)∗√p+(i/m)-th block that is subsequently assigned to the memory of the same-numbered processor.

Let Ai (respectively, Bi) denote the i-th block of A (respectively, B) stored in processor i. With these conventions the algorithm can be described in Figure 1. The following Proposition describes the performance of the aforementioned algorithm.

115

Matrix Multiplication

116

Sorting Algorithm

Rearranging a list of numbers into increasing (strictly nondecreasing) order.

117

Potential Speedup O(nlogn) optimal for any sequential sorting algorithm

without using special properties of the numbers. Best we can expect based upon a sequential sorting

algorithm using n processors is

Optimal parallel time complexity = O(n logn)/n= O(logn)

Has been obtained but the constant hidden in the order notation extremely large.

118

Compare-and-Exchange Sorting Algorithms: Compare and Exchange

Form the basis of several, if not most, classical sequential sorting algorithms.

Two numbers, say A and B, are compared. If A > B, A and B are exchanged, i.e.:if (A > B) {

temp = A;A = B;B = temp;

}

119

Message Passing Method

For P1 to send A to P2 and P2 to send B to P1. Then both processes perform compare operations. P1 keeps the larger of A and B and P2 keeps the smaller of A and B:

120

Merging Two Sublists

121

Bubble Sort

First, largest number moved to the end of list by a series ofcompares and exchanges, starting at the opposite end.

Actions repeated with subsequent numbers, stopping just before the previously positioned number.

In this way, the larger numbers move (“bubble”) toward one end,

122

Bubble Sort

123

Time Complexity

Number of compare and exchange operations

Indicates a time complexity of O(n^2) given that a single compare-and-exchange operation has a constant complexity, O(1).

124

Parallel Bubble Sort

Iteration could start before previous iteration finished if doesnot overtake previous bubbling action:

125

Odd-Even (Transposition) Sort

Variation of bubble sort.Operates in two alternating phases, even

phase and odd phase.Even phase

Even-numbered processes exchange numbers with their right neighbor.

Odd phaseOdd-numbered processes exchange numbers

with their right neighbor.

126

Sequential Odd-Even Transposition Sort

127

Parallel Odd-Even Transposition SortSorting eight numbers

128

Parallel Odd-Even Transposition Sort

Consider the one item per processor case. There are n iterations; in each iteration,

each processor does one compare-exchange –which can all be done in parallel.

The parallel run time of this formulation is Θ(n).

This is cost optimal with respect to the base serial algorithm but not the optimal serial algorithm.

129


130


Consider a block of n/p elements per processor. The first step is a local sort. In each subsequent step, the compare

exchange operation is replaced by the compare split operation.

There are p phases with each phase performing Θ(n/p) compares and Θ(n/p) communication.

The parallel run time of the formulation is

The parallel formulation is cost-optimal for p= O(logn).

( log ) ( ) ( )p

n nT n n

p p

131

Quicksort Very popular sequential sorting algorithm that

performs well with average sequential time complexity of O(nlogn).

First list divided into two sublists. All numbers in one sublist arranged to be smaller than all numbers in other sublist.

Achieved by first selecting one number, called a pivot, against which every other number is compared. If the number is less than the pivot, it is placed in one sublist. Otherwise, it is placed in the other sublist.

Pivot could be any number in the list, but often first number in list chosen. Pivot itself could be placed in one sublist, or the pivot could be separated and placed in its final position

132

Quicksort

133

Quicksort

Example of the quicksort algorithm sorting a sequence of size n= 8.

134

Parallel Quicksort

Lets start with recursive decomposition -the list is partitioned by a single process and then each of the subproblems is handled by a different processor.

The time for this algorithm is lower-bounded by Ω(n)!

–Not cost optimal as the process-time product is Ω(n^2).

Can we parallelize the partitioning step -in particular, if we can use n processors to partition a list of length n around a pivot in O(1)time, we have a winner.

This is difficult to do on real machines, though.

135

Parallel Quicksort Using tree allocation of processes

136

Parallel Quicksort With the pivot being withheld in processes:

137

Analysis Fundamental problem with all tree

constructions –initial division done by a single processor, which will seriously limit speed.

Tree in quicksort will not, in general, be perfectly balanced

Pivot selection very important to make quicksort operate fast.

138

Parallelizing Quicksort: PRAM Formulation We assume a CRCW (concurrent read, concurrent write)

PRAM with concurrent writes resulting in an arbitrary write succeeding.

The formulation works by creating pools of processors. Every processor is assigned to the same pool initially and has one element.

Each processor attempts to write its element to a common location (for the pool).

Each processor tries to read back the location. If the value read back is greater than the processor's value, it assigns itself to the 'left' pool, else, it assigns itself to the 'right' pool.

Each pool performs this operation recursively. Note that the algorithm generates a tree of pivots. The

depth of the tree is the expected parallel runtime. The average value is O(logn).

139

Parallelizing Quicksort: PRAM Formulation

A binary tree generated by the execution of the quicksortalgorithm. Each level of the tree represents a different array-partitioning iteration. If pivot selection is optimal, then the height of the tree is Θ(log n), which is also the number of iterations.

140

Parallelizing Quicksort: PRAM Formulation

The execution of the PRAM algorithm on the array shown in (a).

141

End

Thank you!

1 parallel computing final exam review. 2 what is parallel computing? parallel computing involves...

Documents

parallel machines

dataparallel programs

parallel tasks

earliest parallel computers

data partitioning

size of data

sequential computer

generalpurpose computer