parallel algorithms

73
Algorithms Parallel Algorithms 1

Upload: dr-sandeep-kumar-poonia

Post on 11-May-2015

554 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Parallel Algorithms

Algorithms

Parallel Algorithms

1

Page 2: Parallel Algorithms

2

A simple parallel algorithm

Adding n numbers in parallel

Page 3: Parallel Algorithms

3

A simple parallel algorithm• Example for 8 numbers: We start with 4 processors and

each of them adds 2 items in the first step.

• The number of items is halved at every subsequent step.

Hence log n steps are required for adding n numbers.

The processor requirement is O(n) .

We have omitted many details from our description of the algorithm.

• How do we allocate tasks to processors?

• Where is the input stored?

• How do the processors access the input as well as intermediate

results?

We do not ask these questions while designing sequential algorithms.

Page 4: Parallel Algorithms

4

How do we analyze a parallel algorithm?

A parallel algorithms is analyzed mainly in terms of its time, processor and work complexities.

• Time complexity T(n) : How many time steps are needed?• Processor complexity P(n) : How many processors are used?• Work complexity W(n) : What is the total work done by all

the processors? Hence,

For our example: T(n) = O(log n)P(n) = O(n)W(n) = O(n log n)

Page 5: Parallel Algorithms

5

How do we judge efficiency?• We say A1 is more efficient than A2 if W1(n) = o(W2(n))

regardless of their time complexities.

For example, W1(n) = O(n) and W2(n) = O(n log n)

• Consider two parallel algorithms A1and A2 for the same problem.A1: W1(n) work in T1(n) time.A2: W2(n) work in T2(n) time.If W1(n) and W2(n) are asymptotically the same then A1 is more efficient than A2 if T1(n) = o(T2(n)).

For example, W1(n) = W2(n) = O(n), butT1(n) = O(log n), T2(n) = O(log2 n)

Page 6: Parallel Algorithms

6

How do we judge efficiency?• It is difficult to give a more formal definition of

efficiency.

Consider the following situation.For A1 , W 1(n) = O(n log n) and T1(n) = O(n).For A2 , W 2(n) = O(n log2 n) and T2(n) = O(log n)

• It is difficult to say which one is the better algorithm. Though A1 is more efficient in terms of work, A2 runs much faster.

• Both algorithms are interesting and one may be better than the other depending on a specific parallel machine.

Page 7: Parallel Algorithms

7

Optimal parallel algorithms• Consider a problem, and let T(n) be the worst-case time

upper bound on a serial algorithm for an input of length n.

• Assume also that T(n) is the lower bound for solving the problem. Hence, we cannot have a better upper bound.

• Consider a parallel algorithm for the same problem that does W(n) work in Tpar(n) time.

The parallel algorithm is work optimal, if W(n) = O(T(n))

It is work-time-optimal, if Tpar(n) cannot be improved.

Page 8: Parallel Algorithms

8

A simple parallel algorithm

Adding n numbers in parallel

Page 9: Parallel Algorithms

9

A work-optimal algorithm for adding n numbers

Step 1.• Use only n/log n processors and assign log n numbers to

each processor.• Each processor adds log n numbers sequentially in O(log n)

time.

Step 2.• We have only n/log n numbers left. We now execute our

original algorithm on these n/log n numbers.

• Now T(n) = O(log n)W(n) = O(n/log n x log n) = O(n)

Page 10: Parallel Algorithms

10

Why is parallel computing important?

• We can justify the importance of parallel computing for two reasons.Very large application domains, andPhysical limitations of VLSI circuits

• Though computers are getting faster and faster, user demands for solving very large problems is growing at a still faster rate.

• Some examples include weather forecasting, simulation of protein folding, computational physics etc.

Page 11: Parallel Algorithms

11

Physical limitations of VLSI circuits

• The Pentium III processor uses 180 nano meter (nm) technology, i.e., a circuit element like a transistor can be etched within 180 x 10-9 m.

• Pentium IV processor uses 160nm technology.

• Intel has recently trialed processors made by using 65nmtechnology.

Page 12: Parallel Algorithms

12

How many transistors can we pack?• Pentium III has about 42 million transistors and

Pentium IV about 55 million transistors.

• The number of transistors on a chip is approximately doubling every 18 months (Moore’s Law).

• There are now 100 transistors for every ant on Earth (Moore said so in a recent lecture).

Page 13: Parallel Algorithms

13

Physical limitations of VLSI circuits

• All semiconductor devices are Si based. It is fairly safe to assume that a circuit element will take at least a single Si atom.

• The covalent bonding in Si has a bond length approximately 20nm.• Hence, we will reach the limit of miniaturization very soon.• The upper bound on the speed of electronic signals is 3 x 108m/sec,

the speed of light.• Hence, communication between two adjacent transistors will take

approximately 10-18sec.• If we assume that a floating point operation involves switching of at

least a few thousand transistors, such an operation will take about 10-15sec in the limit.

• Hence, we are looking at 1000 teraflop machines at the peak of this technology. (TFLOPS, 1012 FLOPS)

1 flop = a floating point operationThis is a very optimistic scenario.

Page 14: Parallel Algorithms

14

Other Problems• The most difficult problem is to control power dissipation.

• 75 watts is considered a maximum power output of a processor.

• As we pack more transistors, the power output goes up and better cooling is necessary.

• Intel cooled its 8 GHz demo processor using liquid Nitrogen !

Page 15: Parallel Algorithms

15

The advantages of parallel computing

• Parallel computing offers the possibility of overcoming such physical limits by solving problems in parallel.

• In principle, thousands, even millions of processors can be used to solve a problem in parallel and today’s fastest parallel computers have already reached teraflop speeds.

• Today’s microprocessors are already using several parallel processing techniques like instruction level parallelism, pipelined instruction fetching etc.

• Intel uses hyper threading in Pentium IV mainly because the processor is clocked at 3 GHz, but the memory bus operates only at about 400-800 MHz.

Page 16: Parallel Algorithms

16

Problems in parallel computing• The sequential or uni-processor computing

model is based on von Neumann’s stored program model.

• A program is written, compiled and stored inmemory and it is executed by bringing oneinstruction at a time to the CPU.

Page 17: Parallel Algorithms

17

Problems in parallel computing• Programs are written keeping this model in mind.

Hence, there is a close match between the software and the hardware on which it runs.

• The theoretical RAM model captures these concepts nicely.

• There are many different models of parallel computing and each model is programmed in a different way.

• Hence an algorithm designer has to keep in mind a specific model for designing an algorithm.

• Most parallel machines are suitable for solving specific types of problems.

• Designing operating systems is also a major issue.

Page 18: Parallel Algorithms

18

The PRAM model

n processors are connected to a shared memory.

Page 19: Parallel Algorithms

19

The PRAM model• Each processor should be able to access any memory

location in each clock cycle.

• Hence, there may be conflicts in memory access. Also, memory management hardware needs to be very complex.

• We need some kind of hardware to connect the processors to individual locations in the shared memory.

• The SB-PRAM developed at University of Saarlandes by Prof. Wolfgang Paul’s group is one such model.http://www-wjp.cs.uni-sb.de/projects/sbpram/

Page 20: Parallel Algorithms

20

The PRAM model

A more realistic PRAM model

Page 21: Parallel Algorithms

An overview of the lecture 2

• Models of parallel computation

• Characteristics of SIMD models

• Design issue for network SIMD models

• The mesh and the hypercube

architectures

• Classification of the PRAM model

• Matrix multiplication on the EREW PRAM

21

Page 22: Parallel Algorithms

Models of parallel computation

Parallel computational models can be broadly classified into two categories,

• Single Instruction Multiple Data (SIMD)

• Multiple Instruction Multiple Data (MIMD)

22

Page 23: Parallel Algorithms

Models of parallel computation

• SIMD models are used for solving problems which have regular structures. We will mainly study SIMD models in this course.

• MIMD models are more general and used for solving problems which lack regular structures.

23

Page 24: Parallel Algorithms

SIMD models

An N- processor SIMD computer has the following characteristics :

• Each processor can store both program and data in its local memory.

• Each processor stores an identical copy of the same program in its local memory.

24

Page 25: Parallel Algorithms

SIMD models

• At each clock cycle, each processor executes the same instruction from this program. However, the data are different in different processors.

• The processors communicate among themselves either through an interconnection network or through a shared memory.

25

Page 26: Parallel Algorithms

Design issues for network SIMD models

• A network SIMD model is a graph. The nodes of the graph are the processors and the edges are the links between the processors.

• Since each processor solves only a small part of the overall problem, it is necessary that processors communicate with each other while solving the overall problem.

26

Page 27: Parallel Algorithms

Design issues for network SIMD models

• The main design issues for network SIMD models are communication diameter, bisection width, and scalability.

• We will discuss two most popular network models, mesh and hypercube in this lecture.

27

Page 28: Parallel Algorithms

Communication diameter

• Communication diameter is the diameter of the graph that represents the network model. The diameter of a graph is the longest distance between a pair of nodes.

• If the diameter for a model is d, the lower bound for any computation on that model is Ω(d).

28

Page 29: Parallel Algorithms

Communication diameter

• The data can be distributed in such a way that the two furthest nodes may need to communicate.

29

Page 30: Parallel Algorithms

Communication diameter

Communication between two furthest nodes takes Ω(d) time steps.

30

Page 31: Parallel Algorithms

Bisection width

• The bisection width of a network model is the number of links to be removed to decompose the graph into two equal parts.

• If the bisection width is large, more information can be exchanged between the two halves of the graph and hence problems can be solved faster.

31

Page 32: Parallel Algorithms

Dividing the graph into two parts.

Bisection width

32

Page 33: Parallel Algorithms

Scalability

• A network model must be scalable so that more processors can be easily added when new resources are available.

• The model should be regular so that each processor has a small number of links incident on it.

33

Page 34: Parallel Algorithms

Scalability

• If the number of links is large for each processor, it is difficult to add new processors as too many new links have to be added.

• If we want to keep the diameter small, we need more links per processor. If we want our model to be scalable, we need less links per processor.

34

Page 35: Parallel Algorithms

Diameter and Scalability

• The best model in terms of diameter is the complete graph. The diameter is 1. However, if we need to add a new node to an n-processor machine, we need n - 1new links.

35

Page 36: Parallel Algorithms

Diameter and Scalability

• The best model in terms of scalability is the linear array. We need to add only one link for a new processor. However, the diameter is n for a machine with n

processors.

36

Page 37: Parallel Algorithms

The mesh architecture

• Each internal processor of a 2-dimensional mesh is connected to 4 neighbors.

• When we combine two different meshes, only the processors on the boundary need extra links. Hence it is highly scalable.

37

Page 38: Parallel Algorithms

• Both the diameter and bisection width of an n-processor, 2-dimensional mesh is

A 4 x 4 mesh

The mesh architecture

( )O n

38

Page 39: Parallel Algorithms

Hypercubes of 0, 1, 2 and 3 dimensions

The hypercube architecture

39

Page 40: Parallel Algorithms

• The diameter of a d-dimensional hypercube is d as we need to flip at most dbits (traverse d links) to reach one processor from another.

• The bisection width of a d-dimensional hypercube is 2d-1.

The hypercube architecture

40

Page 41: Parallel Algorithms

• The hypercube is a highly scalable architecture. Two d-dimensional hypercubes can be easily combined to form a d+1-dimensional hypercube.

• The hypercube has several variants like butterfly, shuffle-exchange network and cube-connected cycles.

The hypercube architecture

41

Page 42: Parallel Algorithms

Adding n numbers in steps

Adding n numbers on the mesh

n42

Page 43: Parallel Algorithms

43

Page 44: Parallel Algorithms

Adding n numbers in log n steps

Adding n numbers on the hypercube

44

Page 45: Parallel Algorithms

45

Page 46: Parallel Algorithms

46

Complexity Analysis: Given n processors

connected via a hypercube, S_Sum_Hypercube needs

log n rounds to compute the sum. Since n messages

are sent and received in each round, the total number of

messages is O(n log n).

1. Time complexity: O(log n).

2. Message complexity: O(n log n).

Page 47: Parallel Algorithms

Classification of the PRAM model

• In the PRAM model, processors

communicate by reading from and writing

to the shared memory locations.

47

Page 48: Parallel Algorithms

Classification of the PRAM model

• The power of a PRAM depends on the kind of access to the shared memory locations.

48

Page 49: Parallel Algorithms

Classification of the PRAM model

In every clock cycle,

• In the Exclusive Read Exclusive Write (EREW) PRAM, each memory location can be accessed only by one processor.

• In the Concurrent Read Exclusive Write (CREW) PRAM, multiple processor can read from the same memory location, but only one processor can write.

49

Page 50: Parallel Algorithms

Classification of the PRAM model

• In the Concurrent Read Concurrent Write (CRCW) PRAM, multiple processor can read from or write to the same memory location.

50

Page 51: Parallel Algorithms

Classification of the PRAM model

• It is easy to allow concurrent reading. However, concurrent writing gives rise to conflicts.

• If multiple processors write to the same memory location simultaneously, it is not clear what is written to the memory location.

51

Page 52: Parallel Algorithms

Classification of the PRAM model

• In the Common CRCW PRAM, all the processors must write the same value.

• In the Arbitrary CRCW PRAM, one of the processors arbitrarily succeeds in writing.

• In the Priority CRCW PRAM, processors have priorities associated with them and the highest priority processor succeeds in writing.

52

Page 53: Parallel Algorithms

Classification of the PRAM model

• The EREW PRAM is the weakest and the Priority CRCW PRAM is the strongest PRAM model.

• The relative powers of the different PRAM models are as follows.

53

Page 54: Parallel Algorithms

Classification of the PRAM model

• An algorithm designed for a weaker model can be executed within the same time and work complexities on a stronger model.

54

Page 55: Parallel Algorithms

Classification of the PRAM model

• We say model A is less powerful compared to model B if either:

• the time complexity for solving a problem is asymptotically less in model B as compared to model A. or,

• if the time complexities are the same, the processor or work complexity is asymptotically less in model B as compared to model A.

55

Page 56: Parallel Algorithms

Classification of the PRAM model

An algorithm designed for a stronger PRAM model can be simulated on a weaker model either with asymptotically more processors (work) or with asymptotically more time.

56

Page 57: Parallel Algorithms

Adding n numbers on a PRAM

Adding n numbers on a PRAM

57

Page 58: Parallel Algorithms

Adding n numbers on a PRAM

• This algorithm works on the EREW PRAM model as there are no read or write conflicts.

• We will use this algorithm to design a matrix multiplication algorithm on the EREW PRAM.

58

Page 59: Parallel Algorithms

For simplicity, we assume that n = 2p for some integer p.

Matrix multiplication

59

Page 60: Parallel Algorithms

Matrix multiplication• Each can be computed in

parallel.

• We allocate n processors for computing ci,j. Suppose these processors are P1, P2,…,Pn.

• In the first time step, processor

computes the product ai,m x bm,j.

• We have now n numbers and we use the addition algorithm to sum these n numbers in log n time.

, , 1 ,i jc i j n

, 1mP m n

60

Page 61: Parallel Algorithms

Matrix multiplication

• Computing each takes nprocessors and log n time.

• Since there are n2 such ci,j s, we need overall O(n3) processors and O(log n)time.

• The processor requirement can be reduced to O(n3 / log n). Exercise !

• Hence, the work complexity is O(n3)

, , 1 ,i jc i j n

61

Page 62: Parallel Algorithms

Matrix multiplication

• However, this algorithm requires concurrent read capability.

• Note that, each element ai,j (and bi,j) participates in computing n elements from the C matrix.

• Hence n different processors will try to read each ai,j (and bi,j) in our algorithm.

62

Page 63: Parallel Algorithms

For simplicity, we assume that n = 2p for some integer p.

Matrix multiplication

63

Page 64: Parallel Algorithms

Matrix multiplication

• Hence our algorithm runs on the CREW PRAM and we need to avoid the read conflicts to make it run on the EREW PRAM.

• We will create n copies of each of the elements ai,j (and bi,j). Then one copy can be used for computing each ci,j .

64

Page 65: Parallel Algorithms

Matrix multiplication

Creating n copies of a number in O (log n)time using O (n) processors on the EREW PRAM.

• In the first step, one processor reads the number and creates a copy. Hence, there are two copies now.

• In the second step, two processors read these two copies and create four copies.

65

Page 66: Parallel Algorithms

Matrix multiplication

• Since the number of copies doubles in every step, n copies are created in O(log n) steps.

• Though we need n processors, the processor requirement can be reduced to O (n / log n).

66

Page 67: Parallel Algorithms

Matrix multiplication

• Since there are n2 elements in the matrix A(and in B), we need O (n3 / log n)processors and O (log n) time to create ncopies of each element.

• After this, there are no read conflicts in our algorithm. The overall matrix multiplication algorithm now take O (log n) time and

O (n3 / log n) processors on the EREW PRAM.

67

Page 68: Parallel Algorithms

Matrix multiplication

• The memory requirement is of course much higher for the EREW PRAM.

68

Page 69: Parallel Algorithms

69

Using n3 Processors

Algorithm MatMult_CREW

/* Step 1 */

Forall Pi,j,k, where do in parallel

C[i,j,k] = A[i,k]*B[k,j]

endfor

/* Step 2 */

For I =1 to log n do

forall Pi,j,k, where do in parallel

if (2k modulo 2l)=0 then

C[i,j,2k] C[i,j,2k] + C[i,j, 2k – 2i-1]

endif

endfor

/* The output matrix is stored in locations C[i,j,n], where */

endfor

Page 70: Parallel Algorithms

70

Complexity Analysis

•In the first step, the products are conducted in parallel

in constant time, that is, O(1).

•These products are summed in O(log n) time during

the second step. Therefore, the run time is O(log n).

•Since the number of processors used is n3, the cost is

O(n3 log n).

1. Run time, T(n) = O(log n).

2. Number of processors, P(n) = n3.

3. Cost, C(n) = O(n3 log n).

Page 71: Parallel Algorithms

71

Reducing the Number of Processors

In the above algorithm, although

all the processors were busy during the first step,

But not all of them performed addition operations during the

second step.

The second step consists of log n iterations.

During the first iteration, only n3/2 processors performed

addition operations,

only n3/4 performed addition operations in the second

iteration, and so on.

With this understanding, we may be able to use a smaller

machine with only n3/log n processors.

Page 72: Parallel Algorithms

72

Reducing the Number of Processors

1. Each processor Pi,j,k , where

computes the sum of log n products. This

step will produce (n3/log n) partial sums.

2. The sum of products produced in step 1 are

added to produce the resulting matrix as

discussed before.

Page 73: Parallel Algorithms

73

Complexity Analysis

1. Run time, T(n) = O(log n).

2. Number of processors, P(n) = n3/log n.

3. Cost, C(n) = O(n3).