part 4: parallel patterns - wordpress.com processors work in parallel, each taking its input from...

30
PART 4: PARALLEL PATTERNS WEEK 12: Design of a Parallel Program * Flynn’s Taxonomy * Levels of Parallelism * Principal Parallel Patterns * Result Parallelism * Agenda Parallelism * Specialist Parallelism CSC526: Parallel Processing Fall 2016 Dr. Soha S. Zaghloul 1

Upload: others

Post on 26-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PART 4: PARALLEL PATTERNS

WEEK 12:

Design of a Parallel Program

* Flynn’s Taxonomy

* Levels of Parallelism

* Principal Parallel Patterns

* Result Parallelism

* Agenda Parallelism

* Specialist Parallelism

CSC526: Parallel Processing

Fall 2016

Dr. Soha S. Zaghloul 1

Page 2: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY

2Dr. Soha S. Zaghloul 2

SISD: Single Instruction, Single Datum

Flynn categorized computer architectures into four main classes according to the

number of instructions and data streams. These are:

SIMD: Single Instruction, Multiple Data

MISD: Multiple Instructions, Single Datum

MIMD: Multiple Instructions, Multiple Data

Page 3: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY – SISD

3Dr. Soha S. Zaghloul 3

One stream of instructions processes a single stream of data.

This architecture is shown in the figure below:

Control

Unit

Processor

instructions

Input Data

Output Data

Obviously, this is the common model of single-processor computers.

Page 4: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY – SIMD

4Dr. Soha S. Zaghloul 4

A single instruction stream is broadcast to multiple processors, each with its own data

stream.

This architecture is shown in the figure below:

Control

Unit

Processor Processor Processor Processor

instructions

Input Data Input Data Input Data Input Data

Output Data Output Data Output Data Output Data

Obviously, this is the SMP.

Page 5: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY – MISD

5Dr. Soha S. Zaghloul 5

No well-known system fits this designation. It is mentioned only for the sake of

completeness.

Page 6: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

FLYNN’S TAXONOMY – MIMD

6Dr. Soha S. Zaghloul 6

Each processing element has its own stream of instructions operating on its own

data.

This architecture is shown in the figure below:

Control

Unit

Processor

instructions

Input Data

Output Data

Control

Unit

Processor

instructions

Input Data

Output Data

Control

Unit

Processor

instructions

Input Data

Output Data

Control

Unit

Processor

instructions

Input Data

Output Data

Obviously, this is the MPP architecture.

Interconnection Network

Page 7: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

GRANULARITY

7Dr. Soha S. Zaghloul 7

Three main grain sizes are identified:

Fine grain

Medium grain

Granularity or grain size is a measure of the amount of computation involved in a

software process.

In other words, the granularity defines the parallelism level of a process.

Coarse grain

In general, the execution of a program may involve a combination of these levels.

The actual combination depends on many factors such as:

Algorithm

Language

Compiler support

Hardware limitations

Page 8: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

PARALLELISM LEVELS

8Dr. Soha S. Zaghloul 8

According to the grain size, five levels of parallelism are identified:

Instruction Level

Loop Level

Procedure Level

Subprogram Level

Job (Program) Level

The figure in the next slide shows the correspondence of parallelism levels to grain

sizes.

Page 9: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

SU

PP

OR

TE

D B

Y S

MP

SU

PP

OR

TE

D B

Y

MP

P

FIN

E G

RA

IN

CO

AR

SE

GR

AIN

CO

AR

SE

OR

ME

DIU

M

GR

AIN

ME

DIU

M

GR

AIN

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

PARALLELISM LEVELS TO GRAIN SIZE

9Dr. Soha S. Zaghloul 9

Level 5: Jobs/Programs

Level 4: Subprograms

Level 3: Procedures

Level 2: Loops

Level 1: Instructions

DE

GR

EE

OF

PA

RA

LLE

LIS

M

CO

MM

UN

ICATIO

N F

RE

QU

EN

CY

SC

HE

DU

LIN

G O

VE

RH

EA

D

From 2 to thousands

of instructions

Less than 500 inst.

Less than 2000 inst.

Thousands of inst.

Tens of thousands of

instructions

Page 10: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

GRANULARITY - EXAMPLE

10Dr. Soha S. Zaghloul 10

Consider the problem of calculating all the pixels in all the frames of a computer-

animated film. This may be solved in one of two ways:

Assign a distinct processor to calculate each pixel

Assign a distinct processor to render each entire frame

Each result requires a small amount of computation.

This is fine-grained parallelism.

Each result requires a large amount of computation.

This is coarse-grained parallelism.

Page 11: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

PARALLELISM PATTERNS

11Dr. Soha S. Zaghloul 11

Three principal patterns for designing parallel programs are identified. These are:

Result Parallelism

Agenda Parallelism

Specialist Parallelism

Using the above patterns, the steps for designing a parallel program are:

Identify the pattern that best matches the problem

Take the pattern’s suggested design as a starting point

Implement the pattern using appropriate constructs in a parallel programming

language

Page 12: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

RESULT PARALLELISM (1) – CONCEPT

12Dr. Soha S. Zaghloul 12

Result Parallelism pattern has the following criteria:

There is a collection of multiple results

The individual results are all computed in parallel, each by its own processor

Each processor is able to carry out the complete computation to produce one

result

The conceptual parallel program design is as follows:

Processor 1: Compute Result 1

Processor 2: Compute Result 2

….

Processor N: Compute Result N

Page 13: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

RESULT PARALLELISM (2) – EXAMPLE 1

13Dr. Soha S. Zaghloul 13

Consider the problem of calculating the factorials of a set of numbers stored in an

array data of size N:

Processor 1 is assigned to compute the factorial of data[0]

Processor 2 is assigned to compute the factorial of data[1]

Processor N is assigned to compute the factorial of data[N-1]

The figure in the next slide illustrates the result pattern:

Page 14: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SRESULT PARALLELISM (3) – FIGURE EXAMPLE 1

14Dr. Soha S. Zaghloul 14

Result Parallelism is depicted in the following figure:

Processor

1

Factorial

data[0]

Processor

2

Factorial

data[1]

Processor

3

Factorial

data[2]

Processor

8

Factorial

data[7]

All processors’ results are independent of each other.

We are concerned with the result calculated by each stand-alone processor.

Note that there is no data sharing between processors.

Conceptually speaking, all processors can start and finish at the same time.

Page 15: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SRESULT PARALLELISM (4) – SEQUENTIAL DEPENDENCY

EXAMPLE 2

15Dr. Soha S. Zaghloul 15

Recalculating the formulae in a spreadsheet is another example of Result Parallelism.

Conceptually, each cell has its own processor that computes the value of the

cell’s formula.

However, if the formula for cell B1 uses the value of cell A1, then B1 must wait

until A1 finishes: This is known as Result Parallelism with Sequential

Dependency.

The figure in the next slide depicts this concept.

Page 16: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

Tim

e =

t1

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SRESULT PARALLELISM (5) – SEQUENTIAL DEPENDENCY

EXAMPLE 2 FIGURE

16Dr. Soha S. Zaghloul 16

Processor

1

Result 1

Processor

2

Result 2

Processor

3

Result 3

Processor

4

Result 4

Tim

e =

t2

Processor

5

Result 5

Processor

6

Result 6

Processor

7

Result 7

Processor

8

Result 8

Page 17: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

AGENDA PARALLELISM (1) – CONCEPT

17Dr. Soha S. Zaghloul 17

Agenda Parallelism pattern has the following criteria:

There is a collection of multiple tasks

We are interested in one result only, or a small number of results

Each processor is able to carry out the complete computations to produce one

result for the assigned task

The conceptual parallel program design is as follows:

Processor 1: Perform task 1

Processor 2: Perform task 2

….

Processor N: Perform task N

Page 18: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

AGENDA PARALLELISM (2) – FIGURE

18Dr. Soha S. Zaghloul 18

Agenda Parallelism is depicted in the following figure:

Processor

1

Task 1

Processor

2

Task 2

Processor

3

Task 3

Processor

8

Task 8

Page 19: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SAGENDA PARALLELISM (3) – SEQUENTIAL DEPENDENCY

EXAMPLE 3

19Dr. Soha S. Zaghloul 19

Consider the following problem for an array of numbers data[4]:

Get the factorial of each number in the array data

Get the Fibonacci of each factorial

Classify into three categories:

The following code segment illustrates the problem:

Numbers that are less than threshold1

Numbers that are greater than threshold2

Numbers between threshold1 and threshold2

Phase 1

Phase 2

Phase 3

//calculate Factorial

for (i=0; i < N; i++) factorial[i] = Facto(data[i]); //Facto is a method

//calculate Fibonacci

for (i=0; i < N; i++) fibonacci[i] = Fibo (factorial[i]); //Fibo is a method

//classify according to thresholds

x = 0; y = 0; x = 0;

for (i=0; i < N; i++)

if (fibonacci[i] < threshold1) {class1[x] = fibonacci[i]; x++;}

else if(fibonacci[i] > threshold3) {class3[z] = fibonacci[i]; z++;}

else {class2[y] = fibonacci[i]; y++}

Page 20: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

Ph

ase

1

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SAGENDA PARALLELISM (4) – FIGURE EXAMPLE 3

20Dr. Soha S. Zaghloul 20

Processor

1

Factorial

data[0]

Processor

2

Factorial

data[1]

Processor

3

Factorial

data[2]

Processor

4

Factorial

data[3]

Ph

ase

2

Processor

5

Fibonnaci

facto[0]

Processor

6

Fibonnaci

facto[1]

Processor

7

Fibonnaci

facto[2]

Processor

8

Fibonnaci

facto[3]

Ph

ase

3

Processor

9

Less than

threshold1

Processor

10

Between

threshold1 &

threshold2

Processor

11

Greater

than

threshold2

Page 21: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

AGENDA PARALLELISM (5) – REDUCTION

21Dr. Soha S. Zaghloul 21

When the output of an agenda parallel program is a summary of the individual tasks’

results, the program is following the so-called reduction pattern.

Consider the example of finding the product of factorials of a set of numbers stored in

an array data of size N:

Task 1: determine the factorial of data[0]

Task 2: determine the factorial of data[1]

Task N: determine the factorial of data[N-1]

Task N+1: find the product of all factorials

The figure in the next slide depicts such pattern.

Page 22: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

Ph

ase

2P

ha

se

1

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SAGENDA PARALLELISM (6) – REDUCTIONEXAMPLE 4

22Dr. Soha S. Zaghloul 22

Processor

1

Factorial

(data[0])

Processor

2

Factorial

(data[1])

Processor

3

Factorial

(data[2])

Processor

4

Factorial

(data[3])

Processor

5

Product

of

factorials

Page 23: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

SPECIALIST PARALLELISM (1) – CONCEPT

23Dr. Soha S. Zaghloul 23

Specialist Parallelism pattern has the following criteria:

There is a group of tasks that must be performed to solve the problem on a

series of (items) data

Each processor performs only one task on a series of data

The conceptual parallel program design is as follows:

Processor 1: For each item

Perform task 1 on the item

Processor 2: For each item

Perform task 2 on the item

….

Processor N: For each item

Perform task N on the item

The figure in the next slide depicts the Specialist Pattern.

Page 24: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

SPECIALIST PARALLELISM (2) – FIGURE

24Dr. Soha S. Zaghloul 24

Specialist Parallelism is depicted in the following figure:

Task 1,

Item 1

Task 1,

Item 2

Task 1,

Item 3

Task 1,

Item 4

Task 1,

Item 5

Processor

1

Task 2,

Item 1

Task 2,

Item 2

Task 2,

Item 3

Task 2,

Item 4

Task 2,

Item 5

Processor

2

Task 3,

Item 1

Task 3,

Item 2

Task 3,

Item 3

Task 3,

Item 4

Task 3,

Item 5

Processor

3

Page 25: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

SPECIALIST PARALLELISM (3) – EXAMPLE 5

25Dr. Soha S. Zaghloul 25

Given an array data[8], we need to:

Count the number of positive elements

A code segment of the sequential version of the above problem is shown below:

Count the number of negative elements

Count the number of zeroes

Processor 1

Processor 2

Processor 3

for (i=0; i < N; i++)

if (data[i] > 0) positive++;

else if (data[i] < 0) negative ++;

else zero++;

A code segment of the parallel version of the above problem is shown below:

for (i=0; i < N; i++) if (data[i] > 0) positive++; //Processor 1

for (i=0; i < N; i++) if (data[i] < 0) negative ++; //Processor 2

for (i=0; i < N; i++) if (data[i] == 0) zero++; //Processor 3

The figure in the next slide illustrates Example 5.

Page 26: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SSPECIALIST PARALLELISM (4) – FIGURE EXAMPLE 5

26Dr. Soha S. Zaghloul 26

data [0]

data [1]

data [2]

data [3]

data [4]

Processor

1

data [5]

data [6]

data [7]Count positive

numbers

data [0]

data [1]

data [2]

data [3]

data [4]

Processor

2

data [5]

data [6]

data [7]Count negative

numbers

data [0]

data [1]

data [2]

data [3]

data [4]

Processor

3

data [5]

data [6]

data [7] Count zeroes

Page 27: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

SPECIALIST PARALLELISM (5) – PIPELINE

27Dr. Soha S. Zaghloul 27

When there are sequential dependencies between the tasks in a specialist parallel

problem, the program follows a pipelined pattern.

The output of one processor becomes the input for the next processor.

All processors work in parallel, each taking its input from the preceding processor’s

previous output.

Consider the following example in an image processing application:

Calculate all pixels of a frame

Render the frame

Compress the frame

Processor 1

Processor 2

Processor 3

Store the frame Processor 4

Page 28: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PROCESSOR 1:

CALCULATE

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SSPECIALIST PARALLELISM (6) – FIGUREEXAMPLE 6

28Dr. Soha S. Zaghloul 28

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

1

2

3

4

5

TIM

E IN

P1

PROCESSOR 2:

RENDER

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

1

2

3

4

5TIM

E IN

P2

PROCESSOR 3:

COMPRESS

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

1

2

3

4

5

TIM

E IN

P3

PROCESSOR 4:

STORE

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

1

2

3

4

5

TIM

E IN

P4

Note that the time is relative to each processor.

The next figure depicts the example with respect to absolute time.

Page 29: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

SSPECIALIST PARALLELISM (6) – PIPELINEEXAMPLE 6: ABSOLUTE TIME

29Dr. Soha S. Zaghloul 29

12 11 10 9 8 7 6 5 4 3 2 1

Frame 1 ST CO RE CA

Frame 2 ST CO RE CA

Frame 3 ST CO RE CA

Frame 4 ST CO RE CA

Frame 5 ST CO RE CA

Time in cycles

Fra

me

s

P1P2P3P4

Page 30: PART 4: PARALLEL PATTERNS - WordPress.com processors work in parallel, each taking its input from the preceding processor’s previous output. Consider the following example in an

PA

RT 4

: P

AR

ALLE

L P

ATTE

RN

S

NOTES

30Dr. Soha S. Zaghloul 30

A sequential program may be completely re-written to adopt to a parallel pattern (See

Example 5).

The difference between parallelism patterns can be summarized as follows:

Result Parallelism: We are concerned with the result of each processor

Agenda Parallelism: We are concerned with only a combination of results

(sequential dependency), or a summary of the individual results (reduction).

Specialist Parallelism: focuses on the processors that can execute in parallel.