decomposition data decomposition – dividing the data into subgroups and assigning each piece to...

Decomposition

• Data Decomposition– Dividing the data into subgroups and assigning

each piece to different processors– Example: Embarrassingly parallel applications

• Functional Decomposition– Dividing an algorithm into its functional pieces and

executing the pieces in separate processors– Example: Pipelining

Pipelined Computations

• Divide a problem into a series of tasks• A processor completes a task sequentially and

pipes the results to the next processor

Example of Summing Groups of Numbers

P0 P1 P4P2 P3 P5

P0 P1 P4P2 P3 P5

∑A[i0] ∑A[i1] ∑A[i2] ∑A[i3] ∑A[i4] ∑A[i5]

zero total

Question: Is this data or is it functional decomposition?

Where is Pipelining Applicable?

Type 1 – More than one instance of a problem – Example: Multiple simulations with different parameter settings

Type 2– Series of data items with multiple operations– Example: Signal Filter or Eratosthenes Sieve

Type 3– Partial results passed on while processing continues– Example: Solving sets of linear equations

Considerations– Are there a series of sequential tasks?– Is the processing of each tack approximately equal?– Can items be grouped to minimize communication cost– If stages exceed processors

o Group stageso Wrap last stage back to the first

– Determine where the result will be at the end of the process

Summing Numbers Example

process Pi>0 && <N-1

recv(&sum, Pi-1);

sum += number;send(&sum, Pi+1);

Process P0

send(&number, P1);

Process PN-1

recv(&number, Pn-2);

sum += number;Save or display result

Application• Remove frequencies from a signal

– Sequential Algorithm: Fourier Analysis (O(N lg(N))– Parallel: Apply filters to the signal (O(N*FilterLength)) with convolution.

– Filter Examples: Chebyshev, ButtorWorth, etc.– Derive filter: Set Z-domain poles and zeroes, perform inverse tranformation.– Filters can be useful to manipulate signals, detect patterns, etc.

Chebyshev Filter DesignChebyshev in the z-domain Chebyshev Frequency Response

Note: Depending on the placement of the poles (+) and zeroes (0), the filter will effect a signal differently

Type 1: Multiple Instances

Sequential execution: t1 = m*tm

Parallel Processing: (m + p – 1)*tm/p

Parallel Communication: (m+p-1)*(tstart+n*tdata)

Speed up: tp= m*tm/((m+p-1)*(tm/p+tstart+n*tdata))

P0 P1 P2 P3 P4 P5

P0 P1 P2 P3 P4 P5

P0 P1 P2 P3 P4 P5

P0 P1 P2 P3 P4 P5

P0 P1 P2 P3 P4 P5

Instance 1

Instance 2

Instance 3Instance 4Instance 5

Time

Space Time Diagram

Notation1. m = instances, p = processors2. tstart = latency tdata = bandwidth3. n = data transmitted /instance4. tm = total time to process an

instance5. Total pipeline cycles = m + p – 16. Assume: Equal processing per stage

Type 2: Multiple Data Elements

P0 P1 P4P2 P3 P5

Filter f0

UnfilteredSignal

FilteredSignal

Filter f1

Filter f2

Filter f3

Filter f4

Filter f5

d9d8d7d6d5d4d3d2d1d0 P0 P0 P0 P0 P0 P0

Example: Signal FilterEach process removes one or more frequencies from a digitized signal

Type 2 Timing Diagram

Type 3: Partial Processing• The next stage receives information to continue processing• Additional processing continues at the source processor

Question: How do we determine speed-up?

P0

P1

P2

P3

P4

P5

P0

P1

P2

P3

P4

P5

Linear Equations A More Balanced Load

= Idle

= Executing

Operation at each processorTypes 1 and 2

• Processor with rank r = 0– Generate the instance (type 1) or the data (type 2) to process– Process appropriately– Send message to the processor with rank 1

• Processors with rank r = 1, 2, p-2– Receive message from the processor with rank r-1– Process appropriately– Send message to the processor with rank r+1

• Processor with rank r = p-1– Receive message from processor with rank r-1– Process appropriately– Output final results Examples

1) Adding Numbers: n1 -> n1+n2 -> n1+n2+n3 -> . . .2) Frequency removal: f(t) -> f0; f(t-f0)-> f1; f(t-f0-f1)-

> . . .

Parallel Pipeline

Sort

5 4 3 2 1

5 4 3 2

5 4 3 1

5 4 2

5 3 1

5 2

5 2

5

5

Step Numbers P0 P1 P2 P3 P4

4, 3, 1, 2, 5

4, 3, 1, 2

4, 3, 1

4, 3

4

1

1

1

1

2

2

2

3

3

4

1

2

3

4

5

6

7

8

9

10

• Pseudo code

Receive xi

IF xi < xmax

Send xi

ELSE

Send xmax

xmax = xi

Note: Processors can hold blocks of numbers for better efficiency

Bi-Directional Pipeline• Use the pipeline to return results to the master

– Useful for line topologies, ring, or hypercube

P0 P1 P4P2 P3 P5

Sorting Phase

P4

P3

P2

P1

P0 Time

Gather Phase

Phases•N(generate steps); •N-1 (propagate steps); •N-1 (return steps) = 3N-2

• Sort PhaseIf (myid == 0) generate number Else receive(&number, pmyid-1)If (number > max and myid<P-1){ send(max,pmyid+1); maximuSoFar=number;}

• Gather phaseIf (myid < P-1) receive sorted numbers from pmyid+1

If (myid > 0) send sorted numbers to pmyid-1

Example: Sorting

Sieve of Eratosthenes

Prime Number GenerationSieve of Eratosthenes (Type 2 pipeline)

• Concept– Each processor filters blocks of non-primes from the flow of data– The “potential” prime numbers pass through to the next

processor• Pseudo-code

The Master processor generates an array of odd n numbersIn a loop after receiving a group of numbers

Filter a group of numbers; pass unfiltered numbers down the pipelineGather all of the primes

• Notes– Wrapping the pipeline in a ring could help maintain load balance– A termination message determines when the pipeline empties

Question: What range of numbers should each processor get?

Sequential codefor (i = 2; i < n; i++)

prime[i] = 1; for (i = 2; i <= sqrt_n; i++)

if (prime[i] == 1)for (j = i + i; j < n; j = j + i)

prime[j] = 0

Parallel CodeProcessor pi > 0Recv(number, rank-1);PRIME = TRUE;FOR (int x=MIN; x<MAX; x+=MIN)

IF ((number % x) == 0)PRIME = FALSE and BREAK

IF (PRIME) send(number, rank+1);

Terminationrecv(number, rank-1);send(number, rank+1)IF (number == terminator) break;

Sequential Time O(n2)

Implementation

Upper Triangular Matrix

All entries below the diagonal are zeroUseful for solving N equations and N unknowns

Solving Sets of Linear Equations

• Upper Triangular Forman-1,0x0 + an-1,1x1 + … + an-1,n-1xn-1 = bn-1

an-2,0x0 + an-2,1x1 + … + an-2,n-2xn-1 = bn-2

a1, 0x0 + a1,1x1 = b1

a0,0x0 = b0

• Back Substitutionx0=b0/a0,0

x1=(b1-a1,0x0)/a1,1

x2=(b2-a2,0x0-a2,1x1)/a2,2

• Parallel code for pi where 1<=i<nsum = 0For (j=0; j<i; j++){ receive(&x[j], pi-1); sum += ai,j * xj;

send(xj,pi+1)}xi = (bi – sum)/ai,i

• General solution for xixi= (bi – ∑j=0 to i-1 ai,j xj)/ai,I

• Sequential codex[0] = b0/a0,0, FOR (i=1; i<n; i++)

sum=0;FOR (j=0; j<i; j++)

sum += ai,I xjxi= (bi – sum)/ai,I

• Parallel Pseudo codefor (j = 0; j < i; j++)

recv(x[j], p-1); send(x[j], p+1);sum = 0;for (j = 0; j < i; j++)

sum = sum + a[i][j]*x[j]x[i] = (b[i] - sum)/a[i][i];send(x[i], p+1);

This is a type 3 pipeline example

Note: ai,j and bi are constants

Pipeline Solution

DO

IF p ≠ master, receive xj from previous processor

IF p ≠ P-1, send xj to next processor

back substitute xj

UNTIL xi evaluated

IF p ≠ P-1send xi to the next processor

• Notes:1. Processing continues after sending values

down the pipeline2. Is the load imbalanced?

Illustration of Type 3 Solution

Compute x0 Compute x1 Compute x2 Compute x3

x0

x1

x2

x3

x0

x0

x1

x0

x1

x2

P0 P1 P2 P3

Time

P5

P4

P3

P2

P1

P0

How balanced isThis load?

decomposition data decomposition – dividing the data into subgroups and assigning each piece to...

Documents