decomposition data decomposition – dividing the data into subgroups and assigning each piece to...
TRANSCRIPT
Decomposition
• Data Decomposition– Dividing the data into subgroups and assigning
each piece to different processors– Example: Embarrassingly parallel applications
• Functional Decomposition– Dividing an algorithm into its functional pieces and
executing the pieces in separate processors– Example: Pipelining
Pipelined Computations
• Divide a problem into a series of tasks• A processor completes a task sequentially and
pipes the results to the next processor
Example of Summing Groups of Numbers
P0 P1 P4P2 P3 P5
P0 P1 P4P2 P3 P5
∑A[i0] ∑A[i1] ∑A[i2] ∑A[i3] ∑A[i4] ∑A[i5]
zero total
Question: Is this data or is it functional decomposition?
Where is Pipelining Applicable?
Type 1 – More than one instance of a problem – Example: Multiple simulations with different parameter settings
Type 2– Series of data items with multiple operations– Example: Signal Filter or Eratosthenes Sieve
Type 3– Partial results passed on while processing continues– Example: Solving sets of linear equations
Considerations– Are there a series of sequential tasks?– Is the processing of each tack approximately equal?– Can items be grouped to minimize communication cost– If stages exceed processors
o Group stageso Wrap last stage back to the first
– Determine where the result will be at the end of the process
Summing Numbers Example
process Pi>0 && <N-1
recv(&sum, Pi-1);
sum += number;send(&sum, Pi+1);
Process P0
send(&number, P1);
Process PN-1
recv(&number, Pn-2);
sum += number;Save or display result
Application• Remove frequencies from a signal
– Sequential Algorithm: Fourier Analysis (O(N lg(N))– Parallel: Apply filters to the signal (O(N*FilterLength)) with convolution.
– Filter Examples: Chebyshev, ButtorWorth, etc.– Derive filter: Set Z-domain poles and zeroes, perform inverse tranformation.– Filters can be useful to manipulate signals, detect patterns, etc.
Chebyshev Filter DesignChebyshev in the z-domain Chebyshev Frequency Response
Note: Depending on the placement of the poles (+) and zeroes (0), the filter will effect a signal differently
Type 1: Multiple Instances
Sequential execution: t1 = m*tm
Parallel Processing: (m + p – 1)*tm/p
Parallel Communication: (m+p-1)*(tstart+n*tdata)
Speed up: tp= m*tm/((m+p-1)*(tm/p+tstart+n*tdata))
P0 P1 P2 P3 P4 P5
P0 P1 P2 P3 P4 P5
P0 P1 P2 P3 P4 P5
P0 P1 P2 P3 P4 P5
P0 P1 P2 P3 P4 P5
Instance 1
Instance 2
Instance 3Instance 4Instance 5
Time
Space Time Diagram
Notation1. m = instances, p = processors2. tstart = latency tdata = bandwidth3. n = data transmitted /instance4. tm = total time to process an
instance5. Total pipeline cycles = m + p – 16. Assume: Equal processing per stage
Type 2: Multiple Data Elements
P0 P1 P4P2 P3 P5
Filter f0
UnfilteredSignal
FilteredSignal
Filter f1
Filter f2
Filter f3
Filter f4
Filter f5
d9d8d7d6d5d4d3d2d1d0 P0 P0 P0 P0 P0 P0
Example: Signal FilterEach process removes one or more frequencies from a digitized signal
Type 2 Timing Diagram
Type 3: Partial Processing• The next stage receives information to continue processing• Additional processing continues at the source processor
Question: How do we determine speed-up?
P0
P1
P2
P3
P4
P5
P0
P1
P2
P3
P4
P5
Linear Equations A More Balanced Load
= Idle
= Executing
Operation at each processorTypes 1 and 2
• Processor with rank r = 0– Generate the instance (type 1) or the data (type 2) to process– Process appropriately– Send message to the processor with rank 1
• Processors with rank r = 1, 2, p-2– Receive message from the processor with rank r-1– Process appropriately– Send message to the processor with rank r+1
• Processor with rank r = p-1– Receive message from processor with rank r-1– Process appropriately– Output final results Examples
1) Adding Numbers: n1 -> n1+n2 -> n1+n2+n3 -> . . .2) Frequency removal: f(t) -> f0; f(t-f0)-> f1; f(t-f0-f1)-
> . . .
Parallel Pipeline
Sort
5 4 3 2 1
5 4 3 2
5 4 3 1
5 4 2
5 3 1
5 2
5 2
5
5
Step Numbers P0 P1 P2 P3 P4
4, 3, 1, 2, 5
4, 3, 1, 2
4, 3, 1
4, 3
4
1
1
1
1
2
2
2
3
3
4
1
2
3
4
5
6
7
8
9
10
• Pseudo code
Receive xi
IF xi < xmax
Send xi
ELSE
Send xmax
xmax = xi
Note: Processors can hold blocks of numbers for better efficiency
Bi-Directional Pipeline• Use the pipeline to return results to the master
– Useful for line topologies, ring, or hypercube
P0 P1 P4P2 P3 P5
Sorting Phase
P4
P3
P2
P1
P0 Time
Gather Phase
Phases•N(generate steps); •N-1 (propagate steps); •N-1 (return steps) = 3N-2
• Sort PhaseIf (myid == 0) generate number Else receive(&number, pmyid-1)If (number > max and myid<P-1){ send(max,pmyid+1); maximuSoFar=number;}
• Gather phaseIf (myid < P-1) receive sorted numbers from pmyid+1
If (myid > 0) send sorted numbers to pmyid-1
Example: Sorting
Sieve of Eratosthenes
Prime Number GenerationSieve of Eratosthenes (Type 2 pipeline)
• Concept– Each processor filters blocks of non-primes from the flow of data– The “potential” prime numbers pass through to the next
processor• Pseudo-code
The Master processor generates an array of odd n numbersIn a loop after receiving a group of numbers
Filter a group of numbers; pass unfiltered numbers down the pipelineGather all of the primes
• Notes– Wrapping the pipeline in a ring could help maintain load balance– A termination message determines when the pipeline empties
Question: What range of numbers should each processor get?
Sequential codefor (i = 2; i < n; i++)
prime[i] = 1; for (i = 2; i <= sqrt_n; i++)
if (prime[i] == 1)for (j = i + i; j < n; j = j + i)
prime[j] = 0
Parallel CodeProcessor pi > 0Recv(number, rank-1);PRIME = TRUE;FOR (int x=MIN; x<MAX; x+=MIN)
IF ((number % x) == 0)PRIME = FALSE and BREAK
IF (PRIME) send(number, rank+1);
Terminationrecv(number, rank-1);send(number, rank+1)IF (number == terminator) break;
Sequential Time O(n2)
Implementation
Upper Triangular Matrix
All entries below the diagonal are zeroUseful for solving N equations and N unknowns
Solving Sets of Linear Equations
• Upper Triangular Forman-1,0x0 + an-1,1x1 + … + an-1,n-1xn-1 = bn-1
an-2,0x0 + an-2,1x1 + … + an-2,n-2xn-1 = bn-2
a1, 0x0 + a1,1x1 = b1
a0,0x0 = b0
• Back Substitutionx0=b0/a0,0
x1=(b1-a1,0x0)/a1,1
x2=(b2-a2,0x0-a2,1x1)/a2,2
• Parallel code for pi where 1<=i<nsum = 0For (j=0; j<i; j++){ receive(&x[j], pi-1); sum += ai,j * xj;
send(xj,pi+1)}xi = (bi – sum)/ai,i
• General solution for xixi= (bi – ∑j=0 to i-1 ai,j xj)/ai,I
• Sequential codex[0] = b0/a0,0, FOR (i=1; i<n; i++)
sum=0;FOR (j=0; j<i; j++)
sum += ai,I xjxi= (bi – sum)/ai,I
• Parallel Pseudo codefor (j = 0; j < i; j++)
recv(x[j], p-1); send(x[j], p+1);sum = 0;for (j = 0; j < i; j++)
sum = sum + a[i][j]*x[j]x[i] = (b[i] - sum)/a[i][i];send(x[i], p+1);
This is a type 3 pipeline example
Note: ai,j and bi are constants
Pipeline Solution
DO
IF p ≠ master, receive xj from previous processor
IF p ≠ P-1, send xj to next processor
back substitute xj
UNTIL xi evaluated
IF p ≠ P-1send xi to the next processor
• Notes:1. Processing continues after sending values
down the pipeline2. Is the load imbalanced?
Illustration of Type 3 Solution
Compute x0 Compute x1 Compute x2 Compute x3
x0
x1
x2
x3
x0
x0
x1
x0
x1
x2
P0 P1 P2 P3
Time
P5
P4
P3
P2
P1
P0
How balanced isThis load?