programming parallel algorithms - nesl

31
Programming Parallel Algorithms - NESL Guy E. Blelloch Presented by: Michael Sirivianos Barbara Theodorides

Upload: quasar

Post on 15-Jan-2016

61 views

Category:

Documents


0 download

DESCRIPTION

Programming Parallel Algorithms - NESL. Guy E. Blelloch Presented by: Michael Sirivianos Barbara Theodorides. Problem Statement. Why design a new language specifically for programming parallel algorithms? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Programming Parallel Algorithms - NESL

Programming Parallel Algorithms - NESL

Guy E. BlellochPresented by:

Michael Sirivianos

Barbara Theodorides

Page 2: Programming Parallel Algorithms - NESL

Problem Statement

Why design a new language specifically for programming parallel algorithms? In the past 20 years there has been tremendous progress in

developing and analyzing parallel algorithms At that time less success in developing good languages for

programming parallel algorithms There is a large gap between languages that are too low level

(details that obscure the meaning of the algorithm) and languages that are too high level (making performance implications unclear)

Page 3: Programming Parallel Algorithms - NESL

NESL

Nested Data Parallel LanguageUseful for teaching and implementing parallel algorithms. Bridges the gap: allows high-level descriptions of parallel algorithms but also has a straightforward mapping onto a performance model.Goals when designing NESL: A language-based performance model that uses work

and depth rather than a machine-based model that uses running time

Support for nested data-parallel constructs (ability to nest parallel calls)

Page 4: Programming Parallel Algorithms - NESL

Analyzing performance

Processor-based models: Performance is calculated in terms of the number of instruction cycles a computation takes (its running time) ~ A function of input size and number of processors

Virtual models: Higher level models that can be mapped onto various real machines (e.g. PRAM - Parallel Random Access Machine) Can be mapped efficiently onto more realistic machines by

simulating multiple processors of the PRAM on a single processor of a host machine. Virtual models easier to program.

Page 5: Programming Parallel Algorithms - NESL

Measuring performance: Work & Depth

Work: the total number of operations executed by a computation specifies the running time on a sequential processor

Depth: the longest chain of sequential dependencies in the computation. represents the best possible running time assuming an ideal machine with

an unlimited number of processors

Example: Summing 16 numbers using a balanced binary tree

Page 6: Programming Parallel Algorithms - NESL

How can work & depth be incorporated into a computational model?

Circuit modelDesigning a circuit of logic gates

In previous example, design a circuit in which the inputs are at the top, each “+” is an adder circuit, and each of the lines between adders is a bundle of wires.

Work = circuit size (number of gates)

Depth = longest path from an input to an output

Page 7: Programming Parallel Algorithms - NESL

How can work & depth be incorporated into a computational model? (cont)

Vector Machine Models VRAM is a sequential RAM extended with a set of instructions that operate on vectors.

Each location in memory contains a whole vector

Vectors can vary in size during the computation

Vector instructions include element wise operations (adding corresponding elements)

Depth = #instructions executed by the machine

Work = sum of the lengths of the vectors

Page 8: Programming Parallel Algorithms - NESL

How can work & depth be incorporated into a computational model? (cont)

Vector Machine Models ExampleSummation tree code

Work = O ( n + n/2 + … ) = O (n)Depth = O (log n)

Page 9: Programming Parallel Algorithms - NESL

How can work & depth be incorporated into a computational model? (cont)

Language-Based Models Specify the costs of the primitive instructions and a set of rules for

composing costs across program expressions. Discuss the running time of the algorithms without introducing a

specific machine model. Using work & depth: work & depth costs are assigned to each

function and scalar primitive of a language and rules are specified for combining parallel and sequential expressions.

Roughly speaking, when executing a set of tasks in parallel:work = sum of work of the tasksdepth = maximum of the depth of the tasks

Page 10: Programming Parallel Algorithms - NESL

Why Work & Depth?

Work & Depth: used informally for many years to describe the performance of parallel algorithms easier to describe easier to think about easier to analyze algorithms in terms of work & depth than in

terms of running time and number of processors (processor-based model)

Why models based on work & depth are better than processor-based models for programming and analyzing parallel algorithms? Performance analysis is closely related to the code and code

provides a clear abstraction of parallelism.

Page 11: Programming Parallel Algorithms - NESL

Why Work & Depth? (cont)

To support this claim they consider Quicksort. Sequential algorithm:

Average case: run time = O ( n log n ) , depth or recur. calls = O ( log n ) Parallel algorithm:

Page 12: Programming Parallel Algorithms - NESL

Quicksort (cont.)

Code and analysis based on a processor based model

Code will have to specify how the sequence is partitioned across processor

how the subselection is implemented in parallel

how the recursive calls get partitioned among the processors.

how the subcalls are synchronized

In the case of Quicksort, this gets even more complicated. T

The recursive calls are not of equal sizes.

Page 13: Programming Parallel Algorithms - NESL

Work & Depth and running time

Running time at the two limits: Single processor. RT = work Unlimited number of processors. RT = depth

We can place upper and lower bounds for a given number of processor.

W/ P <= T <= W / P + D

valid under assumptions about communication and scheduling costs.e.g. given memory latency L

W/ P <= T <= W / P + L*D

Communication cost among processor is not unit time thus D is multiplied by a latency factor. Bandwidth is not taken into account. In case of significantly different bandwidth W should be divided by a large B factor and D by a small B factor.

Page 14: Programming Parallel Algorithms - NESL

Work & Depth and running time (cont)

Communication BoundsWork & depth do not take into account communication costs: latency: time between making a remote request and receiving the reply bandwidth: rate at which a processor can access memory

Latency can be hidden.

Each processor has multiple parallel tasks (threads) to execute and therefore has plenty to do while waiting for replies

Bandwidth can not be hidden. While processor is waiting for data transfer to complete it is not able to perform other operations, and therefore remains idle.

.

Page 15: Programming Parallel Algorithms - NESL

Nested Data-Parallelism and NESL

Data-Parallelism: the ability to operate in parallel over sets of dataData-Parallel Languages or Collection-Oriented Languages: languages based on data-parallelism. Can be either flat or nested

Importance of nested parallelism: Used to implement nested loops and divide-and-conquer algorithms in

parallel Existing languages, such as C, do not have direct support for such nesting!

NESLIs a nested data-parallel language.Designed in order to express nested parallelism in a simple way with a minimum set of structures

Page 16: Programming Parallel Algorithms - NESL

NESL Supports data-parallelism by means of operations on sequences Apply-to-each construct which uses a set-like notation

e.g. {a * a : a in [3, -4, -9, 5]}; Used over multiple sequences. {a + b : a in [3, -4, -9, 5]; b in [1, 2, 3, 4]};

Ability to subselect elements of a sequence based on a filter. e.g. {a * a : a in [3, -4, -9, 5] | a > 0};

Any function may be applied to each element of a sequencee.g. {factorial(i) : i in [3, 1, 7]};

Provides a set of functions on sequences, each of which can be implemented in parallel (sum, reverse, write)

e.g. write([0, 0, 0, 0, 0, 0, 0, 0], [(4,-2),(2,5),(5,9)]); Nested parallelism: allow sequences to be nested and allow parallel funcitons to be used in an apply-to-each.

e.g. {sum(a) : a in [[2,3], [8,3,9], [7]]};

Page 17: Programming Parallel Algorithms - NESL

The performance ModelDefines Work & Depth in terms of the work and depth of the primitive operations, and Rules for composing the measures across expressions.

In most cases: W(e1 + e2) = 1 + W(e1) + W(e2), where ei : expresions

A similar rule is used for the depth.

Rules apply-to-each expression:

if expression:

Page 18: Programming Parallel Algorithms - NESL

The performance Model (cont)

Example: FactorialConcider the evaluation of the expression: e = {factorial(n) : n in a} where a = [3, 1, 5, 2].

function factorial(n) = if (n == 1) then 1 else n*factorial(n-1);

Using the rules for work and depth:

where W= =, W*, W- have cost 1.The two unit constants come form the cost of the function call and the if-then-

else statement.

Page 19: Programming Parallel Algorithms - NESL

Examples of Parallel Algorithms in NESL

Principles:

An important aspect of developing a good parallel algorithm is designing one whose work is close to the time for a good sequential algorithm that solves the same problem.

Work-efficient: Parallel algorithms are referred to as work-efficient relative to a sequential algorithm if their work is within a constant factor of the time of the sequential algorithm.

Page 20: Programming Parallel Algorithms - NESL

Examples of Parallel Algorithms in NESL (cont)

PrimesSieve of Eratosthenes:

1 procedure PRIMES(n):

2 let A be an array of length n

3 set all but the first element of A to TRUE

4 for i from 2 to sqrt(n)

5 begin

6 if A[i] is TRUE

7 then set all multiples of i up to n to FALSE

8 end

Line 7 is implementing by looping over the multiples, thus the algorithm takes O (n log log n) time.

Page 21: Programming Parallel Algorithms - NESL

Examples of Parallel Algorithms in NESL (cont)

Primes (parallelized)Parallelize the line “set all multiples of i up to n to FALSE”

multiples of a value i can be generated in parallel by [2*i:n:i]

and can be written into the array A in parallel with the write function

The depth of this algorithm is O (sqrt(n)), since each iteration of the loop has constant depth and there are sqrt(n) iterations.

The number of multiples is the same as the time of the sequential version.

Since it does the same number of operations, work is the same O (n log log n).

Page 22: Programming Parallel Algorithms - NESL

Examples of Parallel Algorithms in NESL (cont)

Primes: Improving depthIf we are given all the primes form 2 up to sqrt(n), we could then generateall the multiples of these primes at once: {[2*p:n:p] : in sqr_primes}

function primes (n) =if n == 2 then ( [ ] int )else

let sqr_primes = primes( isqrt(n) );composites = {[2*p:n:p] : p in sqr_primes};flat_comps = flatten (composites);flags = write(dist(true, n), {(i,false) : i in

flat_comps});indices = {i in [0:n]; fl in flags | fl}

in drop(indices, 2);

Page 23: Programming Parallel Algorithms - NESL

Examples of Parallel Algorithms in NESL (cont)

Primes: Improving depthAnalyze of Work & Depth:

Work: clearly most of the work is done at the top level of recursion, which does O (n log log n) work, and therefore the total work is

O (n log log n)

Depth: since each recursion level has constant depth, the total depth is proportional to the number of levels. The number of levels is log log n (the size of the problem at the ith level is n1/2^d => d = log log n) and therefore the depth is O (log log n)

This algorithm remains work-efficient and greatly improves the depth.

Page 24: Programming Parallel Algorithms - NESL

Examples of Parallel Algorithms in NESL (cont)

Sparce Matrix Multiplication Sparce matices: most elements are zero

Representation in NESL: 2.0 -1.0 0 0 A = [[(0, 2.0), (1, -1.0)],

A = -1.0 2.0 -1.0 0 [(0, -1.0), (1, 2.0), (2, -1.0)], 0 -1.0 2.0 -1.0 [(1, -1.0), (2, 2.0), (3, -1.0)], 0 0 -1.0 2.0 [(2, -1.0), (3, 2.0)]]

E.g. multiply a sparce matrix A with a dense vector x.

The dot product Ax in NESL is: {sum({v * x[i] : (i,v) in row}) : row in A}; Let n be the number of nonzero elements in the row, then

depth of the computation = the depth of the sum = O ( log n )

work = sum of the work across the elements = O (n)

Page 25: Programming Parallel Algorithms - NESL

Examples of Parallel Algorithms in NESL (cont)

Planar Convex HullProblem: Given n points in the plane, find which of them lie on the perimeter of the smallest convex region that contains all points.An example of nested parallelism for divide-and-conquer algorithms.Quickhull algorithm (similar to Quicksort):The strategy is to pick a pivot element, split the data based on the pivot, and recurse on each of the split sets.Worst case performance is O (n2) and the worst case depth is O (n).

Page 26: Programming Parallel Algorithms - NESL

Examples of Parallel Algorithms in NESL (cont)

hsplit(set,A,P) & hsplit(set,P,A)

cross product (p, (A,P))

pm: farthest from line A-P

Recursively: hsplit(set’,A,pm)

hsplit(set’,pm,P)

Ignores elements below the line

Page 27: Programming Parallel Algorithms - NESL

Examples of Parallel Algorithms in NESL (cont)

Performance analysis of Quickhull:

Each recursive call has constant depth and O(n) work.

However, since many points might be deleted on each step, the work could be significantly less.

As in Quicksort, worst case performance is O (n2) and the worst case depth is O (n).

For m hull points the best case times are O (n) work and O( log m ) depth.

Page 28: Programming Parallel Algorithms - NESL

Summary

They formalize a clear-cut formal language-based model for analyzing performance

Work & depth based model is directly defined through a programming language, rather than a specific machine

It can be applied to various classes of machines using mappings that count for number of processors, processing and communication costs.

NESL allows simple description of parallel algorithms and makes use of data parallel constructs and the ability to nest such constructs..

Page 29: Programming Parallel Algorithms - NESL

Summary

NESL hides the CPU/Memory allocation, and inter-processor communication details by providing an abstraction of parallelism. The current NESL implementation is based on an intermediate language (VCODE )and a library of low level vector routines (CVL)For more information on how NESL compiler is implemented:“Implementation of a Portable Nested Data-Parallel Language” Guy E. Blelloch, Siddhartha Chatterjee, Jonathan C. Hardwick, Jay Sipelstein, and Marco Zagha.

 

Page 30: Programming Parallel Algorithms - NESL

Discussion

Parallel Processing - Sensor Network Analogy:

Local processing -> Aggregation. Work corresponds to total aggregation cost.

Moving levels up -> Collecting aggregated results from children nodes.

Depth->Depth of routing tree in sensor network. Implies communication cost.

Latency->Cost to transmit data between motes.• In parallel computation the goal is to reduce execution time.

Sensor networks aim to reduce power consumption by minimizing communications. Execution time is also an issue when real time requirements are imposed.

Page 31: Programming Parallel Algorithms - NESL

Discussion

NESL and TAG queries?

Can latency be hidden by assigning multiple tasks to motes?

Can you perform different operations on an array's elements in parallel? Is it hard to add one more parallelism mechanism besides apply-to-each and parallel functions?