programming parallel algorithms - nesl

Programming Parallel Algorithms - NESL

Guy E. BlellochPresented by:

Michael Sirivianos

Barbara Theodorides

Problem Statement

Why design a new language specifically for programming parallel algorithms? In the past 20 years there has been tremendous progress in

developing and analyzing parallel algorithms At that time less success in developing good languages for

programming parallel algorithms There is a large gap between languages that are too low level

(details that obscure the meaning of the algorithm) and languages that are too high level (making performance implications unclear)

Nested Data Parallel LanguageUseful for teaching and implementing parallel algorithms. Bridges the gap: allows high-level descriptions of parallel algorithms but also has a straightforward mapping onto a performance model.Goals when designing NESL: A language-based performance model that uses work

and depth rather than a machine-based model that uses running time

Support for nested data-parallel constructs (ability to nest parallel calls)

Analyzing performance

Processor-based models: Performance is calculated in terms of the number of instruction cycles a computation takes (its running time) ~ A function of input size and number of processors

Virtual models: Higher level models that can be mapped onto various real machines (e.g. PRAM - Parallel Random Access Machine) Can be mapped efficiently onto more realistic machines by

simulating multiple processors of the PRAM on a single processor of a host machine. Virtual models easier to program.

Measuring performance: Work & Depth

Work: the total number of operations executed by a computation specifies the running time on a sequential processor

Depth: the longest chain of sequential dependencies in the computation. represents the best possible running time assuming an ideal machine with

an unlimited number of processors

Example: Summing 16 numbers using a balanced binary tree

How can work & depth be incorporated into a computational model?

Circuit modelDesigning a circuit of logic gates

In previous example, design a circuit in which the inputs are at the top, each “+” is an adder circuit, and each of the lines between adders is a bundle of wires.

Work = circuit size (number of gates)

Depth = longest path from an input to an output

How can work & depth be incorporated into a computational model? (cont)

Vector Machine Models VRAM is a sequential RAM extended with a set of instructions that operate on vectors.

Each location in memory contains a whole vector

Vectors can vary in size during the computation

Vector instructions include element wise operations (adding corresponding elements)

Depth = #instructions executed by the machine

Work = sum of the lengths of the vectors

Vector Machine Models ExampleSummation tree code

Work = O ( n + n/2 + … ) = O (n)Depth = O (log n)

Language-Based Models Specify the costs of the primitive instructions and a set of rules for

composing costs across program expressions. Discuss the running time of the algorithms without introducing a

specific machine model. Using work & depth: work & depth costs are assigned to each

function and scalar primitive of a language and rules are specified for combining parallel and sequential expressions.

Roughly speaking, when executing a set of tasks in parallel:work = sum of work of the tasksdepth = maximum of the depth of the tasks

Why Work & Depth?

Work & Depth: used informally for many years to describe the performance of parallel algorithms easier to describe easier to think about easier to analyze algorithms in terms of work & depth than in

terms of running time and number of processors (processor-based model)

Why models based on work & depth are better than processor-based models for programming and analyzing parallel algorithms? Performance analysis is closely related to the code and code

provides a clear abstraction of parallelism.

Why Work & Depth? (cont)

To support this claim they consider Quicksort. Sequential algorithm:

Average case: run time = O ( n log n ) , depth or recur. calls = O ( log n ) Parallel algorithm:

Quicksort (cont.)

Code and analysis based on a processor based model

Code will have to specify how the sequence is partitioned across processor

how the subselection is implemented in parallel

how the recursive calls get partitioned among the processors.

how the subcalls are synchronized

In the case of Quicksort, this gets even more complicated. T

The recursive calls are not of equal sizes.

Work & Depth and running time

Running time at the two limits: Single processor. RT = work Unlimited number of processors. RT = depth

We can place upper and lower bounds for a given number of processor.

W/ P <= T <= W / P + D

valid under assumptions about communication and scheduling costs.e.g. given memory latency L

W/ P <= T <= W / P + L*D

Communication cost among processor is not unit time thus D is multiplied by a latency factor. Bandwidth is not taken into account. In case of significantly different bandwidth W should be divided by a large B factor and D by a small B factor.

Work & Depth and running time (cont)

Communication BoundsWork & depth do not take into account communication costs: latency: time between making a remote request and receiving the reply bandwidth: rate at which a processor can access memory

Latency can be hidden.

Each processor has multiple parallel tasks (threads) to execute and therefore has plenty to do while waiting for replies

Bandwidth can not be hidden. While processor is waiting for data transfer to complete it is not able to perform other operations, and therefore remains idle.

Nested Data-Parallelism and NESL

Data-Parallelism: the ability to operate in parallel over sets of dataData-Parallel Languages or Collection-Oriented Languages: languages based on data-parallelism. Can be either flat or nested

Importance of nested parallelism: Used to implement nested loops and divide-and-conquer algorithms in

parallel Existing languages, such as C, do not have direct support for such nesting!

NESLIs a nested data-parallel language.Designed in order to express nested parallelism in a simple way with a minimum set of structures

NESL Supports data-parallelism by means of operations on sequences Apply-to-each construct which uses a set-like notation

e.g. {a * a : a in [3, -4, -9, 5]}; Used over multiple sequences. {a + b : a in [3, -4, -9, 5]; b in [1, 2, 3, 4]};

Ability to subselect elements of a sequence based on a filter. e.g. {a * a : a in [3, -4, -9, 5] | a > 0};

Any function may be applied to each element of a sequencee.g. {factorial(i) : i in [3, 1, 7]};

Provides a set of functions on sequences, each of which can be implemented in parallel (sum, reverse, write)

e.g. write([0, 0, 0, 0, 0, 0, 0, 0], [(4,-2),(2,5),(5,9)]); Nested parallelism: allow sequences to be nested and allow parallel funcitons to be used in an apply-to-each.

e.g. {sum(a) : a in [[2,3], [8,3,9], [7]]};

The performance ModelDefines Work & Depth in terms of the work and depth of the primitive operations, and Rules for composing the measures across expressions.

In most cases: W(e1 + e2) = 1 + W(e1) + W(e2), where ei : expresions

A similar rule is used for the depth.

Rules apply-to-each expression:

if expression:

The performance Model (cont)

Example: FactorialConcider the evaluation of the expression: e = {factorial(n) : n in a} where a = [3, 1, 5, 2].

function factorial(n) = if (n == 1) then 1 else n*factorial(n-1);

Using the rules for work and depth:

where W= =, W*, W- have cost 1.The two unit constants come form the cost of the function call and the if-then-

else statement.

Examples of Parallel Algorithms in NESL

Principles:

An important aspect of developing a good parallel algorithm is designing one whose work is close to the time for a good sequential algorithm that solves the same problem.

Work-efficient: Parallel algorithms are referred to as work-efficient relative to a sequential algorithm if their work is within a constant factor of the time of the sequential algorithm.

Examples of Parallel Algorithms in NESL (cont)

PrimesSieve of Eratosthenes:

1 procedure PRIMES(n):

2 let A be an array of length n

3 set all but the first element of A to TRUE

4 for i from 2 to sqrt(n)

5 begin

6 if A[i] is TRUE

7 then set all multiples of i up to n to FALSE

Line 7 is implementing by looping over the multiples, thus the algorithm takes O (n log log n) time.

Primes (parallelized)Parallelize the line “set all multiples of i up to n to FALSE”

multiples of a value i can be generated in parallel by [2*i:n:i]

and can be written into the array A in parallel with the write function

The depth of this algorithm is O (sqrt(n)), since each iteration of the loop has constant depth and there are sqrt(n) iterations.

The number of multiples is the same as the time of the sequential version.

Since it does the same number of operations, work is the same O (n log log n).

Primes: Improving depthIf we are given all the primes form 2 up to sqrt(n), we could then generateall the multiples of these primes at once: {[2*p:n:p] : in sqr_primes}

function primes (n) =if n == 2 then ( [ ] int )else

let sqr_primes = primes( isqrt(n) );composites = {[2*p:n:p] : p in sqr_primes};flat_comps = flatten (composites);flags = write(dist(true, n), {(i,false) : i in

flat_comps});indices = {i in [0:n]; fl in flags | fl}

in drop(indices, 2);

Primes: Improving depthAnalyze of Work & Depth:

Work: clearly most of the work is done at the top level of recursion, which does O (n log log n) work, and therefore the total work is

O (n log log n)

Depth: since each recursion level has constant depth, the total depth is proportional to the number of levels. The number of levels is log log n (the size of the problem at the ith level is n1/2^d => d = log log n) and therefore the depth is O (log log n)

This algorithm remains work-efficient and greatly improves the depth.

Sparce Matrix Multiplication Sparce matices: most elements are zero

Representation in NESL: 2.0 -1.0 0 0 A = [[(0, 2.0), (1, -1.0)],

A = -1.0 2.0 -1.0 0 [(0, -1.0), (1, 2.0), (2, -1.0)], 0 -1.0 2.0 -1.0 [(1, -1.0), (2, 2.0), (3, -1.0)], 0 0 -1.0 2.0 [(2, -1.0), (3, 2.0)]]

E.g. multiply a sparce matrix A with a dense vector x.

The dot product Ax in NESL is: {sum({v * x[i] : (i,v) in row}) : row in A}; Let n be the number of nonzero elements in the row, then

depth of the computation = the depth of the sum = O ( log n )

work = sum of the work across the elements = O (n)

Planar Convex HullProblem: Given n points in the plane, find which of them lie on the perimeter of the smallest convex region that contains all points.An example of nested parallelism for divide-and-conquer algorithms.Quickhull algorithm (similar to Quicksort):The strategy is to pick a pivot element, split the data based on the pivot, and recurse on each of the split sets.Worst case performance is O (n2) and the worst case depth is O (n).

hsplit(set,A,P) & hsplit(set,P,A)

cross product (p, (A,P))

pm: farthest from line A-P

Recursively: hsplit(set’,A,pm)

hsplit(set’,pm,P)

Ignores elements below the line

Performance analysis of Quickhull:

Each recursive call has constant depth and O(n) work.

However, since many points might be deleted on each step, the work could be significantly less.

As in Quicksort, worst case performance is O (n2) and the worst case depth is O (n).

For m hull points the best case times are O (n) work and O( log m ) depth.

Summary

They formalize a clear-cut formal language-based model for analyzing performance

Work & depth based model is directly defined through a programming language, rather than a specific machine

It can be applied to various classes of machines using mappings that count for number of processors, processing and communication costs.

NESL allows simple description of parallel algorithms and makes use of data parallel constructs and the ability to nest such constructs..

Summary

NESL hides the CPU/Memory allocation, and inter-processor communication details by providing an abstraction of parallelism. The current NESL implementation is based on an intermediate language (VCODE )and a library of low level vector routines (CVL)For more information on how NESL compiler is implemented:“Implementation of a Portable Nested Data-Parallel Language” Guy E. Blelloch, Siddhartha Chatterjee, Jonathan C. Hardwick, Jay Sipelstein, and Marco Zagha.

Discussion

Parallel Processing - Sensor Network Analogy:

Local processing -> Aggregation. Work corresponds to total aggregation cost.

Moving levels up -> Collecting aggregated results from children nodes.

Depth->Depth of routing tree in sensor network. Implies communication cost.

Latency->Cost to transmit data between motes.• In parallel computation the goal is to reduce execution time.

Sensor networks aim to reduce power consumption by minimizing communications. Execution time is also an issue when real time requirements are imposed.

Discussion

NESL and TAG queries?

Can latency be hidden by assigning multiple tasks to motes?

Can you perform different operations on an array's elements in parallel? Is it hard to add one more parallelism mechanism besides apply-to-each and parallel functions?

programming parallel algorithms - nesl

work depth costs

machinebased model

computational model

specific machine model

work depthwork

contvector machine models

contlanguagebased models

higher level models

Documents

parallel algorithms - sorting

2.1 parallel algorithms

cs 515: parallel algorithms

parallel vs sequential algorithms. design of efficient...

parallel algorithms

parallel hashing algorithms

parallel sorting algorithms

introduction to parallel algorithms

parallel algorithms iii

chapter ∞: parallel algorithms

parallel algorithms k means clustering - university at...

nesl: a nested data-parallel language · nesl: a nested...

parallel graph algorithms

design of parallel algorithms

edelman - applied parallel algorithms

data parallel algorithms

massively parallel algorithms

efficient parallel algorithms for graph problemsefficient...

nesl: a nested data-parallel language1 introduction this...

section 4: parallel algorithms