chap. 5 part 2

Chap. 5 Part 2

CIS*3090 Fall 2016

Fall 2016 CIS*3090 Parallel Programming 1

Static work allocation

Where work distribution is predetermined, but based on what?

Typical scheme

Divide n size data into P equal elements/blocks

Assumption is that work ∝ data

But what if amount of work is not a function of the amount of data?

Some blocks take longer to compute (=hot spots)

Can’t load-balance work based on data alone!


Cyclic & Block Cyclic distribution/allocation

Idea

Instead of making just P successive equal-sized partitions, make many more, smaller partitions, and hand them out in rotation (round-robin) (Fig 5.12)

Is it a really a static method?

Yes! Notifies each slave of all the chunks it will be responsible for, and lets it process them at its own speed


Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4

Figure 5.12 Illustration of a cyclic distribution

of an 8 × 8 array onto five processes.

How cyclic distribution load-balances statically

Depends on “law of averages” to spread out the “hot spots”!

Want to balance size of chunk (block):

If too large, more likelihood that the workload will be uneven

If too small, pumping up comm. overhead

Contrast: any dynamic method requires more logic in the master and more overhead to communicate with the workers


Mandelbrot (Julia sets) good example: Fig 5.15

Data is static (rectangle on complex plane)

Arbitrary graphical interpretation each (x,y)

pixel has a colour = func(# of iterations for that x,y point calculation to converge)

Easy to divide up the data points equally

Classic “embarrassingly parallel”

But time to compute colour of each point differs dramatically in # of iterations!

Cyclic alloc. gives each Pi every nth pixel

better chance of achieving even workload (why?)



Figure 5.15 Julia set generated from the

site http://aleph0.clarku.edu/~djoyce.

Irregular data sets/problems

Previous examples based on matrices

Localization of Pi’s “own” data (for “owner computes”) easy to identify

With “irregular” data sets, 2 problems:

How to partition the work?

See “mesh partitioning,” geometric vs. graph-theoretic techniques

How to efficiently localize Pi’s partition?


“Inspector/executor” technique

On each Pi… 1. Inspect its partition for non-local refs.

2. Batch those and obtain them from Px’s in bulk

3. Now that all data is localized, go ahead and execute the computation

Analogy

You have a list of parts to assemble

Some are on your shelf

Others to be purchased at one or more stores

How many trips are you going to make?


Dynamic schemes where work is generated at run time

Fits producer/consumer pattern

Easy to put queue between P’s and C’s

P’s and C’s can compute independently of one another, perfect for scalable parallelism

In non-shared mem. system, queue has to be in some node’s memory, access via messages

Depending on problem, may only need one queue entry per processor = P length array

Peril-L: no “queue” abstraction as such

Access its global mem. inside exclusive block


Collatz expansion factor computation (queue example)

What’s the “conjecture” about? (p134)

Numerical oddity: Starting with any arbitrary positive integer, do iterations:

If # is odd, triple it and add 1

If # is even, halve it

The series eventually converges to 1!

“Expansion” of series

How much the original # blows up before convergence takes over, i.e., max(series)/start


Parallel design

Split up testing, test integers in parallel

The test: Does this integer converge to 1? If so, what’s its expansion factor?

Test higher ints hoping to find exception

Can test them independently of each other because they really are independent

But ignores a useful characteristic!

Once you encounter any previously found max (shown to converge), you needn’t keep generating terms (can’t possibly go higher)

Book’s scalable solution doesn’t use this fact Fall 2016 CIS*3090 Parallel Programming 12

Scalable queue solution

Single queue of ints still to be tested

Initialize with 1st P integers (for P threads)

As each thread completes its dequeued test, it increments its tested # by P and enqueues the new #

Allows for computing expansion(some #) to take any amount of time


Dynamic allocation effect

For given calculation…

If fast, thread returns to queue quickly to get next item

If slow, other threads will grab items while it’s busy

Queue allows computation to proceed as fast as possible

Continuously employing all processors

More processors finish faster scalable


Limitations of single queue

Not really that scalable!

Shared mem bottleneck for both producers and consumers, especially critical section

So, make 1+ queue/process

Reduces contention for single queue, but causes new problem:

Load imbalances if queues not evenly populated

Can solve with work stealing, getting item from another P’s queue if own is empty


malloc/free in shared memory environment (p137)

Shared, global address space

pointers will be valid on all threads

“Housekeeping” tables for free storage (often a linked chain of blocks)

Regular malloc’s performance poor, getting slower with increased P (graph)

Why? malloc has contention for a critical section where it does its housekeeping

effectively serializing malloc/free calls http://software.intel.com/en-us/articles/avoiding-heap-contention-among-threads


http://software.intel.com/en-us/articles/avoiding-heap-contention-among-threads












Figure 1: ThreadTest Performance

by Number of Threads

A Comparison of Memory Allocators in Multiprocessors

Joseph Attardi and Neelakanth Nadgir, June 2003

17

http://developers.sun.com/solaris/articles/multiproc/multiproc.html

malloc woes

Results in non-scalable code, bottleneck having worse impact as P increases

Another problem: false sharing

Happens when heap memory allocated to different cores from same cache line

Can be solved by minimum-size allocations (like padding solution), but can be wasteful


Two heap storage use cases

Thinking about the problem…

1. Each thread only wants to malloc/free for its internal use, won’t share pointers

2. A pointer malloc’d in one thread will be passed from thread to thread

Finally freed by last thread that needed it (task parallel processing pattern)


Try a thread-private heap?

Each thread starts with its own pool of free storage, does own housekeeping

Good: pointers still globally valid

No contention for single shared housekeeping structure this is scalable

Bad: malloc in Pi, free in Pj

1. Transfers i’s storage to j’s heap! (gets linked into j’s free pool)

2. Any thread can run out of heap despite there being heap globally available somewhere!


Thread-private heap variation

Record who owns malloc’d block, so it can be freed back to owner’s heap!

But… freeing thread is the wrong one to access owner’s free list

Can solve with lock, but motive is to avoid lock

Still same problem of local “starvation” for heap though globally available

Though not caused by “heap stealing” now

More odd behaviour…


Thread-private heap variation

Pipeline processing pattern can result in chain of private allocations as item passes thru multiple queues

P0’s pointer could be passed right through pipeline of P’s and back to P0 for freeing

Ties up P0’s storage too long

So, P1 copies P0’s block into its own heap, after which P0 frees its block; repeat…

Memory footprint of one item becomes much larger than with common heap


“Hoard” solution (www.hoard.org)

Combine private heaps with global heap

No contentionscalable performance

“7x faster than Mac built-in allocator”

Prevents local heap starvation by allowing freed pages to be “donated” to global heap which can be joined to needy local heaps

Allocates out of large blocks (avoids false sharing)

Comes with GPL

Open-source your app, or pay $$$$ license Fall 2016 CIS*3090 Parallel Programming 23

Summary re malloc/free

Something for multicore programmers to pay attention to!

Changing from default malloc/free could be big performance booster if app relies on dynamic storage


Trees: hard to share among threads

If non-shared mem, can’t build out of conventional pointers!

Why Pilot does common configuration phase on all nodes in parallel

Pilot’s internal tables

Global in effect (all nodes have same definitions of processes, channels, bundles)

Local in reality (tables are built on each node so all pointers are locally valid)


Allocating sub-trees to processors

It’s leaf subtrees (at some level) we mostly care about (contain the “work”)

Allocate 1+ subtrees to Pi (Fig 5.18)

Replicate “cap” (tree above subtrees) for each Pi

Since cap is in local mem, its pointers are valid

If cap changes on Pi, needs to be sync’d with other P views

Combine with work queue for trees that grow unpredictably/irregularly



Figure 5.18 Cap allocation for a binary tree on P = 8

processes. Each process is allocated one of the leaf

subtrees, along with a copy of the cap (shaded).

chap. 5 part 2

Documents