task and data parallelism: real-world examples

21
Sasha Goldshtein CTO Sela Group @goldshtn blog.sashag.net Task and Data Parallelism: Real- World Examples

Upload: sasha-goldshtein

Post on 21-Jun-2015

4.011 views

Category:

Technology


2 download

DESCRIPTION

This presentation begins by reviewing the Task Parallel Library APIs, introduced in .NET 4.0 and expanded in .NET 4.5 -- the Task class, Parallel.For and Parallel.ForEach, and even Parallel LINQ. Then, we look at patterns and practices for extracting concurrency and managing dependencies, with real examples like Levenstein's edit distance algorithm, Fast Fourier Transform, and others.

TRANSCRIPT

Page 1: Task and Data Parallelism: Real-World Examples

Sasha Goldshtein

CTOSela Group

@goldshtnblog.sashag.net

Task and Data Parallelism: Real-World

Examples

Page 2: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

2

AGENDA

Multicore machines have been a cheap commodity for >10 years

Adoption of concurrent programming is still slow

Patterns and best practices are scarce We discuss the APIs first… …and then turn to examples, best practices, and tips

Page 3: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

3

TPL EVOLUTION

The Future

• DataFlow in .NET 4.5 (NuGet)

• Augmented with language support (await, async methods)

2012

• Released in full glory with .NET 4.0

2010

• Incubated for 3 years as “Parallel Extensions for .NET”

2008

Page 4: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

4

TASKS

A task is a unit of work May be executed in parallel with other tasks

by a scheduler (e.g. Thread Pool)

Much more than threads, and yet much cheaper

Task<string> t = Task.Factory.StartNew( () => { return DnaSimulation(…); });t.ContinueWith(r => Show(r.Exception), TaskContinuationOptions.OnlyOnFaulted);t.ContinueWith(r => Show(r.Result), TaskContinuationOptions.OnlyOnRanToCompletion);DisplayProgress();

try { //The C# 5.0 version var task = Task.Run(DnaSimulation); DisplayProgress(); Show(await task);}catch (Exception ex) { Show(ex);}

Page 5: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

5

PARALLEL LOOPS

Ideal for parallelizing work over a collection of data

Easy porting of for and foreach loops Beware of inter-iteration dependencies!

Parallel.For(0, 100, i => { ...});

Parallel.ForEach(urls, url => { webClient.Post(url, options, data);});

Page 6: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

6

PARALLEL LINQ

Mind-bogglingly easy parallelization of LINQ queries

Can introduce ordering into the pipeline, or preserve order of original elementsvar query = from monster in monsters.AsParallel() where monster.IsAttacking let newMonster = SimulateMovement(monster) orderby newMonster.XP select newMonster;

query.ForAll(monster => Move(monster));

Page 7: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

7

MEASURING CONCURRENCY

Visual Studio Concurrency Visualizer to the rescue

Page 8: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

8

RECURSIVE PARALLELISM EXTRACTION

Divide-and-conquer algorithms are often parallelized through the recursive call

Be careful with parallelization threshold and watch out for dependenciesvoid FFT(float[] src, float[] dst, int n, int r, int

s) { if (n == 1) { dst[r] = src[r]; } else { FFT(src, n/2, r, s*2); FFT(src, n/2, r+s, s*2); //Combine the two halves in O(n) time }}

Parallel.Invoke( () => FFT(src, n/2, r, s*2), () => FFT(src, n/2, r+s, s*2));

Page 9: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

9

SYMMETRIC DATA PROCESSING

For a large set of uniform data items that need to processed, parallel loops are usually the best choice and lead to ideal work distribution

Inter-iteration dependencies complicate things (think in-place blur)

Parallel.For(0, image.Rows, i => { for (int j = 0; j < image.Cols; ++j) { destImage.SetPixel(i, j, PixelBlur(image, i, j)); }});

Page 10: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

10

UNEVEN WORK DISTRIBUTION

With non-uniform data items, use custom partitioning or manual distribution

Primes: 7 is easier to check than 10,320,647var work = Enumerable.Range(0, Environment.ProcessorCount) .Select(n => Task.Run(() => CountPrimes(start+chunk*n, start+chunk*(n+1))));Task.WaitAll(work.ToArray());

VS

Parallel.ForEach(Partitioner.Create(Start, End, chunkSize), chunk => CountPrimes(chunk.Item1, chunk.Item2));

Page 11: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

11

COMPLEX DEPENDENCY MANAGEMENT

Must extract all dependencies and incorporate them into the algorithm

Typical scenarios: 1D loops, dynamic algorithms

Edit distance: each task depends on 2 predecessors, wavefront computation

C = x[i-1] == y[i-1] ? 0 : 1;D[i, j] = min( D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + C);

0,0

m,n

Page 12: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

12

SYNCHRONIZATION > AGGREGATION

Excessive synchronization brings parallel code to its knees

Try to avoid shared state, or minimize access to it

Aggregate thread- or task-local state and merge later

Parallel.ForEach( Partitioner.Create(Start, End, ChunkSize), () => new List<int>(), //initial local state (range, pls, localPrimes) => { //aggregator for (int i = range.Item1; i < range.Item2; ++i) if (IsPrime(i)) localPrimes.Add(i); return localPrimes; }, localPrimes => { lock (primes) //combiner primes.AddRange(localPrimes);});

Page 13: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

13

CREATIVE SYNCHRONIZATION

We implement a collection of stock prices, initialized with 105 name/price pairs

107 reads/s, 106 “update” writes/s, 103 “add” writes/day

Many reader threads, many writer threads

GET(key): if safe contains key then return safe[key] lock { return unsafe[key] }

PUT(key, value): if safe contains key then safe[key] = value lock { unsafe[key] = value }

Page 14: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

14

LOCK-FREE PATTERNS (1)

Try to avoid Windows synchronization and use hardware synchronization

Primitive operations such as Interlocked.Increment, Interlocked.CompareExchange

Retry pattern with Interlocked.CompareExchange enables arbitrary lock-free algorithms

int InterlockedMultiply(ref int x, int y) { int t, r; do { t = x; r = t * y; } while (Interlocked.CompareExchange(ref x, r, t) != t); return r;}

New

Valu

e

Com

para

nd

Old

Valu

e

Page 15: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

15

LOCK-FREE PATTERNS (2)

User-mode spinlocks (SpinLock class) can replace locks you acquire very often, which protect tiny computations

class __DontUseMe__SpinLock { private int _lck; public void Enter() { while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0); } public void Exit() { _lck = 0; Thread.MemoryBarrier(); }}

Page 16: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

16

MISCELLANEOUS TIPS (1)

Don’t mix several concurrency frameworks in the same process

Some parallel work is best organized in pipelines – TPL DataFlow

BroadcastBlock

<Uri>

TransformBlock

<Uri, byte[]>

TransformBlock

<byte[], string>

ActionBlock<string>

Page 17: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

17

MISCELLANEOUS TIPS (2)

Some parallel work can be offloaded to the GPU – C++ AMP

void vadd_exp(float* x, float* y, float* z, int n) { array_view<const float,1> avX(n, x), avY(n, y); array_view<float,1> avZ(n, z); avZ.discard_data(); parallel_for_each(avZ.extent, [=](index<1> i) ... { avZ[i] = avX[i] + fast_math::exp(avY[i]); }); avZ.synchronize();}

Page 18: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

18

MISCELLANEOUS TIPS (3)

Invest in SIMD parallelization of heavy math or data-parallel algorithms

Make sure to take cache effects into account, especially on MP systems

START: movups xmm0, [esi+4*ecx] addps xmm0, [edi+4*ecx] movups [ebx+4*ecx], xmm0 sub ecx, 4jns START

Page 19: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

19

SUMMARY

Avoid shared state and synchronization Parallelize judiciously and apply

thresholds Measure and understand performance

gains or losses Concurrency and parallelism are still

hard A body of best practices, tips, patterns,

examples is being built

Page 20: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

ADDITIONAL REFERENCES

Page 21: Task and Data Parallelism: Real-World Examples

www.devconnections.com

GARBAGE COLLECTION PERFORMANCE TIPS

21

THANK YOU!

Sasha Goldshtein@goldshtn

[email protected]