toub parallelism tour_oct2009

41
http://go.microsoft.com/? linkid=9692084

Upload: nkaluva

Post on 10-May-2015

955 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Toub parallelism tour_oct2009

http://go.microsoft.com/?linkid=9692084

Page 2: Toub parallelism tour_oct2009

Parallel Programming with Visual Studio 2010 and the .NET Framework 4

Stephen ToubMicrosoft Corporation

October 2009

Page 3: Toub parallelism tour_oct2009

Agenda

− Why Parallelism, Why Now?− Difficulties w/ Visual Studio 2008

& .NET 3.5− Solutions w/ Visual Studio 2010 & .NET

4− Parallel LINQ− Task Parallel Library− New Coordination & Synchronization

Primitives− New Parallel Debugger Windows− New Profiler Concurrency Visualizations

Page 4: Toub parallelism tour_oct2009

Moore’s Law

“The number of transistors incorporated in a chip will approximately double every 24 months.”

Gordon MooreIntel Co-Founder

http://www.intel.com/pressroom/kits/events/moores_law_40th/

Page 5: Toub parallelism tour_oct2009

Moore’s Law: Alive and Well?

More than 1 billion

transistorsin 2006!

The number of transistors doubles every two years…

http://upload.wikimedia.org/wikipedia/commons/2/25/Transistor_Count_and_Moore%27s_Law_-_2008_1024.png

Page 6: Toub parallelism tour_oct2009

Moore’s Law: Feel the Heat!

10,000

1,000

100

10

1

‘70 ‘80 ’90 ’00 ‘10

Pow

er D

ensi

ty (

W/c

m2 )

8080

Pentium® processors

Hot Plate

Nuclear Reactor

Rocket Nozzle

Sun’s Surface

Intel Developer Forum, Spring 2004 - Pat Gelsinger

486

386

Page 7: Toub parallelism tour_oct2009

Moore’s Law: But Different

Frequencies will NOT get much faster!Maybe 5 to 10% every year or so, a few more times…And these modest gains would make the chips A LOT

hotter!

http://www.tomshw.it/cpu.php?guide=20051121

Page 8: Toub parallelism tour_oct2009

The Manycore Shift

− “[A]fter decades of single core processors, the high volume processor industry has gone from single to dual to quad-core in just the last two years. Moore’s Law scaling should easily let us hit the 80-core mark in mainstream processors within the next ten years and quite possibly even less.”

-- Justin Rattner, CTO, Intel (February 2007)

− “If you haven’t done so already, now is the time to take a hard look at the design of your application, determine what operations are CPU-sensitive now or are likely to become so soon, and identify how those places could benefit from concurrency.”

-- Herb Sutter, C++ Architect at Microsoft (March 2005)

Page 9: Toub parallelism tour_oct2009

I'm convinced… now what?

− Multithreaded programming is “hard” today− Doable by only a subgroup of senior specialists− Parallel patterns are not prevalent, well known,

nor easy to implement− So many potential problems

− Businesses have little desire to “go deep”− Best devs should focus on business value,

not concurrency− Need simple ways to allow all devs to write

concurrent code

Page 10: Toub parallelism tour_oct2009

Example: “Race Car Drivers”

IEnumerable<RaceCarDriver> drivers = ...;var results = new List<RaceCarDriver>();foreach(var driver in drivers){ if (driver.Name == queryName && driver.Wins.Count >= queryWinCount) { results.Add(driver); }}results.Sort((b1, b2) => b1.Age.CompareTo(b2.Age));

Page 11: Toub parallelism tour_oct2009

Manual Parallel SolutionIEnumerable<RaceCarDriver> drivers = …;var results = new List<RaceCarDriver>();int partitionsCount = Environment.ProcessorCount;int remainingCount = partitionsCount;var enumerator = drivers.GetEnumerator();try { using (var done = new ManualResetEvent(false)) { for(int i = 0; i < partitionsCount; i++) { ThreadPool.QueueUserWorkItem(delegate { while(true) { RaceCarDriver driver; lock (enumerator) { if (!enumerator.MoveNext()) break; driver = enumerator.Current; } if (driver.Name == queryName && driver.Wins.Count >= queryWinCount) { lock(results) results.Add(driver); } } if (Interlocked.Decrement(ref remainingCount) == 0) done.Set(); }); } done.WaitOne(); results.Sort((b1, b2) => b1.Age.CompareTo(b2.Age)); }}finally { if (enumerator is IDisposable) ((IDisposable)enumerator).Dispose(); }

Page 12: Toub parallelism tour_oct2009

LINQ Solution

var results = from driver in drivers where driver.Name == queryName && driver.Wins.Count >= queryWinCount orderby driver.Age ascending select driver;

.AsParallel()

P

Page 13: Toub parallelism tour_oct2009

Visual Studio 2010Tools, Programming Models, Runtimes

Parallel Pattern Library

Resource Manager

Task Scheduler

Task Parallel Library

Parallel LINQ

Managed NativeKey:

ThreadsOperating System

Concurrency Runtime

Programming Models

ThreadPool

Task Scheduler

Resource Manager

Data

Stru

ctu

res

Data

Str

uctu

res

Tools

Tooling

ParallelDebugge

r Tool Windows

Profiler Concurren

cyAnalysis

AgentsLibrary

UMS Threads

.NET Framework 4 Visual C++ 10Visual Studio

IDE

Windows

Page 14: Toub parallelism tour_oct2009

Parallel Extensions− What is it?

− Pure .NET libraries− No compiler changes necessary− mscorlib.dll, System.dll, System.Core.dll

− Lightweight, user-mode runtime− Key ThreadPool enhancements

− Supports imperative and declarative, data and task parallelism− Declarative data parallelism (PLINQ)− Imperative data and task parallelism (Task Parallel Library)− New coordination/synchronization constructs

− Why do we need it?− Supports parallelism in any .NET language− Delivers reduced concept count and complexity, better time to solution

− Begins to move parallelism capabilities from concurrency experts to domain experts

− How do we get it?− Built into the core of .NET 4− Debugging and profiling support in Visual Studio 2010

Page 15: Toub parallelism tour_oct2009

Architecture

Task Parallel Library Coordination Data Structures

.NET Program

Proc 1

PLINQ Execution Engine

C# Compiler

VB Compiler

C++ Compiler

IL

Threads

Declarative

Queries Data Partitioning

ChunkRangeHash

StripedRepartitioning

Custom

Operator Types

MergingSync and AsyncOrder Preserving

BufferedInverted

Proc p

Parallel Algorith

ms

Query Analysis

Thread-safe CollectionsSynchronization Types

Coordination Types

Loop replacementsImperative Task

ParallelismScheduling

F# Compiler

Other .NET Compiler

MapFilterSortSearch

ReduceGroupJoin…

Page 16: Toub parallelism tour_oct2009

Language Integrated Query (LINQ)

LINQ enabled data sources

LINQTo

Objects

Objects

LINQTo XML

<book> <title/> <author/> <price/></book>

XML

LINQ-enabled ADO.NET

LINQTo

Datasets

LINQTo SQL

LINQTo

Entities

Relational

Others…Visual Basic C#

.NET Standard Query Operators

Page 17: Toub parallelism tour_oct2009

Writing a LINQ-to-Objects Query− Two ways to write queries

− Comprehensions− Syntax extensions to C# and Visual Basic

− APIs− Used as extension methods on IEnumerable<T>

− System.Linq.Enumerable class

− Compiler converts the former into the latter− API implementation does the actual work

var q = Enumerable.Select( Enumerable.OrderBy( Enumerable.Where(Y, x => p(x)), x => x.f1), x => x.f2);

var q = Y.Where(x => p(x)).OrderBy(x => x.f1).Select(x => x.f2);

var q = from x in Y where p(x) orderby x.f1 select x.f2;

Page 18: Toub parallelism tour_oct2009

LINQ Query Operators

Aggregate(3)All(1)Any(2)AsEnumerable(1)Average(20)Cast(1)Concat(1)Contains(2)Count(2)DefaultIfEmpty(2)Distinct(2)ElementAt(1)ElementAtOrDefault(1)Empty(1)Except(2)First(2)FirstOrDefault(2)

GroupBy(8)GroupJoin(2)Intersect(2)Join(2)Last(2)LastOrDefault(2)LongCount(2)Max(22)Min(22)OfType(1)OrderBy(2)OrderByDescending(2)Range(1)Repeat(1)Reverse(1)Select(2)SelectMany(4)

SequenceEqual(2)Single(2)SingleOrDefault(2)Skip(1)SkipWhile(2)Sum(20)Take(1)TakeWhile(2)ThenBy(2)ThenByDescending(2)ToArray(1)ToDictionary(4)ToList(1)ToLookup(4)Union(2)Where(2)Zip(1)

● In .NET 4, ~50 operators w/ ~175 overloads

var operators = from method in typeof(Enumerable).GetMethods( BindingFlags.Public | BindingFlags.Static | BindingFlags.DeclaredOnly) group method by method.Name into methods orderby methods.Key select new { Name = methods.Key, Count=methods.Count() };

Page 19: Toub parallelism tour_oct2009

Query Operators, cont.

− Tree of operators− Producers

− No input− Examples: Range, Repeat

− Consumer/producers− Transform input stream(s) into output stream− Examples: Select, Where, Join, Skip, Take

− Consumers− Reduce to a single value− Examples: Aggregate, Min, Max, First

− Many are unary while others are binary

• Data-intensive bulk transformations

Where

Select

Where

Join

Page 20: Toub parallelism tour_oct2009

Implementation of a Query Operator

− What might an implementation look like?

− Does it have to be this way?− What if we could do this in… parallel?!

public static IEnumerable<TSource> Where<TSource>( this IEnumerable<TSource> source, Func<TSource, bool> predicate){ if (source == null || predicate == null) throw new ArgumentNullException(); foreach (var item in source) { if (predicate(item)) yield return item; }}

public static IEnumerable<TSource> Where<TSource>( this IEnumerable<TSource> source, Func<TSource, bool> predicate){ ...}

Page 21: Toub parallelism tour_oct2009

Parallel LINQ (PLINQ)

− Utilizes parallel hardware for LINQ queries− Abstracts away most parallelism details

− Partitions and merges data intelligently− Supports all .NET Standard Query

Operators− Plus a few knobs

− Works for any IEnumerable<T>− Optimizations for other types (T[], IList<T>)− Supports custom partitioning (Partitioner<T>)

− Built on top of the rest of Parallel Extensions

Page 22: Toub parallelism tour_oct2009

Programming Model

− Minimal impact to existing LINQ programming model− AsParallel extension method

− ParallelEnumerable class− Implements the Standard Query

Operators, but for ParallelQuery<T>

public static ParallelQuery<T> AsParallel<T>(this IEnumerable<T> source);

public static ParallelQuery<TSource> Where<TSource>( this ParallelQuery<TSource> source, Func<TSource, bool> predicate)

Page 23: Toub parallelism tour_oct2009

Writing a PLINQ Query

− Two ways to write queries− Comprehensions

− Syntax extensions to C# and Visual Basic

− APIs− Used as extension methods on ParallelQuery<T>

− System.Linq.ParallelEnumerable class

− Compiler converts the former into the latter − As with serial LINQ, API implementation does the actual work

var q = ParallelEnumerable.Select( ParallelEnumerable.OrderBy( ParallelEnumerable.Where(Y.AsParallel(), x => p(x)), x => x.f1), x => x.f2);

var q = Y.AsParallel().Where(x => p(x)). OrderBy(x => x.f1).Select(x => x.f2);

var q = from x in Y.AsParallel() where p(x) orderby x.f1 select x.f2;

Page 24: Toub parallelism tour_oct2009

PLINQ Knobs

− Additional Extension Methods− WithDegreeOfParallelism

− AsOrdered

− WithCancellation− WithMergeOptions − WithExecutionMode

var results = from driver in drivers.AsParallel().WithDegreeOfParallelism(4) where driver.Name == queryName && driver.Wins.Count >= queryWinCount orderby driver.Age ascending select driver;

var results = from driver in drivers.AsParallel().AsOrdered() where driver.Name == queryName && driver.Wins.Count >= queryWinCount orderby driver.Age ascending select driver;

Page 25: Toub parallelism tour_oct2009

Partitioning

• Input to a single operator is partitioned into p disjoint subsets

• Operators are replicated across the partitions• Example

from x in A where p(x) …

• Partitions execute in (almost) complete isolation

… Task n …

… Task 1 …

where p(x)

Awhere p(x)

… Tasks 2..n-1 …

Page 26: Toub parallelism tour_oct2009

Partitioning: Load BalancingDynamic Scheduling

CPU0

CPU1

…CPUN

CPU0

CPU1

…CPUN

3

4

1

Static Scheduling (Range)

2

3

4

1

2

56

78

56

78

Page 27: Toub parallelism tour_oct2009

Partitioning: Algorithms− Several partitioning schemes built-in

− Chunk− Works with any IEnumerable<T>− Single enumerator shared; chunks handed out on-demand

− Range− Works only with IList<T>− Input divided into contiguous regions, one per partition

− Stripe− Works only with IList<T>− Elements handed out round-robin to each partition

− Hash− Works with any IEnumerable<T>− Elements assigned to partition based on hash code

− Custom partitioning available through Partitioner<T>− Partitioner.Create available for tighter control over built-in partitioning schemes

Page 28: Toub parallelism tour_oct2009

Operator Fusion

• Naïve approach: partition and merge for each operator• Example: (from x in D.AsParallel() where p(x) select x*x*x).Sum();

• Partition and merge mean synchronization => scalability bottleneck

• Instead, we can fuse operators together:

• Minimizes number of partitioning/merging steps necessary

… Task n …

… Task 1 …

where p(x)

Dwhere p(x)

… Task n …

… Task 1 …

select x3

select x3

… Task n …

… Task 1 …

Sum()

Sum()

#

… Task n …

… Task 1 …

where p(x)

Dwhere p(x)

select x3

select x3

Sum()

Sum()

#

Page 29: Toub parallelism tour_oct2009

Merging

− Pipelined: separate consumer thread− Default for GetEnumerator()

− And hence foreach loops− AutoBuffered, NoBuffering− Access to data as its available

− But more synchronization overhead

− Stop-and-go: consumer helps− Sorts, ToArray, ToList, etc.− FullyBuffered− Minimizes context switches

− But higher latency and more memory

− Inverted: no merging needed− ForAll extension method− Most efficient by far

− But not always applicable− Requires side-effects

Thread 2

Thread 4

Thread 1

Thread 3

Thread 1

Thread 1

Thread 3

Thread 1

Thread 2

Thread 1

Thread 1

Thread 3

Thread 2

Thread 1

Thread 1

Page 30: Toub parallelism tour_oct2009

Parallelism Blockers− Ordering not guaranteed

− Exceptions

− Thread affinity

− Operations with < 1.0 speedup

− Side effects and mutability are serious issues− Most queries do not use side effects, but it’s possible…

int[] values = new int[] { 0, 1, 2 };var q = from x in values.AsParallel() select x * 2;int[] scaled = q.ToArray(); // == { 0, 2, 4 }?

object[] data = new object[] { "foo", null, null };var q = from x in data.AsParallel() select o.ToString();

Random rand = new Random();var q = from i in Enumerable.Range(0, 10000).AsParallel() select rand.Next();

System.AggregateException

IEnumerable<int> input = …;var doubled = from x in input.AsParallel() select x*2;

controls.AsParallel().ForAll(c => c.Size = ...);

Page 31: Toub parallelism tour_oct2009

Task Parallel LibraryLoops− Loops are a common source of work

− Can be parallelized when iterations are independent− Body doesn’t depend on mutable state / synchronization used

− Synchronous− All iterations finish, regularly or exceptionally

− Lots of knobs− Breaking, task-local state, custom partitioning, cancellation, scheduling,

degree of parallelism− Visual Studio 2010 profiler support (as with PLINQ)

for (int i = 0; i < n; i++) work(i);…foreach (T e in data) work(e);

Parallel.For(0, n, i => work(i));…Parallel.ForEach(data, e => work(e));

Page 32: Toub parallelism tour_oct2009

Task Parallel LibraryStatements− Sequence of statements

− When independent, can be parallelized

− Synchronous (same as loops)− Under the covers

− May use Parallel.For, may use Tasks

StatementA();StatementB;StatementC();

Parallel.Invoke( () => StatementA() , () => StatementB , () => StatementC() );

Page 33: Toub parallelism tour_oct2009

Task Parallel LibraryTasks

− System.Threading.Tasks− Task

− Represents an asynchronous operation− Supports waiting, cancellation, continuations, …− Parent/child relationships− 1st-class debugging support in Visual Studio 2010

− Task<TResult> : Task− Tasks that return results

− TaskCompletionSource<TResult>− Create Task<TResult>s to represent other operations

− TaskScheduler− Represents a scheduler that executes tasks− Extensible− TaskScheduler.Default => ThreadPool

Page 34: Toub parallelism tour_oct2009

Global Queue

Program Thread

Worker Thread 1

Worker Thread 1

ThreadPool in .NET 3.5

Item 1Item 2Item 3

Item 4

Item 5

Item 6

Thread Management: Starvation Detection Idle Thread Retirement

Page 35: Toub parallelism tour_oct2009

Program Thread

ThreadPool in .NET 4

Lock-Free

Global Queue

LocalWork-

Stealing Queue

Local Work-

Stealing Queue

Worker Thread 1

Worker Thread p

Task 1Task 2

Task 3Task 5

Task 4

Task 6

Thread Management: Starvation Detection Idle Thread Retirement Hill-climbing

Page 36: Toub parallelism tour_oct2009

New Primitives

− Thread-safe, scalable collections− IProducerConsumerCollection<T>

− ConcurrentQueue<T>− ConcurrentStack<T>− ConcurrentBag<T>

− ConcurrentDictionary<TKey,TValue>

− Phases and work exchange− Barrier − BlockingCollection<T>− CountdownEvent

− Partitioning− {Orderable}Partitioner<T>

− Partitioner.Create

− Exception handling− AggregateException

− Initialization− Lazy<T>

− LazyInitializer.EnsureInitialized<T>− ThreadLocal<T>

− Locks− ManualResetEventSlim− SemaphoreSlim− SpinLock− SpinWait

− Cancellation− CancellationToken{Source}

Public, and used throughout PLINQ and TPLAddress many of today’s core concurrency issues

Page 37: Toub parallelism tour_oct2009

What Can I Do with These Cores?− Offload

− Free up your UI

− Go faster whenever you can− Parallelize the parallelizable

− Do more− Use more data to get better results− Add more features

− Speculate− Pre-fetch, Pre-process− Evaluate multiple solutions

Page 38: Toub parallelism tour_oct2009

Performance Tips− Compute intensive and/or large data sets

− Work done should be at least 1,000s of cycles− Measure, and combine/optimize as necessary

− Use the Visual Studio concurrency profiler− Look for common anti-patterns: load imbalance, lock convoys, etc.

− Parallelize fine-grained but not too fine-grained− e.g. Parallelize outer loop, unless N is insufficiently large to offer enough

parallelism− Consider parallelizing only inner, or both, at that point− Consider unrolling

− Do not be gratuitous in task creation− Lightweight, but still requires object allocation, etc.

− Prefer isolation & immutability over synchronization− Synchronization => !Scalable

− Try to avoid shared state

− Have realistic expectations

Page 39: Toub parallelism tour_oct2009

Amdahl’s Law

1 2 4 8 160

20

40

60

80

100

120

ParallelSequential

Number of processors

Tota

l execu

tion

ti

me

Theoretical maximum speedup determined by amount of sequential

code

Page 40: Toub parallelism tour_oct2009

To Infinity And Beyond…

− The “Manycore Shift” is happening− Parallelism in your code is inevitable− Visual Studio 2010 and .NET 4 will help

− Parallel Computing Dev Center− http://msdn.com/concurrency

− Download Beta 2 (“go-live” license)− http://go.microsoft.com/?linkid=9692084

− Team Blogs− Managed: http://blogs.msdn.com/pfxteam− Native: http://blogs.msdn.com/nativeconcurrency− Tools: http://blogs.msdn.com/visualizeconcurrency

− Forums− http://

social.msdn.microsoft.com/Forums/en-US/category/parallelcomputing

We love feedback!

Page 41: Toub parallelism tour_oct2009

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS,

IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.