toub parallelism tour_oct2009
TRANSCRIPT
http://go.microsoft.com/?linkid=9692084
Parallel Programming with Visual Studio 2010 and the .NET Framework 4
Stephen ToubMicrosoft Corporation
October 2009
Agenda
− Why Parallelism, Why Now?− Difficulties w/ Visual Studio 2008
& .NET 3.5− Solutions w/ Visual Studio 2010 & .NET
4− Parallel LINQ− Task Parallel Library− New Coordination & Synchronization
Primitives− New Parallel Debugger Windows− New Profiler Concurrency Visualizations
Moore’s Law
“The number of transistors incorporated in a chip will approximately double every 24 months.”
Gordon MooreIntel Co-Founder
http://www.intel.com/pressroom/kits/events/moores_law_40th/
Moore’s Law: Alive and Well?
More than 1 billion
transistorsin 2006!
The number of transistors doubles every two years…
http://upload.wikimedia.org/wikipedia/commons/2/25/Transistor_Count_and_Moore%27s_Law_-_2008_1024.png
Moore’s Law: Feel the Heat!
10,000
1,000
100
10
1
‘70 ‘80 ’90 ’00 ‘10
Pow
er D
ensi
ty (
W/c
m2 )
8080
Pentium® processors
Hot Plate
Nuclear Reactor
Rocket Nozzle
Sun’s Surface
Intel Developer Forum, Spring 2004 - Pat Gelsinger
486
386
Moore’s Law: But Different
Frequencies will NOT get much faster!Maybe 5 to 10% every year or so, a few more times…And these modest gains would make the chips A LOT
hotter!
http://www.tomshw.it/cpu.php?guide=20051121
The Manycore Shift
− “[A]fter decades of single core processors, the high volume processor industry has gone from single to dual to quad-core in just the last two years. Moore’s Law scaling should easily let us hit the 80-core mark in mainstream processors within the next ten years and quite possibly even less.”
-- Justin Rattner, CTO, Intel (February 2007)
− “If you haven’t done so already, now is the time to take a hard look at the design of your application, determine what operations are CPU-sensitive now or are likely to become so soon, and identify how those places could benefit from concurrency.”
-- Herb Sutter, C++ Architect at Microsoft (March 2005)
I'm convinced… now what?
− Multithreaded programming is “hard” today− Doable by only a subgroup of senior specialists− Parallel patterns are not prevalent, well known,
nor easy to implement− So many potential problems
− Businesses have little desire to “go deep”− Best devs should focus on business value,
not concurrency− Need simple ways to allow all devs to write
concurrent code
Example: “Race Car Drivers”
IEnumerable<RaceCarDriver> drivers = ...;var results = new List<RaceCarDriver>();foreach(var driver in drivers){ if (driver.Name == queryName && driver.Wins.Count >= queryWinCount) { results.Add(driver); }}results.Sort((b1, b2) => b1.Age.CompareTo(b2.Age));
Manual Parallel SolutionIEnumerable<RaceCarDriver> drivers = …;var results = new List<RaceCarDriver>();int partitionsCount = Environment.ProcessorCount;int remainingCount = partitionsCount;var enumerator = drivers.GetEnumerator();try { using (var done = new ManualResetEvent(false)) { for(int i = 0; i < partitionsCount; i++) { ThreadPool.QueueUserWorkItem(delegate { while(true) { RaceCarDriver driver; lock (enumerator) { if (!enumerator.MoveNext()) break; driver = enumerator.Current; } if (driver.Name == queryName && driver.Wins.Count >= queryWinCount) { lock(results) results.Add(driver); } } if (Interlocked.Decrement(ref remainingCount) == 0) done.Set(); }); } done.WaitOne(); results.Sort((b1, b2) => b1.Age.CompareTo(b2.Age)); }}finally { if (enumerator is IDisposable) ((IDisposable)enumerator).Dispose(); }
LINQ Solution
var results = from driver in drivers where driver.Name == queryName && driver.Wins.Count >= queryWinCount orderby driver.Age ascending select driver;
.AsParallel()
P
Visual Studio 2010Tools, Programming Models, Runtimes
Parallel Pattern Library
Resource Manager
Task Scheduler
Task Parallel Library
Parallel LINQ
Managed NativeKey:
ThreadsOperating System
Concurrency Runtime
Programming Models
ThreadPool
Task Scheduler
Resource Manager
Data
Stru
ctu
res
Data
Str
uctu
res
Tools
Tooling
ParallelDebugge
r Tool Windows
Profiler Concurren
cyAnalysis
AgentsLibrary
UMS Threads
.NET Framework 4 Visual C++ 10Visual Studio
IDE
Windows
Parallel Extensions− What is it?
− Pure .NET libraries− No compiler changes necessary− mscorlib.dll, System.dll, System.Core.dll
− Lightweight, user-mode runtime− Key ThreadPool enhancements
− Supports imperative and declarative, data and task parallelism− Declarative data parallelism (PLINQ)− Imperative data and task parallelism (Task Parallel Library)− New coordination/synchronization constructs
− Why do we need it?− Supports parallelism in any .NET language− Delivers reduced concept count and complexity, better time to solution
− Begins to move parallelism capabilities from concurrency experts to domain experts
− How do we get it?− Built into the core of .NET 4− Debugging and profiling support in Visual Studio 2010
Architecture
Task Parallel Library Coordination Data Structures
.NET Program
Proc 1
…
PLINQ Execution Engine
C# Compiler
VB Compiler
C++ Compiler
IL
Threads
Declarative
Queries Data Partitioning
ChunkRangeHash
StripedRepartitioning
Custom
Operator Types
MergingSync and AsyncOrder Preserving
BufferedInverted
Proc p
Parallel Algorith
ms
Query Analysis
Thread-safe CollectionsSynchronization Types
Coordination Types
Loop replacementsImperative Task
ParallelismScheduling
F# Compiler
Other .NET Compiler
MapFilterSortSearch
ReduceGroupJoin…
Language Integrated Query (LINQ)
LINQ enabled data sources
LINQTo
Objects
Objects
LINQTo XML
<book> <title/> <author/> <price/></book>
XML
LINQ-enabled ADO.NET
LINQTo
Datasets
LINQTo SQL
LINQTo
Entities
Relational
Others…Visual Basic C#
.NET Standard Query Operators
Writing a LINQ-to-Objects Query− Two ways to write queries
− Comprehensions− Syntax extensions to C# and Visual Basic
− APIs− Used as extension methods on IEnumerable<T>
− System.Linq.Enumerable class
− Compiler converts the former into the latter− API implementation does the actual work
var q = Enumerable.Select( Enumerable.OrderBy( Enumerable.Where(Y, x => p(x)), x => x.f1), x => x.f2);
var q = Y.Where(x => p(x)).OrderBy(x => x.f1).Select(x => x.f2);
var q = from x in Y where p(x) orderby x.f1 select x.f2;
LINQ Query Operators
Aggregate(3)All(1)Any(2)AsEnumerable(1)Average(20)Cast(1)Concat(1)Contains(2)Count(2)DefaultIfEmpty(2)Distinct(2)ElementAt(1)ElementAtOrDefault(1)Empty(1)Except(2)First(2)FirstOrDefault(2)
GroupBy(8)GroupJoin(2)Intersect(2)Join(2)Last(2)LastOrDefault(2)LongCount(2)Max(22)Min(22)OfType(1)OrderBy(2)OrderByDescending(2)Range(1)Repeat(1)Reverse(1)Select(2)SelectMany(4)
SequenceEqual(2)Single(2)SingleOrDefault(2)Skip(1)SkipWhile(2)Sum(20)Take(1)TakeWhile(2)ThenBy(2)ThenByDescending(2)ToArray(1)ToDictionary(4)ToList(1)ToLookup(4)Union(2)Where(2)Zip(1)
● In .NET 4, ~50 operators w/ ~175 overloads
var operators = from method in typeof(Enumerable).GetMethods( BindingFlags.Public | BindingFlags.Static | BindingFlags.DeclaredOnly) group method by method.Name into methods orderby methods.Key select new { Name = methods.Key, Count=methods.Count() };
Query Operators, cont.
− Tree of operators− Producers
− No input− Examples: Range, Repeat
− Consumer/producers− Transform input stream(s) into output stream− Examples: Select, Where, Join, Skip, Take
− Consumers− Reduce to a single value− Examples: Aggregate, Min, Max, First
− Many are unary while others are binary
• Data-intensive bulk transformations
Where
Select
Where
Join
…
Implementation of a Query Operator
− What might an implementation look like?
− Does it have to be this way?− What if we could do this in… parallel?!
public static IEnumerable<TSource> Where<TSource>( this IEnumerable<TSource> source, Func<TSource, bool> predicate){ if (source == null || predicate == null) throw new ArgumentNullException(); foreach (var item in source) { if (predicate(item)) yield return item; }}
public static IEnumerable<TSource> Where<TSource>( this IEnumerable<TSource> source, Func<TSource, bool> predicate){ ...}
Parallel LINQ (PLINQ)
− Utilizes parallel hardware for LINQ queries− Abstracts away most parallelism details
− Partitions and merges data intelligently− Supports all .NET Standard Query
Operators− Plus a few knobs
− Works for any IEnumerable<T>− Optimizations for other types (T[], IList<T>)− Supports custom partitioning (Partitioner<T>)
− Built on top of the rest of Parallel Extensions
Programming Model
− Minimal impact to existing LINQ programming model− AsParallel extension method
− ParallelEnumerable class− Implements the Standard Query
Operators, but for ParallelQuery<T>
public static ParallelQuery<T> AsParallel<T>(this IEnumerable<T> source);
public static ParallelQuery<TSource> Where<TSource>( this ParallelQuery<TSource> source, Func<TSource, bool> predicate)
Writing a PLINQ Query
− Two ways to write queries− Comprehensions
− Syntax extensions to C# and Visual Basic
− APIs− Used as extension methods on ParallelQuery<T>
− System.Linq.ParallelEnumerable class
− Compiler converts the former into the latter − As with serial LINQ, API implementation does the actual work
var q = ParallelEnumerable.Select( ParallelEnumerable.OrderBy( ParallelEnumerable.Where(Y.AsParallel(), x => p(x)), x => x.f1), x => x.f2);
var q = Y.AsParallel().Where(x => p(x)). OrderBy(x => x.f1).Select(x => x.f2);
var q = from x in Y.AsParallel() where p(x) orderby x.f1 select x.f2;
PLINQ Knobs
− Additional Extension Methods− WithDegreeOfParallelism
− AsOrdered
− WithCancellation− WithMergeOptions − WithExecutionMode
var results = from driver in drivers.AsParallel().WithDegreeOfParallelism(4) where driver.Name == queryName && driver.Wins.Count >= queryWinCount orderby driver.Age ascending select driver;
var results = from driver in drivers.AsParallel().AsOrdered() where driver.Name == queryName && driver.Wins.Count >= queryWinCount orderby driver.Age ascending select driver;
Partitioning
• Input to a single operator is partitioned into p disjoint subsets
• Operators are replicated across the partitions• Example
from x in A where p(x) …
• Partitions execute in (almost) complete isolation
… Task n …
… Task 1 …
where p(x)
Awhere p(x)
… Tasks 2..n-1 …
Partitioning: Load BalancingDynamic Scheduling
CPU0
CPU1
…CPUN
CPU0
CPU1
…CPUN
3
4
1
Static Scheduling (Range)
2
3
4
1
2
56
78
56
78
Partitioning: Algorithms− Several partitioning schemes built-in
− Chunk− Works with any IEnumerable<T>− Single enumerator shared; chunks handed out on-demand
− Range− Works only with IList<T>− Input divided into contiguous regions, one per partition
− Stripe− Works only with IList<T>− Elements handed out round-robin to each partition
− Hash− Works with any IEnumerable<T>− Elements assigned to partition based on hash code
− Custom partitioning available through Partitioner<T>− Partitioner.Create available for tighter control over built-in partitioning schemes
Operator Fusion
• Naïve approach: partition and merge for each operator• Example: (from x in D.AsParallel() where p(x) select x*x*x).Sum();
• Partition and merge mean synchronization => scalability bottleneck
• Instead, we can fuse operators together:
• Minimizes number of partitioning/merging steps necessary
… Task n …
… Task 1 …
where p(x)
Dwhere p(x)
… Task n …
… Task 1 …
select x3
select x3
… Task n …
… Task 1 …
Sum()
Sum()
#
… Task n …
… Task 1 …
where p(x)
Dwhere p(x)
select x3
select x3
Sum()
Sum()
#
Merging
− Pipelined: separate consumer thread− Default for GetEnumerator()
− And hence foreach loops− AutoBuffered, NoBuffering− Access to data as its available
− But more synchronization overhead
− Stop-and-go: consumer helps− Sorts, ToArray, ToList, etc.− FullyBuffered− Minimizes context switches
− But higher latency and more memory
− Inverted: no merging needed− ForAll extension method− Most efficient by far
− But not always applicable− Requires side-effects
Thread 2
Thread 4
Thread 1
Thread 3
Thread 1
Thread 1
Thread 3
Thread 1
Thread 2
Thread 1
Thread 1
Thread 3
Thread 2
Thread 1
Thread 1
Parallelism Blockers− Ordering not guaranteed
− Exceptions
− Thread affinity
− Operations with < 1.0 speedup
− Side effects and mutability are serious issues− Most queries do not use side effects, but it’s possible…
−
int[] values = new int[] { 0, 1, 2 };var q = from x in values.AsParallel() select x * 2;int[] scaled = q.ToArray(); // == { 0, 2, 4 }?
object[] data = new object[] { "foo", null, null };var q = from x in data.AsParallel() select o.ToString();
Random rand = new Random();var q = from i in Enumerable.Range(0, 10000).AsParallel() select rand.Next();
System.AggregateException
IEnumerable<int> input = …;var doubled = from x in input.AsParallel() select x*2;
controls.AsParallel().ForAll(c => c.Size = ...);
Task Parallel LibraryLoops− Loops are a common source of work
− Can be parallelized when iterations are independent− Body doesn’t depend on mutable state / synchronization used
− Synchronous− All iterations finish, regularly or exceptionally
− Lots of knobs− Breaking, task-local state, custom partitioning, cancellation, scheduling,
degree of parallelism− Visual Studio 2010 profiler support (as with PLINQ)
for (int i = 0; i < n; i++) work(i);…foreach (T e in data) work(e);
Parallel.For(0, n, i => work(i));…Parallel.ForEach(data, e => work(e));
Task Parallel LibraryStatements− Sequence of statements
− When independent, can be parallelized
− Synchronous (same as loops)− Under the covers
− May use Parallel.For, may use Tasks
StatementA();StatementB;StatementC();
Parallel.Invoke( () => StatementA() , () => StatementB , () => StatementC() );
Task Parallel LibraryTasks
− System.Threading.Tasks− Task
− Represents an asynchronous operation− Supports waiting, cancellation, continuations, …− Parent/child relationships− 1st-class debugging support in Visual Studio 2010
− Task<TResult> : Task− Tasks that return results
− TaskCompletionSource<TResult>− Create Task<TResult>s to represent other operations
− TaskScheduler− Represents a scheduler that executes tasks− Extensible− TaskScheduler.Default => ThreadPool
Global Queue
Program Thread
Worker Thread 1
Worker Thread 1
ThreadPool in .NET 3.5
…
Item 1Item 2Item 3
Item 4
Item 5
Item 6
Thread Management: Starvation Detection Idle Thread Retirement
Program Thread
ThreadPool in .NET 4
Lock-Free
Global Queue
LocalWork-
Stealing Queue
Local Work-
Stealing Queue
Worker Thread 1
Worker Thread p
…
…
Task 1Task 2
Task 3Task 5
Task 4
Task 6
Thread Management: Starvation Detection Idle Thread Retirement Hill-climbing
New Primitives
− Thread-safe, scalable collections− IProducerConsumerCollection<T>
− ConcurrentQueue<T>− ConcurrentStack<T>− ConcurrentBag<T>
− ConcurrentDictionary<TKey,TValue>
− Phases and work exchange− Barrier − BlockingCollection<T>− CountdownEvent
− Partitioning− {Orderable}Partitioner<T>
− Partitioner.Create
− Exception handling− AggregateException
− Initialization− Lazy<T>
− LazyInitializer.EnsureInitialized<T>− ThreadLocal<T>
− Locks− ManualResetEventSlim− SemaphoreSlim− SpinLock− SpinWait
− Cancellation− CancellationToken{Source}
Public, and used throughout PLINQ and TPLAddress many of today’s core concurrency issues
What Can I Do with These Cores?− Offload
− Free up your UI
− Go faster whenever you can− Parallelize the parallelizable
− Do more− Use more data to get better results− Add more features
− Speculate− Pre-fetch, Pre-process− Evaluate multiple solutions
Performance Tips− Compute intensive and/or large data sets
− Work done should be at least 1,000s of cycles− Measure, and combine/optimize as necessary
− Use the Visual Studio concurrency profiler− Look for common anti-patterns: load imbalance, lock convoys, etc.
− Parallelize fine-grained but not too fine-grained− e.g. Parallelize outer loop, unless N is insufficiently large to offer enough
parallelism− Consider parallelizing only inner, or both, at that point− Consider unrolling
− Do not be gratuitous in task creation− Lightweight, but still requires object allocation, etc.
− Prefer isolation & immutability over synchronization− Synchronization => !Scalable
− Try to avoid shared state
− Have realistic expectations
Amdahl’s Law
1 2 4 8 160
20
40
60
80
100
120
ParallelSequential
Number of processors
Tota
l execu
tion
ti
me
Theoretical maximum speedup determined by amount of sequential
code
To Infinity And Beyond…
− The “Manycore Shift” is happening− Parallelism in your code is inevitable− Visual Studio 2010 and .NET 4 will help
− Parallel Computing Dev Center− http://msdn.com/concurrency
− Download Beta 2 (“go-live” license)− http://go.microsoft.com/?linkid=9692084
− Team Blogs− Managed: http://blogs.msdn.com/pfxteam− Native: http://blogs.msdn.com/nativeconcurrency− Tools: http://blogs.msdn.com/visualizeconcurrency
− Forums− http://
social.msdn.microsoft.com/Forums/en-US/category/parallelcomputing
We love feedback!
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS,
IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.