high productivity computing: taking hpc mainstream lee grant technical solutions professional high...
TRANSCRIPT
High Productivity Computing:Taking HPC Mainstream
Lee GrantTechnical Solutions Professional High Performance [email protected]
Challenge: High Productivity Computing
“Make high-end computing easier and more productive to use.
Emphasis should be placed on time to solution, the major metric of value to high-end computing users…
A common software environment for scientific computation encompassing desktop to high-end systems will enhance productivity gains by promoting ease of use and manageability of systems.”
2004 High-End Computing Revitalization Task ForceOffice of Science and Technology Policy, Executive Office of the President
X64 Server
The Data Pipeline
Data GatheringDiscovery and Browsing
Science Exploration
Domain specific analyses Scientific Output
“Raw” data includes sensor output, data downloaded from agency or collaboration web sites, papers (especially for ancillary data
“Raw” data browsing for discovery (do I have enough data in the right places?), cleaning (does the data look obviously wrong?), and light weight science via browsing
“Science variables” and data summaries for early science exploration and hypothesis testing. Similar to discovery and browsing, but with science variables computed via gap filling, units conversions, or simple equation.
“Science variables” combined with models, other specialized code, or statistics for deep science understanding.
Scientific results via packages such as MatLab or R2. Special rendering package such as ArcGIS.
Paper preparation.
Free Lunch Is Over For Traditional SoftwareFr
ee L
unch
fo
r tra
ditio
nal s
oftw
are
No Free Lunch for traditional software(Without highly concurrent software it won’t get any faster!)
Ope
ratio
ns p
er s
econ
d fo
r ser
ial c
ode
Additional operations per second if code can take advantage of concurrency
6 GHz1 Core
12 GHz1 Core
24 GHz1 Core
3 GHz2 Cores
3 GHz4 Cores
3 GHz8 Cores
3 GHz1 Cor 3 GHz
1 Cores
“Provide the platform, tools and broad ecosystem to reduce the complexity of HPC by making parallelism more accessible to address future computational needs.”
Microsoft’s Vision for HPC
Reduced Complexity Mainstream HPC Developer Ecosystem
Ease deployment forlarger scale clusters
Simplify management forclusters of all scale
Integrate with existing infrastructure
Address needs of traditional supercomputing
Address emerging cross-industry
computation trends
Enable non-technical users to harness the power of HPC
Increase number of parallel applications and codes
Offer choice of parallel development tools,
languages and libraries
Drive larger universe of developers and ISVs
Application Benefits
The most productive distributed application development environment
System Benefits
Cost-effective, reliable and high performance server operating system
Cluster Benefits
Complete HPC cluster platform integrated with the enterprise infrastructure
Microsoft HPC++ Solution
Systems Management
Job Scheduling
MPIStorage
Rapid large scale deployment and built-in diagnostics suite
Integrated monitoring, management and reporting
Familiar UI and rich scripting interface
Integrated security via Active Directory
Support for batch, interactive and service-oriented applications
High availability scheduling Interoperability via OGF’s HPC
Basic Profile
MS-MPI stack based on MPICH2 reference implementation
Performance improvements for RDMA networking and multi-core shared memory
MS-MPI integrated with Windows Event Tracing
Access to SQL, Windows and Unix file servers
Key parallel file server vendor support (GPFS, Lustre, Panasas)
In-memory caching options
Windows HPC Server 2008
Group compute nodes based on hardware, software and custom attributes; Act on groupings.
Pivoting enables correlating nodes and jobs together
Track long running operations and access operation history
Receive alerts for failures
List or Heat Map view cluster at a glance
Integrated Job Scheduling
Services oriented HPC apps
Expanded Job Policies
Support for Job Templates
Improve interoperability with mixed IT infrastructure
Skip/Demo
Node 1
S0
P0 P1
P2 P3
S1
P0 P1
P2 P3
S2
P0 P1
P2 P3
S3
P0 P1
P2 P3
Node 2
S0
P0 P1
P2 P3
S1
P0 P1
P2 P3
S2
P0 P1
P2 P3
S3
P0 P1
P2 P3
J1 J1
J3
J2
J1: /numsockets:3 /exclusive: falseJ3: /numcores:4 /exclusive: false
J2: /numnodes:1
Windows HPC Server can help your application make the best use of multi-core systems
Node/Socket/Core Allocation
J3
J3 J3J1
Skip/Demo
Job submission: 3 methods
• Command line– Job submit /headnode:Clus1 /Numprocessors:124 /nodegroup:Matlab– Job submit /corespernode:8 /numnodes:24– Job submit /failontaskfailure:true /requestednodes:N1,N2,N3,N4– Job submit /numprocessors:256 mpiexec \\share\mpiapp.exe– [Completel Powershell system mgmt commands are available as well]
using Microsoft.Hpc.Scheduler;class Program{
static void Main(){
IScheduler store = new Scheduler(); store.Connect(“localhost”); ISchedulerJob job = store.CreateJob(); job.AutoCalculateMax = true; job.AutoCalculateMin = true; ISchedulerTask task = job.CreateTask(); task.CommandLine = "ping 127.0.0.1 -n *"; task.IsParametric = true; task.StartValue = 1; task.EndValue = 10000; task.IncrementValue = 1; task.MinimumNumberOfCores = 1; task.MaximumNumberOfCores = 1; job.AddTask(task); store.SubmitJob(job, @"hpc\user“, "p@ssw0rd");
}}
• Programmatic• Support for C++ & .Net
languages
• Web Interface• Open Grid Forum: “HPC
Basic Profile”
Scheduling MPI jobs
• Job Submit /numprocessors:7800 mpiexec hostname• Start time: 1 second, Completion time: 27 seconds
NetworkDirectA new RDMA networking interface built for speed and stability
• Verbs-based design for close fit with native, high-perf networking interfaces
• Equal to Hardware-Optimized stacks for MPI micro-benchmarks– 2 usec latency, 2 GB/sec
bandwidth on ConnectX
• OpenFabrics driver for Windows includes support for Network Direct, Winsock Direct and IPoIB protocols
User Mode
Kernel Mode
TCP/Ethernet
Networking
Ker
nel B
y-P
ass
MPI AppSocket-Based App
MS-MPI
Windows Sockets (Winsock + WSD)
Networking HardwareNetworking HardwareNetworking Hardware
Networking HardwareNetworking HardwareHardware Driver
Networking Hardware
Networking Hardware
Mini-port Driver
TCP
NDIS
IP
Networking HardwareNetworking HardwareUser Mode Access Layer
Networking Hardware
Networking Hardware
WinSock Direct Provider
Networking Hardware
Networking Hardware
NetworkDirect Provider
RDMA Networking
OS Component
CCP Component
IHV Component(ISV) App
Spring 2008, NCSA, #239472 cores, 68.5 TF, 77.7%
Fall 2007, Microsoft, #1162048 cores, 11.8 TF, 77.1%
Spring 2007, Microsoft, #1062048 cores, 9 TF, 58.8%
Spring 2006, NCSA, #130896 cores, 4.1 TF
Spring 2008, Umea, #405376 cores, 46 TF, 85.5%
30% efficiencyimprovement
Windows HPC Server 2008
Windows Compute Cluster 2003
Spring 2008, Aachen, #1002096 cores, 18.8 TF, 76.5%
November 2008 Top500
“Ferrari is always looking for the most advanced technological solutions and, of course, the same applies for software and engineering. To achieve industry leading power-to-weight ratios, reduction in gear change times, and revolutionary aerodynamics, we can rely on Windows HPC Server 2008. It provides a fast, familiar, high performance computing platform for our users, engineers and administrators.”
-- Antonio Calabrese, Responsabile Sistemi Informativi (Head of Information Systems), Ferrari
“It is important that our IT environment is easy to use and support. Windows HPC is improving our performance and manageability.”
-- Dr. J.S. Hurley, Senior Manager, Head Distributed Computing, Networked Systems Technology, The Boeing Company
Customers
“Our goal is to broaden HPC availability to a wider audience than just power users. We believe that Windows HPC will make HPC accessible to more people, including engineers, scientists, financial analysts, and others, which will help us design and test products faster and reduce costs.”
-- Kevin Wilson, HPC Architect, Procter & Gamble
“We are very excited about utilizing the Cray CX1 to support our research activities,” said Rico Magsipoc, Chief Technology Officer for the Laboratory of Neuro Imaging. “The work that we do in
brain research is computationally intensive but will ultimately have a huge impact on our understanding of the relationship between brain structure and function, in both health and disease.
Having the power of a Cray supercomputer that is simple and compact is very attractive and necessary, considering the physical constraints we face in our data centers today.”
• Windows Subsystem for Unix applications– Complete SVR-5 and BSD UNIX environment with 300
commands, utilizes, shell scripts, compilers– Visual Studio extensions for debugging POSIX applications– Support for 32 and 64-bit applications
• Recent port of WRF weather model– 350K lines, Fortran 90 and C using MPI, OpenMP– Traditionally developed for Unix HPC systems– Two dynamical cores, full range of physics options
• Porting experience– Fewer than 750 lines of code changed in makefiles/scripts– Level of effort similar to port to any new version of UNIX– Performance on par with the Linux systems
• India Interoperability Lab, MTC Bangalore– Industry Solutions for Interop jointly with partners– HPC Utility Computing Architecture– Open Source Applications on HPC Server 2008
(NAMD, PL_POLY, GROMACS)
Porting Unix Applications
High Productivity Modeling
Languages/RuntimesC++, C#, VBF#, Python, Ruby, JscriptFortran (Intel, PGI)OpenMP, MPI
Team DevelopmentTeam portal: version control, scheduled build, bug trackingTest and stress generationCode analysis, Code coveragePerformance analysis
IDERapid application developmentParallel debuggingMultiprocessor buildsWork flow design
.Net FrameworkLINQ: language integrated queryDynamic Language RuntimeFx/JIT/GC improvementsNative support for Web Services
MSFT || Computing Technologies
Task Concurrency
Data Parallelism
Distributed/Cloud Computing
LocalComputing
• Robotics-based manufacturing assembly line
• Silverlight Olympics viewer
• Enterprise search, OLTP, collab
• Animation / CGI rendering
• Weather forecasting• Seismic monitoring• Oil exploration
• Automotive control system
• Internet –based photo services
• Ultrasound imaging equipment
• Media encode/decode• Image processing/
enhancement• Data visualization
IFx / CCR
Maestro
TPL / PPL
Cluster-TPL
Cluster-PLINQ
MPI / MPI.Net
WCF
Cluster SOA
WF
PLINQ
TPL / PPL
CDS
OpenMP
UDF
UDF
UDF
UDF
UDF
UDF
Head NodesSupports SOAfunctionality
WCF Brokers.Compute Nodes
Each performs UDF Tasks as called
From WCF Broker
UDF
UDF
1 4 16 64256
10244096
163840
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6Low latency
WSD IPoIB Gige
Message Size ( bytes )
Roun
d Tr
ip L
aten
cy (
ms )
0 50 100 150 2000
1000
2000
3000
4000
5000
6000High throughput
0k pingpong 1k pingpong4k pingpong 16k pingpong
Number of clientsM
essa
ges/
sec
(25
ms c
ompu
te ti
me)
SOA Broker Performance
MPI.NET• Supports all .NET languages
(C#, C++, F#, ..., even Visual Basic!)• Natural expression of MPI in C#
• Negligible overhead (relative to C) over TCP
if (world.Rank == 0) world.Send(“Hello, World!”, 1, 0);else string msg = world.Receive<string>(0, 0);
string[] hostnames = comm.Gather(MPI.Environment.ProcessorName, 0);
double pi = 4.0*comm.Reduce(dartsInCircle,(x, y) => return x + y, 0) / totalDartsThrown;
Allinea DDT VS Debugger Add-inSkip/Demo
0.01
0.1
1
10
100
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
Throughput (M
bps)
Message Size (Bytes)
NetPI PE Performance
C (Native)
C# (Primitive)
C# (Serialized)
Parallel Extensions to .NET
• Declarative data parallelism (PLINQ)
• Imperative data and task parallelism (TPL)
• Data structures and coordination constructs
var q = from n in names.AsParallel() where n.Name == queryInfo.Name && n.State == queryInfo.State && n.Year >= yearStart && n.Year <= yearEnd orderby n.Year ascending select n;
Parallel.For(0, n, i=> { result[i] = compute(i);});
static void ProcessNode<T>(Tree<T> tree, Action<T> action) { if (tree == null) return;
ProcessNode(tree.Left, action); ProcessNode(tree.Right, action); action(tree.Data);}
Sequential
Example: Tree Walk
static void ProcessNode<T>(Tree<T> tree, Action<T> action) { if (tree == null) return;
Stack<Tree<T>> nodes = new Stack<Tree<T>>(); Queue<T> data = new Queue<T>();
nodes.Push(tree); while (nodes.Count > 0) { Tree<T> node = nodes.Pop(); data.Enqueue(node.Data); if (node.Left != null) nodes.Push(node.Left); if (node.Right != null) nodes.Push(node.Right); }
using (ManualResetEvent mre = new ManualResetEvent(false)) { int waitCount = Environment.ProcessorCount;
WaitCallback wc = delegate { bool gotItem; do { T item = default(T); lock (data) { if (data.Count > 0) { item = data.Dequeue(); gotItem = true; } else gotItem = false; } if (gotItem) action(item); } while (gotItem);
if (Interlocked.Decrement(ref waitCount) == 0) mre.Set(); };
for (int i = 0; i < Environment.ProcessorCount - 1; i++) { ThreadPool.QueueUserWorkItem(wc); }
wc(null); mre.WaitOne(); }}
Thread Pool
static void ProcessNode<T>(Tree<T> tree, Action<T> action) { if (tree == null) return; Task t = Task.Create(delegate { ProcessNode(tree.Left, action); }); ProcessNode(tree.Right, action); action(tree.Data); t.Wait();}
Parallel Extensions (with Task)
static void ProcessNode<T>(Tree<T> tree, Action<T> action) { if (tree == null) return; Parallel.Do( () => ProcessNode(tree.Left, action), () => ProcessNode(tree.Right, action), () => action(tree.Data) );}
Parallel Extensions (with Parallel)
static void ProcessNode<T>(Tree<T> tree, Action<T> action) { tree.AsParallel().ForAll(action);}
Parallel Extensions (with PLINQ)
Example: Tree Walk
F# is...
...a functional, object-oriented, imperative and explorative
programming language for .NET
F#
Strongly Typed
SuccinctScalableLibrariesExplorative
Interoperable
Efficient
Interactive F# Shell
C:\fsharpv2>bin\fsi
MSR F# Interactive, (c) Microsoft Corporation, All Rights Reserved F# Version 1.9.2.9, compiling for .NET Framework Version v2.0.50727
NOTE: NOTE: See 'fsi --help' for flags NOTE: NOTE: Commands: #r <string>;; reference (dynamically load) the given DLL. NOTE: #I <string>;; add the given search path for referenced DLLs. NOTE: #use <string>;; accept input from the given file. NOTE: #load <string> ...<string>;; NOTE: load the given file(s) as a compilation unit. NOTE: #time;; toggle timing on/off. NOTE: #types;; toggle display of types on/off. NOTE: #quit;; exit. NOTE: NOTE: Visit the F# website at http://research.microsoft.com/fsharp. NOTE: Bug reports to [email protected]. Enjoy!
> let rec f x = (if x < 2 then x else f (x-1) + f (x-2));;
val f : int -> int
> f 6;;
val it = 8 val it : int
Example: Taming Asynchronous I/Ousing System;using System.IO;using System.Threading; public class BulkImageProcAsync{ public const String ImageBaseName = "tmpImage-"; public const int numImages = 200; public const int numPixels = 512 * 512; // ProcessImage has a simple O(N) loop, and you can vary the number // of times you repeat that loop to make the application more CPU- // bound or more IO-bound. public static int processImageRepeats = 20; // Threads must decrement NumImagesToFinish, and protect // their access to it through a mutex. public static int NumImagesToFinish = numImages; public static Object[] NumImagesMutex = new Object[0]; // WaitObject is signalled when all image processing is done. public static Object[] WaitObject = new Object[0]; public class ImageStateObject { public byte[] pixels; public int imageNum; public FileStream fs; }
public static void ReadInImageCallback(IAsyncResult asyncResult) { ImageStateObject state = (ImageStateObject)asyncResult.AsyncState; Stream stream = state.fs; int bytesRead = stream.EndRead(asyncResult); if (bytesRead != numPixels) throw new Exception(String.Format ("In ReadInImageCallback, got the wrong number of " + "bytes from the image: {0}.", bytesRead)); ProcessImage(state.pixels, state.imageNum); stream.Close(); // Now write out the image. // Using asynchronous I/O here appears not to be best practice. // It ends up swamping the threadpool, because the threadpool // threads are blocked on I/O requests that were just queued to // the threadpool. FileStream fs = new FileStream(ImageBaseName + state.imageNum + ".done", FileMode.Create, FileAccess.Write, FileShare.None, 4096, false); fs.Write(state.pixels, 0, numPixels); fs.Close(); // This application model uses too much memory. // Releasing memory as soon as possible is a good idea, // especially global state. state.pixels = null; fs = null; // Record that an image is finished now. lock (NumImagesMutex) { NumImagesToFinish--; if (NumImagesToFinish == 0) { Monitor.Enter(WaitObject); Monitor.Pulse(WaitObject); Monitor.Exit(WaitObject); } } }
public static void ProcessImagesInBulk() { Console.WriteLine("Processing images... "); long t0 = Environment.TickCount; NumImagesToFinish = numImages; AsyncCallback readImageCallback = new AsyncCallback(ReadInImageCallback); for (int i = 0; i < numImages; i++) { ImageStateObject state = new ImageStateObject(); state.pixels = new byte[numPixels]; state.imageNum = i; // Very large items are read only once, so you can make the // buffer on the FileStream very small to save memory. FileStream fs = new FileStream(ImageBaseName + i + ".tmp", FileMode.Open, FileAccess.Read, FileShare.Read, 1, true); state.fs = fs; fs.BeginRead(state.pixels, 0, numPixels, readImageCallback, state); } // Determine whether all images are done being processed. // If not, block until all are finished. bool mustBlock = false; lock (NumImagesMutex) { if (NumImagesToFinish > 0) mustBlock = true; } if (mustBlock) { Console.WriteLine("All worker threads are queued. " + " Blocking until they complete. numLeft: {0}", NumImagesToFinish); Monitor.Enter(WaitObject); Monitor.Wait(WaitObject); Monitor.Exit(WaitObject); } long t1 = Environment.TickCount; Console.WriteLine("Total time processing images: {0}ms", (t1 - t0)); }}
Processing 200 images in parallel
let ProcessImageAsync(i) =
async { let inStream = File.OpenRead(sprintf "source%d.jpg" i)
let! pixels = inStream.ReadAsync(numPixels)
let pixels' = TransformImage(pixels,i)
let outStream = File.OpenWrite(sprintf "result%d.jpg" i)
do! outStream.WriteAsync(pixels')
do Console.WriteLine "done!" }
let ProcessImagesAsync() =
Async.Run (Async.Parallel
[ for i in 1 .. numImages -> ProcessImageAsync(i) ])
Read from the file,
asynchronously
Write the result asynchronously
Equivalent F# code
(same perf)
Generate the tasks and
queue them in parallel
Open the file synchronously
Example: Taming Asynchronous I/O
The Coming of Accelerators
Current Offerings
Accelerator Brook+ RapidMind
Compute Shader CUDACAL LRB Native
Ct
nVidia GPUAMD CPU or GPU
Intel CPU Larrabee
D3DX, DaVinci, FFT,
ScanACML-GPU cuFFT,
cuBLAS, cuPP MKL++
Microsoft AMD nVidia Intel
Any Processor
OpenCL
Grand Central
CoreImageCoreAnim
Apple
Any Processor
DirectX11 Compute Shader
• A new processing model for GPUs– Integrated with Direct3D– Supports more general constructs– Enables more general data structures– Enables more general algorithms
• Image/Post processing:– Image Reduction, Histogram, Convolution, FFT– Video transcode, superResolution, etc.
• Effect physics– Particles, smoke, water, cloth, etc.
• Ray-tracing, radiosity, etc.• Gameplay physics, AI
FFT Performance Example
• Complex 1024x1024 2-D FFT:– Software 42ms 6 GFlops– Direct3D9 15ms 17 GFlops 3x– CUFFT 8ms 32 GFlops 5x– Prototype DX11 6ms 42 GFlops 6x– Latest chips 3ms 100 GFlops
• Shared register space and random access writes enable ~2x speedups
IMSL .NET Numerical Library• Linear Algebra• Eigensystems• Interpolation and
Approximation• Quadrature• Differential Equations• Transforms
• Nonlinear Equations• Optimization• Basic Statistics• Nonparametric Tests• Goodness of Fit• Regression
• Variances, Covariances and Correlations
• Multivariate Analysis• Analysis of Variance• Time Series and Forecasting• Distribution Functions• Random Number Generation
Data acquisition from source systems and integration
Data transformation and synthesis
Data enrichment, with business logic, hierarchical views
Data discovery via data mining
Data presentation and distribution
Data access for the masses
Integrate Analyze Report
Research
Data Browsing with Excel
Annual Mean
MonthlyMean
WeeklyMean
Courtesy Catherine van Ingen, MSR
Datamining with Excel
Integrated algorithms• Text Mining• Neural Nets• Naïve Bayes• Time Series• Sequent Clustering• Decision Trees• Association Rules
Workflow Design for Sharepoint
Microsoft HPC++ Labs:Academic Computational Finance Service
Taking HPC Mainstream
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of
the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.