high-performance computing on the windows server platform marvin theimer software architect windows...

High-Performance Computing on the Windows Server Platform

Marvin TheimerSoftware ArchitectWindows Server HPC Grouphpcinfo @ microsoft.comMicrosoft Corporation

Session OutlineSession Outline

Brief introduction to HPCDefinitions

Market trends

Overview of V1 version of Windows Server 2003 CCE

Features

System architecture

Key challenges for future HPC systemsToo many factors affect performance

Grid computing economics

Data management

Brief Introduction to HPC

Defining High Performance Computing (HPC)Defining High Performance Computing (HPC)

Systems MPPs, Vector, NUMA

Dedicated HPC Clusters

Resource Scavenging

Goal Absolute Perf Price/Perf, SLA Harness unused cycles

Targets Large scale SMPs; special purpose

Clusters Underutilized desktops & servers

Coupling Tight Tight, Loose Loose

Comm Shared memory Message passing Messages, files

Network Bus, backplane Myrinet, Infiniband, GigE

TCP/IP, HTTP

HPC Definition: Using compute resources to solve computationally intensive problems

Persist(DB, FS, ..)

ComputationalModeling Sensors

Mining

Interpretation

Technical and

ScientificComputing

HPC Use

HPC Role in Science Different Platforms for Achieving Results

Cluster HPC ScenarioCluster HPC Scenario

Data

Inp

ut

Job Policy, reports

Man

agem

ent

DB or FS

High speed, low latency interconnect (1GE,

Infiniband, Myricom)

User

Job

Admin

User Mgmt

Resource Mgmt

Cluster Mgmt

Job Mgmt

Web service

Web page

Cmd line

Head Node

Cluster Node

Job Mgr

Resource Mgr

User AppMPI

Node Mgr

Sensors, Workflow,Computation

Data mining, Visualization, Workflow Remote query

Top 500 Supercomputer TrendsTop 500 Supercomputer Trends

Industry usage is

rising

Clusters over 50%

GigE is gaining

IA is winning

Commoditized HPC Systems are Affecting Every VerticalCommoditized HPC Systems are Affecting Every Vertical

Leverage Volume Markets of Industry Standard Hardware and Software.

Rapid Procurement, Installation and Integration of systems

Cluster Ready Applications Accelerating Market Growth

Engineering

Bioinformatics

Oil & Gas

Finance

Entertainment

Government/Research

The convergence of affordable high performance hardware and commercial apps is making supercomputing a mainstream market

Supercomputing: Yesterday vs. TodaySupercomputing: Yesterday vs. Today

1991 1998 2005System Cray Y-MP C916 Sun HPC10000 Shuttle @ NewEgg.com

Architecture 16 x Vector4GB, Bus

24 x 333MHz Ultra-SPARCII, 24GB, SBus

4 x 2.2GHz Athlon644GB, GigE

OS UNICOS Solaris 2.5.1 Windows Server 2003 SP1

GFlops ~10 ~10 ~10

Top500 # 1 500 N/A

Price $40,000,000 $1,000,000 (40x drop) < $4,000 (250x drop)

Customers Government Labs Large Enterprises Every Engineer & Scientist

Applications Classified, Climate, Physics Research

Manufacturing, Energy, Finance, Telecom

Bioinformatics, Materials Sciences, Digital Media

Cheap, Interactive HPC Systems Cheap, Interactive HPC Systems Are Making Supercomputing PersonalAre Making Supercomputing Personal

Grids of personal &

departmental clusters

Personal workstations &

departmental servers

Minicomputers

Mainframes

The Evolving Nature of HPCThe Evolving Nature of HPC

Scenario Focus

Departmental ClusterConventional scenario

IT owns large clusters due to complexity and allocates resources on per job basis

Users submit batch jobs via scripts

In-house and ISV apps, many based on MPI

Scheduling multiple users’ applications onto scarce compute cycles

Cluster systems administration

Personal/Workgroup ClusterEmerging scenario

Clusters are pre-packaged OEM appliances, purchased and managed by end-users

Desktop HPC applications transparently and interactively make use of cluster resources

Desktop development tools integration

Interactive applications

Compute grids

HPC Application IntegrationFuture scenario

Multiple simulations and data sources integrated into a seamless application workflow

Network topology and latency awareness for optimal distribution of computation

Structured data storage with rich meta-data

Applications and data potentially span organizational boundaries

Data-centric, “whole-system” workflows

Data grids

Interactive Computation and Visualization

Manual, batchexecution

IT Mgr

SQL

Windows Server HPC

Intel (32bit & 64bit) & AMD x64

Fast Ethernet Gigabit Ethernet

TCP

Windows based HPC TodayWindows based HPC Today

Ecosystem

Partnerships with ISV to develop on Windows

Partnership with Cornell Theory Center

Technical Solution

Partner Driven Solution Stack

PlatformPlatformPlatformPlatform

InterconnectInterconnectInterconnectInterconnect

ProtocolProtocolProtocolProtocol

OSOSOSOS

MiddlewareMiddlewareMiddlewareMiddleware

WINDOWS

MPI/Pro MPICH-1.2 WMPI MPI-NT

ApplicationsApplicationsApplicationsApplications Parallel Applications

Vis

ua

l Stu

dio

LSF PBSPro DataSynapse MSTIManagementManagementManagementManagement

What Windows-based HPC needs to provideWhat Windows-based HPC needs to provide

Users require:An integrated supported solution stack leveraging the Windows infrastructure

Simplified job submission, status and progress monitoring

Maximum compute performance and scalability

Simplified environment from desktops to HPC clusters

Administrators require:

Ease of setup and deployment

Better cluster monitoring and management for maximum resource utilization

Flexible, extensible, policy-driven job scheduling and resource allocation

High availability

Secure process startup and complete cleanup

Developers Require:

Programming environment that enables high productivity

Availability of optimized compilers (Fortran) and math libraries

Parallel debugger, profiler, and visualization tools

Parallel programming models (MPI)

V1 PlansV1 Plans

Introduce compute cluster solutionWindows Server 2003 Compute Cluster Edition based on Windows Server 2003 SP1 x64 Standard Edition

Features for Job Management, IT Admin and Developers

Build partner eco-system around the Windows Server Compute Cluster Edition from day one

Establish Microsoft credibility in the HPC community

Create worldwide Centers of Innovation

TechnologiesTechnologies

PlatformWindows Server 2003 SP1 64 bit Edition

x64 processors (Intel EM64T & AMD Opteron)

Ethernet, Ethernet over RDMA and Infiniband support

AdministrationPrescriptive, simplified cluster setup and administration

Scripted, image-based compute node management

Active Directory based security, impersonation and delegation

Cluster-wide job scheduling and resource management

DevelopmentMPICH-2 from Argonne National Labs

Cluster scheduler accessible via DCOM, http, and Web Services

Visual Studio 2005 – Compilers, Parallel Debugger

Partner delivered compilers and libraries

Windows HPC EnvironmentWindows HPC Environment

Data

Inp

ut

Job Policy, reports

Man

agem

ent

DB or FS

High speed, low latency interconnect (Ethernet over RDMA,

Infiniband)

User

Job

Admin

User Mgmt

Resource Mgmt

Cluster Mgmt

Job Mgmt

Web service

Web page

Cmd line

Head Node

Cluster Node

Job Mgr

Resource Mgr

User AppMPI

Node Mgr

Sensors, Workflow,Computation

Data mining, Visualization, Workflow Remote query

Active Directory

Microsoft Operations

Manager

Windows Server 2003,

Compute Cluster Edition

Architectural OverviewArchitectural Overview

Legend

Cluster Nodes

GigE/RDMA

Windows Server 2003 CCE

Node Manager

Infiniband

Head Node


WSE3MSDE

Job Scheduler

RIS AD

Job Sched UI

ClusterUser Workstation

GigE

Windows XP

X86/64 Disk

Application Job ScriptsData

Developer Workstation

GigE

Windows XP

X86/64 Disk

Application

Whidbey WSE

SFU

Compilers Libs

HPC SDKMPISched WSPolicy API

3rd Party

Windows OS

Whidbey

MS Component

HPC Component

Application

IIS6

HPC Application

Cluster Nodes

GigE/RDMA


MPI-2

TCP

Node Manager

Infiniband

SHM WSD/SDP

HPC Application

MPI-2 MPI-2

WSE

HTTP

HTTP

MPI-2

TCP SHM WSD/SDP

WSE

COM

COM

Key Challenges for Future HPC Systems

Difficult to Tune PerformanceDifficult to Tune Performance

Example: Tightly-coupled MPI applications:Very sensitive to network performance characteristics

Communication times measured in microseconds: O(10 usecs) for interconnects such as Infiniband; O(100 usecs) for GigE

OS network stack is a significant factor: Things like RDMA can make a big difference

Excited about the prospects of industry-standard RDMA hardware

We are working with InfiniBand and GigE vendors to ensure our stack supports them

Driver quality is an important facet

We are supporting the OpenIB initiative

Considering the creation of a WHQL program for InfiniBand

Very sensitive to mismatched node performanceRandom OS activities can add millisecond delays to microsecond communication times

Need self-tuning systemsNeed self-tuning systems

Application configuration has a significant impactIncorrect assumptions about hardware/communications architecture can dramatically affect performance

Choice of communication strategy

Choice of communication granularity

…

Tuning is an end-to-end issue:OS support

ISV library support

ISV application support

Computational Grid EconomicsComputational Grid Economics

What $1 will buy you (roughly): Computers cost $1000 (roughly) 1 cpu day (~ 10 Tera-ops) == $1

(roughly, assuming 3 yr use cycle)

10TB network transfer costs == $1 (roughly, assuming 1Gbps interconnect)

Internet bandwidth costs roughly 100 $/mbps/month (not including routers and management) 1GB network transfer costs == $1 (roughly)

Some observations:HPC cluster communication is 10,000x cheaper than WAN communicationBreak-even point for instructions computed per byte transferred:

Cluster: O(1) instrs/byteWAN: O(10,000) instrs/byte

Computational Grid Economics ImplicationsComputational Grid Economics Implications

“Small data, high compute” applications work well across the Internet, such as SETI@home and Folding@home

MPI-style parallel, distributed applications work well in clusters and across LANs, but are uneconomic and do not work well in wide-area settings

Data analysis is usually best done by moving the programs to the data, not the data to the programs.

Move questions and answers, not petabyte-scale datasets

The Internet is NOT the cpu backplane (Internet-2 will not change this)

Exploding Data SizesExploding Data Sizes

Experimental data: TBs PBs

Modeling data:Today: 10’s to 100’s of GB is the common case

Tomorrow: TBs

Near-future example: CFD simulation of a turbine engine

10**9 mesh nodes, each containing 16 double-precision variables

128 GB / time-step

Simulate 1000’s of time steps 100’s TBs / simulation

Archived for future reference

Whole-System Modeling and WorkflowWhole-System Modeling and Workflow

Today: mostly about computationStand-alone static simulations of individual parts/phenomena

Mostly batch

Simple workflows: short, deterministic pipelines (though some are massively parallel)

Future: mostly about data that is produced and consumed by computational steps

Dynamic whole-system modeling via multiple, interacting simulations

More complex workflows (don't yet know how complex)

More interactive analysis

More sharing

Whole-System Modeling Example: Whole-System Modeling Example: Turbine EngineTurbine Engine

Interacting simulationsCFD simulation of dynamic airflow through turbine

FE stress analysis of engine & wing parts

"Impedance" issues between various simulations (time steps, meshes, ...)

Serial workflow stepsCrack analysis of engine & wing parts

Visualization of results

Interactive Workflow ExampleInteractive Workflow Example

Base CFD simulation produces huge outputPoints of interest may not be easy to find

Find and then focus on important detailsData analysis/mining of output

Restart simulation at a desired point in time/space.

Visualize simulation from that point forward.

Modify simulation from that point forward (e.g. higher fidelity)

Data Analysis and MiningData Analysis and Mining

Traditional approach:Keep data in flat files

Write C or Perl programs to compute specific analysis queries

Problems with this approach:Imposes significant development times

Scientists must reinvent DB indexing and query technologies

Results from the astronomy community:Relational databases can yield speed-ups of one to two orders of magnitude

SQL + application/domain-specific stored procedures greatly simplify creation of analysis queries

Combining Simulation with Experimental Combining Simulation with Experimental Data: Drug DiscoveryData: Drug Discovery

Clinical trial database describes toxicity & side effects observed for tested drugs.

Simulation searches for candidate compounds that have a desired effect on a biological system.

Clinical data searched for drugs that contain a candidate compound or "near neighbor"; toxicity results retrieved and used to decide if the candidate compound should be rejected or not.

SharingSharing

Simulations (or ensembles of simulations) mostly done in isolation

No sharing except for archival output

Some coarse-grained sharingCheck-out/check-in of large componentsExample: automotive design

Check-out componentCAE-based design & simulation of componentCheck-in with design rule checking step

Data warehouses typically only need coarse-grained update granularity

Bulk or coarse-grained updatesModeling & simulations done in the context of particular versions of the data

Audit trails and reproducible workflows becoming increasingly important

Data Management NeedsData Management Needs

Cluster file systems and/or parallel DBs to handle I/O bandwidth needs of large, parallel, distributed applications

Data warehouses for experimental data and archived simulation output

Coarse-grained geographic replication to accommodate distributed workforces and workflows

Indexing and query capabilities to do data mining & analysis

Audit trails, workflow recorders, etc.

Windows HPC RoadmapWindows HPC Roadmap

V1V1Introduce complete Introduce complete

Windows based compute Windows based compute

cluster solutioncluster solution

V1V1Introduce complete Introduce complete

Windows based compute Windows based compute

cluster solutioncluster solution

V2V2Apply Microsoft’s core Apply Microsoft’s core

innovation to HPCinnovation to HPC

V2V2Apply Microsoft’s core Apply Microsoft’s core

innovation to HPCinnovation to HPC

Long-term visionLong-term vision Revolutionize scientific Revolutionize scientific

and technical and technical

computationcomputation

Long-term visionLong-term vision Revolutionize scientific Revolutionize scientific

and technical and technical

computationcomputation

• Complete development platform – SFU, MPI, Compilers, Parallel Debugger

• Integrated cluster setup and management

• Secure and programmable job scheduling and management

• Complete development platform – SFU, MPI, Compilers, Parallel Debugger

• Integrated cluster setup and management

• Secure and programmable job scheduling and management

• Enhanced development support

• Harnessing unused desktop cycles for HPC

• Meta scheduling to integrate desktops and forests of clusters

• Efficient management and manipulation of large datasets and workflows

• Enhanced development support

• Harnessing unused desktop cycles for HPC

• Meta scheduling to integrate desktops and forests of clusters

• Efficient management and manipulation of large datasets and workflows

• Ease of parallel programming

• Exploitation of hardware features (e.g. GPUs and multi-core chips)

• Integration with high-level tools (e.g. Excel)

• Support for domain-specific platforms

• Ease of parallel programming

• Exploitation of hardware features (e.g. GPUs and multi-core chips)

• Integration with high-level tools (e.g. Excel)

• Support for domain-specific platforms

Call To ActionCall To Action

IHVsDevelop Winsock Direct drivers for your RDMA cards

Automatically let our MPI stack take advantage of low latency

Develop support for diskless scenarios (e.g. iScsi)

OEMsOffer turn-key clusters

Pre-wired for management and RDMA networks

Support “boot from net” diskless scenarios

Support WS-Management

Consider noise and power requirements for personal and workgroup configurations

Community ResourcesCommunity Resources

Windows Hardware & Driver Central (WHDC)www.microsoft.com/whdc/default.mspx

Technical Communitieswww.microsoft.com/communities/products/default.mspx

Non-Microsoft Community Siteswww.microsoft.com/communities/related/default.mspx

Microsoft Public Newsgroupswww.microsoft.com/communities/newsgroups

Technical Chats and Webcastswww.microsoft.com/communities/chats/default.mspx

www.microsoft.com/webcasts

Microsoft Blogswww.microsoft.com/communities/blogs

Related WinHEC SessionsRelated WinHEC Sessions

TWNE05005: Winsock Direct Value Proposition-Partner Concepts

TWNE05006: Implementing Convergent Networking-Partner Concepts

To Learn MoreTo Learn More

Microsoft:Microsoft HPC website: http://www.microsoft.com/hpc/

Other Sites:CTC Activities: http://cmssrv.tc.cornell.edu/ctc/winhpc/3rd Party Windows Cluster Resource Centre www.windowsclusters.org HPC related-links web site: http://www.microsoft.com/windows2000/hpc/miscresources.asp

Some useful articles & presentations:“Supercomputing in the Third Millenium”, by George Spix: http://www.microsoft.com/windows2000/hpc/supercom.asp Introduction of the book “Beowulf Cluster Computing with Windows” by Thomas Sterling, Gordon Bell, and Janusz Kowalik“Distributed Computing Economics”, by Jim Gray: MSR-TR-2003-24: http://research.microsoft.com/research/pubs/view.aspx?tr_id=655 “Web Services, Large Databases, and what Microsoft is doing in the Grid Computing Space”, presentation by Jim Gray: http://research.microsoft.com/~Gray/talks/WebServices_Grid.ppt

Send questions to hpcinfo @ microsoft.com

http://www.microsoft.com/hpc/

http://cmssrv.tc.cornell.edu/ctc/winhpc/



http://www.windowsclusters.org/

high-performance computing on the windows server platform marvin theimer software architect windows...

Documents