the computational plant 9 th orap forum paris (cnrs) rolf riesen sandia national laboratories...

36
The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

Post on 20-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

The Computational Plant9th ORAP Forum

Paris (CNRS)

Rolf Riesen

Sandia National LaboratoriesScalable Computing Systems Department

March 21, 2000

Page 2: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Distributed and Parallel Systems

Distributedsystemshetero-geneous

Massivelyparallelsystemshomo-geneous

Legi

on\G

lobu

sB

eow

ulf

Ber

kley

NO

WC

plan

t

Inte

rnet

AS

CI R

ed

Tf

lops

• Gather (unused) resources

• Steal cycles

• System SW manages resources

• System SW adds value

• 10% - 20% overhead is OK

• Resources drive applications

• Time to completion is not critical

• Time-shared

• Bounded set of resources

• Apps grow to consume all cycles

• Application manages resources

• System SW gets in the way

• 5% overhead is maximum

• Apps drive purchase of equipment

• Real-time constraints

• Space-shared

Tech Report SAND98-2221

SE

TI@

hom

e

Page 3: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Massively Parallel Processors

• Intel Paragon• 1,890 compute nodes• 3,680 i860 processors• 143/184 GFLOPS• 175 MB/sec network• SUNMOS lightweight kernel

• Intel TeraFLOPS• 4,576 compute nodes• 9,472 Pentium II processors• 2.38/3.21 TFLOPS• 400 MB/sec network• Puma/Cougar lightweight

kernel

Page 4: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Cplant Goals

• Production system• Multiple users• Scalable (easy to use buzzword)• Large scale (proof of the above)• General purpose for scientific applications (not

Beowulf dedicated to a single user)• 1st step: Tflops look and feel for users

Page 5: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Cplant Strategy

• Hybrid approach combining commodity cluster technology with MPP technology

• Build on the design of the TFLOPS:– large systems should be built from independent

building blocks– large systems should be partitioned to provide

specialized functionality– large systems should have significant resources

dedicated to system maintenance

Page 6: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Why Cplant?

• Modeling and simulation, essential to stockpile stewardship, require significant computing power

• Commercial supercomputer are a dying breed• Pooling of SMPs is expensive and more

complex• Commodity PC market is closing the

performance gap• WEB services and e-commerce are driving

high-performance interconnect technology

Page 7: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Cplant Approach

• Emulate the ASCI Red environment– Partition model (functional decomposition)– Space sharing (reduce turnaround time)– Scalable services (allocator, loader, launcher)– Ephemeral user environment– Complete resource dedication

• Use Existing Software when possible– Red Hat distribution, Linux/Alpha– Software developed for ASCI Red

Page 8: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Conceptual Partition View

Net I/O

System Support

Service

Sys Admin

Users

File

I/O

Compute

/home

Page 9: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0User View

alaska0 alaska1 alaska2 alaska3 alaska4

/home

Load balancing daemon

Service partition

rlogin alaska

Net I/O

File

I/O

Page 10: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0

System Support Hierarchy

sss1Admin access

sss0

node

node

node

nodeScalable

Unit

In-use copyof systemsoftware

NFS mountroot fromSSS0

sss0

node

node

node

nodeScalable

Unit

In-use copyof systemsoftware

NFS mountroot fromSSS0

sss0

node

node

node

nodeScalable

Unit

In-use copyof systemsoftware

NFS mountroot fromSSS0

Master copyof systemsoftware

Page 11: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Scalable Unit

power serial Ethernet Myrinet

To

syst

em s

uppo

rt n

etw

ork

100BaseT hub

16 p

ort M

yrin

et s

wit

ch

compute

compute

compute

compute

compute

compute

compute

service

8 Myrinet LAN cablesss

s0

Ter

min

al s

erve

r

Pow

er c

ontr

olle

r

100BaseT hub

16 p

ort M

yrin

et s

wit

ch

compute

compute

compute

compute

compute

compute

compute

service

Ter

min

al s

erve

r

Pow

er c

ontr

olle

r

Page 12: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0

sss1Uses rdist topush system softwaredown

sss0

node

node

node

nodeScalable

Unit

In-use copyof systemsoftware

NFS mountroot fromSSS0

sss0

node

node

node

nodeScalable

Unit

In-use copyof systemsoftware

NFS mountroot fromSSS0

sss0

node

node

node

nodeScalable

Unit

In-use copyof systemsoftware

NFS mountroot fromSSS0

BetaAlpha

Production SU configurationdatabase

“Virtual Machines”

Page 13: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Runtime Environment

• yod - Service node parallel job launcher• bebopd - Compute node allocator• PCT - Process control thread, compute node

daemon• pingd - Compute node status tool• fyod - Independent parallel I/O

Page 14: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Phase I - Prototype (Hawaii)

• 128 Digital PWS 433a (Miata)• 433 MHz 21164 Alpha CPU• 2 MB L3 Cache• 128 MB ECC SDRAM• 24 Myrinet dual 8-port SAN

switches• 32-bit, 33 MHz LANai-4 NIC• Two 8-port serial cards per SSS-

0 for console access• I/O - Six 9 GB disks• Compile server - 1 DEC PWS

433a• Integrated by SNL

Page 15: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Phase II Production (Alaska)

• 400 Digital PWS 500a (Miata)• 500 MHz Alpha 21164 CPU• 2 MB L3 Cache, 192 MB RAM• 16-port Myrinet switch• 32-bit, 33 MHz LANai-4 NIC• 6 DEC AS1200, 12 RAID (.75

Tbyte) || file server• 1 DEC AS4100 compile & user file

server• Integrated by Compaq• 125.2 GFLOPS on MPLINPACK

(350 nodes)– would place 53rd on June 1999

Top 500

Page 16: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Phase III Production (Siberia)

• 624 Compaq XP1000 (Monet)• 500 MHz Alpha 21264 CPU• 4 MB L3 Cache• 256 MB ECC SDRAM• 16-port Myrinet switch• 64-bit, 33 MHz LANai-7 NIC • 1.73 TB disk I/O• Integrated by Compaq and

Abba Technologies• 247.6 GFLOPS on

MPLINPACK (572 nodes)– would place 40th on Nov

1999 Top 500

Page 17: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Phase IV (Antarctica?)

• ~1350 DS10 Slates (NM+CA)• 466MHz EV6, 256MBRAM• Myrinet 33MHz 64bit LANai 7.x• Will be combined with Siberia

for a ~1600-node system• Red, black, green switchable

Page 18: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Myrinet Switch

• Based on 64-port Clos switch• 8x2 16-port switches in a 12U

rack-mount case• 64 LAN cables to nodes• 64 SAN cables (96 links) to

mesh

4 nodes

16-port switch

Page 19: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0One Switch Rack = One Plane

• 4 Clos switches in one rack

• 256 nodes per plane (8 racks)

• Wrap-around in x and y direction

• 128+128 links in z direction

x

zy

4 nodes per not shown

Page 20: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Cplant 2000: “Antarctica”

Wrap-around and z links and nodes not shown

Connected to open network

Connected to classified network

Compute nodes swingbetween red, black, or green

Connected to unclassified network

Page 21: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Cplant 2000: “Antarctica” cont.

• 1056 + 256 + 256 nodes 1600 nodes 1.5TFlops• 320 “64-port” switches + 144 16-port switches from

Siberia• 40 + 16 system support stations

Page 22: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Portals

• Data movement layer from SUNMOS and PUMA

• Flexible building blocks for supporting many protocols

• Elementary constructs that support MPI semantics well

• Low-level message-passing layer (not a wire protocol)

• API intended for library writers, not application programmers

• Tech report SAND99-2959

Page 23: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Interface Concepts

• One-sided operations– Put and Get

• Zero copy message passing– Increased bandwidth

• OS Bypass– Reduced latency

• Application Bypass– No polling, no threads– Reduced processor utilization– Reduced software complexity

Page 24: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0MPP Network: Paragon and Tflops

Memory

Processor Processor

MemoryBus

Network

Message passing or computational co-processor

Network interface is on the memory bus

Page 25: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Commodity: Myrinet

MemoryBus

PCI Bus

Bridge

Processor Memory

NIC

Network

Network is far from the memory

OS Bypass

Page 26: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0“Must” Requirements

Common protocols (MPI, system protocols) Portability Scalability to 1000’s of nodes High-performance Multiple process access Heterogeneous processes (binaries) Runtime independence Memory protection Reliable message delivery Pairwise message ordering

Page 27: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0“Will” Requirements

Operational API Zero-copy MPI Myrinet Sockets implementation Unrestricted message size OS Bypass, Application Bypass Put/Get

Page 28: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0“Will” Requirements

Packetized implementations Receive uses start and length Receiver managed Sender managed Gateways Asynchronous operations Threads

Page 29: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0“Should” requirements

No message alignment restrictions Striping over multiple channels Socket API Implement on ST Implement on VIA No consistency/coherency Ease of use Topology information

Page 30: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Portal Addressing

Portal Table

Match ListMemory

Descriptors

Event QueueMemory Region

Portal API Space Application Space

Operational Boundary

Page 31: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Portal Address Translation

MoreMatch

Entries?Match?

First MDAccepts?

Unlink MD?

Empty &Unlink ME?

Get NextMatch Entry

PerformOperation

Unlink MD

Unlink ME

RecordEvent

DiscardMessage

EventQueue?

Exit

Enter

IncrementDrop Count

No

No

No No

No

No

Yes

Yes

YesYes

Yes Yes

Page 32: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Implementing MPI

• Short message protocol– Send message (expect receipt)– Unexpected messages

• Long message protocol– Post receive– Send message– On ACK or Get, release message

• Event includes the memory descriptor

Page 33: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Implementing MPI

Match none

Match any

Match any

short,unlink

short,unlink

short,unlink

0, trunc, no ACK

Pre-posted

Mark

Event Queue

buffer

buffer

buffer

Page 34: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Flow Control

• Basic flow control– Drop messages that receiver is not prepared for

• Long messages might waste network resources– Good performance for well-behaved MPI apps

• Managing the network – packets– Packet size big enough to hold a short message– First packet is an implicit RTS– Flow control ACK can indicate that message will be

dropped

Page 35: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0Portals 3.0 Status

• Currently testing Cplant Release 0.5– Portals 3.0 kernel module using the RTS/CTS

module over Myrinet– Port of MPICH 1.2.0 over Portals 3.0

• TCP/IP reference implementation ready• Port to LANai begun

Page 36: The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

9th

OR

AP

For

um. M

arch

21,

200

0

http://www.cs.sandia.gov/cplant