the computational plant 9 th orap forum paris (cnrs) rolf riesen sandia national laboratories...
Post on 20-Jan-2016
215 views
TRANSCRIPT
The Computational Plant9th ORAP Forum
Paris (CNRS)
Rolf Riesen
Sandia National LaboratoriesScalable Computing Systems Department
March 21, 2000
9th
OR
AP
For
um. M
arch
21,
200
0Distributed and Parallel Systems
Distributedsystemshetero-geneous
Massivelyparallelsystemshomo-geneous
Legi
on\G
lobu
sB
eow
ulf
Ber
kley
NO
WC
plan
t
Inte
rnet
AS
CI R
ed
Tf
lops
• Gather (unused) resources
• Steal cycles
• System SW manages resources
• System SW adds value
• 10% - 20% overhead is OK
• Resources drive applications
• Time to completion is not critical
• Time-shared
• Bounded set of resources
• Apps grow to consume all cycles
• Application manages resources
• System SW gets in the way
• 5% overhead is maximum
• Apps drive purchase of equipment
• Real-time constraints
• Space-shared
Tech Report SAND98-2221
SE
TI@
hom
e
9th
OR
AP
For
um. M
arch
21,
200
0Massively Parallel Processors
• Intel Paragon• 1,890 compute nodes• 3,680 i860 processors• 143/184 GFLOPS• 175 MB/sec network• SUNMOS lightweight kernel
• Intel TeraFLOPS• 4,576 compute nodes• 9,472 Pentium II processors• 2.38/3.21 TFLOPS• 400 MB/sec network• Puma/Cougar lightweight
kernel
9th
OR
AP
For
um. M
arch
21,
200
0Cplant Goals
• Production system• Multiple users• Scalable (easy to use buzzword)• Large scale (proof of the above)• General purpose for scientific applications (not
Beowulf dedicated to a single user)• 1st step: Tflops look and feel for users
9th
OR
AP
For
um. M
arch
21,
200
0Cplant Strategy
• Hybrid approach combining commodity cluster technology with MPP technology
• Build on the design of the TFLOPS:– large systems should be built from independent
building blocks– large systems should be partitioned to provide
specialized functionality– large systems should have significant resources
dedicated to system maintenance
9th
OR
AP
For
um. M
arch
21,
200
0Why Cplant?
• Modeling and simulation, essential to stockpile stewardship, require significant computing power
• Commercial supercomputer are a dying breed• Pooling of SMPs is expensive and more
complex• Commodity PC market is closing the
performance gap• WEB services and e-commerce are driving
high-performance interconnect technology
9th
OR
AP
For
um. M
arch
21,
200
0Cplant Approach
• Emulate the ASCI Red environment– Partition model (functional decomposition)– Space sharing (reduce turnaround time)– Scalable services (allocator, loader, launcher)– Ephemeral user environment– Complete resource dedication
• Use Existing Software when possible– Red Hat distribution, Linux/Alpha– Software developed for ASCI Red
9th
OR
AP
For
um. M
arch
21,
200
0Conceptual Partition View
Net I/O
System Support
Service
Sys Admin
Users
File
I/O
Compute
/home
9th
OR
AP
For
um. M
arch
21,
200
0User View
alaska0 alaska1 alaska2 alaska3 alaska4
/home
Load balancing daemon
Service partition
rlogin alaska
Net I/O
File
I/O
9th
OR
AP
For
um. M
arch
21,
200
0
System Support Hierarchy
sss1Admin access
sss0
node
node
node
nodeScalable
Unit
In-use copyof systemsoftware
NFS mountroot fromSSS0
sss0
node
node
node
nodeScalable
Unit
In-use copyof systemsoftware
NFS mountroot fromSSS0
sss0
node
node
node
nodeScalable
Unit
In-use copyof systemsoftware
NFS mountroot fromSSS0
Master copyof systemsoftware
9th
OR
AP
For
um. M
arch
21,
200
0Scalable Unit
power serial Ethernet Myrinet
To
syst
em s
uppo
rt n
etw
ork
100BaseT hub
16 p
ort M
yrin
et s
wit
ch
compute
compute
compute
compute
compute
compute
compute
service
8 Myrinet LAN cablesss
s0
Ter
min
al s
erve
r
Pow
er c
ontr
olle
r
100BaseT hub
16 p
ort M
yrin
et s
wit
ch
compute
compute
compute
compute
compute
compute
compute
service
Ter
min
al s
erve
r
Pow
er c
ontr
olle
r
9th
OR
AP
For
um. M
arch
21,
200
0
sss1Uses rdist topush system softwaredown
sss0
node
node
node
nodeScalable
Unit
In-use copyof systemsoftware
NFS mountroot fromSSS0
sss0
node
node
node
nodeScalable
Unit
In-use copyof systemsoftware
NFS mountroot fromSSS0
sss0
node
node
node
nodeScalable
Unit
In-use copyof systemsoftware
NFS mountroot fromSSS0
BetaAlpha
Production SU configurationdatabase
“Virtual Machines”
9th
OR
AP
For
um. M
arch
21,
200
0Runtime Environment
• yod - Service node parallel job launcher• bebopd - Compute node allocator• PCT - Process control thread, compute node
daemon• pingd - Compute node status tool• fyod - Independent parallel I/O
9th
OR
AP
For
um. M
arch
21,
200
0Phase I - Prototype (Hawaii)
• 128 Digital PWS 433a (Miata)• 433 MHz 21164 Alpha CPU• 2 MB L3 Cache• 128 MB ECC SDRAM• 24 Myrinet dual 8-port SAN
switches• 32-bit, 33 MHz LANai-4 NIC• Two 8-port serial cards per SSS-
0 for console access• I/O - Six 9 GB disks• Compile server - 1 DEC PWS
433a• Integrated by SNL
9th
OR
AP
For
um. M
arch
21,
200
0Phase II Production (Alaska)
• 400 Digital PWS 500a (Miata)• 500 MHz Alpha 21164 CPU• 2 MB L3 Cache, 192 MB RAM• 16-port Myrinet switch• 32-bit, 33 MHz LANai-4 NIC• 6 DEC AS1200, 12 RAID (.75
Tbyte) || file server• 1 DEC AS4100 compile & user file
server• Integrated by Compaq• 125.2 GFLOPS on MPLINPACK
(350 nodes)– would place 53rd on June 1999
Top 500
9th
OR
AP
For
um. M
arch
21,
200
0Phase III Production (Siberia)
• 624 Compaq XP1000 (Monet)• 500 MHz Alpha 21264 CPU• 4 MB L3 Cache• 256 MB ECC SDRAM• 16-port Myrinet switch• 64-bit, 33 MHz LANai-7 NIC • 1.73 TB disk I/O• Integrated by Compaq and
Abba Technologies• 247.6 GFLOPS on
MPLINPACK (572 nodes)– would place 40th on Nov
1999 Top 500
9th
OR
AP
For
um. M
arch
21,
200
0Phase IV (Antarctica?)
• ~1350 DS10 Slates (NM+CA)• 466MHz EV6, 256MBRAM• Myrinet 33MHz 64bit LANai 7.x• Will be combined with Siberia
for a ~1600-node system• Red, black, green switchable
9th
OR
AP
For
um. M
arch
21,
200
0Myrinet Switch
• Based on 64-port Clos switch• 8x2 16-port switches in a 12U
rack-mount case• 64 LAN cables to nodes• 64 SAN cables (96 links) to
mesh
4 nodes
16-port switch
9th
OR
AP
For
um. M
arch
21,
200
0One Switch Rack = One Plane
• 4 Clos switches in one rack
• 256 nodes per plane (8 racks)
• Wrap-around in x and y direction
• 128+128 links in z direction
x
zy
4 nodes per not shown
9th
OR
AP
For
um. M
arch
21,
200
0Cplant 2000: “Antarctica”
Wrap-around and z links and nodes not shown
Connected to open network
Connected to classified network
Compute nodes swingbetween red, black, or green
Connected to unclassified network
9th
OR
AP
For
um. M
arch
21,
200
0Cplant 2000: “Antarctica” cont.
• 1056 + 256 + 256 nodes 1600 nodes 1.5TFlops• 320 “64-port” switches + 144 16-port switches from
Siberia• 40 + 16 system support stations
9th
OR
AP
For
um. M
arch
21,
200
0Portals
• Data movement layer from SUNMOS and PUMA
• Flexible building blocks for supporting many protocols
• Elementary constructs that support MPI semantics well
• Low-level message-passing layer (not a wire protocol)
• API intended for library writers, not application programmers
• Tech report SAND99-2959
9th
OR
AP
For
um. M
arch
21,
200
0Interface Concepts
• One-sided operations– Put and Get
• Zero copy message passing– Increased bandwidth
• OS Bypass– Reduced latency
• Application Bypass– No polling, no threads– Reduced processor utilization– Reduced software complexity
9th
OR
AP
For
um. M
arch
21,
200
0MPP Network: Paragon and Tflops
Memory
Processor Processor
MemoryBus
Network
Message passing or computational co-processor
Network interface is on the memory bus
9th
OR
AP
For
um. M
arch
21,
200
0Commodity: Myrinet
MemoryBus
PCI Bus
Bridge
Processor Memory
NIC
Network
Network is far from the memory
OS Bypass
9th
OR
AP
For
um. M
arch
21,
200
0“Must” Requirements
Common protocols (MPI, system protocols) Portability Scalability to 1000’s of nodes High-performance Multiple process access Heterogeneous processes (binaries) Runtime independence Memory protection Reliable message delivery Pairwise message ordering
9th
OR
AP
For
um. M
arch
21,
200
0“Will” Requirements
Operational API Zero-copy MPI Myrinet Sockets implementation Unrestricted message size OS Bypass, Application Bypass Put/Get
9th
OR
AP
For
um. M
arch
21,
200
0“Will” Requirements
Packetized implementations Receive uses start and length Receiver managed Sender managed Gateways Asynchronous operations Threads
9th
OR
AP
For
um. M
arch
21,
200
0“Should” requirements
No message alignment restrictions Striping over multiple channels Socket API Implement on ST Implement on VIA No consistency/coherency Ease of use Topology information
9th
OR
AP
For
um. M
arch
21,
200
0Portal Addressing
Portal Table
Match ListMemory
Descriptors
Event QueueMemory Region
Portal API Space Application Space
Operational Boundary
9th
OR
AP
For
um. M
arch
21,
200
0Portal Address Translation
MoreMatch
Entries?Match?
First MDAccepts?
Unlink MD?
Empty &Unlink ME?
Get NextMatch Entry
PerformOperation
Unlink MD
Unlink ME
RecordEvent
DiscardMessage
EventQueue?
Exit
Enter
IncrementDrop Count
No
No
No No
No
No
Yes
Yes
YesYes
Yes Yes
9th
OR
AP
For
um. M
arch
21,
200
0Implementing MPI
• Short message protocol– Send message (expect receipt)– Unexpected messages
• Long message protocol– Post receive– Send message– On ACK or Get, release message
• Event includes the memory descriptor
9th
OR
AP
For
um. M
arch
21,
200
0Implementing MPI
Match none
Match any
Match any
short,unlink
short,unlink
short,unlink
0, trunc, no ACK
Pre-posted
Mark
Event Queue
buffer
buffer
buffer
9th
OR
AP
For
um. M
arch
21,
200
0Flow Control
• Basic flow control– Drop messages that receiver is not prepared for
• Long messages might waste network resources– Good performance for well-behaved MPI apps
• Managing the network – packets– Packet size big enough to hold a short message– First packet is an implicit RTS– Flow control ACK can indicate that message will be
dropped
9th
OR
AP
For
um. M
arch
21,
200
0Portals 3.0 Status
• Currently testing Cplant Release 0.5– Portals 3.0 kernel module using the RTS/CTS
module over Myrinet– Port of MPICH 1.2.0 over Portals 3.0
• TCP/IP reference implementation ready• Port to LANai begun
9th
OR
AP
For
um. M
arch
21,
200
0
http://www.cs.sandia.gov/cplant