6/28/98spaa/podc1 high-performance clusters part 2: generality david e. culler computer science...
Post on 15-Jan-2016
227 views
TRANSCRIPT
6/28/98 SPAA/PODC 1
High-Performance Clusterspart 2: Generality
David E. Culler
Computer Science Division
U.C. Berkeley
PODC/SPAA Tutorial
Sunday, June 28, 1998
6/28/98 SPAA/PODC 2
What’s Different about Clusters?
• Commodity parts?
• Communications Packaging?
• Incremental Scalability?
• Independent Failure?
• Intelligent Network Interfaces?
• Fast Scalable Communication?
=> Complete System on every node– virtual memory
– scheduler
– file system
– ...
6/28/98 SPAA/PODC 3
Topics: Part 2
• Virtual Networks– communication meets virtual memory
• Scheduling
• Parallel I/O
• Clusters of SMPs
• VIA
6/28/98 SPAA/PODC 4
General purpose requirements
• Many timeshared processes– each with direct, protected access
• User and system
• Client/Server, Parallel clients, parallel servers– they grow, shrink, handle node failures
• Multiple packages in a process– each may have own internal communication layer
• Use communication as easily as memory
6/28/98 SPAA/PODC 5
Virtual Networks
• Endpoint abstracts the notion of “attached to the network”
• Virtual network is a collection of endpoints that can name each other.
• Many processes on a node can each have many endpoints, each with own protection domain.
6/28/98 SPAA/PODC 6
Process 3
How are they managed?
• How do you get direct hardware access for performance with a large space of logical resources?
• Just like virtual memory– active portion of large logical space is bound to physical
resources
Process n
Process 2Process 1
***
HostMemory
Processor
NICMem
Network Interface
P
6/28/98 SPAA/PODC 7
Endpoint Transition Diagram
COLDPaged Host Memory
WARMR/O
Paged Host Memory
HOTR/W
NIC Memory
Read
Evict
Swap
WriteMsg Arrival
6/28/98 SPAA/PODC 8
Network Interface Support
• NIC has endpoint frames
• Services active endpoints
• Signals misses to driver– using a system endpont
Frame 0
Frame 7
Transmit
Receive
EndPoint Miss
6/28/98 SPAA/PODC 9
Solaris System Abstractions
Segment Driver• manages portions of an address space
Device Driver• manages I/O device
Virtual Network Driver
6/28/98 SPAA/PODC 10
LogP Performance
• Competitive latency
• Increased NIC processing
• Difference mostly– ack processing
– protection check
– data structures
– code quality
• Virtualization cheap0
2
4
6
8
10
12
14
16
gam AM gam AM
µs
gLOrOs
6/28/98 SPAA/PODC 11
Bursty Communication among many
Client
Client
Client
ServerServerServer
Msgburst work
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 2 4 6 8 10 12 14 16
Clients
Ms
gs
/Se
c
0
10000
20000
30000
40000
50000
60000
70000
0 5 10 15 20
Clients
Bu
rst
Ba
nd
wid
th
(Ms
g/S
ec
)
6/28/98 SPAA/PODC 12
Multiple VN’s, Single-thread Server
0
10000
20000
30000
40000
50000
60000
70000
80000
1 4 7 10 13 16 19 22 25 28
Number of virtual networks
Ag
gre
ga
te m
sgs/
s
continuous
1024 msgs
2048 msgs
4096 msgs
8192 msgs
16384 msgs
6/28/98 SPAA/PODC 13
Multiple VNs, Multithreaded Server
0
10000
20000
30000
40000
50000
60000
70000
80000
1 4 7 10 13 16 19 22 25 28Number of virtual networks
Ag
gre
ga
te m
sgs/
s
continuous
1024 msgs
2048 msgs
4096 msgs
8192 msgs
16384 msgs
6/28/98 SPAA/PODC 14
Perspective on Virtual Networks
• Networking abstractions are vertical stacks– new function => new layer
– poke through for performance
• Virtual Networks provide a horizontal abstraction– basis for building new, fast services
• Open questions– What is the communication “working set” ?
– What placement, replacement, … ?
6/28/98 SPAA/PODC 15
Beyond the Personal Supercomputer
• Able to timeshare parallel programs – with fast, protected communication
• Mix with sequential and interactive jobs
• Use fast communication in OS subsystems– parallel file system, network virtual memory, …
• Nodes have powerful, local OS scheduler
• Problem: local schedulers do not know to run parallel jobs in parallel
6/28/98 SPAA/PODC 16
Local Scheduling
• Schedulers act independently w/o global control
• Program waits while trying communicate with its peers that are not running
• 10 - 100x slowdowns for fine-grain programs!
=> need coordinated scheduling
A AAB
BC
A
A
AA
B C
B
C
Time
P1 P2 P3 P4
A
C
6/28/98 SPAA/PODC 17
Explicit Coscheduling
• Global context switch according to precomputed schedule
• How do you build it? Does it work?
A A AA
B CB C
A A AA
B CB C
TimeP1 P2 P3 P4
Master
6/28/98 SPAA/PODC 18
Typical Cluster Subsystem Structures
A
LS
A A
LS
A
LS
A
LS
A
Master
A
LS
A
GS
A
LS
GS
A
LS
A
GS
LS
A
GS
Local service
Applications
Communication
Global Service
Communication
Communication
Master-Slave
Peer-to-Peer
6/28/98 SPAA/PODC 19
Ideal Cluster Subsystem Structure
• Obtain coordination without explicit subsystem interaction, only the events in the program
– very easy to build
– potentially very robust to component failures
– inherently “service on-demand”
– scalable
• Local service component can evolve.
A
LS
A
GS
A
LS
GS
A
LS
A
GS
LS
A
GS
6/28/98 SPAA/PODC 20
Three approaches examined in NOW
• GLUNIX explicit master-slave (user level)– matrix algorithm to pick PP
– uses stops & signals to try to force desired PP to run
• Explicit peer-peer scheduling assist with VNs– co-scheduling daemons decide on PP and kick the solaris
scheduler
• Implicit– modify the parallel run-time library to allow it to get itself co-
scheduled with standard scheduler
A
LS
A A
LS
A
LS
A
LS
A
M
A
LS
A
GS
A
LS
GS
A
LS
A
GS
LS
A
GS
A
LS
A
GS
A
LS
GS
A
LS
A
GS
LS
A
GS
6/28/98 SPAA/PODC 21
Problems with explicit coscheduling
• Implementation complexity
• Need to identify parallel programs in advance
• Interacts poorly with interactive use and load imbalance
• Introduces new potential faults
• Scalability
6/28/98 SPAA/PODC 22
Why implicit coscheduling might work
• Active message request-reply model
• Infer non-local state from local observations; react to maintain coordination
observation implication action
fast response partner scheduled spin
delayed response partner not scheduled block
WS 1 Job A Job A
WS 2 Job B Job A
WS 3 Job B Job A
WS 4 Job B Job A
sleep
spin
request response
6/28/98 SPAA/PODC 23
Obvious Questions
• Does it work?
• How long do you spin?
• What are the requirements on the local scheduler?
6/28/98 SPAA/PODC 24
How Long to Spin?
• Answer: round trip time + context switch + msg
processing
– round-trip to stay scheduled together
– plus wake-up to get scheduled together
– keep spinning if serving messages
» interval of 3 x wake-up
Job A
Job B
Spin-Wait
WS 1
WS 2
Job A
Job C
Job AWakeup
Job C
Spin-Wait Sleep
Job B
2L+4o2L+4o+W
6/28/98 SPAA/PODC 25
Does it work?
6/28/98 SPAA/PODC 26
Synthetic Bulk-synchronous Apps
• Range of granularity and load imbalance– spin wait 10x slowdown
6/28/98 SPAA/PODC 27
With mixture of reads
• Block-immediate 4x slowdown
6/28/98 SPAA/PODC 28
Timesharing Split-C Programs
6/28/98 SPAA/PODC 29
Many Questions
• What about – mix of jobs?
– sequential jobs?
– unbalanced placement?
– Fairness?
– Scalability?
• How broadly can implicit coordination be applied in the design of cluster subsystems?
• Can resource management be completely decentralized?
– Computational economies, ecologies
6/28/98 SPAA/PODC 30
A look at Serious File I/O
• Traditional I/O system
• NOW I/O system
• Benchmark Problem: sort large number of 100 byte records with 10 byte keys
– start on disk, end on disk
– accessible as files (use the file system)
– Datamation sort: 1 million records
– Minute sort: quantity in a minute
Proc-Mem
P-M P-M P-M P-M
6/28/98 SPAA/PODC 31
NOW-Sort Algorithm
• Read – N/P records from disk -> memory
• Distribute – scatter keys to processors holding result buckets
– gather keys from all processors
• Sort– partial radix sort on each bucket
• Write– write records to disk
(2 pass: gather data runs onto disk, then local, external merge sort)
6/28/98 SPAA/PODC 32
Key Implementation Techniques
• Performance Isolation: highly tuned local disk-to-disk sort
– manage local memory
– manage disk striping
– memory mapped I/O with m-advise, buffering
– manage overlap with threads
• Efficient Communication– completely hidden under disk I/O
– competes for I/O bus bandwidth
• Self-tuning Software– probe available memory, disk bandwidth, trade-offs
6/28/98 SPAA/PODC 33
Minute Sort
SGI Power Challenge
SGI Orgin
0
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80 90 100
Processors
Gig
abyte
s s
orted
World-Record Disk-to-Disk Sort
• Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth
• but only in the wee hours of the morning
6/28/98 SPAA/PODC 34
Towards a Cluster File System
• Remote disk system built on a virtual networkR
ate
(MB
/s)
LocalRemote
5.0
6.0
Read Write
CP
U U
tiliz
atio
n
Read Write0%
40%
20%
client
server
Client
RDlibRD server
Activemsgs
6/28/98 SPAA/PODC 35
Streaming Transfer Experiment
P P P P
0 1 2 3
0 1 2 3
Loca
lP
3F
S L
oca
l
P3F
S R
ever
seP
3F
S R
emot
eP P P P
0 1 2 3
0 1 2 3
P P P P
3 2 1 0
0 1 2 3
P P P P
0 1 2 3
0 1 2 3
6/28/98 SPAA/PODC 36
Results
• Data distribution affects resource utilization
• Not delivered bandwidth
LocalP3FS LocalP3FS ReverseP3FS Remote
Rat
e (M
B/s
)
5.0
6.0
Access Method
CP
U U
tiliz
atio
n0%
40%
Access Method
20%
client
server
6/28/98 SPAA/PODC 37
I/O Bus crossings
M
P
NI
M
P
NI
M
P
NI
M
P
NI
Parallel Scan Parallel Sort
(a) local disk (b) remote disk (a) local disk (b) remote disk
6/28/98 SPAA/PODC 38
Opportunity: PDISK
• Producers dump data into I/O river• Consumers pull it out• Hash data records across disks• Match producers to consumers• Integrated with work scheduling
P P P P P P P P
Fast Communication - Remote Queues
Fast I/O - Streaming Disk Queues
6/28/98 SPAA/PODC 39
What will be the building block?
networkcloud
memory interconnect
SMP
memory
network cards
SMP
SMP SMP
memory
Processors per node
No
des
SMPs
NO
Ws
Clustersof
SMPs
6/28/98 SPAA/PODC 40
Multi-Protocol Communication
• Uniform Prog. Model is key
• Multiprotocol Messaging – careful layout of msg queues
– concurrent objects
– polling network hurts memory
• Shared Virtual Memory– relies on underlying msgs
• Pooling vs Contention
Send / Write
sharedmemory
network
communication layer
Rcv / Read
6/28/98 SPAA/PODC 41
LogP analysis of shared mem AM
-1
0
1
2
3
4
5
6
7
8
latency sendoverhead
receiveoverhead
gap round triptime
mic
rose
cond
s test & set
t&t&s
ticket
Anderson
Posix
lock-free
6/28/98 SPAA/PODC 42
Virtual Interface Architecture
Application
VI User Agent (“libvia”)
Ope
n, C
onne
ct,
Map
Mem
ory
Descriptor Read, Write
VI-Capable NIC
Sockets, MPI,Legacy, etc.
Host
NIC
RequestsCompleted
VI VI C
S S S COMP
R R R
VI
Do
orb
ells
Un
de
term
ined
VIA Kernel Driver (Slow) User-Level (Fast)
6/28/98 SPAA/PODC 43
VIA Implementation Overview
Host
NIC
...
Mapped Doorbells
Descriptor Queues
Data Buffers
Kernel MemoryMapped to Application
Doorbell Pages
Desc Buffer
Tx/Rx Buffers
VI
Request
Block Xfer
1Write
2DMAReq
3DMARd
4DMAReq
5DMARd
7DMAWrt
6/28/98 SPAA/PODC 44
Current VIA Performance
0
100
200
300
400
500
600
4 32 64 128 256 512 1024 2048 3072 4060
Message Size (bytes)
RT
T (
us
ec
)
VIA
UNET
AM2
6/28/98 SPAA/PODC 45
VIA ahead
• You will be able to buy decent clusters
• Virtualization in host memory is easy– will it go beyond pinned regions
– still need to manage active endpoints (doorbells)
• Complex descriptor queues will hinder low latency short messages
– NICs will chew on them, but many instructions on host
• Need to re-examine where error handling, flow control, retry are performed
• Interactions with scheduling, I/O, locking etc. will dominate application speed-up
– will demand new development methodologies
6/28/98 SPAA/PODC 46
Conclusions
• Complete system on every node makes clusters a very powerful architecture
– can finally get serious about I/O
• Extend the system globally– virtual memory systems,
– schedulers,
– file systems, ...
• Efficient communication enables new solutions to classic systems challenges
• Opens a rich set of issues for parallel processing beyond the personal supercomputer or LAN
– where SPAA and PDOC meet