fast-os: petascale single system image...presented by fast-os: petascale single system image jeffrey...
TRANSCRIPT
Presented by
FAST-OS: Petascale Single SystemImage
Jeffrey S. Vetter (PI)Nikhil Bhatia, Collin McCurdy, Phil Roth, Weikuan Yu
Future Technologies Group
Computer Science and Mathematics Division
2 Vetter_petassi_SC07
PetaScale single system image
• What?
Make a collection of standard hardware look like one big machine,
in as many ways as feasible
• Why?
Provide a single solution to address all forms of clustering
Simultaneously address availability, scalability, manageability,and usability
• Components
Virtual cluster using contemporary hypervisor technology
Performance and scalability of parallel shared root file system
Paging behaviors of applications, i.e., reducing TLB misses
Reducing “noise” from operating systems at scale
3 Vetter_petassi_SC07
Virtual clusters using hypervisors
• Physical nodes run Hypervisor in the privileged mode.
• Virtual machines run on top of the hypervisor.
• Virtual machines are compute nodes hosting user-level distributed
memory (e.g., MPI) tasks.
• Virtualization enables dynamic cluster management via live migration of
compute nodes.
• Capabilities for dynamic load balancing and cluster management
ensuring high availability.
VM VM VM VM VM VM VM VM VM
Task
0
Task
1
Hypervisor (Xen) Hypervisor (Xen) Hypervisor (Xen)
Physical Node 0 Physical Node 1 Physical Node 15
Task
3
Task
4
Task
5
Task
7
Task
60
Task
61
Task
63
4 Vetter_petassi_SC07
Virtual cluster management
• Single system view of
the cluster
• Cluster performance
diagnosis
• Dynamic node addition
and removal ensuring
high availability
• Dynamic load balancing
across the entire cluster
• Innovative load balancing
schemes based on
distributed algorithms
Intel x86
Xen
Virtual Machine
Virtual Application
Xen
Intel x86
5 Vetter_petassi_SC07
Virunga—virtual cluster manager
Virunga Daemon
Compute Node
Virunga Daemon
Compute Node
Virunga Daemon
Compute Node
Virunga Client
Performance
Data
Presenter
Performance
Data Parser
System
Administration
Manager
Monitoring station (Login Noded)
6 Vetter_petassi_SC07
Paging behavior of DOE applications
1. With performance counters and simulation:
• Show that the HPCC benchmarks, meant to characterize the
memory behavior of HPC applications, do not exhibit the same
behavior in the presence of paging hardware as scientific
applications of interest to the Office of Science
• Offer insight into why that is the case
2. Use memory system simulation to determine whether large
page performance will improve with the next generation of
Opteron processors
Goal: Understand behavior of applications in large-scale systems composed of commodity processors
7 Vetter_petassi_SC07
Experimental results (from performance counters)
• Performance: trends for large vs small nearly opposite
• TLBs: significantly different miss rates
HPCC Benchmarks
Applications
1.0
0
Cycle
MP
IRA
ND
OM
STR
EA
MR
AN
DO
MFFT
DG
EM
MP
TR
AN
SH
PL
MP
IFFT
Tb
_m
isse
s
0.1
0.01
0.001
1.0
0.0001
MP
IRA
ND
OM
STR
EA
MR
AN
DO
MFFT
DG
EM
M
PTR
AN
SH
PL
MP
IFFT
HY
CO
M
AM
BE
R
1.0
0 CA
M
PO
P
GTC
GY
RO
LM
P
Cycle
Tb
_m
isse
s
0.1
0.01
0.001
1.0
0.0001
CA
M
PO
P
HY
CO
M
GTC
GY
RO
LM
P
AM
BE
R
L2
tlb
_h
its
MP
IRA
ND
OM
STR
EA
MR
AN
DO
MFFT
DG
EM
M
PTR
AN
SH
PL
MP
IFFT
0.1
0.01
0.001
1.0
0.0001
L2
tlb
_h
its
0.1
0.01
0.001
1.0
0.0001
CA
M
PO
P
HY
CO
M
GTC
GY
RO
LM
P
AM
BE
R
Large SmallLarge SmallLarge Small
Large SmallLarge SmallLarge Small
8 Vetter_petassi_SC07
HPCC Benchmarks
Applications
Reuse distances (simulated)
Patterns are clearly significantly different…
pctg
81 2 4 16
32
64 128
256
512
1024
2048
4096
8192
16384
32768
65536
Large
Small
RANDOM50
40
30
20
10
0
70
60
50
40
30
20
10
0
pctg
Large
Small
STREAM
pctg
Large
Small
PTRANS60
50
40
30
20
10
0
pctg
Large
Small
HPL100
90
80
70
60
50
40
30
20
10
0
Large
Small
GTC
1 2 4 8 16
32 64
128
256
512
1024
pctg
60
50
40
30
20
10
0
Large
Small
HYCOM
1 2 4 8 16
32 64
128
256
512
1024
50
40
30
20
10
0
pctg
40
30
20
10
0
pctg
Large
Small
PCP
1 2 4 8 16 32 64
128
256
512
1024
Large
Small
CAM
1 2 4 8 16 32 64
128
256
512
1024
60
50
40
30
20
10
0
pctg
81 2 4 16
32
64 128
256
512
1024
2048
4096
8192
16384
32768
65536
81 2 4 16
32
64 128
256
512
1024
2048
4096
8192
16384
32768
65536
81 2 4 16 32
64 128
256
512
1024
2048
4096
8192
16384
32768
65536
9 Vetter_petassi_SC07
Paging conclusions
• HPCC benchmarks are not representative of paging
behavior of typical DOE applications.
Applications access many more arrays concurrently.
• Simulation results (not shown) indicate the following:
New paging hardware in next-generation Opteron
processors will improve large page performance.
• Performance near that with paging turned off.
However, simulations also indicate that the mere presence
of a TLB is likely degrading performance, whether paging
hardware is on or off.
• More research into implications of paging, and of
commodity processors in general, on performance of
scientific applications is required.
10 Vetter_petassi_SC07
Parallel root file system
• Goals of study
Use a parallel file system for implementation
of shared root environment
Evaluate performance of parallel root file system
Evaluate the benefits of high-speed interconnects
Understand root I/O access pattern and potential scaling limits
• Current status
RootFS implemented using NFS, PVFS-2, Lustre, and GFS
RootFS distributed using ramdisk via etherboot
• Modified mkinitrd program locally
• Modified to init scripts to mount root at boot time
Evaluation with parallel benchmarks (IOR, b_eff_io, NPB I/O)
Evaluation with emulated loads of I/O accesses for RootFS
Evaluation of high-speed interconnects for Lustre-based RootFS
11 Vetter_petassi_SC07
Example results: Parallel benchmarksIOR read/write throughput
NFS Write 256MB Block
0
10
20
30
40
50
60
70
80
90
100
8 16 32 64 128 256
xfer (KiB)
BW
(M
iB/s
)
PVFS2 Write 256MB Block
0
5
10
15
20
25
30
35
40
45
50
8 16 32 64 128 256
xfer (KiB)
B/W
(M
iB/s
)
1 PE 4 PEs 9 PEs 16 PEs 25 PEs
NFS Read 256M Block
0
20
40
60
80
100
120
8 16 32 64 128 256
xfer (KiB)
B/W
(M
iB/s
)
PVFS2 Read 256MB Block
0
50
100
150
200
250
8 16 32 64 128 256
xfer (KiB)
B/W
(M
iB/s
)
Lustre Read 256MB Block
0
100
200
300
400
500
600
700
800
8 16 32 64 128 256
xfer (KiB)
B/W
(M
iB/s
)
Lustre Write 256MB Block
0
5
10
15
20
25
30
35
40
45
50
8 16 32 64 128 256
xfer (KiB)
B/W
(M
iB/s
)
12 Vetter_petassi_SC07
Performance with synthetic I/O accessesStartup with different images
1
3
5
7
9
2 4 6 8 10 12 14 16 18 20 22
Number of Processors
Execution T
ime
(sec)
Time cvs co mpich2
0
200400
600800
1000
12001400
2 4 8 12 16 20
Number of Nodes
Tim
e (
sec)
0
5
10
15
20
25
30
CP
U U
tiliz
ato
in (
%)
Time diff mpich2
0
100
200
300
400
500
600
2 4 8 12 16 20
Number of Nodes
Tim
e (
sec)
0
5
10
15
20
25
CP
U U
tiliz
ation (
%)
Time tar jcf linux-2.6.17.13
0
400
800
1200
1600
2 4 8 12 16 20
Number of Nodes
Tim
e (
sec)
0
510
15
20
2530
35
CP
U U
tiliz
ation (
%)
Time tar jxf linux-2.6.17.13.tar.bz2
0
500
1000
1500
2000
2 4 8 12 16 20
Number of NodesT
ime (
sec)
05
10
152025
30
35
CP
U U
tiliz
ation (
%)
16K
16M
1G
Lustre-TCP-Time
Lustre-TCP-CPU
Lustre-IB-Time
Lustre-IB-CPU
Lustre-TCP-Time
Lustre-TCP-CPU
Lustre-IB-Time
Lustre-IB-CPU
13 Vetter_petassi_SC07
Contacts
Jeffrey S. Vetter
Principle Investigator
Future Technologies Group
Computer Science and Mathematics Division
(865) 356-1649
Nikhil Bhatia Phil Roth
(865) 241-1535 (865) 241-1543
[email protected] [email protected]
Collin McCurdy Weikuan Yu(865) 241-6433 (865) 574-7990
[email protected] [email protected]
13 Vetter_petassi_SC07