fast-os: petascale single system image...presented by fast-os: petascale single system image jeffrey...

Presented by

FAST-OS: Petascale Single SystemImage

Jeffrey S. Vetter (PI)Nikhil Bhatia, Collin McCurdy, Phil Roth, Weikuan Yu

Future Technologies Group

Computer Science and Mathematics Division

2 Vetter_petassi_SC07

PetaScale single system image

• What?

Make a collection of standard hardware look like one big machine,

in as many ways as feasible

• Why?

Provide a single solution to address all forms of clustering

Simultaneously address availability, scalability, manageability,and usability

• Components

Virtual cluster using contemporary hypervisor technology

Performance and scalability of parallel shared root file system

Paging behaviors of applications, i.e., reducing TLB misses

Reducing “noise” from operating systems at scale


Virtual clusters using hypervisors

• Physical nodes run Hypervisor in the privileged mode.

• Virtual machines run on top of the hypervisor.

• Virtual machines are compute nodes hosting user-level distributed

memory (e.g., MPI) tasks.

• Virtualization enables dynamic cluster management via live migration of

compute nodes.

• Capabilities for dynamic load balancing and cluster management

ensuring high availability.

VM VM VM VM VM VM VM VM VM

Task

0

Task

1

Hypervisor (Xen) Hypervisor (Xen) Hypervisor (Xen)

Physical Node 0 Physical Node 1 Physical Node 15

Task

3

Task

4

Task

5

Task

7

Task

60

Task

61

Task

63


Virtual cluster management

• Single system view of

the cluster

• Cluster performance

diagnosis

• Dynamic node addition

and removal ensuring

high availability

• Dynamic load balancing

across the entire cluster

• Innovative load balancing

schemes based on

distributed algorithms

Intel x86

Xen

Virtual Machine

Virtual Application

Xen

Intel x86


Virunga—virtual cluster manager

Virunga Daemon

Compute Node

Virunga Daemon

Compute Node

Virunga Daemon

Compute Node

Virunga Client

Performance

Data

Presenter

Performance

Data Parser

System

Administration

Manager

Monitoring station (Login Noded)


Paging behavior of DOE applications

1. With performance counters and simulation:

• Show that the HPCC benchmarks, meant to characterize the

memory behavior of HPC applications, do not exhibit the same

behavior in the presence of paging hardware as scientific

applications of interest to the Office of Science

• Offer insight into why that is the case

2. Use memory system simulation to determine whether large

page performance will improve with the next generation of

Opteron processors

Goal: Understand behavior of applications in large-scale systems composed of commodity processors


Experimental results (from performance counters)

• Performance: trends for large vs small nearly opposite

• TLBs: significantly different miss rates

HPCC Benchmarks

Applications

1.0

0

Cycle

MP

IRA

ND

OM

STR

EA

MR

AN

DO

MFFT

DG

EM

MP

TR

AN

SH

PL

MP

IFFT

Tb

_m

isse

s

0.1

0.01

0.001

1.0

0.0001

MP

IRA

ND

OM

STR

EA

MR

AN

DO

MFFT

DG

EM

M

PTR

AN

SH

PL

MP

IFFT

HY

CO

M

AM

BE

R

1.0

0 CA

M

PO

P

GTC

GY

RO

LM

P

Cycle

Tb

_m

isse

s

0.1

0.01

0.001

1.0

0.0001

CA

M

PO

P

HY

CO

M

GTC

GY

RO

LM

P

AM

BE

R

L2

tlb

_h

its

MP

IRA

ND

OM

STR

EA

MR

AN

DO

MFFT

DG

EM

M

PTR

AN

SH

PL

MP

IFFT

0.1

0.01

0.001

1.0

0.0001

L2

tlb

_h

its

0.1

0.01

0.001

1.0

0.0001

CA

M

PO

P

HY

CO

M

GTC

GY

RO

LM

P

AM

BE

R

Large SmallLarge SmallLarge Small

Large SmallLarge SmallLarge Small


HPCC Benchmarks

Applications

Reuse distances (simulated)

Patterns are clearly significantly different…

pctg

81 2 4 16

32

64 128

256

512

1024

2048

4096

8192

16384

32768

65536

Large

Small

RANDOM50

40

30

20

10

0

70

60

50

40

30

20

10

0

pctg

Large

Small

STREAM

pctg

Large

Small

PTRANS60

50

40

30

20

10

0

pctg

Large

Small

HPL100

90

80

70

60

50

40

30

20

10

0

Large

Small

GTC

1 2 4 8 16

32 64

128

256

512

1024

pctg

60

50

40

30

20

10

0

Large

Small

HYCOM

1 2 4 8 16

32 64

128

256

512

1024

50

40

30

20

10

0

pctg

40

30

20

10

0

pctg

Large

Small

PCP

1 2 4 8 16 32 64

128

256

512

1024

Large

Small

CAM

1 2 4 8 16 32 64

128

256

512

1024

60

50

40

30

20

10

0

pctg

81 2 4 16

32

64 128

256

512

1024

2048

4096

8192

16384

32768

65536

81 2 4 16

32

64 128

256

512

1024

2048

4096

8192

16384

32768

65536

81 2 4 16 32

64 128

256

512

1024

2048

4096

8192

16384

32768

65536


Paging conclusions

• HPCC benchmarks are not representative of paging

behavior of typical DOE applications.

Applications access many more arrays concurrently.

• Simulation results (not shown) indicate the following:

New paging hardware in next-generation Opteron

processors will improve large page performance.

• Performance near that with paging turned off.

However, simulations also indicate that the mere presence

of a TLB is likely degrading performance, whether paging

hardware is on or off.

• More research into implications of paging, and of

commodity processors in general, on performance of

scientific applications is required.


Parallel root file system

• Goals of study

Use a parallel file system for implementation

of shared root environment

Evaluate performance of parallel root file system

Evaluate the benefits of high-speed interconnects

Understand root I/O access pattern and potential scaling limits

• Current status

RootFS implemented using NFS, PVFS-2, Lustre, and GFS

RootFS distributed using ramdisk via etherboot

• Modified mkinitrd program locally

• Modified to init scripts to mount root at boot time

Evaluation with parallel benchmarks (IOR, b_eff_io, NPB I/O)

Evaluation with emulated loads of I/O accesses for RootFS

Evaluation of high-speed interconnects for Lustre-based RootFS


Example results: Parallel benchmarksIOR read/write throughput

NFS Write 256MB Block

0

10

20

30

40

50

60

70

80

90

100

8 16 32 64 128 256

xfer (KiB)

BW

(M

iB/s

)

PVFS2 Write 256MB Block

0

5

10

15

20

25

30

35

40

45

50

8 16 32 64 128 256

xfer (KiB)

B/W

(M

iB/s

)

1 PE 4 PEs 9 PEs 16 PEs 25 PEs

NFS Read 256M Block

0

20

40

60

80

100

120

8 16 32 64 128 256

xfer (KiB)

B/W

(M

iB/s

)

PVFS2 Read 256MB Block

0

50

100

150

200

250

8 16 32 64 128 256

xfer (KiB)

B/W

(M

iB/s

)

Lustre Read 256MB Block

0

100

200

300

400

500

600

700

800

8 16 32 64 128 256

xfer (KiB)

B/W

(M

iB/s

)

Lustre Write 256MB Block

0

5

10

15

20

25

30

35

40

45

50

8 16 32 64 128 256

xfer (KiB)

B/W

(M

iB/s

)


Performance with synthetic I/O accessesStartup with different images

1

3

5

7

9

2 4 6 8 10 12 14 16 18 20 22

Number of Processors

Execution T

ime

(sec)

Time cvs co mpich2

0

200400

600800

1000

12001400

2 4 8 12 16 20

Number of Nodes

Tim

e (

sec)

0

5

10

15

20

25

30

CP

U U

tiliz

ato

in (

%)

Time diff mpich2

0

100

200

300

400

500

600

2 4 8 12 16 20

Number of Nodes

Tim

e (

sec)

0

5

10

15

20

25

CP

U U

tiliz

ation (

%)

Time tar jcf linux-2.6.17.13

0

400

800

1200

1600

2 4 8 12 16 20

Number of Nodes

Tim

e (

sec)

0

510

15

20

2530

35

CP

U U

tiliz

ation (

%)

Time tar jxf linux-2.6.17.13.tar.bz2

0

500

1000

1500

2000

2 4 8 12 16 20

Number of NodesT

ime (

sec)

05

10

152025

30

35

CP

U U

tiliz

ation (

%)

16K

16M

1G

Lustre-TCP-Time

Lustre-TCP-CPU

Lustre-IB-Time

Lustre-IB-CPU

Lustre-TCP-Time

Lustre-TCP-CPU

Lustre-IB-Time

Lustre-IB-CPU


Contacts

Jeffrey S. Vetter

Principle Investigator

Future Technologies Group

Computer Science and Mathematics Division

(865) 356-1649

[email protected]

Nikhil Bhatia Phil Roth

(865) 241-1535 (865) 241-1543

[email protected] [email protected]

Collin McCurdy Weikuan Yu(865) 241-6433 (865) 574-7990

[email protected] [email protected]


fast-os: petascale single system image...presented by fast-os: petascale single system image jeffrey...

Documents