runtime support for effective memory ...axs53/csl/files/thesis/murali_thesis.pdf3.6 differences...

The Pennsylvania State University

The Graduate School

Department of Computer Science and Engineering

RUNTIME SUPPORT FOR EFFECTIVE MEMORY MANAGEMENT

IN LARGE-SCALE APPLICATIONS

A Thesis in

Computer Science and Engineering

by

Murali N. Vilayannur

c© 2005 Murali N. Vilayannur

Submitted in Partial Fulfillmentof the Requirements

for the Degree of

Doctor of Philosophy

August 2005

The thesis of Murali N. Vilayannur has been reviewed and approved? by the following:

Anand SivasubramaniamProfessor of Computer Science and EngineeringThesis Co-AdviserCo-Chair of Committee

Mahmut KandemirAssociate Professor of Computer Science and EngineeringThesis Co-AdviserCo-Chair of Committee

Padma RaghavanAssociate Professor of Computer Science and Engineering

Natarajan GautamAssociate Professor of Industrial and Manufacturing Engineering

Rajeev ThakurComputer Scientist at Argonne National LaboratorySpecial Member

Robert RossComputer Scientist at Argonne National LaboratorySpecial Member

Raj AcharyaProfessor of Computer Science and EngineeringHead of the Department of Computer Science and Engineering

?Signatures are on file in the Graduate School.

iii

Abstract

As processor speeds continue to advance at a rapid pace, accesses to the I/O subsystem

are increasingly becoming the bottleneck in the performance of large-scale applications. In spite

of technological advances in peripheral devices, provisioning and maintenance of large buffers in

memory remains a crucial technique for achieving good performance, but is only effective if we

can achieve good hit rates. This thesis describes the runtime system support to determine what

should go into an I/O cache and when to avoid accessing it, as well as techniques to improve the

hit ratio itself by choosing appropriate candidate cache blocks for eviction/replacement. Such

techniques are equally applicable for both explicitly and implicitly I/O intensive applications

that access data either through a file-system interface or through the virtual memory interface.

While the afore-mentioned techniques can boost the performance for a single I/O inten-

sive application, an important consideration that needs to be addressed for practical reasons is

the effects of multi-programming, where multiple applications are run simultaneously for better

resource utilization. The thesis will conclude with the design and implementation of a runtime

system scheduling strategy on top of an un-modified process scheduler in the operating sys-

tem that can be used to ensure that performance of many large-scale applications does not get

degraded in multi-programmed scenarios.

iv

Table of Contents

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 2. Explicitly I/O Intensive Applications . . . . . . . . . . . . . . . . . . . . 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1 PVFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2 Overview of System Architecture . . . . . . . . . . . . . . . . . . 15

2.3.3 Performance of Primitives and Micro-Benchmarks . . . . . . . . . 21

2.3.4 Cache Bypass Mechanisms . . . . . . . . . . . . . . . . . . . . . . 30

2.4 Compiler-directed Cache Bypass . . . . . . . . . . . . . . . . . . . . . . . 31

2.5 Runtime Cache Bypass . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.6 Concluding Remarks and Future Work . . . . . . . . . . . . . . . . . . . . 43

Chapter 3. Implicitly I/O Intensive Applications . . . . . . . . . . . . . . . . . . . . 46

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

v

3.3 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.2 Experimental Platform . . . . . . . . . . . . . . . . . . . . . . . . 54

3.4 Characterization Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.5 Towards a Better Replacement Algorithm : Predictive Replacement . . . . 67

3.5.1 Estimation Techniques with Hardware Support . . . . . . . . . . . 71

3.5.2 OS-Implementable Estimation Technique . . . . . . . . . . . . . . 75

3.5.3 Results with Predictive Replacement Techniques . . . . . . . . . . 76

3.5.4 Comparison with EELRU . . . . . . . . . . . . . . . . . . . . . . 77

3.5.5 Performance of DP-Approx . . . . . . . . . . . . . . . . . . . . . 82

3.5.6 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 86

Chapter 4. Synergistic Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.3 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.3.1 Experimental Platform . . . . . . . . . . . . . . . . . . . . . . . . 95

4.3.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.4 Scheduling Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.4.1 Heuristics for task set selection . . . . . . . . . . . . . . . . . . . . 98

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.5.1 Underload Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 105

vi

4.5.2 Overload Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Chapter 5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

vii

List of Tables

2.1 Read times (in ms) for different request sizes and number of IODs (|IOD|). . . 24

2.2 Write times (in ms) for different request sizes and number of IODs (|IOD|). . 26

3.1 Description of applications: The Total Memory column indicates the total/maximum

memory that is used by the application, and the Simulated Memory column in-

dicates the simulated memory size that was used for the characterization. . . . . 55

3.2 Threshold values of applications. . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.1 Description of applications: The Total Memory column indicates the total/maximum

memory that is used by the application. . . . . . . . . . . . . . . . . . . . . . 97

viii

List of Figures

1.1 Code fragment to illustrate the different I/O programming models . . . . . . . 4

2.1 System architecture. Nodes 1..n are the clients where one or more application

processes run, and have a local cache present. Upon a miss, requests are either

directed to the global cache (one such entity for a file), or are sent directly to

IOD node(s) containing the data in the disk(s). . . . . . . . . . . . . . . . . . . 22

2.2 Graph showing the minimum required hit-rate at global cache for good perfor-

mance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Micro-benchmark running on 1 node, File striped on one IOD, s = 2, (a)o = 10,

(b) o = 600. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 Micro-benchmark running on 1 node, File striped on one IOD, s = 2, (a) o =

25000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5 Micro-benchmark running on 2 nodes, File striped on three IODs, s = 2, (a) o

= 10, (b) o = 600. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.6 Micro-benchmark running on 2 nodes, File striped on three IODs, s = 2, (a) o

= 25000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.7 tomcatv: impact of problem size (a) Global cache is 20MB, (b) Global cache

size is 200MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.8 Impact of global cache size for a problem size of 1500.(a) tomcatv, (b) vpenta . 35

2.9 vpenta: impact of problem size (a) Global cache size is 20MB, (b) Global

cache size is 200MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

ix

2.10 Runtime cache bypassing (global cache size is 20 MB) (a) tomcatv. (b)

vpenta. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.11 (a) Impact of the threshold value for tomcatv. (b) vpenta: Impact of seg-

ment size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.12 tomcatv:Variation of Cache Hit rates with problem size . . . . . . . . . . . . 41

2.13 vpenta:Variation of Cache Hit rates with problem size . . . . . . . . . . . . 41

2.14 Benefits of runtime bypassing on application traces. . . . . . . . . . . . . . . . 44

2.15 Application traces:Variation of Cache Hit rates . . . . . . . . . . . . 44

3.1 Page-fault characterization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2 Ratio of D/L+D measured for NPB as (a) References, (b) Faults. . . . . . . . . 57

3.3 Ratio of D/L+D measured for SPEC2000 as (a) References, (b) Faults. . . . . . 58

3.4 Where does time go? (a) NPB, (b) SPEC2000 . . . . . . . . . . . . . . . . . . 58

3.5 Absolute differences between successive L distances measured as (a) NPB2.3 -

Total Memory References, (b) NPB2.3 - Faults to other pages, (c) SPEC2000 -

Total Memory References, (d) SPEC2000 - Faults to other pages. . . . . . . . . 61

3.6 Differences between successive L distances measured as (a) NPB - Total Mem-

ory References, (b) NPB - Faults to other pages, (c) SPEC2000 - Total Memory

References, (c) SPEC2000 - Faults to other pages. . . . . . . . . . . . . . . . . 62

3.7 Absolute differences of successive (L+D) distances measured as (a) NPB - Total

Memory References, (b) NPB - Faults to other pages, (c) SPEC2000 - Total

Memory References, (d) SPEC2000 - Faults to other pages. . . . . . . . . . . . 63

3.8 NPB: CDF of L distance measured as (a) References, (b) Faults. . . . . . . . . 64

x

3.9 MG: Variation of L distance with time measured as (a) References, (b) Faults. . 65

3.10 SP: Variation of L distance with time measured as (a) References, (b) Faults. . . 66

3.11 Coefficient of Variance of L for each page (a) IS, (b) MG, (c) SP. . . . . . . . . 67

3.12 Coefficient of Variance of L for each page (a) FT, (b) BT, (c) LU. . . . . . . . . 68

3.13 Coefficient of Variance of L distance for each page (a) WUPWISE, (b) MCF,

(c) APSI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.14 Normalized page-fault counts of the replacement algorithm for SPEC 2000 with

respect to perfect LRU scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.15 Normalized page-fault counts of the replacement algorithm for NPB 2.3 with

respect to perfect LRU scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.16 Normalized invocation counts of the replacement algorithm for (a) SPEC 2000,

(b) NPB 2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.17 Comparison of the best prediction-based replacement algorithm with EELRU

for SPEC 2000 using the ratio of page-faults in comparison to the perfect LRU

scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.18 Comparison of the best prediction-based replacement algorithm with EELRU

for NPB 2.3 using the ratio of page-faults in comparison to the perfect LRU

scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.19 Normalized page-fault reduction of DP-Approx algorithm in comparison to

Linux kernel 2.4.20 execution for (a) SPEC 2000, (b) NPB 2.3 . . . . . . . . . 83

3.20 SPEC 2000 (a) Sensitivity of DP-Approx to parameter “a”, (b) Prediction ac-

curacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.21 NPB 2.3 (a) Sensitivity of DP-Approx to parameter “a”, (b) Prediction accuracy. 85

xi

4.1 Synergy Scheduler Design Alternatives (a) Using a kernel-module based ap-

proach, (b) Using a user-level probe process based approach . . . . . . . . . . 99

4.2 Variation of working set size with simulation time for (a) IS (σ = 0.4 Million)

(b) FT (σ = 15 Million) (c) CG (σ = 14 Million) (d) MG (σ = 56 Million) (e)

SP (σ = 18 Million) (f) EP (σ = 68 Million) (g) LU (σ = 21 Million) (h) BT (σ

= 76 Million) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.3 Underload: (a) Execution Time (in seconds) measured as the time taken from

job start till completion (b) Normalized execution time that is measured as the

ratio of the job completion time to the batch processing execution time. . . . . 107

4.4 Underload: (a) Overall CPU utilization (b) Context Switches (c) Overall Major

Page Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.5 Overload: (a) Execution Time (seconds) measured as the time from job start till

completion (b) Normalized Slowdown measured as the ratio of the job comple-

tion time with the batch scheduling. . . . . . . . . . . . . . . . . . . . . . . . 111

4.6 Overload: (a) Overall CPU utilization (b) Context Switches (c) Overall Major

Page Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

xii

Acknowledgments

Finally, I am reaching the point when I can take a deep breath and see the light at the

end of the tunnel! The last six years at Penn State have been the most challenging, exciting

and rewarding roller-coaster ride of my life. There were several trying moments, but I am glad

I pulled through. Several people have contributed to molding me as a researcher and more

importantly a better human being.

I am deeply indebted to my thesis advisors, Dr. Anand Sivasubramaniam and Dr. Mah-

mut Kandemir. Both of them are dedicated researchers, genuine well-wishers for their students

and very good human beings. In the past six years, they have played several important roles

in my life and taught me numerous things on both the technical and non-technical fronts. I am

so thankful to them for placing trust in me when I lacked self-confidence and motivating me

to work harder and aim higher. Without their help, it is hard for me to imagine how my thesis

would have shaped or how my career could have been launched. I would also like to thank NSF,

the funding agency that has funded me for the bulk of my graduate program.

The three semesters that I spent in the Mathematics and Computer Science Division at

Argonne National Laboratory was the best thing that could have happened to me. I am extremely

grateful to my two mentors, Dr. Rajeev Thakur and Dr. Robert Ross, for encouraging me to aim

higher and their implicit trust has helped boost my self-confidence tremendously. Both of them

are great researchers and excellent colleagues to work with. Our collaboration has gone well

xiii

beyond the two summers, and their inputs and suggestions to my thesis as some of the well-

known practitioners in this field will hopefully lead to fruition of some of the ideas that have

come out of my thesis.

I also wish to thank my thesis committee members and referees: Dr. Padma Ragahavan,

Dr. Natarajan Gautam, Dr. Rajeev Thakur, and Dr. Robert Ross who took time off their busy

schedules, writing recommendation letters for me and being a part of my thesis committee. Their

strong support and valuable comments have made my job search much easier and helped refine

my dissertation.

I have been very lucky to have a group of excellent senior colleagues and lab-mates,

who were also my best friends and mentors at Penn State: Mangesh Kasbekar, Shailabh Nagar,

Ning An, Yanyong Zhang, Chun Liu, Gokul Kandiraju, Sudhanva Gurumurthi, Amitayu Das,

Angshuman Parashar, Vivek Natarajan, Shiva Chaitanya, Jianyong Zhang, Partho Nath, Saurabh

Agarwal, Balaji Viswanathan, Giridhar Viswanathan and Vivek Bhanu. The list of friends that I

made at Penn State is endless and I am positive I must have missed out a dozen names from the

above list, but one thing is for sure, they will always be a part of my memories. I will always

have a special place in my memories for three of my friends (Gokul Kandiraju, Birjoo Vaishnav

and Deepak Ramrakhyani) whom I regard as beyond friends, as part of my family. Their help,

encouragement, discussions on technical and spiritual matters is something that I will cherish for

the rest of my life. Whenever I had a doubt or question with my research, our discussion gave

me new insights, and helped me solve the problem. Whenever I felt down, they were always

there to cheer me up. My life has been so enjoyable because of them.

None of the work described in this thesis would have been possible without the prompt

attention of the lab support team at Penn State. I am deeply indebted to Eric Prescott, Nate

xiv

Coraor, John Domico, David Heidrich and Barbara for their dedication and prompt attention to

administering all the machines in the department. All the secretaries in the office extended so

much warmth and help that have really made my stay in the department pleasurable, especially

Vicki who manages to do so much work and yet remain cheerful all the time. Without her help,

I cannot imagine how things would have turned out for me.

I am grateful to my parents and my brother. Their love and care have always been with

me during these years. Their trust and encouragement pulled me through what appeared to be

some of the worst times early on.

Finally and most importantly, I am so grateful to be a part of the Art of Living Family

and come under the presence of my spiritual Guru, Sri Sri Ravi Shankar.

1

Chapter 1

Introduction

Many large-scale scientific applications of today are increasingly data-intensive, manipu-

lating large disk-resident data sets ranging from mega-bytes to tera-bytes. For example, medical

imaging, data analysis and mining, video processing, global climate modeling, computational

physics and chemistry can easily involve data sets that are too large to fit in main memory

[23, 25]. Typically, such applications treat main memory simply as an intermediate stage of the

memory hierarchy and the bulk of the data that they manipulate usually resides on secondary

storage devices like disks. Recent trends in computer architecture [27] show that networks of

workstations – also referred to as clusters – are the dominant force for delivering high perfor-

mance for such challenging applications cost effectively. This has been made possible in part

due to the rapid advances in processor and networking technology (such as [5, 66]). In these ar-

chitectures, the multiple CPUs and their memories can provide processing and primary storage

parallelism, while the multiple disks (one or more at each workstation, or on a network) can pro-

vide secondary storage parallelism for both data access and data transfer. As processor speeds

continue to advance at a rapid pace, accesses to the I/O subsystem are increasingly becoming

the bottleneck in the performance of large-scale applications that manipulate huge datasets. This

gap between CPU and I/O performance is exacerbated as we move to multiprocessor and cluster

systems, where the compute power is potentially multiplied by the available number of CPUs.

Therefore, optimizing I/O performance is of critical importance.

2

Large buffers in memory (referred to as caches throughout this thesis) are one way of

alleviating this problem, provided we can achieve good hit rates. However, unlike the tradi-

tional instruction/data caches that are provisioned in the hardware of processor architectures, I/O

caches are implemented in software, managed in main memory and have much higher overheads.

Further, the levels of I/O caching on some of the parallel environments (including clusters) can

span machine boundaries, requiring network messages for cache accesses. A large body of work

([10, 16, 21, 24, 28, 30, 31, 45, 46, 58, 64, 71, 75, 96] and the references contained there-in) have

dealt with various aspects of I/O caches (design, replacement algorithms, prefetching, sharing

and partitioning, and so on) in the literature. This thesis describes management of I/O caches

which is critical for alleviating the I/O bottlenecks and discusses compiler and runtime system

support for managing them. In particular, this thesis proposes compiler and runtime system

support to determine what should go into an I/O cache and when we should avoid accessing

it, apart from improving the hit rate itself by choosing appropriate candidate cache blocks for

eviction/replacement. Further, we also propose a simple runtime mechanism that ensures that

performance of such applications in a multi-programming scenario does not get degraded.

Typically, large-scale I/O intensive applications, such as those described above, are coded

to access and manipulate their data sets in one of three different ways:

• Using explicit I/O calls (e.g., the POSIX read/write/lseek interface in UNIX) to

stage data between memory and peripheral devices.

• Using the paged virtual memory system (e.g., the POSIX mmap interface in UNIX) to

handle transparent migration of data sets between main memory and disk. This model of

3

computation can be termed in-core since the programmer assumes that all the data fits in

main memory and the system cooperates to preserve that assumption.

• Using explicit I/O calls to manage in-core data sets rather than relying on the virtual mem-

ory system. This model of computation can be termed out-of-core since the programmer

explicitly stages the data to/from the peripheral devices. Writing programs using this

model of computation is daunting since it involves significant restructuring of the in-core

version of the same code, but offer best performance since application writers know best.

Figure 1.1 shows an example code fragment that illustrates each of the above models of computa-

tion. While the first and the third technique appear to be similar, the difference is that in the latter

scheme once the data is brought into memory, it is manipulated as if it were in main memory

and written to disk when the entire data set is staged out, while in the former any manipulation

of data is accomplished by a sequence of lseek and read/write.

The focus of this thesis is twofold:

• To improve the performance of applications that have been coded to manipulate their data

sets using the first two approaches (namely explicit, in-core) from the perspective of the

I/O caches. It is important to note here that optimizations for the third approach (out-

of-core) are beyond the scope of this thesis, since we believe that it is a burden on the

programmer to write a numerically stable out-of-core program and any approaches to al-

leviate the I/O bottlenecks in such large-scale applications has to be automated.

• To guarantee that performance of such applications in a multi-programming scenario does

not degrade through the design of a runtime system implementing a scheduling strategy

on top of an un-modified operating system.

4

Explicitly I/O intensive Implicitly I/O intensive Out of Core Computation

#define NUM_ELEM 1000000

#define NUM_IN_MEMORY 1000#define SZ(x) ((x)*sizeof(double))

double A[NUM_IN_MEMORY];.....int start_core_index = -1, end_core_index = -1;

void read_data_set(void) { ..... fd = open("A.txt", O_RDWR);

for (i=0; i < NUM_ELEM; i++) { if (i < start_core_index || i > end_core_index) { if (start_core_index > 0) write(fd, A, SZ(NUM_IN_MEMORY)); pread(fd, A, SZ(NUM_IN_MEMORY), SZ(i); start_core_index = i; end_core_index = (i + NUM_IN_MEMORY); } .. = A[i]; ... A[i] = ... } close(fd); ..... .....}

int NUM_ELEM = 0;

double *A;.....

void read_data_set(void) { ..... .....

fd = open("A.txt", O_RDWR);

fstat(fd, &statbuf);

NUM_ELEM = statbuf.st_size / sizeof(double);

A= (double *) mmap(NULL, statbuf.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

for (i=0; i < NUM_ELEM; i++) { .. = A[i]; ... A[i] = ... }

munmap(A, statbuf.st_size);

close(fd); ..... .....}

#define NUM_ELEM 1000000

double A[NUM_ELEM];.....

void read_data_set(void) { ..... .....

fd = open("A.txt", O_RDWR);

read(fd, A, sizeof(double) * NUM_ELEM);

for (i=0; i < NUM_ELEM; i++) { .. = A[i]; ... A[i] = ... }

write(fd, A, sizeof(double) * NUM_ELEM);

close(fd); ..... .....}

Fig. 1.1. Code fragment to illustrate the different I/O programming models

For applications that explicitly read/write data to/from peripheral devices, this thesis at-

tempts to increase the effectiveness of the buffer management system by intelligently deciding

(with the help of a compiler or the operating system) which cache blocks should go into which

I/O caches, and when we should avoid accessing certain levels of the I/O cache hierarchy to

improve hit rates across all I/O caches. A prototype implementation was completed on a popular

and widely used parallel file system (PVFS [14]) on a Linux-based cluster to demonstrate the

effectiveness of this technique.

When an application that uses the paged virtual memory system, accesses a page that is

not in main memory, the hardware raises an exception (page-fault) that is handled transparently

by the operating system. The OS then arranges for the page to be brought into main memory,

potentially replacing some other page to make room for the newly accessed page. Once the

page is brought into main memory, the appropriate page-table entries are updated to reflect the

5

new mappings. Thus, the discretionary caching techniques that were developed for the explicitly

I/O intensive applications are not applicable directly in this context, i.e the discretionary nature

of caching certain cache blocks cannot be exploited. However, it has been a well-documented

characteristic [82] that the system’s replacement algorithm (usually a variant of Least-Recently-

Used algorithm) under-performs for typical scientific application memory access characteristics.

Consequently, this thesis tries to address the shortcomings of the system’s replacement algo-

rithm and proposes a novel runtime predictive application-specific replacement algorithm that is

shown to have the potential to perform better. The implementation of this algorithm was com-

pleted on a popular and widely used execution-driven simulator (Valgrind [79]) to demonstrate

its effectiveness.

Lastly, this thesis explores the inter-play of memory management, process scheduling

and their impact on performance. While this thesis has attempted to show that intelligent,

application-aware memory management can boost performance, it raises a challenging and a

difficult problem of ensuring that performance of such applications does not degrade when con-

currently executed with other jobs. Process scheduling policies implemented by today’s oper-

ating systems cause memory intensive applications to exhibit poor performance and throughput

when all the applications’ working sets do not fit in main memory. A primary cause for this is that

process scheduling algorithms do not take memory working set size considerations into account.

Consequently, this thesis attempts to alleviate the shortcomings of the scheduling algorithms of

today’s operating system to ensure that concurrently running memory-intensive applications do

not step on each other’s working sets. To demonstrate its effectiveness, a prototype implementa-

tion of this technique as a Linux kernel module was completed.

6

The rest of this thesis is organized as follows : Chapter 2 evaluates the proposed buffer

management techniques that were developed for explicitly I/O intensive applications. Chapter 3

looks at the replacement algorithms that was developed for scaled versions of in-core scientific

applications that are implicitly I/O intensive. Chapter 4 proposes the design and implementation

of the scheduling algorithm framework that was developed to investigate the impact of memory-

aware process scheduling on performance. Finally, Chapter 5 concludes with pointers to future

work.

7

Chapter 2

Explicitly I/O Intensive Applications

2.1 Introduction

As processor speeds continue to advance at a rapid pace, accesses to the I/O subsystem

are increasingly becoming the bottleneck in the performance of large-scale applications that ma-

nipulate huge datasets. While one could argue that we can use a large number of disks in parallel

for improving I/O bandwidth, the latency of seeking to the appropriate location and performing

the disk operation in addition to the overhead of the network transfer continues to hurt perfor-

mance of many applications, especially those with non-sequential access patterns. Large buffers

in memory (referred to as caches throughout this chapter) are one way of alleviating this problem,

provided we can achieve good hit rates. However, unlike the traditional instruction/data caches

that are provisioned in the hardware of processor architectures, I/O caches are implemented in

software and have much higher overheads. Further, the levels of I/O caching on some of the

parallel environments (including clusters) can span machine boundaries, requiring network mes-

sages for cache accesses. It is thus very important to be able to determine what should go into

an I/O cache and when we should avoid accessing it, apart from improving the hit rate itself.

This chapter addresses this important problem, presenting the design, implementation, and eval-

uation of a parallel file system’s I/O subsystem that provides two levels of discretionary caching

[93] and demonstrate the benefits of such discretionary caching mechanisms with compiler and

runtime optimizations.

8

Clusters, put together with off-the-shelf workstations/PCs and networking hardware, are

becoming the platform of choice for demanding applications because of their cost-effectiveness,

upgradability, and widespread availability. Clusters are finding their place in a plethora of en-

vironments, from academic departments to supercomputing centers and even to the commercial

world (e.g., for database, web and e-commerce applications). These platforms can benefit not

only from the constantly improving processor/memory speeds, but also from the disk capacities

and bandwidths. The multiple CPUs and their memories on these systems can provide process-

ing and primary-storage parallelism, while the multiple disks (one or more at each workstation

or on a network) can provide secondary-storage parallelism for both data access and data trans-

fer. One could either have disks attached to each cluster node with a SCSI-like interface (the

corresponding node has to be involved in data transfers to/from such disks), or have disks acces-

sible by everyone over a storage area network. While many of the issues/optimizations in this

work are applicable to both environments, we specifically focus on the former which is usually

a much cheaper, and thus more prevalent, alternative for the I/O subsystem (and are intending to

investigate such issues for a storage-area network in the future).

Many large-scale scientific applications are data-intensive, manipulating immense disk-

resident data sets [41]. These include applications from medical imaging, data analysis and

mining, video processing, large archive maintenance, and so on. Commercial services such

as web, multimedia, and databases on clusters are also demanding on the I/O subsystem. In

addition, many high-performance environments (particularly shared clusters within a department

or a supercomputing center) not only handle one such application, but often have to deal with

several (possibly I/O intensive) applications at the same time in a time-shared manner. All these

issues make I/O optimizations an important and challenging problem for off-the-shelf clusters.

9

While the parallelism offered by the numerous disks in a cluster can alleviate the I/O

bandwidth problem, it does not really address the latency issue, which is largely limited by

seek, rotational and network transfer costs. Caching data blocks in memory is a well-known

way of reducing I/O latencies, provided we can achieve good hit rates. I/O caching is typically

implemented in software (not the disk/controller caches), and the overheads of cache lookup

and maintenance can become quite high. Furthermore, it has been shown in [39] that we may

need multiple levels of caching. For instance, in PPFS [39], a local cache at each node of the

parallel system caters to the individual process requests at that node and, upon a miss, goes to

a shared global cache (running on one or more nodes of the cluster), which can possibly satisfy

requests that come from different nodes. On such systems, the cost of going to the global cache—

requiring a network message—and not finding the data there (before going to the disk) might be

quite substantial. For instance, as our performance results show, this approach turns out to be

over twice as costly as directly getting the data from disk in several situations. Consequently,

it becomes extremely important to intelligently determine what to place in the caches and when

to avoid (i.e., bypass) the cache (particularly the caches whose look-up costs are higher) on I/O

requests. This largely depends on the data-access patterns of the workload. To our knowledge,

the issue of exploiting application behavior for such I/O cache optimizations on clusters has not

been studied previously. There has been similar work (e.g., [44]) in the context of hardware data

CPU caches, but the costs for I/O caching are of a much higher magnitude.

Rather than implementing all the APIs/features of a full-fledged parallel file system to

investigate these issues, we start with a publicly available parallel file system — PVFS [14] — for

Linux/Pentium clusters. We have considerably extended this system to incorporate a kernel-level

cache module at each cluster node to cater to all the requests (possibly different applications)

10

coming from that node, which we refer to as the local cache. We also have implemented a shared

global cache (between processes running on different nodes of an application, or even across

applications) that runs on one or more nodes of the cluster. This global cache receives requests

from the local cache and services them. If the lookup fails in the global cache as well, the request

is forwarded to one or more nodes whose disks are used for striping the data. The experimental

results presented in this chapter are from a Pentium/Linux-based cluster of workstations. Each

node on this cluster has a 800 MHz Intel Pentium-III (Coppermine) microprocessor with 32 KB

of L1 cache, 256 KB of L2 cache, and 128 MB of PC-133 main memory. The global cache is

run on one of the nodes that contains 384 MB of main memory. Each node is also equipped with

a 20 GB Maxtor hard disk drive and a 32bit PCI 10/100Mbps 3-Com 3c59x network interface

card. All the nodes are connected through a Linksys Etherfast 10/100 Mbps 16 port hub. Using

this experimental system, this chapter investigates/illustrates the following issues:

• Several latency numbers for file reads and writes, satisfying the requests from different

levels in the cache hierarchy are presented, and compared to the original PVFS imple-

mentation that does not perform any explicit caching. The results clearly demonstrate the

benefit of caching. Even when missing from the local cache, going to the global cache

and fetching the data turns out to be better than the original PVFS in most cases. How-

ever, when we go via the global cache, only to find that the data is missing there, the

overheads are significantly worse than not performing any caching altogether (as in the

original PVFS). Experimental data showing what hit rate is needed in the global cache to

justify going through it.

11

• After pointing out the importance of discretionary data placement in the caches and by-

passing them when needed, mechanisms provided by our system to explicitly specify

whether a read/write should go through the local/global cache are discussed. The by-

pass capabilities can be conveyed to our caching layers through a kernel ioctl() call and

can be specified either by the application itself or via the compiler or the runtime system.

• We show how simple compiler-based techniques are quite effective in benefiting from

the caches, without incurring extra overheads, for statically analyzable applications. We

specifically present two techniques: one that determines what files should be accessed via

the cache and what files should bypass the cache (which we refer to as coarse-grain opti-

mizations), and the other that performs such discretionary accesses at a finer granularity.

• While compile time analysis can be employed in applications with statically analyzable

code, we present a simple runtime approach for determining when to bypass the cache in

situations where the codes are not readily analyzable or the sources are not available.

• All these optimizations are extensively evaluated with several applications/traces to show

how they can be beneficial for improving cache behavior for parallel I/O.

The rest of this chapter is organized as follows. The next section identifies some work

related to this work. Section 2.3 describes the system architecture and implementation details

of our I/O subsystem on the Linux cluster, together with some raw performance numbers. The

compiler-based and runtime-based optimizations are evaluated and compared in Sections 2.4 and

2.5. Finally, Section 2.6 summarizes the contributions of this work and discusses directions for

future work.

12

2.2 Related Work

Software work on high-performance I/O can be roughly divided into three categories:

parallel file systems, runtime I/O libraries, and compiler work for out-of-core computations.

A number of groups have studied automatic detection and optimization of I/O access patterns

(e.g., see [28, 47, 48, 53] and the references therein). Others have proposed parallel file systems

and I/O runtime systems that provide users/programmers with easy-to-use APIs [17, 20, 61, 76,

87]. While these systems allow users/programmers to exploit optimizations for I/O, it is still in

general the user’s responsibility to select which optimization to apply and determine the suitable

parameters for it. Obviously, this puts a great burden on users, as in most cases it is not trivial to

select what optimization(s) to use and the accompanying parameters. Our work instead tries to

bring the advantages of I/O caching without much user effort.

Compilation of I/O-intensive codes using explicit I/O has also been the focus of some re-

search (see [6, 9, 67] for example techniques that target out-of-core datasets). Brezany et al. [9]

have developed a parallel I/O system called VIPIOS that can be used by an optimizing compiler.

Bordawekar et al. [6, 7] have focussed on stencil computations that can be reordered freely due

to lack of flow-dependencies. They have presented several algorithms to optimize communica-

tion and to indirectly improve the I/O performance of parallel out-of-core applications. Palecnzy

et al. [67] have incorporated I/O compilation techniques in Fortran D. The main philosophy be-

hind their approach is to choreograph I/O from disks along with the corresponding computation.

Many of these studies however, have specifically targeted massively parallel processors (MPPs)

and do not deal with selective data placement in caches. DPFS [80] is a parallel file system that

collects locally distributed unused storage resources as a supplement to the internal storage of a

13

parallel system. In contrast, our work is targeted for cluster environments with multiple levels

of caching, and not only benefits the processes of one application, but can also benefit several

applications sharing datasets (through a global cache).

There has been a considerable amount of prior work on optimizing I/O and I/O caches

[10, 21, 31, 40, 45, 46, 51, 58, 64, 71, 75, 84], some of which has been on clusters as well. Re-

cently, [16, 96] have focused on buffer-cache management policies in a multi-level buffer cache

system. Wong et. al proposes primitives for maintaining exclusivity in multi-level buffer caches

in [96], while Chen et. al use higher level cache eviction information to guide the placement of

blocks in lower levels in [16]. Maybe the most closely related work to ours are the approaches

presented in three prior systems, namely, MPI-IO [19, 32], PVFS [14], and PPFS [39]. MPI-IO

[32] is an API for parallel I/O as part of the MPI-2 standard and contains features specifically

designed for I/O parallelism and performance. This API has been implemented for a wide vari-

ety of hardware platforms including clusters [86]. The main optimizations in MPI-IO are for non

contiguous parallel accesses to shared data, mainly at the user-level. As a result, the user needs

to have a thorough understanding of the numerous programming interfaces to invoke the appro-

priate routines. Since the MPI-IO interface itself does not specify any caching functionality, its

response time is largely determined by the caching capabilities provided by the underlying file

system or the MPI-IO implementation. PVFS [14] is a parallel file system for Linux clusters that

presents three different APIs, and accommodates frequently used UNIX file tools. Its optimiza-

tions for non contiguous data are perhaps less powerful than MPI-IO’s optimizations. The work

presented in this chapter augments PVFS with a local and global caching capability, benefiting

from its rich original APIs. PPFS [39] is a user-level I/O library that has been implemented

14

for several parallel machines and clusters. This system differs from the other two in that it of-

fers runtime/adaptive optimizations (not just an API) for caching, prefetching, data distribution

and sharing. The differences of our work from PPFS are in that we are examining the benefits

of compiler/runtime directed cache bypassing towards optimizing the hit rates of one or more

applications running on the cluster.

2.3 System Architecture

Our system builds on the architecture of the Parallel Virtual File System (PVFS) [14]

since we did not want to re-invent the APIs and mechanisms for providing a shared name space

across the cluster, and facilities for distributing/striping the file data across the disks of the cluster

nodes. PVFS also provides seamless transparent access to several existing utilities on normal

file systems. Since all these provisions are already provided in a publicly distributed parallel

file system, we have opted to build upon this system in this work rather than re-implement all

these features. We briefly go over some key architectural features of PVFS and then discuss our

contributions.

2.3.1 PVFS

The original PVFS is a mainly user-level implementation, i.e., there is a library (libpvfs)

linked to application programs which provides a set of interface routines (API) to distribute

and retrieve data to/from the files striped across the cluster nodes. In addition to the library,

PVFS uses two other components, both of which run as daemons on one or more nodes of

the cluster. One of these is a meta-data server (called mgr), to which libpvfs sends requests

for meta-data information (access rights, directories, file attributes, etc.). In addition, there are

15

several instances of a data server daemon (called IOD), one on each of the machines whose disk

is being used to store the data. This daemon (again running at the user level) listens on sockets for

requests from libpvfs functions on clients to read/write data from/to its local disk using normal

Linux file system calls. There are well-defined protocols for exchanging information between

libpvfs and IODs. For instance, when the user wants to read file data that is striped across

several IODs, libpvfs converts this request into several requests (one for each IOD involved),

sends these requests to the IODs using sockets, waits for an acknowledgment from each of them,

following which it waits for the data sent by the IODs. This data is then collated and returned to

the application process. On a write, libpvfs sends out the requests, following which the relevant

data is sent to each IOD. To check for error conditions each IOD sends back an acknowledgment

indicating how much data was actually written. The reader is referred to [14] for further details

on the functioning of PVFS.

2.3.2 Overview of System Architecture

As mentioned earlier, we would like to build on the existing capabilities provided by

PVFS to leverage off its rich API and features. Further, we wanted to provide our caching

infrastructure in a fairly transparent fashion so that it is not even apparent to a large part of the

PVFS implementation, let alone the application. This implies that we need to intercept all the

socket calls that libpvfs makes and provide caching at that point. It should be noted that our

cache is meant only for IOD requests, and we do not cache any metadata information at this time

(i.e., they always go to the meta-data server).

Our system provides two levels of caching: a local cache at every node of the cluster

where an application process executes, and a global cache that is shared by different nodes (and

16

possibly different) applications across the cluster, with the possibility of skipping either of them

as illustrated in Figure 2.1. The design and implementation of the local cache at each node is

described in [91], and here we describe it briefly for completeness, and then concentrate on the

global cache.

Local Cache

There are two alternatives for implementing the local cache at each node. One option

is to implement the caching within the library that is linked with the application (user-level).

However, with this approach we do not have the flexibility of sharing cache data between appli-

cation processes running on the same node. This is the reason why we opted to implement the

local cache within the Linux kernel (a dynamically-loadable module), that can be shared across

all the processes running on that node. Only when the request misses in this cache (either all

or some of the request cannot be satisfied locally), is an external request initiated out of that

node, either to the global cache or to the IODs, as explained below. This cache is implemented

using open hashing with second chance LRU replacement. There is a dirty list (which keeps

track of all the cache frames that have been modified while in cache), a free list (which keeps

track of all the unused cache frames), and a buffer hash to chain used blocks for faster retrieval

and access. The hashing function takes as parameters the inode number of the file and the block

number to index the buffer hash table. There are two kernel threads, called flusher and harvester

in the implementation. Writes are normally nonblocking (except the sync write explained be-

low), and the flusher periodically propagates dirty blocks to the global cache/IOD. The harvester

is invoked whenever the number of blocks in the free list falls below a low water mark, upon

which it frees up blocks till the free list exceeds a high water mark. A block size of 4K bytes is

17

used in our implementation. Note that such a kernel implementation automatically allows mul-

tiple applications/processes to share this local cache, thus making more effective use of physical

memory.

Global Cache

The global cache, as explained above, adds one more level to the storage hierarchy be-

fore the disk at the IOD needs to be accessed. There are numerous questions/alternatives when

implementing the global cache and we go over them in the following discussion, explaining the

rationale behind the choices we make specifically in our implementation:

• Should there be a global cache for each file, or should all files share the same cache?

While there may be some possibility for detecting access patterns across datasets for op-

timizations, our current system uses a separate global cache for each file. If there is little

file sharing across applications, or even across parallel processes of the same application,

then the requests would automatically distribute the load more evenly with this approach.

• Should each application have its own global cache, or should we share a global cache

across applications? Since we would also like to be able to perform inter-application

optimizations based on sharing patterns, we have opted to share the global cache across

applications. This can help one application (even its cold references) benefit from the data

brought in earlier by another from the cache. There is, however, the fear of worse miss

rates if there is interference because of such sharing, and these are points that our cache

bypass mechanisms addresses. This feature is one key difference between our system and

PPFS [39], where the global cache is intended for optimizations within the processes of a

18

single application. Our system does suffer from scalability issues, and performance may

start to drop beyond a particular number of client nodes due to the centralized nature of the

global cache. However, the focus of this work is on techniques for intelligent caching of

data in file-system caches, and we are looking at scalable techniques as part of our future

work. Furthermore, providing a separate global cache for each file as explained above can

ease some of this bottleneck.

• Should we distribute the global cache across the cluster? While distribution is a good idea

in terms of alleviating contention, there are a couple of drawbacks. First, depending on

the granularity of distribution, it may be difficult to perform certain optimizations (such

as prefetching) if one node is not the repository for all the file data. Second, two levels

(one between the IODs and the global caches, and one between the global caches and local

caches) of multiplexing and demultiplexing the data may be needed. We, instead, opted

to have a centralized global cache for each file. However, since we have a separate global

cache for each file, we can have separate global caches on different cluster nodes serving

different files, and that can alleviate some of the contention problems that may arise.

• Should the global cache be implemented as a user process or as a kernel module? The

reason for a kernel level implementation for the local cache is the need for trapping all

application requests coming at that node from the different processes via the PVFS calls.

However, with the global cache, TCP/IP sockets are being explicitly used for sending

messages to it from the individual local caches regardless of which application process

is making a call. The convenience and flexibility (option of busy-waiting) of a user-level

19

implementation has led us to implement the global cache for a specified file as a stand-

alone, user-level daemon running on a specified node of our cluster.

Each global cache in our system is, thus, a user-level process serving requests to a specific

file running on a cluster node, to which explicit requests are sent by the local caches, and is

shared by different applications. The internal data structures and activities of the global cache are

similar to those of the local cache, described above. One could designate such global caches on

different nodes (for each file), particularly on those nodes with larger physical memory (DRAM).

Consequently, this architecture is also well suited to heterogeneous clusters where one or more

nodes may have larger amounts of memory than the others.

Reads/Writes

Figure 2.1 gives a schematic overview of our system. Let us now briefly go over a

typical read operation (there could be some differences when one or more levels of caching are

disabled as discussed below) to understand how everything works when an application process

on a node makes a read call, possibly to several blocks that span different IODs. The original

PVFS library on the client aggregates the requests to a particular IOD, before making a socket

request (kernel call) to the node running that IOD. Our local cache intercepts this call in the

kernel and checks to see if all or even a part of it can be satisfied locally. If the entire request

can be satisfied without a network message, then the data is returned to the PVFS library and

the application proceeds. Otherwise, the local cache module accumulates a list of requests that

need to be fetched. A subsequent message is sent to the global cache with these requests (Note

that this may change, and the requests are directly sent to IODs if the global cache is bypassed).

The multi-threaded global cache keeps listening on a dedicated socket for requests, and upon

20

receiving such a message looks up its data structure. If it can satisfy the requests completely

from its memory, it returns the data to the requesting local cache. Otherwise, it sends a request

message to each of the IODs holding corresponding blocks, stores the blocks in its memory

when it gets responses from the IODs, and then returns the necessary data to the requesting local

cache. A write operation works similarly except that the writes are propagated in the background

(using the flusher thread described earlier), and control is returned back as soon as the writes are

buffered.

The above read and write operations are the most common, and can benefit significantly

from spatial and temporal locality in the caches. However, with the presence of multiple copies

for data blocks, there is the issue of coherence/consistency. The above read/write mechanisms

do not worry about consistency, and a read simply returns the value in a version of the block that

it finds (i.e., the write is only propagated to the global cache and IOD — any subsequent read

to the global cache/IOD will get this value, but a read from a node that already has this block

in its local cache will not get this latest value). While this may not pose a problem for many

applications, where read-write sharing is not common (as compared to read sharing) or where

consistency is explicitly managed by the application itself, there are certain applications where

ensuring consistency is critical. Consequently, in our system, we also provide a special version

of the write, called sync write, which not only propagates the writes to the global cache/IOD, but

also invalidates the local caches which have a copy (so that subsequent reads on those nodes can

go out on the network and get the latest copy). Coherence is maintained at a block granularity,

and thus requires a directory entry per block to keep track of the local caches that have a current

copy of that block. We maintain this directory at both the global cache and the IOD. The need

for the latter would be more clear when we discuss global-cache bypassing. The actual set of

21

local caches with a copy would involve merging these two directory entries for a block. On

a system where there is no global-cache bypassing (all requests go via the global cache), the

directory at the IOD would be empty. Local caches that bypass the global cache would update

the directory at the IOD rather than at the global cache. A sync write is thus an additional

overhead (over normal writes), involving looking up the directory entries, and invalidating any

copies, in addition to propagating the write itself. It would thus be more prudent to use the

normal writes as far as possible, and use sync write only when coherence is needed (or when

one is not sure).

2.3.3 Performance of Primitives and Micro-Benchmarks

Before we go any further into our optimizations, we would first like to present some

latency numbers and micro-benchmark results for read and write performance with the pres-

ence/absence of local/global caches. For these experiments, the local cache size was fixed at

2 MB (500 data blocks), while the global cache size was fixed at 40 MB (10000 data blocks).

Also, a stripe size of 32 KB was used in all our experiments.

Raw Latencies for Reads/Writes

In the first set of results (see Table 2.1), we give the read latencies for a file striped over

different number of IODs (1 to 4). In these tables, Pvfs denotes the read latency of the original

PVFS system which does not use any caching (local or global). Local Hit indicates the

latency when the access is satisfied from local cache and Local Miss is the latency when the

access misses in the local cache and is satisfied from one or more IODs. The latter case thus

captures the execution on a system without a global cache. Global Hit and Global Miss,

22

A1 B1 A2 B2

Local Cache Local Cache

Global CacheFile 1

Global Cache

IOD

..........

.........

Node 1 Node n

..........Kernel

KernelKernel

Kernel KernelLocal

Cache Bypass

LocalCache Bypass

Global Cache Bypass

To otherIOD’s

File 2

User

User User

User

User

Fig. 2.1. System architecture. Nodes 1..n are the clients where one or more application pro-cesses run, and have a local cache present. Upon a miss, requests are either directed to the globalcache (one such entity for a file), or are sent directly to IOD node(s) containing the data in thedisk(s).

23

on the other hand, denote the cases when the access misses in the local cache (i.e., a local cache

lookup is still needed) and hits and misses, respectively, in the global cache.

From these numbers, we clearly see that the local cache hits (Local Hit) can substan-

tially lower read costs compared to the original PVFS implementation. On the other hand, if the

locality is not good, causing us to miss in the local cache (i.e., Local Miss), the performance

becomes worse than the original PVFS for all request sizes because of the overheads in looking

up the local cache. Therefore, it is not only important to improve the hit behavior of the local

cache, but it is also meaningful to bypass the local cache on certain lookups if we feel that it is

going to miss.

When we next move to the scenarios with the accesses to global cache (misses in local

cache), we first see that the global cache can lower access times, (provided the data is present

there) compared to the original PVFS without caching in many cases (i.e., requests larger than 4

KB). It is also better than fetching the data directly from IODs upon a local cache miss (Local

Miss). However, global-cache-miss costs are substantially higher than any of the other cases

because of the additional message hop and serialization overhead that occur in the critical path

and the associated lookup costs. This suggests that if we want to incorporate and benefit from the

global cache, it is very important to keep its hit rate quite high. In fact, the Required HR rows

in Table 2.1 give the minimum hit rates that are needed (for each request size) to tilt the balance

in favor of the global cache compared to the original PVFS. A value larger than 1 in these rows

indicate that it is impossible to generate better results than the original PVFS using that request

size and the number of IODs. Figure 2.2 shows the same behavior plotted as a graph. This again

means that we need to be very careful on what to put in the global cache and when to avoid

going through it. Further, we can observe that the benefits of global caching (look at the last row

24

Table 2.1. Read times (in ms) for different request sizes and number of IODs (|IOD|).Request Size −→ 4K 8K 16K 32K 64K 128K 256K

|IOD| = 1

Pvfs 1.09 2.27 4.31 9.48 19.04 38.52 54.04Local Hit 0.67 0.68 0.72 0.80 0.97 1.59 2.85Local Miss 1.25 2.28 4.61 9.54 20.77 44.23 67.54Global Hit (Local Miss) 1.43 1.71 2.44 4.26 8.14 15.28 25.91Global Miss (Local Miss) 2.00 2.85 5.86 11.49 23.85 50.42 94.38Required HR 1.59 0.50 0.45 0.27 0.30 0.33 0.58

|IOD| = 2


|IOD| = 3


|IOD| = 4


25

showing required hit rate) are most significant when request sizes are not at either extreme. At

very small request sizes, the overhead of global caching itself is more significant. At the other

end, large amounts of data can cause more capacity misses, leading to poor temporal locality.

Another point to note is that when the number of IODs involved in the access increases, the cost

of a global cache miss becomes even more significant. This is because the global cache has to

amass the data coming in from different IODs and then send them sequentially to the requester,

while all the IODs could have potentially sent them in parallel to the requester if the global cache

was not involved.

Table 2.2 gives the times for write operations to return back to the application after they

are issued, with different number of IODs involved. We compare the performance of the original

PVFS code (denoted Pvfs) with our system having a local cache (denoted Caching). We are

not separately giving the costs as in the read table (Table 2.1) for the other scenarios as they are

comparable to the scenario with a local cache (the writes are simply accumulated in the local

cache, and a background activity — flusher — propagates these writes to either the global cache

or the IOD). We do not buffer writes of an application when there is not enough space left on

the local cache. Hence the cost of writes whose sizes are greater than the local cache size is

comparable to the cost of the original PVFS implementation. We can see that write stall times

are significantly lower because of this feature, as is to be expected. It is to be noted that the

savings that will be presented later in this chapter with our optimizations are not a result of these

nonblocking writes, since we show savings even over the scenarios that cache everything in the

local/global caches (which also performs nonblocking writes).

26

0 0.5 1 1.5 2 2.5 3

x 105

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Read block size

Req

uire

d H

it ra

te

Variation of required hit rate with readsize

Striped on 1 nodeStriped on 2 nodesStriped on 3 nodesStriped on 4 nodesMaximum Feasible Hit Rate

Fig. 2.2. Graph showing the minimum required hit-rate at global cache for good performance

Table 2.2. Write times (in ms) for different request sizes and number of IODs (|IOD|).Request Size −→ 4K 8K 16K 32K 64K 128K 256K

|IOD| = 1

Pvfs 0.68 1.03 1.97 3.95 7.83 15.94 31.09Caching 0.55 0.56 0.60 0.96 1.05 1.76 3.15

|IOD| = 2

Pvfs 0.68 1.27 1.90 3.77 9.86 15.44 29.61Caching 0.60 0.67 0.84 1.43 2.04 3.70 7.19

|IOD| = 3

Pvfs 0.68 1.04 1.85 3.62 8.23 15.74 29.40Caching 0.59 0.68 0.87 1.37 2.08 4.01 7.79

|IOD| = 4

Pvfs 0.68 1.02 1.95 3.58 8.18 15.87 29.09Caching 0.60 0.68 0.90 1.55 2.17 4.30 8.02

27

Micro-benchmark Results

While our later experiments will evaluate caching using real benchmarks, we wanted to

stress the system along different dimensions, and employed a micro-benchmark to do so. Our

micro-benchmark is parameterized based on s (the maximum size for a read/write operation in

blocks, where a block is defined to be the same size as the granularity of the caches) and o (the

maximum offset within a file in blocks from which the next read/write is initiated). The micro-

benchmark program iteratively goes over a number of operations, randomly picking whether it

is a read or a write with equal probability. The size of this operation is also picked randomly

between 1 and s blocks, and the starting offset within the file for the operation is picked, again

randomly, between 0 and o blocks. Note that a small value of o will automatically provide good

locality, and we can tune these parameters to mimic different access patterns.

Instead of presenting all the results, we discuss here one representative case with one IOD

being employed, s = 2, and for three different values of o: 10, 600, and 25000 (see Figure 2.3

and Figure 2.4). Note that the locality progressively gets worse from o = 10 to o=25000. When

the locality is very good (o=10), the working set is contained well within the local cache, and

the schemes that use the local cache perform much better than those without it. We also note

that the global-cache-only scheme still does turn out to be better than the scheme without any

caching. Even though the hit rates are quite high for the global cache, its overheads cause it

to perform much worse than for schemes with a local cache. At the other end of the spectrum,

when the locality becomes very poor (o=25000), the working set is not well exploited by any of

the caching schemes, and their associated overheads cause them to perform worse than a scheme

without any caching. The more interesting results are those for o=600, where the working set

28

overflows the local cache, but is captured by the global cache (which is larger). Consequently, the

two schemes which use a global cache provide much better performance than a scheme without

any caching or a scheme with just a local cache.

0 5000 10000 150000

1

2

3

4

5

6

7

8

9

Number of Operations

Tim

e to

Com

plet

e(se

cond

s)

No CachingOnly Local CachingOnly Global CachingLocal & Global Caching

0 5000 10000 150000

5

10

15

20

25


Tim

e to

Com

plet

e(se

cond

s)


(a) (b)

Fig. 2.3. Micro-benchmark running on 1 node, File striped on one IOD, s = 2, (a)o = 10, (b) o= 600.

In the earlier experiment, the micro-benchmark is run on a single node. We have also

run the same micro-benchmark on different nodes, with the data striped across different IODs

(using a stripe size of 32KB). In Figures 2.5 and 2.6, we show the results for one such scenario

with three IODs used to distribute the data. We observe similar trends to those we saw earlier.

The only slight difference with the poor locality situation (o=25000) is that the local-cache-only

execution is not much worse than without any caches because the local cache overheads are not

too significant.

29

0 5000 10000 150000

10

20

30

40

50

60


Tim

e to

Com

plet

e(se

cond

s)


(a)

Fig. 2.4. Micro-benchmark running on 1 node, File striped on one IOD, s = 2, (a) o = 25000.

0 5000 10000 150000

1

2

3

4

5

6

7

8

9

10


Tim

e to

Com

plet

e(se

cond

s)


0 5000 10000 150000

5

10

15

20

25


Tim

e to

Com

plet

e(se

cond

s)


(a) (b)

Fig. 2.5. Micro-benchmark running on 2 nodes, File striped on three IODs, s = 2, (a) o = 10,(b) o = 600.

30

0 5000 10000 150000

20

40

60

80

100

120


Tim

e to

Com

plet

e(se

cond

s)


(a)

Fig. 2.6. Micro-benchmark running on 2 nodes, File striped on three IODs, s = 2, (a) o = 25000.

2.3.4 Cache Bypass Mechanisms

The results in the previous subsection indicate that it is important to provision a local and

a global cache for good performance. However, our results also show that it is equally important

to be very careful in deciding what data to place in these caches and when to avoid/bypass them.

Our system provisions mechanisms for bypassing the local and/or global caches for a

read or write. Our system does not require any different read/write calls to specify that a cache

needs to be bypassed since that can get cumbersome, and it is not clear how such a mechanism

can be effectively used by application programmers. Instead, we provide the notion of a segment

— a certain number of contiguous file blocks (unless explicitly stated otherwise, a segment

of 4 blocks was used in the experiments) — with a set of bits determining what actions to

be performed on a read/write. For each operation (read or write), we have two bits, one for

specifying whether that operation for the segment needs to go through the local cache and another

31

for whether it needs to go through the global cache. We thus provide a segment-level granularity

for cache bypassing.

These (segment) bits can be set via a system call that updates a data structure in the

underlying kernel module (implementing the local cache) at each node. When a read/write call

is made, this bitmap data structure is consulted to find out whether to look up the local cache, and

whether to route the request to the global cache or directly to the IOD. The system call to set these

bits can either be explicitly invoked by the application program or be invoked by instructions

inserted into the code by the compiler. These bits can also be set by the runtime system based on

previous execution characteristics. In the default configuration, all operations go via the local and

global caches for all segments. The rest of this chapter explores the benefits of cache bypassing,

and ways of initiating such bypassing with the compiler and the runtime system. While it is

also possible to adopt a user-based strategy where the application programmer sets these bits

explicitly, we believe that such an approach would be very difficult for the user (investigating

profile-based techniques and tools for doing this is part of our future research agenda). Also, we

specifically focus on bypass mechanisms for the global cache in this work, whose overheads on

a miss are much more significant than the corresponding overheads for the local cache.

2.4 Compiler-directed Cache Bypass

Previous discussion emphasized the importance of careful management of the global

cache space. An optimizing compiler can help us identify what data should be brought into the

global cache. It can achieve this by using at least two different strategies. We assume here that

the data for each array corresponds to a different file. In the first strategy, the compiler adopts

a coarse-grained approach and determines the arrays that are used frequently in the program. It

32

achieves this by estimating (at compile time) the number of accesses to each array in the code.

More specifically, for each loop nest, the compiler counts the number of references to each array

and multiplies these counts by the trip counts (the number of iterations) of all enclosing loops.

If there is a conditional flow of control within the loop (e.g., an ‘if’ statement), the compiler

conservatively assumes that all possible branches are equally likely to be taken. Note that if

we have profile data on branch probabilities, it is straightforward to exploit it for obtaining a

more accurate estimate. Another potential problem is the compile-time-unknown loop bounds.

In such cases, the compiler can estimate the number of accesses symbolically. Note that well-

known symbolic manipulation techniques (e.g., [29, 36]) can be used here for this purpose. After

doing such analysis, the compiler uses the global cache for reads/writes to the files (arrays) with

the most references (depending on how many such files can fit in the global cache).

An important drawback of this coarse-grained strategy is that it fails to capture short-term

localities. For example, in a given large, I/O-intensive application, an array might be accessed

very frequently in the first half of the application and is not accessed in the second part. However,

the strategy described above can continue to cache the segments of this array in the second part

of the application if the overall (program-wide) access count of this array is larger than those of

the others. Our second strategy tries to eliminate this drawback of the coarse-grained method by

managing the global cache space on a loop nest basis focusing on segment granularity.

Specifically, in our second strategy, the compiler determines the blocks that will be ac-

cessed in each loop nest separately. The id’s of a subset of these blocks are then recorded at the

loop header. This subset contains the most frequently used blocks in the nest. By doing this,

the second strategy tries to capture short-term localities and manages the global cache space at

a finer granularity. Then, the segments corresponding to the most frequently used blocks are

33

cached. Note that this approach can be expected to result in better global cache hit ratio than

the first strategy. It should also be noted that determining the blocks that will be accessed by a

loop nest is possible as in our applications there is a one-to-one correspondence between arrays

declared in the program and disk-resident files (i.e., our applications use a separate file for each

array that they manipulate). Therefore, the compiler can associate the array elements with the

blocks. Also, as in the case of coarse-grained approach, this approach can take advantage of

profile data (e.g., on branch probabilities) where available. Furthermore, again as in the previous

case, it can employ symbolic expression [29, 36] manipulation when loop trip counts are not

known at compile time.

We implemented both these strategies by using the SUIF compiler infrastructure [94]

and evaluated them by using codes where data access patterns are statically analyzable. SUIF

consists of a small, clearly documented kernel and a toolkit of compiler passes built on top of

the kernel. The strategies that were described above have been implemented as SUIF passes that

perform the required analysis and write the output to a file. We present here results with I/O-

intensive versions of two Spec benchmarks: tomcatv and vpenta. While the original codes

manipulate arrays directly in memory, we extended them to read/write these arrays from data

files explicitly, before manipulating them in memory. The results are shown for tomcatv in

Figures 2.7 and 2.8(a) as a function of the problem size (local cache size of 400KB, global cache

sizes of 20 MB and 200 MB) and as a function of the global cache size (keeping the problem

size fixed at 1500 – this corresponds to matrices of size 1500*1500 doubles manipulated in the

application), respectively. The corresponding results for vpenta are given in Figures 2.9 and

2.8(b). In each of these figures, we compare the performance of four different executions: (a)

a scheme with no caching (and hence no compiler optimizations for I/O); (b) a scheme with

34

local and global caches without any compiler optimizations for I/O; (c) a scheme with local

and global caches in conjunction with coarse-grain (file level) compiler optimizations, and (d) a

scheme with local and global caches in conjunction with fine-grain compiler optimizations.

500 1000 1500 20000

50

100

150

200

250

300

350

400

450

Problem Size

Tim

e to

Com

plet

e in

seco

nds

No CachingLocal & GlobalCoarse Grain Fine Grain

500 1000 1500 20000

50

100

150

200

250

300

350

Problem Size

Tim

e to

Com

plet

e in

seco

nds


(a) (b)

Fig. 2.7. tomcatv: impact of problem size (a) Global cache is 20MB, (b) Global cache size is200MB.

Examining Figure 2.7(a), we find evidence confirming the earlier arguments that blindly

caching everything in the local and global caches can sometimes worsen performance. Specif-

ically, we observe that the No Caching alternative does better than the Local & Global

option (i.e., caching everything indiscriminately), especially at larger problem sizes. The over-

heads of going to the global cache and not finding the required blocks in it contribute to this

behavior. Performing compiler optimizations at the coarse (file) granularity does give better per-

formance than caching everything, but it still does worse than not caching anything. However,

we can see that the fine-grained approach gets the benefits of the global cache and is also a

35

0.5 1 1.5 2 2.5

x 104

160

180

200

220

240

260

280

Global Cache Size in blocks

Tim

e to

Com

plet

e in

seco

nds


0.5 1 1.5 2 2.5

x 104

120

130

140

150

160

170

180

Global Cache Size in blocksT

ime

to C

ompl

ete

in se

cond

s


(a) (b)

Fig. 2.8. Impact of global cache size for a problem size of 1500.(a) tomcatv, (b) vpenta

500 1000 1500 20000

50

100

150

200

250

300

Problem size

Tim

e to

Com

plet

e in

seco

nds


500 1000 1500 200020

40

60

80

100

120

140

160

180

200

220

Problem size

Tim

e to

Com

plet

e in

seco

nds


(a) (b)

Fig. 2.9. vpenta: impact of problem size (a) Global cache size is 20MB, (b) Global cachesize is 200MB.

36

better alternative than not caching (because it avoids consulting the global cache when it feels

the data may not be present). This benefit improves as the problem size gets larger (relative to

the global cache size). Evidence for the last statement is further substantiated when we exam-

ine the executions with a much larger global cache in Figure 2.7(b). Here, the hit rates in the

global cache are much higher, and the always-cache option is a better idea. As the global cache

gets larger, the selectively cache option can possibly limit some data from benefiting from this

compared to caching everything. All these observations are reiterated when we look at the im-

pact of global cache capacity for a fixed problem size in Figure 2.8(a). The benefits of selective

caching/bypassing is much more significant at small cache sizes, and the always-cache option

becomes better only with larger global caches. The results for vpenta (given in Figures 2.9

and 2.8(b) are similar to many of those observed with tomcatv, except that the magnitude of

the differences are less pronounced because its I/O traffic is lower.

In summary, we find that discretionary caching becomes very important when the prob-

lem sizes of applications get large enough, and the working sets cause more thrashing in the

global cache. We find that a compiler based technique for modulating what to place/bypass in

the global cache can alleviate some of these thrashing problems and help us reap the benefits

of a global cache. Of the two different policies that we tried, we find that a finer granularity of

control is a better option than file-level control. This is because not all blocks within a file may

have the same access pattern or access frequency.

2.5 Runtime Cache Bypass

So far, we have evaluated two compiler-based strategies (coarse-grain and fine-grain)

where our compiler decided what to place in the global cache and when to bypass it. There are

37

many cases where such a compiler-based strategy may not be desirable or even applicable. For

example, when we do not have the source code of the application, we cannot analyze the program

and determine its access pattern statically. Similarly, in some cases, the application code might

be available but the access pattern it exhibits may not be amenable to compiler analysis (e.g.,

due to array-subscripted array references, non-affine subscript functions, or pointer arithmetic).

However, in these and similar cases, it might be still possible to optimize the application using

a runtime technique. A runtime technique tries to evaluate block-access frequencies at runtime

and makes cache-bypassing decisions dynamically.

In this section, we investigate the effectiveness of a runtime strategy for managing the

global cache. Along similar lines, there has been prior work [44] in the context of processor data

caches for runtime bypassing using access counters. However, in this study, we examine a much

simpler strategy since there are some problems when implementing schemes such as [44, 89]

on our platform where we have multiple levels of caches, and a miss from the local cache may

not go through the global cache at all. Our strategy is based on the idea of having counters with

segments. Specifically, we associate a counter with each segment that keeps the number of times

the segment is accessed. These counters are called segment counters. When a block needs to

be brought into global cache, its segment counter is compared with a pre-set threshold value. If

the value of the segment counter is larger than the threshold, the block is placed into the global

cache; otherwise, the cache is bypassed. When the local cache gets this block, it is told (either

in the read response or the write acknowledgment) to avoid going through the global cache if it

needs to be bypassed subsequently. The rationale behind this approach is that when a block is

not accessed frequently enough, placing it into the global cache can cause a useful (i.e., more

frequently used than the block in question) block to be discarded. It should be noted that we do

38

not perform any checks when the block is accessed for the first time (counter reads zero), and

only subsequently does this scheme kick in. When a new block is accessed, the harvester on the

global cache examines all currently residing blocks to find a candidate for replacement whose

counter is below the threshold (and does some aging of counters when doing so). Finally, in our

current implementation, the decision for a block (whether to bypass or not) is made only once

and we do not re-evaluate the choice once we decide to bypass the global cache for a block.

The results with this strategy are given in Figure 2.10 for a global cache size of 20 MB

with two different threshold values — high (20) and low (3) for the same two applications ex-

amined earlier. We find that the runtime strategy improves the performance of global caching

for both these extremes. The benefits are better at larger problem sizes where cache thrashing

becomes more significant and we need to be careful on what to put in the global cache. This is

also the reason why when we go to larger problem sizes, the more aggressive runtime approach

(i.e., the one with the higher threshold value) does better than the one with the smaller threshold.

200 400 600 800 1000 1200 14000

50

100

150

200

250

Problem size

Tim

e to

com

plet

e in

seco

nds

No cachingLocal & GlobalRun time(low threshold)Run time(high threshold)

200 400 600 800 1000 1200 14000

20

40

60

80

100

120

140

160

Problem size

Tim

e to

com

plet

e in

seco

nds

No cachingLocal & GlobalRun time(low threshold)Run time(high threshold)

(a) (b)

Fig. 2.10. Runtime cache bypassing (global cache size is 20 MB) (a) tomcatv. (b) vpenta.

39

Next, we perform a sensitivity study of the runtime technique that depends on two tune-

able parameters, namely, threshold value and segment size. Figure 2.11(a) captures the perfor-

mance of the runtime strategy as a function of the threshold value for tomcatv. We observe

that typically threshold values in the range of 20-50 lead to better performance since they are

more effective in weeding out what should not be put in the global cache, without defaulting to

the No Caching strategy. Consequently, we use threshold values in this range in the next few

experiments.

Recall that so far we have fixed segment size to be four blocks. To study the sensitivity

of our runtime strategy to the segment size, we conducted another set of experiments where

we used different segment sizes ranging from 2 blocks to 64 blocks. The results are illustrated

in Figure 2.11(b) for vpenta. Note that each bar in these graphs is normalized to the 4 block

segments. These results indicate that selecting a suitable segment size is important. In particular,

working with very small or very large segment sizes may not be a good idea. When the segment

size is very large, the blocks in a given segment do not exhibit uniform locality, therefore, a

segment-wide decision might be the wrong (suboptimal) choice for many blocks in the segment.

Similarly, if the segment size is very small, we witness an increased traffic through the global

cache (which in turn hurts the performance). It should also be stressed that a small segment size

means more bookkeeping and more runtime overhead. Similar results have been obtained with

other applications as well and they are not explicitly given here.

Having examined both compiler (static) based and runtime optimizations for the same

two applications, one could ask how the two compare in terms of effectiveness. We plot the

local and global cache hit rates for different problem sizes for the same two applications under

four different execution scenarios, (a) a scheme which blindly caches everything without any

40

0 50 100 150 200 250 300 350 400 450 500180

200

220

240

260

280

300

320

Threshold

Tim

e to

com

plet

e in

seco

nds

0

0.2

0.4

0.6

0.8

1

2 4 8 16 32 64Segment size (in blocks)

Nor

mal

ized

tim

e to

com

plet

e

(a) (b)

Fig. 2.11. (a) Impact of the threshold value for tomcatv. (b) vpenta: Impact of segmentsize.

optimization for I/O, (b) a static compiler-driven scheme that caches file blocks at a coarse

granularity (file level), (c) a static compiler driven scheme that caches file blocks at a finer

granularity (block level), and (d) a scheme which makes cache bypassing decisions at runtime,

and the results are given in Figures 2.12 and 2.13 where the hit rates in the two caches are given

for tomcatv and vpenta. As is to be expected, in such applications where all the information

can be statically gleaned, the compiler based techniques can be anticipated to perform better than

their runtime counterpart, since the latter requires a warm-up period before it attempts bypassing.

However, the benefits of the runtime approach will be felt more in non-analyzable applications,

or those in which we do not have source codes to perform these optimizations. We illustrate this

by studying the effectiveness of the runtime optimizations on a set of parallel I/O traces, where

this option is the only choice.

41

0 500 1000 1500 200015

20

25

30

35

40

45

Problem Size

Loc

al C

ache

Hit

perc

enta

ge

No optimizationCompiler−CoarseCompiler−FineRuntime

0 500 1000 1500 200020

30

40

50

60

70

80

90

100

Problem SizeG

loba

l Cac

he H

it pe

rcen

tage


(a) Local Cache Hit rate (b) Global Cache Hit rate

Fig. 2.12. tomcatv:Variation of Cache Hit rates with problem size

0 500 1000 1500 200030

35

40

45

50

55

60

65

Problem Size

Loc

al C

ache

Hit

perc

enta

ge


0 500 1000 1500 200010

20

30

40

50

60

70

80

90

100

Problem Size

Glo

bal C

ache

Hit

perc

enta

ge



Fig. 2.13. vpenta:Variation of Cache Hit rates with problem size

42

The traces used in this part of our experiments are from [90], which capture several

diverse set of application executions (scientific and commercial). We evaluated the runtime

strategy using the traces for the following six applications :

• LU: This application computes the dense LU decomposition of an out-of-core matrix. It

performs I/O using synchronous read/write operations.

• Cholesky: This application computes Cholesky decomposition for sparse, symmetric

positive-definite matrices. It stores the sparse matrix as panels. This application performs

I/O using synchronous read/write operations.

• Titan: This is a parallel scientific database for remote-sensing data.

• Mining: This application tries to extract association rules from retail data.

• Pgrep: This application is a parallelization of a grep program from the University of

Arizona.

• DB2: This is a parallel RDBMS (Relational Database Management System) from

IBM.

In the above experiment, we fixed the size of the local cache to 2MB, and the size of

the global cache was fixed at 4MB and the threshold values were selected between 10 and 25.

Figure 2.14 shows the execution time of the runtime optimized system normalized with respect

to the system that uses local and global caching without runtime bypass. We can see that the

optimized system benefits all but one of the six applications, with the benefits (reductions in

43

execution times) ranging between 4% and 48%. The benefits are particularly significant in ap-

plications with poor locality (such as DB2 and LU). These results reiterate the importance of

managing/bypassing the global cache with an effective runtime strategy.

For the above experiment, the hit rates of the local and global caches are shown in Fig-

ure 2.15. As before, we don’t see too much of variance in the local cache hit rates, and the

performance improvement can be attributed to improved global cache hit rates with the runtime

technique.

2.6 Concluding Remarks and Future Work

Caching for I/O is widely recognized as being critical for performance enhancements in

large codes. Such caching is typically done at multiple levels: at the client nodes, at the server

nodes, and perhaps even in between. Each has its advantages and drawbacks. This work has

shown that one must not indiscriminately cache all data at all levels of the caching hierarchy.

We have demonstrated this by extending an off-the-shelf parallel file system for clusters, with a

local cache at each node and a shared global cache. We have also provisioned mechanisms for

bypassing each of these caches for a read/write operation at a fine granularity. One could use

such mechanisms either explicitly by the application (perhaps some profile-based tools could be

useful here), or could be exploited by the compiler or the runtime system. In this work, we have

presented both compile-time and runtime strategies to exploit global-cache bypassing. Using

both statically analyzable codes, and several public-domain I/O traces from diverse domains,

we have demonstrated the benefits of discretionary caching with these techniques. It should

be noted that several of the previously proposed I/O optimizations such as prefetching, data

striping/distribution, etc., can be used in conjunction with the ideas and discussions in this work.

44

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Nor

mal

ized

Run

ning

Tim

es

LU Cholesky DB2 Pgrep Titan Mining

Fig. 2.14. Benefits of runtime bypassing on application traces.

LU Cholesky Mining Titan DB2 Pgrep0

1

2

3

4

5

6

7

8

9

10

Loc

al C

ache

Hit

perc

enta

ge

with−runtimewithout−runtime

LU Cholesky Mining Titan DB2 Pgrep0

10

20

30

40

50

60

70

80

90

100

Glo

bal C

ache

Hit

perc

enta

ge

with−runtimewithout−runtime


Fig. 2.15. Application traces:Variation of Cache Hit rates

45

There are several interesting directions for future work. As mentioned previously, the

scalability of the global cache with additional client nodes may turn out to be a problem and we

are currently looking at scalable solutions to see if we can apply the techniques presented here at

the I/O server nodes. We have only presented and evaluated a simple runtime strategy, and even

that has turned out to be quite effective. We plan to explore more sophisticated runtime schemes

with this approach. We have used a shared-nothing architecture for the experimental studies,

and it would be interesting to study the applicability and benefits to systems with a shared stor-

age architecture (perhaps including a storage-area network). An important goal of our future

optimizations is to be able to detect access patterns across different simultaneously running ap-

plications for I/O and cache optimizations. We are also interested in developing performance

monitoring and profiling tools to better determine what, when, and where to cache data blocks.

Finally, extending our compiler analysis to capture I/O access patterns inter-procedurally and

applying more aggressive (global) optimizations are interesting extensions to consider.

46

Chapter 3

Implicitly I/O Intensive Applications

3.1 Introduction

Many scientific applications of today solve problems that access large data sets. In gen-

eral, these data sets are disk-resident and their sizes range from mega-bytes to tera-bytes. These

include applications from medical imaging, data analysis, video processing, large archive main-

tenance, space telemetry data, and so on. Typically, such applications are explicitly coded to

stage the data to be manipulated from disk or the in-core version of the same program is scaled to

handle the larger problem sizes. It has been well-documented [82] that LRU-like virtual memory

replacement algorithms under-perform for typical scientific applications that tend to cyclically

access pages of memory. Consequently, a paged virtual memory system (or the scaled version

of the in-core program) is not considered a viable option [95] for solving out-of-core problems.

When virtual memory pages are touched cyclically and the working set is slightly larger than

available physical memory, LRU chooses to evict exactly the page that would be referenced the

soonest. In the above scenario, the optimal algorithm would evict those pages that were refer-

enced recently since it has future knowledge of the reference patterns. The basic idea behind our

work is as follows – if we can predict or estimate the lifetimes of all the memory-resident pages,

then we cacan evict a page as soon as we know that there would be no more references to that

page. In a sense, we are trying to approximate online the optimal algorithm, which always evicts

the pages that would be referenced the furthest in time. The basic motivation behind our idea is

47

that the LRU replacement algorithm holds on to pages that have long been dead, and hence any

replacement algorithm should be pro-active, instead of being reactive, and anticipate candidates

for eviction [92]. In practice, however, it is quite difficult to accurately predict future access in-

formation and lifetimes of virtual memory pages. Consequently, the algorithm that we propose

uses past access-pattern information as an indicator to predict lifetime distances of pages. In this

regard, we have experimented with a few different predictors and compared and evaluated their

performance for a set of twelve different memory-intensive applications drawn from the NAS

parallel and SPEC 2000 benchmark suites. We now introduce a few notations and definitions

that will be used later on in the text:

• Denote the distance between the first access (touch) and the last access (touch) for a virtual

memory page before which it gets replaced as the lifetime distance (L)

• Denote the distance between the last access (touch) and the subsequent reuse access to a

virtual memory page during which time the page was replaced as the reuse distance (R)

• Denote the distance between the replacement of a page and the subsequent reuse access to

the same virtual memory page as the window distance (W)

• Denote the distance between the last access (touch) and the replacement of the page as the

dead-page distance (D)

• Denote the distance between two successive page faults to a virtual memory page as the

page-fault distance (F)

These notations are pictorially shown in Figure 3.1. Henceforth, we shall refer to these

parameters as the page-fault parameters or simply the fault parameters. Note that in the above

48

definitions, the notion of distance was deliberately left undefined since it could be measured

in many different ways (for example in terms of memory references, number of page faults to

other pages, number of references to unique memory pages or perhaps even time). Ideally, we

would like to measure these distances in terms of memory references to unique pages since

that could directly translate to whether or not a page would be retained for a given memory

configuration (this is similar to the idea that was proposed in [42] in the context of buffer caches),

which is fairly difficult, if not impossible, to do on an actual system since that would involve

an unacceptable overhead of trapping on each and every memory reference, or would require

special hardware support (like an augmented MemorIES [60] board) to store the timestamps of

last access to pages. Without hardware support, since the only OS-visible events are page-faults

and replacements, it is possible to measure W and L + D alone, and that too only in terms of

units that are visible to the OS ( e.g. in terms of the number of page-faults incurred by other

pages). Each virtual memory page (p) is thus characterized by a unique 4-tuple tpi = (L, R, D,

W) between page faults i and i+1. Consequently, a program’s execution can be visualized as a

sequence of such tuples. Note that for a deterministic replacement algorithm, every instance of

a program’s execution would yield the same sequence of tuples. Typically, operating systems

employ non-deterministic or time-dependent replacement algorithms which then give rise to

different sequences across different executions.

The objectives of this study are as follows:

• Show that LRU-like replacement algorithms hold on to pages that have long been dead,

thus losing opportunities for reducing page-faults.

49

Fig. 3.1. Page-fault characterization.

• Characterize the variations of some of these parameters (L, D) in terms of memory refer-

ences and in terms of page faults and see if there is any kind of correlation and predictable

patterns between these, and

• Use these parameters in conjunction with simple predictors to design application-specific

replacement algorithms.

If we can predict L values, then we can evict pages earlier than when an operating system might

choose to evict them, because a virtual memory page typically becomes dead much before the

system’s replacement algorithm (like LRU) decides to evict the page. However, an incorrect

prediction of L may degrade the performance of the application by increasing the number of

page-faults incurred by the application. Ideally, we would like the predicted value of L to be

as large as possible, but not larger than L + D. Likewise, if we can predict W values, we could

prefetch pages well-ahead of when they would be actually referenced. However, an incorrect

50

prediction of W could also degrade performance by increasing number of page-faults incurred by

applications. Ideally, we would like the predicted value of W to be as large as possible such that

the sum of the predicted value and the time it takes to fetch the data from disk/peripheral device

is no larger than W. Consequently, this work can be considered as an application-customized

prefetching and replacement technique that is done automatically and transparently at runtime

by the operating system, and is hence more generally applicable than what was proposed in

[58]. In this study, we only explore the predictability characteristics of the lifetime distances for

effective page replacement and leave predictive prefetching as a future extension.

The rest of this chapter is organized as follows. Section 3.2 discusses related work. Sec-

tion 3.3 describes the experimental setup and the scientific applications that we used as bench-

marks for characterization and evaluation. In Section 3.4, we illustrate page-fault characteristics

of the system replacement algorithm. Section 3.5 proposes a new replacement algorithm and

evaluates its performance. We finally conclude and discuss the scope for future work in Section

3.6.

3.2 Related Work

Over the last few decades, a lot of work [18, 42, 43, 50, 65, 73, 82] has been done to

address the shortcomings of LRU-like replacement algorithms. Some of these [34, 50, 65, 82]

try to address the shortcomings for a specific workload access pattern such as cyclical access pat-

terns of programs whose working-set sizes are larger than available physical memory. In [82],

the authors propose an adaptive replacement algorithm (EELRU) that uses the same kind of re-

cency information that is normally available to LRU and a simple online cost-benefit analysis

51

to guide its replacement decisions. In their approach, the system continuously monitors the per-

formance of the LRU algorithm and, if it detects the worst-case behavior, it tries to pro-actively

evict pages. Our work in this study is similar to theirs in the sense that we wish to evict pages

early if we detect that this could do better than LRU, but is different from theirs in the method-

ology that we employ to detect such situations. In the context of buffer-caches where the system

gets control for every reference/access, researchers have proposed adaptive variants of LRU such

as [50, 65], and our work is inherently different from theirs because of the different domains of

applicability. In [34], the authors propose a new replacement algorithm (SEQ) that detects a long

sequence of page-faults and resorts to Most-Recently-Used replacement algorithm on detecting

such sequences. In spirit, most of the proposed algorithms try to imitate the behavior of the OPT

[4] algorithm and therein lies the similarity of our proposed work with them. In [73], the authors

propose a k-order Markov chain to model the sequence of time intervals between successive ref-

erences to the same address in memory during program execution. Note that in this work, we

are only interested in the sequence of time intervals between successive page faults to the same

virtual memory page. In a more recent work [42], the authors propose an efficient buffer cache

replacement policy called LIRS (Low Interference Recency Set) that uses recency to evaluate

Inter-Reference Recency for making a replacement decision. The key insight of this technique

is that it attempts to capture the inter-reference recency (the number of unique blocks accessed

between two consecutive references to a block) to overcome the deficiencies of LRU. In [57], the

authors propose a self-tuning, low-overhead, scan-resistant algorithm (ARC) that exploits both

recency and frequency to adapt itself according to the workload characteristics. A modified ver-

sion of the algorithm, CAR, was proposed in a more recent paper [3]. It combines the benefits of

ARC [57] and CLOCK [18] and is more amenable for virtual memory replacement. Both ARC

52

and CAR require significant amount of memory for recording history information that could

prove to be expensive for memory intensive applications considered in this study. Additionally,

many of the proposed algorithms in the literature are usually interesting only from the theoretical

point of view and may not really be implementable in an operating system. In this work, we also

demonstrate a potential in-kernel approximation of our idea.

From a system’s perspective, a lot of work has been proposed [12, 33, 15, 70] for applica-

tions that access disk-resident data sets through explicit I/O invocations. In the past, researchers

have also studied the problem of poor virtual-memory performance (implicit I/O) from an ap-

plication’s perspective [10, 22, 37, 58]. Researchers in [58] propose automatic compiler-driven

techniques to modify application codes to prefetch memory pages from disk/peripheral I/O de-

vices. In their prefetching scheme, the compiler provides the crucial information on future access

patterns, the operating system provides a simple interface for prefetching, and the run-time ac-

celerates performance by adapting to runtime behavior. In a subsequent work [10], they show

that primitives to release memory pages that are no longer used (if used judiciously) when used

in conjunction with their prefetching schemes improves the response times of concurrently run-

ning interactive tasks. However, static techniques like the above require that the source code be

available and analyzable. On the OS side, a lot of work has been done towards detection of file

access patterns automatically in the file system [35, 48] or parametric specification of file access

patterns supplied by the application [13, 69]. However, much of this work involves using explicit

I/O interfaces to stage data from peripheral devices.

On the compiler side, researchers have primarily looked at reordering computation to

improve data reuse and reduce I/O time [7] or inserting explicit I/O calls into array-based codes

53

[8, 67]. Typically, compilers are aided by annotated source code or programming-language ex-

tensions to indicate properties of important data structures. Among the techniques developed to

improve I/O performance of applications by predicting reuse distances and dead-page distances,

the closest work is by Mowry et. al in [10, 58]. However, the method that we propose is a pure

runtime technique and does not require any source-level modifications. On the hardware side,

researchers have proposed a dead-block prediction scheme in [49] that predicts when a cache

block becomes ”dead”and hence evictable. While our work shares a common goal with theirs,

prediction of lifetimes of virtual memory pages is inherently a different problem.

3.3 Experimental Framework

We now describe the applications and the simulation platform that we use in this study.

3.3.1 Applications

To evaluate the effectiveness of our approach, we measured its impact on the performance

of a selected set of memory-intensive SPEC CPU 2000 workloads [38] and seven of the memory-

intensive sequential versions of the NAS Parallel benchmark (NPB) suite [2]. Note that the

benchmarks we used in the study are among the most-memory intensive applications of the SPEC

benchmark suite. There is no inherent difficulty in running other SPEC benchmarks, but most

of the other SPEC benchmarks are CPU intensive and their working set sizes are very small and

consequently do not stress the virtual memory subsystem at all. It is for the same reason that

we also show results for 7 of the 8 NAS benchmarks whose virtual memory footprints are fairly

large in addition to the SPEC applications.

54

All the C benchmarks were compiled with gcc version 3.2.2 at an optimization level -

O3, and the Fortran 90 benchmarks were compiled with the Intel Fortran 90 compiler at the

same optimization level. A brief description of the benchmarks and the sizes of the data sets

that they access are shown in Table 3.1. Since different applications have different working

set sizes, and since we wanted to exercise the virtual memory capabilities of the system, we

configured the memory available differently for these applications. The specific values are given

in Table 3.1. Unless, otherwise mentioned the memory configuration that we simulated for the

characterization experiments was fixed at 300 MB for most of the NAS Parallel Benchmarks,

128 MB for LU, CG and all SPEC 2000 workloads with the exception of GZIP and MCF for

which we fix it at 64 MB.

3.3.2 Experimental Platform

We characterize the virtual memory behavior of these applications and the potential of

our proposed replacement algorithms in the context of an execution-driven x86 simulator. The

simulations were executed on a Linux-2.4.20 kernel on a dual 2.7 GHZ Xeon workstation with a

total of 1 GB physical memory and a 36 GB SCSI disk. The execution-driven simulator that we

used in this study is valgrind [79] which is an extensible x86 memory debugger and emulator.

Valgrind is a framework that allows for custom skins/plugins to be written that can augment the

basic blocks of the program as it executes. The skins/plugins that we implemented augmented

the basic blocks to return control to the skin after every memory-referencing instruction with the

value of the memory address that was referenced. The skins maintain data structures necessary

for implementing the techniques that we will be describing shortly and for collecting the relevant

statistics. The page fault statistics for these applications that were used for comparison with the

55

Table 3.1. Description of applications: The Total Memory column indicates the total/maximummemory that is used by the application, and the Simulated Memory column indicates the simu-lated memory size that was used for the characterization.

Name Description Input Data Set TotalMem-ory

SimulatedMem-ory

IS Integer Bucket Sort 225 21-bit integers 384MB

300MB

CG Conjugate Gradient Method to 75000x75000 sparse ma-trix

399MB

128MB

solve an unstructured sparse-matrix

with 15825000 non-zeroes

FT 3-D Fast-Fourier Transform 256x256x256 matrix 584MB

300MB

of complex numbersMG 3-D Multi-Grid Solver 256x256x256 matrix 436

MB300MB

SP Diagonalized ApproximateFactorization

102x102x102 matrices 323MB

300MB

BT Block Approximate Factoriza-tion

5x5x65x65x65 matrices 400MB

300MB

LU Simulated CFD using SSORtechniques

5x102x102 matrices 178MB

128MB

GZIP GNU Compression Utility TIFF Image, Webserverlog,

192MB

64MB

binary, random data, tar-ball

WUPWISE Physics/Quantum Chromody-namics

NITER=75,KAPPA=2.4E-1

176MB

128MB

SWIM Shallow Water Modeling 1335x1335 matrix, 200 it-erations

190MB

128MB

MCF Combinatorial Optimization default input 190MB

64MB

APSI Meteorology 112x112x112 matrix 191MB

128MB

for 70 iterations

56

kernel-implementable version of our scheme were obtained on a uniprocessor Xeon workstation

running the 2.4.20 Linux kernel.

3.4 Characterization Results

We first take a look at the page-fault characteristics of these applications. As indicated

in the previous section, we would like to characterize an application’s execution based on the

fault parameters. In particular, we want to understand the reasons as to why LRU performs

poorly from the fault parameters perspective, i.e, we want to characterize those situations in

which LRU holds on to the page long after it has been “dead”. To illustrate this, we give the

cumulative distribution plot of the ratio D/(L+D) for the LRU replacement algorithm for all the

applications for a specific memory configuration. If the replacement algorithm is really doing a

good job, then it would replace a page as soon as it became dead, or in other words, all replaced

pages would have a small dead-time distance D. In order to normalize the notion of small, we

compute the ratio D/(L+D) and plot the cumulative distribution plot of this ratio. Based on the

above explanation, we can now interpret the cumulative distribution plot of this ratio as follows

– if the plot peaks early, then the algorithm does a good job of evicting pages as soon as they

become dead. On the other hand, if it peaks late, then the algorithm is not doing a good job.

In this study, we have used the following units to measure the distance – number of

memory references and number of page-faults to other pages. The reason we plotted two sets

of graphs was because the former is a quantity that needs hardware support and can only be

approximated in practice, while the latter can be measured by the operating system. In Figures

3.2 (a) and 3.3 (a), we find that with the exception of BT, LU and WUPWISE, the replacement

algorithm is not evicting dead pages quickly since we find that only 15-25% of all replaced pages

57

have their D/(L+D) ratios less than 50% when distances are measured as number of memory

references. Similarly, if we observe Figures 3.2 (b) and 3.3 (b), we find that 40% of all replaced

pages have their D/(L+D) ratios less than 50% when the distances are measured as page-faults

to other pages. This serves as a motivation for designing a better replacement algorithm that

needs to be more pro-active in choosing candidates for eviction. Thus, it is clear that the LRU

replacement algorithm performs poorly from the fault parameters perspective. In fact Figures 3.4

(a) and (b) demonstrate the dismal performance of the LRU algorithm by plotting the execution

time profile of these applications when run on a Linux 2.4.20 kernel (averaged over 5 runs).

We next study the predictability characteristics of the fault parameters before we describe the

proposed algorithm.

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ratio of D/(L+D) measured as References

CD

F

ISFTCGMGSPBTLU

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ratio of D/(L+D) measured as Faults

CD

F

ISFTCGMGSPBTLU

(a) (b)

Fig. 3.2. Ratio of D/L+D measured for NPB as (a) References, (b) Faults.

As stated earlier, a pro-active replacement algorithm should evict a page as soon as it

becomes “dead”. However, in order for the system to predict which pages will become dead the

58

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ratio of D/(L+D) measured as References

CD

F

GZIPWUPWISESWIMMCFAPSI

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ratio of D/(L+D) measured as FaultsC

DF

GZIPWUPWISESWIMMCFAPSI

(a) (b)

Fig. 3.3. Ratio of D/L+D measured for SPEC2000 as (a) References, (b) Faults.

IS FT CG MG SP BT LU0

0.2

0.4

0.6

0.8

1

Tim

e in

sec

onds

User TimeSystem TimeCold Pagefault TimeWarm Pagefault Time

GZI WUP SWI MCF APS0

0.2

0.4

0.6

0.8

1

Tim

e in

sec

onds

User TimeSystem TimeCold Pagefault TimeWarm Pagefault Time

(a) (b)

Fig. 3.4. Where does time go? (a) NPB, (b) SPEC2000

59

soonest, it needs to predict the lifetime distances (L) of pages. Hence, we felt that characteriz-

ing the variability and hence predictability of the lifetime distance (L) was necessary, and may

even provide insights into the design of the predictors for the replacement algorithm. All the

characterization experiments were conducted using exact LRU as the replacement algorithm.

The first characterization experiment plots the cumulative distribution of the absolute dif-

ferences between successive values of lifetime distances of a virtual memory page. It is expected

that if lifetime distances are similar, the successive differences would be close to zero and hence

a CDF plot of the frequencies of occurrence of the differences would be steep in the beginning.

Yet another metric that can be gleaned from such a plot is the number of different dominant

values for the distribution. In this experiment, we compute the frequency of occurrences for

differences upto a large value (set to 50000), and all the differences greater than that are treated

identically. Yet another property that we wished to look at was whether the magnitude of the

differences had any discernible characteristic that we could take advantage of.

A cumulative distribution plot of the frequencies of occurrences of differences in life-

time distances (absolute difference) is shown in Figures 3.5 (a) and (b) for the NAS benchmarks,

when measured in terms of (a) total memory references, (b) total number of page-faults to other

virtual memory pages. Figures 3.5 (c) and (d) plot the same for the SPEC 2000 benchmarks.

Figure 3.5 (a) indicates that for two of the NAS applications (MG, FT), less than 50% of occur-

rences have successive differences less than 10, while all the remaining applications’ successive

lifetime differences, when measured as memory references, are fairly predictable. Similarly,

Figure 3.5 (b) indicates that with the exception of the same two applications, all the remaining

applications have a fairly predictable differences of lifetime distances when measured in terms

60

of page faults. On the other hand, Figure 3.5 (c) indicates that four of the five SPEC 2000 appli-

cations have less than 50% occurrences of differences of less than 10 when the lifetime distances

are measured in terms of memory references, and Figure 3.5 (b) indicates that two of the five

SPEC 2000 applications have less than 55% occurrences of differences of less than 10 when

lifetime distances are measured in terms of page-faults. However, a CDF plot dilutes temporal

information and hence cannot be used for determining whether predictability exists or not. Our

subsequent experiments indicate that there is sufficient predictability that can be exploited. Fig-

ures 3.6 (a), (b) and Figures 3.6 (c), (d) plot the same with the only difference being that instead

of the absolute value, the actual difference between the successive lifetime distances is used for

the NAS and SPEC benchmarks respectively. From both these plots, it is clear that for most

of the benchmarks, differences of successive lifetime distances are fairly symmetric on either

side of zero, and a majority of them lie within a bounded range, which we surmise is because

of the applications’ structured access patterns. Figures 3.7 (a), (b) and Figures 3.7 (c), (d) plot

the cumulative distribution of the absolute differences between successive (L+D) values of a

virtual memory page. Note that a realizable implementation of any replacement algorithm in

the operating system cannot rely on accurately tracking memory references to calculate lifetime

distances. Therefore, the operating system must rely upon page-faults and page replacement

events to approximate lifetime distances of pages, or in other words the operating system needs

to approximate the lifetime distance using the measured L+D distances. It can be observed that

the CDF plots in Figures 3.7 (a), (b) and Figures 3.7 (c), (d) largely mirror the distribution of the

lifetime distances’ differences, though the exact values tend to differ.

61

(a) (b)

100

101

102

103

104

105

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Absolute Difference of Successive Lifetime Distances

CD

F

isftcgmgspbtlu

100

101

102

103

104

105

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


CD

F

isftcgmgspbtlu

(c) (d)

100

105

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


CD

F

gzipwupwiseswimmcfapsi

100

105

0.4

0.5

0.6

0.7

0.8

0.9

1


CD

F


Fig. 3.5. Absolute differences between successive L distances measured as (a) NPB2.3 - TotalMemory References, (b) NPB2.3 - Faults to other pages, (c) SPEC2000 - Total Memory Refer-ences, (d) SPEC2000 - Faults to other pages.

62

(a) (b)

−100 −50 0 50 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Difference of Successive Lifetime Distances

CD

F

isftcgmgspbtlu

−100 −50 0 50 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


CD

F

isftcgmgspbtlu

(c) (d)

−100 −50 0 50 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


CD

F


−100 −50 0 50 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


CD

F


Fig. 3.6. Differences between successive L distances measured as (a) NPB - Total MemoryReferences, (b) NPB - Faults to other pages, (c) SPEC2000 - Total Memory References, (c)SPEC2000 - Faults to other pages.

63

(a) (b)

100

101

102

103

104

105

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Absolute Difference of Successive (L+D) Distances

CD

F

isftcgmgspbtlu

100

101

102

103

104

105

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


CD

F

isftcgmgspbtlu

(c) (d)

100

105

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


CD

F


100

105

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


CD

F


Fig. 3.7. Absolute differences of successive (L+D) distances measured as (a) NPB - TotalMemory References, (b) NPB - Faults to other pages, (c) SPEC2000 - Total Memory References,(d) SPEC2000 - Faults to other pages.

64

In the second set of experiments, we plot the cumulative distribution of the lifetime dis-

tance measured both in terms of the number of references and in terms of the number of page-

faults to other pages as shown in Figures 3.8 (a) and (b), respectively (for the same reasons as

stated before due to the inability of the operating system to measure distances in terms of num-

ber of memory references without hardware support). Although these graphs indicate that there

are no dominant values of lifetime distances and a seeming lack of predictability, it should be

kept in mind that a CDF plot dilutes and hides temporal information. Suppose, we consider a

highly predictable sequence 1,2,3,1,2,3.. and so on, a cumulative distribution plot would be a

45◦ straight-line since each unique value in the sequence occurs with equal frequency.

(a) (b)

Fig. 3.8. NPB: CDF of L distance measured as (a) References, (b) Faults.

Towards determining whether sufficient predictability exists in the sequence of lifetime

distances, we plot the variation of lifetime distances with time in our next set of experiments. It

is to be noted here that while such a characterization is relevant only for a fixed deterministic

65

replacement algorithm (eg. exact LRU) and a particular memory configuration, it is important to

see if there are any patterns that can be exploited. In the interest of space, we show the time plots

only for a few applications. While there seems to be a certain degree of regularity and hence

predictability in the variation of L with time as seen in Figures 3.9 (a), (b) and 3.10 (a), (b), this

trend is not immediately evidenced in the cumulative distribution plots of L shown in Figures 3.8

(a) and (b). However, an implementation of a replacement algorithm using the lifetime distances

depends upon the predictability of L for a particular virtual memory page. Thus, as the next step,

we characterize the predictability of L for each virtual memory page.

(a) (b)

Fig. 3.9. MG: Variation of L distance with time measured as (a) References, (b) Faults.

In order to capture this trend, we plot the coefficient of variance of L (coefficient of

variance of a sequence is defined as the ratio of standard deviation and the mean of the sequence)

for each virtual page. We use this metric instead of the standard deviation since the latter is

66

(a) (b)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 106

0

0.5

1

1.5

2

2.5x 109

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 106

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 105

Fig. 3.10. SP: Variation of L distance with time measured as (a) References, (b) Faults.

dependent upon the mean of the sequence and cannot be used as a straightforward yardstick

for comparison and/or predictability. In such a graph, points on the x-axis are the individual

virtual memory pages and the y-axis is the coefficient of variance of L computed for each page.

Figures 3.11, 3.12 and 3.13 show the coefficient of variance plots for a few of the NPB and

SPEC 2000 applications. Since, we are interested in finding out whether there is predictability

in the sequences of lifetime distances observed for each virtual memory page, we would like

the coefficient of variance of such a sequence to be as close to 0 as possible. If the coefficient

of variance is close to zero, it indicates that the lifetime distance sequence is almost constant

and hence we can easily predict the lifetimes of pages. However, it is not necessary that a high

coefficient of variance necessarily indicates a lack of predictability. For instance, sequences

like those depicted in Figures 3.9 and 3.10 may have a high value of coefficient of variance but

nonetheless there is a predictable pattern that can be observed. It is also interesting to find that

there are applications for which the coefficient of variance for all pages is consistently close to

67

0 which indicates the underlying predictable nature of the page lifetimes (eg. Figure 3.11(a)).

There are also applications, where we find that a certain subset of pages exhibit coefficient of

variance close to 0 and certain others for which it is very high (eg. Figures 3.11(b), 3.12(a),

3.12(b) and 3.12(c)). There are also applications that do not exhibit either of the above (eg.

Figure 3.13(b)).

Thus far, we have motivated the need for a better replacement algorithm and also looked

at the predictability characteristics of the fault parameters (lifetime distances) and observed that

there seems to be sufficient predictability for us to investigate better replacement algorithms that

will be elaborated upon in the next section.

(a) (b) (c)

Fig. 3.11. Coefficient of Variance of L for each page (a) IS, (b) MG, (c) SP.

3.5 Towards a Better Replacement Algorithm : Predictive Replacement

In the previous section, we observed that quite a few of the applications exhibited fairly

low coefficient of variance of lifetime distances at the individual page granularity, i.e, at the

68

(a) (b) (c)

Fig. 3.12. Coefficient of Variance of L for each page (a) FT, (b) BT, (c) LU.

(a) (b) (c)

Fig. 3.13. Coefficient of Variance of L distance for each page (a) WUPWISE, (b) MCF, (c)APSI.

69

individual page granularity there is sufficient regularity of lifetime distances that can be effec-

tively predicted. We also observed that there are also applications in which predictability of

lifetime distances for certain virtual memory pages is very low and this indicates the need for

some adaptive algorithm that dynamically enables or disables prediction based on the current

prediction accuracy. Based on the above observations, we now outline a novel page replace-

ment algorithm. In this approach, the system maintains an additional list that is “approximately”

sorted on system-predicted values of L which we call as Dead-List (somewhat of a misnomer,

since it keeps track of when a page would become dead in the future rather than currently dead

pages!), in addition to the LRU list. Both the LRU and the Dead-Lists are lists of physical mem-

ory pages, and hence we would only need to store an extra pointer to traverse the Dead-List. At

the time of replacement of pages, the system checks if the head of the Dead-List has expired

(i.e, whether or not the system has decided/predicted that a page would not be accessed anymore

in this time-interval. The issue of how the system does this prediction is explained a little later

in this section.) and if so, it decides to replace that page. Note that since we keep the list ap-

proximately sorted on estimated lifetimes, we do not need to look through the whole list. If this

list is empty or if the page has not yet been estimated to have expired, then the system defaults

to the LRU replacement algorithm. We now present the steps for this predictive replacement

algorithm:

• When a page fault occurs on some page(X), we check if the Dead-List is empty, or if the

head of the Dead-List is not yet estimated to be dead.

70

• If the above condition is true, we initiate the normal LRU replacement algorithm, and

delete the page from the Dead-List as well if needed (it may not necessarily be the head

of the Dead-List).

• Otherwise, we dequeue the head of the Dead-List and choose that as a candidate for re-

placement. Note that we need to delete it from the LRU list as well.

• Once the candidate for replacement has been decided, we need to insert the currently

faulted page into the sorted Dead-List based on our estimated/predicted value of L for that

page.

Keeping the Dead-list exactly sorted could be a cause of significant overhead, that could

have increased the page-fault service time. Consequently, any reductions in page-faults may

not have translated to reductions in overall execution time. There are two possible solutions to

overcoming this problem. One possible alternative would involve using a min-heap, where the

root of the heap would hold the page that is estimated to have the least lifetime, and the other

alternative involves keeping the list approximately sorted. The latter scheme involves chaining

pages with similar values of estimated lifetime distances in a hash bucket. Note that this kind of

scheme also entails one extra pointer per physical memory page since a page cannot be in more

than one hash bucket. In the case of a heap-based implementation, the worst-case time complex-

ity of insertion of pages is O(log n), and O(1) for deletion, while the hash-chaining scheme’s

time complexity is O(1) for insertion (since pages are always inserted at the tail) and O(m) for

deletion of pages, where n is the number of physical memory pages and m is the number of hash

chains. Another possible drawback of the heap-based scheme is the need for locking the entire

heap during insertion that could prove to be expensive on multi-processor systems, whereas the

71

hash-chaining based scheme involves locking only the appropriate hash chain. In this study, we

have explored the hash-chaining based “approximately” sorted scheme for managing the Dead-

List with a fairly small value of m (currently set to 101). Another point to be noted in the above

description of the algorithm is the deliberate omission of the prediction mechanism, since that

is a parameter that we want to experiment with for good performance. Note that the optimal

algorithm would be a perfect predictor, and would replace a page as soon as it becomes dead.

Recall that it is possible to measure the lifetime distances accurately only with adequate hard-

ware support and is quite hard to measure it practically in an operating system. Thus, we break

up our prediction schemes into two categories: estimation techniques with hardware support and

an operating-system-implementable estimation technique, which are explained in the next two

subsections.

3.5.1 Estimation Techniques with Hardware Support

With appropriate hardware support, we can keep track of and measure lifetime distances

of virtual memory pages, using which we have experimented with simple prediction schemes.

In all these experiments, the metric that we have used to compare performance is the normalized

page-faults when compared to the base LRU scheme. The simulation framework that we have

built upon valgrind, maintains two global counters, one which increments on every memory

reference (G1) and the other which increments on every page-fault (G2). In addition, each

virtual-memory page has a set of four counters associated with it, which record the following

information:

• Timestamp of the last access to that page(C1).

72

• Page-Fault Counter at the time of the last access to that page(C2).

• Timestamp of the previous page-fault to that page(C3).

• Page-Fault Counter at the time of the previous page-fault to that page(C4).

A possible concern that might arise in this regard is the storage and access efficiency costs for

these counters. A possible solution to this problem is to store these counters in the unused bits

of a page-table entry for a virtual memory page, which can also be subsequently cached in the

TLB after the first access to the page. On each access to a page (whether it be a hit or a miss)

the G1 counter is incremented, and the G2 counter is incremented only on a miss. On every

hit access to a page, counters C1 and C2 are updated to store the latest values of G1 and G2

respectively. On every page-fault/miss, counters C3 and C4 are updated to store the latest values

of G1 and G2 respectively. At the time of a page-fault (miss), the system can now measure L

both in terms of the memory references (L=C1-C3) and in terms of number of page-faults to

other pages (L=C2-C4). In all schemes described below, the system uses the measured value

of the lifetime distance to predict the next lifetime distance for the page. Once, the prediction

is done, we insert the currently faulted page into the appropriate bucket of the Dead-List based

on this estimated value. For the remainder of the discussion, we only consider the lifetime

distances measured in terms of the number of page-faults to other pages. The characterization

experiments in Section 3.4 indicated the high predictability of Lifetime Distances. In particular,

Figures 3.5 (b), (d) and Figures 3.6 (b), (d) show that a majority of the differences between

successive lifetime distances of a virtual memory page is within a bounded range (around 10) ,

thus indicating that simple variants of a Last-Value predictor would be sufficient in estimating

lifetime distances fairly accurately.

73

• Static variant of Last Value Prediction (Last Static k): In this scheme, if a page’s L value

was measured to be Li at the time of a page-fault, we predict that the next lifetime of the

page using this scheme as, Li+1 = Li + k, where k is a static constant. For the following

set of experiments, we have experimented with 5 different k values, namely -10, -5, 0, 5,

and 10. These values were selected based on the observations made in Figures 3.6 (b) and

(d). Note that a value of 0 for this constant is exactly equivalent to a Last-value predictor,

while a positive value of this constant implies a conservative approach towards estimating

lifetimes.

• Adaptive Variant of Last Value Prediction (Last-Dynamic): This scheme tries to overcome

the previous technique’s limitation by trying to reduce the number of predictions made

based on observed accuracy of the predictions. In this technique, we associate a state with

each virtual memory page that can assume three values: namely Sinit, Stransient and

Spred. The key idea in this algorithm is to disable prediction unless the state associated

with the virtual memory page is equal to Spred. The algorithm uses a simple three-state

machine and works as follows on a page fault:

– If the state associated with the currently faulted page is Sinit, and if the difference

(Ldiff ) between the last two observed lifetime distances is less than or equal to a

threshold (LVthresh), it sets the state for the currently faulted page to Stransient.

– If the state associated with the currently faulted page is Stransient, and if the differ-

ence (Ldiff ) between the last two observed lifetime distances is less than or equal

to LVthresh, it sets the state for the page to Spred, and estimates the lifetime of this

74

page as, Li+1 = Li + Ldiff . Otherwise, if the difference (Ldiff ) is greater than

LVthresh, it sets the state of the page to Sinit, and disables prediction.

– If the state associated with the currently faulted page is Spred, and if the difference

(Ldiff ) between the last two observed lifetime distances is less than or equal to

LVthresh, it estimates the lifetime of this page as, Li+1 = Li + Ldiff . Otherwise,

if the difference (Ldiff ) is greater than LVthresh, it sets the state of the page to

Stransient and disables prediction.

Associating multiple states allows for disabling prediction during temporary bursts and/or

sequences where predictability is poor. A critical parameter in the above algorithm is the

value of the threshold (LVthresh), since it determines how aggressive or conservative a

scheme is. Choosing a small value for this threshold will allow for very few predictions

(conservative), and choosing a large value could potentially allow more prediction based

replacements (aggressive). We find that different applications have different ranges of

thresholds over which good performance is achieved, as shown in the next section. How-

ever, the job of determining what are best thresholds for a particular application, memory

configuration and relating it to the application characteristics is beyond the scope of this

study and is the focus of our future efforts.

• EELRU: In [82], authors propose an adaptive replacement algorithm that uses a simple

online cost-benefit analysis to guide its replacement decision, which is considered to be

one of the state-of-art algorithms towards addressing the performance shortcomings of

LRU. Hence, we have also compared the performance of our schemes with EELRU in the

subsequent evaluations.

75

3.5.2 OS-Implementable Estimation Technique

In this scheme, the OS needs to keep a counter (G1) that keeps track of the number of

page-faults that have been incurred by the application. On each page-fault, this counter (G1)

needs to be incremented. In addition, we need to associate a counter for each virtual page (C1),

(likewise, this can be stored in the unused bits of the page-table entry after suitable encoding)

which is updated whenever a page-fault occurs on that page, i.e., we set C1 to the latest value of

G1 at the time of a page-fault to a page. At the time of subsequent page-faults to the same page,

we can now estimate L+D as G1-C1. The in-kernel scheme that we propose uses this value to

estimate lifetime distance of the page, which we denote as DP-Approx.

DP-Approx is a novel replacement algorithm that uses exponential averaging to estimate

the lifetime distance. As mentioned earlier, since the operating system does not get control over

individual memory references, any in-kernel approximation of the replacement algorithm needs

to make use of OS-visible events like page-faults and replacements. Therefore, in this technique

we start out by estimating the lifetime distance as L+D, and we subsequently use exponential

averaging to predict the next lifetime distance as

Lt+1

pred= a * (L+D)t

measured+ (1 - a) * Lt

pred. Unless otherwise stated, we fix the value

of the parameter “a” as 0.5, which means that we give equal weights to the current measurement

and previously estimated lifetimes. One may question that relying on the parameter “a” may

reduce the chance of a successful implementation, but we believe that it is possible to build a

more sophisticated scheme, where the value of this parameter “a” can be determined dynam-

ically. Please note that our intent in this study is to demonstrate a proof-of-concept strategy

76

that can be practically realized without too much overheads, and determining the best “a” value

automatically is, itself an interesting research topic that we wish to address in the future.

3.5.3 Results with Predictive Replacement Techniques

Figures 3.14 and 3.15 plot the normalized page-faults for the static and adaptive variants

of the last-value-prediction-based replacement algorithms for the applications, compared with

the base LRU replacement algorithm. The results for the Last static schemes shown in Figures

3.14 and 3.15 have the static parameter set to the following values (-10, -5, 0, +5 and +10). The

results for the best performing adaptive scheme (Last dynamic at a fixed value of the thresh-

old) are also shown in Figures 3.14 and 3.15. The threshold values for each of the application

for which the performance was relatively the best is summarized in Table 3.2. From Table 3.2,

we observe that applications for which the Last static schemes performed well require a higher

threshold for the adaptive schemes to show benefits. This can be attributed to the fact that a

higher threshold value lends itself to an aggressive algorithm that predicts more often, which in

turn is good for such applications as demonstrated by the good performance of the static algo-

rithms. Analogously, applications that perform poorly with the Last static algorithm due to a

high number of potentially incorrect predictions require a low value of threshold that lends itself

to a conservative algorithm that predicts less often. An interesting area of research that has not

been addressed here is the design of a self-tuning, adaptive algorithm that adjusts the thresh-

olds dynamically without manual assignment, which is the focus of our future efforts. As we

pointed out earlier, any replacement algorithm that predicts lifetime distances incorrectly may

worsen the performance of the application, since it may cause more page-faults by replacing a

page ahead of when it actually became dead. From Figures 3.14 and 3.15, we can observe that

77

the static variants of the predictive algorithm can significantly out-perform LRU in six of the

twelve applications (IS, CG, BT, WUPWISE, MCF, SWIM), but can also degrade the perfor-

mance sometimes quite significantly in the remaining six applications. It is also clear that the

adaptive algorithm out-performs LRU in all the applications, thus indicating that the adaptation

ensures that the performance never gets degraded badly. It must also be noted that the perfor-

mance of the adaptive schemes are better than their static variant counterparts in ten of the twelve

applications considered and nearly as good in the remaining two applications (IS, MCF), which

is also indicative that the dynamic scheme adapts itself better, and resorts to using prediction

judiciously. Figure 3.16 plots the normalized invocation counts of the predictive replacement al-

gorithm and the LRU replacement algorithm which is essentially the number of times a particular

algorithm was invoked for replacement. In this graph, the five bars for each application denote

the five different schemes for which we have shown results thus far (i.e., four static variants,

one adaptive scheme). This graph serves to reinforce the fact that the adaptive scheme resorts to

using prediction-based replacement as sparingly as possible without worsening the performance.

In summary, we find that, for all the applications considered, an adaptive prediction-based al-

gorithm always performs better than LRU in terms of reductions in the number of page-faults,

sometimes by as much as 78%.

3.5.4 Comparison with EELRU

Authors in [82] proposed an adaptive replacement algorithm (EELRU) that uses the same

kind of recency information as LRU, and using a simple online cost-benefit analysis, they demon-

strate that their algorithm outperforms LRU in the context of virtual memory systems. The basic

intuition behind the EELRU technique is that, when the system notices that a large number of

78

GZIP WUP SWI APSI MCF0

0.5

1

1.5

2

2.5

Nor

mal

ized

Pag

e Fa

ults

Last−static−(−10)Last−static−(−5)Last−static−0Last−static−5Last−static−10Last−dynamic(best)

Fig. 3.14. Normalized page-fault counts of the replacement algorithm for SPEC 2000 withrespect to perfect LRU scheme.


0.5

1

1.5

2

2.5

Nor

mal

ized

Pag

e Fa

ults

Last−static−(−10)Last−static−(−5)Last−static−0Last−static−5Last−static−10Last−dynamic(best)

Fig. 3.15. Normalized page-fault counts of the replacement algorithm for NPB 2.3 with respectto perfect LRU scheme.

79

Table 3.2. Threshold values of applications.NPB 2.3 Application Threshold SPEC Application ThresholdIS 8000 GZIP 30FT 15 WUP 4000CG 800 SWIM 4000MG 30 MCF 10000SP 60 APSI 60BT 45LU 60

(a) (b)

GZIP WUP SWIM APSI MCF0

0.2

0.4

0.6

0.8

1

Nor

mal

ized

Invo

catio

n co

unt

DP−invocationsLRU−invocations


0.2

0.4

0.6

0.8

1

Nor

mal

ized

Invo

catio

n C

ount

DP−invocationsLRU−invocations

Fig. 3.16. Normalized invocation counts of the replacement algorithm for (a) SPEC 2000, (b)NPB 2.3.

80

pages are being touched in a roughly cyclic pattern that is larger than main memory it diverges

from LRU since such a pattern is known to be a worst-case scenario for LRU. In order to de-

tect such situations, the system needs to track the number of pages that were touched since a

page was last touched, which is exactly the same kind of information that LRU maintains, but

EELRU maintains it for both resident and non-resident pages. Once such a situation is detected,

they apply a fall-back replacement algorithm that evicts either the least-recently used page or

a page from a pre-determined recency position. In the context of virtual memory systems, the

EELRU approach is considered to be one of the state-of-art approaches towards improving the

performance of LRU, and hence, we wished to compare the performance of our schemes with

EELRU. Note that EELRU is also a simulation-based approach, and no practical approximation

of it has been demonstrated thus far in an operating system. Hence, the comparison is done

only with the hardware-based simulation techniques that we have proposed thus far (see Section

3.5.1).

The first two bars for each application in Figures 3.17 and 3.18 shows the normalized

page fault counts for the best performing prediction-based replacement algorithm and EELRU

with respect to perfect LRU. Therefore, the greater the second bar, the better the prediction-

based algorithm performs when compared to EELRU. From Figures 3.17 and 3.18, it is clear

that both the prediction-based and EELRU replacement algorithms out-perform LRU for all the

applications. Further, we also notice that the best-performing predictive replacement schemes

outperforms EELRU in nine out of the twelve applications that we tested against, namely FT

(17.49%), CG (3.84%), MG (16.19%), SP (75.56%), BT (78.18%), LU (50.43%), WUPWISE

(35.36%), SWIM (38.03%) and APSI (33.78%), where the percentages in parentheses indicate

the reduction in number of page-faults compared to the EELRU scheme. With the exception of

81

GZIP, where EELRU performs dramatically better than any of our prediction schemes, we find

that on the average, our scheme generates around 15% lower page-faults than EELRU over all

the applications (around 26% lower page-faults than EELRU over all applications except GZIP).

It must be remembered that the predictive algorithm needs sufficient history to start prediction,

and in two of the three applications (IS, GZIP) where EELRU performs better, quite a few pages

are accessed exactly once which does not allow the prediction based replacement algorithm to

start replacing such pages proactively.

GZIP WUP SWI APSI MCF

−1

−0.5

0

0.5

1

1.5

Rat

io o

f Pag

e Fa

ults

to L

RU

Best−Last−value−predictorEELRU

Fig. 3.17. Comparison of the best prediction-based replacement algorithm with EELRU forSPEC 2000 using the ratio of page-faults in comparison to the perfect LRU scheme.

82

IS FT CG MG SP BT LU

−0.5

0

0.5

1

1.5

Rat

io o

f Pag

e Fa

ults

to L

RU

Best−Last−value−predictorEELRU

Fig. 3.18. Comparison of the best prediction-based replacement algorithm with EELRU forNPB 2.3 using the ratio of page-faults in comparison to the perfect LRU scheme.

3.5.5 Performance of DP-Approx

As discussed in Section 3.5.2, DP-Approx is an in-kernel, practical, realizable implemen-

tation of the prediction-based replacement techniques. The kernel uses exponential averaging of

the measured (L+D) distances to estimate the lifetime distances of virtual memory pages. For a

fair evaluation of this scheme, we show the reduction in the number of page-faults over the actual

number of page-faults reported by the Linux kernel when running the application natively. We

augmented the Linux 2.4.20 kernel with a new system call (getrusage2), along the lines of an

existing system call (getrusage) that returns the number of cold and warm misses/faults. The

experimental data for this study was collected on a uniprocessor Xeon-based machine running

the modified Linux 2.4.20 kernel that was instructed to use a specified amount of main memory

through its command line options specified in the boot-loader, and is the average over five runs

of the application. Figures 3.19 (a), (b) plot the normalized reduction in the number of page

83

faults using the DP-Approx technique in comparison to the Linux 2.4.20 kernel’s page-faults.

It is clear from the figures that the DP-Approx technique outperforms the kernel’s replacement

algorithm in all but one application (APSI) by reducing page-faults by as much as 56%. We

find that on the average, the DP-Approx scheme gives around 14% lower page-faults than the

kernel’s replacement algorithm over all the applications (18% lower page-faults than the kernel’s

replacement scheme over all applications except APSI).

GZIP WUP SWI APSI MCF−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

Nor

mal

ized

red

uctio

n of

pag

e fa

ults

IS FT MG SP BT LU0

0.1

0.2

0.3

0.4

0.5N

orm

aliz

ed r

educ

tion

of p

age

faul

ts

(a) (b)

Fig. 3.19. Normalized page-fault reduction of DP-Approx algorithm in comparison to Linuxkernel 2.4.20 execution for (a) SPEC 2000, (b) NPB 2.3

3.5.6 Sensitivity Analysis

While the techniques that we have proposed thus far yielded significant reductions in

page-faults compared to LRU, it must be remembered that most of these schemes (with the

84

exception of DP-Approx) cannot translate to an in-kernel implementation without hardware sup-

port, due to the inability of the operating system to keep track of individual memory references.

In the DP-Approx technique, we approximate the lifetime distances by exponentially averaging

observed (L+D) distances in the operating system. Recall from the earlier discussion that this

technique depended on the exponential averaging factor “a”, and in the next set of experiments,

we wanted to study the impact on the performance by varying this parameter. In Figures 3.20 (a)

and 3.21 (a), we plot the normalized page-faults incurred by applications with the DP-Approx

replacement technique over that of LRU. Although such a comparison is not really fair, since one

is a scheme that can be implemented, and the other can at best be approximated, we wanted to

see if any trends can be observed that can help in designing a self-tuning online algorithm. We

observe that in most of the applications the value of “a” is critical for good performance. In Fig-

ures 3.20 (b) and 3.21 (b), we plot the prediction accuracy of the exponential averaging scheme

for the different values of “a” and find that in almost all the cases, we ended up under-predicting

lifetime distances which sort of reflects on why the performance of this replacement algorithm

is not so much better than LRU. The reason for the poor performance due to under-prediction

is because of the increased page-faults that the application incurs. Note, that a large number of

over-predictions may also not perform better than LRU since we may not be aggressive enough

to evict pages as soon as they become “dead”. In fact, we can see that in applications like IS,

MG, MCF and WUPWISE where this scheme performs better than LRU, the percentage of over-

predictions is higher than the rest. This indicates that any approximation scheme needs to track

the lifetime distances quite accurately for good performance.

85

(a) (b)

GZI SWI MCF WUP APS0

0.5

1

1.5

2

2.5

3

Nor

mal

ized

pag

e fa

ults

a=0.25a=0.50a=0.75

GZI SWI MCF WUP APS0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Nor

mal

ized

pre

dict

ion

accu

racy

Under−predictedExact−predictedOver−predicted

Fig. 3.20. SPEC 2000 (a) Sensitivity of DP-Approx to parameter “a”, (b) Prediction accuracy.

(a) (b)

IS MG SP BT FT LU0

0.5

1

1.5

2

2.5

3

Nor

mal

ized

pag

e fa

ults

a=0.25a=0.50a=0.75

IS MG SP BT FT LU0

0.5

1

1.5

Nor

mal

ized

pre

dict

ion

accu

racy

Under−predictedExact−predictedOver−predicted

Fig. 3.21. NPB 2.3 (a) Sensitivity of DP-Approx to parameter “a”, (b) Prediction accuracy.

86

3.6 Conclusions and Future Work

In this chapter, we have presented a novel technique of tracking application’s virtual

memory access pattern in the operating system for pro-active memory management by replacing

virtual memory pages as soon as they become dead. The contributions of this work can be

summarized as follows:

• Demonstrating the sub-optimal performance of LRU-like replacement algorithms for sci-

entific applications’ access patterns from the application characteristics and fault param-

eters perspective and demonstrating the fact that LRU-like replacement algorithms hold

onto virtual memory pages long after they are “dead”.

• Characterizing the predictability of the fault parameters from an application’s perspective.

• Using these parameters in conjunction with simple predictors (variants of Last-value pre-

dictors) to design a novel set of replacement algorithms.

• Evaluating the performance of these replacement algorithms on a set of 12 different mem-

ory intensive applications drawn from the NAS and SPEC 2000 application suite and con-

cluded that a prediction-based replacement algorithm can significantly out-perform LRU

by yielding as much as 78% reduction in the number of page-faults. On the average, a

prediction-based replacement scheme yields around 48% reduction in page-faults in com-

parison to LRU.

• Evaluating and comparing the performance of our techniques with EELRU that is consid-

ered to be one of the state-of-art algorithms towards improving performance of LRU in

the context of virtual memory systems and demonstrate that our predictive replacement

87

schemes can reduce number of page-faults over EELRU in 9 of the 12 memory inten-

sive applications with which we experimented by as much as 78%. On the average, the

predictive replacement schemes yield around 15% lower page-faults than EELRU.

• Designing and implementing a novel in-kernel approximation algorithm that can estimate

lifetime distances using just the parameters that an operating system can measure. On the

average, this algorithm yields around 14% lower page-faults than a Linux 2.4.20 kernel’s

replacement algorithm and by as much as 56% reduction in the number of page-faults.

This can serve as a much better alternative to approximate LRU or not recently used re-

placement algorithms (note that schemes such as LRU or EELRU are not implementable

in practice) that are typically implemented in the kernel.

We do realize that many of the proposed schemes depend upon parameters that need to be tuned

for good performance. For instance, the adaptive variant of the Last-Value predictor based re-

placement algorithm relies on the threshold value, and DP-Approx relies upon the exponential

averaging constant for good performance. Consequently, an interesting avenue of research that

we have not addressed in this thesis is relating the application characteristics to auto-select pa-

rameters or a self-tuning algorithm that auto-tunes the parameters for good performance. We

also have not explored the effects of multi-programming and OS implementation issues (like

memory space needed for prediction), that are the focus of our future research agenda. Another

interesting aspect that needs investigation is a prediction-based prefetching mechanism similar

to the replacement techniques proposed here. We also plan to design and implement the optimal

predictor-based replacement algorithm and compare its performance with our prediction-based

replacement algorithm to get an idea of how much better we could perform. In the future, we

88

also plan to incorporate compile-time information on future access patterns that can be used in

conjunction with the run-time based prediction schemes that we have proposed.

89

Chapter 4

Synergistic Scheduling

4.1 Introduction

Scheduling policies implemented by today’s operating systems cause memory intensive

applications to exhibit poor performance and throughput when all the applications’ working sets

do not fit in main memory. A primary cause for this is that scheduling algorithms do not take

memory size considerations into account. Multi-programming has always been a thorn in the

development of efficient programs for non-dedicated computational platforms such as academic

or production servers. Sharing of resources such as processor, memory and I/O devices make

it hard for programmers to make assumptions on the availability of resources. Programmers

typically write programs under the assumption that the entire system’s resources are available at

its disposal that may not hold true on non-dedicated computational clusters. The performance

penalty of memory pressure can be severe, because the operating system might be forced to page

data to and from the hard disk. Paging has a very large cost compared to the cost of accessing

memory and the slowdown experienced by the job that is being paged during its execution can

be quite unpredictable and/or large.

While there have been several attempts [52, 54, 56, 88] in the past for efficient proces-

sor allocation in multi-programming scenarios, studying the impact of memory requirements

of jobs on scheduling policies has received far less attention. As [63] rightly states, much of

the previous work in this area has considered the problem of memory pressure as a problem

90

of admission control and devised scheduling strategies that permit or forbid execution of jobs

based on user-estimated memory requirements or snapshot of memory occupancy at the time

of job submission. Both these criteria can prove to be extremely inaccurate, since memory re-

quirements of tasks can be quite unpredictable. Further, any under/over estimation of memory

requirements can lead to terrible performance and/or severe under-utilization of resources. In

[63], authors propose application-level modifications coupled with operating system level exten-

sions as a solution towards tackling this problem. However, practical reasons for deployment

have led us to the belief that any solution to address this problem has to be completely dealt with

inside the operating system or any runtime layer rather than require application level changes.

An intuitive and accurate model of the memory requirements of a process was formulated

by Denning in his seminal work on working set theory [26]. His proposal for a working-set [26]

based model has been the theoretical basis of many other subsequent approaches. The working

set of pages associated with a process is defined to be the collection of its most recently used

pages, and provides knowledge vital to the dynamic management of paged memories. More for-

mally, the working set of information W(t,τ ) of a process at time t is defined to be the collection

of pages referenced by the process during the process time interval (t - τ , t). The parameter τ is

defined to be the working set parameter. Further, the model defines the working set size ω(t, τ )

as the number of pages/elements in W(t, τ ). Determining the working set size of every process

translates to an efficient implementation of a virtual-memory-aware-scheduler, since it essen-

tially translates to the well-understood 0-1 knapsack problem (bin-packing problem), where the

job of the scheduler is to determine a subset of processes whose overall working-set size fits

within the available physical memory. However, it has been a well-acknowledged problem that

91

determining the working set size of a program in execution is hard and consequently approxima-

tions in software as well as hardware modifications to track working sets abound in the literature

[74, 85, 97].

In this chapter, we are concerned with the detrimental effect that paging inflicts on the

performance of jobs running on multi-programmed machines. More specifically, we consider

the problem of paging in the case of multi-programmed systems where processes/jobs can be

spawned by multiple users in an uncoordinated fashion and without any a-priori knowledge of

the required resources such as CPU, memory. In such a scenario, we envisage that the operating

system detects when the system is not making any forward progress at any time (i.e. when the

system is thrashing) and takes appropriate actions on the jobs that are running. In this context, we

denote that the system is not making any forward progress if it incurs too many page-faults (and

therefore context-switches) which in turn results in low overall CPU utilization. The dynamics

of the workloads on such servers provides several opportunities for optimization that may not be

available in the case of strict admission control that ends up severely under-utilizing resources

due to inaccurate estimations.

We present a simple scheduling strategy called Synergistic Scheduling that attempts to

prevent paging. An important design goal of Synergistic Scheduling was to use an unmodified

scheduler core and use an external, loadable kernel module and/or an external daemon for flex-

ibility of deployment. The basic idea behind the Synergistic Scheduling framework is that the

external module and/or daemon gathers sufficient information such as CPU utilization, system-

wide page faults, memory residency and page-fault information of individual tasks using which

the static priorities of the tasks are manipulated, so that programs that would end up paging are

92

not scheduled as often as those that don’t stress the virtual memory system. This in turn also pre-

vents the system from thrashing due to application programs stepping on each other’s working

sets. A similar approach (priority-boost) was advocated in [59], although the intended goal was

for efficient co-scheduling of communicating parallel processes of a job. While authors in [59]

consider communicating processes of a parallel job as candidates for co-scheduling, our work

considers the set of processes whose working set sizes fit in memory as candidates for being

scheduled simultaneously on the same machine (co-scheduled on the same node). Further, this

solution requires no application modification unlike [63, 62]. Moreover, we anticipate that for

such a scheme to be relevant in the context of grid-computing, where loosely coupled distributed

programs run on non-dedicated computational nodes with fluctuating CPU and memory loads,

intrusive changes to the core scheduler may not be appropriate. Consequently, the Synergis-

tic Scheduler’s design as an external, dynamically loadable kernel module and/or an external

daemon atop an un-modified process scheduler fits the bill.

The rest of this chapter is organized as follows. Section 4.2 discusses some related work

and contrasts it with our approach. Section 4.3 gives details on the experimental setup and the

applications that were used for evaluation. Section 4.4 outlines the scheduling strategy, and

Section 4.5 presents the results of our experimental evaluation. Finally, Section 4.6 concludes

with pointers to future work.

4.2 Related Work

Much of the work on scheduling policies for multi-programmed multi-processors and

parallel machines has focused on how many processors to allocate to each runnable applica-

tion without considering the memory requirements of these jobs [52, 54, 56, 88]. There have

93

been a few attempts that studied the impact of the memory requirements of jobs on scheduling

policies. Work in this area can be split into two major categories depending upon the intended

target environment for which the algorithms were designed, namely distributed-memory paral-

lel machines and shared-memory parallel machines (symmetric multiprocessors). There have

also been several theoretical studies in the past in the quest towards designing a virtual-memory

aware job/process scheduler. In particular, Denning’s proposal for a working-set [26] based

model has been the theoretical underpinnings of many approaches. However, practical schemes

for determining working sets of a task has been acknowledged to be a hard problem and approx-

imations in software as well as hardware modifications to track memory working sets abound in

the literature [74, 85, 97].

Authors in [55, 77, 72] study the trade-offs involved between processor and memory allo-

cation in distributed parallel processing systems and try to design efficient scheduling strategies

with minimal overheads in a multi-programmed scenario where jobs have minimum memory

requirements. However, all these studies assume that jobs have a minimum memory require-

ment that can be stated a-priori. Even if the minimum memory requirements were to be known

a-priori, the problem of over-provisioning memory to applications is possible since the work-

ing set [26] of an application varies with time as our experimental results in subsequent sec-

tions illustrate. Authors in [72] model the paging behavior of parallel jobs when operating with

less memory than required. They apply this model on a real parallel job running on a parallel

message-passing machine and study how the performance changes as a function of processor

allocation. However, they do not consider the problem of job-scheduling per-se, although their

model allows for varying processor allocation.

94

Authors in [1, 11, 55, 63, 68, 78, 83] study this problem in the context of a shared mem-

ory parallel machines (symmetric multi-processors, simultaneous multi-threaded processors). In

the context of an Intel Paragon system, McCann et. al, examine how minimum processor al-

location requirements due to memory influences job scheduling [55]. They suggest ways in

which processors can be allocated so that each job receives exactly the same share of processing

time. The metric by which they evaluate their scheduling algorithms is the mean CPU utilization

where lower values are assumed to lead to higher processor-efficiency values. They do not take

into account jobs leaving the system, and also require significant computation to just ensure that

each job receives the same processing time in each scheduling cycle. They also do not take into

account the constructive or destructive interference that processes memory access patterns have

on each other when scheduled simultaneously. Finally, researchers from Tera Computer Com-

pany describe the scheduler algorithm on the Tera MTA [1] multi-processor, where they take

into consideration the overhead of swapping jobs in and out of memory, and present an optimal

algorithm for minimizing paging. However, this work also suffers from the drawback that users

have to associate a space-time overhead with each job that the scheduler uses for guiding its

memory allocation policies that may not always be easy for the programmer.

Our work is similar to all the above techniques in that they share the same design phi-

losophy wherein it is the job of the operating system scheduler to take virtual memory sizes

into consideration and differ in the means used to attain this objective. Our design relies on an

external kernel module or an external daemon atop an unmodified process scheduler/kernel that

makes it relevant in the context of Grid computing like environments, where loosely coupled

distributed programs run on non-dedicated computational farms.

95

More recently, researchers in [62, 63] advocate an adaptive strategy that tries to cope with

the adverse effects of paging on multi-programmed multi-processors by rewriting programs us-

ing a combination of a user-level runtime system and appropriate kernel support to take schedul-

ing actions automatically upon detecting memory-pressure. When programs detect memory

pressure, they either suspend themselves or release any unneeded memory back to the operating

system. This is in direct contrast to prior approaches where the system tries to impose schedul-

ing constraints and do not expect programmers to re-write their codes. Our work is similar to

the above approach in that we wish to be able to identify memory pressure situations and take

actions accordingly. We differ from this work in that we do not require application programs to

be rewritten to take advantage of this mechanism. Instead, the operating system chooses actions

automatically on detection of memory pressure. Further, the actions chosen by the operating

system is simpler, since it involves adjusting the priorities of tasks rather than suspending and

restarting processes.

4.3 Experimental Framework

We now describe the applications and the experimental platform that we used in our

study.

4.3.1 Experimental Platform

The experiments were carried out on an Intel Pentium-4 based CPU running at 3.00 GHz

(hyper-threading turned off in the BIOS) with a Seagate 73 GB EIDE hard disk drive and 1 GB

physical memory. Our experimental prototype is based on a Fedora Core 3 distribution running

a 2.6.10 kernel. The working set graphs shown in the subsequent sections were obtained using

96

an execution-driven simulator valgrind [79] that was augmented with a plug-in that implements

the working-set algorithm outlined by Denning in [26]. The Synergistic Scheduler framework

has been implemented both as an external kernel module and as a user-level ”probe” process for

the 2.6.10 kernel.

4.3.2 Applications

To evaluate the effectiveness of our approach, we measured its impact on the performance

of real-world memory-intensive applications, namely the sequential versions of the NAS Parallel

benchmark (NPB Version 2.4 Class A) suite [2] when they are all run simultaneously. All the

C benchmarks were compiled with Intel C Compiler version 7 at an optimization level -O3 and

the Fortran benchmarks were compiled with the Intel Fortran Compiler version 7 at the same

optimization levels. Since the problem class sizes used in this chapter is different from those

used in the previous chapter, a brief description of the benchmarks and the sizes of the data sets

that they access are shown in Table 4.1.

4.4 Scheduling Strategy

As explained above, the Synergistic Scheduler has been implemented both as a kernel

module as well as an daemon ”probe” process. There is not much difference in the performance

results using either of the two approaches, although the latter is simpler and more convenient

from the point of view of deployment. The kernel module registers a callback function to be ex-

ecuted as part of a timer structure at periodic intervals. The value of the interval is statically set

to a period of 2 seconds, but ideally this parameter should also be dynamically tuned depending

on the load on the system. A higher CPU load on the system should cause the value of this time

97

Table 4.1. Description of applications: The Total Memory column indicates the total/maximummemory that is used by the application.

Name Description Input Data Set TotalMemory

IS Integer Bucket Sort 8388608 integers 147 MBCG Conjugate Gradient Method to 14000x14000 sparse ma-

trix77 MB

solve an unstructured sparse-matrixFT 3-D Fast-Fourier Transform 256x256x128 matrix 464 MB

of complex numbersMG 3-D Multi-Grid Solver 256x256x256 matrix 436 MBSP Diagonalized Approximate Factoriza-

tion64x64x64 matrices 107 MB

BT Block Approximate Factorization 5x5x64x64x64 matrices 354 MBLU Simulated CFD using SSOR techniques 64x64x64 matrices 66 MBEP Monte-Carlo Simulation 229 random numbers 17 MB

98

period to be set higher so that fewer overheads are incurred. Similarly, a higher memory load

should set this time period to a smaller value so that the process scheduler can take advantage of

our approach. When the callback function is executed after the specified time interval, it deter-

mines a subset of processes whose priorities need to be adjusted for the system to make forward

progress based on a set of heuristics that are outlined in the subsequent sections. The user-space

daemon based approach is very similar. The daemon sleeps for a pre-determined interval after

which it samples the system statistics, per-process page-fault rates using the /proc file-system. It

then uses this information to determine a subset of process whose priorities need to be adjusted.

Note that this process has to run as root to effect the priority changes. Figure 4.1 outlines the

design alternatives for the Synergistic Scheduler. The job of determining candidate tasks for pri-

ority boost and the number of tasks for priority boost is highly difficult. The challenges involved

in such a task is that different processes take different amount of time to reach their steady state,

and it is practically impossible to determine how close to steady state a particular process is with

respect to memory utilization, since this amounts to estimating working set size. In the next

paragraphs, we outline a few heuristics that we have used to evaluate the system.

4.4.1 Heuristics for task set selection

The goal of any heuristic employed in this context is to improve the overall resource

utilization of the system including CPU and memory. Furthermore, the system should also favor

candidates that are closer to their steady state (from the memory utilization point-of-view) than

those that are not. This is based on an intuition and premise that jobs that are closer to steady

state would not fault frequently and consequently will not lower the overall CPU efficiency. In

other words, favoring such jobs should allow for overall forward progress of the system (fewer

99

(a) (b)

Fig. 4.1. Synergy Scheduler Design Alternatives (a) Using a kernel-module based approach,(b) Using a user-level probe process based approach

page-faults/swapping). In this study, we look at the effects of two parameters that can be gathered

in the operating system without any overheads and use them to determine if a process is closer

to steady state or not, namely,

• Page-fault rate suffered by a process

• Memory residency (RSS) of a process

Although these parameters are not really independent of each other, together they provide an

indicator of whether a particular process can make progress without faulting. The whole point

of boosting the priority of a certain set of tasks is that the overall system should make forward

progress. If it so happens that the system decides to boost the priority of those processes that

would block immediately, the situation is somewhat similar to an artificially induced priority in-

version problem, wherein processes with a lower priority end up hogging the CPU more than the

higher priority processes. Hence, the system tries to make a judicious selection of the candidates

for priority boost by making use of the above-mentioned parameters. The input to the algorithm

100

is the desired multi-programming level, (i.e the number of jobs/processes whose priority needs

to be incremented). Determining the optimal value of this parameter is beyond the scope of this

work since it is highly workload specific. However, this is an interesting aspect that we wish to

investigate in the near future.

The actual pseudo-code for task selection is described in the subsequent paragraphs. It

is to be noted that this algorithm is executed periodically by the system (either by a user-level

”probe” process or by a kernel-level thread as part of a timer list) at pre-specified intervals.

• Gather the list of user processes that are currently running on the system (excluding system

daemons, kernel threads) and sample their memory residency (RSS) and page-fault rate

incurred during the last time interval.

• Denote the number of jobs/programs in the system whose priority has been boosted as N.

• If N is less than the desired multi-programming level (MPL), then select at most (MPL-N)

tasks as candidates for priority boost. The candidates that are selected are those that have

incurred the least page-fault rate in the previous interval. Any ties are broken by selecting

candidates that have the highest resident set size.

• Note that we may select fewer than MPL processes for priority boost if the sums of

the virtual-memory sizes exceeds the available physical memory. Note that the virtual-

memory sizes for a process are a conservative over-estimate of the working set of a pro-

gram and could lead to under-utilization of resources. However, practical considerations

of not being able to measure the working set size inside the operating system kernel have

forced us to resort to this.

101

To illustrate this with an example, let us assume that there are 3 tasks A (500 MB, 10

flt/s), B (500 MB, 30 flt/s) and C (10 MB, 8 flt/s) with memory requirements and page-faults

per second indicated in parentheses. Let us assume that the system is equipped with 512 MB

of RAM, and that the desired MPL of the system is 2. Clearly, an optimal schedule would try

to schedule A and C or A and B together. Otherwise, the system could possibly thrash with A

and B stepping on each other’s toes. Based on the observed page-fault rates, our system elects to

elevate the priorities of C and A (Although, we could also conceivably reduce the priority of B

we chose not to do so in our prototype). Once we have elevated the priorities of tasks, we do not

change them until they run to completion. A smarter strategy would try to change the priorities

once the system does not thrash any more. If on the other hand, let us say that the following

scenario occurs with A (500 MB, 10 flt/s), B(500 MB, 12 flt/s) and C (10 MB, 20 flt/s) then

the system decides to elevate the priority of only A despite being asked to elevate priorities of

2 tasks (otherwise A and B may step on each other). If in the next sampling interval, C’s fault

rate is lower than B, then the system would elevate the priority of C as well since fewer than the

requested number of tasks have elevated priorities.

4.5 Results

In order to motivate the problem and to illustrate the difficulty in being able to predict

and/or measure the working set size from inside the operating system kernel, we plot the vari-

ation of the working set sizes as defined by Denning [26] for our NAS 2.4 application suite

using Valgrind [79] as our execution driven simulator. Valgrind [79] is an extensible x86 mem-

ory debugger and emulator and is also a framework that allows for custom skins/plug-ins to be

written that can augment the basic blocks of the program as it executes. The skins/plug-ins that

102

we implemented augmented the basic blocks to return control to the skin after every memory-

referencing instruction with the value of the memory address that was referenced and whether

it was a load or a store. The skins maintain the necessary data-structures that implement the

working-set size calculation algorithm explained in [26]. Figure 4.2 plots the variation of the

working set size measured in number of pages (4KB page-size) with simulation time measured

in terms of the number of memory-referencing instructions. using a 1-bit reference counter for

the page-table entries as explained in [26]. The value of σ shown in parentheses in the caption

of Figure 4.2 is the sampling interval (measured in terms of the number of memory-referencing

instructions). There is no sanctity in the values chosen for σ. The only constraint was that

there should be a reasonable number of sampled data points. With the exception of CG and

EP whose working set size is fairly constant with time, most of the other applications show a

phase behavior (in terms of variation of working set size) that is consistent with other studies

findings [81]. However the graphs in Figure 4.2 indicate that it is not trivial to do working-set

size predictions. Further, the difficulty of choosing a good value of the sampling interval (σ) and

the overheads in determining the working set size inside the operating system (without impact-

ing the performance of jobs/tasks on the system) make this a challenging problem for operating

system engineers and developers. Consequently, we use the overall virtual memory size as an

indicator in our scheduling framework.

We categorize our experimental evaluation of the Synergistic Scheduling framework into

two categories, namely:

• Underloaded: In this scenario, each application’s working set size comfortably fits the

available physical memory when run stand-alone. Consequently, such a scenario does

not cause applications to swap if they are run stand-alone and therefore any sequential

103

0 2000 4000 6000 8000 10000 120000

200

400

600

800

1000

1200

Time

Wor

king

set s

ize

1−bit

0 200 400 600 800 10000

2000

4000

6000

8000

10000

12000

14000

16000

18000

Time

Wor

king

set s

ize

1−bit

(a) (b)

0 100 200 300 4001000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Time

Wor

king

set s

ize

1−bit

0 50 100 150 200 250 3000

1

2

3

4

5

6

7x 10

4

Time

Wor

king

set s

ize

1−bit

(c) (d)

0 500 1000 1500 2000 25000

2000

4000

6000

8000

10000

12000

14000

16000

18000

Time

Wor

king

set s

ize

1−bit

0 100 200 300 400260

270

280

290

300

310

320

330

Time

Wor

king

set s

ize

1−bit

(e) (f)

0 200 400 600 800 1000 12000

2000

4000

6000

8000

10000

12000

Time

Wor

king

set s

ize

1−bit

0 200 400 600 800 1000 12000

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 10

4

Time

Wor

king

set s

ize

1−bit

(g) (h)

Fig. 4.2. Variation of working set size with simulation time for (a) IS (σ = 0.4 Million) (b) FT(σ = 15 Million) (c) CG (σ = 14 Million) (d) MG (σ = 56 Million) (e) SP (σ = 18 Million) (f)EP (σ = 68 Million) (g) LU (σ = 21 Million) (h) BT (σ = 76 Million)

104

execution of applications (batch processing) is expected to be optimal. However it is to

be emphasized here that the sum of the working set sizes of all the applications together

is larger than the available physical memory. Therefore, any execution of all applications

simultaneously is expected to swap.

• Overloaded: In this scenario, a few of the application’s working set sizes do not fit in the

available physical memory when run stand-alone. Consequently, such a scenario results

in applications to swap even if they are run stand-alone. Therefore, sequential execution

of applications is not expected to be optimal, since it does not utilize the CPU efficiently.

Such a scenario is therefore expected to be an interesting case study for the Synergistic

scheduling framework, since the primary goal of this technique is to improve the overall

CPU utilization and is therefore expected to perform better than batch processing. In

order to demonstrate this situation, we ran the experiments on a machine whose kernel

was instructed to use a specified portion of memory instead of the entire physical memory

(384 MB instead of 1 GB).

The experimental results described in the next few sections plot the execution times (in

seconds) and the normalized slowdown of each application as well as the overall execution time

averaged over 5 runs of each NAS application benchmark for the following schemes:

• Sequential (SEQ): This is the base scheme where the NAS applications are run one after

another and is virtually identical to a batch processing system. In addition to measuring

the average completion time for each benchmark, we also compute the overall completion

times for all the benchmarks to be run to completion as the sum of the completion times

of each benchmark.

105

• Simultaneous (SIM): This is the scheme where all the NAS applications are run simultane-

ously on the native scheduler that is oblivious to memory pressure and working set sizes.

In this case, the overall completion time is the execution time of the slowest application.

• Prio-Simultaneous (FIX): This is the scheme where all the NAS applications are run si-

multaneously on the native scheduler that has been augmented with knowledge of memory

pressure and working set size considerations. The system determines the candidates whose

priority needs to be boosted based upon resident set sizes/page fault rates and once the pri-

orities are adjusted the system does not recalculate or readjust priorities until one or more

processes exits. The pseudo-code for this scheme was described in the previous section. In

this case too, the overall completion time is the execution time of the slowest application.

• Prio-Simultaneous (RAN): This is a scheme where all the NAS applications are run simul-

taneously on the native scheduler that has been augmented to select a random subset of

tasks for priority boost. The overall completion time is the execution time of the slowest

application.

4.5.1 Underload Scenario

Figure 4.3(a) plots the execution times (in seconds) of all the above schemes for the un-

derload scenario for each application as well as the overall completion time. Figure 4.3(b) plots

the normalized slowdown of all the above schemes normalized with respect to the Sequential

scheme’s (SEQ) execution time. It can be seen from the last set of columns (denoted as OV) in

Figure 4.3 (a) that the Prio-Simultaneous (FIX) scheme performs better than both the Simultane-

ous (SIM) and the Prio-Simultaneous (RAN) scheme. The performance of this scheme is in fact

106

quite comparable to Sequential scheme, which is clearly expected to be optimal if the working

set sizes of all processes is guaranteed to be less than available physical memory when run in

isolation (on a uni-processor workstation). The Simultaneous (SIM) scheme is about 14% slower

in comparison to the Sequential (SEQ) scheme, while the Prio-Simultaneous (FIX) scheme with

MPL 1, 2 and 3 is about 0.5%, 0.4%, 2.5% slower in comparison to the Sequential (SEQ) scheme

respectively and the Prio-Simultaneous (RAN) scheme with MPL 1, 2 and 3 is about 10%, 9%,

0.1% slower in comparison to the Sequential (SEQ) scheme respectively. The next few graphs

shown in the subsequent sub-sections provide more detailed statistics on system CPU utilization,

context switches and page faults that help us better understand as to how our scheme achieves

what we intended at the outset.

Figure 4.4(a) plots the overall percentage CPU utilization during the course of the entire

execution of the benchmark (for all the 5 iterations). This graph shows the percentage of time

spent executing user-code, percentage of time spent executing system-code, percentage of time

spent idling and percentage of time spent waiting for I/O to complete (all these statistics are

exported by the /proc file-system) for all the above schemes. Figure 4.4(b) plots the total number

of context switches during the course of the entire execution of the benchmark (for all the 5

iterations) for all the above schemes. Figure 4.4(c) plots the overall number of page faults during

the course of the entire execution (for all the 5 iterations). This graph also shows the split-up of

major (e.g. faults that require reading from swap device) and minor page faults (e.g. faults that

require allocation of a page) during the course of execution for all the above schemes. We can

summarize the results from Figures 4.3 and 4.4 as follows,

• SEQ scheme is the best in terms of CPU utilization, since a bulk of the time is spent

executing user code, with very few idle time periods during which the jobs are being

107

(a)

BT CG EP FT IS LU MG SP OV0

100

200

300

400

500

600

700

800

900

1000E

xecu

tion

Tim

e (s

econ

ds)

SequentialSimultaneousFIX−MPL−1FIX−MPL−2FIX−MPL−3RAN−MPL−1RAN−MPL−2RAN−MPL−3

(b)

OV0

0.5

1

1.5

Nor

mal

ized

Exe

cutio

n Ti

me

SequentialSimultaneousFIX−MPL−1FIX−MPL−2FIX−MPL−3RAN−MPL−1RAN−MPL−2RAN−MPL−3

Fig. 4.3. Underload: (a) Execution Time (in seconds) measured as the time taken from job starttill completion (b) Normalized execution time that is measured as the ratio of the job completiontime to the batch processing execution time.

108

switched. This is to be expected since the working set sizes of all the applications fit

available physical memory. Also, since only one application is being executed there are

no context-switching overheads and hence the time spent in system mode is also minimal.

• SIM scheme is the worst in terms of overall CPU utilization, since a good portion of the

time is spent waiting for I/O to complete. This is due to the excessive number of page-

faults that are incurred when applications running simultaneously step on each other’s

working sets. A good portion of the time is also spent in the system, since the number

of context switches has also increased and the scheduler code is invoked more often than

in the SEQ scheme. Figures 4.4 (b) and (c) corroborate the increased number of context-

switches and page-faults for the SIM scheme.

• All the FIX schemes are very good in ensuring that the bulk of the time is spent in execut-

ing user code. As is to be expected, if the desired multi-programming level is increased

the time spent in system mode increases since the scheduler code is now invoked more

often than before. Figure 4.4 (b) indicates that the FIX schemes end up context switching

far fewer number of times than the RAN and SIM schemes because they utilize the system

CPU far more efficiently. As is to be expected, the FIX schemes incur more context-

switches and percentage time spent in system mode than the SEQ scheme since more

than one application is executing concurrently. Despite this, the FIX schemes end up with

overall completion time that is comparable to the SEQ scheme.

• The RAN schemes (MPL 1 and 2) are unfortunately not as good as the FIX schemes in

ensuring that only user-level code gets executed. Clearly, this is to be expected since these

schemes do not use any heuristics in selecting tasks. Consequently, the system spends

109

more time waiting for I/O which in turn contributes to the excessive number of context

switches and excessive number of page-faults and hence the performance degradation in

comparison to the FIX scheme.

• With increasing multi-programming levels, the RAN scheme starts to perform better since

it does a better job at improving overall CPU utilization. In particular, the RAN scheme

with MPL 3 actually does a pretty good job that is better than the FIX schemes.

• Trends in context switches and major page fault rates corroborate all the above observa-

tions and the performance results thereof.

4.5.2 Overload Scenario

Figure 4.5(a) plot the overall execution times (in seconds) of the SEQ and the FIX

schemes with MPL 1 for the overload scenario for each application as well as the overall com-

pletion time. Figure 4.5(b) plots the normalized slowdown of the SEQ and the FIX schemes with

MPL 1 normalized with respect to the SEQ scheme’s execution time. It can be seen from the last

set of columns (denoted as OV) in Figure 4.5 (a) that the Prio-Simultaneous (FIX) scheme with

MPL 1 performs better than the Sequential (SEQ) schemes. The performance of the FIX scheme

is better than SEQ schemes because it is much more efficient in managing the CPU resources.

This could be attributed to the fact that if a particular task faults the new schemes can possibly

schedule from other processes, thus utilizing the CPU more efficiently, whereas the SEQ scheme

cannot schedule any other job. The next few graphs provide more detailed statistics on overall

system CPU utilization, context switches and page faults that help us better understand as to how

our scheme achieves what we intended at the outset.

110

(a)

SEQ SIM FIX−1 FIX−2 FIX−3 RAN−1 RAN−2 RAN−30

10

20

30

40

50

60

70

80

90

100

110

Ove

rall

Per

cent

age

utili

zatio

n

UserSystemIdleWait

(b)


1

2

3

4

5

6

7

8

9x 105

Tota

l Num

ber

of C

onte

xt S

witc

hes

(c)


0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 106

Num

ber

of P

age

Faul

ts

Major FaultMinor Fault

Fig. 4.4. Underload: (a) Overall CPU utilization (b) Context Switches (c) Overall Major PageFaults

111

(a)

BT CG EP FT IS LU MG SP OV0

200

400

600

800

1000

1200

1400

1600

1800

2000E

xecu

tion

Tim

e (s

econ

ds)

SequentialFIX−MPL−1

(b)

OV0

0.5

1

1.5

Nor

mal

ized

Slo

wdo

wn

SequentialFIX−MPL−1

Fig. 4.5. Overload: (a) Execution Time (seconds) measured as the time from job start tillcompletion (b) Normalized Slowdown measured as the ratio of the job completion time with thebatch scheduling.

112

Figure 4.6(a) plots the overall percentage CPU utilization during the course of the entire

execution. Figure 4.6(b) plots the number of context switches during the course of the entire

execution. Figure 4.6(c) plots the overall number of major page faults during the course of the

entire execution. The results in Figure 4.6 corroborate that the performance of the FIX scheme

is better than the SEQ scheme, since it is able to make better scheduling decisions and utilize the

CPU more efficiently, whilst minimizing the overall number of page-faults.

In summary, we have shown that a priority-boost based scheduler approach that takes

virtual-memory /working-set sizes into consideration performs as good as a batch scheduler

(SEQ) (around 0.5% slower) in the underloaded scenario when none of the applications experi-

ences any major page-faults (no paging) when run stand-alone and performs much better than a

batch scheduler (SEQ) (around 54% faster) in the overloaded scenario when a few of the appli-

cations experiences major page-faults (paging) when run stand-alone.

4.6 Conclusions

In summary, this chapter takes a look at the adverse effects that paging has on perfor-

mance of jobs that are concurrently running on multi-programmed processors. Today’s oper-

ating system schedulers are oblivious to memory load of individual tasks, since estimating the

working set size of programs is believed to be a pretty hard problem to implement practically.

Contrary to other approaches that have attempted to treat this problem as an admission control

problem or that have required application modifications, our approach is unique in terms of its

design. Specifically, we have presented the design of simple set of heuristics as add-ons to ex-

isting operating system process schedulers that attempts to minimize paging and/or thrashing.

113

(a)

SEQ FIX−10

10

20

30

40

50

60

70

80

90

100

110

Ove

rall

Per

cent

age

utili

zatio

n

UserSystemIdleWait

(b)

SEQ FIX−10

1

2

3

4

5

6

7

Log1

0 of

Num

ber

of C

onte

xt S

witc

hes

(c)

SEQ FIX−10

2

4

6

8

10

12

14

Log1

0 of

Num

ber

of P

age

Faul

ts

Major FaultMinor Fault

Fig. 4.6. Overload: (a) Overall CPU utilization (b) Context Switches (c) Overall Major PageFaults

114

Further, we show that the set of heuristics that have been used in this study can be obtained rel-

atively inexpensively on many modern day operating systems that makes our approach portable.

An important design objective of our approach in using an external, loadable kernel module or

an external ”probe” daemon atop an unmodified scheduler core enables flexibility of deploy-

ment in a distributed system setting like the computational Grid. The Synergistic Scheduling

framework presents a simple technique to couple two different sub-systems of the operating sys-

tem namely the virtual memory manager and the operating system scheduler. In other words,

it is a simple technique to realize a virtual-memory-aware process scheduler. Results on an ex-

perimental prototype indicate that the Synergistic Scheduling framework performs as well as a

batch-processing based scheduler in the underloaded scenarios and much better than a batch-

processing based scheduler in the overloaded scenarios.

There are a number of interesting avenues that this work has opened for future research.

Clearly, the lack of coupling between the memory management sub-system and the scheduler

sub-system of an operating system kernel is responsible for the performance degradation. While

sub-systems should be designed and evolve independently without inter-module dependencies,

there maybe situations that require coupling between them such as those illustrated in this pa-

per. The solution proposed in this paper is a workaround of this fundamental limitation, and a

better solution is to integrate this into the core kernel. Another possible avenue of research is

the interaction of a virtual-memory aware scheduler with memory replacement and allocation

algorithms. In other words, while we have shown the need for a virtual-memory-aware process

scheduler, there is also a need for a scheduler-aware virtual-memory manager. Going a step

further, the ideas in this framework is a specific instance of a general purpose ideal of coupling

every sub-system in the operating system. Such techniques could lead to better consolidation of

115

resources that is increasingly becoming an important consideration for clusters and data centers

both from a performance and a power perspective. Another future extension that we are looking

at currently is the design of a Synergistic Scheduling framework for a parallel job scheduler that

can be used on clusters and/or shared-memory parallel machines.

116

Chapter 5

Conclusions

Recent trends indicate that data-set sizes and application-working-set sizes are increas-

ing. With the growing disparity in performance between processor and I/O peripherals, it be-

comes very critical to optimize I/O performance. The advent of clusters and parallel file systems

that harness multiple disks has alleviated the I/O bandwidth problems. Main-memory caches

have been proposed in the past as an effective solution to tackle the I/O latency issues. In this

thesis, we have developed techniques that use or glean sufficient application-level information

to efficiently manage such caches.

For applications that access their data sets through explicit interfaces, we exploit compile-

time and run-time knowledge of application access patterns to determine which cache blocks

should be cached and when cache hierarchies need to be bypassed for good performance. We

have implemented such a discretionary-caching system as extensions to a parallel file system

and our results demonstrate that performance could improve by as much as 33% over indis-

criminate caching. For scaled in-core versions of applications that access their data sets through

the virtual-memory interface, we have developed a novel predictive-replacement algorithm that

tracks application’s runtime behavior in the operating system and show that it could perform

significantly better than the system’s default replacement algorithm (variant of LRU). While the

above two techniques may seem very different in their methodology, the underlying similarity

117

is based on the fact that not all blocks/pages are important and hence the system tries to pro-

actively manage the I/O caches by either not caching unimportant blocks or by evicting such

blocks earlier. The previous two techniques in this thesis have looked at the issue of improving

performance of a single application by maximizing the effectiveness of caching either through a

runtime scheme or a static compiler-driven scheme. However, such techniques are not sufficient

to guarantee that performance does not get degraded when multiple memory-intensive applica-

tions are executing simultaneously in a multi-programming scenario. The previous chapter of

this thesis describes the design and implementation of the Synergistic scheduling framework

that operates on top of an unmodified OS process scheduler and alleviates the shortcomings of

today’s memory-oblivious scheduler algorithms.

There are several interesting directions for future work. Perhaps, the most important

objective of my future work involves application of the techniques developed in this thesis for

improving the performance of large-scale I/O intensive high-performance applications. Such

applications’ performance can be improved not only by better memory management as this thesis

has demonstrated but also by incorporating applications requirements into the design of high-

performance file-systems and I/O middle-ware libraries. For instance, file system consistency

is a dimension that can be exploited to deliver higher-performance to applications that do not

require very strong semantics. Traditionally, file-systems use variants of locking techniques to

serialize accesses to shared resources that fail to scale well in large-scale clusters. Locking

techniques are an example of a pessimistic design that works well when contention to shared

resources are expected to be frequent. However, such a design may not be necessary and may in

fact prove to be an overkill for scientific applications that are usually well-structured in terms of

their data accesses. Taking a hard stance on consistency demotes throughput and scalability to

118

second-class citizen status, having to make do with whatever leeway is available. Therefore an

interesting direction for future work involves designing a file-system that leaves the choice and

granularity of consistency policies to the user at open/mount time that provides an attractive way

of providing the best of all worlds.

119

References

[1] G. A. Alverson, S. Kahan, R. Korry, C. McCann, and B. J. Smith. Scheduling on the

Tera MTA. In Proceedings of the Workshop on Job Scheduling Strategies for Parallel

Processing, pages 19–44, London, UK, 1995. Springer-Verlag.

[2] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A.

Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrish-

nan, and S. K. Weeratunga. The NAS Parallel Benchmarks. The International Journal of

Supercomputer Applications, pages 63–73, Fall 1991.

[3] S. Bansal and D. Modha. CAR: Clock with Adaptive Replacement. In Proceedings of the

USENIX Conference on File and Storage Technologies, pages 187–200. ACM Press, 2004.

[4] L. A. Belady. A Study of Replacement Algorithms for Virtual Storage Computers. IBM

Systems Journal, 5:78–101, 1966.

[5] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and

W. Su. Myrinet: A Gigabit-per-Second Local Area Network. IEEE Micro, 15(1):29–36,

1995.

[6] R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, and M. Paleczny. A Model and

Compilation Strategy for Out-of-Core Data Parallel Programs. In Proceedings of the ACM-

SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 1–10,

Santa Barbara, CA, 1995. ACM Press.

120

[7] R. Bordawekar, A. Choudhary, and J. Ramanujam. Automatic Optimization of Commu-

nication in Compiling Out-of-Core Stencil Codes. In Proceedings of the Tenth ACM In-

ternational Conference on Supercomputing, pages 366–373, Philadelphia, PA, 1996. ACM

Press.

[8] R. Bordawekar, R. Thakur, and A. Choudhary. Efficient Compilation of Out-of-core Data

Parallel Programs. Technical Report SCCS-622, Syracuse University, 1994.

[9] P. Brezany, T. A. Muck, and E. Schikuta. Language, Compiler and Parallel Database Sup-

port for I/O Intensive Applications. In Proceedings on High Performance Computing and

Networking, Milano, Italy, 1995.

[10] A. D. Brown and T. C. Mowry. Taming the Memory Hogs: Using Compiler-Inserted

Releases to Manage Physical Memory Intelligently. In Proceedings of the Symposium on

Operating Systems Design and Implementation, pages 31–44. USENIX Association, 2000.

[11] D. C. Burger, R. S. Hyder, B. P. Miller, and D. A. Wood. Paging tradeoffs in Distributed-

Shared-Memory Multiprocessors. In Proceedings of the 1994 ACM/IEEE Conference on

Supercomputing, pages 590–599, New York, NY, USA, 1994. ACM Press.

[12] P. Cao, E. W. Felten, A. R. Karlin, and K. Li. A Study of Integrated Prefetching and

Caching Strategies. In Proceedings of the 1995 ACM SIGMETRICS Joint International

Conference on Measurement and Modeling of Computer Systems, pages 188–197. ACM

Press, 1995.

121

[13] P. Cao, E. W. Felten, and K. Li. Implementation and Performance of Application-

Controlled File Caching. In Proceedings of Operating Systems Design and Implementation,

pages 165–177, 1994.

[14] P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur. PVFS: A Parallel File System

for Linux Clusters. In Proceedings of the Fourth Annual Linux Showcase and Conference,

pages 317–327, Atlanta, GA, 2000.

[15] F. W. Chang and G. A. Gibson. Automatic I/O Hint Generation Through Speculative Ex-

ecution. In Proceedings of the Operating Systems Design and Implementation (OSDI)

Conference, pages 1–14, 1999.

[16] Z. Chen, Y. Zhou, and K. Li. Eviction Based Cache Placement for Storage Caches. In

Proceedings of the USENIX Annual Technical Conference, 2003.

[17] A. Choudhary, R. Bordawekar, M. Harry, R. Krishnaiyer, R. Ponnusamy, T. Singh, and

R. Thakur. PASSION: Parallel and Scalable Software for Input-Output. Technical Report

SCCS-636, Syracuse University, NY, 1994.

[18] F. J. Corbato. A Paging Experiment with the Multics System, 1969. Included in a

Festschrift published in honor of Prof. P.M. Morse. MIT Press, Cambridge, Mass.

[19] P. Corbett, D. Feitelson, S. Fineberg, Y. Hsu, B. Nitzberg, J.-P. Prost, M. Snir, B. Traversat,

and P. Wong. Overview of the MPI-IO Parallel I/O Interface. In Hai Jin, Toni Cortes, and

Rajkumar Buyya, editors, High Performance Mass Storage and Parallel I/O: Technologies

and Applications, pages 477–487. IEEE Computer Society Press and Wiley, New York,

NY, 2001.

122

[20] P. F. Corbett, D. G. Feitelson, J-P. Prost, G. S. Almasi, S. J. Baylor, A. S. Bolmarcich,

Y. Hsu, J. Satran, M. Snir, R. Colao, B. D. Herr, J. Kavaky, T. R. Morgan, and A. Zlotek.

Parallel File Systems for the IBM SP computers. IBM Systems Journal, 34(2):222–248,

1995.

[21] T. Cortes, S. Girona, and J. Labarta. Design Issues of a Cooperative Cache with no Co-

herence Problems. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Perfor-

mance Mass Storage and Parallel I/O: Technologies and Applications, pages 259–270.

IEEE Computer Society Press and Wiley, New York, NY, 2001.

[22] M. Cox and D. Ellsworth. Application-controlled Demand Paging for Out-of-Core Visual-

ization. In IEEE Visualization, pages 235–244, 1997.

[23] P. E. Crandall, R. A. Aydt, A. A. Chien, and D. A. Reed. Input/output characteristics of

scalable parallel applications. In Proceedings of Supercomputing. ACM Press, 1995.

[24] M. Dahlin, R. Wang, T. E. Anderson, and D. A. Patterson. Cooperative Caching: Using

Remote Client Memory to Improve File System Performance. In Proceedings of the First

Symposium on Operating Systems Design and Implementation, pages 267–280, 1994.

[25] J. M. del Rosario and A. N. Choudhary. High-performance i/o for massively parallel com-

puters: Problems and prospects. Computer, 27(3):59–68, 1994.

[26] P. J. Denning. The Working Set Model for Program Behavior. In In the Proceedings of

the First ACM Symposium on Operating System Principles, pages 15.1–15.12. ACM Press,

1967.

123

[27] J. M. Dessler and S. N. Kandadai. Performance of Scientific Applications on Linux Clus-

ters. IBM eServer Performance Technical Report, IBM Form Number REDP-0439-00, 24

April, 2002.

[28] C. S. Ellis and D. Kotz. Prefetching in File Systems for MIMD Multiprocessors. In

Proceedings of the International Conference on Parallel Processing, pages I:306–314, St.

Charles, IL, 1989. Pennsylvania State Univ. Press.

[29] N. Stavrako et al. Symbolic Analysis in the PROMIS Compiler. In Proceedings of the

Twelfth International Workshop on Languages and Compilers for Parallel Computing,

1999.

[30] M. J. Feeley, W. E. Morgan, F. H. Pighin, A. R. Karlin, H. M. Levy, and C. A. Thekkath.

Implementing Global Memory Management in a Workstation Cluster. In Proceedings of

the ACM Symposium on Operating Systems Principles, pages 201–212, 1995.

[31] B. C. Forney, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Storage-Aware Caching:

Revisiting Caching for Heterogeneous Storage Systems. In Proceedings of the First Inter-

national Conference on File and Storage Technologies (FAST), 2002.

[32] Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Interface.

Technical Report, University of Tennessee, Knoxville, 1996.

[33] K. Fraser and F. W. Chang. Operating System I/O Speculation: How two invocations are

faster than one. In Proceedings of the USENIX Annual Technical Conference (General

track), 2003.

124

[34] G. Glass and P. Cao. Adaptive Page Replacement based on Memory Reference Behavior.

In Measurement and Modeling of Computer Systems, pages 115–126, 1997.

[35] J. Griffioen and R. Appleton. Reducing File System Latency using a Predictive Approach.

In USENIX Summer, pages 197–207, 1994.

[36] M. R. Haghighat and C. D. Polychronopoulos. Symbolic Analysis: A Basis for Paralleliza-

tion, Optimization and Scheduling of Programs. In Workshop on Languages and Compilers

for Parallel Computing, pages 567–585, Portland, OR., 1993. Berlin: Springer Verlag.

[37] K. Harty and D. R. Cheriton. Application-controlled Physical Memory using External

Page-cache Management. In Proceedings of the Fifth International Conference on Ar-

chitectural Support for Programming Languages and Operating System (ASPLOS), pages

187–197, New York, NY, 1992. ACM Press.

[38] J. L. Henning. SPEC CPU2000: Measuring CPU Performance in the New Millennium.

Computer, 33(7):28–35, 2000.

[39] J. V. Huber, Jr., C. L. Elford, D. A. Reed, A. A. Chien, and D. S. Blumenthal. PPFS: A High

Performance Portable Parallel File System. In Hai Jin, Toni Cortes, and Rajkumar Buyya,

editors, High Performance Mass Storage and Parallel I/O: Technologies and Applications,

pages 330–343. IEEE Computer Society Press and Wiley, New York, NY, 2001.

[40] K. Hwang, H. Jin, and R. Ho. RAID-x: A New Distributed Disk Array for I/O-Centric

Cluster Computing. In Proceedings of the Ninth IEEE International Symposium on High

Performance Distributed Computing, pages 279–287, Pittsburgh, PA, 2000. IEEE Com-

puter Society Press.

125

[41] Scalable I/O. Initiative. http://www.cs.princeton.edu/sio/, 1997.

[42] S. Jiang and X. Zhuang. Lirs: An Efficient low Inter-reference Recency Set Replacement

Policy to improve Buffer Cache Performance. In In Proceedings of ACM SIGMETRICS

Conference on Measurement and Modelling of Computer Systems., 2002.

[43] T. Johnson and D. Shasha. 2Q: A Low Overhead High Performance Buffer Management

Replacement Algorithm. In Proceedings of the Twentieth International Conference on Very

Large Databases, pages 439–450, Santiago, Chile, 1994.

[44] T. L. Johnson, D. A. Connors, M. C. Merten, and W. W. Hwu. Run-Time Cache Bypassing.

IEEE Transactions on Computers, 48(12):1338–1354, 1999.

[45] M. Kallahalla and P. J. Varman. Optimal Prefetching and Caching for Parallel I/O Sys-

tems. In Proceedings of the Thirteenth Annual ACM symposium on Parallel algorithms

and architectures, pages 219–228. ACM Press, 2001.

[46] T. Kimbrel, P. Cao, E. Felten, A. Karlin, and K. Li. Integrating Parallel Prefetching and

Caching. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Mod-

elling of Computer Systems, pages 262–263. ACM Press, 1996.

[47] D. Kotz. Disk-directed I/O for MIMD Multiprocessors. In Hai Jin, Toni Cortes, and

Rajkumar Buyya, editors, High Performance Mass Storage and Parallel I/O: Technologies

and Applications, pages 513–535. IEEE Computer Society Press and John Wiley & Sons,

2001.

126

[48] T. M. Kroeger and D. D. E. Long. Predicting File-System Actions From Prior Events. In

Proceedings of the USENIX 1996 Annual Technical Conference, pages 319–328, 1996.

[49] An-Chow Lai, Cem Fide, and Babak Falsafi. Dead-block Prediction and Dead-block Cor-

relating Prefetchers. In Proceedings of the Twentyeighth Annual International Symposium

on Computer Architecture, pages 144–154. ACM Press, 2001.

[50] D. Lee, J. Choi, H. Choe, S. Noh, S. Min, and Y. Cho. Implementation and Performance

Evaluation of the LRFU Replacement Policy. In Proceedings of the Twenty Third Euromi-

cro Conference, pages 106–111, 1997.

[51] E. K. Lee and C. A. Thekkath. Petal: Distributed Virtual Disks. In Proceedings of the

Seventh International Conference on Architectural Support for Programming Languages

and Operating Systems, pages 84–92, Cambridge, MA, 1996.

[52] S. T. Leutenegger and M. K. Vernon. The performance of multiprogrammed multipro-

cessor scheduling algorithms. In Proceedings of the ACM SIGMETRICS conference on

Measurement and Modeling of Computer Systems, pages 226–236, New York, NY, USA,

1990. ACM Press.

[53] T. M. Madhyastha. Automatic Classification of Input Output Access Patterns. PhD thesis,

UIUC, IL, 1997.

[54] S. Majumdar, D. L. Eager, and R. Bunt. Scheduling in Multi-programmed Parallel Systems.

In Proceedings of the ACM SIGMETRICS conference on Measurement and Modeling of

Computer Systems, pages 104–113, New York, NY, USA, May, 1988. ACM Press.

127

[55] C. McCann and J. Zahorjan. Scheduling memory constrained jobs on distributed memory

parallel computers. In Proceedings of the 1995 ACM SIGMETRICS Joint International

Conference on Measurement and Modeling of Computer Systems, pages 208–219, New

York, NY, USA, 1995. ACM Press.

[56] C. McCann and J. Zahorjan. Processor allocation policies for message-passing parallel

computers. In Proceedings of the ACM SIGMETRICS conference on Measurement and

Modeling of Computer Systems, pages 19–32, New York, NY, USA, May, 1994. ACM

Press.

[57] N. Megiddo and D. S. Modha. Outperforming LRU with an Adaptive Replacement Cache

Algorithm. Computer, 37(4):58–65, 2004.

[58] T. C. Mowry, A. K. Demke, and O. Krieger. Automatic Compiler-Inserted I/O Prefetching

for Out-of-Core Applications. In Proceedings of the Symposium on Operating Systems

Design and Implementation, pages 3–17. USENIX Association, 1996.

[59] S. Nagar, A. Banerjee, A. Sivasubramaniam, and C. R. Das. A Closer look at Coscheduling

Approaches for a Network of Workstations. In Proceedings of the Eleventh Annual ACM

Symposium on Parallel Algorithms and Architectures, pages 96–105, New York, NY, USA,

1999. ACM Press.

[60] A. Nanda, K-K. Mak, K. Sugavanam, R. K. Sahoo, V. Soundararajan, and T. B. Smith.

MemorIES: A Programmable, Real-Time Hardware Emulation Tool for Multiprocessor

Server Design. SIGPLAN Notices, 35(11):37–48, 2000.

128

[61] J. Nieplocha and I. Foster. Disk Resident Arrays: An Array-Oriented I/O library for out-of-

core computations. In Proceedings of the Sixth Symposium on the Frontiers of Massively

Parallel Computation, pages 196–204. IEEE Computer Society Press, 1996.

[62] D. S. Nikolopoulos. Malleable Memory Mapping: User-Level Control of Memory Bounds

for Effective Program Adaptation. In Proceedings of the International Conference on Par-

allel and Distributed Processing Symposium, Nice, France, 2003.

[63] D. S. Nikolopoulos and C. D. Polychronopoulos. Adaptive Scheduling Under Memory

Pressure on Multiprogrammed SMPs. In Proceedings of the 16th International Paral-

lel and Distributed Processing Symposium, page 53, Washington, DC, USA, 2002. IEEE

Computer Society.

[64] B. Nitzberg and V. Lo. Collective Buffering: Improving Parallel I/O Performance. In

Proceedings of the Sixth IEEE International Symposium on High Performance Distributed

Computing, pages 148–157. IEEE Computer Society Press, 1997.

[65] E. J. O’Neil, P. E. O’Neil, and G. Weikum. The LRU-k Page Replacement Algorithm for

Database Disk Buffering. In International Conference on Management of Data and Sym-

posium on Principles of Database Systems, ACM SIGMOD, pages 297–306, Washington,

D.C., 1993.

[66] S. Pakin, M. Lauria, and A. Chien. High Performance Messaging on Workstations: Illinois

Fast Messages (FM) for Myrinet. In Proceedings of Supercomputing, 1995.

129

[67] M. Paleczny, K. Kennedy, and C. Koelbel. Compiler Support for Out-of-Core Arrays on

Data Parallel Machines. In Proceedings of the Fifth Symposium on the Frontiers of Mas-

sively Parallel Computation, pages 110–118, McLean, VA, 1995.

[68] Eric W. Parsons and Kenneth C. Sevcik. Coordinated Allocation of Memory and Pro-

cessors in Multiprocessors. In Proceedings of the 1996 ACM SIGMETRICS International

Conference on Measurement and Modeling of Computer Systems, pages 57–67, New York,

NY, USA, 1996. ACM Press.

[69] R. Patterson, G. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed Prefetching

and Caching. In Proceedings of the 15th Symposium on Operating Systems Principles,

pages 79–95, Copper Mountain, Colorado, 1995.

[70] R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed Prefetch-

ing and Caching. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Perfor-

mance Mass Storage and Parallel I/O: Technologies and Applications, pages 224–244.

IEEE Computer Society Press and Wiley, New York, NY, 1995.

[71] R. H. Patterson, G. A. Gibson, and M. Satyanarayanan. A Status Report on Research in

Transparent Informed Prefetching. ACM Operating Systems Review, 27(2):21–34, 1993.

[72] V. G. J. Peris, M. S. Squillante, and V. K. Naik. Analysis of the Impact of Memory in

Distributed Parallel Processing Systems. In Proceedings of the 1994 ACM SIGMETRICS

Conference on Measurement and Modeling of Computer Systems, pages 5–18, New York,

NY, USA, 1994. ACM Press.

130

[73] V. Phalke and B. Gopinath. An Inter-Reference Gap model for temporal locality in program

behavior. In Proceedings of the ACM SIGMETRICS Joint International Conference on

Measurement and Modeling of Computer Systems, pages 291–300, 1995.

[74] E. Rothberg, J. Pal Singh, and A. Gupta. Working sets, cache sizes, and node granularity

issues for large-scale multiprocessors. In Proceedings of the Twentieth Annual Interna-

tional Symposium on Computer Architecture, pages 14–26, New York, NY, USA, 1993.

ACM Press.

[75] F. Schmuck and R. Haskin. GPFS: A Shared-Disk File System for Large Computing Clus-

ters. In Proceedings of the First Conference on File and Storage Technologies (FAST),

2002.

[76] K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-Directed Collective

I/O in Panda. In Proceedings of Supercomputing, San Diego, CA, 1995. IEEE Computer

Society Press.

[77] S. Setia. The Interaction between Memory Allocation and Adaptive Partitioning in

Message-Passing Multicomputers. In Job Scheduling Strategies for Parallel Processing,

pages 146–164, 1995.

[78] S. Setia, M. S. Squillante, and V. K. Naik. The Impact of Job Memory Requirements on

Gang-Scheduling Performance. SIGMETRICS Performance Evaluation Review, 26(4):30–

39, 1999.

[79] J. Seward and N. Nethercote. Valgrind, An Open Source Memory Debugger for x86-linux,

2003. http://developer.kde.org/ sewardj/.

131

[80] X. Shen and A. Choudhary. DPFS: A Distributed Parallel File System. In Proceedings of

the International Conference on Parallel Processing, Spain, 2001.

[81] T. Sherwood, S. Sair, and B. Calder. Phase Tracking and Prediction. In Proceedings of

the 30th Annual International Symposium on Computer Architecture, pages 336–349, New

York, NY, USA, 2003. ACM Press.

[82] Y. Smaragdakis, S. Kaplan, and P. R. Wilson. EELRU: Simple and effective adaptive page

replacement. In Measurement and Modeling of Computer Systems, pages 122–133, 1999.

[83] A. Snavely and D. M. Tullsen. Symbiotic Jobscheduling for a Simultaneous Multithreaded

Processor. In Proceedings of the Ninth International Conference on Architectural Sup-

port for Programming Languages and Operating Systems, pages 234–244, New York, NY,

USA, 2000. ACM Press.

[84] S. R. Soltis, T. M. Ruwart, G. M. Erickson, K. W. Preslan, and M. T. O’Keefe. The Global

File System. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance

Mass Storage and Parallel I/O: Technologies and Applications, pages 10–15. IEEE Com-

puter Society Press and John Wiley & Sons, 2001.

[85] G. E. Suh, S. Devadas, and L. Rudolph. A New Memory Monitoring Scheme for Memory-

Aware Scheduling and Partitioning. In Proceedings of the Eighth International Symposium

on High-Performance Computer Architecture (HPCA), page 117, Washington, DC, USA,

2002. IEEE Computer Society.

132

[86] R. Thakur, E. Lusk, and W. Gropp. Users Guide for ROMIO: A High-Performance,

Portable MPI-IO Implementation. Technical Report ANL/MCS–TM–234, Argonne Na-

tional Laboratory, 1997.

[87] S. Toledo and F. G. Gustavson. The Design and Implementation of SOLAR, A Portable

Library for Scalable Out-of-Core Linear Algebra Computations. In Proceedings of the

Fourth Annual Workshop on I/O in Parallel and Distributed Systems, 1996.

[88] A. Tucker and A. Gupta. Process Control and Scheduling Issues for Multiprogrammed

Shared-Memory Multiprocessors. In Proceedings of the Twelfth ACM Symposium on Op-

erating Systems Principles, pages 159–166, New York, NY, USA, 1989. ACM Press.

[89] G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun. A Modified Approach to Data

Cache Management. In Proceedings of the 28th Annual ACM/IEEE International Sympo-

sium on Microarchitecture, pages 93–103, 1995.

[90] M. Uysal, A. Acharya, and J. Saltz. Requirements of I/O Systems for Parallel Machines:

An Application-driven Study. Technical Report CS-TR-3802, University of Maryland,

College Park, MD, 1997.

[91] M. Vilayannur, M. Kandemir, and A. Sivasubramaniam. Kernel-level Caching for Op-

timizing I/O by Exploiting Inter-application Data Sharing. In Proceedings of the IEEE

International Conference on Cluster Computing, pages 425–432, Chicago, IL, USA, 2002.

IEEE Computer Society.

133

[92] M. Vilayannur, A. Sivasubramaniam, and M. Kandemir. Proactive Page Replacement

for Scientific Applications: A Characterization. In Proceedings of the IEEE/ACM Inter-

national Symposium on Performance Analysis of Software and Systems, pages 248–257,

Austin, TX, USA, 2005. IEEE Computer Society.

[93] M. Vilayannur, A. Sivasubramaniam, M. Kandemir, R. Thakur, and R. Ross. Discretionary

Caching for I/O on Clusters. In Proceedings of the Third IEEE/ACM International Confer-

ence on Cluster Computing and the Grid, pages 96–103, Tokyo, Japan, 2003.

[94] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M. Anderson, S. W. K.

Tjiang, S. Liao, C. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy. SUIF: An In-

frastructure for Research on Parallelizing and Optimizing Compilers. SIGPLAN Notices,

29(12):31–37, 1994.

[95] D. Womble, D. Greenburg, R. Riesen, and S. Wheat. Out of core, out of mind: Practical

Parallel I/O. In Scalable Parallel Libraries Conference, pages 10–16, October 1993.

[96] T. M. Wong and J. Wilkes. My cache or yours? Making storage more exclusive. In

Proceedings of the USENIX Annual Technical Conference, pages 161–175, 2002.

[97] P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic

Tracking of Page Miss Ratio Curve for Memory Management. In Proceedings of the

Eleventh International Conference on Architectural Support for Programming Languages

and Operating Systems, pages 177–188, New York, NY, USA, 2004. ACM Press.

Vita

Murali Vilayannur did his schooling from Vidya Mandir Higher Secondary School in

Chennai, India. Subsequently, he received his undergraduate B.Tech degree in Computer Sci-

ence and Engineering from the Institute of Technology - Banaras Hindu University Varanasi,

India in the year 1999. He expects to get his Ph.D degree in August 2005 from the Depart-

ment of Computer Science and Engineering at the Pennsylvania State University. He spent the

summers of 2002 and 2004 at the Mathematics and Computer Sciences Division at Argonne

National Laboratory in Chicago, where he is currently a post-doctorate staff. His research inter-

ests mainly include Operating Systems, High-Performance Computing and Parallel I/O. He is a

student member of IEEE and ACM.

runtime support for effective memory ...axs53/csl/files/thesis/murali_thesis.pdf3.6 differences...

Documents