runtime support for effective memory ...axs53/csl/files/thesis/murali_thesis.pdf3.6 differences...
TRANSCRIPT
The Pennsylvania State University
The Graduate School
Department of Computer Science and Engineering
RUNTIME SUPPORT FOR EFFECTIVE MEMORY MANAGEMENT
IN LARGE-SCALE APPLICATIONS
A Thesis in
Computer Science and Engineering
by
Murali N. Vilayannur
c© 2005 Murali N. Vilayannur
Submitted in Partial Fulfillmentof the Requirements
for the Degree of
Doctor of Philosophy
August 2005
The thesis of Murali N. Vilayannur has been reviewed and approved? by the following:
Anand SivasubramaniamProfessor of Computer Science and EngineeringThesis Co-AdviserCo-Chair of Committee
Mahmut KandemirAssociate Professor of Computer Science and EngineeringThesis Co-AdviserCo-Chair of Committee
Padma RaghavanAssociate Professor of Computer Science and Engineering
Natarajan GautamAssociate Professor of Industrial and Manufacturing Engineering
Rajeev ThakurComputer Scientist at Argonne National LaboratorySpecial Member
Robert RossComputer Scientist at Argonne National LaboratorySpecial Member
Raj AcharyaProfessor of Computer Science and EngineeringHead of the Department of Computer Science and Engineering
?Signatures are on file in the Graduate School.
iii
Abstract
As processor speeds continue to advance at a rapid pace, accesses to the I/O subsystem
are increasingly becoming the bottleneck in the performance of large-scale applications. In spite
of technological advances in peripheral devices, provisioning and maintenance of large buffers in
memory remains a crucial technique for achieving good performance, but is only effective if we
can achieve good hit rates. This thesis describes the runtime system support to determine what
should go into an I/O cache and when to avoid accessing it, as well as techniques to improve the
hit ratio itself by choosing appropriate candidate cache blocks for eviction/replacement. Such
techniques are equally applicable for both explicitly and implicitly I/O intensive applications
that access data either through a file-system interface or through the virtual memory interface.
While the afore-mentioned techniques can boost the performance for a single I/O inten-
sive application, an important consideration that needs to be addressed for practical reasons is
the effects of multi-programming, where multiple applications are run simultaneously for better
resource utilization. The thesis will conclude with the design and implementation of a runtime
system scheduling strategy on top of an un-modified process scheduler in the operating sys-
tem that can be used to ensure that performance of many large-scale applications does not get
degraded in multi-programmed scenarios.
iv
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2. Explicitly I/O Intensive Applications . . . . . . . . . . . . . . . . . . . . 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 PVFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Overview of System Architecture . . . . . . . . . . . . . . . . . . 15
2.3.3 Performance of Primitives and Micro-Benchmarks . . . . . . . . . 21
2.3.4 Cache Bypass Mechanisms . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Compiler-directed Cache Bypass . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Runtime Cache Bypass . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6 Concluding Remarks and Future Work . . . . . . . . . . . . . . . . . . . . 43
Chapter 3. Implicitly I/O Intensive Applications . . . . . . . . . . . . . . . . . . . . 46
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
v
3.3 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.2 Experimental Platform . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Characterization Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5 Towards a Better Replacement Algorithm : Predictive Replacement . . . . 67
3.5.1 Estimation Techniques with Hardware Support . . . . . . . . . . . 71
3.5.2 OS-Implementable Estimation Technique . . . . . . . . . . . . . . 75
3.5.3 Results with Predictive Replacement Techniques . . . . . . . . . . 76
3.5.4 Comparison with EELRU . . . . . . . . . . . . . . . . . . . . . . 77
3.5.5 Performance of DP-Approx . . . . . . . . . . . . . . . . . . . . . 82
3.5.6 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 86
Chapter 4. Synergistic Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.1 Experimental Platform . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4 Scheduling Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4.1 Heuristics for task set selection . . . . . . . . . . . . . . . . . . . . 98
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5.1 Underload Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 105
vi
4.5.2 Overload Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Chapter 5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
vii
List of Tables
2.1 Read times (in ms) for different request sizes and number of IODs (|IOD|). . . 24
2.2 Write times (in ms) for different request sizes and number of IODs (|IOD|). . 26
3.1 Description of applications: The Total Memory column indicates the total/maximum
memory that is used by the application, and the Simulated Memory column in-
dicates the simulated memory size that was used for the characterization. . . . . 55
3.2 Threshold values of applications. . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1 Description of applications: The Total Memory column indicates the total/maximum
memory that is used by the application. . . . . . . . . . . . . . . . . . . . . . 97
viii
List of Figures
1.1 Code fragment to illustrate the different I/O programming models . . . . . . . 4
2.1 System architecture. Nodes 1..n are the clients where one or more application
processes run, and have a local cache present. Upon a miss, requests are either
directed to the global cache (one such entity for a file), or are sent directly to
IOD node(s) containing the data in the disk(s). . . . . . . . . . . . . . . . . . . 22
2.2 Graph showing the minimum required hit-rate at global cache for good perfor-
mance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Micro-benchmark running on 1 node, File striped on one IOD, s = 2, (a)o = 10,
(b) o = 600. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Micro-benchmark running on 1 node, File striped on one IOD, s = 2, (a) o =
25000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Micro-benchmark running on 2 nodes, File striped on three IODs, s = 2, (a) o
= 10, (b) o = 600. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Micro-benchmark running on 2 nodes, File striped on three IODs, s = 2, (a) o
= 25000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 tomcatv: impact of problem size (a) Global cache is 20MB, (b) Global cache
size is 200MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.8 Impact of global cache size for a problem size of 1500.(a) tomcatv, (b) vpenta . 35
2.9 vpenta: impact of problem size (a) Global cache size is 20MB, (b) Global
cache size is 200MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
ix
2.10 Runtime cache bypassing (global cache size is 20 MB) (a) tomcatv. (b)
vpenta. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.11 (a) Impact of the threshold value for tomcatv. (b) vpenta: Impact of seg-
ment size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.12 tomcatv:Variation of Cache Hit rates with problem size . . . . . . . . . . . . 41
2.13 vpenta:Variation of Cache Hit rates with problem size . . . . . . . . . . . . 41
2.14 Benefits of runtime bypassing on application traces. . . . . . . . . . . . . . . . 44
2.15 Application traces:Variation of Cache Hit rates . . . . . . . . . . . . 44
3.1 Page-fault characterization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Ratio of D/L+D measured for NPB as (a) References, (b) Faults. . . . . . . . . 57
3.3 Ratio of D/L+D measured for SPEC2000 as (a) References, (b) Faults. . . . . . 58
3.4 Where does time go? (a) NPB, (b) SPEC2000 . . . . . . . . . . . . . . . . . . 58
3.5 Absolute differences between successive L distances measured as (a) NPB2.3 -
Total Memory References, (b) NPB2.3 - Faults to other pages, (c) SPEC2000 -
Total Memory References, (d) SPEC2000 - Faults to other pages. . . . . . . . . 61
3.6 Differences between successive L distances measured as (a) NPB - Total Mem-
ory References, (b) NPB - Faults to other pages, (c) SPEC2000 - Total Memory
References, (c) SPEC2000 - Faults to other pages. . . . . . . . . . . . . . . . . 62
3.7 Absolute differences of successive (L+D) distances measured as (a) NPB - Total
Memory References, (b) NPB - Faults to other pages, (c) SPEC2000 - Total
Memory References, (d) SPEC2000 - Faults to other pages. . . . . . . . . . . . 63
3.8 NPB: CDF of L distance measured as (a) References, (b) Faults. . . . . . . . . 64
x
3.9 MG: Variation of L distance with time measured as (a) References, (b) Faults. . 65
3.10 SP: Variation of L distance with time measured as (a) References, (b) Faults. . . 66
3.11 Coefficient of Variance of L for each page (a) IS, (b) MG, (c) SP. . . . . . . . . 67
3.12 Coefficient of Variance of L for each page (a) FT, (b) BT, (c) LU. . . . . . . . . 68
3.13 Coefficient of Variance of L distance for each page (a) WUPWISE, (b) MCF,
(c) APSI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.14 Normalized page-fault counts of the replacement algorithm for SPEC 2000 with
respect to perfect LRU scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.15 Normalized page-fault counts of the replacement algorithm for NPB 2.3 with
respect to perfect LRU scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.16 Normalized invocation counts of the replacement algorithm for (a) SPEC 2000,
(b) NPB 2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.17 Comparison of the best prediction-based replacement algorithm with EELRU
for SPEC 2000 using the ratio of page-faults in comparison to the perfect LRU
scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.18 Comparison of the best prediction-based replacement algorithm with EELRU
for NPB 2.3 using the ratio of page-faults in comparison to the perfect LRU
scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.19 Normalized page-fault reduction of DP-Approx algorithm in comparison to
Linux kernel 2.4.20 execution for (a) SPEC 2000, (b) NPB 2.3 . . . . . . . . . 83
3.20 SPEC 2000 (a) Sensitivity of DP-Approx to parameter “a”, (b) Prediction ac-
curacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.21 NPB 2.3 (a) Sensitivity of DP-Approx to parameter “a”, (b) Prediction accuracy. 85
xi
4.1 Synergy Scheduler Design Alternatives (a) Using a kernel-module based ap-
proach, (b) Using a user-level probe process based approach . . . . . . . . . . 99
4.2 Variation of working set size with simulation time for (a) IS (σ = 0.4 Million)
(b) FT (σ = 15 Million) (c) CG (σ = 14 Million) (d) MG (σ = 56 Million) (e)
SP (σ = 18 Million) (f) EP (σ = 68 Million) (g) LU (σ = 21 Million) (h) BT (σ
= 76 Million) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.3 Underload: (a) Execution Time (in seconds) measured as the time taken from
job start till completion (b) Normalized execution time that is measured as the
ratio of the job completion time to the batch processing execution time. . . . . 107
4.4 Underload: (a) Overall CPU utilization (b) Context Switches (c) Overall Major
Page Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.5 Overload: (a) Execution Time (seconds) measured as the time from job start till
completion (b) Normalized Slowdown measured as the ratio of the job comple-
tion time with the batch scheduling. . . . . . . . . . . . . . . . . . . . . . . . 111
4.6 Overload: (a) Overall CPU utilization (b) Context Switches (c) Overall Major
Page Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
xii
Acknowledgments
Finally, I am reaching the point when I can take a deep breath and see the light at the
end of the tunnel! The last six years at Penn State have been the most challenging, exciting
and rewarding roller-coaster ride of my life. There were several trying moments, but I am glad
I pulled through. Several people have contributed to molding me as a researcher and more
importantly a better human being.
I am deeply indebted to my thesis advisors, Dr. Anand Sivasubramaniam and Dr. Mah-
mut Kandemir. Both of them are dedicated researchers, genuine well-wishers for their students
and very good human beings. In the past six years, they have played several important roles
in my life and taught me numerous things on both the technical and non-technical fronts. I am
so thankful to them for placing trust in me when I lacked self-confidence and motivating me
to work harder and aim higher. Without their help, it is hard for me to imagine how my thesis
would have shaped or how my career could have been launched. I would also like to thank NSF,
the funding agency that has funded me for the bulk of my graduate program.
The three semesters that I spent in the Mathematics and Computer Science Division at
Argonne National Laboratory was the best thing that could have happened to me. I am extremely
grateful to my two mentors, Dr. Rajeev Thakur and Dr. Robert Ross, for encouraging me to aim
higher and their implicit trust has helped boost my self-confidence tremendously. Both of them
are great researchers and excellent colleagues to work with. Our collaboration has gone well
xiii
beyond the two summers, and their inputs and suggestions to my thesis as some of the well-
known practitioners in this field will hopefully lead to fruition of some of the ideas that have
come out of my thesis.
I also wish to thank my thesis committee members and referees: Dr. Padma Ragahavan,
Dr. Natarajan Gautam, Dr. Rajeev Thakur, and Dr. Robert Ross who took time off their busy
schedules, writing recommendation letters for me and being a part of my thesis committee. Their
strong support and valuable comments have made my job search much easier and helped refine
my dissertation.
I have been very lucky to have a group of excellent senior colleagues and lab-mates,
who were also my best friends and mentors at Penn State: Mangesh Kasbekar, Shailabh Nagar,
Ning An, Yanyong Zhang, Chun Liu, Gokul Kandiraju, Sudhanva Gurumurthi, Amitayu Das,
Angshuman Parashar, Vivek Natarajan, Shiva Chaitanya, Jianyong Zhang, Partho Nath, Saurabh
Agarwal, Balaji Viswanathan, Giridhar Viswanathan and Vivek Bhanu. The list of friends that I
made at Penn State is endless and I am positive I must have missed out a dozen names from the
above list, but one thing is for sure, they will always be a part of my memories. I will always
have a special place in my memories for three of my friends (Gokul Kandiraju, Birjoo Vaishnav
and Deepak Ramrakhyani) whom I regard as beyond friends, as part of my family. Their help,
encouragement, discussions on technical and spiritual matters is something that I will cherish for
the rest of my life. Whenever I had a doubt or question with my research, our discussion gave
me new insights, and helped me solve the problem. Whenever I felt down, they were always
there to cheer me up. My life has been so enjoyable because of them.
None of the work described in this thesis would have been possible without the prompt
attention of the lab support team at Penn State. I am deeply indebted to Eric Prescott, Nate
xiv
Coraor, John Domico, David Heidrich and Barbara for their dedication and prompt attention to
administering all the machines in the department. All the secretaries in the office extended so
much warmth and help that have really made my stay in the department pleasurable, especially
Vicki who manages to do so much work and yet remain cheerful all the time. Without her help,
I cannot imagine how things would have turned out for me.
I am grateful to my parents and my brother. Their love and care have always been with
me during these years. Their trust and encouragement pulled me through what appeared to be
some of the worst times early on.
Finally and most importantly, I am so grateful to be a part of the Art of Living Family
and come under the presence of my spiritual Guru, Sri Sri Ravi Shankar.
1
Chapter 1
Introduction
Many large-scale scientific applications of today are increasingly data-intensive, manipu-
lating large disk-resident data sets ranging from mega-bytes to tera-bytes. For example, medical
imaging, data analysis and mining, video processing, global climate modeling, computational
physics and chemistry can easily involve data sets that are too large to fit in main memory
[23, 25]. Typically, such applications treat main memory simply as an intermediate stage of the
memory hierarchy and the bulk of the data that they manipulate usually resides on secondary
storage devices like disks. Recent trends in computer architecture [27] show that networks of
workstations – also referred to as clusters – are the dominant force for delivering high perfor-
mance for such challenging applications cost effectively. This has been made possible in part
due to the rapid advances in processor and networking technology (such as [5, 66]). In these ar-
chitectures, the multiple CPUs and their memories can provide processing and primary storage
parallelism, while the multiple disks (one or more at each workstation, or on a network) can pro-
vide secondary storage parallelism for both data access and data transfer. As processor speeds
continue to advance at a rapid pace, accesses to the I/O subsystem are increasingly becoming
the bottleneck in the performance of large-scale applications that manipulate huge datasets. This
gap between CPU and I/O performance is exacerbated as we move to multiprocessor and cluster
systems, where the compute power is potentially multiplied by the available number of CPUs.
Therefore, optimizing I/O performance is of critical importance.
2
Large buffers in memory (referred to as caches throughout this thesis) are one way of
alleviating this problem, provided we can achieve good hit rates. However, unlike the tradi-
tional instruction/data caches that are provisioned in the hardware of processor architectures, I/O
caches are implemented in software, managed in main memory and have much higher overheads.
Further, the levels of I/O caching on some of the parallel environments (including clusters) can
span machine boundaries, requiring network messages for cache accesses. A large body of work
([10, 16, 21, 24, 28, 30, 31, 45, 46, 58, 64, 71, 75, 96] and the references contained there-in) have
dealt with various aspects of I/O caches (design, replacement algorithms, prefetching, sharing
and partitioning, and so on) in the literature. This thesis describes management of I/O caches
which is critical for alleviating the I/O bottlenecks and discusses compiler and runtime system
support for managing them. In particular, this thesis proposes compiler and runtime system
support to determine what should go into an I/O cache and when we should avoid accessing
it, apart from improving the hit rate itself by choosing appropriate candidate cache blocks for
eviction/replacement. Further, we also propose a simple runtime mechanism that ensures that
performance of such applications in a multi-programming scenario does not get degraded.
Typically, large-scale I/O intensive applications, such as those described above, are coded
to access and manipulate their data sets in one of three different ways:
• Using explicit I/O calls (e.g., the POSIX read/write/lseek interface in UNIX) to
stage data between memory and peripheral devices.
• Using the paged virtual memory system (e.g., the POSIX mmap interface in UNIX) to
handle transparent migration of data sets between main memory and disk. This model of
3
computation can be termed in-core since the programmer assumes that all the data fits in
main memory and the system cooperates to preserve that assumption.
• Using explicit I/O calls to manage in-core data sets rather than relying on the virtual mem-
ory system. This model of computation can be termed out-of-core since the programmer
explicitly stages the data to/from the peripheral devices. Writing programs using this
model of computation is daunting since it involves significant restructuring of the in-core
version of the same code, but offer best performance since application writers know best.
Figure 1.1 shows an example code fragment that illustrates each of the above models of computa-
tion. While the first and the third technique appear to be similar, the difference is that in the latter
scheme once the data is brought into memory, it is manipulated as if it were in main memory
and written to disk when the entire data set is staged out, while in the former any manipulation
of data is accomplished by a sequence of lseek and read/write.
The focus of this thesis is twofold:
• To improve the performance of applications that have been coded to manipulate their data
sets using the first two approaches (namely explicit, in-core) from the perspective of the
I/O caches. It is important to note here that optimizations for the third approach (out-
of-core) are beyond the scope of this thesis, since we believe that it is a burden on the
programmer to write a numerically stable out-of-core program and any approaches to al-
leviate the I/O bottlenecks in such large-scale applications has to be automated.
• To guarantee that performance of such applications in a multi-programming scenario does
not degrade through the design of a runtime system implementing a scheduling strategy
on top of an un-modified operating system.
4
Explicitly I/O intensive Implicitly I/O intensive Out of Core Computation
#define NUM_ELEM 1000000
#define NUM_IN_MEMORY 1000#define SZ(x) ((x)*sizeof(double))
double A[NUM_IN_MEMORY];.....int start_core_index = -1, end_core_index = -1;
void read_data_set(void) { ..... fd = open("A.txt", O_RDWR);
for (i=0; i < NUM_ELEM; i++) { if (i < start_core_index || i > end_core_index) { if (start_core_index > 0) write(fd, A, SZ(NUM_IN_MEMORY)); pread(fd, A, SZ(NUM_IN_MEMORY), SZ(i); start_core_index = i; end_core_index = (i + NUM_IN_MEMORY); } .. = A[i]; ... A[i] = ... } close(fd); ..... .....}
int NUM_ELEM = 0;
double *A;.....
void read_data_set(void) { ..... .....
fd = open("A.txt", O_RDWR);
fstat(fd, &statbuf);
NUM_ELEM = statbuf.st_size / sizeof(double);
A= (double *) mmap(NULL, statbuf.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
for (i=0; i < NUM_ELEM; i++) { .. = A[i]; ... A[i] = ... }
munmap(A, statbuf.st_size);
close(fd); ..... .....}
#define NUM_ELEM 1000000
double A[NUM_ELEM];.....
void read_data_set(void) { ..... .....
fd = open("A.txt", O_RDWR);
read(fd, A, sizeof(double) * NUM_ELEM);
for (i=0; i < NUM_ELEM; i++) { .. = A[i]; ... A[i] = ... }
write(fd, A, sizeof(double) * NUM_ELEM);
close(fd); ..... .....}
Fig. 1.1. Code fragment to illustrate the different I/O programming models
For applications that explicitly read/write data to/from peripheral devices, this thesis at-
tempts to increase the effectiveness of the buffer management system by intelligently deciding
(with the help of a compiler or the operating system) which cache blocks should go into which
I/O caches, and when we should avoid accessing certain levels of the I/O cache hierarchy to
improve hit rates across all I/O caches. A prototype implementation was completed on a popular
and widely used parallel file system (PVFS [14]) on a Linux-based cluster to demonstrate the
effectiveness of this technique.
When an application that uses the paged virtual memory system, accesses a page that is
not in main memory, the hardware raises an exception (page-fault) that is handled transparently
by the operating system. The OS then arranges for the page to be brought into main memory,
potentially replacing some other page to make room for the newly accessed page. Once the
page is brought into main memory, the appropriate page-table entries are updated to reflect the
5
new mappings. Thus, the discretionary caching techniques that were developed for the explicitly
I/O intensive applications are not applicable directly in this context, i.e the discretionary nature
of caching certain cache blocks cannot be exploited. However, it has been a well-documented
characteristic [82] that the system’s replacement algorithm (usually a variant of Least-Recently-
Used algorithm) under-performs for typical scientific application memory access characteristics.
Consequently, this thesis tries to address the shortcomings of the system’s replacement algo-
rithm and proposes a novel runtime predictive application-specific replacement algorithm that is
shown to have the potential to perform better. The implementation of this algorithm was com-
pleted on a popular and widely used execution-driven simulator (Valgrind [79]) to demonstrate
its effectiveness.
Lastly, this thesis explores the inter-play of memory management, process scheduling
and their impact on performance. While this thesis has attempted to show that intelligent,
application-aware memory management can boost performance, it raises a challenging and a
difficult problem of ensuring that performance of such applications does not degrade when con-
currently executed with other jobs. Process scheduling policies implemented by today’s oper-
ating systems cause memory intensive applications to exhibit poor performance and throughput
when all the applications’ working sets do not fit in main memory. A primary cause for this is that
process scheduling algorithms do not take memory working set size considerations into account.
Consequently, this thesis attempts to alleviate the shortcomings of the scheduling algorithms of
today’s operating system to ensure that concurrently running memory-intensive applications do
not step on each other’s working sets. To demonstrate its effectiveness, a prototype implementa-
tion of this technique as a Linux kernel module was completed.
6
The rest of this thesis is organized as follows : Chapter 2 evaluates the proposed buffer
management techniques that were developed for explicitly I/O intensive applications. Chapter 3
looks at the replacement algorithms that was developed for scaled versions of in-core scientific
applications that are implicitly I/O intensive. Chapter 4 proposes the design and implementation
of the scheduling algorithm framework that was developed to investigate the impact of memory-
aware process scheduling on performance. Finally, Chapter 5 concludes with pointers to future
work.
7
Chapter 2
Explicitly I/O Intensive Applications
2.1 Introduction
As processor speeds continue to advance at a rapid pace, accesses to the I/O subsystem
are increasingly becoming the bottleneck in the performance of large-scale applications that ma-
nipulate huge datasets. While one could argue that we can use a large number of disks in parallel
for improving I/O bandwidth, the latency of seeking to the appropriate location and performing
the disk operation in addition to the overhead of the network transfer continues to hurt perfor-
mance of many applications, especially those with non-sequential access patterns. Large buffers
in memory (referred to as caches throughout this chapter) are one way of alleviating this problem,
provided we can achieve good hit rates. However, unlike the traditional instruction/data caches
that are provisioned in the hardware of processor architectures, I/O caches are implemented in
software and have much higher overheads. Further, the levels of I/O caching on some of the
parallel environments (including clusters) can span machine boundaries, requiring network mes-
sages for cache accesses. It is thus very important to be able to determine what should go into
an I/O cache and when we should avoid accessing it, apart from improving the hit rate itself.
This chapter addresses this important problem, presenting the design, implementation, and eval-
uation of a parallel file system’s I/O subsystem that provides two levels of discretionary caching
[93] and demonstrate the benefits of such discretionary caching mechanisms with compiler and
runtime optimizations.
8
Clusters, put together with off-the-shelf workstations/PCs and networking hardware, are
becoming the platform of choice for demanding applications because of their cost-effectiveness,
upgradability, and widespread availability. Clusters are finding their place in a plethora of en-
vironments, from academic departments to supercomputing centers and even to the commercial
world (e.g., for database, web and e-commerce applications). These platforms can benefit not
only from the constantly improving processor/memory speeds, but also from the disk capacities
and bandwidths. The multiple CPUs and their memories on these systems can provide process-
ing and primary-storage parallelism, while the multiple disks (one or more at each workstation
or on a network) can provide secondary-storage parallelism for both data access and data trans-
fer. One could either have disks attached to each cluster node with a SCSI-like interface (the
corresponding node has to be involved in data transfers to/from such disks), or have disks acces-
sible by everyone over a storage area network. While many of the issues/optimizations in this
work are applicable to both environments, we specifically focus on the former which is usually
a much cheaper, and thus more prevalent, alternative for the I/O subsystem (and are intending to
investigate such issues for a storage-area network in the future).
Many large-scale scientific applications are data-intensive, manipulating immense disk-
resident data sets [41]. These include applications from medical imaging, data analysis and
mining, video processing, large archive maintenance, and so on. Commercial services such
as web, multimedia, and databases on clusters are also demanding on the I/O subsystem. In
addition, many high-performance environments (particularly shared clusters within a department
or a supercomputing center) not only handle one such application, but often have to deal with
several (possibly I/O intensive) applications at the same time in a time-shared manner. All these
issues make I/O optimizations an important and challenging problem for off-the-shelf clusters.
9
While the parallelism offered by the numerous disks in a cluster can alleviate the I/O
bandwidth problem, it does not really address the latency issue, which is largely limited by
seek, rotational and network transfer costs. Caching data blocks in memory is a well-known
way of reducing I/O latencies, provided we can achieve good hit rates. I/O caching is typically
implemented in software (not the disk/controller caches), and the overheads of cache lookup
and maintenance can become quite high. Furthermore, it has been shown in [39] that we may
need multiple levels of caching. For instance, in PPFS [39], a local cache at each node of the
parallel system caters to the individual process requests at that node and, upon a miss, goes to
a shared global cache (running on one or more nodes of the cluster), which can possibly satisfy
requests that come from different nodes. On such systems, the cost of going to the global cache—
requiring a network message—and not finding the data there (before going to the disk) might be
quite substantial. For instance, as our performance results show, this approach turns out to be
over twice as costly as directly getting the data from disk in several situations. Consequently,
it becomes extremely important to intelligently determine what to place in the caches and when
to avoid (i.e., bypass) the cache (particularly the caches whose look-up costs are higher) on I/O
requests. This largely depends on the data-access patterns of the workload. To our knowledge,
the issue of exploiting application behavior for such I/O cache optimizations on clusters has not
been studied previously. There has been similar work (e.g., [44]) in the context of hardware data
CPU caches, but the costs for I/O caching are of a much higher magnitude.
Rather than implementing all the APIs/features of a full-fledged parallel file system to
investigate these issues, we start with a publicly available parallel file system — PVFS [14] — for
Linux/Pentium clusters. We have considerably extended this system to incorporate a kernel-level
cache module at each cluster node to cater to all the requests (possibly different applications)
10
coming from that node, which we refer to as the local cache. We also have implemented a shared
global cache (between processes running on different nodes of an application, or even across
applications) that runs on one or more nodes of the cluster. This global cache receives requests
from the local cache and services them. If the lookup fails in the global cache as well, the request
is forwarded to one or more nodes whose disks are used for striping the data. The experimental
results presented in this chapter are from a Pentium/Linux-based cluster of workstations. Each
node on this cluster has a 800 MHz Intel Pentium-III (Coppermine) microprocessor with 32 KB
of L1 cache, 256 KB of L2 cache, and 128 MB of PC-133 main memory. The global cache is
run on one of the nodes that contains 384 MB of main memory. Each node is also equipped with
a 20 GB Maxtor hard disk drive and a 32bit PCI 10/100Mbps 3-Com 3c59x network interface
card. All the nodes are connected through a Linksys Etherfast 10/100 Mbps 16 port hub. Using
this experimental system, this chapter investigates/illustrates the following issues:
• Several latency numbers for file reads and writes, satisfying the requests from different
levels in the cache hierarchy are presented, and compared to the original PVFS imple-
mentation that does not perform any explicit caching. The results clearly demonstrate the
benefit of caching. Even when missing from the local cache, going to the global cache
and fetching the data turns out to be better than the original PVFS in most cases. How-
ever, when we go via the global cache, only to find that the data is missing there, the
overheads are significantly worse than not performing any caching altogether (as in the
original PVFS). Experimental data showing what hit rate is needed in the global cache to
justify going through it.
11
• After pointing out the importance of discretionary data placement in the caches and by-
passing them when needed, mechanisms provided by our system to explicitly specify
whether a read/write should go through the local/global cache are discussed. The by-
pass capabilities can be conveyed to our caching layers through a kernel ioctl() call and
can be specified either by the application itself or via the compiler or the runtime system.
• We show how simple compiler-based techniques are quite effective in benefiting from
the caches, without incurring extra overheads, for statically analyzable applications. We
specifically present two techniques: one that determines what files should be accessed via
the cache and what files should bypass the cache (which we refer to as coarse-grain opti-
mizations), and the other that performs such discretionary accesses at a finer granularity.
• While compile time analysis can be employed in applications with statically analyzable
code, we present a simple runtime approach for determining when to bypass the cache in
situations where the codes are not readily analyzable or the sources are not available.
• All these optimizations are extensively evaluated with several applications/traces to show
how they can be beneficial for improving cache behavior for parallel I/O.
The rest of this chapter is organized as follows. The next section identifies some work
related to this work. Section 2.3 describes the system architecture and implementation details
of our I/O subsystem on the Linux cluster, together with some raw performance numbers. The
compiler-based and runtime-based optimizations are evaluated and compared in Sections 2.4 and
2.5. Finally, Section 2.6 summarizes the contributions of this work and discusses directions for
future work.
12
2.2 Related Work
Software work on high-performance I/O can be roughly divided into three categories:
parallel file systems, runtime I/O libraries, and compiler work for out-of-core computations.
A number of groups have studied automatic detection and optimization of I/O access patterns
(e.g., see [28, 47, 48, 53] and the references therein). Others have proposed parallel file systems
and I/O runtime systems that provide users/programmers with easy-to-use APIs [17, 20, 61, 76,
87]. While these systems allow users/programmers to exploit optimizations for I/O, it is still in
general the user’s responsibility to select which optimization to apply and determine the suitable
parameters for it. Obviously, this puts a great burden on users, as in most cases it is not trivial to
select what optimization(s) to use and the accompanying parameters. Our work instead tries to
bring the advantages of I/O caching without much user effort.
Compilation of I/O-intensive codes using explicit I/O has also been the focus of some re-
search (see [6, 9, 67] for example techniques that target out-of-core datasets). Brezany et al. [9]
have developed a parallel I/O system called VIPIOS that can be used by an optimizing compiler.
Bordawekar et al. [6, 7] have focussed on stencil computations that can be reordered freely due
to lack of flow-dependencies. They have presented several algorithms to optimize communica-
tion and to indirectly improve the I/O performance of parallel out-of-core applications. Palecnzy
et al. [67] have incorporated I/O compilation techniques in Fortran D. The main philosophy be-
hind their approach is to choreograph I/O from disks along with the corresponding computation.
Many of these studies however, have specifically targeted massively parallel processors (MPPs)
and do not deal with selective data placement in caches. DPFS [80] is a parallel file system that
collects locally distributed unused storage resources as a supplement to the internal storage of a
13
parallel system. In contrast, our work is targeted for cluster environments with multiple levels
of caching, and not only benefits the processes of one application, but can also benefit several
applications sharing datasets (through a global cache).
There has been a considerable amount of prior work on optimizing I/O and I/O caches
[10, 21, 31, 40, 45, 46, 51, 58, 64, 71, 75, 84], some of which has been on clusters as well. Re-
cently, [16, 96] have focused on buffer-cache management policies in a multi-level buffer cache
system. Wong et. al proposes primitives for maintaining exclusivity in multi-level buffer caches
in [96], while Chen et. al use higher level cache eviction information to guide the placement of
blocks in lower levels in [16]. Maybe the most closely related work to ours are the approaches
presented in three prior systems, namely, MPI-IO [19, 32], PVFS [14], and PPFS [39]. MPI-IO
[32] is an API for parallel I/O as part of the MPI-2 standard and contains features specifically
designed for I/O parallelism and performance. This API has been implemented for a wide vari-
ety of hardware platforms including clusters [86]. The main optimizations in MPI-IO are for non
contiguous parallel accesses to shared data, mainly at the user-level. As a result, the user needs
to have a thorough understanding of the numerous programming interfaces to invoke the appro-
priate routines. Since the MPI-IO interface itself does not specify any caching functionality, its
response time is largely determined by the caching capabilities provided by the underlying file
system or the MPI-IO implementation. PVFS [14] is a parallel file system for Linux clusters that
presents three different APIs, and accommodates frequently used UNIX file tools. Its optimiza-
tions for non contiguous data are perhaps less powerful than MPI-IO’s optimizations. The work
presented in this chapter augments PVFS with a local and global caching capability, benefiting
from its rich original APIs. PPFS [39] is a user-level I/O library that has been implemented
14
for several parallel machines and clusters. This system differs from the other two in that it of-
fers runtime/adaptive optimizations (not just an API) for caching, prefetching, data distribution
and sharing. The differences of our work from PPFS are in that we are examining the benefits
of compiler/runtime directed cache bypassing towards optimizing the hit rates of one or more
applications running on the cluster.
2.3 System Architecture
Our system builds on the architecture of the Parallel Virtual File System (PVFS) [14]
since we did not want to re-invent the APIs and mechanisms for providing a shared name space
across the cluster, and facilities for distributing/striping the file data across the disks of the cluster
nodes. PVFS also provides seamless transparent access to several existing utilities on normal
file systems. Since all these provisions are already provided in a publicly distributed parallel
file system, we have opted to build upon this system in this work rather than re-implement all
these features. We briefly go over some key architectural features of PVFS and then discuss our
contributions.
2.3.1 PVFS
The original PVFS is a mainly user-level implementation, i.e., there is a library (libpvfs)
linked to application programs which provides a set of interface routines (API) to distribute
and retrieve data to/from the files striped across the cluster nodes. In addition to the library,
PVFS uses two other components, both of which run as daemons on one or more nodes of
the cluster. One of these is a meta-data server (called mgr), to which libpvfs sends requests
for meta-data information (access rights, directories, file attributes, etc.). In addition, there are
15
several instances of a data server daemon (called IOD), one on each of the machines whose disk
is being used to store the data. This daemon (again running at the user level) listens on sockets for
requests from libpvfs functions on clients to read/write data from/to its local disk using normal
Linux file system calls. There are well-defined protocols for exchanging information between
libpvfs and IODs. For instance, when the user wants to read file data that is striped across
several IODs, libpvfs converts this request into several requests (one for each IOD involved),
sends these requests to the IODs using sockets, waits for an acknowledgment from each of them,
following which it waits for the data sent by the IODs. This data is then collated and returned to
the application process. On a write, libpvfs sends out the requests, following which the relevant
data is sent to each IOD. To check for error conditions each IOD sends back an acknowledgment
indicating how much data was actually written. The reader is referred to [14] for further details
on the functioning of PVFS.
2.3.2 Overview of System Architecture
As mentioned earlier, we would like to build on the existing capabilities provided by
PVFS to leverage off its rich API and features. Further, we wanted to provide our caching
infrastructure in a fairly transparent fashion so that it is not even apparent to a large part of the
PVFS implementation, let alone the application. This implies that we need to intercept all the
socket calls that libpvfs makes and provide caching at that point. It should be noted that our
cache is meant only for IOD requests, and we do not cache any metadata information at this time
(i.e., they always go to the meta-data server).
Our system provides two levels of caching: a local cache at every node of the cluster
where an application process executes, and a global cache that is shared by different nodes (and
16
possibly different) applications across the cluster, with the possibility of skipping either of them
as illustrated in Figure 2.1. The design and implementation of the local cache at each node is
described in [91], and here we describe it briefly for completeness, and then concentrate on the
global cache.
Local Cache
There are two alternatives for implementing the local cache at each node. One option
is to implement the caching within the library that is linked with the application (user-level).
However, with this approach we do not have the flexibility of sharing cache data between appli-
cation processes running on the same node. This is the reason why we opted to implement the
local cache within the Linux kernel (a dynamically-loadable module), that can be shared across
all the processes running on that node. Only when the request misses in this cache (either all
or some of the request cannot be satisfied locally), is an external request initiated out of that
node, either to the global cache or to the IODs, as explained below. This cache is implemented
using open hashing with second chance LRU replacement. There is a dirty list (which keeps
track of all the cache frames that have been modified while in cache), a free list (which keeps
track of all the unused cache frames), and a buffer hash to chain used blocks for faster retrieval
and access. The hashing function takes as parameters the inode number of the file and the block
number to index the buffer hash table. There are two kernel threads, called flusher and harvester
in the implementation. Writes are normally nonblocking (except the sync write explained be-
low), and the flusher periodically propagates dirty blocks to the global cache/IOD. The harvester
is invoked whenever the number of blocks in the free list falls below a low water mark, upon
which it frees up blocks till the free list exceeds a high water mark. A block size of 4K bytes is
17
used in our implementation. Note that such a kernel implementation automatically allows mul-
tiple applications/processes to share this local cache, thus making more effective use of physical
memory.
Global Cache
The global cache, as explained above, adds one more level to the storage hierarchy be-
fore the disk at the IOD needs to be accessed. There are numerous questions/alternatives when
implementing the global cache and we go over them in the following discussion, explaining the
rationale behind the choices we make specifically in our implementation:
• Should there be a global cache for each file, or should all files share the same cache?
While there may be some possibility for detecting access patterns across datasets for op-
timizations, our current system uses a separate global cache for each file. If there is little
file sharing across applications, or even across parallel processes of the same application,
then the requests would automatically distribute the load more evenly with this approach.
• Should each application have its own global cache, or should we share a global cache
across applications? Since we would also like to be able to perform inter-application
optimizations based on sharing patterns, we have opted to share the global cache across
applications. This can help one application (even its cold references) benefit from the data
brought in earlier by another from the cache. There is, however, the fear of worse miss
rates if there is interference because of such sharing, and these are points that our cache
bypass mechanisms addresses. This feature is one key difference between our system and
PPFS [39], where the global cache is intended for optimizations within the processes of a
18
single application. Our system does suffer from scalability issues, and performance may
start to drop beyond a particular number of client nodes due to the centralized nature of the
global cache. However, the focus of this work is on techniques for intelligent caching of
data in file-system caches, and we are looking at scalable techniques as part of our future
work. Furthermore, providing a separate global cache for each file as explained above can
ease some of this bottleneck.
• Should we distribute the global cache across the cluster? While distribution is a good idea
in terms of alleviating contention, there are a couple of drawbacks. First, depending on
the granularity of distribution, it may be difficult to perform certain optimizations (such
as prefetching) if one node is not the repository for all the file data. Second, two levels
(one between the IODs and the global caches, and one between the global caches and local
caches) of multiplexing and demultiplexing the data may be needed. We, instead, opted
to have a centralized global cache for each file. However, since we have a separate global
cache for each file, we can have separate global caches on different cluster nodes serving
different files, and that can alleviate some of the contention problems that may arise.
• Should the global cache be implemented as a user process or as a kernel module? The
reason for a kernel level implementation for the local cache is the need for trapping all
application requests coming at that node from the different processes via the PVFS calls.
However, with the global cache, TCP/IP sockets are being explicitly used for sending
messages to it from the individual local caches regardless of which application process
is making a call. The convenience and flexibility (option of busy-waiting) of a user-level
19
implementation has led us to implement the global cache for a specified file as a stand-
alone, user-level daemon running on a specified node of our cluster.
Each global cache in our system is, thus, a user-level process serving requests to a specific
file running on a cluster node, to which explicit requests are sent by the local caches, and is
shared by different applications. The internal data structures and activities of the global cache are
similar to those of the local cache, described above. One could designate such global caches on
different nodes (for each file), particularly on those nodes with larger physical memory (DRAM).
Consequently, this architecture is also well suited to heterogeneous clusters where one or more
nodes may have larger amounts of memory than the others.
Reads/Writes
Figure 2.1 gives a schematic overview of our system. Let us now briefly go over a
typical read operation (there could be some differences when one or more levels of caching are
disabled as discussed below) to understand how everything works when an application process
on a node makes a read call, possibly to several blocks that span different IODs. The original
PVFS library on the client aggregates the requests to a particular IOD, before making a socket
request (kernel call) to the node running that IOD. Our local cache intercepts this call in the
kernel and checks to see if all or even a part of it can be satisfied locally. If the entire request
can be satisfied without a network message, then the data is returned to the PVFS library and
the application proceeds. Otherwise, the local cache module accumulates a list of requests that
need to be fetched. A subsequent message is sent to the global cache with these requests (Note
that this may change, and the requests are directly sent to IODs if the global cache is bypassed).
The multi-threaded global cache keeps listening on a dedicated socket for requests, and upon
20
receiving such a message looks up its data structure. If it can satisfy the requests completely
from its memory, it returns the data to the requesting local cache. Otherwise, it sends a request
message to each of the IODs holding corresponding blocks, stores the blocks in its memory
when it gets responses from the IODs, and then returns the necessary data to the requesting local
cache. A write operation works similarly except that the writes are propagated in the background
(using the flusher thread described earlier), and control is returned back as soon as the writes are
buffered.
The above read and write operations are the most common, and can benefit significantly
from spatial and temporal locality in the caches. However, with the presence of multiple copies
for data blocks, there is the issue of coherence/consistency. The above read/write mechanisms
do not worry about consistency, and a read simply returns the value in a version of the block that
it finds (i.e., the write is only propagated to the global cache and IOD — any subsequent read
to the global cache/IOD will get this value, but a read from a node that already has this block
in its local cache will not get this latest value). While this may not pose a problem for many
applications, where read-write sharing is not common (as compared to read sharing) or where
consistency is explicitly managed by the application itself, there are certain applications where
ensuring consistency is critical. Consequently, in our system, we also provide a special version
of the write, called sync write, which not only propagates the writes to the global cache/IOD, but
also invalidates the local caches which have a copy (so that subsequent reads on those nodes can
go out on the network and get the latest copy). Coherence is maintained at a block granularity,
and thus requires a directory entry per block to keep track of the local caches that have a current
copy of that block. We maintain this directory at both the global cache and the IOD. The need
for the latter would be more clear when we discuss global-cache bypassing. The actual set of
21
local caches with a copy would involve merging these two directory entries for a block. On
a system where there is no global-cache bypassing (all requests go via the global cache), the
directory at the IOD would be empty. Local caches that bypass the global cache would update
the directory at the IOD rather than at the global cache. A sync write is thus an additional
overhead (over normal writes), involving looking up the directory entries, and invalidating any
copies, in addition to propagating the write itself. It would thus be more prudent to use the
normal writes as far as possible, and use sync write only when coherence is needed (or when
one is not sure).
2.3.3 Performance of Primitives and Micro-Benchmarks
Before we go any further into our optimizations, we would first like to present some
latency numbers and micro-benchmark results for read and write performance with the pres-
ence/absence of local/global caches. For these experiments, the local cache size was fixed at
2 MB (500 data blocks), while the global cache size was fixed at 40 MB (10000 data blocks).
Also, a stripe size of 32 KB was used in all our experiments.
Raw Latencies for Reads/Writes
In the first set of results (see Table 2.1), we give the read latencies for a file striped over
different number of IODs (1 to 4). In these tables, Pvfs denotes the read latency of the original
PVFS system which does not use any caching (local or global). Local Hit indicates the
latency when the access is satisfied from local cache and Local Miss is the latency when the
access misses in the local cache and is satisfied from one or more IODs. The latter case thus
captures the execution on a system without a global cache. Global Hit and Global Miss,
22
A1 B1 A2 B2
Local Cache Local Cache
Global CacheFile 1
Global Cache
IOD
..........
.........
Node 1 Node n
..........Kernel
KernelKernel
Kernel KernelLocal
Cache Bypass
LocalCache Bypass
Global Cache Bypass
To otherIOD’s
File 2
User
User User
User
User
Fig. 2.1. System architecture. Nodes 1..n are the clients where one or more application pro-cesses run, and have a local cache present. Upon a miss, requests are either directed to the globalcache (one such entity for a file), or are sent directly to IOD node(s) containing the data in thedisk(s).
23
on the other hand, denote the cases when the access misses in the local cache (i.e., a local cache
lookup is still needed) and hits and misses, respectively, in the global cache.
From these numbers, we clearly see that the local cache hits (Local Hit) can substan-
tially lower read costs compared to the original PVFS implementation. On the other hand, if the
locality is not good, causing us to miss in the local cache (i.e., Local Miss), the performance
becomes worse than the original PVFS for all request sizes because of the overheads in looking
up the local cache. Therefore, it is not only important to improve the hit behavior of the local
cache, but it is also meaningful to bypass the local cache on certain lookups if we feel that it is
going to miss.
When we next move to the scenarios with the accesses to global cache (misses in local
cache), we first see that the global cache can lower access times, (provided the data is present
there) compared to the original PVFS without caching in many cases (i.e., requests larger than 4
KB). It is also better than fetching the data directly from IODs upon a local cache miss (Local
Miss). However, global-cache-miss costs are substantially higher than any of the other cases
because of the additional message hop and serialization overhead that occur in the critical path
and the associated lookup costs. This suggests that if we want to incorporate and benefit from the
global cache, it is very important to keep its hit rate quite high. In fact, the Required HR rows
in Table 2.1 give the minimum hit rates that are needed (for each request size) to tilt the balance
in favor of the global cache compared to the original PVFS. A value larger than 1 in these rows
indicate that it is impossible to generate better results than the original PVFS using that request
size and the number of IODs. Figure 2.2 shows the same behavior plotted as a graph. This again
means that we need to be very careful on what to put in the global cache and when to avoid
going through it. Further, we can observe that the benefits of global caching (look at the last row
24
Table 2.1. Read times (in ms) for different request sizes and number of IODs (|IOD|).Request Size −→ 4K 8K 16K 32K 64K 128K 256K
|IOD| = 1
Pvfs 1.09 2.27 4.31 9.48 19.04 38.52 54.04Local Hit 0.67 0.68 0.72 0.80 0.97 1.59 2.85Local Miss 1.25 2.28 4.61 9.54 20.77 44.23 67.54Global Hit (Local Miss) 1.43 1.71 2.44 4.26 8.14 15.28 25.91Global Miss (Local Miss) 2.00 2.85 5.86 11.49 23.85 50.42 94.38Required HR 1.59 0.50 0.45 0.27 0.30 0.33 0.58
|IOD| = 2
Pvfs 1.12 1.99 3.82 7.84 14.16 24.09 47.79Local Hit 0.74 0.83 1.03 1.38 2.43 4.34 8.56Local Miss 1.32 2.08 4.36 8.07 18.59 36.49 52.92Global Hit (Local Miss) 1.51 1.85 2.62 5.01 8.32 17.77 39.92Global Miss (Local Miss) 2.05 3.31 5.93 11.91 24.78 49.06 109.36Required HR 1.72 0.90 0.63 0.58 0.64 0.79 0.88
|IOD| = 3
Pvfs 1.08 1.83 3.52 6.17 12.00 20.04 36.86Local Hit 0.75 0.84 1.01 1.41 2.42 4.50 8.67Local Miss 1.31 2.35 4.48 8.19 18.96 26.90 40.63Global Hit (Local Miss) 1.23 1.66 2.45 4.80 8.67 19.26 39.38Global Miss (Local Miss) 1.87 3.36 6.30 12.06 30.71 54.14 100.34Required HR 1.23 0.90 0.72 0.81 0.84 0.97 1.04
|IOD| = 4
Pvfs 1.08 1.63 3.33 5.32 10.64 19.06 33.79Local Hit 0.76 0.84 1.01 1.40 2.41 4.68 8.89Local Miss 1.32 2.18 4.50 8.61 14.47 21.76 38.96Global Hit (Local Miss) 1.48 1.67 2.80 4.70 9.10 19.50 38.87Global Miss (Local Miss) 1.88 3.87 6.10 12.33 26.07 49.62 107.63Required HR 2.00 1.01 0.83 0.91 0.90 1.01 1.07
25
showing required hit rate) are most significant when request sizes are not at either extreme. At
very small request sizes, the overhead of global caching itself is more significant. At the other
end, large amounts of data can cause more capacity misses, leading to poor temporal locality.
Another point to note is that when the number of IODs involved in the access increases, the cost
of a global cache miss becomes even more significant. This is because the global cache has to
amass the data coming in from different IODs and then send them sequentially to the requester,
while all the IODs could have potentially sent them in parallel to the requester if the global cache
was not involved.
Table 2.2 gives the times for write operations to return back to the application after they
are issued, with different number of IODs involved. We compare the performance of the original
PVFS code (denoted Pvfs) with our system having a local cache (denoted Caching). We are
not separately giving the costs as in the read table (Table 2.1) for the other scenarios as they are
comparable to the scenario with a local cache (the writes are simply accumulated in the local
cache, and a background activity — flusher — propagates these writes to either the global cache
or the IOD). We do not buffer writes of an application when there is not enough space left on
the local cache. Hence the cost of writes whose sizes are greater than the local cache size is
comparable to the cost of the original PVFS implementation. We can see that write stall times
are significantly lower because of this feature, as is to be expected. It is to be noted that the
savings that will be presented later in this chapter with our optimizations are not a result of these
nonblocking writes, since we show savings even over the scenarios that cache everything in the
local/global caches (which also performs nonblocking writes).
26
0 0.5 1 1.5 2 2.5 3
x 105
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Read block size
Req
uire
d H
it ra
te
Variation of required hit rate with readsize
Striped on 1 nodeStriped on 2 nodesStriped on 3 nodesStriped on 4 nodesMaximum Feasible Hit Rate
Fig. 2.2. Graph showing the minimum required hit-rate at global cache for good performance
Table 2.2. Write times (in ms) for different request sizes and number of IODs (|IOD|).Request Size −→ 4K 8K 16K 32K 64K 128K 256K
|IOD| = 1
Pvfs 0.68 1.03 1.97 3.95 7.83 15.94 31.09Caching 0.55 0.56 0.60 0.96 1.05 1.76 3.15
|IOD| = 2
Pvfs 0.68 1.27 1.90 3.77 9.86 15.44 29.61Caching 0.60 0.67 0.84 1.43 2.04 3.70 7.19
|IOD| = 3
Pvfs 0.68 1.04 1.85 3.62 8.23 15.74 29.40Caching 0.59 0.68 0.87 1.37 2.08 4.01 7.79
|IOD| = 4
Pvfs 0.68 1.02 1.95 3.58 8.18 15.87 29.09Caching 0.60 0.68 0.90 1.55 2.17 4.30 8.02
27
Micro-benchmark Results
While our later experiments will evaluate caching using real benchmarks, we wanted to
stress the system along different dimensions, and employed a micro-benchmark to do so. Our
micro-benchmark is parameterized based on s (the maximum size for a read/write operation in
blocks, where a block is defined to be the same size as the granularity of the caches) and o (the
maximum offset within a file in blocks from which the next read/write is initiated). The micro-
benchmark program iteratively goes over a number of operations, randomly picking whether it
is a read or a write with equal probability. The size of this operation is also picked randomly
between 1 and s blocks, and the starting offset within the file for the operation is picked, again
randomly, between 0 and o blocks. Note that a small value of o will automatically provide good
locality, and we can tune these parameters to mimic different access patterns.
Instead of presenting all the results, we discuss here one representative case with one IOD
being employed, s = 2, and for three different values of o: 10, 600, and 25000 (see Figure 2.3
and Figure 2.4). Note that the locality progressively gets worse from o = 10 to o=25000. When
the locality is very good (o=10), the working set is contained well within the local cache, and
the schemes that use the local cache perform much better than those without it. We also note
that the global-cache-only scheme still does turn out to be better than the scheme without any
caching. Even though the hit rates are quite high for the global cache, its overheads cause it
to perform much worse than for schemes with a local cache. At the other end of the spectrum,
when the locality becomes very poor (o=25000), the working set is not well exploited by any of
the caching schemes, and their associated overheads cause them to perform worse than a scheme
without any caching. The more interesting results are those for o=600, where the working set
28
overflows the local cache, but is captured by the global cache (which is larger). Consequently, the
two schemes which use a global cache provide much better performance than a scheme without
any caching or a scheme with just a local cache.
0 5000 10000 150000
1
2
3
4
5
6
7
8
9
Number of Operations
Tim
e to
Com
plet
e(se
cond
s)
No CachingOnly Local CachingOnly Global CachingLocal & Global Caching
0 5000 10000 150000
5
10
15
20
25
Number of Operations
Tim
e to
Com
plet
e(se
cond
s)
No CachingOnly Local CachingOnly Global CachingLocal & Global Caching
(a) (b)
Fig. 2.3. Micro-benchmark running on 1 node, File striped on one IOD, s = 2, (a)o = 10, (b) o= 600.
In the earlier experiment, the micro-benchmark is run on a single node. We have also
run the same micro-benchmark on different nodes, with the data striped across different IODs
(using a stripe size of 32KB). In Figures 2.5 and 2.6, we show the results for one such scenario
with three IODs used to distribute the data. We observe similar trends to those we saw earlier.
The only slight difference with the poor locality situation (o=25000) is that the local-cache-only
execution is not much worse than without any caches because the local cache overheads are not
too significant.
29
0 5000 10000 150000
10
20
30
40
50
60
Number of Operations
Tim
e to
Com
plet
e(se
cond
s)
No CachingOnly Local CachingOnly Global CachingLocal & Global Caching
(a)
Fig. 2.4. Micro-benchmark running on 1 node, File striped on one IOD, s = 2, (a) o = 25000.
0 5000 10000 150000
1
2
3
4
5
6
7
8
9
10
Number of Operations
Tim
e to
Com
plet
e(se
cond
s)
No CachingOnly Local CachingOnly Global CachingLocal & Global Caching
0 5000 10000 150000
5
10
15
20
25
Number of Operations
Tim
e to
Com
plet
e(se
cond
s)
No CachingOnly Local CachingOnly Global CachingLocal & Global Caching
(a) (b)
Fig. 2.5. Micro-benchmark running on 2 nodes, File striped on three IODs, s = 2, (a) o = 10,(b) o = 600.
30
0 5000 10000 150000
20
40
60
80
100
120
Number of Operations
Tim
e to
Com
plet
e(se
cond
s)
No CachingOnly Local CachingOnly Global CachingLocal & Global Caching
(a)
Fig. 2.6. Micro-benchmark running on 2 nodes, File striped on three IODs, s = 2, (a) o = 25000.
2.3.4 Cache Bypass Mechanisms
The results in the previous subsection indicate that it is important to provision a local and
a global cache for good performance. However, our results also show that it is equally important
to be very careful in deciding what data to place in these caches and when to avoid/bypass them.
Our system provisions mechanisms for bypassing the local and/or global caches for a
read or write. Our system does not require any different read/write calls to specify that a cache
needs to be bypassed since that can get cumbersome, and it is not clear how such a mechanism
can be effectively used by application programmers. Instead, we provide the notion of a segment
— a certain number of contiguous file blocks (unless explicitly stated otherwise, a segment
of 4 blocks was used in the experiments) — with a set of bits determining what actions to
be performed on a read/write. For each operation (read or write), we have two bits, one for
specifying whether that operation for the segment needs to go through the local cache and another
31
for whether it needs to go through the global cache. We thus provide a segment-level granularity
for cache bypassing.
These (segment) bits can be set via a system call that updates a data structure in the
underlying kernel module (implementing the local cache) at each node. When a read/write call
is made, this bitmap data structure is consulted to find out whether to look up the local cache, and
whether to route the request to the global cache or directly to the IOD. The system call to set these
bits can either be explicitly invoked by the application program or be invoked by instructions
inserted into the code by the compiler. These bits can also be set by the runtime system based on
previous execution characteristics. In the default configuration, all operations go via the local and
global caches for all segments. The rest of this chapter explores the benefits of cache bypassing,
and ways of initiating such bypassing with the compiler and the runtime system. While it is
also possible to adopt a user-based strategy where the application programmer sets these bits
explicitly, we believe that such an approach would be very difficult for the user (investigating
profile-based techniques and tools for doing this is part of our future research agenda). Also, we
specifically focus on bypass mechanisms for the global cache in this work, whose overheads on
a miss are much more significant than the corresponding overheads for the local cache.
2.4 Compiler-directed Cache Bypass
Previous discussion emphasized the importance of careful management of the global
cache space. An optimizing compiler can help us identify what data should be brought into the
global cache. It can achieve this by using at least two different strategies. We assume here that
the data for each array corresponds to a different file. In the first strategy, the compiler adopts
a coarse-grained approach and determines the arrays that are used frequently in the program. It
32
achieves this by estimating (at compile time) the number of accesses to each array in the code.
More specifically, for each loop nest, the compiler counts the number of references to each array
and multiplies these counts by the trip counts (the number of iterations) of all enclosing loops.
If there is a conditional flow of control within the loop (e.g., an ‘if’ statement), the compiler
conservatively assumes that all possible branches are equally likely to be taken. Note that if
we have profile data on branch probabilities, it is straightforward to exploit it for obtaining a
more accurate estimate. Another potential problem is the compile-time-unknown loop bounds.
In such cases, the compiler can estimate the number of accesses symbolically. Note that well-
known symbolic manipulation techniques (e.g., [29, 36]) can be used here for this purpose. After
doing such analysis, the compiler uses the global cache for reads/writes to the files (arrays) with
the most references (depending on how many such files can fit in the global cache).
An important drawback of this coarse-grained strategy is that it fails to capture short-term
localities. For example, in a given large, I/O-intensive application, an array might be accessed
very frequently in the first half of the application and is not accessed in the second part. However,
the strategy described above can continue to cache the segments of this array in the second part
of the application if the overall (program-wide) access count of this array is larger than those of
the others. Our second strategy tries to eliminate this drawback of the coarse-grained method by
managing the global cache space on a loop nest basis focusing on segment granularity.
Specifically, in our second strategy, the compiler determines the blocks that will be ac-
cessed in each loop nest separately. The id’s of a subset of these blocks are then recorded at the
loop header. This subset contains the most frequently used blocks in the nest. By doing this,
the second strategy tries to capture short-term localities and manages the global cache space at
a finer granularity. Then, the segments corresponding to the most frequently used blocks are
33
cached. Note that this approach can be expected to result in better global cache hit ratio than
the first strategy. It should also be noted that determining the blocks that will be accessed by a
loop nest is possible as in our applications there is a one-to-one correspondence between arrays
declared in the program and disk-resident files (i.e., our applications use a separate file for each
array that they manipulate). Therefore, the compiler can associate the array elements with the
blocks. Also, as in the case of coarse-grained approach, this approach can take advantage of
profile data (e.g., on branch probabilities) where available. Furthermore, again as in the previous
case, it can employ symbolic expression [29, 36] manipulation when loop trip counts are not
known at compile time.
We implemented both these strategies by using the SUIF compiler infrastructure [94]
and evaluated them by using codes where data access patterns are statically analyzable. SUIF
consists of a small, clearly documented kernel and a toolkit of compiler passes built on top of
the kernel. The strategies that were described above have been implemented as SUIF passes that
perform the required analysis and write the output to a file. We present here results with I/O-
intensive versions of two Spec benchmarks: tomcatv and vpenta. While the original codes
manipulate arrays directly in memory, we extended them to read/write these arrays from data
files explicitly, before manipulating them in memory. The results are shown for tomcatv in
Figures 2.7 and 2.8(a) as a function of the problem size (local cache size of 400KB, global cache
sizes of 20 MB and 200 MB) and as a function of the global cache size (keeping the problem
size fixed at 1500 – this corresponds to matrices of size 1500*1500 doubles manipulated in the
application), respectively. The corresponding results for vpenta are given in Figures 2.9 and
2.8(b). In each of these figures, we compare the performance of four different executions: (a)
a scheme with no caching (and hence no compiler optimizations for I/O); (b) a scheme with
34
local and global caches without any compiler optimizations for I/O; (c) a scheme with local
and global caches in conjunction with coarse-grain (file level) compiler optimizations, and (d) a
scheme with local and global caches in conjunction with fine-grain compiler optimizations.
500 1000 1500 20000
50
100
150
200
250
300
350
400
450
Problem Size
Tim
e to
Com
plet
e in
seco
nds
No CachingLocal & GlobalCoarse Grain Fine Grain
500 1000 1500 20000
50
100
150
200
250
300
350
Problem Size
Tim
e to
Com
plet
e in
seco
nds
No CachingLocal & GlobalCoarse Grain Fine Grain
(a) (b)
Fig. 2.7. tomcatv: impact of problem size (a) Global cache is 20MB, (b) Global cache size is200MB.
Examining Figure 2.7(a), we find evidence confirming the earlier arguments that blindly
caching everything in the local and global caches can sometimes worsen performance. Specif-
ically, we observe that the No Caching alternative does better than the Local & Global
option (i.e., caching everything indiscriminately), especially at larger problem sizes. The over-
heads of going to the global cache and not finding the required blocks in it contribute to this
behavior. Performing compiler optimizations at the coarse (file) granularity does give better per-
formance than caching everything, but it still does worse than not caching anything. However,
we can see that the fine-grained approach gets the benefits of the global cache and is also a
35
0.5 1 1.5 2 2.5
x 104
160
180
200
220
240
260
280
Global Cache Size in blocks
Tim
e to
Com
plet
e in
seco
nds
No CachingLocal & GlobalCoarse Grain Fine Grain
0.5 1 1.5 2 2.5
x 104
120
130
140
150
160
170
180
Global Cache Size in blocksT
ime
to C
ompl
ete
in se
cond
s
No CachingLocal & GlobalCoarse Grain Fine Grain
(a) (b)
Fig. 2.8. Impact of global cache size for a problem size of 1500.(a) tomcatv, (b) vpenta
500 1000 1500 20000
50
100
150
200
250
300
Problem size
Tim
e to
Com
plet
e in
seco
nds
No CachingLocal & GlobalCoarse Grain Fine Grain
500 1000 1500 200020
40
60
80
100
120
140
160
180
200
220
Problem size
Tim
e to
Com
plet
e in
seco
nds
No CachingLocal & GlobalCoarse Grain Fine Grain
(a) (b)
Fig. 2.9. vpenta: impact of problem size (a) Global cache size is 20MB, (b) Global cachesize is 200MB.
36
better alternative than not caching (because it avoids consulting the global cache when it feels
the data may not be present). This benefit improves as the problem size gets larger (relative to
the global cache size). Evidence for the last statement is further substantiated when we exam-
ine the executions with a much larger global cache in Figure 2.7(b). Here, the hit rates in the
global cache are much higher, and the always-cache option is a better idea. As the global cache
gets larger, the selectively cache option can possibly limit some data from benefiting from this
compared to caching everything. All these observations are reiterated when we look at the im-
pact of global cache capacity for a fixed problem size in Figure 2.8(a). The benefits of selective
caching/bypassing is much more significant at small cache sizes, and the always-cache option
becomes better only with larger global caches. The results for vpenta (given in Figures 2.9
and 2.8(b) are similar to many of those observed with tomcatv, except that the magnitude of
the differences are less pronounced because its I/O traffic is lower.
In summary, we find that discretionary caching becomes very important when the prob-
lem sizes of applications get large enough, and the working sets cause more thrashing in the
global cache. We find that a compiler based technique for modulating what to place/bypass in
the global cache can alleviate some of these thrashing problems and help us reap the benefits
of a global cache. Of the two different policies that we tried, we find that a finer granularity of
control is a better option than file-level control. This is because not all blocks within a file may
have the same access pattern or access frequency.
2.5 Runtime Cache Bypass
So far, we have evaluated two compiler-based strategies (coarse-grain and fine-grain)
where our compiler decided what to place in the global cache and when to bypass it. There are
37
many cases where such a compiler-based strategy may not be desirable or even applicable. For
example, when we do not have the source code of the application, we cannot analyze the program
and determine its access pattern statically. Similarly, in some cases, the application code might
be available but the access pattern it exhibits may not be amenable to compiler analysis (e.g.,
due to array-subscripted array references, non-affine subscript functions, or pointer arithmetic).
However, in these and similar cases, it might be still possible to optimize the application using
a runtime technique. A runtime technique tries to evaluate block-access frequencies at runtime
and makes cache-bypassing decisions dynamically.
In this section, we investigate the effectiveness of a runtime strategy for managing the
global cache. Along similar lines, there has been prior work [44] in the context of processor data
caches for runtime bypassing using access counters. However, in this study, we examine a much
simpler strategy since there are some problems when implementing schemes such as [44, 89]
on our platform where we have multiple levels of caches, and a miss from the local cache may
not go through the global cache at all. Our strategy is based on the idea of having counters with
segments. Specifically, we associate a counter with each segment that keeps the number of times
the segment is accessed. These counters are called segment counters. When a block needs to
be brought into global cache, its segment counter is compared with a pre-set threshold value. If
the value of the segment counter is larger than the threshold, the block is placed into the global
cache; otherwise, the cache is bypassed. When the local cache gets this block, it is told (either
in the read response or the write acknowledgment) to avoid going through the global cache if it
needs to be bypassed subsequently. The rationale behind this approach is that when a block is
not accessed frequently enough, placing it into the global cache can cause a useful (i.e., more
frequently used than the block in question) block to be discarded. It should be noted that we do
38
not perform any checks when the block is accessed for the first time (counter reads zero), and
only subsequently does this scheme kick in. When a new block is accessed, the harvester on the
global cache examines all currently residing blocks to find a candidate for replacement whose
counter is below the threshold (and does some aging of counters when doing so). Finally, in our
current implementation, the decision for a block (whether to bypass or not) is made only once
and we do not re-evaluate the choice once we decide to bypass the global cache for a block.
The results with this strategy are given in Figure 2.10 for a global cache size of 20 MB
with two different threshold values — high (20) and low (3) for the same two applications ex-
amined earlier. We find that the runtime strategy improves the performance of global caching
for both these extremes. The benefits are better at larger problem sizes where cache thrashing
becomes more significant and we need to be careful on what to put in the global cache. This is
also the reason why when we go to larger problem sizes, the more aggressive runtime approach
(i.e., the one with the higher threshold value) does better than the one with the smaller threshold.
200 400 600 800 1000 1200 14000
50
100
150
200
250
Problem size
Tim
e to
com
plet
e in
seco
nds
No cachingLocal & GlobalRun time(low threshold)Run time(high threshold)
200 400 600 800 1000 1200 14000
20
40
60
80
100
120
140
160
Problem size
Tim
e to
com
plet
e in
seco
nds
No cachingLocal & GlobalRun time(low threshold)Run time(high threshold)
(a) (b)
Fig. 2.10. Runtime cache bypassing (global cache size is 20 MB) (a) tomcatv. (b) vpenta.
39
Next, we perform a sensitivity study of the runtime technique that depends on two tune-
able parameters, namely, threshold value and segment size. Figure 2.11(a) captures the perfor-
mance of the runtime strategy as a function of the threshold value for tomcatv. We observe
that typically threshold values in the range of 20-50 lead to better performance since they are
more effective in weeding out what should not be put in the global cache, without defaulting to
the No Caching strategy. Consequently, we use threshold values in this range in the next few
experiments.
Recall that so far we have fixed segment size to be four blocks. To study the sensitivity
of our runtime strategy to the segment size, we conducted another set of experiments where
we used different segment sizes ranging from 2 blocks to 64 blocks. The results are illustrated
in Figure 2.11(b) for vpenta. Note that each bar in these graphs is normalized to the 4 block
segments. These results indicate that selecting a suitable segment size is important. In particular,
working with very small or very large segment sizes may not be a good idea. When the segment
size is very large, the blocks in a given segment do not exhibit uniform locality, therefore, a
segment-wide decision might be the wrong (suboptimal) choice for many blocks in the segment.
Similarly, if the segment size is very small, we witness an increased traffic through the global
cache (which in turn hurts the performance). It should also be stressed that a small segment size
means more bookkeeping and more runtime overhead. Similar results have been obtained with
other applications as well and they are not explicitly given here.
Having examined both compiler (static) based and runtime optimizations for the same
two applications, one could ask how the two compare in terms of effectiveness. We plot the
local and global cache hit rates for different problem sizes for the same two applications under
four different execution scenarios, (a) a scheme which blindly caches everything without any
40
0 50 100 150 200 250 300 350 400 450 500180
200
220
240
260
280
300
320
Threshold
Tim
e to
com
plet
e in
seco
nds
0
0.2
0.4
0.6
0.8
1
2 4 8 16 32 64Segment size (in blocks)
Nor
mal
ized
tim
e to
com
plet
e
(a) (b)
Fig. 2.11. (a) Impact of the threshold value for tomcatv. (b) vpenta: Impact of segmentsize.
optimization for I/O, (b) a static compiler-driven scheme that caches file blocks at a coarse
granularity (file level), (c) a static compiler driven scheme that caches file blocks at a finer
granularity (block level), and (d) a scheme which makes cache bypassing decisions at runtime,
and the results are given in Figures 2.12 and 2.13 where the hit rates in the two caches are given
for tomcatv and vpenta. As is to be expected, in such applications where all the information
can be statically gleaned, the compiler based techniques can be anticipated to perform better than
their runtime counterpart, since the latter requires a warm-up period before it attempts bypassing.
However, the benefits of the runtime approach will be felt more in non-analyzable applications,
or those in which we do not have source codes to perform these optimizations. We illustrate this
by studying the effectiveness of the runtime optimizations on a set of parallel I/O traces, where
this option is the only choice.
41
0 500 1000 1500 200015
20
25
30
35
40
45
Problem Size
Loc
al C
ache
Hit
perc
enta
ge
No optimizationCompiler−CoarseCompiler−FineRuntime
0 500 1000 1500 200020
30
40
50
60
70
80
90
100
Problem SizeG
loba
l Cac
he H
it pe
rcen
tage
No optimizationCompiler−CoarseCompiler−FineRuntime
(a) Local Cache Hit rate (b) Global Cache Hit rate
Fig. 2.12. tomcatv:Variation of Cache Hit rates with problem size
0 500 1000 1500 200030
35
40
45
50
55
60
65
Problem Size
Loc
al C
ache
Hit
perc
enta
ge
No optimizationCompiler−CoarseCompiler−FineRuntime
0 500 1000 1500 200010
20
30
40
50
60
70
80
90
100
Problem Size
Glo
bal C
ache
Hit
perc
enta
ge
No optimizationCompiler−CoarseCompiler−FineRuntime
(a) Local Cache Hit rate (b) Global Cache Hit rate
Fig. 2.13. vpenta:Variation of Cache Hit rates with problem size
42
The traces used in this part of our experiments are from [90], which capture several
diverse set of application executions (scientific and commercial). We evaluated the runtime
strategy using the traces for the following six applications :
• LU: This application computes the dense LU decomposition of an out-of-core matrix. It
performs I/O using synchronous read/write operations.
• Cholesky: This application computes Cholesky decomposition for sparse, symmetric
positive-definite matrices. It stores the sparse matrix as panels. This application performs
I/O using synchronous read/write operations.
• Titan: This is a parallel scientific database for remote-sensing data.
• Mining: This application tries to extract association rules from retail data.
• Pgrep: This application is a parallelization of a grep program from the University of
Arizona.
• DB2: This is a parallel RDBMS (Relational Database Management System) from
IBM.
In the above experiment, we fixed the size of the local cache to 2MB, and the size of
the global cache was fixed at 4MB and the threshold values were selected between 10 and 25.
Figure 2.14 shows the execution time of the runtime optimized system normalized with respect
to the system that uses local and global caching without runtime bypass. We can see that the
optimized system benefits all but one of the six applications, with the benefits (reductions in
43
execution times) ranging between 4% and 48%. The benefits are particularly significant in ap-
plications with poor locality (such as DB2 and LU). These results reiterate the importance of
managing/bypassing the global cache with an effective runtime strategy.
For the above experiment, the hit rates of the local and global caches are shown in Fig-
ure 2.15. As before, we don’t see too much of variance in the local cache hit rates, and the
performance improvement can be attributed to improved global cache hit rates with the runtime
technique.
2.6 Concluding Remarks and Future Work
Caching for I/O is widely recognized as being critical for performance enhancements in
large codes. Such caching is typically done at multiple levels: at the client nodes, at the server
nodes, and perhaps even in between. Each has its advantages and drawbacks. This work has
shown that one must not indiscriminately cache all data at all levels of the caching hierarchy.
We have demonstrated this by extending an off-the-shelf parallel file system for clusters, with a
local cache at each node and a shared global cache. We have also provisioned mechanisms for
bypassing each of these caches for a read/write operation at a fine granularity. One could use
such mechanisms either explicitly by the application (perhaps some profile-based tools could be
useful here), or could be exploited by the compiler or the runtime system. In this work, we have
presented both compile-time and runtime strategies to exploit global-cache bypassing. Using
both statically analyzable codes, and several public-domain I/O traces from diverse domains,
we have demonstrated the benefits of discretionary caching with these techniques. It should
be noted that several of the previously proposed I/O optimizations such as prefetching, data
striping/distribution, etc., can be used in conjunction with the ideas and discussions in this work.
44
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Nor
mal
ized
Run
ning
Tim
es
LU Cholesky DB2 Pgrep Titan Mining
Fig. 2.14. Benefits of runtime bypassing on application traces.
LU Cholesky Mining Titan DB2 Pgrep0
1
2
3
4
5
6
7
8
9
10
Loc
al C
ache
Hit
perc
enta
ge
with−runtimewithout−runtime
LU Cholesky Mining Titan DB2 Pgrep0
10
20
30
40
50
60
70
80
90
100
Glo
bal C
ache
Hit
perc
enta
ge
with−runtimewithout−runtime
(a) Local Cache Hit rate (b) Global Cache Hit rate
Fig. 2.15. Application traces:Variation of Cache Hit rates
45
There are several interesting directions for future work. As mentioned previously, the
scalability of the global cache with additional client nodes may turn out to be a problem and we
are currently looking at scalable solutions to see if we can apply the techniques presented here at
the I/O server nodes. We have only presented and evaluated a simple runtime strategy, and even
that has turned out to be quite effective. We plan to explore more sophisticated runtime schemes
with this approach. We have used a shared-nothing architecture for the experimental studies,
and it would be interesting to study the applicability and benefits to systems with a shared stor-
age architecture (perhaps including a storage-area network). An important goal of our future
optimizations is to be able to detect access patterns across different simultaneously running ap-
plications for I/O and cache optimizations. We are also interested in developing performance
monitoring and profiling tools to better determine what, when, and where to cache data blocks.
Finally, extending our compiler analysis to capture I/O access patterns inter-procedurally and
applying more aggressive (global) optimizations are interesting extensions to consider.
46
Chapter 3
Implicitly I/O Intensive Applications
3.1 Introduction
Many scientific applications of today solve problems that access large data sets. In gen-
eral, these data sets are disk-resident and their sizes range from mega-bytes to tera-bytes. These
include applications from medical imaging, data analysis, video processing, large archive main-
tenance, space telemetry data, and so on. Typically, such applications are explicitly coded to
stage the data to be manipulated from disk or the in-core version of the same program is scaled to
handle the larger problem sizes. It has been well-documented [82] that LRU-like virtual memory
replacement algorithms under-perform for typical scientific applications that tend to cyclically
access pages of memory. Consequently, a paged virtual memory system (or the scaled version
of the in-core program) is not considered a viable option [95] for solving out-of-core problems.
When virtual memory pages are touched cyclically and the working set is slightly larger than
available physical memory, LRU chooses to evict exactly the page that would be referenced the
soonest. In the above scenario, the optimal algorithm would evict those pages that were refer-
enced recently since it has future knowledge of the reference patterns. The basic idea behind our
work is as follows – if we can predict or estimate the lifetimes of all the memory-resident pages,
then we cacan evict a page as soon as we know that there would be no more references to that
page. In a sense, we are trying to approximate online the optimal algorithm, which always evicts
the pages that would be referenced the furthest in time. The basic motivation behind our idea is
47
that the LRU replacement algorithm holds on to pages that have long been dead, and hence any
replacement algorithm should be pro-active, instead of being reactive, and anticipate candidates
for eviction [92]. In practice, however, it is quite difficult to accurately predict future access in-
formation and lifetimes of virtual memory pages. Consequently, the algorithm that we propose
uses past access-pattern information as an indicator to predict lifetime distances of pages. In this
regard, we have experimented with a few different predictors and compared and evaluated their
performance for a set of twelve different memory-intensive applications drawn from the NAS
parallel and SPEC 2000 benchmark suites. We now introduce a few notations and definitions
that will be used later on in the text:
• Denote the distance between the first access (touch) and the last access (touch) for a virtual
memory page before which it gets replaced as the lifetime distance (L)
• Denote the distance between the last access (touch) and the subsequent reuse access to a
virtual memory page during which time the page was replaced as the reuse distance (R)
• Denote the distance between the replacement of a page and the subsequent reuse access to
the same virtual memory page as the window distance (W)
• Denote the distance between the last access (touch) and the replacement of the page as the
dead-page distance (D)
• Denote the distance between two successive page faults to a virtual memory page as the
page-fault distance (F)
These notations are pictorially shown in Figure 3.1. Henceforth, we shall refer to these
parameters as the page-fault parameters or simply the fault parameters. Note that in the above
48
definitions, the notion of distance was deliberately left undefined since it could be measured
in many different ways (for example in terms of memory references, number of page faults to
other pages, number of references to unique memory pages or perhaps even time). Ideally, we
would like to measure these distances in terms of memory references to unique pages since
that could directly translate to whether or not a page would be retained for a given memory
configuration (this is similar to the idea that was proposed in [42] in the context of buffer caches),
which is fairly difficult, if not impossible, to do on an actual system since that would involve
an unacceptable overhead of trapping on each and every memory reference, or would require
special hardware support (like an augmented MemorIES [60] board) to store the timestamps of
last access to pages. Without hardware support, since the only OS-visible events are page-faults
and replacements, it is possible to measure W and L + D alone, and that too only in terms of
units that are visible to the OS ( e.g. in terms of the number of page-faults incurred by other
pages). Each virtual memory page (p) is thus characterized by a unique 4-tuple tpi = (L, R, D,
W) between page faults i and i+1. Consequently, a program’s execution can be visualized as a
sequence of such tuples. Note that for a deterministic replacement algorithm, every instance of
a program’s execution would yield the same sequence of tuples. Typically, operating systems
employ non-deterministic or time-dependent replacement algorithms which then give rise to
different sequences across different executions.
The objectives of this study are as follows:
• Show that LRU-like replacement algorithms hold on to pages that have long been dead,
thus losing opportunities for reducing page-faults.
49
Fig. 3.1. Page-fault characterization.
• Characterize the variations of some of these parameters (L, D) in terms of memory refer-
ences and in terms of page faults and see if there is any kind of correlation and predictable
patterns between these, and
• Use these parameters in conjunction with simple predictors to design application-specific
replacement algorithms.
If we can predict L values, then we can evict pages earlier than when an operating system might
choose to evict them, because a virtual memory page typically becomes dead much before the
system’s replacement algorithm (like LRU) decides to evict the page. However, an incorrect
prediction of L may degrade the performance of the application by increasing the number of
page-faults incurred by the application. Ideally, we would like the predicted value of L to be
as large as possible, but not larger than L + D. Likewise, if we can predict W values, we could
prefetch pages well-ahead of when they would be actually referenced. However, an incorrect
50
prediction of W could also degrade performance by increasing number of page-faults incurred by
applications. Ideally, we would like the predicted value of W to be as large as possible such that
the sum of the predicted value and the time it takes to fetch the data from disk/peripheral device
is no larger than W. Consequently, this work can be considered as an application-customized
prefetching and replacement technique that is done automatically and transparently at runtime
by the operating system, and is hence more generally applicable than what was proposed in
[58]. In this study, we only explore the predictability characteristics of the lifetime distances for
effective page replacement and leave predictive prefetching as a future extension.
The rest of this chapter is organized as follows. Section 3.2 discusses related work. Sec-
tion 3.3 describes the experimental setup and the scientific applications that we used as bench-
marks for characterization and evaluation. In Section 3.4, we illustrate page-fault characteristics
of the system replacement algorithm. Section 3.5 proposes a new replacement algorithm and
evaluates its performance. We finally conclude and discuss the scope for future work in Section
3.6.
3.2 Related Work
Over the last few decades, a lot of work [18, 42, 43, 50, 65, 73, 82] has been done to
address the shortcomings of LRU-like replacement algorithms. Some of these [34, 50, 65, 82]
try to address the shortcomings for a specific workload access pattern such as cyclical access pat-
terns of programs whose working-set sizes are larger than available physical memory. In [82],
the authors propose an adaptive replacement algorithm (EELRU) that uses the same kind of re-
cency information that is normally available to LRU and a simple online cost-benefit analysis
51
to guide its replacement decisions. In their approach, the system continuously monitors the per-
formance of the LRU algorithm and, if it detects the worst-case behavior, it tries to pro-actively
evict pages. Our work in this study is similar to theirs in the sense that we wish to evict pages
early if we detect that this could do better than LRU, but is different from theirs in the method-
ology that we employ to detect such situations. In the context of buffer-caches where the system
gets control for every reference/access, researchers have proposed adaptive variants of LRU such
as [50, 65], and our work is inherently different from theirs because of the different domains of
applicability. In [34], the authors propose a new replacement algorithm (SEQ) that detects a long
sequence of page-faults and resorts to Most-Recently-Used replacement algorithm on detecting
such sequences. In spirit, most of the proposed algorithms try to imitate the behavior of the OPT
[4] algorithm and therein lies the similarity of our proposed work with them. In [73], the authors
propose a k-order Markov chain to model the sequence of time intervals between successive ref-
erences to the same address in memory during program execution. Note that in this work, we
are only interested in the sequence of time intervals between successive page faults to the same
virtual memory page. In a more recent work [42], the authors propose an efficient buffer cache
replacement policy called LIRS (Low Interference Recency Set) that uses recency to evaluate
Inter-Reference Recency for making a replacement decision. The key insight of this technique
is that it attempts to capture the inter-reference recency (the number of unique blocks accessed
between two consecutive references to a block) to overcome the deficiencies of LRU. In [57], the
authors propose a self-tuning, low-overhead, scan-resistant algorithm (ARC) that exploits both
recency and frequency to adapt itself according to the workload characteristics. A modified ver-
sion of the algorithm, CAR, was proposed in a more recent paper [3]. It combines the benefits of
ARC [57] and CLOCK [18] and is more amenable for virtual memory replacement. Both ARC
52
and CAR require significant amount of memory for recording history information that could
prove to be expensive for memory intensive applications considered in this study. Additionally,
many of the proposed algorithms in the literature are usually interesting only from the theoretical
point of view and may not really be implementable in an operating system. In this work, we also
demonstrate a potential in-kernel approximation of our idea.
From a system’s perspective, a lot of work has been proposed [12, 33, 15, 70] for applica-
tions that access disk-resident data sets through explicit I/O invocations. In the past, researchers
have also studied the problem of poor virtual-memory performance (implicit I/O) from an ap-
plication’s perspective [10, 22, 37, 58]. Researchers in [58] propose automatic compiler-driven
techniques to modify application codes to prefetch memory pages from disk/peripheral I/O de-
vices. In their prefetching scheme, the compiler provides the crucial information on future access
patterns, the operating system provides a simple interface for prefetching, and the run-time ac-
celerates performance by adapting to runtime behavior. In a subsequent work [10], they show
that primitives to release memory pages that are no longer used (if used judiciously) when used
in conjunction with their prefetching schemes improves the response times of concurrently run-
ning interactive tasks. However, static techniques like the above require that the source code be
available and analyzable. On the OS side, a lot of work has been done towards detection of file
access patterns automatically in the file system [35, 48] or parametric specification of file access
patterns supplied by the application [13, 69]. However, much of this work involves using explicit
I/O interfaces to stage data from peripheral devices.
On the compiler side, researchers have primarily looked at reordering computation to
improve data reuse and reduce I/O time [7] or inserting explicit I/O calls into array-based codes
53
[8, 67]. Typically, compilers are aided by annotated source code or programming-language ex-
tensions to indicate properties of important data structures. Among the techniques developed to
improve I/O performance of applications by predicting reuse distances and dead-page distances,
the closest work is by Mowry et. al in [10, 58]. However, the method that we propose is a pure
runtime technique and does not require any source-level modifications. On the hardware side,
researchers have proposed a dead-block prediction scheme in [49] that predicts when a cache
block becomes ”dead”and hence evictable. While our work shares a common goal with theirs,
prediction of lifetimes of virtual memory pages is inherently a different problem.
3.3 Experimental Framework
We now describe the applications and the simulation platform that we use in this study.
3.3.1 Applications
To evaluate the effectiveness of our approach, we measured its impact on the performance
of a selected set of memory-intensive SPEC CPU 2000 workloads [38] and seven of the memory-
intensive sequential versions of the NAS Parallel benchmark (NPB) suite [2]. Note that the
benchmarks we used in the study are among the most-memory intensive applications of the SPEC
benchmark suite. There is no inherent difficulty in running other SPEC benchmarks, but most
of the other SPEC benchmarks are CPU intensive and their working set sizes are very small and
consequently do not stress the virtual memory subsystem at all. It is for the same reason that
we also show results for 7 of the 8 NAS benchmarks whose virtual memory footprints are fairly
large in addition to the SPEC applications.
54
All the C benchmarks were compiled with gcc version 3.2.2 at an optimization level -
O3, and the Fortran 90 benchmarks were compiled with the Intel Fortran 90 compiler at the
same optimization level. A brief description of the benchmarks and the sizes of the data sets
that they access are shown in Table 3.1. Since different applications have different working
set sizes, and since we wanted to exercise the virtual memory capabilities of the system, we
configured the memory available differently for these applications. The specific values are given
in Table 3.1. Unless, otherwise mentioned the memory configuration that we simulated for the
characterization experiments was fixed at 300 MB for most of the NAS Parallel Benchmarks,
128 MB for LU, CG and all SPEC 2000 workloads with the exception of GZIP and MCF for
which we fix it at 64 MB.
3.3.2 Experimental Platform
We characterize the virtual memory behavior of these applications and the potential of
our proposed replacement algorithms in the context of an execution-driven x86 simulator. The
simulations were executed on a Linux-2.4.20 kernel on a dual 2.7 GHZ Xeon workstation with a
total of 1 GB physical memory and a 36 GB SCSI disk. The execution-driven simulator that we
used in this study is valgrind [79] which is an extensible x86 memory debugger and emulator.
Valgrind is a framework that allows for custom skins/plugins to be written that can augment the
basic blocks of the program as it executes. The skins/plugins that we implemented augmented
the basic blocks to return control to the skin after every memory-referencing instruction with the
value of the memory address that was referenced. The skins maintain data structures necessary
for implementing the techniques that we will be describing shortly and for collecting the relevant
statistics. The page fault statistics for these applications that were used for comparison with the
55
Table 3.1. Description of applications: The Total Memory column indicates the total/maximummemory that is used by the application, and the Simulated Memory column indicates the simu-lated memory size that was used for the characterization.
Name Description Input Data Set TotalMem-ory
SimulatedMem-ory
IS Integer Bucket Sort 225 21-bit integers 384MB
300MB
CG Conjugate Gradient Method to 75000x75000 sparse ma-trix
399MB
128MB
solve an unstructured sparse-matrix
with 15825000 non-zeroes
FT 3-D Fast-Fourier Transform 256x256x256 matrix 584MB
300MB
of complex numbersMG 3-D Multi-Grid Solver 256x256x256 matrix 436
MB300MB
SP Diagonalized ApproximateFactorization
102x102x102 matrices 323MB
300MB
BT Block Approximate Factoriza-tion
5x5x65x65x65 matrices 400MB
300MB
LU Simulated CFD using SSORtechniques
5x102x102 matrices 178MB
128MB
GZIP GNU Compression Utility TIFF Image, Webserverlog,
192MB
64MB
binary, random data, tar-ball
WUPWISE Physics/Quantum Chromody-namics
NITER=75,KAPPA=2.4E-1
176MB
128MB
SWIM Shallow Water Modeling 1335x1335 matrix, 200 it-erations
190MB
128MB
MCF Combinatorial Optimization default input 190MB
64MB
APSI Meteorology 112x112x112 matrix 191MB
128MB
for 70 iterations
56
kernel-implementable version of our scheme were obtained on a uniprocessor Xeon workstation
running the 2.4.20 Linux kernel.
3.4 Characterization Results
We first take a look at the page-fault characteristics of these applications. As indicated
in the previous section, we would like to characterize an application’s execution based on the
fault parameters. In particular, we want to understand the reasons as to why LRU performs
poorly from the fault parameters perspective, i.e, we want to characterize those situations in
which LRU holds on to the page long after it has been “dead”. To illustrate this, we give the
cumulative distribution plot of the ratio D/(L+D) for the LRU replacement algorithm for all the
applications for a specific memory configuration. If the replacement algorithm is really doing a
good job, then it would replace a page as soon as it became dead, or in other words, all replaced
pages would have a small dead-time distance D. In order to normalize the notion of small, we
compute the ratio D/(L+D) and plot the cumulative distribution plot of this ratio. Based on the
above explanation, we can now interpret the cumulative distribution plot of this ratio as follows
– if the plot peaks early, then the algorithm does a good job of evicting pages as soon as they
become dead. On the other hand, if it peaks late, then the algorithm is not doing a good job.
In this study, we have used the following units to measure the distance – number of
memory references and number of page-faults to other pages. The reason we plotted two sets
of graphs was because the former is a quantity that needs hardware support and can only be
approximated in practice, while the latter can be measured by the operating system. In Figures
3.2 (a) and 3.3 (a), we find that with the exception of BT, LU and WUPWISE, the replacement
algorithm is not evicting dead pages quickly since we find that only 15-25% of all replaced pages
57
have their D/(L+D) ratios less than 50% when distances are measured as number of memory
references. Similarly, if we observe Figures 3.2 (b) and 3.3 (b), we find that 40% of all replaced
pages have their D/(L+D) ratios less than 50% when the distances are measured as page-faults
to other pages. This serves as a motivation for designing a better replacement algorithm that
needs to be more pro-active in choosing candidates for eviction. Thus, it is clear that the LRU
replacement algorithm performs poorly from the fault parameters perspective. In fact Figures 3.4
(a) and (b) demonstrate the dismal performance of the LRU algorithm by plotting the execution
time profile of these applications when run on a Linux 2.4.20 kernel (averaged over 5 runs).
We next study the predictability characteristics of the fault parameters before we describe the
proposed algorithm.
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Ratio of D/(L+D) measured as References
CD
F
ISFTCGMGSPBTLU
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Ratio of D/(L+D) measured as Faults
CD
F
ISFTCGMGSPBTLU
(a) (b)
Fig. 3.2. Ratio of D/L+D measured for NPB as (a) References, (b) Faults.
As stated earlier, a pro-active replacement algorithm should evict a page as soon as it
becomes “dead”. However, in order for the system to predict which pages will become dead the
58
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Ratio of D/(L+D) measured as References
CD
F
GZIPWUPWISESWIMMCFAPSI
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Ratio of D/(L+D) measured as FaultsC
DF
GZIPWUPWISESWIMMCFAPSI
(a) (b)
Fig. 3.3. Ratio of D/L+D measured for SPEC2000 as (a) References, (b) Faults.
IS FT CG MG SP BT LU0
0.2
0.4
0.6
0.8
1
Tim
e in
sec
onds
User TimeSystem TimeCold Pagefault TimeWarm Pagefault Time
GZI WUP SWI MCF APS0
0.2
0.4
0.6
0.8
1
Tim
e in
sec
onds
User TimeSystem TimeCold Pagefault TimeWarm Pagefault Time
(a) (b)
Fig. 3.4. Where does time go? (a) NPB, (b) SPEC2000
59
soonest, it needs to predict the lifetime distances (L) of pages. Hence, we felt that characteriz-
ing the variability and hence predictability of the lifetime distance (L) was necessary, and may
even provide insights into the design of the predictors for the replacement algorithm. All the
characterization experiments were conducted using exact LRU as the replacement algorithm.
The first characterization experiment plots the cumulative distribution of the absolute dif-
ferences between successive values of lifetime distances of a virtual memory page. It is expected
that if lifetime distances are similar, the successive differences would be close to zero and hence
a CDF plot of the frequencies of occurrence of the differences would be steep in the beginning.
Yet another metric that can be gleaned from such a plot is the number of different dominant
values for the distribution. In this experiment, we compute the frequency of occurrences for
differences upto a large value (set to 50000), and all the differences greater than that are treated
identically. Yet another property that we wished to look at was whether the magnitude of the
differences had any discernible characteristic that we could take advantage of.
A cumulative distribution plot of the frequencies of occurrences of differences in life-
time distances (absolute difference) is shown in Figures 3.5 (a) and (b) for the NAS benchmarks,
when measured in terms of (a) total memory references, (b) total number of page-faults to other
virtual memory pages. Figures 3.5 (c) and (d) plot the same for the SPEC 2000 benchmarks.
Figure 3.5 (a) indicates that for two of the NAS applications (MG, FT), less than 50% of occur-
rences have successive differences less than 10, while all the remaining applications’ successive
lifetime differences, when measured as memory references, are fairly predictable. Similarly,
Figure 3.5 (b) indicates that with the exception of the same two applications, all the remaining
applications have a fairly predictable differences of lifetime distances when measured in terms
60
of page faults. On the other hand, Figure 3.5 (c) indicates that four of the five SPEC 2000 appli-
cations have less than 50% occurrences of differences of less than 10 when the lifetime distances
are measured in terms of memory references, and Figure 3.5 (b) indicates that two of the five
SPEC 2000 applications have less than 55% occurrences of differences of less than 10 when
lifetime distances are measured in terms of page-faults. However, a CDF plot dilutes temporal
information and hence cannot be used for determining whether predictability exists or not. Our
subsequent experiments indicate that there is sufficient predictability that can be exploited. Fig-
ures 3.6 (a), (b) and Figures 3.6 (c), (d) plot the same with the only difference being that instead
of the absolute value, the actual difference between the successive lifetime distances is used for
the NAS and SPEC benchmarks respectively. From both these plots, it is clear that for most
of the benchmarks, differences of successive lifetime distances are fairly symmetric on either
side of zero, and a majority of them lie within a bounded range, which we surmise is because
of the applications’ structured access patterns. Figures 3.7 (a), (b) and Figures 3.7 (c), (d) plot
the cumulative distribution of the absolute differences between successive (L+D) values of a
virtual memory page. Note that a realizable implementation of any replacement algorithm in
the operating system cannot rely on accurately tracking memory references to calculate lifetime
distances. Therefore, the operating system must rely upon page-faults and page replacement
events to approximate lifetime distances of pages, or in other words the operating system needs
to approximate the lifetime distance using the measured L+D distances. It can be observed that
the CDF plots in Figures 3.7 (a), (b) and Figures 3.7 (c), (d) largely mirror the distribution of the
lifetime distances’ differences, though the exact values tend to differ.
61
(a) (b)
100
101
102
103
104
105
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Absolute Difference of Successive Lifetime Distances
CD
F
isftcgmgspbtlu
100
101
102
103
104
105
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Absolute Difference of Successive Lifetime Distances
CD
F
isftcgmgspbtlu
(c) (d)
100
105
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Absolute Difference of Successive Lifetime Distances
CD
F
gzipwupwiseswimmcfapsi
100
105
0.4
0.5
0.6
0.7
0.8
0.9
1
Absolute Difference of Successive Lifetime Distances
CD
F
gzipwupwiseswimmcfapsi
Fig. 3.5. Absolute differences between successive L distances measured as (a) NPB2.3 - TotalMemory References, (b) NPB2.3 - Faults to other pages, (c) SPEC2000 - Total Memory Refer-ences, (d) SPEC2000 - Faults to other pages.
62
(a) (b)
−100 −50 0 50 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Difference of Successive Lifetime Distances
CD
F
isftcgmgspbtlu
−100 −50 0 50 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Difference of Successive Lifetime Distances
CD
F
isftcgmgspbtlu
(c) (d)
−100 −50 0 50 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Difference of Successive Lifetime Distances
CD
F
gzipwupwiseswimmcfapsi
−100 −50 0 50 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Difference of Successive Lifetime Distances
CD
F
gzipwupwiseswimmcfapsi
Fig. 3.6. Differences between successive L distances measured as (a) NPB - Total MemoryReferences, (b) NPB - Faults to other pages, (c) SPEC2000 - Total Memory References, (c)SPEC2000 - Faults to other pages.
63
(a) (b)
100
101
102
103
104
105
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Absolute Difference of Successive (L+D) Distances
CD
F
isftcgmgspbtlu
100
101
102
103
104
105
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Absolute Difference of Successive (L+D) Distances
CD
F
isftcgmgspbtlu
(c) (d)
100
105
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Absolute Difference of Successive (L+D) Distances
CD
F
gzipwupwiseswimmcfapsi
100
105
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Absolute Difference of Successive (L+D) Distances
CD
F
gzipwupwiseswimmcfapsi
Fig. 3.7. Absolute differences of successive (L+D) distances measured as (a) NPB - TotalMemory References, (b) NPB - Faults to other pages, (c) SPEC2000 - Total Memory References,(d) SPEC2000 - Faults to other pages.
64
In the second set of experiments, we plot the cumulative distribution of the lifetime dis-
tance measured both in terms of the number of references and in terms of the number of page-
faults to other pages as shown in Figures 3.8 (a) and (b), respectively (for the same reasons as
stated before due to the inability of the operating system to measure distances in terms of num-
ber of memory references without hardware support). Although these graphs indicate that there
are no dominant values of lifetime distances and a seeming lack of predictability, it should be
kept in mind that a CDF plot dilutes and hides temporal information. Suppose, we consider a
highly predictable sequence 1,2,3,1,2,3.. and so on, a cumulative distribution plot would be a
45◦ straight-line since each unique value in the sequence occurs with equal frequency.
(a) (b)
Fig. 3.8. NPB: CDF of L distance measured as (a) References, (b) Faults.
Towards determining whether sufficient predictability exists in the sequence of lifetime
distances, we plot the variation of lifetime distances with time in our next set of experiments. It
is to be noted here that while such a characterization is relevant only for a fixed deterministic
65
replacement algorithm (eg. exact LRU) and a particular memory configuration, it is important to
see if there are any patterns that can be exploited. In the interest of space, we show the time plots
only for a few applications. While there seems to be a certain degree of regularity and hence
predictability in the variation of L with time as seen in Figures 3.9 (a), (b) and 3.10 (a), (b), this
trend is not immediately evidenced in the cumulative distribution plots of L shown in Figures 3.8
(a) and (b). However, an implementation of a replacement algorithm using the lifetime distances
depends upon the predictability of L for a particular virtual memory page. Thus, as the next step,
we characterize the predictability of L for each virtual memory page.
(a) (b)
Fig. 3.9. MG: Variation of L distance with time measured as (a) References, (b) Faults.
In order to capture this trend, we plot the coefficient of variance of L (coefficient of
variance of a sequence is defined as the ratio of standard deviation and the mean of the sequence)
for each virtual page. We use this metric instead of the standard deviation since the latter is
66
(a) (b)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 106
0
0.5
1
1.5
2
2.5x 109
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 106
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2x 105
Fig. 3.10. SP: Variation of L distance with time measured as (a) References, (b) Faults.
dependent upon the mean of the sequence and cannot be used as a straightforward yardstick
for comparison and/or predictability. In such a graph, points on the x-axis are the individual
virtual memory pages and the y-axis is the coefficient of variance of L computed for each page.
Figures 3.11, 3.12 and 3.13 show the coefficient of variance plots for a few of the NPB and
SPEC 2000 applications. Since, we are interested in finding out whether there is predictability
in the sequences of lifetime distances observed for each virtual memory page, we would like
the coefficient of variance of such a sequence to be as close to 0 as possible. If the coefficient
of variance is close to zero, it indicates that the lifetime distance sequence is almost constant
and hence we can easily predict the lifetimes of pages. However, it is not necessary that a high
coefficient of variance necessarily indicates a lack of predictability. For instance, sequences
like those depicted in Figures 3.9 and 3.10 may have a high value of coefficient of variance but
nonetheless there is a predictable pattern that can be observed. It is also interesting to find that
there are applications for which the coefficient of variance for all pages is consistently close to
67
0 which indicates the underlying predictable nature of the page lifetimes (eg. Figure 3.11(a)).
There are also applications, where we find that a certain subset of pages exhibit coefficient of
variance close to 0 and certain others for which it is very high (eg. Figures 3.11(b), 3.12(a),
3.12(b) and 3.12(c)). There are also applications that do not exhibit either of the above (eg.
Figure 3.13(b)).
Thus far, we have motivated the need for a better replacement algorithm and also looked
at the predictability characteristics of the fault parameters (lifetime distances) and observed that
there seems to be sufficient predictability for us to investigate better replacement algorithms that
will be elaborated upon in the next section.
(a) (b) (c)
Fig. 3.11. Coefficient of Variance of L for each page (a) IS, (b) MG, (c) SP.
3.5 Towards a Better Replacement Algorithm : Predictive Replacement
In the previous section, we observed that quite a few of the applications exhibited fairly
low coefficient of variance of lifetime distances at the individual page granularity, i.e, at the
68
(a) (b) (c)
Fig. 3.12. Coefficient of Variance of L for each page (a) FT, (b) BT, (c) LU.
(a) (b) (c)
Fig. 3.13. Coefficient of Variance of L distance for each page (a) WUPWISE, (b) MCF, (c)APSI.
69
individual page granularity there is sufficient regularity of lifetime distances that can be effec-
tively predicted. We also observed that there are also applications in which predictability of
lifetime distances for certain virtual memory pages is very low and this indicates the need for
some adaptive algorithm that dynamically enables or disables prediction based on the current
prediction accuracy. Based on the above observations, we now outline a novel page replace-
ment algorithm. In this approach, the system maintains an additional list that is “approximately”
sorted on system-predicted values of L which we call as Dead-List (somewhat of a misnomer,
since it keeps track of when a page would become dead in the future rather than currently dead
pages!), in addition to the LRU list. Both the LRU and the Dead-Lists are lists of physical mem-
ory pages, and hence we would only need to store an extra pointer to traverse the Dead-List. At
the time of replacement of pages, the system checks if the head of the Dead-List has expired
(i.e, whether or not the system has decided/predicted that a page would not be accessed anymore
in this time-interval. The issue of how the system does this prediction is explained a little later
in this section.) and if so, it decides to replace that page. Note that since we keep the list ap-
proximately sorted on estimated lifetimes, we do not need to look through the whole list. If this
list is empty or if the page has not yet been estimated to have expired, then the system defaults
to the LRU replacement algorithm. We now present the steps for this predictive replacement
algorithm:
• When a page fault occurs on some page(X), we check if the Dead-List is empty, or if the
head of the Dead-List is not yet estimated to be dead.
70
• If the above condition is true, we initiate the normal LRU replacement algorithm, and
delete the page from the Dead-List as well if needed (it may not necessarily be the head
of the Dead-List).
• Otherwise, we dequeue the head of the Dead-List and choose that as a candidate for re-
placement. Note that we need to delete it from the LRU list as well.
• Once the candidate for replacement has been decided, we need to insert the currently
faulted page into the sorted Dead-List based on our estimated/predicted value of L for that
page.
Keeping the Dead-list exactly sorted could be a cause of significant overhead, that could
have increased the page-fault service time. Consequently, any reductions in page-faults may
not have translated to reductions in overall execution time. There are two possible solutions to
overcoming this problem. One possible alternative would involve using a min-heap, where the
root of the heap would hold the page that is estimated to have the least lifetime, and the other
alternative involves keeping the list approximately sorted. The latter scheme involves chaining
pages with similar values of estimated lifetime distances in a hash bucket. Note that this kind of
scheme also entails one extra pointer per physical memory page since a page cannot be in more
than one hash bucket. In the case of a heap-based implementation, the worst-case time complex-
ity of insertion of pages is O(log n), and O(1) for deletion, while the hash-chaining scheme’s
time complexity is O(1) for insertion (since pages are always inserted at the tail) and O(m) for
deletion of pages, where n is the number of physical memory pages and m is the number of hash
chains. Another possible drawback of the heap-based scheme is the need for locking the entire
heap during insertion that could prove to be expensive on multi-processor systems, whereas the
71
hash-chaining based scheme involves locking only the appropriate hash chain. In this study, we
have explored the hash-chaining based “approximately” sorted scheme for managing the Dead-
List with a fairly small value of m (currently set to 101). Another point to be noted in the above
description of the algorithm is the deliberate omission of the prediction mechanism, since that
is a parameter that we want to experiment with for good performance. Note that the optimal
algorithm would be a perfect predictor, and would replace a page as soon as it becomes dead.
Recall that it is possible to measure the lifetime distances accurately only with adequate hard-
ware support and is quite hard to measure it practically in an operating system. Thus, we break
up our prediction schemes into two categories: estimation techniques with hardware support and
an operating-system-implementable estimation technique, which are explained in the next two
subsections.
3.5.1 Estimation Techniques with Hardware Support
With appropriate hardware support, we can keep track of and measure lifetime distances
of virtual memory pages, using which we have experimented with simple prediction schemes.
In all these experiments, the metric that we have used to compare performance is the normalized
page-faults when compared to the base LRU scheme. The simulation framework that we have
built upon valgrind, maintains two global counters, one which increments on every memory
reference (G1) and the other which increments on every page-fault (G2). In addition, each
virtual-memory page has a set of four counters associated with it, which record the following
information:
• Timestamp of the last access to that page(C1).
72
• Page-Fault Counter at the time of the last access to that page(C2).
• Timestamp of the previous page-fault to that page(C3).
• Page-Fault Counter at the time of the previous page-fault to that page(C4).
A possible concern that might arise in this regard is the storage and access efficiency costs for
these counters. A possible solution to this problem is to store these counters in the unused bits
of a page-table entry for a virtual memory page, which can also be subsequently cached in the
TLB after the first access to the page. On each access to a page (whether it be a hit or a miss)
the G1 counter is incremented, and the G2 counter is incremented only on a miss. On every
hit access to a page, counters C1 and C2 are updated to store the latest values of G1 and G2
respectively. On every page-fault/miss, counters C3 and C4 are updated to store the latest values
of G1 and G2 respectively. At the time of a page-fault (miss), the system can now measure L
both in terms of the memory references (L=C1-C3) and in terms of number of page-faults to
other pages (L=C2-C4). In all schemes described below, the system uses the measured value
of the lifetime distance to predict the next lifetime distance for the page. Once, the prediction
is done, we insert the currently faulted page into the appropriate bucket of the Dead-List based
on this estimated value. For the remainder of the discussion, we only consider the lifetime
distances measured in terms of the number of page-faults to other pages. The characterization
experiments in Section 3.4 indicated the high predictability of Lifetime Distances. In particular,
Figures 3.5 (b), (d) and Figures 3.6 (b), (d) show that a majority of the differences between
successive lifetime distances of a virtual memory page is within a bounded range (around 10) ,
thus indicating that simple variants of a Last-Value predictor would be sufficient in estimating
lifetime distances fairly accurately.
73
• Static variant of Last Value Prediction (Last Static k): In this scheme, if a page’s L value
was measured to be Li at the time of a page-fault, we predict that the next lifetime of the
page using this scheme as, Li+1 = Li + k, where k is a static constant. For the following
set of experiments, we have experimented with 5 different k values, namely -10, -5, 0, 5,
and 10. These values were selected based on the observations made in Figures 3.6 (b) and
(d). Note that a value of 0 for this constant is exactly equivalent to a Last-value predictor,
while a positive value of this constant implies a conservative approach towards estimating
lifetimes.
• Adaptive Variant of Last Value Prediction (Last-Dynamic): This scheme tries to overcome
the previous technique’s limitation by trying to reduce the number of predictions made
based on observed accuracy of the predictions. In this technique, we associate a state with
each virtual memory page that can assume three values: namely Sinit, Stransient and
Spred. The key idea in this algorithm is to disable prediction unless the state associated
with the virtual memory page is equal to Spred. The algorithm uses a simple three-state
machine and works as follows on a page fault:
– If the state associated with the currently faulted page is Sinit, and if the difference
(Ldiff ) between the last two observed lifetime distances is less than or equal to a
threshold (LVthresh), it sets the state for the currently faulted page to Stransient.
– If the state associated with the currently faulted page is Stransient, and if the differ-
ence (Ldiff ) between the last two observed lifetime distances is less than or equal
to LVthresh, it sets the state for the page to Spred, and estimates the lifetime of this
74
page as, Li+1 = Li + Ldiff . Otherwise, if the difference (Ldiff ) is greater than
LVthresh, it sets the state of the page to Sinit, and disables prediction.
– If the state associated with the currently faulted page is Spred, and if the difference
(Ldiff ) between the last two observed lifetime distances is less than or equal to
LVthresh, it estimates the lifetime of this page as, Li+1 = Li + Ldiff . Otherwise,
if the difference (Ldiff ) is greater than LVthresh, it sets the state of the page to
Stransient and disables prediction.
Associating multiple states allows for disabling prediction during temporary bursts and/or
sequences where predictability is poor. A critical parameter in the above algorithm is the
value of the threshold (LVthresh), since it determines how aggressive or conservative a
scheme is. Choosing a small value for this threshold will allow for very few predictions
(conservative), and choosing a large value could potentially allow more prediction based
replacements (aggressive). We find that different applications have different ranges of
thresholds over which good performance is achieved, as shown in the next section. How-
ever, the job of determining what are best thresholds for a particular application, memory
configuration and relating it to the application characteristics is beyond the scope of this
study and is the focus of our future efforts.
• EELRU: In [82], authors propose an adaptive replacement algorithm that uses a simple
online cost-benefit analysis to guide its replacement decision, which is considered to be
one of the state-of-art algorithms towards addressing the performance shortcomings of
LRU. Hence, we have also compared the performance of our schemes with EELRU in the
subsequent evaluations.
75
3.5.2 OS-Implementable Estimation Technique
In this scheme, the OS needs to keep a counter (G1) that keeps track of the number of
page-faults that have been incurred by the application. On each page-fault, this counter (G1)
needs to be incremented. In addition, we need to associate a counter for each virtual page (C1),
(likewise, this can be stored in the unused bits of the page-table entry after suitable encoding)
which is updated whenever a page-fault occurs on that page, i.e., we set C1 to the latest value of
G1 at the time of a page-fault to a page. At the time of subsequent page-faults to the same page,
we can now estimate L+D as G1-C1. The in-kernel scheme that we propose uses this value to
estimate lifetime distance of the page, which we denote as DP-Approx.
DP-Approx is a novel replacement algorithm that uses exponential averaging to estimate
the lifetime distance. As mentioned earlier, since the operating system does not get control over
individual memory references, any in-kernel approximation of the replacement algorithm needs
to make use of OS-visible events like page-faults and replacements. Therefore, in this technique
we start out by estimating the lifetime distance as L+D, and we subsequently use exponential
averaging to predict the next lifetime distance as
Lt+1
pred= a * (L+D)t
measured+ (1 - a) * Lt
pred. Unless otherwise stated, we fix the value
of the parameter “a” as 0.5, which means that we give equal weights to the current measurement
and previously estimated lifetimes. One may question that relying on the parameter “a” may
reduce the chance of a successful implementation, but we believe that it is possible to build a
more sophisticated scheme, where the value of this parameter “a” can be determined dynam-
ically. Please note that our intent in this study is to demonstrate a proof-of-concept strategy
76
that can be practically realized without too much overheads, and determining the best “a” value
automatically is, itself an interesting research topic that we wish to address in the future.
3.5.3 Results with Predictive Replacement Techniques
Figures 3.14 and 3.15 plot the normalized page-faults for the static and adaptive variants
of the last-value-prediction-based replacement algorithms for the applications, compared with
the base LRU replacement algorithm. The results for the Last static schemes shown in Figures
3.14 and 3.15 have the static parameter set to the following values (-10, -5, 0, +5 and +10). The
results for the best performing adaptive scheme (Last dynamic at a fixed value of the thresh-
old) are also shown in Figures 3.14 and 3.15. The threshold values for each of the application
for which the performance was relatively the best is summarized in Table 3.2. From Table 3.2,
we observe that applications for which the Last static schemes performed well require a higher
threshold for the adaptive schemes to show benefits. This can be attributed to the fact that a
higher threshold value lends itself to an aggressive algorithm that predicts more often, which in
turn is good for such applications as demonstrated by the good performance of the static algo-
rithms. Analogously, applications that perform poorly with the Last static algorithm due to a
high number of potentially incorrect predictions require a low value of threshold that lends itself
to a conservative algorithm that predicts less often. An interesting area of research that has not
been addressed here is the design of a self-tuning, adaptive algorithm that adjusts the thresh-
olds dynamically without manual assignment, which is the focus of our future efforts. As we
pointed out earlier, any replacement algorithm that predicts lifetime distances incorrectly may
worsen the performance of the application, since it may cause more page-faults by replacing a
page ahead of when it actually became dead. From Figures 3.14 and 3.15, we can observe that
77
the static variants of the predictive algorithm can significantly out-perform LRU in six of the
twelve applications (IS, CG, BT, WUPWISE, MCF, SWIM), but can also degrade the perfor-
mance sometimes quite significantly in the remaining six applications. It is also clear that the
adaptive algorithm out-performs LRU in all the applications, thus indicating that the adaptation
ensures that the performance never gets degraded badly. It must also be noted that the perfor-
mance of the adaptive schemes are better than their static variant counterparts in ten of the twelve
applications considered and nearly as good in the remaining two applications (IS, MCF), which
is also indicative that the dynamic scheme adapts itself better, and resorts to using prediction
judiciously. Figure 3.16 plots the normalized invocation counts of the predictive replacement al-
gorithm and the LRU replacement algorithm which is essentially the number of times a particular
algorithm was invoked for replacement. In this graph, the five bars for each application denote
the five different schemes for which we have shown results thus far (i.e., four static variants,
one adaptive scheme). This graph serves to reinforce the fact that the adaptive scheme resorts to
using prediction-based replacement as sparingly as possible without worsening the performance.
In summary, we find that, for all the applications considered, an adaptive prediction-based al-
gorithm always performs better than LRU in terms of reductions in the number of page-faults,
sometimes by as much as 78%.
3.5.4 Comparison with EELRU
Authors in [82] proposed an adaptive replacement algorithm (EELRU) that uses the same
kind of recency information as LRU, and using a simple online cost-benefit analysis, they demon-
strate that their algorithm outperforms LRU in the context of virtual memory systems. The basic
intuition behind the EELRU technique is that, when the system notices that a large number of
78
GZIP WUP SWI APSI MCF0
0.5
1
1.5
2
2.5
Nor
mal
ized
Pag
e Fa
ults
Last−static−(−10)Last−static−(−5)Last−static−0Last−static−5Last−static−10Last−dynamic(best)
Fig. 3.14. Normalized page-fault counts of the replacement algorithm for SPEC 2000 withrespect to perfect LRU scheme.
IS FT CG MG SP BT LU0
0.5
1
1.5
2
2.5
Nor
mal
ized
Pag
e Fa
ults
Last−static−(−10)Last−static−(−5)Last−static−0Last−static−5Last−static−10Last−dynamic(best)
Fig. 3.15. Normalized page-fault counts of the replacement algorithm for NPB 2.3 with respectto perfect LRU scheme.
79
Table 3.2. Threshold values of applications.NPB 2.3 Application Threshold SPEC Application ThresholdIS 8000 GZIP 30FT 15 WUP 4000CG 800 SWIM 4000MG 30 MCF 10000SP 60 APSI 60BT 45LU 60
(a) (b)
GZIP WUP SWIM APSI MCF0
0.2
0.4
0.6
0.8
1
Nor
mal
ized
Invo
catio
n co
unt
DP−invocationsLRU−invocations
IS FT CG MG SP BT LU0
0.2
0.4
0.6
0.8
1
Nor
mal
ized
Invo
catio
n C
ount
DP−invocationsLRU−invocations
Fig. 3.16. Normalized invocation counts of the replacement algorithm for (a) SPEC 2000, (b)NPB 2.3.
80
pages are being touched in a roughly cyclic pattern that is larger than main memory it diverges
from LRU since such a pattern is known to be a worst-case scenario for LRU. In order to de-
tect such situations, the system needs to track the number of pages that were touched since a
page was last touched, which is exactly the same kind of information that LRU maintains, but
EELRU maintains it for both resident and non-resident pages. Once such a situation is detected,
they apply a fall-back replacement algorithm that evicts either the least-recently used page or
a page from a pre-determined recency position. In the context of virtual memory systems, the
EELRU approach is considered to be one of the state-of-art approaches towards improving the
performance of LRU, and hence, we wished to compare the performance of our schemes with
EELRU. Note that EELRU is also a simulation-based approach, and no practical approximation
of it has been demonstrated thus far in an operating system. Hence, the comparison is done
only with the hardware-based simulation techniques that we have proposed thus far (see Section
3.5.1).
The first two bars for each application in Figures 3.17 and 3.18 shows the normalized
page fault counts for the best performing prediction-based replacement algorithm and EELRU
with respect to perfect LRU. Therefore, the greater the second bar, the better the prediction-
based algorithm performs when compared to EELRU. From Figures 3.17 and 3.18, it is clear
that both the prediction-based and EELRU replacement algorithms out-perform LRU for all the
applications. Further, we also notice that the best-performing predictive replacement schemes
outperforms EELRU in nine out of the twelve applications that we tested against, namely FT
(17.49%), CG (3.84%), MG (16.19%), SP (75.56%), BT (78.18%), LU (50.43%), WUPWISE
(35.36%), SWIM (38.03%) and APSI (33.78%), where the percentages in parentheses indicate
the reduction in number of page-faults compared to the EELRU scheme. With the exception of
81
GZIP, where EELRU performs dramatically better than any of our prediction schemes, we find
that on the average, our scheme generates around 15% lower page-faults than EELRU over all
the applications (around 26% lower page-faults than EELRU over all applications except GZIP).
It must be remembered that the predictive algorithm needs sufficient history to start prediction,
and in two of the three applications (IS, GZIP) where EELRU performs better, quite a few pages
are accessed exactly once which does not allow the prediction based replacement algorithm to
start replacing such pages proactively.
GZIP WUP SWI APSI MCF
−1
−0.5
0
0.5
1
1.5
Rat
io o
f Pag
e Fa
ults
to L
RU
Best−Last−value−predictorEELRU
Fig. 3.17. Comparison of the best prediction-based replacement algorithm with EELRU forSPEC 2000 using the ratio of page-faults in comparison to the perfect LRU scheme.
82
IS FT CG MG SP BT LU
−0.5
0
0.5
1
1.5
Rat
io o
f Pag
e Fa
ults
to L
RU
Best−Last−value−predictorEELRU
Fig. 3.18. Comparison of the best prediction-based replacement algorithm with EELRU forNPB 2.3 using the ratio of page-faults in comparison to the perfect LRU scheme.
3.5.5 Performance of DP-Approx
As discussed in Section 3.5.2, DP-Approx is an in-kernel, practical, realizable implemen-
tation of the prediction-based replacement techniques. The kernel uses exponential averaging of
the measured (L+D) distances to estimate the lifetime distances of virtual memory pages. For a
fair evaluation of this scheme, we show the reduction in the number of page-faults over the actual
number of page-faults reported by the Linux kernel when running the application natively. We
augmented the Linux 2.4.20 kernel with a new system call (getrusage2), along the lines of an
existing system call (getrusage) that returns the number of cold and warm misses/faults. The
experimental data for this study was collected on a uniprocessor Xeon-based machine running
the modified Linux 2.4.20 kernel that was instructed to use a specified amount of main memory
through its command line options specified in the boot-loader, and is the average over five runs
of the application. Figures 3.19 (a), (b) plot the normalized reduction in the number of page
83
faults using the DP-Approx technique in comparison to the Linux 2.4.20 kernel’s page-faults.
It is clear from the figures that the DP-Approx technique outperforms the kernel’s replacement
algorithm in all but one application (APSI) by reducing page-faults by as much as 56%. We
find that on the average, the DP-Approx scheme gives around 14% lower page-faults than the
kernel’s replacement algorithm over all the applications (18% lower page-faults than the kernel’s
replacement scheme over all applications except APSI).
GZIP WUP SWI APSI MCF−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
Nor
mal
ized
red
uctio
n of
pag
e fa
ults
IS FT MG SP BT LU0
0.1
0.2
0.3
0.4
0.5N
orm
aliz
ed r
educ
tion
of p
age
faul
ts
(a) (b)
Fig. 3.19. Normalized page-fault reduction of DP-Approx algorithm in comparison to Linuxkernel 2.4.20 execution for (a) SPEC 2000, (b) NPB 2.3
3.5.6 Sensitivity Analysis
While the techniques that we have proposed thus far yielded significant reductions in
page-faults compared to LRU, it must be remembered that most of these schemes (with the
84
exception of DP-Approx) cannot translate to an in-kernel implementation without hardware sup-
port, due to the inability of the operating system to keep track of individual memory references.
In the DP-Approx technique, we approximate the lifetime distances by exponentially averaging
observed (L+D) distances in the operating system. Recall from the earlier discussion that this
technique depended on the exponential averaging factor “a”, and in the next set of experiments,
we wanted to study the impact on the performance by varying this parameter. In Figures 3.20 (a)
and 3.21 (a), we plot the normalized page-faults incurred by applications with the DP-Approx
replacement technique over that of LRU. Although such a comparison is not really fair, since one
is a scheme that can be implemented, and the other can at best be approximated, we wanted to
see if any trends can be observed that can help in designing a self-tuning online algorithm. We
observe that in most of the applications the value of “a” is critical for good performance. In Fig-
ures 3.20 (b) and 3.21 (b), we plot the prediction accuracy of the exponential averaging scheme
for the different values of “a” and find that in almost all the cases, we ended up under-predicting
lifetime distances which sort of reflects on why the performance of this replacement algorithm
is not so much better than LRU. The reason for the poor performance due to under-prediction
is because of the increased page-faults that the application incurs. Note, that a large number of
over-predictions may also not perform better than LRU since we may not be aggressive enough
to evict pages as soon as they become “dead”. In fact, we can see that in applications like IS,
MG, MCF and WUPWISE where this scheme performs better than LRU, the percentage of over-
predictions is higher than the rest. This indicates that any approximation scheme needs to track
the lifetime distances quite accurately for good performance.
85
(a) (b)
GZI SWI MCF WUP APS0
0.5
1
1.5
2
2.5
3
Nor
mal
ized
pag
e fa
ults
a=0.25a=0.50a=0.75
GZI SWI MCF WUP APS0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Nor
mal
ized
pre
dict
ion
accu
racy
Under−predictedExact−predictedOver−predicted
Fig. 3.20. SPEC 2000 (a) Sensitivity of DP-Approx to parameter “a”, (b) Prediction accuracy.
(a) (b)
IS MG SP BT FT LU0
0.5
1
1.5
2
2.5
3
Nor
mal
ized
pag
e fa
ults
a=0.25a=0.50a=0.75
IS MG SP BT FT LU0
0.5
1
1.5
Nor
mal
ized
pre
dict
ion
accu
racy
Under−predictedExact−predictedOver−predicted
Fig. 3.21. NPB 2.3 (a) Sensitivity of DP-Approx to parameter “a”, (b) Prediction accuracy.
86
3.6 Conclusions and Future Work
In this chapter, we have presented a novel technique of tracking application’s virtual
memory access pattern in the operating system for pro-active memory management by replacing
virtual memory pages as soon as they become dead. The contributions of this work can be
summarized as follows:
• Demonstrating the sub-optimal performance of LRU-like replacement algorithms for sci-
entific applications’ access patterns from the application characteristics and fault param-
eters perspective and demonstrating the fact that LRU-like replacement algorithms hold
onto virtual memory pages long after they are “dead”.
• Characterizing the predictability of the fault parameters from an application’s perspective.
• Using these parameters in conjunction with simple predictors (variants of Last-value pre-
dictors) to design a novel set of replacement algorithms.
• Evaluating the performance of these replacement algorithms on a set of 12 different mem-
ory intensive applications drawn from the NAS and SPEC 2000 application suite and con-
cluded that a prediction-based replacement algorithm can significantly out-perform LRU
by yielding as much as 78% reduction in the number of page-faults. On the average, a
prediction-based replacement scheme yields around 48% reduction in page-faults in com-
parison to LRU.
• Evaluating and comparing the performance of our techniques with EELRU that is consid-
ered to be one of the state-of-art algorithms towards improving performance of LRU in
the context of virtual memory systems and demonstrate that our predictive replacement
87
schemes can reduce number of page-faults over EELRU in 9 of the 12 memory inten-
sive applications with which we experimented by as much as 78%. On the average, the
predictive replacement schemes yield around 15% lower page-faults than EELRU.
• Designing and implementing a novel in-kernel approximation algorithm that can estimate
lifetime distances using just the parameters that an operating system can measure. On the
average, this algorithm yields around 14% lower page-faults than a Linux 2.4.20 kernel’s
replacement algorithm and by as much as 56% reduction in the number of page-faults.
This can serve as a much better alternative to approximate LRU or not recently used re-
placement algorithms (note that schemes such as LRU or EELRU are not implementable
in practice) that are typically implemented in the kernel.
We do realize that many of the proposed schemes depend upon parameters that need to be tuned
for good performance. For instance, the adaptive variant of the Last-Value predictor based re-
placement algorithm relies on the threshold value, and DP-Approx relies upon the exponential
averaging constant for good performance. Consequently, an interesting avenue of research that
we have not addressed in this thesis is relating the application characteristics to auto-select pa-
rameters or a self-tuning algorithm that auto-tunes the parameters for good performance. We
also have not explored the effects of multi-programming and OS implementation issues (like
memory space needed for prediction), that are the focus of our future research agenda. Another
interesting aspect that needs investigation is a prediction-based prefetching mechanism similar
to the replacement techniques proposed here. We also plan to design and implement the optimal
predictor-based replacement algorithm and compare its performance with our prediction-based
replacement algorithm to get an idea of how much better we could perform. In the future, we
88
also plan to incorporate compile-time information on future access patterns that can be used in
conjunction with the run-time based prediction schemes that we have proposed.
89
Chapter 4
Synergistic Scheduling
4.1 Introduction
Scheduling policies implemented by today’s operating systems cause memory intensive
applications to exhibit poor performance and throughput when all the applications’ working sets
do not fit in main memory. A primary cause for this is that scheduling algorithms do not take
memory size considerations into account. Multi-programming has always been a thorn in the
development of efficient programs for non-dedicated computational platforms such as academic
or production servers. Sharing of resources such as processor, memory and I/O devices make
it hard for programmers to make assumptions on the availability of resources. Programmers
typically write programs under the assumption that the entire system’s resources are available at
its disposal that may not hold true on non-dedicated computational clusters. The performance
penalty of memory pressure can be severe, because the operating system might be forced to page
data to and from the hard disk. Paging has a very large cost compared to the cost of accessing
memory and the slowdown experienced by the job that is being paged during its execution can
be quite unpredictable and/or large.
While there have been several attempts [52, 54, 56, 88] in the past for efficient proces-
sor allocation in multi-programming scenarios, studying the impact of memory requirements
of jobs on scheduling policies has received far less attention. As [63] rightly states, much of
the previous work in this area has considered the problem of memory pressure as a problem
90
of admission control and devised scheduling strategies that permit or forbid execution of jobs
based on user-estimated memory requirements or snapshot of memory occupancy at the time
of job submission. Both these criteria can prove to be extremely inaccurate, since memory re-
quirements of tasks can be quite unpredictable. Further, any under/over estimation of memory
requirements can lead to terrible performance and/or severe under-utilization of resources. In
[63], authors propose application-level modifications coupled with operating system level exten-
sions as a solution towards tackling this problem. However, practical reasons for deployment
have led us to the belief that any solution to address this problem has to be completely dealt with
inside the operating system or any runtime layer rather than require application level changes.
An intuitive and accurate model of the memory requirements of a process was formulated
by Denning in his seminal work on working set theory [26]. His proposal for a working-set [26]
based model has been the theoretical basis of many other subsequent approaches. The working
set of pages associated with a process is defined to be the collection of its most recently used
pages, and provides knowledge vital to the dynamic management of paged memories. More for-
mally, the working set of information W(t,τ ) of a process at time t is defined to be the collection
of pages referenced by the process during the process time interval (t - τ , t). The parameter τ is
defined to be the working set parameter. Further, the model defines the working set size ω(t, τ )
as the number of pages/elements in W(t, τ ). Determining the working set size of every process
translates to an efficient implementation of a virtual-memory-aware-scheduler, since it essen-
tially translates to the well-understood 0-1 knapsack problem (bin-packing problem), where the
job of the scheduler is to determine a subset of processes whose overall working-set size fits
within the available physical memory. However, it has been a well-acknowledged problem that
91
determining the working set size of a program in execution is hard and consequently approxima-
tions in software as well as hardware modifications to track working sets abound in the literature
[74, 85, 97].
In this chapter, we are concerned with the detrimental effect that paging inflicts on the
performance of jobs running on multi-programmed machines. More specifically, we consider
the problem of paging in the case of multi-programmed systems where processes/jobs can be
spawned by multiple users in an uncoordinated fashion and without any a-priori knowledge of
the required resources such as CPU, memory. In such a scenario, we envisage that the operating
system detects when the system is not making any forward progress at any time (i.e. when the
system is thrashing) and takes appropriate actions on the jobs that are running. In this context, we
denote that the system is not making any forward progress if it incurs too many page-faults (and
therefore context-switches) which in turn results in low overall CPU utilization. The dynamics
of the workloads on such servers provides several opportunities for optimization that may not be
available in the case of strict admission control that ends up severely under-utilizing resources
due to inaccurate estimations.
We present a simple scheduling strategy called Synergistic Scheduling that attempts to
prevent paging. An important design goal of Synergistic Scheduling was to use an unmodified
scheduler core and use an external, loadable kernel module and/or an external daemon for flex-
ibility of deployment. The basic idea behind the Synergistic Scheduling framework is that the
external module and/or daemon gathers sufficient information such as CPU utilization, system-
wide page faults, memory residency and page-fault information of individual tasks using which
the static priorities of the tasks are manipulated, so that programs that would end up paging are
92
not scheduled as often as those that don’t stress the virtual memory system. This in turn also pre-
vents the system from thrashing due to application programs stepping on each other’s working
sets. A similar approach (priority-boost) was advocated in [59], although the intended goal was
for efficient co-scheduling of communicating parallel processes of a job. While authors in [59]
consider communicating processes of a parallel job as candidates for co-scheduling, our work
considers the set of processes whose working set sizes fit in memory as candidates for being
scheduled simultaneously on the same machine (co-scheduled on the same node). Further, this
solution requires no application modification unlike [63, 62]. Moreover, we anticipate that for
such a scheme to be relevant in the context of grid-computing, where loosely coupled distributed
programs run on non-dedicated computational nodes with fluctuating CPU and memory loads,
intrusive changes to the core scheduler may not be appropriate. Consequently, the Synergis-
tic Scheduler’s design as an external, dynamically loadable kernel module and/or an external
daemon atop an un-modified process scheduler fits the bill.
The rest of this chapter is organized as follows. Section 4.2 discusses some related work
and contrasts it with our approach. Section 4.3 gives details on the experimental setup and the
applications that were used for evaluation. Section 4.4 outlines the scheduling strategy, and
Section 4.5 presents the results of our experimental evaluation. Finally, Section 4.6 concludes
with pointers to future work.
4.2 Related Work
Much of the work on scheduling policies for multi-programmed multi-processors and
parallel machines has focused on how many processors to allocate to each runnable applica-
tion without considering the memory requirements of these jobs [52, 54, 56, 88]. There have
93
been a few attempts that studied the impact of the memory requirements of jobs on scheduling
policies. Work in this area can be split into two major categories depending upon the intended
target environment for which the algorithms were designed, namely distributed-memory paral-
lel machines and shared-memory parallel machines (symmetric multiprocessors). There have
also been several theoretical studies in the past in the quest towards designing a virtual-memory
aware job/process scheduler. In particular, Denning’s proposal for a working-set [26] based
model has been the theoretical underpinnings of many approaches. However, practical schemes
for determining working sets of a task has been acknowledged to be a hard problem and approx-
imations in software as well as hardware modifications to track memory working sets abound in
the literature [74, 85, 97].
Authors in [55, 77, 72] study the trade-offs involved between processor and memory allo-
cation in distributed parallel processing systems and try to design efficient scheduling strategies
with minimal overheads in a multi-programmed scenario where jobs have minimum memory
requirements. However, all these studies assume that jobs have a minimum memory require-
ment that can be stated a-priori. Even if the minimum memory requirements were to be known
a-priori, the problem of over-provisioning memory to applications is possible since the work-
ing set [26] of an application varies with time as our experimental results in subsequent sec-
tions illustrate. Authors in [72] model the paging behavior of parallel jobs when operating with
less memory than required. They apply this model on a real parallel job running on a parallel
message-passing machine and study how the performance changes as a function of processor
allocation. However, they do not consider the problem of job-scheduling per-se, although their
model allows for varying processor allocation.
94
Authors in [1, 11, 55, 63, 68, 78, 83] study this problem in the context of a shared mem-
ory parallel machines (symmetric multi-processors, simultaneous multi-threaded processors). In
the context of an Intel Paragon system, McCann et. al, examine how minimum processor al-
location requirements due to memory influences job scheduling [55]. They suggest ways in
which processors can be allocated so that each job receives exactly the same share of processing
time. The metric by which they evaluate their scheduling algorithms is the mean CPU utilization
where lower values are assumed to lead to higher processor-efficiency values. They do not take
into account jobs leaving the system, and also require significant computation to just ensure that
each job receives the same processing time in each scheduling cycle. They also do not take into
account the constructive or destructive interference that processes memory access patterns have
on each other when scheduled simultaneously. Finally, researchers from Tera Computer Com-
pany describe the scheduler algorithm on the Tera MTA [1] multi-processor, where they take
into consideration the overhead of swapping jobs in and out of memory, and present an optimal
algorithm for minimizing paging. However, this work also suffers from the drawback that users
have to associate a space-time overhead with each job that the scheduler uses for guiding its
memory allocation policies that may not always be easy for the programmer.
Our work is similar to all the above techniques in that they share the same design phi-
losophy wherein it is the job of the operating system scheduler to take virtual memory sizes
into consideration and differ in the means used to attain this objective. Our design relies on an
external kernel module or an external daemon atop an unmodified process scheduler/kernel that
makes it relevant in the context of Grid computing like environments, where loosely coupled
distributed programs run on non-dedicated computational farms.
95
More recently, researchers in [62, 63] advocate an adaptive strategy that tries to cope with
the adverse effects of paging on multi-programmed multi-processors by rewriting programs us-
ing a combination of a user-level runtime system and appropriate kernel support to take schedul-
ing actions automatically upon detecting memory-pressure. When programs detect memory
pressure, they either suspend themselves or release any unneeded memory back to the operating
system. This is in direct contrast to prior approaches where the system tries to impose schedul-
ing constraints and do not expect programmers to re-write their codes. Our work is similar to
the above approach in that we wish to be able to identify memory pressure situations and take
actions accordingly. We differ from this work in that we do not require application programs to
be rewritten to take advantage of this mechanism. Instead, the operating system chooses actions
automatically on detection of memory pressure. Further, the actions chosen by the operating
system is simpler, since it involves adjusting the priorities of tasks rather than suspending and
restarting processes.
4.3 Experimental Framework
We now describe the applications and the experimental platform that we used in our
study.
4.3.1 Experimental Platform
The experiments were carried out on an Intel Pentium-4 based CPU running at 3.00 GHz
(hyper-threading turned off in the BIOS) with a Seagate 73 GB EIDE hard disk drive and 1 GB
physical memory. Our experimental prototype is based on a Fedora Core 3 distribution running
a 2.6.10 kernel. The working set graphs shown in the subsequent sections were obtained using
96
an execution-driven simulator valgrind [79] that was augmented with a plug-in that implements
the working-set algorithm outlined by Denning in [26]. The Synergistic Scheduler framework
has been implemented both as an external kernel module and as a user-level ”probe” process for
the 2.6.10 kernel.
4.3.2 Applications
To evaluate the effectiveness of our approach, we measured its impact on the performance
of real-world memory-intensive applications, namely the sequential versions of the NAS Parallel
benchmark (NPB Version 2.4 Class A) suite [2] when they are all run simultaneously. All the
C benchmarks were compiled with Intel C Compiler version 7 at an optimization level -O3 and
the Fortran benchmarks were compiled with the Intel Fortran Compiler version 7 at the same
optimization levels. Since the problem class sizes used in this chapter is different from those
used in the previous chapter, a brief description of the benchmarks and the sizes of the data sets
that they access are shown in Table 4.1.
4.4 Scheduling Strategy
As explained above, the Synergistic Scheduler has been implemented both as a kernel
module as well as an daemon ”probe” process. There is not much difference in the performance
results using either of the two approaches, although the latter is simpler and more convenient
from the point of view of deployment. The kernel module registers a callback function to be ex-
ecuted as part of a timer structure at periodic intervals. The value of the interval is statically set
to a period of 2 seconds, but ideally this parameter should also be dynamically tuned depending
on the load on the system. A higher CPU load on the system should cause the value of this time
97
Table 4.1. Description of applications: The Total Memory column indicates the total/maximummemory that is used by the application.
Name Description Input Data Set TotalMemory
IS Integer Bucket Sort 8388608 integers 147 MBCG Conjugate Gradient Method to 14000x14000 sparse ma-
trix77 MB
solve an unstructured sparse-matrixFT 3-D Fast-Fourier Transform 256x256x128 matrix 464 MB
of complex numbersMG 3-D Multi-Grid Solver 256x256x256 matrix 436 MBSP Diagonalized Approximate Factoriza-
tion64x64x64 matrices 107 MB
BT Block Approximate Factorization 5x5x64x64x64 matrices 354 MBLU Simulated CFD using SSOR techniques 64x64x64 matrices 66 MBEP Monte-Carlo Simulation 229 random numbers 17 MB
98
period to be set higher so that fewer overheads are incurred. Similarly, a higher memory load
should set this time period to a smaller value so that the process scheduler can take advantage of
our approach. When the callback function is executed after the specified time interval, it deter-
mines a subset of processes whose priorities need to be adjusted for the system to make forward
progress based on a set of heuristics that are outlined in the subsequent sections. The user-space
daemon based approach is very similar. The daemon sleeps for a pre-determined interval after
which it samples the system statistics, per-process page-fault rates using the /proc file-system. It
then uses this information to determine a subset of process whose priorities need to be adjusted.
Note that this process has to run as root to effect the priority changes. Figure 4.1 outlines the
design alternatives for the Synergistic Scheduler. The job of determining candidate tasks for pri-
ority boost and the number of tasks for priority boost is highly difficult. The challenges involved
in such a task is that different processes take different amount of time to reach their steady state,
and it is practically impossible to determine how close to steady state a particular process is with
respect to memory utilization, since this amounts to estimating working set size. In the next
paragraphs, we outline a few heuristics that we have used to evaluate the system.
4.4.1 Heuristics for task set selection
The goal of any heuristic employed in this context is to improve the overall resource
utilization of the system including CPU and memory. Furthermore, the system should also favor
candidates that are closer to their steady state (from the memory utilization point-of-view) than
those that are not. This is based on an intuition and premise that jobs that are closer to steady
state would not fault frequently and consequently will not lower the overall CPU efficiency. In
other words, favoring such jobs should allow for overall forward progress of the system (fewer
99
(a) (b)
Fig. 4.1. Synergy Scheduler Design Alternatives (a) Using a kernel-module based approach,(b) Using a user-level probe process based approach
page-faults/swapping). In this study, we look at the effects of two parameters that can be gathered
in the operating system without any overheads and use them to determine if a process is closer
to steady state or not, namely,
• Page-fault rate suffered by a process
• Memory residency (RSS) of a process
Although these parameters are not really independent of each other, together they provide an
indicator of whether a particular process can make progress without faulting. The whole point
of boosting the priority of a certain set of tasks is that the overall system should make forward
progress. If it so happens that the system decides to boost the priority of those processes that
would block immediately, the situation is somewhat similar to an artificially induced priority in-
version problem, wherein processes with a lower priority end up hogging the CPU more than the
higher priority processes. Hence, the system tries to make a judicious selection of the candidates
for priority boost by making use of the above-mentioned parameters. The input to the algorithm
100
is the desired multi-programming level, (i.e the number of jobs/processes whose priority needs
to be incremented). Determining the optimal value of this parameter is beyond the scope of this
work since it is highly workload specific. However, this is an interesting aspect that we wish to
investigate in the near future.
The actual pseudo-code for task selection is described in the subsequent paragraphs. It
is to be noted that this algorithm is executed periodically by the system (either by a user-level
”probe” process or by a kernel-level thread as part of a timer list) at pre-specified intervals.
• Gather the list of user processes that are currently running on the system (excluding system
daemons, kernel threads) and sample their memory residency (RSS) and page-fault rate
incurred during the last time interval.
• Denote the number of jobs/programs in the system whose priority has been boosted as N.
• If N is less than the desired multi-programming level (MPL), then select at most (MPL-N)
tasks as candidates for priority boost. The candidates that are selected are those that have
incurred the least page-fault rate in the previous interval. Any ties are broken by selecting
candidates that have the highest resident set size.
• Note that we may select fewer than MPL processes for priority boost if the sums of
the virtual-memory sizes exceeds the available physical memory. Note that the virtual-
memory sizes for a process are a conservative over-estimate of the working set of a pro-
gram and could lead to under-utilization of resources. However, practical considerations
of not being able to measure the working set size inside the operating system kernel have
forced us to resort to this.
101
To illustrate this with an example, let us assume that there are 3 tasks A (500 MB, 10
flt/s), B (500 MB, 30 flt/s) and C (10 MB, 8 flt/s) with memory requirements and page-faults
per second indicated in parentheses. Let us assume that the system is equipped with 512 MB
of RAM, and that the desired MPL of the system is 2. Clearly, an optimal schedule would try
to schedule A and C or A and B together. Otherwise, the system could possibly thrash with A
and B stepping on each other’s toes. Based on the observed page-fault rates, our system elects to
elevate the priorities of C and A (Although, we could also conceivably reduce the priority of B
we chose not to do so in our prototype). Once we have elevated the priorities of tasks, we do not
change them until they run to completion. A smarter strategy would try to change the priorities
once the system does not thrash any more. If on the other hand, let us say that the following
scenario occurs with A (500 MB, 10 flt/s), B(500 MB, 12 flt/s) and C (10 MB, 20 flt/s) then
the system decides to elevate the priority of only A despite being asked to elevate priorities of
2 tasks (otherwise A and B may step on each other). If in the next sampling interval, C’s fault
rate is lower than B, then the system would elevate the priority of C as well since fewer than the
requested number of tasks have elevated priorities.
4.5 Results
In order to motivate the problem and to illustrate the difficulty in being able to predict
and/or measure the working set size from inside the operating system kernel, we plot the vari-
ation of the working set sizes as defined by Denning [26] for our NAS 2.4 application suite
using Valgrind [79] as our execution driven simulator. Valgrind [79] is an extensible x86 mem-
ory debugger and emulator and is also a framework that allows for custom skins/plug-ins to be
written that can augment the basic blocks of the program as it executes. The skins/plug-ins that
102
we implemented augmented the basic blocks to return control to the skin after every memory-
referencing instruction with the value of the memory address that was referenced and whether
it was a load or a store. The skins maintain the necessary data-structures that implement the
working-set size calculation algorithm explained in [26]. Figure 4.2 plots the variation of the
working set size measured in number of pages (4KB page-size) with simulation time measured
in terms of the number of memory-referencing instructions. using a 1-bit reference counter for
the page-table entries as explained in [26]. The value of σ shown in parentheses in the caption
of Figure 4.2 is the sampling interval (measured in terms of the number of memory-referencing
instructions). There is no sanctity in the values chosen for σ. The only constraint was that
there should be a reasonable number of sampled data points. With the exception of CG and
EP whose working set size is fairly constant with time, most of the other applications show a
phase behavior (in terms of variation of working set size) that is consistent with other studies
findings [81]. However the graphs in Figure 4.2 indicate that it is not trivial to do working-set
size predictions. Further, the difficulty of choosing a good value of the sampling interval (σ) and
the overheads in determining the working set size inside the operating system (without impact-
ing the performance of jobs/tasks on the system) make this a challenging problem for operating
system engineers and developers. Consequently, we use the overall virtual memory size as an
indicator in our scheduling framework.
We categorize our experimental evaluation of the Synergistic Scheduling framework into
two categories, namely:
• Underloaded: In this scenario, each application’s working set size comfortably fits the
available physical memory when run stand-alone. Consequently, such a scenario does
not cause applications to swap if they are run stand-alone and therefore any sequential
103
0 2000 4000 6000 8000 10000 120000
200
400
600
800
1000
1200
Time
Wor
king
set s
ize
1−bit
0 200 400 600 800 10000
2000
4000
6000
8000
10000
12000
14000
16000
18000
Time
Wor
king
set s
ize
1−bit
(a) (b)
0 100 200 300 4001000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Time
Wor
king
set s
ize
1−bit
0 50 100 150 200 250 3000
1
2
3
4
5
6
7x 10
4
Time
Wor
king
set s
ize
1−bit
(c) (d)
0 500 1000 1500 2000 25000
2000
4000
6000
8000
10000
12000
14000
16000
18000
Time
Wor
king
set s
ize
1−bit
0 100 200 300 400260
270
280
290
300
310
320
330
Time
Wor
king
set s
ize
1−bit
(e) (f)
0 200 400 600 800 1000 12000
2000
4000
6000
8000
10000
12000
Time
Wor
king
set s
ize
1−bit
0 200 400 600 800 1000 12000
0.5
1
1.5
2
2.5
3
3.5
4
4.5x 10
4
Time
Wor
king
set s
ize
1−bit
(g) (h)
Fig. 4.2. Variation of working set size with simulation time for (a) IS (σ = 0.4 Million) (b) FT(σ = 15 Million) (c) CG (σ = 14 Million) (d) MG (σ = 56 Million) (e) SP (σ = 18 Million) (f)EP (σ = 68 Million) (g) LU (σ = 21 Million) (h) BT (σ = 76 Million)
104
execution of applications (batch processing) is expected to be optimal. However it is to
be emphasized here that the sum of the working set sizes of all the applications together
is larger than the available physical memory. Therefore, any execution of all applications
simultaneously is expected to swap.
• Overloaded: In this scenario, a few of the application’s working set sizes do not fit in the
available physical memory when run stand-alone. Consequently, such a scenario results
in applications to swap even if they are run stand-alone. Therefore, sequential execution
of applications is not expected to be optimal, since it does not utilize the CPU efficiently.
Such a scenario is therefore expected to be an interesting case study for the Synergistic
scheduling framework, since the primary goal of this technique is to improve the overall
CPU utilization and is therefore expected to perform better than batch processing. In
order to demonstrate this situation, we ran the experiments on a machine whose kernel
was instructed to use a specified portion of memory instead of the entire physical memory
(384 MB instead of 1 GB).
The experimental results described in the next few sections plot the execution times (in
seconds) and the normalized slowdown of each application as well as the overall execution time
averaged over 5 runs of each NAS application benchmark for the following schemes:
• Sequential (SEQ): This is the base scheme where the NAS applications are run one after
another and is virtually identical to a batch processing system. In addition to measuring
the average completion time for each benchmark, we also compute the overall completion
times for all the benchmarks to be run to completion as the sum of the completion times
of each benchmark.
105
• Simultaneous (SIM): This is the scheme where all the NAS applications are run simultane-
ously on the native scheduler that is oblivious to memory pressure and working set sizes.
In this case, the overall completion time is the execution time of the slowest application.
• Prio-Simultaneous (FIX): This is the scheme where all the NAS applications are run si-
multaneously on the native scheduler that has been augmented with knowledge of memory
pressure and working set size considerations. The system determines the candidates whose
priority needs to be boosted based upon resident set sizes/page fault rates and once the pri-
orities are adjusted the system does not recalculate or readjust priorities until one or more
processes exits. The pseudo-code for this scheme was described in the previous section. In
this case too, the overall completion time is the execution time of the slowest application.
• Prio-Simultaneous (RAN): This is a scheme where all the NAS applications are run simul-
taneously on the native scheduler that has been augmented to select a random subset of
tasks for priority boost. The overall completion time is the execution time of the slowest
application.
4.5.1 Underload Scenario
Figure 4.3(a) plots the execution times (in seconds) of all the above schemes for the un-
derload scenario for each application as well as the overall completion time. Figure 4.3(b) plots
the normalized slowdown of all the above schemes normalized with respect to the Sequential
scheme’s (SEQ) execution time. It can be seen from the last set of columns (denoted as OV) in
Figure 4.3 (a) that the Prio-Simultaneous (FIX) scheme performs better than both the Simultane-
ous (SIM) and the Prio-Simultaneous (RAN) scheme. The performance of this scheme is in fact
106
quite comparable to Sequential scheme, which is clearly expected to be optimal if the working
set sizes of all processes is guaranteed to be less than available physical memory when run in
isolation (on a uni-processor workstation). The Simultaneous (SIM) scheme is about 14% slower
in comparison to the Sequential (SEQ) scheme, while the Prio-Simultaneous (FIX) scheme with
MPL 1, 2 and 3 is about 0.5%, 0.4%, 2.5% slower in comparison to the Sequential (SEQ) scheme
respectively and the Prio-Simultaneous (RAN) scheme with MPL 1, 2 and 3 is about 10%, 9%,
0.1% slower in comparison to the Sequential (SEQ) scheme respectively. The next few graphs
shown in the subsequent sub-sections provide more detailed statistics on system CPU utilization,
context switches and page faults that help us better understand as to how our scheme achieves
what we intended at the outset.
Figure 4.4(a) plots the overall percentage CPU utilization during the course of the entire
execution of the benchmark (for all the 5 iterations). This graph shows the percentage of time
spent executing user-code, percentage of time spent executing system-code, percentage of time
spent idling and percentage of time spent waiting for I/O to complete (all these statistics are
exported by the /proc file-system) for all the above schemes. Figure 4.4(b) plots the total number
of context switches during the course of the entire execution of the benchmark (for all the 5
iterations) for all the above schemes. Figure 4.4(c) plots the overall number of page faults during
the course of the entire execution (for all the 5 iterations). This graph also shows the split-up of
major (e.g. faults that require reading from swap device) and minor page faults (e.g. faults that
require allocation of a page) during the course of execution for all the above schemes. We can
summarize the results from Figures 4.3 and 4.4 as follows,
• SEQ scheme is the best in terms of CPU utilization, since a bulk of the time is spent
executing user code, with very few idle time periods during which the jobs are being
107
(a)
BT CG EP FT IS LU MG SP OV0
100
200
300
400
500
600
700
800
900
1000E
xecu
tion
Tim
e (s
econ
ds)
SequentialSimultaneousFIX−MPL−1FIX−MPL−2FIX−MPL−3RAN−MPL−1RAN−MPL−2RAN−MPL−3
(b)
OV0
0.5
1
1.5
Nor
mal
ized
Exe
cutio
n Ti
me
SequentialSimultaneousFIX−MPL−1FIX−MPL−2FIX−MPL−3RAN−MPL−1RAN−MPL−2RAN−MPL−3
Fig. 4.3. Underload: (a) Execution Time (in seconds) measured as the time taken from job starttill completion (b) Normalized execution time that is measured as the ratio of the job completiontime to the batch processing execution time.
108
switched. This is to be expected since the working set sizes of all the applications fit
available physical memory. Also, since only one application is being executed there are
no context-switching overheads and hence the time spent in system mode is also minimal.
• SIM scheme is the worst in terms of overall CPU utilization, since a good portion of the
time is spent waiting for I/O to complete. This is due to the excessive number of page-
faults that are incurred when applications running simultaneously step on each other’s
working sets. A good portion of the time is also spent in the system, since the number
of context switches has also increased and the scheduler code is invoked more often than
in the SEQ scheme. Figures 4.4 (b) and (c) corroborate the increased number of context-
switches and page-faults for the SIM scheme.
• All the FIX schemes are very good in ensuring that the bulk of the time is spent in execut-
ing user code. As is to be expected, if the desired multi-programming level is increased
the time spent in system mode increases since the scheduler code is now invoked more
often than before. Figure 4.4 (b) indicates that the FIX schemes end up context switching
far fewer number of times than the RAN and SIM schemes because they utilize the system
CPU far more efficiently. As is to be expected, the FIX schemes incur more context-
switches and percentage time spent in system mode than the SEQ scheme since more
than one application is executing concurrently. Despite this, the FIX schemes end up with
overall completion time that is comparable to the SEQ scheme.
• The RAN schemes (MPL 1 and 2) are unfortunately not as good as the FIX schemes in
ensuring that only user-level code gets executed. Clearly, this is to be expected since these
schemes do not use any heuristics in selecting tasks. Consequently, the system spends
109
more time waiting for I/O which in turn contributes to the excessive number of context
switches and excessive number of page-faults and hence the performance degradation in
comparison to the FIX scheme.
• With increasing multi-programming levels, the RAN scheme starts to perform better since
it does a better job at improving overall CPU utilization. In particular, the RAN scheme
with MPL 3 actually does a pretty good job that is better than the FIX schemes.
• Trends in context switches and major page fault rates corroborate all the above observa-
tions and the performance results thereof.
4.5.2 Overload Scenario
Figure 4.5(a) plot the overall execution times (in seconds) of the SEQ and the FIX
schemes with MPL 1 for the overload scenario for each application as well as the overall com-
pletion time. Figure 4.5(b) plots the normalized slowdown of the SEQ and the FIX schemes with
MPL 1 normalized with respect to the SEQ scheme’s execution time. It can be seen from the last
set of columns (denoted as OV) in Figure 4.5 (a) that the Prio-Simultaneous (FIX) scheme with
MPL 1 performs better than the Sequential (SEQ) schemes. The performance of the FIX scheme
is better than SEQ schemes because it is much more efficient in managing the CPU resources.
This could be attributed to the fact that if a particular task faults the new schemes can possibly
schedule from other processes, thus utilizing the CPU more efficiently, whereas the SEQ scheme
cannot schedule any other job. The next few graphs provide more detailed statistics on overall
system CPU utilization, context switches and page faults that help us better understand as to how
our scheme achieves what we intended at the outset.
110
(a)
SEQ SIM FIX−1 FIX−2 FIX−3 RAN−1 RAN−2 RAN−30
10
20
30
40
50
60
70
80
90
100
110
Ove
rall
Per
cent
age
utili
zatio
n
UserSystemIdleWait
(b)
SEQ SIM FIX−1 FIX−2 FIX−3 RAN−1 RAN−2 RAN−30
1
2
3
4
5
6
7
8
9x 105
Tota
l Num
ber
of C
onte
xt S
witc
hes
(c)
SEQ SIM FIX−1 FIX−2 FIX−3 RAN−1 RAN−2 RAN−30
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5x 106
Num
ber
of P
age
Faul
ts
Major FaultMinor Fault
Fig. 4.4. Underload: (a) Overall CPU utilization (b) Context Switches (c) Overall Major PageFaults
111
(a)
BT CG EP FT IS LU MG SP OV0
200
400
600
800
1000
1200
1400
1600
1800
2000E
xecu
tion
Tim
e (s
econ
ds)
SequentialFIX−MPL−1
(b)
OV0
0.5
1
1.5
Nor
mal
ized
Slo
wdo
wn
SequentialFIX−MPL−1
Fig. 4.5. Overload: (a) Execution Time (seconds) measured as the time from job start tillcompletion (b) Normalized Slowdown measured as the ratio of the job completion time with thebatch scheduling.
112
Figure 4.6(a) plots the overall percentage CPU utilization during the course of the entire
execution. Figure 4.6(b) plots the number of context switches during the course of the entire
execution. Figure 4.6(c) plots the overall number of major page faults during the course of the
entire execution. The results in Figure 4.6 corroborate that the performance of the FIX scheme
is better than the SEQ scheme, since it is able to make better scheduling decisions and utilize the
CPU more efficiently, whilst minimizing the overall number of page-faults.
In summary, we have shown that a priority-boost based scheduler approach that takes
virtual-memory /working-set sizes into consideration performs as good as a batch scheduler
(SEQ) (around 0.5% slower) in the underloaded scenario when none of the applications experi-
ences any major page-faults (no paging) when run stand-alone and performs much better than a
batch scheduler (SEQ) (around 54% faster) in the overloaded scenario when a few of the appli-
cations experiences major page-faults (paging) when run stand-alone.
4.6 Conclusions
In summary, this chapter takes a look at the adverse effects that paging has on perfor-
mance of jobs that are concurrently running on multi-programmed processors. Today’s oper-
ating system schedulers are oblivious to memory load of individual tasks, since estimating the
working set size of programs is believed to be a pretty hard problem to implement practically.
Contrary to other approaches that have attempted to treat this problem as an admission control
problem or that have required application modifications, our approach is unique in terms of its
design. Specifically, we have presented the design of simple set of heuristics as add-ons to ex-
isting operating system process schedulers that attempts to minimize paging and/or thrashing.
113
(a)
SEQ FIX−10
10
20
30
40
50
60
70
80
90
100
110
Ove
rall
Per
cent
age
utili
zatio
n
UserSystemIdleWait
(b)
SEQ FIX−10
1
2
3
4
5
6
7
Log1
0 of
Num
ber
of C
onte
xt S
witc
hes
(c)
SEQ FIX−10
2
4
6
8
10
12
14
Log1
0 of
Num
ber
of P
age
Faul
ts
Major FaultMinor Fault
Fig. 4.6. Overload: (a) Overall CPU utilization (b) Context Switches (c) Overall Major PageFaults
114
Further, we show that the set of heuristics that have been used in this study can be obtained rel-
atively inexpensively on many modern day operating systems that makes our approach portable.
An important design objective of our approach in using an external, loadable kernel module or
an external ”probe” daemon atop an unmodified scheduler core enables flexibility of deploy-
ment in a distributed system setting like the computational Grid. The Synergistic Scheduling
framework presents a simple technique to couple two different sub-systems of the operating sys-
tem namely the virtual memory manager and the operating system scheduler. In other words,
it is a simple technique to realize a virtual-memory-aware process scheduler. Results on an ex-
perimental prototype indicate that the Synergistic Scheduling framework performs as well as a
batch-processing based scheduler in the underloaded scenarios and much better than a batch-
processing based scheduler in the overloaded scenarios.
There are a number of interesting avenues that this work has opened for future research.
Clearly, the lack of coupling between the memory management sub-system and the scheduler
sub-system of an operating system kernel is responsible for the performance degradation. While
sub-systems should be designed and evolve independently without inter-module dependencies,
there maybe situations that require coupling between them such as those illustrated in this pa-
per. The solution proposed in this paper is a workaround of this fundamental limitation, and a
better solution is to integrate this into the core kernel. Another possible avenue of research is
the interaction of a virtual-memory aware scheduler with memory replacement and allocation
algorithms. In other words, while we have shown the need for a virtual-memory-aware process
scheduler, there is also a need for a scheduler-aware virtual-memory manager. Going a step
further, the ideas in this framework is a specific instance of a general purpose ideal of coupling
every sub-system in the operating system. Such techniques could lead to better consolidation of
115
resources that is increasingly becoming an important consideration for clusters and data centers
both from a performance and a power perspective. Another future extension that we are looking
at currently is the design of a Synergistic Scheduling framework for a parallel job scheduler that
can be used on clusters and/or shared-memory parallel machines.
116
Chapter 5
Conclusions
Recent trends indicate that data-set sizes and application-working-set sizes are increas-
ing. With the growing disparity in performance between processor and I/O peripherals, it be-
comes very critical to optimize I/O performance. The advent of clusters and parallel file systems
that harness multiple disks has alleviated the I/O bandwidth problems. Main-memory caches
have been proposed in the past as an effective solution to tackle the I/O latency issues. In this
thesis, we have developed techniques that use or glean sufficient application-level information
to efficiently manage such caches.
For applications that access their data sets through explicit interfaces, we exploit compile-
time and run-time knowledge of application access patterns to determine which cache blocks
should be cached and when cache hierarchies need to be bypassed for good performance. We
have implemented such a discretionary-caching system as extensions to a parallel file system
and our results demonstrate that performance could improve by as much as 33% over indis-
criminate caching. For scaled in-core versions of applications that access their data sets through
the virtual-memory interface, we have developed a novel predictive-replacement algorithm that
tracks application’s runtime behavior in the operating system and show that it could perform
significantly better than the system’s default replacement algorithm (variant of LRU). While the
above two techniques may seem very different in their methodology, the underlying similarity
117
is based on the fact that not all blocks/pages are important and hence the system tries to pro-
actively manage the I/O caches by either not caching unimportant blocks or by evicting such
blocks earlier. The previous two techniques in this thesis have looked at the issue of improving
performance of a single application by maximizing the effectiveness of caching either through a
runtime scheme or a static compiler-driven scheme. However, such techniques are not sufficient
to guarantee that performance does not get degraded when multiple memory-intensive applica-
tions are executing simultaneously in a multi-programming scenario. The previous chapter of
this thesis describes the design and implementation of the Synergistic scheduling framework
that operates on top of an unmodified OS process scheduler and alleviates the shortcomings of
today’s memory-oblivious scheduler algorithms.
There are several interesting directions for future work. Perhaps, the most important
objective of my future work involves application of the techniques developed in this thesis for
improving the performance of large-scale I/O intensive high-performance applications. Such
applications’ performance can be improved not only by better memory management as this thesis
has demonstrated but also by incorporating applications requirements into the design of high-
performance file-systems and I/O middle-ware libraries. For instance, file system consistency
is a dimension that can be exploited to deliver higher-performance to applications that do not
require very strong semantics. Traditionally, file-systems use variants of locking techniques to
serialize accesses to shared resources that fail to scale well in large-scale clusters. Locking
techniques are an example of a pessimistic design that works well when contention to shared
resources are expected to be frequent. However, such a design may not be necessary and may in
fact prove to be an overkill for scientific applications that are usually well-structured in terms of
their data accesses. Taking a hard stance on consistency demotes throughput and scalability to
118
second-class citizen status, having to make do with whatever leeway is available. Therefore an
interesting direction for future work involves designing a file-system that leaves the choice and
granularity of consistency policies to the user at open/mount time that provides an attractive way
of providing the best of all worlds.
119
References
[1] G. A. Alverson, S. Kahan, R. Korry, C. McCann, and B. J. Smith. Scheduling on the
Tera MTA. In Proceedings of the Workshop on Job Scheduling Strategies for Parallel
Processing, pages 19–44, London, UK, 1995. Springer-Verlag.
[2] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A.
Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrish-
nan, and S. K. Weeratunga. The NAS Parallel Benchmarks. The International Journal of
Supercomputer Applications, pages 63–73, Fall 1991.
[3] S. Bansal and D. Modha. CAR: Clock with Adaptive Replacement. In Proceedings of the
USENIX Conference on File and Storage Technologies, pages 187–200. ACM Press, 2004.
[4] L. A. Belady. A Study of Replacement Algorithms for Virtual Storage Computers. IBM
Systems Journal, 5:78–101, 1966.
[5] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and
W. Su. Myrinet: A Gigabit-per-Second Local Area Network. IEEE Micro, 15(1):29–36,
1995.
[6] R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, and M. Paleczny. A Model and
Compilation Strategy for Out-of-Core Data Parallel Programs. In Proceedings of the ACM-
SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 1–10,
Santa Barbara, CA, 1995. ACM Press.
120
[7] R. Bordawekar, A. Choudhary, and J. Ramanujam. Automatic Optimization of Commu-
nication in Compiling Out-of-Core Stencil Codes. In Proceedings of the Tenth ACM In-
ternational Conference on Supercomputing, pages 366–373, Philadelphia, PA, 1996. ACM
Press.
[8] R. Bordawekar, R. Thakur, and A. Choudhary. Efficient Compilation of Out-of-core Data
Parallel Programs. Technical Report SCCS-622, Syracuse University, 1994.
[9] P. Brezany, T. A. Muck, and E. Schikuta. Language, Compiler and Parallel Database Sup-
port for I/O Intensive Applications. In Proceedings on High Performance Computing and
Networking, Milano, Italy, 1995.
[10] A. D. Brown and T. C. Mowry. Taming the Memory Hogs: Using Compiler-Inserted
Releases to Manage Physical Memory Intelligently. In Proceedings of the Symposium on
Operating Systems Design and Implementation, pages 31–44. USENIX Association, 2000.
[11] D. C. Burger, R. S. Hyder, B. P. Miller, and D. A. Wood. Paging tradeoffs in Distributed-
Shared-Memory Multiprocessors. In Proceedings of the 1994 ACM/IEEE Conference on
Supercomputing, pages 590–599, New York, NY, USA, 1994. ACM Press.
[12] P. Cao, E. W. Felten, A. R. Karlin, and K. Li. A Study of Integrated Prefetching and
Caching Strategies. In Proceedings of the 1995 ACM SIGMETRICS Joint International
Conference on Measurement and Modeling of Computer Systems, pages 188–197. ACM
Press, 1995.
121
[13] P. Cao, E. W. Felten, and K. Li. Implementation and Performance of Application-
Controlled File Caching. In Proceedings of Operating Systems Design and Implementation,
pages 165–177, 1994.
[14] P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur. PVFS: A Parallel File System
for Linux Clusters. In Proceedings of the Fourth Annual Linux Showcase and Conference,
pages 317–327, Atlanta, GA, 2000.
[15] F. W. Chang and G. A. Gibson. Automatic I/O Hint Generation Through Speculative Ex-
ecution. In Proceedings of the Operating Systems Design and Implementation (OSDI)
Conference, pages 1–14, 1999.
[16] Z. Chen, Y. Zhou, and K. Li. Eviction Based Cache Placement for Storage Caches. In
Proceedings of the USENIX Annual Technical Conference, 2003.
[17] A. Choudhary, R. Bordawekar, M. Harry, R. Krishnaiyer, R. Ponnusamy, T. Singh, and
R. Thakur. PASSION: Parallel and Scalable Software for Input-Output. Technical Report
SCCS-636, Syracuse University, NY, 1994.
[18] F. J. Corbato. A Paging Experiment with the Multics System, 1969. Included in a
Festschrift published in honor of Prof. P.M. Morse. MIT Press, Cambridge, Mass.
[19] P. Corbett, D. Feitelson, S. Fineberg, Y. Hsu, B. Nitzberg, J.-P. Prost, M. Snir, B. Traversat,
and P. Wong. Overview of the MPI-IO Parallel I/O Interface. In Hai Jin, Toni Cortes, and
Rajkumar Buyya, editors, High Performance Mass Storage and Parallel I/O: Technologies
and Applications, pages 477–487. IEEE Computer Society Press and Wiley, New York,
NY, 2001.
122
[20] P. F. Corbett, D. G. Feitelson, J-P. Prost, G. S. Almasi, S. J. Baylor, A. S. Bolmarcich,
Y. Hsu, J. Satran, M. Snir, R. Colao, B. D. Herr, J. Kavaky, T. R. Morgan, and A. Zlotek.
Parallel File Systems for the IBM SP computers. IBM Systems Journal, 34(2):222–248,
1995.
[21] T. Cortes, S. Girona, and J. Labarta. Design Issues of a Cooperative Cache with no Co-
herence Problems. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Perfor-
mance Mass Storage and Parallel I/O: Technologies and Applications, pages 259–270.
IEEE Computer Society Press and Wiley, New York, NY, 2001.
[22] M. Cox and D. Ellsworth. Application-controlled Demand Paging for Out-of-Core Visual-
ization. In IEEE Visualization, pages 235–244, 1997.
[23] P. E. Crandall, R. A. Aydt, A. A. Chien, and D. A. Reed. Input/output characteristics of
scalable parallel applications. In Proceedings of Supercomputing. ACM Press, 1995.
[24] M. Dahlin, R. Wang, T. E. Anderson, and D. A. Patterson. Cooperative Caching: Using
Remote Client Memory to Improve File System Performance. In Proceedings of the First
Symposium on Operating Systems Design and Implementation, pages 267–280, 1994.
[25] J. M. del Rosario and A. N. Choudhary. High-performance i/o for massively parallel com-
puters: Problems and prospects. Computer, 27(3):59–68, 1994.
[26] P. J. Denning. The Working Set Model for Program Behavior. In In the Proceedings of
the First ACM Symposium on Operating System Principles, pages 15.1–15.12. ACM Press,
1967.
123
[27] J. M. Dessler and S. N. Kandadai. Performance of Scientific Applications on Linux Clus-
ters. IBM eServer Performance Technical Report, IBM Form Number REDP-0439-00, 24
April, 2002.
[28] C. S. Ellis and D. Kotz. Prefetching in File Systems for MIMD Multiprocessors. In
Proceedings of the International Conference on Parallel Processing, pages I:306–314, St.
Charles, IL, 1989. Pennsylvania State Univ. Press.
[29] N. Stavrako et al. Symbolic Analysis in the PROMIS Compiler. In Proceedings of the
Twelfth International Workshop on Languages and Compilers for Parallel Computing,
1999.
[30] M. J. Feeley, W. E. Morgan, F. H. Pighin, A. R. Karlin, H. M. Levy, and C. A. Thekkath.
Implementing Global Memory Management in a Workstation Cluster. In Proceedings of
the ACM Symposium on Operating Systems Principles, pages 201–212, 1995.
[31] B. C. Forney, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Storage-Aware Caching:
Revisiting Caching for Heterogeneous Storage Systems. In Proceedings of the First Inter-
national Conference on File and Storage Technologies (FAST), 2002.
[32] Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Interface.
Technical Report, University of Tennessee, Knoxville, 1996.
[33] K. Fraser and F. W. Chang. Operating System I/O Speculation: How two invocations are
faster than one. In Proceedings of the USENIX Annual Technical Conference (General
track), 2003.
124
[34] G. Glass and P. Cao. Adaptive Page Replacement based on Memory Reference Behavior.
In Measurement and Modeling of Computer Systems, pages 115–126, 1997.
[35] J. Griffioen and R. Appleton. Reducing File System Latency using a Predictive Approach.
In USENIX Summer, pages 197–207, 1994.
[36] M. R. Haghighat and C. D. Polychronopoulos. Symbolic Analysis: A Basis for Paralleliza-
tion, Optimization and Scheduling of Programs. In Workshop on Languages and Compilers
for Parallel Computing, pages 567–585, Portland, OR., 1993. Berlin: Springer Verlag.
[37] K. Harty and D. R. Cheriton. Application-controlled Physical Memory using External
Page-cache Management. In Proceedings of the Fifth International Conference on Ar-
chitectural Support for Programming Languages and Operating System (ASPLOS), pages
187–197, New York, NY, 1992. ACM Press.
[38] J. L. Henning. SPEC CPU2000: Measuring CPU Performance in the New Millennium.
Computer, 33(7):28–35, 2000.
[39] J. V. Huber, Jr., C. L. Elford, D. A. Reed, A. A. Chien, and D. S. Blumenthal. PPFS: A High
Performance Portable Parallel File System. In Hai Jin, Toni Cortes, and Rajkumar Buyya,
editors, High Performance Mass Storage and Parallel I/O: Technologies and Applications,
pages 330–343. IEEE Computer Society Press and Wiley, New York, NY, 2001.
[40] K. Hwang, H. Jin, and R. Ho. RAID-x: A New Distributed Disk Array for I/O-Centric
Cluster Computing. In Proceedings of the Ninth IEEE International Symposium on High
Performance Distributed Computing, pages 279–287, Pittsburgh, PA, 2000. IEEE Com-
puter Society Press.
125
[41] Scalable I/O. Initiative. http://www.cs.princeton.edu/sio/, 1997.
[42] S. Jiang and X. Zhuang. Lirs: An Efficient low Inter-reference Recency Set Replacement
Policy to improve Buffer Cache Performance. In In Proceedings of ACM SIGMETRICS
Conference on Measurement and Modelling of Computer Systems., 2002.
[43] T. Johnson and D. Shasha. 2Q: A Low Overhead High Performance Buffer Management
Replacement Algorithm. In Proceedings of the Twentieth International Conference on Very
Large Databases, pages 439–450, Santiago, Chile, 1994.
[44] T. L. Johnson, D. A. Connors, M. C. Merten, and W. W. Hwu. Run-Time Cache Bypassing.
IEEE Transactions on Computers, 48(12):1338–1354, 1999.
[45] M. Kallahalla and P. J. Varman. Optimal Prefetching and Caching for Parallel I/O Sys-
tems. In Proceedings of the Thirteenth Annual ACM symposium on Parallel algorithms
and architectures, pages 219–228. ACM Press, 2001.
[46] T. Kimbrel, P. Cao, E. Felten, A. Karlin, and K. Li. Integrating Parallel Prefetching and
Caching. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Mod-
elling of Computer Systems, pages 262–263. ACM Press, 1996.
[47] D. Kotz. Disk-directed I/O for MIMD Multiprocessors. In Hai Jin, Toni Cortes, and
Rajkumar Buyya, editors, High Performance Mass Storage and Parallel I/O: Technologies
and Applications, pages 513–535. IEEE Computer Society Press and John Wiley & Sons,
2001.
126
[48] T. M. Kroeger and D. D. E. Long. Predicting File-System Actions From Prior Events. In
Proceedings of the USENIX 1996 Annual Technical Conference, pages 319–328, 1996.
[49] An-Chow Lai, Cem Fide, and Babak Falsafi. Dead-block Prediction and Dead-block Cor-
relating Prefetchers. In Proceedings of the Twentyeighth Annual International Symposium
on Computer Architecture, pages 144–154. ACM Press, 2001.
[50] D. Lee, J. Choi, H. Choe, S. Noh, S. Min, and Y. Cho. Implementation and Performance
Evaluation of the LRFU Replacement Policy. In Proceedings of the Twenty Third Euromi-
cro Conference, pages 106–111, 1997.
[51] E. K. Lee and C. A. Thekkath. Petal: Distributed Virtual Disks. In Proceedings of the
Seventh International Conference on Architectural Support for Programming Languages
and Operating Systems, pages 84–92, Cambridge, MA, 1996.
[52] S. T. Leutenegger and M. K. Vernon. The performance of multiprogrammed multipro-
cessor scheduling algorithms. In Proceedings of the ACM SIGMETRICS conference on
Measurement and Modeling of Computer Systems, pages 226–236, New York, NY, USA,
1990. ACM Press.
[53] T. M. Madhyastha. Automatic Classification of Input Output Access Patterns. PhD thesis,
UIUC, IL, 1997.
[54] S. Majumdar, D. L. Eager, and R. Bunt. Scheduling in Multi-programmed Parallel Systems.
In Proceedings of the ACM SIGMETRICS conference on Measurement and Modeling of
Computer Systems, pages 104–113, New York, NY, USA, May, 1988. ACM Press.
127
[55] C. McCann and J. Zahorjan. Scheduling memory constrained jobs on distributed memory
parallel computers. In Proceedings of the 1995 ACM SIGMETRICS Joint International
Conference on Measurement and Modeling of Computer Systems, pages 208–219, New
York, NY, USA, 1995. ACM Press.
[56] C. McCann and J. Zahorjan. Processor allocation policies for message-passing parallel
computers. In Proceedings of the ACM SIGMETRICS conference on Measurement and
Modeling of Computer Systems, pages 19–32, New York, NY, USA, May, 1994. ACM
Press.
[57] N. Megiddo and D. S. Modha. Outperforming LRU with an Adaptive Replacement Cache
Algorithm. Computer, 37(4):58–65, 2004.
[58] T. C. Mowry, A. K. Demke, and O. Krieger. Automatic Compiler-Inserted I/O Prefetching
for Out-of-Core Applications. In Proceedings of the Symposium on Operating Systems
Design and Implementation, pages 3–17. USENIX Association, 1996.
[59] S. Nagar, A. Banerjee, A. Sivasubramaniam, and C. R. Das. A Closer look at Coscheduling
Approaches for a Network of Workstations. In Proceedings of the Eleventh Annual ACM
Symposium on Parallel Algorithms and Architectures, pages 96–105, New York, NY, USA,
1999. ACM Press.
[60] A. Nanda, K-K. Mak, K. Sugavanam, R. K. Sahoo, V. Soundararajan, and T. B. Smith.
MemorIES: A Programmable, Real-Time Hardware Emulation Tool for Multiprocessor
Server Design. SIGPLAN Notices, 35(11):37–48, 2000.
128
[61] J. Nieplocha and I. Foster. Disk Resident Arrays: An Array-Oriented I/O library for out-of-
core computations. In Proceedings of the Sixth Symposium on the Frontiers of Massively
Parallel Computation, pages 196–204. IEEE Computer Society Press, 1996.
[62] D. S. Nikolopoulos. Malleable Memory Mapping: User-Level Control of Memory Bounds
for Effective Program Adaptation. In Proceedings of the International Conference on Par-
allel and Distributed Processing Symposium, Nice, France, 2003.
[63] D. S. Nikolopoulos and C. D. Polychronopoulos. Adaptive Scheduling Under Memory
Pressure on Multiprogrammed SMPs. In Proceedings of the 16th International Paral-
lel and Distributed Processing Symposium, page 53, Washington, DC, USA, 2002. IEEE
Computer Society.
[64] B. Nitzberg and V. Lo. Collective Buffering: Improving Parallel I/O Performance. In
Proceedings of the Sixth IEEE International Symposium on High Performance Distributed
Computing, pages 148–157. IEEE Computer Society Press, 1997.
[65] E. J. O’Neil, P. E. O’Neil, and G. Weikum. The LRU-k Page Replacement Algorithm for
Database Disk Buffering. In International Conference on Management of Data and Sym-
posium on Principles of Database Systems, ACM SIGMOD, pages 297–306, Washington,
D.C., 1993.
[66] S. Pakin, M. Lauria, and A. Chien. High Performance Messaging on Workstations: Illinois
Fast Messages (FM) for Myrinet. In Proceedings of Supercomputing, 1995.
129
[67] M. Paleczny, K. Kennedy, and C. Koelbel. Compiler Support for Out-of-Core Arrays on
Data Parallel Machines. In Proceedings of the Fifth Symposium on the Frontiers of Mas-
sively Parallel Computation, pages 110–118, McLean, VA, 1995.
[68] Eric W. Parsons and Kenneth C. Sevcik. Coordinated Allocation of Memory and Pro-
cessors in Multiprocessors. In Proceedings of the 1996 ACM SIGMETRICS International
Conference on Measurement and Modeling of Computer Systems, pages 57–67, New York,
NY, USA, 1996. ACM Press.
[69] R. Patterson, G. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed Prefetching
and Caching. In Proceedings of the 15th Symposium on Operating Systems Principles,
pages 79–95, Copper Mountain, Colorado, 1995.
[70] R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed Prefetch-
ing and Caching. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Perfor-
mance Mass Storage and Parallel I/O: Technologies and Applications, pages 224–244.
IEEE Computer Society Press and Wiley, New York, NY, 1995.
[71] R. H. Patterson, G. A. Gibson, and M. Satyanarayanan. A Status Report on Research in
Transparent Informed Prefetching. ACM Operating Systems Review, 27(2):21–34, 1993.
[72] V. G. J. Peris, M. S. Squillante, and V. K. Naik. Analysis of the Impact of Memory in
Distributed Parallel Processing Systems. In Proceedings of the 1994 ACM SIGMETRICS
Conference on Measurement and Modeling of Computer Systems, pages 5–18, New York,
NY, USA, 1994. ACM Press.
130
[73] V. Phalke and B. Gopinath. An Inter-Reference Gap model for temporal locality in program
behavior. In Proceedings of the ACM SIGMETRICS Joint International Conference on
Measurement and Modeling of Computer Systems, pages 291–300, 1995.
[74] E. Rothberg, J. Pal Singh, and A. Gupta. Working sets, cache sizes, and node granularity
issues for large-scale multiprocessors. In Proceedings of the Twentieth Annual Interna-
tional Symposium on Computer Architecture, pages 14–26, New York, NY, USA, 1993.
ACM Press.
[75] F. Schmuck and R. Haskin. GPFS: A Shared-Disk File System for Large Computing Clus-
ters. In Proceedings of the First Conference on File and Storage Technologies (FAST),
2002.
[76] K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-Directed Collective
I/O in Panda. In Proceedings of Supercomputing, San Diego, CA, 1995. IEEE Computer
Society Press.
[77] S. Setia. The Interaction between Memory Allocation and Adaptive Partitioning in
Message-Passing Multicomputers. In Job Scheduling Strategies for Parallel Processing,
pages 146–164, 1995.
[78] S. Setia, M. S. Squillante, and V. K. Naik. The Impact of Job Memory Requirements on
Gang-Scheduling Performance. SIGMETRICS Performance Evaluation Review, 26(4):30–
39, 1999.
[79] J. Seward and N. Nethercote. Valgrind, An Open Source Memory Debugger for x86-linux,
2003. http://developer.kde.org/ sewardj/.
131
[80] X. Shen and A. Choudhary. DPFS: A Distributed Parallel File System. In Proceedings of
the International Conference on Parallel Processing, Spain, 2001.
[81] T. Sherwood, S. Sair, and B. Calder. Phase Tracking and Prediction. In Proceedings of
the 30th Annual International Symposium on Computer Architecture, pages 336–349, New
York, NY, USA, 2003. ACM Press.
[82] Y. Smaragdakis, S. Kaplan, and P. R. Wilson. EELRU: Simple and effective adaptive page
replacement. In Measurement and Modeling of Computer Systems, pages 122–133, 1999.
[83] A. Snavely and D. M. Tullsen. Symbiotic Jobscheduling for a Simultaneous Multithreaded
Processor. In Proceedings of the Ninth International Conference on Architectural Sup-
port for Programming Languages and Operating Systems, pages 234–244, New York, NY,
USA, 2000. ACM Press.
[84] S. R. Soltis, T. M. Ruwart, G. M. Erickson, K. W. Preslan, and M. T. O’Keefe. The Global
File System. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance
Mass Storage and Parallel I/O: Technologies and Applications, pages 10–15. IEEE Com-
puter Society Press and John Wiley & Sons, 2001.
[85] G. E. Suh, S. Devadas, and L. Rudolph. A New Memory Monitoring Scheme for Memory-
Aware Scheduling and Partitioning. In Proceedings of the Eighth International Symposium
on High-Performance Computer Architecture (HPCA), page 117, Washington, DC, USA,
2002. IEEE Computer Society.
132
[86] R. Thakur, E. Lusk, and W. Gropp. Users Guide for ROMIO: A High-Performance,
Portable MPI-IO Implementation. Technical Report ANL/MCS–TM–234, Argonne Na-
tional Laboratory, 1997.
[87] S. Toledo and F. G. Gustavson. The Design and Implementation of SOLAR, A Portable
Library for Scalable Out-of-Core Linear Algebra Computations. In Proceedings of the
Fourth Annual Workshop on I/O in Parallel and Distributed Systems, 1996.
[88] A. Tucker and A. Gupta. Process Control and Scheduling Issues for Multiprogrammed
Shared-Memory Multiprocessors. In Proceedings of the Twelfth ACM Symposium on Op-
erating Systems Principles, pages 159–166, New York, NY, USA, 1989. ACM Press.
[89] G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun. A Modified Approach to Data
Cache Management. In Proceedings of the 28th Annual ACM/IEEE International Sympo-
sium on Microarchitecture, pages 93–103, 1995.
[90] M. Uysal, A. Acharya, and J. Saltz. Requirements of I/O Systems for Parallel Machines:
An Application-driven Study. Technical Report CS-TR-3802, University of Maryland,
College Park, MD, 1997.
[91] M. Vilayannur, M. Kandemir, and A. Sivasubramaniam. Kernel-level Caching for Op-
timizing I/O by Exploiting Inter-application Data Sharing. In Proceedings of the IEEE
International Conference on Cluster Computing, pages 425–432, Chicago, IL, USA, 2002.
IEEE Computer Society.
133
[92] M. Vilayannur, A. Sivasubramaniam, and M. Kandemir. Proactive Page Replacement
for Scientific Applications: A Characterization. In Proceedings of the IEEE/ACM Inter-
national Symposium on Performance Analysis of Software and Systems, pages 248–257,
Austin, TX, USA, 2005. IEEE Computer Society.
[93] M. Vilayannur, A. Sivasubramaniam, M. Kandemir, R. Thakur, and R. Ross. Discretionary
Caching for I/O on Clusters. In Proceedings of the Third IEEE/ACM International Confer-
ence on Cluster Computing and the Grid, pages 96–103, Tokyo, Japan, 2003.
[94] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M. Anderson, S. W. K.
Tjiang, S. Liao, C. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy. SUIF: An In-
frastructure for Research on Parallelizing and Optimizing Compilers. SIGPLAN Notices,
29(12):31–37, 1994.
[95] D. Womble, D. Greenburg, R. Riesen, and S. Wheat. Out of core, out of mind: Practical
Parallel I/O. In Scalable Parallel Libraries Conference, pages 10–16, October 1993.
[96] T. M. Wong and J. Wilkes. My cache or yours? Making storage more exclusive. In
Proceedings of the USENIX Annual Technical Conference, pages 161–175, 2002.
[97] P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic
Tracking of Page Miss Ratio Curve for Memory Management. In Proceedings of the
Eleventh International Conference on Architectural Support for Programming Languages
and Operating Systems, pages 177–188, New York, NY, USA, 2004. ACM Press.
Vita
Murali Vilayannur did his schooling from Vidya Mandir Higher Secondary School in
Chennai, India. Subsequently, he received his undergraduate B.Tech degree in Computer Sci-
ence and Engineering from the Institute of Technology - Banaras Hindu University Varanasi,
India in the year 1999. He expects to get his Ph.D degree in August 2005 from the Depart-
ment of Computer Science and Engineering at the Pennsylvania State University. He spent the
summers of 2002 and 2004 at the Mathematics and Computer Sciences Division at Argonne
National Laboratory in Chicago, where he is currently a post-doctorate staff. His research inter-
ests mainly include Operating Systems, High-Performance Computing and Parallel I/O. He is a
student member of IEEE and ACM.