Fault-Tolerant Distributed Shared Memory on a
Broadcast-based Interconnection Architecture
A Thesis
Submitted to the Faculty
of
Drexel University
by
Diana Lynn Hecht
in partial fulfillment of the
requirements for the degree
of
Doctor of Philosophy
December 2002
ii
Dedications
I dedicate this dissertation to my husband, Stephen. I would not have been able to achieve this
goal without his faith in me and his sacrifice, patience, and understanding.
iii
Acknowledgements
I would like to give a special thanks to my advisor, Dr Constantine Katsinis, for his invaluable
guidance, encouragement, advice and friendship over many years. I am very grateful for the
impact he has had in my life. In addition, I would like to thank my parents for the love, support,
and encouragement that helped me to be where I am today. I would like to thank my husband,
Stephen, for his unwavering confidence in me and his encouragement, patience, understanding
and sacrifice throughout this endeavor.
iv
Table of Contents
LIST OF TABLES vi
LIST OF FIGURES vii
ABSTRACT x
1. INTRODUCTION 1
1.1. INTRODUCTION 1
1.2. PREVIOUS WORK 2
1.3. ORGANIZATION OF THE THESIS 6
2. THE SOME-BUS ARCHITECTURE 7
2.1. THE SOME-BUS ARCHITECTURE 7
2.2. DESIGN COMPLEXITY AND SCALABILITY 12
3. DISTRIBUTED SHARED MEMORY ON THE SOME-BUS 15
3.1. BASIC DSM OPERATION ON THE SOME-BUS 15
3.2. ORGANIZATION OF CACHE AND DIRECTORY CONTROLLERS 18
3.3. INPUT QUEUE STRUCTURE AND RESOLVER DESIGN 22
3.4. PERFORMANCE ANALYSIS 25
3.4.1. Trace-Driven Simulation 26
3.4.2. Decision Tree 36
3.4.3. Queuing Network Theoretical Model 49
3.4.4. Comparison Between the Simulators and the Theoretical Model 59
4. FAULT TOLERANCE 67
4.1. FAULT TOLERANCE AND DISTRIBUTED SHARED MEMORY ON THE SOME-BUS 67
4.2. FT0 PROTOCOL 69
4.3. FT1 PROTOCOL 85
v4.3.1. Home2 Consistency 87
4.3.2. Using Home2 Memory to Fill Read Requests 96
4.3.3. Performance 103
4.3.4. Implementation Issues 111
4.4. FT2 PROTOCOL 113
4.4.1. Performance 117
4.5. FT3 PROTOCOL 121
4.5.1. Performance 123
5. CONCLUSIONS 129
6. LIST OF REFERENCES 131
7. VITA 134
vi
List of Tables
3.1 Comparison of Multiprocessor Trace Files 29
3.2 Simulation Data, after Relocation 31
3.3 Decision Tree Path Distribution for Read Accesses 40
3.4 Decision Tree Path Distribution for Write Hit Accesses 45
3.5 Decision Tree Path Distribution for Write Miss Accesses 45
3.6 Net Cost for Paths A Through F 46
3.7 Net Cost for Paths H Through L 46
3.8 Net Cost for Paths M Through S 47
3.9 Decision Tree Traffic vs. Channel Utilization 48
3.10 Distribution of Read or Write Misses Through the Decision Tree Paths 48
4.1 Average Number of Cache Blocks Written back at Checkpoint Time 84
4.2 Comparison of the Network Cost of the Tree Paths for FT0 and FT1 102
4.3 Distribution of Tree Path Usage for Cache Misses 105
vii
List of Figures
2.1 Parallel Receiver Array and Output Coupler 7
2.2 Optical Interface 10
2.3 Processor Interface 12
2.4 Extending the SOME-Bus from N to 2N Nodes 14
3.1 Typical System Architecture 19
3.2 Computer System Architecture for DSM and Message Passing 20
3.3 Computer System Architecture for Message Passing 20
3.4 SOME-Bus Channel Controller, Cache and Directory 21
3.5 Single Resolver Implementation 23
3.6 Separate Cache and Directory Message Chains 24
3.7 Dual Resolver Implementation 25
3.8 Distribution of Address References 30
3.9 Distribution of Address References, after Relocation 30
3.10 Processor Utilization and Channel Utilization: Low Level of Locality 34
3.11 Processor Utilization and Channel Utilization: High Level of Locality 35
3.12 Decision Tree Branch for Read Accesses 37
3.13 Decision Tree Path Distribution for Write Accesses to Blocks in Cache 42
3.14 Decision Tree Path Distribution for Write Accesses to Blocks not in Cache 43
3.15 Processor, Cache, Directory and Channel Queues 50
3.16 Message Flow in Queue System 50
3.17 Four-node Queuing Network 53
3.18 Traffic Through a Single Node in the Queuing Network 54
3.19 Channel Utilization: Case N01 61
3.20 Channel Utilization: Case N10 61
viii3.21 Channel Utilization: Case S01 62
3.22 Channel Utilization: Case S10 62
3.23 Channel Queue Waiting Time: Case N01 63
3.24 Channel Queue Waiting Time: Case N10 63
3.25 Channel Queue Waiting Time: Case S01 64
3.26 Channel Queue Waiting Time: Case S10 64
3.27 Processor Utilization: Case N01 65
3.28 Processor Utilization: Case N10 65
3.29 Processor Utilization: Case S01 66
3.30 Processor Utilization: Case S10 66
4.1 FT0 Memory Organization of SOME-Bus Node 70
4.2 Checkpoint Intervals of 10,000 and 30,000 for the Case N01 Applications 75
4.3 Checkpoint Intervals of 10,000 and 30,000 for the Case N10 Applications 75
4.4 Checkpoint Intervals of 10,000 and 30,000 for the Case S01 Applications 76
4.5 Checkpoint Intervals of 10,000 and 30,000 for the Case S10 Applications 77
4.6 State-transition-rate Diagram 80
4.7 State Probabilities, Pw= .10 and Pbfr = .0003 83
4.8 State Probabilities, Pw= .10 and Pbfl = .9 84
4.9 FT1 Memory Organization of SOME-Bus Node 86
4.10 Local and Remote Requests for Data Block DataB 87
4.11 Ownership Request Message Transfer with FT1 Protocol 90
4.12 Message Time Line 93
4.13 Messages at the Receiver Queues of the Home2 Node 95
4.14 Decision Tree for FT1 Read Request 98
4.15 Decision Tree for FT1 Write Miss 100
4.16 Decision Tree for FT1 Write Hit 101
ix4.17 Processor Utilization, Case N01 106
4.18 Execution Time, Case N01 106
4.19 Processor Utilization, Case N10 107
4.20 Execution Time, Case N10 108
4.21 Processor Utilization, Case S01 109
4.22 Execution Time, Case S01 109
4.23 Processor Utilization, Case S10 110
4.24 Execution Time, Case S10 111
4.25 Ownership Request scenarios 115
4.26 Execution Time, Case N01 118
4.27 Execution Time, Case S01 118
4.28 Execution Time, Case N10 119
4.29 Execution Time, Case S10 119
4.30 Total Execution Time, Case N01 125
4.31 Total Execution Time, Case N10 126
4.32 Total Execution Time, Case S01 126
4.33 Total Execution Time, Case S10 127
4.34 Total Execution Time, Trace Files 128
x
AbstractFault-Tolerant Distributed Shared Memory on aBroadcast-based Interconnection Architecture
Diana Lynn HechtConstantine Katsinis, Ph.D.
This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared
Memory Multiprocessors, and on the performance impact of implementing fault tolerance
protocols that allow for Backward Error Recovery through the use of synchronized
checkpointing. High Performance Parallel computing systems that implement Distributed Shared
Memory (DSM) require interconnection networks capable of providing low latency and high
bandwidth and efficient support for multicast and synchronization operations. Software-based
DSM systems rely on the operating system to manage the replicated memory pages and
consequently their performance suffers due to operating system overhead, false sharing and page
thrashing. In order to obtain high levels of performance, the activities related to maintaining the
consistency of shared data in a DSM should be implemented in hardware so that latencies for data
access can be minimized. The recoverable DSM system examined in this thesis is intended for
the class of broadcast-based interconnection networks in order to provide the low latencies
required for the application workloads characteristic of DSM.
An example of this class of interconnection network is the Simultaneous Optical
Multiprocessor Exchange Bus (SOME-Bus). The unique architecture of the SOME-Bus provides
for strong integration of the transmitter, receiver, and cache controller hardware to produce a
highly integrated system-wide coherence mechanism. This thesis presents four protocols for
fault-tolerant DSM and uses simulation and theoretical analysis to examine the performance of
the protocols on the SOME-Bus multiprocessor. The proposed fault tolerance protocols exploit
the inherent data distribution operations that occur as part of the management of shared data in
DSMs in order to hide the overhead of fault tolerance. The increased availability of shared data
xifor the support of fault tolerance can be used to enhance the performance of the DSM by
increasing the likelihood that a request for data can be filled locally without requiring
communication with remote nodes.
1
CHAPTER 1: INTRODUCTION
1.1 INTRODUCTION
High-performance computing is required for many applications, including the modeling
of weather patterns, atomic structure of materials and other physical phenomena as well as image
processing, simulation of integrated circuits and other applications known as “Grand Challenge”
problems. Scalable systems capable of addressing these application classes are formed by
interconnecting large numbers of microprocessor-based processing nodes in order to create
distributed-memory multiprocessor systems. The effectiveness of these types of multiprocessing
systems is determined by the interconnection network architecture, the programming model
supported by the system, and the level of reliability and fault-tolerance provided by the system.
The types of applications used with these multiprocessors affects the level of performance that
can be provided due to differences in the inherent parallelism of the application, which affects the
ability to balance the workload across the processors, differences in communication patterns, and
the implementation of synchronization operations.
Shared address programming, message passing, and data parallel processing are examples
of the programming models that can be supported by multiprocessor systems. The programming
model is important because it affects the amount of operating system overhead involved in
communication operations as well as the level of involvement required by the programmer to
specify the processor interaction required by the application. The message passing paradigm
requires a higher level of programmer involvement and knowledge of the details of the
underlying communication subsystem in order to explicitly direct the interprocessor
communication. Distributed Shared memory systems (DSM) offer the application programmer a
model for using shared data that is identical with that used when writing sequential programs,
thereby reducing the complexity involved in developing distributed applications.
2The memory consistency model supported by a DSM has a large impact on the system
performance. Sequential consistency provides the programmer with most intuitive memory
model but results in increased access latency and network bandwidth requirements. Relaxed
consistency models such as Release Consistency [13] and Entry Consistency [3] improve
performance by allowing hardware optimizations such as pipelining and reordering as well as
overlapping of memory accesses. However, the weaker consistency models rely heavily on
synchronization operations when accessing shared data, requiring a higher level of complexity for
the programmer. The success of DSM depends on its ability to free the programmer from any
operations that are necessary for the only purpose of supporting the memory model. It is
imperative that interconnection networks be designed with high bisection bandwidth and low
latency in order to provide the best possible performance in DSM systems.
As the number of interconnected nodes in a DSM system increases, the probability of
node failures also increases. For this reason, tolerating node failures becomes essential for
parallel applications with large execution times. Backward Error Recovery enables an
application that encounters an error to restart its execution from an earlier, error-free state. In
order for this to be possible, the state or checkpoint of a process be periodically saved so that it
can be used to restart the application in the case of a node failure.
1.2 PREVIOUS WORK
A survey of Recoverable DSM systems is provided in [28]. In [18] ICARE, recoverable
DSM is presented in which the basic Write-Invalidate coherence protocol is extended in order to
manage both active data and recovery data. The system uses globally consistency checkpointing
implemented as a two-phase commit protocol. In [29] the implementation of ICARE on a COMA
architecture (hardware-based DSM using a cache line as the unit of coherence) and a Shared
Virtual Memory architecture (software-base DSM using a memory page as the unit of coherence)
is compared.
3In [36] [38] present the Boundary-Restricted class of coherence protocols aimed at
providing a mechanism to guarantee that the number of copies of a memory page remain in a
range that can be specified. In [12], several instances of the Boundary-Restricted coherence
protocols are compared with Write-Invalidate, Write-Invalidate with Downgrading, Write
Broadcast in terms of the level of availability and the operating costs for update or invalidation
operations.
[19] presents a recoverable DSM system based on the competitive update protocol. The
goal of the competitive update protocol is to remove copies of pages in nodes that are not actively
using them in order to reduce the system overhead of keeping the pages updated. The original
competitive update protocol was modified in order to guarantee that at least two copies of each
pages exists in the system enabling recovery from a single node failure. The system is
implemented on a page-base software DSM implementing Release Consistency.
DSMPI, a parallel library implemented on top of MPI that provides the abstraction of
globally accessible shared memory is presented in [34]. The unit of sharing is the program
variable or data structure and non-blocking coordinated checkpointing is provided. A globally
consistent state is achieved through the use of a checkpoint identifier included with every
message sent. Before processing messages, a receiver compares it’s own checkpoint Identifier
with the one received with a message in order to determine if the message was sent before or after
the checkpoint round in which the receiver is currently in.
In [2] a recoverable shared memory (RSM) is proposed in which the memory in the
system is divided into two equal parts, the current data blocks and the recovery data blocks. The
RSM uses a dependency-tracking matrix in order to create a consistent recovery state, using a
two-phase commit protocol, without requiring a global, fully synchronized mechanism for
creating a new recovery point. Advantages of this approach are that only processors directly
affected by a failure are required to perform a rollback, processor failures are tolerated
transparently, and both sequential and relaxed consistency are supported. Disadvantages of this
4approach are that write operations can be delayed because the copy-on-write mechanism for
updating the recovery data requires that the dependency tracking be performed on every write.
A major reason for the moderate performance that occurs with many modern large-scale
applications is the mismatch between interconnection architecture and application structure.
DSM systems typically have a large percentage of multicast traffic due to the invalidation
messages in Write-Invalidate cache coherence protocols and for update messages in Write-
Update cache coherence protocols. In addition to high bisection bandwidth and low latency, the
interconnection architecture should provide support for stronger levels of memory consistency,
efficient multicast and broadcast operations for cache-coherence related messages, and facilitate
hardware-based DSM through close integration with the cache and memory subsystems. The
performance of a DSM system is shown to be adversely affected by the latency in the network in
simulations performed in [21].
There is a large amount of research in multicast communications in popular architectures
with path-based broadcasting [16], and trees and (multidestination) wormhole routing
[6][23][27]. Large efforts are focused on development of extensive algorithms to alleviate the fact
that intense multicast communications cause wormhole routing to resemble store-and-forward
routing.
Barrier synchronization occurs frequently in programs for multiprocessors, especially in
those problems that can be solved by using iterative methods. The ability to support efficient
synchronization operations is important not only for application-level synchronization but also for
activities relating to fault tolerance such as globally synchronized checkpointing. Efficient
implementation of barrier synchronization in scalable multiprocessors and multicomputers has
received a lot of attention. Synchronization based on multidestination wormhole routing and
additional hardware support at the network routers is presented in [31]. Additional hardware
designs in support of efficient synchronization are presented in [33] and [40].
5A network architecture that offers an alternative to present switch-based networks relies
on one-to-all broadcast, where each processor can directly communicate with any other
processor; from the point of view of any processor, all other processors appear the same. Such a
network architecture allows the user, or the compiler, to structure the data and operations in the
application code to better reflect the parallelism inherent in the applications with the resulting
benefit that messages experience smaller latencies. It also allows the operating system to perform
extensive thread placement and migration dynamically to successfully manage the level of
parallelism present in large applications. The most useful properties of such a network
architecture are high bandwidth (scaling directly with the number of workstations), low latency,
no arbitration delay, and non-blocking communication.
Broadcast-based networks can be constructed using optoelectronic devices (and multiple-
wavelength data representations), relying on sources, modulators and arrays of detectors, all
being coupled to local electronic processors. This thesis presents a set of protocols for fault-
tolerance that are optimized for a multiprocessor system interconnected via a broadcast-based
network architecture. Although the protocols presented can be applied to the general class of
broadcast-based networks, in this thesis, the protocols are described in terms of their
implementation on the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) [22].
The memory architecture required to support the different protocols is also presented.
The work presented in this thesis differs from previous work in that the DSM system is
hardware-based and the communication that occurs as part of the underlying DSM as well as the
Fault Tolerance protocols is optimized to take advantage of the efficient broadcast capability of
the network. In addition, the Fault Tolerance protocols have been integrated with the existing
DSM operations in a manner that causes little or no reduction in system performance during
normal operation and eliminates most of the overhead at checkpoint creation. Under certain
conditions, data blocks which are duplicated for fault tolerance purposes can be utilized by the
6basic DSM protocol, reducing network traffic, and increasing the processor utilization
significantly.
1.3 ORGANIZATION OF THE THESIS
In Chapter 2, the architecture of the SOME-Bus is presented, including the design of the
Transmitter/Receiver, the optical interface, the processor interface and a discussion of the design
complexity and scalability of the SOME-Bus. Chapter 3 presents the details of Distributed
Shared Memory on the SOME-Bus including the major components in each node (processor,
cache controller, directory controller and channel controller) and the supporting input queue
structure and resolver design. In addition, three methods for evaluating the performance of the
system, two simulators and a theoretical model are described in detail. Chapter 4 presents the
details of four protocols which provide fault tolerance on the SOME-Bus DSM. A summary and
conclusions are presented in Chapter 5.
7
CHAPTER 2. THE SOME-BUS ARCHITECTURE
This chapter presents the architecture of the SOME-Bus including the design of the
Transmitter/Receiver, the optical interface and the processor interface. A discussion of the design
complexity and scalability of the SOME-Bus is provided as well.
2.1 THE SOME-BUS ARCHITECTURE
The Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) [22] incorporates
optoelectronic devices into a very high-performance network architecture. It is a low-latency,
high-bandwidth, fiber-optic network that directly connects each processing node to all other
nodes without contention. Its properties distinguish it from other optical networks examined in
the past [14][30][32][35]. One of its key features is that each of N nodes has a dedicated
broadcast channel that can operate at several GBytes/sec, depending on the configuration. In
general, the SOME-Bus contains K fibers, each carrying M wavelengths organized in M/W
channels, where each channel is composed of W wavelengths. The total number of fibers is K =
NW/M. A simple configuration with 128 nodes (N = 128 channels) and W=1 wavelength per
channel would require K=32 fibers with M=4 wavelengths per fiber. Current or foreseeable
technology allows a configuration of 128 nodes (channels) and K=8 fibers, with W=2
wavelengths per channel and M=32 wavelengths per fiber. As discussed later, each fiber can
carry several Gbits/sec, resulting in channels with GBytes/sec bandwidth. Each of N nodes also
has an input channel interface based on an array of N receivers (each with W detectors) that
simultaneously monitors all N channels.
The physical implementation of SOME-Bus is motivated by recent progress in optical
communication, Dense Wavelength Division Multiplexing (DWDM), and optoelectronics. Slant
Bragg gratings [5][10][11][24] are written directly into the fiber core and are used as narrow-
band, inexpensive output couplers. This coupling of the evanescent field allows the traffic to
continue and eliminates the need for regeneration. Figure 2.1 shows the parallel receiver array
8and output coupler. The SOME-Bus uses amorphous silicon (a-Si) photodetectors built as
superstructures on the surface of electronic processing devices. Due to the low conductivity of the
a-Si layer, no subsequent patterning is required, and therefore the yield and cost of the receiver is
determined by the yield and cost of the CMOS device itself. Optical power budget analysis of a
system with 128 nodes, 32 wavelengths per fiber and 10 mW of power inserted into the fiber
shows that the worst case for output power, occurring where light from the first node is coupled
out by the receiver at the last node, is 21 nW which is more than sufficient for present detectors.
Detectors have been demonstrated that have very low dark current, a carrier lifetime of a few
picoseconds, and a dynamic range of more than 4 orders of magnitude under only 5 nW of optical
power [9].
Receiver Arrays
Micro-mirror
Bragg Gratings
Detectors
Bragg Gratings
Detectors
Node 1 Node 2
Optical Fiber
Figure 2.1: Parallel Receiver Array and Output Coupler
Since the receiver array does not need to perform any routing, its hardware complexity
(including detector, logic, and packet memory storage) is small. This organization eliminates the
need for global arbitration and provides bandwidth that scales directly with the number of nodes
in the system. No node is ever blocked from transmitting by another transmitter or due to
contention for shared switching logic. The ability to support multiple simultaneous broadcasts is
a unique feature of the SOME-Bus that efficiently supports high-speed, distributed barrier
9synchronization mechanisms and cache consistency protocols and allows process group
partitioning within the receiver array.
Once the logic level signal is restored from the optical data, it is directed to the input
channel interface, which consists of two parts: the optical interface and the processor interface.
Figure 2.2 shows the optical interface, which includes physical signaling, address filtering, barrier
processing, length monitoring and type decoding. Each receiver generates a data stream that is
examined to detect the start of the packet and the packet header. The header decode circuitry
examines the header field, which includes information on the message type, destination address
(or addresses) and length, to determine whether or not the message is a synchronization message.
If the message is a synchronization message, it is handled by the barrier circuitry, otherwise the
destination address is compared to the set of valid addresses contained in the address decode
circuitry. In addition to recognizing its own individual address, a processor can recognize
multicast group addresses as well as broadcast addresses. Once a valid address has been
identified, the message is placed in a queue. If the address does not match, the message is
ignored.
10
Destuff
HeaderDecode
TypeDecode
AddressDecode
BarrierMsg
DataMsg
MsgLengthMonitor
BufferControl
Figure 2.2: Optical Interface
The optical interface shown in Figure 2.2 provides for the handling of two basic types of
messages, a data message and a barrier message. In order to support these message types, at a
minimum, the message header should contain a type field, destination address field and a length
field. It is not necessary to include a source address in the header since the originator of the
message can be determined directly by the input channel queue on which the message was
received. The type field can be used not only to indicate whether the message is a data or barrier
message but can also indicate a particular module within the node to which the message should be
directed. For example, in a DSM multiprocessor, the type field could indicate that the message
should be delivered to either the cache controller or the directory controller (in the case of
Directory-Based Cache coherence). The destination address field can contain either the address
of an individual node or the address of a pre-determined multicast address. Another possibility
for multicast messages would be to place a bit vector (number of bits = number of nodes) in the
11header which identifies the individual nodes to be included in the multicast. The advantage of the
later approach is that it is not necessary to set up multicast groups ahead of time. The
disadvantage of the bit vector approach is that for systems with a large number of nodes, the size
of the message header could increase significantly. The Length field is used to detect underflow
or overflow conditions when the message is being received.
Figure 2.3 [17] shows the processor interface which includes a routing network and a
queuing system. One queue is associated with each input channel, allowing messages from any
number of processors to arrive and be buffered simultaneously, until the local processor is ready
to remove them. A resolver circuit receives a request signal (Ri) from each non-empty queue and
produces the index of the next queue to be accessed under either the limited or the exhaustive
service disciplines. The local processor can force the next queue selection through the Pin input. A
straightforward implementation of the resolver as a selection tree, using logic gates to select the
next queue and multiplexers to forward the corresponding queue index, requires only several
hundred gates organized in log2(N) levels. The time required to select the next queue (polling
walk time) is consequently very small and can be overlapped with the queue access time.
Arbitration may be required only locally in a receiver array when multiple input queues contain
messages.
12
Address Filter
Address Filter
Address Filter
Resolver
Enables
Data BusR0RN-1
ROUT
ES
Channel
Dec
Rec 0
Rec N-1
RIN
Figure 2.3: Processor Interface
Etched micro-mirrors [25][26] can be used to insert a signal into a fiber. Each node uses
a separate mirror and a separate laser source to insert each wavelength of its channel. It is
possible to integrate the transmitter sources on the same chip with the detectors and the associated
electronic circuits.
2.2 DESIGN COMPLEXITY AND SCALABILITY
The SOME-Bus has much more functionality than traditional crossbar architectures. A
major advantage of this architecture is that, due to the multiple broadcast capability, no node is
ever blocked from transmitting by another transmitter, no arbitration is required and network
bandwidth scales directly with the number of nodes. No communication is ever blocked through
contention for shared switching logic. With N nodes, the diameter of the SOME-Bus is 1, the
time needed for all-to-all communication with distinct messages is O(N) and the time needed for
synchronization is O(1). Unlike a fully-connected point-to-point network, where the number of
transmitters and channels increases O(N2), the number of transmitters and channels of the SOME-
13Bus is O(N), quite smaller than the number required in other popular architectures, such as the
hypercube or the torus. The number of receivers is N2, which is larger than the number required in
other architectures. They are arranged so that N receivers are fabricated as amorphous silicon
structures constructed as a thin film directly on the surface of a digital CMOS device, with no
lithography required. Because of the low conductivity of the amorphous silicon layer, no
subsequent patterning is required, and therefore the yield and cost of the receiver is determined by
the yield and cost of the CMOS device itself. Since the receiver does not need to perform any
routing, its hardware complexity (including detector, logic, and packet memory storage) is small
keeping their cost small, too. The full receiver array can be implemented on a single chip even for
large values of N (N > 128). Therefore, the total receiver cost is approximately O(N) instead of
O(N2).
Network scalability, whether based on switches or the SOME-Bus, is a very important
issue. Since achieving “supercomputing” performance requires a large number of processors, and
since the topology of the network becomes critical as the network size increases, it is important to
examine the issue of scalability in the sense of doubling the number of processing nodes, rather
than incrementally increasing the network size.
While it may be easier to incrementally expand a switch-based network by adding
switches and nodes, the resulting hot spots and/or bottlenecks in such a system would cause a
dramatic loss of performance. Only under very strong assumptions of locality is it possible to
incrementally scale up a switch-based network with no loss of performance. If the application
does not exhibit complete locality, then incremental expansion of this network adversely affects
its performance.
The SOME-Bus is by comparison easier to scale than a switch-based network. It relies on
a set of N receivers integrated on a single chip. Since most of the network cost is in the fiber optic
ribbon, a system with fewer than N nodes may be constructed by using a SOME-Bus segment
that accommodates the number of needed nodes (and would leave a portion of the receiver chip
14unused). Such a system may scale up to N by optically joining additional SOME-Bus segments.
Once N is reached, it is not possible to add one more node. System expansion is then achieved by
incorporating a second N-receiver chip in each node (and using additional SOME-Bus networks
for the necessary channels). Figure 2.4 shows the expansion of such a system from N to 2N
nodes, which is accomplished by using four SOME-Bus segments to create twice the number of
channels (and each channel is twice as long to accommodate the additional nodes). Since
information flows only in one direction on each SOME-Bus between the two halves of the
network, amplifiers may be easily placed between SOME-Bus segments as Figure 2.4 shows.
Que-Net
P 0 P N-1
Que-NetQue-Net Que-Net Que-Net
P N P 2N-1
Que-NetQue-Net Que-Net
Channel N
Channel 2N-1
Channel 0
Channel N-1
SO
ME
-Bus
0
SO
ME
-Bus
1Figure 2.4: Extending the SOME-Bus from N to 2N Nodes
15
CHAPTER 3: DISTRIBUTED SHARED MEMORY ON THE SOME-BUS
Communication between processors in a distributed system can occur via message
passing or Distributed shared memory (DSM). Message passing systems tend to have a high
level of integration between the processor and network. The user specifies the communication
operations through operating system or library calls that perform basic send and receive
operations. The message passing approach to interprocessor communication places a significant
burden on the programmer and introduces a considerable software overhead.
DSM systems can be easier to program than systems that use a message-passing model.
A DSM system hides the mechanism of interprocessor communication by presenting the
programmer with a shared memory model logically implemented in a physically distributed
memory system. The DSM programming model decreases the burden placed on the programmer
and provides a higher-level of portability for applications.
This chapter describes the implementation of Distributed Shared Memory on the SOME-
Bus. The basic DSM operation is described and then details about the Cache and Directory
controllers are provided. Several designs for the Input Queue Structure and Resolver design are
described as well. Finally, three mechanisms used to evaluate the performance of proposed fault
tolerance protocols (Trace-Driven Simulator, Statistical Simulator, and Queuing Network
Theoretical Model) are described in detail.
3.1 BASIC DSM OPERATION ON THE SOME-BUS
A natural implementation of a DSM interconnected by the SOME-Bus architecture is a
multiprocessor consisting of a set of nodes organized as a CC-NUMA system where the shared
virtual address space is distributed across the local memories of the processing nodes. The
SOME-Bus provides an advantage over software-based cache coherence approaches because it
16supports a strong integration of the transmitter, receiver and cache controller hardware producing
an effective system-wide hardware-based cache coherence mechanism.
There are two main approaches for maintaining cache coherence, snooping and directory-
based techniques. A disadvantage for snooping approaches is the fact that the interconnection
network saturates quickly as the number of processors increases. The SOME-Bus
interconnection network will not saturate in this manner since every node has a dedicated output
channel and there is no contention with other nodes when sending messages. The cache
controller, however, can become saturated as the number of nodes in the system increases due to
the increase in cache consistency traffic. When directory-based techniques are used to maintain
cache coherence, it is not necessary for the cache controller in every node to receive all messages
relating to cache consistency, but only the messages that pertain to data blocks resident in that
node’s cache. The Directory controller will multicast Invalidation and Downgrade requests only
to the affected nodes rather than simply broadcasting them. In the SOME-Bus, messages are still
broadcast over the sending node's output channel, but the decision to accept or reject an input
message is performed at the receiver input rather than the cache controller of each remote node.
In order to multicast a message on the SOME-Bus, a list of destinations can be added to the
message header. The receiver logic of a particular node can determine if a message is intended
for that node by examining the header of the incoming message, and reject the message if
necessary by not placing it in the input queue.
The multiprocessor examined in this thesis consists of a set of nodes organized as a CC-
NUMA system interconnected by the SOME-Bus architecture. The shared virtual address space is
distributed across local memories that can be accessed both by the local processor and by
processors from remote nodes, with different access latencies. Each node contains a processor
with cache, memory, an output channel and a receiver that can receive messages simultaneously
on all N channels.
17A sequential-consistency model is adopted and statically distributed directories are used
to enforce a Write-Invalidate protocol. Cache blocks can be in the state INVALID, SHARED, or
EXCLUSIVE. Directory entries can be in the state UNOWNED, SHARED, or EXCLUSIVE.
Each directory entry is associated with a bit vector (the copyset) that identifies the processors
with a copy of the data block corresponding to that entry. A node that contains the portion of
global memory and the corresponding directory entry for a particular data block is known as the
Home node for that block.
Multithreaded execution is assumed where each processor executes a program that
contains a set of parallel threads. A thread remains in the RUNNING state until it encounters a
cache miss. If the miss can be serviced locally without coordinating with other nodes through
Invalidation or Downgrade messages, the thread waits for the memory access to complete and
then continues running. If the cache miss requires a message to be sent to the home directory
controller, the thread is blocked and another thread is chosen from the pool of threads that are
ready to run. Each node contains four major components:
1. The processor handles all activities related to the scheduling of the threads.
2. The cache controller fills requests for data from the threads. When it encounters a cache miss
that cannot be serviced locally, the cache controller sends a Data or Ownership request
message to the home node that contains the directory entry for that block. A cache block
chosen for replacement in the cache results in a victim message to the home directory. If the
block was in EXCLUSIVE state, the victim message contains the writeback data. Otherwise
the victim message is sent so the directory controller can remove the node from the copyset.
3. A directory controller maintains the directory information for the portion of main memory
that is located at its node and receives and processes Data and Ownership requests from the
cache controllers. When processing the requests, the directory controller issues any required
Invalidation or Downgrade messages and then collects the acknowledge messages. A
18waiting-message queue is used to store requests waiting for acknowledge messages so the
directory can service multiple requests simultaneously.
4. The channel controller receives messages from the cache or directory controllers and delivers
them to the destination node. If the source and destination nodes of the message are different,
the message is considered to be remote and is placed on the output queue associated with the
output channel of the source node. When the channel becomes available, the message is
transmitted and arrives at the input queue of the cache or directory controller at the
destination node. Messages that are broadcast or multicast arrive simultaneously at the
destination input queues. If the source and destination node of a message are the same, the
message is considered to be local and the channel controller places the message directly on
the input queue of the controller (directory or cache) to which the message is directed.
Messages delivered locally in this manner are not placed on the node’s output queue and do
not contribute to the channel utilization of the node.
3.2 ORGANIZATION OF CACHE AND DIRECTORY CONTROLLERS
Incoming messages may be either application-related messages destined for the processor
or coherence messages directed to either the cache or the directory controller. Achieving high
performance depends critically on the way in which messages are removed from the input queues
at the node receiver and delivered to the processor, cache controller or directory controller.
In a typical system architecture, shown in Figure 3.1, the channel interface is connected
to an I/O bus, which is connected to the processor bus through a bridge. Because the I/O bus is
designed to support several I/O devices, its bandwidth is usually relatively small compared to the
main processor-memory bus.
In the SOME-Bus architecture, in order to implement DSM as efficiently as possible, the
channel controller must be integrated as closely as possible with the cache and the directory. This
organization, shown in Figure 3.2, offers the fastest access to data in the cache and the directory
19and results in smaller latencies. The flow of typical DSM messages is also shown on the figure.
Message passing can easily be done using this organization as Figure 3.3 shows.
Figure 3.4 shows the major blocks of the channel controller as well as the relating
functionality of the cache and the directory. A processor reference may result in a cache miss and
a subsequent interaction with the directory. If necessary, a request is created by the directory and
is forwarded to the channel controller, which creates a request message and enqueues, it in the
output channel queue for transmission.
CPU
CACHE MEM
BRIDGE
CHAN
Processor Bus
I/O Bus
Figure 3.1: Typical System Architecture
20
CPU
CACHEMEM
BRIDGE
CHANDMA
DATA-REQ
DIR
DATA-ACK DATA-REQ
DATA-ACK
Processor Bus
I/O Bus
Figure 3.2: Computer System Architecture for DSM and Message Passing
CPU
CACHEMEM
BRIDGE
CHANDMA
MESSAGES
Figure 3.3: Computer System Architecture for Message Passing
21
Figure 3.4: SOME-Bus Channel Controller, Cache and Directory
To support message passing, a direct connection is created between the channel controller
and the main memory. Small messages may be read by the processor directly from the selected
queue. Large messages can be transferred from the proper queue into the regular memory in a
cut-through manner by the DMA controller. Figure 3.4 shows DMA channels connecting the
transmitter and receiver to the memory bus.
To support DSM, the channel receiver must pass incoming messages to either the cache
or the directory. As Figure 3.4 shows, incoming Data or Ownership request messages are sent to
the directory, while invalidation requests are sent to the cache. Similarly, incoming invalidation
acknowledgments are collected by the directory and Data-Ack or Ownership-Ack are sent to the
cache. In addition there is communication between cache and directory as needed to complete the
protocol.
22
3.3 INPUT QUEUE STRUCTURE AND RESOLVER DESIGN
Figure 3.5 shows a simple implementation of a receiver with a single resolver. The
queue structure has a status register with an attribute that indicates if there exist messages
targeted for either the cache or the directory. The resolver determines the next queue from which
a message is removed by generating the index of the next queue to be polled among the queues
with the corresponding attribute. Both cache and directory keep requesting the next message as
long as the corresponding status bit indicates that there are such messages.
The message data is placed on a bus shared by the Cache and Directory controller and
then received by the controller to which the message is destined. There are several disadvantages
of this implementation however due to the contention between the cache and directory controller.
Assume the cache controller is busy processing a previously received message and the directory
controller is idle and waiting for the next message to process. If the input buffers at the receiver
are implemented as a circular queue, a situation known as the head-of-the-line (HOL) blocking
can occur. In this case, if the first message in all of the queues is a message destined for the
cache controller, HOL blocking occurs when the queues contain messages that could be delivered
to the directory controller immediately but must remain in the queue until the messages directed
to the cache controller are removed. The result of this is that the directory controller remains idle
unnecessarily because it cannot reach into the queue to pull out the required message. In
addition, only one message can be removed from the input channel queues at a time because the
Cache and Directory controller are sharing the same data path.
23ResolverBinary Codeof Selected QueueCommon Queue Data Bus A
ttri
bute
Net
wor
k
Att
ribu
te
Att
ribu
te
Att
ribu
te
DataDataDataFrontRearIncoming DataData0Req0Ack0Enable0 Att
ribu
te N
etw
ork
Att
ribu
te
Att
ribu
te
Att
ribu
te
DataDataDataFrontRearIncoming DataDatan-1Reqn-1Ackn-1Enablen-1Input Channel Queue 0Input Channel Queue n-1
Figure 3.5: Single Resolver Implementation
One solution to the HOL blocking problem is to organize the input buffers as a linked list
rather than a circular queue. This approach allows messages to be removed from the middle of
the queue thereby avoiding the HOL blocking. Figure 3.6 illustrates an implementation of this
approach. In order to ensure that each controller removes its messages from the input buffers in
the order in which they were received, separate cache and directory linked lists can be
implemented on top of the link list used to maintain the input channel buffers. Separate pointers,
one for the cache controller and one for the directory controller, are used to obtain the next
message to be removed by each controller. Unlike a FIFO queue structure, the linked list
implementation of the input message queue requires the ability to extract data from any buffer.
The separate buffers are connected to an internal queue data bus. Information from the resolver
and the Attribute network is used to select which buffer will place its data on the Internal Queue
data bus.
24
Att
ribu
te N
etw
ork
DataRearQueueDataData CA
Att
r
DataFrontQueueFrontDirFrontCacheRear CacheRearDir DIR
Att
r
CA
Att
r
DIR
Att
r
CA
Att
r
DIR
Att
r
CA
Att
r
DIR
Att
r
Internal QueueData Bus
Figure 3.6: Separate Cache and Directory Message Chains
To maintain high performance, the receiver can be capable of extracting two messages
from the channel input queues simultaneously. One message is directed to the cache and one to
the directory and a demultiplexer is used to direct the message to one of two data paths, each one
dedicated to one of the controllers. In this case, two resolvers, one per controller, are required to
select the next queue from which a message will be removed and placed on the associated data
path. Figure 3.7 illustrates the dual resolver Implementation.
The queue structure has a status register with separate attributes to indicate if there exist
messages targeted for the cache or for the directory. As before, both cache and directory keep
requesting the next message as long as the corresponding status bit indicates that there are such
messages. Each resolver generates the index of the next queue to be polled among the queues
25with the corresponding attribute. If the two resolvers decide to read the same queue, one
controller will have to wait or the resolvers must be capable of coordinating so that the same
queue is not selected. If the resolvers are not coordinated and it is possible for both controllers to
request access to the same queue, the cache controller should have priority because it is also
serving the processor (Data-Ack or Ownership-Ack messages) and the amount of time the
processor is stalled waiting for data should be minimized.
Enable0DirEnable0DEMUXCache CommonQueue BusDirectory CommonQueue BusCache ResolverDirectory Resolver CA
Att
r
Data DIR
Att
r
CA
Att
r
Data DIR
Att
r
Inte
rnal
Que
ue
Dat
a B
us
CacheAck0CacheReq0DirAck0DirReq0DataDataCacheAckn-1CacheReqn-1DirAckn-1DirReqn-1DEMUXCacheEnable0CacheEnablen-1DirEnable0DirEnablen-1 CA
Att
r
DIR
Att
r
CA
Att
r
DIR
Att
r
Inte
rnal
Que
ue
Dat
a B
us
Input Channel Queue 0Input Channel Queue n-1Attribute NetworkAttribute Network
Figure 3.7: Dual Resolver Implementation
3.4 PERFORMANCE ANALYSIS
In the remaining sections of this chapter, three distinct set of results are presented:
1) The first set of results is obtained by the Trace-driven simulator described above. The Trace-
driven simulation provides a detailed model of the processes and memory and the DSM
operation of every node on the SOME-Bus and keeps track of every memory access by each
processor and its effect on individual data blocks.
262) A statistical simulation of a simplified queuing network model of the SOME-Bus is also
presented. In this simulation, the DSM operation is modeled in a probabilistic manner; an
event may have one or more specific outcomes, each with an associated probability. When an
event occurs, one of the related outcomes is selected according to the associated probabilities.
The statistical simulation is a simplified version of the Trace-driven simulation and does not
require maintaining the detailed state of the DSM system.
3) The final set of results is provided by a theoretical model that closely approximates the
distribution-driven simulation model, and makes additional simplifying assumptions to allow
a closed-form solution based on traditional queuing-network theory.
The three approaches described above are used so that they may validate each other. One
important result is that a system with such complex behavior can be modeled by relatively simple
and tractable models, which can provide reasonable estimates of processor utilization and
message latencies, using parameters that characterize real applications.
3.4.1 Trace-Driven Simulation
A trace-driven simulator was created in order to examine the performance of Fault
Tolerance and DSM on the Some-bus architecture. The parameters of the CC-NUMA
multiprocessor to be simulated are specified as inputs to the simulator. Examples of these
parameters include the number of nodes, the number of threads per node, amount of memory
allocated to each node and the cache structure of each node (cache size, number of cache blocks
and number of bytes per block).
The applications executed by the simulator consist of a sequence of memory references
and a type of access (read or write) for each address. The applications used with the simulator
come from actual multiprocessor address trace files and from artificially generated address
streams. Each thread has its own address sequence and stops running when it has finished
processing all of its memory references. The execution of the application is complete when all
27threads have finished processing their respective memory references. The simulated execution of
the application is briefly described below. A more detailed description is provided later in this
chapter.
A memory reference from the currently running thread’s address sequence is examined to
see if the memory reference will hit in the cache. If the address causes a cache hit the thread
remains in the RUNNING state and the next address in the thread’s sequence will be examined on
the next simulator clock cycle.
If the memory reference causes a miss in the cache but can be obtained from the local
memory, the thread waits for the access to complete in the SUSPENDED state. When the access
completes the thread transitions back to the RUNNING state and the next address in the thread’s
sequence will be examined on the next simulator clock cycle.
If the memory reference causes a miss in the cache and the memory access cannot be
filled by the local node, the thread is placed in the BLOCKED state and another thread from the
pool of ready threads is chosen and placed in the RUNNING state. The new thread will begin
running by processing the next address in its sequence on the next simulator clock cycle.
The operation of the processor, directory controller, cache controller and channel
controller is simulated in detail in order to provide information needed to compare the
performance of each component for the DSM system before and after the proposed Fault
Tolerance approaches have been applied.
All Simulation results are based on time units equal to one clock cycle. The performance
of the multiprocessor is evaluated in terms of processor and channel utilization, the number of
simulation cycles required to execute an application (address traces or synthetic workload),
average waiting times for the queues, and average round-trip times for Data and Ownership
request messages. In addition, performance characteristics relating to the checkpointing
procedures are also presented. All simulations reported in this paper assume a SOME-Bus system
configuration with 32 nodes (2 threads per node), 64 cache blocks per node, 64 bytes per block
28and either 2785 or 6828 memory blocks per node. The amount of memory per node was chosen in
order to satisfy the memory requirements of the applications for which the multiprocessor address
trace files were obtained.
There are two mechanisms for providing the simulator with multiprocessor applications:
address trace files and artificially generated address streams. A set of three multiprocessor
address trace files was obtained from the trace database supported by the Parallel Architecture
Research Laboratory (PARL) at New Mexico State University. Multiprocessor traces for the
programs called speech, simple and weather were obtained from the TraceBase website1. Details
about these three applications are provided in [7]. The weather application models the
atmosphere around the globe using finite difference methods to solve a set of partial differential
equations describing the state of the atmosphere. Speech uses a modified Viterbi search algorithm
to find the best match between paths through a directed graph representing a dictionary and
another through a directed graph representing spoken input in order to provide the lexical
decoding stage for a spoken language understanding system developed at MIT. Simple is an
application that models fluids. using finite difference methods to solve a set of behavior-
describing equations.
Table 3.1 provides a comparison between trace files weather, speech and simple. The
processor utilization is very low for all three traces. The reasons for this can be seen in Figure
3.8, which provides the distribution of the memory references by showing the number of
references that were directed to each node's local memory space for each of the three applications.
It can be seen from this figure that all three of the applications contain hotspots. An analysis of
the trace address references indicated that in both, the weather and simple applications, there are a
very large number of references directed to a single data byte. The cause of this behavior is a
barrier synchronization that occurs in both of the applications. Repeated accesses are made to a
1 http://tracebase.nmsu.edu/tracebase.html
29barrier flag as processors spin-lock on it. In order to reduce the hot spots created by the
synchronization variable, the simulator code was modified in order to redirect accesses to these
variables to another node that was the target for no other memory references.
Hot spots were also observed in the Speech application trace, but for different reasons.
The Hot spots were due to the frequent accesses to the data structures comprising the two
dictionaries used in the application. Instead of a single data block being continuously accessed,
the entire dictionary structure is scanned repeatedly. In this case, the hot spot can be avoided by
dividing the dictionary data structure into smaller units that can be distributed among the nodes.
Figure 3.9 provides the resulting distribution of the address references, after the highly accessed
data blocks were relocated as described above and Table 3.2 shows the corresponding
performance characteristics. The barrier synchronization, prevents the processor utilization in the
weather and simple from improving significantly from the reorganization of the highly accessed
data blocks. The processor utilization in speech was raised 3% but it still comparatively low
although the total execution time was shorted by 80%. The total execution time of the weather
and simple applications were lowered by 8 and 10 % respectively.
Table 3.1: Comparison of Multiprocessor Trace Filesweather speech simp
Address References per Node 496,313 183,932 422,345
Simulation Time 7,129,800 76,281,918 14,077,177
Processor Utilization 11.38% 0.64% 6.16%
Channel Utilization 31.84% 6.58% 20.28%
Avg Time Msg Spends in Ouput Q 103.046 575.627 222.37
Cache Read Ratio 90.34% 78.24% 91.80%
Cache Miss Ratio 2.46% 16.33% 3.83%
Cache Remote Miss Ratio 2.97% 18.03% 4.06%
30
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 2122 23 24 25 26 2728 29 30 31
Node
Nu
mb
er o
f R
efer
ence
s
weather speech simp
Figure 3.8: Distribution of Address References
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Node
Nu
mb
er o
f R
efer
ence
s
weather speech simp
Figure 3.9: Distribution of Address References, after Relocation
31Table 3.2: Simulation Data, after Relocation
weather speech simp
Address References per Node 496,313 183,932 422,345
Simulation Time 6,540,632 14,678,029 12,691,795
Processor Utilization 11.59% 3.56% 7.53%
Channel Utilization 32.32% 41.37% 26.62%
Avg Time Msg Spends in Ouput Q 95.122 75.713 196.353
Cache Read Ratio 90.34% 78.25% 91.80%
Cache Miss Ratio 2.39% 16.22% 3.89%
Cache Remote Miss Ratio 2.89% 18.28% 4.11%
Characteristics specific to certain applications such as those just described can have a
very large impact on the performance of a system on which the application is being simulated.
As can be seen in Tables 3.1 and 3.2, applications with similar basic characteristics such as
read/write ratio and miss ratio can provide very different performance due to other contributing
factors. The behavior of multiprocessor applications is a complex interaction of read and write
memory operations, cache miss rate, memory access patterns (temporal and spatial locality) and
interprocessor sharing of data. There is no such thing as a "typical" application and therefore
system evaluation must include a wide-range of different types of applications in order to assure
satisfactory performance under most conditions. The limited availability of multiprocessor traces
that cover a wide range of application behaviors is a major disadvantage to this approach.
When using simulation as a tool for evaluating new architectures or approaches, the use
of synthetically generated workloads can provide a much larger degree of flexibility than actual
traces. For this reason, there is a need for effective synthetic workload generation that can
provide reasonable coverage over a large range of application behaviors.
The behavior of real applications can be approximated by adjusting input parameters used
for the address generation scheme. Differences between the application behavior for synthetic
workloads and the address trace files can be attributed to temporal variations in the applications,
32such as bursts of accesses to memory locations on the same node, to asymmetrical memory
reference patterns (some nodes get a much higher or much lower percentage of the memory
references than other nodes), or to the pattern of alternating read and write operations to a
particular memory block. These types of differences can also be found in real world applications.
Artificially generated workloads also provide an opportunity to experiment with other system
parameters such as the number of threads or number of nodes for which the address trace files
have fixed values.
For synthetic workloads used in our simulations, the memory access behavior of the
application can be specified by associating probabilities with a set of options for the next address
to be accessed. The set of available options for our synthetic workload generator was chosen to
show the differences in the proposed fault tolerance protocols described in the next chapter. The
application characteristics that are to be compared are the cache hit ratio, the degree of locality
and the degree of data sharing between the nodes.
In order to study the behavior of the system under varying degrees of loads and
application behaviors, it is necessary to characterize the traffic in the network and through the
processing nodes, in terms of the address patterns caused by the application. The miss rate (and
especially the remote miss rate) is not sufficient to characterize the application behavior. In the
analysis described here, several parameters are used to indicate the degree of locality and sharing
of data between threads at different nodes.
When a node is ready to generate a new memory reference, that memory reference can be
an address that is currently in the node’s cache, an address that falls within the node’s local
memory, or an address that falls within another node’s memory. In addition, the memory
reference could specify a block that is currently being accessed by or has recently been accessed
by another node. Selection of proper probability values associated with each of the behaviors
described above, is used to control the degree of spatial locality for the memory references and
the amount of interaction between threads. Specifically the following parameters are used:
331. Parameter L indicates the probability that the reference is to the node’s local memory
2. Parameter N indicates the probability that the reference is to a designated neighboring
node’s memory2.
3. Parameter S indicates the probability that the reference belongs to any remote node’s
memory space and has been accessed by another thread in the system in the past.
4. The probability that the reference is to one of the blocks already in the cache is 1-
(L+N+S).
Applications of interest can be described by the following cases:
1. Threads tend to access arbitrary blocks at a designated neighboring node and the current
node. This case indicates a smaller degree of locality within the current node and a
smaller degree of sharing between threads. The parameters of interest are L=0.01, S=0.01
and N=[0.01,...,0.10]. (Case N01).
2. There is little sharing between threads. Within the current node there is a larger degree of
locality. The parameters of interest are L=0.10, S=0.02 and N=[0.01,...,0.10]. (Case N10).
3. Threads tend to access blocks at any remote node that have already been accessed by
another thread. This case indicates a smaller degree of locality within the current node
but a larger degree of sharing between threads. The parameters of interest are L=0.01,
N=0.01 and S=[0.01,...,0.10]. (Case S01).
4. Threads tend to access blocks at any remote node that have already been accessed by
another thread. Within the current node there is a larger degree of locality. The
parameters of interest are L=0.10, N=0.02 and S=[0.01,...,0.10]. (Case S10).
2 Although there are no neighbors in the SOME-Bus, a node X may be neighboring some other Node Y in afunctional sense. As discussed in following chapters, Node Y is a neighbor to Node X, if Node Y contains acopy of Node-X recovery memory
34Figure 3.10 and 3.11 show the processor and channel utilization for these four types of
application behavior. The horizontal axis is the remote cache miss ratio, i.e. the percentage of
cache misses that result in a request to a remote node. The figures show the effect of placing
emphasis on each one of the parameters. The S01 and S10 application cases have a larger channel
utilization and a slightly lower processor utilization than the N01 and N10 application cases
causing an increase in the job queue waiting time. The reason for this is that additional messages
must be sent in a DSM system upon a write request to a block that is shared by other nodes. In
addition to the Data or Ownership request message and corresponding acknowledge message,
invalidation and invalidation-acknowledge messages would be required. If the requested block
were in the Exclusive state, a writeback message would also be required. When there is no
sharing of data, the message exchange is limited to a Data or Ownership request and the
corresponding acknowledge.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
0.00% 2.00% 4.00% 6.00% 8.00% 10.00% 12.00% 14.00%
Remote Cache Miss Ratio
Uti
lizat
ion
ProcUtil:N01
ProcUtil:S01
ChanUtil:N01
ChanUtil:S01
Figure 3.10: Processor Utilization and Channel Utilization: Low Level of Locality
35
The Case S01 and S10 experiments represent the more conventional behavior of a
program. The Case N01 and N10 are more specialized because they reflect direct interaction
between two nodes. The N10 case shows more local activity which causes less channel
utilization than N01. The differences between Case S01 and S10 are due to the probability that
the memory reference can be found in the requesting node’s local memory. When the data can be
read from the node’s local memory, no external messages are generated. This can be seen in the
lower channel utilization of Case S01.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
0.00% 2.00% 4.00% 6.00% 8.00% 10.00% 12.00% 14.00%
Remote Cache Miss Ratio
Uti
lizat
ion
ProcUtil:N10
ProcUtil:S10
ChanUtil:N10
ChanUtil:S10
Figure 3.11: Processor Utilization and Channel Utilization: High Level of Locality
A memory reference may be processed locally (without sending messages to another
node) depending upon the type of request (Data or Ownership), the state of the block (SHARED
or EXCLUSIVE), and if the block is SHARED, on how many nodes are sharing at the time the
request for the block is made.
36
3.4.2 Decision Tree
A Decision Tree provides all possible sequences of events that can occur during the
processing of a memory reference for any combination of access type (read or write), location of
the memory block (local or remote) and state of the memory block (UNOWNED, SHARED,
EXCLUSIVE). The tree provides a mechanism for comparing the behavior of different
application in terms of the network traffic generated as a result of the processing of the address
references.
Each path in the tree provides a weighted cost in terms of network traffic for all messages
generated along that path. Identifying the paths that are more frequently taken by different types
of applications allows the different fault tolerance algorithms discussed in the following chapters
to be optimized in order to reduce the additional network traffic due to fault tolerance.
As mentioned previously, each thread has associated with it, a sequence of memory
references. These references can be predetermined (and read from a file) as in the case of the
multiprocessor traces, or can be generated at runtime using probabilistic values for specified
behavior (cache hit, locality, sharing of data between nodes).
When the thread begins to process the next memory reference, it determines whether the
memory access is a read or write operation. If the access type is a read operation, the branch of
the tree that will be traversed while processing the memory reference is shown in Figure 3.12.
This section of the tree can be exited via six separate paths identified by the letters A-F. The
particular branch taken at each fork in the tree will affect the size and number of messages that
will appear on the SOME-Bus channels as part of the processing of this memory reference.
37
F
REFERENCE
READ
MODIFIEDBLOCK
HOMEADDRESS
NOT HOMEADDRESS
NOT MODIFIEDBLOCK
DATA_REQTO LOC DIR
(NODE H = NODE A)(net cost 0)
DATA_REQTO REM DIR
(NODE H)(net cost SH)
IN CACHEHIT
NOT IN CACHE
NODE A
LOC DIR SENDSDOWNGR_REQ
TO REMOTE NODE C(net cost SH)
REMOTE NODE CSENDS DOWNGR_WB_ACK
TO THIS NODE A(net cost SD)
REMOTE DIR HASBLOCK IN STATE:
NOT MODIFIED MODIFIEDBY NODE C
REM DIR SENDSDATA_ACK TO NODE A
(net cost SD)
HOME ISNODE C
HOME IS NOTNODE C
REM DIR (NODE H)SENDS DATA_ACK
TO NODE A(net cost SD)
REM DIR (NODE H)SENDS DOWNGR_REQ
TO NODE C(net cost SH)
NODE C SENDSDOWNGR_WB_ACK TO
REM DIR (NODE H)(net cost SD)
NO MESSAGELocal Read miss
REM DIR (NODE H)SENDS DOWNGR_REQ
TO NODE H CACHE(net cost 0)
NODE H CACHE SENDSDOWNGR_WB_ACK TO
NODE H DIR(net cost 0)
A
B
C
D
E
Network Cost:0
Network Cost:0
LOC DIR SENDS DATA_ACK TO NODE A CACHE
(net cost 0)
Network Cost:SH + SD
Network Cost:SH + SD
REM DIR (NODE H)SENDS DATA_ACK
TO NODE A(net cost SD)
Network Cost:2 * (SH + SD)
Network Cost:SH + SD
Figure 3.12: Decision Tree Branch for Read Accesses
38There are three major possibilities for the number and type of messages generated during
the progression through the tree. One possibility is that no messages are generated at all. If the
reference causes a cache hit, the thread will remain in the RUNNING state and will process a new
memory reference on the next clock cycle. The second possibility occurs when the reference is a
cache miss, but can be serviced from the local memory on the node without coordinating with
other nodes through Invalidation or Downgrade messages. In this case, the thread is suspended
during the reading/writing of the local memory and then returns to the RUNNING state and will
process a new memory reference on the next clock cycle. A context switch does not occur in this
situation. The third possibility occurs when the memory access causes a cache miss and the state
of the data block requires either Invalidation or Downgrade messages to be sent to nodes with a
copy of the data block. In this case a message must be sent to the home directory for the data
block. If the home node for the data block is the same node that is requesting the data, the
message will be local (does not appear on the output queue or channel) otherwise the message is
remote and will be placed on the sending node’s output queue and then eventually transmitted
across the node’s output channel
The size and number of messages that are generated along a particular tree path
determines the network cost associated with that path. The network cost of sending a message is
found by multiplying the size of the message in bytes by the number of clock cycles required to
transmit a single byte. Local messages do not contribute to the network cost. There are two types
of messages in terms of message size. Messages that contain a header and no data payload, such
as request or Invalidation messages have a message size of SH. Messages that have both a header
and a data payload are of size SD. In general, SD = SH + number of bytes in a cache block. The
network cost associated with each of the paths A-F is provided in Figure 3.12.
The tree branch shown in Figure 3.12 contains six possible paths for the processing of the
memory reference depending on the state of the data block and the location of the home node for
the data block. Tree Path A is taken if Node A is processing a read request for an address that is
39found in the cache in either SHARED or EXCLUSIVE state. This path ends with a cache hit and
no messages are generated. For this reason, the network traffic cost of Path A is 0.
If a thread from Node A, begins to process a read request for an address that misses in the
cache and Node A is the Home node for the memory block and the block is currently in the
EXCLUSIVE state (owned by Node C), Tree Path B is taken. The messages generated along this
path include a local Data request message to the directory controller of Node A (network cost =
0), a remote Downgrade request message sent to the cache controller on Node C (network cost =
SH), a remote DowngradeWB-Ack (containing the data block’s current value) sent from Node C
to the directory controller on Node A (network cost = SD), and a local Data-Ack message
containing the data block which is sent to the cache controller on Node A (network cost = 0).
Since local messages do not appear in the output queue or on the channel, the total network cost
of this path is SH + SD for the Downgrade request and DowngradeWB-Ack messages
In the Tree Paths for C and D, the memory reference misses in the cache, but is found in
the SHARED or UNOWNED state in the memory and therefore no Downgrade request messages
are required. In Path C, the data is found in the local memory of Node A and therefore no
messages are exchanged between the cache and directory controller. The network cost of Path C
is 0. In Path D, the home for the data block is not Node A so the messages generated along this
path include a remote Data Request message and a remote Data-Ack message, both of which
contribute to the network cost of this path.
Tree Path E and F are similar to Tree Path B except that the home node for the data block
is not Node A. In Path E, a remote Data request message, a local Downgrade message, a local
DowngradeWB message and a remote Data-Ack message is generated. Only the remote Data
request and Data-Ack contribute to the network cost. The Downgrade and DowngradeWB
messages are local in this case because the cache in the home node of the data block is the current
owner of the data block. This results in a network cost of SH + SD for Path E. In Path F, the
current owner is Node C and all message generated along this path (Data request, Downgrade,
40DowngradeWB, Data-Ack) are remote messages that contribute to the network cost of this path,
resulting in a total network cost of 2 * (SH + SD).
In Table 3.3, the average number of times a particular path A-F is taken by a node is
provided for the Application Cases N01, S01, N10, and S10 described above. Columns A-F show
the number of memory references that were processed on a particular path and the column
entitled Total References shows the number of memory references executed by each node in the
application after the first 1000 clock cycles. The first 1000 clock cycles are ignored in each
experiment in order to avoid the startup transient behavior.
The previous discussion dealt with the portion of the Decision Tree dedicated to read
accesses. The write access portion of the tree has many more paths and is presented here in two
parts. The first part is shown in Figure 3.13 and contains the paths for write accesses that hit in
the cache. The second part is shown in Figure 3.14 and contains the paths for write accesses for
blocks that are not in the cache.
Table 3.3: Decision Tree Path Distribution for Read AccessesTotal References Application A B C D E F
19629.126 Case N01: N = .01 17148.250 0.313 180.469 315.125 0.031 11.71919719.253 Case N01: N = .02 17044.469 0.281 176.438 505.000 0.063 9.28119835.565 Case N01: N = .05 16595.813 0.313 179.781 1056.281 0.000 6.06319899.345 Case N01: N = .10 15777.063 0.313 181.875 1945.281 0.000 4.59419719.253 Case S01: S = .01 17044.469 0.281 176.438 505.000 0.063 9.28119796.501 Case S01: S = .02 16950.406 0.250 178.594 663.844 0.094 16.43819888.283 Case S01: S = .05 16479.469 0.313 177.594 1195.875 0.281 36.84419930.939 Case S01: S = .10 15623.375 0.344 182.563 2059.156 1.250 74.40619726.471 Case N10: N = .01 15474.188 1.906 1782.875 471.969 0.094 19.50019769.784 Case N10: N = .02 15326.000 2.844 1781.813 669.375 0.000 17.87519852.627 Case N10: N = .05 14830.750 3.156 1781.844 1225.531 0.125 10.40619915.096 Case N10: N = .10 14005.156 2.875 1790.750 2121.344 0.031 7.87519722.22 Case S10: S = .01 15458.969 3.219 1756.125 511.375 0.031 9.65619769.784 Case S10: S = .02 15326.000 2.844 1781.813 669.375 0.000 17.87519853.876 Case S10: S = .05 14857.875 2.719 1795.063 1196.781 0.125 24.00019938.909 Case S10: S = .10 13986.688 2.750 1804.688 2121.875 1.594 33.375
992166.6563 weather 959301.688 9.406 1326.375 8937.500 1.094 618.156367841.0313 speech 228797.313 305.594 2375.219 27607.594 2.344 9925.156844409.4063 simple 803736.125 24.969 1934.656 21050.219 1.219 526.219
41Figure 3.13 contains 5 paths, H-L. Path H is taken when the data block is found in the
cache in EXCLUSIVE state and results in a cache hit with a network cost of 0. Paths I and J are
taken when the block was found in the cache, but it was in the SHARED state, requiring an
upgrade to the EXCLUSIVE state. In this case, only the Ownership for the block is desired, it is
not necessary to provide the cache with the data for the block since the block is already contained
in the cache. In Path I, Node A has the only copy of the block and is also the home for the data
block. The messages generated on this path include a local Upgrade request and a local Upgrade-
Ack. The network cost of Path I is 0.
Path J is similar to Path I except that more than one copy of the data block exists in the
caches of other nodes. This requires an Invalidation message to be multicast to the sharing nodes
and the corresponding Invalidation-Ack messages to be collected. The messages generated along
this path include a local Upgrade request, a single Invalidation message multicast, multiple
remote Invalidation-Ack messages, and a local Upgrade-Ack. The network cost of this path is
related to the invalidation of the copies of the data block. There can be multiple Invalidation-Ack
messages that result from a single Invalidation request for a shared data block. The parameter
Ninv shown in Figure 3.13 is the average number of nodes found sharing a data block when an
Invalidation request is made for a block in the SHARED state. If the data block is in the state
EXCLUSIVE, there can only be one owner and therefore only one Invalidation-WB message. If
there are Ninv sharers of the data block, there will be Ninv Invalidation-Ack messages. The
difference between an Upgrade-Ack message and an Ownership-Ack message is that the later
contains the data block (network cost of SD) and the Upgrade-Ack does not (network cost of SH).
Paths K and L are similar to paths I and J except that Node A is not the home node for the
data block. The messages generated along path K include a remote Upgrade request and a remote
Upgrade-Ack. The messages generated along path L include a remote Upgrade request, a remote
Invalidation message, multiple remote Invalidation-Ack messages and a remote Upgrade-Ack.
42
IN CACHE
MODIFIEDBLOCK
HITSHAREDBLOCK
HOMEADDRESS
UPGRADEOWNR_REQTO LOC DIR(net cost 0)
LOCAL DIR HASBLOCK IN STATE:
SHARED ONLYBY NODE A
LOC DIR (NODE H)SENDS OWN_ACK
TO NODE A(net cost 0)
SHARED BYREM NODE(S)
NOT HOMEADDRESS
UPGRADEOWNR_REQTO REM DIR(net cost SH)
SHARED
REMOTE DIR HASBLOCK IN STATE:
SHARED ONLY BYNODE A
H
IJ
K
L
LOC DIR (NODE H)SENDS INV_REQ
TO SHARING NODES(net cost SH)
REFERENCENODE A
WRITE
NOT IN CACHE
Network Cost:0
Network Cost:0
Network Cost:(1 +Ninv)* SH
LOC DIR (NODE H)SENDS OWN_ACK
TO NODE A(net cost 0)
NODES IN COPYSETSEND INV_ACK TO NODE H
(net cost Ninv * SH)
REM DIR (NODE H)SENDS OWN_ACK
TO NODE A(net cost SH)
Network Cost:2 * SH
REM DIR (NODE H)SENDS INV_REQ
TO SHARING NODES(net cost SH)
REM DIR (NODE H)SENDS OWN_ACK
TO NODE A(net cost SH)
NODES IN COPYSETSEND INV_ACK TO NODE H
(net cost Ninv * SH)
Network Cost:(3 +Ninv)* SH
Figure 3.13: Decision Tree Path Distribution for Write Accesses to Blocks in Cache
43
M
REFERENCE
NODE A
WRITE
SHAREDMODIFIEDBY NODE C
UNOWNED
HOME ISNODE C
HOME IS NOTNODE C
REM DIR (NODE H)SENDS INV REQ
TO NODE C(net cost SH)
REM DIR (NODE H)SENDS OWN_ACK
TO NODE A(net cost SD)
REM DIR (NODE H)SENDS INV_REQ TO NODES IN COPYSET
(net cost SH)
REM DIR (NODE H=C)SENDS INV REQ TO NODE (H=C) CACHE
(net cost 0)
NODE (H=C) CACHE SENDSINV_WB_ACK TO DIR (NODE H=C)
(net cost 0)
HOMEADDRESS
SHR/MODBLOCK
OWNR_REQTO LOC DIR H(net cost 0)
UNOWNEDBLOCK
LOCAL DIR HASBLOCK IN STATE:
LOC DIR (NODE H)SENDS OWN_ACK
TO NODE A AND HOME2(net cost 0)
SHARED BYREM NODE(S)
LOC DIR (NODE H)SENDS INV_REQ
TO OWNER(net cost SH)
MODIFIED BYREM NODE
NOT IN CACHE
NOT HOMEADDRESS
OWNR_REQTO REM DIR H
(net cost SH)
REMOTE DIR HASBLOCK IN STATE:
LOC DIR (NODE H)SENDS OWN_ACK
TO NODE A (net cost 0)
P Q R
ON
S
OWNR_REQTO LOC DIR H
(net cost 0)
IN CACHE
Network Cost:SH + SD Network Cost:
( 2 * SH) + (2 * SD)
NODE C SENDSINV_WB_ACK TO
NODE H(net cost SD)
REM DIR (NODE H)SENDS OWN_ACK
TO NODE A(net cost SD)
REM DIR (NODE H)SENDS OWN_ACK
TO NODE A(net cost SD)
Network Cost:((Ninv+2) * SH) + SD
NODES IN COPYSETSEND INV_ACK TO REM DIR (NODE H)(net cost (Ninv *SH))
REM DIR (NODE H)SENDS OWN_ACK
TO NODE A(net cost SD)
NODES IN COPYSETSEND INV_ACK TO REM DIR (NODE H)(net cost (Ninv *SH))
LOC DIR (NODE H)SENDS INV_REQ TO NODES IN COPYSET
(net cost SH)
Network Cost:((Ninv+1) * SH)
REM NODE SENDS INV_WB_ACK TO
NODE H(net cost SD)
LOC DIR (NODE H)SENDS OWN_ACK
TO NODE A(net cost 0)
Network Cost:0
Network Cost:SH + SD
Network Cost:SH + SD
Figure 3.14: Decision Tree Path Distribution for Write Accesses to Blocks not in Cache
Figure 3.14 contains 8 paths, M-T. Paths M-P are taken when Node A is not the home
node for the data block. In paths M and N the block is found to be in EXCLUSIVE state. Path M
44generates a remote Ownership request, a local Invalidation message, a single local
InvalidationWB-Ack message, and a remote Ownership-Ack message. The network cost is due to
the Ownership request and Ownership-Ack messages. In Path N, the owner of the block is not the
home node and therefore all messages generated along this path (Ownership request, Invalidation,
InvalidationWB, and Ownership-Ack) are remote messages and contribute to the network cost of
this path.
In Path O, the data block is found to be UNOWNED so the only messages are the remote
Ownership request and Ownership-Ack. In Path P, the data block is shared by caches on nodes
other than Node A. This path generates a remote Ownership request, a single remote Invalidation
message multicast, multiple remote Invalidation-Ack messages, and a remote Ownership-Ack.
Paths Q, R, and S are similar to paths P, N, and O, respectively except that the Ownership
request and Ownership-Ack messages are local and do not contribute to the network cost of the
path. In the case of Path S, since the data block is UNOWNED, no Invalidation messages are
necessary and it is not necessary to send an Ownership request message because the home for the
data block is local, no messages are generated at all on this path, resulting in a network cost of 0.
Tables 3.4 and 3.5 provides the average number of times a node takes a particular path H-
S for the Application Cases N01, S01, N10, and S10. Columns H-S show the number of Total
References shows the number of memory references executed by each node in the application
after the first 1000 clock cycles.
The net cost for each path in the system is shown in Tables 3.6, 3.7 and 3.8. The total
network cost is obtained by multiplying the number of times that each path was taken by the
network cost associated with taking that path. The values used in the tables are SH =10 and SD=74
. The column Ninv contains the average number of nodes containing a copy of the data block when
an Invalidation message was sent for a block in the SHARED state.
45Table 3.4: Decision Tree Path Distribution for Write Hit Accesses
Total References Application H I J K L 19629.126 Case N01: N = .01 1912.313 0.031 0.000 0.156 3.62519719.253 Case N01: N = .02 1904.938 0.000 0.000 0.313 2.09419835.565 Case N01: N = .05 1859.875 0.031 0.000 0.719 0.68819899.345 Case N01: N = .10 1749.563 0.000 0.000 1.250 0.65619719.253 Case S01: S = .01 1904.938 0.000 0.000 0.313 2.09419796.501 Case S01: S = .02 1888.125 0.000 0.000 0.313 3.03119888.283 Case S01: S = .05 1835.625 0.031 0.000 0.438 2.43819930.939 Case S01: S = .10 1731.188 0.063 0.000 1.469 0.68819726.471 Case N10: N = .01 1720.406 0.188 0.000 0.063 5.81319769.784 Case N10: N = .02 1695.313 0.188 0.000 0.281 3.37519852.627 Case N10: N = .05 1663.344 0.000 0.000 0.719 1.06319915.096 Case N10: N = .10 1551.594 0.063 0.000 1.281 0.37519722.22 Case S10: S = .01 1723.500 0.000 0.000 0.219 1.84419769.784 Case S10: S = .02 1695.313 0.188 0.000 0.281 3.37519853.876 Case S10: S = .05 1638.281 0.156 0.000 0.156 3.68819938.909 Case S10: S = .10 1552.250 0.031 0.000 0.625 0.781
992166.6563 weather 2767.219 68.688 8.094 2941.406 553.031367841.0313 speech 69303.938 42.156 3.000 273.969 9739.688844409.4063 simple 3864.375 78.688 12.000 3219.063 233.063
Table 3.5: Decision Tree Path Distribution for Write Miss AccessesTotal References Application M N O P Q R S
19629.126 Case N01: N = .01 0.031 1.281 28.500 8.000 0.125 0.094 19.06319719.253 Case N01: N = .02 0.000 0.844 52.094 5.250 0.000 0.000 18.18819835.565 Case N01: N = .05 0.000 0.813 112.500 3.125 0.125 0.094 19.34419899.345 Case N01: N = .10 0.000 0.594 215.906 2.594 0.031 0.031 19.59419719.253 Case S01: S = .01 0.000 0.844 52.094 5.250 0.000 0.000 18.18819796.501 Case S01: S = .02 0.031 2.188 63.969 9.781 0.125 0.031 19.28119888.283 Case S01: S = .05 0.031 4.125 113.250 21.844 0.094 0.031 20.00019930.939 Case S01: S = .10 0.281 7.906 181.656 46.219 0.094 0.031 20.25019726.471 Case N10: N = .01 0.000 1.938 40.000 10.938 1.156 0.281 195.15619769.784 Case N10: N = .02 0.000 1.594 63.313 10.250 1.281 0.438 195.84419852.627 Case N10: N = .05 0.031 1.156 132.844 5.063 1.594 0.438 194.56319915.096 Case N10: N = .10 0.000 0.563 233.063 3.094 1.531 0.438 195.06319722.22 Case S10: S = .01 0.000 0.750 52.219 4.438 1.656 0.188 198.03119769.784 Case S10: S = .02 0.000 1.594 63.313 10.250 1.281 0.438 195.84419853.876 Case S10: S = .05 0.000 2.844 119.906 13.031 1.469 0.219 197.56319938.909 Case S10: S = .10 0.188 3.594 210.594 20.188 1.375 0.469 197.844
992166.6563 weather 0.750 416.656 3567.906 14.844 0.500 6.281 150.406367841.0313 speech 1.844 49.313 376.938 44.313 0.688 1.719 156.406844409.4063 simple 1.500 327.875 4845.094 125.531 7.406 5.781 200.969
46Table 3.6: Net Cost for Paths A Through F
Application Ninv A B C D E F Total_ACase N01: N = .01 6.43883 0.000 26.292 0.000 26470.500 2.604 1968.792 28468.188Case N01: N = .02 6.285106 0.000 23.604 0.000 42420.000 5.292 1559.208 44008.104Case N01: N = .05 4.97619 0.000 26.292 0.000 88727.604 0.000 1018.584 89772.480Case N01: N = .10 4.6 0.000 26.292 0.000 163403.604 0.000 771.792 164201.688Case S01: S = .01 6.285106 0.000 23.604 0.000 42420.000 5.292 1559.208 44008.104Case S01: S = .02 6.130435 0.000 21.000 0.000 55762.896 7.896 2761.584 58553.376Case S01: S = .05 3.219231 0.000 26.292 0.000 100453.500 23.604 6189.792 106693.188Case S01: S = .10 1.421543 0.000 28.896 0.000 172969.104 105.000 12500.208 185603.208Case N10: N = .01 6.947644 0.000 160.104 0.000 39645.396 7.896 3276.000 43089.396Case N10: N = .02 6.480084 0.000 238.896 0.000 56227.500 0.000 3003.000 59469.396Case N10: N = .05 4.044534 0.000 265.104 0.000 102944.604 10.500 1748.208 104968.416Case N10: N = .10 2.91875 0.000 241.500 0.000 178192.896 2.604 1323.000 179760.000Case S10: S = .01 5.897638 0.000 270.396 0.000 42955.500 2.604 1622.208 44850.708Case S10: S = .02 6.480084 0.000 238.896 0.000 56227.500 0.000 3003.000 59469.396Case S10: S = .05 5.637457 0.000 228.396 0.000 100529.604 10.500 4032.000 104800.500Case S10: S = .10 1.476923 0.000 231.000 0.000 178237.500 133.896 5607.000 184209.396
weather 1.026346 0.000 790.104 0.000 750750.000 91.896 103850.208 855482.208speech 1.62919 0.000 25669.896 0.000 2319037.896 196.896 1667426.208 4012330.896simple 1.095569 0.000 2097.396 0.000 1768218.396 102.396 88404.792 1858822.980
Table 3.7: Net Cost for Paths H Through LApplication Ninv H I J K L Total_B
Case N01: N = .01 6.43883 0.000 0.000 0.000 3.120 342.158 351.716Case N01: N = .02 6.285106 0.000 0.000 0.000 6.260 194.430 206.975Case N01: N = .05 4.97619 0.000 0.000 0.000 14.380 54.876 74.232Case N01: N = .10 4.6 0.000 0.000 0.000 25.000 49.856 79.456Case S01: S = .01 6.285106 0.000 0.000 0.000 6.260 194.430 206.975Case S01: S = .02 6.130435 0.000 0.000 0.000 6.260 276.743 289.134Case S01: S = .05 3.219231 0.000 0.000 0.000 8.760 151.625 163.604Case S01: S = .10 1.421543 0.000 0.000 0.000 29.380 30.420 61.222Case N10: N = .01 6.947644 0.000 0.000 0.000 1.260 578.257 586.464Case N10: N = .02 6.480084 0.000 0.000 0.000 5.620 319.953 332.053Case N10: N = .05 4.044534 0.000 0.000 0.000 14.380 74.883 93.308Case N10: N = .10 2.91875 0.000 0.000 0.000 25.620 22.195 50.734Case S10: S = .01 5.897638 0.000 0.000 0.000 4.380 164.072 174.350Case S10: S = .02 6.480084 0.000 0.000 0.000 5.620 319.953 332.053Case S10: S = .05 5.637457 0.000 0.000 0.000 3.120 318.549 327.307Case S10: S = .10 1.476923 0.000 0.000 0.000 12.500 34.965 48.942
weather 1.026346 0.000 0.000 164.012 58828.120 22266.942 81260.100speech 1.62919 0.000 0.000 78.876 5479.380 450868.663 456428.548simple 1.095569 0.000 0.000 251.468 64381.260 9545.256 74179.080
47Table 3.8: Net Cost for Paths M Through S
Application Ninv M N O P Q R S Total_CCase N01: N = .01 6.43883 2.604 215.208 2394.000 7472.000 106.250 7.896 0.000 10195.354Case N01: N = .02 6.285106 0.000 141.792 4375.896 4903.500 0.000 0.000 0.000 9421.188Case N01: N = .05 4.97619 0.000 136.584 9450.000 2918.750 106.250 7.896 0.000 12619.480Case N01: N = .10 4.6 0.000 99.792 18136.104 2422.796 26.350 2.604 0.000 20687.646Case S01: S = .01 6.285106 0.000 141.792 4375.896 4903.500 0.000 0.000 0.000 9421.188Case S01: S = .02 6.130435 2.604 367.584 5373.396 9135.454 106.250 2.604 0.000 14985.288Case S01: S = .05 3.219231 2.604 693.000 9513.000 20402.296 79.900 2.604 0.000 30690.800Case S01: S = .10 1.421543 23.604 1328.208 15259.104 43168.546 79.900 2.604 0.000 59838.362Case N10: N = .01 6.947644 0.000 325.584 3360.000 10216.092 982.600 23.604 0.000 14907.880Case N10: N = .02 6.480084 0.000 267.792 5318.292 9573.500 1088.850 36.792 0.000 16285.226Case N10: N = .05 4.044534 2.604 194.208 11158.896 4728.842 1354.900 36.792 0.000 17473.638Case N10: N = .10 2.91875 0.000 94.584 19577.292 2889.796 1301.350 36.792 0.000 23899.814Case S10: S = .01 5.897638 0.000 126.000 4386.396 4145.092 1407.600 15.792 0.000 10080.880Case S10: S = .02 6.480084 0.000 267.792 5318.292 9573.500 1088.850 36.792 0.000 16285.226Case S10: S = .05 5.637457 0.000 477.792 10072.104 12170.954 1248.650 18.396 0.000 23987.896Case S10: S = .10 1.476923 15.792 603.792 17689.896 18855.592 1168.750 39.396 0.000 38357.426
weather 1.026346 63.000 69998.208 299704.104 13864.296 425.000 527.604 0.000 384582.212speech 1.62919 154.896 8284.584 31662.792 41388.342 584.800 144.396 0.000 82219.810simple 1.095569 126.000 55083.000 406987.896 117245.954 6295.100 485.604 0.000 586223.554
Table 3.9 gives a summary of the network cost totals for application cases N01, S01,
N10, and S10. SimTime is the execution time of the application in simulation clock cycles.
Total_A, Total_B, and Total_C are the network cost values taken from Tables 3.6, 3.7 and 3.8.
There are two types of messages that are not taken into account by the Decision Tree and they
relate to capacity cache misses. Victim messages are sent to the home directory to remove a node
from the copyset when a shared block is removed from the cache, and VictimWB messages
contain writeback data for a cache block that was in the EXCLUSIVE state when it was chosen as
a victim by the cache controller. The VictimWB and Victim columns in the table provide the
network cost for these two types of messages. The Total column is the sum of the individual
network cost totals. The Traffic column is the value in the Total column divided by the
Simulation Time and gives a measure of the channel utilization as calculated using the Decision
Tree and the victim message information. The ChanUtil column is the channel utilization
measured in the simulation by taking the ratio of the total number of cycles the channel was busy
transmitting data over the total number of clock cycles in the simulation. Table 3.9 shows that the
48Decision Tree accurately describes the contribution of each path in the decision tree to the overall
channel utilization observed for the application as a whole.
Table 3.9: Decision Tree Traffic vs. Channel UtilizationApplication Sim Time Total_A Total_B Total_C VICTIM_WB VICTIM Total Traffic Chan Util
Case N01: N = .01 59741 28,468.19 351.72 10,195.35 14.38 216.81 39,246.45 66% 64%Case N01: N = .02 82153 44,008.10 206.98 9,421.19 36.47 433.78 54,106.52 66% 74%Case N01: N = .05 148445 89,772.48 74.23 12,619.48 98.88 1,004.91 103,569.97 70% 83%Case N01: N = .10 256286 164,201.69 79.46 20,687.65 204.16 1,895.22 187,068.17 73% 87%Case S01: S = .01 82153 44,008.10 206.98 9,421.19 36.47 433.78 54,106.52 66% 74%Case S01: S = .02 108094 58,553.38 289.13 14,985.29 48.03 571.31 74,447.14 69% 72%Case S01: S = .05 193752 106,693.19 163.60 30,690.80 90.13 1,143.56 138,781.28 72% 73%Case S01: S = .10 333431 185,603.21 61.22 59,838.36 146.28 2,095.38 247,744.45 74% 74%Case N10: N = .01 85816 43,089.40 586.46 14,907.88 23.63 338.00 58,945.37 69% 66%Case N10: N = .02 108389 59,469.40 332.05 16,285.23 46.00 561.00 76,693.67 71% 72%Case N10: N = .05 175771 104,968.42 93.31 17,473.64 117.50 1,167.44 123,820.30 70% 80%Case N10: N = .10 290052 179,760.00 50.73 23,899.81 218.94 2,075.25 206,004.74 71% 84%Case S10: S = .01 84186 44,850.71 174.35 10,080.88 35.53 436.81 55,578.28 66% 73%Case S10: S = .02 108389 59,469.40 332.05 16,285.23 46.00 561.00 76,693.67 71% 72%Case S10: S = .05 193622 104,800.50 327.31 23,987.90 101.00 1,093.13 130,309.83 67% 72%Case S10: S = .10 328579 184,209.40 48.94 38,357.43 189.09 2,107.19 224,912.04 68% 75%
weather 6540632 855,482.21 81,260.10 384,582.21 6,485.13 6,033.00 1,333,842.65 20% 32%speech 14678029 4,012,330.90 456,428.55 82,219.81 510.53 21,519.50 4,573,009.29 31% 41%simple 12691795 1,858,822.98 74,179.08 586,223.55 7,961.22 18,123.06 2,545,309.90 20% 27%
Table 3.10: Distribution of Read or Write Misses Through the Decision Tree PathsApplication B C D E F G I J K L M N O P Q R S
Case N01: N = .01 32% 55% 2% 5% 1% 3%Case N01: N = .02 23% 66% 1% 7% 2%Case N01: N = .05 13% 77% 8% 1%Case N01: N = .10 8% 82% 9%
Case S01: S = .01 23% 66% 1% 7% 2%Case S01: S = .02 19% 69% 2% 7% 1% 2%Case S01: S = .05 11% 76% 2% 7% 1% 1%Case S01: S = .10 7% 80% 3% 7% 2%
Case N10: N = .01 70% 19% 2% 8%Case N10: N = .02 65% 24% 2% 7%Case N10: N = .05 53% 36% 4% 6%Case N10: N = .10 41% 49% 5% 4%
Case S10: S = .01 69% 20% 2% 8%Case S10: S = .02 65% 24% 2% 7%Case S10: S = .05 53% 36% 4% 6%Case S10: S = .10 41% 48% 5% 4%
weather 7% 48% 3% 16% 3% 2% 19% speech 5% 54% 19% 19% simple 6% 65% 2% 10% 1% 15%
Table 3.10 shows the percent of read or write misses that were processed through each of
the Tree Paths, rounded to the nearest 1%. This table provides a mechanism to compare the
behavior of the simulated applications. It also shows that our approach to generating synthetic
workloads produces application behavior that resembles that of the trace files.
49Path C is traversed when the address references are local. The fact that the trace
applications have a very low degree of locality (due to the frequent remote barrier accesses and
the remote dictionary data structure access) accounts for the differences between the synthetic
and trace-based applications for Path C.
Path F handles read accesses to remote data blocks that are in the Exclusive state and
Path L handles upgrade requests for remote data blocks that are shared by nodes other than the
requesting node. Speech has a higher value in both of these paths do to the scanning of the
dictionary data and the following update to a shared global database.
3.4.3 Queuing Network Theoretical Model
The system consisting of the processor, the cache, the directory and the channel
controller can be represented by a set of queues through which messages of different types flow.
Figure 3.15 shows the queue organization. Although no messages are delivered directly to the
processor, the arrival of Data-Ack and Ownership-Ack messages causes threads to become ready
for execution and therefore, affects the processor operation. Messages of other types used to
support coherence arrive at either the directory or the cache queue. Data and ownership request
messages from remote nodes arrive at the directory queue. Since the directory may have to send
and receive coherence messages in response to the arrival of a single data and ownership request
message, there is also a pending queue where such data and ownership request messages wait
until the coherence transactions are performed. Figure 3.16 shows the flow of messages through
several nodes for the case where a processor read operation causes a miss to a remote block that is
in EXCLUSIVE state.
50
CHAN
CACHE
PROC
DIR
INVREQ ORINV+DAT REQ ORDNGRADE REQ
INVACK ORINV+DAT ACK ORDNGRADE ACK
REMOTE MEMORYDATA/OWNREQ
LOCAL MEMORYDATA/OWNREQ
DATA/OWNREQ FROMREMOTE NODE
LOCAL MEMORYDATA/OWNACK
PENDINGDATA/OWNREQ
DATA REQ TO REMOTE SH OR COHREQ
COHACK
DATA/OWNACK FROMREMOTE NODE
PEND
DATA/OWNACKTO REMOTE NODE
Figure 3.15: Processor, Cache, Directory and Channel Queues
CHAN
CACHE
PROC
DIR
INVREQ ORINV+DAT REQ ORDNGRADE REQ
INVACK ORINV+DAT ACK ORDNGRADE ACK
REMOTE MEMORYDATA/OWNREQ
LOCAL MEMORYDATA/OWNREQ
DATA/OWNREQ FROMREMOTE NODE
LOCAL MEMORYDATA/OWNACK
PENDINGDATA/OWNREQ
DATA REQ TO REMOTE SH OR COHREQ
COHACK
DATA/OWNACK FROMREMOTE NODE
PEND
DATA/OWNACKTO REMOTE NODE
CHAN
CACHE
PROC
DIR
INVREQ ORINV+DAT REQ ORDNGRADE REQ
INVACK ORINV+DAT ACK ORDNGRADE ACK
REMOTE MEMORYDATA/OWNREQ
LOCAL MEMORYDATA/OWNREQ
DATA/OWNREQ FROMREMOTE NODE
LOCAL MEMORYDATA/OWNACK
PENDINGDATA/OWNREQ
DATA REQ TO REMOTE SH OR COHREQ
COHACK
DATA/OWNACK FROMREMOTE NODE
PEND
DATA/OWNACKTO REMOTE NODE
CHAN
CACHE
PROC
DIR
INVREQ ORINV+DAT REQ ORDNGRADE REQ
INVACK ORINV+DAT ACK ORDNGRADE ACK
REMOTE MEMORYDATA/OWNREQ
LOCAL MEMORYDATA/OWNREQ
DATA/OWNREQ FROMREMOTE NODE
LOCAL MEMORYDATA/OWNACK
PENDINGDATA/OWNREQ
DATA REQ TO REMOTE SH OR COHREQ
COHACK
DATA/OWNACK FROMREMOTE NODE
PEND
DATA/OWNACKTO REMOTE NODE
HOME
DowngradeRequest
DownGradeAcknowledge
Figure 3.16: Message Flow in Queue system
51In this section, a model of a SOME-Bus system is developed and results are compared to
previous results from the Trace-driven simulation of the SOME-Bus. Theoretical analysis of the
SOME-Bus system is based on queuing network models.
In a SOME-Bus system with N nodes, each node contains a processor with cache,
memory, an output channel and a receiver that can receive messages simultaneously on all N
channels. Each node also contains a directory that maintains coherence information on the section
of the distributed memory that is implemented in that node (the home node).
A multithreaded execution model is assumed: each processor executes a program that
contains a set of M parallel threads. The node in which a group of M threads executes, is called
the host or owner node. Threads can only be executed in their owner node. A thread continues
execution until it encounters a global cache miss that requires data or permission from a remote
node. Then, it becomes suspended until the required action is completed (data transferred from a
remote memory or permission received), at which time it becomes ready for execution and
eventually resumes execution. A thread is executed on the processor for runtime R before
becoming suspended on a global cache miss.
A cache miss causes a request message to be enqueued for transmission on the output
channel. After transfer time T expires, the message is enqueued in an input queue at the receiver
of the destination (remote) node, is serviced by the directory at the remote node and another
message is sent back to the originating node with data or acknowledgment. The remote node
memory requires time L to assemble the response message. Time L is spent by the directory
controller, which must perform the necessary accesses on the memory, create the response
message and enqueue it for transmission on the output channel of the remote node. As part of
servicing a message, a directory may send messages to other nodes and receive data or
acknowledgments. A global cache miss may be due to a read or write miss at the local cache.
Accordingly, a Data request or Ownership request message is sent to the directory of the home
node. When the message is received, the required data block may be in SHARED state or
52EXCLUSIVE state. The home directory may send a data message back to the requestor (in the
case of reading a shared block) or it may send Downgrade or Invalidation messages and collect
Invalidation Acknowledge (Invalidation-Ack) messages.
Since M indicates the maximum number of outstanding requests that a node may have
before it blocks, it is assumed that when a node has less than M outstanding requests, it generates
messages with mean interval of R time units between requests. R is the mean thread runtime. A
request message generated by a node is directed to any other node with equal probability. These
types of assumptions on the node operations are similar to the ones made in [39] and to the ones
used in [1] and [15] who study the performance of DSM in a torus system.
Since a message is sent by a node to the home node, which responds with another
message back to the originating node, this type of operation can be represented by a multi-chain,
closed queuing network, where messages receive complicated forms of service at certain server
stations. The M threads owned by each node form a separate class of messages with population
equal to M. When m < M threads have outstanding messages, the remaining M-m messages are
served in FCFS order at the processor. A message receives service (of geometrically distributed
time with mean R) at the owner processor, is enqueued at the output channel of the owner node,
receives service (of time T) by the channel, is enqueued in a receiver input queue at a remote
node, receives service (of time L) by the directory at the remote node, and is similarly transferred
back to the owner node through the remote output channel and owner receiver input queue.
Due to the symmetry of the system, all directory controllers behave in a similar fashion.
Therefore, a system with N=M+1 nodes shows the same performance as a system with N > M+1
nodes. Typically, a small number of threads per node is possible. Then, a relatively small queuing
network representing a SOME-Bus system with M+1 nodes can be used to calculate performance
measures. Node P (P=0,1,...,M) owns M regular data messages that circulate through the
processor, the directory controllers of node P and the other M nodes and the necessary channels.
These messages represent the Data request and Ownership request messages from a remote node
53to the home node and the Data-Ack and Ownership-Ack messages sent back to the originating
node. Different chains are used to distinguish between messages belonging to different nodes, so
that messages in chain P belong to node P. Messages owned by node P are directed to the
directory of any node, including the owner. Messages that arrive at node P from other nodes and
do not belong to chain P are sent to the directory of node P. Figure 3.17 shows the queuing
network resulting from a system with N=4 nodes (M=3), and traces the paths of the four chains
present in the system.
DIR
CHAN
PROC
Q00
Q01
Q02
DIR
CHAN
PROC
Q03
Q04
Q05
DIR
CHAN
PROC
Q06
Q07
Q08
DIR
CHAN
PROC
Q09
Q10
Q11
DIR
CHAN
PROC
Q00
Q01
Q02
DIR
CHAN
PROC
Q03
Q04
Q05
DIR
CHAN
PROC
Q06
Q07
Q08
DIR
CHAN
PROC
Q09
Q10
Q11
DIR
CHAN
PROC
Q00
Q01
Q02
DIR
CHAN
PROC
Q03
Q04
Q05
DIR
CHAN
PROC
Q06
Q07
Q08
DIR
CHAN
PROC
Q09
Q10
Q11
DIR
CHAN
PROC
Q00
Q01
Q02
DIR
CHAN
PROC
Q03
Q04
Q05
DIR
CHAN
PROC
Q06
Q07
Q08
DIR
CHAN
PROC
Q09
Q10
Q11
Figure 3.17: Four-node Queuing Network
The complexity of the model arises from the fact that there are several other messages
that are used only to maintain coherence. These messages do not pass through the processors.
54Instead they are generated by the directory controllers, go through the channels, possibly interact
with cache controllers and return to the originating directory controllers. Figure 3.18 shows the
traffic through one node.
PROC
Q00
DIR
Q01
CHAN
Q02
CHAIN 0
CHAIN 0
CHAIN 0
OTHER CHAINS
OTHER CHAINS
PLM
COHERENCETRAFFIC
(1-PLM)
Figure 3.18: Traffic Through a Single Node in the Queuing Network
The service time at the processor is the thread run time. The coherence traffic determines
the service time at the directory controller. We assume that a data message caused by a cache
miss is due to a read reference with probability prd, and that a block may be found in shared state
with probability psh. There are four distinct cases, each with a different action by the directory
controller at the home node:
1. Data-request to a block in shared state (probability prd * psh). The directory controller
discards the data-request message and returns a data-acknowledge message back to the
requesting node. From the queuing network point of view there is no distinction between
a data-request and a data-acknowledge message. This activity can simply be viewed as a
data-request message performing one trip from its own processor to the home directory
55controller and back to its own processor. Its service time at the home directory controller
is the time L1 that the directory needs to compose the data-acknowledge message with the
copy of the requested block.
2. data-request to a block in EXCLUSIVE state (probability prd * (1-psh)). A Downgrade
message is sent by the home directory to the node with ownership of the block. That node
sends a DowngradeWB-Ack message to the home node, who sends a Data-Ack to the
requesting node.
3. Ownership request to a block in EXCLUSIVE state (probability (1-prd) * (1-psh)). An
Invalidation message is sent by the home directory to the node with ownership of the
block. That node sends an InvalidationWB-Ack (Invalidation Acknowledge with write-
back) message to the home node, who sends an Ownership-Ack to the requesting node.
4. Ownership request to a block in shared state (probability (1-prd) * psh). The home
directory controller broadcasts invalidations to all nodes having a copy of the requested
block. It collects all invalidate-acknowledge messages and then sends an Ownership-Ack
message to the requesting block.
The average service time experienced by a data message at the home directory controller
is therefore
L = L1 * prd * psh + 2 * (L2 + L3)*(prd * (1-psh)+(1-prd) * (1-psh)) + L4 * (1-prd) * psh (3.1)
where L1 is the time that the directory needs to compose the data-acknowledge message
with the copy of the requested block, L2 is the channel transfer time, L3 is the average channel
queue time and L4 = (L2+L3)+L5 is the mean time to send the Invalidation message and collect the
Invalidation-Ack messages.
We assume that the cache controller works at much higher speed than the network, so
that the response time of the cache controller can be ignored. If the number of blocks which must
56be invalidated on each Ownership request is fixed Ninv, then L5 is the mean value of a random
variable equal to the maximum of Ninv identical random variables, each being equal to the service
and queuing time at the channel server. In several applications and in most experiments described
here, it is observed that typically one Invalidation message is sent. In that case, L5 = L2+L3 and
L4 = 2 * (L2+L3). Then (3.1) becomes
L = L1 * prd * psh + 2 *(L2+L3) * (prd * (1-psh)+(1-prd) * (1-psh)) + 2 *(L2+L3) * (1-prd) * psh
L = L1 * prd * psh + 2 * (L2 + L3) * (prd * (1-psh) + (1-prd) * (1-psh) + (1-prd) * psh )
L = L1 * prd * psh + 2 * (L2 + L3) * (prd * (1-psh) + (1-prd))
L = L1 * prd * psh + 2 * (L2 + L3) * (1 - prd * psh) (3.2)
In fact, the contribution of Invalidation messages to the directory service time is even
smaller, because it is possible that the requested block is in the UNOWNED state and no
Invalidation messages need to be sent.
Second, the coherence traffic is also passing through the channels. The interference with
the data messages can be approximated by the fact that coherence messages absorb some fraction
of the service rate of the channel server, making the channel service time of the data messages
appear larger. This apparent service time is
T' = ( T * (KD +KC ) ) / KD (3.3)
where T is the real channel mean transfer time, KD=M(M+1) is the total number of data
messages in the system, and KC is the mean number of coherence traffic messages present in the
system. Since coherence messages are created due to the arrival of data messages into the
underlying coherence mechanism, KC can be calculated using Little’s result: KC = lD * L4, where
lD is the arrival rate of data messages into the underlying coherence mechanism and L4 =
2(L2+L3) is the time that a coherence message exists in the system. The arrival rate lD is lD = (1-
57prd psh) * KD / RD, where RD is the data message mean roundtrip time, and the (1- prd psh) factor is
necessary because data-request messages to a block in shared state do not result in additional
coherence traffic.
Assuming geometrically distributed processing times, this queuing network model is in a
condition that may be solved by standard techniques, with the only exception that the service time
at the directory server depends on the queuing time at the channel server queue. It is possible to
solve the model iteratively, by assuming an initial L3 value, and repeatedly solving the model and
updating L3 until the current and next values of L3 are sufficiently close. Such an iteration does
converge, because a large (or small) assumed value of L3 results in low (or large) channel
utilization and a small (or large, respectively) subsequent value of L3. However, in simulation
results show a moderate channel utilization, which results in relatively small queuing times. To
keep the queuing network model simple, we set L3 = 0, and essentially ignore the dependence of
the directory service time on the channel wait time. Using the reduced channel service rate, and
the estimate of the directory service time, the closed queuing network with 3(M+1) queues,
(M+1) chains and M messages in each chain can be solved using techniques described in [4].
Comparisons with simulation results indicate that such an assumption is justified.
To perform a comparison between simulation and theoretical results, the parameters of
the queuing network model must be related to some key measurements of the simulation.
Specifically, the inputs to the theoretical model are the mean service times at the processor,
directory and channel servers, and the probability that a miss can be satisfied by a block in the
local node memory. The directory and channel service times are calculated using (3.2) and (3.3).
The processor service time and the local probability are obtained from the simulation results.
From the Trace-driven simulation we extract a number of parameter values that are used
to characterize the application, and especially its effect on the processor and the network. In
general, the proper values of these parameters may be measured or estimated given a specific
58application or common values applicable to a wide range of applications may be used. The
relevant parameters are
PRead = probability that a memory access is a read
PMiss = probability of a cache miss
PLoca = probability of local access
PModi = probability that node finds one of its blocks modified by another node
PUpgr = probability that a write access finds the required block in cache in shared state
PShmo = probability that a node finds one of its blocks SHARED or EXCLUSIVE and not
UNOWNED
Pmo = probability that a node performs a remote access and finds the required block in
EXCLUSIVE state
Parameter PLoca is used to indicate the degree that individual processes in the application
tend to access the local memory of the node that they run on. Similarly, Pmodi, PUpgr and Pmo
indicate the degree of interaction between processes as they read and write in shared areas of the
distributed memory. Parameter PShmo also relates to process locality and indicates the degree that
blocks tend to be accessed repeatedly and consequently are found in SHARED or EXCLUSIVE
state and not UNOWNED state. From these parameters, we calculate the following
PWrit = 1. - PRead =probability that a memory access is a write
PHit = 1. - PMiss = probability of a cache hit
PRemo = 1. - PLoca = probability of remote access
Pdl = PRead * PMiss * PLoca * Pmodi =Prob. of data request to the local directory
Puol = PWrit * PHit * PUpgr * PLoca = Prob. of Upgrade request to the local directory
Prol = PWrit * PMiss * PLoca * PShmo = Prob. of Ownership request to the local directory
Pdr = PRead * PMiss * PRemo = Prob. of Data request to a remote directory
Puor = Pwrit * PHit * PUpgr * PRemo = Prob. of Upgrade request to a remote directory
Pror = PWrit * PMiss * PRemo = Prob. of an Ownership request to a remote directory
59Pol = Puol + Prol = Prob. of Ownership request (including upgrade) to the local directory
Por = Puor + Pror = Prob. of Ownership request (including upgrade) to a remote directory
Pml = Pdl + Pol =Prob. of any request to the local directory
Pmr = Pdr + Por =Prob. of any request to a remote directory
Pam = Pml + Pmr =Prob. of any request
1. / Pam = mean processor run time before message generated
PLM = Pml / (Pml + Pmr) = Prob. that a request message is sent to the local directory
3.4.4 Comparison Between the Simulators and the Theoretical Model
Figures 3.19 through 3.30 show the processor utilization, channel utilization and channel
queue waiting time obtained from the Trace-driven simulation, the Statistical simulation and the
queuing network model described above. A comparison is provided for each of the four
application cases (N01, N10, S01, S10).
The figures show that all three approaches provide very similar values for Processor
Utilization and Channel Utilization. The main differences between the two simulators and
queuing model are message size, victim messages (writeback and notification only victims), and
the number of invalidations. The theoretical model does not take into consideration the victim
messages, which will cause the values for the channel utilizations and the waiting time in the
output queue to be lower than the Trace-driven simulation. As the probability of local accesses
increase, a larger percent of the victim messages will be local and will not contribute to either the
channel utilization or the waiting time in the output queue. Figure 3.19 shows that as the
probability of data sharing between the Home and Home2 nodes increases, the channel utilization
also increases. This is due to the increase in the number of victim messages. In Figure 3-20,
when the Remote Miss ratio is low, the fact that the probability of a local access is higher has the
predominant effect and the channel utilization is lower. As the probability that data is shared
between Home and Home2 increases, so does the number of victim messages in the system.
60Internode sharing of data increases for both S01 and S10 cases and this causes the number of
upgrade messages to increase (as opposed to Ownership messages). Upgrade messages do not
carry data and therefore make a smaller contribution to the channel utilization. As the sharing
increases, data blocks in the cache tend to be invalidated more than being victimized. In Figures
3.21 and 3.22 the theoretical and Trace-driven values for the channel utilization are much closer.
The Statistical simulator has a higher channel utilization in all cases mainly because it uses a
fixed message size. For the same reason, the Statistical simulator also has a longer waiting time
at the output queue. In the theoretical model, the message size is exponentially distributed. In
the Trace-driven simulation, the messages are of two sizes, 74 bytes for a message containing
data and 10 bytes for a message that does not contain data.
The Statistical simulator and the theoretical model assume low values for Ninv. The
value of Ninv depends on the state of the memory block when a write request arrives at the
directory. If the block is UNOWNED, Ninv = 0, if the block is in the EXCLUSIVE state, Ninv =
1, if the block is shared Ninv = the number of sharers. The Trace-driven simulator is the only one
capable of accurately determining the value for Ninv. The Decision Tree Paths that are affected
by Ninv are J, L, P, and Q. Tables 3.7 and 3.8 indicate that the paths P and L are major
contributors to the overall network cost of an application due to the higher values for Ninv.
The difference in the processor utilization and the channel utilization for all three
approaches is less than 10%. The difference in output channel waiting time between the Trace-
Driven simulator and the Theoretical model is approximately 5-10 clock cycles. The similarities
in the results from each of the approaches tend to validate each other and provide confidence in
the results presented in this thesis. Another important result of this analysis is that a system with
such complex behavior can be modeled by relatively simple and tractable models, which can
provide reasonable estimates of processor utilization and message latencies, using parameters that
characterize real applications.
61
0
10
20
30
40
50
60
70
80
90
100
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Probability of Remote Miss
T_Sim_FT0S_Sim_FT0TH_FT0
Figure 3.19: Channel Utilization: Case N01
0
10
20
30
40
50
60
70
80
90
100
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Probability of Remote Miss
T_Sim_FT0S_Sim_FT0TH_FT0
Figure 3.20: Channel Utilization: Case N10
62
0
10
20
30
40
50
60
70
80
90
100
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Probability of Remote Miss
T_Sim_FT0S_Sim_FT0TH_FT0
Figure 3.21: Channel Utilization: Case S01
0
10
20
30
40
50
60
70
80
90
100
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Probability of Remote Miss
T_Sim_FT0S_Sim_FT0TH_FT0
Figure 3.22: Channel Utilization: Case S10
63
0
10
20
30
40
50
60
70
80
90
100
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Probability of Remote Miss
Clo
ck C
ycl
es
T_Sim_FT0 S_Sim_FT0 TH_FT0
Figure 3.23: Channel Queue Waiting Time: Case N01
0
10
20
30
40
50
60
70
80
90
100
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Probability of Remote Miss
Clo
ck C
ycl
es
T_Sim_FT0 S_Sim_FT0 TH_FT0
Figure 3.24: Channel Queue Waiting Time: Case N10
64
0
10
20
30
40
50
60
70
80
90
100
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Probability of Remote Miss
Clo
ck C
ycl
es
T_Sim_FT0 S_Sim_FT0 TH_FT0
Figure 3.25: Channel Queue Waiting Time: Case S01
0
10
20
30
40
50
60
70
80
90
100
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Probability of Remote Miss
Clo
ck C
ycl
es
T_Sim_FT0 S_Sim_FT0 TH_FT0
Figure 3.26: Channel Queue Waiting Time: Case S10
65
0
10
20
30
40
50
60
70
80
90
100
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Probability of Remote Miss
T_Sim_FT0 S_Sim_FT0 TH_FT0
Figure 3.27: Processor Utilization: Case N01
0
10
20
30
40
50
60
70
80
90
100
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Probability of Remote Miss
T_Sim_FT0 S_Sim_FT0 TH_FT0
Figure 3.28: Processor Utilization: Case N10
66
0
10
20
30
40
50
60
70
80
90
100
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Probability of Remote Miss
T_Sim_FT0 S_Sim_FT0 TH_FT0
Figure 3.29: Processor Utilization: Case S01
0
10
20
30
40
50
60
70
80
90
100
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Probability of Remote Miss
T_Sim_FT0 S_Sim_FT0 TH_FT0
Figure 3.30: Processor Utilization: Case S10
67
CHAPTER 4: FAULT TOLERANCE
A multiprocessor system based on the SOME-Bus can be composed of hundreds of
nodes. As the number of nodes in the system increases, the likelihood that the system will
experience temporary faults causing application errors or complete node failures also increases.
For this reason, the ability to tolerate node faults and failures becomes essential for parallel
applications with large execution times.
A popular approach to fault tolerance is known as Backward Error Recovery. Backward
Error Recovery enables an application that encounters an error to restart its execution from an
earlier, error-free state. This is achieved through the periodic saving of system information, which
is restored when an error is detected. The information that is collected represents a snapshot of
the program and processor state and is referred to as a checkpoint or recovery point. The two
main requirements of a checkpoint are that it contains a consistent system state and that it is saved
to a stable storage medium.
The Chapter presents four protocols for fault tolerance based on consistent checkpointing.
The first protocol FT0 is the traditional approach for consistent checkpointing., The remaining
three protocols FT1, FT2 and FT3 take advantage of the inherent multicast capabilities of the
SOME-Bus in order to reduce the overhead of fault tolerance that is observed is FT0. In addition,
the optimized protocols are shown to improve the performance of the DSM in some applications.
4.1 FAULT TOLERANCE AND DISTRIBUTED SHARED MEMORY ON THESOME-BUS
A multiprocessor system based on the SOME-Bus can be composed of hundreds of
nodes. As the number of nodes in the system increases, the likelihood that the system will
experience temporary faults causing application errors or complete node failures also increases.
For this reason, the ability to tolerate node faults and failures becomes essential for parallel
applications with large execution times. A popular approach to fault tolerance is known as
68Backward Error Recovery. Backward Error Recovery enables an application that encounters an
error to restart its execution from an earlier, error-free state. This is achieved through the periodic
saving of system information, which is restored when an error is detected. The information that is
collected represents a snapshot of the program and processor state and is referred to as a
checkpoint or recovery point. The two main requirements of a checkpoint are that it contains a
consistent system state and that it is saved to a stable storage medium.
Typically, in order to ensure that the checkpoint is consistent over the entire system, the
processes are synchronized before the checkpoint is established and then resume normal
execution after the checkpoint has been saved. In order to avoid excessive memory requirements,
it is only necessary to keep a single checkpoint during normal operation. When a new checkpoint
is created, it replaces the old one. Since it is possible to have an error during the establishment of
a checkpoint, it is crucial that the operation be atomic. This means that the entire checkpoint must
successfully be created before replacing the old checkpoint. If the entire checkpoint cannot be
created, the previous one should be preserved.
The checkpoint must be saved to a storage medium that can be accessed after a failure
occurs. Saving large amounts of data to disk can be extremely time consuming. Another approach
that has been proposed [18] is to utilize the memory of another node to store checkpoints. This
approach provides the necessary requirement to tolerate a single node failure, or multiple node
failures where all copies of the checkpoint do not reside on the failed nodes. Storing the
checkpoint in the memory of another node instead of writing the checkpoint to disk reduces the
amount of time that is required to save the checkpoint.
A number of recoverable distributed shared memory systems were described in chapter 1.
Many of these approaches use the built-in mechanism for data replication and transfer that exists
in a DSM to hide some of the overhead of fault tolerance. The DSM system described in this
thesis differs from those described in chapter 1 because it is hardware-based, implements the
sequential consistency model, uses the cache block as the unit of coherence, and in some
69situations is capable of using the data maintained for fault tolerance to improve the performance
of the DSM. In the following sections, four fault tolerance protocols, FT0, FT1, FT2 and FT3,
are presented which exploit the capabilities of the SOME-Bus Architecture to provide a
recoverable DSM with little or no performance degradation over the basic DSM system..
4.2 FT0 PROTOCOL
In the FT0 protocol, each node keeps a full copy, or checkpoint, of its local memory,
known as its recovery memory. In addition, every node also contains a copy of the recovery
memory of another node. Node X contains its own local memory and recovery memory as well as
a copy of Node Y’s recovery memory, as shown in Figure 4.1. If Node Y encounters an
application error that does not contaminate its recovery memory, Node Y can use its own
recovery memory to restore the previous checkpoint state and then restart along with the rest of
the nodes in the system. In the case of a complete failure and replacement of Node Y, the copy of
Node Y’s recovery data that resides on Node X can be used to initialize the replacement node. In
either case, after the memory of Node Y has been restored, the system can perform the Backward
Error Recovery.
70
Cache
Local Memory
Recovery DataNode Y
Recovery DataNode Z
Node Y
Cache
Local Memory
Recovery DataNode Z
Recovery DataNode X
Node Z
Cache
Local Memory
Recovery DataNode X
Recovery DataNode Y
Node X
Figure 4.1: FT0 Memory Organization of SOME-Bus Node
Initially the recovery memory is created by making a copy of the local memory.
However, it is not necessary to copy the entire contents of the local memory when subsequent
checkpoints are taken. Instead the recovery data can be incrementally updated since it is only
necessary to save data that has been modified since the last checkpoint [18]. The process of
creating a checkpoint involves three phases: the synchronization phase, cache writeback phase
and the recovery memory update phase, which are described below.
The first phase is the synchronization phase. At system start up, one of the processors is
designated the coordinator for all checkpoint procedures. The coordinator sends a message to all
71processors directing them to suspend normal operation and synchronize in order to begin the
checkpointing process. During the synchronization phase, the processors do not issue new
requests for data. Threads that are either in the ready or running state are suspended. If the
processor has threads that are blocked waiting for the response to a previously issued data
request, it waits for the outstanding request to complete before notifying the coordinator that the
processor is ready to begin the checkpoint. When all outstanding data requests have completed
and all processors have indicated their readiness to begin the checkpoint, the cache writeback
phase begins.
During the cache writeback phase, the cache controller in each node searches the cache
looking for blocks in the EXCLUSIVE state. The cache controller combines the data blocks that
must be written back to a particular node (the home node for the data blocks) into a single
message. The number of messages sent by the cache controller depends on the number of home
nodes for which the cache contains a data block in the EXCLUSIVE state. After each cache
controller sends the necessary writeback messages, the main memory of every node contains the
most recent value of all data blocks in preparation for the checkpoint. At the end of the cache
writeback phase, all blocks in the cache will be in the SHARED state3. After the cache writeback
phase is complete, the checkpointing process enters the recovery memory update phase.
In the recovery memory update phase, the directory controller in each node copies the
blocks in main memory that have been modified since the previous checkpoint to the recovery
memory. At the same time, these modified blocks are added to an update message that is sent to
the backup node so that it may update the backup copy of the recovery memory. A fault that
occurs during this phase could cause the recovery memory to become corrupted. For this reason,
the recovery memory update operation must be atomic. The first step is to create a temporary
copy of the recovery memory with the required updates. When the temporary copy of the
3 The issue of changing the state to Shared is discussed later in the section.
72 recovery memory on every node has been updated successfully, the processors synchronize again
and the newly created copy of the recovery memory becomes the recovery memory for the
current checkpoint and the previous recovery memory is deleted. All nodes synchronize again and
then resume normal operation. If any node has a failure while the new copy of the recovery data
is being created, the checkpointing procedure can be stopped and then restarted using the original
recovery memory that remained unmodified. The original recovery memory is not deleted until
every node indicates that it has successfully created the new recovery memory.
A “modified” attribute is added to the directory entry for each memory block to indicates
that the block has been modified in the interval between two checkpoints. After the recovery
memory has been updated, all of the “modified” attributes are cleared since all cache blocks are
in the SHARED state. When a directory controller grants ownership of a block to a requesting
node, the “modified” attribute is set so that the block will become part of the next checkpoint.
Downgrading all cache blocks in the EXCLUSIVE state to the SHARED state during the
cache writeback phase can cause a rush of Ownership requests immediately following the
checkpoint creation. Another approach is to have the cache controllers perform the writeback but
keep the blocks in the EXCLUSIVE state. This optimization may require some additional
complexity when the recovery memory is updated. If the cache blocks are allowed to remain in
the EXCLUSIVE state after they have been written back to main memory during the cache
writeback phase, the “modified” attribute for these blocks should not be cleared after the recovery
memory has been updated. If the behavior of the application is such that multiple write operations
are performed to the cache blocks once ownership has been obtained, allowing the blocks to
remain in the EXCLUSIVE state after the writeback can prevent a burst of ownership (or
upgrade) requests following the checkpoint. However, if the blocks tend to only be written once
or twice before being requested by another thread, allowing a cache to maintain ownership of the
EXCLUSIVE blocks following the writeback could needlessly increase the size of the update
messages that are transmitted to the backup nodes during the recovery memory update phase.
73This occurs because the “modified” attribute for any block in the EXCLUSIVE state will remain
set after the checkpoint even though the data block might not be written again by the thread that
owns it before the next checkpoint is created. The approach used in this paper is to change all
EXCLUSIVE blocks to shared once the cache writeback operation is performed. The main reason
for this is that the experimental results show that the major contributor to the network traffic
overhead associated with fault tolerance is the size of the recovery update message that is
exchanged between each node and its assigned backup node during the recovery memory update
phase.
The FT0 protocol does not cause any overhead in terms of network traffic associated with
fault-tolerance related messages during normal execution. The overhead occurs only when a
checkpoint is being created and is caused by the messages related to the cache writeback phase
and the exchange of update messages between each node and its backup node during the recovery
memory update phase. The total execution time for an application is the combination of the time
required to perform the normal activities required by the application and the time required to
create and maintain the checkpoint information.
The amount of time required to create a checkpoint depends on the number and size of
messages that must be exchanged during the creation of the checkpoint. For this reason, the
number of cache blocks that need to be written back to main memory during the cache writeback
phase and the number of blocks whose “modified” attribute is set and therefore must be added to
the update message in the recovery memory update phase have a direct effect on the overhead
associated with the checkpoint procedure. Depending on the memory access pattern, as the
interval between consecutive checkpoints increases, the number of modified memory blocks may
also increase, resulting in larger update messages and consequently more time spent creating each
checkpoint. The number of cache blocks that must be written back also contributes to the
checkpoint time. The number of EXCLUSIVE blocks in the cache at any particular time is related
to the read probability and miss probability of the application.
74Figures 4.2 through 4.5 show the effect that the checkpoint interval has on the overhead
associated with the FT0 protocol. Each figure provides results for one of the application classes
described in Chapter 3 (Case N01, Case N10, Case S01 and Case S10 respectively). Each figure
provides the proportion of the checkpoint creation time required for each of the three phases: the
synchronization phase, the cache writeback phase, and the recovery memory update phase. The
data is provided in terms of the number of simulation clock cycles required to perform the
activity.
Figure 4.2 provides results for the Case N01 class of applications described in Chapter 3.
For the Case N01 class of applications the parameter L and the parameter S are held constant and
set to .01. The parameter N takes on the values .01, .02, .05 and .10. The application type
represented by this combination of L, S, and N is one in which there is a small degree of locality
within a particular node as well as a small degree of sharing between threads. The data in Figure
4.2 is arranged in four pairs where each pair corresponds to one of the four possible values for the
parameter N. Within each pair, the results provided are for checkpoint intervals of 10,000 and
30,000 memory references. The simulated applications consist of 30,000 memory references,
resulting in 3 checkpoints for the first data column (I=10) and a single checkpoint for the second
data column (I=30) in each pair.
The data in Figure 4.3 is organized in pairs as described for the previous figure. Figure
4.3 provides results for the Case N10 class of applications in which the parameter N and the
parameter S are the same values as the N01 case but the L parameter is .10. The application type
represented by Case N10 is one in which there is a larger degree of locality within a particular
node and there is little sharing of data between threads.
75
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
N=.01, I =10 N=.01, I =30 N=.02, I =10 N=.02, I =30 N=.05, I =10 N=.05, I =30 N=.10, I =10 N=.10, I =30
Case N01
Sim
ula
tio
n C
ycle
s
Synchronization Cache Writeback Recovery Mem Writeback
Figure 4.2: Checkpoint Intervals of 10,000 and 30,000 for the Case N01 Applications
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
N=.01, I =10 N=.01, I =30 N=.02, I =10 N=.02, I =30 N=.05, I =10 N=.05, I =30 N=.10, I =10 N=.10, I =30
Case N10
Sim
ula
tio
n C
ycle
s
Synchronization Cache Writeback Recovery Mem Writeback
Figure 4.3: Checkpoint Intervals of 10,000 and 30,000 for the Case N10 Applications
76Figure 4.4 provides results for the Case S01 class of applications in which the parameter
L and the parameter N are held constant and set to .01. The parameter S takes on the values .01,
.02, .05 and .10. The application type represented by Case S01 is one in which there is a smaller
degree of locality within a particular node and there is a larger degree of sharing of data between
threads.
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
S= .01, I = 10 S= .01, I = 30 S= .02, I = 10 S= .02, I = 30 S= .05, I = 10 S= .05, I = 30 S= .10, I = 10 S= .10, I = 30
Case S10
Sim
ula
tio
n C
ycle
s
Synchronization Cache Writeback Recovery Mem Writeback
Figure 4.4: Checkpoint Intervals of 10,000 and 30,000 for the Case S01 Applications
Figure 4.5 provides results for the Case S10 class of applications in which the parameters
N and S are the same values as the S01 case but the L parameter is .10. The application type
represented by Case S10 is one in which there is a larger degree of locality within a particular
node and there is a larger degree of sharing of data between threads.
77
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
S= .01, I = 10 S= .01, I = 30 S= .02, I = 10 S= .02, I = 30 S= .05, I = 10 S= .05, I = 30 S= .10, I = 10 S= .10, I = 30
S Group, L = .10
Sim
ula
tio
n C
ycle
s
Synchronization Cache Writeback Recovery Mem Writeback
Figure 4.5: Checkpoint Intervals of 10,000 and 30,000 for the Case S10 Applications
Figures 4.2 through 4.5 show that in all cases, the recovery memory update phase is
responsible for the majority of the time required to create a new checkpoint. The figures also
show that the time required to perform the recovery memory update phase increases as the
checkpoint interval increases for all four of the application classes. The amount of increase in the
time to perform the recovery memory update phase is directly related to the number of data
blocks that are modified during the checkpoint interval. The number of data blocks modified
during the checkpoint interval depends on the behavior of the application and this is the reason
for the differences the amount of additional time that is required to perform the recovery memory
update in Figures 4.2 through 4.5. These application behavior characteristics are described
below.
The probability that a reference will be to a block already in the node’s cache is 1-
(L+N+S). The Case N01 and S01 applications have a cache miss ratio in the range of 3% to 12%
78while the Case N10 and S10 applications have a cache miss ratio in the range of 13% to 22%.
The difference in the cache miss ratio is due to the increase in the L parameter from .01 to .10.
The miss ratio is the reason for the increase in the number of modified data blocks for Case N10
over Case N01 and between Case S10 and Case S01. An increase in the number of modified data
blocks during the interval between checkpoints causes an increase in the time require to perform
the recovery memory update phase of the checkpoint procedure.
For Case N01 and N10, the probability of sharing is low and constant and as the N
parameter is increased from .01 to .10, the number of remote accesses increases but the majority
of these accesses are to unique remote locations. This results in a larger number of data blocks
being modified as the checkpoint interval increases. The Case S01 and S10 application classes
have an increased probability of sharing. In these applications, the N parameter is held constant
and the S parameter is increase from .01 to .10. The S parameter corresponds to the probability
that a thread will choose an address that has been accessed previously. For this reason, once a
data block has been accessed, there is an increased probability that the data block will be chosen
again during the interval between the checkpoints. For the same cache miss ratio, the N01 and
N10 class applications will modify more data blocks than the S01 and S10 class applications
because the later will tend to pick data blocks that have already been modified resulting in fewer
modified blocks over all. For this reason, the Case S01 and Case S10 applications do not exhibit
the same level of increase in the number of modified blocks when the S parameter is increased as
the Case N01 and Case N10 applications do when the N parameter is increased.
The average number of cache blocks that must be written to main memory for each
checkpoint ranges between 12 and 15 block for all experiments and is not affected by the length
of the interval between the checkpoints. A method for theoretically determining these values
based on a set of parameters that describe the application is presented below.
Assumptions for the proposed model are a DSM architecture with a Write-Invalidate
cache coherency policy. The system has N nodes, M memory blocks per node, and C cache
79blocks per node. The replacement strategy for the cache is the random algorithm with a uniform
distribution. All memory accesses happen with a common clock and in every clock cycle each
processor generates a memory reference. In a single clock period, a particular memory block can
only be referenced by one of the nodes.
The memoryless property of a Markov process requires that the future of the process
depends only upon the present state of the process and not upon its history. A discrete time
Markov Chain Markov Chain (MC) assumes a discrete state space, with state transitions
occurring at integer units of time.
If the number of blocks in a particular cache is used to represent the state of the cache,
the system can be modeled as a discrete-time MC if it can be shown that the state of the cache at
time t+1 depends only upon the state of the cache at time t and not any previous state. Assuming
a Write-Invalidate cache coherency protocol, changes in the number of modified blocks in a
cache occur as the result of write requests by the local processor, incoming Invalidation or
Downgrade requests, or selection of EXCLUSIVE blocks as victims to be removed from the
cache. Which of these operations occur at a particular time t, depends upon the application
behavior or more specifically upon the sequence of accesses to memory generated by the
processors in the system. In order for the system to be modeled as a MC, neither the type of
operation (read or write) nor the selection of memory address can be related in any way to the
state of the cache. In other words, the program flow of the application (and consequently the
sequence of memory accesses) cannot be influenced by the state of the cache. An example of the
type of interaction that is not allowed would be for a node to purposely avoid accessing a
memory block that is in the Exclusive state in another cache. In this thesis, the memory
references are assumed to be generated by the process without knowledge of and completely
independent of the state of any cache in the system. With this assumption, the behavior of a
cache can be approximated using a discrete-time Markov chain. The states (Si for i = 0 to C) used
80in the model represent the number of cache blocks in the EXCLUSIVE state in a single cache.
Figure 4.6 shows the state-transition-rate diagram for the MC described above.
0 1 2 CC-1
l0 l1lC-1
...mCm2m1
l2 lC-2
mC-1m3
Figure 4.6: State-transition-rate Diagram
The caches are assumed to be identical in design and therefore behave identically under
the same conditions. The model is described from the point of view of a single cache, which will
be referred to as the target cache. The local node is the node that contains the target cache and
other nodes are referred to as remote. Since all N nodes generate a memory reference in a single
clock period, when a memory block is chosen, the probability that the node that accessed the
block is the local node is (1/N). The probability that the node that accessed the memory block
was one of the remote nodes in the system is 1-(1/N). The probability of a write access is Pw and
the probability of a read access is 1-Pw. The equations for the transition probabilities will be
presented with the assumption that memory blocks are randomly accessed with a uniform
distribution. The adjustments to the equations that are required when the memory access pattern
is not uniformly distributed are also discussed.
In order to move from state Si to state Si+1, there must be a write access from the local
node which causes the number of EXCLUSIVE blocks in the cache to be increased by 1. This
can be achieved in one of two ways described below.
1. the access is a write operation (Pw) from the local node (1/N) to a block that is in the
cache (C/NM) but not in the EXCLUSIVE state (1 - (i/C)).
812. the access is a write operation (Pw) from the local node (1/N) to a block that is not already
in the cache (1 - C/NM) and that the victim block was not in the EXCLUSIVE state(1 -
(i/C)).
Equation (4.1) provides li, the probability of moving from Si to state Si+1
˙˚
˘ÍÎ
Ș¯
ˆÁË
Ê -˜¯
ˆÁË
Ê -+˜¯
ˆÁË
Ê -˜¯
ˆÁË
ʘ¯
ˆÁË
Ê=Ci
NMC
Ci
NMC
NPWi 111
1l (4.1)
˜¯
ˆÁË
Ê -˜¯
ˆÁË
Ê=Ci
NPWi 1
1l
In order to move from state Si to state Si-1, there must be a write access from the local
node, which causes the number of EXCLUSIVE blocks in the cache to be decreased by 1. This
can be achieved in one of the two ways described below. (4.2) provides mi, the probability of
moving from Si to state Si-1.
1. The access is a read operation (1-Pw) from the local node (1/N) to a block that is not in
the cache (1 - C/NM) and the resulting victim block is in the EXCLUSIVE state (i/C).
2. The access is either a read or write operation from a remote node (1-1/N) to a block that
is in the cache (C/NM) that is in the EXCLUSIVE state(i/C).
( ) ˜¯
ˆÁË
ʘ¯
ˆÁË
ʘ¯
ˆÁË
Ê -+˜¯
ˆÁË
ʘ¯
ˆÁË
ʘ¯
ˆÁË
Ê --=Ci
NMC
NCi
NNMC
Pwi
11
111m (4.2)
Let PBF be the probability that the memory block is found in the cache. If uniform
memory access is assumed, the PBF is equal to C/NM and is independent of whether the node
accessing the memory block is the local node or one of the remote nodes. Multiprocessor
applications typically do not have uniform memory access patterns. If there is a high level of
spatial or temporal locality of reference, then there is a higher probability that the memory block
selected by the local node will fall within the local memory space of the node. If memory access
82is uniform, PBF=C/NM but if instead, a node always picks a memory block within its own local
memory space (with a uniform distribution), PBF=C/M. When there are higher levels of spatial
locality in applications PBF is not the same when the node is local and when it is remote. If the
application exhibits a large amount of interprocessor sharing of data, PBF for the remote nodes will
increase.
Let PBF_L be the probability that the memory block is found in the cache when the block
was selected by the local node, and PBF_R be the probability that the block was selected by one of
the remote nodes. The necessary changes to the transition probabilities are shown in (4.3) and
(4.4).
( ) ( ) ˙˚
˘ÍÎ
Ș¯
ˆÁË
Ê --+˜¯
ˆÁË
Ê -˜¯
ˆÁË
Ê=Ci
Ci
NPWi 1P11P
1BF_LBF_Ll (4.3)
˜¯
ˆÁË
Ê -˜¯
ˆÁË
Ê=Ci
NPWi 1
1l
( )( ) ( ) ˜¯
ˆÁË
ʘ¯
ˆÁË
Ê -+˜¯
ˆÁË
ʘ¯
ˆÁË
Ê--=Ci
NCi
NPwi BF_RBF_L P
11
1P11m (4.4)
Once the MC is solved and the state probabilities are found, the expected value for the
number of EXCLUSIVE cache blocks can be determined using (4.5), where C is the number of
cache blocks in each node and i is the number of EXCLUSIVE cache blocks in state i.
mean number of EXCLUSIVE cache blocks =
†
iSii= 0
C
 (4.5)
Figures 4.7 and 4.8 show state probabilities for a system with 32 nodes, 2868 memory
blocks per node, and 64 cache blocks per node using Equations 4.3 and 4.4. Figure 4.7 shows the
state probabilities when Pw=.10, PBF_R = C/MN=.0003 and PBF_L ranges from .0003 to 1. As PBF_L
increases the steady state probability of being in the higher states also increases. Figure 4.8
83shows the state probabilities when Pw=.10, PBF_L = .9 and PBF_R ranges from 0 to 1. As PBF_r
increases the steady state probability of being in the lower states also increases. Table 4.1
provides the average number of cache blocks that were written back at checkpoint time.
0
0.05
0.1
0.15
0.2
0.25
0 5 10 15 20 25 30 35 40 45 50 55 60
State number
Sta
te P
rob
abili
ty
L=1.0000
L=0.9500L=0.9000
L=0.8000
L=0.0003 L=0.200
0L=0.4000 L=0.600
0
Figure 4.7: State Probabilities, Pw= .10 and Pbfr = .0003
84
0
0.05
0.1
0.15
0.2
0.25
0.3
0 5 10 15 20 25 30 35 40 45 50
State Number
Sta
te P
rob
abili
ty
R=0.0000
R=0.0001
R=0.0003
R=0.0002
R=0.0010
R=0.0100
R=0.1000
Figure 4.8: State Probabilities, Pw= .10 and Pbfl = .9
Table 4.1: Average Number of Cache Blocks Written back at Checkpoint TimeModified Cache Blks
Case N01: N=.01 15Case N01: N=.02 15.09Case N01: N=.05 13.63Case N01: N=.10 13.28Case S01: S=.01 15.09Case S01: S=.02 14.97Case S01: S=.05 12.72Case S01: S=.10 11.19Case N10: N=.01 14.78Case N10: N=.02 13.5Case N10: N=.05 11.88Case N10: N=.10 11.63Case S10: S=.01 13.94Case S10: S=.02 13.5Case S10: S=.05 12.53Case S10: S=.10 11.69wthr_FT0 28.07simp_FT0 17.93speech_FT0 2.3
85
4.3 FT1 PROTOCOL
As shown previously for the FT0 protocol the largest source of overhead from
checkpointing is the amount of time required to transfer the updates for the recovery memory to
the back-up node. The FT1 Protocol reduces the overhead associated with checkpointing by
eliminating the update message exchange during the recovery memory update phase of the
checkpoint creation process. Instead of exchanging large update messages at checkpoint time, the
backup node keeps a consistent backup copy of the primary node’s local memory in addition to
the backup copy of the node’s recovery memory. During the recovery memory update phase of
the checkpoint process, the backup copy of the recovery memory is updated from the
corresponding backup copy of local memory, requiring no additional network communication. A
Node X keeps the backup copy of Node Y’s local memory consistent by receiving a copy of any
cache writeback messages to Node Y as they occur. This approach ensures that the backup node
has all the information necessary to update the backup recovery memory without relying on the
home node to explicitly send it at checkpoint time.
Figure 4.9 shows the organization of a node’s memory for the FT1 Protocol. Node X
contains its own local memory and recovery memory as well as a copy of Node Y’s local
memory and Node Y’s recovery memory. Node X is the Home node for data blocks located in its
own local memory (Nx_H). In addition, Node X is the Home2 node for data blocks located in the
copy of Node Y’s local memory, which resides on Node X (Nx_H2). Similarly Node Y contains
Home memory (Ny_H) and Home2 memory (Ny_H2) for data blocks located in the copy of Node
Z’s local memory, which resides on Node Y.
86
Cache
HomeLocal Memory
Node X
Recovery DataNode X
Recovery DataNode Y
Node X
Home2Local Memory
Node Y
Cache
Recovery DataNode Y
Recovery DataNode Z
Node Y
Cache
Recovery DataNode Z
Recovery DataNode X
Node Z
Home2Local Memory
Node X
HomeLocal Memory
Node Y
Home2Local Memory
Node Z
HomeLocal Memory
Node Z
Nx_H Nx_H2 Ny_H Ny_H2 Nz_H Nz_H2
Figure 4.9: FT1 Memory Organization of SOME-Bus Node
Data blocks in the Home memory can be in the state EXCLUSIVE, SHARED, or
UNOWNED but data blocks in the Home2 memory can only be in the state SHARED or
INVALID. In Figure 4.10, Node Y is Home for the data block DataB, and Node X is Home2 for
DataB. If DataB is in the state SHARED or UNOWNED in Ny_H, the copy of DataB in Nx_H2
will be in the state SHARED. If DataB is in the EXCLUSIVE state in Ny_H, then the copy of
DataB located in Nx_H2 will be in the state INVALID.
The FT1 protocol must keep the Home2 local memory consistent with the Home local
memory in order to guarantee the correct values of each data block are used to update the backup
recovery memory during the recovery memory update phase of the checkpoint process. Any
additional messages that must be sent in order to keep the Home2 local memory consistent causes
overhead in terms of additional network traffic during normal operation. In FT0, the overhead
due to fault tolerance occurs only at checkpoint time.
87
Cache
HomeLocal Memory
Node X
Recovery DataNode X
Recovery DataNode Y
Node X
Home2Local Memory
Node Y
Cache
Recovery DataNode Y
Recovery DataNode Z
Node Y
HomeLocal Memory
Node Y
Home2Local Memory
Node Z
Nx_H Nx_H2 Ny_H Ny_H2
Px Py
DataBDataB
Figure 4.10: Local and Remote Requests for Data Block DataB
4.3.1 Home2 Consistency
In order to correctly update its copy of the Node Y recovery memory at checkpoint time,
Node X must dynamically keep its Home2 memory (Nx_H2) consistent with Node Y’s Home
memory (Ny_H). Initially Nx_H2 is an exact copy of Ny_H and all blocks in both Nx_H2 and
Ny_H are in the SHARED state. When an Ownership request for a block in Ny_H is received by
Node Y’s directory controller, an Invalidation message is sent to Node X and the corresponding
data block in Nx_H2 is changed from SHARED to INVALID and an Acknowledge message is
sent from Node X to Node Y. Upon receipt of the Invalidation-Ack message, Node Y changes the
state of the block in Ny_H to EXCLUSIVE and grants the requesting node ownership of the data
block.
Keeping Home2 consistent means that every time ownership of a block is requested, the
Home2 node must receive an Invalidation request along with any other nodes sharing the block.
The Invalidation request is multicast to all nodes with a copy of the block. Individual
88Invalidation-Ack messages must then be collected from all recipients of the Invalidation request.
The additional Invalidation-Ack message that must always be collected from the Home2 node,
increases the time the directory controller spends invalidating all nodes.
As long as the Invalidation-Ack messages are being collected from several other nodes,
the time spent collecting the additional one from Home2 is relatively small. The time necessary to
send the Invalidation request is equal to the sum of two variables: the waiting time at the channel
queue (wch) and the transfer time of the Invalidation message (ttrans). The time necessary to collect
a single Invalidation-Ack message is the sum of the waiting time at the cache input queue (wca),
the service time at the remote cache controller(sca), the waiting time at the channel queue (wch)
and the transfer time of the Invalidation-Ack message(ttrans).
Let X1,X2,...,XNinv be random variables representing the time required to collect
Invalidation-Ack messages from Ninv nodes that received the Invalidation request. The random
variable Xi is the time required to collect an Invalidation-Ack message from the ith node where i =
1 to NINV. Let Xi = wca(i) + sca(i) + wch(i) + ttrans(i)
Simulations show that sca and wca are negligible. The transfer time ttrans is constant and
small for Invalidation-Ack messages. Simulations indicate that wch can be approximated with an
exponential distribution. Let Tinv be the time to send an Invalidation request and collect all Ninv
of the Invalidation-Ack messages, where Tinv = wch(i) + ttrans(i) + max(Xi : i = 1 to Ninv).
If X1,X2,...,Xn are mutually independent, identically distributed continuous random
variables. The random variables Y1,Y2,...,Yn are obtained by arranging the set X1,X2,....,Xn
variables so that they are in increasing order. In this case Y1=min(X1,X2,...,Xn), and
Yn=max(X1,X2,...Xn). The random variable Yk is called the kth-order statistic [37]. In this case, if
Xi is exponentially distributed with parameter l for each i in i=1 to n, the order statistic Yn-k+1 is
hypoexponentially distributed [37] as shown in Equation (4.6) . The density of Yn-k+1 is shown in
Equation (4.7) and the mean of Yn and Yn+1 is provided in Equation (4.8).
89
],...,)1(,[1 lll knnHYPOY kn -ª+- (4.6)
lll l iwhereea it
n
miii
i =-
=Â , (4.7)
ÂÂÂ===
===n
i
n
i
n
i in ii 111
1111lll
j (4.8)
Â+
=+ =
1
11
11 n
in il
j
Equation (4.9) shows that the increase in the mean of Y when one additional variable Xn+1
is added is 1/(n+1) which becomes relatively small when n ≥ 3. This analysis indicates that
sending an Invalidation request to Home2 and collecting the Invalidation-Ack does not add
significantly to the latency of the corresponding ownership request unless the Home2
Invalidation-Ack is the only one being collected. In this case the entire time spent sending the
Invalidation request to Home2 and collecting the Invalidation-Ack directly adds to the latency of
the ownership request.
1
111 +
=-+ nnn ljj (4.9)
Assume the location of a data block DataB in the local memory of Node Y and in the
Home2 memory of Node X as shown in Figure 4.10. Figure 4.11 illustrates the message transfers
required when Node Z requests ownership of DataB, which is in the shared state in the cache of
Node W. Node Z sends an Ownership Request message to Node Y, which is the Home node for
DataB. The directory controller at Node Y finds that DataB is in the shared state and sends an
Invalidation message to Node W in order to invalidate the copy of DataB in Node W’s cache.
The Invalidation message is also sent to Node X so that the copy of DataB in the Home2 memory
can also be invalidated. The directory controller waits until the Invalidation-Ack messages are
90received from both Node W and Node X and then sends the Ownership-Ack message to Node Z.
If DataB was not shared by any node when the directory controller at Node Y received the
Ownership request from Node Z, an Invalidation message would still be sent to Node X so that
the copy of DataB in the Home2 memory will be invalidated. In the FT0 protocol, no
invalidation messages would be sent in this case.
Node X Node Y Node ZNode W
Ownership Request
Invalidation Request
Invalidation Acknowledgement
Invalidation Acknowledgement
Ownership Acknowledgement
Figure 4.11: Ownership Request Message Transfer with FT1 Protocol
The Home2 node must receive a copy of any Writeback message sent to the Home node
so that it can update the data block in the Home2 memory with the most recent value and change
its state from INVALID to SHARED. The cache controllers multicast Writeback messages for a
data block to both the Home and Home2 nodes. In the SOME-Bus, a message can be multicast by
having the source node specify the two destinations (Home Node and Home2 Node) in the
message header. Alternatively, the address filters at the SOME-Bus receivers can be programmed
to accept messages of certain types even when destined for another (specific) node. The second
solution is more hardware-intensive, but avoids increasing the message header size, however the
first solution provides more flexibility in terms of the ability to change the node that will be
designated the Home2 node.
91In the SOME-Bus, each node has a separate receiver and associated input queue for every
channel in the system. All messages sent by a node are received by other nodes in the order in
which they were sent and are stored in the associated input queue. The order in which the
processor sees messages sent from different nodes is determined by the order in which messages
are removed from the set of receiver queues. Since there are multiple queues at each node input,
the arrival order of messages in different input queues cannot be determined by simple inspection
of the corresponding positions in the queues. Consequently, such a simple inspection cannot
determine the last value written to a data block whose ownership changes between several nodes.
A data block in the Home2 memory changes from state SHARED to INVALID when
Home2 receives an Invalidation message from the Home node. In order for Home2 to change the
state of the data block from INVALID to SHARED, it must be able to determine 1) when the data
block changes to the SHARED state in the Home node and 2) which Writeback message in the
Home2 input queues corresponds to the most recently written value for the data block.
Ownership Acknowledge messages are multicast from the Home node to the requesting
node and to Home2 in order to allow Home2 directory controller to keep track of the transfer of
ownership of a data block, and therefore to determine the most recently written value for that data
block. Anytime there is a transfer of ownership of a data block, there must be a corresponding
InvalidateWB or VictimWB message in which the modified value of the data block is written
back to main memory. Similarly when a block changes from EXCLUSIVE to SHARED, the
resulting Data Acknowledge message can be matched with the Downgrade writeback message.
Since all messages originating from the Home node are received by the Home2 node in the same
order in which they were transmitted, Home2 determines the correct order of the InvalidateWB,
VictimWB or DowngradeWB messages received from any of the nodes by examining the
sequence of Data-Ack and Ownership-Ack messages in the receiver queue associated with the
Home node.
92Home2 must receive a copy of all coherence-related messages so that it knows the order
in which the Data-Ack and Ownership-Ack messages were sent by Home. The FT1 protocol
does not permit Ownership requests to be filled locally because there would be no Invalidation
message (or subsequent writeback message) sent to Home2 and no Ownership-Ack message
appearing in the Home2 node for the corresponding change of ownership. Under the FT1
approach, all Ownership requests must result in a multicast to the Home2 of the Ownership- Ack
messages and the InvalidationWB, DowngradeWB or VictimWB message when the cache
eventually gives up ownership of the data.
Writeback messages are sent by the current owner of a block for two reasons. The first
reason is that another node requests access to the block, either read access or ownership. The
second reason is because a cache may chose to replace a block that is in EXCLUSIVE state. In
this case, the cache must send a copy of that block to the Home node.
If node A wishes to access a block for which node Z is the owner, node A sends an
Ownership or Data request to the Home node Y, and node Y sends an Invalidation or Downgrade
request to the current owner Z. Node Z responds with an Invalidation-Ack or Downgrade-Ack
message to Home node X. The Home node X then sends an Ownership-Ack message to the new
owner node A.
Figures 4.12a and 4.12b show an example where three nodes (A, B and C) take turns in
writing and reading a block whose Home is node H (and Home2 is node H2). Figure 4.13 shows
the messages that have arrived at the corresponding input queues of Home2 node H2 under the
assumption that H2 has been busy serving other queues during the time that those messages
arrived.
93
SH
H H2
NODE 0
SH
OWN REQ (A=>H)
NODE 5 NODE 6
A B C
NODE 3 NODE 4
INV
INV
SH
SH => BUSY
INV REQ (H=>H,H2,C)
INV ACK (C=>H)
OWN ACK (H=>A,H2)
BUSY => EXCLUSIVE
EXCLUSIVE EXCLUSIVE
EXCLUSIVE=>BUSY
INV REQ (H=>A,H2)
INV_WB ACK (A=>H, H2)
INV BUSY => EXCLUSIVE
OWN REQ (C=>H)
INV
OWN-REQ (B=>H)
INV ACK (H2=>H)
SH =>INV
SH => INV
INV
INV INVINV
EXCLUSIVE =>INV
INV => EXCLUSIVE
OWN ACK (H=>B,H2)
Figure 4.12a: Message Time Line
94
EXCLUSIVE=>BUSY
INV REQ (H=>B,H2)
INV_WB ACK (B=>H, H2)
BUSY => EXCLUSIVE
EXCLUSIVEINV
DATA REQ (A=>H)
DOWNGRD REQ (H=>C,H2)
DOWNGRD_WB ACK (C=>H,H2)
EXCLUSIVE => BUSY
EXCLUSIVE =>SH
BUSY => SHINV =>SH
EXCLUSIVE => INV
OWN ACK (H=>C,H2)
INV => EXCLUSIVE
EXCLUSIVE
SHINV
DATA ACK (H=>A,H2)
Figure 4.12b: Message Time Line (continued)
As the figures show, nodes A, B, and C have acquired ownership of the block and have
sent writeback messages to node H. All the relevant messages have also been received by node
H2. (In the following, the notation Qn is used to indicate the queue in node H2 that receives
messages from node n). As Figure 4.13 shows, three writeback messages destined for the same
memory block can be found in queues QA, QB and QC. Since these queues are distinct, it is not
possible to distinguish which writeback message arrived last just by examining the contents of
these three queues. The messages in queue QH must be used to determine the proper order. As
Figure 4.13 shows, the writeback messages can be matched with the corresponding Invalidation
Request message or Downgrade request messages sent by Home to the owner node at different
95times. Since all messages in queue QH originate at the same node, they are stored in-order and
can be used to determine the order of the writeback messages. While there are writeback
messages in any input queue in Home2, its directory controller examines queue QH and dequeues
either an Invalidation or a Downgrade request message, uses the information contained in that
message to determine the queue where the corresponding writeback message is stored and
dequeues it. The Home2 directory controller extracts the data from the writeback message and
writes it in the corresponding block in the Home2 memory.
H H2
NODE 0 NODE 5 NODE 6
A B C
NODE 3 NODE 4
DOWNGRD REQ (H=>C,H2)
DOWNGRD_WB ACK(C=>H,H2)
QUEUES AT HOME2
INV REQ (H=>H,H2,C)
OWN ACK (H=>A,H2)
INV REQ(H=>A,H2)
INV_WB ACK (A=>H,H2)
INV REQ (H=>B,H2)
INV_WB ACK (B=>H,H2)
head
tail
OWN ACK (H=>C,H2)
OWN ACK (H=>B,H2)
DATA ACK (H=>C,H2)
Figure 4.13: Messages at the Receiver Queues of the Home2 Node
Writeback messages are also generated because a cache may chose to replace a block that
is in EXCLUSIVE state. Then the cache must send a copy of that block to the Home node. Such a
message is called victim-writeback. The existence of victim-writeback messages presents a
complication to the procedure described above, because victim-writeback messages are not
generated as a response to a message from the Home node, and therefore the directory in Home2
cannot directly use the Invalidation or Downgrade Request messages in queue QH to determine
the correct arrival order. When a node (say node Z) generates a victim-writeback message it must
have already acquired ownership of that block in one of two ways: either 1) the block was in
96SHARED state, in which case Home sent an Invalidation request, collected the Invalidation-Ack
messages, and replied to node Z with an Ownership-Ack message, or 2) the block was in
EXCLUSIVE state, in which case the previous owner of the block was notified by the Home
node to return an InvalidateWB-Ack message after which Home replied to node Z with an
Ownership-Ack message.
To determine the proper order, the Home2 directory controller operates in the following
way when it encounters a victim-writeback message from node Z in queue QZ: locate in QH the
matching Ownership-Ack message from the Home node to node Z If the subsequent
acknowledgment message in QH (either Data-Ack or Ownership-Ack) is found without an
intervening InvalidationWB or DowngradeWB request to node Z, then the VictimWB occurred in
between these two acknowledgement messages. It should be noted that it is possible that a
VictimWB message can be sent before an Invalidation or Downgrade message that is already
enroute from Home has been received by node Z. In this case, the VictimWB message takes the
place of the expected InvalidationWB or DowngradeWB message in QH.
4.3.2 Using Home2 Memory to Fill Read Requests
The data replication that occurs as a result of keeping the Home2 local memory
consistent can be used to enhance the performance of the DSM system. This performance
enhancement is achieved by allowing a Data request to be filled with either the Home or Home2
memory. When the Home2 memory is used to fill a Data request, the network traffic and service
latency for the data request is reduced since the request is filled from the local memory rather
than from memory on a remote node. As long as a block in Nx_H2 is in the SHARED state, it can
be used to fill a read request from a process at Node X locally.
An Ownership request, however, cannot be serviced locally under the FT1 Protocol.
Consider Process Py from Figure 4.10, running on Node Y. Py issues a write request for DataB
that misses in the cache. The FT0 protocol allows the request to be serviced locally if the block is
97in the UNOWNED state since Node Y is Home for DataB, traversing Path S in Figure 3.14 Since
no other node has a copy of the block, the transfer of ownership to Node Y does not require
messages to be sent to other nodes. If the block was either in the state SHARED or EXCLUSIVE,
an Invalidation message would be sent to the nodes that are sharing the data block before
ownership of the block could be given to Py. In the FT1 protocol, the Home2 node (in this case
Node X) will always have a copy of the data block and must receive the Invalidation message.
The Decision Tree described previously for FT0 requires changes to the network cost for
several paths as well as structural changes to the tree itself. Figure 4.14 shows the Decision Tree
for Read Requests under the FT1 protocol. The network cost information is provided for each
individual message transfer as well as a total cost for each tree path. The parameters used in the
network cost values include SH, the cost of sending a message without data, SD, the cost of sending
a message containing a data block, and Ninv, the average number of nodes found in the copyset
when an ownership request is received for a data block in the SHARED state.
A new path, G, is added to the Decision Tree shown in Figure 4.14. This path is taken
when the requested data block is found in the SHARED state in the Home2 memory of the
requesting node and can be used to fill the request locally. Path G provides a 0 cost path for
requests that can be filled from the Home2 memory, in the FT0 protocol, these requests would
normally have been associated with path D. In this case FT1 provides a savings of (SH + SD) over
the FT0 protocol. Other changes in network cost between the FT0 and FT1 protocols include an
increase of SD to Path B, because the Local Data Acknowledge message of FT0 is remote in FT1
because it must be multicast to the requesting node and to the Home2 node. The network cost of
Path E is also increased by SD because the Local Downgrade Writeback message of FT0 is now
remote in FT1, because it must be multicast to both the Home and Home2 nodes.
98
F
REFERENCE
READ
MODIFIEDBLOCK
HOMEADDRESS
NOT HOMEADDRESS
NOT MODIFIEDBLOCK
DATA_REQTO LOC DIR
(NODE H = NODE A)(net cost 0)
DATA_REQTO REM DIR
(NODE H)(net cost SH)
IN CACHEHIT
NOT IN CACHE
NODE A
LOC DIR SENDSDOWNGR_REQ
TO REMOTE NODE C(net cost SH)
REMOTE NODE CSENDS DOWNGR_WB_ACK
TO THIS NODE A AND HOME2(net cost SD)
REMOTE DIR HASBLOCK IN STATE:
NOT MODIFIED MODIFIEDBY NODE C
REM DIR SENDSDATA_ACK TO NODE A
AND HOME2(net cost SD)
HOME ISNODE C
HOME IS NOTNODE C
REM DIR (NODE H)SENDS DATA_ACK
TO NODE A AND HOME2 DIR(net cost SD)
REM DIR (NODE H)SENDS DOWNGR_REQ
TO NODE C(net cost SH)
NODE C SENDSDOWNGR_WB_ACK TO REM DIR (NODE H) AND
HOME2 Dir(net cost SD)
NO MESSAGELocal Read miss
REM DIR (NODE H)SENDS DOWNGR_REQ
TO NODE H CACHE(net cost 0)
NODE H CACHE SENDSDOWNGR_WB_ACK TO
NODE H DIR AND HOME2 Dir(net cost SD)
A
B
C
D
E
HOME2ADDRESS
MODIFIEDBLOCK
NOT MODIFIEDBLOCK
NO MESSAGEHome2 Read miss
G
Network Cost:0
Network Cost:0
Network Cost:0
LOC DIR SENDS DATA_ACK TO LOC CACHE AND HOME2
(net cost SD)
Network Cost:SH + (2 * SD)
Network Cost:SH + SD
Network Cost:SH + (2 * SD)
REM DIR (NODE H)SENDS DATA_ACK
TO NODE A AND HOME2 DIR(net cost SD)
Network Cost:2 * (SH + SD)
Figure 4.14: Decision Tree for FT1 Read Request
99Figure 4.15 shows the modification to the Decision Tree for Write Requests that Miss in
the cache under the FT1 protocol. In all paths M-S, an Invalidation message must be sent to the
Home2 node and the resulting Invalidation-Ack message must be collected. The additional
Invalidation-Ack message adds SH to every path in the tree. Since Invalidation messages are
multicast, the only paths which will have an added network cost for sending an extra Invalidation
message to the Home2 node are path O and path S, the ones that would not have resulted in any
Invalidation messages under the FT0 protocol. The Invalidation messages have a network cost of
SH. In paths M. N and R, the block for which ownership is being requested is in the state
EXCLUSIVE. In this case, an InvalidationWB-Ack will be multicast from the current owner to
both Home and Home2 in response to the Invalidation message. The writeback message would
have occurred locally in path M for FT0, but is remote in the FT1 protocol and adds a network
cost of SD to path M. In all paths M-S, the Ownership-Ack message must be multicast to the
requesting node and the Home2 node. This requirement adds SD to the network cost of Path Q, R
and S.
Figure 4.16 provides the FT1 protocol Decision Tree for a write request that is found in
the cache. Paths I-L result in upgrade requests for the data block. In paths I and J, the data block
is located in the local memory of the requesting node resulting in a local Upgrade request and
local Ownership Acknowledge under the FT0 protocol. In the FT1 protocol a copy of the
Ownership Acknowledge must be sent to the Home2 node. Since it is not necessary to send the
data block itself, only the notification of the Ownership Acknowledge, the additional message to
the Home2 node increases the network cost of these paths by SH. Paths I and K contain an
Invalidation message that is sent to Home2 for a data block in which the only copy of the data
was the one that was about to be upgraded. The additional network cost for the FT1 protocol for
these paths is (2 * SH) for the Invalidation message and the Invalidation-Ack message. In Paths J,
and L, an Invalidation message is multicast to all nodes on the copyset for the data block (Ninv
100nodes). In this case the additional network cost for the FT1 protocol is SH due to the additional
Invalidation-Ack message that must be collected from the Home2 node.
M
REFERENCE
NODE A
WRITE
SHAREDMODIFIEDBY NODE C
UNOWNED
HOME ISNODE C
HOME IS NOTNODE C
REM DIR (NODE H)SENDS INV REQ
TO NODE C AND HOME2(net cost SH)
REM DIR (NODE H)SENDS OWN_ACK
TO NODE A AND HOME2(net cost SD)
REM DIR (NODE H)SENDS INV_REQ TO NODES IN COPYSET
AND HOME2(net cost SH)
REM DIR (NODE H=C)SENDS INV REQ TO
NODE (H=C) CACHE AND HOME2(net cost SH)
NODE (H=C) CACHE SENDSINV_WB_ACK TO
DIR (NODE H=C) AND HOME2(net cost SD)
HOMEADDRESS
SHR/MODBLOCK
OWNR_REQTO LOC DIR H(net cost 0)
UNOWNEDBLOCK
LOCAL DIR HASBLOCK IN STATE:
LOC DIR (NODE H)SENDS OWN_ACK
TO NODE A AND HOME2(net cost SD)
SHARED BYREM NODE(S)
LOC DIR (NODE H)SENDS INV_REQ
TO OWNER AND HOME2(net cost SH)
MODIFIED BYREM NODE
NOT IN CACHE
NOT HOMEADDRESS
OWNR_REQTO REM DIR H
(net cost SH)
REMOTE DIR HASBLOCK IN STATE:
LOC DIR (NODE H)SENDS OWN_ACK
TO NODE A AND HOME2(net cost SD)
PQ RO
N
REM DIR (NODE H)SENDS INV_REQ
TO HOME2 (net cost SH)
S
LOC DIR (NODE H)SENDS INV_REQ
TO HOME2 (net cost SH)
OWNR_REQTO LOC DIR H
(net cost 0)
NOT IN CACHE
HOME2 SENDSINV_ACK TO
REM DIR (NODE H)(net cost SH)
Network Cost:( 3 * SH) + (2 * SD)
Network Cost:( 3 * SH) + (2 * SD)
NODE C SENDSINV_WB_ACK TO
NODE H AND HOME2(net cost SD)
HOME2 SENDSINV_ACK TO
REM DIR (NODE H)(net cost SH)
REM DIR (NODE H)SENDS OWN_ACK
TO NODE A AND HOME2(net cost SD)
HOME2 SENDSINV_ACK TO
REM DIR (NODE H)(net cost SH)
REM DIR (NODE H)SENDS OWN_ACK
TO NODE A AND HOME2(net cost SD)
Network Cost:( 3 * SH) +SD
Network Cost:((Ninv+3) * SH) + SD
HOME2 SENDSINV_ACK TO
REM DIR (NODE H)(net cost SH)
NODES IN COPYSETSEND INV_ACK TO REM DIR (NODE H)(net cost (Ninv *SH)
REM DIR (NODE H)SENDS OWN_ACK
TO NODE A AND HOME2(net cost SD)
HOME2 SENDSINV_ACK TO
REM DIR (NODE H)(net cost SH)
NODES IN COPYSETSEND INV_ACK TO REM DIR (NODE H)
(net cost (Ninv *SH)
LOC DIR (NODE H)SENDS INV_REQ TO NODES IN COPYSET
AND HOME2(net cost SH)
Network Cost:((Ninv+2) * SH) + SD
HOME2 SENDSINV_ACK TO
REM DIR (NODE H)(net cost SH)
REM NODE SENDS INV_WB_ACK TO
NODE H AND HOME2(net cost SD)
Network Cost:(2 * SH) + (2 * SD)
HOME2 SENDSINV_ACK TO NODE H
(net cost SH)
LOC DIR (NODE H)SENDS OWN_ACK
TO NODE A AND HOME2(net cost SD)
Network Cost:(2 * SH) +SD
Figure 4.15: Decision Tree for FT1 Write Miss
101
IN CACHE
MODIFIEDBLOCK
HITSHAREDBLOCK
HOMEADDRESS
UPGRADEOWNR_REQTO LOC DIR(net cost 0)
LOCAL DIR HASBLOCK IN STATE:
SHARED ONLYBY NODE A
LOC DIR (NODE H)SENDS OWN_ACK
TO NODE A AND TOREM NODE H2
(net cost SH)
SHARED BYREM NODE(S)
NOT HOMEADDRESS
UPGRADEOWNR_REQTO REM DIR(net cost SH)
SHARED
REMOTE DIR HASBLOCK IN STATE:
SHARED ONLY BYNODE A
H
I
J
K
L
LOC DIR (NODE H)SENDS INV_REQ
TO HOME2(net cost SH)
LOC DIR (NODE H)SENDS INV_REQ
TO SHARING NODES AND HOME2(net cost SH)
REFERENCENODE A
WRITE
NOT IN CACHE
Network Cost:0
HOME2 SENDSINV_ACK TO NODE H
(net cost SH)
Network Cost:3 * SH
Network Cost:(3 +Ninv)* SH
LOC DIR (NODE H)SENDS OWN_ACK
TO NODE A AND TOREM NODE H2
(net cost SH)
HOME2 SENDSINV_ACK TO NODE H
(net cost SH)
NODES IN COPYSETSEND INV_ACK TO NODE H
(net cost Ninv * SH)
REM DIR (NODE H)SENDS OWN_ACK
TO NODE A AND TOREM NODE H2
(net cost SH)
REM DIR (NODE H)SENDS INV_REQ
TO HOME2(net cost SH)
HOME2 SENDSINV_ACK TO NODE H
(net cost SH)
Network Cost:4 * SH
REM DIR (NODE H)SENDS INV_REQ
TO SHARING NODES AND HOME2(net cost SH)
REM DIR (NODE H)SENDS OWN_ACK
TO NODE A AND TOREM NODE H2
(net cost SH)
HOME2 SENDSINV_ACK TO NODE H
(net cost SH)
NODES IN COPYSETSEND INV_ACK TO NODE H
(net cost Ninv * SH)
Network Cost:(4 +Ninv)* SH
Figure 4.16: Decision Tree for FT1 Write Hit
102Table 4.2 summarizes the differences in the network cost associated for all Decision Tree
Paths A-S for FT0 and FT1. As mentioned previously, SH is the network cost of a message with
data, SD is the network cost of a message with one data block, and Ninv is the number of nodes
found on the copyset when an ownership request is made for a block in the SHARED state. The
paths that have the largest difference in network cost between FT1 and FT0 are paths B, E, G, M,
R and S. FT0 does not have a path G. Any request that would be serviced via path G would have
been serviced via path D in FT0. For this reason FT1 has a benefit for path G of (SH + SD). The
remaining paths would be an additional network cost for the FT1 protocol.
Table 4.2: Comparison of the Network Cost of the Tree Paths for FT0 and FT1Path FT0 network cost FT1 network cost Difference
A 0 0
B SH + SD SH + (2 * SD) SD
C 0 0
D SH + SD SH + SD
E SH + SD SH + (2 * SD) SD
F 2 * (SH + SD) 2 * (SH + SD)
G N/A 0 SH + SD
H 0 0
I 0 3 * SH 3 * SH
J SH * (1 + Ninv) SH * (Ninv + 3) 2 * SH
K 2 * SH 4 * SH 2 * SH
L SH * (3 + Ninv) SH * (Ninv + 4) SH
M SH + SD (2 * SD)+(3 * SH) SD+(2 * SH)
N 2 * (SH + SD) (2*SD)+(3*SH) SH
O SH + SD (3 * SH) + SD 2 * SH
P SD + (SH * (2 + Ninv)) SD + (SH * (Ninv + 3)) SH
Q SH * (1 + Ninv) SD + (SH * (Ninv + 2)) SH
R SH + SD (2 * SD)+(2 * SH) SH + SD
S 0 SD + (2 * SH) SD + (2 * SH)
103The FT1 protocol removes the burst of network traffic related to updating the backup
Recovery Memory that occurs in the FT0 protocol, but produces overhead during normal
operation unlike the FT0 protocol. This overhead is due to the network load caused by the
necessity of sending additional messages to the Home2 node including: Invalidation requests,
writeback (InvalidationWB, DowngradeWB, or VictimWB) messages, Ownership-Ack and Data-
Ack messages. For every Ownership request, an Invalidation-Ack message must be collected
from Home2 in addition to any other Invalidation-Ack messages collected from nodes on the
copyset.
4.3.3 Performance
In Chapter 3, a set of synthetically generated applications was described and compared to
each other in terms of performance for a basic DSM system that directly relates to the FT0
protocol during normal operation (no checkpointing). Differences in application behavior were
obtained by varying three parameters L, N and S.
1. Parameter L is the probability that the memory reference will reside in the node’s
local memory.
2. Parameter N is the probability that the reference will be to the local memory of the
node for which the requesting node has a backup copy. (Node X is Home for DataB,
Node Y is Home2 for DataB, N is the probability that Node Y will choose a memory
reference that resides in the local memory of Node X).
3. Parameter S is the probability that the reference belongs to a remote node’s memory
space and has been accessed by another thread in the system some time previously.
4. The probability that the reference will be one of the blocks already in the cache is 1-
(L+N+S).
104Applications of interest can be described by the following cases:
1. Threads tend to access arbitrary blocks at the Home2 node and the current node. This
case indicates a smaller degree of locality within the current node and a smaller
degree of sharing between threads. The parameters of interest are L=0.01, S=0.01 and
N=[0.01,...,0.10]. (Case N01).
2. Threads tend to access arbitrary blocks within the Home2 memory of a node. Within
the current node there is a larger degree of locality. The parameters of interest are
L=0.10, S=0.02 and N=[0.01,...,0.10]. (Case N10).
3. Threads tend to access blocks at any remote node that have already been accessed by
another thread. Within the current node, threads tend to access arbitrary blocks. This
case indicates a smaller degree of locality within the current node but a larger degree
of sharing between threads. The parameters of interest are L=0.01, N=0.01 and
S=[0.01,...,0.10]. (Case S01).
4. Threads tend to access blocks at any remote node that have already been accessed by
another thread. Within the current node there is a larger degree of locality. The
parameters of interest are L=0.10, N=0.02 and S=[0.01,...,0.10]. (Case S10).
The main advantage FT1 has over the FT0 protocol during normal operation is the ability
to fill Data requests from the Home2 memory. The main disadvantage of FT1 protocol during
normal operation is the fact that messages that would occur locally (with a network cost of 0) in
the FT0 protocol, must be multicast to Home2 and therefore become remote (with a network cost
> 0) in the FT1 protocol. The synthetic applications described above were simulated for each
protocol, FT0 and FT1 in order to compare their performance over a range of application types.
These experiments assume that the size of a message header (SH) is 10 bytes and the size of a
message with one data block (SD) is 74 bytes.
105Table 4.3 shows the percentage of cache misses that traverse the specified Tree paths(s)
for the Case N01, Case N10, Case S01 and Case S10 set of applications. The Tree paths included
in are those through which at least .1% of the cache misses are handled. If fewer that .1% of the
cache misses use a particular path, the path was not included in the table. Tree paths G, O and S
are the paths that are used more frequently and have a difference in network cost between FT0
and FT1 as shown in Table 4.2.
Table 4.3: Distribution of Tree Path Usage for Cache MissesB C D F G L N O P S
Case N01: N=.01 29.82% 28.23% 1.09% 30.65% 5.96% 0.76% 3.36%Case N01: N=.02 22.65% 21.12% 0.69% 45.58% 6.86% 0.42% 2.53%Case N01: N=.05 12.55% 12.21% 0.27% 64.82% 8.56% 0.16% 1.40%Case N01: N=.10 7.52% 7.15% 0.14% 75.23% 9.15% 0.78%
Case N10: N=.01 22.65% 21.12% 0.69% 45.58% 6.86% 0.42% 2.53%Case N10: N=.02 18.54% 32.97% 1.26% 37.06% 0.18% 6.85% 0.96% 2.01%Case N10: N=.05 11.60% 50.70% 2.95% 24.52% 0.32% 6.61% 2.04% 1.19%Case N10: N=.10 7.12% 62.46% 4.22% 16.09% 0.11% 0.45% 5.89% 2.97% 0.77%
Case S01: S=.01 69.16% 13.21% 0.29% 7.27% 2.14% 0.15% 7.65%Case S01: S=.02 0.10% 64.44% 11.94% 0.27% 13.29% 2.70% 0.12% 7.16%Case S01: S=.05 0.11% 52.99% 10.05% 0.23% 26.59% 4.10% 5.91%Case S01: S=.10 40.88% 7.93% 0.13% 41.00% 5.45% 4.58%
Case S10: S=.01 0.13% 69.11% 6.61% 0.19% 14.16% 2.24% 7.55%Case S10: S=.02 0.10% 64.44% 11.94% 0.27% 13.29% 2.70% 0.12% 7.16%Case S10: S=.05 52.86% 24.99% 0.52% 11.43% 3.89% 0.29% 5.90%Case S10: S=.10 41.04% 38.76% 0.82% 9.47% 4.83% 0.47% 4.49%
Figures 4.17 and 4.18 compare the processor utilization and the execution time for the
FT0 and FT1 protocols for Case N01. In each of the four Case N01 applications (N=.01, N=.02,
N=.05 and N=.10), FT1 provides better performance than FT0. As the N parameter increases, the
difference between the FT0 and FT1 performance also increases. The reason for this in that in the
N01 applications, there is a larger probability that the thread will access a data block within its
Home2 memory. As the probability that the address is located in the Home2 memory increases,
so does the probability that the Home2 memory will be able to fill a read request locally.
106
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
FT0, N=.01 FT1, N=.01 FT0, N=.02 FT1, N=.02 FT0, N=.05 FT1, N=.05 FT0, N=.10 FT1, N=.10
Case N01
Pro
cess
or
Uti
lizat
ion
Figure 4.17: Processor Utilization, Case N01
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
FT0, N=.01 FT1, N=.01 FT0, N=.02 FT1, N=.02 FT0, N=.05 FT1, N=.05 FT0, N=.10 FT1, N=.10
Case N01
Sim
ula
tio
n C
lock
Cyc
les
Figure 4.18: Execution Time, Case N01
107Figures 4.19 and 4.20 compare the processor utilization and the execution time for the
FT0 and FT1 protocols for Case N10. In the N10 applications, there is a larger probability that
the thread will access a data block within its own local memory. The FT0 protocol can support
both local read requests and local ownership requests, the FT1 algorithm however does not
support local ownership requests.
When the N parameter is .01 and .02, the FT0 protocol performs better due to the higher
percentage of local memory references. As the N parameter is increased to .05 and .10, the FT1
algorithm performs better due to the higher percentage of memory references that fall within the
requesting node’s Home2 memory.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
FT0, N=.01 FT1, N=.01 FT0, N=.02 FT1, N=.02 FT0, N=.05 FT1, N=.05 FT0, N=.10 FT1, N=.10
Case N10
Pro
cess
or
Uti
lizat
ion
Figure 4.19: Processor Utilization, Case N10
108
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
FT0, N=.01 FT1, N=.01 FT0, N=.02 FT1, N=.02 FT0, N=.05 FT1, N=.05 FT0, N=.10 FT1, N=.10
Case N10
Sim
ula
tio
n C
lock
Cyc
les
Figure 4.20: Execution Time, Case N10
Figures 4.21 and 4.22 compare the processor utilization and the execution time for the
FT0 and FT1 protocols for Case S01. The Case S01 applications have a low probability that the
memory reference will be to the local memory and a higher probability that the memory reference
will be remote and to a data block that has been previously accessed. For this class of
applications, the FT1 algorithm performs better than the FT0 algorithm due to the ability to fill
read requests with the Home2 memory. The difference in performance is not as large as it is for
the Case N01 applications because the percentage of references that fall within the Home2
memory is not as high.
109
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
FT0, S=.01 FT1, S=.01 FT0, S=.02 FT1, S=.02 FT0, S=.05 FT1, S=.05 FT0, S=.10 FT1, S=.10
Case S01
Pro
cess
or
Uti
lizat
ion
Figure 4.21: Processor Utilization, Case S01
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
FT0, S=.01 FT1, S=.01 FT0, S=.02 FT1, S=.02 FT0, S=.05 FT1, S=.05 FT0, S=.10 FT1, S=.10
Case S10
Pro
cess
or
Uti
lizat
ion
Figure 4.22: Execution Time, Case S01
110Figures 4.23 and 4.24 compare the processor utilization and the execution time for the
FT0 and FT1 protocols for Case S10. In the S10 applications, there is a larger probability that the
thread will access a data block within its own local memory. For this class of applications, the
performance of the FT0 and FT1 algorithms is approximately the same. The advantages that the
FT0 algorithm has for local references balances the advantage that the FT1 algorithm has for
number of references the fall within the Home2 memory in this application class.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
FT0, S=.01 FT1, S=.01 FT0, S=.02 FT1, S=.02 FT0, S=.05 FT1, S=.05 FT0, S=.10 FT1, S=.10
Case S10
Pro
cess
or
Uti
lizat
ion
Figure 4.23: Processor Utilization, Case S10
111
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
FT0, S=.01 FT1, S=.01 FT0, S=.02 FT1, S=.02 FT0, S=.05 FT1, S=.05 FT0, S=.10 FT1, S=.10
Case S10
Sim
ula
tio
n C
lock
Cyc
les
Figure 4.24: Execution Time, Case S10
4.3.4 Implementation Issues
There are several possibilities for the implementation of the node structure for the FT1
approach. Separate directory controllers could be used for the Home and Home2 memory within
a node. Alternatively there could be a single directory controller whose capabilities are extended
so that it can maintain both Home and Home2 memories. One argument for having a single
directory controller is that a node would have one physical memory that would be logically
divided into sections serving as Home, Home2 and recovery memory sections. In this situation,
two separate controllers would have to coordinate access to the memory bus and therefore one
controller would often be waiting for the other to finish using the memory bus.
The Home directory controller receives Data or Ownership requests from the cache
controllers and issues Invalidation or Downgrade request messages if necessary, waits for
acknowledgements, and then sends the Data-Ack or Ownership-Ack messages (usually along
112with the data block being requested). The Home2 directory controller receives and processes any
messages necessary to keep the Home2 copy of the memory consistent with the Home memory.
The performance of the Home directory controller directly impacts the performance of
the multiprocessor system in terms of the service latency of the Data or Ownership request. The
service latency is the length of time between the issue of the request until the receipt of the reply.
When servicing an Ownership request, the directory may have to issue a number of Invalidation
messages and wait for the resulting acknowledgments. If the acknowledgments are not received
and processed in a timely manner, the cache controller must wait longer before receiving the data.
Longer service latencies can cause the processor to spend more time in an idle state waiting for
the data it requires to continue processing.
The response time of the Home2 directory controller for handling incoming messages is
important in two cases. The first case is the ability to allow a read miss from the local cache to be
filled with a block from the Home2 memory if it is in the SHARED state. The advantage of the
Home2 memory providing the data block to the cache is the reduction in service latency for the
read request since messages do not have to be exchanged with the Home node. The performance
of the Home2 directory controller also affects the amount of time required to determine the most
recent value written to a data block when it comes time to create a new checkpoint. The most
recent value written to the data block must be determined before the recovery memory can be
updated.
The speed with which the Home messages are processed will impact the performance of
the system. The processing of the Home2 messages, however, is not as time-critical. If there were
two directory controllers (Home and Home2) and two input queues, the Home directory
controller could be given higher priority for the memory bus and therefore continue to process
messages with no reduction in performance. The Home2 directory controller could process its
messages when it doesn’t interfere with the Home directory controller. This actually leaves a
reasonable amount of time since the home directory spends time waiting for acknowledgments
113before reading or writing to the memory. The performance degradation resulting from the
increased service latencies that could result from a single controller performing the functions of
both the Home and Home2 directory function may warrant having separate controllers and
possibly separate queues. In the simulations, it is assumed that there are separate Home and
Home2 directory controllers and associated input queues.
4.4 FT2 PROTOCOL
Although the FT1 Protocol outperforms the FT0 protocol when a large number of Data
Requests can be filled by the Home2 memory, in cases where this is not possible, the FT1
Protocol exhibits an increasing loss in performance as the degree of spatial locality in the
application increases. This loss in performance is due to the fact that Ownership/Upgrade requests
cannot be performed locally because every write request must cause an Invalidation message to
be sent to the Home2 node, even if no other node has a copy of the data.
The performance loss for the FT1 Protocol is not merely due to the additional network
traffic, but also to the increase in the service latency of an ownership request. The increase in
service latency is due to the necessity of invalidating the data block in the Home2 node and then
waiting for the acknowledge to be received by the Home node. When the only Invalidation
message required is the one to Home2, the delay is significant because in FT0 there would be no
Invalidation messages and the Ownership-Ack message would be sent immediately.
If the ownership request is for a data block in either the UNOWNED state or the
SHARED state where the only sharer is the node requesting the ownership, the FT2 protocol
allows the Home node to send an Invalidation message to Home2 and then grant the requesting
node ownership of the data block, without waiting for the Invalidation-Ack message from
Home2. In either of these cases, Home2 is the only node that must be invalidated before the
ownership is transferred. If the data block was in the EXCLUSIVE state or in the SHARED state
with multiple sharers, invalidations must be sent to nodes other than the Home2 node and the
114resulting Acknowledgements must be collected in order to guarantee sequential consistency.
With this optimization, the FT2 protocol decreases the service latency for Ownership or Upgrade
requests for a block that is either unowned or shared only by the node requesting ownership for
the block.
The FT2 protocol awards ownership to a node without waiting for the Acknowledgment
from Home2. In order to ensure that sequential consistency is not violate, three orderings of the
set of interactions between the Home and Home2 nodes must be considered. In [8] a sequentially
consistent execution is defined to be one in which the result that is produced is the same as one
obtained by any one of the possible interleaving of operations that preserve program order within
a single process.
Referring to Figure 4.25, assume that both Node X and Node Y have a copy of DataB in
the SHARED state in their respective caches. Process Px issues a write access to DataB. In this
case, the cache controller on Node X will send an Ownership request to Node Y and wait for the
corresponding Ownership-Ack before the write operation completes. The Home directory (on
Node Y) will cause the local cache (on Node Y) to invalidate the block and then send an
Invalidation request message to the Home2 directory (on Node X) and send the Ownership-Ack
to Node X. This scenario poses no problems for the FT2 protocol since the cache controller on
Node X is waiting for the data block to come from Home and will not attempt to access the
Home2 value in the meantime and sequential consistency is maintained.
Assume that caches on both Node X and Node Y again have DataB in the SHARED
state. Process Py issues a write access to DataB. In this case, the Home directory controller (on
Node Y) changes the state of the block to a transient “busy” state just as it normally does when it
is waiting to gather Invalidation-Ack or Downgrade-Ack messages, sends an Invalidation request
to the Home2 directory (on Node X) and immediately grants ownership of the block to the Node
Y cache. If another request for the data block reaches the Home node while the block is in the
“busy” state, the request is placed on the waiting queue. When the Home node receives the
115Invalidation-Ack message from Home2, the state of the block is changed to EXCLUSIVE and the
Home directory controller can process any requests for the block that are on the waiting queue.
Process Py is granted ownership immediately and may begin writing to the Data block before the
Invalidation-Ack message is received from the Home2 node. Whether the Home directory
controller waits for the acknowledge message from Home2 before granting the ownership or not,
the Home2 node will eventually receive the invalidation and stop reading from DataB in its
cache. The Home2 node will then invalidate the cache and send the Invalidation-Ack message
back to the Home node. The resulting global order of operations preserves the program order of
each of the processes and is a valid sequential interleaving of the program orders of all processes,
so sequential consistency is not violated.
Cache
HomeLocal Memory
Node X
Recovery DataNode X
Recovery DataNode Y
Node X
Home2Local Memory
Node Y
Cache
Recovery DataNode Y
Recovery DataNode Z
Node Y
HomeLocal Memory
Node Y
Home2Local Memory
Node Z
Nx_H Nx_H2 Ny_H Ny_H2
Px Py
DataBDataB
Figure 4..25: Ownership Request scenarios.
116As before, assume that caches on both Node X and Node Y have DataB in the SHARED
state. This time both process Py and process Px decide to write to DataB at the same time. The
Home directory will see the request from Py first. Since only Home2 has a copy of the data block,
Py is allowed to proceed with the write while DataB is placed in the “busy” state. If Px issues a
write access to DataB after it receives the Invalidation message from Home, the effect is the same
as the previous case that was shown to preserve sequential consistency.
If Px issues the write access to DataB before it receives the Invalidation message from
Home, it sends an Ownership request to the Home directory and waits for the reply. When the
Home directory receives the Ownership Request from Home2, DataB is still in the “busy” state
because the Home has not received the Invalidation-Ack from Node X. The request is placed on
the waiting queue. When Home2 receives the Invalidation request, it invalidates the cache block
and sends an Invalidation-Ack to the Home node. When the Home node receives the
Invalidation-Ack, the state of the block is changed to EXCLUSIVE and the waiting queue is
checked for requests. When the Ownership Request message from Node X is retrieved from the
waiting queue, the Home directory causes the DataB block in the cache on Node Y to be
invalidated and the Ownership-Ack message is sent to Node X. Once the data block becomes
busy, it stays busy until the acknowledge message is received from Home2, even if the process
finished writing earlier. For this reason successive write operations appear in the same order for
all processes and the result of execution is the same as if the operations of all processes were
interleaved and program order was preserved in each process, i.e. sequential consistency is
preserved.
The FT2 protocol reduces the service latency for Ownership requests for unowned data
blocks by granting Ownership of the data block before the Invalidation-Ack message has been
received from the Home2 node. In addition, the FT2 protocol reduces the network cost that is
involved in the multicasting of all Ownership and Data Acknowledge messages in FT1. The
Ownership Acknowledge message is sent to Home2 for the purposes of tracking the changes in
117ownership for a data block. The data block value included in the Ownership Acknowledge
message is not used by Home2 because the data block is in the INVALID state in the Home2
memory and will be modified again before being written back by the new owner. For this reason,
it is not necessary to send the data block to Home2, only the notification of the ownership
Acknowledge. If the Home node is remote to the requesting node, multicasting the Ownership
Acknowledge message to Home2 and the requesting node does not increase the network cost over
what it would be in FT0. If however, the Home node is local to the requesting cache, the
Ownership Acknowledge would be handled locally (network cost of 0). Any message sent to
Home2 in this case will have a network cost proportional to the message size (SD). Since it is not
necessary to send the data to the Home2 node, a reduction in network cost can be achieved by
sending 2 Ownership Acknowledge messages. One Ownership Acknowledge message containing
the data will be sent locally to the requesting cache. Another Ownership Acknowledge message
that does not contain the data block (message size of SH) will be sent remotely to Home2.
4.4.1 Performance
Figures 4.26 through 4.29, provide the execution time for the Case N01, Case S01, Case
N10 and Case S10 application classes respectively. In each all of the experiments reported in this
section, each thread processed 10,000 memory references. These figures show that the
performance increase provided by the FT2 protocol over the FT1 protocol can be observed when
the degree of locality of the application is higher (CASE N10 and Case S10) since both
enhancements are aimed at the situation where there is an Ownership or Upgrade request for a
data block that is local to the requesting node and is either unowned or shared only by the
requesting node. If the degree of locality is lower (Case N01 and Case S01), there is little
difference in the performance of the FT1 and FT2 protocol.
118
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
FT1, N=.01 FT2, N=.01 FT1, N=.02 FT2, N=.02 FT1, N=.05 FT2, N=.05 FT1, N=.10 FT2, N=.10
Case N01
Sim
ula
tio
n C
lock
Cyc
les
Figure 4.26: Execution Time, Case N01
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
FT1, S=.01 FT2, S=.01 FT1, S=.02 FT2, S=.02 FT1, S=.05 FT2, S=.05 FT1, S=.10 FT2, S=.10
Case S01
Sim
ula
tio
n C
lock
Cyc
les
Figure 4.27: Execution Time, Case S01
119
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
FT1, N=.01 FT2, N=.01 FT1, N=.02 FT2, N=.02 FT1, N=.05 FT2, N=.05 FT1, N=.10 FT2, N=.10
Case N10
Sim
ula
tio
n C
lock
Cyc
les
Figure 4.28: Execution Time, Case N10
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
FT1, S=.01 FT2, S=.01 FT1, S=.02 FT2, S=.02 FT1, S=.05 FT2, S=.05 FT1, S=.10 FT2, S=.10
Case S10
Sim
ula
tio
n C
lock
Cyc
les
Figure 4.29: Execution Time, Case S10
120
In general, the FT2 protocol provides better performance than the FT1 protocol when the
application exhibits a high degree of locality. However it is possible when an application exhibits
a high degree of locality and there are multiple threads per process, the DSM can become
unstable and the queues can become full bringing the system to a halt. This can occur because the
cache is granted ownership of a block that is unowned or shared only by the requesting cache
without waiting for the Invalidation-Ack from Home2. As described previously, this does not
pose a problem in regard to sequential consistency however, without a limiting condition (such as
waiting for an acknowledgment) the number of outstanding Invalidation-Ack messages in the
system can become large enough to adversely affect the performance of the DSM.
If every thread in a process issues an ownership request to a different memory block
which is resident in the local memory of that node and is unowned or only shared by the
requesting cache, ownership requests and acknowledgments can be issued one right after the
other without waiting for the corresponding Invalidation-Ack message from Home2. This results
in the output queue of the Home node being filled with Invalidation requests and Ownership
acknowledgments destined for Home2. Assuming the system is symmetric and all nodes are
behaving similarly, the queues fill up and the response to other requests that involve Invalidation-
Ack or Downgrade-Ack becomes slower and slower. This problem worsens as the number of
threads increases, the write ratio increases, or the degree of locality of the application increases.
In order to prevent this situation, a limit is set to the number of outstanding Invalidation-Ack
messages from Home2 for blocks that are unowned or shared only by the requesting cache. When
this limit is reached, all subsequent ownership requests must wait for the Home2 Invalidation-
Ack message before the cache is awarded ownership of the block.
121
4.5 FT3 PROTOCOL
Although the FT2 Protocol provides some improvement over the FT1 protocol, the
network cost for Decision Tree paths that involve transferring a data block (writeback messages
as well as Data and Ownership acknowledge messages) still causes performance loss in paths
where the data transfer would otherwise have been done locally. The FT3 Protocol reduces the
network costs incurred in FT1 and FT2 in three ways.
Cache controllers do not send either DowngradeWB or InvalidationWB messages to the
Home2 node. This is possible because all InvalidatationWB or DowngradeWB messages occur
as the result of an Ownership or Data request to a block in the EXCLUSIVE state. Once the
Home directory controller receives the writeback message, it sends the Data or Ownership
Acknowledge message with the data block to the requesting cache. In order to reduce the
overhead of sending InvalidationWB or DowngradeWB messages for locally owned data blocks,
the new value for the memory block can be sent to the Home2 directory controller as part of the
subsequent Data or Ownership Acknowledge messages thereby avoiding the cost of sending local
writeback messages only to the Home2 directory as would occur for Decision Tree paths E and
M. Victim writeback messages, however, are not the result of a request from another node and
the next Data or Ownership request for the block could occur after the next checkpoint is taken.
For this reason, Victim writeback messages must be sent to both the Home and Home2 directory
controllers.
When a cache chooses a locally owned data block to be replaced in the cache, the Victim
writeback message would have been handled locally in FT0 but causes a remote message to be
sent in either FT1 or FT2. The next optimization in the FT3 protocol is to eliminate the necessity
of sending a remote message only to Home2. Instead, the update to Home2 can be postponed so
that a node other than the home node can request the data block and Home2 can receive the
update through the multicast of the Data Acknowledge message that is sent to the remote node.
122If another node does not request the data block before the next checkpoint occurs, the
Home2 node will not have all the information it requires to update its copy of the recovery
memory. This situation is handled by associating a “writeback-pending” attribute with the
directory entry for each data block. When a home node receives a local victim writeback
message, it sets the “writeback-pending” attribute to indicate that the data has not been written
back to the Home2 node. At checkpoint time, each node will go through the directory entries for
its local memory. All data blocks that have the “writeback-pending” attribute set are added to a
message that will be sent to the Home2 node providing it with the most recent value for these data
blocks. This message is similar to the recovery memory update message sent by the FT0 protocol
at checkpoint time but will contain fewer data blocks.
The FT3 protocol removes the requirement to multicast the Ownership Acknowledge
message to the requesting node and Home2. The data in an Ownership Acknowledge message is
not useful to the Home2 node because the data block in the Home2 memory remains in the
INVALID state until a Data Acknowledgement message is received by the Home2 node. As long
as a node has the ownership of a block, it is possible the block will be modified. The only way to
get the most recent value of a data block is to either receive the writeback message that occurs
when the node relinquishes ownership of the block or to get the data block as part of a future Data
Acknowledge message. Furthermore, it is only necessary to send the Home2 node a Data
Acknowledge message when the data block has just transitioned from the EXCLUSIVE state to
the SHARED state. It is not necessary to multicast the Data Acknowledge message to Home2 if
the block was already in the SHARED state when the requesting node made a read request. The
“writeback-pending” attribute is used to indicate that the Home2 node has not received the latest
value of a data block and is set anytime a writeback of any kind occurs. The attribute is cleared
when a Data Acknowledge message is multicast to the Home2 node. Each time the home
directory is ready to send a Data acknowledge to another node, it checks the “writeback-pending”
attribute. If the attribute is set, the Data Acknowledge message is multicast to the Home2 node,
123otherwise it is not. A major advantage of this approach is that no additional effort is needed to
determine the correct order of updates to a data block that occurs when the updates consist of
Writeback messages received from different caches. All messages sent from the Home node are
seen in the same order in all other nodes. Therefore the most recent value of a data block is the
value received in the most recent Data Acknowledge message.
4.5.1 Performance
The total execution time, including normal execution and checkpointing time, is provided
for the FT0, FT1, FT2 and FT3 protocols are compared for each of the four synthetic workloads
in Figures 4.30 through 4.33 and for the multiprocessor trace files in Figure 4.34. The synthetic
applications contained 30,000 memory references and created three checkpoints during the
execution of the application.
In the Case N01 applications, there is a low probability (.01) that a memory reference
will be for a block that is in the local memory of the requesting node. There is also a low level of
sharing data blocks between all the nodes. Figure 4.30 shows the performance for each of the
protocols when the probability that the backup node (Home2) will reference a data block in the
local memory of the primary node (Home). Increasing the probability that data will be shared
between the Home and Home2 nodes increases the probability that read misses on the Home2
node can be filled with the Home2 memory which otherwise would have required a remote
access. Under these conditions the FT1, FT2 and FT3 provide an improvement over FT0. The
results from these experiments show that even for very small increases in the probability of
sharing data between the Home and Home2 nodes, the performance of the system increases
significantly.
In the Case S01 and Case S10, very little effort is made to encourage sharing of data
between the Home and Home2 nodes in particular (probability of .01). The probability that a
memory reference is to a remote block that has been accessed before is increased. This type of
124application is one in which there is an increasing degree of sharing data structures between all of
the nodes in the system. The Case S01 experiments shown in Figure 4.32 have a small
probability of a node accessing a block in its own local memory (.01) and the Case S10
experiments shown in Figure 4.33 have a larger probability (.10) of a node accessing a local
block. In all experiments shown in Figure 4.32 and 4.33, the probability that the Home and
Home2 nodes in particular share data is .02. The Case S01 experiments in Figure 4.32 show that
FT0, FT1 and FT2 perform better than FT0 both in terms of normal execution time and in the
time required to create checkpoints. In addition FT1, FT2, and FT3 provide similar performance.
The reason for this is that the features of the FT2 and FT3 protocols are aimed at reducing the
penalties from write requests that are local in FT0 but must be handled as a remote request in the
FT1 protocol. Since the probability of accessing local memory is small in these experiments, the
optimization of FT2 and FT3 are not necessary. The higher probability of accessing local
memory in the Case S10 experiments shown in Figure 4.33 allow the optimizations of FT2 and
FT3 to be utilized. Although all three protocols, FT1, FT2, and FT3 provide better performance
than FT0, in each experiment, the FT2 and FT3 algorithms provided an improvement over the
FT1 algorithm. In these experiments, the FT3 algorithm has a lower normal execution time and
FT2 but the time required to send the update message containing the local Victim writeback
messages that were not sent to Home2 is high enough to be the total execution time higher than
FT2.
In order for FT3 to spend a smaller amount of time in the Recovery Memory Update
phase of the checkpoint procedure, the local memory accesses must be to a smaller number of
local data blocks. If the number of local memory accesses is high but the set of blocks that are
being accessed is smaller (i.e. ongoing updates to a particular data structure). The FT3 protocol
will not only eliminate the local Victim writeback messages that the FT2 protocol requires but
will also have a significantly smaller update message during checkpoint time. In other words a
125small number of local data blocks were obtained Exclusively and then written back a large
number of times.
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
N01_F
T0
N01_F
T1
N01_F
T2
N01_F
T3
N02_F
T0
N02_F
T1
N02_F
T2
N02_F
T3
N05_F
T0
N05_F
T1
N05_F
T2
N05_F
T3
N10_F
T0
N10_F
T1
N10_F
T2
N10_F
T3
Increasing Probability of Neighbor Access
Clo
ck C
ycl
es
Normal Execution Total Checkpoint
Figure 4.30: Total Execution Time, Case N01
126
0
200000
400000
600000
800000
1000000
1200000
N01_F
T0
N01_F
T1
N01_F
T2
N01_F
T3
N02_F
T0
N02_F
T1
N02_F
T2
N02_F
T3
N05_F
T0
N05_F
T1
N05_F
T2
N05_F
T3
N10_F
T0
N10_F
T1
N10_F
T2
N10_F
T3
Increasing Probability of Neighbor Access
Clo
ck C
ycl
es
Normal Execution Total Checkpoint
Figure 4.31: Total Execution Time, Case N10
0
200000
400000
600000
800000
1000000
1200000
S01_
FT0
S01_
FT1
S01_
FT2
S01_
FT3
S02_
FT0
S02_
FT1
S02_
FT2
S02_
FT3
S05_
FT0
S05_
FT1
S05_
FT2
S05_
FT3
S10_
FT0
S10_
FT1
S10_
FT2
S10_
FT3
Increasing Level of Sharing
Clo
ck C
ycl
es
Normal Execution Total Checkpoint
Figure 4.32: Total Execution Time, Case S01
127
0
200000
400000
600000
800000
1000000
1200000
1400000
S01_
FT0
S01_
FT1
S01_
FT2
S01_
FT3
S02_
FT0
S02_
FT1
S02_
FT2
S02_
FT3
S05_
FT0
S05_
FT1
S05_
FT2
S05_
FT3
S10_
FT0
S10_
FT1
S10_
FT2
S10_
FT3
Increasing Level of Sharing
Clo
ck C
ycl
es
Normal Execution Total Checkpoint
Figure 4.33: Total Execution Time, Case S10
Figure 4.34 provides a comparison of the protocols for the multiprocessor trace files. The
weather application provides an example of a situation where the FT3 protocol performs better
than the FT2 protocol. In this case, the FT0 protocol has a smaller normal execution time than
any of the other protocols but the time to take a checkpoint causes the total execution time to be
higher than FT1, FT2 or FT3.
In the simple application, there are a large number of local accesses to different local data
blocks and a large degree of sharing between the nodes. In this case, the most of the local
ownership requests would have to be handled remotely by either FT0 or FT1 because the
requested data is rarely found in the UNOWNED state. The FT2 algorithm has a higher
execution time due to the necessity of multicasting the local victim writeback messages to the
Home2 node. Although the FT3 algorithm, does not sent the local victim writeback messages to
Home2, the number of blocks that were accessed during the interval between checkpoints is high
128enough to cause the size of the update message sent during the Update Recovery Memory phase
of the checkpoint to approach that of the one in the FT0 protocol. The FT1 algorithm performs
best in this situation, because waiting for the acknowledge message from Home2 limits the
number of outstanding ownership requests in the system. If the number of outstanding ownership
requests is too high, the output queues tend to fill up which causes the entire system to experience
higher latencies in servicing the cache requests for data.
All four protocols perform similarly in the speech application for the simple reason that a
small number of data blocks are modified during the interval between checkpoints making the
time required to take a checkpoint small in all cases, even FT0. In addition, there is a very low
probability of locality or sharing between the Home and Home2 nodes, which results in similar
execution times during normal operations for all protocols.
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
18,000,000
wthr_
FT0
wthr_
FT1
wthr_
FT2
wthr_
FT3
simp_
FT0
simp_
FT1
simp_
FT2
simp_
FT3
spee
ch_F
T0
spee
ch_F
T1
spee
ch_F
T2
spee
ch_F
T3
Clo
ck C
ycle
s
Normal Operation Checkpoint
Figure 4.34: Execution Time, Trace Files
129
CHAPTER 5: CONCLUSIONS
In this thesis, characteristics required for interconnection networks to provide DSM
Systems with the means to achieve efficient low latency communication were discussed. In
Chapters two and three, the architecture of the SOME-Bus was presented along with possible
implementations of the internal components necessary to provide a fully hardware-supported
DSM system. A method of generating synthetic workloads was presented along with a Decision
Tree structure capable of providing detailed information on the behavior of applications and the
message traffic generated as a result of changes in application behavior. A Trace-driven
simulator was used to compare performance measures such as processor and channel utilization
as well as application runtime for the set of multiprocessor traces and synthetic workloads used to
evaluate the performance of DSM on the SOME-Bus. Also in chapter three, a theoretical model
was developed for the SOME-Bus and solved as a closed queuing network. A simulator based on
the network queuing model was developed in which parameters that characterize behavior
patterns of specific applications can be incorporated into the model. Values for Processor and
Channel utilization as well as waiting time in the output channel queue were compared for the
Trace-driven and Distribution-driven simulators as well as the theoretical model. The results of
the three approaches were very close, with differences less that 10% for processor and channel
utilization showing that complex behavior present in DSM systems can effectively be modeled by
relatively simple queuing models using parameters that characterize real applications.
In Chapter four, a set of protocols were presented to achieve fault tolerance with little or
no network overhead when implemented on the SOME-Bus. The advantages and disadvantages
of each protocol were presented in terms of specific application characteristics such as locality
and degree of internode sharing of data. Applications implementing the protocols were analyzed
using the Decision Tree structure in order to determine the source of higher levels of network
traffic. Several versions of the protocols were proposed in order to take advantage of features
130such as locality and communication patterns that occur frequently between pairs of nodes.
Results show that for the synthetic workloads, not only was the overhead for fault tolerance
hidden, but the performance of the DSM was improved by the addition of the protocols.
Characteristics such as hot spots and a small degree of locality prevented the multiprocessor
traces from benefiting from the protocols with the exception of simple, which demonstrated a
performance increase when the FT1 protocol was applied. The speech application was successful
in hiding the fault tolerance overhead and weather allowed the overhead to be minimized.
The work described in this thesis demonstrates the benefits of an interconnection network
architectures that are optimized for the types of communication inherent in DSM shared memory,
namely support for frequent efficient multicast of relatively small messages. A DSM
implemented within such a framework has the potential to realize not only an increase in the level
of fault tolerance of the system but also a simultaneous increase in the performance of the overall
DSM system.
131
List of References
1. Adve, V., "Performance Analysis of Mesh Interconnected Networks With DeterministicRouting", IEEE Transactions on Parallel and Distributed Systems, Mar 1, 1994, v 5, n 3, p.225.
2. Banatre, M., Gefflaut, A., Joubert, P., Morin, C., and Lee, P. A., "An architecture fortolerating processor failures in shared-memory multiprocessors," 1996.
3. Bershad B. N. and Zekauskas, M. J., “Midway: Shared Memory Parallel Programming withEntry Consistency for Distributed Memory Multiprocessors,” Research Report CMU-CS-91-170. Dept of Computer Science, Carnegie-Mellon Univ. Sept, 1991.
4. Bolch G., Greiner S., Trivedi K. S. and Meer H. Queueing Networks and Markov Chains:Modeling and Performance Evaluation with Computer Science Applications.
5. Bouzid, A., M. A. G. Abushagur, "Thin-film approximate modeling of in-core fiber gratings."Optical Engineering, Vol. 35, No. 10, pp. 2793-2797 (1996)
6. Calvin, C., "All-To-All Broadcast in Torus with Wormhole-Like Routing", IEEE Symp.Parallel Distributed Process., 1995, pp. 130-137.
7. Chaiken, D.; Fields, C.; Kurihara, K.; Agarwal, A. “Directory-based cache coherence inlarge-scale multiprocessors “Computer , Vol.23, Iss.6, 1990 Pages: 49- 58
8. Culler, D. E., Singh, J. P. and Gupta, A., Parallel Computer Architecture, Morgan Kaufman,1999
9. Culp, J., B. Nabet, F. Castro, and A. Anwar, "Gain Enhancement of Low Temperature GaAsHeterojunction MSM Photodetector", Applied Physics Letters, March 1998.
10. Dong, L., Ortega, B., Reekie, L., "Coupling characteristics of claddding modes in tiltedoptical fiber gratings", Applied Optics, 37:(22), pp. 5099-5105, 8/1998.
11. Erdogan, T., Sipe, J., "Tilted fiber phase gratings", Journal of the Optical Society of America,13:(2), pp. 296-313, 2/1996.
12. Fleisch, B. D., Michel, H., Shah, S. K., and Theel, O. E., "Fault tolerance and configurabilityin DSM coherence protocols," 2000.
13. Gharachorloo K., Lenoski D., Laudon J., Gibbons P., Gupta A., and Hennessy J., “MemoryConsistency and Event Ordering in Scalable Shared Memory Multiprocessors,” Proc. 17th
Ann. Int’l Symp. Computer Architecture, pp. 15-26, Seattle, May 1990.
14. Gravenstreter, G., Melhem, R., "Realizing Common Communication Patterns in PartitionedOptical Passive Stars (POPS) Networks", IEEE Transactions on Computers, Vol. 47, No. 9,9/1998.
15. Grujic, A., Tomasevic, M., Milutinovic, V., "A Simulation Study of Hardware-Oriented DSMApproaches", IEEE Parallel & Distributed Technology, Sprg 1996, v 4, n 1, p. 74.
13216. Ho, C., "Optimal Broadcast in All-Port Wormhole-Routed Hypercubes", IEEE Trans.
Parallel Distr Syst., Feb. 1995 v. 6 n. 2 p. 203(5).
17. Katsinis, C., "Performance Analysis of the Simultaneous Optical Multiprocessor ExchangeBus", Parallel Computing Journal, Vol. 27, No. 8, pp. 1079-1115, July 2001.
18. Kermarrec, A-M, Morin, C., Banatre, M., “Design, implementation and evaluation ofICARE: an efficient recoverable DSM”, Software - Practice and Experience, vol. 28, no. 9,pp. 981-1010, 25 Jul 1998.
19. Kim J.H. and Vaidya N. H., “Recoverable Distributed Shared Memory Using the CompetitiveUpdate Protocol”, Proceedings of the Pacific Rim International Symposium on Fault-TolerantSystems, Newport Beach, California, USA, Dec 1995, pp. 152-157
20. Kim J.H. and Vaidya N. H., “Single fault-tolerant distributed shared memory usingcompetitive update”, Microprocessors and Microsystems, Volume 21, Issue 3, 15 December1997, Pages 183-196.
21. Kontothanassis, L. L., Scott, M. L, "Using memory-mapped network interfaces to improvethe performance of distributed shared memory", IEEE High Performance ComputerArchitecture 1996, pp. 166-177.
22. Kulick, J., Cohen, W. E., Katsinis, C., Wells, E., Thomsen, A., Gaede, R. K., Lindquist, R.G., Nordin, G. P., Abushagur, M., and Shen, D., "The simultaneous optical multiprocessorexchange bus," 1995.
23. Lan, Y., "Multicast Communication in 2-D Mesh Network", IEEE Int Conf. ParallelDistributed Syst., Icpads, 1994, pp. 63-68.
24. Lee, M. , Little, G., "Study of radiation modes for 45-deg tilted fiber phase gratings", OpticalEngineering, 37:(10), pp. 2687-2698, 10/1998.
25. Li Y., T. Wang, "Distribution of light power and optical signals using embedded mirrorsinside polymer optical fibers", IEEE Photonics Technology Letters, vol. 8, no. 10, October1996, pp. 1352-1354.
26. Li Y., T. Wang, and K. Fasanella, "Cost-Effective Side-Coupling Polymer Fiber Optics forOptical Interconnections", Journal of Lightwave Technology, vol. 16, no. 5, May 1998, pp.892-901.
27. Mckinley, P., "Collective Communications in Wormhole-Routed Massively ParallelComputing", IEEE Computer, v. 28, n. 12, pp. 39-50, 1995.
28. Morin, C. and Puaut, I., "A survey of recoverable distributed shared virtual memorysystems," 1997.
29. Morin, C., Kermarrec, A. M., Banatre, M., and Gefflaut, A., "An efficient and scalableapproach for implementing fault-tolerant DSM architectures," 2000
30. Ould-Khaoua, M., "Comparative Evaluation of Hypermesh and Multi-Stage InterconnectionNetwork", Computer Journal, 1996 v. 39 n. 3 p. 232.
13331. Panda, D., "Fast Barrier Synchronization in Wormhole k-ary n-cube Network with
Multidestination Worms", IEEE High Performance Computing Architecture Symp., 1995, pp.209-209.
32. Rajasekaran, S. Sahni, S., "Sorting, Selection, and Routing on the Array with ReconfigurableOptical Buses", IEEE Transactions on Parallel and Distributed Systems, Vol. 8, No. 11,11/1997.
33. Shang, S., "Distributed Hardwired Barrier Synchronization for Scalable MultiprocessorClusters", IEEE Trans. Parallel Distributed Syst., v. 6, n. 6, pp. 591-605, 1995.
34. Silva, Luis M, Silva, Joao Gabriel, “Checkpointing distributed shared memory”, Journal ofSupercomputing, vol. 11, no. 2, pp. 137-158, 1997.
35. Szymanski, T., "Hypermeshes. Optical Interconnection Network for Parallel Computing",Journal Parallel Distributed Computing, Apr 01 1995 v. 26 n. 1 p. 1.
36. Theel O., Fleisch B., “A dynamic coherence protocol for distributed shared memoryenforcing high data availability at low costs”, IEEE Transactions on Parallel and DistributedSystems”, vol.7, no.9, p. 915-30, Sept. 1996.
37. Trivedi, K., Probability and Statistics with Reliability, Queueing and Computer ScienceApplications, Prentice-Hall, 1982.
38. Turk, J. G. and Fleisch, B. D., "DBRpc: a highly adaptable protocol for reliable DSMsystems," 1999.
39. Willick, D. L., Eager, D. L., "An Analytic model of Multistage interconnection networks",ACM SIGMETRICS, May 1990, pp. 192-202.
40. Yang, J., "Hardware Supports for Efficient Barrier Synchronization on 2-D Mesh Network",IEEE Int. Conf. Distributed Computing Syst., 1996, pp. 233-240.
134
VITA
Diana Lynn Hecht was born in Tuscaloosa Alabama in 1964. Diana Hecht received her
B.S.E. in 1995 and M.S.E. in 1999 from the University of Alabama in Huntsville. As an
undergraduate, Diana was a Co-op Student working at NASA, where she received a U.S. Patent
for her contribution to the design of an Imaging Phosphor Scanner for application in X-Ray
Crystallography. After receiving her B.S.E., Diana was employed by VME Microsystems Inc,
where she developed low-level software for embedded systems before returning to school to
pursue her M.S.E. and Ph.D. in Computer Engineering. While pursuing her graduate studies,
Diana held a teaching and research assistantship and has extensive teaching experience in the
areas of hardware design and Real-time Systems Software. Diana’s past research activities
included the development of a parallel version of an existing Optical Plume Anomaly Detection
software application (sponsored by NASA). The scheduling issues that arose from the
transformation of the application became the focus of her master’s thesis. Additional research
efforts involved Fault Tolerance for Low-Bandwidth Networks aimed at the Rapid Force
Projection Initiative System Network (sponsored by the US Army Aviation and Missile
Command). Diana has also been involved in research sponsored by the National Science
Foundation for the development of a prototype of the Simultaneous Optical Multiprocessor
Exchange Bus (SOME-bus). Diana Hecht is currently working as a research engineer for Rydal
Research and Development where she is involved in research and development efforts in low
latency processor interconnects and represents Rydal Research in the RapidIO Trade Association.