dynamic load balancing for clustered time · abstract in this thesis. we consider the problem of...

DYNAMIC LOAD BALANCING FOR CLUSTERED TIME WARP

by

Khalil M. El-Khatib

School of Cornputer Science

McGill University, Montreai

November 1996

A dissertation

submitted to the Faculty of Graduate Studies and Rtseach

in partial fulfillment of the rcquirements for the degrec of

MASTER OF SCIENCE

Copyright 01996 by Khalil M. El-Khatib

National Library Bibiiitheque nationale du Canada

Acquisitions and Acquisitions et BiMiographic Services seMces bibliographiques 395 Wellington Street 395, rue Wdlingtori OriawaON K 1 A W OnawaON K 1 A W canada Canada

The author has granted a non- exclusive licence allowing the National Library of Canada to reproduce, loan, distribute or seli copies of this thesis in microfom7 paper or electronic formats.

The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fiom it may be printed or othewise reproduced without the author's ~ermission.

L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/fih, de reproduction sur papier ou sur format électronique.

L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.

Abstract

In this thesis. we consider the problem of dynamic load balancing for parallel

cliscrete evcnt simulation. We focus on the optimistic synchronization protocol. Time

Warp.

A distributed load balancing algorithm was dcveloped. which makes use of thc

active process migration in Clustered Time Warp. Clustercd Time Warp is a liybrid

synchronisation protocol; it uses an optimistic approach between the clusters and a

sequential approach within the clusters. As opposed to the ccntraliaed algorithm de-

veloped bj H. Avril for Clustered Time Warp, the presented load balancing algorithm

is a distributed token-passing one.

Wc present two metrics for rneasuring the load: processor utilization and proccssor

advance siniulation rate. Different models were simulated and test ed: VLSI niodele

arid queuing nctwork models (pipeline and distributcd networks). Rcsults show tliat

improving the performance of the system depends a great dcd on the nature of the

simulated model.

For the VLSI model, we also examined the effect of the dyuamic load bslancing

algorithm on the total number of processed messages per unit time. Performance

rcsults show that dynamically balancing the load, the t hroughput of the simulation

was iniproved by more than 100%.

Résumé

Dans cette thèse. nous considerons le problème du balancement dynamique des

charges pour la simulation à évenement discrèts basée sur un principe de synclironi-

scztion optimistique.

An algorithme distribué et basé sur le principe de migration des processus a été

développé pour le system TWR (Time Warp avec Regroupement). TWR est un

protocole de synchronisation hybride: il utilise une approache optimistique entre

les groupes de processors et une approache sequentielle à l'interieure des groupes.

Contrairement à l'algorithme centralisé developpé par H. Avril. notre algorithme est

distribu6 et basé sur la distribution d'un jeton.

Différents mocl&les out été simulés et testés: cles circuits VLSI, pipelines d'assemblage

et réseaux distribués. Les resultats obtenus ont montré que le balancement dynamique

des charges a permis d'accoitre considerablement la performance de la simulation.

ACKNOWLEDGMENTS 1 wish to place on record my profound sense of gratitude to my thesis advisor, Prof.

Car1 Tropper, for his advice, guidance and encouragement throughout all phases of

this work.

1 owe <an especial debt of gratitude to Dr. Axzedine Boukerche for the vivid

instructions and exchange of ideas. 1 have no doubt that this thesis is better product

due to his guidance.

1 am very grateful to Herve Avril for Iiis lielpful discussioris and hclp providirig

the code for CTW.

A general thanks to al1 my fellow consultants a t the School of Computer Science.

It is always a grcat pleasurc working with them.

1 would likc to tliank all the system staff and cspecially Luc Boulianne. Matthew

Sams, and Kent Tse for keeping biernat up and running. Further thanks go to al1

people working in the administrative office at our school.

I grcatly appreciste the friendship I share with Rami Dalloul and Nada Tammim

Dalloul who have been with me every step of the way.

1 am greatful to Jacqueline Fai Yeung and Peter Alleyne for their patience and

unfailing assistance. 1 will never ever forget their hclp.

Last but not least, 1 rcmain grateful to my family whose moral support und

encouragement were responsible for my sustained progress throughout the counc of

this work.

Contents

1 Introduction

. . . . . . . . . . . . . . . . . . . . 1.1 Parallel Discrete Event Simulation

. . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Parallel Logic Simulation

. . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Logic Simulation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Pürtitioning

. . . . . . . . . . . . . . . . . . 1.2.3 Fine Computation Granularity

. . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Circuit Activity ;

2 Pardel Discrete Event Simulation 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Optimistic Approach 9

. . . . . . . . . . . . . . . . . . . . 2.2.1 Local Control Mechanism 9

. . . . . . . . . . . . . . . . . . . . 2.2.2 Global Control Mechanism 10

. . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Conservative Approach 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Load Balancing 17

. . . . . . . . . . . . . . . . . . . . . . 2.4.1 Static Load Balancing 17

. . . . . . . . . . . . . . . . . . . . . 2.4.2 D ynamic Load Balancing 20

. . . . . . . . . . . . . . . . . . . . . . . 2.5 Ciustered Time Warp (CTW) 26

. . . . . . . . . . . . . . . . . 2.5.1 Cluster Structure and Behavior 27

. . . . . . . . . . . . . . . 2.5.2 Variations of Clustered Time Warp 28

2.5.3 Load Baiancing for Clustercd Time Warp . . . . . . . . . . . . 29

3 Metrics and Algorithm 30

3.1 Mctrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.1 Processor Advance Simulation Rate (PASR) . . . . . . . . . . 30

3.1.2 Processor Utilization(PU) . . . . . . . . . . . . . . . . . . . . 32

. . . . . . . . . . . . . . . . . 3.1.3 Combination of PU and PASR 33

3.2 Algonthrn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.1 Pseudo Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 Token period, P . . . . . . . . . . . . . . . . . . . . . . . . . 36

. . . . . . . . . . . . . . . . 3.3.2 SelectingtheClusters(Processes) 37

. . . . . . . 3.3.3 Number of Processors that can initiate migration 37

4 Results and Discussion 38

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 Experimental Rcsults . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.1 Increase in rollbacks . . . . . . . . . . . . . . . . . . . . . . . 44

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Pipeline Mode1 44

4.4 Distributed Network mode1 . . . . . . . . . . . . . . . . . . . . . . . 48

5 Conclusion

List of Figures

2.1 A Tirne Warp process (A) . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Process A after message &th timestamp 142 was processed . . . . . . 11

2.3 Rollbnck to time 135 . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Deadlock Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Cluster Structure 27

. . . . . . . . . . . . . . . . . . . . . 3.1 Processors with different PASRs 32

. . . . . . . . . . . . . . . . . . . . . . . . . Simulation time for C2 40

. . . . . . . . . . . . . . . . . . . . . . . . . Simulation time for C3 40

. . . . . . . . . . . . . . . . . . . . Peak rnemory consumptzon for C2 42

Peak mernory consumption for C3 . . . . . . . . . . . . . . . . . . . 43

Throughput Graph for C2 . . . . . . . . . . . . . . . . . . . . . . . . . 45

. . . . . . . . . . . . . . . . . . . . . . . . . Throughput Graph for CR 45

. . . . . . . . . . . . . . . . . . . . . . . Manufacture Pipeline Mode1 46

Percentage Increase in the Number of Rollbacks . . . . . . . . . . . . 47

Percentage Reduction of the Simulation Time . . . . . . . . . . . . . 47

Distributed Netutork Model . . . . . . . . . . . . . . . . . . . . . . . 48

Percentage of Reduction in the Simulation Time in the Presence o j

Time- Varying Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.12 Percentage of Reduction in the Simulation Time in the Absence of

Time-Varying Load.. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

vii

Chapter 1

Introduction

1.1 Parallel Discrete Event Simulation

Simulation is regarded as an indispensable tool for the design and vcrification of

complex and large systems. There is hardly a discipline or a science that cioes not

use simulation. Examples are queuing network simulation [NicoBS]. VLSI simulation

[Bai194]. and battlefield simulation [ReitSO].

Discrcte Event Simulation (DES) is a. technique for the siniulation of systems

wliich can be rnodeled as discrete systems: the components of the system change

stntes at discrete points in timc. For cxample? gates in a logic simulation change

state from 1 to O, or vice versa.

Continuous Simulations (CS) are used to simulate models in which components

change states in a continuous way with timc; examples are weather and liquid flow

simulation. Unlike discrete simulation, continuous simulation usually involves the use

of differential equations to evaluate the change in the state valuc.

A simulated systern is usually divided into entities which interact ovcr time by

scheduling events for one other. An event describes a change in thc state of a. certain

ent ity or component in the model. Each event in the system is assigned a virtual time

stamp which represents the time at which the event would happcri in thc rcal worlcl.

For instance. cntities iu a computer network simulation rcprcsent computcr servcrs.

and events represent real messages sent over the network.

Originally. simulations started on sequential machines. The simulated system is

mocieled as follows: a state, which includcs local variables. and an event queue, which

contains the events scheduled for the system. The simulator mode1 continuously

processes the event with the smallest time stamp, advances thc simula.tion time. and

scheclules events for thc future.

Because of the rapicl change in the s i x and complcxity of the models, Parallcl

Discrete Event Simulation (PDES) seems to be a promising tool to handle the d o

mands in the world of simulation. In PDES, clifferent cornponents of the modcl rcside

in different machines. Each component has iits own separate state and cvent queue.

The main problem for PDES is how to satis& the causality constraint: evcnts that

rlepend or1 eacli othcr have to be processecl in (incrcasing) ordcr of tlicir timcstamps.

Two main approaches to avoid causality errors have k e n widely stutlied ancl

usecl: the conservative approach, introduced by Chandy & Misra [Chari79], and the

optzmistzc approach, pioneered by Jefferson [Jeff85]. The two approaches cliffcr in

the way they deal with the causality issue. In a conservative simulation. the logical

proccss (LP) will process an event when it knows it is safe to do so. An LP in an

optimistic simulation processes events as soon as they arrive. If it later rcceives an

cvent with a srnaller tirne stamp, it rolls back mcl restores an old corrcct savctl stütc.

Bcfore resuming the simulation, the LP would cancel prcviously sent messages by

sending anti-messages that annihilate the originals.

The focal point of this thesis is the dynamic load balancing for Clustered Time

Warp (CTW) [Avri95]. CTW is a hybrid approach that uses Time Warp between

clusters of logical proccsses and a sequential mechanism inside the clusten. Although

tlic problem of loacl bülancing has been extensively stuclied over the past fcw ycars

[Hac87, Hac87, Wang85, Livn82, Zarg851, few results arc directly relatai to the area

of PDES. Io this thesis, two metrics for dynarnic load balancing are presented. A

distributcd dynamic load balancing algorit hm is also constructcd. Scverd associated

modcls are simulated to evaluate the performance of the algorithm and metrics.

1.2 Parallel

Logic simulation is a

Logic Simulation

standard technique for the design and venfication of VLSI

systcms bcfore they are actudly built. While the verification part can reducc thc

cxpcnsive prototypes and debugging time. the cxecutiou time requircd by simulatiou

cxceeds days and even weeks of CPU time for large circuits. Pxallcl logic simulation

providcs z u i alternative solution that accounts for thc rapid increasc in tlic sizc of

circuits.

Two traditional algorithms for parallel logic simulation have been proposcd iri

the liternture: (1) compile-mode, and (2 ) centralized-time event-driuen. In compile-

mode simulation [Wang87, Wnng90, Maur901, the systcm simulstes al1 the elemcrits

after each time unit. Scheduling and synchronization arc siniplifiecl in this approach

at thc cxpcnse of run time. The simplicity of this approach makcs it fcasiblc for

direct implemcntation in hardware. Whilc the compilc-moclc algoritlim cari rcsult iu

high spced-ups for active circuits: the synchronization time is a hindrancc for circuits

with low activity. Event-driven simulation [Fran86, Srnit87, SmitBû, Sou188. Wils86.

Wong861 provides an alternative solution in which only elcments whose input have

changed are simulated at each time step.

Cornpilernode and centralized-time event-driven are both synchronous algorithms.

Asynchronous algorithms used in parallel discrete cvent simulations h m exhibitcd

promises for parallel logic simulation.

Soule and Gupta [Sou1921 discussed the effectiveness of Chandy-Misra-Bryant

[Chan79, Brya771 algorithm (CMB) for parallel logic simulation. For six benchmark

circuits, the authors reported an effective concurrency of 42-196. Effective concur-

rcricy is definctl as the amount of parallelism that is obtaincd wheu usiug arbitrary

niariy procesuors. in the absence of synchronization and sctieduling overheads. The

authors noticed that using the CMB algorithm. 50% to 80% of the total execution

timc is spent on deadlock resolution. To resolve this issue. the authors experimented

with many variations such as look-ahead, null messages. clustering. and demand driven

scherne. Speed-ups of 16-32 were achieved on a 64-procmsor machine.

Su ancl Seitz [Su891 studied logic simulation for null-message conse~at ive simula-

tion. To rcduce the overhead of null messages, two variations were examinecl: (1) lory

null mcssagcs, and (2) elernent groupingl. Thc authors used Intel iPSC machincs as

a testbed to rneasure the speed-ups of a 1376-gate multiplier network. Speecl-ups of

approximately 2 and 10 were reported using 16-node iPSC/'L and 128-nodc iPSC/l

rcspectively.

Time-Warp techniques were &O used for VLSI circuit simulation. Chung and

Cliung [Cliuu89] impleniented Tirne-Warp on the Conncctiori Machirie. Thc authors

described the Lower Bound Rollback (LBR) technique to overcome storagc problcrns.

The LBR of an LP is equal to the local virtual time of the L P if it does not have

incoming links. For an L P with incoming links, the LBR is set to the minimum of

the local virtuel time of the LP and of al1 of the LBRs of the incoming links. If the

LP is part of a. cycle. than the LBR is set to the GVT. Oncc the LBR is known.

cach L P can rcclaini al1 the space used by events with timestamp up to thc LBR..

Two circuits from the ISCAS-85 benchmarks were simulatcd: a 16-bit combiriütiorial

multiplier (2406 gates) and 32-bit array multiplier(80UO gates). The authors rcported

that, using the LBR technique, c m reduce the storage up to 30% corriparcd to the

origirial Time Warp depending on the simulated circuit.

Brincr et al [Brin911 present another version of the Time-Warp for logic simula-

tion in that the algorithm uses incremental state saving to reduce thc overhead of

'Saiiic idca as CTW, introduced in section 2.5.

periodically checkpointing al1 events on a processor. The simulation runs on a BBN

GPlOOO machine. and it uses a lazy cancellation scheme. Even though the authors

rcported an improvement in the performance of the simulation. they noticetl that

circuit partitionhg is important in order to reduce rollbacks in the systcrn.

Implernentations for TimcWarp can also be founcl in [BaucSl]. Thc simulator

was implemented on a Sequent. and also on a set of Sun Sparcstation 2's. Speed-ups

from 2 to 3 were reported using a 5-processor Sequent, and 5.2 to 5.7 using a uetwork

of 8 Sun Sparcstations. Simulated circuits were selected from ISCAS-89 [BrglSS].

1.2.1 Logic Simulation

VLSI circuits can be sirnulated at various levels. tlcpending on thc abstraction of the

siniulated elerneuts. Each levcl emphasis certain aspects about the sirnulated circuit.

Four typcs of logic simulators are presented in [Horb86]:

a Switch level simulators

The switch level simulator is a special type gate sirnulator whick c m simulatc

MOS circuits more precisely than thc conventional gatc level sirnulators.

a Gatc level simulators

Thc simulated elernents are such logic gates as AND, O R's and INVERTER'S.

Each element has only one output termirial.

a Furictional leveI sirnulators

Thc simulated elemcnts are such hnctional primitive as latchcs, flip-Rops. coun-

tcrs and registers. The elernents rnay have more than one output tenninals.

0 Register transfer level simulaton

Register transfer level simulators provide the ability to define the behavior of

the simulated circuits with register transfer language.

Each circuit element (gate) is represented by a logical proccss (LP). LPs are

connectecl with each other as determined by thc circuit conncction. Each LP has its

own code that simulates the function of the corresponding circuit element.

Because of the rapid increase in the size of VLSI circuits [Mukh86. Term83j. large

simulation times have become a serious problem. VLSI simulation poses many chal-

lcnging problems for parallel discrete event simulation. The following text describes

thrce major problems: circuit partitioning. computatioo granularity and locality of

activi ty.

The distribution of LPs on the processon of a distributcd computing system has a

grcat impact on the performance of the simulation. A parallel simulation with a

bac1 distribution might run slower than its cquivalent serial simulation. VLSI circuits

are hard to partition [Cartgl. Spor93. Bai1941 because of the Iack of information on

signal activity: partitioning can only be done based on the structural informatio~i of

the circuit

A number of the partitioning algorithms [Spor93, Bai1941 attempt to balancc the

lond by assigning an equal number of components to al1 of the processors. This

technique ignores the fact that al1 parts of the circuit might not be active at the sarnc

timc [Sou187].

Anot her approach to partitioning is to minimize interprocessor cornniunication.

One approach is to place frequently communicating components on the same processor

[Nand92]. Carothers et. al. [Caro951 showed that an increase in the communication

delays can significantly reduce the performance of Time Warp for simulations with

large number of LPs and fine granular events such as logic simulation.

1.2.3 Fine Computation Granularity

Thc fine computation grauularity of events is a serious problem for optimistic logic

simulation. The message service time at an LP is very srna11 (e.g. evaluating thc out-

put of an AND gate). which rnakes message canccllation difficult in case of rollback;

ariti-messages cannot always annihilate previously sent messages quickly cnough. pos-

sibly leading to a cascaded rollback. Moreover, the number of events is in the hundrcds

of thousands. A change in the state of a gate, from high to low or vice versa. will

crcate an event which might trigger thousands of other events. Thc large number of

cvents means that the sixe of the event heap, the state queue, the iuput queuc and

the output qucuc are very large ancl any operation on thcse data structiirrs bcconies

corrcspontlingly expensive.

1.2.4 Circuit Activity

VLSI circuits are known to have locality of actzvzty property [AvrigG. Bai194. SoulSP].

i.c. only a part of the circuit may be active a t a given time, while thc rcmaining parts

are idle. This property can cause a big diffcrence in processor utilid â t' ions: sonic

processors might be very active while many others are idle.

Chapter 2

Parallel Discret e Event Simulation

2.1 Introduction

This chaptcr describes in section orle aiid two the two major techniques for psrallcl

simulation: the op timzstic approach and the consemative approach. Scctiori tlirce

iriclutlcs a gcneral description for load balancing. III section fivc, we introduce Clus-

terecl Time Warp. Clustered Time Warp uses Time Warp synchronization in between

clusters and a sequentid rnechanism within the clustcrs.

Optimistic Approach

In 1983. Jefferson [.Jeff83] proposed a new paradigrn for synchronizing distributed

systcms. This new paradigm. known as vzrtual tirne. providcs a flexible abstraction

of rcal time for distributcd systcms.

Timc Warp is introduced as an optimistic synchronization protocol whicli implc-

ments thc virtual time paradigm. It uses run time detection of errors caused by

out of order execution of dependent simulator events. and recovery using a rollback

mcchmism.

Thc following subsections describc the two main components of Timc wwp: local

ürid global control mechanism. The local contrd mechanism is rcsporisiblc for proccss

syncliroriization. Thc global control mechanism clcals with spacc management.

2.2.1 Local Control Mechanism

In Timc Warp. cach message consists of six components: (1) the sender of the message

(2) thc receiver (3) the virtual sent timc (4) the virtual rcceive time (also known as

tirncstamp) (5) and an sign field used to indicate whcther thc nicssagc is an ordiriary

message or and antimessage and finally (6) the event data.

Eech process in Timc Warp h a its own local virtual timc (LVT). whicli is thc

timestamp of the next message waiting in the input queue. In addition, cach proccss

kceps a virtual clock that reads the local virtual time.

Each process has a single input queue in which al1 arriving messagcs are storcd in

increasing order of th& timcstamps. A process serves al1 of the messages rcsicling in

its input queue in spite the fact that future messages might havc lower tinicstrzrnps.

Sincc al1 the virtual clocks in the system are not synchroniircd, it is possible for

a. process to receive a message with a virtual time in the "pmt". Such a. message is

callecl a stmggler. In order to maintain the causality constraint, the receiving process

has to roll back to an earlier virtual time, cancelling al1 work clone and sending " anti-

messages" to cancel al1 of the messages sent after thc timcstamp of the straggler.

In order to rollback? a process must. from time to timc. save its state in a state

queue. Whenever a process receives a straggler. a process uses the statc qucue to

rcstore itself to a state p io r to the timestamp of the straggler.

An anti-message is a duplicate of the message. with a negative sign field. Each

time a proccss sends a message? its anti-mcssage is retainecl in the sender's output

queue. Iri the case a of rollback, al1 of the anti-messages with timcstanips highcr thari

that of the straggler are sent out.

Figures 2.1? 2.2 and 2.3 show the structure and state of a process during three

consecutive snapshots. Figure 2.1 shows process A about to process the niessage

with timestamp 142, coming fiom process D. After processing that message. (figure

2.2). process A saves its state. and changes its local virtual time to 154. which is the

timestamp of tlie next message waiting in the input queue. At this moment, proccss

A rcceivcs a mcssage (straggler) with timestamp 135. Figurc 2.3 shows process A

after rolling back to a s a k state: (1) the last state saved beforc timestümp 135 was

rcstored; (2) antimessages in A's output queue that have a virtiid timestamp larger

than 135 have been transmitted to their destinations.

2.2.2 Global Control Mechanism

During the siniulation, cach process uses its mcrnory for swing statcs. storing input

niessages and keeping copies of output messages. Since each procesv has a limitecl

memory, space has to be reclaimcd regularly for further use.

Global virtual time (GVT) was introduced by Jefferson[.Jeff83]. to help solve the

problem of memory management:

The GVT at rcal timc T is the minimum of al1 virtual tirncs in al1 docks

a t time r, and (2) of the virtual send times of al1 mcssagcs that have becn

sent but have not yet been processecl at time r.

Local Time

Process

Virtual

Name Queue

lZ1 Begin Time +inf End Time

Local Time

'rocess

Output Queue

Figure 2.1: A Time Warp process ( A )

Virtual

Name Queue

Begin Time End Time

Output Queue

Sig n Receive Time Receiver Send Time Sender Data

Sign Receive Time Receiver Send Time Sender Data



Figure 2.2: Process A ajter message with tzmestamp 142 was processed

Local Virtual Time

Process Name

Current State

State Queue

Begin Time End l ime

Output Queue ,

C

Sign Receive l ime Receiver Send Time Sender Data

Sign Raceive Time Receiver Send Time Sender Data

Antimessages in transit toward receivers

Figure 2.3: Rollback to time 135

Oncc the GVT is computed, thc system cari gct rid off al1 but thc last s w d

state before GVT. siuce there might be uo strite saved exactly a t GVT. Similarly. ttic

events prior to the GVT in the L P input queue çan al1 bc rcmoved. Only evcrits with

a timestamps smaller than the timestamp of the oldest kept state can bc discardecl.

This process is called fossil collection.

The task of finding the GVT in a distributed environnient is not trivial. In a sharcd

nicniory rnultiprocessor environment, the GVT can be easily computed [Fuji891 : how-

cver, transient messages in a distributed system make GVT computation a difficult

problem. A transient message is a inessage that has been sent from a source process

but hm not yet arrived a t the destination proccss.

A very simple way to find the GVT is to stop al1 processes, cornpute the GVT,

and then resume the computation. The problem with this approach is the high

cost in time, especially when the number of processes is large, and when the GVT

cornputation is done frequently.

Some GVT algorithms [Jeff83, Sama851 solve the problem of transient messages

by sending acknowledgment messages. However? sending acknowledgment messages

will result in a large increase in the network traffic. and might also clegracle the

performance of the distributed simulation.

Lin and Lazowska [Lin891 proposed another algorithm for GVT computation.

in which an acknowledgrnent is sent for a group of messages, rather than for each

message. The authors reported a 50% decrease in network trafEc. compared to an

optimized version of Samadi's [Sama851 algorithm proposed by Bellenot [Bc1190].

As the number of proccsses in the simulation grows larger. scnding acknowlcdg-

nicnts bcconies a scrious impediment. In rcsponse to this problcm. asyricl~ronoos

token passing algorithms [BellSO, CouB1, Prei891 have been introduccd. Prciss's

[Prci89] algorithm. presented next, computes the GVT for nodes arrrrimgcd on a vir-

tua1 ring. The algorithms runs in two phases: the START and the STOP phase.

Lct [STARTi, SSOPI:] denotes a real time interval for node i. Let MVT, denote

the minimum of al1 timestamps of al1 messages either in transient from notlc i at

time START, or sent d u h g the timc intcrval [START,, STOP,]. Let PVT, dcnotc

the srnallest timcstamp at time STOP, of al1 objccts simulatcd on noclc 2 . LVT,

rcprcsents the minimum of MVTi and PVTi. Then an estimatc of the GVT is the

minimum of al1 the LVT,s. The algorithm requirn that al1 of the time intcrvals h m

at least a time point in common, during which al1 processors will be computing their

LVTis at the same time.

During the START phase. The START token starts at node 0, and circulatcs to

cach riode on the ring. Whcn a proccss receivcs the STARS tokcn, it forwarcis it to

its neighbor, and then sets its STARTi to the receipt time of the tokcn.

When the START tokcn gets back to node O, it launches the STOP phase. Node

O will compute its LVTo, store it inside the token, and send the token. When a node

i rcceives the STOP token, it compares its LVTi to the GVT valuc in the tokcn, and

the minimum value is kept in the token. When the token gets back to nocle 0: the

value stored inside the token is the smallest value of al1 of the LVTi. which tlenotes

the GVT. Another round is needed to pass the new GVT to al1 nodes.

2.3 Conservative Approach

While optimistic simulation detects cuid recovers from causality error by rolling back

to a statc just pnor to the error. a process in a consemtive sirriulatiori blocks so as

to avoid the possibility of any synchronization crror.

The channel clock value is defined as the timcstamp of the last message receivcd

dong that channel: in case there is no message received dong that channel. the

cliannel dock value is set to O. Each process repeatedly selects the input chmnel

with the sniallest clock value. and serves the message waiting in that channel. If the

selectcd channel is empty, thc process blocks since it cloes not know thc timcstanip

of thc next message to arrive on that enipty channel.

A deadlock can happen as a. natural consequence of the blocking behavior. Figure

2.4 shows three processes in a deadlock situation. Peacock et. al. [Peac79] prcsenteci

a necessary and sufficient condition for deadlock to occur:

A dcadlock exists if and only if there is a cycle of cmpty channcls which

d l havc the same channel clock value, and processes are blockecl becausc

of t hese channels.

A number of approaches have been proposed to solvc the deadlock problem. The

first approach, presented independently by Chandy k Misra [Chan791 and Bryant

[Brya77], uses null messages to avoid deadlock. A null message provides a lower

bound on the timestamp of the next incoming message. Each time a process consumcs

a message, it must send a message on each of its output channels. If the simulation

does not require a regular message on a channel, a null message is seut in its place.

When a process receives a null message on one of its input channels, it guarantees

Figurc 2.4: Deadlock Situation

that no rricssagc will arrive with a srnallcr timestanip than that of tl-ic riull riicssagc.

Thc nul1 niessage approach suffcrs from two drawbacks: the fine granulnn'ty of thc

riull niessages and the minimum time stamp zncrement. For tlic simulation of systenis

of simple elements, such as logic gatcs, the tirnc required to evaluate the output of the

gste is so srnall that it is comparable to the time requiretl to process a null mcssage.

In addition. the number of regular rnessages in logic simulatiou is in terms of millions.

and if for each message there is only one null niessiige sent. this woultl tloublc thc

rictwork t raffic.

To rcduce the problern of network traffic introduccd by the use of riull messages.

Misra [Misr861 suggested that a process refrains for a period of time fiorn seriding a

nul1 message, expecting that a real message will be sent over the same channel. After

ü predetermined period of time a null message can be sent in case no rcal rncssagc is

sent. Misra [Misr861 also suggested the demand driven technique, in whicli a proccss .

scnds a rcquest to its predecessor for the Iowa bound on the tirnc of its riext output

niessage.

Scverd venions of the previous algorithm have been suggcsted. Su and Seita

[Su891 suggested a lary message technique, in which several consecutive null messages

are replaccd with one null message. Nicol [Nico88] proposed the use of a future list.

Eiich process. making use of its input list, can estimate the lower bound on the

timestamp of the next message. This estimate. defined as lookahead depends on the

service time of the number of messages in the input queues.

Many empirical results [NicoSS, Fuji89. FujiSS] have proved that a good perfor-

mance of the simulation depends on a good lookahead estimate. Intuitivdy. a good

lookaheacl reduces the proccssors blocking timc. For example' if an L P has a dock

value t , and it has a lookahead 1. then this LP can predict that it will not rcceivc any

further message with a timestamp less than t + 1. Thc LP would be safc to proccss al1

pending events with timestamp less than t + 1. The problem with using lookahead is

that it requires knowledge about the rnodel. which is not possible sornetimes without

a pre-computation

While prcvious algorithms avoid deadlocks, others detect and break thcni. Un-

likc the dcacllock avoidance approach, this approncli does uot prohibit cyclcs of zero

timestamp increment. The first such algorithm was presented in [Misr82]. Thc algo-

rithm uses diffusion computation to detect whether a givcn node is a part of a cycle

or a knot. The dgorithm uses special messages, called probes, that are sent alorig the

outgoing edges of the nodes in the graph. The deadlock c m be broken by observing

that thc smallest timestamped message in the entire systcm is always safc to proccss.

The problcm with this approach is that it is clifficult to dccidc wlicn such an algorithm

shoulcl bc crnployecl: if the algorithm is usecl too frequently, tlicn it rriight degrade

thc performance of the simulation. On the other hando if a deadlock is not brokeri

quickly. the performance of the simulation will also deteriorate.

Lubachevsky [Luba88] proposed another alternative consermtivc scheme basecl on

the idea of bounded lag (BL). The BL algorithm uses a moving timc window in which

only evcnts whose timestamp lies in the time window are eligible for processing.

The BL algorithm suffers from two problems: (1) the smallest timestamp of al1

cverits in the system has to be computed periodically and sent to al1 processes which

adds synchronisation overhead on the simulation; (2) the size of the time window

is critical to achieve good performance. Finding the adequate sizc of the window

requires know1edge about the application.

2.4 Load Balancing

In a clistributed system, it is likely that some processors are heavily loaded while

others are lightly loaded or idle. Efficient utilization of parallel computer systems

rcquires that the job being executed be partitioned over the proccssors in an efficient

way.

Iu the gerieral partitioning problem? one is given a parallel cornputcr systeni (with

a specific intcrconnectioa network) and a parallcl job or task composecl of niodulcs

or units that communicate with each other in a specified pattern. One is requircd to

map the modules into the processors in a way to minimizc the total exccution timc.

The mapping or assignment can be categorized as being cither static or dynamic.

Thc static approach ' assumes a priori knowledge of the execution time and cornniuni-

cation pattern of the simulation. and the tnsk-to-proccssor mapping is dcciricd alicad

of run-time. The clynamic approach, in contrat, moves jobs or modules bctwcen

processors from time to time, whenever this leads to improved efficiency.

In this section, a general review of the field of load balancing is presented with

cmphasis on the work done in the context of PDES. The fint subsection describes

somc approaches and solutions to the problem of static partitioning. The second

subsection outlines some algorithms for dynarnic load balancing.

2.4.1 Static Load Balancing

This subsection briefly reviews some of the previous work donc on static load balanc-

ing. Algorithms for this paradigm rely on a priori knowledge of the computation and

l Aiso known as stlztic partitioning, compile time pastitioning, and mapping problem [Taut85,

Bouk941.

corrimunication cost for each task in the system.

In t e m s of graph theory. the problem can be descnbed ris follows:

Given a weighted graph G with costs on its edges, partition the nodes of

G into subsets no larger than a gzuen maximum size. so as to niinimize

the total cost of the edges? cuts and the difference in the subset wcights.

Wcights on nodes and edges represent proccssing ancl communication cost.

Since the problem is known to be NP-complete [Gare79]. lieuristics [BokhSl.

Bokh87. Ni851 have been developed to obtain suitable partitions. Kernighan <anci

Lin [Kern70] proposed a 2-way partitioning algorithm with constraints on the subset

skc: givcn n graph of 271 vertices (of equd size), the algorithm splits the graph into

two subsets of n vertices each with minimum cost on dl edges cut. The algorithm

starts with any arbitrary partition A, B of the graph mci tncs to clccrease the initial

cxtcrnsl cost by a series of interchanges of subsets of A. and B. T h cxternal cost

clefines the sum of connections between the two partitions. Using the 2-way partition

as a tool, the authon extended the technique to perform a k-way partitions on a set

of kn objects.

Bokhari [Bokhal] studied the mapping problem on a Finitc Elcment Maclliinc

(FEM) .' Starting witli a random initial mapping, the heuristic algorithni procceds by

sequences of pairwise interchange between processors till tlic best mapping is fouricl.

Wei aucl Cheng [Wei891 proposed a partitioning approach called the ratio cut. Their

ülgorithm runs by identifying clusters in a circuit, and avoids any cut through these

clusters.

Most of the previous approaches to the partitioning problem are suitable for re-

stncted applications, but they don% quite satisfy the properties of parallel simulation

bccause of the synchronization constraint among proceusors. Static partitioning for

parallel simulation has recently became a focus of research.

*The Fiiriite Element Machine is m array of processors used to solve structurai problcms.

Nandy and Loucks [Nand92] presented a partitioning and mapping algorithni for

consenmtive PDES. The objective of the dgorithm is t o reduce the communication

ovcrhead and to evenly distribute the cxecution load among al1 of the proccssors.

Tlie algorit hm starts with an initial random allocation of proccsses to cliistcrs. and

then iteratively moves processes between clusters until no further improvement can

be found (a local optimum). A process is moved if its reallocation does not violate the

size constrrrint of the processor. Using the proposed algorithm, thrce srnall circuits

wcrc partitioned. mapped, and simulated on a network of transputers. Results showed

a 50% improvernent in the simulation performance ovcr random partitioning.

Rccently. Sporrer and Bauer [Spor93] have presented a hierazchical partitioning

for the Time Warp simulation of VLSI circuits calleci the Torollii partitioning" . Thc

ob jcctive of the algorithm is to minimize the number of interconnccting signals whilc

kccping the size of partitions nearly equal. The basic idea of the algorithm is to clc-

tcct strongly connected regions and use them as indivisible cotnponents. callecl petals.

for partitioning. The algorithm uses an hierarchical method which, in the first step.

creatcs a fine graincd clustering of the circuit with a niinimttrn nuniber of intcrcorincc-

tions, rcgardless of the size of the partitions. In the seconcl step. cqual sized partitions

arc formcd at a coarse grained level based on the connectivity matrix. To cvaluate

the partitioning algorithm, cornparison was donc with three other partitioning algo-

rithms [Fidu82, Hilb82, Smit871. Six benchmark circuits werc partitioned and rated

with four diffcrent algorithms. The simulation was run on a nctwork of SunSparc

2 workstations connected via Ethernet. The achieved speed ups wcre almost lincar

even for large riurnbers of partitions.

Another approach for static partitioning known as simulated annealing (SA) was

proposcd by Kirkpatrick et al [Kirp83]. The SA dgorithm draws upon an analogy

with the behavior of a physical system in the presence of a heat bath. The algorithm

uses m adaptive search schedule to find a good partition (near optimal). Starting

wit h an initial partition, the algorit hm moves processes between processors until a

termination or equilibrium condition is met.

Recent work on partitioning using sirnulated annealing appeared in [BoukW] . The

authors proposed an objective hnction to evaluate the quality of the partitioning so-

lution generated by the algorithm. The objective hnction depends on inter-proccssor

communication? the distribution of loads, and the number of null messages betnreen

proccssors. A sct of experiments werc conducted to find the impact of tlie partitiou-

ing algorithm on the performance of a conservative simulation using null niessages.

Thc authors used Intel iPSCli860 hypcrcube as a platform for the simulation. In

the experiments? a queuing network mode1 in thc form of a toms with FCFS service

clistribution was used as a benchmark. Results showed a 25-35% rcduction in the exe-

cution time of the simulation using the proposed dgonthm over raridom partitioning

whcn 2 and 4 processors are usecl.

2.4.2 Dynamic Load Balancing

Thc literature is rich in gcncral purpose dynamic load balancing algorithms [LuBo,

Iqba86, Ande871, but only a handful of these algorithms apply to PDES.

In [Shan89], the author presents a simple dynamic load balancing for conscrvative

simulation. Three phases take place in his dynamic allocation scheme. In the first

pliwc. the process(es) which need allocation are identified: thcse are the proccsscs

whose servicc times fa11 outside a prcdefined intcrval of service time. In thc sccond

phase? process(es) which should be migrated arc identified: processes identified in

phase one which have predecesson or suc ces soi.^ ori different processors are consid-

cred for reallocation. The third phase is for identifying the processors to which the

process(es) are reallocated: processors containing successors or predecessors for se-

lected processes will be chosen first. The algorithm was tested on two logical systcms

(pipelines) on an iPSC Hypercube with 4 nodes. Experiments showed tliat when us-

iug the load balancing algorithm, the running time of the simulation is reduced only

when the system exhibits a non-uniform distribution of messages.

Boukerche and Das [Bouk97] presented a dynamic load balancing algonthm for

conservative simulation using the Chmdy-Misra[Chan79] nul1 nicssage approach. T h i r

algorithm uses the CPU queue length as a metric. that indicates thc workloacl at each

processor. In their approacho a Load Balancing Facility (LBF) is implementetl in two

separate phases: load balancing and process migration. In the first step of the al-

gorithm, every processing element in the simulation sends the size of its CPU queuc

length to a central processor in which processors are classifiecl according to the devi-

ation of their respective queue lengths from thc mean. The second stcp for the loacl

balancer is to select the heavily overloatled processes fiom the heavily overloadecl pro-

cessors. Experiments were conducted on an Intel Paragon A4. Tlic authors cliosc ari

N x N torus queuing nctwork mode1 with an FCFS servicc discipline as a bcncliniark

in their cxperirnents. Their results showed approximately a 25-30% reduction in the

simulation time using dynamic load balancing over a random static partitiming whcn

2 processors were employed. A reduction of 40% was observecl when the numbcr of

processors incrcases from 4 to 8. Similarly. the authors oLscrvccl a 50% rcductiori in

the run timc when clianging the tiurnber of proccssor from 8 to 16. Thc authors also

rcportcd a recluction in the synchronization overhead " witli dynamic load balancing:

whcn less than 4 processors are used, the rcduction was approxirnately 2530% in the

~ynchron~a t ion overhead. When 8 and 16 processors werc used, the reduction was

10-40%.

Burdorf and Marti [Burd931 dcal with the problem of dynamic load balancing for

the RAND Time Warp system. The system runs on a set of workstations sliürecl with

othcr users, which creates a largc variation in the loads depending on the number

of users and processes. Initially? the static balancer assigns objects to processors

xcording to the load, which is gathered by running a presimulation. During the

'Dcfined as the uumber of nul1 messages processed by the sirridatiou usiug the Cliaridy-Misra

ud-message approach dividcd by thc number of red messages proccsscd.

simulation? the balancer records the smallest simulation time of al1 objccts on each

machine. To minimize rollbacks, the dynamic load balancer seeks to minimize the

variance between the ob jects' simulation times. The authon prcsented four schcmes

for load balancing: three of them use Local Virtual Time (LVT) as a mctric, and the

fourth uses the ratio of total processed messages to the total number of rollbacks.

Thc algorithm was irnplemented on a simulation of colliding shapcs niovirig in a

constricted spoce. Thgy report that using the average of processeci messages was uot

fcaüiblc for their simulation, while the best results were achievctl when objects which

are furthest ahead are moved to machines which are furthest behind. ruid objects

whiclz arc furthest behind are moved to machines which are furthest ahead. Rcsults

using the dynamic load balancing strategy showed a five to 10 times performance

irnprovcment ovcr simulation with only static balancing.

Sclilagcnhaft ct ai [Sctil95] describc a dynamic load balaricirig a l g o r i t h for Tirric

FVarp VLSI circuit simulation. The algorithm consists of three cornponcnts: (1) loacl

scrisor; (2) load cvûluator: and (3) load adaptor. The load sensor cornputes the

Virtud Tirne Progess (VTP) for cach simulation process. The VTP rcfiects how

fast a siniulation process progresses in virtual time. The loacl sensor first calculatcs

the Integrated Virtual Time (IVT) a t the end of each simulation stcp. The IVT of a

proccssor is defined as the average of al1 of the virtual times of the clusters rcsiding

on that processor. Thc VTP during a time interval [Ti, Ti] is thcn cornputeci as tlic

change in the IVT per unit real time. The load evaluator decides whether to launch

process migration or not, depending on ratio betwcen the actual and the predicted

VTP; if the ratio is big enough (migration is worthwhile), then load balancing is

initiated. Process migration is thcn controlled by the load aclaptor. The authors

simulated one logical circuit (~13207 [Brg189]) on two processors shared with other

users. Corolla partitions was used to partition thc circuit iuto sniall clustcis with

many of thcm mapped to each processor. The results showed that the algorithm

reduccs the lag between the VTPs of the processors which resulted in an increase in

the aclvance of the GVT, and hence a 24% reduction in the simulation time.

Reither and Jefferson [ReitSO] presented a dynamic load balancing algorithm for

thc Tirrie Warp Operating System (TWOS). Tkcir algorithm was tcsted 0x1 battlcfield

simulation in which objects are created and destroyed concurrcntly. For this imple-

mentation, dynamic load balancing was necessary in order to ensure goocl <assignment

of the new object to processor, and to balance the load after some objects have been

dcstroycd. In their simulation, the life span of an object is divided into phases. which

can run on different processors. Each phase is responsiblc for handling an objcct's

data and variables for one interval of virtual time. The algorithm trics to balmec

the 10ad using an estimate of eflective work, defincd as the portion of the work tliat

will not be rolled back. The authors tested their algorithm on two simulation niodels:

a military simulation and two-dimensional colliding pucks simulation. Their results

showed that speedups for the pucks simulation are equal to or less than the normal

speedups because of the relativcly even balance of pucks. Because of the inhercnt im-

balCuicc in thc battlcfield simulation, thc dynamic load balancing algorithm improvccl

thc spccclup by 25%.

Glazer and Tropper [Glaz93] introduccd the notion of a tirnc slicc and simulation

advance rate (SAR) in thcir work on dynamic load balancing. TLic purposc of the

tlynamic load balancer was to reduce the number of rollbacks, and heuce reducc the

simulation time. To this end, the SAR is computed at each processor and sent to

one dedicated processor. The SAR is dcfined as the advance of the simulation dock

clivided by the CPU time needed for the advance. The load of a proccss, derived frorn

its SAR, is a measure of the amount of CPU time it requires to advance its local

simulation dock by one unit. The length of the time slice for a process is dctermincd

by its load: each process is given tirne slice proportional to the ratio of its load to

the mean of al1 of the loads. To evaluate the performance of the algorithm, three

different simulation models were constructed, representing different classes of nioclelu:

a pipeline model, an hierarchical network model and a distributed network model.

Expcriments were conducted on PARALLEX, an emulation of a panllcl siniiilation

on a uniproccssor machine. The authors reporteci a 16-33% decrease in rollbacks ancl

19-37% increase in the simulation rate when only 8 processors were used. Rcsults

showed better improvement when the number of processon was increased: 42-71%

arid 4749% decrease in the rollbacks when using 16 and 32 proccssors respcctively?

iri addition to 30-39% and 49-65% increase in thc simulation ratc.

Most rccently. Carothers and Fujinioto [CaroSG] proposed a. load clistributiori sys-

tem for background cxecution of Time Warp. The systcm is designcd to use the

free cycles of a collection of heterogeneous machines to run a Time Warp simulation.

Their load management policy consists of two components: the processor allocation

policy and the load balanczng policy. The processor allocation policy is used to deter-

rnirie the set of proccssors that c m be used for a Timc Warp simulation. This set is

computed dynaniically throughout the siniulation. A prowssor is addcd or rcniovctl

ficm the set when an cstimate for thc Time Warp exccution timc on that proccssor

falls below or moves above certain thresholds. As in [GIas93]. the metric uscd wcw

thc Processor Advance Time (PAT), which refiects the amount of rcal timc uscd to

mske a one unit advance in virtual timc. Using the centrally collected PATs, the

load bûlancing policy tries to distribute the load cvenly ovcr al1 processors. An initial

implementatiou of the nlgorithm was implemented on a set of niric Silicon Graphies

rnachincs? one of which was dedicated to dynamic load managcmerit. The Tinie Warp

application usecl in the experiments is a simulation of a personal communication ser-

vices network. The authon cxperimented on two different parameters: timc-varying

cxternal workloads, and changing the usable set of processon. In thc first set of

cxperiments, half of the processors experienced a rnonotonic increase in the cxternal

workload which results in cluster migration into other processors. Rcsults showecl an

improvcment in the simulation tinie of up to 45%. To sce how t h system rcacts to

the change in the usable set of processon, a "spike" workload was aclded to four of

the eight workstations. The authors observed a small improvement in the simulator

cfficiency and in the reduction of the simulation tirne. This minor improvemcnt was

believed to be thc result of the extra overhead of consiclering iriactivc processors in

the cornputation of GVT. A delay in the GVT computation would result in a merriory

shortage on active processors since memory is riot rcclaimeci fast eiiough.

Avril and Tropper [Ami961 studied dynarnic load balancing for Clustered Time

Warp. To measure the load, the authon defined the ''load of a cluster'' to be the

number of events which were processecl by the cluster since the Iast load balance in

the simulation. The load of a processor is then cornputecl as the sum of al1 of thc

loads of al1 of the clusters resicling on the proccssor. The authors clcscribe a triggcring

technique based on the throughput4 of the system: the loacl balanciug algorithm is

triggercd only when overall incrcase in the throughput of the system becomes larger

than the cost (in terms of throughput) of moving the clustcrs. The algorithm was

implemented and its performance was measured using two of the largest benchmark

digital circuits of the ISCAS789 series [Brg189]. Results showed an irnprovcmcnt of

40-100% in thc throughput of the system when dynarnic load balsncing was uscd.

The authors observecl that minimizing interproccssor's comrnuuicatious whcri niovirig

clustcrs rcduce the uuniber of rollbacks, but does not improve the throughput.

'Definecl ils the number of non-rolleil-back message evcrits pcr rinit timc.

25

2.5 Clustered Time Warp (CTW)

An implementation of our algorithm for load balancing was developed for Clustered

Time Warp (CTW) [Avri95]. CTW is a hybrid algorithm which makcs use of Tirne

Warp between clusten of LPs. and a sequential algorithm within thc cluster.

Logic simulation has always been s challenge for Timc Warp because of two pri-

mary problems: memory management [Fuji891 ancl unstable behavior [Luba89]. The

niemory management problcm arises as the result of the large number of LPs in the

simulation: a circuit can consist of hundreds of thousands of gates (LPs). Sirice each

LP must regularly Save its state to overpass a rollback, it is possible that thc simula-

tion might run out of memory. Fossil collection algorithm cannot rcclaini spacc fast

cnough for the simulation to continue. Several schemes werc dcvclopctl to solvc this

problem includirig thc use of the cancelback protocol and the usc of artificial rollhack

[Jeff9O].

Logic simulation dso cxhibits unstable behwior becsusc of low coniputatiorial

granularity: messages are small. and the service time is in terms of nano-seconds

(bit-wise operation); messages flow very quickly from one proccss to thc othcr: an-

timcssages cannot annihilate previously sent messages quickly enough. possibly Icad-

ing to cascacling rollbacks [Avri96].

To rcclucc memory consumption and to avoid cascading rollbacks. CTW üsscrnbles

niany LPs into one cluster. Administration is exccuted on the clustcr lcvel. rather

thau on the LP lcvcl, which leads to a less scheduling overheacl. Message cenccllation.

cxecuted on the level of clusters, helps to avoid cascading rollbacks. Next section

describes the structure and the rnechanism for CTW. The 1 s t section introduces

variations of CT W.

/'- Cluster Environment

Ciwter Input

Queue

output Queue

\ Tirnezone Table

Figure 2.5: Cluster Structure

2.5.1 Cluster Structure and Behavior

Each cluster in Clustered Time Warp is autonomous: it receives messages. proccsses

cvents, swes states and messages, and rolls back in case of a stragglcr or ariti-riicssage.

A closter corisists of one or more LPs (gates), a Cluster Environnient (CE). a Time-

zonc table, and a Cluster Output Queue (COQ). The COQ holds the output messages

of the cluster and is used in case of rollback. The CE is the managcr of the Tiniezone

tablc and the COQ. Figure 2.5 shows the complete structure of a cluster with four

LPs.

Initially, the cluster tirnezone consists only of the interval [O, +m[. arid is updatecl

ovcr time. When a cluster receives an event with a timestarnp t, it determines which

tirnezone t fits into, and divides it into two timezones, [ti, t [ and [ t , t i+i[ .

Each LP in the cluster keeps its Local Simulation Times (LST), as well as the time

of the last event it processed (TLE). When the LP proccsses an cvent, it determines

if its tirnezone is different £rom the one of the TLE; if this is the case, then the LP

'Local Simulation T i e is the same as locd virtual timc

27

saves its state.

When an LP tries to send an event. it checks the receiving LP; if this LP rcsides

on the sarne cluster, the message is inserted into the input queue of the receiving LP:

if not, thc message is handed to the CE which takes care of forwarding the message

to the cluster in which the rcceiving LP is located.

Whcn a cluster receives a straggler with timestamp t,. the CE creates a ncw

tirnezone. and rolls back d l of the LPs with SLE greater than t,. After the rollback,

a. "coast forward:' is executed at the LP level. The cluster will behave similarly when

it receivcs an anti-message. except that it will not create a new tirnezone.

2.5.2 Variations of Clustered Time Warp

As rnentioned prcviously. when a clustcr rcccives a straggler or ari üritirncssagc. it

rolls back dl of the LPs which have processed an eveut with a. tirncstamp largcr than

that of the straggler or the sntimessage. This includes also LPs which arc uot dircctly

affected by the rollback. The first variation of CTW [Avriga] was intcndecl to avoid

unnecessary rollbacks: only affected LPs will be rolled back.

In the new stratcgy, when a cluster receives a straggler or an antirnesssge, it

iipclates its timezonc table, and forwards the evcnt to the input queuc of the rcceiving

LP. In this way, only the receiving LP will be rolled back. This ncw scheme is callecl

Local Rollback, as opposed to Clustered Rollback.

A second variation of CTW deals with chcckpoint timing. Instcad of saving its

' ves a state when it cnters a new tirnezone, an LP saves its statc cach time it rec~i

message from an LP located in a different cluster. Even though the new scheme

clecre~es the number of states saved, therc is an increase iri thc number of cvcnts

an LP has to keep. Tlicse events are needed for coast forwmd, whcn ari LP is rolled

back to a state prior to the GVT. This new schenie was cellccl Local Checkpointing,

as opposed to Cluster Checkpointing for pure CTW.

Using the previous variations, three checkpointing algorithms were developed:

Clustered Rollback Clustcr Checkpointing (CRCC), and Local Rollback Local Check-

pointing (LRLC) , Local Rollback Cluster Checkpointing (LRCC) . These algorithms

exhibit a trade-off between memory and execution t ime. Experimental rcsults about

the performance of each algonthm is found in [Avri95].

2.5.3 Load Balancing for Clustered Time Warp

Cliistered Time Warp supports process migration for dynamic load balancing [AvrigG].

In the original implementation of Clustered Time Warp, a proccss calletl thc pilot is

detlicated to collect load information from d l of the processors; thc load of each

processor is piggybacked on the GVT token, and forwardecl to the pilot. Depending

ou thc distribution of the loads, the pilot decidcs on changing the process to processor

niapping.

Our current implementation of load balancing uses a. distributcd algorit hm. iri

wliich thcre is rio dedicated processor. A token is passecl arouncl on a virtual ririg to

al1 of thc proccssorso and load balancing is done when al1 loüds arc inscrtecl irito tlic

token.

Chapter 3

Metrics and Algorit hm

This chepter provides details of the metrics and the algorithm for our dynarriic load

balancing algori t hm. The following section describes thc two mctrics crnploycd. Scc-

tion two describes the dynamic load balancing algorithm in dctail. Section thrce

describes the design issues associated with the devclopment of the algorithm.

3.1 Metrics

Dynamic load balauuicing rcquires both a. metric to clcterminc thc systcm load arid

a mcchanism for controlling process migration. Ideally, the mctric should riot orily

bc simple and fast to compute, but also effective. In this section, two such rnctrics

arc suggested: Processor Utilization (PU), and Processor Advancc Simulation Rate

(PAS R) .

3.1.1 Processor Advance Simulation Rate (PASR)

If several (simulation) proccsses are interconnectecl, a cliscrcpaucy in tlieir rcspcctivc

virtual clocks can result in an increase in the numbcr of messages arriving iri the

past, and cause rollbacks. When a process is rolled back fiom time ti to time t j , ali

work performed during this time penod is discarded. System resources used dunng

the corresponding real time interval could have bcen productively employcd by othcr

processes.

Controlling the rate at which processes advance their corresponding virtual timcs

will minimize the difference between the virtual clocks, and as a consequence. reducc

the number of rollbacks which occur in the simulation.

The virtual time of a processor is defined as the minimum virtual timc of d l thc

processes residing on th& proccssor. A proccssor which has no cvents to proccss scts

its virtual time to infinity.

For a system sirnulatecl in thc rcal time intemal ( tS tar t , tend). t hc Processor Advancc

Simulation Rate (PASR) defines the rate of advancc in virtud timc relative to real

timc. Let t l and t2 be two real time values, with t2 > t L . Dcfine STt as the simulation

tinie a t rcal time t. Let AST denote the change in the simulation timc during thc

time interval ( t l , t 2 ) :

A(ST) :: = ST,, - ST,, .

Thc PASR is defined as:

A proccssor with a PASR

bccause it is ahead of other

higher than average is susceptible to being rollcd b x k

processors in virtual timc. If it is slowed clown? the

frcqucncy with which it is rolled back by other proccssors might well dccrease. Figure

3.1 shows an cxample of two processors with different PASRs. A message rn sent £rom

Pl to Pi will force fi to roll back to a virtual tirric prcvious to u t l , since the timestarnp

of m is ut l . PZ then has to cancel al1 the previous work done in the (t2, tl ) real time

iriterval. Hence, moving some Ioad fiom processors with liigh P A S h to others with

low PAS& should speed up the slow processors and slow down the f a t ones.

Vi rtu al Time

"t2

3.1.2 Processor Utilization(PU)

A uumber of researchers [Hac87, Tant85, Wang851 feel that it is best to niaximizc

the available parallelism in the systern by keeping processor utilization as high as

possible. For systems where no a priori estimates of load distribution arc possible. only

actiiül progrmi cxecution can reveal how much work has bccn assiguet1 to iriclividual

processors.

Lct us define effective utilization [ReitSO] as thc proportion of work donc by a

processor which is not rolled back. Unfortunately, it is impossible for a proccssor to

dctermine the effective utilization at a given point in the simulation since it might

rollback later and cancel al1 of the work that has been doue. In [Reitg~] an estimate

of the effective utilization is used for Ioad computatibn. Consequently we makc usc

of the processor utilkation (PU), defined as the ratio of thc proccssor's cornput cl, t' lori

tirne (in seconds) bctwcen tl and t2 to t2 - t l ;

PU = camptation time in (tl , t z ) t., -t1

Processor utilization allows for the fact that messages in thc systcm might be of

clifferent s ix . and might require different service times. It also accounts for thc fact

that two processors might advance their virtual clocks by the same value' evcri if the

coniputation time is différent.

3.1.3 Combination of PU and PASR

was also tested in our experimcuts. Thc A combination of the two metrics,

combination was intended to increase the utilimtion of the processors. while main-

tüiriing maximum advance simulation ratc ruid niinimizing the riiinibcr rollback.

3.2 Algorit hm

A clifficulty in the load bdancing of a distributed application is the absence of global

information ou the load of the systern. We employ a (distributed) algorithm to collect

thc rclevant inforniatiori about processors' Ioads in our loacl balancing ulgorithrn.

Our algorithm uses a token which circulates to each processor ori a logical ring

[TüneSl]. At each proccssor. the PASR, the PU and the & ( al1 refcrred to as

LOAD later) are inserted into the token.

Assume that there are n processors in the system. Initially. the token is launched

by processor one (1). Whcn it gets to processor n, data from d l of the processors will

he stored in the token. Processor n, called the host, will be able to identify overloaded

and underloaded processors. Thc token will remain in proccssor n for a pcriodl of

timc after which it is launched again. After the next round, when the tokcn reuches

processor n - 1, data from al1 of the processors will be contained in it. and hcncc n-1

becomes then the uext host.

At the end of each round, the host processor cornputes both the mean and the

standard deviation of al1 of the LOADs. The riew mean is then compared with the

'WC dcterruine this time period cxperimentaliy. Sec scction 3.3.1 for a disciission.

33

niean Erom the previous round (stored in the tokcn). If the new mean is larger. this

indicates that the performance of the system is improving (increase in the P U or the

PASR). and the system is left intact. The host sets a count-dom timcr which triggers

the token again after ,8 GVT computations. If the new mean is smallcr tkan the

mean from the prcvious round. thc host checks the standard deviatiori of tlic LOADs.

If tlic standard deviation is found to bc larger than a certain toleraucc value. a:'. thcn

a uew process to processor mapping is iissumed to be necessary. The host matches the

processor with largest load together with thc lemt loaded onc ancl sends a message

to the over-loaded processor with the name of the destination processor to which

it should transfcr some of its load. The load to be movcd is equal to half of thc

rliffcrcncc in the loads betwcen the two matched processors. The staidarcl clcviation

of thc loacls is computed again: and ariother pair of proccssors is rriatckccl. This

process is repeated until thc standard clcviation is less tkan thc value a. Whcri al1

niigrütious are completcd. the host starts the timcr for tlic ncxt round of the tokcu.

3.2.1 Pseudo Code

This section prcsents thc basic functions for the tlynamic load balancing algorithm.

Tlic skcleton of the algorithm is prcscntcd ncxt in ri. C-format langiiagc.

function Serve-Token(token)

I Insert processor's LOAD into the token.

if (token contains loads from al1 processors){ - Cy=, L0.4D;

M ~ ~ ~ L O A D - n /* ri is the numbcr of processors */ if(MeanLoAo 2 token.(prevzous MeanLoAo))

'Tlic value of t h coristmt P is dctcrmiried e~pcrirncrit~dy (10 GVT coniputations).

3A value of 25% for cr was found feasiblc for our cxperirncrits.

/* The performance of the system is improving */ Set-Tirner() :

else {

S t D= (standard deviatiou of LOADs) ;

if (StD 5 tolerance a)

/* cr was taken to be 2 5 . a value determinecl experimcntally*/

Set-Timer( ) :

else {

while (processors are still transferring processes)

wait();/* Processor can procced with other computations. */ Set-Timer();

} } } else

pass the token to the succcssor on the virtual ring;

1 fiirictiori Sct-Simer( ) ; {

wait(p computations); '

Init ialize ( NewToken) ;

Serve-Tokcn(NewToken) ;

1

function matchProcessors()

{ StD = (standard deviation of LOADs);

while (StD 5 tolerance a){

sourceProc=processor with the highest LOAD;

destinationProc=processor with the lowest LOAD;

ZoadToMove = LOAD(sourceProc)- LOAD(destination Proc) . 2 1

updatc LOADs of sourceProc and destinationProc and reconipute StD:

}

3.3 Design Issues

Designing a proccss migration system involves resolving a iiumbcr of issues such as

which cluster to niove: when the dynamic loacl bdancing algorithm sliotild bc iuvokcd.

and what tlic maximum number of processors which arc dlowcd to transfer load (at

the end of each round) should be. These issues are al1 interrclatecl. and th& rcsohttion

also depends on the sirnulateci model. WC provide a discussion of thcsc issues iri the

following sections.

3.3.1 Token period, ,O

Tlie algorithni periodicdly sends out a tokeri to collect the necessas. inforrnatiori

from a11 of the processors, and at the end of cach round thc host decidcs whether

a process migration is needed. The time intervd P betwecn two consecutivc token

rounds (token peniod) should be long enough to allow the system to stabiliae after the

previous load balance? but should be short enough to prevent the system from being

unbalanced for a lengthy period of time.

Dunng the course of the experiments? a token period of 10 GVT' computations

was employed. This value was fouml fcasible for our models. This value is partially

determined by the systems being simulated, in addition to thc time it takcs for a.

"lrc GVT computation w u initiated every 3 seconds

GVT cornputatiori: the longer it takes to compute the GVT. the shorter the token

perioci should be.

A rclated issue is the determination of a. Transfer of clusters occurs orily if the

clifference of the standard deviations is less than a. We used a value of a. = -25,

clet ermined experimentally.

3.3.2 Selecting the Clusters (Processes)

There is a trade-off between achieving the goal of coniplctely bülaricirig tlic load

ancl thc communication costs associatecl with rnigmting processes since transfcrring

clusters takes time. When determining which cluster to move, our implcnientation

diooses the cluster with the highest LOAD so as to mininiize the number of clusters

rnovecl. assurning that the load of the cluster does not exceed the intcnded 10x1 to

movc.

3.3.3 Number of Processors that can initiate migration

TIicoreticdly, the load balancing algorithm should distribute the loads on al1 of the

processois as uuiforrnly as possible. However, rnigratioris from too many processors

can cause the system to become unstable. In our experiments, WC observecl that

approximately 30% of the processon can launch migration at onc time without jeop-

ardixing the stability of the system.

Chapter 4

Results and Discussion

4.1 Introduction

This chapter presents the simulated models. cxperiments ancl results using the algo-

&,hm and metrics presented in the prcvious chaptcr. The load balancing a l g o ~ t h m

was imphmented on top of Clustered Time Warp. In our experiments. wc made cx-

çlusive use of the LRCC checkpointing technique which offcrs and intermecliate clloicc

in the niemory vs execution time tradc-off. The simulations wcre cxccutecl on a BBN

Butterflyl GP1000, a sharcd memory niultiprocessor machine. A set of niodels. basetl

on VLSI circuits, asscrnbly pipeline and distributed comniunicütion rictworks. wcrc

simulatcd.

'~lre Buttedy is an MIMD machine with 32 processor riode. Each nodc lias nu MC68020 ancl

MCG8881 processors with 4 rnegabytes of rnemory 'wd a high-speed rnultistagc crossbar switcli whicti

interconnects the processors. The clock speecl of e d processor is 25 MHL.

1 Circuit 1 Inputs 1 Outputs 1 Flip-Flop 1 Nurnber of gatcs 1

Table 4.1: Circuits jkom ISCAS'89

Two digital circuits fiom the ISCAS'89 [Bq1891 benchmark suite (Table 1.1) werc

selcctccl and simulated. The size of each of these circuits is approximately 24.000

gatcs. The circuits were partitioned into 200 clustcrs cach. using string partitioning

[Lcvc82]. Clusters were riuniberecl arbitrarily ancl wcre mappccl to proccssors as fol-

lows: if N is the number of processors, and K is the number of clustcrs (K > N ) .

then the first 5 clusters were assigned to proccssor numbcr 1. the second f clustcrs

were assigned to processor number 2, and so on so forth.

The data which was collected includes the running time, pesk rncrnory usage. and

cffcctivc throughput . The performance of the algorit hm was evaluatccl using bctwcen

12 and 24 proccssors or1 the Butterfly.

4.2 Experimental Results.

Figures 4.1 <and 4.2 show the simulation time for the two circuits 6 anct C:{. plottecl

against the number of processors. Results arc compared to the simulation without

a ions. load balancing. The samc number of input vectors were used for d l of the siniul t'

Figure 4.1 depicts tlie performance of the metries and the algorithm for C2: 3572%

rcduction in the simulation time when PU is uscd as a mctric, 0-30% when PASR

is used and 0-40% when P U * & is used. As for C3, the results arc as follows:

2-21% when P U is used as a metric, (-5)-16% when PASR is used and 0-15% when

P U * & is used.

Wc ;mibu te the differcnce in the percentage of recluction in thc simulation time

PU

PASR

PU/PASR

M NO L-B

O ' 12 14 15 16 17 19 22 24

processors

Figure 4.1: Simulation time for C2

5 PU

5 PASR

E PUtPASR

m No L-B

200 ' 12 14 15 16 17 19 22 24

processors

Figure 4.2: Simulation time for C3

between the two nietrics to the locdity of activity. VVheri the system cxhibits a. high

locality of activity. PU will increase on somc processors. Uriderloaded processors can

quickly proccss the cvents generated by a more heavily loadecl proccssor rcsulting iri

the same PASR, but a cliffercnt PU. This also explains why the improvements for C2

decline with an increase in the number of processors since the activity of the circuit

is spread out when using more processors.

To understand the difference operation of the load bal<mcing algorithm between

CZ and C:l. and why clynamic load balancing rnight sometimcs iucrcrrsc thc iurinirig

time of thc simulation (employing the PASR with C:& WC lookecl a t two othcr clc-

ments: pcak memory consurnption ancl the effective throughput. Thc peak memory

consumption is the average of the peak memory consumption in al1 of thc proccs-

sors. On each processor. it represents the maximum amount of memory uscd ovcr the

course of a run of the simulation. Thc effective throughput is dcfined as the number

of non-rolled-back mcssages in the systern per unit timc.

Rcscarchers on niemory çonsumpt ion for Timc Würp [.Jcff85, LiriSl. .JcffgO Akyi9 Z]

have pointcd out that there is a time penalty msociatcd with large spacc corisumption.

and that it is possiblc for a simulation to run out of memory. Memory management is

time consuming; a heavily loaded processor must spend considerable time on memory

mariagcnicnt . By looking a t figures 4.3 and 4.4, one can see that dynamic load balancing rcduced

pcak rncniory consumption for C2, but increased it for C:{. Circuit C2 has a higher

locality" of events thau does C3, hence moving clusters and aüsociated cvents from

the loaded processors in C2 decreases the peak memory consumption. As for C:i, the

activity of the circuit is much more distributed, and hencc moving clusters might

create a memory problem.

?Ail cxperimcntd stiidy about the activity of the s m e two circuits w u bc fouiid in [AvBQG].

41

600 '

12 14 15 16 17 19 22 24 Processors

Figure 4.3: Peak memory consumption for C2

12 14 15 16 17 19 22 24 Processors

Figure 4.4: Peak memory consumption for C:]

38 234 457 650 781 976 1163 1353 1533 Simulation Time

Figure 4.5: Throughput Gruph for C2

Wc also cxamined tlie effective throughput of the systcrn. Figures 4.5 and 4.6

show the impact of thc dynamic load balxicing algorithm on thc cffcctivc througliy ut

of the systcni during the simulation of C2 ancl C:{. Thc figures show a ruri of tlie sim-

ulation using the PU metric. Both figures show a noticeable incrcasc in thc effective

throughput of the system. This increase is counter-balaicecl in C:$ by mi incrcase in

the memor-y consumption.

4.2.1 Increase in rollbacks

An incrcasc in rollbacks was observecl (up to 20%) when load balancing was cmploycd.

This increase was expected since moving clusters slowcd clown the source and the des-

tination processors; when these processors resume computation they might well serid

messages into the "past" of processors which were not involved in moving clusters.

In addition, it is sometimes the case that the virtual time of the migratecl cluster is

srrialler than the virtual time of the receiving processor. This rcsults frorn the fact

that the receiving processor is lightly loded, and is a h e d in virtuül time.

O 13 45 76 106 1 37 159 171 196

Simulation Time

Figure 4.6: Throughput Graph for C:i

4.3 Pipeline Mode1

A second model which was simulated is a manufacturing pipeline (figure 4.7) [Glugd].

The mode1 consists of thirty processes, including two sinks and two sources. arrarigccl

in ninc stagcs. Each proccss waü represented by a cluster of 625 logical proccsscs

conncctcd in a. mcsh topology. and clusters a t the same stage wcre mappetl to tlic

same proccüsor. Messages in the system flow from sources to thc sinks. following

diffcrcnt patlis. At each stage, the message is servcd and forwarded to the ncxt stage,

until it gets to the sink, where it leaves the system. The service time distribution

is cleterministic and the routing decision at cach stage is governed by a uniform

distribution. The pipeline model exhibits a large number of rollbacks which arc caused

by messages starting at thc snme source, following diffcrent patlis, and arriving at

the süme processor in a. (possibly) different order from the orle iri which they wcre

generatcd.

Figures 4.8 and 4.9 show the results from the pipeline model. When dynamic load

balancing was used, the simulation showed an incrcase in the percentage of rollbacks,

Figure 4.7: Manufacture Pipeline Model

Pipeline Model

25 7

Figure 4.8: Perrentage Increase in the Number of Rollbacks

Pipeline Model

PU

Figurc 4.9: Percentage Reductzon of the Simulation Time

and an increase in the simulation time. Figure 4.8 shows that the number of rollbacks

increased by 20% when the PU metric was used, 18% when using the PASR and 17%

whcn a combination of the two metrics was used. The increase in the perceritage

of rollbacks results £rom the fact that moving clusters from one stagc (proccssor) to

aiiother will cause delay on some of the input links of tlic next stagc (proccssor).

Figurc 4.9 shows tliat the simulation time increased by 8% when using clynamic

loacl balancing with the P U metric, 0.5% when using thc PASR mctric. Whcu iising

the combination of the two metrics, the simulation time did not change. The increave

in the simulation time is explained by the increase in the number of rollbacks iri the

system.

4.4 Distributed Network model

The final model is a. distributed communication model (figure 4.10) . Two kinds of

cxperiments were conducted on the model. In the first experiment, messages are

Figure 4.10: Distribu ted Network Model

uniformly distributeci on the network. The second experiment moclclccl a national

conraiunication network clivided into four regions. In ttris model. we expcrinicntcd

with the reaction of dynamic load balancing to a continuous changc of loads on

the proccssors. During the course of the simulation, messages were conccntratcd on

diffcrcnt rcgions, one region a t a time. For instance, at one poiut messages were

concentrated in region 1, arid regions 2, 3 and 4 were lightly loacled. After a. period

of time. region 2 became saturated with messages, and rcgions 1, 3 arid 4 were lightly

loaclecl . The simulation runs on 10 processors, with 7-8 riodes mapped to each processor.

Interprocessor communication was minimized by mapping the connected nodes to the

same processor. On each node a message is servcd, and with a probability of 30%,

is forwarded to a randomly selected neighbor. Nodes have service times governed

by exponentiai distributions (with different means ), and the choicc of a neiglibor to

which to forward the message is governed by a uniforrn distribution.

The results in figures 4.11 and 4.12 shows a difference in the spceclup perccntagc

PU PAÇR PUPASR

Figurc 4.11: Percentage of Reduction in the Simula-

tion Time in the Presence of Time- Varying Load.

Distributed Communication Model

25 7

Figurc 4.12: Percentage of Reduction in the Simula-

tion Time in the Absence of Tzme-Varying Load.

betwcen the two experiments. Figure 4.11 shows a 30-35% reduction in the simulation

t h e when dynamic load balancing was employcd with al1 metrics. This comes b x k

to the facct that, in the presence of time-varying load. the system is locally ovcrloaded

with messages, and cluster migration improves the performance of the simulation. A

rcduction of only 10-20% was observed in the absence of time-varying load (figure

4.12).

Chapter 5

Conclusion

Iri tfiis tlicsis' we examined two mctncs for clynamic l o d balancing: processor uti-

lization (PU) and processor advance simulation rate (PASR). A combination of the

two mctncs was also testcd. A clistributed load balancing algorithm was devcloped.

in which a tokcn circulating on a logical ring is uscd to collect the information about

thc loads of proccssors, and inforni the processors about the new mapping. Thc algo-

rithrri runs in conjiinction with Clustered Time Warp (CTW)[Avri95]. which allows

clustcr migration.

To observe the performance of the algorithm with each of the metrics. scveral

rnoclels wcrc constructed: VLSI moclels, an assembly pipeline mode1 ancl a clistributecl

cornniunication nctwork model. Experiments were carried out on the BBN Buttcrfly

GP 1000. a 32 nodc distributed memory multiprocessor. The simulation time. memory

consumption and effective throughput were measurcd. The effective throughput is

tlic nunibcr of non-rolled-back messages in the systcni per unit timc.

Rcsults for thc VLSI circuits showed that clyncunic load belancing chariges tlic

simulation time between (-5) and 71%. As for the pipeline model, the simulation

time inçrcased by 0.5-8.0% when load balancing was employed. Load balaricing cx-

liibited 10-33% improvement with the distributed network model. The PU metric

performed well for al1 models except for the pipeline model. Results also showed that

the throughput of the system was improved by more than 100% for the VLSI model.

The PASR and & yield the same improvement as the PU for the distributed net-

work model? and better results for the pipeline model. Perhaps the most important

conclusion of this thesis is to point out that the dynamic load balancing depencls

strongly on the nature of the model which is being simulated.

A riuniber of extensions of this research suggest theniselves. 01ic is to iniplcrricnt

thc algorithm on a distributed mernory machine. It is possible that sonie intcrcsting

aspects do not rcveal themselves on a shared mcmory machine, such as the impact of

the communication network connecting the processing elements.

Another extension relates to the use of string pûrtitioning for the VLSI circuits ex-

amined. One might investigate the effect of different (initial) partitioning algorithms

011 clynamic load balancing.

Bibliography

[Akyi92] Akyildiz, I.F. and Chen. L. and Dmo S.R. and others. "Pcrforniancc Anal-

ysis of Time Warp with Limited Memory. Proc. 1992 ACM Sigmetrics

Con.. on Measurement and Modeling of Cornputer Systerns. pp. 2 13-SN.

May 1992.

[Aricle87] Ander. E., "A Simulation of Dynarnic Task Allocation in a Dishibutcd

Computer System" Proceedings o j the 1987 Wznter Simulation Confer-

ence, pp. 768-776, 1987.

[Avri95] Avril. Hcrvc and Tropper. Carl. "Clustered Timc Warp and Logic S inda-

tion'' , PTOC. of the 9th Workshop On Parnllel and Distributed Simulation,

pp. 112-119, June 1995.

[Ami961 Avril, Herve and Tropper, Carl, "The Dynamic Load Balancing of Clus-

tcred Time Warp for Logic Simulation", Proc. of the 9th Workshop On

Pamllel and Distributeal Simulation, pp. 20-27, May 1996.

[Avri96] Avril, Hcnre, "Clustered Time Warp and Logic Simulatiori" , PliD Tlicsis.

School of Computer Science. McGill University. Montreal. Canada, 1996.

[Bai1921 Bailey, M. L., "How Circuit Size Affects Panllelism", IEEE Tkans. Com-

put. Aided. Des. Integr. Czm. Syst. 12, Vol. 12, pp. 1903-1912. 1992.

Bailey, M. L. and Briner, J.V. and Chamberlain. R.D.. "Parallel Logic

Simulation of VLSI Systems" , A CM Computing Sumeys, Vol. 26. No. 3'

pp. 255-295, Septernber 1994.

Bauer, H. and Sporrer, C. and Krodel, T.H., "On Distributed Logic Simu-

lation Using Time-Warp" , Proc. of the international Conference on Very

Large Scale Integration VLSI. pp. 127-136, 1991.

Bcllnot . S., "Global Virtual Time Algorithm" , Distributed Simulation.

pp. 122-127, 1990.

Bokhari, Sahid H., "On the Mapping Problem", IEEE ~ n s a c t i o n s on

Cornputers, Vol. 30, No. 3, pp. 207-214, March 1981.

Bokhari, Sahid H. , Assignrnent Problerns in Pnrallel and Distributed

Computing, Klcwer Academic Publishcrs, Boston, 1987.

Boukerche, Azedine and Tropper, Carl, "A Static Partitioning m c l Map-

ping Algorithm for Consenrative", Proc. of the 8th Workshop on Pnrallel

and Distributed Simulation, pp. 164- 172 , 1994.

Boukerche, Aazedine and Das, Sajal, "Load Balancing for Conservativc

Pnrallel Simulation", MASCO T, 1997.

Brglez, Franc and Bryan, David and Kozminski, Krzystof, 'Combina-

tional Profiles of Sequential Benchmark Circuits", Proceedings IEEE In-

ternational Symposium on Circuits and Systems (ISCAS), pp. 1929-1934,

1989.

Brimer, J.V., ''Fast Paralle1 Simulation of Digital Systems" . Proc. of the

5th Workshop On Parallel and Distributed Simulation, pp. 71-77. 1 991.

[Brin911 Briner, J.V. and Ellis, J.L. and Kedem, G., "Breaking the Bamcr of

Paralld Simulation of Digital Systems, Proc. of the 26th ACM/IEEE

Design Automation Conference. pp. 223-226. 1991.

[Brya77] Bryant, R.E.. "Simulation of Packet Communication Archit ecturc Corn-

puter Systems" , Tech. Rep. MIT-LCS-TR-188. Massachusetts Institute

of Technology. 1977.

[Brya81] Bryant, R.E., "MOSSIM: A Switch-Level Simulator for MOS-LSI". Pro-

ceedings of the 18th Design Automation Conference, 198 1.

[Burd931 Burdorf, C. ancl Marti, J . . "Load Balancing for Tinic Warp ori Multi-User

workstations" , The Cornputer .Journal. Vol. 36. No. 2. pp. 168-176. 1993.

[Caro951 Carothers, Christopher D. and Fujimoto. Richard M. and England. Paul.

"Effcct of Communication Overhead on Timc Warp Performarmz An

Experimental Study", Proc. of the 8th Workshop On Pnrallel and Dis-

tributed Simulation, pp. 118- 125, 1995.

[Caro961 Carotlicrs, Christopher D . and Fujimoto, Richard M.. "Backgrouricl Ex-

ccution of Time Warp Programs, Proc. of the 9th Workshop On Parallel

and Distributed Simulation, pp. 12-19, May 1996.

[Cartgl] Carter, H. and Vemuri, R. and Wilsey, P. A. and others? "High Speecl

Acceleration of VHDL Simulation, Synthesis, and atpg: Overview of the

quest Project" , Spring 1991 VHDL Users ' Group, pp. 85-90. April 1991.

[Cham941 Chamberlain, R.D. and Henderson, C.D. "Evaluating the Use of Prc-

Simulation of VLSI Circuits", Proc. of the 8th Workshop On Paralle2

and Distributed Simulation, pp. 139-146, July 1994.

[Chan791

[Chun891

[ Conc9 11

[Elma861

[Fidu821

(Fran861

[FujiSB]

[Fu j i89)

Chandy. K. and Misra. J . . "Distributed Simulation: A Case Study in

Design and Verification of Distributed Prog~arns". IEEE Transactions

on Software Engineering? September 1979.

Chung, M.J. and Chung, Y., "Data Parallel Siniulation Using Tirnc-Warp

on the Connection Machine", PTOC. of the 26th RCM/IEEE Design Au-

tomation Conference. pp. 98-103, 1989.

Concepcion, A.1 and Kelly S.G.. "Computing Globd Virtual Tinie Using

the Multi-level Token Passing Algorithm"' Proc. of the 5th Workshop On

Pamllel and Distnbuted Simulation, pp. 63-68, 1991.

Elmagarmid, A.K., "A Survey of Distributed Deitdlock Detection Algo-

rithrns", ACM SIGMOD Record, Vol. 15. No. 3, pp. 37-45. Scptcmber

1986.

Fiduccia. C. M. and Mattheyses, R.M., "A Linear-Timc Heuristic for Im-

proving Nctwork Partit ions" ? Proceedings of the 19th A CM/IEEE Design

Automation Cmference DAC 82, pp. 175-181, 1982.

Frank, E.. "Exploi ting Parallelism in a Switch-Levcl Simulation Ma-

cliine" , Proc. of the 23rd A C W E E E Design Automation Conference.

pp. 20-26, 1986.

Fujimoto, Richard M., "Lookc&ead in Parallel Discrcte Event Simula-

tion", Proceedings of the International Conference on Pnml le1 Pmcessing.

pp. 34-41, 1988.

Fujimoto, Richard M., "Parallel Discrete Event Simulation'' , Proceedings

of the 1989 Winter Simulation Conference. pp. 19-28. 1989.

[Fuj i8 91 Fujimoto, R.M.. "Time Warp on a Shared Memory Multiprocessor".

Tech. Rep. TR-UUCS88-021a, Computer Science Department. Univcr-

sity of Utah.

January 1989.

Garcy. M.R. and Johnson, D.S., Cornputers and Intractabilzty: A Guide

to the Theory of NP-completeness, W.H. Fieedman and Company, New

York. 1979.

Ghzer, David W. and Tropper. Carl. "On Proccss Migration and Load

Balancing in Time Warp" . IEEE Bans. Parallel and Distributed S?jstems.

Vol. 4, No. 3, pp. 318-327, Murch 1993-

Groselj, B. and Tropper, Carl, "The Time-of-NextEveut Algorithrn" . Pro-

ceedings of the 1988 Distributed Simulation Conference, SCS simulation

series, Vol. 19, No. 3. pp. 25-29, February 1988.

Groselj. B. and Tropper, Carl, "Thc Distributcd Simulation of Cliistcrcd

Processes" Distributed Computing, Vol. 4. yp. 11 1-121, 1991.

Hac, A. and ,Johnson! T. J., "A Study of Dynamic Load Balancing in

a Distributed System" . A CM Symposium on Communica tions, Architec-

tures and Protocols, pp. 348-356, 1986.

Hac, A. and Jin, X., "Dynarnic Load Balancing in a Distributed System

Using a Sender-init iated Algorit hm", Proceedings of the IEEE- CS and

ACM SIGARCH Workshop on Instrumentation for Distributed Comput-

ing Systems, pp. 62-65, 1987.

Hilberg, W., Grundprobleme der Mikmelektmnik, Oldenbourg Vcrlag,

Munchcn, 1982.

Horbst . E., Logic Design and Simulation, Elsevier Science Publishers B .V.

(North Holland). 1986.

Iqbal. A.M. and Saltz. J.H. and Bokhari. S.H.? "A Comparative Anal-

ysis of S tatic and Dynamic Load Balancing S tratcgics" . Proceedings of

the 1986 International Conference on Parallel Processing. pp. 1040-1-47.

1986.

.Jcffersou? D. and Sowixral. H.. "Fast Concurrent Simulation Usiug t hc

Time Warp Mechanisrn, Part II: Global Control" . Tcch. Rcp. TR-83-204.

Rand Corporation, August 1983.

.Jefferson, D.R.. "Virtud Time", A CM ~ n s a c t i o n s on Progmmming

Languages and Systems, Vol. 7: No. 3. pp. 404-425. July 1985.

.Jefferson. D.R.. "Virtud Time II: The Cüncelback Protocol for Storagc

Management in Distributed Simulation", Proc. of the 9th Ann. ACM

Symp. on Principles of Distributed Computation. pp. 75-90. August 1990.

Kernighan,B.W. and Lin, S., "An Efficient Heuristic Proccdure for Par-

titioning Graphs", Bell System Technical Journal? Vol. 49. No. 2. pp.

291-307, Fkbruary 1970.

Kirpatrick! S. and Gclatt. C. D. and Vccchi. M.P.? "Optimization by

Simulated Anncaling" , Science, Vol. 220. No. 4598, May 1983.

Levendel, Y.H,. and Menon. P.R. and Soffa. S.H., "Special Purpose Corn-

puter for Logic Simulation Using Distributed Processing" . Bell System

Technicd Journal, Vol. 61, No. 10, pp. 2873-2090, December 1982.

Lin, Yi-Bing and Lazowska, Edward D., "Determining the Global Virtual

Time in a Distributed Simulation", Tech. Rep. TR-90-01-02, Deyartment

of Computer Science and Engineering, University of Washington,

Seattle, Wa, December 1989.

[Ling 1]

[Liu901

[Lin821

[Lu861

[Luba88]

[LubaSB]

[Luba89]

[Maur901

Lin. Y. B. and Preiss. B.R., "Optimal Memory Management for Timc

Warp Parallel Simulation", ACM Transactions on Modeling and Corn-

puter Simulation, Vol. 1' No. 4. pp. 283-307, October 1991.

Liu, L.Z. and Tropper, Carl, "Local Deadlock Detection in Distributed

Simulation", Proceedings of the 1990 Distributed Simulation Conference,

SCS simulation series, Vol. 22, No. 1, pp. 137-143, 1990.

Livney, M. cmd Melman, M. , "Load Bdancing in Honiogeneous Broaclcast

Distributed S ystem" . Proc. of the A CM Computer Network Performance

Syposium. pp. 47-55, 1982.

Lu. H. <and Garey, M.J., "Load-Balancing Task Allocation in Locally

Distributed Computer System'? , Proceedings of the 1986 International

Conference on Parallel Processing. pp. 1037-1039, 1986.

Lubachevsky, Boris D.. "Simulating Colliding Rigid Disks in Parallel Us-

ing Boundcd Lag Without Time Warp" . Distributed Simulation, SCS

Simulation Series, Vol. 22, No. 1, pp. 194-202, 1988.

Lubachevky, B. "Bounded Lag Distributetl Discrete Event Simulation".

Proceedings of the 1988 Distributed Simulation Conference, SCS szmula-

tion series, Vol. 19, No. 3, pp. 183-191, February 1988.

Lubachevsky, Boris D. and Schwartz, A. and Weiss, A., "Rollback Sonie-

timcs Works.. .if Filtered" , Proceeding of the 1989 Winter Simulation .

Conference, pp. 630-639, December 1989.

Maurer, P.M. and Wang, Z., "Techniques for Unit-Delay Compiled Sim-

ulation", Proc. of the 27th ACM/IEEE Design Autontution Conference,

pp. 480-484, 1990.

Misra. J. and Chandy. K.M.. "A Distributed Graph Algorithm: Knot

Dctection" . ACM Transactions on Progmmméng Languages and Systems.

Vol. 4. pp. 678-686, October 1982.

Misra. .J .V., "Distributed Discrete Event Simulation'' . Computing Sur-

vep. Vol. 18. No. 1, pp. 39-65, March 1986.

Mukherjee, A.. '?ntroduction to nMos and CMOS VLSI Systcms Design".

Prentice Hall International Editions. 1986.

Nandy, Biswajit and Loucks. Wayne M., "An Algorithm for Partitionhg

and Mapping Consenrative Parallel Simulation onto Multicomputers" ,

PADSW pp. 139-146. 1992.

Natarajari. N.. "A distributed Scheme for Dctecting Commuriicatiori

Dcacllocks". IEEE Iransnctzons on Soflware Engineen'ng. Vol. SE-12.

No. 4. pp. 531-537, April 1986.

Ni, L. and et al., "A Distributed Drafting Algorithm for Load Balancing" . IEEE i3-ans. Sof. Eng., Vol. 11, No. 10. 1985.

Nicol, D.M, "Parallel Discrete Event Simulation of FCFS Stochastic

Qucuing Nctworks" Proceeding of the ACM SIGPLAN Sgmposium on

Pamllel Programming, Envimnments, Applications, and Languages. pp.

124-137, JUIY 1988.

Peacock, J.K. and Wong, d. W. and Manning E.G., "Distributed Simula-

tion Using a Network of Processors", Cornputer Networks? Vol. 13. No.

1, pp. 44-56, February 1979.

Prciss, B.R., "The Yaddes Distributcd Discrete Event Simulatiori Speci-

fication Language and Execution Environment", Proceedings of the Mul-

tzconference on Distributed Simulation, pp. 139-144, 1989.

[ReitgO] Reither, Petcr L. and Jefferson, David, "Virtual Time Based Dynamic

Load Management in the Time Warp Operating Systcm". 4th Workshop

on Parnllel and Distributed Sàmulntion, pp. 103-1 11. 1990.

[Sani;185] Samadi. B.: "Distributed Simulation, Algotithms and Performance Anal-

ysis", PhD Thesis, Computer Science Department. University of Califor-

nia. Los Angeles. 1985.

[Schl95] Schlagenhaft? Rolf c u i d others, "Dynaniic Load Balancing of a Multi-

Cluster Simulator on a. Network of Workstations" . 9th Workshop on Par-

allel and Distrib~ted Simulation. pp. 175-180. 1995.

[Shan891 Shanker. M. and et al.. "Adaptive Distribution of Modcl Corriponcnts

Via Congestion", Proceedings of the 1989 Winter Simulation Conference.

1989.

[Srnit861 Smith. R.J. . "Fundamentd of Parallel Logic Simulation". Proc. of the

23rd ACM/IEEE Design Automation Conference. pp. 2-12. 1986.

[Srriit87] Smith. R. and Smith, J. arid Smith. K., Tastcr Architecture Sirriulatioii

Througli Parallelism". Proc. of the 24th A C W E E E Design Automation

Conference, pp. 189-194. 1987.

[Srnit871 Smith, Steven P. and Underwood, Bill and Ray Mercer, M.. "An Aualysis

of Scveral Approaches to Circuit Partitioning for Pardlel Logic Simula-

tion", Proceedings IEEE International Conierence on Computer Design

ICCD 87, pp. 664-667, 1987.

[Sou1871 Soule, L. and Blank,T., "Statistic for Parallelism and and Abstraction

Lcvels in Digital Simulation", Proc. of the 24rd ACM/IEEE Design Au-

tomation Conference, pp. 588-591, 1987.

[Tan& 11

[Tant 8 51

Soule. L. and B1ank.T.. "ParaIlel Logic Simulation on General Purpose

Machine". Proc. of the 23d ACM/IEEE Design Automation Conference,

pp. 166-171, 1988.

Soule, L. <and Gupta, A., "An Evaluation of the Chandy-Misra-Bryant

Algorithm for Digital Logic Simulation of VLSI-circuits" . Proc. of the 6th

Workshop On Pamllel and Distributed Simulation, pp. 129-138. 1992.

Sporrer. Christian and Bauer. Herbert, "Corolla Partitioning for Dis-

tributed Logic simulation of VLSI-Circuits" , PADS'93, Vol. 23. No. 1.

pp. 85-92. 1993.

Su. W.K. and Seitz, C.L., "Variants of the Chmdy-Misra-Bryant Dis-

t ributed Discrete-Event simulation Algorithm" . Proc. of the 9th Work-

shop On Parallel and Distributed Simulation, pp. 38-43. 1989.

Tanenbaum. A., Computer Networks. Prentice Hall? Englewood Cliffs.

NJ, 1981 (second edition 1988).

Tantawi, A.N. and Towsley, D.. "Optimal S tatic Load Balancing in Dis-

tributcd Computer Systems", Journal ACM, Vol. 32, No. 3. pp. 445-465.

1985.

Terman, C.J., "Simulation Tools for Digital Design". PhD clisscrtation.

Massachusetts Institute of Technology, Cambridge, Massachusetts. 1983.

Vladimirescu, A. and Liu, S., The Simulation os MûS Integrated Cir-

cuits Using SPICE, Memo VCB/ERLM80/7, University of California.

Berkeley, 1980.

Wang, L.T. and Hoover, N. and Porter, E. and ZasioJ., "SSIM: A Soft-

ware Levelized Compiled-Code Simulatoc", Proc. of the 27th A CM/IEEE

Design Automation Conference, pp. 2-8, 1987.

[Wang851 Wang, Y.T and Morris, R. J. T.. "Load Sharing in Distributcd Systems".

IEEE 13-Qnsactions on Cornputers, pp. 204-217. 1985.

[WangSO] Wang. 2. and Maurer, P.M.. "LECSIM: A Lcvclized Eveut Driven Com-

piled Logic Simulator" ? Proc. of the 27th A CM/IEEE Design Automation

Conference, pp. 491-496, 1990.

[Wei891 Wei. Yen-Chuen and Cheng, Chung-Kuan, "Towards Efficient Hierarchi-

c d Designs by Ratio Cut Partitioning", IEEE. pp. 298-301. 1989.

[WilsBG] Wilson, A., "Parallelization of an Event Drivcn Simulator on tlic Ericorc

Multim~w", Tcch. Rcp. ETR 86-005. Encore Computcr. 1986.

[Wong86] Wong, F., "Statistics on Logic Simulation", Proc. of the 23rd ACM/IEEE

Design Automation Conference, pp. 13-19, 1988.

[Zarg85] Zargham, M. and Pircell, R., "A Protocol for Load Balancing CSMA

Networks". IEEE Trans. Pamllel and Distributed Systems, 1985.

IMAGE NALUATION TEST TARGET (QA-3)

APPLIED i I W G E . lnc 1653 East Main Street - ,-- RocCiester, NY 14609 USA =-= Phone: i l 61482d3OO .== Fax: 7161208-5989

O 1993, Appiied Image. Inc.. All Rlghts Rasermi

dynamic load balancing for clustered time · abstract in this thesis. we consider the problem of...

Documents