efficient response time approximations for multiclass fork and join queues in open and closed...

Efficient Response Time Approximations forMulticlass Fork and Join Queues in Open and

Closed Queuing NetworksFiras Alomari and Daniel A. Menasce, Senior Member, IEEE

Abstract—Parallel and concurrent structures are widely used both as standalone components and as building blocks of largersystems. Efficient models of parallelism and concurrency are therefore necessary to understand the impact of parallelism in systemperformance. These models become more critical with the proliferation of adaptive systems that require solving a large numberof performance models in a short amount of time to facilitate configuration decisions dynamically. This paper presents efficientresponse time approximations for parallel constructs modeled as fork and join queues. These approximations can be used bypractitioners and performance engineers to quickly compare the performance of contending configurations. The contributions overprevious work are twofold. First, this paper considers heterogeneous multiclass fork and join open and closed queuing networks.Second, the paper also presents models for fork and join where each class of jobs might fork to different queues in a probabilisticmanner.

Index Terms—Performance of systems, modeling and prediction, modeling techniques, parallel systems, queuing theory, fork andjoin, queuing networks, approximation algorithms

Ç

1 INTRODUCTION

AS modern systems become more complex, more in-tegrated, and more essential to the conduct of business

than ever before, there is an increasing need for perfor-mance models that allow practitioners and designers tostudy and evaluate the performance of systems to answerquestions of cost and performance that arise throughoutthe life cycle of such systems. Moreover, features exhibitedby modern systems such as parallelism, blocking, priorityscheduling, and runtime reconfiguration place a greateremphasis on the need for efficient approximate analyticmodels.

More recently, an increased emphasis has been placed onefficient analytic models and approximations to be used inautonomic computing and self-managing systems [1], [2].These systems dynamically adjust their configuration tomeet system performance objectives at run time. Analyticmodels are used in such systems to evaluate a very largenumber of possible performance models dynamically tofind an optimal or near-optimal configuration [3]. There-fore, there is an increasing need to develop techniques toefficiently predict system performance dynamically or atdesign stages. A structure of increasing importance is par-allelism, where jobs are replicated and serviced in parallelby different processing units with possibly different pro-

cessing times. Some examples of what is called fork and join(FJ) include:

. Web service composition applications [4] invoke sev-eral simultaneous parallel services and only proceedwhen all parallel requests have completed.

. A combination of Intrusion Detection Systems thatwork in tandem to protect a computer system; a re-quest is only allowed to proceed after all types ofattacks have been inspected for [5].

. A computer system with multiple redundant diskarrays (RAID) [6] where the system queries differentdisks in parallel and integrates their response into asingle result.

. A MapReduce framework [7] where computationsthat use large datasets are divided into sub prob-lems and the results are aggregated to provide theanswer to the original problem.

. A multi-core system with a program that runs twoor more routines in parallel; all should be completedin order to start the next routine [8].

. A health care provider requests several tests simul-taneously by several labs; the patient’s diagnosis isfinalized when all test results are received [9].

There is no exact analytic solution to FJ modeling exceptfor the case of two homogeneous queues and using simu-lation is computationally expensive. For example, a simu-lation of 100 queues on a Intel Core i5 machine takes anaverage of 3 hours. Therefore, there is a need for an analyticapproximation to FJ constructs that allows modelers to dealwith these types of situations. The approximation must besimple and efficient enough to be used by practitioners andgeneral enough to be used to evaluate several candidatescenarios in different domains with acceptable accuracy.

. F. Alomari is with The Volgenau School of Engineering, George MasonUniversity, Fairfax, VA 22030 USA. E-mail: [email protected].

. D.A. Menasce is with the Department of Computer Science, GeorgeMason University, Fairfax, VA 22030 USA. E-mail: [email protected].

Manuscript received 12 June 2012; revised 15 Feb. 2013; accepted 15 Feb.2013. Date of publication 6 Mar. 2013; date of current version 16 May 2014.Recommended for acceptance by M. Kandemir.For information on obtaining reprints of this article, please send e-mail [email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2013.70

1045-9219 � 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. 6, JUNE 2014 1437

The rest of the paper is organized as follows. First, wediscuss queuing networks and analytic models. Second, wedescribe the FJ model considered here. Third, we discussprevious work related to FJ analysis and approximations.Fourth, we explain the proposed approximations and theresults of the evaluation. Finally, we present concludingremarks.

2 QUEUING NETWORKS

Queuing networks (QN) are defined as a collection of in-terconnected queues (resource plus waiting line) that pro-vide service to customers (requests, jobs, etc.). Customersmove from one queue to another until they complete theirexecution. Customers with similar characteristics (servicetimes and routing probabilities within the QN) can begrouped into one (i.e., single class) or more (i.e., multiclass)classes. A QN can be open, closed, or mixed. In an open QN,customers enter the network at a certain arrival rate, receiveservice, and leave the network. In a closed QN, the numberof customers in the QN is fixed. In mixed networks, somecustomer classes are open and the others are closed. Theinput parameters of a QN model consist of workloadintensity parameters (e.g., arrival rate, number of custo-mers) and service demand parameters at each resource (i.e.,total service time of a request at a resource during all visitsto the resource). A number of techniques are available tosolve QNs to obtain performance measures. However, thereare some conditions that limit the use of these techniques tomodel parallel systems and applications. See Appendix A,available in the online supplemental materials, which isavailable in the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPDS.2013.70 formore details on QNs.

3 FORK AND JOIN

In the FJ construct shown in Fig. 1, there are K branchesand each branch has a single queue. Jobs are split on arrivalinto J ðJ � KÞ tasks to be serviced in parallel by J queues.Only when all J parallel tasks of a job complete, the job canrejoin (synchronize) and leave the system. A job can be splitto every queue in the FJ ðJ ¼ KÞ (i.e., full fork) or to acertain number of queues in a probabilistic way (i.e., dy-namic fork). In the dynamic fork case, jobs are first split into

a random number of J ð1 � J � KÞ tasks. Then, J specificqueues are selected in a probabilistic way. In multiclassmodels, each job class can be routed differently in the FJ (i.e.,configurable fork). A single queue in the FJ operates like astandard queuing system. When the service time distribu-tions and their moments are the same at all queues visitedby the tasks of a job in a FJ construct, we have a homogeneousFJ. Otherwise, we have a heterogeneous one.

The synchronization requirements of a FJ add significantmodeling complexity. The only exact method available tosolve general networks with non product-form character-istics is to determine and solve the Markov process rep-resentation of the network. Generalized Stochastic PetriNets (GSPN), can be used to specify a system and facilitatethe automatic generation of an underlying Markov process,whose solution can be used to compute the performancemetrics of interest [10]. Even if the Markov process rep-resentation of a GSPN can be found, the solution is usuallytoo costly unless the network has a small state space.However, this is not the case for many practical problems.Therefore, approximation techniques have been proposed toestimate performance measures for FJ constructs in QNs.However, often these approximations are designed with aset of restrictive assumptions that limit their applicabilityfor more general performance planning and design scenarios.

3.1 Related WorkExact and efficient solutions for FJ queues are only knownfor two servers (see [20], [21] for open networks and [13],[15] for closed and open networks). Approximation tech-niques have been used to compute the average responsetime for FJ networks for the case of more than two servers.Table 1 compares approximations in previous work withthe ones presented in this paper. Initially, [22] definedbounds for the mean response time for FJ networks, [15]derived an approximation based on the observation thatboth lower and upper bounds grow at the same rate as thenumber of servers. The work in [14] extends this by in-terpolating between light and heavy traffic to approximatehomogeneous FJ queues. However, the work in [14], [15]assumes homogeneous queues where customers are groupedinto a single class. Furthermore, that work requires ob-taining a scaling factor estimated through simulation. Theapproximation presented in [13] considers a dynamic forkmodel. However, it only considers single class homoge-neous FJ constructs. The approximation presented hereconsiders heterogeneous service demands and multipleclasses of jobs. In [17], [18], maximum order statistics areused to approximate the synchronization time in FJ withheterogeneous service demands. However, their approachrequires solving integrals for different number of serversand does not consider multiclass and closed networks ordynamic and configurable forking. In [13], an approxima-tion is presented for open homogeneous FJ with dynamicfork. However, it only considers homogeneous queuesand single class models. Ray [16] uses a bubble sort al-gorithmic approach to find bounds for homogeneous ex-ponential service time FJ networks. Another algorithmicapproximation for open FJ networks is given in [19] usinga matrix geometric approach. However, the algorithmicsolutions presented in [16] and [19] are iterative and their

Fig. 1. Fork and Join (FJ) construct.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. 6, JUNE 20141438

complexity increases with the number of servers in the FJmodel. Furthermore, they do not consider multiclass orclosed FJ networks. Varki [11], [12] presents response timebounds for closed FJ constructs and extends Mean ValueAnalysis (MVA) [23] to obtain an approximate solution forQNs with single-class closed FJ sub-networks. Further-more, the work in [11], [12] considers the more generalcase of partial branching of jobs in the FJ. The work in [12],[24] presents non-iterative geometric response time boundsfor closed networks with FJ constructs. However, all workin closed networks considers a single class closed QN withhomogeneous service times. This paper considers singleand multiclass heterogeneous closed networks with dy-namic branching (i.e., dynamic fork) and presents meth-ods to solve them. In [25], a closed model is analyzed asdescribed by the BCMP theorem [26] but the join point isnot modeled. In [11], [27], a FJ is considered where thenumber of jobs at each fork is random. However, theiranalysis assumes single class homogeneous servers. Theapproximation approach presented here considers a FJwith heterogeneous servers where different classes mayexecute jobs on different servers in the FJ.

The comparison discussed in Table 1 shows that: 1) Thetechniques for heterogeneous FJ suffer from high compu-tational cost or the difficulty in obtaining a closed formexpression. 2) None of the heterogeneous techniques pro-vide a solution for closed or open multiclass networks.3) The approximation presented in this paper provides amore general solution by considering open or closed, singleor multiclass heterogeneous FJ networks with full, dynamicand configurable routing in the FJ constructs.

4 FORK AND JOIN APPROXIMATION

We first address open QNs and their extensions. Then, weaddress closed QNs and show how MVA can be adaptedto solve single class closed models. Finally, we present analgorithm to solve multiclass closed QNs with FJ subnetworks.

4.1 Open ModelFig. 1 depicts a multiclass FJ queue with K � 2 heteroge-neous queues. There are R different classes of jobs num-bered 1; . . . ; r; . . . ; R. The vector ~S ¼ ð~s1; . . . ;~sKÞ, where

~sj ¼ ðsj;1; . . . ; sj;r; . . . ; sj;RÞ specifies the service demands (inseconds) of the tasks of jobs of classes 1 through R at queuej ðj ¼ 1; . . . ; KÞ of the FJ. For the single class case we dropthe class subscript and use the notation ~S ¼ ðs1; . . . ; sKÞ.Jobs of class r ðr ¼ 1; . . . ; RÞ are assumed to arrive at a rateof �r jobs/sec. We denote by ~� ¼ ð�1; . . . ; �r; . . . ; �RÞ the vec-tor of arrival rates of jobs of all classes.

Each class of jobs forks into 1 � J � K tasks based on arouting matrixA ¼ ½K �R�. The rth column of the matrix Ais the vector ~"r ¼ ð"1;r; . . . ; "K;rÞ that prescribes how class rjobs are routed to the K servers in the fork and join sub-network. For example, if K ¼ 3 and ~"2 ¼ ð1; 0; 1Þ jobs ofclass 2 will branch into queues 1 and 3. After receivingservice by the servers at these two queues, tasks proceed tothe join point and wait until all its siblings complete ser-vice. Only then, the job proceeds to the next queue in theQN or leaves the QN.

Intuitively, a job’s service time at a FJ construct is at leastequal to the time spent at the server with maximum servicetime. More specifically, the time it takes to service J jobsiblings concurrently that must synchronize after they haveall completed is at least S ¼ maxJi¼1fsig. However, since it ispossible for different jobs to overlap and wait to synchro-nize before they leave the join, the service time becomes, atmost OJ � S, where OJ is a scaling factor by which S mustbe increased in order to compute the time it takes for the jobto synchronize and leave the FJ. In probability theory, OJ isknown as the mean of the Jth order statistics and the valueof OJ is dependent on the distribution of si [28]. Similarobservations can be made about the waiting time. Thus, thewaiting time is described by the Jth order statistics of themaximum waiting time of J siblings ðW �OJÞ. In view ofthe above observations about the service and waiting times,one can see that the mean response time, T , of jobs in the FJis bounded by:

S �OJ � T � OJ � ðW þ SÞ: (1)

The lower bound in equation (1) is obtained by neglectingthe waiting time. The upper bound on the other hand,assumes that all job’s siblings will wait at the most utilizedserver.

The same intuitive observations are made, and a formalproof is provided in [11], [15]. Furthermore, [11], [15] makeuse of two assumptions to derive and prove theoretical

TABLE 1Related Fork and Join Approximations

ALOMARI AND MENASCE: RESPONSE TIME FOR MULTICLASS FORK AND JOIN QUEUES IN QUEUING NETWORKS 1439

bounds on the mean response time. The first assumes thatall servers in the FJ are homogeneous (i.e., s1 ¼ s2; � � � ¼ sK).The second assumes that the service times are exponentiallydistributed with the same mean. Under these assumptions,one can show that if all service times are exponentially dis-tributed with the same mean, the scaling factor OJ equalsthe Jth harmonic number HJ , where HJ ¼

PJi¼1 1=i.

Let us now consider a heterogeneous case with differ-ent exponentially distributed service times. Intuitively, thiscase follows the same rationale as in the homogeneous case.More specifically, the time it takes to service a job will be atleast S ¼ maxJi¼1fsig. However, in the heterogeneous case,jobs will take less time on average to synchronize becausejobs completing from the queues with the highest servicetimes are less likely to wait for jobs of other queues tocomplete since the other queues have lower service times(i.e., they are faster). In this case, it is harder to efficientlycompute a scaling factor for different exponentially dis-tributed service times. Therefore, to obtain a reasonablyjustified response time approximation, we make use of thefollowing observations: 1) the mean response time is mostsensitive to servers with the highest service times; 2) themean response time grows with the number of parallelservers and with the service time at each server; 3) themean response time of the heterogeneous case is at mostthat of the homogeneous case, as long as the maximumservice time in the heterogeneous case is equal to that ofthe homogeneous case, and both have the same number ofservers. In view of the previous observations, a reasonableapproximation for the mean response time, Rr, for class rrequests is obtained by reordering the queue numbers inthe FJ based on the service demands of each queue and onthe routing associated to each queue.

Before we present the approximation for Rr, we need todefine some notation. For each class r ðr ¼ 1; . . . ; RÞ definea sequence Vr ¼ ðvrð1Þ; vrð2Þ; . . . ; vrðKÞÞ such that:

1. vrðkÞ 2 f1; 2; . . . ; Kg 8 k ¼ 1; . . . ; K,2. vrðkÞ 6¼ vrðk0Þ for k 6¼ k0, and3. svrð1Þ;r � "vrð1Þ;rvrð2Þ; r� "vrð2Þ;r �� svrðKÞ;r � "vrðKÞ;r.

In other words, Vr is a reordering of the original queuenumbers in the FJ sorted by the third condition above.

Then,

Rr ¼XKk¼1

1

k� TvrðkÞ; (2)

where TvrðkÞ is the average response time of class r requestsat queue vrðkÞ.

It should be noted that if all the service times are equal,equation (2) is similar to the approximations presented in[11], [15]. However, the mean response time approximationin [15] grows as a function of the number of parallel sib-lings, while in our approximation the mean response timegrows as a function of the number of parallel siblings andthe service time for each sibling. If the service times areexponentially distributed and the jobs arrive from a Pois-son process, we can use the M/M/1 single queue results[29] and rewrite equation (2) as:

Rr ¼XKk¼1

1

k� "vrðkÞ;r

svrðkÞ;r1� �vrðkÞ

; (3)

where �vrðkÞ is the total utilization of FJ queue numbervrðkÞ. The utilization of queue k ðk ¼ 1; . . . ; KÞ is comput-ed as

�k ¼XRr¼1

�r � sk;r � "k;r: (4)

Equation (3) gives more weight to queues with largerservice demands because service times are sorted in de-creasing order. The routing vector ~" accounts for the notused servers so that when a server is not used by a certainclass it has no impact on the response time of that class.

For example, consider the open FJ shown in Fig. 2 withtwo job classes and three queues in the FJ. Jobs of class 1fork into queues 1 and 3 only while jobs of class 2 fork intoall three queues. The queue utilizations are computed as-suming that the arrival rates for jobs of classes 1 and 2 are�1 ¼ 0:2 jobs/sec and�2 ¼ 0:1 jobs/sec, respectively. Accord-ing to these products, V1 ¼ ð1; 3; 2Þ and V2 ¼ ð3; 1; 2Þ. There-fore, according to equation (3), R1 is computed as

R1 ¼1

1:1:

2

1� 0:7þ 1

2:1:

1

1� 0:8þ 1

3:0:

4

1� 0:1

¼ 9:17 sec (5)

and R2 as

R2 ¼1

1:1:

6

1� 0:8þ 1

2:1:

3

1� 0:7þ 1

3:1:

1

1� 0:1

¼ 35:37 sec. (6)

The values obtained for R1 and R2 by using simulationwith a 95 percent confidence interval are 8.32� 0.09 sec and33.04� 0.23 sec, respectively, which represents 9.03 percentand 5.9 percent relative errors (See Appendix B.1 for moredetails). More about the validation and comparison withsimulation is presented later in the paper.

According to equation (3), when class r jobs are onlyforked to one server (i.e., K ¼ 1), the response time forclassr becomes the exact response time for M/M/1 queues(i.e., Rr ¼ sk;r=ð1� �kÞ).

Fig. 2. Fork and Join example.


4.1.1 Dynamic Fork ModelIn the FJ considered earlier, jobs of each class were split intoexactly J ðJ � KÞ branches. However, in many parallelsystems such as RAID disks, jobs may be split into anyrandom number J ð1 � J � KÞ of tasks that are forked (i.e.,routed) to queues in the FJ in a probabilistic way. Forexample, in a RAID disk system, requests are distributed todifferent number of disks. Another example can be seen indistributed computing where the number of parallel jobsiblings and their assignment depend on available resourcecapacity. This kind of dynamic fork arises quite often inheterogeneous parallel systems where the degree of paral-lelism (i.e., number of parallel siblings) and their assignedresources are determined dynamically to take advantageof faster systems by assigning more jobs to faster queues.Each queue in the FJ is visited by a certain class of jobswith a certain probability. We assume without loss ofgenerality that the probability of class r jobs visiting queuek is defined by the vector~"r. In other words, "k;r representsthe probability that queue k is visited by class r jobs. Ifqueue k is not visited by class r jobs, "k;r ¼ 0. If every classr job visits queue k, "k;r ¼ 1.

It should be noted thatPK

k¼1 "k;r is not necessarily equalto one. For example, consider a special case of dynamic forkin which the number of queues visited by class r jobs isgiven by the random variable ~Jr and that the probability ofselecting a set of k queues (when ~Jr ¼ k) is the same for allsets of k queues. Then, "k;r can be computed as

"k;r ¼XKk¼1

Pr½ ~Jr ¼ k�K�1k�1

� �Kk

� �¼ 1

K

XKk¼1

k� Pr½ ~Jr ¼ k�: (7)

The expression in equation (7) is equal to the ratio be-tween the average number of queues selected by classr jobsand the total number of queues in the FJ.

The average response time, Rr, for the dynamic case iscomputed using the process described before for the staticand configurable case, with the interpretation that "k;r is theprobability that class r jobs visit queue k.

Consider the FJ in Fig. 2 where the service times are thesame as the ones in the example presented before. Therouting policy is different and therefore, the order of thequeues, the utilizations, and the average response times aredifferent: R1 ¼ 5:27 sec and R2 ¼ 4:10 sec. See Appendix B.2for more details. The values obtained for R1 and R2 byusing simulation with 95 percent confidence intervals are5.52 � 0.07 sec and 4.24 � 0.04 sec, respectively, whichrepresents 3.3 percent and 2.1 percent relative errors.

In general QN models, the FJ is one of the devices of theQN. Then, the QN model can be solved by composing allthe queues as shown in [30] where the FJ mean responsetime at the FJ device is computed as described above.

4.2 Closed QN ModelIn many scenarios, one may need to compute performancevalues for FJ constructs in closed and mixed QN models.In a closed model, there are a finite number of jobs cyclingthrough the system. Closed QNs are quite useful for mod-

eling batch or terminal workloads, such as, a client serversystem with a known number of clients, heavy load sys-tems, and batch workloads.

The workload intensity parameters in such modelsare given by the number of customers N in the QN. Inmulticlass QNs, the workload intensity is given by ~N ¼ðN1; . . . ; NRÞ whereNr is the number of classr jobs in theQN. The next sections show how MVA can be extendedto include our approximation for heterogeneous FJ sub-networks for single-class and multiclass closed QNs.

4.2.1 Heterogeneous FJ Single Class Mean ValueAnalysis (MVA)

The arrival theorem [23] states that in a closed QN, theaverage number of jobs seen by an arriving job at a device isequal to the average queue length at that device with oneless customer in the QN. In MVA, this theorem is used tocompute the residence time of a job at a device by addingthe job service time to the time it takes to service all the jobsfound at the device by the arriving job. Since jobs in FJconstructs are split between the different queues in the forkand join and then rejoined, FJ constructs can be seen assingle devices in the QN. For example, if we have a QNwith a 3-queue FJ and two regular queues, one can think ofthis as a closed QN with three devices. An extension ofMVA was presented in [12] to compute approximate per-formance values for QNs with FJ subsystems. However,that extension only considers homogeneous FJ networks.Their approximation assumes that the queues length inthe FJ are approximately the same. More specifically, in[12] the FJ sub-network is considered a single device in theQN and the residence time of the FJ sub network is com-puted as si½Hk þ �niðn� 1Þ�, where �niðn� 1Þ is the averagequeue length at device i (the FJ construct) when there isone less customer in the QN and Hk ¼

Pkj¼1 1=j is the k-th

harmonic number.When the service demands in all FJ queues are equal, it

is fair to assume that the average queue lengths in the FJqueues are equal. However, in the heterogeneous case, theaverage queue lengths will be different. Therefore, basedon our earlier observations, we propose an approximationthat is different from the traditional MVA algorithm andfrom the extension presented in [12] for homogeneous FJsubnetworks. Our approximation starts when there are zerocustomers in the QN and proceeds as follows:

1. Let M be the number of non-FJ queues (numbered1; . . . ;M) in the QN and let the queues in the FJqueue be numbered M þ 1; . . . ; M þK. Considereach branch of the FJ sub network as a regularqueue. Let si ði ¼ 1; . . . ; M þKÞ be the service de-mand of queue i. Let "i ði ¼M þ 1; . . . ;M þKÞ bethe routing probability of queue i in the FJ. To sim-plify the notation below, let "i ¼ 1 for i ¼ 1; . . . ;M.

2. Compute the residence time R0iðnÞ ði ¼ 1; . . . ; M þKÞ for all queues in the QN, including the K queuesin the FJ subnetwork using the standard MVA equa-tion as follows:

R0iðnÞ ¼ "i � si 1þ �niðn� 1Þ½ �i ¼ 1; . . . ;M þK;

(8)


where �niðn� 1Þ is the average queue length at queue iwhen there is one less customer in the QN. Note that bymultiplying the service demand by "i we can handlethe configurable and dynamic cases.

3. Reorder the residence times of the FJ devices indescending order.

4. Compute the overall FJ device residence time asfollows:

R0ðnÞ ¼XMþK

k¼Mþ1

1

k�MR0kðnÞ; (9)

where R0Mþ1ðnÞ � R0Mþ2ðnÞ � � � � � R0MþKðnÞ.5. Compute the system throughput by applying

Little’s Law and considering that the response time isequal to the residence time at the FJ device plus thesum of the residence times at all non-FJ queues:

X0ðnÞ ¼n

R0ðnÞ þPM

i¼1 R0iðnÞ

: (10)

6. Compute the average queue length for each queuein the QN, including those in the FJ device usingthe standard MVA equation based on the residencetimes of each queue and the overall throughput asfollows:

�niðnÞ ¼ X0ðnÞ �R0iðnÞ i ¼ 1; . . . ; M þK: (11)

Due to the fact that the queue length is zero when thereare no customers in the network ð�nið0Þ ¼ 0Þ, the previousequations can be used recursively to calculate the residencetime in equation (8). The residence time can then be used tocalculate the average response time in the QN. Contrary tothe FJ extension to MVA presented in [12], we do not usethe residence time equation for the FJ construct to obtain thequeue length at the FJ. Rather, we use the residence times ateach individual queue of the FJ. A detailed numericalexample is given in Appendix B.2.

The formulation above was presented for a closed QNwith a single FJ construct. It is easy to see that the processdescribed above can be easily applied to situations whenthere are multiple FJ constructs in the QN.

4.2.2 Heterogeneous FJ Multiclass MVAThe algorithm presented before is for networks with asingle customer class. This algorithm is very fast and itscomputation time grows linearly with the number of jobsand the number of queues. For multiple classes, the com-putational complexity of MVA grows exponentially withthe number of customer classes and number of devices.Therefore, approximate MVA (AMVA) techniques havebeen used to reduce the computational complexity [31].

We consider a closed QN with M non-FJ queues num-bered 1; . . . ;M and a single FJ queue with K queues. Thequeues of the FJ construct are numbered M þ 1; . . . ;M þKin the QN. The notation R0r ð~NÞ stands for the averageresidence time at the FJ device for a customer population of~N in the QN. Similarly to the single-class case, we define"k;r as the probability that class r jobs visit queue k of the FJ

for k ¼M þ 1; . . . ;M þK and to simplify the notation wedefine "k;r ¼ 1 for k ¼ 1; . . . ;M.

The AMVA equations extended for the FJ case are givenbelow. The AMVA throughput equation is

X0;rð~NÞ ¼Nr

Zr þR0r ð~NÞ þPM

k¼1 R0k;rð~NÞ

; (12)

where Zr is class r customers’ think time and R0r ð~NÞ þPMk¼1 R

0k;rð~NÞ is the average response time, Rrð~NÞ, for class

r customers. The average number of class r customers atqueue k is given by:

�nk;rð~NÞ ¼ X0;rð~NÞ �R0k;rð~NÞ: (13)

The residence time equation becomes

R0k;rð~NÞ ¼ "k;r:sk;r 1þXRt¼1

�nk;tð~N �~1rÞ" #

; (14)

where~1r is a vector ð0; 0; . . . ; 1; 0; . . . ; 0Þ of zeros with a 1 inthe r-th position.

The term �nk;tð~N �~1rÞ is approximated as ðNr � 1Þ=Nr ��nk;tð~NÞ for t ¼ r and �nk;tð~NÞ for t 6¼ r using the Bard-Schweitzer approximation [32].

AMVA starts with the assumption that the number ofcustomers is equally distributed between the devices in thenetwork and stops when successive values of the queuelength are sufficiently close. As with our single-class MVAapproach, the residence times for the FJ queues arecomputed individually. In that computation, we multiplythe service demands by the corresponding queue routingprobability as we did for the single-class case. The values ofthe residence times in the FJ are ordered descendingly tocompute the overall FJ residence time for each class as

R0r ð~NÞ ¼XMþK

k¼Mþ1

1

k�MR0k;rð~NÞ; (15)

s.t. R0Mþ1;rð~NÞ � R0Mþ2;rð~NÞ � � � � � R0MþK;rð~NÞ. Then, thethroughput is computed using the overall FJ residence timeaccording to equation (12). Finally, the queue lengths arecomputed by using the single FJ queues residence time andthe system throughput. The approach for computing theperformance metrics for the multiclass FJ using anextension to AMVA is described by algorithm given inAppendix C.

5 EVALUATION

To validate our approximation, we compared the analyticwith simulation results. Simulations for each configurationwere run 30 times, with different and independent seeds ineach run, and 95 percent confidence intervals were computed.The simulation was built using OMNeT++ 4.2 [33]. In eachconfiguration we simulated K ¼ 4, 8, 16, 25, 50, 75, 100for different values of the highest utilization of the queuesin the FJ: � ¼ :05; 0:1; 0:2; . . . ; 0:9; 0:95 in the open case andfor N ¼ 1; . . . ; 35 in the closed case. We ran each experimentfor 300 minutes with a 3-minute warm-up period. It should


be noted that for larger values of K and �, the simulationrequired significantly longer warm-up periods to attainstatistical steady-state.

The service times for each configuration are given inTable 5 along with the queue visit probabilities for eachconfiguration used in the dynamic case. The multiclassaccess probabilities are given in Table 3. The think time inall the closed QN experiments was fixed at 0.5 secondsand the AMVA algorithm tolerance was fixed at 0.01. Thesimulation results were compared to the approximateones to compute the relative error. We first check if theapproximation result is within the 95 percent confidenceinterval of the simulation. In this case, we assume thatthere is no error at the 95 percent confidence level. Other-wise, we compute the error as the percent difference be-tween the approximation result and the closest bound ofthe 95 percent confidence interval. We ran experiments toverify the accuracy of the approximation as a function ofthe degree of heterogeneity of the service demands. Addi-tionally, we ran experiments to assess the scalablity of theapproximation for a large number of queues and jobs.Finally, we ran experiments to compare our approxima-tion to other approximations presented in the literature.

5.1 ResultsFig. 3 shows the analytic results of the open model de-scribed in Section 4.1 compared to simulation results. Thetop plots show the quality of the approximation for the fullfork case ðJ ¼ KÞ. We can see that the approximation tracksthe simulated response time and provides adequate esti-mates for the response time. The relative error between theapproximate response time values and those obtained bysimulation increases slightly with high utilization. How-ever, the relative error in the worst case scenario at � ¼ 0:95is less than 22 percent. The results for the dynamic forkmodel are shown in the middle plots. One can see that theapproximation matches the simulated model and providesbetter results, in general, than the full fork case with amaximum relative error of 6 percent, which for the mostpart falls within the simulation’s 95 percent confidence in-terval. Moreover, the dynamic fork response time is lowerthan that of a similar full fork FJ. This can be explained bythe fact that in the dynamic fork case the number of parallelsiblings (i.e., degree of parallelism) is smaller than that ofthe full fork case. Consequently, the dynamic FJ case hasless synchronization time at the joint point resulting in asmaller response time and a smaller relative error. Onenotable difference between both cases (i.e., full fork anddynamic fork) is that the approximation for the full forkcase always provides more pessimistic estimates while inthe case of the dynamic fork the estimate is more optimistic,yet closer to the simulated response time. The multiclass

case is shown in the bottom plots where jobs of classes 1and 2 are assigned to queues according to the routing val-ues in Table 3 (dynamic fork) and jobs of class 3 are routedbased on the policy shown in the same table (configurablefork). The multiclass approximation provides similar ac-curacies to that of the single class case.

Fig. 4 compares the response time analytic results for theclosed model described in Section 4.2 with the simulationresults. All the graphs for the closed model are a functionof the number of jobs in the system (i.e., the customerpopulation). The top plots show the quality of the approx-imation in the full fork case. We can see that the approx-imation tracks the simulated results and provides accurateestimates for the response time. Similarly to the open case,the approximation tends to be more pessimistic. Further-more, the relative error between the approximate responsetime and the simulation results increases until the customerpopulation ðNÞ is equal to five and decreases with N afterthat.

The dynamic fork model results shown in the middleplots exhibit similar trends to the open case and providemore accurate results than the full fork case with a relativeerror of about 5 percent. Results for a multiclass case areshown in the bottom plots in which the branching policy isgiven in Table 3. The AMVA extension presented in Section4.2.2 provides an accurate approximation and converges atthe same rate as the AMVA algorithm.

Fig. 5 compares the throughput for a single class closedQN obtained through simulation and by the approxima-tion. The top plots are for the full fork case and the bottomones for the dynamic case. As it can be seen, in both cases,the error decreases as the customer population increases.The graphs also show that the error in the throughput ismuch smaller in the dynamic case than in the full fork one.

Fig. 6 shows the relative percentage error for an open FJwith two queues as a function of the service time het-erogeneity defined as ðs1 � s2Þ=s1 for different values of themaximum utilization (30 percent, 45 percent, 60 percent,and 75 percent). We can observe that the relative percent-age error initially increases with the heterogeneity, andreaches a maximum value for some heterogeneity valueand then decreases as the heterogeneity increases. Foropen QNs, the maximum error occurs at different pointsfor different values of the maximum utilization and tendsto occur for lower heterogeneity values for higher utili-zation and at higher heterogeneity values for lower utili-zation. Moreover, the maximum error increases with theutilization. For example, for a maximum utilization of75 percent, the maximum value of the relative percentageerror is close to 12 percent and it occurs for a heterogeneityvalue equal to 0.2. However for a utilization of 30 percent,the maximum value of the relative percentage error is8 percent and it occurs for a heterogeneity value of 0.7. We

TABLE 2Simulation Parameters

TABLE 3Multiclass Parameters


observed a similar behavior for the closed QN case, albeitwith a smaller relative error, as shown in the figure.

Additional experiments were run to assess the scal-ability of the approximations (see Appendix D.1). The ex-periments were conducted for closed and open QNs with alarge number of queues and customers. For open QNs, theresults show that the relative error increases for highervalues of the utilization �. Also, the relative error is rela-

tively invariant with respect to the number K of queues inthe FJ. However, for K 9 80 the FJ actual response timegrows at a slower rate than that of the approximation,which in turn results in a higher error rate for K 9 80. Thisis due to the fact that adding queues with relatively lowerservice time to the FJ has a minute impact on the syn-chronization time, hence, the actual FJ response time willgrow slower than the approximation as a function of the

Fig. 3. Average response time (in seconds) vs. utilization for open QNs.

Fig. 4. Average response time (in seconds) vs. number of jobs for closed QNs.


number of queues k. In closed QNs, the error reaches apeak for a small number of jobs and then decreases as thenumber of jobs increases. It can also be seen that the errorincreases with the number of queues. However, that dif-ference is more pronounced at lower loads. This trend isdue to the fact that a higher number of jobs results in ahigher FJ throughput. More specifically, the FJ queues be-come more utilized and it becomes more likely that there isalways a job in the queue. This in turn results in less var-iability in a job’s service time and therefore results in asmaller error for a large number of jobs.

Finally, our experiments show that the approximationpresented in this paper compares favorably to approxima-tions presented in the literature while providing solutionsfor more cases. In the cases where other approaches pro-vide higher accuracy, the cost of evaluating a FJ QN modelusing these approaches significantly exceeds the cost ofevaluating the same model using the approximations pre-sented in this paper. The results of our comparison aregiven in Appendix D.2 (see supplemental file, availableonline, for details).

6 CONCLUSION

This paper discussed existing analytical solutions for fork-join queuing networks and presented an approximationbased on harmonic numbers and assessed it through sim-ulation. The approximation presented in this paper pro-vides a more general expression than the ones formerlypresented in prior work by others. The approximationspresented here cover the cases of open, closed, singleclass, and multiclass queuing networks. Furthermore, theapproximation works well with more general fork andjoin cases such as 1) heterogeneous service demands and2) dynamic and configurable fork. The relative error of theapproximation is approximately between 5-15 percentfor most of the evaluated cases. As a future work, onemay want to consider non-exponential service timedistributions.

REFERENCES

[1] J. Kephart and D. Chess, ‘‘The Vision of Autonomic Computing,’’Computer, vol. 36, no. 1, pp. 41-50, Jan. 2003.

[2] M.C. Huebscher and J.A. McCann, ‘‘A Survey of AutonomicComputing Degrees, Models, Applications,’’ ACM ComputingSurveys, vol. 40, no. 3, pp. 7:1-7:28, Aug. 2008.

[3] D. Menasce, M. Bennani, and H. Ruan, ‘‘On the Use of OnlineAnalytic Performance Models, in Self-Managing and Self-Organizing Computer Systems,’’ in Self-Star Properties in Com-plex Information Systems. Berlin, Germany: Springer-Verlag,2005, pp. 128-142.

[4] D. Menasce, ‘‘Response-Time Analysis of Composite WebServices,’’ IEEE Internet Computing, vol. 8, no. 1, pp. 90-92,Jan./Feb. 2004.

[5] F. Alomari and D. Menasce, ‘‘An Autonomic Framework forIntegrating Security and Quality of Service Support in Data-bases,’’ in Proc. IEEE Sixth Int’l Conf. Software Security Reliability(SERE 2012), 2012, pp. 51-60.

[6] P.M. Chen, E.K. Lee, G.A. Gibson, R.H. Katz, and D.A. Patterson,‘‘RAID: High-Performance, Reliable Secondary Storage,’’ ACMComputing Surveys, vol. 26, no. 2, pp. 145-185, Jun. 1994.

[7] J. Dean and S. Ghemawat, ‘‘MapReduce: Simplified DataProcessing on Large Clusters,’’ Comm. ACM, vol. 51, no. 1,pp. 107-113, Jan. 2008.

[8] M. Hill and M. Marty, ‘‘Amdahl’s Law in the Multicore Era,’’Computer, vol. 41, no. 7, pp. 33-38, July 2008.

[9] S. Al-Momen and D. Menasce, ‘‘The Design of an AutonomicController for Self-Managed Emergency Departments,’’ in Proc.HEALTHINFVInt’l. Conf. Health Informatics, Feb. 2012, pp. 174-182.

Fig. 5. Throughput (jobs per second) vs. number of jobs for a single class closed model for 4, 8, and 16 queues.

Fig. 6. Relative error vs. service time heterogeneity.


[10] D. Menasce, ‘‘A Methodology for Combinining GSPNs and QNs,’’in Proc. Computer Measurement Group Conf., 2011, pp. 774-783.

[11] E. Varki, ‘‘Response Time Analysis of Parallel Computer andStorage Systems,’’ IEEE Trans. Parallel Distrib. Syst., vol. 12,no. 11, pp. 1146-1161, Nov. 2001.

[12] E. Varki, ‘‘Mean Value Technique for Closed Fork-Join Net-works,’’ SIGMETRICS Perform. Eval. Rev., vol. 27, no. 1, pp. 103-112, June 1999.

[13] E. Varki, A. Merchant, and H. Chen, The M/M/1 Fork-Join QueueWith Variable Sub-Tasks, 2008, Unpublished. [Online]. Available:http://www.cs.unh.edu/varki/publication/open.pdf.

[14] S. Varma and A.M. Makowski, ‘‘Interpolation Approximationsfor Symmetric Fork-Join Queues,’’ Perform. Eval., vol. 20, no. 1-3,pp. 245-265, May 1994.

[15] R. Nelson and A. Tantawi, ‘‘Approximate Analysis of Fork/JoinSynchronization in Parallel Queues,’’ IEEE Trans. Computers,vol. 37, no. 6, pp. 739-743, June 1988.

[16] R.J. Chen, ‘‘A Hybrid Solution of Fork/Join Synchronization inParallel Queues,’’ IEEE Trans. Parallel Distrib. Syst.., vol. 12, no. 8,pp. 829-845, Aug. 2001.

[17] A. Lebrecht and W. Knottenbelt, ‘‘Response Time Approxima-tions in Fork-Join Queues,’’ in Proc. 23rd UK PerformanceEngineering Workshop (UKPEW), 2007, pp. 12-19.

[18] P. Harrison and S. Zertal, ‘‘Queueing Models With Maxima ofService Times,’’ in Computer Performance Evaluation. ModellingTechniques and Tools, vol. 2794, P. Kemper and W. Sanders, Eds.Berlin, Germany: Springer-Verlag, 2003, , pp. 152-168.

[19] S. Balsamo, L. Donatiello, and N. Van Dijk, ‘‘Bound PerformanceModels of Heterogeneous Parallel Processing Systems,’’ IEEETrans. Parallel Distrib. Syst.., vol. 9, no. 10, pp. 1041-1056, Oct. 1998.

[20] F. Baccelli and A. Makowski, ‘‘Simple computable bounds for thefork-join queue,’’ in Proc. John Hopkins Conf. Information Science,1985, pp. 436-441.

[21] F. Baccelli, ‘‘Two Parallel Queues Created by Arrivals With TwoDemands: The M/G/2 Symmetrical Case,’’ INRIA, Paris, France,INRIA Tech. Rep. No. 426, 1985.

[22] F. Baccelli and A. Makowski, ‘‘Queueing Models for SystemsWith Synchronization Constraints,’’ Proc. IEEE, vol. 77, no. 1,pp. 138-161, Jan. 1989.

[23] M. Reiser and S.S. Lavenberg, ‘‘Mean-Value Analysis of ClosedMultichain Queuing Networks,’’ J. ACM, vol. 27, no. 2, pp. 313-322, Apr. 1980.

[24] G. Casale, R. Muntz, and G. Serazzi, ‘‘Geometric Bounds: A Non-iterative Analysis Technique for Closed Queueing Networks,’’IEEE Trans. Computers, vol. 57, no. 6, pp. 780-794, June 2008.

[25] P. Heidelberger and K. Trivedi, ‘‘Analytic Queueing Models forPrograms With Internal Concurrency,’’ IEEE Trans. Computers,vol. C-32, no. 1, pp. 73-82, Jan. 1983.

[26] F. Baskett, K. Chandy, R. Muntz, and F. Palacios-Gomez, ‘‘Open,Closed and Mixed Networks of Queues With Different Classes ofCustomers,’’ J. ACM), vol. 22, no. 2, pp. 248-260, Apr. 1975.

[27] A. Kumar and R. Shorey, ‘‘Performance Analysis and Schedul-ing of Stochastic Fork-Join Jobs in a Multicomputer System,’’IEEE Trans. Parallel Distrib. Syst., vol. 4, no. 10, pp. 1147-1164,Oct. 1993.

[28] K.S. Trivedi, Probability and Statistics With Reliability, Queuingand Computer Science Applications, 2nd ed., Chichester, U.K.:Wiley, 2002.

[29] L. Kleinrock, Queueing Systems. Volume 1: Theory. Hoboken, NJ,USA: Wiley, 1975.

[30] D. Menasce, V. Almeida, and L. Dowdy, Performance by Design:Computer Capacity Planning by Example. Upper Saddle River, NJ,USA: Prentice-Hall, 2004.

[31] K. Chandy and C. Sauer, ‘‘Approximate Methods for AnalyzingQueueing Network Models of Computing Systems,’’ ACMComputing Surveys, vol. 10, no. 3, pp. 281-317, Sept. 1978.

[32] P. Schweitzer, ‘‘Approximate Analysis of Multiclass ClosedNetwork of Queues,’’ in Proc. Int’l Conf. Stochastic Control Optim.,1979, pp. 25-29.

[33] Omnet++ Discrete Event Simulation System (2008). [Online].Available: www.omnetpp.org.

Firas Alomari received the BSc degree in com-puter science from Yarmouk University, Jordan,and the MSc degree in information systems fromGeorge Mason University, USA. He received thePhD degree in Information Technology at theVolgenau School of Engineering at GeorgeMason University, in 2013. His areas of interestinclude information security and assurance,analysis of computer system performance, andautonomic computing.

Daniel A. Menasce is a University Professor ofComputer Science and was the Senior Associ-ate Dean from 2005 to 2012 of the VolgenauSchool of Engineering at George Mason Univer-sity, Fairfax, VA, USA. He received a PhD fromthe University of California at Los Angeles, USAin 1978 and MS in CS and BSEE degrees fromthe Pontifical Catholic University of Rio deJaneiro (PUC-Rio) in 1975 and 1974, respec-tively. Menasce is a Fellow of the ACM, a SeniorMember of the IEEE, and a recipient of the 2001

A.A. Michelson Award from the Computer Measurement Group. He haspublished over 215 papers in refereed journals and conferenceproceedings and five books on capacity planning, performancemodeling, and web and e-commerce performance. His researchinterests include autonomic computing, analytic modeling and analysisof computer systems, and software performance engineering.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


efficient response time approximations for multiclass fork and join queues in open and closed...

Education

parallel systems

performance engineers

selfmanaging systems

fork andjoin

computer system

different queues

parallel constructs

parallel requests