grandslam: guaranteeing slas for jobs in …guaranteeing slas for jobs in microservice execution...

GrandSLAm: Guaranteeing SLAs for Jobs inMicroservices Execution Frameworks

Ram Srivatsa KannanUniversity of Michigan, Ann Arbor

[email protected]

Lavanya Subramanian∗Facebook

[email protected]

Ashwin RajuUniversity of Texas at [email protected]

Jeongseob Ahn†Ajou [email protected]

Jason MarsUniversity of Michigan, Ann Arbor

[email protected]

Lingjia TangUniversity of Michigan, Ann Arbor

[email protected]

AbstractThe microservice architecture has dramatically reduced usereffort in adopting and maintaining servers by providing acatalog of functions as services that can be used as buildingblocks to construct applications. This has enabled datacenteroperators to look at managing datacenter hosting microser-vices quite differently from traditional infrastructures. Sucha paradigm shift calls for a need to rethink resource manage-ment strategies employed in such execution environments.We observe that the visibility enabled by a microservices ex-ecution framework can be exploited to achieve high through-put and resource utilization while still meeting Service LevelAgreements, especially in multi-tenant execution scenarios.

In this study, we present GrandSLAm, a microservice exe-cution framework that improves utilization of datacentershostingmicroservices. GrandSLAm estimates time of comple-tion of requests propagating through individual microservicestages within an application. It then leverages this estimateto drive a runtime system that dynamically batches and re-orders requests at each microservice in a manner where indi-vidual jobs meet their respective target latency while achiev-ing high throughput. GrandSLAm significantly increasesthroughput by up to 3× compared to the our baseline, with-out violating SLAs for a wide range of real-world AI and MLapplications.

CCS Concepts • Software and its engineering → Soft-ware as a service orchestration system;

∗This work was done while the author worked at Intel Labs†Corresponding author

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACMmust be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] ’19, March 25–28, 2019, Dresden, Germany© 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6281-8/19/03. . . $15.00https://doi.org/10.1145/3302424.3303958

Keywords Microservice, Systems and Machine Learning

ACM Reference Format:RamSrivatsa Kannan, Lavanya Subramanian, Ashwin Raju, JeongseobAhn, Jason Mars, and Lingjia Tang. 2019. GrandSLAm: Guaran-teeing SLAs for Jobs in Microservices Execution Frameworks. InProceedings of Fourteenth EuroSys Conference 2019 (EuroSys ’19).ACM,NewYork, NY, USA, 16 pages. https://doi.org/10.1145/3302424.3303958

1 IntroductionThe microservice architecture along with cloud computingis dramatically changing the landscape of software develop-ment. A key distinguishing aspect of the microservice archi-tecture is the availability of pre-existing, well-defined andimplemented software services by cloud providers. Thesemicroservices can be leveraged by the developer to constructtheir applications without perturbing the underlying hard-ware or software requirements. The user applications can,therefore, be viewed as an amalgamation of microservices.The microservice design paradigm is widely being utilizedby many cloud service providers driving technologies likeServerless Computing [3, 5, 13, 14, 19, 20].

Viewing an application as a series of microservices is help-ful especially in the context of datacenters where the appli-cations are known ahead of time. This is in stark contrastto the traditional approach where the application is viewedas one monolithic unit and instead, lends a naturally seg-mentable structure and composition to applications. Suchbehavior is clearly visible for applications constructed usingartificial intelligence and machine learning (AI and ML) ser-vices, an important class of datacenter applications whichhas been leveraging the microservice execution framework.As a result, it opens up new research questions especiallyin the space of multi-tenant execution where multiple jobs,applications or tenants share common microservices.Multi-tenant execution has been explored actively in the

context of traditional datacenters and cloud computing frame-works towards improving resource utilization [10, 31, 41,48]. Prior studies have proposed to co-locate high prioritylatency-sensitive applications with other low priority batchapplications [31, 48]. However, the multi-tenant execution

https://doi.org/10.1145/3302424.3303958

https://doi.org/10.1145/3302424.3303958

https://doi.org/10.1145/3302424.3303958

EuroSys ’19, March 25–28, 2019, Dresden, Germany RS Kannan et al.

0 500 1000 1500 2000 2500 3000 3500 4000Requests served

0200400600800

10001200

Late

ncy

stage 1 stage 2 stage 3 stage 4Image solo Image colo SLA

(a) Image Querying: SLA is not violated

Stage 1 Stage 2 Stage 3 Stage 4

Image Recognition

Speech Recognition

Natural Language

Understanding Output

Output

input

input

Text to SpeechQuestion

Answering

(b) Sharing NLU and QA microservices

0 500 1000 1500 2000 2500 3000 3500 4000Requests served

0200400600800

100012001400160018002000

Late

ncy

stage 1 stage 2 stage 3 IPA soloIPA coloSLA

(c) IPA: SLA is violated

Figure 1. Sharing the two common microservices betweenImage Querying and Intelligent Personal Assistant applica-tions

in a microservice based computing framework would oper-ate on a fundamentally different set of considerations andassumptions since resource sharing can now be viewed at amicroservice granularity rather than at an entire applicationgranularity.Figure 1b illustrates an example scenario in which an

end-to-end Intelligent Personal Assistant (IPA) applicationshares the Natural Language Understanding (NLU) andQuestion Answering (QA) microservices with an imagebased querying application. Each of these applications isconstructed as an amalgamation of different microservices(or stages). In such a scenario, the execution load in theseparticular microservices increases, thereby causing the la-tency of query execution in stages 2 and 3 to increase. Thisincrease in latency at specific stages affects the end-to-endlatency of the IPA application, thereby violating service levelagreements (SLAs). This phenomenon is illustrated by Fig-ure 1c and Figure 1a. The x-axis represents the number ofrequests served while the y-axis denotes latency. Horizontaldotted lines separate individual stages. As can be seen, theSLA violation for the image querying application in Figure 1ais small, whereas the IPA application suffers heavily fromSLA violation. However, our understanding of the resource

contention need not stop at such an application granular-ity, unlike traditional private data centers. It can rather bebroken down into contention at the microservice granular-ity, which makes resource contention management a moretractable problem.This fundamentally different characteristic of microser-

vice environments motivates us to rethink the design ofruntime systems that drive multi-tenancy in microservice ex-ecution frameworks. Specifically, in virtualized datacenters,consolidation of multiple latency critical applications is lim-ited, as such scenarios can be performance intrusive. In par-ticular, the tail latency of these latency critical applicationscould increase significantly due to the inter-application inter-ference from sharing the hardware resources [31, 32, 48, 51].Even in a private datacenter, there is limited visibility intoapplication specific behavior and SLAs, which makes it hardeven to determine the existence of such performance intru-sion [27]. As a result, cloud service providers would not beable to meet SLAs in such execution scenarios that co-locatemultiple latency critical applications. In stark contrast, theexecution flow of requests through individual microservicesis much more transparent.

We observe that this visibility creates a new opportunityin a microservice-based execution framework and can enablehigh throughput from consolidating the execution of multi-ple latency critical jobs, while still employing fine-grainedtask management to prevent SLA violations. In this context,satisfying end-to-end SLAs merely becomes a function ofmeeting disaggregated partial SLAs at each microservicestage through which requests belonging to individual jobspropagate. However, focusing on each microservice stage’sSLAs standalone misses a key opportunity, since we observethat there is significant variation in the request level execu-tion slack among individual requests of multiple jobs. Thisstems from the variability that exists with respect to userspecific SLAs, which we seek to exploit.

In this study, we propose GrandSLAm, a holistic runtimeframework that enables consolidated execution of requestsbelonging to multiple jobs in a microservice-based comput-ing framework. GrandSLAm does so by providing a predic-tion based on identifying safe consolidation to deliver satis-factory SLA (latency) while maximizing throughput simul-taneously. GrandSLAm exploits the microservice executionframework and the visibility it provides especially for AI andML applications, to build a model that can estimate the com-pletion time of requests at different stages of a job with highaccuracy. It then leverages the prediction model to estimateper-stage SLAs using which it (1) ensures end-to-end joblatency by reordering requests to prioritize those requestswith low computational slack, (2) batches multiple requeststo the maximum extent possible to achieve high throughputunder the user specified latency constraints. It is importantto note that employing each of these techniques standalonedoes not yield SLA enforcement. An informed combination

Guaranteeing SLAs for Jobs in Microservice Execution Frameworks EuroSys ’19, March 25–28, 2019, Dresden, Germany

of request re-ordering with a view of end-to-end latencyslack and batching is what yields effective SLA enforcement,as we demonstrate later in the paper. Specifically, this papermakes the following contributions:• Analysis of microservice execution scenarios. Ourinvestigation observes the key differences between tra-ditional and microservice-based computing platforms –primarily in the context of visibility into the underlyingmicroservices that provide exposure to application spe-cific SLA metrics.

• Accurate estimation of completion time at individ-ual microservice stages. We build a model that esti-mates the completion time of individual requests at thedifferent microservice stages and hence, the overall timeof completion. We have demonstrated high accuracy inestimating completion times, especially for AI and MLmicroservices.

• Guarantee end-to-end SLAs by exploiting stage levelSLAs. By utilizing the completion time predictions fromthe model, we derive individual stage SLAs for each mi-croservice/stage. We then combine this per-stage SLA re-quirement with our understanding of end-to-end latencyand slack. This enables an efficient request schedulingmechanism towards the end goal of maximizing serverthroughput without violating the end-to-end SLA.Our evaluations on a real system deployment of a 6 node

CPU cluster coupled with graphics processing acceleratorsdemonstrates GrandSLAm’s capability to increase the through-put of a datacenter by up to 3× over the state-of-the-art re-quest execution schemes for a broad range of real-world ap-plications. We perform scale-out studies as well that demon-strate increased throughput while meeting SLAs.

2 BackgroundIn this section, we first describe the software architectureof a typical microservice and its execution framework. Wethen describe unique opportunities a microservice frame-work presents as compared to a traditional datacenter, foran efficient redesign.

2.1 Microservices Software ArchitectureThe microservice architecture is gaining popularity amongsoftware engineers, since it enables easier application devel-opment and deployment while not having to worry about theunderlying hardware and software requirements. Microser-vices resemble well-defined libraries that perform specificfunctions, which can be exposed to consumers (i.e., appli-cation developers) through simple APIs. With the microser-vice paradigm approach, instead of writing an applicationfrom scratch, software engineers leverage these microser-vices as building blocks to construct end-to-end applications.The end-to-end applications consist of a chain of microser-vices many of which are furnished by the datacenter service

providers. Microservice based software architectures speedup deployment cycles, foster application-level innovationby providing a rich set of primitives, and improve main-tainability and scalability, for application classes where thesame building blocks tend to be used in many applicationcontexts [13].

Traditional, multi-tier architectures compartmentalize ap-plication stages based on the nature of services into differ-ent tiers. In most cases, application stages belong to eitherthe presentation layer which focuses on the user interface,application processing layer in which the actual applica-tion execution occurs and the data management layer whichstores data and metadata belonging to the application. This isfundamentally different from the microservice architecture.Microservices, at each stage in a multi-stage application, per-form part of the processing in a large application. In otherwords, one can imagine a chain of microservices to constitutethe application processing layer.

2.2 Microservices Use CasesWith the advent of Serverless Computing design, the mi-croservices paradigm is being viewed as a convenient solu-tion for building and deploying applications. Several cloudservice providers like Amazon (AWS Lambda [3]) and IBM(IBM Bluemix [29]) utilize the microservice paradigm to of-fer services to their clients. Typically, microservices hostedby cloud service providers provide the necessary function-ality for each execution stage in every user’s multi-stageapplication. In this context, a motivating class of applica-tions that would benefit from the microservice paradigmis artificial intelligence (AI) and machine learning (ML) ap-plications [39]. Many of the stages present in the execu-tion pipeline of AI applications are common across otherAI applications [13]. As shown in the example in Figure 1b,a speech-input based query execution application is con-structed as an amalgamation of microservices that performsspeech recognition, natural language understanding, and aquestion answering system. Similarly, an image-input basedquery system/application also uses several of these samemicroservices as its building blocks.FaaS (Function-as-a-Service) or Serverless based cloud

services contain APIs to define the workflow of a multi-stage application as a series of steps representing a DirectedAcyclic Graph (DAG). For instance, some of the workflowtypes (DAGs) that are provided by Amazon as defined byAWS step functions [4] are shown in Figure 2. Elgamal etal. talk about this in detail [11]. Figure 2 (a) shows the sim-plest case where the DAG is sequential. From our study, wewere able to find that several real-world applications (Ap-plications Table 3) and customers utilizing AWS Lambdapossess workflow DAGs there were sequential. Figure 2 (b)shows a workflow DAG with parallel steps in which multi-ple functions are executed in parallel, and their outputs areaggregated before the next function starts. The last type of


λ1

λ3

λ2

λ4

λ1

λ4

λ2

λ5

λ5

λ1

λ3

λ2

λ4

(a) Sequential DAGs (b) Parallel DAGs (b) Branching DAGs

Figure 2. Types of DAGs used in applications based on mi-croservices

workflow DAGs possesses branching steps shown in Figure 2(c). Such workflows typically have a branch node that has acondition to decide in which direction the branch executionwould proceed. In our paper, we focus only on sequentialworkflows as shown in Figure 2 (a). In Section 5, we willdiscuss the limitation of our study and possible extensionsfor the complex workflows.

2.3 ChallengesAlthough the usage and deployment of microservices are fun-damentally different from traditional datacenter applications,the low resource utilization problem persists even in datacen-ters housing microservices [15, 16, 40]. In order to curb this,datacenter service providers could potentially allow sharingof common microservices across multiple jobs as shown inFigure 1b. However, these classes of applications, being user-facing, are required to meet strict Service Level Agreements(SLAs) guarantees. Hence, sharingmicroservices could createcontention, resulting in the violation of end-to-end latency ofindividual user-facing applications, thereby violating SLAs.This is analogous to traditional datacenters where there is atendency to actively avoid co-locating multiple user-facingapplications, leading to over-provisioning of the underlyingresources when optimizing for peak performance [6].

2.4 OpportunitiesHowever, the microservice execution environments funda-mentally change several operating assumptions present intraditional datacenters that enable muchmore efficient multi-tenancy, while still achieving SLAs. First, the microserviceexecution framework enables a new degree of visibility intoan application’s structure and behavior, since an applica-tion is comprised of microservice building blocks. This isdifferent from traditional datacenter applications where theapplication is viewed as one monolithic unit. Hence, in sucha traditional datacenter, it becomes very challenging to evenidentify, let alone prevent interference between co-runningapplications [27, 31, 48]. Second, the granularity of multi-tenancy and consolidation in a microservice framework is

0 8 16 24 32Sharing degree

0400800

12001600

Late

ncy

(ms)

(a) Latency

0 8 16 24 32Sharing degree

04080

120160

Thro

ughp

ut (Q

PS)

(b) Throughput

64 128 256Input size

0200040006000

Late

ncy

(ms)

(c) Input size

Figure 3. Increase in latency, throughput, and input size asthe sharing degree increases

distinctively different from traditional datacenter systems.Application consolidation in microservice execution plat-forms is performed at a fine granularity, by batching mul-tiple requests belonging to different tenants, to the samemicroservice [16]. On the other hand, for traditional datacen-ter applications, multi-tenancy is handled at a very coarsegranularity where entire applications belonging to differentusers are co-scheduled [27, 31, 44, 45, 48]. These observationsclearly point to the need for a paradigm shift in the designof runtime systems that can enable and drive multi-tenantexecution where different jobs share common microservicesin a microservice design framework.Towards rethinking runtime systems that drive multi-

tenancy in microservice design frameworks, we seek to iden-tify and exploit key new opportunities that exist, in thiscontext. First, the ability to accurately predict the timeeach request spends at a microservice even prior toits execution opens up a key opportunity towards per-forming safe consolidations without violating SLAs.This, when exploited judiciously, could enable the sharingof microservices that are employed across multiple jobs,achieving high throughput, while still meeting SLAs. Sec-ond, the variability existing in SLAs whenmultiple la-tency sensitive jobs are consolidated generates a lot ofrequest level execution slack that can be distributedacross other requests. In other words, consolidated execu-tion is avoided for requests with low execution slack and viceversa. These scenarios create new opportunities in the mi-croservice execution framework to achieve high throughputby consolidating the execution of multiple latency sensi-tive jobs, while still achieving user-specific SLAs, throughfine-grained task management.

3 Analysis of MicroservicesThis section investigates the performance characteristicsof emerging AI and ML services utilizing the pipelined mi-croservices. Using that, we develop a methodology that canaccurately estimate completion time for any given request ateach microservice stage prior to its execution. This informa-tion becomes beneficial towards safely enabling fine-grainedrequest consolidation when microservices are shared amongdifferent applications under varying latency constraints.


3.1 Performance of MicroservicesIn this section, we analyze three critical factors that deter-mine the execution time of a request at each microservicestage: (1) Sharing degree (2) Input size (3) Queuing de-lay. For this analysis, we select a microservice that performsimage classification (IMC) which is a part of the catalogof microservices offered by AWS Step Functions [39].(1) Sharing degree. Sharing degree defines the granularityat which requests belonging to different jobs (or applica-tions) are batched together for execution. A sharing degreeof one means that the microservice processes only one re-quest at a time. This situation arises where a microserviceinstance executing a job restricts sharing its resources si-multaneously for requests belonging to other jobs. Requestsunder this scheme can achieve low latency at the cost of lowresource utilization. On the other hand, a sharing degree ofthirty indicates that the microservice merges thirty requestsinto a single batch to process the requests belonging differ-ent jobs simultaneously. Increasing the sharing degree hasdemonstrated to increase the occupancy of the underlyingcomputing platform (especially for GPUs) [16]. However, ithas a direct impact on the latency of the executing requestsas the first request arriving at the microservice would endup waiting until the arrival of the 30th request when thesharing degree is 30.Figures 3a and 3b illustrate the impact of sharing degree

on latency and throughput. The inputs that we have used forstudying this effect is a set of images with dimension 128x128.The horizontal axes on both figure 3a and 3b represent thesharing degree. The vertical axis in figure 3a and 3b repre-sents latency in milliseconds and throughput in requests persecond respectively. From figures 3a and 3b, we can clearlysee that the sharing degree improves throughput. However,it affects the latency of execution of individual requests aswell.(2) Input size. Second, we observe changes in the executiontime of a request by varying its input size. As the inputsize increases, additional amounts of computation would beperformed by the microservices. Hence, input sizes play akey role in determining the execution time of requests. Tostudy this using the image classification (IMC) microservice,we obtain request execution times for different input sizesof images from 64x64 to 256x256. The sharing degree is keptconstant in this experiment. Figure 3c illustrates the findingsof our experiment. We observe that as input sizes increase,execution time of requests also increase. We also observedsimilar performance trends for other microservice types.(3) Queuing delay. Queuing delay is the last factor thataffects execution time of requests. This is experienced byrequests waiting on previously dispatched requests to becompleted. From our analysis, we observe that there is alinear relationship between the execution time of a request

0 2 4 6 8 10 12 14

Small input−30

−20

−10

0

10

20

30

Erro

r (%

)

IMC FACED FACER HS

2 4 6 8 10 12 14

Medium input−30

−20

−10

0

10

20

30

2 4 6 8 10 12 14

Large input−30

−20

−10

0

10

20

30

Figure 4. Error(%) in predicting ETC for different input sizeswith increase in the sharing degree (x-axis)

its sharing degree and input size respectively. Queuing de-lay can be easily calculated at runtime using the executionsequences of requests, the estimated execution time of indi-vidual requests and its preceding requests.

From these observations, we conclude that there is anopportunity to build a highly accurate performance modelfor each microservice that our execution framework canleverage to enable sharing of resources across jobs. Further,we also provide capabilities that can control the magnitudeof sharing at every microservice instance. These attributescan be utilized simultaneously for preventing SLA violationsdue to microservice sharing while optimizing for datacenterthroughput.

3.2 Execution Time Estimation ModelAccurately estimating the execution time of a request at eachmicroservice stage is crucial as it drives the entire microser-vice execution framework. Towards achieving this, we tryto build a model that calculates the estimated time of com-pletion (ETC) for a request at each of its microservice stages.The ETC of a request is a function of its compute time on themicroservice and its queuing time (time spent waiting forthe completion of requests that are scheduled to be executedbefore the current request).

ETC = Tqueuinд +Tcompute (1)

We use a linear regressionmodel to determine theTcomputeof a request, for each microservice type and the input size,as a function of the sharing degree.

Y = a + bX (2)

where X is the sharing degree (batch size) which is an inde-pendent variable and Y is the dependent variable that we tryto predict, the completion time of a request. b and a are theslope and intercepts of the regression equation. Tqueuinд isdetermined as the sum of the execution times of the previousrequests that need to be completed before the current requestcan be executed on the microservice which can directly bedetermined at runtime. Each model obtained is specific to asingle input size. Hence, we design a system where we havea model for every specific input size that can predict ETCfor varying batch sizes and queuing delays.


Data normalization. A commonly followed approach inmachine learning is to normalize data before performinglinear regression so as to achieve high accuracy. Towardsthis objective, we rescale the raw input data present in bothdimensions in the range of [0, 1], normalizing with respectto the min and max, as in the equation below.

x ′ =x −min (x)

max (x) −min (x)(3)

We trained our model for sharing degrees following pow-ers of two to create a predictor corresponding to every mi-croservice and input size pair. We cross validated our trainedmodel by subsequently creating test beds and comparing theactual values with the estimated time of completion by ourmodel. Figure 4 shows the error rate that exists in predictingthe completion time, given a sharing degree for different in-put sizes. For the image based microservices, the input sizesutilized are images of dimensions 64, 128 and 256 for small,medium and large inputs, respectively. These are standard-ized inputs from publicly available datasets whose detailsare enumerated in Table 1. As can be clearly observed fromthe graph, the error in predicting the completion time fromour model is around 4% on average. This remains consistentacross other microservices too whose plots are not shownin the figure to avoid obscurity.

The estimated time of completion (ETC) obtained from ourregression models is used to drive decisions on how to dis-tribute requests belonging to different users across microser-vice instances. However, satisfying application-specific SLAsbecomes mandatory under such circumstances. For this pur-pose, we seek to exploit the variability in the SLAs of indi-vidual requests and the resulting slack towards building ourrequest scheduling policy. Later in section 4.2 and 4.3, wedescribe in detail the methodology by which we computeand utilize slack to undertake optimal request distributionpolicies.

The ETC prediction model that we have developed is spe-cific towards microservice types whose execution times canbe predicted prior to its execution. Based on our observa-tions, applications belonging to the AI and ML space exhibitsuch execution characteristics and fit well towards beingpart of microservice execution frameworks hosted at Server-less Computing Infrastructures. However, there exist certainmicroservice types whose execution times are highly un-predictable. For instance, an SQL range query’s executiontime and output is dependent both on the input query typeand the data which it is querying. Such microservice typescannot be handled by our model. We discuss this at muchmore detail in Section 5.

4 GrandSLAm DesignThis section presents the design and architecture of Grand-SLAm, our proposed runtime system for moderating requestdistribution at micro-service execution frameworks. The goal

Microservice cluster

IMC()

NLU()

QA()

TTS()

……

……

…

ASR()

NLU()

QA()

……

……

Job A

Job B

1

ASR

IMC

NLU

QA

2 Building microservice DAG

IMC NLU QA TTS

ASR NLU QA

Job A’s DAG

Job B’s DAG

Submitting job

TTS

Figure 5. Extracting used microservices from given jobs inthe microservice cluster

of GrandSLAm is to enable high throughput at microserviceinstances without violating application specific SLAs. Grand-SLAm leverages the execution time prediction models toestimate request completion times. Along with this Grand-SLAm utilizes application/job specific SLAs, to determinethe execution slack of different jobs’ requests at each mi-croservice stage. We then exploit this slack information forefficiently sharing microservices amongst users to maximizethroughput while meeting individual users’ Service LevelAgreements (SLAs).

4.1 Building Microservice Directed Acyclic GraphThe first step in GrandSlam’s execution flow is to identifythe pipeline of microservices present in each job. For thispurpose, our system takes the user’s job written in a high-level language such as Python, Scala, etc. as an input ( 1in Figure 5) and converts it into a directed acyclic graph(DAG) of microservices ( 2 in Figure 5). Here, each vertexrepresents a microservice and each edge represents commu-nication between two microservices (e.g., RPC call). SuchDAG based execution models have been widely adoptedin distributed systems frameworks like Apache Spark [50],Apache Storm [21], TensorFlow [1],etc. Building a microser-vice DAG is an offline step that needs to be performed oncebefore GrandSLAm’s runtime system starts distributing re-quests across microservice instances.

4.2 Calculating Microservice Stage SlackThe end-to-end latency of a request is a culmination of thecompletion time of the request at each microservice stage.Therefore, to design a runtime mechanism that providesend-to-end latency guarantees for requests, we take a disag-gregated approach. We calculate the partial deadlines at eachmicroservice stage which every request needs to meet at sothat end-to-end latency targets are not violated. We define


0 5 10 15 20 25 30Batch Size (CPU)

0

20

40

60

80

100Sl

ack

(%)

Activity PoseNatural Language UnderstandingQuestion AnsweringSequence Learning

Figure 6.Microservice stage slack corresponding to differentmicroservices present in Pose Estimation for Sign Languageapplication

this asmicroservice stage slack. In other words, microser-vice stage slack is defined as the maximum amount of timea request can spend at a particular microservice stage. Stageslacks are allocated offline after building the microserviceDAG, before the start of the GrandSLAm runtime system.

Mathematically slack at every stage is determined by cal-culating the proportion of end-to-end latency that a requestcan utilize at each particular microservice stage.

slackm =Lm

La + Lb · · · + Lm + . . .× SLA (4)

where Lm is the latency of job at stagem and La , Lb . . . arethe latency of the same job at the other stages a,b . . . respec-tively. Figure 6 illustrates the proportion of time that shouldbe allocated at each microservice stage for varying batchsizes, for a real world application called Pose Estimation forSign Language. We can clearly see from Figure 6 that thepercentage of time a request would take to complete theSequence Learning stage is much higher than the percentageof time the same request would take to complete the ActivityPose stage. Using this observation, requests are allocatedstage level execution slacks proportionally.

4.3 Dynamic Batching with Request ReorderingGrandSLAm’s final step is an online step orchestrating re-quests at each microservice stage based on two main objec-tive functions (i) meeting end-to-end latency (ii) maximizingthroughput. For this purpose, GrandSLAm tries to executeevery request that is queued up at a microservice stage in amanner at which it simultaneously maximizes the sharingdegree while meeting end-to-end latency guarantees. In thisregard, GrandSLAm undertakes two key optimizations: 1Request reordering and 2 Dynamic batching as depictedin Figure 7. GrandSLAm through these optimizations triesto maximize throughput. However, it keeps a check on thelatency of the executing job by comparing slack possessed byeach request (calculated offline as described at 4.2) with itsexecution time estimates (obtained from the model describedat Section 3.2).

7310532 2025 7310532 2025 7310532 2025

3571025 2032

Reordering requests based on the slack1

2

3571025 2032

Batch size: 3 Batch size: 2 Batch size: 2

Dynamically adjusting the batch size

ASR

IMC

QA

TTS

NLU

Figure 7. Request reordering and dynamic batching mecha-nismRequest reordering. Slack based request reordering is per-formed at each microservice instance by our runtime system.The primary objective of our request reordering mechanismis to prioritize the execution of requests with lower slackas they possess much tighter completion deadlines. Hence,our GrandSLAm runtime system reorders requests at run-time that promotes requests with lower slack to the headof the execution queue. The request reordering mechanismin Figure 7 illustrates this with an example. Each rectangleis a request present in the microservice execution and thenumber in each rectangle illustrates its corresponding slackvalue. On the left, it shows the status before reordering, andon the middle, it shows the status after reordering.Dynamic batching. At each microservice stage, once therequests have been reordered using slack, we identify thelargest sharing degree (actual batch size during execution)that can be employed such that each request’s executiontime is within the allocated microservice stage slack. Sucha safe identification of the largest sharing degree is doneby comparing the allocated slack obtained by the processdescribed in Section 4.2 with the execution time estimationmodel described in Section 3.2.

Algorithm 1 summarizes the dynamic batching approachthat we employ. The input to the algorithm is a queue of re-quests sorted by their respective slack values. Starting fromthe request possessing the lowest slack value we traversethrough the queue increasing the batch size. We perform thisuntil increasing batch size violates the sub-stage SLA of indi-vidual requests present in the queue. We repeat the requestreordering and dynamic batching process continuously asnew incoming requests arrive from time to time. Figure 7shows how the dynamic batching is used in our system fromthe middle part to the right part.

4.4 Slack ForwardingWhile performing slack based request scheduling in multi-stage applications, we observed a common scenario. Thereis always some leftover slack that remains unused for manyrequests. For instance, at the first stage if the best ETC valueprovided for a request is 100ms and the slack allocated forthat stage is 135ms, there is 35ms (135ms - 100ms) leftoverslack. We reutilize this remaining slack, by performing slackforwarding, wherein we carry forward the unused slack on


Algorithm 1 Dynamic batching algorithm1: procedure DynBatch(Q ) ▷ Queue of requests2: star t Idx = 03: Slackq = 04: executed = 05: len = lenдth(Q )

6: while executed ≤ QSize do ▷ All are not batched7: window = 08: par tQ = Q [star t idx : lenдth]9: window = дetMaxBatchSizeUnderSLA(par tQ, Slackq )10: star t Idx = star t Idx +window11: Slackq = Slackq + latency12: executed = executed +window13: end while14: end procedure

Leftover slackASRExecution time

ASR NLU

Slack forwarding

SlackNLU

SlackNLU +

Leftover SlackASR

1

2

SlackASR

Figure 8. Forwarding unused slack in the ASR stage to theNLU stage

to the subsequent microservice stages. Figure 8 exemplifiesthe case where the unused slack in the ASR stage can beforwarded into the next NLU microservice stage. This hasincreased the overall request slack in the later stages of exe-cution in a multi-stage application enabling higher sharingdegrees.

5 DiscussionOur approach requires an accurate estimation of the exe-cution time at each microservice stage. For this purpose, itbecomes essential to determine the factors affecting the exe-cution time of microservices. This motivated us to developan execution time estimation (ETC) model based on a set offactors based on the application space we have considered.In this study, we analyzed the performance characteristics ofAI and ML related microservices as these applications werewell suited to be hosted on the microservice architecture.In this context, we observed two distinct characteristics inthe AI and ML space. First, batching multiple requests intoa single large one is widely used in these microservices toimprove the resource utilization of the computing devices.For this purpose, these microservices performs preprocess-ing of inputs (e.g., resizing images in image classifications,splitting voice inputs in speech recognition, chunking wordsin natural language processing) to fit in a single batch for

simultaneous execution. Second, many of the AI applica-tions exhibit the pipelined execution of microservices. Imagerecognition, an application from AWS Step Functions [39]is one such example. Such simple linear pipelines make itmuch easier to design slack-based optimizations introducedin Section 4.Limitations. However, we anticipate that our methodologycannot be applied directly to microservice types other thanAI and ML space. For example, a simple model that we haveproposed is not sufficient for other types of microserviceswhich do not batch queries belonging to different applica-tions. For example, the execution time of the microservicesexecuting SQL range queries, will be sensitive on both the in-put query and output results. In other words, similar queriesexecuted on different datasets might possess different execu-tion times. In such circumstances, it requires a much moredetailed analysis and investigation on application types forbuilding muchmore sophisticated models. In addition to that,the complex microservice topologies such as general graphsand conditional execution have not been considered in thisstudy. It is challenging for GrandSLAm in its existing form tocalculate slacks in cases where different requests take differ-ent paths at runtime or need to perform a few microservicesin parallel. These are some of the limitations of GrandSLAmwhich we plan to investigate in the near future.

6 EvaluationIn this section, we evaluate GrandSLAm’s policy and alsodemonstrate its effectiveness in meeting service level agree-ments (SLAs), while simultaneously achieving high through-put in datacenters that house microservices.

6.1 Experimental Environments

Infrastructure. We evaluate GrandSLAm on a testbed con-sisting of 100 docker containers. Each container has a single2.4 GHz CPU core, 2GB of RAM and runs Ubuntu 16.10.GrandSLAm is evaluated on both CPU as well as GPU plat-forms as enumerated in Table 2. Today’s datacenters housedifferent kinds of accelerators improving the performanceof AI and ML applications [16, 17, 34, 36, 38]. We setup atopology of services and processes according to that of IBMBluemix [13]. In other words, each microservice executeson containerized execution environments. We use dockercontainers for this purpose.Microservice types. Table 1 shows the list of microservicesthat we have utilized in our experiments. POS, CHK, andNER microservices utilize the kernels from Djinn&Tonic [16]suite which in turn uses SENNA [9]. Similarly, ASR microser-vice utilizes kernels from Djinn& Tonic suite [16], which inturn uses Kaldi [37]. IMC, FACED, FACER, AP, HS, QA,and SL microservices are implemented using TensorFlowframework version 1.0 [1].


Type Application Input Sizes Output Network Type Layers Parameters

Image Services

Image Classification (IMC)

64X64, 128X128and 256 X 256 images

Probability of an object Alexnet CNN 8 15MFace Detection (FACED) Facial Key Points Xception CNN 9 58KFacial Recognition (FACER) Probability of a person VGGNet CNN 14 40MHuman Activity Pose (AP) Probability of a pose deeppose CNN 8 40MHuman Segmentation (HS) Presence of a body part VGG16 CNN 16 138M

Speech Services Speech Recognition (ASR) 52.3KB, 170.2KB audio Raw text NNet3 DNN 13 30MText to Speech (TTS) Audio output WaveNet DNN 15 12M

Text Services

Part-of-Speech Tagging (POS)

text containing 4-70words per sentence

WordâĂŹs part of speech eg. Noun SENNA DNN 3 180KWord Chuncking (CHK) Label Words as begin chunk etc. SENNA DNN 3 180KName Entity Recognition (NER) Labels words SENNA DNN 3 180KQuestion Answering (QA) Answer for question MemNN RNN 2 43KSequence Learning (SL) Translated text seq2seq RNN 3 3072

General PurposeServices

NoSQL Database (NoSQL) Directory input Output of Query N/A N/A N/A N/AWeb Socket Programmig (WS) Text, image Data communication N/A N/A N/A N/A

Table 1. Summary of microservices and their functionality

CPU/GPU config Microarchitecture

Intel Xeon E5-2630 @2.4 GHz Sandy Bridge-EPIntel Xeon E3-1420 @3.7 GHz HaswellNvidia GTX Titan X MaxwellGeForce GTX 1080 Pascal

Table 2. Experimental platforms

Application Description Pipelined microservices

IPA-Query Provides answers to queries thatare given as input through voice. ASR→NLP→QA

IMG-Query Generates natural language descri-ptions of the images as output. IMG→NLP→QA

POSE-Sign Analyzes interrogative imagesand provides answers. AP→NLP→QA→SL

FACE-Security Scans images to detect the presenceof identified humans. FACED→FACER

DETECT-Fatigue Detects in real time the onsetof sleep in fatigued drivers. HS→AP→FACED→FACER

Translation Performs language translation. SL QA NoSQL

Table 3. Applications used in evaluation

Applications Shared microservices

WL1 IMG-Query, FACE-Security, DETECT-Fatigue, POSE-Sign QA, FACED, FACER, APWL2 IPA-Query, POSE-Sign, Translation NLU, QAWL3 I/O-IPA-Query, I/O-Sign, I/O-Translation NLU, NoSQL

Table 4. Workload scenarios

Load generator/Input. To evaluate the effectiveness ofGrandSLAm, we design a load generator that submits userrequests following a Poisson distribution that is widely usedto mimic cloud workloads [33]. The effect of performancedegradation at multi-tenant execution scenarios is luminousextensively at servers handling high load. Hence, our experi-ments are evaluated at scenarios in datacenters where theload is high. Such a distribution has been used by severalprior works on multi-stage applications [42, 47, 49]. The SLAthat we use for each application is obtained and calculatedfrom the methodology proposed by PowerChief [49]. Ta-ble 4 shows the workload table and the microservices thatare shared when they are executed together. For each mi-croservice request we have evaluated our methodology usinginputs that correspond to data that is available from opensource datasets.

WL1 WL2 WL3020406080

100

SLA

viol

atio

ns (%

)

BaselineBaseline + reordering

Baseline + dynamic batchingGrandSLAm

(a) Percentage of requests that violate SLA

WL1 WL2 WL305000

1000015000200002500030000

99%

Tai

l Lat

ency

(ms)

(b) 99th Percentile tail latency of each application

Figure 9. Comparing the effect of different componentspresent in GrandSLAm’s policy

6.2 Achieving Service Level Agreements (SLAs)First, we evaluate the effectiveness of GrandSLAm in achiev-ing Service Level Agreements (SLAs) for the workload sce-narios enumerated in Table 4. For this purpose, we introducereordering and batching incrementally over the baseline sys-tem and try to study its effects on the percentage of SLAviolations.

6.2.1 Reducing SLA ViolationsFor this experiment, we deployed a docker container in-stance for each microservice type. Communication acrossmicroservice instances within the cluster happens throughweb sockets. Under this experimental setup, we first obtainthe percentage of requests violating SLA under a baselinescheme which executes requests (i)in a first-in-first-out(FIFO) fashion (ii)without sharing the microservices. Sub-sequently, we introduce a request re-ordering scheme thatexecutes requests in an Earliest Deadline First (EDF) fash-ion to compare it with the baseline system. Similarly, wealso execute requests in a situation where requests sharemicroservice instances(using query batching) to see how it


1500 1800 2100 2400 2700 3000

Latency (ms)

0.0

0.2

0.4

0.6

0.8

1.0

Cum

ula

tive d

istr

ibuti

ons

of

late

ncie

s (

%)

IPA-WL2 Pose-WL2

1500 1800 2100 2400 2700 3000 3300

Latency (ms)

IPA-WL2 Pose-WL2

(a) EDF-DNB vs. GrandSLAm

1500 1800 2100 2400 2700 3000

Latency (ms)

0.0

0.2

0.4

0.6

0.8

1.0

Cum

ula

tive d

istr

ibuti

ons

of

late

ncie

s (

%)

IPA-WL3 Pose-WL3

1500 1800 2100 2400 2700 3000 3300

Latency (ms)

IPA-WL3 Pose-WL3

(b) EDF-50 vs. GrandSLAm

800 1800 2800 3800 4800 5800

Latency (ms)

0.0

0.2

0.4

0.6

0.8

1.0

Cum

ula

tive d

istr

ibuti

ons

of

late

ncie

s (

%)

Pose-WL2 Translate-WL2

800 1800 2800 3800 4800 5800

Latency (ms)

Pose-WL2 Translate -WL2

(c) ED-DNB vs. GrandSLAm

1800 2100 2400 2700 3000

Latency (ms)

0.0

0.2

0.4

0.6

0.8

1.0

Cum

ula

tive d

istr

ibuti

ons

of

late

ncie

s (

%)

Face Security-WL1 Fatigue Driver-WL1

1000 1300 1600 1900 2200 2500 2800

Latency (ms)

Face Security-WL1 Fatigue Driver-WL1

(d) ED-30 vs. GrandSLAm

Figure 10. Comparing the cumulative distribution function of latencies for prior approaches and GrandSLAm.

improves performance. Lastly, we compare GrandSLAmwiththese schemes to illustrate its effectiveness. Our experimentkeeps the input load constant at fixed Requests per Second(RPS) while comparing each policy.

Figure 9 shows the results of this experiment. From Fig-ure 9a, we can clearly see that for a given workload, almostall of the requests violate SLAs under the baseline and re-ordering policies. However, the effect is much amortizedwhen requests are grouped together in batches. This is be-cause batching can improve the overall latency of a multi-tude of requests collectively [16]. This is clearly evident fromthe percentage of requests violated under baseline+dynamicbatching policy. GrandSLAm utilizes best of both the policieswhere it ends up having a low percentage of requests thatviolate SLAs.

6.3 Comparing with Prior TechniquesPrior approaches which try to solve this problem are catego-rized based on their respective (i) batching policies for aggre-gating requests and (ii) slack calculation policies for reorder-ing requests. Most relevant work use a no-batching policywhere they do not batch multiple requests. Djinn&Tonic [16]utilizes a static batching policy where they used a fixed batchsize for all applications. However, we propose a dynamicbatching technique which varies the batch size based on theslack available for each request. Again, with respect to slackcalculation policy, prior approaches [21, 50] utilize an equaldivision slack allocation (ED) policy which equally dividesslack across individual microservice stages. Certain otherapproaches utilize a first-in-first-out policy while most ap-proaches utilize earliest deadline first (EDF) slack allocationpolicy [42, 47]. However, we propose a slack calculation pol-icy which allocates slack taking into account the intrinsic

variation present in the execution time of different computa-tional stages. This is explained in Section 4.2.We derive 4 baselines on equal division policy. ED-NB

(equal division no batch) disables batching, ED-30 and ED-50statically fix batch size to 30 and 50 respectively, and ED-DNB(equal division dynamic batch) utilizes the dynamic batchingapproach proposed by GrandSLAm along with the ED policy.We also derive 4 baselines on using earliest deadline first pol-icy: EDF-NB, EDF-30, EDF-50 and EDF-DNB, respectively.GrandSLAm’s policy is abbreviated as GS in our graphs.

6.3.1 Reordering Requests based on SlackIn this subsection, we quantify the effectiveness of Grand-SLAm’s slack calculation and reordering policy by compar-ing it with ED and EDF. We illustrate this using the cumu-lative distribution function (CDF) of latencies, as shown inFigure 10. We have used the same experimental setup wherethe configuration of the input load and the number of mi-croservice instances remains constant.

Figures 10a, 10b, 10c and 10d compare the cumulative dis-tribution function (CDF) of the policies EDF-DYN, EDF-50,ED-DYN, and ED-30, respectively with GrandSLAm. The hor-izontal axis denotes time. The vertical axis denotes the CDFof the percentage of requests executed at a particular time.The dashed lines correspond to the target SLAs that individ-ual applications are subjected to meet. For each figure, thegraph in the left portrays the CDF of the baseline techniques(EDF-DYN, EDF-50, ED-DYN, and ED-30) and the graph inthe right portrays the CDF of GrandSLAm. The green shadedportion illustrates the leftover slack at the final stage whenrequests execute before the deadline. The red shaded portionillustrates slack violation when requests execute after thedeadline has passed. In an ideal case, both green and red por-tions should be minimized. In other words, requests should


EDNB

ED30

ED50

EDDYB

EDFNB

EDF30

EDF50

EDFDYB

GS EDNB

ED30

ED50

EDDYB

EDFNB

EDF30

EDF50

EDFDYB

GS EDNB

ED30

ED50

EDDYB

EDFNB

EDF30

EDF50

EDFDYB

GS0

1000200030004000

Avg.

late

ncy

(ms)

WL1 WL2 WL3

Queuing Delay IMC NLU QA AP SL NoSQL

(a) Average latency

EDNB

ED30

ED50

EDDYB

EDFNB

EDF30

EDF50

EDFDYB

GS EDNB

ED30

ED50

EDDYB

EDFNB

EDF30

EDF50

EDFDYB

GS EDNB

ED30

ED50

EDDYB

EDFNB

EDF30

EDF50

EDFDYB

GS0

1000200030004000

99th

% T

ail

late

ncy

(ms)

WL1 WL2 WL3

(b) 99th Percentile tail latency

Figure 11. Comparing the latency of workloads under different policies. GrandSLAm has the lowest average and tail latency.

be reordered and batched in such a way that it neither passesthe deadline nor executes way ahead of the deadline. Execut-ing way ahead of the deadline restricts requests with lowerslack to stall creating a situation where other requests end upviolating SLAs. In an ideal situation, slack remaining shouldbe transferred to the requests who are about to violate slack.From these graphs, we draw the following conclusion. Asshown in Figure 10a, 10b, 10c and 10d requests reorderingpolicies proposed by prior literature creates a situationwherea few requests execute much before the expected deadlinewhile other requests end up violating the SLAs.

Figure 10a and 10b compare EDF with GrandSLAm. EDF’sslack allocation policy for each request is agnostic to the in-trinsic variation present in the microservice execution stageswithin an application. Hence, in many instances, it underes-timates execution times of requests and performs aggressivebatching. As a result of this, some requests complete theirexecution well ahead of the latency targets while other re-quests end up violating SLAs. GrandSLAm, on the other hand,avoids this situation by allocating slack that is proportionalto the time that would be taken at each stage. GrandSLAmperforms judicious batching while limiting aggressive batch-ing by introducing sub-stage SLAs. This is clearly illustratedin Figure 10a. Pose and IPA are two applications presentin WL2. Under EDF’s policy, we see that the requests cor-responding to the Pose application complete well ahead oftime (as shown in the green patch). However, a substantialnumber of requests corresponding to the IPA violate SLAs(asshown in the red patch). GrandSLAm, on the other hand,carefully reallocates slack among applications. Hence, theexecution of requests with abundant slack is stalled untiljust before the deadline thereby allowing requests with less

amount of slack to be executed, preventing them from vi-olating SLAs. The can clearly be seen in Figure 10a as theamount of green and red patches are much lesser for Grand-SLAm. A similar phenomenon can be witnessed for EDF’sstatic batching policy with batch size 50 in Figure 10b.

Figure 10c and 10d compare ED with GrandSLAm. The ma-jor drawback of the ED technique lies in its inability to gaugethe slack that should be allocated in each stage. This is clearlyillustrated in Figure 10c and 10d. In many cases, it wronglyallocates more slack to requests that do not require it, whiledepriving other requests that actually need slack. This intro-duces additional queuing time, thereby violating the SLA fora substantial amount of requests. This could be avoided ifslack is being distributed judiciously across requests. Grand-SLAm is cognizant of this need and hence, predicts the ap-propriate amount of compute time required for each stageand allocates slack proportionally.

6.3.2 Dynamic Batching for Latency ReductionIn order to study the effects of dynamic batching, we compareGrandSLAm with all our baseline policies. Figure 11 and 12illustrate the results of this experiment. In Each stacked barin Figure 11a and 11b represents the average latencies andthe tail latencies of the applications respectively. The poli-cies in each figure are ordered starting from ED-NB followedby ED-30, ED-50, ED-DYN, EDF-NB, EDF-30, EDF-50,EDF-DYN concluding with GrandSLAm as GS respectively.GrandSLAm is distinctively distinguished from other barsby hatching it with slanting lines. The color in the stackedgraph corresponds to either queuing latency experienced atany stage or the compute latency at individual microservicestages. The different components of this plot are stacked


breaking the end-to-end latency as queuing latency or com-pute stage delay over time (which is why there is a queuinglatency stack after each stage). As can be seen in Figure 11a,GrandSLAm achieves the lowest latency across all policies.GrandSLAm is able to meet the required SLA for almost ev-ery request, as compared to prior policies that violate SLAsfor several of these requests. We draw the following insightsinto why prior policies are ineffective in meeting SLAs.No batching techniques. The latency of requests is com-pletely dominated by the queueing latency when employingtechniques that don’t perform batching, namely ED-NB andEDF-NB. Hench, such policies are undesirable.Static batching techniques. In view of the clear disadvan-tage when requests are not batched, statically batching themis one of the simplest policies that can be employed to im-prove throughput. However, latencies and SLAs could becompromised if they are not batched judiciously.Assigning a large fixed batch size for execution can criti-

cally violate the latency of many requests within that par-ticular batch. Let us take WL1 for example. From Figure 11aand 11b we see that employing a fixed batch size (batch size50) under EDF policy violates SLA only by a small propor-tion. However, it violates the SLA for most requests presentin the workload. This can be seen in Figure 12 where thepercentage of violations for WL1 under ED-50 goes up to60%. This is caused because of using a large batch size re-sulting in a situation where every request ends up violatingthe SLA especially at the last stage of the application. This isbecause a fixed batch size is not aware of the latencies andslack of requests that are executing at a point in time. Thisis an unfavorable outcome especially for applications thatrequire strict latency targets.To remedy this, employing smaller batch sizes could be

viewed as a favorable solution. However, smaller batch sizescan be conservative, thereby not being able to exploit thepotential opportunities where aggressive batching can in-crease throughput while still meeting the latency constraints.Furthermore, small batch sizes could also cause excessivequeuing. Specifically, when requests are grouped with smallbatch sizes, the first few batches might have low queuingdelays. However subsequent batches of requests would endup waiting for a substantial period of time for the executionof prior batches of requests to complete, thereby affecting theend-to-end latency. This increase in queueing latency at thelater stages can be clearly seen in situations created by WL2(from figure 11a and 11b ) where policies ED-30 and EDF-30violates SLAs both in terms of average latencies as well astail latencies. Additionally, many requests also violate SLAsas queuing becomes a huge problem due to large batch sizes.This can be seen in figure 12. These observations stronglymotivate a dynamic batching policy where batch sizes aredetermined online, during runtime, depending upon eachapplication’s latency constraints.

WL1 WL2 WL3020406080

100

SLA

viol

atio

ns (%

)

ED-NBED-30

ED-50ED-DNB

TT-NBTT-30

TT-50TT-DNB

GS

Figure 12. Percentage of requests violating SLAs under dif-ferent schemes

Dynamic batching. Equal Division dynamic batching, Ear-liest Deadline First dynamic batching and Grand Slam deter-mines appropriate batch sizes during runtime. The differencebetween these three policies is the way by which they com-pute slack. Once slack is computed, the largest batch sizewhich accommodates all the requests without violating itsslack is obtained during runtime. For Equal Division dynamicbatching, slack for each request is a fair share from the SLAfor each stage in the end-to-end pipeline. For instance, for anapplication consisting of 3 stages, each request of that appli-cation is estimated to have a slack of 33% of the SLA at eachstage. Earliest Deadline first approach, however, undertakesa greedy approach wherein the slack for each request of anapplication at each stage is the remaining time the requestpossesses before it would end up violating the SLA. Grand-SLAm is unique and distinct from all these mechanisms. Weadopt the methodology elaborated in Section 4.2 that is cog-nizant of the volume of computation each individual stageperforms.In Figures 11a and 11b we clearly see that both Equal

Division dynamic batching and Earliest Deadline First dy-namic batching perform poorly. This is due to the followingreasons. First, the policy that Earliest Deadline First (EDF)utilizes to determine the appropriate batch size for a set ofrequests is a greedy policy. EDF dynamically selects batchsizes for the requests aggressively until there is remainingslack. Although this can be beneficial for traditional data-center applications where execution can only be thought ofas single stage and monolithic, such an approach performspoorly at microservice execution framework that possessesmultistage execution pipelines. This is due to the fact thatwhen requests reach the final stages of execution, they havea limited amount of slack, which in turn restricts the amountof batching possible to avoid potential SLA violations dueto excessive batching. Such a policy has two key downsides,First, it increases the queuing time for subsequent requeststhereby increasing the of those requests. This has a negativeimpact especially on the tail latency of applications as shownin figure 11b. Second, it becomes difficult to identify the exactindividual stage that was the causing this bottleneck. As aresult, the command center will perform non-optimal remedi-ation where unwanted instances would be scaled up leadingto high resource utilization. This is experimentally validated


0.0

0.5

1.0

Thro

ughp

ut (Q

PS)

norm

alize

d to

Gra

ndSL

Am

0.0

0.5

1.0

CPU GPU

ED-30ED-50

ED-DYNEDF-30

EDF-50EDF-DYN

GS

Figure 13. Throughput gains from GrandSLAm

in section 6.4.2. Third, equal division dynamic batching intro-duces a fair share of sub-stage SLA for each stage. This canrestrict microservices from batching aggressively at a singlestage. It can also identify the exact microservice instancethat was responsible for end-to-end SLA violation. Such apolicy, on the one hand, can address the high tail latencyproblem that exists in the Earliest Deadline First’s aggressiveand greedy dynamic batching approach. However, on theother hand, it neglects the fact that the computation timeat each stage is very different. Hence, in many scenarios, itdoes not exploit the full benefits of batching for stages thathave high slack. For example, during the final stage in WL2shown in figure 11a and figure 11b, if all the requests havebeen batched, the percentage of requests that would haveviolated slack would be much lower. However, the equaldivision policy cannot exploit this opportunity resulting inan increased latency of requests.GrandSLAm. Our technique, on the other utilizes a hybridapproach by exploiting the advantages of dynamic batch-ing as well as enabling sub stage cut off slacks. GrandSLAmutilizes a weighted sub-stage SLA slack based on the compu-tational requirements of each stage and an online dynamicbatching technique. As a result, GrandSLAm is able to outper-form all prior approaches and achieve a much lower averageand tail latency as shown in figures 11a and 11b.

6.4 GrandSLAm PerformanceIn this section, we evaluate GrandSLAm’s capability in in-creasing datacenter throughput and server utilization, whileguaranteeing Service Level Agreements (SLAs) for the work-load scenarios enumerated in Table 4.

6.4.1 Throughput IncreaseIn this section, we demonstrate the throughput benefits ofGrandSLAm, as compared to the state-of-the-art techniquesat scale-out environments. We compare the different exe-cution policies by constructing a real-time simulational ex-perimental setup consisting of a 1000 node CPU and GPUenabled cluster. As executing AI applications in acceleratorplatforms is becoming more common, we try to evaluate ourtechnique at both CPU and GPU platforms. For GPU basedexperiments the executing workloads do not utilize the CPUand are executed only in the GPU device and vice versa. Ad-ditionally, to mimic scale out execution scenarios, we collect

performance telemetry of workload scenarios for multipleexecution runs. We then extrapolate the performance teleme-try to obtain data nearly equivalent to the amount of databeing collected at large scale datacenter. On top of that, webuild a simulation infrastructure that mimics GrandSLAm’sexecution model at a larger scale. We also fix our applicationspecific SLA, instance count and the server configurationacross experimental runs. We ensure that every request ex-ecuting across the end-to-end pipeline meets the latencyconstraints. Under such situations, we observe the through-put gains corresponding to each execution policy.

Figure 13 illustrates the throughput gains of GrandSLAmcompared to state of the art execution policies. Each barrepresents the average number of Requests executed perSecond (RPS) across all the applications and workload sce-narios enumerated in Table 4, normalized to the average QPSof GrandSLAm. We normalize with respect to GrandSLAmsince the best prior technique is different for the CPU andGPU systems. We clearly see that GrandSLAm outperformsother execution policies. The graph on the left is the averagethroughput for executing the workloads on a CPU clusterwhile the graph on the right illustrates the results of thesame experiment on a GPU platform. An interesting obser-vation consistent across both CPU and accelerator platformsis that the static batching techniques consistently outper-form the dynamic batching techniques. This is because, dy-namic batching, for instance, in the context of time trader,aggressively batches requests initially. However, requestsget stalled during the terminal stages resulting in decreasedthroughput. On the contrary, equal division misjudges theproportion of slack that is to be allocated. As a result, thepolicy restricts aggressive batching during scenarios wherelatency does not take a hit. This results in low throughput.On an average we obtain up to 3× performance on the GPUplatform and around 2.2× performance on the CPU servercluster, over the best prior mechanism.

6.4.2 Reduced OverheadsIn this section, we illustrate the decrease in the number of mi-croservice instances when employing GrandSLAm’s execu-tion policy. Under fixed latency and throughput constraints,we try to obtain the number of microservice instances of eachtype that is required for executing the workloads enumeratedin Table 4 in a scale-out fashion similar to section 6.4.1.

Figure 14 compares the instance count for GrandSLAmandprior works. The top graph corresponds to CPU performancewhile the bottom graph corresponds to GPU performance.We can see that GrandSLAm reduces instance count signif-icantly on both the CPU and GPU platforms. Additionally,GrandSLAm’s instance count reduction is higher on the GPUplatform. This is intuitive as GPUs are devices that are opti-mized to provide high throughput. Overall, we conclude thatGrandSLAm is able to effectively meet SLAs while achievinghigh throughput at low instance counts.


0.0

0.5

1.0

1.5

2.0

2.5

CPU infrastructureED-30ED-50

ED-dynbatchTT-30

TT-50TT-dynbatch

GrandSLAm

ASR NLU QA IMC AP FACE HS YCSB SL Mean0.0

0.5

1.0

1.5

2.0

GPU infrastructure

Num

ber o

f ins

tanc

esno

rmal

ized

to G

rand

SLAm

Figure 14.Decrease in number of servers due to GrandSLAm

7 Related WorkPrior literature on guaranteeing response latency falls intotwo primary categories: Improving QoS without violatinglatency constraints and managing SLAs in multi-stage appli-cations.

7.1 Improving QoS without Latency ViolationPrior work on addressing response latency variation andproviding quality of service (QoS) guarantees have primarilybeen in the context of traditional datacenters [10, 31, 35, 48].Bubble-Up [31] and Bubble-Flux [48] quantify contention forlast level cache and memory bandwidth towards enablingco-location of a latency critical application alongside batchapplications. However, these techniques prioritize the la-tency critical user-facing application and end up significantlyhurting the performance of the co-running batch applica-tions. Paragon [10] and Whare-Map [30] utilize runtimesystems using machine learning techniques like collabora-tive filtering and sensitivity analysis towards identifying theright amount of resources required for guaranteeing QoS inheterogeneous datacenters. However, these techniques aredesigned for traditional datacenter applications like mem-cached, web search, etc. There is some prior literature thatattempts to estimate performance at co-located situationsin accelerator environments [2, 7, 8, 23, 28, 43]. Baymax [8]predicts the behavior of tasks executing in a GPU acceleratorcontext. Prophet [7] models the interference across acceler-ator resources in co-located execution scenarios. However,neither of these techniques caters to the needs of a microser-vice execution framework, as they do not tackle the challengeof providing solutions for guaranteeing latency for applica-tions containing multiple stages.

7.2 Managing SLAs in Multi-Stage ApplicationsRecent prior studies have identified the advantages of appli-cations that are composed of multiple stages, especially itsease of deployment [12, 18, 19, 22, 24, 25, 38, 42, 46]. Undersuch scenarios, support for multi-tenancy as well as schemesto abstract users from the impact of multi-tenancy would becritical. However, explorations in this direction by compa-nies such as Facebook [26], Microsoft [22, 38] and academicinstitutions neglect multi-tenant execution scenarios [47, 49].

However, the most relevant prior studies that have lookedinto multi-stage applications from the academic standpointare as follows:TimeTrader. [47] addresses the problem of meeting appli-cation specific latency targets for multi request execution inOnline Data Intensive applications (OLDIs). Towards meet-ing that objective, they employ a mechanism that tries toreorder requests that contain varying slack using merely anEarliest Deadline First scheduling methodology. However,this technique assumes that the applications contain a singleprocessing stage and fails to acknowledge the intrinsic la-tency variance across multiple stages. Hence, it deprioritizesrequests assuming to contain relaxed latency constraints,however, would be subjected to a bulk of compute at itslater stages. This leads to diminished effectiveness in miti-gating response latency for multi-stage applications, as wequantitatively show in Section 6.PowerChief. [49] seeks to identify the bottleneck stagespresent in multi-stage voice and image based intelligentpersonal assistant applications towards employing dynamicvoltage frequency scaling to boost partial execution stages.However, PowerChief does not strive to guarantee SLAs ata request level. Furthermore, the proposed solution is notgeneralized for a microservice execution framework whichhandles requests from multiple tenants and focuses on aparticular class of applications.

8 ConclusionMicroservice execution framework is rapidly transformingthe operation of datacenters. It offers significantly moretransparency into the underlying application execution thanmonolithic applications. Such visibility is a key enabler to-wards co-locating multiple latency critical applications onthe same systems and still meeting SLAs. In the face of suchvisibility and changing opportunities, there is a clear needto rethink runtime systems and frameworks.

Towards this end, we present GrandSLAm, a runtime sys-tem that exploits this visibility along with identifying slackin individual queries of different applications. GrandSLAmenables multiple tenants to meet their SLAs, while achiev-ing high throughput and utilization, with no performanceoverhead or programmer support. Therefore, we concludethat GrandSLAm can be an efficient substrate for currentand future datacenter environments housing microserviceexecution frameworks.

AcknowledgmentWe thank our anonymous reviewers for their constructivefeedback and suggestions. This work was sponsored by theNational Science Foundation (NSF) under NSF CAREER SHF-1553485. Jeongseob Ahn was supported by the National Re-search Foundation of Korea grant (NRF-2017R1C1B5075437)funded by MSIP, Korea.


References[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,

Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving,Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, SherryMoore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan,Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016.TensorFlow: A System for Large-Scale Machine Learning. In 12thUSENIX Symposium on Operating Systems Design and Implementation(OSDI 16).

[2] Paula Aguilera, Katherine Morrow, and Nam Sung Kim. 2014. Fairshare: Allocation of GPU resources for both performance and fairness.In Proceedings of the IEEE 32nd International Conference on ComputerDesign (ICCD 14).

[3] Amazon. 2019. What is AWS Lambda? https://docs.aws.amazon.com/lambda/latest/dg/welcome.html. (2019).

[4] Amazon. 2019. What is AWS Step Functions? http://docs.aws.amazon.com/step-functions/latest/dg/welcome.html. (2019).

[5] Microsoft Azure. 2019. Azure Functions Serverless Architecture. https://azure.microsoft.com/en-us/services/functions/. (2019).

[6] Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The data-center as a computer: An introduction to the design of warehouse-scalemachines. Synthesis lectures on computer architecture 8, 3 (2013), 1–154.

[7] Quan Chen, Hailong Yang, Minyi Guo, Ram Srivatsa Kannan, JasonMars, and Lingjia Tang. 2017. Prophet: Precise QoS Prediction onNon-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers. In Proceedings of the Twenty-Second InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS 17).

[8] Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Bay-max: QoS Awareness and Increased Utilization for Non-PreemptiveAccelerators in Warehouse Scale Computers. In Proceedings of theTwenty-First International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS 16).

[9] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, KorayKavukcuoglu, and Pavel P. Kuksa. 2011. Natural Language Processing(almost) from Scratch. CoRR abs/1103.0398 (2011). http://arxiv.org/abs/1103.0398

[10] Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. In Proceedings ofthe Eighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS 13).

[11] Tarek Elgamal, Atul Sandur, Klara Nahrstedt, and Gul Agha. 2018.Costless: Optimizing Cost of Serverless Computing through FunctionFusion and Placement. CoRR abs/1811.09721 (2018). arXiv:1811.09721http://arxiv.org/abs/1811.09721

[12] Sameh Elnikety, Erich Nahum, John Tracey, and Willy Zwaenepoel.2004. AMethod for Transparent Admission Control and Request Sched-uling in e-Commerce Web Sites. In Proceedings of the 13th InternationalConference on World Wide Web (WWW 04).

[13] A. Gheith, R. Rajamony, P. Bohrer, K. Agarwal, M. Kistler, B. L. WhiteEagle, C. A. Hambridge, J. B. Carter, and T. Kaplinger. 2016. IBMBluemix Mobile Cloud Services. IBM Journal of Research and Develop-ment 60, 2-3 (March 2016), 7:1–7:12.

[14] Google. 2019. Serverless Environment to Build and Connect CloudServices. https://cloud.google.com/functions/. (2019).

[15] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, LukaszWesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and KaimingHe. 2017. Accurate, large minibatch SGD: training imagenet in 1 hour.arXiv preprint arXiv:1706.02677 (2017).

[16] Johann Hauswald, Yiping Kang, Michael A. Laurenzano, Quan Chen,Cheng Li, Ronald Dreslinski, Trevor Mudge, Jason Mars, and LingjiaTang. 2015. Djinn and Tonic: DNN as a Service and Its Implications forFuture Warehouse Scale Computers. In Proceedings of the 42nd AnnualInternational Symposium on Computer Architecture (ISCA 15).

[17] Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Cheng Li,Austin Rovinski, Arjun Khurana, Ron Dreslinski, Trevor Mudge, Vini-cius Petrucci, Lingjia Tang, and Jason Mars. 2015. Sirius: An OpenEnd-to-End Voice and Vision Personal Assistant and Its Implicationsfor Future Warehouse Scale Computers. In Proceedings of the Twenti-eth International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS 15).

[18] Yuxiong He, Sameh Elnikety, James Larus, and Chenyu Yan. 2012. Zeta:Scheduling Interactive Services with Partial Execution. In Proceedingsof the Third ACM Symposium on Cloud Computing (SoCC 12).

[19] Scott Hendrickson, Stephen Sturdevant, Tyler Harter, VenkateshwaranVenkataramani, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Serverless Computation with OpenLambda. In 8thUSENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16).

[20] IBM. 2019. IBM Cloud Functions. https://www.ibm.com/cloud/functions. (2019).

[21] Muhammad Hussain Iqbal and Tariq Rahim Soomro. 2015. Big dataanalysis: Apache storm perspective. International journal of computertrends and technology 19, 1 (2015), 9–14.

[22] Virajith Jalaparti, Peter Bodik, Srikanth Kandula, Ishai Menache,Mikhail Rybalkin, and Chenyu Yan. 2013. Speeding Up DistributedRequest-response Workflows. In Proceedings of the ACM SIGCOMM2013 Conference on SIGCOMM.

[23] Adwait Jog, Evgeny Bolotin, Zvika Guz, Mike Parker, Stephen W.Keckler, Mahmut T. Kandemir, and Chita R. Das. 2014. Application-aware Memory System for Fair and Efficient Execution of ConcurrentGPGPU Applications. In Proceedings of Workshop on General PurposeProcessing Using GPUs (GPGPU 14).

[24] Evangelia Kalyvianaki, Marco Fiscato, Theodoros Salonidis, and PeterPietzuch. 2016. THEMIS: Fairness in Federated Stream ProcessingUnder Overload. In Proceedings of the 2016 International Conference onManagement of Data (SIGMOD 16).

[25] Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ran-ganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Pro-filing a Warehouse-scale Computer. In Proceedings of the 42nd AnnualInternational Symposium on Computer Architecture (ISCA 15).

[26] S. Kanev, K. Hazelwood, G. Y. Wei, and D. Brooks. 2014. Tradeoffsbetween power management and tail latency in warehouse-scale appli-cations. In IEEE International Symposium onWorkload Characterization(IISWC 14).

[27] R. S. Kannan, A. Jain, M. A. Laurenzano, L. Tang, and J. Mars. 2018.Proctor: Detecting and Investigating Interference in Shared Datacen-ters. In 2018 IEEE International Symposium on Performance Analysis ofSystems and Software (ISPASS 18).

[28] O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kan-demir, G. H. Loh, O. Mutlu, and C. R. Das. 2014. Managing GPUConcurrency in Heterogeneous Architectures. In 2014 47th AnnualIEEE/ACM International Symposium on Microarchitecture (MICRO 14).

[29] Kris Kobylinski. 2015. Agile Software Development for Bluemix withIBM DevOps Services. In Proceedings of the 25th Annual InternationalConference on Computer Science and Software Engineering (CASCON15).

[30] Jason Mars and Lingjia Tang. 2013. Whare-map: Heterogeneity in"Homogeneous" Warehouse-scale Computers. In Proceedings of the40th Annual International Symposium on Computer Architecture (ISCA13).

[31] Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary LouSoffa. 2011. Bubble-Up: Increasing Utilization in Modern WarehouseScale Computers via Sensible Co-locations. In Proceedings of the 44thAnnual IEEE/ACM International Symposium on Microarchitecture (MI-CRO 11).

[32] Sean Marston, Zhi Li, Subhajyoti Bandyopadhyay, Juheng Zhang, andAnand Ghalsasi. 2011. Cloud Computing - The Business Perspective.Decis. Support Syst. 51, 1 (April 2011), 14.

https://docs.aws.amazon.com/lambda/latest/dg/welcome.html

https://docs.aws.amazon.com/lambda/latest/dg/welcome.html

http://docs.aws.amazon.com/step-functions/latest/dg/welcome.html

http://docs.aws.amazon.com/step-functions/latest/dg/welcome.html

https://azure.microsoft.com/en-us/services/functions/

https://azure.microsoft.com/en-us/services/functions/

http://arxiv.org/abs/1103.0398




https://cloud.google.com/functions/

https://www.ibm.com/cloud/functions

https://www.ibm.com/cloud/functions


[33] David Meisner and Thomas F. Wenisch. 2012. DreamWeaver: Ar-chitectural Support for Deep Sleep. In Proceedings of the SeventeenthInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS 12).

[34] V. Nagarajan, R. Hariharan, V. Srinivasan, R. S. Kannan, P. Thinakaran,V. Sankaran, B. Vasudevan, R. Mukundrajan, N. C. Nachiappan, A. Srid-haran, K. P. Saravanan, V. Adhinarayanan, and V. V. Sankaranarayanan.2012. SCOC IP Cores for Custom Built Supercomputing Nodes. In IEEEComputer Society Annual Symposium on VLSI (ISVLSI 12).

[35] V. Nagarajan, K. Lakshminarasimhan, A. Sridhar, P. Thinakaran, R.Hariharan, V. Srinivasan, R. S. Kannan, and A. Sridharan. 2013. Per-formance and energy efficient cache system design: Simultaneousexecution of multiple applications on heterogeneous cores. In IEEEComputer Society Annual Symposium on VLSI (ISVLSI 13).

[36] V. Nagarajan, V. Srinivasan, R. Kannan, P. Thinakaran, R. Hariharan,B. Vasudevan, N. C. Nachiappan, K. P. Saravanan, A. Sridharan, V.Sankaran, V. Adhinarayanan, V. S. Vignesh, and R. Mukundrajan. 2012.Compilation Accelerator on Silicon. In IEEE Computer Society AnnualSymposium on VLSI (ISVLSI 12).

[37] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, OndrejGlembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, YanminQian, Petr Schwarz, et al. 2011. The Kaldi speech recognition toolkit. InIEEE 2011 workshop on automatic speech recognition and understanding.

[38] Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou,Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fow-ers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck,Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, JamesLarus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip YiXiao, and Doug Burger. 2014. A Reconfigurable Fabric for Accelerat-ing Large-scale Datacenter Services. In Proceeding of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA 14).

[39] Amazon Web Services. 2017. The Image Recognition andProcessing Backend reference architecture demonstrates how touse AWS Step Functions to orchestrate a serverless process-ing workflow using AWS Lambda, Amazon S3, Amazon Dy-namoDB and Amazon Rekognition. https://github.com/aws-samples/lambda-refarch-imagerecognition. (2017).

[40] Samuel L Smith, Pieter-Jan Kindermans, and Quoc V Le. 2017. Don’tDecay the Learning Rate, Increase the Batch Size. arXiv preprintarXiv:1711.00489 (2017).

[41] Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan,and Onur Mutlu. 2015. The Application SlowdownModel: Quantifyingand Controlling the Impact of Inter-application Interference at SharedCaches and Main Memory. In Proceedings of the 48th International

Symposium on Microarchitecture (MICRO 15).[42] Lalith Suresh, Peter Bodik, Ishai Menache, Marco Canini, and Florin

Ciucu. 2017. Distributed Resource Management Across Process Bound-aries. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC17).

[43] Shanjiang Tang, BingSheng He, Shuhao Zhang, and Zhaojie Niu. 2016.Elastic Multi-resource Fairness: Balancing Fairness and Efficiency inCoupled CPU-GPU Architectures. In Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis (SC 16).

[44] Prashanth Thinakaran, Jashwant Raj Gunasekaran, Bikash Sharma,Mahmut Taylan Kandemir, and Chita R Das. 2017. Phoenix: aconstraint-aware scheduler for heterogeneous datacenters. In IEEE37th International Conference on Distributed Computing Systems (ICDCS17).

[45] Prashanth Thinakaran, Jashwant Raj, Bikash Sharma, Mahmut T Kan-demir, and Chita R Das. 2018. The Curious Case of Container Orches-tration and Scheduling in GPU-based Datacenters. In Proceedings ofthe ACM Symposium on Cloud Computing (SoCC 18).

[46] T. Ueda, T. Nakaike, and M. Ohara. 2016. Workload characterizationfor microservices. In IEEE International Symposium on Workload Char-acterization (IISWC 16).

[47] Balajee Vamanan, Hamza Bin Sohail, Jahangir Hasan, and T. N. Vi-jaykumar. 2015. TimeTrader: Exploiting Latency Tail to Save Datacen-ter Energy for Online Search. In Proceedings of the 48th InternationalSymposium on Microarchitecture (MICRO 15).

[48] Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. 2013.Bubble-flux: Precise Online QoS Management for Increased Utiliza-tion in Warehouse Scale Computers. In Proceedings of the 40th AnnualInternational Symposium on Computer Architecture (ISCA 13).

[49] Hailong Yang, Quan Chen, Moeiz Riaz, Zhongzhi Luan, Lingjia Tang,and Jason Mars. 2017. PowerChief: Intelligent Power Allocation forMulti-Stage Applications to Improve Responsiveness on Power Con-strained CMP. In Proceedings of the 44th Annual International Sympo-sium on Computer Architecture (ISCA 17).

[50] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das,Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shiv-aram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez,Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Enginefor Big Data Processing. Commun. ACM 59, 11 (Oct. 2016), 10.

[51] Yilei Zhang, Zibin Zheng, and M.R. Lyu. 2011. Exploring Latent Fea-tures for Memory-Based QoS Prediction in Cloud Computing. In IEEESymposium on Reliable Distributed Systems (SRDS 11).

https://github.com/aws-samples/lambda-refarch-imagerecognition

https://github.com/aws-samples/lambda-refarch-imagerecognition

grandslam: guaranteeing slas for jobs in …guaranteeing slas for jobs in microservice execution...

Documents