q -service-aware s heterogeneous d paragondelimitrou/papers/2014.toppicks.paragon.pdf8. b. hindman...

........................................................................................................................................................................................................................

QUALITY-OF-SERVICE-AWARESCHEDULING IN HETEROGENEOUSDATACENTERS WITH PARAGON

........................................................................................................................................................................................................................

PARAGON, AN ONLINE, SCALABLE DATACENTER SCHEDULER, ENABLES BETTER CLUSTER

UTILIZATION AND PER-APPLICATION QUALITY-OF-SERVICE GUARANTEES BY LEVERAGING

DATA MINING TECHNIQUES THAT FIND SIMILARITIES BETWEEN KNOWN AND NEW

APPLICATIONS. FOR A 2,500-WORKLOAD SCENARIO, PARAGON PRESERVES PERFORMANCE

CONSTRAINTS FOR 91 PERCENT OF APPLICATIONS, WHILE SIGNIFICANTLY IMPROVING

UTILIZATION. IN COMPARISON, A BASELINE LEAST-LOADED SCHEDULER ONLY PROVIDES

SIMILAR GUARANTEES FOR 3 PERCENT OF WORKLOADS.

......Efficiency is a first-class require-ment and the main source of scalability con-cerns both for small and large systems.1,2

Achieving high efficiency is not only a matterof sensible design, but also a function of howthe system is managed, which becomes essen-tial as the hardware grows progressively heter-ogeneous and parallel and applications getdynamic and diverse. Architecture has tradi-tionally been about efficient system design.As efficiency increases in importance, archi-tecture should be about both design andmanagement for systems of any scale.

In this article, we focus on improving effi-ciency while guaranteeing high performancein large-scale systems. Although an increasingamount of computing now happens in publicand private clouds, such as Amazon ElasticCompute Cloud (EC2; see http://aws.amazon.com/ec2) or vSphere (www.vmware.

com/products/vsphere), datacenters continueto operate at utilizations in the single dig-its.1,3 This lessens the two main advantagesof cloud computing—flexibility and cost effi-ciency both for cloud operators and endusers—because not only are the machinesunderutilized, they are also operating in anon-energy-proportional region.1,4

There can be several reasons why ma-chines are underutilized. Two of the mostprominent obstacles are interference betweencoscheduled applications and heterogeneityin server platforms. For more information,see the “Interference and Heterogeneity”sidebar.

In our paper presented at the 18th Inter-national Conference on Architectural Sup-port for Programming Languages andOperating Systems (ASPLOS 2013),5 weintroduced Paragon, an online and scalable

Christina Delimitrou

Christos Kozyrakis

Stanford University

............................................................

2 Published by the IEEE Computer Society 0272-1732/14/$31.00�c 2014 IEEE

..............................................................................................................................................................................................

Interference and HeterogeneityInterference occurs as coscheduled applications contend in shared

resources. Coscheduled applications may interfere negatively even if

they run on different processor cores because they share caches,

memory channels, storage, and networking devices.1,2 If unmanaged,

interference can result in performance degradations of integer fac-

tors,2 especially when the application must meet tail latency guaran-

tees apart from average performance.3 Figure A shows that an

interference-oblivious scheduler will slow workloads down by 34 per-

cent on average, with some running more than two times slower. This

is undesirable for both users and operators.

Heterogeneity is the natural result of the infrastructure’s evolu-

tion, as servers are gradually provisioned and replaced over the typical

15-year lifetime of a datacenter.4-7 At any point in time, a datacenter

may host three to five server generations with a few hardware config-

urations per generation, in terms of the processor speed, memory,

storage, and networking subsystems. Managing the different hard-

ware incorrectly not only causes significant performance degradations

to applications sensitive to server configuration, but also wastes

resources as workloads occupy servers for significantly longer, and

gives a low-quality signal to hardware vendors for the design of future

platforms. Figure A shows that a heterogeneity-oblivious scheduler

will slow applications down by 22 percent on average, with some run-

ning nearly 2 times slower (see the “Methodology” section in the

main article).

Finally, a baseline scheduler that is oblivious to both interference

and heterogeneity and which schedules applications to least-loaded

servers is even worse (48 percent average slowdown), causing some

workloads to crash due to resource exhaustion on the server. Unless

interference and heterogeneity are managed in a coordinated fashion,

the system loses both its efficiency and predictability guarantees. Pre-

vious research has identified the issues of heterogeneity6 and inter-

ference,2 but while most cloud management systems—such as

Mesos8 or vSphere (www.vmware.com/products/vsphere)—have

some notion of contention or interference awareness, they either use

empirical rules for interference management or assume long-running

workloads (for example, online services), whose repeated behavior

can be progressively modeled. In this article, we target both heteroge-

neity and interference and assume no a priori analysis of the applica-

tion. Instead, we leverage information the system already has about

the large number of applications it has previously seen.

References1. S. Govindan et al., “Cuanta: Quantifying Effects of Shared

On-Chip Resource Interference for Consolidated Virtual

Machines,” Proc. 2nd ACM Symp. Cloud Computing, 2011,

article no. 22.

2. J. Mars et al., “Bubble-Up: Increasing Utilization in Modern

Warehouse Scale Computers via Sensible Co-locations,”

Proc. 44th Ann. IEEE/ACM Int’l Symp. Microarchitecture,

2011, pp. 248-259.

3. D. Meisner et al., “Power Management of Online Data-Inten-

sive Services,” Proc. 38th Ann. Int’l Symp. Computer Archi-

tecture (ISCA 11), 2011, pp. 319-330.

4. L.A. Barroso and U. Holzle, The Datacenter as a Computer:

An Introduction to the Design of Warehouse-Scale

Machines, Morgan and Claypool Publishers, 2009.

5. C. Kozyrakis et al., “Server Engineering Insights for Large-Scale

Online Services,” IEEE Micro, vol. 30, no. 4, 2010, pp. 8-19.

6. J. Mars, L. Tang, and R. Hundt, “Heterogeneity in ‘Homoge-

neous’ Warehouse-Scale Computers: A Performance Oppor-

tunity,” IEEE Computer Architecture Letters, vol. 10, no. 2,

2011, pp. 29-32.

7. R. Nathuji, C. Isci, and E. Gorbatov, “Exploiting Platform Het-

erogeneity for Power Efficient Data Centers,” Proc. 4th Int’l

Conf. Autonomic Computing (ICAC 07), 2007, doi:10.1109/

ICAC.2007.16.

8. B. Hindman et al., “Mesos: A Platform for Fine-Grained

Resource Sharing in the Data Center,” Proc. 8th USENIX

Conf. Networked Systems Design and Implementation,

2011, article no. 22.

1.0

Alone on best platform No interferenceLeast loadedNo heterogeneity

Sp

eed

up o

ver

alon

e on

bes

t pla

tform

0.8

0.6

0.4

0.2

0.00 1,000 2,000

Workloads3,000 4,000 5,000

Figure A. Performance degradation for 5,000 applications

on 1,000 Amazon Elastic Compute Cloud (EC2) servers with

heterogeneity-oblivious, interference-oblivious, and

baseline least-loaded schedulers compared to ideal

scheduling (application runs alone on best platform).

Results are ordered from worst to best-performing

workload.

.................................................................

MAY/JUNE 2014 3

datacenter scheduler that accounts for hetero-geneity and interference. The key feature ofParagon is its ability to quickly and accuratelyclassify an unknown application with respectto heterogeneity (which server configurationsit will perform best on) and interference(how much interference it will cause tocoscheduled applications and how muchinterference it can tolerate itself in multipleshared resources). Unlike previous techniquesthat require detailed profiling of each in-coming application, Paragon’s classificationengine exploits existing data from previouslyscheduled workloads and requires only aminimal signal about a new workload. Spe-cifically, it is organized as a low-overhead rec-ommendation system similar to the onedeployed for the Netflix Challenge,6 butinstead of discovering similarities in users’movie preferences, it finds similarities inapplications’ preferences with respect to het-erogeneity and interference. It uses singularvalue decomposition (SVD) to perform col-laborative filtering and identify similaritiesbetween incoming and previously scheduledworkloads.

Once an incoming application is classi-fied, a greedy scheduler assigns it to the serverthat is the best possible match in terms ofplatform and minimum negative interferencebetween all coscheduled workloads. Eventhough the final step is greedy, the high accu-racy of classification leads to schedules thatachieve both fast execution time and efficientresource usage. Paragon scales to systemswith tens of thousands of servers and tens ofconfigurations, running large numbers ofpreviously unknown workloads. We imple-mented Paragon and showed that it signifi-cantly improves cluster utilization, whilepreserving per-application quality-of-service(QoS) guarantees both for small- and large-scale systems. For more information onrelated work, see the “Research Related toParagon” sidebar.

Fast and accurate classificationThe key requirement for heterogeneity

and interference-aware scheduling is toquickly and accurately classify incomingapplications. First, we need to know how fastan application will run on each of the tens of

server configurations (SCs) available. Second,we need to know how much interference itcan tolerate from other workloads in each ofseveral shared resources without significantperformance loss and how much interferenceit will generate itself. Our goal is to performonline scheduling for large-scale systemswithout any a priori knowledge about incom-ing applications. Most previous schemesaddress this issue with detailed but offlineapplication characterization or long-termmonitoring and modeling.7-9 Paragon takes adifferent approach. Its core idea is that,instead of learning each new workload indetail, the system leverages information italready has about applications it has seen toexpress the new workload as a combinationof known applications. For this purpose, weuse collaborative filtering techniques thatcombine a minimal profiling signal about thenew application with the large amount ofdata available from previously scheduledworkloads. The result is fast and accurateclassification of incoming applications withrespect to heterogeneity and interference.Within a minute of its arrival, an incomingworkload is scheduled on a large-scale cluster.

Background on collaborative filteringCollaborative filtering techniques are fre-

quently used in recommendation systems.We use one of their most publicized applica-tions, the Netflix Challenge,6 to provide aquick overview of the two analytical methodswe rely on, SVD and PQ reconstruction.10

In this case, the goal is to provide valid movierecommendations for Netflix users given theratings they have provided for various othermovies.

The input to the analytical framework is asparse matrix A, the utility matrix, with onerow per user and one column per movie. Theelements of A are the ratings that users haveassigned to movies. Each user has rated onlya small subset of movies; this is especially truefor new users, who might only have a handfulof ratings, or even none. Although techniquesexist that address the cold-start problem (thatis, providing recommendations to a com-pletely fresh user with no ratings), we focushere on users for whom the system has someminimal input. If we can estimate the valuesof the missing ratings in the sparse matrix A,

..............................................................................................................................................................................................

TOP PICKS

.................................................................

4 IEEE MICRO

we can make movie recommendations; thatis, we can suggest that users watch the moviesfor which the recommendation system esti-mates they will give high ratings to with highconfidence.

The first step is to apply SVD, a matrixfactorization method used for dimensionalityreduction and similarity identification. Fac-toring A produces the decomposition to thefollowing matrices of left (U) and right (V)

..............................................................................................................................................................................................

Research Related to ParagonWe discuss work relevant to Paragon in the areas of datacenter

scheduling, virtual machine (VM) management, workload rightsizing,

and scheduling for heterogeneous multicore chips.

Datacenter schedulingRecent work on datacenter scheduling has highlighted the impor-

tance of platform heterogeneity and workload interference. Mars

et al. showed that the performance of Google workloads can vary by

up to 40 percent because of heterogeneity, even when considering

only two server configurations, and by up to 2 times because of inter-

ference, even when considering only two colocated applications.1,2

Govindan et al. also present a scheme to quantify the effects of cache

interference between consolidated workloads.3 In Paragon, we extend

the concepts of heterogeneity- and interference-aware scheduling by

providing an online, scalable, and low-overhead methodology that

accurately classifies applications for both heterogeneity and interfer-

ence across multiple resources.

VM managementSystems such as vSphere (http://www.vmware.com/products/

vsphere) or the VM platforms on public cloud providers can schedule

diverse workloads submitted by users on the available servers. In gen-

eral, these platforms account for application resource requirements

that they expect the user to express or they learn over time by moni-

toring workload execution. Paragon can complement such systems by

making scheduling decisions on the basis of heterogeneity and inter-

ference and detecting when an application should be considered for

rescheduling.

Resource management and rightsizingThere has been significant work on resource allocation in virtual-

ized and nonvirtualized large-scale datacenters. Mesos performs

resource allocation between distributed computing frameworks such

as Hadoop or Spark.4 Rightscale (http://www.rightscale.com) auto-

matically scales out three-tier applications to react to changes in the

load in Amazon’s cloud service. DejaVu serves a similar goal by identi-

fying a few workload classes and, based on them, reusing previous

resource allocations to minimize reallocation overheads.5 In general,

Paragon is complementary to rightsizing systems. Once such a system

determines the amount of resources needed by an application, Para-

gon can classify and schedule it on the proper hardware platform in a

way that minimizes interference.

Scheduling for heterogeneous multicore chipsScheduling in heterogeneous CMPs shares some concepts and

challenges with scheduling in heterogeneous datacenters; thus, some

of the ideas in Paragon can be applied in heterogeneous CMP sched-

uling as well. Shelepov et al. present a scheduler for heterogeneous

CMPs that is simple and scalable,6 whereas Craeynest et al. use per-

formance statistics to estimate which workload-to-core mapping is

likely to provide the best performance.7 Given the increasing number

of cores per chip and coscheduled tasks, techniques similar to the

ones used in Paragon can be applicable when deciding how to sched-

ule applications in heterogeneous CMPs as well.

References1. J. Mars, L. Tang, and R. Hundt, “Heterogeneity in ‘Homoge-

neous’ Warehouse-Scale Computers: A Performance Oppor-

tunity,” IEEE Computer Architecture Letters, vol. 10, no. 2,

2011, pp. 29-32.

2. J. Mars et al., “Bubble-Up: Increasing Utilization in Modern

Warehouse Scale Computers via Sensible Co-locations,”

Proc. 44th Ann. IEEE/ACM Int’l Symp. Microarchitecture,

2011, pp. 248-259.

3. S. Govindan et al., “Cuanta: Quantifying Effects of Shared

On-Chip Resource Interference for Consolidated Virtual

Machines,” Proc. 2nd ACM Symp. Cloud Computing, 2011,

article no. 22.

4. B. Hindman et al., “Mesos: A Platform for Fine-Grained

Resource Sharing in the Data Center,” Proc. 8th USENIX

Conf. Networked Systems Design and Implementation,

2011, article no. 22.

5. N. Vasic et al., “DejaVu: Accelerating Resource Allocation in

Virtualized Environments,” Proc. 17th Int’l Conf. Architec-

tural Support for Programming Languages and Operating

Systems, 2012, pp. 423-436.

6. D. Shelepov et al., “HASS: A Scheduler for Heterogeneous

Multicore Systems,” ACM SIGOPS Operating Systems

Rev., vol. 43, no. 2, 2009, pp. 66-75.

7. K. Craeynest et al., “Scheduling Heterogeneous Multi-Cores

through Performance Impact Estimation (PIE),” Proc. 39th

Ann. Int’l Symp. Computer Architecture (ISCA 12), 2012,

pp. 213-224.

.................................................................

MAY/JUNE 2014 5

singular vectors and the diagonal matrix ofsingular values (R):

Am;n ¼

a1;1 a1;2 � � � a1;n

a2;1 a2;2 � � � a2;n

..

. ... . .

. ...

am;1 am;2 � � � am;n

0BBBB@

1CCCCA

¼ U � R � V T where

Um�r ¼

u1;1 � � � u1;r

..

. . .. ..

.

um;1 � � � um;r

0BB@

1CCA;

V n�r ¼

v1;1 � � � v1;r

..

. . .. ..

.

vn;1 � � � vn;r

0BB@

1CCA;

Rr�r ¼r1 � � � 0

..

. . .. ..

.

0 � � � rr

0BB@

1CCA

Dimension r is the rank of matrix A, andit represents the number of similarity con-cepts identified by SVD. For instance, onesimilarity concept might be that certain mov-ies belong to the drama category, whileanother might be that most users who likedthe movie The Lord of the Rings: The Fellow-ship of the Ring also liked The Lord of theRings: The Two Towers. Similarity conceptsare represented by singular values rið Þ inmatrix R and the confidence in a similarityconcept by the magnitude of the correspond-ing singular value. Singular values in R areordered by decreasing magnitude. Matrix Ucaptures the strength of the correlationbetween a row of A and a similarity concept.In other words, it expresses how users relateto similarity concepts such as the one aboutliking drama movies. Matrix V captures thestrength of the correlation of a column of Ato a similarity concept. In other words, towhat extent does a movie fall in the dramacategory? The complexity of performingSVD on a m � n matrix is minðn2m; m2nÞ.SVD is robust to missing entries and imposesrelaxed sparsity constraints to provide accu-racy guarantees.

Before we can make accurate score estima-tions using SVD, we need the full utilitymatrix A. To recover the missing entries in A,

we use PQ reconstruction. Building from thedecomposition of the initial sparse A matrix,

we have Qm�r ¼ U and PTr�n ¼

P�V T .

The product of Q and PT gives matrix R,which is an approximation of A with themissing entries. To improve R, we use sto-chastic gradient descent (SGD), a scalableand lightweight latent-factor model that iter-atively recreates A:

8rui, where rui is an element of the rec-onstructed matrix R

2ui ¼ rui � qi � pTu

qi qi þ gð2ui pu � kqiÞpu pu þ gð2ui qi � kpuÞ

until 2j jL2¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPu;i 2uij j2

qbecomes marginal.

In this process, g is the learning rate andk is the regularization factor. The complex-ity of PQ is linear with the number of rui

and in practice takes up to a few millisec-onds for matrices whose m and n equalabout 1,000. Once the dense utility matrixR is recovered, we can make movie recom-mendations. This involves applying SVDto R to identify which of the reconstructedentries reflect strong similarities that enablemaking accurate recommendations withhigh confidence.

Classification for heterogeneityWe use collaborative filtering to identify

how well a previously unknown workloadwill run on different hardware platforms.The rows in matrix A represent applications,the columns represent server configurations(SCs), and the ratings represent normalizedapplication performance on each SC. As partof an offline step, we select a small number ofapplications and profile them on all the dif-ferent SCs. This provides some initial infor-mation to the classification engine to addressthe cold-start problem that would otherwiseoccur. It only needs to happen once in thesystem.

During regular operation, when an appli-cation arrives, we profile it for 1 minute onany two SCs, insert it as a new row in matrixA, and use the process described previously toderive the missing ratings for the other serverconfigurations. In this case, R represents sim-ilarity concepts such as the fact that applica-tions that benefit from SC1 will also benefit

..............................................................................................................................................................................................

TOP PICKS

.................................................................

6 IEEE MICRO

from SC3. U captures how an applicationcorrelates to the different similarity concepts,and V shows how an SC correlates to them.Collaborative filtering identifies similaritiesbetween new and known applications. Twoapplications can be similar in one characteris-tic (for instance, they both benefit from highclock frequency) but different in others (forexample, only one benefits from a large L3cache). This is especially common when scal-ing to large application spaces and hardwareconfigurations. SVD addresses this issue byuncovering hidden similarities and filteringout the ones less likely to have an impact onthe application’s behavior.

As incoming applications are added in A,the density of the matrix increases and therecommendation accuracy improves. Notethat online training is performed only on twoSCs. This reduces the training overhead andthe number of servers needed for it comparedto exhaustive search. In contrast, if weattempted an exhaustive application profil-ing, the number of profiling runs wouldequal the number of SCs. For a cloud servicewith high workload arrival rates, this wouldbe infeasible to support. On a production-class Xeon server, classification takes 10 to 30milliseconds for thousands of applicationsand tens of SCs. We can perform classifica-tion for one application at a time or for smallgroups of incoming applications (batching) ifthe arrival rate is high without impactingaccuracy or speed.

Performance scores. We use the following per-formance metrics according to the applica-tion type:

� Single-threaded workloads: We useinstructions committed per second(IPS) as the initial performance met-ric. Using execution time wouldrequire running applications to com-pletion during profiling, increasingoverheads. We have verified that IPSleads to similar classification accuracyas using time to completion. Formultiprogrammed workloads, we useaggregate IPS.

� Multithreaded workloads: In the pres-ence of spinlocks or other synchroni-zation schemes, IPS can be deceptive.

We address this by detecting activewaiting and weight such executionsegments out of the IPS computa-tion. We verified that using this“useful” IPS leads to similar classifi-cation accuracy as using the full exe-cution time.

The choice of IPS is influenced by ourcurrent evaluation, which focuses on single-node CPU-, memory-, and I/O-intensiveprograms. The same methodology can beextended to higher-level metrics, such asqueries per second (QPS), which cover com-plex multitier workloads as well.

Validation. We evaluate the accuracy of het-erogeneity classification on a 40-server clusterwith 10 SCs with a large set of diverse appli-cations. The offline training set includes 20randomly selected applications. Using theclassification output for scheduling improvesperformance by 24 percent for single-threaded workloads, 20 percent for multi-threaded workloads, 38 percent for multi-programmed workloads, and 40 percent forI/O workloads, on average, while some appli-cations have a 2� performance difference.Table 1 summarizes key statistics on the vali-dation study. It is important to note that theaccuracy does not depend on the SCs selectedfor training, which matched the top-performing configuration only for 20 percentof workloads. We also compare performancepredicted by the recommendation system toperformance obtained through experimenta-tion. The deviation is 3.8 percent on average.

Classification for interferenceWe are interested in two types of interfer-

ence: that which an application can toleratefrom preexisting load on a server, and thatwhich the application will cause on that load.We detect interference due to contentionand assign a score to the sensitivity of anapplication to a type of interference. Toderive sensitivity scores, we develop severalmicrobenchmarks (sources of interference, orSoIs), each stressing a specific shared resourcewith tunable intensity.11 SoIs span the core,memory, and cache hierarchy and networkand storage bandwidth. We run an applica-tion concurrently with a microbenchmark

.................................................................

MAY/JUNE 2014 7

and progressively tune up its intensity untilthe application violates its QoS. Applicationswith high tolerance to interference (for exam-ple, a sensitivity score over 60 percent) areeasier to coschedule than applications withlow tolerance. Similarly, we detect the sensi-tivity of a microbenchmark to the interfer-ence the application causes by tuning up itsintensity and recording when the microbe-nchmark’s performance degrades by 5 per-cent compared to its performance inisolation. In this case, high-sensitivity scorescorrespond to applications that cause a lot ofinterference in the specific shared resource.

Collaborative filtering for interference. Weclassify applications for interference toleratedand caused, using twice the process describedearlier. The two utility matrices have applica-tions as rows and SoIs as columns. The ele-ments of the matrices are the sensitivityscores of an application to the correspondingmicrobenchmark. Similarly to classificationfor heterogeneity, we profile a few applica-tions offline against all SoIs and insert themas dense rows in the utility matrices. In theonline mode, each new application is profiled

against two randomly chosen microbe-nchmarks for one minute, and its sensitivityscores are added in a new row in each of thematrices. Then, we use SVD and PQ recon-struction to derive the missing entries and theconfidence in each similarity concept.

Validation. We evaluated the accuracy ofinterference classification using the sameworkloads and systems as before. Table 2summarizes key statistics on the classificationquality. The average error in estimating bothtolerated and caused interference across SoIsis 5.3 percent. For high values of sensitivity(that is, applications that tolerate and cause alot of interference), the error is even lower(3.4 percent).

Putting it all togetherOverall, Paragon requires two short runs

(approximately 1 minute) on two SCs to clas-sify incoming applications for heterogeneity.Another two short runs against two micro-benchmarks on a high-end SC are needed forinterference classification. Running for 1minute provides some signal on the newworkload without introducing significant

Table 1. Validation of heterogeneity classification.

Applications

Metric

Single

threaded (%)

Multithreaded

(%)

Multiprogrammed

(%)

I/O bound

(%)

Selected best platform 86 86 83 89

Selected platform within 5% of best 91 90 89 92

Correct platform ranking (best to worst) 67 62 59 43

90% correct platform ranking 78 71 63 58

Training and best selected platform match 28 24 18 22

Table 2. Validation of interference classification.

Metric Percentage (%)

Average estimation error of sensitivity across all examined resources 5.3

Average estimation error for sensitivities> 60% 3.4

Applications with< 5% estimation error 59.0

Resource with highest estimation error: L1 instruction cache 15.8

Frequency L1 instruction cache used for training 14.6

Resource with lowest estimation error: Storage bandwidth 0.9

..............................................................................................................................................................................................

TOP PICKS

.................................................................

8 IEEE MICRO

profiling overheads. In our full paper,5 we dis-cuss the issue of workload phases (that is,transient effects that do not appear in the1-minute profiling period). Next, we use col-laborative filtering to classify the applicationin terms of heterogeneity and interference.This requires a few milliseconds even whenconsidering thousands of applications andseveral tens of SCs or SoIs. Classification forheterogeneity and interference is performedin parallel. For the applications we consid-ered, the overall profiling and classificationoverheads are 1.2 and 0.09 percent onaverage.

Using analytical methods for classificationhas two benefits. First, we have strong analyt-ical guarantees on the quality of the informa-tion used for scheduling, instead of relyingmainly on empirical observation. The analyt-ical framework provides low and tight errorbounds on the accuracy of classification, stat-istical guarantees on the quality of colocationcandidates, and detailed characterization ofsystem behavior. Moreover, the schedulerdesign is workload independent, whichmeans that the properties the scheme pro-vides hold for any workload. Second, thesemethods are computationally efficient, scalewell with the number of applications andSCs, and do not introduce significant sched-uling overheads.

ParagonOnce an incoming application is classified

with respect to heterogeneity and interference,

Paragon schedules it on one of the availableservers. The scheduler attempts to assign eachworkload to the server of the best SC and colo-cate it with applications so that interference isminimized for workloads running on thesame server.

Scheduler designFigure 1 presents an overview of Paragon’s

components and operation. The schedulermaintains per-application and per-serverstate. The per-application state includes theclassification information; for a datacenterwith 10 SCs and 10 SoIs, it is 64 bytes perapplication. The per-server state records theIDs of applications running on a server andthe cumulative sensitivity to interference(roughly 64 bytes per server). The per-serverstate is updated as applications are scheduledand, later on, completed. Overall, state over-heads are marginal and scale logarithmicallyor linearly with the number of applications(N) and servers (M). In our experiments withthousands of applications and servers, a singleserver could handle all processing and storagerequirements of scheduling, although addi-tional servers can be used for fault tolerance.

Greedy server selectionIn examining candidates, the scheduler

considers two factors: first, which assign-ments minimize negative interference be-tween the new application and existing load,and second, which servers have the best SCfor this workload.

Selection of colocation candidates

2x

State: N*16B

Step 2: Server selection

Apparrival

1 3

1 52 33 5

2 33 4

2 4

5 4

U’ ∑’ V’

Classification for heterogeneity (SVD+PQ)

Classification for interference (SVD+PQ)

Step 1: Application classification

U ∑ V

1 3

1 52 33 5

2 33 4

2 4

5 4

U’ ∑’ V’5 45 5 5 1

1 31 2 542 4 35 5 3

1 54 2532 31 5 553 511 3 2

2 3 34 4 23 41 551

5 45 5 5 11 31 2 542 4 35 5 3

1 54 2532 31 5 553 511 3 2

2 3 34 4 23 41 551

U ∑ V

Heterogeneityscores

Interferencescores

C

DC servers

SA

N

B

DS

A

CS

CDC

S

D

SS

D

DEE

E FFFA

A

BB

Figure 1. The components of Paragon and the state maintained by each component. Overall, the state requirements are

marginal and scale linearly or logarithmically with the number of applications (N), servers (M), and configurations. (PQ: PQ

reconstruction; SVD: singular value decomposition; DC: datacenter.)

.................................................................

MAY/JUNE 2014 9

The scheduler evaluates two metrics,D1¼ tserver� cnewapp and D2¼ tnewapp�cserver , where t is the sensitivity score for toler-ated and c for caused interference for a spe-cific SoI. The cumulative sensitivity of aserver to caused interference is the sum ofsensitivities of individual applications run-ning on it, whereas the sensitivity to toleratedinterference is the minimum of these values.The optimal candidate is a server for whichD1 and D2 are exactly zero for all SoIs, whichimplies no negative impact from interferenceand perfect resource usage. In practice, agood selection is one where D1 and D2 arepositive and small for all SoIs. Large, positivevalues for D1 and D2 indicate suboptimalresource utilization. Negative values for D1

or D2 imply violation of QoS.We examine candidate servers for an

application in the following way. The processis explained for interference tolerated by theserver and caused by the new workload (D1)and is exactly the same for D2. We start fromthe resource the new application is most sen-sitive to. We select the server set for which D1

is non-negative for this SoI. Next, we exam-ine the second SoI in order of decreasing sen-sitivity scores, filtering out any servers forwhich D1 is negative, until all SoIs have beenexamined. Then, we take the intersection ofserver sets for D1 and D2 and select themachine with the best SC and withmin D1 þ D2 L1kkð Þ.

As we filter out servers, at some point theset of candidate servers might becomeempty. This implies that there is no singleserver for which D1 and D2 are non-negativefor some SoI. Although unlikely, we supportthis event with backtracking and QoSrelaxation. Given M servers, the worst-casecomplexity is OðM � SoI 2Þ, because, theo-retically, backtracking might extend all theway to the first SoI. In practice, however, weobserve that for a 1,000-server system, 89percent of applications were scheduled with-out any backtracking. For 8 percent of theremaining applications, backtracking led tonegative D1 or D2 for a single SoI (and for 3percent for multiple SoIs). Additionally, webound the runtime of the greedy searchusing a timeout mechanism, after which thebest server from the ones already examinedis selected.

Our full paper includes a discussion onworkload phases and applicability to multit-ier latency-critical applications.5

Evaluation methodologyIn the following paragraphs, we describe

the server systems, alternative schedulers,applications, and workload scenarios used inour evaluation.

We evaluated Paragon on a 1,000-servercluster on Amazon EC2 with 14 instancetypes from small to extra large.12 All instanceswere exclusive (reserved)—that is, no otherusers had access to the servers. There were noexternal scheduling decisions or actions suchas auto-scaling or workload migration duringthe course of the experiments.

We compared Paragon to three schedu-lers. The first is a baseline scheduler thatassigns applications to least-loaded (LL)machines, accounting for their core andmemory requirements but ignoring their het-erogeneity and interference profiles. Thesecond is a heterogeneity-oblivious (NH)scheme that uses the interference classifica-tion in Paragon to assign applications to serv-ers without visibility in their SCs. The thirdis an interference-oblivious (NI) scheme thatuses the heterogeneity classification but hasno insight on workload interference.

We used 400 single-threaded (ST), multi-threaded (MT), and multiprogrammed(MP) applications from SPEC CPU2006,several multithreaded benchmark suites,5 andSPECjbb. For multiprogrammed workloads,we created 350 mixes of four SPEC applica-tions. We also used 26 I/O-bound workloadsin Hadoop and Matlab running on a singlenode. Workload durations range fromminutes to hours. For workload scenarioswith more than 426 applications, we repli-cated these workloads with equal likelihoods(1/4 ST, 1/4 MT, 1/4 MP, and 1/4 I/O) andrandomized their interleaving.

We used the applications listed in this sec-tion to examine the following scenarios: a low-load scenario with 2,500 randomly chosenapplications submitted with 1-second inter-vals, a high-load scenario with 5,000 applica-tions submitted with 1-second intervals, andan oversubscribed scenario where 7,500 work-loads are submitted with 1-second intervalsand an additional 1,000 applications arrive in

..............................................................................................................................................................................................

TOP PICKS

.................................................................

10 IEEE MICRO

burst (less than 0.1-second intervals) after thefirst 3,750 workloads.

EvaluationWe evaluated the Paragon scheduler

against the LL, NH, and NI schedulers, withrespect to performance, decision quality,resource allocation, and cluster utilization.

Performance impactFigure 2 shows the performance for the

three workload scenarios on the 1,000-server

EC2 cluster. The low-load scenario, in gen-eral, does not create significant performancechallenges. Nevertheless, Paragon outper-forms the other three schemes; it preservesQoS for 91 percent of workloads andachieves on average 96 percent of the per-formance of a workload running in isolationin the best SC. When moving to the high-load scenario, the difference between schedu-lers becomes more obvious. Although theheterogeneity and interference-obliviousschemes degrade performance by an average

0 5,00 1,000 1,500 2,000 2,500

Workloads

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Sp

eed

up o

ver

alon

e on

b

est p

latfo

rm

Low load

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Sp

eed

up o

ver

alon

e on

b

est p

latfo

rm

0 1,000 2,000 3,000 4,000 5,000

Workloads

High load

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Sp

eed

up o

ver

alon

e on

bes

t pla

tform

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000Workloads

Oversubscribed

(a) (b)

(c)

Alone on best platform No heterogeneity (NH) No interference (NI)

Least loaded (LL) Paragon (P)

Figure 2. Performance comparison between the four schedulers for three workload scenarios on 1,000 Amazon Elastic

Compute Cloud (EC2) servers. Performance is normalized to optimal performance in isolation, and applications are ordered

from worst to best performing.

.................................................................

MAY/JUNE 2014 11

of 22 and 34 percent and violate QoS for 96and 97 percent of workloads, respectively, Par-agon degrades performance by only 4 percentand guarantees QoS for 61 percent of work-loads. The least-loaded scheduler degradesperformance by 48 percent on average, withsome applications not terminating success-fully. The differences in performance arelarger for workloads submitted when the sys-tem is heavily loaded.

Finally, for the oversubscribed case, NH,NI, and LL dramatically degrade perform-ance for most workloads, while the numberof applications that do not terminate success-fully increases to 10.4 percent for LL. Para-gon, on the other hand, preserves QoSguarantees for 52 percent of workloads, whilethe other schedulers provide similar guaran-tees only for 5, 1, and 0.09 percent of work-loads, respectively. Additionally, it limitsdegradation to less than 10 percent for anadditional 33 percent of applications andmaintains moderate performance degrada-tion (no cliffs in performance similar to NHfor applications 1 through 1,000).

Decision qualityFigure 3 shows a breakdown of the deci-

sion quality of the different schedulers forheterogeneity (left) and interference (right)across the three scenarios. LL induces more

than 20 percent performance degradation tomost applications, both due to heterogeneityand interference. NH has low decision qual-ity in terms of platform selection, whereas NIcauses performance degradation by colocatingunsuitable applications. The errors increase aswe move to scenarios of higher load. Paragondecides optimally for 65 percent of applica-tions for heterogeneity and 75 percent forinterference, on average, significantly higherthan the other schedulers. It also constrainsdecisions that lead to larger than 20 percentdegradation to less than 8 percent ofworkloads.

Resource allocationFigure 4 shows why this deviation exists.

The solid black line in each graph representsthe required core count based on the applica-tions running at a snapshot of the system,while the other lines show the allocated coresby each of the schedulers. Because Paragonoptimizes for increased utilization within QoSconstraints, it follows the application require-ments closely. It only deviates when therequired core count exceeds the resourcesavailable in the system (oversubscribed case).NH has mediocre accuracy, whereas NI andLL either significantly overprovision the num-ber of allocated cores, or oversubscribe certainservers. There are two important points in

LL NH NI P LL N

H NI P LL N

H NI P

0

20

40

60

80

100

Ap

plic

atio

npe

rcen

tag

e

No degradation < 10% degradation < 20% > 20%

LL NH NI P LL N

H NI P LL N

H NI P

0

20

40

60

80

100

Ap

plic

atio

npe

rcen

tag

e

Low load High load Oversubscribed Low load High load Oversubscribed

Figure 3. Breakdown of decision quality for the four schedulers across the three EC2

scenarios. Different colors correspond to different impacts in application performance in

terms of heterogeneity (left) and interference.

..............................................................................................................................................................................................

TOP PICKS

.................................................................

12 IEEE MICRO

these graphs. First, as the load increases, thedeviation of execution time from optimalincreases for NH, NI, and LL, whereas Para-gon approximates it closely. Second, for highloads, the errors in core allocation increase dra-matically for the other three schedulers,whereas for Paragon the average deviation

remains approximately constant, excludingthe part where the system is oversubscribed.

Cluster utilizationFigure 5 shows the cluster utilization in

the high-load scenario for LL and Paragon inthe form of heat maps. Utilization is shown

0 50 100 150 200 250 3000

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Cor

e co

unt

Low load

Time (minutes)0 100 200 300 400 500

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Cor

e co

unt

High load

Time (minutes)

(a) (b)

0 100 200 300 400 500 600 700 8000

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Cor

e co

unt

Oversubscribed load

Time (minutes)

(c)

Required

No heterogeneity (NH)

No interference (NI)

Least loaded (LL)

Paragon (P)

Figure 4. Resource allocation for the three workload scenarios. Each line corresponds to the number of allocated computing

cores at each point during the execution of the scenario. Although the heterogeneity-oblivious (NH), interference-oblivious

(NI), and least-loaded (LL) schedulers under- or overestimate the required resources, Paragon closely follows the application

resource requirements.

0

200

400

600

800

1,000Least loaded

0

10

20

30

40

50

60

70

80

90

100

Ser

vers

Time (minutes)

100 200 300 400 5000

200

400

600

800

1,000Paragon

0

10

20

30

40

50

60

70

80

90

100

Ser

vers

Time (minutes) 100 200 300 400 500

(a) (b)

Figure 5. CPU utilization heat maps for the high-load scenario for the least-loaded system and Paragon. Utilization is averaged

across the cores of a server and is sampled every 5 seconds. Darker colors correspond to higher CPU utilization in the

heatmaps.

.................................................................

MAY/JUNE 2014 13

for each individual server throughout theduration of the experiment and is averagedacross the server’s cores every 5 seconds.Whereas with LL utilization does not exceed20 percent for the majority of time, Paragonachieves an average utilization of 52 percent.Additionally, as workloads run closer to theirQoS requirements, the scenario completes in19 percent less time.

T he Paragon scheduler moves away fromthe traditional empirical design

approach in computer architecture andsystems and adopts a more data-drivenapproach. In the past few years, we haveentered an era where data has become so vastand rich that it can provide much better (andfaster) insight on design decisions than thetraditional trial-and-error approach can.Applying such techniques in datacenterscheduling with significant gains is proof ofthe value of using data to drive system designand management decisions. There are otherhighly dimensional problems where similartechniques can be proven effective, such asthe large space-design explorations for eitherprocessors13 or memory systems or the moregeneral cluster management problem incloud providers. The latter becomes increas-ingly challenging because many cloud appli-cations are multitier workloads with complexdependencies and they must satisfy strict taillatency guarantees. Additionally, issues likeheterogeneity and interference are not rele-vant only to datacenters. Systems of all scales,from low-power mobile to traditional CMPsand large-scale cloud computing facilities,face similar challenges, which makes employ-ing techniques that work online, fast and canhandle huge spaces a pressing need.

Determining which data can offer valua-ble insights in system decisions and designingefficient techniques to collect and mine it in away that leverages their nature and character-istics is a significant challenge movingforward.

MICR O

AcknowledgmentsWe sincerely thank John Ousterhout,

Mendel Rosenblum, Byung-Gon Chun,Daniel Sanchez, Jacob Leverich, David Lo,and the anonymous reviewers for their

feedback on earlier versions of this manu-script. This work was partially supported bya Google-directed research grant on energy-proportional computing. Christina Delimi-trou was supported by a Stanford GraduateFellowship.

....................................................................References1. L.A. Barroso and U. Holzle, The Datacenter

as a Computer: An Introduction to the

Design of Warehouse-Scale Machines,

Morgan and Claypool, 2009.

2. J. Rabaey et al., “Beyond the Horizon: The

Next 10x Reduction in Power—Challenges

and Solutions,” Proc. IEEE Int’l Solid-State

Circuits Conf., 2011, doi:10.1109/ISSCC.

2011.5746206.

3. L. Barroso, “Warehouse-Scale Computing:

Entering the Teenage Decade,” Proc. 38th

Ann. Int’l Symp. Computer Architecture

(ISCA 11), 2011.

4. D. Meisner et al., “Power Management of

Online Data-Intensive Services,” Proc. 38th

Ann. Int’l Symp. Computer Architecture

(ISCA 11), 2011, pp. 319-330.

5. C. Delimitrou and C. Kozyrakis, “Paragon:

QoS-Aware Scheduling in Heterogeneous

Datacenters,” Proc. 18th Int’l Conf. Archi-

tectural Support for Programming Lan-

guages and Operating Systems (ASPLOS

13), 2013, pp. 77-88.

6. R.M. Bell, Y. Koren, and C. Volinsky,

The BellKor 2008 Solution to the

Netflix Prize, tech. report, AT&T Labs, Oct.

2007.

7. J. Mars et al., “Bubble-Up: Increasing Uti-

lization in Modern Warehouse Scale Com-

puters via Sensible Co-locations,” Proc.

44th Ann. IEEE/ACM Int’l Symp. Microarchi-

tecture, 2011, pp. 248-259.

8. R. Nathuji, C. Isci, and E. Gorbatov,

“Exploiting Platform Heterogeneity for

Power Efficient Data Centers,” Proc. 4th

Int’l Conf. Autonomic Computing (ICAC 07),

2007, doi:10.1109/ICAC.2007.16.

9. N. Vasic et al., “DejaVu: Accelerating

Resource Allocation in Virtualized Environ-

ments,” Proc. 17th Int’l Conf. Architectural

Support for Programming Languages and

Operating Systems, 2012, pp. 423-436.

..............................................................................................................................................................................................

TOP PICKS

.................................................................

14 IEEE MICRO

10. A. Rajaraman and J.D. Ullman, Mining of

Massive Datasets, Cambridge Univ. Press,

2011.

11. C. Delimitrou and C. Kozyrakis, “iBench:

Quantifying Interference for Datacenter

Workloads,” Proc. IEEE Int’l Symp. Work-

load Characterization, 2013, pp. 23-33.

12. C. Delimitrou and C. Kozyrakis, “QoS-Aware

Scheduling in Heterogeneous Datacenters

with Paragon,” ACM Trans. Computer Sys-

tems, vol. 31, no. 4, 2013, article no. 12.

13. O. Azizi et al., “Energy Performance Trade-

offs in Processor Architecture and Circuit

Design: A Marginal Cost Analysis,” Proc.

37th Ann. Int’l Symp. Computer Architec-

ture (ISCA 10), 2010, pp. 26-36.

Christina Delimitrou is a PhD student inthe Department of Electrical Engineering atStanford University. Her research focuses onlarge-scale datacenters, specifically on sched-uling and resource allocation techniqueswith quality-of-service guarantees, practicalcluster management systems that improveresource efficiency, and datacenter applica-tion analysis and modeling. Delimitrou hasan MS in electrical engineering from

Stanford University. She is a student mem-ber of IEEE and the ACM.

Christos Kozyrakis is an associate professorin the Departments of Electrical Engineer-ing and Computer Science at Stanford Uni-versity, where he investigates hardwarearchitectures, system software, and pro-gramming models for systems ranging fromcell phones to warehouse-scale datacenters.His research focuses on resource-efficientcloud computing, energy-efficient multicoresystems, and architectural support for secur-ity. Kozyrakis has a PhD in computerscience from the University of California,Berkeley. He is a senior member of IEEEand the ACM.

Direct questions and comments about thisarticle to Christina Delimitrou, Gates Hall,353 Serra Mall, Room 316, Stanford, CA94305; [email protected].

.................................................................

MAY/JUNE 2014 15

q -service-aware s heterogeneous d paragondelimitrou/papers/2014.toppicks.paragon.pdf8. b. hindman...

Documents