ensuring high-performance of mission-critical java applications in multi-tenant cloud platforms

Ensuring High-performance of Mission-critical Java Applications in Multi-tenantCloud Platforms

Zhenyun Zhuang, Cuong Tran, Haricharan Ramachandra, Badri Sridharan{zzhuang, ctran, hramachandra, bsridharan}@linkedin.com

LinkedIn Corporation, 2029 Stierlin Court Mountain View, CA 94043 United States

Abstract—Cloud Computing promises a cost-effective andadministration-effective solution to the traditional needs ofcomputing resources. While bringing efficiency to the usersthanks to the shared hardware and software, the multi-tenencycharacteristics also bring unique challenges to the backendcloud platforms. In particular, the JVM mechanisms used byJava applications, coupled with OS-level features, give rise toa set of problems that are not present in other deploymentscenarios.

In this work, we consider the problem of ensuring high-performance of mission-critical Java applications in multi-tenant cloud environments. Based on our experiences withLinkedin’s platforms, we identify and solve a set of problemscaused by multi-tenancy. We share the lessons and knowledgewe learned during the course.

Keywords-Linux performance; Multi-tenancy; THP; Java;Cloud Computing;

I. INTRODUCTION

Cloud Computing promises a cost-effective andadminstration-effective solution to the traditional needsof computing resources. While bringing efficiency to theusers thanks to the shared hardware and software, themulti-tenency characteristics also bring unique challengesto the backend cloud platforms. Cloud platforms need todeliver high performance to all the users, subject to SLA(Service Level Agreement). However, since multiple usersshare the same computing resources 1, it is important toprevent different users from adversely affecting each other.

Though the general problem of isolating users and en-suring high performance belongs to the big topic of QoS(Quality of Service), as of today, there are little satisfyingsolutions to isolating users while maintaining the cost-effectiveness of multi-tenancy 2. In many of today’s multi-tenant backend platforms such as many PaaS platforms,multiple users share the same OS (and in turn the mem-ory and cpu resources). Reasons for adopting such setupincludes ease of administration and deployment. In suchmulti-tenancy deployments, it is critical to protect users atOS level to ensure high performance.

1Depending on specific cloud solutions, users may use a mix of dedicatedresources and shared resources.

2Some solutions provide hard isolation of resources by dedicating aspecific set of hardware and OS to each user (e.g., through Virtual Ma-chines), however it incurs additional administration cost and compromisesthe benefits gained by resource sharing.

Different users impact each other when sharing limitedcomputing resources including memory and cpu. Whenusers’ consumptions of computing resources are not alignedwell due to independent activities and load, the individualuser’s performance may be compromised. For example, anuser that suddenly aggressively consumes memories (e.g.,reading files) and causes another user to swap out its pages,when the latter users resumes, its performance will besignificantly penalized due to slow swap-in activities.

Given the popularity and powerfulness of Java platforms,a significant portion of today’s backend platforms run Java.Though supporting multi-tenancy inside a single JVM hasbeen proposed [1], it is yet to be verified and deployed.As of today, effectively supporting multi-tenancy with Javarequires running multiple Java applications on the sameplatform. Java applications’ performance highly depend onthe use of JVM and heap spaces. We have observed inour production platforms that the interaction caused bydifferent Java applications can cause severe performancedegradations. Briefly, the competitions among multiple Javaapplications cause inflated application pauses during certainscenarios. Moreover, we have found that some new featuresof Linux platforms, though intended to improve applicationperformance, could significantly backfire, thanks to somecomplicated interactions between these features.

In this work, we consider the problem of ensuring high-performance of mission-critical Java applications in multi-tenant cloud environments. Based on our experiences withLinkedin’s multi-tenant platforms, we identify and solve aset of problems caused by multi-tenancy. In a nutshell, weconsider the performance of Java applications, and identifieda set of Linux OS features that can adversely affect Javaapplication performance in ceratin scenarios. After root-causing the problems, we provide solutions to mitigate thechallenges. Though the solutions have been verified workingon Linux OS, on which most of LinkedIn products run, theycan easily be applied to similar setup of other platforms toachieve better performance. More importantly, the studiesgained could be incorporated into future designs of JVM orOS features.

For the remainder of the paper, after providing somenecessary technical background in section II, we then mo-tivate the problems being addressed in this paper using twoscenarios in Section III. We then present the investigations

(a) JavaApp throughput (K allocations/sec) (b) JavaApp GC pauses (Second)Figure 1. Scenario 1: Application performance during startup state

we conducted in diagnosing the performance problems inSection IV. Based on the findings, we propose the solutionsin Section V. We perform performance evaluation and showthe results in Section VI. In Section VII we highlight thelearned lessons during our investigations with LinkedIn’splatform. We also present related works in Section VIII.And finally in Section IX we conclude the work.

II. BACKGROUND

We begin by providing background information regardingJVM heap management and GC (Garbage Collection), Linuxmemory management, and new features of THP (TransparentHuge Pages).

A. JVM heap management and GC

Java programs run in JVM (Java Virtual Machine), andthe area of memory used by the JVM is referred to as heap.JVM heap is used for dynamic Java objects allocations and isdivided into generations (e.g., Young and Old). Java objectsare firstly allocated on the Young generation; when unused,they are collected by a mechanism called GC. When GCoccurs, objects are checked for reference counters startingfrom a root object. If the reference counter of an objectdrops to zero, the object is deleted and the correspondingmemory space is reused. Some phases of GC process requireapplications to stop responding to other requests, a behaviorcommonly referred to as STW (Stop the world) Pause.One of the important objectives of Java performance is tominimize the durations of GC pauses.

B. Linux memory management

Memory management is one of the critical components ofLinux OS. The virtual memory space is divided into pagesof fixed sizes (e.g. 4KB). Over the years, many features ofmemory management have been designed to increase thedensity and improve the performance of running processes.

Page reclaiming Linux OS maintains free page lists toserve applications’ memory request. When the free pagesdrop to a certain level, OS will begin to reclaim pages andadd them to the free lists. When performing page reclaiming,page scanning is needed to check the liveness of allocatedpages. Both background scanning (performed by kswapd

daemon) and foreground scanning (performed by processes)are used. Oftentimes the foreground page reclaiming isreferred to as direct reclaiming or synchronous reclaiming,which represents a more severe scenario where the systemis under heavy memory pressure, and the application stopsduring the process.

Swapping Swapping is designed to increase processdensity. When free memory is low, Linux will swap mem-ory pages out to swap spaces to allow for new processesbeing invoked. When the memory space corresponding toswapped-out pages is active again, these pages will beswapped in.

THP Transparent huge pages [2] is a relatively newfeature that aims to improve the processes’ performance.With larger page sizes (e.g., 2MB) , the number of pagetable entries is reduced for a particular process. More virtualaddress translations can be covered by TLB (TranslationLookaside Buffer) lookup, hence higher TLB hit ratio [3].Though the benefit of using huge pages has long beenunderstood, the use of huge pages was not easy. For instance,the huge pages need to be reserved when OS starts, andprocesses have to make explicit calls to allocate huge pages.THP, on the other hand, promises to avoid both problems andthus is enabled by default. THP allows regular pages to becollapsed into transparent huge pages (THPs) through twotypes of collapsing: background collapsing by khugepageddaemon and direct collapsing by applications requestingmemory 3.

III. MOTIVATION

We first provide two motivating scenarios to illustrate theproblems that Java applications may face in multi-tenantenvironments.

A. Experiment setup

The machine used in the experiments is an x86 64-basedIntel(R) Xeon(R) X5650 SMP (Symmetric multiprocessing)NUMA 2-node machine. Each node has 6 cores, and hyper-threading is enabled. The OS version is Red Hat Linux

3Note that we use THP to denote the feature, while use THPs to denotethe transparent huge pages.

(a) JavaApp throughput (K allocations/sec) (b) JavaApp GC pauses (second)Figure 2. Scenario 2: Application performance during steady state

2.6.32-220.13.1.el6. The machine has totally 72 GB ofphysical memory. All default system configurations are used.

Two applications are used, one is written in Java, theother in C++. For easy presentation, we refer to the Javaapplication as JavaApp, and the C++ one BackgroundApp.JavaApp is used to represent real production applications,while BackgroundApp simply consumes computer resourcesof memory to mimic production environments. JavaAppkeeps allocating Java objects, and also periodically discardsobjects such that they can be reclaimed during JVM GC.

In real production environments, it is common for thesystem to experience memory pressure. To mimic suchsituations, BackgroundApp is designed to allocate a fixedamount of memory such that when running JavaApps theycan experience certain level of memory pressure.

We consider two types of Java applications: throughput-oriented or response-time-oriented; so we will examinethe application throughput as well as the response timesexperienced by the client requests. For throughput-orientedapplications, the throughput is measured as how many Javaobjects are allocated per second. For response-time-orientedJava applications that serve client requests, ideally the user-experienced response time should be measured. Directlymeasuring the user response time is not easy, as each type ofJava application has different request handling mechanisms.We instead measure the severity of STW GC pauses, sinceduring these GC pauses, Java applications stop respondingto the clients.

The JavaApp is started with varying heap sizes and withidentical -XMx and -Xms values. Other important JVMconfigurations are: UseConcMarkSweepGC, CMSScavenge-BeforeRemark, CMSParallelRemarkEnabled.

B. Scenario 1: During startup state

In this scenario, a single JavaApp is started and running.We first start the BackgroundApp to take 50 GB of memory,which leaves about 20 GB memory unused. We then startthe JavaApp with 20 GB heap size.

In Figure 1(a), after JavaApp is started, it has a steadythroughput of 12K/sec, which lasts for about 30 seconds.Then the throughput begins to drop sharply. The worst

case sees almost zero throughput for about 20 seconds.Interestingly, after a while, the throughput comes back tosteady again.

In Figure 1(b) which shows the GC pauses, we alsoobserve similar pattern. The GC pauses initially are wellbelow 50 milliseconds; then they jump to hundreds ofmilliseconds. We even see 2 pauses are larger than 1 second!After a period of about 1 minute, GC pauses drop again tobelow 50 milliseconds and become steady.

C. Scenario 2: During steady state

In this scenario, a JavaApps is started with 20 GB heapand enters steady state. Then a BackgroundApp starts andbegins to allocate 50 GB of memory.

In Figure 2(a), we saw that the JavaApp achieves a steady12K/sec throughput in the beginning. Then the throughputsharply drops to zero which lasts for about 2 minutes. Fromthen on, the throughput varies wildly. Sometimes it comesback at 12K/sec, other times drops to zero again.

In Figure 2(b), the GC pauses are almost zero in steadystate, then it rises to 55 seconds! From then on, the GCpauses varies a lot, but rarely come back to zero. Most ofthe pauses are of several seconds.

D. Summary

We see that in multi-tenant environments, Java applica-tions can experience performance problems both in startupstate and steady state. The symptoms of these performanceproblems as seen by users are low application throughputand increased response time. We will investigate the prob-lems and root cause them in Section IV.

IV. INVESTIGATIONS

We just saw that Java applications can experience signifi-cant performance degradation in multi-tenant setup. We nowdelve deeper into the two scenarios and identify the causes.

A. Investigations into Scenario-1: Startup state

In Section III-B we see that during the startup periodJavaApp experiences undesired performance. We suspectthat the undesired performance of JavaApp has to do withhow JVM is started since it occurs when JVM is started.

(a) JVM resident size (GB) (a) Free memory (KB) (c) Cpu idle usage (%)Figure 3. Scenario 1: JVM RES size (a), free memory (b) and cpu idle usage (c)

(a) Free pages of Node 0 (b) Free pages of Node 1 (c) Direct page scanning (pages per second)Figure 4. Scenario 1: Number of free pages in Normal Zone (a,b), Number of pages scanned by direct-reclaim (c)

We examine the JavaApp’s RES (Resident size), which isthe non-swapped physical memory a process has used. Theresults are shown in Figure 3(a). We see that though wespecify the JVM parameters to be “-Xmx20g and -Xms20g”,JVM does not allocate the heap space from memory all at atime. Instead, the heap is allocated on the go. As more andmore objects are instantiated, JVM allocates correspondingmemory pages to accommodate. During the allocation pro-cess, OS checks the free page list. If the amount of freememory is below certain level, OS will begin reclaimingpages, which takes CPU time. Depending on how severethe shortage of free memory is, the reclaiming process maysignificantly stall the JavaApp. In Figure 3(b) we see thatfree memory significantly drops to a very low level. Thepage reclaiming process incurs CPU overhead, as can beseen in Figure 3(c).

On Linux, when available memory is low, the kswapddaemon wakes up to free pages in the background. If thememory pressure is high, the reclaiming process will freeup memory synchronously. The page reclaiming mechanismof Linux is per-zone based [4], which is governed by certainzone watermarks (i.e., pages min, pages low, pages high).User-land Java applications allocate memory in NUMAnodes’ Normal zones. For our setup, there are two nodes.Node-0’s pages high is set at 21,498 pages (about 83 MB),while Node-1’s 11,277 pages (about 44 MB). We examinedthe free pages of Normal zones for the two nodes based on/proc/zoneinfo, and plot the amounts in Figures 4(a,b). Wefound that the available pages occasionally drop below thewatermarks, which could trigger direct-reclaim path. Whendirect-reclaim occurs, Linux freezes the applications thatexecuting the code path, thus causing high GC pauses. Inaddition, direct-reclaiming typically scans many pages inorder to free unused pages. In Figure 4(c) we plot the number

of pages scanned by the direct-reclaiming path as reportedby Linux SAR utility. We see that at the peak value, everysecond about 48K pages (i.e. 200 MB) amount of pages arescanned by direct-reclaiming.

B. Investigations into Scenario-2: Steady state

In Section III-C we observe that the actions of otherapplications can severely impact the performance of a Javaapplication. Since the total physical memory is only about 70GB, while the two applications together request that amountof memory, our first observation is that the system is undermemory pressure. We found that there are quite a lot ofswapping activities going on, as seen from the Figures 5. InFigure 5(a) we see that many memory pages are swappedout. These swapped-out pages belong to the JavaApp. Lateron in Figure 5(b), many pages are then swapped in. We cansee from Figure 5(c) that the taken swap space is up to 7 GB.During these time, if JavaApp experiences GC activities, andGC needs to scan objects to collect dead ones. If the scannedobjects are allocated on the pages that are swapped out, theyneed to be brought back to memory from swap space first.Swapping-in takes time, as the swap space is typically ondisk drives. Thus, JavaApp see high GC pauses.

Though swapping activities affect GC pauses, we suspectthat they alone are unable to explain the excessive pauseswe see (i.e. 55 seconds). Our suspicion is justified by ourexamination of the GC pauses, many of which show highsys time. In Figure 6(a) We observed that the system isalso under severe cpu pressure. The high cpu usage cannotbe entirely attributed to swapping activities as swappingtypically is not cpu intensive. There must be other activ-ities contributing to the cpu usage. We examined varioussystem performance statistics, and identified a particularLinux mechanism of THP that significantly exacerbates the

(a) Swapping out (pages/sec) (b) Swapping in (pages/sec) (c) Free swap space size (KB)Figure 5. Number of pages swapped out and in per second (a,b), Amount of free swap space (c)

(a) Cpu idle usage (%) (b) Number of anonymous THPs (c) Number of split THPsFigure 6. Cpu idle usage (a), Number of anonymous THPs (b) and number of split THPs (c)

performance degradation.With THP enabled, when Linux applications allocate

memory, preference is given to allocations of transparentof huge pages of 2 MB size rather than the regular pagesof 4 KB. We can verify the allocation of transparent hugepages in Figure 6(b), which shows the instantaneous numberof anonymous transparent huge pages. Since THPs are onlyallocated to anonymous memory, the figure practically showsthe total THPs of the system. At the peak we see about 34KTHPs, or about 68 GB.

We also observed that the number of THPs begins to dropafter a while. This is because some of the THPs are splitin response to low available memory. When system is undermemory pressure, it will split the THPs into regular pages tobe prepared for swapping. The reason for splitting is becausecurrently Linux only supports swapping regular pages. Thenumber of splitting activities are seen in Figure 6(c). We seethat during the five minutes, about 5K THPs are split, whichcorresponds to 10 GB of memory.

At the same time, Linux attempts to collapse regular pagesinto THPs, which requires page scanning and consumescpu. There are two ways to collapse: background collapsingand direct collapsing. Background collapsing is performedby khugepaged daemon. We occasionally observed that thekhugepaged daemon tops cpu usage. Direct collapsing byapplications trapping into kernel and allocating huge pageshas even more severe consequences in terms of performancepenalty, which can be seen by direct page scanning count.

An even more troublesome scenario THP can ran intois that the two contradicting activities of compacting andsplitting are performed back and forth. When system is undermemory pressure, THPs are split into regular pages, while ashort while later regular pages are compacted into THPs, andso on and forth. We have observed such behaviors severely

hurting application performance in our production systems.

C. Summary

We saw that JVM mechanisms, coupled with OS-levelfeatures, give rise to unique problems that are not present inother deployment scenarios. Though the investigations andfindings are Linux-specific, given the wide deployment ofLinux OS, particularly in server markets, the findings applyto a significant portion of multi-tenant backend platforms.In addition, other platforms are expected to expose similarproblems of varying degree of severity. On one hand, JVMmechanisms are largely universal across OS platforms. Onthe other hand, most OS platforms have mechanisms ofswapping and reclaiming. These similarities, along with theidentical requirements with regard to multi-tenancy in cloudenvironments, makes us believe our findings can help onsimilar problems and solutions in other cloud platforms.

Note that to help illustrate the problem and understandthe causes, we consider the particular motivation scenarioswhere the memory requirement almost exactly matches thephysical memory. However, based on our production experi-ence, these scenarios do occur frequently in real productiveenvironment. This is because the fact that deployment oftentimes over-commits applications to allow resource sharingand reduce cost.

V. SOLUTION

We now present solutions to prevent Java applicationsrunning with multi-tenancy from performance degradation.

A. Overview

Our solution consists of a set of three design elements,each targets a specific aspect of the problems discussed inSection IV. Applying any individual element will help to

Variables:1 T h f reemem: Threshold of free memory;2 T hdd : Threshold of direct page scanning;3 T hcpu: Threshold of khugepaged cpu usage percentage;

Algorithm:4 Characterise Java applications into three sets:5 Setlowpause : Java apps that require low GC pause;6 Setslowstart : Java apps that can tolerate slow startup;7 Setothers: Other Java applications;8 If need to protect all Java applications:9 If cannot kill Java applications when starting others:10 Set swappiness to 0 ;11 Else:12 Issue command of swapoff ;13 Else:14 Turn off swapping for Setlowpause ;

15 Apply AlwaysPreTouch to Setlowpause ∩ Setslowstart

and start Java applications16 Every time of T (Adjustment period):17 Perform system resource monitoring (e.g. free mem);18 If free memory > T h f reemem:19 Enable THP;20 Else:21 Disable THP;22 Continue;23 If average direct page scanning rate > T hdd :24 Disable direct THP collapsing;25 Else:26 Enable direct THP collapsing;27 If khugepaged average cpu usage > T hcpu:28 Disable background THP collapsing;29 Else:30 Enable background THP collapsing;

Figure 7. Main algorithm

some extent, however, all the three design elements need tobe applied and work together to get the full benefit.

The first design element is Pre-allocating JVM heapspace. We know that JVM heap space is not allocated untilthey are absolutely used. When new heap space is neededto accommodate new objects allocation requests. Linux OSneeds to allocate memory pages for the heap growth, whichmay trigger heavy page reclaiming and hurt performance.The design element pre-allocates all the heap space such thatavoiding on-the-fly page allocation by Linux OS. To enforceheap pre-allocation, Java applications need to be started with“-XX:+AlwaysPreTouch”. The side-effect of this design isthe increased time taken for the JVM to start, which needsto be taken into account when applying the design element.

The second design element is Protecting JVM heap spacesfrom being swapped out. We learned that when GC occurs,the corresponding memory pages need to be scanned. Ifthese pages were swapped out, they need to be swappedin first, which incurs delay and hence GC pauses. Thedesign element protects the pages from being swappedout. Linux OS allows turning off swapping, however itapplies to all applications and all memory spaces. An idealimplementation is to allow fine tuning of swapping in termsof which applications and what memory areas. For instance,the use of cgroup [5] to finely control which applications toswap. However, the complexity and administration cost maymake such mechanisms a over-killer. On the other hand, werealize that for most multi-tenant platforms as in LinkedIn,these platforms are dedicated to only running homogenousJava applications 4. In these scenarios, it justifies to simplyturn off swapping for applications.

At this time, Linux has two approaches to turn off

4Homogenous applications are identical applications running same codebase except serving different customers.

swapping: (1) by issuing “swapoff ” command and (2) bysetting swappiness=0. The difference between them is thatthe former completely forbids any swapping activity whilethe latter only discourages swapping. When the system isunable to handle new process’s memory request, the formerapproach will kill another process to free memory, whilethe latter approach begins swapping anyway. So dependingon deployment scenarios and requirements, they need to bechosen carefully.

The third design element is Dynamically tuning THP.Though we have seen that enabling THP feature could causecritical performance penalty, THP provides performancegains in other scenarios. The bottom line is to enableTHP when it can bring benefits, while disable it when itcould cause troubles. The first observation we make is thatTHP exposes performance penalty mostly when the sys-tem’s available memory is low. When that happens, existingTHPs need to be split into regular pages for swappingout. Thus, it is better to disable THP entirely when thesystem is under memory pressure. Since Java applicationsallocate heap when started (particularly when started with “-XX:+AlwaysPreTouch”), it is important to decide on whetherto allocate THPs to a Java application when started. Thus,we choose to use the memory footprint size of the Javaapplication as the memory threshold to decide whetherto turn on or off THP. When the available memory issignificantly larger than the application’s memory footprintsize, then THP is enabled, as the system is unlikely undermemory pressure after launching the particular application.Otherwise, THP is disabled. Since many dedicated backendplatforms like LinkedIn’s are hosting homogenous applica-tions, assigning applications’ footprint size is a simple whileeffective decision.

Moreover, regular pages need to be collapsed into THPs

(a) Throughput without AlwaysPreTouch (b) GC pauses without AlwaysPreTouch (c) Throughput with AlwaysPreTouchFigure 8. Impact of pre-allocating JVM heap space: JavaApp throughput (a,c) and GC pauses (b)

before the huge pages can be allocated to applications.Thus, part of the element is to decide when to allow THPcollapsing. We propose to base the decision on the directpage scanning rate and the cpu usage of khugepaged.

In summary, the design element consists of 3 compo-nents which control different collapsing types separately: (1)When available memory drops below the threshold, disableTHP; (2) When direct page scanning is high, disable directTHP collapsing; and (3) when khugepaged daemon appearsto be a cpu hogger, disable background THP collapsing.

B. Algorithm

The complete algorithm is as follows. All the Java ap-plications that will be run on a particular Linux system isfirstly characterized into three sets based on two types ofrequirements: (1) strict low pause requirements and (2) shortstartup delay. Specifically, as shown in lines 4-7 in Figure7, Setlowpause contains the Java applications that require lowpauses, Setslowstart contains those can tolerate slow startup,while Setothers are others. Note that the first two sets mayhave common elements.

The algorithm then adjusts the swapping configurationsof the system. If all the Java applications need to beprotected from swapping, then it simply adjusts the systemswapping configuration, as described in lines 8-14 in Figure7. Otherwise, individual Java applications can be separatelyadjusted with regard to swapping.

Before any Java applications is started, the algorithmapplies “-XX:+AlwaysPreTouch” to the Java applicationsthat belong to the intersection of Setlowpause and Setslowstart .In other words, only Java applications that need to guaranteelow GC pauses and can tolerate slow startup are started withpre-allocating heap spaces.

After Java applications are started, every T time (ad-justment period), the algorithm finely tunes THP. Firstly,the algorithm obtains current system performance statisticsincluding free memory size. It then decides to enable ordisable THP, as shown in lines 17-22 of Figure 7. Notethat once THP is turned off, the algorithm simply skips thefollowing steps as finer knobs are disabled inside THP.

The algorithm then checks whether it should turn on oroff the background and direct THP collapsing independently.For direct THP collapsing, it relies on the past period’s statis-tics of direct page scanning. If it appears to be having heavy

direct page scanning activities, it will turn off the knob.Otherwise, direct THP collapsing is enabled. Similarly, forbackground THP collapsing, it relies on the activities ofkhugepaged as shown in cpu monitoring utilities such astop. If khugepaged appears to be hogging cpu, that knob isturned off. Otherwise, the knob is turned on. The tuning ofTHP is shown in lines of 23-30 in Figure 7.

VI. EVALUATION

We now present the evaluation results of all three designelements. We use the same JavaApp as described in SectionIII, and also use the BacgroundApp for creating deploymentscenarios with different available memory sizes.

A. Pre-allocating JVM heap space

We begin by evaluating the design element of pre-allocating JVM heap space. We let the BackgroundAppto take 50GB of memory, and then start the JavaAppwith 20GB heap. Recall that the machine has 72 GB ofmemory, the above setup creates a scenario with memorypressure. For comparison we start the JavaApp in twoscenarios: without and with pre-allocating heap. As shownin Figure 8(a,b), without heap pre-allocation, the applicationthroughput sharply drops to almost zero after about 30seconds, which is correlated with significant GC pauses. Inthe scenario with pre-allocated heap, as shown in Figure8(c), the application performance is very stable.

Pre-allocating JVM heap space does increase the startuplatencies of Java applications. Applications begin to respondonly after all the heap space is allocated. We quantify thestartup latencies of a JavaApp of different heap sizes andfound that it roughly takes about 4 seconds for every 10GBof heap size, as shown in Table I.

Table ISTARTUP DELAY OF PRE-ALLOCATING JVM HEAP SPACE

Heap size (GB) 1 2 4 10 20 30 40Start delay (Sec) 0.5 0.8 1.6 2.5 7.3 11 15

Table IIJAVAAPP THROUGHPUT (K ALLOCATIONS/SEC)

Mechanisms THP OFF THP ON Dynamic THPJavaApp I 12 15 15JavaApp II 13 11 12

(a) Throughput when swappiness=100 (b) Free swap size when swappiness=100 (c) Throughput when swappiness=0Figure 9. Impact of Protecting JVM heap spaces from swapping: JavaApp throughput (a,c) and Free swap space size (b)

B. Protecting JVM heap spaces from being swapped out

To evaluate the effectiveness of protecting JVM heapspace from swapping, we firstly run the JavaApp with pre-allocated heap of 20GB for 150 seconds. We then start theBackgroundApp to take 50GB of memory, which encouragespage swapping. In the scenario where JVM heap spaceis not protected (i.e., by setting swappiness to 100), wesee that after about 3 minutes, the JavaApp’s performancedrops sharply, as shown in Figure 9(a). In Figure 9(b),we confirmed that some heap memory is swapped out.In the scenario where JVM heap is protected, we observemuch better performance in Figure 9(c). Note that the lowerperformance after 3 minutes are caused by THP activities,

C. Dynamically tuning THP

We then evaluate the design element of dynamicallytuning THP. We consider the scenario where JavaApps arestarted with abundant available memory and less-abundantavailable memory. Specifically, the first JavaApp is startedwithout other applications running. It then runs for 3 minutesand stops. After that, a BackgroundApp takes 45GB ofmemory, then another JavaApp is started and runs for 3minutes. Note that for these runs, the first two designelements are both enabled.

For the above scenario, 3 mechanisms are considered:THP is turned off, THP is turned on, and THP is dynamicallytuned. The adjustment period is set to be 2 seconds. Theresults are shown in Table II. We see that when THP is off,JavaApp-I achieves the lowest throughput of 12 K/s, whilethe other two mechanisms have THP enabled and hence seehigher throughput of 15 K/s.

For JavaApp-II, since it is running under memory pres-sure, turning THP off gives the highest throughput. Dynam-ically tuning THP results in less throughput (12K/s) thanturning-off THP, but outperforms THP-on since it turns offTHP when necessary. Note that dynamically tuning THPbrings the benefit of accommodating more scenarios, partic-ularly in scenarios where the system usage is unpredictablethus manual configuration of THP is not desirable.

We also notice that for JavaApp II, the performancebenefit brought by Dynamic THP when compared to THP-ON is not significant (i.e. 1K/s), that is because of therelatively simple scenario we considered and hence less

performance improvement. However, in other scenarios suchas those illustrated in Section III, as well as in real produc-tive environments, we have observed much more significantimprovement by turning off THP appropriately. Thus, webelieve the algorithm used by Dynamic THP needs to befinely tuned, which will be our future work.

VII. LESSONS LEARNED

During the investigations into the performance problemsexperienced with our production platforms, we have learnedseveral lessons which are summarized in this section.

Multi-tenant Java platforms expose extra performancechallenges Multi-tenant platforms expose unique challengeswith regard to the tradeoff between cost and performance.On one hand, the goal of serving multiple tenants on a singlesystem is to save cost by encouraging resource sharing; onthe other hand, independent activities of tenants may affecteach other. The interactions between tenants oftentimes arefurther complicated by the OS features, particularly thememory management. When the tenant applications arerunning in JVM, we also need to consider the JVM andGC mechanisms due to the fact that Java maintains its ownheap. We have seen in this work that the Java heap is notentirely allocated in the beginning. Though the design hasits own benefits, it could cause performance problems incertain scenarios.

Be cautious about Linux’s new features (optimizations)Linux has been constantly incorporating new features in thehope of optimizing the systems and improve performance.Most of the mechanisms are successful over the years,however performance engineers should never be carefulenough about these features. In this paper, we have shownthat the new feature of THP proves to cause significantperformance problem when the system is under memorypressure. We have to admit that the idea of THP is fantastic,as it significantly eases the way huge pages are used.The pitfalls of THP we have seen are mainly due to theless-mature implementations of the idea. We believe as itsimplementations mature, THP will be able to bring moreperformance benefits in more circumstances. However as ageneral guideline, we need to deeply understand the internalmechanisms of any new OS features as well as the scenarioswhere a feature works or does not work.

Root causes can come from seemingly insignificant in-formation Linux OS emits significant amount of systemstatistics (e.g., /proc), and most of us most of the time mostlyonly examine a small subset of these statistics. Over yearsof performance investigation lives, we have encounteredmany scenarios where commonly examined statistics shedlittle light on the performance problems. For instance, inthis work, we have seen heavy direct-reclaim-caused pagescanning activities, however free memory is sufficient. Afterwe turn our eyes to examining per-zone statistics emitted by/proc/zoneinfo, we nailed down the zones that are desperatefor free memory and hence triggers direct page scanning.

VIII. RELATED WORK

Multi-tenant cloud platform Cloud Computing model isbeing increasingly deployed as a fast and economic solutionto serving multiple users. Despite the distinctions among thethree commonly accepted service models (i.e., IaaS, PaaS,and SaaS), any large scale cloud computing environmentrequires the instantiation of multiple tenant applications [6]or VMs (Virtual machines) [7], [8] on a single hardwareand/or software. Though certain techniques such as LogicalDomains [8] can isolate resources dedicated to individualtenants, to achieve highest level of resource sharing andhence cost saving, system resources have to be shared bymultiple tenant applications. Such resource sharing, particu-lar memory-sharing, gives rise to certain unique challenges.

Java performance for server applications Bearing manyadvantages including platform-independence, Java is used asone of the top languages to build and run server applications.Ever since its birth, extensive works have bee done to finelyimprove its performance in various deployment scenarios[9], [10]. However, most of the performance studies focusedon the Java/JVM itself. Our work, on the other hand, dealswith an undesirable deployment scenario caused by systemlevel resource shortages and multi-tenant interactions.

Linux memory management and system optimizationsLinux has long been the mostly deployed OS in servermarket [11]. Its key component of memory management hasseen a plethora of advanced features being designed over thepast years [12]. To accommodate the fact of limited RAMand requirement of supporting multiple processes, pagingand swapping are continuously optimized to improve thesystem/appliction performance [2], [13], [14]. Meanwhile, tobetter fit a particular setup, various system and configurationoptimizations are researched extensively [15].

IX. CONCLUSION

We studied the challenges faced in a typical multi-tenantcloud platforms. We identified that JVM mechanisms usedby Java applications, coupled with OS-level features, giverise to a set of problems that are not present in otherdeployment scenarios. We propose a solution that addressesthese problems and shared the experiences we learned.

ACKNOWLEDGMENT

We sincerely thank Ritesh Maheshwari, Sharad Gandhiand Wai Ip. They have worked on tools and relevant perfor-mance issues that help this work.

REFERENCES

[1] “Java multitenancy in the ibm java 8 beta,”http://www.ibm.com/developerworks/java/library/j-multitenant-java/index.html.

[2] “Huge pages and transparent huge pages,”https://access.redhat.com/site/documentation/en-US/Red Hat Enterprise Linux/6/html/Performance Tuning Guide/s-memory-transhuge.html.

[3] M. Talluri, S. Kong, M. D. Hill, and D. A. Patterson,“Tradeoffs in supporting two page sizes,” in Proceedingsof the 19th Annual International Symposium on ComputerArchitecture, ser. ISCA ’92, Queensland, Australia, 1992.

[4] C. Lameter, “Numa (non-uniform memory access): Anoverview,” Queue, vol. 11, no. 7, pp. 40:40–40:51, Jul. 2013.

[5] “Linux cgroups,” https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt.

[6] R. Woollen, “The internal design of salesforce.com’s multi-tenant architecture,” in Proceedings of the 1st ACM Sym-posium on Cloud Computing, ser. SoCC ’10, Indianapolis,Indiana, USA, 2010.

[7] E. Bugnion, S. Devine, M. Rosenblum, J. Sugerman, and E. Y.Wang, “Bringing virtualization to the x86 architecture withthe original vmware workstation,” ACM Trans. Comput. Syst.,vol. 30, no. 4, Nov. 2012.

[8] “Oracle virtualization,” http://www.oracle.com/us/technologies/virtualization/overview/index.html.

[9] C. Hunt and B. John, Java Performance, 1st ed. UpperSaddle River, NJ, USA: Prentice Hall Press, 2011.

[10] G. L. Taboada, S. Ramos, R. R. Exposito, J. Tourino, andR. Doallo, “Java in the high performance computing arena:Research, practice and experience,” Sci. Comput. Program.,vol. 78, no. 5, May 2013.

[11] “Usage share of operating systems,”http://en.wikipedia.org/wiki/Usageshare of operating systems.

[12] D. Bovet and M. Cesati, Understanding The Linux Kernel.Oreilly & Associates Inc, 2005.

[13] R. Cervera, T. Cortes, and Y. Becerra, “Improving applicationperformance through swap compression,” in Proceedings ofthe Annual Conference on USENIX Annual Technical Con-ference, ser. ATEC ’99, Berkeley, CA, USA, 1999.

[14] S. Oikawa, “Integrating memory management with a file sys-tem on a non-volatile main memory system,” in Proceedingsof the 28th Annual ACM Symposium on Applied Computing,ser. SAC ’13, Coimbra, Portugal, 2013.

[15] S. K. Johnson, B. Hartner, B. Pulavarty, and D. J. Vianney,Linux Server Performance Tuning. Prentice Hall PTR, 2005.

ensuring high-performance of mission-critical java applications in multi-tenant cloud platforms

Engineering