load balancing in the cloud

Load Ba lanc ing i n the C lo ud :

Too l s , T ips , and Techn ique s

A T E C H N I C A L W H I T E PA P E R

Brian Adler, Solutions Architect, RightScale, Inc.

R i g h t S c a l e • w w w. r i g h t s c a l e . c o m • 1

®

Abstract

Load Balancing is a method to distribute workload across one or more servers, network interfaces, hard drives, or other computing resources. Typical datacenter implementations rely on large, powerful (and expensive) computing hardware and network infrastructure, which are subject to the usual risks associated with any physical device, including hardware failure, power and/or network interruptions, and resource limitations in times of high demand.

Load balancing in the cloud differs from classical thinking on load-balancing architecture and implementation by using commodity servers to perform the load balancing. This provides for new opportunities and economies-of-scale, as well as presenting its own unique set of challenges.

The discussion to follow details many of these architectural decision points and implementation considerations, while focusing on several of the cloud-ready load balancing solutions provided by

RightScale, either directly from our core components, or from resources provided by members of our comprehensive partner network.


1 Introduction

A large percentage of the systems (or “deployments” in RightScale vernacular) managed by the RightScale Cloud Management Platform employ some form of front-end load balancing. As a result of this customer need, we have encountered, developed, architected, and implemented numerous load

balancing solutions. In the process we have accumulated experience with solutions that excelled in their application, as well as discovering the pitfalls and shortcomings of other solutions that did not meet the desired performance criteria. Some of these solutions are open source and are fully supported by RightScale, while others are commercial applications (with a free, limited version in some cases) supported by members of the RightScale partner network.

In this discussion we will focus on the following technologies that support cloud-based load balancing: HAProxy, Amazon Web Services’ Elastic Load Balancer (ELB), Zeus Technologies’ Load Balancer (with some additional discussion of their Traffic Manager features), and aiCache’s Web Accelerator. While it may seem unusual to include a caching application in this discussion, we will describe the setup in a later section that illustrates how aiCache can be configured to perform strictly as a load balancer.

The primary goal of the load balancing tests performed in this study is to determine the maximum connection rate that the various solutions are capable of supporting. For this purpose we focused on retrieving a very small web page from backend servers via the load balancer under test. Particular use-cases may see more relevance in testing for bandwidth or other metrics, but we have seen more difficulties surrounding scaling to high connection rates than any other performance criterion, hence the

focus of this paper. As will be seen, the results provide insight into other operational regimes and metrics as well.

Section 2 will describe the test architecture and the method and manner of the performance tests that were executed. Application- and/or component-specific configurations will be described in each of the subsections describing the solution under test. Wherever possible, the same (or similar) configuration

options were used in an attempt to maintain a compatible testing environment, with the goal being relevant and comparable test results. Section 3 will discuss the results of these tests from a pure load balancing perspective, with additional commentary on specialized configurations pertinent to each solution that may enhance its performance (with the acknowledgement that these configurations/options may not be available with the other solutions included in these evaluations). Section 4 will describe an

enhanced testing scenario used to exercise the unique features of the ELB, and section 5 will summarize the results and offer suggestions with regards to best practices in the load balancing realm.

2 Test Architecture and Setup

In order to accomplish a reasonable comparison among the solutions exercised, an architecture typical of many RightScale customer deployments (and cloud-based deployments in general) was utilized. All tests were performed in the AWS EC2 US-East cloud, and all instances (application servers, server

under test, and load-generation servers) were launched in a single availability zone.

A single EC2 large instance (m1.large, 2 virtual cores, 7.5GB memory, 64-bit platform) was used for the load balancer under test for each of the software appliances (HAProxy, Zeus Load Balancer, and aiCache Web Accelerator). As the ELB is not launched as an instance, we will address it as an architectural component as opposed to a server in these discussions. A RightImage (a RightScale-

created and supported Machine Image) utilizing CentOS 5.2 was used as the base operating system on the HAProxy and aiCache servers, while an Ubuntu 8.04 RightImage was used with the Zeus Load Balancer. A total of five identically configured web servers were used in each test to handle the responses to the http requests initiated by the load-generation server. These web servers were run on


EC2 small instances (m1.small, 1 virtual core, 1.7GB memory, 32-bit platform), and utilized a CentOS 5.2 RightImage. Each web server was running Apache version 2.2.3 and the web page being requested was a simple text-only page with a size of 147 bytes. The final server involved in the test was the load-

generation server. This server was run on an m1.large instance, and also used a CentOS 5.2 RightImage. The server configurations used are summarized in Table 1 below.

Table 1 – Summary of server configurations

The testing tool used to generate the load was ApacheBench, and the command used during the tests was the following:

ab -k -n 100000 -c 100 http://<Public_DNS_name_of_EC2_server>

The full list of options available is described in the ApacheBench man page (http://httpd.apache.org/

docs/2.2/programs/ab.html) but the options used in these tests were:

-k

Enable the HTTP KeepAlive feature, i.e., perform multiple requests within one HTTP session. Default is no KeepAlive.

-n requests

Number of requests to perform for the benchmarking session. The default is to just perform a single request which usually leads to non-representative benchmarking results.

-c concurrency

Number of multiple requests to perform at a time. Default is

one request at a time.

Additional tests were performed on the AWS ELB and on HAProxy using httperf as an alternative to

ApacheBench. These tests are described in sections to follow.

An architectural diagram of the test setup is shown in Figure 1.


Figure 1 – Test setup architecture

In all tests, a round-robin load balancing algorithm was used. CPU utilization on all web servers was tracked during the tests to ensure this tier of the architecture was not a limiting factor on performance.

The CPU idle value for each of the five web servers was consistently between 65%-75% during the entire timeline of all tests. The CPU utilization of the load-generating server was also monitored during all tests, and the idle value was consistently above 70% on both cores (the httperf tests more fully

utilized the CPU, and these configurations will be discussed in detail in subsequent sections).

As an additional test, two identical load-generating servers were used to simultaneously generate load on the load balancer. In each case, the performance seen by the first load-generating server was halved as compared to the single load generator case, with the second server performing equally. Thus, the overall performance of the load balancer remained the same. As a result, the series of tests that generated the results discussed herein were run with a single load-generating server to simplify the test

setup and results analysis. The load-generation process was handled differently in the ELB test to more adequately test the auto-scaling aspects of the ELB. Additional details are provided in section 4.1, which discusses this test setup and results.

The metric collected and analyzed in all tests was the number of requests per second that were processed by the server under test (referred to as responses/second hereafter). Other metrics may be

more relevant for a particular application, but pure connection-based performance was the desired metric for these tests.

2.1 Additional Testing Scenarios

Due to the scaling design of the AWS ELB, adequately testing this solution requires a different and more complex test architecture. The details of this test configuration are described in section 4 below. With this more involved architecture in place, additional tests of HAProxy were performed to confirm the

results seen in the more simplistic architecture described above. The HAProxy results were consistent


between the two test architectures, lending validation to the base test architecture. Additional details on these HAProxy tests are provided in section 4.2.

3 Test Results

Each of the ApacheBench tests described in Section 2 was repeated a total of ten times against each

load balancer under test with the numbers quoted being the averages of those tests. Socket states were checked between tests (via the netstat command) to ensure that all sockets closed correctly

and the server had returned to a quiescent state. A summary of all test results is included in Appendix A.

3.1 HAProxy

HAProxy is an open-source software application that provides high-availability and load balancing

features (http://haproxy.1wt.eu/). In this test, version 1.3.19 was used and the health-check

option was enabled, but no other additional features were configured. The CPU utilization was less than 50% on both cores of the HAProxy server during these tests (HAProxy does not utilize multiple cores, but monitoring was performed to ensure no other processes were active and consuming CPU cycles), and the addition of another web server did not increase the number of requests serviced, nor change the

CPU utilization of the HAProxy server. HAProxy performance tuning as well as Linux kernel tuning was performed. The tuned parameters are indicated in the results below, and are summarized in Appendix B. HAProxy does not support the keep-alive mode of the HTTP transactional model, thus its response rate is equal to the TCP connection rate.

3.1.1 HAProxy Baseline

In this test, HAProxy was run with the standard configuration file (Appendix C) included with the

RightScale frontend ServerTemplates (a ServerTemplate is a RightScale concept, and defines the base OS image and series of scripts to install and configure a server at boot time). The results of the initial HAProxy tests were:

Requests per second: 4982 [#/sec]

This number will be used as a baseline for comparison with the other load balancing solutions under evaluation.

3.1.2 HAProxy with nbproc Modification

The nbproc option to HAProxy is used to set the number of haproxy processes when run in daemon

mode. This is not the preferred mode in which to run HAProxy as it makes debugging more difficult, but it may result in performance improvements on certain systems. As mentioned previously, the HAProxy server was run on an m1.large instance, which has two cores, so the nbproc value was set to 2 for this

test. Results:


This is approximately 2% of a performance reduction compared with the initial tests (in which nbproc

was set to the default value of 1), so this difference is considered statistically insignificant, with the conclusion that in this test scenario, modifying the nbproc parameter has no effect on performance.

This is most likely an indicator that user CPU load is not the limiting factor in this configuration. Additional tests described in section 4 add credence to this assumption.


3.1.3 HAProxy with Kernel Tuning

There are numerous kernel parameters that can be tuned at runtime, all of which can be found under the /proc/sys directory. The ones mentioned below are not an exhaustive or all-inclusive list of the

parameters that would positively (or negatively) affect HAProxy performance, but they have been found

to be beneficial in these tests. Alternate values for these (and other) parameters may have positive performance implications depending on the traffic patterns a site encounters and the type of content being served. The following kernel parameters were modified by adding them to the /etc/sysctl.conf file and executing the ‘sysctl –p’ command to load them into the kernel:

net.ipv4.conf.default.rp_filter=1net.ipv4.conf.all.rp_filter=1net.core.rmem_max = 8738000net.core.wmem_max = 6553600net.ipv4.tcp_rmem = 8192 873800 8738000net.ipv4.tcp_wmem = 4096 655360 6553600net.ipv4.tcp_tw_reuse = 1net.ipv4.tcp_max_tw_buckets = 360000vm.min_free_kbytes = 65536vm.swappiness = 0net.ipv4.ip_local_port_range = 30000 65535

With these modifications in place, the results of testing were:


This represents about a 5.2% improvement over the initial HAProxy baseline tests. To ensure accuracy

and repeatability of these results, the same tests (HAProxy Benchmark with no application or kernel tuning and the current test) were rerun. The 5%-6% performance improvement was consistent across these tests. Additional tuning of the above-mentioned parameters was performed, with the addition of other network- and buffer-related parameters, but no significant improvements to these results were observed. Setting the haproxy process affinity also had a positive effect on performance (and negated

any further gains from kernel tuning). This process affinity modification is described in section 4.2.

It is worth noting that HAProxy can be configured for both cookie-based and IP-based session stickiness (IP-based if a single HAProxy load balancer is used). This can enhance performance, and in certain application architectures, it may be a necessity.


3.2 Zeus Load Balancer

Zeus Technologies (http://www.zeus.com/) is a RightScale partner which has created a cloud-

ready ServerTemplate available from the RightScale Dashboard. Zeus is a fee-based software application, with different features being enabled at varying price points. The ServerTemplate used in

these tests utilized version 6.0 of the Zeus Traffic Manager. This application provides many advanced features that support caching, SSL termination (including the ability to terminate SSL for multiple fully-qualified domain names on the same virtual appliance), cookie- and IP-based session stickiness, frontend clustering, as well as numerous other intelligent load balancing features. In this test only the Zeus Load Balancer (a feature subset of the Zeus Traffic Manager) was used to provide more feature-

compatible tests with the other solutions involved in these evaluations. By default, Zeus enables both HTTP keep-alives as well as TCP keep-alives on the backend (the connections to the web servers), thus avoiding the overhead of unnecessary TCP handshakes and tear-downs. With a single Zeus Load Balancer running on an m1.large (consistent with all other tests), the results were:


This represents a 30% increase over the HAProxy baseline, and a 24% increase over the tuned HAProxy test results. As mentioned previously, the Zeus Traffic Manager is capable of many advanced load balancing and traffic managing features, so depending on the needs and architecture of the application, significantly improved performance may be achieved with appropriate tuning and configuration. For example, enabling caching would increase performance dramatically in this test since a simple static

text-based web page was used. We will see a use case for this in the following section discussing aiCache’s Web Accelerator. However, for these tests standard load balancing with particular attention to requests served per second was the desired metric, so the Zeus Load Balancer features were exercised, and not the extended Zeus Traffic Manager capabilities.

3.3 aiCache Web Accelerator

aiCache implements a software-solution to provide frontend web server caching (http://aicache.com/). aiCache is a RightScale partner that has created a ServerTemplate to deploy their

application in the cloud through the RightScale platform. The aiCache Web Accelerator is also a fee-based application. While it may seem out of place to include a caching application in an evaluation of load balancers, the implementation of aiCache lends itself nicely to this discussion. If aiCache does not find the requested object in its cache, it will load the object into the cache by accessing the “origin”

servers (the web servers used in these discussions) in a round-robin fashion. aiCache does not support session stickiness by default, but it can be enabled via a simple configuration file directive. In the tests run as part of this evaluation, aiCache was configured with the same five web servers on the backend as in the other tests, and no caching was enabled, thus forcing the aiCache server to request the page from a backend server every time. With this setup and configuration in place, the results were:

Requests per second: 4785 [#/sec]This performance is comparable with that of HAProxy (it is 4% less than the HAProxy baseline, and 9% less than the tuned HAProxy results). As mentioned previously, aiCache is designed as a caching application to be placed in front of the web servers of an application, and not as a load balancer per se. But as these results show, it performs this function quite well. Although it is a bit out of scope with

regard to the intent of these discussions on load balancing, a simple one line change to the aiCache configuration file allowed caching of the simple web page being used in these tests. With this one line change in place, the same tests were run, and the results were:



This is large improvement (320%) over the initial aiCache load balancing test, and similarly compared to the HAProxy tests (307% over the HAProxy baseline, and 293% better than the tuned HAProxy results).

Caching is most beneficial in applications that serve primarily static content. In this simple test it was applicable in that the requested object was a static, text-based web page. As mentioned above in the discussion of the Zeus solution, depending on the needs, architecture, and traffic patterns associated with an application, significantly improved results can be obtained by selecting the correct application for the task, and tuning that application correctly.

3.4 Amazon Web Services Elastic Load Balancer (ELB)

Elastic Load Balancing facilitates distributing incoming traffic among multiple AWS instances (much like HAProxy). Where ELB differs from the other solutions discussed in this white paper is that it can span Availability Zones (AZ), and can distribute traffic to different AZs. While this is possible with HAProxy, Zeus Load Balancer, and aiCache Web Accelerator, there is a cost associated with cross-AZ traffic (traffic within the same AZ via private IPs is at no cost, while traffic between different AZs is fee-based).

However, an ELB has a cost associated with it as well (an hourly rate plus a data transfer rate), so some of this inter-AZ traffic cost may be equivalent to the ELB charges depending on your application architecture. Multiple AZ configurations are recommended for applications that demand high reliability and availability, but an entire application can be (and often is) run within a single AZ. AWS has not released details on how ELB is implemented, but since it is designed to scale based on load (which will

be shown in sections to follow), it is most likely a software-based virtual appliance. The initial release of ELB did not support session stickiness, but cookie-based session affinity is now supported.

AWS does not currently have different sizes or versions of ELBs, so all tests executed were run with the standard ELB. Additionally, no performance tuning or configuration is currently possible on ELBs. The only configuration that was set with regard to the ELB used in these tests was that only a single AZ was

enabled for traffic distribution.

Two sets of tests were run. The first was functionally equivalent to the tests run against the other load balancing solutions in that a single load-generating server was used to generate a total of 100,000 requests (and then repeated 10 times to obtain an average). The second test was designed to exercise the auto-scaling nature of ELB, and additional details are provided in section 4.1. For the first set of

tests, the results were:

Requests per second: 2293 [#/sec]This performance is about 46% of that of the HAProxy baseline tests, and approximately 43% of the tuned HAProxy results. This result is consistent with tests several of RightScale’s customers have run independently. As a comparison to this simple ELB test, a test of HAProxy on an m1.small instance was

conducted. The results of this HAProxy test are as follows:


In this test scenario, the ELB performance is approximately 82% that of HAProxy running on an m1.small. However, due to the scaling design of the ELB solution discussed previously, another testing methodology is required to adequately test the true capabilities of ELB. This test is fundamentally

different from all others performed in this investigation, so it will be addressed separately in section 4.

4 Enhanced Testing ArchitectureIn this test, a much more complex and involved testing architecture was implemented. Instead of five


backend web servers as used in the previous tests, a total of 25 identical backend web servers were used, with 45 load-generating servers utilized instead of a single server. The reason for the change is that fully exercising ELB requires that requests are issued to a dynamically varying number of IP

addresses returned by the DNS resolution of the ELB endpoint. In effect, this is the first stage of load balancing employed by ELB in order to distribute incoming requests across a number of IP addresses which correspond to different ELB servers. Each ELB server then in turn load balances across the registered application servers.

The load-generation servers used in this test were run on AWS c1.medium servers (2 virtual cores,

1.7GB memory, 32-bit platform). As a result of observing the load-generating servers in the previous tests, it was determined that memory was not a limiting factor and the 7.5GB available was far more than was necessary for the required application. CPU utilization was high on the load-generator, so the c1.medium was used to add an additional 25% of computing power. As mentioned previously, instead of a single load-generating server, up to 45 servers were used, each running the following httperf

command in an endless loop:

httperf --hog --server=$ELB_IP --num-conns=50000 --rate=500 --timeout=5

In order to spread the load among the ELB IPs that were automatically added by AWS, a DNS query was made at the beginning of each loop iteration so that subsequent runs would not necessarily use the same IP address. These 45 load-generating servers were added in groups at specific intervals, which will be detailed below.

The rate of 500 requests per second (the “--rate=500” option to httperf) was determined via

experimentation on the load-generating server. With rates higher than this, non-zero fd-unavail error

counts were observed, which is an indication that the client has run out of file descriptors (or more accurately, TCP ports), and is thus overloaded. The number of total connections per iteration was set to 50,000 (--num-conns=50000) in order to keep each test run fairly short in duration (typically less than

two minutes) such that DNS queries would occur at frequent intervals in order to spread the load as the

ELB scaled.

4.1 ELB performance

The first phase of the ELB test utilized all 25 backend web servers, but only three load-generating servers were launched initially (which would generate about 1500 requests/sec – three servers at 500 requests/second each). Some reset/restart time was incurred between each loop iteration running the httperf commands, so a sustained 500 requests/second per load-generating server was not quite

achievable. DNS queries initially showed three IPs for the ELB. As shown in Figure 2 (label (a)) an average of about 1000 requests/second were processed by the ELB at this point.

Approximately 20 minutes into the test, an additional three load-generating servers were added, resulting in a total of six, generating about 3000 requests/second (see Figure 2 (b)). The ELB scaled up to five IPs over the course of the next 20 minutes (c), and the response rate leveled out at about 3000/second at

this point. The test was left to run in its current state for the next 45 minutes, with the number of ELB IPs monitored periodically, as well as the response rate. As Figure 2 shows (d), the response rate remained fairly stable at about 3000/second during this phase of the test. The number of IPs returned via DNS queries for the ELB varied between seven and 11 during this time.

At this point, an additional 19 load-generating servers were added (for a total of 25, see Figure 2 (e)),

which generated about 12500 requests/second. The ELB added IPs fairly quickly in response to this load, and averaged between 11 and 15 within 10 minutes. After about 20 minutes (Figure 2 (f)), an


average of 10500 responses/second was realized (again, due to the restart time between iterations of the httperf loop, the theoretical maximum of 12500 requests/second was not quite realized).

The test was left to run in this state for about 20 minutes, where it remained fairly stable in terms of

response rate, but the number of IPs for the ELB continued to vary between 11 and 15. An additional 20 load-generating servers (for a total of 45, see Figure 2 (g)) were added at this time. About 10 minutes were required before the ELB scaled up to accommodate this new load, with a result of between 18 and 23 IPs for the ELB. The response rate at this time averaged about 19000/second (Figure 2 (h)). The test was allowed to run for approximately another 20 minutes before all servers were terminated. The

response rate during this time remained around 19000/second, and the ELB varied in the number of IPs between 19 and 22.

Figure 2 – httperf responses per second through the AWS ELB. Each color corresponds to the responses received from an individual ELB IP address. The quantization is due to the fact that each load generating server is locked to a specific IP address for a 1-2 minute

period during which it issues 500 requests/second.

To ensure that the backend servers were not overtaxed during these tests, the CPU activity of each was monitored. Figure 3 show the CPU activity on a typical backend server. Additionally, the interface traffic

on the load-generating servers and the number of Apache requests on the backend servers was monitored. Figures 4 and 5 show graphs for these metrics.


Figure 3 – CPU activity on typical backend web server

Figure 4 – Interface traffic on typical load-generating server

Figure 5 – Apache requests on typical backend web server. Peak is with 45 load-generating servers.

It would appear that the theoretical maximum response rate using an ELB is almost limitless, assuming that the backend servers can handle the load. Practically this would be limited by the capacity of the

AWS infrastructure, and/or by throttles imposed by AWS with regards to an ELB. These test results


were shared with members of the AWS network engineering team, who confirmed that there are activity thresholds that will trigger an inspection of traffic to ensure it is legitimate (and not a DoS/DDoS attack, or similar). We assume that the tests performed here did not surpass this threshold and that additional

requests could have been generated before the alert/inspection mechanism would have been performed. If the alert threshold is met, and after inspection the traffic is deemed to be legitimate, the threshold is lifted to allow additional AWS resources to be allocated to meet the demand. In addition, when using multiple availability zones (as opposed to the single AZ used in this test) supplemental ELB resources become available.

While the ELB does scale up to accommodate increased traffic, the ramp-up is not instantaneous, and therefore may not be suitable to all applications. In a deployment that experiences a slow and steady load increase, an ELB is an extremely scalable solution, but in a flash-crowd or viral event, ELB scaling may not be rapid enough to accommodate the sudden influx of traffic, although artificial “pre-warming” of ELB may be feasible.

4.2 Enhanced Test Configuration with HAProxy

In order to validate the previous HAProxy results, the enhanced test architecture described above was used to test a single instance running HAProxy on an m1.large (2 virtual cores, 7.5GB memory, 64-bit architecture). In this test configuration, 16 load-generating servers were used as opposed to the 45 used in the ELB tests. (No increase in performance was seen beyond 10 load-generators, so the test was halted once 16 had been added.) The backend was populated with 25 web servers as in the ELB

test, and the same 147-byte text-only web page was the requested object. Figure 6 shows a graph of the responses/second handled by HAProxy. The average was just above 5000, which is consistent with the results obtained in the tests described in section 3.1 above.

Figure 6 – HAProxy responses/second

The gap in the graph was the result of a restart of HAProxy once kernel parameters had been modified.

The graph tails off at the end as DNS TTLs expired, which pushed the traffic to a different HAProxy server running on an m1.xlarge. Results of this m1.xlarge test are described below.

In the initial test run in the new configuration, an average of about 5000 responses/second was observed. During this time frame, CPU-0 was above 90% utilization (see Figure 7), while CPU-1 was essentially idle. By setting the HAProxy process affinity for a single CPU (essentially moving all system-

related CPU cycles to a separate CPU), performance was increased approximately 10% to the 5000 responses/second shown in Figure 6. When the affinity was set (using the ‘taskset -p 2


<haproxy_pid>’ command) CPU-0’s utilization was dropped to less than 5%, and CPU-1’s changed

from 0% utilized to approximately 60% utilization (due to the fact that the HAProxy process was moved exclusively to CPU-1). (See Figure 8.) Additionally, when the HAProxy process’ affinity was set, tuning

the kernel parameters no longer had any noticeable effect.

Figure 7 – CPU-0 activity on HAProxy server

Figure 8 – CPU activity on CPU-0 and CPU-1 after HAProxy affinity is set to CPU-1

The interface on the HAProxy server averaged approximately 100 MBits/second total (in and out combined) during the test (see Figure 9). In previous tests of m1.large instances in the same availability zone, throughput in excess of 300 MBits/second has been observed, thus confirming the instance’s bandwidth was not the bottleneck in these tests.


Figure 9 – Interface utilization on HAProxy server

With unused CPU cycles on both cores, and considerable bandwidth on the interface available, the bottleneck in the HAProxy solution is not readily apparent. The HAProxy test described above was also run on an m1.xlarge (4 virtual cores, 15GB memory, 64-bit platform) with the same configuration. The

results observed were identical to that of the m1.large. Since HAProxy is not memory-intensive, and does not utilize additional cores, these results are not surprising, and support the reasoning that the throttling factor may be an infrastructure- or hypervisor-related limitation.

During these HAProxy tests, it was observed that the virtual interface was peaking at approximately 110K packets per second (pps) in total throughput (input + output). As a result of this observation, the

ttcp utility was run in several configurations to attempt to validate this finding. Tests accessing the

instance via its internal IP, external IP, as well as two concurrent transmit sessions were executed (see Figure 10).

Figure 10 – Packets per second as generated by ttcp

The results of these tests were fairly consistent in that a maximum of about 125K pps were achieved,

with an average of 118K-120K being more typical. These results were shared with AWS network engineering representatives, who confirmed that we are indeed hitting limits in the virtualization layer which involves the traversal of two network stacks.


The takeaway from these experiments is that in high traffic applications, the network interface should be monitored and additional capacity should be added when the interface approaches 100K pps, regardless of other resources that may still be available on the instance.

These findings also explain why the results between HAProxy, aiCache, and Zeus are very similar. With all three appliances the practical limit is about 100K packets per second. The minor performance differences between the three are primarily due to keep-alive versus non keep-alive HTTP connections and internal buffer strategies that may distribute payloads over more or fewer packets in different situations.

5 Conclusions

At RightScale we have encountered numerous and varied customer architectures, applications, and use cases, and the vast majority of these deployments use, or can benefit from, the inclusion of front-end load balancing. As a result of assisting these customers both in a consultant capacity as well as engaging with them on a professional services level, we have amassed a broad spectrum of experience with load balancing solutions. The intent of this discussion was to give a brief overview of the load

balancing options currently available in the cloud via the RightScale platform, and compare and contrast these solutions using a specific configuration and metric on which to rate these solutions. Through these comparisons, we have hoped to illustrate that there is no “one size fits all” when it comes to load balancing. Depending on the particular application’s architecture, technology stack, traffic patterns, and numerous other variables, there may be one or more viable solutions, and the decision on which

mechanism to put in place will often come down to a tradeoff between performance, functionality, and cost.


Appendices

[A] Summary of all tests performed

[B] Kernel tuning parametersnet.ipv4.conf.default.rp_filter=1net.ipv4.conf.all.rp_filter=1net.core.rmem_max = 8738000net.core.wmem_max = 6553600net.ipv4.tcp_rmem = 8192 873800 8738000net.ipv4.tcp_wmem = 4096 655360 6553600net.ipv4.tcp_tw_reuse = 1net.ipv4.tcp_max_tw_buckets = 360000vm.min_free_kbytes = 65536vm.swappiness = 0net.ipv4.ip_local_port_range = 30000 65535


[C] HAProxy configuration file# Copyright (c) 2007 RightScale, Inc, All Rights Reserved Worldwide.## THIS PROGRAM IS CONFIDENTIAL AND PROPRIETARY TO RIGHTSCALE# AND CONSTITUTES A VALUABLE TRADE SECRET. Any unauthorized use,# reproduction, modification, or disclosure of this program is# strictly prohibited. Any use of this program by an authorized# licensee is strictly subject to the terms and conditions,# including confidentiality obligations, set forth in the applicable# License Agreement between RightScale.com, Inc. and# the licensee.

globalstats socket /home/haproxy/status user haproxy group haproxy log 127.0.0.1 local2 info# log 127.0.0.1 local5 info maxconn 4096 ulimit-n 8250 # typically: /home/haproxy chroot /home/haproxy user haproxy group haproxy daemon quiet pidfile /home/haproxy/haproxy.piddefaults log global mode http option httplog option dontlognull retries 3 option redispatch maxconn 2000 contimeout 5000 clitimeout 60000 srvtimeout 60000

# Configuration for one application:# Example: listen myapp 0.0.0.0:80listen www 0.0.0.0:80 mode http balance roundrobin # When acting in a reverse-proxy mode, mod_proxy from Apache adds X-Forwarded-For, # X-Forwarded-Host, and X-Forwarded-Server request headers in order to pass information to # the origin server;therefore, the following option is commented out #option forwardfor # Haproxy status page stats uri /haproxy-status #stats auth @@LB_STATS_USER@@:@@LB_STATS_PASSWORD@@ # when cookie persistence is required cookie SERVERID insert indirect nocache # When internal servers support a status page #option httpchk GET @@HEALTH_CHECK_URI@@ # Example server line (with optional cookie and check included) # server srv3.0 10.253.43.224:8000 cookie srv03.0 check inter 2000 rise 2 fall 3

server i-570a243f 10.212.69.176:80 check inter 3000 rise 2 fall 3 maxconn 255


load balancing in the cloud

Documents