energy efficiency of arm architectures for cloud computing applications

ENERGY EFFICIENCY OF ARMARCHITECTURES FOR CLOUD

COMPUTING APPLICATIONS

Olle Svanfeldt-Winter

Master of Science ThesisSupervisor: Prof. Johan LiliusAdvisor: Dr. Sébastien LafondEmbedded Systems Laboratory

Department of Information TechnologiesÅbo Akademi University

2011

ABSTRACT

This thesis evaluates how the energy efficiency of the ARMv7 architecture based pro-cessors Cortex-A9 MPCpre and Cortex-A8 compare in applications such as a SIP-Proxy and a web server compared to Intel Xeon processors. The focus is on com-paring the energy efficiency between the two architectures rather than just the perfor-mance. As the processors used in servers today have more computational power thanthe Cortex-A9 MPCore, several of these slower but more energy efficient processorsare needed. Depending on the application, benchmarks indicate energy efficiency of3-11 times greater for the ARM Cortex-A9 in comparison to the Intel Xeon. The top-ics of interconnects between processors and overhead caused by using an increasingnumber of processors, are left for later research.

Keywords: Cloud Computing, Energy Efficiency, ARM, Erlang, SIP-Proxy, Apache

i

CONTENTS

Abstract i

Contents ii

List of Figures iv

Glossary vi

1 Introduction 11.1 Purpose of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Cloud Software project . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Energy efficiency of servers 52.1 Throughput and latency . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Large scale energy consumption . . . . . . . . . . . . . . . . . . . . 72.4 Reducing energy consumption . . . . . . . . . . . . . . . . . . . . . 92.5 Energy proportional computing . . . . . . . . . . . . . . . . . . . . . 112.6 Energy efficient low power processors . . . . . . . . . . . . . . . . . 132.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Evaluated computing platforms 153.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Performance comparison 324.1 Apache results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2 Emark results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 SIP-Proxy results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

ii

5 Conclusions and future work 485.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Bibliography 52

Swedish Summary 55

6 Energieffektivitet hos ARM-arkitektur för applikationer i datormoln 556.1 Introduktion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2 Energiförbrukning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3 Förbättring av energieffektivitet . . . . . . . . . . . . . . . . . . . . 566.4 Mätningar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.5 Slutsatser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

A Results from Erlang benchmarking 60

iii

LIST OF FIGURES

2.1 Monthly costs for server, power and infrastructure [1] . . . . . . . . . 92.2 CPU contribution to total server power usage for two generations of

Google servers. The rightmost bar shows the newer server when idling[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 BeagleBoard block diagram [3] . . . . . . . . . . . . . . . . . . . . . 173.2 OMAP3530 block diagram [4] . . . . . . . . . . . . . . . . . . . . . 173.3 Block diagram of the Versatile Express with the Motherboard Express

µATX, CoreTile Express A9x4 and LogicTile Express [5] . . . . . . . 183.4 Top level view of the main components of the CoreTile Express A9x4

and with the CA9 NEC chip [5] . . . . . . . . . . . . . . . . . . . . 203.5 Test setup for Apache test . . . . . . . . . . . . . . . . . . . . . . . . 283.6 Test setup SIP-Proxy test . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1 Comparison between CoreTile Express, Tegra and an Intel Pentium 4powered machine running the Apache HTTP server. . . . . . . . . . . 33

4.2 CPU utilization during test on machine with two Quad Core Intel XeonE5430 processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Number of requests handled for each Joule used by the CPU . . . . . 364.4 Graph showing the CPU utilization for CoreTile Express during SIP-

Proxy benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.5 Power consumption for the CPU in CoreTile Express during SIP-Proxy

test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.6 Graph showing performance of reference machine with two Quad Core

Xeons using an increasing number of schedulers . . . . . . . . . . . . 454.7 Number of calls handled for each Joule used by the CPU . . . . . . . 46

iv

5.1 Achievable energy dissipation reduction by the usage of more efficientprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Achievable energy dissipation reduction when moving to more effi-cient processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

v

GLOSSARY

SMPIn symmetric multiprocessing two or more processors are connected to the samemain memory.

DVFDynamic Voltage and Frequency regulation is used to adjust the input voltageand clock frequency according to the momentary need in order to avoid unnec-essary energy consumption

Power gatingCutting power to parts of a chip when the particular part is not needed.

Data centerFacility that houses computer systems

Server farmA server farm is a collection of servers. They are used when a single server isnot capable of providing the required service

CPU Central processing unit.

CloudPlatform for computational and data access services where details of physicallocation of the hardware is not necessarily of concern for the end user

GranularityGranularity describes the extent a system is broken down into smaller parts

DMIPSDhrystone MIPS. Obtained when dividing a Dhrystone benchmark score by1757

vi

1 INTRODUCTION

Cloud computing systems often use large server farms in order to provide services.These server farms have high energy consumption. The energy is not only needed torun the servers themselves but also for the systems that keep them cool. Energy con-sumption is seen as both an economical and ecological issue. Regardless of whetherone wants to save money or to cause less strain on the environment, the solution is thesame; to reduce the energy consumption.

The approach presented in this thesis to reduce the power dissipated by server farmsis to replace their processors with ones that are more energy efficient. The architectureof processors used in smartphones and embedded systems have been designed withenergy efficiency in mind from the beginning, something that has not been the casewith the x86 architecture usually found in servers. This makes the processors used inembedded systems interesting candidates when looking for replacements for regularx86 architecture based server processors.

Although the processors used in mobile devices use less energy for executing a sin-gle instruction, the execution of a single instruction is often not directly comparable toan instruction execution on a regular server processor, due to factors such as differencesin the instruction sets. Also the computational power for a single low power processoris generally modest compared to traditional desktop and server processors. Movingto processors with lower individual performance increases the number of processorsneeded to provide the same service as before. Distributing work on a larger numberof processors increases the importance of parallelism on the software side. Applica-tions that use a lot of I/O operations will get less of a performance drop than thosethat are more computationally intensive, when switching to processors with lower in-dividual performance, as the speed of I/O is dependent on other components than justthe CPU. Applications designed to be run on server farms are already designed to bedistributable in order to use the added resources from a server farm, compared to thatof a single server.

1

Applications such as gateways in telecommunication have a large number of re-quests to serve but the requests are light and generally independent of each other. Alsotasks performed by web servers such as serving static webpages are suitable candidatesto be run in clouds of low power processors, as the services provided are not CPU in-tensive. The suitability to provide services using low power energy efficient processorswill be evaluated using benchmarks.

General performance benchmarks for the Erlang virtual machine (VM) will be usedin addition to Apache benchmarking and a SIP-Proxy running on top of Erlang. Gen-eral performance benchmarks will be used to evaluate the performance of the Erlangrun time system (erts) for a variety of tasks such as message passing and pure numbercrunching thereby comparing the performance between the different architectures. Theobtained results are used to evaluate why some tasks run better on some hardware thanothers and explain the performance differences for realistic applications.

The extent of this thesis is limited to how well a single Cortex-A9 MPCore and aCortex-A8 perform in comparison to processors such as the Intel Xeon. The topics ofinterconnects between processors and overhead caused by using an increasing numberof processors are left for later research.

1.1 Purpose of this thesis

The purpose of this thesis is to evaluate the energy efficiency of ARM Cortex-A9 MP-Core based processors compared to x86 based processors for telecom systems andother Cloud like services. In addition to the energy efficiency of the Cortex-A9 MP-Core processors single core Cortex-A8 processor will also be evaluated. To make anenergy efficiency comparison possible the performance of the processors will first beevaluated. The main interest is the comparison between the energy efficiency for thetwo architectures rather than on the energy efficiency between certain processor mod-els.

To achieve this, several processors based on the same underlying architecture willbe evaluated. The ability to provide simple web services will also be evaluated on thesame ARM based test machines. For testing how well the testmachines perform inproviding simple web services Apache 2.2 will be used. The focus is on the abilityto provide static web pages. For the telecommunication part the focus will be on aSIP-Proxy running on top of the Erlang VM. Micro benchmarks that stress different

2

aspects of the Erlang VM are used to evaluate how efficient the ARM based processorscan handle different tasks. The important metric is how much performance is achievedcompared to the amount of energy used, rather than pure performance of the proces-sors. The potential for energy saving achieved from using ARMv7 based processorsin servers compared to servers based on x86 processors will be evaluated. To makethe comparison realistic only the efficiency of the processors will be considered. Thepotential for energy efficiency improvement for the rest of the components in the testmachines are not considered. The impact to total data center infrastructure and powercost will be analyzed using a cost model for a hypothetical data center.

1.2 Cloud Software project

The Cloud Software Program (2010-2013) is a SHOK-program financed through TEKESand coordinated by Tivit Oy. Its aim is to improve the competitive position of theFinnish software intensive industry in the global market [6]. The content of this thesisis part of the project.

The research focus for the project in the Embedded Systems Laboratory at TheDepartment of Information Technologies at Åbo Akademi is to evaluate the potentialgain for energy efficiency by using low power nodes to provide services. In additionto energy efficiency the total cost of ownership for the cloud server infrastructure is ina central role.

1.3 Thesis structure

Chapter 2 begins with the introduction of concepts such as energy and energy effi-ciency. Followed by why energy efficiency is an issue for cloud service providers andhow much and where energy is consumed in data centers. Methods to reduce energyconsumption as well as the concept of energy proportional computing are presentedin chapter 2. The chapter ends with a discussion on the motivations and theories onwhy the usage of energy efficient low power processors is a viable option. Chapter 3presents the hardware and software used in the evaluations as well as the benchmarks.Chapter 4 presents the results from the benchmarks presented in Chapter 3. Compar-isons between the results for the different test machines in the benchmarks are alsopresented in this chapter. Chapter 5 shows conclusions together with suggestions for

3

future work.

4

2 ENERGY EFFICIENCY OF SERVERS

In this chapter concepts such as energy, energy efficiency and energy proportionalitywill be presented. Metrics necessary to evaluate energy efficiency will also be brieflydiscussed, continued by the subject of why energy consumption is both an economicaland practical issue. An example of how much energy is being used by server farms andthe cost associated with it is shown. Both direct energy cost and costs related to energyrelated infrastructure. The chapter ends by presenting methods used to decrease energyconsumption and by discussing why using energy efficient low power nodes would bean option.

2.1 Throughput and latency

It is important to notice that latency and throughput do not always correlate. Even iftwo systems have the same throughput the time to serve a single request is not nec-essarily the same. A system that uses one node to provide a certain service has toprocess the individual requests faster than one consisting of several units. If the num-ber of nodes used to provide a certain service is doubled, the requirements are naturallycut by half for each unit. The formula below shows the relation between throughput,latency and the number of nodes.

Throughput = Availablenodes ∗ (1/Latency)

The unit of throughput can vary greatly. If the performance of a web server is eval-uated, the throughput can be defined as the number of requests served each second.The latency is the time taken to serve a request and the available nodes simply indi-cates how many nodes that are available to provide the service. The throughput canbe kept on the same level even if the latency increases, provided that the number ofnodes is increased to compensate. For example if one server with the ability to serve

5

a hundred requests per second were be replaced by servers with the ability to serve 10requests each, ten of the less powerful servers would be needed to achieve the sameperformance. It is implied that the minimum latency for the service cannot be lowerthan the minimum latency for a node. How long latency is acceptable depends onthe service being provided: a phone call is likely to have stricter requirements for theresponse time than a service for downloading files.

2.2 Energy

Energy is generally described as the ability to perform work. It can be of many formssuch as kinetic, thermal or electric. In this thesis the focus is on electric and thermalenergy as computers use electric energy and transform it to thermal energy. Poweris the rate at which energy is being transformed into another form. In the case ofcomputers the conversion is from electrical energy to thermal energy. The unit forpower is Watt (W). Electric energy is measured in Joule (J) and is defined as powermultiplied with time.

Energy = AvgPower ∗ Time

Energy efficiency is the amount of work done compared to the amount of energyused. The exact way to measure and compare energy efficiency varies depending on theparticular application. When considering the energy efficiency for services provided bya cloud or a server it could be for example how many Joules is needed for a transaction,or retrieving a file from a web server. Depending on the application the metrics cangreatly vary.

As it is stated by the first law of thermodynamics, energy is never created or de-stroyed, only converted to other forms. All the energy that a computer, or part of acomputer consumes will be converted into heat. Depending on the amount of energythe component consumes, the greater the problem with heat dissipation becomes. Theability to dissipate heat is dependent on many factors, such as surface area and ma-terial. Heat sinks are often used to increase the ability of the component in questionto dissipate heat. Regardless of thermal dissipation capabilities and the amount ofthermal energy dissipated, the heat has to be transferred somewhere in order to avoidoverheating issues.

6

2.3 Large scale energy consumption

Small computer systems such as personal computers can generally be cooled by a fewfans as the space they are kept in is relatively large in comparison to the amount ofheat that is generated. When having a large number of servers in the same place theamount of heat builds adds up. Also the modest energy consumption of a regular homecomputer is often not a great economical issue as the power needed is on the samescale of magnitude as a few incandescent light bulbs. The more densely hardware isstacked in order to fit as much equipment as possible in the smallest possible space;the more heat is also produced in the same space.

The biggest consumer of energy in a server is the CPU with approximately 45percent of the total consumption [2]. The energy consumption ratios between differentcomponents vary depending on the configuration of the server. In servers where severaldisk drives are used for data storage the energy consumption of the disk drives alsobecomes significant [7]. According to Schäppi et al. the total energy consumptionof data centers has been increasing for years [8]. In 2006 the energy consumptionof servers in Western Europe (EU 15 and Switzerland) was 14,7 TWh [8]; this doesnot include any energy consumed by the infrastructure such as cooling, lighting andUPS. Schnäppi et al. also states that the complete energy consumption for the datacenters in the same region is 36,9 TWh. It is not uncommon for data center serviceproviders to boast of high energy efficiency, both for their servers and for the datacenters as a whole. Companies do not, however, generally present exact data on energyconsumption and the technical specification of their centers for the public, makingaccurate estimates difficult. Several different metrics for energy efficiency on a datacenter scale are used. Power Usage Effectiveness (PUE) and DCiE are metrics definedby Green Grid in a white paper called “The green grid power efficiency metrics: PUE& DCiE“ [9]. The definitions on PUE and DCiE are shown below.

PUE = TotalFacilityPower/ITEquipmentPower[9]

DCiE = 1/PUE = ITEquipmentPower/TotalFacilityPower ∗ 100%[9]

IT Equipment power includes the servers but also network equipment and equip-ment used to monitor and control the data center. Total facility power includes in

7

addition to the IT equipment cooling, UPS, lighting and distribution losses external tothe IT equipment [9].

In an ideal data center the PUE would be 1 and would mean that all power usedby the center is used to power the IT equipment. According to "The green grid powerefficiency metrics: PUE & DCiE“ preliminary data shows that many data centers havea PUI of 3.0 or greater [9].

Companies that provide cloud services need large amounts of computer resources.When using cloud computing the user does not need to worry about the resourceslocally, and many new data centers are being built to provide the required resources.Several companies including Google and Microsoft are building data centers [10] withincreasing numbers of servers. Many of the centers are so large that instead of usinga server rack as the basic unit shipping containers are used [10] [11]. For exampleGoogle uses shipping containers to house servers in their data centers. One containeris reported to house 1160 servers, and the power consumption of just one container isreported to be up to 250 KW [11]. Using the reported values one server would useapproximately 216 W.

In 2008 Microsoft announced that they were building a data center containing 300000 servers [12]. If the power consumption of the servers in Microsoft’s new serverfarm is the same as that reported by Google the power consumption of the servers inthe farm is approximately 65 MW. The fact that the servers are packed tightly alsomeans that the challenges for the cooling is increasing. The problem with large heatdissipation is being addressed in different ways, for example Intel provides energyefficient versions of some of its Xeon processors intended especially for high densityblade servers [13]. The more energy efficient versions are generally more expensive.For example the Intel Xeon L5434 costs 562 e[14] and the E5430 costs 455 e[15].Having to pay less for keeping the servers running and still providing the same servicesmakes new business opportunities possible and increases the profit for current businessareas.

In order to evaluate the potential savings caused by a reduction of the energy con-sumption the total cost structure for a server farm must be analyzed. Hamilton [1]presents a cost analysis for a hypothetical data center. To enable the comparison be-tween cost elements such as infrastructure, hardware and power, amortization times aredefined for the investments. The infrastructure in Hamilton’s hypothetical data centeris designed to have a 15-year amortization time for infrastructure and a 3-year amorti-

8

zation time for the servers. A five percent annual cost for the capital used to build thedata center is assumed. The cost of power is set at $0.07/KWh for this example. Thecosts of the data center can be seen in the pie chart shown in Figure 2.1. The chartshows that the direct cost of power is 19 percent of the total cost. Hamilton contin-ues pointing out that for the hypothetical data center 82 percent of the infrastructurecosts consist of power and cooling infrastructure, and that thereby the maximum powerconsumption of the servers is reflected in the infrastructure costs. In Hamilton´s hy-pothetical data center the combined cost of power and cooling infrastructure, and theactual power is 42 percent. Hamilton writes that the power consumption contributionis 23 percent of the total cost. The numbers in the graph do not support that statement.The contribution of power and cooling infrastructure to the total cost on the other handis 23 percent and the cost of power is 19 percent, according to the graph. These are thevalues that will be used in chapter five.

Figure 2.1: Monthly costs for server, power and infrastructure [1]

2.4 Reducing energy consumption

In order to improve energy efficiency for a computer, the causes for the energy con-sumption must be known. Knowing the contributions of the main components in amodern computer helps to focus only on the critical components. Modern computersystems are built using CMOS circuits. The causes for energy consumption in a CMOScircuit are divided into Static power consumption and Active power consumption. Thestatic power consumption is caused by unintended leakage currents within the circuits.

9

The static power consumption can be reduced by bringing down the number of activetransistors and turning parts of the chip off when not needed. Another factor that af-fects the static power consumption is the supply voltage. Active power consumption iscaused by switching the states of the transistors and is thereby dependent of the usageof the circuit.

The time taken to charge and discharge a capacitor is dependent on the voltageused. A higher voltage allows a shorter switching time and thereby a higher clockfrequency. The lowest voltage possible should be used for the planned clock frequencyin order to be energy efficient. Voltage and Frequency Scaling (DVFS) is a method toreduce the energy consumption of a processor at times when it is not required to runat full capacity. DVFS works by varying both the voltage and clock frequency of theprocessor, depending on the performance required at a specific time [16].

The number of transistors in a CPU has increased approximately as predicted byMoore’s law for the last forty years, which means doubling roughly every two years[17]. The number of transistors is reflected in chip performance. David A. Pattersonpoints out that the bandwidth (performance) improvement of CPUs has been fasterthan for other components [18]. The annual improvements can be seen in Table 2.1.The performance increases shown in the table are without units as it only shows theannual improvement for the type of component in comparison to similar componentsfrom previous years. While the difference in improvement per year is not huge, thedifference has been building up for many years. Patterson points out that bandwidthbetween components such as the CPU and memory can always be improved by addingmore communications paths between them, but that it is costly and causes an increasein energy consumption and the size of the circuits. In addition to the un proportionalincrease in performance for the components in Table 2.1, Patterson raises concernsthat latency has improved less than bandwidth. Patterson continues to point out thatmarketing has been one reason for this inbalance, that an increase in bandwidth iseasier to sell than a decrease in latency. Finally Patterson reminds us that certainmethods created to improve bandwidth, such as buffering, has a negative effect onlatency [18].

The time it takes for a computer to execute a process is not only dependent on thespeed of its CPU. Other components in a computer such as the random-access memory(RAM) are not as fast as the CPU. The speed difference causes the CPU to waste manyclock cycles waiting for memory transactions. If data has to be fetched from a hard

10

CPU DRAM LAN HDD1.50 1.27 139 1.28

Table 2.1: Relative annual bandwidth improvement of different computer componentsduring the last 20-25 years [18]

drive disk (HDD) the waiting period is further increased.If there is more than one task running on the same system, and the tasks are running

independently from each other, the system might well be able to execute other taskswhile one is waiting for I/O. This works well if most tasks do not require I/O operationsand access to memory. If the purpose of a system is to mainly run tasks that are I/Ointensive and require lots of memory accesses, much time is potentially wasted for theCPU.

As Hamilton [1] points out there are at least two ways of dealing with the perfor-mance inbalance problem. One is to simply invest in better bandwidth and commu-nication paths between the memory and CPU. Another way is to avoid the problemby using lower-powered and cheaper CPUs that does not need as fast memory [1].Hamilton also points out that because server hardware is built with higher quality re-quirements, and in lower volumes than client hardware it is more expensive. Hamiltoncontinues that ”When we replace servers well before they fail, we are effectively pay-ing for quality that we’re not using“[1]. The energy efficiency is in general better fornewer hardware adding to the pressure to upgrade to newer servers.

2.5 Energy proportional computing

According to Barroso and Holzle [2] a server is generally operating at 10 to 50 percentof its maximum capacity but is rarely completely idle. Having data on several serversimproves the availability of the data; a side effect is, however, that more servers mustbe online. In a case where servers would be completely idle for significant times,powering down a part of the server farm would allow significant power savings. Inpractice some sort of load and task migration/management system would be neededto distribute tasks in a favorable manner between the available servers, in order toallow powering down a larger number of servers. Barroso and Holzle continue tostate that even when a server is close to idle it still consumes about half of its peak

11

power consumption. In a completely energy proportional server no energy would beused while the server is idle. Complete energy proportionality is not feasible with themanufacturing techniques and materials of today’s processors, due to leakage currents.

Regardless of the average power consumption during standard operation, a datacenter must still have the infrastructure to support the maximum power that the serverscan use, or are allowed to use. Reducing the peak power consumption also reduces thedemand on the power and cooling infrastructure, the part of the infrastructure that isresponsible for 82 percent of the total infrastructure costs [1]. The energy consumptionof a server is not necessarily the same as the combined peak power of the componentsthe server is built from [7]. The maximum peak power consumption measured fora server constructed for the example was less than 60 percent of the combined peakpower consumption for its components. Fan et al. [7] continue to state that the powerconsumption is also application specific. From the tests performed in preparation forthis thesis, it is clear that even if the system reported full CPU utilization the actualpower consumption of the CPU can vary. Furthermore Fan et al. [7] state that incase of an actual data center the consumption is 72 percent of the actual peak powerconsumption.

Figure 2.2: CPU contribution to total server power usage for two generations of Googleservers. The rightmost bar shows the newer server when idling [2]

Figure 2.2 from [2] shows the percentage of energy consumption that the CPU con-tributes to the total energy consumption of the server. The data are from two servers

12

used by Google in 2005 and 2007. The graphs show that the contribution to the to-tal consumption is approximately 45 percent during its peak power consumption andapproximately 27 percent when idle for the newer server. The power saving mech-anisms on the server is unknown but from the data provided in the graph the powersaving works better on the processor itself than on the server as a whole because thecontribution made by it is smaller when the server is idling. Barroso and Holzle alsopoint out that they have experienced that the dynamic power ranges for DRAM arebelow 50 percent: 25 percent for disc drives and 15 for networking switches [2]. Theobservations are in line with the results in Figure 2.2.

The authors in [7] claim that peak power consumption is the most important factorfor guiding server deployment in data centers but that the power bill is defined bythe average consumption. A lower peak power consumption for the servers allowsfor a larger number of servers within the same energy budget, leading to a higherutilization level of the cooling and power infrastructure and thereby a more effectiveuse of the available resources and budget. The requirements for both cooling andpower, including UPS are reduced with lower peak power consumption.

2.6 Energy efficient low power processors

Servers have generally been constructed for high performance, using high performanceprocessors rather than energy efficient ones. Processors that originate from embeddedsystems are in contrast mainly built for energy efficiency. This is due to both thermalconstraints and power constraints from battery powered devices. This kind of proces-sors hardly ever needs active cooling, regardless of them having small physical size.

By replacing an energy hungry high performance processor with a set of energyefficient processors originating from battery powered embedded systems, the energyconsumption can be reduced. General purpose processors for embedded systems areproduced in large numbers. To get the same amount of work done, a larger number ofthe slower processors is needed. When increasing the number of processors the gran-ularity of the power consumption also increases. In order to improve energy efficiencywith the changing of processors, the processors used must be at least as energy efficientas the one that should be replaced. Processors used in battery powered devices wherecomputational power is required, are ideal for the evaluating the energy saving poten-tial. ARM Cortex-A8 and the Cortex-A9 MPCore are tested for this purpose. When

13

using DVFS to reduce the energy consumption of the processor the server continuesin an operational state. A much greater energy reduction can, however, be achievedby entering a sleep state, where a processor is turned off and thereby not able to doany calculations until it is waken up. The time to enter and return from a sleep stateis generally longer than when changing between power states using DVFS. In a serverwith multiple processors it could be possible to put the ones that are not needed at themoment in a sleep state, in order to reduce power consumption. In this case it is, how-ever, important to be able to predict how long time switching between different powerstates takes, and know if the service deadlines allow for such a latency.

2.7 Summary

In a perfectly power proportional server the instantaneous energy consumption is pro-portional to the required service level. An idling server would not use any energyand a server functioning at half capacity would use half of the server’s peak powerconsumption. Techniques such as DVFS and power gating are used to increase en-ergy proportionality. In a modern computer based on CMOS circuits complete energyproportionality is not achievable due to leakage currents. While the average powerconsumption determines the amount of actual used energy, the peak power consump-tion is what defines the required capacity of the cooling and power infrastructure. Thecost of power and cooling infrastructure combined with direct energy consumption ina data center is 42 percent, and approximately 45 percent of a server’s peak powerconsumption is caused by the server’s processor or processors.

Processors intended for usage in embedded devices are designed for energy effi-ciency, in contrast to server processors that are designed for performance. The poten-tial benefit from replacing server grade processors based on the x86 -architecture, usedin modern servers with more energy efficient ARM Cortex-A9 MPCore processors isevaluated in this thesis.

14

3 EVALUATED COMPUTING PLATFORMS

This chapter describes the hardware and software that has been used for the bench-marking. The evaluated hardware uses the ARMv7 -architecture based Cortex-A8 andCortex-A9 MPCore processors. The platforms that will be used for testing are the Bea-gleBoard, Versatile Express with a CoreTile Express A9 MPCore daughter board anda Tegra 250 development board. The benchmarks presented in this chapter are usedto evaluate the performance of the following applications: Apache 2 HTTP server, anErlang based SIP-Proxy used for session management and some benchmarks testingvarious aspects of the Erlang virtual machine itself. What SIP and a SIP-Proxy is willbe covered as well as Erlang.

In order to evaluate how well a cluster of the ARM Cortex-A8 and Cortex-A9MPCore processors are suited to replace processors used in todays servers the perfor-mance of single processors must first be evaluated. It is possible to determine howthe energy efficiency compares between the different architectures by running com-parison benchmarks with machines that are built using processors based on the x86architecture. The performance of the processor architecture will be better shown byusing two different Cortex-A9 MPCore processors compared to using only one. As thetwo Cortex-A9 MPCore machines have different clock frequencies as well as differ-ent number of cores, the scaling properties for the two performance increasing optionscan be evaluated. How the energy efficiency is affected by these factors must also beevaluated. In practice evaluate, if increasing a processors clock frequency or addingadditional cores is more beneficial, when looking at the performance per watt. If theresults for a single ARM processor show worse energy efficiency compared to modernserver processors a cluster of the low power processors will then also have a worseenergy efficiency.

15

3.1 Hardware

3.1.1 BeagleBoard

The BeagleBoard [3] is a low cost system based on the ARM Cortex-A8 processorwith low power requirements. The version of the BeagleBoard that was used for themeasurements is the C3. It is equipped with a TI-OMAP3530 chip with an ARMCortex-A8 processor running at 600MHz. The main storage device is a Micro SD-cardand there is 256MB DDR RAM available. A block diagram of the BeagleBoard isshown in figure 3.1 and a block diagram of the OMAP3530 chip in figure 3.2. TheBeagleBoard that was used for the benchmarking had Ångström Linux installed withkernel version 2.6.32.

The first tests were run on a BeagleBoard B5. The B5 was later replaced by a C3as it has double the amount of RAM compared to the B5, allowing for a wider range oftests to be run. Neither model has an Ethernet port built in, a USB to Ethernet adapterwas therefore added to get Ethernet connectivity. An improvement from the B5 modelto the C3 model is a USB A-port in addition to the OTG mini USB port on the B5board. Ethernet connectivity has not been built in before the new xM model and alsoon the xM it is a USB based Ethernet solution [19]. A USB to Ethernet adapter wasfound to be the best way to get network connectivity to the BeagleBoard. Due to theUSB to Ethernet adapter the maximum bandwidth is limited by the speed of USB 2.0to 480 Mbps. This was not a serious limitation, due to the limited performance of theBeagleBoard in the benchmarks.

Although the BeagleBoard did give some indications on how the test programsperformed on an ARM based system, the tests were not conclusive. This, because ofthe many differences compared to “normal” computer systems not just caused by theprocessor architecture, but also from the small amount of RAM and the speed of theRAM. The slow speed of the Micro SD-card used for main storage was also slowingdown the entire system. In the test with Erlang, a non SMP version of the Erlang runtime system (erts) was used, as the Cortex-A8 only has one core.

16

Figure 3.1: BeagleBoard block diagram [3]

Figure 3.2: OMAP3530 block diagram [4]

3.1.2 Versatile Express

The Versatile Express [5] development platform that was used consisted of the Versa-tile Express Motherboard (V2M-P1) with a CoreTile Express A9 MPCore [5] (V2P-CA9) daughter board. In addition to the Quad Core Cortex-A9 MPCore the daughterboard has 1GB of DDR2 memory with a 266MHz clock frequency [5]. A block dia-

17

gram of the Versatile Express can be seen in figure 3.3. The diagram shows a seconddaughter board, a LogicTile Express, in addition to the CoreTile Express, the particularmachine used for the benchmarking did not have a LogicTile Express installed.

Figure 3.3: Block diagram of the Versatile Express with the Motherboard ExpressµATX, CoreTile Express A9x4 and LogicTile Express [5]

The ARM processor on the daughter board is a CA9 NEC [5] chip clocked at400MHZ with limited power management functions. Power gating and DVFS are not

18

supported on the chip [5], which needs to be noted when considering the power con-sumption of the system. A top level view of the chip can be seen in figure 3.4. Aspowering on and off cores is the main power reduction technique available on thisparticular chip, the power consumption is not precisely matched to the required perfor-mance. In a system where power gating and DVFS are available, the possibilities forpower proportional computing are better.

The Versatile Express does, however, allow monitoring of both operating voltageand power consumption. To use this functionality a kernel module was created andloaded to the kernel on the V2P-CA9 to enable usage of the necessary registers forcollecting voltage, current and power consumption data. The registers used are VD10_ S2 and VD10 _ S3. VD10 _ S3 is the power measurement device for the Cortex-A9 system supply, cores, MPEs SCU and PL310 logic [5]. VD10 _ S3 is the mostinteresting power measurement supply for this comparison. VD10 _ S2 is the currentmeasuring device for the PL310, L2 cache and SRAM cell supply. A program that readthe values for the voltage, current and power for both supplies once every second andstored them for further use, was created. The use of the program allowed continuousmonitoring during benchmarking. Data logging at shorter intervals was also tested, butwas discontinued in order to reduce the interference caused by the data collecting, andbecause the added value brought by it was negligible.

Furthermore, several possibilities exist for changing settings on the chip, such asthe speed of the memory and the clock frequency for the cores. Suitable frequencycombinations for the different clocks must be carefully calculated in order for the sys-tem to work properly. These changes must be done while the system is off line as thesettings are stored on a memory card on the Versatile Express Motherboard, and readfrom there on startup. A Debian installation was provided with the Versatile Express.The installation was provided with a 2.6.28 Linux kernel. Official support for the Ver-satile Express in the Linux kernel was not added before version 2.6.33. To determinethe reasons for unexpected performance differences between the test platforms, mainlybetween the CoreTile Express, and the Tegra 250, impacts of different parts of the testplatforms were examined. Kernel version 2.6.33 was used to evaluate if unexpectedperformance differences were caused by the kernel version previously used. The mainreasons for choosing version 2.6.33 for the test, are that it supports the Versatile Ex-press and is the closest possible version to the 2.6.32 used on the Tegra. Having theTegra 250, and the CoreTile Express using similar software is useful, in order to find

19

differences caused by the hardware. The operating system was installed on a USB flashdrive, as the native memory card on the Versatile Express was significantly slower thanthe USB flash drive.

Figure 3.4: Top level view of the main components of the CoreTile Express A9x4 andwith the CA9 NEC chip [5]

20

3.1.3 Tegra

The Tegra [20] is a Tegra 200 series developer kit with a Tegra 250 system intendedto support software development. The Tegra 250 chip includes a dual core Cortex-A9MPCore chip running at 1GHz. The board also contains 1GB of DDR2-667 RAM andis equipped with a SMSC LAN9514 USB hub with integrated 10/100 Ethernet. Theused Tegra 250 board had an additional PCI express Gigabit Ethernet card isntalled inorder to avoid networking bottlenecks. Compared to the older chip on the VersatileExpress, the newer Cortex-A9 has both more advanced power management featuresand a different networking implementation. By evaluating the performance of boththe Tegra and the CoreTile Express, the aim is to identify how varying number ofcores and difference in clock frequencies is reflected in the performance for runningdifferent applications. Ubuntu 10.04 was installed on the board with a Linux kernelversion 2.6.32, which was provided by Nvidia. As the only compatible kernel versionavailable for the Tegra was the 2.6.32, and the Versatile Express was not supportedbefore 2.6.33, the two Cortex-A9 systems could not use the same kernel version. Forthe benchmarks the Tegra 250 board was controlled through its serial port and the twoEthernet ports.

As information of the power consumption of either the parts of, or the entire Tegra250 chip was not available, the values used are estimates derived from the informationreleased by ARM [21]. The Tegra 250 chip also includes several other specializedprocessors in addition to the Cortex-A9 MPCore. This makes the process of measur-ing the energy consumption of the Cortex-A9 even more challenging. There is littleinformation available for the exact configuration and manufacturing process for theTegra 250. According to ARM, a Dual Core Cortex-A9 built using the TSMC (TaiwanSemiconductor Manufacturing Company) 40G process, which is a 40 nm manufactur-ing process, in a speed optimized implementation uses 1.9 W at 2 GHz, resulting in10000 DMIPS. A power optimized implementation uses 0.5W at 800 MHz providing4000 DMIPS [21]. In this thesis the power consumption is estimated to be 1 W for theCortex-A9 in the Tegra 250.

3.1.4 Reference and client machines

In addition to the test machines with ARMv7-A processors other test machines withx86 processors are needed to make a comparison between the energy efficiency of the

21

processor architectures. A variety of machines were used during the benchmarking,both to make comparisons possible but also to enable the benchmarking in both theSIP-Proxy and the Apache HTTP server tests. The results that are presented for refer-ence values originate mainly from three different machines. The first has a Dual CoreIntel E6600 processor, the second has two Intel Quad Core E5430 processors and thethird has two Quad Core Intel L5430 processors.

According to the data sheet for the 5400 series [13], there are three different subseries within the 5400 series, targeting different markets, the X5400, E5400 and L5400sub-series. The X5400 series is described as a performance version and the E5400 asa mainstream performance version. The L5400 is described as a lower voltage andlower power version intended specifically for dual processor server blades. The listedthermal dissipation power (TDP) for X5400, E5400 and L5400 series is 130 W, 80 Wand 50 W.

3.1.5 Network

The benchmarks that required interaction between several machines were connectedin a number of different ways, depending on the test in question, and partly by theavailable resources. Due to the design of the test machines, Gigabit Ethernet, was notalways available. The BeagleBoards had the option of Ethernet over USB, or a USBto Ethernet adapter. In order to make the tests more comparable, the USB to Ethernetadapter option was used. Both the Versatile Express, and the Tegra 250 had 10/100Mbps Ethernet capabilities. The Tegra had in addition to the built in fast Ethernet aGigabit Ethernet card.

At first the machines that were to be benchmarked were connected directly to thebenchmarking machine without any switches. To increase the number of clients a fastEthernet switch was used in order to add up to six benchmarking machines. In orderto enable benchmarks by using a larger number of more powerful machines, a GigabitEthernet LAN was used. Through this network ten client machines were controlled byan eleventh machine. These machines were used to create the necessary traffic for thebenchmark. To avoid and detect problems caused by other users of the same network,the tests were performed in the evenings outside office hours. Tests were also re donelater to confirm the results. As the network bandwidth was limited the file requestedin the test was small, in order to keep the bandwidth requirements as low as possible.The theoretical maximum bandwidth for the LAN is a gigabit, or 131 072 KBps. The

22

bandwidth of the network was not a problem for the ARM test machines. However,for the machine used for the reference results, it was a potential bottleneck consideringthe ability of the reference machines to serve tens of thousands of request per second.

3.2 Software

3.2.1 Erlang

Erlang [22] is a functional programming language and a Virtual Machine. The Erlangsyntax resembles that of prolog, not surprisingly, as it started out as a modified versionof prolog. The first version of Erlang was created at the Ericsson Computer ScienceLaboratory by Joe Armstrong, Robert Virding and Mike Williams. The development ofErlanf began in the eighties and Erlang is still used by Ericsson in telecommunicationapplications [22]. It is designed to be highly concurrent and designed for fault tolerantsoft real-time systems [23]. The aim was to create a language that would be suitablefor creating telecommunication systems, consisting of millions of lines of code. Thesesystems are not only large, they are also meant to constantly be running. To be ableto run them continuously for as long times as possible, software upgrades must bepossible without stopping the system [24].

Erlang/OTP is often implied when discussing Erlang. OTP is short of Open Tele-com Platform. It contains tools, libraries and procedures for building Erlang appli-cations. It provides readymade components, such as a complete web server and FTPserver. It is also useful when creating telecommunication applications. Both the ErlangVM and OTP are open source licensed.

The Erlang run time system (erts) implements its own lightweight processes andgarbage collection mechanism. Erlang is run as a single process in the host operatingsystem, and schedules the Erlang processes within it. SMP Erlang enables the use ofmore than one CPU core on the host machine by using multiple schedulers. All sched-ulers are run as separate processes in order to enable their simultaneous execution. Ingeneral equally many schedulers are run as there are available CPU cores. From theusers perspective there is no difference if the cores are on the same CPU, or on differ-ent CPUs in the same SMP machine. The number of schedulers that are used can vary,but by default it is the same as the number of available CPU cores. There is no sharedmemory between Erlang processes, which means that all communication is done using

23

message passing, enabling the construction of distributed systems [23].

3.2.2 SIP-Proxy

SIP is short for Session Initialization Protocol, a standard defined by IETF [25]. IETFor Internet Engineering Task Force is an organization that develops and promotes In-ternet standards. IETF does not have any formal membership or membership require-ments [25]. SIP is an application-layer protocol for controlling sessions with one ormore participants. It is used for creating, modifying and terminating sessions. The ses-sions can be multimedia, including video or voice calls, and the session modificationpossibilities include the ability to add or remove media and participants, and changeaddresses. The protocol itself can be run on top of several different transport protocols,such as Transmission Control Protocol (TCP) or User Datagram Protocol (UDP). SIPincludes features such as the possibility for a user to move around in a network, whilemaintaining a single visible identifier. It is also possible to be connected to the net-work from several different places, by for example using several different phones thatare associated with the same identifier. The SIP protocol does not provide services onits own, it does, however, provide primitives that can be used to implement a variety ofservices. There is a great variety of extensions for the SIP protocol to make it usablefor many use cases and environments.

SIP enables the creation of an infrastructure, consisting of proxy servers that userscan use to access a service. A SIP-proxy is a server that helps route requests to thecurrent location of the user, and makes requests on behalf of the client, the proxy alsoauthenticates and authorizes users for the provided services. The protocol allows forregistration of the users locations to be used by the proxy servers.

3.3 Benchmarks

To evaluate the performance of our hardware, several benchmarks were used. First, theperformance of the Erlang rts is benchmarked to see how well it performs on the hard-ware. This shows how well an application running on top of Erlang could be expectedto run. By benchmarking the SIP-Proxy, and Apache 2 server, the performance foractual services is evaluated for all the hardware. More precise information about thebenchmark setups are presented in the following sections, and the results are presentedand analyzed in the following chapter.

24

3.3.1 Apache 2

Apache 2.2 HTTP server was used to determine how well the Cortex-A9 MPCoremachines can perform with traditional server tasks. Apache 2.2 was chosen as it is bothfreely available, open source and has been one of the most popular servers for a longtime. The Apache HTTP server is available for many platforms, such as Linux, MacOS/X and is used with a variety of architectures. As these benchmarks are targetingthe x86 and ARMv7-A architectures, the ability for the Apache HTTP server to runon both is crucial. These tests are focusing on use cases with small static files. Theinitial tests were run using Apache Bench (AB), a tool for quick performance testing.The use of AB was later discontinued in favor of autobench, in order to produce morereliable results.

Autobench [26] was used to measure how well Apache performs on the differentmachines and with different test parameters. Autobench is a tool that helps automatethe use of httperf. Httperf is a program for benchmarking the performance of a HTTPserver. It creates connections to a server, in order to fetch a file. During one connec-tion, one or several requests for the file is made, depending on the test parameters. Bychanging the rate the connections are created and the number of requests for each con-nection, the load on the server varies. By running httperf several times with differentrates of connections the servers response to different load can be evaluated. By runningthe test from several machines simultaneously, the load generating capabilities of thetest setup can be taken beyond that of a single machine, but the instances must be con-trolled separately. The results from such tests must also generally be later combinedmanually. To decrease the error caused by differences in the starting time for the testfrom different test machines, the testing time should be relatively long [27].

One of the most useful features of autobench is that it runs multiple tests whilesteadily increasing the load on the server according to the users instructions. In orderto run autobench, a number of parameters must be set. The parameters can either beset in a configuration file, or given as command line arguments when starting the test.An example for benchmarking a single server giving the arguments from the commandline, looks like the following.

autobench -single_host -host1 www.google.com -uri1 /10B

-quiet -low_rate 100 -high_rate 1000 -rate_step 20 -num_call

10 -num_conn 10000 -timeout 5 -file benchmarkresult.tsv

Single host indicates that only one server is benchmarked. Host1 is the ad-

25

dress to the server, in this case www.google.com. Uri1 is the file that is requestedin the test, a file called 10B, is used in this example. The quiet option must beused if the results are expected to be sent to STDOUT, as it restricts the amount ofdata that httperf produces. Too much output from httperf causes autobench to cre-ate badly formatted report tables. The benchmarking stats from the point indicated bylow_rate, and continues to the upper limit given by high_rate using the step sizefrom rate_step. In this case the test starts from 100 connections and is run until1000 connections per seconds is achieved, always adding 20 connections for each newrun. Num_call regulates how many requests for the particular file should be madefor each connection, 10 in this case. If the option to keep alive connections is disabledon the server, the test will report a number of unsuccessful attempts at retrieving thefile. The number of attempted requests is calculated by multiplying the num_callwith the connection rate. Num_conn specifies the number of connections to createfor each step of requests. As the time taken to attempt a number of connections isdependent on the rate the connections are created, a more convenient alternative is re-placing num_conn with const_test_time. Const_test_time specifies howlong the test should continue, automatically calculating the value for the number ofconnections for all request rates during the testing. Timeout sets the value in sec-onds to how long a request is allowed to last before it is considered a failure. A longertimeout value causes a greater fd (file descriptor) usage than a shorter, increasingthe risk for stability issues for the system under test. The file option simply specifiesthe name of the file where the results are stored.

In theory, almost any benchmarking program could have been run on several ma-chines simultaneously, and then manually combine the results, autobench helps auto-mate this process for httperf. Two programs are provided together with autobench:autobench_admin and autobenchd. Their purpose is to make the usage of several ma-chines for testing convenient. The idea with these programs is that autobenchd is runon all client machines and one of them runs autobench_admin. Autobench_admindistributes the required requests between all client machines. After the instances ofautobenchd have completed their part of the test, the results are collected and com-bined by autobench_admin. With the basic configuration the requests are distributedequally among all autobenchd instances. This has two implications. The first is that therequested number of connections must be evenly dividable between all instances. Thesecond is that in order to get clean results, all client machines should only be requested

26

to create a number of requests that the used machines are able to produce. The resultsfrom a test where the client machine has not been able to produce enough request dueto limitations of its own, look similar to results where the server that is being tested isnot able to reply to the requests.

In the results file, autobench stores the number of requests that were supposed tobe requested, followed by the number that actually was attempted and the number ofconnections created. Statistics on what the minimum, mean and maximum numberof requests during the test is also provided. The amount of bandwidth used, and thenumber of errors is also listed. As the tests are run several times, trends are easilydetectable. Especially short and sudden interference is generally obvious, interferencethat remains constant is more difficult to detect. By running the tests several times, andwith an increasing load, more information is generated on how the server responds toa varying number of connections and requests. Testing several times with similar testparameters also gives an overview and helps verify if something has interfered withthe test.

As illustrated in Figure 3.5, the test was set up so that the machines that were tobe benchmarked, were connected through Ethernet to the client machines that createdthe requests. In order to make sure that the server was the bottleneck in the finaltests, i.e. that the data produced were meaningful; the test was run several times withslight variations in the test parameters. To guarantee that the client machines wereable to create enough traffic, the same tests were run using an increasing number ofclient machines. If the results remained close to the same, even if the number of clientmachines was increased, the load created by the client machines was sufficient. Thenumber of served requests was set as the metric for performance. The only requirementwas that the requests being counted would be served within five seconds, the rest of theresponses were discarded. Any other quality of service aspects such as the number ofunanswered request were ignored, as the focus was on maximum performance ratherthan quality of service.

The original plan was to leave all installations of the Apache 2 http server in theirdefault configurations to get a generic comparison. It soon became clear that the de-fault configurations for the Apache server installations were not identical. Althoughthe same version of Apache (Apache 2.2) was provided with all Linux distributionsthat were used, the configurations varied. It was expected that there would be differ-ences such as where configuration files were located on the different distributions, but

27

not that the default configurations would differ. The main difference was that on theApache installation on the Fedora machine, the keep alive option was disabled, thiswas changed to be enabled on all machines.

By running the same static page fetching test, but using files of different sizes andmonitoring the bandwidth usage, it was decided if a file could be used safely withoutgetting problems with the available network bandwidth. Small file size was required asthe Versatile Express was equipped with only a 10/100 Mbps Ethernet card rather thanGigabit Ethernet. The performance of the reference machine was expected to servea large number of requests resulting in a high bandwidth usage even with small files.In tests where small files are used, the overhead caused by the underlying protocolsbecomes significant.

Figure 3.5: Test setup for Apache test

3.3.2 Basic Erlang performance benchmarks

To evaluate the performance of the Erlang Virtual Machine (VM) running on the testmachines some general performance benchmarking was done. A set of micro bench-marks running on the Erlang VM was used. Some of the benchmarks were able tomake use of more than one core on the host machine, while others did not gain anynotable benefit compared to running on just one core. The micro benchmarks are de-signed to stress different parts of the VM. While these tests do not emulate a realisticservice producing scenario, they do give information of the general performance levels

28

of different parts of the systems. This information can then be used to compare theperformance of the VMs running on different platforms, and provide a way to estimatehow well different applications could be expected to run. The results are included asAppendix A. Due to reasons such as insufficient memory in the test hardware, not allbenchmarks were run on all available test platforms. If a benchmark was not able torun on all machines, the results from the benchmark was also not analyzed for the restof the machines either.

3.3.3 SIP-Proxy

An Erlang based SIP-Proxy (Session Initiation Protocol Proxy) was tested to find outhow well an ARM Cortex-A9 would perform in telecom applications. The perfor-mance of the SIP-Proxy was measured in the number of calls per second it could han-dle. The metric for energy efficiency for the proxy was decided to be the number ofcalls the proxy could handle for each Joule used. As the proxy is running on top ofErlang, the results from this particular proxy reflect the result from the Erlang microbenchmark. The value this benchmark brings comes from giving a performance andenergy efficiency evaluation for a realistic service, rather than just parts of the systemas the micro benchmarks did.

To measure the performance of the proxy, two other machines were used as shownin Figure 3.6. One machine was used to create the messages that should be passed andthe other was used as the receiver. Both of these machines were running SIPp, an opensource test and traffic generation tool made available by HP. The version used was 3.1and was compiled from the source. The bandwidth required for the proxy running on amachine capable of only a limited number of calls each second is not big, bandwidth,was thereby not expected to be an issue here. The proxy had a fd leak and in order toavoid issues caused by this, the maximum number of fd:s both system wide, and foreach user was increased. During the time the proxy was running there was constantlya small increase in the memory used by the underlying erts. The total memory usageof the system was, however, all in all modest. To get more reliable results and avoid asmuch interference from the unwanted accumulative use of system resources caused bythe fd leak, the proxy was restarted between every test.

In order to evaluate the performance of the proxy a definition for when the proxypassed a test was needed. In the reference results provided by Ericsson the proxy hadbeen expected to run for a few minutes. The reference results used in this benchmark

29

are from a machine with two Quad Core Intel Xeon L5430 processors and 8 GB ofRAM. As the testing focused on processors with lower performance more strict re-quirements were set up. This was done to enable a more accurate comparison betweenthe different ARM Cortex-A9 MPCore processors that were to be tested.

Figure 3.6: Test setup SIP-Proxy test

The test machines were required to be able to pass all requested messages for a twominute period. The two minute requirement was as a result of two different factors.The first was the requirement used by Ericsson that the new requirement would needto be in line with. The second was caused by the way the proxy used system resources.When running the proxy, the CPU utilization level of the system increased rapidly for awhile. After the rapid increase a very slow increase was observable for the entire timethe proxy was running. The rapid increase was observed to halt well before the twominute mark for all the tested ARM Cortex-A9 MPCore systems. There was not muchneed for discussion for an acceptable error rate, as a low sustainable error rate wasnot encountered during the testing. If the load was not significantly decreased quicklyafter a failed message caused by a high load, the fail rate would increase rapidly.

The erts on both the Versatile Express and the Tegra 250 was recompiled to supportprofiling using Gprof. Some optimizations had to be disabled from the make files forthe erts in order for Gprof to work. Gprof shows function calls executed by a specificprogram, Oprofile was also used as it has the ability to perform system wide profiling.To enable profiling using Oprofile the kernels on both machines were recompiled. Asthe profiling has a negative effect on system performance the profiling enabled versions

30

of both the erts and the kernels were not used when obtaining results for maximumperformance in any benchmarks. The versions that supported profiling were only usedto find reasons for unexpected anomalies and performance differences.

3.4 Summary

To evaluate the energy efficiency of the ARM Cortex-A8 and the Cortex-A9 MPCore,compared to processors built on the x86-architecture for server tasks, a set of test hard-ware has been used. To evaluate the performance of the Cortex-A8 a BeagleBoardis used. Evaluating the Cortex-A9 is done using a Tegra 200 development kit with aTegra 250 chip, and a Versatile Express with a CoreTile Express. An Apache 2.2 HTTPserver is used to evaluate the energy efficiency of the Cortex-A9 MPCore processor,compared to an Intel Xeon processor when serving static files to clients. The perfor-mance for running the Erlang VM on the test machines is evaluated directly using a setof micro benchmarks, as well as a Erlang based SIP-Proxy.

31

4 PERFORMANCE COMPARISON

This chapter explains the individual executions of the benchmarks. After each bench-mark execution the corresponding results are presented. The results from the execu-tions are followed by performance comparisons. After the pure performance compar-isons the energy efficiencies will be compared. All energy efficiency comparisons inthis chapter focus on the energy consumption off the processors themselves, rather thanon the total consumption of the computers being benchmarked.

4.1 Apache results

After reaching its peak performance in this particular test the performance of the Ver-satile Express dropped quickly, and in the end made the server unresponsive. This canbe seen in Figure 4.1. This happened sometimes before reaching full CPU utilization.A reason for this could be the network implementation on the Versatile Express. Ifenough interrupts are generated in order to handle TCP packets it eventually leads toa situation where an increasing part of the runtime is used up by interrupts and thusleaving a decreasing amount of resources available to actually provide the intendedservice. This has not been proven to be the case here, but is a viable possibility. Todeal with the unresponsiveness the test machine was restarted between tests.

When benchmarking the machine with the two Intel Quad Core Xeon E5430 pro-cessors using the same test parameters as for the rest of the machines, the test proved tonot be CPU intensive enough as full CPU utilization was not achieved. In order to findthe bottleneck, the number of clients was increased. The test was run using both tenand five client machines. As the results were the same in both tests the performanceof the client machines was not a bottleneck. The bandwidth was tested by redoing thetest using a larger file than the original. The result from the test with the larger file wasclose to that of the original test, with the biggest difference being a higher bandwidth

32

Figure 4.1: Comparison between CoreTile Express, Tegra and an Intel Pentium 4 pow-ered machine running the Apache HTTP server.

usage. The system reported no shortage of available memory in any of the ApacheHTTP server benchmarks.

The machine with the two Quad Core Xeons was able to serve 36000 requestsper second when a hundred requests were made for each connection. For ten requestsfor each connection the result was only 6200 requests/s. The data transfer during thetest with 36000 requests per second was reported by Autobench to be 11600 KBps.Compared to the theoretical maximum bandwidth for a gigabit network (131072 KBps)the used bandwidth was less than 10 percent. Although the theoretical bandwidth isgenerally not achieved in a real life network, more than 10 percent is achievable. Inaddition, higher data transfer rates from the same server, using the same network wasachieved using a larger file. As the network seemed an unlikely bottleneck, other partsof the test setup was inspected. The machine running the Apache server reported 60

33

percent CPU utilization for the test with 36000 requests and 10 percent for the test with6200 requests. If the CPU utilization level and web server performance would continuehaving the same relation to each other, the performance in both cases is around 60000requests per second with full CPU utilization. Figure 4.2 show the CPU utilizationat a few points during the Apache test for the dual E5430 machine. Assuming theperformance for one E5430 running at 100 percent is the same as the performanceof two running at 50 percent, the performance for one E5430 is 33000 requests persecond.

Figure 4.2: CPU utilization during test on machine with two Quad Core Intel XeonE5430 processors

The results from the Apache test shown in Table 4.1 are from fetching a static fileof size 10 Bytes. 10 Calls per connection and 100 calls per connection were requestedin the tests. The better results from the two benchmarks were used. The performancefor the Tegra 250 was more or less the same when making ten or a hundred requestsfor each connection. For the comparison machine with the two Xeon processors thedifference was approximately a multiple of ten. The better results were used for the

34

Machine Request / second Requests / Joule

Quad Core Intel Xeon E5430 (2.66 GHz, 80 W) 33000 413Pentium 4 (2.8GHz) 7100 80Dual Core Cortex-A9 MPCore (1 GHz) 4600 4600Quad Core Cortex-A9 MPCore (400 MHz) 3400 2833Cortex-A8 (600 MHz) 760 760

Table 4.1: Ability of Apache 2.2 to serve a 10 byte static files using different hardware

comparison. The test results can be seen in Table 4.1. As it can be seen in the table theTegra 250 managed to serve 4600 requests per second and the Versatile Express 3400requests per second. The performance difference of the Versatile Express comparedto the Tegra 250 is likely caused by both the slower clock frequency of the CPU andthe network implementation. The difference in combined clock frequencies betweenthe two processors on its own is slightly less than the performance difference. Thecombined number of clock ticks for the Versatile Express is (400 * 4) 1600 and 2000(1000 * 2) for the Tegra 250. Comparing these, the Versatile Express has 80 percentof the clock ticks of the Tegra 250. The performance of the Versatile Express is incomparison slightly less, 74 percent of that of the Tegra 250.

The results from the Apache tests were mainly compared against a machine withtwo Intel Quad Core Xeon processors running at 2.66 GHz. To provide a more com-prehensive comparison, and more reference points a machine with a Pentium 4 (2.8GHz) was also benchmarked. While the machine that has the more traditional serverprocessors outperforms the tested Cortex-A9 processors, the Cortex-A9 processors dowell taking their energy consumption into account. The Intel Xeon processor (E5430)that was used in the reference machine has a reported maximum thermal design power(TDP) of 80 W, while the Quad Core Cortex-A9 according to performed tests has amaximum measured power consumption of 1.2 W.

The rightmost column in table 4.1 shows the number of answered calls producedper Joule used. A clear improvement in energy efficiency is visible, starting from thePentium 4 to the Dual Core ARM Cortex-A9 MPCore. Figure 4.3 shows the energyefficiency comparison as a bar diagram.

Figure 4.3 indicates a energy efficiency of about 6,9 times the performance per

35

Joule for the Versatile Express compared to the Intel Xeon. The results can be assumedto be a bit better for the Intel Xeon in practice when its actual power consumption istaken into account. The actual power dissipation of the Quad Core ARM was, however,also below 1 W during the test rather than the measured maximum of 1.2 Watts thatwas used for the calculations. As there are no actual numbers available for the powerconsumption of the CPU on the Tegra 250 board the estimate of 1 W is used for itspower consumption. For the Tegra 250 the energy efficiency compared to the referenceIntel Xeon processor was approximately 11,1 times better. A clear improvement inenergy efficiency is also visible between the Pentium 4 processor and the Xeon. Thisimprovement is an indication on the energy efficiency improvement for Intel’s x86based processors. One of the major improvements from the Pentium 4 to the XeonL5430 is the manufacturing technology that has improved from 90 nm to 45 nm.

Figure 4.3: Number of requests handled for each Joule used by the CPU

4.2 Emark results

A set of benchmarks called Emark was used to evaluate the performance of the erts.The benchmarks are meant to be used for evaluating Erts and its performance on par-ticular hardware. The tests packet included a set of baseline results for comparison.

36

The benchmarks can be used to test the performance of either different erts implemen-tations or to compare different hardware against each other. In the results the sameversions of the erts was used in order to make the results as much dependent on thedifferences in hardware as possible, rather than differences in software. The differentbenchmarks returns results using different metrics, some measure time while other thenumber of transactions and the Stones test gives the results in “stones”. All the resultsshown here are a comparison to the baseline results if not something else is mentioned.They are to be interpreted as how many times worse the tested systems performed thanthe baseline. Regardless of the metrics in the original benchmarks. A lower score inthese results is always better and should be interpreted as how many of these machineswould theoretically, in a perfect world, and without any overhead be needed to replacethe baseline machine in the particular test. The machine used for the baseline has aDual Core Intel E6600. The chip is built using a 65 nm technology and has a TDP of65 W [28].

Among the tests that were run was a message passing test called “big bang”. Itcreates a thousand processes and every process sends a “ping” message to every otherprocess, every process that receives a “ping” responds with a “pong” message. Anadvantage to the message passing test in the Stones benchmark is that it is capable ofusing more than one core. The inability to effectively use more than one core at a timeis something that holds true for all the tests in the Stones benchmarks. This can easilybe seen in Appendix A. Table 4.2 shows how the different machines performed inthe benchmark compared to the baseline results. All results in the table are measuredusing as many OS processes as there are available cores on the particular test machine.The BeagleBoard uses an erts implementation without SMP support and the othersruns SMP enabled erts implementations. Most of the benchmarks here have been runseveral times with different parameters and the results in the table are average values.If results for some test was not available the results from the corresponding test onthe other machines has also been omitted. Short explanations of what the differentbenchmarks test are given in Table 4.3.

Most of the benchmarks in Table 4.3 gets better results when using more thanone scheduler, there are, however, some exceptions. The results show that the bench-marks codec, containers and Msgq does not gain any benefit from using more than onescheduler.

37

Test BeagleBoard CoreTile Express Tegra 250

Bang 17.8 14.2 6.7Big 38.8 12.3 9.7Chameneosredux 5.6 17.3 6.7Codec 19.4 19.9 7.0Containers 34.4 30.2 14.1Ets _ test 10.7 5.9 3.7Genstress 17.6 14.4 7.6Mbrot 40.3 12.9 8.5Msgq 1.0 1.5 0.9Netio 29.3 12.4 12.6

Table 4.2: Performance of ARM test machines compared to Baseline results

Test Explanation

Bang All to one message passingBig All to all message passingChameneosredux Shake hands with everyoneCodec Encode Decode binaries (test binary to term)Containers Adds and lookups in containers (ADT’s)Ets _ test Ets insert/lookup (also in parallel)Genstress Genserver testMbrot Mandelbrot calculations (concurrent, number crunching)Msgq Message queue bashingNetio TCP messages

Table 4.3: Explanations on what the benchmarks evaluate

38

SMP1 SMP2 SMP3 SMP4

Baseline 12998 10820BeagleBoard 284375V2P-CA9 248217 141691 105538 106230Tegra 250 102404 102403

Table 4.4: Results from Netio benchmark

The benchmark named Netio that tests TCP messages shows an interesting behav-ior. The results from the benchmark are visible in table 4.4. As a difference to table4.2 the results in this table are the direct output from the benchmark. According to thesource code for the benchmark the values represents milliseconds. A lower result isbetter. In this particular test run the following test parameters were used. 200 Connec-tions, 1000 packets and a packet size of 10000. There is a clear increase in performancewhen adding more schedulers from one to three for the V2P-CA9. Between three andfour schedulers the test shows no further improvement, it actually show a decrease.The decrease is still small enough to be discarded, due to the fact that the test resultsdiffer slightly between the different test runs. For the Tegra 250 the results for one andtwo schedulers are basically identical. The difference between the best results for theCoreTile Express and the Tegra 250 is only three percent. The benchmark is designedto stress the packet receive and accept processes and is not affected by external factors,as it is not dependent on I/O.

The results from the Stones benchmark is shown in Table 4.5. The table shows thatthe performance of the Tegra 250 is consistently better than the one of the V2P-CA9.The interesting thing is that the BeagleBoard outperforms not just the CoreTile Expressbut also the Tegra 250 in some of the benchmarks. As only the Links benchmark is ableto gain any benefit of using more than one core the BeagleBoard has an advantage overthe CoreTile Express with its higher clock frequency in the rest of the benchmarks. Inthe small and medium message passing benchmark, it is even faster than the Tegra 250.

39

Test BeagleBoard CoreTile Express Tegra 250

List manipulation 26 25 11Small messages (message passing) 10 35 14Medium messages 13 34 14Huge messages 17 28 11Pattern matching 32 30 14Traverse 35 27 13Work with large dataset 32 26 12Work with large local dataset 35 28 13Alloc and dealloc 22 23 10Bif dispatch 11 18 7Binary handling 15 27 11Ets datadictionary 27 46 19Generic server (with timeout) 15 37 15Small Integer arithmetic 30 23 12Float arithmetic 31 37 9Function calls 38 30 13Timers 10 26 8Links 18 9 8

Table 4.5: Results from the Stones benchmark compared to the baseline

40

4.3 SIP-Proxy results

Ericsson provided reference results for the benchmark. The machine used for thiscomparison has two Quad-Core Intel Xeon L5430 processors running at 2.66GHz.The test result for the reference machine is presented in table 4.7. If the CPU isthe bottleneck the performance increase is approximately dependent on the amount ofCPU resources available, in this case, the number of cores. As visible in Table 4.7this is not the case. There is a significant performance improvement all the way fromone scheduler (SMP1) to four schedulers (SMP4). When the number of schedulersis increased from four to eight, there is only an increase of 50 calls/s, although thenumber of available schedulers, and thereby cores, has doubled. This indicates thatthe results are dependent on something else than the pure processing power of theprocessors. As the focus is on processor performance and energy efficiency, resultsthat are not dependent on the processors themselves, is to be avoided. Only the resultsfrom one to four cores will be used, in practice considering the reference machine ashaving only one Quad-Core processor, using only the energy required by one, ratherthan two processors. An issue with comparing the energy efficient in this test is thatthe only energy consumption data available is the TDP information provided by themanufacturer. According to the information available on Intel web page the maximumTDP of the L5430 is 50W and that it is manufactured using a 45 nm process [14]. Inthe datasheet for the 5400 series on page 87 however, it is stated that the TDP for theL5400 series is also 50 W [13].

When evaluating the performance of the SIP-Proxy the Versatile express was ableto handle 30 calls/s. The reference machine with its Intel Xeon L5430 was able tohandle 350 calls/s. Both machines were tested using different numbers of schedulers(1-4), these results can be seen in Table 4.8. By taking into account that the CPU ofthe reference machine has a maximum TDP of 50 W compared to the measured maxi-mum consumption of 1.2 W used by the Cortex-A9, the Cortex-A9 performs well. Bycomparing the throughputs and the power consumptions, it can be seen that the Cortex-A9 can handle 3.5 times more traffic for each watt it dissipates compared to the IntelXeon. An issue in this comparison is that the energy consumption listed for the Xeonis according to the manufacturer rated TDP, and not actual measured maximum energyconsumption. To compensate for this the maximum measured energy consumption forthe Quad Core Cortex-A9 is also used for comparison. The power consumption duringthe benchmark was measured using the VD10_S3 register on the CoreTile Express,

41

Calls/s CPU utilization Power avg

1 19 0.545 38 0.6510 51 0.6615 66 0.8525 88 0.9130 98 0.95

Table 4.6: CPU utilization and average CPU power consumption for the CoreTile Ex-press during SIP-Proxy benchmark

SMP1 SMP2 SMP4 SMP8

Calls / Second 130 240 350 400

Table 4.7: Performance of reference machine with two Quad Core Xeons with differentnumbers of schedulers

and the average values are shown in Table 4.6, together with the CPU utilization andthe number of calls the proxy was subjected to. During this particular test where theenergy consumption was measured, SMP 8 was used. The power consumption is alsopresented in graph 4.5. It is noteworthy that no DVFS is available on the CoreTile Ex-press reducing the possibilities for precise reduction of energy consumption in relationto the load.

As the erts does not generally benefit from using more schedulers than there areavailable CPU cores on the host machine, the results for those tests are not listed here.The strange thing about these test results is that the Tegra 250 performs significantlyworse than the Versatile Express. In other benchmarks the Tegra 250 has consistentlyover performed the CoreTile Express, except in this and the TCP message benchmark,that is part of the basic Erlang benchmarking presented previously in this chapter.While the Versatile has the advantage of having double the number of CPU cores the

42

Figure 4.4: Graph showing the CPU utilization for CoreTile Express during SIP-Proxybenchmark

Tegra 250 has more than double the clock frequency on its cores. As visible in Table4.8, the performance for the Versatile Express and the Tegra 250 is very similar whenusing the same number of cores. When using one core the difference could be dueto a static overhead of running the proxy but the results from using two cores are notas easily dismissible. The main question here is why an almost identical performanceincrease is achieved when adding a core running at 400 MHz and one at 1000 MHz.If the test machine running at 1 GHz would not report almost full CPU utilization, itwould be clear that the CPU is not the bottleneck, but this is not the case here.

As the performance of the proxy when running on the Tegra 250 was not as ex-pected from the technical data available and our previous benchmarks, additional stepsto certify the results were taken. The erts on both the Tegra 250 and the CoreTile Ex-press was recompiled from the same source in the same way using the same version ofGCC and using the same version of the libatomic library (7.2 alpha 4). As this did notcause any difference in the results the erts was again recompiled with a few changesto support profiling using Gprof. The performance on the CoreTile Express was af-

43

Figure 4.5: Power consumption for the CPU in CoreTile Express during SIP-Proxytest

fected more by running Gprof than the Tegra 250. With the profiling enabled the Tegra250 could handle nine calls per second, while the CoreTile Express could handle sixfor a two minute period using SMP2, as can be seen in table 4.9. A test was thenperformed using five calls per second and SMP2 for two minutes on both machines,while profiling using Gprof. The biggest difference between the number of functioncalls and time spent in different functions, was in functions that have to do with atomicread functions. This is caused by the fact that the schedulers are frequently left withoutwork, and at that point, in an attempt to optimize, spins over a variable to check formore work. To match the throughput between the two test machines the Tegra wasnot under maximum load causing the schedulers to be without work more often thanon the V2P-CA9. Other significant differences were not observed. In order to profilesystem wide rather than just the erts the kernels on both the Tegra 250 and the CoreTileExpress was recompiled to support Oprofile. Oprofile showed that when running on ashigh load as possible the Tegra 250 spent 31,6 % of its time running vmlinux, whilethe CoreTile Express spent 20,6 %. The times spent running the erts were 65 % and

44

Figure 4.6: Graph showing performance of reference machine with two Quad CoreXeons using an increasing number of schedulers

67 %.To make sure the issue was not caused by problems with the OS the installation on

the Tegra 250 was replaced by a backup of a older version of Ubuntu that the boardhad originally been tested with, before the evaluation started, and thereby the changespossibly caused by it. The results remained the same from the previous tests. Beingunable to find a reason for the benchmark results even after great effort ARM agreedto redo the measurements. They same erts was used and the same installation of theproxy-server. The kernels used were also compiled separately although both were ofversion 2.6.32. ARM reported the same results as produced earlier with the Tegra 250.The cause for the unexpected performance difference is still unknown.

The energy efficiency for the SIP-Proxy is shown in Figure 4.7. The energy ef-ficiency difference between the Versatile Express and the Tegra 250 is close to theperformance difference, around half. Compared to the Versatile Express the Xeon usesapproximately 3.6 times more energy for each call. Compared to the Tegra 250 the

45

SMP Intel Xeon Quad Core Cortex-A9 Dual Core Cortex-A9(2.66GHz) MPCore (400MHz) MPCore (1 GHz)

1 130 5 52 240 12 134 350 30 13

Table 4.8: How many calls the SIP-Proxy can handle using different hardware andnumber of schedulers

SMP CoreTile Express Tegra 250

SMP4 10 9SMP2 6

Table 4.9: How many calls the SIP-Proxy on can handle on the CoreTile Express andTegra 250 while profiling using Gprof

Figure 4.7: Number of calls handled for each Joule used by the CPU

energy consumption is approximately 1.9 higher for the Xeon.

46

4.4 Summary

The performance of the test machines with the ARMv7 -architecture based processorshave been compared to machines with x86 -architecture based processors. While theindividual performance of the more energy efficient ARM processors is lower than thatfor the processors traditionally used in servers, the energy efficiency is better. In theApache HTTP server benchmark the energy efficiency of the Dual Core Cortex-A9processor was over 11 times that of the Intel Xeon E5430. The processor with the bestenergy efficiency in the SIP-Proxy benchmark was the Quad Core Cortex-A9 with a3.5 times better energy efficiency compared to the Intel Xeon L5430.

47

5 CONCLUSIONS AND FUTURE WORK

5.1 Conclusions

The Apache benchmarking showed up to eleven times the energy efficiency for theARM Cortex-A9 compared to the Intel Xeon E5430. In addition the SIP-Proxy bench-marking indicated up to 3.5 times better energy efficiency for the Cortex-A9 comparedto the Intel Xeon L5430. In other words the Cortex-A9 needed 28 percent of the energyneeded by the Intel Xeon L5430. According to the data sheet for the Intel Xeon 5400series [13] the energy efficiency for the L5430 is better compared to the Intel E5430.Exact energy efficiency or performance comparisons are not given in the datasheet.Assuming that the performance of the L5430 and the E5430 were the same, the resultsfrom the Apache HTTP server benchmark is still impressive. If the given TDP valueof the E5430 (80 W) is substituted with that of the L5430 (50W) the resulting energyefficiency improves from 413 requests per Joule to 660 requests per Joule. The energyefficiency for the Dual Core Cortex-A9 is still about seven times that of the E5430.

The differences in energy efficiency between the processors based on X86 architec-ture and those based on the ARMv7 architecture varies between different applications.In the benchmarks conducted here the energy efficiency has been proven to be betterfor the ARMv7 than for the x86. In the Apache HTTP server benchmark the differ-ence in energy efficiency was the greatest, at best over eleven times the throughputfor every used Joule from the Dual Core Cortex-A9 than the Xeon E5430. The QuadCore Cortex-A9 had clearly the best results in the SIP-Proxy benchmark, consideringenergy efficiency. Compared to the Xeon L5430 the Quad Core Cortex-A9 has a 3.5times better energy efficiency compared to the L5430.

The potential energy reduction for the processors in a server using Cortex-A9 MP-Core processors instead of the E5430 and L5430 is shown in Figure 5.1. The top partsof the bars in graph in Figure 5.1 represents the reduction in energy consumption thatwould ideally be achieved using the more efficient processors. The lower part of the

48

bars represents the percentage of energy that is needed by the more energy efficientprocessors.

Figure 5.1: Achievable energy dissipation reduction by the usage of more efficientprocessors

The impact on energy requirement on a server level is derived from Figure 5.1and the data in Figure 2.2 presenting the energy consumption contribution to totalserver energy consumption from the processors. The result is a 32 percent reduction intotal server energy consumption for a server running the SIP-Proxy, and a 41 percentreduction for a server running the Apache HTTP server.

Combining the cost structure of the hypothetical data center provided by Hamilton[1] and discussed in chapter 2, the potential savings can be calculated. Consideringthat contribution of the total monthly cost from the energy related cost is 19 percentfor actual energy and 23 percent for energy related infrastructure. Combined these twocontributes with 42 percent to the total monthly cost. Using this number the savingsfrom the power and cooling infrastructure can be calculated.

Assuming the cost for power and cooling infrastructure to be fully optimized tothe cooling and power needs of the servers, the possible savings can be calculated.Considering only the energy consumption of the processors, the impact on total energy

49

Figure 5.2: Achievable energy dissipation reduction when moving to more efficientprocessors

consumption for the system cannot be more than what is caused by the processors. Asthe processors are responsible for 45 percent of the power dissipated by the server, andthe impact on the total cost from power is 42 percent the impact of processor energyconsumption to total cost is 18.9 percent. The reduction from the total cost is calculatedusing the following equation where R is the achievable cost reduction from the totalcost, RPP is the percentage of total server power contributed by the processor, EEold isthe energy efficiency of the current processor and EEnew the energy efficiency of themore energy efficient replacement processor. PRC is the percentage of power relatedcost for the entire datacenter that includes power cost as well as the power and coolinginfrastructure according to Hamilton’s model.

R =RPP

100× (1− EEold

EEnew

)× PRC

100

In the case with the SIP-Proxy the energy efficiency for the Cortex-A9 MPCorewas 3.6 times that of the L5430. The cost reduction from the total cost in percentages

50

would be calculated as

45

100× (1− 7

25)× 42

100= 13.6

In the Apache HTTP server benchmark the energy efficiency for the Dual CoreCortex-A9 is 11.1 times better than the energy efficiency for the Xeon E5430. Theimpact to the total monthly is thereby

45

100× (1− 413

4600)× 42

100= 17.2

The total cost saving potential for the hypothetical data center using Hamilton’smodel is visualized in Figure 5.2. The top part of the bars presents how much moneycould be saved in total cost for the entire data center.

The x86 processors that have been benchmarked are not the newest available serverprocessors. Newer processors are in general more energy efficient than older pro-cessors, decreasing the difference between the new ARM based processors and themost energy efficient x86 based processors. The difference in energy efficiency inthe Apache HTTP server test was eleven times better for the Cortex-A9 than for theXeon, even with a doubled energy efficiency for the x86 processors the difference isremarkable.

5.2 Future work

The lower individual performance for the more energy efficient processors creates aneed to use a larger number of processors in a server. The benchmarks indicate thatto achieve the same performance using the ARM Cortex-A9 MPCore processors, thanmore traditional x86 based server processors, more than ten times the number of pro-cessors are needed in the worst cases. Ways to connect several of these processorstogether must be evaluated in order to find the optimal configuration as a simple SMParchitecture, is only suitable for a limited number of processors.

A possible option is a cloud on a chip solution. For a cloud on a chip like system,design decisions such as hierarchies and communication paths needs to be evaluated.Also questions such as what would be the optimal numbrer of processors in a nodemust be ansvered. How the cloud would be controlled internally and the abstractionlevel visible to the rest of the system needs to be decided.

51

BIBLIOGRAPHY

[1] James Hamilton. Cooperative expendable micro-slice servers (cems): Low cost,low power servers for internet-scale services. In Proceedings of CIDR 09, January2009.

[2] L.A. Barroso and U. Holzle. The case for energy-proportional computing. Com-puter, 40(12):33–37, December 2007.

[3] Gerald Coley. Beagleboard system reference manual. BeagleBoard.org, Decem-ber 2009.

[4] Texas Instruments Incorporated. OMAP35x Product Bulletin, 2009.

[5] CoreTile Express A9x4 Cortex-A9 MPCore (V2P-CA9) Technical Reference Man-ual.

[6] Cloud software project webpage. http://www.cloudsoftwareprogram.org/cloud-program. Online; accessed 20 December 2010.

[7] Luiz Andre Barroso Xiaobo Fan, Wolf-Dietrich Weber. Power provisioning fora warehouse-sized computer, 2007. ISCA ’07 Proceedings of the 34th annualinternational symposium on Computer architecture.

[8] Bernd Sch˙ Energy efficient servers in europe. energy consumption, saving po-tentials, market barriers and measures. part 1.

[9] Green Grid. The green grid power efficiency metrics: Pue % dcie, 2008. Online;accessed 31 January 2011.

[10] Rich Miller. Microsoft embraces data center containers.http://www.datacenterknowledge.com/archives/2008/04/01/microsoft-embraces-data-center-containers, April 2008. Online; accessed 25 November 2010.

[11] Stephen Shankland. Google uncloaks once-secret server.http://news.cnet.com/8301-1001 _3-10209580-92.html. April 2009. Online; ac-cessed 25 November 2010.

52

[12] Rich Miller. Microsoft: 300,000 servers in container farm.http://www.datacenterknowledge.com/archives/2008/05/07/microsoft-300000-servers-in-container-farm/, May 2008. Online; accessed 12 January 2011.

[13] Intel. Quad-Core Intel Xeon Processor 5400 Series Datasheet, August 2008.Online; accessed 18 January 2011.

[14] Intel. Intel xeon processor l5430 product specifications.http://ark.intel.com/product.aspx?id=33091. Online; accessed 18 January2011.

[15] Intel. Intel xeon processor l5430 product specifications.http://ark.intel.com/product.aspx?id=33081. Online; accessed 31 January2011.

[16] T.D. Burd, T.A. Pering, A.J. Stratakos, and R.W. Brodersen. A dynamicvoltage scaled microprocessor system. IEEE Journal of Solid-State Circuits,35(11):1571–1580, November 2000.

[17] Intel. Moore’s law. http://www.intel.com/technology/mooreslaw/. Online; ac-cessed 21 December 2010.

[18] Patterson David A. Latency lags bandwidth. Communications of the ACM, 47,October 2004.

[19] Gerald Coley. Beagleboard-xM system reference manual. Revision A2. Beagle-Board.org, July 2010.

[20] Nvidia. Nvidia Tegra 200 series developer kit, quick start guide, December 2009.DU-04942-001v02.

[21] ARM. Cortex-a9 processor. http://www.arm.com/products/processors/cortex-a/cortex-a9.php. Online; accessed 3 January 2011.

[22] Erlang programming language official web site.http://www.erlang.org/faq/introduction.html. Online; accessed 1 December2010.

[23] Armstrong Joe. Programming Erlang, Software for a Concurrent World. Prag-matic Bookshelf, Raleigh, North Carolina Dallas, Texas, 2007-8-8 edition, 2007.

[24] Joe Armstrong. Icfp ’97 proceedings of the second acm sigplan internationalconference on functional programming. pages 196–203.

[25] IETF. Ietf web site. http://www.ietf.org/. Online; accessed 1 February 2011.

[26] Autobench web site. http://www.xenoclast.org/autobench/. Online; accessed 16January 2011.

53

[27] Tai Jin David Mosberger. httperf-a tool for measuring web server performance.Performance Evaluation Review, 26(3):31–37, December 1998.

[28] Intel. Intel Core 2 Extreme Processor X6800 and Intel Core Duo Desktop Pro-cessor E6000 and E4000 Sequences, October 2007. Online; accessed 2 February2011.

54

6 ENERGIEFFEKTIVITET HOSARM-ARKITEKTUR FÖR APPLIKATIONER I

DATORMOLN

6.1 Introduktion

I samband med att efterfrågan på programvara, beräkningskapacitet och datalagring imolnet ökar, ökar också antalet servrar som behövs för att möta efterfrågan. De enormadatacenter som konstrueras för att öka kapaciteten behöver stora mängder ström föratt fungera. Servrarnas energiförbrukning reflekteras inte bara i driftkostnaderna fördatacentret utan också i infrastrukturen som sköter om servarnas strömförsörjningoch kylning. Strömförbrukningen har en klar inverkan på helhetskostnaderna för ettdatacenter och är därför ett intressant forskningsområde.

Detta examensarbete är en del av forskningen som gjorts inom CloudSoftware Program-projektet. Projektets syfte är att förbättra konkurrenskraften hosmjukvaruintensiv industri i Finland på global nivå. Projektet finansieras av TEKESoch styrs av Tivit Oy [6]. I laboratoriet för inbyggda datorsystem på institutionen förinformationsteknologi i Åbo Akademi är forskningsfokus för detta projekt på potentiellenergieffektivitetsförbättring genom användande av energisnåla noder.

6.2 Energiförbrukning

All energi i form av elektrisk energi som en dator förbrukar omvandlas till värme.Strömförbrukningen för en enstaka dator är ofta kring några hundra watt, och omdatorn är i ett relativt stort utrymme räcker det oftast med en eller några små fläktarför att hålla datorn tillräckligt sval. I datacenter är mängden värmeproducerandehårdvara mycket tätare packad än vad som är vanligt för datorer i hemmabruk. I storanya datacenter såsom de som byggs av Google och Microsoft installeras servrar i

55

fraktcontainrar [10] [11]. Enligt Google innehåller en container 1160 servrar och kanha en energiförbrukning på upp till 250 KW [11]. I stora datacenter finns ett stort antalav dessa containrar. År 2008 rapporterade Microsoft att de byggde ett datacenter med300 000 servrar [12]. Om energiförbrukningen per server i Microsofts stora datacenterär den samma som den är för Googles servrar, är den kombinerade energiförbrukningenför servrarna i datacentret ca 65 MW.

Det finns olika sätt att mäta energieffektiviteten i datacenter. En mätare är PUI(Power Usage Effektiveness) [9]. PUI räknas genom att dividera den totala mängdenenergi som används av datacentret med den del som används för att driva den egentligaIT-utrustningen. För ett idealt datacenter är PUI-värdet ett, men enligt [9] är det vanligtmed datacenter med ett PUI-värde högre än tre. I praktiken betyder ett PUI-värde påtre att datacentrets totala strömförbrukning är tre gånger större än vad som behövsför att driva själva servrarna. Strömmen som inte används av servrar behövs till t.ex.belysning, UPS och framför allt kylning. För att en fraktcontainer fylld med tättpackade servrar inte skall överhettas behövs ett kraftigt kylsystem som kan dra storamängder ström.

Enligt en modell som Hamilton [1] utvecklat består de totala kostnaderna förett hypotetiskt datacenter till 19 procent av energikostnader och till 23 procentav kostnader för energirelaterad infrastruktur, d.v.s. kylning och strömförsörjning.Modellen är byggd med antagandet att infrastrukturen har en livstid på 15 år ochservrarna 3 år. Energikostnaderna i modellen är definierade som $0.07/KWh ochkapitalet till data-centret är lånat med en årlig ränta på fem procent. Av modellenframgår att 42 procent av kostnaderna för det hypotetiska datacentret är beroende avströmförbrukningen.

6.3 Förbättring av energieffektivitet

Enligt [2] står processorn eller processorerna i en server för ca 45 procentav serverns totala energikonsumtion. Detta gör deras energieffektivitet till ettintressant forskningsområde för att minska på kostnader som uppstår på grund avenergiförbrukning. Energiförbrukningen i en processor kan minskas genom att sänkadess spänning. Eftersom processorns klockfrekvens är beroende av den tillgängligaspänningen måste också processorns klockfrekvens sänkas. Denna metod kallas förDVFS och används för att minska energiförbrukningen hos processorer under tider då

56

deras fulla kapacitet inte behövs [16].Processorer som används i moderna servrar är oftast baserade på x86 -arkitekturen

och har utvecklats med maximal prosesseringskraft som främsta mål. De processorersom har utvecklats för att användas i strömsnåla batteridrivna enheter har däremotutvecklats med energiförbrukning som högre prioritet. Genom att använda deströmsnålare processorerna i servrar kan servrarnas energiförbrukning minskas. Idetta examensarbete undersöks hur stor ökning i energieffektivitet som kan uppnåsi jämförelse med etablerade serverprocessorer genom att använda ARMv7-baseradeprocessorer, närmare sagt Cortex-A8- och Cortex-A9 MPCore-processorer. Förevalueringen används tre olika ARM-baserade testmaskiner, en BeagleBoard, enVersatile Express och ett Tegra 200-seriens utveklingskort. Samtliga testmaskineranvände Linux som operativsystem.

BeagleBoard [3] är en förmånlig testplatform med ett TI-OMAP3530-chip.Chippet innehåller bland annat en Cortex-A8-processor som för evalueringen hadeen klockfrekvens på 600 MHz. Den version av BeagleBoarden som används förevalueringen är av revision C3. Revision C3 är utrustad med 256 MB arbetsminneoch för evalueringen användes ett minneskort som primär lagringsmedia. EnUSB-Ethernet-adapter användes för att få nätverkskonnektivitet till BeagleBoarden.

Versatile Express är en utvecklingsplattform bestående av ett VersatileExpress Motherboard (V2M-P1)-moderkort och ett CoreTile Express A9 MPCore(V2P-CA9)-dotterkort [5]. Processorn är ett CA9 NEC-chip [5] med fyraprocessorkärnor med en maximal klockfrekvens på 400 MHz. CoreTile Expresshar 1 GB arbetsminne och operativsystemet är installerat på ett flashminne.Versatile Express-utveklingsplattformen har inbyggd 10/100 Mbps uppkoplingsbarhet.Genom att skapa en kärnmodul som ger tillgång till register med informationom energiförbrukning är det möjligt att observera energiförbrukningen för CA9NEC-chippet. Den maximalt energiförbrukningen som uppmättes i och medevalueringen var 1.2 W.

Tegra 200-utveklingskortet har ett Tegra 250-chip med en Dual Core Cortex-A9MPCore-processor och en klockfrekvens på 1 GHz. Liksom CoreTile Expressen harTegra-utveklingskortet 1 GB arbetsminne och 10/100 Ethernet-anslutningsbarhet.För att förbättra kortets kommunikationsmöjligheter användes ett GigabitEthernet-tilläggskort. Energiförbrukningsdata fanns ej tillgängliga varken förCortex-A9-implementationen i Tegra 250-chippet eller för chippet i sin helhet.

57

Istället användes en energikonsumtpionsuppskattning på 1 W. Uppskattningenbaseras på data som publicerats av ARM. Enligt ARM är energiförbrukningen för enenergioptimerad Cortex-A9 med två kärnor och en klockfrekvens på 800 MHz 0.5W medan förbrukningen för en prestandaoptimerad Dual Core Cortex-A9 med enklockfrekvens på 2 GHz är 1.9 Watt, med antagandet att de är tillverkade med en 40nm process [21].

6.4 Mätningar

För att evaluera energieffektiviteten hos ARM-processorerna i testmaskinerna utfördesett antal test. Dessa test evaluerade energieffektiviteten för Apache 2.2 http-servern,Erlang-virtualmaskinen (VM) och en SIP-Proxy som körs på Erlang VM.

Apache http-servern valdes eftersom den länge har varit en av de populärastehttp-servrarna och är tillgänglig till alla testmaskiner. För att evaluera servernsprestanda på de olika testmaskinerna användes Autobench. Autobench är ett verktygsom underlättar automatiserad körning av ett större antal httperf-test. Autobenchunderlättar också situationer där testmaskinens prestanda är hög och det krävs att fleraän en klientmaskin används för att utföra testet. I testet mättes hur många förfrågningarpå en statisk fil av storleken 10 B servern kunde besvara. "Keep alive-funktionenvar aktiverad under testkörningarna. Energieffektiviteten för Cortex-A9-processornmed två processorkärnor var 11,1 ggr bättre än den för en Quad Core Intel XeonE5430-processor.

Erlang är ett funktionellt programmeringsspråk och en köromgivning somimplementerar ett system med egna lätta processer [22]. Erlang utvecklades för attfå en plattform för massiva telekommunikationssystem med mjuka realtidskrav ochmöjligheten till förändringar utan att starta om systemet. [24]. För att undersökavirtualmaskinens prestanda för dess olika delar kördes ett antal test som fokuseradepå virtualmaskinens olika delar.

SIP, eller session initialization protocol, är ett protokoll för sessionshantering somanvänds för sessioner med en eller flera deltagare. Sessionerna kan vara exempelvismultimedia, såsom ljud eller video. Protokollet är framtaget av IETF [25]. EnSIP-Proxy är en server som används för att bygga upp den behövliga infrastrukturenför kommunikation mellan olika parter. Användare loggar in på en SIP-Proxy somsedan kommunicerar med resten av systemet på användarens begäran. Ur testresultaten

58

framgår att energieffektiviteten för en Quad Core Cortex-A9-processor är 3.6 ggr bättreän för Quad Core Intel Xeon L5430-processor.

6.5 Slutsatser

Enligt de utförda mätningarna är energieffektiviteten för processorer som är baseradepå ARMv7-arkitekturen bättre än den är för de evaluerade x86 baserade-processorernai de undersökta applikationerna. Enligt Hamiltons modell för det hypotetiskadatacentret skulle besparingarna vara upp till 17.2 procent för Apache http-servernoch 13.2 procent för SIP-Proxyn.

Då dessa ARMv7-baserade processorer har en klart bättre energieffektivitet än deprocessorer som nu används i servrar men mindre prestanda behövs ett större antal avdem för att utföra samma uppgift. Fortsatt forskning behövs för att undersöka hur fleraav dessa mer energieffektiva processorer bäst skulle kopplas samman, för att uppnåden prestanda som förväntas av en processor i en processor med hög prestanda.

59

A RESULTS FROM ERLANG

BENCHMARKING

60

��

��

��

��

��

��

��

��

��

��

��

��

� �

�!�

�!�

��"

��

��

��#��

� ��

�"

�$$

!��

� "�

� �

��

��

�$��

��

� ��

�"$�

�� "

� !"

!�$"

��

��$

�"��

��

��

� ��! ��

"�"�

��"�

��!"

�� !

��

!$!��

$�!�!

��$�

��$"�

��%��&�' ��(�#)��

��

��

�$� !

"�"��

��$�"

��! !

��"�!

��

��$��

$!�

"�$

��

�"!"��

�$��!$

�$�"!�

�$�! "

!$�"�

$��

��*�&�+�*�&�� # �� #)��#�

�

��$

��"

� $��

$!��

$"" �!

$!�"$�

$$��!

��"��

�� $

�$$�"

�$"�$

��!��

�� $��

��!��

��! � �

��!��

�� $"

�� !

��!�

��$��

�"��

��"

��"$"

��"

��$!��

��!

��

$ $�

$��

"!!! �

$ �"��

$��!!�

$�$$!$

$��$!

��$ !�

��"��$

!�

��

� "

��

��

��"

��""

�$

�

��

��

�!��

��

��"��

��

��!!�

"�$��

"��

��

��

��$

��

��

��$

�

$$"

$ �

�"� !

�!��

�!� �

�$��

�!��

!"��

!"��

*�� #��##�)��&&��

$$

��

�!

�!�

�!�

�!�

"�

"�

*�� #��##�)��&&��

!�

!!�

�"��

��$!�

�� "$

��$��

��!"�

��"��

��"�

��

"�

!�

!�

!�

��$

��

��

��

��

��$

��

��"�

��

��

!

�

,��-��.#��#��*��#% ��

��#&

/��#��

/��#��

*��#�&-0��#��

*��#�&-0��#��

*�&�*��#��.�#�

��

*�&�*��#��.�#�

��

*�&�*�*��#��

*�&�*�*��#��

�&&��&��%-�� *�� #��+/1��

*�� #��2�#��&&��

*�� #��2�#��&&��

*�� #��2�#��%-��

*�� #��2�#��%-��

*�� #��##�)��%-��

*�� #��##�)��%-��

��

��

!��

��

��$��

�$�!�

"��!

$��

� �

$��

��"

��

�$��

��

��!�

��

��

�!��

��$��

�!� �

$�$

""!

��!�

�$��"

"� �

$"�$

!""

$�"�

�$�

��

�"�

��

��$�

��!"�

��"��

� " !

��!��

�!��!

!��

��!!

��!$"

�!��

��$

��

�� !

!��

��

��!

��

��

��!$

�!!�

�$!�!

��

��!

��$"!

�$�$�

�� "

��!�

��"��

��$�"

�!�"�$

��

�!��

"!��"

��

��

$$� !

�!��$

"!��

��

"�� "

$�"�$

��

!��

��

� "��

��!��

" ��

$ "!�

��

$��

��

��

!$"�

��

�!��

��!"��

��$"��

"��"

�� $

$ $��

��$��

� �"�

$�" ��

!� ��

�!$$!�

� ��

�"��$�

� $"�

�� !

�"�!!

� ��

$��"��

!��"��

�!��!�

�$�$�"

��"$�

� !�""

��"�"

�!��

��!!�

��"��!

��"�"�

! ��

��$ �

��$��

��

�$�!""

��!��"

$��"�

��!$ !�

�"!"$�"

� ��

��

!$�""�

��

��"��

$ ��

��

��"��"

�$� ��

��"�

�!��!

� �!"�

!"��

�"�$�$

��

$" ��

�!"��!�

�� !��

�$"�$��

�� $!

�$�"��

�� !

�"��

�$�$ �

��$�$�

!�!!"

$!��

��!!�"�

�� !��

�!��

��

$�"�

��$

��

$�

$�

!��

��"�$

��

$��

$�!

$$�"

��"

�!!

!�

!�

$$��

��"�

�!�

!�

!�

$��$

$�$$�

$! �!

""��

""�!�

"""$�

"""��

$� �$

$� �$

�� #�3��%-�� #��

��2��-��

��2��-��

��2��-��

��2��-��

��2��-��

��2��-��

4��#(�#��

��#��2*��

��#��#�*2*��

��&��#��*��*-�� *��*-##��-��#*#-�*� ��

��#��

��#�� $�

��#��

��#��

��#�� $�

��#��

��#��! ��

��#��! ��$�

��#��! ��

�� +��/�5��-��#�.�#��*� ��#��*��&

�� 2��*��

�� 2��*��

�� 2��*��

�� 2��*��

��#6-�-��

��6��

��

��

��

��"

�� $

�$ ��

�$��"

$"��!

$��

$!�$�

$!�$�

��""�

��

��!

��!

��$"�

��

��$��

��

��

� �

$��

�$� !

��

��

��"

� ��

!�$!

�"��$

�"��

�" �

��

��$�

��!�

��!��

�!!

��!�

��"

"��

"��"

"!$

"�

�$�$!

��

��

��"

��$�

!�$��

�"��

�"��

��

��

�$!

��!��

��!��

� "

��

��

"��!

"��$

7��85��# ��#��-��.#��

�"

��

��

��

��

��

�"

�"

��

��

��"

��

�""

�""

��$

"�

"�

��

!�

��$

��

!!$

!��

��

��

��!

�"�

��

��

��!

��!

�!��

�! �

�"�

�"�

$�

!�

�$"�!

� ��

��

�!

� �

$$!�

$$$"

!"��

��

��

��

��

��

��"

��

��

��"

��

��$

��$

��$

��

��

��

��

$ $$

$ "$

��

��

��

�"�

�"

��

��!

��""

� �

��

��

��!

��

��

��

��

��"

��

$"

!��

! �

! $

! $

� ��

� $"

! �

!�$�

��

�$"

�$"

�!�

�!�

��

��

"!�

$��!

�"�

��"

��

��

��

�

��

$�""

$$�

�"�

��$

��"

��

��"

��

�"

/5��

��

�� 9:�9;*��*� ��<�;��*%��2�

�� 9:�9;*��*� ��<�;��*%��2�

>��&��#��.��#�

�� ( ��.� ��# ��&��%��74�. �???

#��&�#��&��6-��

#��&�#��&��6-��

#��&�#��&��6-��

#��&�#��&��6-��

#��&�#��&��#��#��

#��&�#��&��#��#��

#��&�#��&��#��#��

#��&�#��&��#��#��

/ ��

��#�0) ��#�-��& ��

��# ��-��

��# ��-��

��# ��-��

��# ��-��

��# ��-��

��

��

��

��2��& -��

��2�-��

��#��

��#�(�

��#��2&��2'�#%�

��#��2��*��2&��2'�#%�

��

��

$

"!

� "

��

��$

��!

��"

$"

$

��!

��!�$

�$ �

�$��

�$��

�$��

�$��

��"

��

�!"�

��$�

��

��$

��$

��$

��$

�$"

�!�

"��

"��

��

�""

�"!

��

�"�

�!

�!"

�$"!

�$!�

��

��

��$

��

��$

��

��"

$!�

$"��

��

�"�

�"!

�"�

�"�

$�

$

�$��

�"�$

"

!�

!�

!�

!�

��

��

��

�� !�

$��

!��

!��

!$�

!��

�!�$

�!�"

� ��

�

�

"

$�

"

"

�$�

�$�

��!

��

��

�$

�

�

$�

�

!$

5��# ��'��#��-��?�,�.�#��*�#��-��& ( &�&' ��*��#%#��-��

��(��

��(��

��

��

��

��

!

��#��

� ��

��

"�

� ��

�"

��

��

� ��! ��

�

��

��

��%��&�' ��(�#)��

$��

!

�$

$

��*�&�+�*�&�� # �� #)��#�

�

��

�"

��

��

�"

��

�!

��

�

�!

��

�

��*�

�� .2& ��*��

�� #)2��

��

��# *�

�� 2�# ��

��.��2�# ��

��.*��

�� #�

�� %��

/��#��(��

*��#�&-0��#��

*��#�&-0��#��

*�&�*��#��.�#�

��

*�&�*��#��.�#�

��

*�&�*�*��#��

*�&�*�*��#��

��

��

��

�!

��

��

�!

��

��

�$

��

�"

�$

��

*�� #��##�)��&&��

��

��

�$

*�� #��##�)��&&��

��

��

�

�$

��

�$

�$

��

�$

"�

�

�

!

��

$�

��

�

��

!�

��

$�

��

��

�

��

��

!

��

��

"

��

��

"

��

��

"

��

��

�

�"

��

�

��

��

"

��

��

"

��

��

�

��

��

�&&��&��%-�� *�� #��+/1��

*�� #��2�#��&&��

*�� #��2�#��&&��

*�� #��2�#��%-��

*�� #��2�#��%-��

*�� #��##�)��%-��

*�� #��##�)��%-��

�� #�3��%-�� #��

��2��-��

��2��-��

��2��-��

��2��-��

��2��-��

��2��-��

4��#(�#��

��#��2*��

��#��#�*2*��

��&��#��*��*-�� *��*-##��-��#*#-�*� ��

��#��

��#�� $�

��#��

��#��

��#�� $�

��#��

��#��! ��

��#��! ��$�

��#��! ��

��

��$

��

�"

��

�"

��

��

��

��

��

�

��

�

�$

�$

��

"

��

�

�$

�"

��

��

�!

�"

��

��

��

��

��

�!

�!

�"

�"

��

$

��

"�

��

��

$

��

��

$

�!

��

��

�(�

�(�

�(�

��

�$

��

�� +��/�5��-��#�.�#��*� ��#��*��&

�� 2��*��

�� 2��*��

�� 2��*��

�� 2��*��

��#6-�-��

��6��

/5��

��

�� 9:�9;*��*� ��<�;��*%��2�

�� 9:�9;*��*� ��<�;��*%��2�

>��&��#��.��#�

�� ( ��.� ��# ��&��%��74�. �???

#��&�#��&��6-��

#��&�#��&��6-��

#��&�#��&��6-��

#��&�#��&��6-��

#��&�#��&��#��#��

#��&�#��&��#��#��

#��&�#��&��#��#��

#��&�#��&��#��#��

/ ��

��#�0) ��#�-��& ��*��#�

��# ��-��

��# ��-��

��# ��-��

��# ��-��

��# ��-��

��

��!

�$

�

��

��

�

��

��

��

��

�!

��

��

��

��

��

�

�!

��

��

�$

��

�

��

��

��

��

��

��

��

!

�

�!

��

�!

�$

�"

�

�!

�

��

��

��

��

�!

"

��

��

��

��

�$

��

��

"�

�(�

�(�

�(�

��

��

��

��

��

��

��2��& -��

��2�-��

��#��

��#�(�

��#��2&��2'�#%�

��#��2��*��2&��2'�#%�

��*�

�� .2& ��*��

�� #)2��

��

��# *�

�� 2�# ��

��.��2�# ��

��.*��

�� #�

�� %��

energy efficiency of arm architectures for cloud computing applications

Documents