the sharing architecture - princeton university · cores, the sharing architecture enables the...

The Sharing Architecture:Sub-Core Configurability for IaaS Clouds

Yanqi ZhouPrinceton University

F210 E2 Engineering Quadrangle, Olden St.Princeton, NJ [email protected]

David WentzlaffPrinceton University

B228 Engineering Quandrangle, Olden St.Princeton, NJ 08544

[email protected]

AbstractBusinesses and Academics are increasingly turning to In-frastructure as a Service (IaaS) Clouds such as Amazon’sElastic Compute Cloud (EC2) to fulfill their computingneeds. Unfortunately, current IaaS systems provide a severelyrestricted pallet of rentable computing options which do notoptimally fit the workloads that they are executing. We ad-dress this challenge by proposing and evaluating a manycorearchitecture, called the Sharing Architecture, specifically op-timized for IaaS systems by being reconfigurable on a sub-core basis. The Sharing Architecture enables better match-ing of workload to micro-architecture resources by replacingstatic cores with Virtual Cores which can be dynamically re-configured to have different numbers of ALUs and amountof Cache. This reconfigurability enables many of the samebenefits of heterogeneous multicores, but in a homogeneousfabric, and enables the reuse and resale of resources on a perALU or per KB of cache basis. The Sharing Architectureleverages Distributed ILP techniques, but is designed in away to be independent of recompilation. In addition, we in-troduce an economic model which is enabled by the SharingArchitecture and show how different users who have varyingneeds can be better served by such a flexible architecture. Weevaluate the Sharing Architecture across a benchmark suiteof Apache, SPECint, and parts of PARSEC, and find that itcan achieve up to a 5x more economically efficient marketwhen compared to static architecture multicores. We imple-mented the Sharing Architecture in Verilog and present areaoverhead results.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’14, March 1–5, 2014, Salt Lake City, Utah, USA.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-2305-5/14/03. . . $15.00.http://dx.doi.org/10.1145/2541940.2541950

Categories and Subject Descriptors C.1.3 [Processor Ar-chitectures]: Other Architecture Styles–Adaptable architec-tures; C.5.5 [Computer System Implementation]: Servers

Keywords Infrastructure as a Service (IaaS); Slice; Cache;Cache Banks; Virtual Core (VCore); Virtual Machine (VM);utility; market efficiency.

1. IntroductionCloud computing and Infrastructure as a Service (IaaS)Clouds have enabled convenient, on-demand, self-service,access to vast quantities of computing that can be provi-sioned with minimal effort. Current IaaS systems utilizecommodity multicore processors as the computational sub-strate to service this diverse computational need. Unfortu-nately, the individual core, the foundation of commoditymulticore processors, is optimized across a set of applica-tions, but by definition it is not optimized for any particularapplication. In this work, we propose a new architecture,we call the Sharing Architecture, which is specifically in-spired by the needs of IaaS Cloud providers. Namely, thatCloud providers require an architecture which enables effi-cient markets to match customer’s needs with computationalresources, enables the customization of micro-architecturehardware to a customer’s application, and enables the reuseof idle hardware resources thereby maximizing profit. TheSharing Architecture addresses these needs by providing amanycore architecture where the core can be reconfigured,customized, and leased on a sub-core basis.

The movement of computation and storage into the Cloudand into large data centers has been at an explosive rate [7,12]. For example, the number of objects stored in Amazon’sS3 storage service [3] has more than doubled every yearsince its inception [9]. Likewise, the number of Virtual Ma-chine (VM) instances active in Amazon’s Elastic ComputeCloud (EC2) [1, 35] has been estimated to be growing at asimilar rate [62]. Many diverse applications which have tra-ditionally been desktop applications have successfully beentransitioned into the Cloud, including word processing (eg.Google Apps [2]), spreadsheets (eg. Google Apps), data

backup clients (eg. Box.net, Dropbox, Mozy, Backblaze ),email clients (Gmail, Hotmail, Yahoo Mail), and even gam-ing (eg. OnLive). The growth in data center computing hasnot only been fueled by desktop applications moving to theCloud, but also startups and established companies movingcomputationally intensive [38, 46] and data serving tasksinto the Cloud.

In this work, we focus specifically on IaaS Cloud sys-tems. An IaaS Cloud provider typically leases managedcomputational, storage, or network resources to customers.We focus on the computational facet which usually comesin the form of leasing Virtual Machine (VM) instances. Fora fee, an IaaS Cloud provider agrees to run a client’s VMon the Cloud provider’s computing infrastructure. Exam-ples of computational IaaS Public Cloud providers include:Amazon’s EC2[1], Microsoft’s Azure Virtual Machines [4],Terremark, Linode, Rackspace Cloud, Joynet, and Google’sCompute Engine. IaaS infrastructure is also gaining trac-tion within an enterprise setting by leveraging many of thesame technologies of Public Clouds, but in a Private Cloudsetting [15, 43, 61].

IaaS Cloud providers typically provide a very limited pal-let of Virtual Machine configurations. For example, Ama-zon’s EC2 currently supports only eighteen machine config-urations. The configurations presented by Amazon do notallow the customer to choose different intra-core parame-ters, such as cache size, but rather only allow the customerto choose fixed combinations of core count, the amount ofDRAM, and the amount of disk in a VM instance. Thelack of intra-core, micro-architecture VM configuration op-tions leaves the Cloud customer utilizing a VM configurationwhich does not match their workload’s needs. This discon-nect leads to Cloud customers overpaying for resources orexperiencing sub-optimal application performance. The in-ability for the Cloud provider to customize a VM instance toa customer’s needs also harms the provider. This leads to aless efficient marketplace, where the Cloud provider is un-able to price fine-grain resources according to their marginalcost thereby contributing to wasted resources and wastedrevenue opportunities. A Cloud provider’s inability to pro-vide fine-grain micro-architecture choices is directly due tothe failing of current commodity multicores not being divis-ible below the core level.

To address this challenge, we propose the IaaS-influencedSharing Architecture along with a modification to how IaaSmarkets are priced to enable the fine-grain assignment ofprocessor resources to Virtual Cores. The Sharing Archi-tecture enables the assignment of fine-grain processor re-sources to a configurable Virtual Core all the way down tothe individual ALU, fetch unit, and KB of cache. One ormore Virtual Cores are then composed into Virtual Machineinstances. Figure 1 shows a conceptual view of the Shar-ing Architecture. The Sharing Architecture is designed toallow a nearly arbitrary set of sub-core resources to be com-

ALU D -‐$

Fetch

Re.re

ALU D -‐$

Fetch

Re.re

ALU D -‐$

Fetch

Re.re

ALU D -‐$

Fetch

Re.re

ALU D -‐$

Fetch

Re.re

ALU D -‐$

Fetch

Re.re

VCore 1 VCore 2

VCore 3

VM2 VM1

Figure 1. High Level View of Sharing Architecture show-ing two Virtual Machines. VM1 has two VCores, VM2 hasone VCore.

posed into a single Virtual Core. These different fine-grainresources are connected together by multiple, pipelined,switched interconnection networks providing connectivityfor the sea-of-ALUs and sea-of-cache banks. While theSharing Architecture enables nearly arbitrary configurationsof resources into Virtual Cores, due to the added commu-nication latency when sharing distant resources, we do notexpect all combinations to provide performance benefit. Byenabling a fluid assignment of ALUs and cache to VirtualCores, the Sharing Architecture enables the Cloud user tochoose whether to optimize for single-stream performanceat the expense of area, for throughput, or complex mixes.

The Sharing Architecture has much of the same moti-vation and performance benefit of heterogeneous multicoresystems [33, 34], but is built out of a large homogeneousfabric. Applications can be more area efficient, energy effi-cient, and perform better if they run on properly sized cores.Unlike heterogeneous multicore systems, the Sharing Ar-chitecture does not require the mix of core sizes to bedetermined at chip design time. This is even more im-portant in IaaS Cloud systems where the mixture of appli-cations is determined by dynamic external customers. TheSharing Architecture is inspired by Core Fusion [24], WiD-GET [64], the Distributed ILP mechanisms in the TRIPS ar-chitecture [11, 17, 49, 50, 53], and how ILP is mapped acrossmultiple cores in the Raw [63] and Tilera [65] architectures.

An important aspect of the Sharing Architecture is thatit enables more efficient markets by enabling the Cloudprovider to price different portions of a core on a sub-corebasis. For instance, instead of simply charging $0.065 perCPU-hour, the Cloud provider is able to charge per ALU-hour, per cache KB-hour, etc. This capability empowers amore dynamic Cloud pricing market and ultimately leads toa more economically efficient marketplace. In this paper, weexplore how different IaaS Cloud users (customers) garnerdifferent utility as a function of sub-core (ALUs, Cache, etc)and beyond-core (Number of cores, VMs, DRAM, etc.) re-sources. Different Cloud customers can have different util-ity curves due to technical, price, and policy needs. Unlikefixed architectures, the Sharing Architecture does not judge

which utility function is superior, but rather simply allowsthe Cloud user to express their preferences and have the mar-ket meet these preferences.

The Sharing Architecture is designed to enable dynamicreconfiguration of Virtual Cores and to run the same binaryindependent of Virtual Core configurations (no recompila-tion). This dynamic nature enables the Cloud provider toprice sub-core resources dynamically and based on instanta-neous market demand. The programmer or IaaS Cloud usermust determine, based on the market rates for sub-core re-sources, how many resources to utilize to best fit their de-mand and utility needs. While this work focuses on the hard-ware aspects of the Sharing Architecture and the market itcan create, we also outline software implications. For in-stance, the IaaS Cloud user could provide a meta-programalong with their Virtual Machine which decides the desiredmachine configuration based off the market prices for re-sources. Alternatively, an auto-tuner could be used if the cus-tomer lacks a detailed application performance model.

2. Economic ModelOptimizing computer architecture for IaaS cloud systemshas different criteria than traditional computer architecture.In traditional computer architecture, systems are typicallyoptimized for performance, cost, energy, or some combina-tion of these constraints. In contrast, when computer archi-tecture is optimized for IaaS Clouds, the ability to enablemore efficient markets dominates traditional computer archi-tecture constraints. This is amplified by the fact that differ-ent IaaS customers can have conflicting optimization crite-ria such that it is impossible to satisfy all customers witha fixed design. Economists have shown that by having eco-nomically efficient markets, both the seller and buyer can beenriched, or in our case, the Cloud provider can make ad-ditional revenue and simultaneously the Cloud user (cus-tomer) can pay less. In this section, we describe the currentIaaS marketplace, the diversity of customer needs, and howthe Sharing Architecture facilitates a more economically ef-ficient market.

2.1 Current IaaS Pricing ModelIaaS Public Cloud providers such as Amazon’s Elastic Com-pute Cloud (EC2) and Microsoft’s Azure Virtual Machinesuse an instance-based pricing model. In these models, cus-tomers are able to choose from a relatively limited number ofVM instance types and they pay on a per-VM instance-hourbasis. For example, in EC2, for $0.065/h, a VM instancewith 1.7GB of RAM, a single core, and 160GB local diskcan be leased. In addition to these fixed cost models, EC2has a market, named Spot Pricing, which provides an auctionfor VM instances. Cloud customers use IaaS Cloud systemsas a way to reduce the risk of over-provisioning or under-provisioning purchased computational resources by turningcapital purchases into operational expenditures. In contrastto instance-based pricing, private IaaS Cloud systems such

Number of Resources

Utility

(Dollars)

CPUs/VM

VMs

128KB CacheBanksALUs

(a) Throughput Applications

Number of Resources

Utility

(Dollars)

CPUs/VM

VMs

128KB CacheBanksALUs

(b) Single Stream Applications

Figure 2. Resource Utility Functions for Different Applica-tions

as VMWare’s VCloud [31] system provide a richer ability toconfigure a VM instance, but the parameters are still limitedto resources outside of the core such as core count, amountof DRAM, and disk size.

2.2 Diversity of NeedsConsumers have diverse desires. Economists use utility tomeasure how well a particular resource satisfies an individ-ual’s wants. Economists measure total utility to determinehow economically efficient a market is. A more economi-cally efficient market will have higher global utility than aless efficient market for a fixed number of resources. In ef-fect, an efficient market will enable better matching of re-sources to consumers who are willing to pay more for thoseresources (assuming consumers are willing to pay commen-surate with the utility they gain). One way to make a marketmore efficient is to disentangle, disaggregate, or sub-divideresources, thereby enabling consumers to more accuratelyexpress their desires and allowing sellers to more accuratelyexpress their costs. Given these assumptions, by maximizingglobal utility, cloud providers maximize profit.

In an IaaS Cloud market, different customers can havedifferent utility functions. In this paper, we do not try tomake any judgment or assumption about customers’ utilityfunctions, but rather we provide hardware that enables dif-ferent utility functions to be well served. Figure 2a showsa set of utility functions for a hypothetical throughput ori-ented application (ex. web serving). In this application, thecomputation scales linearly with VMs and CPUs/VM, butthe customer favors having more CPUs per VM to reduceadministrative overhead. As more Cache per core is added,the marginal utility gained by the customer diminishes. Last,as can be seen, as more ALUs are added per core, the utilitygoes up at first, but then this customer’s application beginsto slow down due to the added communication cost betweenALUs. It should be noted that these utility functions are sim-ply 2D approximations of the true multi-dimensional utilityspace as one resource may affect the utility gained from asecond resource. A savvy customer will profile their appli-cation and use policy decisions to determine what their util-ity functions are. Hence the same customer with the samepolicy may have different utility functions when running a

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

L2 L2 L2

Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice



Slice Slice Slice

Slice Slice Slice

Slice Slice Slice

Figure 3. Slices, Cache Banks and Interconnection Net-work

different application due to how the application responds todifferent resources.

In contrast to the throughput example, a second cus-tomer running an application (potentially the same appli-cation) may favor single-stream performance. This may bebecause they know that their application does not paral-lelize well across cores or may simply be a policy decision.Figure 2b shows an example customer’s utility functionswhere they favor (sequential performance)2 as a metric.Performance2 or Performance3 may be very reasonablemetrics if a customer favors sequential time to completionand these metrics have much similarity to Energy ∗Delay2

and Energy ∗ Delay3 used in energy efficient computingresearch. Finally, a single customer and a single applicationmay have different utility functions as a function of time ofday or program phase [34]. Programs commonly have differ-ent phases and if properly expressed, a per-phase machine-application fit can be made.

2.3 New ModelTo create a more efficient market, we propose changing thetypical IaaS market and software APIs from one which hasa limited pallet of VM instance types that have differencesat the core level and above into a market where the cloudprovider auctions off all resources down to the ALU, KB ofcache, fetch unit, retire unit, and even on-chip or off-chipmemory bandwidth. To realize this change, a hardware fab-ric that is more flexible than multicore processors is needed.The Sharing Architecture is the first step in building such anarchitecture.

3. ArchitectureThe Sharing Architecture is a fine-grain composable archi-tecture which provides the flexibility to dynamically “syn-thesize” a flexible core out of the correct amount of ALUs,fetch bandwidth, and cache, based on an application’s de-mand and the need to optimize cost. Unlike multicore andmanycore architectures where the processor core is staticallyfixed at design and fabrication time, the Sharing Architec-ture enables the core to be reconfigured and use resourcesfrom a wide array of resources across a single chip. Thislevel of reconfigurability is especially useful in a competitive

L1 Dcache ALU LD/

ST

Local RF

Scalar Operand Network

Instruc<on Window

Fetch Decode Global Rename

Local Rename

ROB Scoreboard

Switche

d

Switche

d

Switche

d

L1 Icache

Switche

d

Switche

d

Global Rename &Scoreboard Sync

Operand Network

LS bank sor<ng

L1/L2 Switched Crossbar

Br_pred & btb

LS Hash

LS Sor<ng Network

Fetch&BTB Sync

Figure 4. Slice Overview and Inter-Slice Network

IaaS data center setting where many applications and differ-ent customers’ codes and Virtual Machines will all share asingle multicore processor.

At the top level, like current day multicore chips usedfor IaaS applications, the Sharing Architecture has multiplecores that can be grouped together to create multicore Vir-tual Machines (VMs). Unlike fixed architecture multicoreprocessors, the VMs in the Sharing Architecture are com-posed of cores which themselves are composed of a variablenumber of ALUs, fetch bandwidth, and cache. We call thisflexible core a Virtual Core (VCore). A VCore is composedout of one or more Slices and zero or more L2 Cache Banks.Figure 3 shows an example array of Slices and Cache Banks.A full chip will have 100’s of Slices and Cache Banks.

The basic unit of computation in the Sharing Architectureis a Slice as shown in Figure 4. A Slice is effectively asimple out-of-order processor with one ALU, one Load-Store Unit, the ability to fetch two instructions per cycle,and a small Level 1 Cache. Multiple Slices can be groupedtogether to increase the performance of sequential programsthereby empowering users to make decisions about tradingoff ILP vs. TLP vs. Process level parallelism vs. VM levelparallelism while all utilizing the same resources.

In order to reduce fragmentation, the Sharing Archi-tecture is designed to enable very flexible sharing of re-sources and does not impose hierarchical grouping of Slicesor Cache Banks. In order to achieve this, we use generalpurpose switched interconnects wherever possible. NeitherSlices or Cache Banks need to be contiguous in a VCore forfunctionality. But we do impose a few limited restrictionsfor the purpose of performance. For instance, when Slicesare joined into a single VCore, those Slices need to be con-tiguous to reduce operand communication cost, but CacheBanks do not need to be contiguous or near by and we modelthe latency as such. We do not see the need for contiguousSlices to be a limitation as all Slices are interchangeable andequally connected therefore fixing fragmentation problemsis as simple as rescheduling Slices to VCores.

The ability to group multiple Slices into a single VCoreadds not only logical complexity, but also a small areaand performance overhead which we study in detail inSection 5.1. In this work, we leverage many of the ideasfrom Distributed ILP architectures such as TRIPS, Raw,

Component Branch BTB Scoreboard Issue Load Store ROB Local Global PhysicalPredictor Window Queue Queue RAT RAT RF

Replicated X X XPartitioned X X X X X X X

Table 1. Replicated vs. Partitioned Structures

and Tilera as well as fusible core ideas from Core Fusionand WiDGET to address the challenge of partitioning oneVCore across multiple Slices. The approach we take is de-signed for wider sharing of resources than the fusible coreapproach and hence the Sharing Architecture tries to enablemore decoupling between Slices.

When grouping multiple Slices to work together to exe-cute a single sequential program, each intra-core component(e.g. Reorder Buffer (ROB), Scoreboard, etc) has to be con-sidered to see if the component needs to be logically repli-cated, partitioned, or centralized. These decisions are madebased off how tolerant a structure is to increased latencywhen accessing different elements in the structure. The ad-vantage of partitioning a structure over replicating a struc-ture is that when partitioned, the number of logical resourcesscales with the number of Slices. Conversely, replicated re-sources need to be sized for the largest number of Slices in aVCore configuration (8) and waste area for smaller configu-rations. Table 1 presents a table of which structures are repli-cated or partitioned in the Sharing Architecture. The rest ofthis section steps through the design decisions of the SharingArchitecture.

3.1 Front-EndEach Slice has its own copy of the program counter. Thereplication of the program counter enables each Slice ina VCore to fetch two instructions per cycle. The fetchingof instructions is interleaved such that each Slice fetchestwo contiguous instructions in one cycle and instructions2∗number of slices and (2∗number of slices)+1 laterin the program on a subsequent cycle. Whenever there is astall that ripples back to the fetch stage of the pipeline, itstalls all of the Slices in a VCore.

Similar to Core Fusion [24], a distributed branch predic-tor is used, thus the effective branch predictor capacity growswith number of Slices. Because instructions are fetched inan interleaved fashion, the same PC is always fetched bythe same Slice. Therefore branch instructions are alwaysmapped to the same Slice and the same predictor. UnlikeCore Fusion, each Slice replicates Branch Target Buffer(BTB) entries in order to enable Slices which do not exe-cute a given branch to fetch the appropriate next instruction.These fake BTB entries are added into the BTBs of Slicesin the same fetch group as a branch and are tagged by theappropriate offset and point to the correct Slice-interleavedtarget address, not the actual branch target. Branch mispre-diction and Branch Target Buffer (BTB) training informationis transmitted over the switched interconnect. When a branchmispredicts, all speculative instructions in the branchingSlice are flushed and misprediction messages are sent to

Localized Global Rename

Localized Global Rename

Local Rename

Issue Window

DestReg Rename Data Back

Read SrcReg Mapping & Update RAT

Dispatch

Communicate to master Slice

Operand request

Global Rename Stage Local Rename Stage

Accesses remote SB to find requested local register in remote slice

Figure 5. Flow Chart of Rename Stage. Multi-stage globalrename and local rename.

Slices which contain newer speculative instructions to causethose instructions to be flushed.

The Sharing Architecture uses a local bimodal branchpredictor [40], which indexes the prediction table by theprogram counter value. In order to utilize a global predic-tion scheme, such as gshare, gselect, etc., a Global HistoryRegister needs to be composed across Slices. This could beachieved with appropriate delay across the switched inter-connect.

3.2 Register RenamingRenaming registers on the Sharing Architecture occurs ina two step process as shown in Figure 5. At a high-level,architectural registers are first renamed into a global logicalregister space which is shared across Slices. After inter-Slice dependencies are determined, a second renaming steprenames from the global logical register space into a Slice-local physical register number. Each Slice contains a LocalRegister File (LRF) [52] and enables the number of physicalregisters to increase as Slices are added. The global logicalregister space is sized for the maximum number of Slices ina VCore.

3.2.1 Global Rename: Local Decision with Global HelpInstructions are decoded in parallel and renaming is firstdone from the architectural register space to a large, globallogical register space, which eliminates false dependen-cies. The free-list of global logical registers is distributedacross Slices in a VCore removing the need for a centralizedfree-list. Each Slice contains a global Register Alias Table(RAT) which tracks the mapping from architectural regis-ter to global logical register. Global renaming of destinationregisters takes place in several steps as shown in Figure 6b.First, all destination operands are renamed in their localSlice. Second, this rename mapping (both architectural reg-ister and global logical register) is sent to a master Slicewithin the VCore. The master Slice broadcasts this renameinformation to the other Slices which then correct their ownrename operations based off of instruction ordering. Thiscorrection step is done locally, but uses global informationin order to handle the case where two instructions that arefetched across different Slices write the same destination orthere is an inter-Slice read-after-write (RAW) dependency

5/14/13 1

Master slice

(a) Master Slice

0 5 1

7 6 8

3 4 2

0 5 1

7 6 8

3 4 2

Slice 4: add r1, r1, r0 1. Rename des:na:on register locally add p33, r1, r0 2. Send (p33, r1) to master slice 0

Slice 0: receive 3: p32, r1 4: p33, r1 5: p34, r1 Slice 0: broadcast At most 9 renamed register to slave slices

(b) Multi-stage Rename

Figure 6. Global Rename

FU

IW

LRF

N

FU

IW

LRF

N

FU

IW

LRF

N

FU

IW

LRF

N

FU

IW

LRF

N

FU

IW

LRF

N

Figure 7. Local Register Files (LRF) and Scalar OperandNetworks (N). IW: Issue Window; FU: Functional Unit.

between instructions in a fetch group. This information up-dates each Slice’s copy of the global RAT and scoreboardwhich tracks which Slice contains the most up-to-date valuefor a given register. Finally, source operands are renamedusing the global RAT’s old and new values dependent oninstruction order.

Figure 6a shows how Slices can be grouped around a mas-ter Slice for global renaming and Figure 6b shows an exam-ple of multi-stage rename. Alternatives include centralizingthe rename information and then broadcasting this informa-tion using a multicast network or choosing different Slices tohandle global rename based on architectural register number.

3.2.2 Local Rename and Remote Operand RequestAfter global rename occurs, each global logical registeroperand is locally renamed into the physical register LRFnamespace. By using distributed LRFs as shown in Fig-ure 7, the number of physical registers in a VCore scaleswith the number of Slices in a core. Using the informa-tion from the global rename step, the scoreboard contains amapping from global logical register to Slice number. Des-tination registers are always renamed into the LRF space,allocating entries from the LRF. Source operands read a lo-cal RAT which keeps a mapping from global register numberto LRF physical register number. If a global logical sourceregister is found from the scoreboard to be in a remote Slice,an operand request message is generated and sent to theremote Slice. When the operand request is sent, the destina-tion is allocated into the LRF and marked as pending until anoperand reply message is received. By renaming and allo-cating remote operands into the LRF, subsequent reads from

a global logical register which has been previously read andrenamed by that Slice do not generate remote operand re-quests, but rather simply read the LRF.

Operand request and operand reply messages transit a 2-Dimensional, switched, Scalar Operand Network (SON) [59,65] to move operand requests and operand responses be-tween different Slices. Section 3.4, below, describes how theScalar Operand Network is integrated with each Slice’s ex-ecution units. When an operand request message is receivedby a remote Slice, the local RAT is used to determine thephysical LRF number that either stores the value or willstore the value. If the remote Slice’s LRF already containsthe operand value, the SON is used to send back the value.If the operation which generates the operand value is pend-ing, the request is enqueued to a wait list, and the reply willbe generated once the value is ready. After renaming andoperand request messages are sent, but responses not neces-sarily received, instructions enter an instruction window.

3.3 Instruction Window and DispatchEach Slice has a separate issue window for ALU instructionsand loads/stores. Instructions wait in the issue window untiltheir dependent operands are going to be available the nextcycle. At that point, the instructions leave the issue window,possibly out-of-order, and execute. Special attention must bepaid to operands that come from remote Slices. In order towake up an instruction waiting for a remote operand, theissue window not only is matched against local finishinginstructions, but also matched against wake-up signals thatindicate that a operand reply will be returning on the nextcycle. It is possible to have this one-cycle head start becausea wake-up signal is sent when the instruction in the remoteSlice issues, which is in the cycle before it executes.3.4 Execution UnitsOnce an instruction is issued, it executes on the ALU orLoad-Store unit. An important aspect of the Sharing Archi-tecture is that it uses a switched, dynamically scheduled, 2DScalar Operand Network (SON) to move operand values be-tween different Slices. The SON is directly tied into the by-pass network and where results are generated in the ALUs.This enables operand values to be sent right after they aregenerated and enables values to be forwarded to a destina-tion instruction and the LRF in a remote Slice with low la-tency. This design is similar to how the on-chip networks areintegrated into the pipelines of the Raw and Tilera proces-sors and reduces latency between ALUs in different Slices.We model a two-cycle communication cost between nearestneighbor Slices and an additional cycle for each additionalnetwork hop, the same latency as on a Tilera processor. TheLRF has an additional read port and an additional write portto process operand request messages and operand reply mes-sages.3.5 CacheEach Slice contains an L1 I-Cache and an L1 D-Cache. TheL1 I-Cache cache line size is reduced to accommodate two

instructions, and a next line predictor is used to prefetch thenext instruction according to the number of Slices in a virtualcore. The L1 D-Cache is private to each Slice. Beyond theL1 D-Cache is a configurable L2 cache. Any L2 Cache Bankin the system can be used by any VCore. This is achievedby connecting L2 Cache Banks to a 2D switched dynamicnetwork interconnect. In our results section, we assume eachL2 Cache Bank is 64KB. On a L1 cache miss, a messageis sent across the switched interconnect to retrieve data andresponse data is sent via that same network. Addresses arelow-order interleaved by cache line across L2 Cache Banks,thereby simplifying the mapping of address to Cache Banks.Latency increases as L2 banks are further away from thecache miss issuing Slice.

In order to prevent the processor cores from stalling dueto long memory latencies, the Sharing cache subsystem usesnon-blocking caches and a small store buffer in each Slice.Loads and stores are sorted to Slices after issue and addressgeneration as shown in Figure 8. A hashing function low-order interleaves accesses by cache line. Because addressesto the same cache line are always sorted to the same Slice,there is no need to keep the caches coherent between Slicesin a VCore.

In a multi-VCore VM, caches need to be keep coherentbetween VCores when you step beyond the VCore such as ina multi-threaded program or inter-program shared memory.The coherence point can be between the L1 and L2 or afterthe L2s. If the coherence point is before the L2, the L2 can beshared by multiple VCores. If it is after the L2, the L2s areprivate to a VCore. Both of these solutions enable a mannerof flexible L2 cache sizing. In our presented results, we putthe coherence point between the L1 and L2 caches thereforehaving a shared L2 cache per VM. We modeled this with adetailed model which has a directory in the L2. Our modeledcache coherence protocol includes switched network costbased on distance and L1 invalidations. An important portionof this design is that each Slice maintains a table which mapsa portion of the VM’s physical address space to a particularL2 Cache Bank (IE a home-node mapping table). We do notcurrently allow inter-VM physical memory sharing so thereare currently no coherence challenges beyond the VM (L2cache). The implication of this is that the reallocation of anL2 Cache Bank requires flushing that bank to main memory.In terms of inter-VM coherence, the L2s themselves couldbe kept coherent with the addition of a directory-based cachecoherence system beyond the L2 caches therefore enablingVM-level page sharing. Such a solution could leverage theVirtual Hierarchy work [39]. Last, if we only have per-VCore private L2 caches, much of this complexity goesaway as the L2 caches could simply be kept coherent witha flat directory-based cache coherence protocol.

3.6 Distributed LSQ and LS Instruction SortingConventional load/store queues (LSQs) are challenging tobuild in wide-issue processors. The Sharing Architecture

ALU

ADDR Unit

MULT

Hashing LSQ Bank1

Issue Window

LRF1

LS Router L1

Dcache LS commit LS sor+ng

LS sor+ng LS commit

ROB

LS sor+ng LS commit

From other slices

To other slices

Figure 8. Distributed LSQ and LS Sorting

sorts loads and stores to a given Slice based on its address,therefore allowing the LSQ to operate local to a Slice asshown in Figure 8. Unlike the Sharing Architecture, mostbanked-by-address LSQs, such as those in clustered proces-sors [8] and Core Fusion [24] require LSQ bank predictionbecause a memory operation’s effective address is unknownuntil the execution stage. Clustered processors and Core Fu-sion use this approach because they allocate into the LSQwhen entering the issue window. In contrast, the SharingArchitecture uses an unordered LSQ such as those in priorwork [54] in order to reduce LSQ occupancy and power con-sumption. Our work is the first to utilize a general-purposeswitched network to sort load/store instructions mixed withan unordered LSQ.

The LSQ bank is unordered with respect to age, and anextra age-tag field is added to maintain the load/store order.The LSQ buffers all stores and only allows stores to bewritten out when they are non-speculative. In the SharingArchitecture, a store is determined to be non-speculative ifit is the head instruction in the ROB. A search is necessaryin the LSQ to match the ROB tag with the LSQ entry tag onstore commit.

Loads access memory speculatively after the address isresolved. Load data values are returned to the issuing Sliceand are marked as speculative until all earlier stores havecommitted. The LSQ reports a violation when it detects thata younger load in program order to the same address exe-cutes before a store to the same address. In order to achievethis, each committing store searches the LSQ bank for anyissued loads younger than it to the same address as the store.Figure 9 shows a committing store checking all youngerloads in a local LSQ. Each store needs to check against allof the loads in a particular LSQ bank when it commits, butloads do not need to check against stores in this design. Theaddress banked nature of the LSQ across Slices enables morememory bandwidth and a large aggregate LSQ capacity asmore Slices are added to a VCore and it eliminates the needfor inter-Slice LSQ alias checking.

3.7 Distributed Reorder BufferWe partition and distribute Reorder Buffer (ROB) entriesacross Slices. We leverage the approach used by CoreFusion[24] which uses a pre-commit pointer to guaranteethat all ROBs are up to date several cycles before true com-mit.

data

address

Age-‐tag

Original Slice ID & Register#

address

Age-‐tag

SQ

LQ

Commit Tag wr_en

Control from ROB

Alias checking

CommiBng Store

Figure 9. LSQ Bank

3.8 ReconfigurationIn order to reconfigure VCores, we rely on the Hypervisorto control the connections of the Slice and Cache intercon-nect. We propose having the hypervisor be time sliced on thesame resources as the client VMs. But, unlike client VMswhich run on reconfigurable cores, we propose having thehypervisor execute only on single-Slice VCores. By havingthe Hypervisor execute on single Slices only, the hypervisorcan then locally reconfigure protection registers and inter-connection state to setup and tear-down client VCores. Thehypervisor is designed to bypass use of the reconfigurablecache system or use hypervisor dedicated L2 banks in orderto prevent having to flush the L2 Caches on every time slice.

When a VCore is reduced in the number of Slices, thereis potentially register state which needs to be transmitted tothe surviving Slices within the VCore. In order to carry thisout, we use a Register Flush instruction which pushes allof the dirty architectural register state to other Slices in thesame VCore using operand reply messages. This can be donequickly because there are only 64 local physical registers perSlice and the Scalar Operand Network is fast for transferringregister data. When VCore L2 Cache configurations change,we propose that all dirty state in L2 Cache Banks be flushedto main memory before reconfiguration.

4. Software ImplicationsWhile this paper focuses on a new IaaS optimized architec-ture, we wanted to address some of the software implicationsof using such a system in an IaaS Cloud. The first ques-tion which comes up is how does the IaaS user determinetheir utility functions and how might they react to variablepricing? A basic solution would be for the IaaS user to pro-vide a meta-program along with the VM workload that theywant to execute. The meta-program can express the user’smulti-dimensional utility function as a function of differentresources and can understand how to react to changing pric-ing. The user can profile their application on different archi-tectures to build performance models which will allow themto back-compute their utility functions.

Alternatively, they could utilize an auto-tuner [5, 6, 44,51, 55]. The auto-tuner would slowly search the configu-ration space by varying the VM instance configuration andnumber of VMs. The auto-tuner can then pick good config-urations provided a high-level goal from the user. Such an

auto-tuning system would likely require the use of a heart-beat [22] or performance feedback to evaluate the successcriteria. An example of this can be seen in current day IaaSauto-scaling systems such as Amazon’s Auto Scaling andRightScale. Other options include having a human-in-the-loop to make decisions or employing a decision system tomake configuration changes.

We could use a sophisticated control theory based or ma-chine learning based software adaptation engine, such asPTRADE [23]. This software engine asks application devel-opers to specify objective functions and asks system develo-plers to specify components that affect their objective func-tions. The software runtime system takes these specificationsand tunes component usage to meet goals while reducing thecost (power consumption, computation and storage cost).

The Sharing Architecture is designed to minimize changesto the software stack, but some changes will be needed.From the IaaS client’s perspective, they will still create VMinstances like before, but now will have to provide a meta-program as described above, rely on auto-tuners, or fall backon the previous market solution of fixed instance types.Because all of the different micro-architecture configura-tions of the Sharing Architecture are binary compatible, theuser does not need to recompile to use different Slice andCache configurations. The user could further optimize by us-ing multi-architecture (FAT) binaries optimized for differentmicro-architectures. Our presented results do not recompileor use this optimization.

Next we analyze the changes to the system stack. TheOS should not need to change for the Sharing Architec-ture. The Hypervisor will likely need modification in orderto manage resources on a sub-core basis. Finally, the Cloudmanagement software (scheduler) will have to change in or-der to schedule new resources. Changing the Cloud sched-uler is a challenging problem, but the Sharing Architectureopens up many opportunities for interesting research in thisspace that can leverage prior work in large scale schedul-ing [14, 19, 21, 25, 48, 56]. These Cloud provider changescan enable improvements in economic efficiency and rev-enue.

5. Evaluation5.1 Design AreaIn order to assess the feasibility of the Sharing Architecture,we have implemented the Sharing Architecture Slice in syn-thesizable Verilog. We use this to measure the added areaoverhead that the Sharing Architecture adds to an out-of-order superscalar. We synthesized the design utilizing theSynopsys tool flow and took the design through DesignCompiler and IC compiler to a fully place-and-routed netlist.We use Synopsys’s TSMC 45nm cell library as the imple-mentation technology. Cache sizes, timing, and power aregenerated utilizing CACTI [42] at the 45nm node. Figure 10and Figure 11 present the area breakdown of the Sharing Ar-

Register File 6%

BTB&Predictor 4%

16 KB 2-way L1 Dcache 24%

16 KB 2-way L1 Icache 24%

Issue Window 4% Instruction

Buffer 11%

Global Rename 1%

LSQ 8%

Multiplier 2%

ROB 6% ALUs

1%

Added Pipeline 0%

Scoreboard 2%

Waitlist 1%

Local Rename 2%

Routers 2%

Sharing Overhead 8%

Figure 10. Area Decomposition without L2 cache

64KB 4-way L2 Dcache 35%

Register File 4%

BTB&Predictor 3%

16 KB 2-way L1 Dcache

16%

16 KB 2-way L1 Icache 16%

Issue Window 3%

Instruction Buffer 7%

Global Rename 0%

LSQ 5% Multiplier

2% ROB 4%

ALUs 1%

Added Pipeline

0% Scoreboard 1%

Waitlist 1%

Local Rename 1%

Routers 2%

Sharing Overhead 5%

Figure 11. Area Decomposition including 64KB L2 cache

chitecture. We call out the area overhead of the Sharing Ar-chitecture as the right sub-pie-charts. The total area of oneSlice is 1.8mm2 (not including L2) at 1GHz. The SharingArchitecture area overhead is 8% without L2 cache and 5%with 64KB L2 cache area included.

We found that the area overhead of the inter-Slice on-chip Networks to be significant as there are three dedicatednetworks modeled for different purposes (operand network,load/store sorting, and global renaming). A single operandnetwork is used for both operand request and operand replymessages as we found that the operand network providessufficient bandwidth. By conducting a sensitivity study onoperand communication bandwidth, we discovered that byadding a second operand network, performance would im-prove by only 1% across our applications. The overhead ofthe Sharing Architecture as a percentage of the total area isalso affected by our choice of modestly sized structures suchas the ROB, Load/Store queue, etc. If a more complex basedesign were to be chosen, the area overhead of the SharingArchitecture as a percentage of the total would be smaller.In addition, we do not model the area of the floating-pointunit in our design, but this structure does not change due tothe Sharing Architecture, therefore our area estimations areconservative.

5.2 Experimental SetupWe have created a trace-based simulator, SSim, to modelthe Sharing Architecture and measure its effectiveness. This

Number of Functional Units/Slice 2 ROB size 64Number of Physical Registers 128 Store Buffer Size 8Number of Local Registers/Slice 64 Maximum In-flight Loads 8Issue Window Size 32 Memory Delay 100Load/Store Queue Size 32

Table 2. Base Slice Configuration

Level Size(KB) Block Size(Byte) Associativity Hit DelayL1D 16 64 2 3L1I 16 64 2 3L2 64*2 64 4 distance*2+4

Table 3. Base Cache Configurations: L2 cache hit delaysdepends on the distance to the cache bank

1 2 3 4 5 6 7 8Number of Slices

0

1

2

3

4

5

Norm

aliz

ed P

erfo

rman

ce

apachebzipgccastarlibquantumperlbenchsjenghmmer

gobmkmcfomnetpph264refdedupswaptionsferret

Figure 12. Scalability of VCore Performance. Includes OneCycle Insertion Delay Plus One Cycle per Hop. Normalizedto One Slice with 128KB L2 Cache VCore

cycle-level simulator models each subsystem of the Shar-ing Architecture (fetch, rename, issue, execution, memory,commit, and on-chip network) along with accurately mod-eling the Out-of-Order execution and inter-Slice and Slice-to-memory latency. SSim is driven by full system tracesgenerated by the Alpha version of GEM5[10]. We use thecomplete SPEC CINT2006 benchmark suite, a static web-serving workload of Apache driven by ApacheBench, and asubset of PARSEC benchmarks to explore the Sharing Ar-chitecture. SSim is very flexible, allowing all critical micro-architecture parameters and latencies to be set from a XMLconfiguration file. When a simulation completes, SSim re-ports the cycles executed for a given workload along withcache miss rates and stage-based micro-architecture stallsand statistics. In the following sections, when aggregate re-sults are shown, we use the geometric mean (GME) of dif-ferent benchmark results following how SPEC creates aggre-gate results. Throughout these results, we use an area modelderived from our results in Section 5.1.

5.3 ScalabilityWe begin by evaluating the Sharing Architecture’s ability toincrease both single-stream performance and multithreadedperformance by adding Slices to a VCore. Table 2 con-tains the base configuration of a single Slice which we willuse throughout this section. The base cache configuration isshown in Table 3. Figure 12 presents these scalability results.

0 64KB 128KB 256KB 512KB 1MB 2MB 4MB 8MBL2 Cache Size (Fixed 2-slice)

0

2

4

6

8

10

12

14No

rmal

ized

Per

form

ance

apachebzipgccastarlibquantumperlbenchsjenghmmer

gobmkmcfomnetpph264refdedupswaptionsferret

Figure 13. Performance Scaling with Cache Size. Normal-ized to No L2 2-Slice VCore.

Since the default Sharing Architecture requires two base cy-cles to transport operands between Slices and another cyclenetwork delay for additional hops, performance can degradefor larger VCore configurations. For PARSEC, benchmarksuse four threads on four equally configured VCores whichshare an L2 Cache. The number of Slices per VCore is var-ied in unison. Compared with SPEC, PARSEC benchmarkshave less ILP; the speedup is bounded by 2.

5.4 Cache SensitivityWe study the sensitivity of performance to cache size byvarying the L2 cache size from 0KB up to 8MB. As canbe seen in Figure 13, some applications such as omnetpp areextremely sensitive to cache size while benchmarks such asastar, libquantum, and gobmk are insensitive to cache size.This variable sensitivity motivates the Sharing Architecturewhich can take advantage of the variable micro-architecturerequirements of applications. Performance can actually de-crease as more cache is added because we model an addi-tional 2-cycles of communication delay for each additional256KB of cache, thereby making the latency of the largecache significant.

5.5 Performance-Area EfficiencyThis section studies VCore performance-area efficiency asa function of Slice count and L2 Cache size. We use thearea model described in Section 5.1 along with an exhaustivesearch of performance for different Slice count and Cacheconfigurations. Table 4 presents the optimal configurationsfor three different metrics across 15 benchmarks (apache,SPEC, and PARSEC). The non-uniformity of optimal con-figurations, even for the performance per area metric, showsthat benefits can be achieved even before we take into ac-count Cloud user utility functions.

We next turn our attention to the performance2/areametric to mimic users whose applications do not scale acrossprocessors well, or who favor single-thread performance.As can be seen in Table 4, gobmk favors 5 Slices and1MB of cache while under the same criteria, hmmer prefers64KB cache and a single Slice. For sequential performance-

oriented applications, some customers will pay significantlymore for higher performance. We use performance3/areato model these needs. Under this constraint the optimalVCore configurations vary significantly across differentbenchmarks. What is even more interesting is that withina single benchmark, the optimal configuration varies greatlydependent on the efficiency metric. For instance, gcc in athroughput computation favors 128KB L2 cache and twoSlices, while under a performance metric favors 4 Slices anda 1MB cache. gcc has over a factor of two in performancegain between optimal configurations for different metrics.

5.6 Utility OptimizationAs developed in Section 2.2, IaaS Cloud customers have var-ied needs for resources based on the characteristics of theworkloads that they execute (how it reacts to different re-sources) and internal policy decisions. In the Sharing Ar-chitecture, customers decide which and how many resourcesto purchase in order to maximize their utility while stayingwithin a budget constraint. In the Sharing Architecture, re-sources include: Slices, L2 Cache banks, VCores, and VMs.For the balance of the paper, we define a Cloud user’s utilityfunction as: Ubi(c, s, vc, vm),

where Ubi is the utility for benchmark bi as a function ofc, the amount of L2 cache per VCore, s, the number of Slicesper VCore, vc, the number of VCores per VM, and vm, thenumber of VMs purchased. We simplify this by combiningvc and vm into one value v which is the number of cores acustomer uses. We can do this without losing generality aswe assume that benchmarks can be replicated within a VMor between VMs providing a new utility of: Ubi(c, s, v).

We define an application’s single-thread performance asP (c, s), a function of Cache size and Slice count. For ex-ample, one Cloud customer may run Online Data-Intensive(OLDI) workloads driven by queries that interact with mas-sive data sets and require responsiveness at a sub-secondtime scale[41]. Such a workload may give substantial weightto single-stream performance to reduce latency. This canlead the OLDI customer to use a utility function such asEquation 1 where utility is proportional to the cube of single-stream performance. A utility function is not enough to drivethe Cloud customer’s decision making, instead, the cost ofresources becomes an important consideration. We defineCc as the cost of one cache bank, Cs as the cost of oneSlice, and B as the overall budget that a user can spend.Combining these costs with a budget, we find the constraintsin Equation 2 as well as constraints which are derived fromvalid VCore configurations in Equation 3. A Cloud customerwould then maximize their utility given these constraints.

UOLDI = 3√v ∗ P 3(c, s) (1)

v =B

Cc ∗ c+ Cs ∗ s(2)

0KB ≤ c ≤ 8MB, 1 ≤ s ≤ 8 (3)

Area Efficiency Metric VCore Config apache astar bzip gcc h264ref libq hmmer sjeng perlb mcf omnetpp gobmk dedup swaptions ferret optimal GME Dyn/Static Gain

performance/areaL2 cache size (KB) 256 64 128 128 64 64 0 64 64 64 64 256 1024 256 512 64

5.2%Slice Number 2 1 1 2 1 1 1 1 1 1 1 3 2 1 1 1

performance2/areaL2 cache size (KB) 512 128 256 512 64 64 64 512 256 128 256 1024 1024 512 512 512

36.6%Slice Number 3 2 1 4 2 1 1 3 3 2 4 5 3 3 2 3

performance3/areaL2 cache size (KB) 512 1024 256 1024 512 64 128 1024 512 256 1024 1024 1024 512 512 1024

56.2%Slice Number 3 5 2 4 3 1 2 4 4 2 5 6 3 3 2 3

Table 4. Optimal VCore Configurations for Performance-Area Efficiency on Apache Server, CINT SPEC2006 and PARSEC

Three Utility Functions Analogy Representative WorkloadsUtility1 v ∗ P (c, s) Energy ∗Delay [13] Media Streaming, MapReduce, Database[16]Utility2 2

√v ∗ P 2(c, s) Energy ∗Delay2 [36] Web Search, Web Fronted, Data serving [16]

Utility3 3√v ∗ P 3(c, s) Energy ∗Delay3 Mobile Interactive, Business Analytics[7]

Table 5. Three Different User Utility Functions

In contrast to the OLDI workload, some cloud workloadsare completely throughput oriented and insensitive to addi-tional latency. Examples of this include a data backup ser-vice which has to encrypt and compress large amounts ofbulk data, bulk image resizing photo sharing sites such asPicassa or SmugMug, and many MapReduce applicationswhich are not directly connected to a user. For these latencytolerant applications, we define the utility function in Equa-tion 4. In the balance of the paper, we will use three utilityfunctions as example Cloud user utility functions as shownin Table 5. We will refer to these as Utility1, Utility2, andUtility3, sorted from favoring throughput computing to fa-voring increasing amounts of single-thread performance.

ULT = v ∗ P (c, s) (4)

To provide some intuition about what a utility functionlooks like, we plot in Figure 14 the utility function Utility1and Utility2 for gcc and bzip. In Figure 14, the x-axis rep-resents the number of Slices [0, 8] and the y-axis representsthe number of 64 KB L2 cache banks on a binary log scale[0, 8] (64KB ∗ 2(y axis) − 64KB). As can be seen, whencomparing Figure 14a (gcc Utility1) versus Figure 14b (gccUtility2), simply changing the utility function can drasticallychange which configuration provides peak utility. Likewisewhen we compare Figure 14d (bzip Utility2) versus Fig-ure 14b (gcc Utility2), we can see that when holding the util-ity function constant but changing the workload, the config-urations with peak utility substantially change. For instance,bzip favors both a 256KB L2 cache and a single Slice to ob-tain peak utility for Utility2 while gcc favors a 512KB L2cache and 4 Slices per VCore to maximize Utility2.

5.7 Market EfficiencyWe simulated the performance of all of our benchmarks atdifferent Slice and L2 Cache configurations and computedthe peak utility for the three different utility functions (Util-ity1, Utility2, Utility3 in Table 5) and in three differentmarket scenarios and present configurations that exhibit thehighest utility for each in Table 6. Market2 is the marketwhich sets the cost of resources equal to their area (1 Slice

Slices

0 1 2 3 4 5 6 7 8 Cache Banks (lo

g2)

0 1 2 3 4 56 780510152025303540

(a) gcc Utility1

Slices

0 1 2 3 4 5 6 7 8 Cache Banks (lo

g2)

0 1 2 3 4 56 780

100200300400500

(b) gcc Utility2

Slices

0 1 2 3 4 5 6 7 8 Cache Banks (lo

g2)

0 1 2 3 4 56 780246810121416

(c) bzip Utility1

Slices

0 1 2 3 4 5 6 7 8 Cache Banks (lo

g2)

0 1 2 3 4 56 780510

15

20

(d) bzip Utility2

Figure 14. Two Utility Functions for bzip and gcc as aFunction of Slice count and L2 Cache Size: x-axis is numberof Slices, y-axis is number of 64 KB L2 banks in binary logscale

costs the same as 128KB Cache). In Market1, we set Slicesto be four times the equal area cost of Cache and in Market3,we set Cache to be four times the equal area cost of a Slice.We use these different markets to show how optimal config-urations change when market demand for resources does nottrack area costs.

5.8 Efficiency Comparisons with Static FixedArchitecture and Heterogeneous Architecture

We turn our attention to how much market efficiency gain(profit gain) can be achieved when compared to a fixed mul-ticore architecture or heterogeneous multicore architecturewhen used in an IaaS Cloud provider setting. For this, werestrict our study to Market2 where resource cost is com-mensurate with the area used. In order to study the marketefficiency gain, we pairwise choose two benchmarks and twoutility functions from Table 6 and compare the utility of run-ning those benchmarks on a fixed resource architecture ver-sus running those benchmarks on the Sharing Architecturewith configurable Slice count and Cache size.

We first compare the Sharing Architecture with the beststatic fixed architecture as determined across all benchmarksand the three utility functions such that it minimizes thegeometric mean of the utility gained in Figure 15. Thisshows how much market efficiency gain can be had whencompared with the current IaaS commodity multicore

Markets Utilities apache astar bzip gcc h264ref libquantum hmmer sjeng perlbench mcf omnetpp gobmk dedup swaptions ferret

Market1: Higher Slice PriceUtility1 256, 1 128, 1 256, 1 512, 2 64, 1 64, 1 128, 1 512, 1 256, 1 128, 1 256, 1 512, 2 1024, 1 512, 1 512, 1Utility2 1024, 3 2048, 6 512, 2 2048, 5 512, 3 256, 1 512, 2 1024, 4 2048, 4 512, 2 1024, 5 1024, 6 1024, 3 512, 3 512, 2Utility3 1024, 4 2048, 7 512, 2 2048, 6 512, 3 256, 1 1024, 4 1024, 5 2048, 5 2048, 4 2048, 8 2048, 6 1024, 3 512, 3 512, 3

Market2: Equal Area PriceUtility1 256, 2 64, 1 128, 1 128, 2 64, 1 64, 1 0, 1 64, 1 64, 1 64, 1 64, 1 256, 3 1024, 2 256, 1 512, 1Utility2 1024, 4 1024, 6 256, 2 1024, 5 512, 3 128, 1 512, 2 1024, 5 1024, 5 512, 4 1024, 7 1024, 6 1024, 3 512, 3 512, 3Utility3 1024, 4 2048, 8 512, 3 2048, 7 512, 3 256, 1 1024, 4 1024, 5 1024, 5 1024, 4 2048, 8 1024, 6 1024, 3 512, 3 512, 3

Market3: Higher Cache PriceUtility1 64, 2 0, 1 0, 1 64, 3 64, 1 0, 1 0, 1 64, 2 64, 3 64, 2 64, 2 64, 3 0, 1 256, 1 64, 1Utility2 512, 4 1024, 6 256, 2 1024, 6 64, 3 64, 1 64, 2 1024, 5 256, 5 128, 4 256, 7 1024, 6 1024, 3 512, 3 512, 3Utility3 1024, 5 2048, 8 512, 3 2048, 7 512, 3 128, 1 1024, 4 1024, 5 1024, 5 512, 4 1024, 8 1024, 6 1024, 3 512, 3 512, 3

Table 6. Optimal VCore Configurations in Different Markets (L2 Cache Size in KB, Slice Number)

0 200 400 600 800 1000Permutation Number

0

1

2

3

4

5

6

Utili

ty G

ain

Figure 15. Utility Gain Compared with Static Fixed Archi-tecture (Optimal Configuration Across Benchmark Suite andthe Three Utility Functions). Points represent utility gain ofone benchmark and utility mixed with another benchmarkand utility.

processors. For example, point 1 in Figure 15 representsastar using Utility1 combined with bzip using Utility1. Ascan be seen, there are significant gains, up to 5x, in marketefficiency by using the Sharing Architecture over an optimalfixed architecture.

We also compare the sum of the two benchmarks’ util-ity running on the Sharing Architecture to the sum of thosebenchmarks’ utility running on configurations that are op-timal across all benchmarks for the particular utility func-tion: (Ub1(sharing) + Ub2(sharing))/(Ub1(fixedc) +Ub2(fixedd)). This mimics what a heterogenous multicoreproposed in previous reseach [18] could achieve when op-timized across a suite of applications. Figure 16 shows theutility gained when mixing one benchmark and utility withanother benchmark and utility for all benchmarks and threeutility functions. Over 3x market efficiency gains can beachieved.

5.9 Datacenter Heterogeneity ComparisonHeterogeneous architectures allow a datacenter to combinebenefits of specialization with the performance guaranteesof traditional high-performance servers [18]. However astatic mix of high-end and low-end processors falls short forcertain applications. While computationally intensive work-loads favor large server-class processors, throughput work-loads, such as web search, have better power efficiency run-ning on mobile-class processors [27]. Utilizing Table 6’s op-timal configuations, hmmer favors having a small core con-

0 200 400 600 800 1000Permutation Number

0

1

2

3

4

5

6

Utili

ty G

ain

Figure 16. Utility Gain Compared with Heterogenerous Ar-chitecture (Optimal Configurations Across Benchmark Suitefor the Particular Utility Functions).

big_core:small_core ratio 00.20.40.60.81

hmmer

:gobm

k rat

io

00.20.40.60.81

Norm

aliz

ed P

erfo

rman

ce/A

rea

0.20.30.40.50.60.70.80.91.0

Figure 17. Normalized performance/area of hmmer andgobmk vs. big core to small core ratio and application mixratio.

figuration with one Slice and zero L2 cache, while gobmkachieves peak Utility1 using a large core with three Slicesand 256KB L2 cache. We vary the ratio of big cores (with 3Slices and 256KB L2 cache) and small cores (with 1 Sliceand 0 L2 cache), and the application ratio, and then measurethe utility of hmmer and gobmk under different ratios. InFigure 17, we can see depending on application mix, dif-ferent ratios of big and small cores are required for optimalperformance/area efficiency. A fixed mixture of big andsmall cores therefore cannot always optimally serviceheterogeneous workloads in the cloud.

5.10 Dynamic PhasesThe Sharing Architecture provides flexibility to reconfigureVCores and VMs as a workload changes. It is well knownthat different program phases can require different resourcesto perform optimally. For instance, some program phases

Metric Core Configurations phase1 phase2 phase3 phase4 phase5 phase6 phase7 phase8 phase9 phase10 optimal GME Dyn/Static Gain

performance/areaL2 cache size (KB) 256 256 128 256 128 128 256 128 64 64 128

9.1%Slice Number 2 2 2 2 2 2 2 1 1 1 2

performance2/areaL2 cache size (KB) 512 256 256 512 512 256 256 128 128 512 512

15.1%Slice Number 4 4 4 3 3 3 2 2 2 2 3

performance3/areaL2 cache size (KB) 1024 1024 1024 512 1024 256 1024 128 128 512 1024

19.4%Slice Number 5 4 4 3 4 3 4 2 2 3 4

Table 7. Optimal VCore Configurations for 10 Phases in “gcc ”Benchmark: The dynamic average is compared with the staticbest average, Dyn/Static gain takes into account 10000 cycles reconfiguration cost after each phase that reconfigures L2 cache,and 500 cycles reconfiguration cost after each phase that reconfigures only Slices.

may be more CPU intensive while other phases are mem-ory intensive. We study program phases for gcc by evenlydividing gcc into 10 segments. We run these on SSim inde-pendently and find the optimal VCore configuration for dif-ferent performance-area metrics as shown in Table 7. We ac-count for reconfiguration overhead by adding 10,000 cycleswhen the cache configuration changes, but only 500 cycleswhen only the Slice count changes as the L2 Cache does notneed to be flushed. Even within a single program and a sin-gle metric, optimal VCore configurations change with phase.We compare to the optimal static configuration for gcc (morestringent than the optimal across benchmarks) and find thatsignificant gain can be achieved with up to a 19.4% gain forthe performance3/area metric.

6. Related WorkThe Sharing Architecture leverages many ideas from Dis-tributed ILP work and fused-core architectures as com-pared in Table 8. Distributed ILP work tries to removelarge centralized structures by replacing them with dis-tributed structures and functional units and has been ex-plored in Raw [59], Tilera [65], TRIPS [17, 49, 50], andWavescalar [58]. Tiled architectures distribute portions ofthe register file, cache, and functional units across a switchedinterconnect. We adopt a similar network topology for com-municating instruction operands between function units indifferent Slices, but unlike Raw and Tilera, we do not ex-pose all of the hardware to software, do not require recom-pilation, and use dynamic assignment of resources, dynamictransport of operands, and dynamic instruction ordering.TRIPS [11, 49] has an array of ALUs connected by a ScalarOperand Network and uses dynamic operand transport sim-ilar to the Sharing Architecture. Unlike the Sharing Archi-tecture, TRIPS requires compiler support, and a new EDGEISA which encodes data dependencies directly. In contrast,the Sharing architecture uses a dynamic request-and-replymessage based distributed local register file to exploit ILPacross Slices without the need of a compiler.

The TFlex CLP micro-architecture [30] allows dynamicaggregation of any number of cores up to 32 for each thread,to target an optimal goal: performance, area efficiency, or en-ergy efficiency, with the support of an EDGE ISA. The Shar-ing Architecture does not rely on compiler level support, butprovides sub-core composability for different customer util-

Features Distributed TRIPS/ Core WiDGET Conjoined Clustered Heterogeneous SMT/ SharingILP CLP Fusion Morph Architecture

Scale up/down Y Y N Y N N N N YDistributed Y Y N N N N N N YSwitched Y Y N N N N N N Y

Symmetric Y Y Y Y Y Y N Y YDynamic OoO N N Y N Y Y Y/N Y/N Y

ISA Compatible Y N Y Y Y Y Y Y YPartition L2 Y Y N N N N N N YMulti-Metric N Y N N N N N N Y

Table 8. Taxonomy of Differences with Related Work.Dynamic OoO stands for dynamic instruction assignment,transport, and ordering.

ity functions in a dynamic manner. The dynamic assignmentof instructions to ALUs and operands to interconnect canpotentially make the Sharing Architecture more flexible thanTFlex, but possibly with less potential to exploit high ILP.

The Sharing Architecture leverages many ideas fromCore Fusion [24] on how to distribute resources across cores,but unlike Core Fusion, the Sharing Architecture is designedto scale to larger numbers of Slices (cores). Core Fusionrequires centralized structures for coordnating fetch, steer-ing, and commit across fused cores, while we distributed thestructures. Also, we take a more Distributed ILP approach tooperand communication and other inter-Slice communica-tion by using a switched interconnect between Slices insteadof Core Fusion’s operand crossbar. Also, the Sharing Archi-tecture is designed as a 2D fabric which is created across alarge manycore with 100’s of cores which can avoid some ofthe fragmentation challenges that smaller configurations likeCore Fusion can have. The interconnect of Slices to Cacheis also designed to be flexible across the entire machine un-like Core Fusion which has a shared cache. This can lead tobetter efficiency and provides a way to partition cache trafficand working sets.

WiDGET [64] decouples computation resources to en-able flexibility in provisioning execution units to opti-mize power-performance. WiDGET only studies trading offsingle-stream performance for power, while the Sharing Ar-chitecture trades ILP for TLP and enables an IaaS Clouduser to use their own utility function. WiDGET uses a sea-of-resources design with a hierarchical operand interconnectwith local broadcast while the Sharing Architecture uses apoint-to-point switched interconnect. WiDGET’s hierarchi-cal design restricts what resources can be shared while theSharing Architecture enables wider sharing. Unlike WiD-GET’s in-order design, the Sharing Architecture is Out-of-Order including its memory system.

Conjoined-cores [32] allows resource sharing betweenadjacent cores of a chip multiprocessor. The degree of shar-ing depends on the topology, and the shared resources areallocated to different processors in a time-partitioned man-ner. The Sharing Architecture tears down the rigid boundaryof cores, allowing more flexible sub-core configurations.Conjoined-cores has high wiring overhead and thereforeonly large structures are shared while we aim at finer grainsharing.

Clustered Architectures [45] and dynamically config-urable clustered architectures [8] exploit ILP when thereis latency between functional units. While dynamic clus-tered architectures trade off parallelism for latency, they donot enable wide spread sharing and the reconfigurability ofCache resources.

SMT [37, 60] enables trade-offs of ILP and TLP. Likethe Sharing Architecture, SMT does not require recompila-tion, but SMT does not enable fine grain resource sharingbetween cores and has rigid core boundaries. The SharingArchitecture does not time multiplex resources, unlike SMT.MorphCore [29] transforms a high-performance OoO coreinto a highly-threaded in-order SMT core. Unlike the Shar-ing Architecture, MorphCore is unable to scale up and downflexibly.

A heterogeneous CMP [33] can combine superscalarcores for ILP with simple cores for TLP. Unlike the SharingArchitecture, heterogeneous architectures only allow selec-tion out of a few fixed architectures. Also, if a heterogeneousarchitecture has the incorrect mixture of core types to matchthe application mix, efficiency is lost. Last, the Sharing ar-chitecture requires the design of one replicated architecturewhile a heterogeneous CMP can have higher design effort.

Building datacenters out of a heterogeneous mix of serverand mobile cores can improve welfare and energy effi-ciency [18]. Unfortunately statically choosing a mix of big-cores and small-cores in the datacenter can lead to a poorprocessor match when the workload deviates from the ex-pected. For instance, if the workload favors a single type ofcore. Also, unlike the Sharing Architecture, this approachcannot adapt the core to a customer’s utility function dy-namically.

Application interference is prevalent in datacenters dueto contention over shared hardware resources [28]. Sharinglast-level cache (LLC) and DRAM bandwidth degrades re-sponsiveness of workloads. Paritioning a shared LLC po-tentially mitigates the negative performance effects of co-scheduling [20, 26, 47, 57]. The Sharing Architecture buildsupon this work by providing a flexible LLC along with theadditive benefits of ALU configuration.

AcknowledgmentsWe acknowledge Yaosheng Fu for developing PriME cachesimulator, Sharad Malik for the suggestions of comparisonwith heterogeneous data center, Princeton Research Com-

puting for providing computation support. This work is sup-ported by NSF Grant No CCF-1217553.

References[1] Amazon elastic compute cloud. http://aws.amazon.com/ec2/.

[2] Google Apps. http://www.google.com/apps/business/index.html.

[3] Amazon simple storage service.http://aws.amazon.com/s3/.

[4] Windows Azure Platform, 2009.http://www.microsoft.com/azure/.

[5] A. Ali-Eldin, J. Tordsson, and E. Elmroth. An adaptive hybridelasticity controller for cloud infrastructures. In Network Op-erations and Management Symposium (NOMS), 2012 IEEE,pages 204–212, 2012.

[6] J. Ansel, M. Pacula, Y. L. Wong, C. Chan, M. Olszewski,U.-M. O’Reilly, and S. Amarasinghe. Siblingrivalry: onlineautotuning through local competitions. In Proceedings ofthe 2012 international conference on Compilers, architecturesand synthesis for embedded systems, CASES ’12, pages 91–100, 2012.

[7] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz,A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, and M. Za-haria. Above the Clouds: A Berkeley View of Cloud Comput-ing. Technical Report UCB/EECS-2009-28, EECS Depart-ment, University of California, Berkeley, Feb 2009.

[8] R. Balasubramonian, S. Dwarkadas, and D. Albonesi. Dy-namically managing the communication-parallelism trade-offin future clustered processors. In Computer Architecture,2003. Proceedings. 30th Annual International Symposium on,pages 275 – 286, june 2003.

[9] J. Barr. Amazon Web Services Blog, January 28, 201, January28, 20111. http://aws.typepad.com/aws/2011/01/amazon-s3-bigger-and-busier-than-ever.html.

[10] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi,A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti,R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A.Wood. The GEM5 simulator. SIGARCH Comput. Archit.News, 39(2):1–7, Aug. 2011. ISSN 0163-5964.

[11] D. Burger, S. Keckler, K. McKinley, M. Dahlin, L. John,C. Lin, C. Moore, J. Burrill, R. McDonald, and W. Yoder.Scaling to the end of silicon with EDGE architectures. Com-puter, 37(7):44–55, july 2004.

[12] R. Buyya, C. S. Yeo, and S. Venugopal. Market-oriented cloudcomputing: Vision, hype, and reality for delivering it servicesas computing utilities. In Proceedings of the 10th IEEEInternational Conference on High Performance Computingand Communications 2008.

[13] J. Cong, A. Jagannathan, G. Reinman, and Y. Tamir. Under-standing the energy efficiency of smt and cmp with multiclus-tering. In Low Power Electronics and Design, 2005. ISLPED’05. Proceedings of the 2005 International Symposium on,pages 48–53, 2005.

[14] C. Delimitrou and C. Kozyrakis. Paragon: Qos-aware schedul-ing for heterogeneous datacenters. In Proceedings of theeighteenth international conference on Architectural support

for programming languages and operating systems, ASPLOS’13, pages 77–88, New York, NY, USA, 2013.

[15] Eucalyptus. Eucalyptus. http://www.eucalyptus.com/.

[16] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee,D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, andB. Falsafi. Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware. In 17th InternationalConference on Architectural Support for Programming Lan-guages and Operating Systems (ASPLOS), 2012.

[17] P. Gratz, C. Kim, R. McDonald, S. Keckler, and D. Burger.Implementation and evaluation of on-chip network architec-tures. In Computer Design, 2006. ICCD 2006. InternationalConference on, pages 477–484, oct. 2006.

[18] M. Guevara, B. Lubin, and B. C. Lee. Navigating heteroge-neous processors with market mechanisms. In High Perfor-mance Computer Architecture (HPCA2013), 2013 IEEE 19thInternational Symposium on, pages 95–106, 2013.

[19] A. Gulati, G. Shanmuganathan, A. Holler, and I. Ahmad.Cloud-scale resource management: challenges and tech-niques. In Proceedings of the 3rd USENIX conference on Hottopics in cloud computing, HotCloud’11, pages 3–3, Berkeley,CA, USA, 2011.

[20] F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A framework for pro-viding quality of service in chip multi-processors. In Proceed-ings of the 40th Annual IEEE/ACM International Symposiumon Microarchitecture, MICRO 40, pages 343–355, Washing-ton, DC, USA, 2007.

[21] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D.Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: a platformfor fine-grained resource sharing in the data center. In Pro-ceedings of the 8th USENIX conference on Networked systemsdesign and implementation, NSDI’11, pages 22–22, 2011.

[22] H. Hoffmann, J. Eastep, M. D. Santambrogio, J. E. Miller, andA. Agarwal. Application heartbeats for software performanceand health. In Proceedings of the ACM SIGPLAN Symposiumon Principles and Practice of Parallel Programming, pages347–348, 2010.

[23] H. Hoffmann, M. Maggio, M. D. Santambrogio, A. Leva, andA. Agarwal. A generalized software framework for accurateand efficient managment of performance goals. In EMSOFT,2013.

[24] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. CoreFusion: accommodating software diversity in chip multipro-cessors. In Proceedings of the 34th annual international sym-posium on Computer architecture, pages 186–197, 2007.

[25] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar,and A. Goldberg. Quincy: fair scheduling for distributedcomputing clusters. In Proceedings of the ACM SIGOPS 22ndsymposium on Operating systems principles, SOSP ’09, pages261–276, 2009.

[26] R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell,Y. Solihin, L. Hsu, and S. Reinhardt. Qos policies and archi-tecture for cache/memory in cmp platforms. In Proceedingsof the 2007 ACM SIGMETRICS international conference onMeasurement and modeling of computer systems, SIGMET-RICS ’07, pages 25–36, 2007.

[27] V. Janapa Reddi, B. C. Lee, T. Chilimbi, and K. Vaid. Websearch using mobile cores: quantifying and mitigating theprice of efficiency. In Proceedings of the 37th annual interna-tional symposium on Computer architecture, ISCA ’10, pages314–325, 2010.

[28] M. Kambadur, T. Moseley, R. Hank, and M. A. Kim. Mea-suring interference between live datacenter applications. InProceedings of the International Conference on High Perfor-mance Computing, Networking, Storage and Analysis, SC ’12,pages 51:1–51:12, 2012.

[29] K. Khubaib, M. Suleman, M. Hashemi, C. Wilkerson, andY. Patt. Morphcore: An energy-efficient microarchitecture forhigh performance ilp and high throughput tlp. In Microarchi-tecture (MICRO), 2012 45th Annual IEEE/ACM InternationalSymposium on, pages 305–316, 2012.

[30] C. Kim, S. Sethumadhavan, M. S. Govindan, N. Ranganathan,D. Gulati, D. Burger, and S. W. Keckler. Composablelightweight processors. In Proceedings of the 40th AnnualIEEE/ACM International Symposium on Microarchitecture,MICRO 40, pages 381–394, 2007. ISBN 0-7695-3047-8.

[31] O. Krieger, P. McGachey, and A. Kanevsky. Enabling amarketplace of clouds: Vmware’s vcloud director. SIGOPSOper. Syst. Rev., 44(4):103–114, Dec. 2010. ISSN 0163-5980.

[32] R. Kumar, N. P. Jouppi, and D. M. Tullsen. Conjoined-corechip multiprocessing. In Proceedings of the 37th annualIEEE/ACM International Symposium on Microarchitecture,MICRO 37, pages 195–206, 2004. ISBN 0-7695-2126-6.

[33] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, andK. I. Farkas. Single-ISA heterogeneous multi-core archi-tectures for multithreaded workload performance. In Com-puter Architecture, Proceedings. International Symposium on,pages 64 – 75, june 2004.

[34] R. Kumar, D. Tullsen, N. Jouppi, and P. Ranganathan. Het-erogeneous Chip Multiprocessors. Computer, 38(11):32 – 38,nov. 2005. ISSN 0018-9162. .

[35] R. M. Lerner. At the forge: Amazon web services. Linux J.,2006(143):12, 2006. ISSN 1075-3583.

[36] Y. Li, D. Brooks, Z. Hu, K. Skadron, and P. Bose. Understand-ing the energy efficiency of simultaneous multithreading. InLow Power Electronics and Design, 2004. ISLPED ’04. Pro-ceedings of the 2004 International Symposium on, pages 44–49, 2004.

[37] J. L. Lo, S. J. Eggers, J. S. Emer, H. M. Levy, R. L. Stamm,and D. M. Tullsen. Converting thread-level parallelism toinstruction-level parallelism via simultaneous multithreading.ACM Trans. Comput. Syst., 15(3):322–354, Aug. 1997.

[38] D. MacAskill. SkyNet Lives! (aka EC2 at Smug-Mug). http://don.blogs.smugmug.com/2008/06/03/skynet-lives-aka-ec2-smugmug/.

[39] M. R. Marty and M. D. Hill. Virtual hierarchies to supportserver consolidation. In Proceedings of the 34th Annual In-ternational Symposium on Computer Architecture, ISCA ’07,pages 46–56, 2007. ISBN 978-1-59593-706-3.

[40] S. McFarling. Combining branch predictors. Technical ReportTN-36, WRL Technical Note, 1993.

[41] D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. Weber, andT. F. Wenisch. Power management of online data-intensiveservices. In Proceedings of the 38th annual internationalsymposium on Computer architecture, ISCA ’11, pages 319–330, 2011.

[42] Muralimanohar, Naveen, Balasubramonian, Rajeev, Jouppi,and N. P. Cacti 6.0: A tool to model large caches. TechnicalReport HPL-2009-85, HP Laboratories, 2009.

[43] Openstack. Openstack Cloud Software.http://www.openstack.org/.

[44] P. Padala, K.-Y. Hou, K. G. Shin, X. Zhu, M. Uysal, Z. Wang,S. Singhal, and A. Merchant. Automated control of multiplevirtualized resources. In Proceedings of the 4th ACM Euro-pean conference on Computer systems, EuroSys ’09, pages13–26, 2009.

[45] S. Palacharla, N. P. Jouppi, and J. E. Smith. Complexity-effective superscalar processors. In Proceedings of the Inter-national Symposium on Computer Architecture, pages 206–218, June 1997.

[46] M. R. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel.Amazon S3 for science grids: a viable solution? In DADC’08: Proceedings of the 2008 international workshop on Data-aware distributed computing, pages 55–64, New York, NY,USA, 2008. ACM. ISBN 978-1-60558-154-5. .

[47] M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning:A low-overhead, high-performance, runtime mechanism topartition shared caches. In Proceedings of the 39th AnnualIEEE/ACM International Symposium on Microarchitecture,MICRO 39, pages 423–432, 2006.

[48] A. Ranadive, A. Gavrilovska, and K. Schwan. Resource-exchange: Latency-aware scheduling in virtualized environ-ments with high performance fabrics. In Cluster Computing(CLUSTER), 2011 IEEE International Conference on, pages45–53, 2011.

[49] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh,D. Burger, S. Keckler, and C. Moore. Exploiting ILP, TLP,and DLP with the polymorphous TRIPS architecture. InProceedings of the International Symposium on ComputerArchitecture, pages 422–433, 2003.

[50] K. Sankaralingam, R. Nagarajan, R. McDonald, R. Desikan,S. Drolia, M. S. Govindan, P. Gratzf, D. Gulati, H. Hanson,C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Shar-iff, P. Shivakumar, S. Keckler, and D. Burger. Distributedmicroarchitectural protocols in the TRIPS prototype proces-sor. In Microarchitecture, 2006. MICRO-39. 39th AnnualIEEE/ACM International Symposium on, pages 480 –491, dec.2006.

[51] M. Santambrogio, H. Hoffmann, J. Eastep, and A. Agar-wal. Enabling technologies for self-aware adaptive systems.In Adaptive Hardware and Systems (AHS), 2010 NASA/ESAConference on, pages 149–156, 2010.

[52] B. Santithorn. Fully Distributed Register Files for Hetero-geneous Clustered Mircroarchitectures. PhD thesis, GeorgiaInstitute of Technology, 2005.

[53] S. Sethumadhavan, R. McDonald, D. Burger, S. Keckler, andR. Desikan. Design and implementation of the TRIPS primary

memory system. In Computer Design, 2006. ICCD 2006.International Conference on, pages 470 –476, oct. 2006.

[54] S. Sethumadhavan, F. Roesner, J. S. Emer, D. Burger, andS. W. Keckler. Late-binding: Enabling unordered load-storequeues. In In Proceedings of the 34th Annual Internationalsymposium on Computer Architecture, pages 347–357, 2007.

[55] Z. Shen, S. Subbiah, X. Gu, and J. Wilkes. Cloudscale: elasticresource scaling for multi-tenant cloud systems. In Proceed-ings of the 2nd ACM Symposium on Cloud Computing, SOCC’11, pages 5:1–5:14, 2011.

[56] Y. Song, H. Wang, Y. Li, B. Feng, and Y. Sun. Multi-tieredon-demand resource scheduling for vm-based data center. InProceedings of the 2009 9th IEEE/ACM International Sympo-sium on Cluster Computing and the Grid, CCGRID ’09, pages148–155, Washington, DC, USA, 2009.

[57] G. Suh, S. Devadas, and L. Rudolph. A new memory mon-itoring scheme for memory-aware scheduling and partition-ing. In High-Performance Computer Architecture, 2002. Pro-ceedings. Eighth International Symposium on, pages 117–128, 2002.

[58] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin.Wavescalar. In Proceedings of the 36th annual IEEE/ACMInternational Symposium on Microarchitecture, MICRO 36,pages 291–, 2003. ISBN 0-7695-2043-X.

[59] M. Taylor, M. B. Taylor, W. Lee, S. Amarasinghe, andA. Agarwal. Scalar operand networks: on-chip interconnectfor ilp in partitioned architectures. In High-Performance Com-puter Architecture, 2003. HPCA-9 2003. Proceedings. TheNinth International Symposium on, pages 341 – 353, feb.2003.

[60] D. Tullsen, S. Eggers, and H. Levy. Simultaneous multithread-ing: maximizing on-chip parallelism. In Proceedings of the in-ternational symposium on Computer architecture, ISCA ’95,pages 392–403, 1995.

[61] VMWare. VMware Virtual Appliance Mar-ketplace: Virtual Applications for the Cloud.http://www.vmware.com/appliances/.

[62] T. von Eicken. Amazon Usage Estimates, RightScaleBlog. http://blog.rightscale.com/2009/10/05/amazon-usage-estimates/.

[63] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee,V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Ama-rasinghe, and A. Agarwal. Baring it all to software: Raw ma-chines. Computer, 30(9):86 –93, sep 1997.

[64] Y. Watanabe, J. D. Davis, and D. A. Wood. WiDGET: Wiscon-sin decoupled grid execution tiles. SIGARCH Comput. Archit.News, 38(3):2–13, June 2010.

[65] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards,C. Ramey, M. Mattina, C.-C. Miao, J. F. Brown III, andA. Agarwal. On-chip interconnection architecture of the TileProcessor. IEEE Micro, 27(5):15–31, Sept. 2007.

the sharing architecture - princeton university · cores, the sharing architecture enables the...

Documents