harnessing isa diversity: design of a heterogeneous-isa ...harnessing isa diversity: design of a...

Harnessing ISA Diversity: Design of a Heterogeneous-ISA Chip Multiprocessor

Ashish Venkat Dean M. TullsenUniversity of California, San Diego

{asvenkat,tullsen}@cs.ucsd.edu

AbstractHeterogeneous multicore architectures have the potential

for high performance and energy efficiency. These architec-tures may be composed of small power-efficient cores, largehigh-performance cores, and/or specialized cores that accel-erate the performance of a particular class of computation.Architects have explored multiple dimensions of heterogeneity,both in terms of micro-architecture and specialization. Whileearly work constrained the cores to share a single ISA, thiswork shows that allowing heterogeneous ISAs further extendsthe effectiveness of such architectures.

This work exploits the diversity offered by three modernISAs: Thumb, x86-64, and Alpha. This architecture has thepotential to outperform the best single-ISA heterogeneousarchitecture by as much as 21%, with 23% energy savings anda reduction of 32% in Energy Delay Product.

1. IntroductionArchitects have proposed heterogeneous chip multiprocessorsfor both general-purpose computing and embedded applica-tions. These architectures exploit heterogeneity in two fun-damental dimensions. While some architectures make use ofspecialized hardware to accelerate the performance of certainworkloads [1, 2, 3, 19], others employ a different set of mi-croarchitectural parameters [4, 15, 16, 22, 23, 24] in order tocreate energy-efficient processors for mixed workloads. Thelatter constrain the cores to execute a single instruction setarchitecture (ISA), maximizing efficiency by allowing a threadto dynamically identify, and migrate to, the core to which itis most suited during a particular phase and under the currentenvironmental constraints. This paper demonstrates that notonly is that constraint unnecessary, but limiting an architectureto a single ISA restricts the potential heterogeneity, sacrificingperformance and efficiency gains.

A critical step in the design of a heterogeneous-ISA archi-tecture is choosing a diverse set of ISAs. While ISAs seem toconverge over time (RISC ISAs adding complex operations,CISC ISAs translated to RISC µops internally), there remainssufficient diversity in existing modern ISAs to provide usefulheterogeneity. We examine some key aspects that character-ize ISA diversity. These include code density, decode andinstruction complexity, register pressure, native floating-pointarithmetic vs emulation, and SIMD processing.

In this paper, we harness the diversity offered by threeISAs: ARM’s Thumb [5], x86-64 [17], and Alpha [12]. By co-designing the hardware architectures and the ISAs to providethe best aggregate architecture, we arrive at a more effectiveand efficient design than one composed of homogeneous cores,or even heterogeneous cores that share a single ISA.

The design of a heterogeneous-ISA chip multiprocessorinvolves navigating a complex search space, made larger bythe additional dimension of freedom. A major contributionof this work is such a design space exploration geared at find-ing an optimal heterogeneous-ISA CMP for general-purposemixed workloads. Observing the results of the design spaceexploration, we provide architects with a set of tools to enableISA-microarchitecture co-design and thereby better streamlinetheir search processes.

To reap the full benefits of the heterogeneity, especiallythe heterogeneity available in the form of ISA diversity, it isimportant that an application is able to migrate freely betweenthe cores. However, migration in a heterogeneous-ISA envi-ronment is a well known difficult problem [14, 31, 37]. This isbecause the runtime state of a program is kept in ISA-specificform, and migration to a different ISA involves expensive pro-gram state transformation. DeVuyst, et al. [11] demonstratethat migration between ISAs can be achieved at acceptablecost on a CMP; however, that work does not explore the archi-tectural advantages to multiple ISAs on a single CMP. Thisresearch employs several ideas from that work, but also severalnew optimizations to reduce the overhead of migration. In thispaper, we present a detailed compilation methodology and aneffective runtime strategy that works for a diverse set of ISAs.We observe that even a single application can gain up to 11.2%performance benefit by migrating between heterogeneous-ISAcores during different phases of its execution.

Finally, we evaluate the proposed heterogeneous-ISA CMPagainst both homogeneous and single-ISA heterogeneousCMPs, under varying power and area budgets. Consequently,we make the following major observations:• Co-design of ISA and microarchitectural parameters is criti-

cal. In the optimal designs, cores employing different ISAstend to naturally diverge, and to diverge in consistent direc-tions.

• ISA heterogeneity is not only beneficial across applications,but also within individual applications across phases.

We find that heterogeneous-ISA CMPs can improve single-thread performance by an average of 20.8% and provide 15.8%more throughput on multi-programmed mixed workloads, ascompared to a single-ISA heterogeneous CMP. Additionally,heterogeneous-ISA CMPs can help achieve an average reduc-tion of 29.8% in Energy Delay Product.

The rest of this paper is organized as follows. Section 2describes related work. Section 3 evaluates the diversity of-fered by the ISAs chosen for this work. Section 4 lays outour design methodology. We present our compilation andruntime methodologies in Section 5. Section 6 describes ourexperimental methodology. Section 7 evaluates the proposed

978-1-4799-4394-4/14/$31.00 c© 2014 IEEE

architecture, and draws lessons and guidelines from the designspace exploration. Section 8 concludes.

2. Related Work

Prior research has shown that heterogeneous CMPs are capa-ble of higher performance and energy efficiency than homo-geneous processors. Single-ISA heterogeneous CMPs wereintroduced by Kumar, et al. [22, 24]. The motivation behindthe single-ISA constraint is that it allows threads to migratefreely between cores dynamically as their behavior or oper-ating conditions change. Recent commercial offerings thatresemble this architecture include ARM’s big.LITTLE proces-sor [15] and NVidia’s Kal-El processor [4].

Yet another class of heterogeneous CMPs make use of spe-cialized hardware to accelerate the performance of certaintypes of workloads. These include the integrated CPU-GPUarchitectures such as Intel’s Sandy Bridge [1] and AMD’s Fu-sion [2]. However, these architectures do not allow migrationbetween core types at arbitrary places in the code. Currentindustry offerings of heterogeneous-ISA CMPs include MP-SoCs in the embedded market [34], GPUs, and acceleratorsin the HPC market [3]. IBM’s Cell microprocessor [19] isa heterogeneous-ISA CMP geared towards general-purposecomputing, but similarly, a too-specialized SPE ISA and lackof a common address space make dynamic task migrationinfeasible.

Several studies [9, 18, 33] have evaluated the role of ISAin RISC and CISC processors. They show that these ISAsare rather similar in terms of their impact on performanceand energy efficiency. However, these studies focus on lessdiverse ISAs (e.g., PowerPC, ARM, and x86) and homoge-neous hardware assumptions. This work differs in two criticalways. First, it examines ISAs with true diversity, and couplesheterogeneous ISAs with heterogeneous hardware. We showthat there is significant synergy in combining the two, whichenables the overall architecture to fully exploit the differencesbetween the ISAs.

DeVuyst, et al. [11] first established the viability ofheterogeneous-ISA CMPs by showing that migration costcould be reduced by orders of magnitude over prior ap-proaches, exploiting the shared memory space of CMPs andeliminating the need for memory transfer. That work focuseson the migration mechanism, and does not address any of thearchitectural implications covered in this work. They examinemigration between ARM and MIPS cores, and present com-piler techniques to minimize the amount of program state keptin ISA-specific form to enable fast migration. They describe aruntime mechanism that performs binary translation until anequivalence point is reached, after which the program statecan be transformed to the required ISA. The Tui system [31]describe a process migration strategy for heterogeneous-ISAmachines in the context of wide area computing. The mainidea of that work is to transform the runtime program state toan intermediate form and then re-compile it to the requiredISA, at the time of migration. We borrow some techniquesfrom both these works; however, our compiler methodology

and runtime strategy is geared towards a more diverse set ofISAs, which makes the problem significantly harder and re-quires additional techniques and optimizations presented inthis paper.

Several researchers have proposed design-space explorationmethodologies for heterogeneous CMPs. Strozek, et al. [32]describe a process flow for automatic synthesis and evalu-ation of heterogeneous CMPs based on runtime profiles ofcertain embedded applications, given different area and powerbudgets. Intel’s QuickIA [10] research prototype allows re-searchers to explore heterogeneous architectures consistingof multiple generations of Intel processors and FPGAs. Thesearch methodology we employ in this paper is similar tothe one described by Kumar, et al. [23]. However, our goalis to not only identify the optimal heterogeneous-ISA multi-core designs, but also to lay out the first principles for ISA-microarchitecture co-design in such an architecture.

There has been some early work on operating systems andruntime support for heterogeneous-ISA CMPs. Li, et al. [28]suggest multiple policies and mechanisms, at the operatingsystem level, for symmetric multiprocessing in an overlapping-ISA CMP. For the scope of this work, we assume similaroperating system support.

3. ISA Diversity

To keep both the design-space exploration and the compilerdevelopment tractable, we select our target ISAs a priori –considering more ISAs and even considering the possibility ofcustom ISAs would only increase the potential gains. This sec-tion describes our three target ISAs – Thumb, Alpha, and x86-64 – with respect to several axes of diversity. These includecode density, dynamic instruction count, register pressure, andsupport for specialized operations.

Code Density. High code density reduces the number ofinstruction cache misses, uses less energy and memory band-width for instruction fetch, and conserves power by enablingthe use of smaller microarchitectural structures. Weaver, etal. [38] evaluate a wide range of ISAs for code density. Theyfind that RISC ISAs with fixed-length instructions such asAlpha and SPARC show the lowest code density, while embed-ded ISAs like Thumb and AVR32 exhibit the highest densityowing to a technique called code compression. This tech-nique packs two 16-bit instructions into one 32-bit instruction,which is then unpacked at the decode stage and executed astwo instructions. CISC ISAs such as x86-64 and VAX areplaced in the middle of the code density spectrum by virtue ofvariable-length instruction encoding.

Dynamic Instruction Count. While code compressionachieves about 32.5% memory savings in Thumb, it increasesthe dynamic instruction count by 30% [21]. This is a directconsequence of using simpler 2-operand instructions to fitThumb’s 16-bit instruction. Thumb instructions also lack theshift-modifier and predication support that ARM instructionsenjoy. Alpha employs 3-operand instructions, but is a load-store architecture, meaning that no arithmetic instruction candirectly operate on memory. While x86-64 also restricts in-

structions to the 2-operand format, it implements a number ofcomplex addressing modes that allow instructions to directlyoperate on memory. x86-64 instructions are decoded into oneor more simpler RISC-like µops, thereby increasing the num-ber of dynamic instructions (µops) by about a factor of 1.3 [8](as compared to the native x86-64 instruction count).

Register Pressure. Thumb uses a reduced register set, al-lowing only eight 32-bit programmable registers for integeroperations. Thus, all 64-bit integer computation is performedusing software emulation. Software emulation is discussed ingreater detail in Section 5. Alpha, on the other hand, has twobanks of thirty-two 64-bit programmable registers, for integerand floating-point computation. x86-64 offers sixteen 64-bitregisters for integer operations and sixteen 128-bit registersfor floating-point and SIMD operations.

The number of programmable registers is inversely propor-tional to the amount of register pressure, and thus the numberof register spills, for any ISA. Therefore, Thumb suffers fromextremely high register pressure, while Alpha enjoys low reg-ister pressure. Interestingly, x86-64 enjoys the lowest registerpressure among the three ISAs, despite the fact that it has asmaller architectural register file than Alpha. This is a mani-festation of the following addressing modes and optimizations:Absolute memory addressing allows instructions to directlyaccess memory operands, eliminating the need to allocate reg-isters for temporary storage of loaded values. Sub-registeraddressing allows programmers to address 48 sub-registersto store/operate on smaller data types, which can be furtherexploited by aggressive sub-register coalescing strategies toreduce the number of register spills. Program counter rela-tive addressing enables position-independent code without theoverhead (both in performance and allocated registers) of aGlobal Offset Table. Lastly, register-to-register spills allowprogrammers (compilers) to spill general-purpose registers toXMM registers, thereby minimizing the number of registerspills into memory.

Figure 1 shows Thumb instructions and x86-64 µopsnormalized to Alpha instructions, for the SPEC2006 inte-ger benchmarks, compiled using the LLVM/Clang frame-work [26, 25]. The average number of dynamic instructionson Thumb is 43.4% more than that of Alpha. This is due,in large part, to the high register pressure in Thumb. In fact,

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(#%"

(#&"

(#'"

$"

)*+,-./01" 23456" 7,3*/6" 8395+0394:"

;*491-/6"

<3,4:"

=96,*>10396"

?3*.

4:+@/5"=96,*>1039"A4036"

<->.B" ):C-4" D'&E&%"

Figure 1: Instruction mix (normalized to Alpha) for SPEC2006

Thumb makes 91.1% more memory references than Alpha.Other factors include 64-bit emulation and the use of simpler2-operand instructions.

x86-64 makes 8.3% fewer memory references than Alphadue to lower register pressure. However, we observe that thereis a very small (0.8%) reduction in the number of stores, whilethe number of loads drops by 10.8%. The compiler generallyseeks to spill variables that won’t be updated for a long periodof time. Therefore, the number of reads from the spill area ismuch higher than the number of writes.

Interestingly, Alpha makes 10.9% fewer memory referenceson the high ILP benchmarks bzip2 and hmmer, in which casethe compiler does not find enough opportunity to utilize thecomplex addressing modes and optimizations offered by x86-64. The 2-operand restriction contributes to the 24.7% morearithmetic instructions on x86-64, resulting in an overall in-crease in the number of dynamic instructions of 15.4%.

Floating-point and SIMD Support. One consequence ofcode compression is that floating-point instructions are notsupported in Thumb. Floating-point operations are emulatedin software. While emulation results in slower execution,Thumb cores don’t need to include floating-point instructionwindows, register files, and functional units, resulting in up to19.5% reduction in peak power and 30% savings in area. In aheterogeneous-ISA architecture, any program or phase withsignificant floating-point activity will likely quickly switch toan ISA that executes natively.

x86-64 also provides SIMD support through its SSE/AVXextensions, making vectorization of loops and basic blockspossible. Alpha’s MVI extension allows for only pack, unpack,max, and min operations. Due to the very primitive nature ofthe MVI extension, we forgo SIMD units in Alpha cores.

To illustrate the benefits of heterogeneity, even on a singleapplication, we examine the performance of bzip2 during twodifferent phases of its execution (in Figure 2), under varyingpower constraints. We identify program phases using Sim-Point [29]. The detailed methodology is described in Section 6.We see two key results in this graph. First, we see that the mosteffective ISAs differ between phases of the same application;e.g., at 15 W, Phase 1 prefers x86 and Phase 2 prefers Alpha.Second, we see that even in a single phase, the best ISA variesdepending on the design constraints or the operating conditionof the processor. For example, in Phase 2, we might preferAlpha unless we are operating unplugged or perhaps in a lowbattery state, in which case we’d prefer Thumb because at lowpower budgets, Thumb provides the highest performance.

4. Design Space Exploration

The possible design space of a heterogeneous-ISA CMP ischaracterized by a diverse set of ISAs and a multitude ofmicroarchitectural parameters. Navigating such a design spaceis a difficult problem. That difficulty can be reduced andpruned if we understand some of the principles that governthe effective co-design of heterogeneous-ISA, heterogeneoushardware architecture processors. To do this, we execute anexhaustive design space exploration of an architecture with

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 5 10 15 20 25

Pe

rfo

rma

nce

(N

orm

aliz

ed

IP

C)

Peak Power (W)

ThumbAlpha

x86-64 0

0.5

1

1.5

2

0 5 10 15 20 25

Pe

rfo

rma

nce

(N

orm

aliz

ed

IP

C)

Peak Power (W)

ThumbAlpha

x86-64

Figure 2: Performance comparison under different peak power budgets for two different execution phases of bzip2

Design Parameter Design ChoicesISA Thumb, Alpha, x86-64Execution Semantics In-order, Out-of-orderIssue width 1, 2, 4Branch Predictor local, tournamentReorder Buffer Size 64, 128 entriesArchitectural Register File ISA-specificPhysical Register File (Integer) 96, 160Physical Register File (FP/SIMD) 64, 96Integer ALUs 1, 3, 6Integer Multiply/Divide Units 1, 2Floating-point ALUs 1, 2, 4FP Multiply/Divide Units 1, 2SIMD Units 1, 2, 4Load/Store Queue Sizes 16, 32 entriesInstruction Cache 32KB 4-way, 64KB 4-wayPrivate Data Cache 32KB 4-way, 64KB 4-wayShared Last Level (L2) Cache 4-banked 4MB 4-way, 4-banked 8MB 8-way

Table 1: Design space of a Heterogeneous-ISA architecture

fairly limited, tractable options, and observe the characteristicsof the best designs.

The design space we explore in this work includes the threeISAs - Thumb, Alpha, and x86-64, along with a set of micro-architectural parameters that represent a wide range of perfor-mance and power control points. The goal of the design-spaceexploration is to find the optimal 4-core heterogeneous-ISACMP for varying power and area budgets, and consideringall permutations of applications in the workload sharing thecores.

Table 1 enumerates the variables in our design space. TheCartesian product of this design space consists of 750 thousandsingle core combinations, making it not practically feasibleto perform an exhaustive search. To reduce the size of theCartesian product, we prune the design space by establishingcorrelations between different variables. While some corre-lations can be inferred by intuition, others are dictated byspecific ISA characteristics: (1) Size of the reorder buffer canbe correlated to the physical register file size, as they togetherestablish the window size. (2) The number of functional unitsvaries with the issue width. (3) Thumb cores need not includea floating-point instruction window, retirement units, registerfiles or functional units. (4) Neither alpha nor thumb needinclude SIMD functional units.

Design Parameter Design ChoicesISA Thumb, Alpha, x86-64Execution Semantics In-order, Out-of-orderBranch Predictor local, tournamentReorder Buffer-Register File 64-96-64, 128-160-96 entries

1-1-1-1-1-1, 1-3-2-2-2-2, 2-3-2-2-2-2,Issue Width-Functional Units 4-3-2-2-2-2, 4-6-2-4-2-4Load/Store Queue Sizes 16, 32 entries

32K/4-32K/4-4M/4,Cache Hierarchy 32K/4-32K/4-8M/8,

64K/4-64K/4-4M/4,64K/4-64K/4-8M/8

Table 2: Pruned design space for faster navigation

The resulting pruned design space, as shown in Table 2, con-tains 120 in-order cores and 480 out-of-order cores. However,the number of possible 4-core configurations in the pruned de-sign space is still very high (129.6 billion configurations). Tofurther make this problem tractable, we use the following re-sults from prior research on single-ISA heterogeneous architec-tures – modeling using private LLCs versus a shared n-bankedLLC in an n-core configuration, results in the same perfor-mance ordering with respect to all n-core configurations [23],as well as different scheduling/migration strategies [35]

Therefore, we model cores using 1MB 4-way or 2MB 8-wayprivate last-level caches, instead of a single shared 4MB or8MB cache, respectively. Thus, the combined performanceof a 4-core configuration with private LLCs can be computedusing the sum of the performances of the individual cores.While the design space exploration still involves finding theoptimal 4-core configuration out of 129.6 billion differentconfigurations, we can now find the best design with 600simulations of the single-core permutations, and a softwaresearch of the 130 billion sums.

5. Compiler and Runtime Environment

The programming environment for a heterogeneous CMP isdictated by one of the first design choices: separate addressspace [3, 4] vs unified address space [15, 22]. We contend thatthe full benefits of a heterogeneous multicore architecture canbe reaped only through dynamic core selection, which requiresprocess migration. Separate address space constraints impose

a significant cost to process migration in terms of programstate transfer. Therefore, we choose the unified address spacemodel in our design. However, this presents a unique challengeto compilation and process migration, because the memorylayout and runtime state of a program is always architecture-specific.

To attack the process migration problem, we borrow a num-ber of compiler and runtime techniques from prior work byDeVuyst, et al. In particular, our compiler generates a fat bi-nary with multiple target-specific code sections and commontarget-independent data sections. Furthermore, our runtime en-vironment employs dynamic binary translation till we reach anequivalence point, at which program state can be successfullytransformed. However, the ISAs we have chosen for this workare significantly more diverse and therefore present additionalchallenges that are not just limited to compiler and runtimesupport.

First, we deal with both 32-bit and 64-bit ISAs that eachexpose a different organization of the virtual address spaceand page table hierarchies. This necessitates the use of a com-mon address translation scheme and long mode emulation onthe 32-bit Thumb ISA. Second, the complexity of creatingcommon target-independent data sections increases with boththe number and diversity of ISAs. We take a cleaner approachhere by using a common intermediate representation and en-capsulate the lowest common denominator size and alignmentrules in intermediate-level types. Third, to reap full benefitsof ISA heterogeneity, it is critical that we don’t turn off anytarget-specific compiler optimization. Therefore, unlike priorwork, we can no longer rely on a map of source-level variablenames to transform every live CPU register or memory loca-tion, at the time of migration. We instead take advantage of apowerful architecture-independent intermediate representationthat can act as a bridge between the ISAs, and provide hintsfor faster and more frequent program state transformation. Asa result, in this work we introduce a compilation infrastructurethat is significantly more flexible (capable of working withmore diverse ISAs) and provides higher performance, both insteady state (native mode execution) and at migration points.

In this section, we first present a memory management strat-egy to facilitate a unified address space, and then describe acompilation and runtime strategy to address additional chal-lenges presented by heterogeneous-ISA process migration.

5.1. Memory Management

Address Translation. A common address translation mech-anism is required to ensure a unified address space in aheterogeneous-ISA environment. From Table 3, Alphaemerges as the lowest common denominator due to its 8KBpage size. While it is possible to use software virtualizationfor 8KB page management on Thumb and x86-64, it neces-sitates the use of multiple page table structures, one for eachISA. Furthermore, software virtualization cannot enable 64-bitvirtual address translation on the 32-bit Thumb architecture.Therefore, we use a common page table structure and an MMUbased on the 4-level page table walker of x86-64, for all thethree ISAs.

ISA Thumb Alpha x86-64Page Table 2-level 3-level 4-levelHierarchyPage Size 4KB, 64KB 8KB 4KB, 2MB, 1GBPage Table 16KB first-level, and 8KB 4KBSize 1KB second-levelTLB Update Hardware page Low-level firmware Hardware pageMechanism table walker (PALcode) table walker

Table 3: Memory Management on Thumb, Alpha and x86-64

Long mode emulation on Thumb. Long mode (64-bit)computation in Thumb is performed using software emulation.Most compilers already support this to perform arithmetic andmemory operations on the “long long” data type. The generalprocedure is to use multiple 32-bit registers to construct 64-bitvalues and compute on them.

To support memory operations using 64-bit pointers (virtualaddresses), we extend the Thumb ISA to include special in-structions: LD64 and ST64. These instruct the MMU to lookfor the higher-order 32-bits of its 64-bit virtual address inputin a special register R8. However, the memory footprint ofmost general-purpose workloads seldom exceeds 4GB. In fact,we observe that no SPEC CPU2006 benchmark acquires morethan 4GB of memory. Therefore, wherever possible, we usethe regular load/store instructions with 32-bit pointers, whichare zero-extended by the MMU during address translation.

5.2. Compilation Strategy

Common Intermediate Representation. In this work, weleverage the LLVM compiler framework [26] and the Clangfront-end [25] to generate a common intermediate represen-tation (LLVM bitcode), and perform target-specific backendcompilation thereafter. To keep the front-end compilationISA-agnostic, we make use of the target-triple functionalityof Clang to specify the data types of a generic target, for allISAs.

Target-Independent Type Legalization. To minimize theamount of program state to be transformed, we enforce com-mon rules for promotion, truncation, expansion, and type con-version. We allow certain exceptions during type legalizationthat interfere with ISA diversity – e.g., vector widening/scalar-izing on x86-64, and long mode/floating-point emulation onThumb. For the most part, target-independent type legalizationensures a consistent view of global data and bitcode-level vari-ables across all ISAs, during every stage of compilation. Thisis critical for generating a single version of target-independentglobal data sections.

Intermediate Name Propagation. Once the intermediaterepresentation has been generated, we provide each bitcode-level variable with a unique name. During the subsequentcode generation and optimization passes, we ensure that eachtarget-level machine operand (both registers and fixed stackslots) is associated with its corresponding intermediate name,if any. This gives us the ability to distinguish between bitcode-level and target-level variables, which plays a key role at thetime of transform generation.

Stack Frame Organization. To avoid handling pointerinconsistencies at the time of program state transformation,

we enforce a common stack frame organization (see Figure 3).In doing so, we do not add any additional instructions, sincewe at most change the relative position of a stack object fromthe stack/frame pointer. To ensure stack consistency acrossISAs, we use similar techniques as described by DeVuyst, etal. [11].

Generation of Runtime Transforms. Towards the end ofthe multi-ISA compilation, we generate a set of transformsthat can be applied by the runtime environment, at the timeof migration. Each transform is a routine that reconstructs thetarget-specific program state of a basic block (both live regis-ters and stack objects), in the required ISA-form. Accordingly,a live register or a stack object can be transformed if one ofthe following conditions hold:• Its value is known at compile time - e.g., constant literals.• Its value can be found at a specific location - e.g, glob-

als, immutable objects (aggregates, alloca variables, andvariables whose addresses have been taken).

• Its intermediate name refers to a live register or stack objecton the ISA from which we migrated.

• Its value can be computed using the already transformed liveregisters and stack slots. This involves a reverse traversal ofthe def-use chain to find a sequence of instructions that canre-compute the required value.We generate transforms at compile-time rather than run-

time because performing reverse data-flow analysis to deter-mine the value of every live register and stack object imposesa significant cost to migration. Prior research [11] suggeststhat we can achieve fast stack transformation by just matchingthe live register and stack contents between the ISAs involved,and copy-in their values accordingly. We found that such astrategy results in very few transformable basic blocks, if theISAs exhibit significant diversity. Specifically, we observedthat the transforms derived using the reverse data-flow analysisreduced the run-time distance to the next transformable basicblock by an order of magnitude.

Theoretically, it is possible to transform any live registerand stack object in the program, using such a reverse data-flowanalysis. However, due to the limited amount of informa-tion available at compile-time, we cannot handle programconstructs such as uncountable loops, indirect branches, andconditional branches. Therefore, we restrict this analysis tocontrol flow graphs that do not contain φ -nodes (merge points).Note that φ -nodes could still be transformable using methodsdescribed above, other than reverse data-flow analysis.

Runtime Environment. Our runtime environment has twomajor components: a dynamic binary translation (DBT) en-gine and a program state transformer. The DBT engine itselfconsists of six individual translators, based on the tiny codegenerators of QEMU [6]. At the time of migration, dynamicbinary translation is performed until an equivalence point isreached, after which it is safe to use the compiler-generatedtransforms to convert program state from one ISA to another.Each compiler-generated transform is recursive in nature. Af-ter transforming the program state of the current basic block,it walks up the stack using the return address and invokes thecaller’s transform. On their return path, the transforms fix up

!"#$#%&'()*+,&-..+(//

0,1+(2/3,4&-..+(//

5*++(,)&

6+27(

8*)493,4&-+4*7(,)/

:9;&96&<)21=

>+27(&?93,)(+

<)21=&?93,)(+

52@@((&<;3@@&-+(2

>+27(&?93,)(+&<2A(

077*)2B@(&8BC(1)/

?2..3,4

D912@/&2,.&)(7;9+2+3(/

0,1973,4&-+4*7(,)/

52@@(+E/&

6+27(

Figure 3: Common Stack Frame Layout across all ISAs

the callee-saved register spill area, which ultimately culmi-nates in the construction of the target CPU register state.

6. Experimental MethodologyIn this section, we describe our experimental methodology forthe two major contributions of this work: (1) design spaceexploration for a heterogeneous-ISA CMP, and (2) compilerand runtime strategy for a diverse set of ISAs.

6.1. Design Space Exploration

Our four-core design space consists of 600 homogeneous pro-cessors, 1.5 billion single-ISA heterogeneous processors, and128.3 billion heterogeneous-ISA chip multiprocessors, thatcan be each designed out of 600 distinct CPU cores.

Although prior work on ISA characterization [9, 18, 33]chooses to model cores based on actual commercial offeringsfor each ISA, we seek to remove any non-ISA biases and starteach design with a clean slate. Thus, we assume the samebasic pipeline design (number of stages and latency), basedon the Alpha 21264 [20], across all ISAs (with the exceptionof instruction decode, which will be the primary stage(s) thatdepend on the ISA). Additionally, we assume a total-store-order (strictest of all) memory consistency model for all ISAs.

All cores are modeled using 32nm technology and the clockrate is fixed at 1.67GHz. Owing to ISA diversity and themultitude of micro-architectural parameters we consider inthis work, the heterogeneous-ISA CMPs are distributed oversignificant peak power (8.32-80.69 W) and area (32.97-129.87mm2) ranges. We use the gem5 [7] simulator to model CPUcore performance, and McPAT [27] to model power and area.

Our design methodology selects the optimal multicore con-figuration over the entire set of workloads (all possible per-mutations), for different peak power and area budgets. Thedesign space explorations are optimized for two types of work-loads: (1) multi-programmed mixed workloads, and (2) single-threaded workloads. The former helps us evaluate the through-put of a conventional CMP, the latter gives us insight into a“Dark Silicon” implementation, where it is expected that onlyone core (out of a heterogeneous cluster) will be powered upat once [13, 22, 36]. In the latter case, a thread will always

be assigned it’s best core, but in the former case, it will de-pend on the threads with which it is co-scheduled. We willalso examine both the case where threads are placed basedon overall execution characteristics (assuming minimal migra-tion), and the case where threads can migrate to other coresat phase changes. That is, in the first case we find the bestassignment of applications to cores, in the second, we find thebest assignment of phases to cores.

We use SimPoint [29] to identify program phases. Specifi-cally, we obtain multiple simulation points for a program’s ex-ecution on Alpha, with an interval size of 100 million dynamicinstructions. We modify the atomic CPU (instruction emula-tion mode) of gem5 to emit the start and end basic blocks foreach simulation point, and their cumulative frequency, whichserve as the start and end markers for the corresponding pro-gram phase on the other two ISAs, namely thumb and x86-64.Our workloads include a total of 72 different program phaseson the 10 applications we benchmark.

6.2. Compiler Methodology

We use the SPEC CPU2006 integer and floating-point C bench-marks to evaluate the proposed architecture. We excludeh264ref and perlbench from this set because they use ISA-specific programming constructs (e.g., inline assembly), eitherdirectly or through library function calls. All benchmarks arecompiled at the -O3 optimization level, using the multi-ISAcompilation methodology described in section 5. We performall experiments that evaluate the performance of our compilerand runtime methodology on cores that are modeled afterCortex A-15 for Thumb, the 21264 processor for Alpha, andCore-i7 for x86-64. We use the following metrics to evaluateour compilation methodology.

Steady State Performance. The primary goal of multi-ISAcompilation is to enable seamless execution migration betweendifferent ISAs at minimal cost. Through consistent compila-tion and deterministic transform generation, we manage to bemigration-safe in 45% of the basic blocks. However, since weenforce many consistency rules during the multi-ISA compila-tion, it is important to study its effect on steady-state perfor-mance. To evaluate the steady-state performance degradation,we simulate each program phase of a benchmark compiled forboth single-ISA execution and multi-ISA execution.

Distance to Next Equivalence Point. This distance repre-sents the number of dynamic instructions to be translated bythe DBT engine, before we can transform the program state toenable native execution. We modify gem5’s atomic CPU tofind if a given instruction is an equivalence point, using com-piler metadata. We then run each benchmark to completionand record the average distance to the next equivalence point.

Migration Cost. Migration cost consists of two major com-ponents: dynamic binary translation and program state trans-formation. On every ISA, we take 10 samples of the bench-mark’s dynamic execution state, each at a 100 million instruc-tion interval, after fast-forwarding execution for the first onebillion instructions [30]. We then simulate heterogeneous-ISAmigration scenarios for each sample, by performing dynamicbinary translation until an equivalence point is reached, and

program state transformation at that point.Overall Speedup Due to Migration. To compute the

overall speedup due to migration, we employ a phase basedscheduling strategy, where migration happens only when phasetransitions demand switching to a different core. To identifyphase transitions, we rely on SimPoint metadata and profilinginformation from oracle experiments. This models a systemwhere the compiler is directing migration (or at least migrationpreferences), or a runtime or hardware system that had beenobserving execution long enough to accurately detect phasebehavior.

7. ResultsThis work seeks to identify the best heterogeneous designsfor a given workload. This not only enables us to quantifythe potential gains for ISA heterogeneity, but also identifytrends and insights from the actual designs that get taggedas optimal. Because the nature of the design exploration isrelatively independent of the cost of migration, only resultslater in this section account for the specific costs of migration,and the extent to which they mitigate the potential gains.

7.1. Evaluation of the Heterogeneous-ISA Architecture

Processor designs today are as likely to be constrained bypower dissipation as they are by area. Thus, in this section, weexamine the top-performing designs under both area and powerconstraints. In addition to finding the best heterogeneous-ISAdesign, we also find the best homogeneous design (best singleconfiguration for any ISA) and best single-ISA heterogeneousdesign (best heterogeneous design for which all cores havethe same ISA). We consider designs optimized for both multi-programmed workload throughput and single-thread perfor-mance.

Multi-programmed workloads. Figure 4 comparesthree architectures: homogeneous, single-ISA heteroge-neous, and heterogeneous-ISA CMPs, all optimized for multi-programmed workload performance under different peakpower and area constraints. We make several important obser-vations here. First, there are significant gains available fromISA heterogeneity, matching or exceeding the gains from hard-ware heterogeneity. Second, hardware heterogeneity alone isless effective under tight constraints (all cores have to be small)or liberal constraints (all cores free to be big), because bothendpoints tend toward homogeneous designs. Heterogeneous-ISA designs, in contrast, are still effective in those regions,because we can still gain from ISA heterogeneity even whenthe hardware is homogeneous.

There are two reasons for this advantage. First, differentcode regions have a natural affinity for one ISA or another,irrespective of the hardware implementation. Second, theISA options give the architect more opportunities to createarea-effective or power-effective cores.

For instance, at a peak power budget of 20W, the single-ISAheterogeneous CMP manages to employ only 3 out-of-ordercores with smaller 32KB L1 caches, while the heterogeneous-ISA CMP sports all out-of-order cores with 64KB L1 caches.

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(#%"

(#&"

$!" %!" &!" )*+,-,./0"

!"##$%"&

'#()&'*+#,&-./&

12-23/*/245" 6,*3+/7869" 6,*3+/7869":;,.<"-,3=>?2*5@"

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(#%"

(#&"

%'" &%" '!" )*+,-,./0"

!"##$%"&

'(#)&*++,-&

1/./234/*/356789:" 1/./234/*/356789:";<,.="-,42>?3*6@"

Figure 4: Multi-programmed Workload Performance comparison under different peak power and area budgets

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

$!" %!" &!" )*+,-,./0"

!"#$

%&'()*+,-.+

.)%/+."0)#+123+

12-23/*/245" 6,*3+/7869" 1/./:23/*/2457869"

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

%'" &%" '!" )*+,-,./0"

!"#$

%&'()*+,-.+

/#)%+0$$12+

12-23/*/245" 6,*3+/7869" 1/./:23/*/2457869"

Figure 5: Energy-Delay-Product comparison for multi-programmed workloads under different peak power and area budgets

This is possible because the area-efficient and power-efficientThumb cores free up space that is put to good use by the othercores. We find that heterogeneous-ISA CMPs can provide15.8% better throughput on multi-programmed workloads thanthe best single-ISA heterogeneous CMPs.

Furthermore, we can achieve a greater speedup if appli-cations are allowed to migrate between the cores at phaseboundaries. This is well-documented in the case of single-ISAheterogeneous CMPs [22, 24]. On a heterogeneous-ISA CMP,this effect is further enhanced due to ISA affinity. We observean additional speedup of 11.2% due to migration alone ona heterogeneous-ISA CMP, in contrast to the 4.6% speedupdue to migration on a single-ISA heterogeneous CMP. In allsubsequent experiments in this paper, migrations are alwaysenabled.

In order to evaluate energy efficiency, we instead optimizethe design space exploration to find energy-efficient cores byidentifying the processor configurations that minimize energy-delay product (EDP). Figure 5 compares the energy efficiencyfor the three architectures under different peak power and areabudgets. Heterogeneous-ISA CMPs achieve an average energysavings of 21.5% and an average reduction of 27.8% in theEDP over single-ISA heterogeneous CMPs, with absolutelyno loss in performance – that is, we gain performance anddecrease energy simultaneously when we employ multi-ISAheterogeneity.

Thus we see that the energy efficiency gains of ISA hetero-geneity actually exceed the potential performance gains (for

the performance-optimized experiments). Maximizing hetero-geneity in this way can be particularly effective in a power-constrained environment. In a homogeneous-ISA general-purpose processor, Thumb is not a serious candidate, becauseit performs so poorly for certain codes; however, as part of aheterogeneous solution, it shines for certain code regions.

Single-threaded workloads. To evaluate designs opti-mized for single-threaded workloads, under different peakpower budgets, we assume the dynamic multicore topologydescribed by Esmaeilzadeh, et al. [13], in which idle cores areturned off to reduce power consumption. Figure 6 shows per-formance and EDP measurements for the three architecturesconstructed when searching the design space for multicorearchitectures that provide optimal performance or energy effi-ciency over our benchmark set. We apply lower peak powerconstraints in this scenario, since we are optimizing for thesingle-powered-core execution scenario.

We observe that in a highly peak power constrained environ-ment, the heterogeneous-ISA CMP still manages to achievea speedup of 16.9% over a single-ISA heterogeneous CMP,and as the peak power budget becomes slightly higher (at 15W), it provides a consistent speedup of 18.8%. However, themaximum speedup that a single-ISA heterogeneous CMP canprovide over a homogeneous CMP in such a topology, is just1.5%. So again we see that ISA heterogeneity continues toprovide gains in regions where hardware heterogeneity is lesseffective.

Figure 7 shows the performance and EDP evaluation on de-

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(#%"

)" (!" ()" *+,-.-/01"

!"##$%"&

'#()&'*+#,&-./&

23.340+0356" 7-+4,0897:" 20/0;340+0356897:"

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

)" (!" ()" *+,-.-/01"

!"#$

%&'()*+,-.+

.)%/+."0)#+123+

23.340+0356" 7-+4,0897:" 20/0;340+0356897:"

Figure 6: Single Thread Performance and EDP evaluation using the dynamic multicore topology

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(#%"

(#&"

(#'"

$"

%'" &%" '!" )*+,-,./0"

!"##$%"&

'(#)&*++,-&

12-23/*/245" 6,*3+/7869" 1/./:23/*/2457869"

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

%'" &%" '!" )*+,-,./0"

!"#$

%&'()*+,-.+

/#)%+0$$12+

12-23/*/245" 6,*3+/7869" 1/./:23/*/2457869"

Figure 7: Single Thread Performance and EDP evaluation under different area budgets

signs optimized for single-threaded workloads, under differentarea budgets. Such designs are typically composed of multi-ple small cores and one large core optimized to provide highsingle thread performance. When highly power constrained,single-ISA heterogeneous CMPs use three small Alpha in-order cores and one powerful out-of-order core. Combiningagain the dual benefits of ISA affinity and the area benefits ofThumb, the best heterogeneous-ISA CMPs provide more bal-anced cores (to better exploit ISA affinity) yet still enable thesame large Alpha core as the single-ISA design. That configu-ration contains two small Thumb cores, the same out-of-orderAlpha core, and a medium-end x86-64 core. We observe thata heterogeneous-ISA CMP can improve single-thread perfor-mance by 20.8% over a single-ISA heterogeneous CMP, orachieve 23% more energy savings and 31.8% reduction inEDP, again with no loss in performance.

7.2. Framework for ISA-Microarchitecture co-design

In this section, we present inferences from our design spaceexploration that can serve as a framework for future ISA-microarchitecture co-design in the context of a heterogeneous-ISA CMP. We consider the best designs from all experimentscarried out in the previous section. Figure 8 shows the fre-quency of occurrence of different micro-architectural parame-ters in a heterogeneous-ISA design. We analyze the influenceof ISA on each of these micro-architectural parameters.

Execution Semantics. We find that out-of-order executionsemantics is favorable in general. However, due to conserva-tive peak power budgets in some designs, we select inorder

!"# $!"# %!"# &!"# '!"# (!!"#

)*+,-./-#

012+,3+,-./-#

(#

$#

%#

&%#

($'#

(&#

4$#

5,6#

789:#

5,;<=#

>,1-*<?/*2#

4$@A#

&%@A#

BC/;1D,*#

E/?<*D;F#

)FF1/#G

8.2:#

H0A#E8I/#

5EJ#E8I/#

K1*;D,*<=#

1*82F#

A-<*;:#

L-/.8;2,-#

5(#M<;:/#

E8I/#

!"#$%#&'()*+)*''%""#&'#)

,-'"*./"'0-1#"%/2)3/"/4#1#")

>:1?N# O=P:<# C'&+&%#

Figure 8: Inferences from the Design Space Exploration

cores 10.8% of the time on Alpha, and 25% of the time onx86-64. Due to the enormous peak power and area benefitsof Thumb, we always select out-of-order Thumb cores be-cause they are so cheap. This impacts a number of our results,because it means even our smallest designs always have anout-of-order core available.

Issue Width. In general, higher fetch width enables higherissue width. Since the instruction fetch units of Thumb andAlpha dissipate less power than x86-64, we seldom choose asmall issue width for these ISAs. In fact, only 2.7% of our de-signs choose a single-issue Alpha core and none of our designsuse a single-issue core for Thumb. Because code compressionensures our minimum fetch bandwidth is 2 instructions withThumb, a scalar Thumb processor would always have fetchand issue out of balance.

ROB Size. We find that the number of ROB entries andregister file size are highly influenced by the register pressureof an ISA. The low register pressure of x86-64 (see Section 3)results in small ROBs and physical register files being config-ured. Conversely, the high register pressure of Thumb has theopposite effect, while Alpha finds a middle ground betweenthe two.

Load/Store Queue Size. Although we do not see a directcorrelation between load/store queue sizes and ISA traits, ISAsmore likely to be configured out-of-order and with wide issue,not surprisingly, also demand large LSQs.

Number of Functional Units. Because the x86-64 coreswe model have more basic functional unit types (integer,floating-point, and SIMD), the cost of going from the lowto the high configuration is higher, and that step is taken lessoften.

Branch Predictor. Interestingly, all ISAs almost alwayschoose the tournament branch predictor. This implies that it isalways worthwhile to invest transistors on the branch predictor.ISA traits such as predication support have little influence onthe selection of branch predictor type.

L1 Cache Size The size of L1 cache largely depends on theworking set of applications, rather than a specific trait of anISA. In general, all ISAs favor higher L1 cache sizes.

7.3. ISA Affinity

To determine the ISA affinity of each application, we simulateboth single-threaded and multi-threaded workloads on twotypes of designs: (1) optimized for performance, and (2) op-timized for EDP. In all scenarios, designs are constrained bya peak power budget of 40 W. Each scenario provides someinteresting insights. The design optimized for single-threadedperformance is the true indicator of ISA affinity, since onlyone application is in execution at a time, and each applicationis allowed to freely migrate between the cores. In case ofmulti-programmed workloads, due to contention between ap-plications, some applications may execute on ISAs of secondpreference. On designs optimized for energy efficiency, appli-cations may choose to execute on ISAs that provide energyefficiency but don’t maximize performance.

Figure 9 shows that each application exhibits a differentdegree of ISA affinity, and most use all ISAs. In our exper-

!"#

$!"#

%!"#

&!"#

'!"#

(!!"#

!"#$%&'()*+,

#)

)*+,-# ./0*.# 1'&2&%#

Figure 9: ISA affinity for different applications on designs op-timized for (left to right) - (a) Single-thread performance, (b)Multi-programmed workload performance, (c) Single-threadedworkload EDP, (d) Multi-programmed workload EDP

iments, benefits arise due to a combination of ISA factors,some synergistic, some interacting negatively – trying to sepa-rate those effects is difficult and not always intuitive. However,we are able to make a few high level observations. (a) Nofloating-point benchmark prefers execution on Thumb due tofloating-point emulation. (b) The floating-point benchmarklbm prefers execution on Alpha instead of x86-64, becauseAlpha requires about 34% fewer dynamic floating-point in-structions. (c) The high ILP benchmarks bzip2, hmmer, andsjeng prefer execution on Alpha over x86-64, because Alphaoffers lower register pressure during phases of high instruction-level parallelism (see Section 3). (d) bzip2 prefers executionon the 32-bit Thumb ISA during phases that involve 32-bitunsigned integer arithmetic. Alpha incurs 27% more dynamicinstructions to emulate 32-bit arithmetic using 64-bit registers.In such phases, x86-64 emerges as the ISA of second pref-erence due to sub-register addressing. (e) The benchmarkslibquantum, milc, and sphinx3 take advantage of x86-64’sSIMD functionality at different execution phases, and revertback to Alpha/Thumb during the scalar phases.

Not immediately clear from the results so far is to whatextent the gains are a result of broad differences in feature sets(e.g., SIMD vs no SIMD support) as opposed to the more sub-tle differences. Further experiments show that the former area surprisingly small component. For example, if we considerx86-64 with vs without SSE, that heterogeneity provides a gainof 1.3% over the best single-ISA configuration – significantlylower than the 15.8% speedup from a fully heterogeneous-ISAdesign.

Finally, we note that there is little deviation in ISA affinitydue to contention amongst multi-programmed workloads, ordue to optimization for EDP instead of performance.

7.4. Compiler and Runtime Evaluation

The prior results, primarily concerned with the discovery ofthe best core configurations, do not account for the cost ofreduced compiler effectiveness (which would affect steadystate performance) or of migration itself. Each of these would

!"#$%%&

!"#$%!&

!"#$%'&

!"#$%(&

!"#$%)&

!"#$%*&

!"#$%+&

!"#$%,&

!"#$%&'()'*+,-#./'.,01&"/2(,0'

-./0-&12&10345& 67+8+)&12&10345& 10345&12&-./0-&

67+8+)&12&-./0-& 10345&12&67+8+)& -./0-&12&67+8+)&

Figure 10: Number of dynamic instructions to be translatedbefore program state can be transformed

!"

#!"

$!"

%!"

&!"

'!"

(!"

!"#$%&'()*+,

#)-,./)

)*+,-")."/01*/" 23(4(&")."/01*/" )*+,-")."23(4(&"

/01*/")."23(4(&" /01*/").")*+,-" 23(4(&").")*+,-"

Figure 11: Average Migration Costs (includes both BinaryTranslation and Program State Transformation)

degrade performance or efficiency gains over a single-ISA de-sign (where migration is much faster). This section evaluatesthose costs.

Steady State Performance. This examines the cost ofmulti-ISA compilation on regular execution, compared tosingle-ISA compilation. Due to the long-mode emulation onThumb, we observe a 4.6% average loss in performance; how-ever, that does not actually impact our results – for code thatincurs that emulation, we typically do not select the Thumbcore for execution. x86-64 and Alpha show no performancedegradation at all, up to four decimal places of the IPC. Suchsmall degradation of performance comes from the fact that wedo not disable any optimization or ISA-specific behavior toenable multi-ISA compilation. This is in contrast to the priorwork [11] which sacrifices 2-3% for multi-ISA compilation.

Fat Binary Overhead. The fat binary created by our com-piler contains multiple code sections, but a core only loads itsown code to its private Icache. In fact, this can only impacta shared cache. If we change our design to have a shared16MB LLC, and incorporate the increased working set size,we measure zero performance loss. At a 40 W peak powerbudget, the best heterogeneous-ISA design achieves 47.5%savings in instruction fetch energy over the best single-ISAheterogeneous design, due to code density advantages.

!"#

$!"#

%!"#

&!"#

'!"#

(!!"#

!"#$%&'()*+,

#)

)*+,-#).#/01*/# 2'&3&%#).#/01*/# )*+,-#).#2'&3&%#

/01*/#).#2'&3&%# /01*/#).#)*+,-# 2'&3&%#).#)*+,-#

Figure 12: Percentage make up of Binary Translation Cost inthe Overall Migration Overhead

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(#%"

(#&"

$!" %!" &!" )*+,-,./0"

!"##$%"&

'#()&'*+#,&-./&

12-23/*/245" 6,*3+/7869"

1/./:23/*/2457869";<=2"-,3:>?2*"@25.A" 1/./:23/*/2457869";<="-,3:>?2*"@25.A"

Figure 13: Overall Speedup due to migration.

Distance to Next Equivalence Point. Figure 10 shows theaverage distance to the next equivalence point for all possiblemigration scenarios. Due to the high frequency of equivalencepoints for both Alpha and x86-64, fewer number of instruc-tions (in the order of tens of thousands) are needed to reach anequivalence point. On the other hand, migrating to Thumb canrequire binary translation of millions of instructions before anequivalence point is reached.

Migration Cost. Figure 11 shows the total overhead permigration, assuming migration at random intervals in execu-tion. We measure an average overhead of 4 milliseconds toswitch between ISAs, but that is heavily influenced by a fewoutliers. In most cases, average migration cost is insignificantif we assume we are not migrating more often than every cou-ple hundred milliseconds. The migration overhead shows highcorrelation to the distance to reach an equivalence point. Thisimplies that the migration overhead is typically dominated bybinary translation. Figure 12 shows the percentage of timespent in binary translation during migration. For six out of tenbenchmarks, this number is more than 90%. The rest of themhave a very small migration cost (less than 100 microseconds),and therefore binary translation is not a significant contribu-tor. We also note that our binary translator is not yet heavilyoptimized for performance.

Overall Speedup due to Migration. In this experiment,we account for the cost of the actual migrations encountered inan earlier experiment. That is, we incur the cost of migration

between two ISAs when a phase change causes a new core/ISAcombination to be preferred. Figure 13 shows the result ofthis experiment. Here we see that we sacrifice negligibleperformance (about 0.4-0.7%) for migration, meaning thatvirtually all of the performance gain from heterogeneous ISAsis retained. This comes from two factors. First, in most cases,migration overhead is very low. Second, phase changes arerelatively infrequent, infrequent enough that even our fewcases of high migration overhead are not significant.

8. ConclusionThis research explores the design space of heterogeneous-ISA chip multiprocessors. It shows that adding an extra axisof heterogeneity by considering multiple ISAs significantlyincreases the performance and energy efficiency of a hetero-geneous processor. Specifically, a heterogeneous design thatallows cores with distinct ISAs outperforms the optimal hetero-geneous single-ISA design by as much as 20.8% and improvesenergy efficiency over the most efficient single-ISA design by23%.

Additionally, this paper builds on prior work in multi-ISAcompilation by reducing compiled-code overhead to virtuallyzero, and by greatly decreasing the average distance to anequivalence point (which drives the average cost for binarytranslation, and thus migration).

AcknowledgementsThe authors would like to thank the anonymous reviewers fortheir helpful insights. This research was supported in part byNSF Grants CCF-1219059 and CCF-1302682.

References[1] 2nd Generation Intel Core vPro Processor Family. Technical report,

Intel, 2008.[2] The future is fusion: The Industry-Changing Impact of Accelerated

Computing. Technical report, AMD, 2008.[3] The Benefits of Multiple CPU Cores in Mobile Devices. Technical

report, NVidia, 2010.[4] Variable SMP – A Multi-Core CPU Architecture for Low Power and

High Performance. Technical report, NVidia, 2011.[5] ARM Limited. ARM7TDMI Technical Reference Manual.[6] F. Bellard. QEMU, a Fast and Portable Dynamic Translator. In USENIX

Technical Conference, Apr. 2005.[7] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and

S. K. Reinhardt. The M5 Simulator: Modeling Networked Systems.Micro, IEEE, 2006.

[8] E. Blem, J. Menon, and K. Sankaralingam. A Detailed Analysis of Con-temporary ARM and x86 Architectures. Technical report, Universityof Wisconsin - Madison, 2013.

[9] E. Blem, J. Menon, and K. Sankaralingam. Power Struggles: Revisitingthe RISC vs. CISC Debate on Contemporary ARM and x86 Architec-tures. In International Symposium on High Performance ComputerArchitecture, Feb. 2013.

[10] N. Chitlur, G. Srinivasa, S. Hahn, P. Gupta, D. Reddy, D. Koufaty,P. Brett, A. Prabhakaran, L. Zhao, N. Ijih, et al. QuickIA: ExploringHeterogeneous Architectures on Real Prototypes. In InternationalSymposium on High Performance Computer Architecture, Feb. 2012.

[11] M. DeVuyst, A. Venkat, and D. M. Tullsen. Execution Migration in aHeterogeneous-ISA Chip Multiprocessor. In International Conferenceon Architectural Support for Programming Languages and OperatingSystems, Mar. 2012.

[12] Digital Equipment Corporation. Alpha Architecture Reference Manual.

[13] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, andD. Burger. Dark Silicon and the End of Multicore Scaling. In In-ternational Symposium on Computer Architecture, June 2011.

[14] A. Ferrari, S. J. Chapin, and A. Grimshaw. Heterogeneous ProcessState Capture and Recovery through Process Introspection. ClusterComputing, 2000.

[15] P. Greenhalgh. big.LITTLE Processing with ARM Cortex-A15 &Cortex-A7. Technical report, ARM, 2011.

[16] M. Hill and M. Marty. Amdahl’s Law in the Multicore Era. Computer,July 2008.

[17] Intel. Intel 64 and IA-32 Architectures Software Developer’s Manual.[18] C. Isen, L. K. John, and E. John. A Tale of Two Processors: Revisiting

the RISC-CISC Debate. In Computer Performance Evaluation andBenchmarking. 2009.

[19] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, andD. Shippy. Introduction to the Cell multiprocessor. IBM Journal ofResearch and Development, July 2005.

[20] R. E. Kessler. The Alpha 21264 Microprocessor. Micro, IEEE, 1999.[21] A. Krishnaswamy and R. Gupta. Efficient Use of Invisible Registers in

Thumb Code. In International Symposium on Microarchitecture, Dec.2005.

[22] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M.Tullsen. Single-ISA Heterogeneous Multi-core Architectures: ThePotential for Processor Power Reduction. In International Symposiumon Microarchitecture, Dec. 2003.

[23] R. Kumar, D. M. Tullsen, and N. P. Jouppi. Core Architecture Opti-mization for Heterogeneous Chip Multiprocessors. In InternationalConference on Parallel Architectures and Compilation Techniques,Sept. 2006.

[24] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas.Single-ISA Heterogeneous Multi-core Architectures for MultithreadedWorkload Performance. In International Symposium on ComputerArchitecture, June 2004.

[25] C. Lattner. LLVM and Clang: Next Generation Compiler Technology.In The BSD Conference, May 2008.

[26] C. Lattner and V. Adve. LLVM: A Compilation Framework for Life-long Program Analysis & Transformation. In International Symposiumon Code Generation and Optimization, Mar. 2004.

[27] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P.Jouppi. McPAT: an Integrated Power, Area, and Timing ModelingFramework for Multicore and Manycore Architectures. In InternationalSymposium on Microarchitecture, Dec. 2009.

[28] T. Li, P. Brett, R. Knauerhase, D. Koufaty, D. Reddy, and S. Hahn.Operating System Support for Overlapping-ISA Heterogeneous Multi-Core Architectures. In International Symposium on High PerformanceComputer Architecture, Jan. 2010.

[29] E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, andB. Calder. Using SimPoint for Accurate and Efficient Simulation. InACM SIGMETRICS Performance Evaluation Review, June 2003.

[30] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. AutomaticallyCharacterizing Large Scale Program Behavior. In International Con-ference on Architectural Support for Programming Languages andOperating Systems, Oct. 2002.

[31] P. Smith and N. C. Hutchinson. Heterogeneous Process Migration:The Tui System. Software-Practice and Experience, 1998.

[32] L. Strozek and D. Brooks. Energy-and Area-Efficient Architecturesthrough Application Clustering and Architectural Heterogeneity. ACMTransactions on Architecture and Code Optimization, 2009.

[33] S. Terpe. Why Instruction Sets No Longer Matter. 2011.[34] Texas Instruments Inc. OMAP5912 Multimedia Processor Device

Overview and Architecture Reference Guide.[35] K. Van Craeynest, A. Jaleel, L. Eeckhout, P. Narvaez, and J. Emer.

Scheduling Heterogeneous Multi-Cores through Performance ImpactEstimation (pie). In International Symposium on Computer Architec-ture, June 2012.

[36] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor. Conservation Cores: Reduc-ing the Energy of Mature Computations. In International Conferenceon Architectural Support for Programming Languages and OperatingSystems, Mar. 2010.

[37] D. G. Von Bank, C. M. Shub, and R. W. Sebesta. A Unified Model ofPointwise Equivalence of Procedural Computations. ACM Transactionson Programming Languages and Systems, 1994.

[38] V. M. Weaver and S. A. McKee. Code Density Concerns for NewArchitectures. In International Conference on Computer Design, Oct.2009.

harnessing isa diversity: design of a heterogeneous-isa ...harnessing isa diversity: design of a...

Documents