exploring heterogeneous-isa core architectures for high...

Exploring Heterogeneous-ISA Core Architectures forHigh-Performance and Energy-Efficient Mobile SoCs

Wooseok Lee1, Dam Sunwoo2, Christopher D. Emmons2,Andreas Gerstlauer1, and Lizy K. John1

1The University of Texas at Austin, 2ARM Research

ABSTRACTIn this paper, we explore opportunities to increase energyefficiency by providing cores with restricted functionality,but without necessarily impacting performance. We aim toachieve this by removing support for complex but less fre-quently executed instructions since instruction mixes usedby real-world workloads are often heavily biased. We inves-tigate which instructions are worthwhile to remove by ana-lyzing a subset of instructions in the ARM ISA and their cor-responding logic burden in the microarchitecture. We pro-pose a heterogeneous-ISA system to achieve energy efficiencywithout performance degradation using a system architec-ture that combines both full- and reduced-ISA cores. Resultsshow that by providing the flexibility of heterogeneous-ISAcores, the proposed system can improve energy efficiency by12% on average and up to 15% for applications that do notrequire NEON support, all without performance overhead.

1. INTRODUCTIONCurrent mobile systems-on-chips (SoCs) take advantage

of heterogeneity at the system level by switching betweenhigh-performance and energy-efficient cores [1]. These het-erogeneous systems match application demands with coretypes to maximize energy efficiency. In such systems, theheterogeneity lies in the microarchitecture. In bigger cores,more transistors are spent on components that improve per-formance, such as branch predictors and out-of-order pro-cessing capabilities. In smaller cores, reducing the amountof performance-relevant resources can, however, be detri-mental to some workloads.

An alternate approach is to implement energy-efficientcores by restricting functionality instead of giving up per-formance. Specifically, by reducing resources that have lessimpact on performance, power dissipation can be alleviatedwithout losing significant performance. The benefit of imple-menting certain features in the Instruction Set Architecture(ISA) is highly dependent on workloads. Not all instruc-tions are frequently used by every workload. If a particularworkload favors specific instructions that are not directlysupported by the hardware, performance dramatically de-creases. By contrast, there is little performance degradationif those instructions are not frequently used.

In this paper, we explore opportunities for systems com-prised of heterogeneous reduced-ISA cores to improve en-ergy efficiency. We identify candidate instructions that arecomplex to implement in state-of-the-art ARM-based sys-

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’17, May 10-12, 2017, Banff, AB, Canadac© 2017 ACM. ISBN 978-1-4503-4972-7/17/05. . . $15.00

DOI: http://dx.doi.org/10.1145/3060403.3060408

tems. We further evaluate performance degradations if can-didate instructions are removed. If performance degrada-tion is large, it is not worthwhile to consider removing themdespite the potential power benefit. We evaluate systemperformance with a reduced instruction set running severalbenchmarks. Results show that some subsets of instructionsare critical to performance while others are not essentialsince in most cases, their usage frequency is low enough tonot impact performance significantly.

We propose a heterogeneous-ISA architecture to obtainenergy benefits while maintaining performance across a widerange of workloads. In our proposed system architecture,reduced-ISA cores remove hardware support for complexinstructions, which increases energy efficiency but requirestrapping and emulating of non-supported instructions. Whenunsupported instructions are infrequent, a workload runs onthe reduced-ISA core to reduce energy. By contrast, whenunsupported instructions are prevalent, the workload is mi-grated to a traditional full-ISA core. This dynamic coreswitching allows the system to avoid the performance degra-dation and energy inefficiency of software-emulated instruc-tions. As long as they are not performance-critical, a com-piler can thereby optimize binaries to remove unsupportedinstructions and thus maximize residency on the reduced-ISA core. Our results show that workloads without per-formance critical instructions spend most of their executiontime on reduced-ISA cores, achieving energy savings of up to15%. Workloads with frequent use of unsupported instruc-tions execute exclusively on full-ISA cores with no changein performance or energy consumption. On average, 12%energy savings at little to no performance cost are observedacross a variety of benchmarks, where applications migratebetween reduced- and full-ISA cores depending on dynami-cally varying instruction usage.

2. REDUCED-ISA CORE DESIGNWe first identify the instructions that greatly impact logic

complexity for reduced-ISA core design and evaluate the per-formance impact after removing them. As a case study, weselect the ARM V7 ISA, and one of the performance-orientedARM processors, Cortex-A15, as a baseline. We find thatthe overall number of instructions in an ISA is not as cru-cial for logic reduction as would be expected. Our studyinstead focuses on the specific semantics required by par-ticular instructions that contribute to large logic within themicroarchitecture. Since the processor is unaware of theinstruction until decoded, we regard the fetch stage as anirrelevant block. We analyze the logic burdens of variousinstructions and provide the instruction distribution of bothAndroid applications and SPEC benchmarks. The detailsof our ISA analyses can be found in [12]. Based on theseanalyses, we identify four sets of candidate instructions forremoval as follows:

• NEON instruction set extensions including floating-point(FP) and SIMD operations.

• Load and store multiple (LDM/STM) instructions.

1.00 0.99 1.01 1.05 1.01 1.06 1.37 1.01 1.06

23.156.19 10.41 16.90 12.60

0.1

1

10

100

0%10%20%30%40%50%60%70%80%90%

100%

Norm

alized

Execu

tion T

ime (

log)

% of

1K in

terval

s with

NEON

instr

uctio

ns% NEON coverage execution time

Figure 1: Performance of ISA withoutNEON instructions.

0.99 1.00 0.99

1.56

1.05 1.06 1.05 1.01 1.01 1.07 1.00 0.99 1.03 0.97 1.00

00.20.40.60.811.21.41.61.8

0%10%20%30%40%50%60%70%80%90%

100%

Norm

alized

Execu

tion T

ime

% of

cond

itiona

l instr

uctio

ns

% conditional instructions execution time

Figure 2: Performance of ISA withoutconditional instructions.

1.00 1.01 1.00 1.02 0.98 0.96 1.01 1.03 1.00 1.00 1.00 1.02 1.07 1.01 1.03

0

0.2

0.4

0.6

0.8

1

1.2

0%

10%

20%

30%

40%

50%

60%

Norm

alized

Execu

tion T

ime

% of

ldm&p

op an

d stm

&pus

h ins

tructi

ons

%ldm&pop %stm&push execution time

Figure 3: Performance of ISA withoutLoad/Store Multiple instructions.

• Predicated instructions [2].

• DSP-like instructions (QADD, SSAT, etc.).

Reduced-ISA Performance Impact: We define re-duced instruction sets as case studies and observe the perfor-mance overhead of each with SPEC CPU2006 benchmarks(bzip2, gcc, mcf, hmmer, sjeng, libquantum, omnetpp, astar,namd, soplex, povray, and lbm). We modify LLVM to gen-erate binaries in which candidate instructions are removed.The execution time was measured on an Arndale board [3].

NEON instructions are a large source of logic to sustainperformance. In other words, the performance loss from re-moving NEON instructions is also large if they are heavilyused. To measure the performance impact of NEON instruc-tions, we compare the execution time of SPEC benchmarkswith and without NEON instructions. Note that since wecould not fully remove all NEON code from some of thepre-compiled libraries, there still exist a negligible numberof NEON instructions. Figure 1 shows the detailed perfor-mance loss and original NEON instruction usage for eachbenchmark along with NEON coverage.

For floating-point benchmarks, the performance slowdownis significant, ranging from about 6 to 23 times. When re-moving floating-point support in the ISA, there is no otherchoice for the compiler than to resort to soft-float emula-tion. By contrast, for integer benchmarks, which sporadi-cally use NEON instructions, the performance degradationis less than 7% with the exception of omnetpp. Since om-netpp contains a large number of NEON instructions, the37% performance degradation is understandable. These pro-filing result match prior work [5].

Figure 2 shows results for a reduced ISA that excludespredicated instructions. Avoiding to use of conditional in-structions is done by disabling the SimplifyCFG optimiza-tion pass in LLVM. When removing this optimization pass,the compiler also loses the chance to generate optimal code.Thus, the slowdown is not solely from removing conditionalinstructions. However, since we observed that only a verysmall number of instructions are changed compared to full-ISA binaries, we regard such effects as negligible.

Overall, the performance loss is less than 5% except forhmmer. Removing conditional instructions results in an in-creased amount of branches in the code. Results indicatethat the branch predictor of a modern high-performancemobile processor is good enough to handle the removal ofpredicated instructions. In case of hmmer, further analysisshowed that the single hot loop in which most of the execu-tion time is spent shows significantly worse branch predictorperformance, causing the severe performance degradation.

Load and store multiple instructions provide a way to re-duce a number of consecutive load/store operations. How-ever, since modern instruction prefetch/fetch and out-of-order logic can hide such an increase in instruction count, thepenalty of replacing them with separate load/store instruc-tions is somewhat mitigated. Figure 3 shows the detailedperformance with and without load/store multiple instruc-tions. The performance loss is negligible across benchmarks.

ReducedISACore

Full

ISA

Core

ReducedISACore

Full

ISA

Core

rISACore

Full ISACore

Time

Full ISACore

Running on Full-ISA core

Running on Heterogenous

System

switchingoverhead

Figure 4: Full- and reduced-ISA core heterogeneous system.

We find that current compilers hardly use DSP-like in-structions for most types of workloads. Some of them arenever used by the LLVM compiler back-end in the firstplace. Consequently, it is meaningless to conduct perfor-mance comparison with or without these DSP-like instruc-tions. However, despite the fact that we skipped the perfor-mance evaluation, a reduced ISA without DSP-like instruc-tions to reduce the logic burden is still valid.

3. HETEROGENEOUS-ISA SYSTEMWe propose a heterogeneous system that contains a combi-

nation of both full- and reduced-ISA cores (Figure 4). Tradi-tional full-ISA cores execute applications with no change inperformance or energy consumption. By contrast, by cut-ting down on the instructions supported by the underly-ing microarchitecture, the logic complexity of reduced-ISAcores is decreased. This allows such cores to achieve lowerpower consumption. At the same time, unsupported instruc-tions need to be trapped and software-emulated. A compilercan create binaries that are optimized for execution on thereduced-ISA core. Nevertheless, if removed instructions arecritical, as is the case with NEON instructions for floating-point workloads, performance suffers severely. In the re-mainder of the paper, we define rISA as the reduced ISAwith all the previously mentioned instructions removed.

For workloads with unsupported instructions that are sen-sitive to performance, a heterogeneous-ISA system can al-leviate the performance degradation. When running an ap-plication in which performance-sensitive instructions are notprevalent, it is better to run it on reduced-ISA cores. How-ever, when the application is in a phase where it includesperformance-sensitive instructions, the application switchesto the full-ISA core where it does not lose any performance.The core switching overhead depends on the granularity andsubsequent frequency of switching. We present evaluationresults for varying core switching granularities in Section 4.

Dynamically monitoring the instruction streams and switch-ing cores based on performance is the best way to maxi-mize energy savings. In this approach, the process schedulercan dynamically map applications to an appropriate ISAcore based on the information from hardware performancecounters, such as the number of NEON instructions exe-cuted during the current scheduling period. As mentionedabove, we assume that unsupported instructions are exe-cuted using emulation after undefined instruction exceptions

0.7

0.75

0.8

0.85

0.9

0.95

1

llvm.O3 nodsp noldm nocond noneon rISA

No

rmal

ize

d P

ow

er

pes normal opt

Figure 5: Power estimation of reducedISA (SPEC CPU2006 INT).

0.90.95

11.051.1

1.151.2

500 1000 2000 3000 4000 5000

Norm

alized

Instru

ction

Coun

t

NEON Threshold (instructions)

neon_overhead:20 neon_overhead:40 neon_overhead:80

Figure 6: Runlength of SPEC CPU2006bzip2 with various NEON overheads.

0

500

1000

1500

2000

2500

3000

3500

0%

20%

40%

60%

80%

100%

120%

500 1000 2000 3000 4000 5000

Co

re S

wit

ch

ing

Sc

he

du

lin

g P

eri

od

Sh

are

NEON Threshold (instructions)

fISA rISA Switching

Figure 7: Core residency of SPECCPU2006 bzip2.

in the reduced-ISA core. This will only be acceptable for in-frequent use of the software-emulated instructions. Oncethe number of software-emulated instructions exceeds a cer-tain threshold within a certain time period, the applicationshould be migrated to a full-ISA core to avoid further perfor-mance degradation. On a full-ISA core, on the other hand,the same instructions are also monitored. If the instructionsare not used during a certain time period, the application ismigrated back to the reduced-ISA core.

For optimal energy and performance balance, we imple-ment a dynamic core scheduling strategy. At the end ofeach scheduling period, the scheduler determines which corethe application should run on based on NEON instructioncounts. If there are more NEON instructions than a pre-determined threshold, it is better to run the application ona full-ISA core to reduce performance loss. In the oppositecase, it is better to run the application on a reduced-ISA corefor energy efficiency. Core switching happens when there areconsecutive periods where NEON instructions are above orbelow a pre-defined neon threshold. In this paper, we assumethat there are two cores, full- and reduced-ISA, respectively,and no other programs are running. The detailed algorithmcan be found in [12].

4. EXPERIMENTS AND RESULTSPower Estimation: We use McPAT [10] to estimate

power consumption. We estimate the logic reduction effectsresulting from the removal of the candidate instructions byadjusting numbers under certain assumptions about physicaland implementation details. The power estimation detailscan be found in [12].

As accurately quantifying power consumption is challeng-ing, we give a range of power estimates: opt (optimistic)indicates that the effect of the logic reduction is assumedto be relatively large while pes (pessimistic) means the op-posite (Figure 5). The power reduction for removing mostinstruction types is about 2% each, except for NEON in-structions, which show about 9% improvement. Due to theorthogonality of removed instructions, we assume that thelogic related to each instruction is independent, and, thus,those power reductions are additive. Our rISA core showsabout 15% power reduction compared to the full-ISA one.

Performance Evaluation: To observe the performancedegradation based on the core switching scenario, we eval-uate SPEC CPU2006 integer benchmarks using QEMU [4]while collecting a trace of all NEON instructions. We focuson NEON instructions as other removed instructions havelittle impact on performance as previously shown. We sim-ulate core switching using the previously shown schedulingalgorithm that accounts for the number of NEON instruc-tion in each scheduling period. Since switching is not free,each core switching is assumed to incur a fixed instructionoverhead. Furthermore, since emulating NEON instructionsleads to additional instructions, every time a NEON instruc-tion is executed, an associated instruction penalty is addedto the total instruction counts.

Reducing the emulation penalty is crucial for removinginstructions. Ideally, it is desirable for an application to en-tirely run on the reduced-ISA core. However, we find thatthe penalty to emulate NEON instructions is critical to per-formance (Figure 6). We estimate the NEON instructionpenalty in increments of 20 instructions corresponding tothe maximum 20x performance degradation of FP bench-marks. An increase in the penalty per NEON instructionfrom 20 to 80 directly affects overall performance.

Figure 7 details the fraction of scheduling periods bzip2 isrunning on each core and the corresponding number of coreswitches. The higher the NEON threshold for scheduling,the longer the bzip2 can run on the reduced-ISA core. This,however, results in more NEON instructions requiring emu-lation, and, thus, an increase in instruction counts. However,the increase is relatively small. In addition, the number ofcore switches is low. As such the switching overhead is notcrucial for overall performance, and a large switching over-head does not result in significant performance loss. In otherwords, lowering the penalty allows more NEON instructionsto run on the reduced-ISA core.

System Evaluation: Based on the performance and en-ergy evaluation results, we estimate the overall efficiency ofour heterogeneous system. The estimated average power ofeach benchmark is multiplied by the total number of instruc-tions and the square of total instructions to compute energyand energy delay product (EDP), respectively, assuming aconstant CPI per benchmark. Furthermore, the computedenergy is normalized to the energy and EDP for the full-ISAcore in order to observe how much energy savings our sys-tem can achieve. These experiments are conducted with a20-instruction NEON penalty and a core switching overheadof 3000 instructions.

Figure 8(a) shows the estimated performance and energyfor SPEC CPU2006 benchmarks. Among the integer bench-marks, we exclude omnetpp and sjeng ; given their high por-tion of NEON instructions. It is evident that they will runmostly on the full-ISA core. Since dynamic scheduling de-termines which core to run on next, other applications showdiverse scheduling behavior. The hmmer benchmark runs onthe full-ISA core most of the time due to having more NEONinstructions than the threshold for each period, resulting inno performance degradation with no energy benefit.

At the opposite end of the spectrum, mcf, astar, andlibquantum run most of the time on the reduced-ISA corewith negligible performance degradation, but significant en-ergy benefits. Interestingly, bzip2 and gcc show unbiased be-havior. In these workloads, due to the broad range of NEONinstructions that vary between 500 and 5000 per period, ourscheduling algorithm switches cores frequently depending onthe NEON threshold. This aggravates the performance losswith higher NEON penalties and thresholds, thus reducingenergy efficiency.

Compiler Optimizations: To optimize energy efficiency,it is better to run applications on the reduced-ISA core asmuch as possible without losing any performance. This sug-gests that reducing the amount of NEON instructions is

0.750.80.850.90.9511.05

0%20%40%60%80%

100%120%

500 1000

2000

3000

4000

5000 500 1000

2000

3000

4000

5000 500 1000

2000

3000

4000

5000 500 1000

2000

3000

4000

5000 500 1000

2000

3000

4000

5000 500 1000

2000

3000

4000

5000

bzip2 gcc hmmer mcf astar libquantum Norm

alized

Instr./

Energ

y

Sche

dulin

g Peri

od Sh

are

neon_threshold

fISA rISA Instr. Energy (Pwr*Instr.) EDP (Pwr*Instr.^2)

(a) SPEC CPU2006 integer benchmarks with NEON instructions. Compiler statically removes LDM/STM,DSP-like, and conditional instructions

0.750.80.850.90.9511.05

0%20%40%60%80%

100%120%

500 1000

2000

3000

4000

5000 500 1000

2000

3000

4000

5000 500 1000

2000

3000

4000

5000 500 1000

2000

3000

4000

5000 500 1000

2000

3000

4000

5000 500 1000

2000

3000

4000

5000

bzip2 gcc hmmer mcf astar libquantum Norm

alized

Instr

./Ene

rgy

Sche

dulin

g Peri

od Sh

are

neon_threshold

fISA rISA Instr. Energy (Pwr*Instr.) EDP (Pwr*Instr.^2)

(b) SPEC CPU2006 integer benchmarks. Frequently used NEON load/store instructions are replaced to integerload/store instructions

Figure 8: Performance evaluation and energy estimation.

beneficial not only to maximize residency of workloads onreduced-ISA cores, but also to reduce performance degrada-tion. Thus, we further limit the generation of NEON instruc-tions by modifying the LLVM back-end. Since we observethat vector load/store instructions are frequently used in in-struction optimizations, blocking such optimizations leads toless NEON instructions. We measure the performance of thetwo binaries on full-ISA cores on the real board and confirmthat performance differences of blocking such optimizationsare negligible.

Figure 8(b) shows the performance and estimated energyof the modified binaries. The amount of time bzip2 and gccrun on the full-ISA core is reduced to zero. This leads toa total energy efficiency equivalent to running solely on thereduced-ISA core. The hmmer benchmark still incorporatesquite a few NEON instructions, causing it to stay on thefull ISA core. Experimental results show that energy sav-ings of up to 15% are achieved for benchmarks that havenegligible NEON instructions. On average, about 12% en-ergy savings are achieved across all evaluated benchmarks.Interestingly, with the manipulation of the compiler, coreswitches are completely avoided, further increasing energyefficiency. This suggests that if a system has control overthe compiler, it is possible to prefer and only use the limitedset of instructions available on the energy-efficient reduced-ISA core.5. RELATED WORKS

Kumar et al. first suggested the possibility of increas-ing energy efficiency by proposing a single-ISA heteroge-neous system [9]. ARM’s big.LITTLE architecture [1] sub-sequently implemented this approach for mobile systems.All of these approaches, however, only focused on the high-performance components in the microarchitecture.

Recently, several notable approaches investigated hetero-geneous ISA architectures. Blem et al. revisited the debateabout RISC versus CISC, comparing x86 and ARM ISAs [6].Their research quantifies each ISA and argues that there isnot much difference between x86 and ARM ISAs. DeVuystet al. [8] and Venkat et al. [7] proposed a way of harnessingheterogeneous ISAs. However, their research only focused onthe diversities of each ISA and argued the benefits when find-ing the right ISA depending on the workloads. Several priorworks [11, 5] investigated OS and software support for taskmigration between cores with overlapping ISAs. However,their work does not discuss detailed trade-offs in designingactual architectures for such heterogeneous-ISA systems.

6. SUMMARY AND CONCLUSIONSA reduced ISA core opens up a chance to run high-performance

tasks with better energy efficiency. In this paper, we exploreand demonstrate the potential with a case study of a het-erogeneous system that includes both full- and reduced-ISAcores. Results show that certain complex instructions can beremoved with little performance overhead. We argue thatin a heterogeneous system that effectively migrates appli-cations to match ISA requirements, significant benefits inenergy efficiency can be obtained. Providing heterogeneityin functionality rather than performance can improve en-ergy efficiency with virtually no performance degradation.Results show that our proposed system can improve energyby up to 15% and by 12% on average, all with little to noperformance overhead.

7. ACKNOWLEDGEMENTSThis research was supported in part by SRC grant 012-

HJ-2317. We would like to thank the anonymous reviewersfor their many insightful comments and suggestions.

8. REFERENCES[1] ARM big.LITTLE Processing. http://www.arm.com/products/

processors/technologies/biglittleprocessing.php.[2] ARMv8 Instruction Set Overview. http://infocenter.arm.com/

help/index.jsp?topic=/com.arm.doc.genc010197a/index.html.[3] Arndale Board. http://www.arndaleboard.org.[4] Qemu. http://www.qemu.org.[5] A. Aminot et al. FPU Speedup Estimation for Task Placement

Optimization on Asymmetric Multicore Designs. In MCSoC,2015.

[6] E. Blem et al. Power Strugles: Revisiting the RISC vs. CISCDebate on Contemporary ARM and x86 Architectures. InHPCA, 2013.

[7] A. V. et al. Harnessing ISA Diversity: Design of aHeterogeneous-ISA Chip Multiprocessor. In ISCA, 2014.

[8] M. D. et al. Execution Migration in a Heterogeneous-ISA ChipMultiprocessor. In ASPLOS, 2012.

[9] R. K. et al. Single-ISA Heterogeneous Multi-Core Architectures:The Potential for Processor Power Reduction. In Micro, 2003.

[10] S. Li et al. McPAT: An integrated power, area, and timingmodeling framework for multicore and manycore architectures.In Micro, 2009.

[11] T. Li et al. Operating System Support for Overlapping-ISAHeterogeneous Multi-core Architectures. In HPCA, 2010.

[12] W. Lee et al. Exploring Opportunities for Heterogeneous-ISACore Architectures in High-Performance Mobile SoCs.Technical Report UT-CERC-17-01, The University of Texas AtAustin, 2017.

exploring heterogeneous-isa core architectures for high...

Documents