exploring voltage scaling techniques in embedded

Exploring Voltage Scaling Techniques in EmbeddedProcessors Hardware Monitors

Arman Pouraghily, Padmaja Duggisetty, Thiago TeixeiraVLSI Design Principles Final Report - ECE 658 Fall 2014

Department of Electrical and Computer EngineeringUniversity of Massachusetts, Amherst, MA, USA

Email: {apouraghily,pduggisetty,tteixeira}@umass.edu

Abstract—The Internet is a very important communicationinfrastructure in modern life, with applications varying frombanking transactions, transfer of copyrighted material, compa-nies’ assets, to the most simple web surfing. The role of Internetis expected to grow exponentially with cloud computing andInternet of Things. In this context, the router will continue tobe the equipment carrying most of the Internet traffic, withan increasing number of routers’ packet forwarding applicationbeing deployed in the form of programmable network processor.Many security algorithms have been developed for the networksecurity as a whole in order to increase reliability of com-munications. The network infrastructure security has receivedless attention, even though attacks in the data plane that cantrigger changes in the network processor have been successfullydemonstrated. Hardware monitors are the proposed solution forsecuring the network infrastructure, by comparing the networkprocessor correct instructions with running instructions, resettingand recovering the processor if the instruction has any deviationthat do not match the monitoring graph. Hardware monitorshave to execute instructions at a very high speed. However, ourchallenge in this work is to look at techniques to reduce thevoltage level, which turns the hardware monitor slower, whilemaintaining its functionality. Hence, consuming less power.

Index Terms—hardware monitor, voltage scaling.

I. INTRODUCTION

With the ubiquitous presence of Internet in today’s soci-ety, ensuring a trustworthy communication is key. Financialtransactions, private data of companies flow from one officeto another, and private user data are a few examples of wherethe Internet requires to work correctly.

An important part of the Internet infrastructure is therouter. An increasing number of routers are shipped withprogrammable packet processing, used by vendors to extendsystem functionality. The packet processing applications areimplemented in the form of network processor (NP).

Furthermore, [1] have showed that the network processorcan be exploited using an integer overflow attack. Hardwaremonitors are the standard solution for preventing attacks onnetwork processors. [2] have showed a solution that uses asingle memory read per instruction to compare the maliciouscode with the code being executed. If the monitor detects amalicious pattern (an invalid state for instance), the packet isdropped.

The hardware monitor is an extra hardware that has to beaccommodated on the chip. We desire it to be small andconsume less power. Hence, in this work, power consumption

is taken into account. Our goal to estimate power consumptionoverhead – imposed by the introduction of hardware monitor– and by applying power saving techniques, minimizing thisoverhead is achieved by using voltage scaling techniques.

The basic idea behind our approach is the simple fact thatthe logical complexity of the modules used in the hardwaremonitor is by far less that the ones used in the main CPUwhich means that our hardware monitor could run with muchhigher clock rate. However, running the hardware monitor at ahigher frequency than the Network Processor leads to resourcewastage. This is because the hardware monitor should trackand monitor the behaviour of the main processor; so runningit at a higher frequency means that it would overtake the mainprocessor which defeats the whole purpose of monitoring.This simple fact gives rise to an added advantage; beingable to run at higher frequency means having much moreslack time for a predefined time budget (which is imposedby the critical path delay of the main processor). The mainobjective of our work is to make a good use of this slacktime. We also know that the delay of a digital circuit isinversely affected by supply voltage value which means thatdecreasing the supply voltage will increase the delay of thecircuit. But the power consumption of a circuit increasesquadratically with the supply voltage, meaning that a smallchange in supply voltage has a significant impact of powerconsumption of the circuit. This work deals predominantlywith the library characterization of digital CMOS logic gateswith respect to different voltage levels in order to predict delayand power consumption. Hence, power consumption of thehardware monitor is reduced since it now operates at a lowervoltage level than the nominal voltage. With the results fromour work we see that voltage scaling has a huge impact inpower savings and the added circuit still satisfies the timingbudget.

II. LITERATURE REVIEW

This work comprises of two main topics: the hardware mon-itor operation and the voltage scaling. The first one concernsabout the device security, while the second one concerns aboutreducing the power consumption as much as possible.

A. Hardware Monitor

Modern high-performance routers no longer use application-specific integrated circuits (ASICs), instead, they use pro-grammable network processors [3]. Network processors aremulti-core high-performance embedded systems that imple-ment packet forwarding and other network functions pro-grammed with software. While programmable network proces-sors offer router vendors and network providers the flexibilityto remotely reprogrammed the equipment, it also exposes po-tential risks to the Internet infrastructure. Defence mechanismsagainst data plane attacks on network processors have beenproposed. Specifically, hardware monitors that operate in par-allel with network processors, monitoring the processor coreand comparing the with monitoring graph. If the behaviourdeviates from the monitoring graph, the processor core is reset(e.g. drop the packet) and recovered.

An effective network processor monitoring system needsto verify every instruction that is executed by the processor.Due to this reason the monitors need to run at very highspeeds to match up with the processor speed. This instruction-based monitoring can be viewed as a finite automaton witha fixed number of acceptable paths. A deterministic finiteautomaton (DFA) has been used to perform instruction levelmonitoring as opposed to non-deterministic finite automaton(NFA). Using DFA reduces the requirement of a high memorybandwidths when compared to an NFA used for monitoring.All the previous work in embedded security had been doneon a Von Neumann processor architecture. But the networkprocessors use a Harvard architecture. So an example attackwas presented to prove the existence of attacks on Harvardarchitecture and how the processor is prevented from such anattack. These two key problems were addressed in the paper.

In [2], the authors developed a high performance hardwaremonitor that takes a single memory read per instruction, oper-ating at speeds sufficient to maintain the network data transferrate. A deterministic monitoring graph is implemented in theform of a state machine derived from the packet processingcode. For each instruction executed on the processor core,a hash value of the executed operation is reported to themonitor. The monitor uses the comparison logic to comparethe reported hash value to the information that is stored in themonitoring graph. The monitoring graph used by the monitoris a state machine, where each state is represented by a specificprocessor instruction. The monitoring system uses 4-bit hashof the next instruction to label edges in the monitoring graph.The monitor verifies each instruction with the comparisonlogic. In case of an attack, the system changes the operationof the processor core, leading to a deviation. This deviationproduces hash values that do not match the monitoring graph.Upon detecting a deviation reported by the comparison logic,the monitor resets the processor core.

In order to match the speeds of the processor core, thecomparison logic needs to be able to retrieve the informationabout next state transitions for every instruction in no morethan one memory access per instruction. In order to achieve

this the solution is to represent the DFA states with varyingnumbers of outgoing edges by encoding all the necessaryinformation in a single table entry and to group states by thenumber of outgoing edges and by the same previous state. Thememory contains tuples and is logically divided into groups.The base addresses for each group are stored in register filewith 16 entries. This makes accessing the memory for hashcode of the next instruction faster. An increase in speed wasobserved with such an implementation of memory usage overprevious approaches.

Code injection attacks are feasible on a Harvard architectureprocessor using a return-oriented programming technique. Insuch attacks the attacker takes control of return instructions inthe stack to chain attack code from an existing function.Sincethe code is already in the executable memory, the attack cannot be prevented. One such attack which is possible by integeroverflow vulnerability is presented in the paper. An attackersends a UDP packet with a maximum size i.e., 65534. But thispasses the maximum packet size since 65534 + 12 = 10 dueto integer overflow. The packet payload is made sure that thereturn address is overwritten and all the ports are over floodedwith the attack packets. As a result the system crashes. As soonas the control flow changes, the hash values reported by theprocessor no longer match the monitoring graph informationand the system is reset. These kind of attacks are hencedetected in the developed hardware monitor since there isno valid edge between the states. All the above work wasimplemented in fixed logic and prototyped on a stratix IVGX230 FPGA located on a Altera DE4 board.

In [4] the authors have extended the work to multi-corenetwork processors, implemented in a field-programmable gatearray (FPGA) platform.

B. Voltage Scaling Technique

Moore’s law has been driving the semiconductor industryadvances since 1965. However, with the miniaturization ofthe transistor and consequently the increasing transistor count,power consumption have became a barrier for the advancingof Moore’s law. Multi-core processors was the solution en-countered by industry to increase computation power.

The most advanced system on chip (SoC) on the marketcan easily reach billions of transistors. For instance, thelast Apple’s A8 processor has a dual-core CPU and haveapproximately two billion transistor, fabricated in a 20 nmprocess [5]. With the provision of 14 nm process or less toreach the market very soon, these dense chips cannot turn onall the transistors at the same time. The term is known as darksilicon, referring to the power constraints, such as leakage anddynamic power.

The authors in [6] explored near-threshold computing(NTC), where the supply voltage is approximately equal tothe threshold voltage of transistors. By reducing the supplyvoltage from the super threshold voltage to the near-thresholdvoltage the authors observed a gain on the order of 10X inthe power consumption, with a the performance degradationof 10X.

Reducing the supply voltage reduces the power consump-tion, but there are trade-offs. When the supply voltage drops,delay increases exponentially, reducing overall performance.Therefore, the optimum point should be where we gain inpower consumption without compromising the circuit delay.Another trade-off is the performance variation, as the depen-dencies of drive current, Vth, Vdd, and temperature approachexponential. Hence, when supply voltage decreases, the per-formance uncertainty increases. Last, NTC increases devicefunctional failure due to variations in processes, temperature,and voltage. Current research to overcome these barriers ispresented in [6].

III. IMPLEMENTATION

The following subsections describe the flow to reach thevoltage level scaling techniques. We started by designing theschematics, then we drew the layouts, customized technologylibrary, synthesize, validated, and power and time evaluation.On the next section, we show the results.

A. Gate Characterization and Library Creation

Our primary goal is to investigate the impact of supplyvoltage scaling on critical path delay and power consump-tion. To accomplish that, we need to a synthesized librarycharacterized for each different voltage levels. The 45 nmNangateOpenCellLibrary [7] is a free low-power library, there-fore, we used in our project. The library defines the behaviorand characterization of its cells under 1.1 V supply voltage,nominal for this technology.

Both delay and power consumption of a library are char-acterized and stored in an industry format (Liberty) file. Fordelay, both pin to pin delays and the corresponding outputslopes are typically characterized for identified timing arcs asa function of load and/or input slope. In general, this allowsslews to propagate during delay and timing analysis and beused to characterize and analyze power consumption.

For power, both static and dynamic sources of power arecharacterized. Dynamic power is made up of internal powerand switching power. The former is dissipated by the cellin the absence of a load capacitance and the latter is thecomponent that is dissipated while charging/discharging aload capacitance. Dynamic power is measured per timing arc(as with delay). Static dissipation is due to leakage currentsthrough OFF transistors and can be significant when thecircuit is in the idle state (there is no switching activity).

Since we did not have access to Liberty NCX [8], which wasmentioned in our proposal, we had to do the characterizationmanually, using HSpice simulation and updating the libertyfile manually. As the number of primitive cells in the libraryis more than 100 and the characterization procedure is verytime consuming, we needed to narrow down our cell library.

We chose 2-input NAND, 2-input NOR, Inverter, and D-type flip flop as our library, which is sufficient to implementany combinational or sequential circuit. Since our goal isto compare the behavior of the synthesized circuit usingthese libraries, the exact delay or power consumption is it

not relevant, but the ratio of them is important. Hence, ourlibrary encompasses the minimum sized gates, making oursynthesized circuit very slow; however, the absolute value ofdelay is not our main concern.

Since we did not have access to the HSpice models ofNangateOpenCellLibrary, our first step was to obtain thesemodels. Therefore, our first step was to draw the layout forthese gates using Cadence virtuoso [9] and extracting theHSpice model for them.

Figures 4-11 show the schematics and layouts for theinverter, NAND, NOR and D flip-flop. After clearing the DRCand LVS checks we extract the netlist from the layout for eachof the gates. The next step was extracting the characteristicsof the cells. By characteristics, we mean power and delaybehavior. As we know, the power and delay behavior of anelectronic gate depends on two factors: capacitive load whichis being driven by it, and the slew of the inputs when changing.

By looking at liberty file of NangateOpenCellLibrary, wecould see that each gate has a list of ports with a definedcapacitive load. We assume that the port definition is the samein our library and left them untouched. After port definition,we are required to define the leakage power behavior of thecell. In order to do so, we applied all possible input vectorsto the inputs of the cell and let the output settle down andafter that, without changing the inputs, we measured the powerconsumed by the circuit in a steady state.

After characterizing the leakage power, we then need tocharacterize the timing behavior of the cell. Timing behaviorincludes rise/fall time of the output and tphl/tplh propagationdelay. Since the delay depends on both slew of the input andcapacitive load of the output, the delay behavior of the cellis summarized in a 2-dimensional table. The row index is theslew of the input and the column index is the capacitive load ofthe output. Each dimension has seven different index values.If the slew and the capacitive load match with one of thoseentries, the delay values will be used, otherwise the outputvalues would be estimated by interpolating the values in theclosest entries.We have the same tables for the rising dynamicpower and falling dynamic power.

Since the circuit shows different behavior by changing eachinput, we have another set of those tables for the other inputtoo. As we are going to compare the impact of voltage scaling,we should make four copies of the liberty file with valuesextracted from HSpice simulation for four different voltagevalues (1.1 V, 0.9 V, 0.8 V, and 0.7 V). After filling all thetables, our characterization procedure is completed. In the nextstep, we compile the liberty file and build the synthesis library(.db file). This process has been done using Design Compilerand four different libraries have been created, correspondingto four different voltage levels.

B. Synthesis and Validation

The RTL synthesis is one of the main parts in an ASICdesign flow. It is a technology dependent process that per-forms translation, optimization, and mapping of the designfiles. These files comprise of a set of constraints, high-level

hardware design, technology library files, and the RTL source.In our work we used Synopsys Design Compiler as the RTLsynthesizer, adapting the set of constraints from [10].

High-level hardware design files were written in Ver-ilog. They comprise of a top level hardware monitormodule (HWMonitor.v) and six lower level modules(base_addr.v, controller.v, CurrentPointer.v,GID_Table.v, my_register.v, PID2Addr.v)that areused for synthesis. Another important set of files are thetool command language scripts (.tcl). TCL is used to driveSynopsys tools. Among the TCL files, it is worth mentioninga few.

Setup.tcl script is used to set various parameters like theclock name, top module, RTL directory, and Gate directory.Read.tcl script is used to read the Verilog files corresponding toeach module in the hardware monitor. Constraints to optimizethe power are included in the Constraints.tcl file. CompileAn-alyze.tcl is used to synthesize the design based on constraints.At the end of these four steps we extracted various reportswhich include critical delay path, leakage power, and dynamicpower.

Technology libraries are used by the design compiler fordifferent voltage levels. These operations were performedusing the following commands. As a result, it compiles the.lib, which is the technology library source file, into .db file,the Synopsys database format.$read_lib \$PATH\NangateOpenCellLibrary$write_lib \$PATH\NangateOpenCellLibrarySynthesis was performed using the four different custom

libraries , one for each of the voltage level. For each voltagelevel, we extracted timing reports for the typical processcorner, as well as power reports.

Customizing the Liberty file (.lib) is a time-consuming task(a two input gate has approximately 400 entries). We extractedthe values from several HSpice simulations, translated to aspreadsheet, and finally inputted on the Liberty file. SinceSpice simulates by solving differential equations, the resultswe obtained are accurate. We took care of adapting themeasurement points for every different slope, capacitance, andvoltage level.

IV. RESULTS

The simulation results of our synthesis using synopsysdesign compiler with our customized library are comparedwith a typical ARM9 embedded processor [11].

ARM processors are predominant processors in the embed-ded systems in the recent era. A typical ARM9 processor itruns at a frequency of 275 MHz. Since we do not have accessto the HDL code of the processor, we make a few assumptionsto enable the comparison.

The synthesis using 45 nm NangateOpenCellLibrary yieldeda critical path of 2.25 ns. We assume that the results ofsynthesis using TSMC library and NangateOpenCellLibraryare close (in fact, TSMC is by far the fastest). As mentionedearlier, we assume that the processor runs at 275 MHz whichmeans a critical path delay of 3.6 ns.

Table 1 - Data arrival time at each voltage level.

Voltage levels (V) 0.7 0.8 0.9 1.1

Data arrival time (ns) 8.8926 6.4291 5.5986 3.9002

Table 1 shows the different critical delay paths for eachvoltage level, simulated using our customized version of theNangateOpenCellLibrary, with the design compiler aiming atdelay and power optimization. As seen in figure 1, reducing theVdd level increases the critical path delay, which is expected.However, the question is how far can we reduce the voltagelevel and the circuit still be functional?

Fig. 1. Data arrival time for different power levels

The synthesis results of our circuit using the customizedlibrary and Vdd voltage of 1.1 V shows delay of 3.9 ns andwe assume that the delay of the processor would be scaled withthe same factor (2.25/3.9). With this assumption, the criticalpath delay is now of 6.24 ns. Since the hardware monitor issupposed to work synchronized with the processor, its clockfrequency would be the same and it would have a delay slackof 2.25 ns. In order to save power, we could reduce the supplyvoltage as long as the timing requirement is met.

According to the table 1, if we reduce the voltage to 0.8 V,the delay of our Hardware monitor would be 6.42 ns whichis violating the timing requirement. But with the increase indelay, we observe that the desired voltage might be slightlyhigher than 0.8 V. By decreasing the supply voltage of thehardware monitor, we could save 57 uW according to figure2, which translates to a 50 percent reduction of it’s total powerconsumption at 1.1V.

We ran several simulations, in order to evaluate the mini-mum critical path for each case. Figure 3 shows the extremecases, where leakage and dynamic power are optimized, whilethe input voltage was kept constant. In both cases, clock wasoptimized by the design compiler. Keeping the input constant,dynamic power decreases at a slower pace than if the voltagelevel is also scaled.

Fig. 2. Static and dynamic power measures

Fig. 3. A comparison between optimized and non-optimized power

V. FUTURE WORK

The effectiveness of scaling the voltage in hardware monitoris demonstrated through the reduced power consumption.Hence the hardware monitor consumes less power than beforeand still maintain its functionality.

We encountered many difficulties due to limited access tothe tools proposed at the previous report. These tools, suchas Primetime, CACTI, and ModelSim were not accessible,making the work much more challenging, with several hoursof Spice simulations and library customization. We made useof Synopsys design compiler for most of our work.

In the future, we would like to integrate the processorand monitor and measure the power consumption using thementioned tools to obtain more accurate results.

REFERENCES

[1] D. Chasaki and T. Wolf, “Attacks and defenses in the data plane ofnetworks,” Dependable and Secure Computing, IEEE Transactions on,vol. 9, no. 6, pp. 798–810, Nov 2012.

[2] H. Chandrikakutty, D. Unnikrishnan, R. Tessier, and T. Wolf, “High-performance hardware monitors to protect network processors from dataplane attacks,” in Design Automation Conference (DAC), 2013 50th ACM/ EDAC / IEEE, May 2013, pp. 1–6.

[3] W. Eatherton, “The push of network processing to the top of the pyra-mid,” in keynote address at Symposium on Architectures for Networkingand Communications Systems, 2005, pp. 26–28.

[4] K. Hu, H. Chandrikakutty, R. Tessier, and T. Wolf, “Scalable hardwaremonitors to protect network processors from data plane attacks,” inCommunications and Network Security (CNS), 2013 IEEE Conferenceon, Oct 2013, pp. 314–322.

[5] Chipworks, “Inside the iphone 6 and iphone 6 plus (part 2),” http://www.chipworks.com/en/technical-competitive-analysis/resources/blog/inside-the-iphone-6-and-iphone-6-plus-part-2/?lang=en&Itemid=815,accessed: 2014-11-17.

[6] R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge,“Near-threshold computing: Reclaiming moore’s law through energyefficient integrated circuits,” Proceedings of the IEEE, vol. 98, no. 2,pp. 253–266, Feb 2010.

[7] Nangate, “Nangate freepdk45 open cell library,” http://www.nangate.com/?page id=2325, accessed: 2014-12-19.

[8] Synopsys, “Tutorial:liberty ncx,” http://www.synopsys.com/Tools/Implementation/SignOff/Pages/LibertyNCX.aspx, accessed: 2014-12-19.

[9] Cadence, “Tutorial:custom ic design,” http://www.cadence.com/products/cic/Pages/default.aspx, accessed: 2014-12-19.

[10] N. C. S. University, “Tutorial:place & route tutorials,” http://www.eda.ncsu.edu/wiki/Tutorial:Place %26 Route Tutorials, accessed: 2014-12-19.

[11] ARM, “Tutorial:arm926 processor,” http://www.arm.com/products/processors/classic/arm9/arm926.php, accessed: 2014-12-19.

VI. APPENDIXPlease refer to this section for a the list of figures used in

the text.

Fig. 4. Inverter schematic

Fig. 5. Inverter layout

Fig. 6. NAND schematic

Fig. 7. NAND layout

Fig. 8. NOR schematic

Fig. 9. NOR layout

Fig. 10. Flip-Flop schematic

Fig. 11. Flip-Flop layout

exploring voltage scaling techniques in embedded

Documents