dynamic power estimation using transaction level...

Accepted Manuscript

Dynamic power estimation using Transaction Level Modeling

Amr Baher, Ahmed N. El-Zeiny, Ahmed Aly, Ahmed Khalil, Adham Hassan,AbdelRahman Saeed, Karim Abo El Makarem, Magdy El Moursy, Hassan Mostafa

PII: S0026-2692(18)30148-4

DOI: 10.1016/j.mejo.2018.08.012

Reference: MEJ 4410

To appear in: Microelectronics Journal

Received Date: 3 March 2018

Revised Date: 9 July 2018

Accepted Date: 27 August 2018

Please cite this article as: A. Baher, A.N. El-Zeiny, A. Aly, A. Khalil, A. Hassan, A. Saeed, K.A. ElMakarem, M. El Moursy, H. Mostafa, Dynamic power estimation using Transaction Level Modeling,Microelectronics Journal (2018), doi: 10.1016/j.mejo.2018.08.012.

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service toour customers we are providing this early version of the manuscript. The manuscript will undergocopyediting, typesetting, and review of the resulting proof before it is published in its final form. Pleasenote that during the production process errors may be discovered which could affect the content, and alllegal disclaimers that apply to the journal pertain.

https://doi.org/10.1016/j.mejo.2018.08.012

MANUSCRIP

T

ACCEPTED

ACCEPTED MANUSCRIPT

Dynamic Power Estimation using Transaction Level Modeling

Amr Baher1, Ahmed N. El-Zeiny1, Ahmed Aly2, Ahmed Khalil2, Adham Hassan1, AbdelRahman Saeed1, Karim Abo El Makarem2, Magdy El Moursy1,3, and Hassan Mostafa2,4

1Mentor, A Siemens Business, Cairo, Egypt, 2Electronics and Communications Engineering Department, Cairo University, Giza 12613, Egypt,

3Electronics Research Institute, Cairo, Egypt, 4Center for Nano-electronics and Devices, American University in Cairo and Zewail City for Science and

Technology, Cairo, Egypt,

[email protected], [email protected], [email protected], [email protected], [email protected], [email protected],

[email protected], [email protected], [email protected].

Abstract—Designing an efficient (from performance and power points of view) system on chip (SoC) is one of the main challenges nowadays. This paper introduces a methodology that uses Transaction Level Modeling (TLM) to accelerate the simulation time for power estimation allowing fast SoC evaluation. Different modeling techniques are used to develop the proposed Transaction Level Power Modeling (TLPM) methodology. The methodology exploits abstracting the design using TLM. This abstraction allows fast simulation with still accurate functionality of the developed model. The methodology enables estimating power dissipation of real applications running on the SoC with high accuracy. ZYNQ-7000 platform is implemented on RTL and TLM to validate the methodology. The validation of the functionality is obtained through identical scenarios on both TLM and RTL. Experimental results reveal the efficiency and accuracy of the TLPM. The proposed methodology speeds up the simulation time by more than two orders of magnitude over RTL while the error in power estimation is less than 3%.

Index Terms—RTL, TLM, SoC, Power dissipation, Simulation time, SystemC.

I. INTRODUCTION

OCS are widely used nowadays in many applications such as mobile phones and digital cameras. Using Register

Transfer Level (RTL) to model SoC has some limitations. The main limitation is the simulation time. As SoC complexity and size increase, the simulation of the full SoC on RTL becomes not practically possible. Accordingly, functional high level models should be used, but the accuracy of those models is very poor. High level models have little information about the underneath hardware details. Power profiling for a running operating system and real applications on the SoC requires innovative techniques. Modeling power dissipation of SoCs using TLM has a great impact on reducing the design life cycle [1], [2]. Accurate high level estimation saves the cost of the long iterations which elongate the design life cycle.

Previous approaches for power dissipation estimation at various abstraction levels of the design did not achieve satisfying accuracy [3–5]. Estimating power at high level of system abstraction has been addressed in [6]. Some important design elements such as power management components are missed in the previous approaches affecting the accuracy of the resulting power numbers versus actual on-chip ones.

Power Kernel tool flow [7–10] adopts cycle-accurate models in TLM. This leads to very large simulation time [11]. Power parameters characterization is done using extracted Gate level parasitics as in [6] and [8]. Power equations are constructed using the developed database of parasitics. The process of parasitic extraction and gate level simulation is time consuming and expensive. The simulation overhead is very large [12], [13]. Power dissipation is minimized at architecture level using Wattch framework. The accuracy is around 10% as compared to some verified tools [14]. Energy of the whole system at architecture level is calculated using SimplePower framework. It uses cycle-accurate data path which increases the simulation time [15]. Some simulation tools such as SimpleScalar are developed to provide architecture modeling. They use execution-driven simulation which increases the model complexity and leads to large simulation time [16–19]. For large SoC designs previous techniques/approaches are not efficient. The previous approaches are suffering from inaccuracy and slowness.

In this paper, a new approach (Transaction Level Power Modeling TLPM) is introduced for dynamic power estimation as shown in Fig. 1. Approximately-timed modeling is used to achieve fast simulation with high accuracy. Estimating dynamic power dissipation on Transaction Level Modeling (TLM) is developed in the proposed methodology [20]. The overhead in simulation to determine power dissipation is minimized. The error in calculating power dissipation is also minimized. It is shown in the paper that dynamic power estimation could be achieved in two phases. The extraction of power parameters from the RTL of a design is used to build Correlation Matrix in the first phase. This phase is called power characterization. In implementation phase, the power of RTL is mapped to TLM registers/ports to build power models on TLM as shown in Fig. 1.

The paper is organized as follows. In Section II, power characterization for Transaction Level Modeling is presented. In Section III, implementation of Transaction Level Power Modeling (TLPM) is provided. Applying the methodology on

S

MANUSCRIP

T

ACCEPTED

ACCEPTED MANUSCRIPT

Fig. 1. Transaction Level Power Modeling (TLPM) Flow

large SoC is discussed in section IV. In Section V, some results are demonstrated. Finally, some conclusions are given in Section VI.

II. POWER CHARACTERIZATION FOR TLM

In order to get an estimate for the dynamic power, it is needed to simulate Gate Level (GL) netlist. The netlist includes information about interconnects and gates for timing and parasitic extraction on the GL. Although it is accurate to estimate the power on the gate level, it is not feasible on large SoC. TLM introduces faster simulation as it deals with the moving transactions. It does not need the details of the gate level. Those details are to be abstracted into a simplified power models. In this paper, bottom-up approach is used to get power estimation of SoC on Transaction Level. A power model is built on TLM for different components. The components are then integrated to have a model for the full SoC. This provides an easy way for estimating the power of large SoC designs in reasonable simulation time while running real SW on it.

The developed power model on TLM depends on power characterization of the design. Power Characterization represents the extraction of the power parameters of the blocks which are implemented in RTL (remember, RTL of the individual blocks is still needed, but it is not possible to run SW or OS on the RTL of the whole SoC). The power parameters enclose information about the energy exerted by different elements in the design. On RTL, simulating large SoC with reasonable simulation time is not possible. The extracted power parameters are passed to TLM model as explained in the next section.

In power characterization phase, ”Power Analysis” flow is used [21]. The following steps are commonly used for power estimation on RTL [21–25] (they are used with TLPM with the needed modifications). Those are considered the first two steps in power characterization.

1) Synthesizing the design: This step is of great importance. The netlist of the RTL is generated. Complete information about the GL components is enclosed in the netlist.

2) Generation of switching activity of the signals: Switching activity for every signal in the design is obtained through RTL simulation of all possible inputs. Then, the activity is stored in Switching Activity Interchange Format (SAIF) file where information about the toggling of the signals is kept.

3) Calculating Power: The SAIF file is linked to the netlist. Calculating the total power of the design for the attached switching activity is performed.

In Transaction Level Power Modeling (TLPM) methodology, power characterization requires manipulating Power Analysis flow to build a correlation matrix. Calculating the exerted energy by every design element is needed to determine its share at the total power. According to the superposition concept, the total power is divided among its contributors of the design elements. The power of the contributors adds up linearly. In the implementation phase, mapping RTL elements to the corresponding TLM registers and ports is performed. Then, an estimation for the total power is performed. The following steps are performed in order to achieve the full power characterization. Fig. 2 shows the steps to build a correlation matrix as follows:

i) Create Zero.SAIF file, Zero activity information for all RTL signals is included in this file.

ii) Perform superposition theory for the circuit elements. The contribution of each element in total power is performed by enabling the activity of only one element at a time. iii) Switching activity is applied on a single element in Zero.SAIF file. On RTL side, power information of every signal is evaluated, and the energy of single toggle count is determined. The energy is calculated and divided by the period of time to determine the average power dissipation.

iv) Activity of signals is added incrementally into Zero.SAIF to overcome dependencies among signals.

v) The dependency of each signal on other signals is checked according to power numbers obtained from the previous steps. Three categories of correlation factors are possible in the dependency scheme:

a) No Correlation: Each signal activity is applied separately to get its contribution in total power independent of other signals. The correlation factor of this category is ZERO. An example to illustrate this category is clock tree signals.

b) Full Correlation (total dependency): The contribution of those signals in the power is identical. Merging the signals in one bundle is applied. The correlation factor of this category is ONE. A simple example for this category is the signals on the same data path.

Design

Synthesis

Extraction of

Power

Parameters

Building

Correlation

Matrix

TLM

Simulation

Power Model

Implement -

ation

RTL Signals to

TLM Registers

Mapping

MANUSCRIP

T

ACCEPTED

ACCEPTED MANUSCRIPT

c) Partial Correlation: Signals in this category affect each other. Applying the concept of superposition is not possible. Instead, conditional correlation is applied.

vi) After getting the dependency scheme among signals, the scheme is used to constitute the ”Correlation Matrix” of the design. Then, Correlation Matrix is passed to the TLM to develop the power model.

Fig. 2. Building Correlation Matrix for TLPM

Creating a testbench to check the functionality of the design and putting the same scenarios on both RTL and TLM are discussed in Section V. A Value Change Dump (VCD) file is generated from the developed testbench simulation. The VCD file is converted into SAIF file containing all signals. In the following section, passing power parameters and the Correlation Matrix of RTL signals to the TLM is discussed. These parameters are used in the TLPM flow to create power models. RTL signals are mapped to the TLM registers and ports. Total dynamic power of the design is evaluated through running the simulation on TLM which is extremely fast as compared to RTL and as shown in the paper.

III. TRANSACTION LEVEL POWER MODELING

In this section, the Correlation Matrix of RTL signals are mapped to the TLM registers and ports. The implementation of TLPM methodology is achieved using two steps:

A. Mapping Signal to Registers/Ports: The design registers and ports are enclosed in the TLM. Read/Write operations are used to access the registers/ports. Programmer View/Timing (PVT) models are built for the required design where the functional behavior of the device is represented through the Programmer View (PV) [26]. For power estimation, the extraction of the Correlation Matrix, that has been discussed earlier, reveals power contributors

on RTL. These contributors are mapped to the TLM registers and ports. Then, the signal to register mapping scheme is adopted through the signal tracking technique. This technique has a great benefit as it tracks the drivers and the receivers of signals to the design ports. The tracking technique is facilitated by using the debugging utilities as in [22].

B. Power Evaluation: There are three categories of signals in a design:

1) The design signals which exist in both RTL and TLM. These are represented in registers and ports. The power evaluation of these signals is performed on Read/Write operations of the registers and ports using the initiator or debug ports as discussed later.

2) The design signals which are implicitly implemented in TLM but explicitly exist in RTL. An example for this type of signals is the clock signals. The clock signals/nets are found in RTL. They are not explicitly found in TLM.

3) The design signals which exist in RTL, but do not exist in TLM. An example for this type of signals is power management components or internally generated control signals which are not modeled in TLM. Activities of those signals are taken into consideration in the TLPM methodology.

The signals which are explicitly or implicitly implemented in TLM are considered using additional functions. These functions are used to evaluate the power of these signals according to their contribution factors. The contribution is taken from the Correlation Matrix which is generated in power characterization step.

Two types of ports exist in TLM: initiator ports and debug ports. TLPM methodology uses both types of ports to estimate the power, as follows;

1) Initiator Ports: The real behavior of the model is represented through these ports. They are the interface between the SoC main bus and any device model. Callback functions are used to perform the communication between TLM of the device and the SoC bus. For every Read/Write operation of each register and port, there is a callback function to implement the functionality. Power equations are added to those callback functions to build the power models. The evaluation of the power upon every Write/Read operation is done using these power equations.

The power model is implemented using the following steps after the annotation of the RTL contributors:

(a) Tracking the activity of TLM registers and ports. (b) Calculating the power for every operation. (c) Accumulating the power for each transaction.

As simulation time moves forward, all exerted energy across the design are gathered by the devices models.

2) Debug Ports: They play the role of the back-door access to the TLM where the information of all registers are accessed by the user without interfering the blocking transport calls on initiator port.

MANUSCRIP

T

ACCEPTED

ACCEPTED MANUSCRIPT

For power estimation, TLPM methodology uses both initiator and debug ports. The activity of one register at a time is tracked by initiator ports. Tracking multiple registers has to be done at the same instant to get accurate estimate power. Multiple registers are tracked by the initiator ports with the aid of the debug ports.

Implementing different blocks is the first step in TLPM then power is computed for each block. As mentioned in Section II, Correlation Matrix which includes power per unit toggle per unit pin is computed and used in TLM. Power is calculated for signals as follows:

1) Per TLM methodology, in the callback function of each register the data is stored in a variable called ”value”. The number of toggles can be obtained from the new value ”newvalue” coming with each transaction as compared to

the old ”value”:

Then the number of toggles is obtained from ”b.count()” function. This number is then multiplied by the power per toggle that is obtained from RTL model. This step is done after each Write/Read in every register and port in the design and is accumulated in a variable for the total power.

2) For counters in clock dividers, and unlike registers, these counters are not implemented in TLM. The number of toggles cannot be obtained directly from the new and old values. The sequence of values of these counters is well known in sequence (1, 2, 3 ...). The total number of toggles is obtained according to the clock divisor. The following series is used to obtain the total number of toggles of the counter:

(1)

where x is the maximum number that the counter reaches (i.e. counter sequence is 0, 1, 2, ..., x, 0, 1 ...).

The next section contains the implementation of the methodology. Two case studies are considered for simple timer and real SoC to implement TLPM. Besides, the key factors that affect the simulation time and power contributors are illustrated.

IV. METHODOLOGY IMPLEMENTATION

This section contains the implementation of the methodology. The first considered block is a standalone commercial IP timer and the second is a full ZYNQ-7000 SoC.

A. Commercial Timer SP804

ARM Dual-Timer Module (SP804) [27] is used as a commercial IP to apply the methodology on a single IP. Two identical programmable Free Running Counters (FRCs) are enclosed in the design. They have the ability to be configured to different modes (Free Running/Periodic/OneShot) for 32 or 16-bit operation. The existing two FRCs operate from a single timer clock with two enables. They have a prescaler that can divide the timer clock by 1, 16 or 256 as shown in the block diagram in Fig. 3.

Fig. 3. Block Diagram of ARM Dual-Timer Module (SP804)

Table I and Fig. 4 give an example of energy dissipation and signal correlation of ARM Dual-Timer module. If two inputs are not correlated then correlation factor in ZERO. On other hand, if two inputs are fully correlation so the correlation factor in ONE. Signals LoadPeriodEn, PreScale are examples of a Correlation Matrix of ONE where their energy per toggle count are 10 nJ, 20 nJ respectively. TimeEn signal contribution is negligible. Moreover, it affects another signal NxtPreScale. This is a good example of Partial Correlation. If (TimeEn = 0) then the energy per toggle count contribution of NxtPreScale is 30 nJ. If (TimeEn = 1) then the energy per toggle count contribution of NxtPreScale is 40 nJ. The Energy per unit toggle per bit is added to the Write and Read callback function of each register as explained in section III. The number of bit toggles caused by each transaction is calculated from equation (1), so the energy of the transaction can be calculated by multiplying the number of toggles by the energy per unit toggle.

s i z e t n ; n = newvalue ˆ value ;

std : : b i t s e t < sizeof ( s i z e t ) ∗ CHAR BIT > b ( n ) ;

Load Register

Background Load Register

Value Register

Control Register

Raw Interrupt Status Register

Masked Interrupt Status Register

Interrupt Clear Register

Load Register

Background Load Register

Value Register

Control Register

Raw Interrupt Status Register

Masked Interrupt Status Register

Interrupt Clear Register

APB Interface

R / W Data Register Address

Control Setting

TIMINT 1 TIMINT 2

TIMCLKEN 1

TIMCLK

TIMCLKEN 2

MANUSCRIP

T

ACCEPTED

ACCEPTED MANUSCRIPT

Table I: Energy Dissipation Example in Timer SP804 Signal Name Condition Energy Value Per Toggle Count (nJ)

LoadPeriodEn 1 10 NxtPreScale TimerEn==0 30 NxtPreScale TimerEn==1 40

TimerEn - Negligible PreScale - 20

The significant advantage of using the presented methodology could not be demonstrated on small block design. The added value with TLPM is greatly illustrated by large design. The speed up in simulation for small devices could reach single order of magnitude. For large SoCs, simulation with RTL could be impossible. Orders of magnitude speed up could be achieved with TLPM on large designs. A full SoC is considered in the coming subsection.

Fig. 4. Block Signal Correlation Matrix Example

B. ZYNQ-7000 SoC

A large platform (ZYNQ-7000) in Fig. 5 is used to demonstrate the efficiency of TLPM on a full SoC. Xilinx All Programmable SoC (AP SoC) is the base for ZYNQ-7000 family. This family consists of two systems: Processing System (PS) and Programmable Logic (PL). Dual-core ARM Cortex-A9 MPCore and some peripherals exist in the PS. The FPGA design is implemented in PL [28]. In the following subsections, some blocks are used as examples for applying TLPM on ZYNQ. The details of these blocks (GPIO, SPI, I2C and UART as examples) of the SoC along with the modeling

aspects are discussed.

Fig. 5. Block Diagram of ZYNQ-7000

1) GPIO: The general purpose I/O (GPIO) is user programmable general purpose input/output controller. It is mainly used to implement functions that require simple output and input software controlled programmable signals. The GPIO peripheral provides access to 64 inputs from the Programmable Logic (PL) and 128 outputs to the PL through the EMIO interface. It also provides observation and control of up to 54 device pins via the MIO module. The GPIO is organized into four banks of registers that group related interface signals. Each GPIO can be dynamically and independently programmed as input, output, or interrupt sensing. Software can read all GPIO values within a bank using a single load instruction, or write data to one or more GPIOs using a single store instruction. The block diagram of the GPIO is illustrated in Fig. 6. The GPIO Channel starting from the software configured registers till the pins which describe the operation of Bank0 and Bank1 is illustrated in Fig. 7.

Fig. 6. Block Diagram of GPIO

GPIO

Bank 0

GPIO

Bank 1

GPIO

Bank 2

GPIO

Bank 3

MIO

EMIOGPIOI [ 31 : 0 ] , EMIOGPIOO [ 31 : 0 ] , EMIOGPIOTN [ 31 : 0 ] ,

EMIOGPIOI [ 63 : 32 ] , EMIOGPIOO [ 63 : 32 ] , EMIOGPIOTN [ 63 : 32 ] ,

EMIO interface to PL

MANUSCRIP

T

ACCEPTED

ACCEPTED MANUSCRIPT

Fig. 7. GPIO Channel

The power dissipated in GPIO during reading/writing basically occurs from/to the programing registers. The power dissipation of the Read/Write operation of the registers is common in all devices and is calculated in the same way. In the callback function of each register, the power dissipation due to Read/Write is calculated as explained in section III. For the register operations. There are two types of correlation that are considered.

A. Full Correlation: The bits of each register have the same power contribution. They have a correlation factor of ONE. The Energy per unit toggle per bit is added to the write and read callback function of each register as explained in section III. The Energy exerted by ARM Advanced Peripheral Bus (APB) signals (the programming port/protocol of the registers of the device) is also added and taken into consideration.

B. Partial Correlation: Each register has partial correlation with APB signals. Conditional correlation is applied. The main APB signals that affect the power of Read/Write operations are PSEL, PRESETn, PWRTIE, PREAD. The appropriate activity of those signals is added to the SAIF file while calculating the power per unit toggle of the register.

2) SPI: Serial communication with many peripherals such as temperature sensors, memories, SD card, analog converters, displays, pressure sensors and real-time clocks is performed using the Serial Peripheral Interface (SPI) that supports full duplex communication between master and slave. SPI device supports up to three slaves where the master drives the three slaves. It also drives the SPI reference clock that simulates the baud-rate division of the system reference clock. The system reference clock can be divided by (4, 8, 16, 32, 64, 128 and 256). Delay between the transmitted bytes can also be added. SPI can operate in two modes: Master mode and Slave mode. In Master mode, data flow from APB bus to Transmitter First In First Out (TXFIFO) then to Master Output Slave Input

(MOSI) data bus while in Slave mode data move from APB to TXFIFO to Master Input Slave Output (MISO). At the same time, in Master mode, data move from slave to MISO then Receiver First In First Out (RXFIFO) to APB bus. In Slave mode data move from slave to MOSI to RXFIFO to APB bus. Fig. 8 contains the block diagram of the SPI module. The power dissipation of SPI module is mainly divided to: A. Power due to Read/Write Operations on registers: It is

explained in GPIO. B. Clock power: The clock has ZERO correlation with other

signals. That means its power is independent of any other signal. Clock power is added at the end of simulation.

C. Clock dividers power: The power of Clock dividers is calculated each time a byte is sent or received. The number of toggles per clock enable cycle is calculated from equation (1). The energy exerted by the Clock dividers during sending/receiving a byte is calculated from equation (2).

EB=NC∗NT ∗ET, (2) where EB: Energy exerted by clock dividers per byte, NC: Number of baud rate cycles per byte, NT: Number of toggles per baud rate cycle (calculated from equation (1)), ET: The energy per unit toggle in counters of the clock dividers.

The calculated number (EB in equation (2)) is added every time a byte is sent or received. EB is accumulated in the Write callback function of the receiving port (DATA IN). The start of transmission can be configured to be manual or auto. In the Manual mode, the transmission starts when the software writes in a specific bit (Man start com) in the configuration register. In the Write callback function of this bit, EB is accumulated. In Auto mode, whenever a byte is written to the transmission FIFO (TXFIFO), transmission starts. In Auto mode, EB is accumulated in the Write callback of the TXFIFO.

Fig. 8. Block Diagram of SPI

3) I2C: Inter Integrated Circuit (I2C) device can be configured to run in different modes (Master Transmitter, Master Receiver, Slave Transmitter, Slave Receiver). It also supports slave monitoring mode which can monitor the slave until it is ready. Moreover, clock stretching is supported. Address can be configured for 7-bits or 10-bits. It has the

MANUSCRIP

T

ACCEPTED

ACCEPTED MANUSCRIPT

ability to configure the baud rate. The baud rate is generated by dividing the CPU clock by two configurable values as shown in Fig. 9. Increasing the dividing value decreases the transmitting and receiving rate of I2C. In I2C, the transmitter must read an ACK from the receiver to go on with the transaction. Otherwise, the transmitter stops. For that reason, master and slave are used during testing to complete the appropriate handshaking [29]. The block diagram of the I2C module is shown in Fig. 10.

Fig. 9. Baud rate generator of the I2C

Fig. 10. Block Diagram of I2C

The power dissipation components of I2C module is similar to that of SPI. There is independent power dissipation for the clock signal. There is also power for Read/Write from/to registers. For the power of the clock dividers, energy per byte (EB) is calculated using equation (2) as explained in the SPI subsection. The only difference is where EB is accumulated to the total energy in the TLM. In Master mode, whenever an address is written to the address register, the transaction starts. In the write callback function of the address register, there is a for loop that performs the transaction (one iteration for each byte). EB is accumulated to the total energy at each iteration of this for loop. In Slave mode, the transactions go through ’I2C Slave’ ports. EB is accumulated in the callback functions of those ports (Read callback in Slave Receiver mode and Write callback in Slave Transmitter mode). 4) UART: Universal Asynchronous Receiver Trasmitter (UART) is used for a different kind of serial communication. The design of the UART is implemented according to the block diagram in Fig. 11. It supports 4 modes with different baudrates. The modes are normal, automatic echo, local loopback, remote loopback. The block encloses three divisions for the clock and varying number of bytes to be sent or received along the different modes. The key factors that affect the simulation time and power dissipation are the number of bytes and the divisors of the clock to be used for transmitting and receiving. The reference clock can be divided by three values as shown in Fig. 12. Increasing the dividing

value decreases the transmitting and receiving rate of UART. The resulted value is the baud rate of UART [28].

Fig. 11. Block Diagram of UART

Fig. 12. Baud rate generator of the UART

For the power of the clock dividers of UART module, energy per byte (EB) is calculated from equation (2). EB is accumulated at the Write callback function for the receiving port. For transmission, EB is accumulated at the Write callback function for the trasmittion FIFO (TXFIFO).

The platform blocks are connected together through APB bus [30]. It is used to select the required peripheral and handles the transactions between these blocks and the ABP bus. The next section contains the scenarios for both Timer Model and ZYNQ-700 SoC. TLPM is performed to estimate the total dynamic power.

V. EXPERIMENTAL RESULTS

Experiments are done on Timer SP804 and the ZYNQ7000 SoC. The verification of functionality between TLM model, where SystemC is used, and RTL model, where SystemVerilog is used, is performed by applying the same scenarios on both models. The comparison of the results between the two models is done using Universal Verification Methodology (UVM) [25]. 80% code coverage is targeted. Power calculation on RTL is done using Design Compiler by Synopses. UMC 130nm CMOS technology with an operating voltage of 1.08 V is assumed. Simulations are done using Questa-Sim and Vista by Mentor Graphics. The machine which is used to run the scenarios has a third generation core i7 processor operating on 2.5GHz.

1 to 4

Divider

1 to 64

Divider

CPU _1x _clock clock _ enable

Divisor _a Divisor _ b

MANUSCRIP

T

ACCEPTED

ACCEPTED MANUSCRIPT

A. Stand-alone Timer

To achieve good coverage for the function of the Timer, five scenarios are applied using UVM. Those scenarios are listed as follows:

• Scenario A: Pre-scale of 256, Timer Clock of 2 ns, and 10 ms simulation time are targeted in this scenario. The timer operates on One-shot mode. When the counter reaches 0, the timer stops. The decrement occurs every 256 cycles.

• Scenario B: In this scenario, Timer Clock equals 2 ns, pre-scale is set to 1, and simulation time of 0.5 ms is used to configure the timer. The counter of the timer decrements every clock cycle. When the counter reaches 0, the Timer wraps to the maximum value again using the Free running mode.

• Scenario C: This scenario is configured to have pre-scale equals 1, Timer Clock equals 4 ns, while simulation time is 2 ms. The mode in this scenario is periodic mode. The Timer wraps to a pre-defined value when counter reaches 0.

• Scenario D: Periodic mode is configured with pre-scale of 1 and 5 ms simulation time in this scenario. In this scenario timer clock of 2ns is used.

• Scenario E: In this scenario, the model is configured with pre-scale equals 16, Timer Clock equals 2 ns, and simulation time equals 1 ms. The timer counter is decremented every 16 clock cycles. The model is configured to free running mode. The two main signals, on which the operation depends, are Timer clock and an internal signal for 16 cycle check. Load register is provided with different values along the simulation. The power dissipation for different scenarios is different.

Power estimation for different operating scenarios of the Timer is listed in Table II. Also the error in power estimation using TLM versus RTL is determined. The average error is 1.5% while the maximum absolute error in the power estimation is 1.7%. In Table III, the execution time of both models for each scenario is provided. TLPM methodology has made a great achievement as it speeds up the simulation for power estimation by up to 7 times the speed of the RTL simulation.

Table II: POWER ESTIMATION FOR DIFFERENT OPERATING SCENARIOS OF

ARM TIMER SP804

Scenario Power Dissipation (mW) Error(%)

RTL TLM A 1.64 1.63 1.10 B 2.26 2.30 -1.70 C 1.40 1.42 -1.55 D 1.70 1.67 1.60 E 1.75 1.72 1.50

Average 1.75 1.75 1.50 Table III: SIMULATION TIME FOR DIFFERENT OPERATING SCENARIOS

OF ARM TIMER SP804

Scenario Simulation Time (sec) RTL/TLM Ratio

RTL TLM A 3530 533 6.6 B 1775 263 6.7 C 723 103 7.0 D 362 55 6.6 E 190 30 6.33

Average 6580 984 6.6

B. ZYNQ-7000 Models Power Estimation

This subsection aims to show the effective parameters in some of the devices of ZYNQ-7000. Power is evaluated with those parameters. One of the main parameters which affect the power is the clock division.

1) GPIO: GPIO has one main operating scenario. Configuration of MIO pins is done by storing specific values in GPIO registers. To write data to output pins two options are used. Option 1: update the GPIO pins using the gpio. DATA_0 register. Option 2: The MASK DATA x MSW/LSW registers, which mask the most/lowest significant half of the bank, are used to update one or more GPIO pins. To read from input pins, two options are used. Option 1: gpio.DATA_RO_x register of each bank is used and read by APB master. Option 2: interrupt logic is used on input pins. Power values for RTL and TLM are shown in Table IV. It is shown that error in estimating power dissipation of GPIO is less than 2%.

Table IV: DYNAMIC POWER ESTIMATION FOR THE PRIMARY OPERATING CONDITION OF GPIO

RTL(uW) TLM(uW) Error(%) 258.02 253.23 1.85

2) SPI: Experiments are developed along changing the clock division. Power is calculated in RTL and TLM to demonestrate how much the power of the TLM model matches that of the RTL model. Power values for RTL and TLM are shown in Table V. For different operating scenarios, absolute error in power estimation of SPI is less than 1.5%.

Table V: DYNAMIC POWER ESTIMATION FOR DIFFERENT OPERATING CONDITIONS OF SPI

Clock Division Power Dissipation (mW) Error(%)

RTL TLM 4 1.145 1.129 1.33 8 1.024 1.021 0.312 16 0.940 0.940 -0.039 32 0.890 0.892 -0.231 64 0.856 0.863 -0.803 128 0.839 0.845 -0.736 256 0.830 0.837 -0.896

3) I2C: To test the power model of I2C, two modules of I2C are used. One is configured as master and the other is configured as slave. Table VI shows the power obtained from both RTL and TLM for different configurations. Different number of bytes in addition to different clock divisions are used to estimate power. For different operating scenarios, error in power estimation for I2C is less than 3%.

MANUSCRIP

T

ACCEPTED

ACCEPTED MANUSCRIPT

Table VI: DYNAMIC POWER ESTIMATION FOR DIFFERENT OPERATING

CONDITIONS OF I2C Clock

Division Number of bytes

Power Dissipation (uW) Error (%) RTL TLM

144 10 177.02 176.00 0.5 3 50 179.29 176.00 1.4 1 30 182.50 178.00 2.2

4) UART: For UART model, a normal mode for transmission is tested with different configurations through changing number of bytes and the clock division as shown in Table VII. For different operating scenarios, absolute error in power estimation is less than 1%. High accuracy in power estimation for different TLM models when compared with RTL is achieved using TLPM making it efficient for high level power estimation.

C. Power Estimation of ZYNQ-7000 SoC

In the previous subsection, power is calculated for each device model and error between RTL and TLM is obtained. Table VII: DYNAMIC POWER ESTIMATION FOR DIFFERENT OPERATING

CONDITIONS OF UART Clock

Division Number of bytes

Power Dissipation (uW) Error(%)

RTL TLM 8 15 142.43 143.00 -0.39 15 15 143.93 144.00 -0.33 255 5 143.18 143.00 0.57

This subsection aims to have power profiling for the full SoC on both RTL and TLM. Four scenarios are applied as follows. The GPIO operating condition which is described in subsection B of this section is used in all coming scenarios.

• Scenario I:

SPI master is configured to transmit 256 bytes which is received by the slave. APB clock is divided by 32. I2C is set to master receiver and slave transmitter mode. Slave is configured to transmit 256 bytes. To achieve the required baud rate, the reference clock is divided by two values 1 and 10. The reference clock is finally divided by 10. UART is configured to normal mode. It is configured to transmit 8 bytes. Reference clock is divided by 8 and two configurable dividers. In this scenario the dividers are configured to 255. Clock is finally divided by 8 * 255 * 255 = 520200.

• Scenario II: For SPI, clock is divided by 64. Master of I2C is configured to transmit 512 bytes. the clock is divided by two values 2 and 20. The reference clock is finally divided by 2 * 20 = 40. UART is configured to transmit 16 bytes. The clock dividers are configured to divide the clock by 255 and 511. The clock is finally divided by 8 * 255 * 511 = 260610.

• Scenario III: Master of I2C is configured to transmit 1024 bytes. the clock is divided by two values 3 and 31. The reference clock is finally divided by 3 * 31 = 93. UART is configured to transmit 32 bytes. The clock dividers are

configured to divide the clock by 255 and 1023. The clock is finally divided by 8 * 255 * 1023 = 2086920.

• Scenario IV: For SPI, clock is divided by 256. Master of I2C is configured to transmit 2048 bytes. the clock is divided by two values 3 and 63. The reference clock is finally divided by 3 * 63 = 189. UART is configured to transmit 64 bytes. The clock dividers are configured to divide the clock by 255 and 4095. The clock is finally divided by 8 * 255 * 4095 = 8353800.

Total power is calculated for these scenarios in RTL and TLM as shown in Table VIII. Simulation time is also calculated as shown in Table IX. The simulation time in RTL increases rapidly as the clock division increases. In TLM, the simulation time is constant because the clock is abstracted with a variable in the Transaction Level. Inspite of the significant difference in simulation time, the error in power estimation of TLM does not exceed 2%. The power of Scenario IV could not be estimated on RTL as the VCD file which is used in power estimation exceeds 1TB which could not be stored on normal PC. RTL simulation time of this scenario exceeds 80 hours. Using TLPM, power estimation of Scenario IV becomes feasible with simulation time of 66 seconds as illustrated in Table VIII and Table IX. TLM provides accurate power estimation with faster simulation. TLPM is efficient for power profiling. A speed up in simulation time using TLPM of up to 628 is achieved. On average, the speed up in different scenarios is almost 290. RTL fails to simulate complicated scenarios. Unlike RTL, TLM not only achieves more than two orders of magnitude speed up in simulation, but also allows heavy SW to run and the power to be estimated accurately.

Table VIII: DYNAMIC POWER ESTIMATION FOR DIFFERENT OPERATING

CONDITIONS FOR ZYNQ-7000

Scenario Power Dissipation (mW) Error(%)

RTL TLM I 1.286 1.260 2 II 1.271 1.259 1 III 1.265 1.258 0.5 IV - 1.257 -

Average 1.274 1.259 1.1 Table IX: SIMULATION TIME FOR DIFFERENT OPERATING CONDITIONS FOR

ZYNQ-7000

Scenario Simulation Time (sec) Speedup

RTL TLM I 1616 66 24.4 II 14400 66 216.1 III 41437 66 627.8 IV - 66 -

Average 19151 66 289.499 CONCLUSIONS

A new methodology Transaction Level Power Modeling (TLPM) is proposed for dynamic power estimation using Transaction Level Modeling (TLM). The methodology is applied first on the block level. Simulation of the full SoC with real software scenarios is then performed. Implementation on commercial tools and flows is performed.

MANUSCRIP

T

ACCEPTED

ACCEPTED MANUSCRIPT

The efficiency of TLPM is demonstrated. The methodology speeds up the simulation of different scenarios by up to 628 times while keeping the error in power estimation to be on average less than 3%. More than two orders of magnitude speed up in simulation to determine power estimation for large SoC is achieved with TLPM. Product life cycle could be enhanced using TLPM to estimate the power consumption at early design phases.

ACHNOWLEDGEMENT

The authors would like to thank Eng. Zeiad Abd El-Twab, Eng. Ahmed Abdel Haleem and Eng. Amr Hany from Mentor Graphics for their assistance and support with the tools.

REFERENCES

[1] ”Transaction Level Modeling”, http://www.accellera.org. [2] Bamberg L. and Garcia-Ortiz A., ”High-Level Energy Estimation

for Submicrometric TSV Arrays.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, v.25, no.10, p.2856 – 2866, October 2017.

[3] G.B. Vece, M. Conti and S. Orcioni. ”Transaction-level power analysis of VLSI digital systems.” INTEGRATION, the VLSI journal. pp. 116–126, 2015.

[4] Subodh Gupta and Farid N. Najm. ”Power modeling for high-level power estimation.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems. v.8 n.1, p.18-29, Feb. 2000.

[5] G. Beltrame , D. Sciuto and C. Silvano. ”Multi-Accuracy Power and Performance Transaction-Level Modeling.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. v.26 n.10, p.1830-1842, October 2007.

[6] Dhanwada, Nagu, Ing-Chao Lin, and Vijay Narayanan. ”A Power Estimation Methodology for SystemC Transaction Level Models.” Proceedings of IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. pp. 142-147, September 2005.

[7] ”PKtool Power Kernel tool”, http://pktool.sourceforge.net. [8] Vece, Giovanni B., and Massimo Conti. ”Power Estimation in

Embedded Systems within a SystemC-based Design Context: the PKtool Environment.” Intelligent solutions in Embedded Systems. pp. 179-184, June 2009.

[9] Greaves, D., and Yasin, M. ”TLM POWER3: Power Estimation Methodology for SystemC TLM 2.0.” Models, Methods, and Tools for Complex Chip Design. Springer International Publishing, pp. 53-68, August 2014.

[10] Miltos D. Grammatikakis1, Stratos Politis, Jean-Pierre Schoellkopf and Constantine Papadas. ”System-Level Power Estimation Methodology using Cycle- and Bit-Accurate TLM”. in Design, Automation and Test in Europe Conference and Exhibition (DATE), 2011, 2011, pp. 1-2.

[11] Cai, Lukai, and Daniel Gajski. ”Transaction Level Modeling: An Overview.” Proceedings of IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. pp. 19-24, October 2003.

[12] Helmstetter, Claude, Tayeb Bouhadiba, Matthieu Moy, and Florence Maraninchi. ”Fast and Modular Transaction-Level-Modeling.” IEEE Transactions on VLSI Systems. pp. 501-513, 2006.

[13] Bouhadiba, Tayeb, Matthieu Moy, and Florence Maraninchi. ”SystemLevel Modeling of Energy in TLM for Early Validation of Power and Thermal Management.” Proceedings of the Conference on Design, Automation and Test in Europe. pp. 1609-1614, March 2013.

[14] D. Brooks, V. Tiwari, and M. Martonosi, Wattch. ”A framework for architectural-level power analysis and optimizations.” In Proceedings International Symposium on Computer Architecture. vol. 28, 2000

[15] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. S. Kim, and W. Ye. ”Energy-driven integrated hardware-software optimizations using SimplePower.” ACM SIGARCH Computer Architecture News. vol. 28, pp. 95-106, 2000.

[16] Wei Wang; Yuan-Yuan Xu; Chua-Chin Wangt, ”Dynamic power estimation for ROM-less DDFS designs using switching activity

analysis,” in The Proceedings of IEEE International SoC Design Conference. pp. 280-281, 2017.

[17] Niar. Awais Yousaf; Shahid Masud, ”Stochastic model based dynamic power estimation of microprocessor using Imperas simulator,” in Proceedings of Annual IEEE Systems Conference., pp. 1-8, 2016.

[18] Y. Nasser; J. -C. Prévotet; Maryline Hélard; J. Lorandel, ”Dynamic power estimation based on switching activity propagation,” in The Proceedings of IEEE International Conference on Field Programmable Logic and Applications (FPL), pp. 1-2, 2017.

[19] A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. ”ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration.” in Proceedings of the Design, Automation and Test in Europe. 2009, pp. 423-428.

[20] Amr B. Darwish, Magdy A. El-Moursy, and Mohamed Dessouky. ”Transaction Level Power Modeling (TLPM) Methodology.” Proceedings of International Workshop on Microprocessor and SOC Test and Verification. pp. 61-64, March 2017.

[21] ”Power Compiler User Guide”, http://www.synopsys.com. [22] ”QuestaSim User Guide”, https://www.mentor.com. [23] Chadha, Rakesh, and Jayaram Bhasker. An ASIC Low Power Primer:

Analysis, Techniques and Specification. Springer Science and Business Media, 2012.

[24] ”Design Compiler User Guide”, http://www.synopsys.com. [25] ”Universal Verification Methodology”,

http://www.accellera .org. [26] ”Vista User Guide”, https://www.mentor.com. [27] ”ARM Dual-Timer Module (SP804) Technical

Reference”, http://www.arm.com. [28] ”Zynq-7000 All Programmable SoC Technical Reference Manual”,

https://www.xilinx.com. [29] ”I2C Bus Specifications”, https://www.i2c-bus.org/ [30] ”AMBA

Specification”, http://www.arm.com/

dynamic power estimation using transaction level...

Documents